Extracting Domain Information using Deep Learning
Machine Learning/Artificial Intelligence
TimeTuesday, July 3011:30am - 12pm
DescriptionAcross various scientific domains, digital publication of technical documents, often in the form of conference/journal article submissions, are the first accessible instance of new human knowledge in these respective fields. Synthesizing and curating this information is a slow and difficult process and often requires non-trivial human expertise. Given the ever increasing rate of these publications and the natural limitations of manual approaches, a computational solution to this problem is the paramount need of the hour. We have developed a computational tool (DIVE) to provide entity extraction and expert curation functionality for large document collections to list biological entities central to each article. However these methods depend on prior knowledge and language model often yield low precision in practice. This is caused by new unseen vocabulary and improper weighting of existing entities.
In this paper, we investigate how deep learning methods may be used to address this issue. Using the author feedback mechanism in our deployed tool we were able to create a expert user annotated gold standard dataset based on articles submitted over an entire year, that enables us to contrast several supervised machine learning methods for the entity extraction task. Our results show that DIVE’s ensemble of methods (regular expression rules, keyword dictionaries, ontology files) have higher precision and recall scores than ABNER CRF based models. However, both tools in general show low precision.We further investigate using deep learning (using theNeuroNER tool) to improve the precision and recall performance of DIVE with promising early results.