Patent mining: combining dictionary-based and machine-learning approaches

Exploration of the chemical patent space is essential for early-stage medicinal chemistry activities. The BioCreative CHEMDNER-patents task focuses on the recognition of chemical compounds in patents. This includes recognition of chemical named entities in patents (CEMP), classification of chemical-related patent titles and abstracts (CPD), and recognition of genes and proteins in patent abstracts (GPRO). In this study we tackled the CEMP and CPD tasks. We investigated an ensemble system where a dictionary-based approach is combined with a machine-learning approach to extract compounds from text. For this the performance of several lexical resources was assessed using Peregrine, our open source indexing engine. We combined our dictionarybased results on the patent corpus with the results of tmChem, a CRF-based chemical recognizer. To improve the performance of tmChem, three additional feature types where introduced (POS tags, lemmas, and word-vector clusters). When evaluated on the training data, our final system obtained an F-score of 85.21% for the CEMP task, and an accuracy of 91.53% for the CPD task. On the test set, our system ranked sixth among 21 teams for CEMP with an F-score of 86.82%, and second for CPD with an accuracy of 94.23%.
Type of Publication:
In Proceedings
Book title:
Proceedings of the fifth BioCreative challenge evaluation workshop
Hits: 4835