The identification of biomedical terms in natural language is essential for information extraction from text. To increase the number of terms in the Unified Medical Language System (UMLS) suitable for text mining we implemented seventeen term rewrite and suppress rules. Five of the nine rewrite rules were found to generate additional synonyms and spelling variations that correctly corresponded to the meaning of the original terms and seven out of the eight suppress rules were found to suppress only undesired terms. We recommend to apply these five rewrite rules and seven suppress rules that passed our evaluation when the UMLS is to be used for term identification in free text, and provide the free software tool Casper that can apply these rules to UMLS data.
Casper operates on UMLS data for which a licence is needed.
Casper requires Java. If you do not have Java installed, download it here.
When you have access to UMLS data and Java installed on your computer: download a zip archive with the files needed to run Casper.
Casper runs as an executable Jar file and is tested on Linux and Windows platforms. Please read the README.txt file contained in the zip archive before running Casper. Note: Casper needs 1.5 GB of RAM to run.
If you used Casper in your study please cite:
Kristina M Hettne, Erik M van Mulligen, Martijn J Schuemie, Bob JA Schijvenaars and Jan A Kors (2010). Rewriting and suppressing UMLS terms for improved biomedical term identification. J. Biomedical Semantics, 1(5). doi:10.1186/2041-1480-1-5