CCP logo

BioLemmatizer

 

The BioLemmatizer is a domain-specific lemmatization tool for the morphological analysis of biomedical literature. The BioLemmatizer is tailored to the biological domain through integration of several published lexical resources related to molecular biology. It focuses on the inflectional morphology of English, including the plural form of nouns, the conjugations of verbs, and the comparative and superlative form of adjectives and adverbs. The BioLemmatizer retrieves lemmas based on the use of a lexicon that covers an exhaustive list of inflected word forms and their corresponding lemmas in both general English and the biomedical domain, as well as a set of rules that generalize morphological transformations to heuristically handle words that are not encountered in the lexicon.

The BioLemmatizer 1.2 release adds an optional functionality to normalize British English spellings into American English spellings and then retrieve corresponding lemmas. For instance: the lemma of "haemangioblastomata" will be "hemangioblastoma".

The BioLemmatizer 1.1 public release is the FULL version of the BioLemmatizer. It includes the data from the EBI term repository, the publicly available part of the BioLexicon database. The lemmatization accuracy of the BioLemmatizer 1.1 is 99% on a sampled set of CRAFT, a richly annotated corpus of 97 full-text biomedical journal articles.

If you use the BioLemmatizer to support academic research, please cite the following paper:

Haibin Liu, Tom Christiansen, William A Baumgartner Jr, and Karin Verspoor BioLemmatizer: a lemmatization tool for morphological processing of biomedical text Journal of Biomedical Semantics, 2012, 3:3.

Source code and resources pertaining to the BioLemmatizer 1.2 release are available here
Java API documentation can be found at http://biolemmatizer.sourceforge.net/apidocs/

Version 1.2 of the BioLemmatizer is available via a Maven repository. If you use Maven as your build tool, you can add the BioLemmatizer as a dependency by adding the following to your pom.xml file:

<dependency>
  <groupId>edu.ucdenver.ccp</groupId>
  <artifactId>biolemmatizer-core</artifactId>
  <version>1.2</version>
</dependency>
<dependency>
  <groupId>edu.ucdenver.ccp</groupId>
  <artifactId>biolemmatizer-uima</artifactId>
  <version>1.2</version>
</dependency>

<repository>
  <id>bionlp-sourceforge</id>
  <url>http://svn.code.sf.net/p/bionlp/code/repo</url>
</repository>