edu.ucdenver.ccp.nlp.biolemmatizer
Class BioLemmatizer

java.lang.Object
  extended by edu.ucdenver.ccp.nlp.biolemmatizer.BioLemmatizer

public class BioLemmatizer
extends Object

BioLemmatizer: Lemmatize a word in biomedical texts and return its lemma; the part of speech (POS) of the word is optional.

Usage:

java -Xmx1G -jar biolemmatizer-core-1.0-jar-with-dependencies.jar [-l] <input_string> [POS tag] or
java -Xmx1G -jar biolemmatizer-core-1.0-jar-with-dependencies.jar [-l] -i <input_file_name> -o <output_file_name> or
java -Xmx1G -jar biolemmatizer-core-1.0-jar-with-dependencies.jar [-l] -t

Example:

java -Xmx1G -jar biolemmatizer-core-1.0-jar-with-dependencies.jar catalyses NNS

Please see the README file for more usage examples

Author:
Haibin Liu , William A Baumgartner Jr and Karin Verspoor

Field Summary
static String lemmaSeparator
          Lemma separator character
 edu.northwestern.at.utils.corpuslinguistics.lemmatizer.Lemmatizer lemmatizer
          BioLemmatizer
protected static String mappingFileName
          the Part-Of-Speech mapping file
 Map<String,String[]> mappingMajorClasstoPennPOS
          Hierachical mapping file from major class to Penn Treebank POS
 Map<String,String[]> mappingPennPOStoNUPOS
          Hierachical mapping file from PennPOS to NUPOS
 edu.northwestern.at.utils.corpuslinguistics.partsofspeech.PartOfSpeechTags partOfSpeechTags
          NUPOS tags
 edu.ucdenver.ccp.nlp.biolemmatizer.POSEntry posEntry
          POSEntry object to retrieve POS tag information
 edu.northwestern.at.utils.corpuslinguistics.tokenizer.WordTokenizer spellingTokenizer
          Extract individual word parts from a contracted word.
 edu.northwestern.at.utils.corpuslinguistics.lexicon.Lexicon wordLexicon
          Word lexicon for lemma lookup
 
Constructor Summary
BioLemmatizer()
          Default constructor loads the lexicon from the classpath
BioLemmatizer(File lexiconFile)
          Constructor to initialize the class fields
 
Method Summary
 LemmataEntry lemmatizeByLexicon(String spelling, String partOfSpeech)
          Lemmatize a string with POS tag using Lexicon only
 LemmataEntry lemmatizeByLexiconAndRules(String spelling, String partOfSpeech)
          Lemmatize a string with POS tag using both lexicon lookup and lemmatization rules This is the preferred method as it gives the best lemmatization performance
 LemmataEntry lemmatizeByRules(String spelling, String partOfSpeech)
          Lemmatize a string with POS tag using lemmatization rules only
static void main(String[] args)
          Input arguments are parsed into a BioLemmatizerCmdOpts object.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

lemmaSeparator

public static String lemmaSeparator
Lemma separator character


lemmatizer

public edu.northwestern.at.utils.corpuslinguistics.lemmatizer.Lemmatizer lemmatizer
BioLemmatizer


wordLexicon

public edu.northwestern.at.utils.corpuslinguistics.lexicon.Lexicon wordLexicon
Word lexicon for lemma lookup


partOfSpeechTags

public edu.northwestern.at.utils.corpuslinguistics.partsofspeech.PartOfSpeechTags partOfSpeechTags
NUPOS tags


spellingTokenizer

public edu.northwestern.at.utils.corpuslinguistics.tokenizer.WordTokenizer spellingTokenizer
Extract individual word parts from a contracted word.


mappingPennPOStoNUPOS

public Map<String,String[]> mappingPennPOStoNUPOS
Hierachical mapping file from PennPOS to NUPOS


mappingMajorClasstoPennPOS

public Map<String,String[]> mappingMajorClasstoPennPOS
Hierachical mapping file from major class to Penn Treebank POS


mappingFileName

protected static String mappingFileName
the Part-Of-Speech mapping file


posEntry

public edu.ucdenver.ccp.nlp.biolemmatizer.POSEntry posEntry
POSEntry object to retrieve POS tag information

Constructor Detail

BioLemmatizer

public BioLemmatizer()
Default constructor loads the lexicon from the classpath


BioLemmatizer

public BioLemmatizer(File lexiconFile)
Constructor to initialize the class fields

Parameters:
lexiconFile - a reference to the lexicon file to use. If null, the lexicon that comes with the BioLemmatizer distribution is loaded from the classpath
Method Detail

lemmatizeByLexicon

public LemmataEntry lemmatizeByLexicon(String spelling,
                                       String partOfSpeech)
Lemmatize a string with POS tag using Lexicon only

Parameters:
spelling - an input string
partOfSpeech - POS tag of the input string
Returns:
a LemmaEntry object containing lemma and POS information

lemmatizeByRules

public LemmataEntry lemmatizeByRules(String spelling,
                                     String partOfSpeech)
Lemmatize a string with POS tag using lemmatization rules only

Parameters:
spelling - an input string
partOfSpeech - POS tag of the input string
Returns:
a LemmaEntry object containing lemma and POS information

lemmatizeByLexiconAndRules

public LemmataEntry lemmatizeByLexiconAndRules(String spelling,
                                               String partOfSpeech)
Lemmatize a string with POS tag using both lexicon lookup and lemmatization rules This is the preferred method as it gives the best lemmatization performance

Parameters:
spelling - an input string
partOfSpeech - POS tag of the input string
Returns:
a LemmaEntry object containing lemma and POS information

main

public static void main(String[] args)
Input arguments are parsed into a BioLemmatizerCmdOpts object. Valid input arguments include:
  VAL    : Single input to be lemmatized
  VAL    : Part of speech of the single input to be lemmatized
  -f VAL : optional path to a lexicon file. If not set, the default lexicon 
           available on the classpath is used
  -i VAL : the path to the input file
  -l     : if present, only the lemma is returned (part-of-speech information is 
           suppressed)
  -o VAL : the path to the output file
  -t     : if present, the interactive mode is used
 

Parameters:
args -


Copyright © 2013. All Rights Reserved.