BioLemmatizer (biolemmatizer 1.2 API)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

edu.ucdenver.ccp.nlp.biolemmatizer
Class BioLemmatizer

java.lang.Object
  edu.ucdenver.ccp.nlp.biolemmatizer.BioLemmatizer

public class BioLemmatizer
extends Object
extends Object

BioLemmatizer: Lemmatize a word in biomedical texts and return its lemma; the part of speech (POS) of the word is optional.

Usage:

java -Xmx1G -jar biolemmatizer-core-1.0-jar-with-dependencies.jar [-l] <input_string> [POS tag] or java -Xmx1G -jar biolemmatizer-core-1.0-jar-with-dependencies.jar [-l] -i <input_file_name> -o <output_file_name> or java -Xmx1G -jar biolemmatizer-core-1.0-jar-with-dependencies.jar [-l] -t

Example:

java -Xmx1G -jar biolemmatizer-core-1.0-jar-with-dependencies.jar catalyses NNS

Please see the README file for more usage examples

Author:: Haibin Liu , William A Baumgartner Jr and Karin Verspoor

Field Summary
`static String`	`lemmaSeparator` Lemma separator character
`edu.northwestern.at.utils.corpuslinguistics.lemmatizer.Lemmatizer`	`lemmatizer` BioLemmatizer
`protected static String`	`mappingFileName` the Part-Of-Speech mapping file
`Map<String,String[]>`	`mappingMajorClasstoPennPOS` Hierachical mapping file from major class to Penn Treebank POS
`Map<String,String[]>`	`mappingPennPOStoNUPOS` Hierachical mapping file from PennPOS to NUPOS
`edu.northwestern.at.utils.corpuslinguistics.partsofspeech.PartOfSpeechTags`	`partOfSpeechTags` NUPOS tags
`edu.ucdenver.ccp.nlp.biolemmatizer.POSEntry`	`posEntry` POSEntry object to retrieve POS tag information
`edu.northwestern.at.utils.corpuslinguistics.tokenizer.WordTokenizer`	`spellingTokenizer` Extract individual word parts from a contracted word.
`edu.northwestern.at.utils.corpuslinguistics.lexicon.Lexicon`	`wordLexicon` Word lexicon for lemma lookup

Constructor Summary
`BioLemmatizer()` Default constructor loads the lexicon from the classpath
`BioLemmatizer(File lexiconFile)` Constructor to initialize the class fields

Method Summary
`LemmataEntry`	`lemmatizeByLexicon(String spelling, String partOfSpeech)` Lemmatize a string with POS tag using Lexicon only
`LemmataEntry`	`lemmatizeByLexiconAndRules(String spelling, String partOfSpeech)` Lemmatize a string with POS tag using both lexicon lookup and lemmatization rules This is the preferred method as it gives the best lemmatization performance
`LemmataEntry`	`lemmatizeByRules(String spelling, String partOfSpeech)` Lemmatize a string with POS tag using lemmatization rules only
`static void`	`main(String[] args)` Input arguments are parsed into a `BioLemmatizerCmdOpts` object.

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Field Detail

lemmaSeparator

public static String lemmaSeparator

Lemma separator character

lemmatizer

public edu.northwestern.at.utils.corpuslinguistics.lemmatizer.Lemmatizer lemmatizer

BioLemmatizer

wordLexicon

public edu.northwestern.at.utils.corpuslinguistics.lexicon.Lexicon wordLexicon

Word lexicon for lemma lookup

partOfSpeechTags

public edu.northwestern.at.utils.corpuslinguistics.partsofspeech.PartOfSpeechTags partOfSpeechTags

NUPOS tags

spellingTokenizer

public edu.northwestern.at.utils.corpuslinguistics.tokenizer.WordTokenizer spellingTokenizer

Extract individual word parts from a contracted word.

mappingPennPOStoNUPOS

public Map<String,String[]> mappingPennPOStoNUPOS

Hierachical mapping file from PennPOS to NUPOS

mappingMajorClasstoPennPOS

public Map<String,String[]> mappingMajorClasstoPennPOS

Hierachical mapping file from major class to Penn Treebank POS

mappingFileName

protected static String mappingFileName

the Part-Of-Speech mapping file

posEntry

public edu.ucdenver.ccp.nlp.biolemmatizer.POSEntry posEntry

POSEntry object to retrieve POS tag information

Constructor Detail

BioLemmatizer

public BioLemmatizer()

Default constructor loads the lexicon from the classpath

BioLemmatizer

public BioLemmatizer(File lexiconFile)

Constructor to initialize the class fields

Parameters:: lexiconFile - a reference to the lexicon file to use. If null, the lexicon that comes with the BioLemmatizer distribution is loaded from the classpath

Method Detail

lemmatizeByLexicon

public LemmataEntry lemmatizeByLexicon(String spelling,
                                       String partOfSpeech)

Lemmatize a string with POS tag using Lexicon only

Parameters:: spelling - an input string; partOfSpeech - POS tag of the input string
Returns:: a LemmaEntry object containing lemma and POS information

lemmatizeByRules

public LemmataEntry lemmatizeByRules(String spelling,
                                     String partOfSpeech)

Lemmatize a string with POS tag using lemmatization rules only

Parameters:: spelling - an input string; partOfSpeech - POS tag of the input string
Returns:: a LemmaEntry object containing lemma and POS information

lemmatizeByLexiconAndRules

public LemmataEntry lemmatizeByLexiconAndRules(String spelling,
                                               String partOfSpeech)

Lemmatize a string with POS tag using both lexicon lookup and lemmatization rules This is the preferred method as it gives the best lemmatization performance

Parameters:: spelling - an input string; partOfSpeech - POS tag of the input string
Returns:: a LemmaEntry object containing lemma and POS information

main

public static void main(String[] args)

Input arguments are parsed into a BioLemmatizerCmdOpts object. Valid input arguments include:

  VAL    : Single input to be lemmatized
  VAL    : Part of speech of the single input to be lemmatized
  -f VAL : optional path to a lexicon file. If not set, the default lexicon 
           available on the classpath is used
  -i VAL : the path to the input file
  -l     : if present, only the lemma is returned (part-of-speech information is 
           suppressed)
  -o VAL : the path to the output file
  -t     : if present, the interactive mode is used

Parameters:: args -

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

edu.ucdenver.ccp.nlp.biolemmatizer Class BioLemmatizer

lemmaSeparator

lemmatizer

wordLexicon

partOfSpeechTags

spellingTokenizer

mappingPennPOStoNUPOS

mappingMajorClasstoPennPOS

mappingFileName

posEntry

BioLemmatizer

BioLemmatizer

lemmatizeByLexicon

lemmatizeByRules

lemmatizeByLexiconAndRules

main

edu.ucdenver.ccp.nlp.biolemmatizer
Class BioLemmatizer