cc.mallet.extract
Interface Extractor

All Superinterfaces:
java.io.Serializable
All Known Implementing Classes:
ACRFExtractor, CRFExtractor

public interface Extractor
extends java.io.Serializable

Generic interface for objects that do information extraction. Typically, this will mean extraction of database records (see @link{Record}) from Strings, but this interface is not specific to this case.


Method Summary
 Extraction extract(java.util.Iterator<Instance> source)
          Performs extraction on a a set of raw documents.
 Extraction extract(java.lang.Object o)
          Performs extraction given a raw object.
 Extraction extract(Tokenization toks)
          Performs extraction from an object that has been already been tokenized.
 Pipe getFeaturePipe()
          Returns the pipe used by this extractor for.
 Alphabet getInputAlphabet()
          Returns an alphabet of the features used by the extractor.
 LabelAlphabet getTargetAlphabet()
          Returns an alphabet of the labels used by the extractor.
 Pipe getTokenizationPipe()
          Returns the pipe used by this extractor to tokenize the input.
 void setTokenizationPipe(Pipe pipe)
          Sets the pipe used by this extractor for tokenization.
 

Method Detail

extract

Extraction extract(java.lang.Object o)
Performs extraction given a raw object. The object will be passed through the Extractor's pipe.

Parameters:
o - The document to extract from (often a String).
Returns:
Extraction the results of performing extraction

extract

Extraction extract(Tokenization toks)
Performs extraction from an object that has been already been tokenized. This method will pass spans through the extractor's pipe.

Parameters:
toks - A tokenized document
Returns:
Extraction the results of performing extraction

extract

Extraction extract(java.util.Iterator<Instance> source)
Performs extraction on a a set of raw documents. The Instances output from source will be passed through both the tokentization pipe and the feature extraction pipe.

Parameters:
source - A source of raw documents
Returns:
Extraction the results of performing extraction

getFeaturePipe

Pipe getFeaturePipe()
Returns the pipe used by this extractor for. The pipe takes an Instance and converts it into a form usable by the particular extraction algorithm. This pipe expects the Instance's data field to be a Tokenization. For example, pipes often perform feature extraction. The type of raw object expected by the pipe depends on the particular subclass of extractor.

Returns:
a pipe

getTokenizationPipe

Pipe getTokenizationPipe()
Returns the pipe used by this extractor to tokenize the input. The type of Instance of this pipe expects is specific to the individual extractor. This pipe will return an Instance whose data is a Tokenization.

Returns:
a pipe

setTokenizationPipe

void setTokenizationPipe(Pipe pipe)
Sets the pipe used by this extractor for tokenization. The pipe should takes a raw object and convert it into a Tokenization.

The pipe @link{edu.umass.cs.mallet.base.pipe.CharSequence2TokenSequence} is an example of a pipe that could be used here.


getInputAlphabet

Alphabet getInputAlphabet()
Returns an alphabet of the features used by the extractor. The alphabet maps strings describing the features to indices.

Returns:
the input alphabet

getTargetAlphabet

LabelAlphabet getTargetAlphabet()
Returns an alphabet of the labels used by the extractor. Labels include entity types (such as PERSON) and slot names (such as EMPLOYEE-OF).

Returns:
the target alphabet