cc.mallet.pipe
Class Pipe

java.lang.Object
  extended by cc.mallet.pipe.Pipe
All Implemented Interfaces:
AlphabetCarrying, java.io.Serializable
Direct Known Subclasses:
AddClassifierTokenPredictions, Array2FeatureVector, AugmentableFeatureVectorAddConjunctions, AugmentableFeatureVectorLogScale, BranchingPipe, CharSequence2CharNGrams, CharSequence2TokenSequence, CharSequenceArray2TokenSequence, CharSequenceLowercase, CharSequenceRemoveHTML, CharSequenceRemoveUUEncodedBlocks, CharSequenceReplace, CharSubsequence, Classification2ConfidencePredictingFeatureVector, Clusterings2Clusterer.ClusteringPipe, ConllNer2003Sentence2TokenSequence, ConllNer2003Sentence2TokenSequence, CountMatches, CountMatchesAlignedWithOffsets, CountMatchesMatching, Csv2Array, Csv2FeatureVector, Directory2FileIterator, EnronMessage2TokenSequence, FeatureCountPipe, FeatureDocFreqPipe, FeatureSequence2AugmentableFeatureVector, FeatureSequence2FeatureVector, FeatureSequenceConvolution, FeaturesInWindow, FeaturesOfFirstMention, FeatureValueString2FeatureVector, FeatureVectorConjunctions, FeatureVectorSequence2FeatureVectors, FeatureWindow, Filename2CharSequence, FilterEmptyFeatureVectors, GenericAcrfData2TokenSequence, Input2CharSequence, InstanceListTrimFeaturesByCount, LabelsSequence2Assignment, LengthBins, LexiconMembership, LineGroupString2TokenSequence, ListMember, LongRegexMatches, MakeAmpersandXMLFriendly, Noop, OffsetConjunctions, OffsetFeatureConjunction, OffsetPropertyConjunctions, PrintInput, PrintInputAndTarget, PrintTokenSequenceFeatures, RegexMatches, RememberTokenizationPipe, SaveDataInSource, SelectiveSGML2TokenSequence, SequencePrintingPipe, SerialPipes, SGML2TokenSequence, SimpleTagger.SimpleTaggerSentence2FeatureVectorSequence, SimpleTaggerSentence2TokenSequence, SimpleTokenizer, SliceLabelsSequence, SourceLocation2TokenSequence, StringAddNewLineDelimiter, StringList2FeatureSequence, SvmLight2FeatureVectorAndLabel, Target2BIOFormat, Target2FeatureSequence, Target2Label, Target2LabelSequence, TargetRememberLastLabel, TargetStringToFeatures, TestCRF.TestCRF2String, TestCRF.TestCRFTokenSequenceRemoveSpaces, TestInstancePipe.Array2ArrayIterator, TestMEMM.TestMEMM2String, TestMEMM.TestMEMMTokenSequenceRemoveSpaces, TestSGML2TokenSequence.Array2ArrayIterator, Token2FeatureVector, TokenFirstPosition, TokenSequence2FeatureSequence, TokenSequence2FeatureSequenceWithBigrams, TokenSequence2FeatureVectorSequence, TokenSequence2TokenInstances, TokenSequence2Tokenization, TokenSequenceDocHeader, TokenSequenceLowercase, TokenSequenceMatchDataAndTarget, TokenSequenceNGrams, TokenSequenceParseFeatureString, TokenSequenceRemoveNonAlpha, TokenSequenceRemoveStopwords, TokenText, TokenTextCharNGrams, TokenTextCharPrefix, TokenTextCharSuffix, TokenTextNGrams, TrieLexiconMembership

public abstract class Pipe
extends java.lang.Object
implements java.io.Serializable, AlphabetCarrying

The abstract superclass of all Pipes, which transform one data type to another. Pipes are most often used for feature extraction.

Although Pipe does not have any "abstract methods", in order to use a Pipe subclass you must override either the pipe method or the newIteratorFrom method. The former is appropriate when the pipe's processing of an Instance is strictly one-to-one. For every Instance coming in, there is exactly one Instance coming out. The later is appropriate when the pipe's processing may result in more or fewer Instances than arrive through its source iterator.

A pipe operates on an Instance, which is a carrier of data. A pipe reads from and writes to fields in the Instance when it is requested to process the instance. It is up to the pipe which fields in the Instance it reads from and writes to, but usually a pipe will read its input from and write its output to the "data" field of an instance.

A pipe doesn't have any direct notion of input or output - it merely modifies instances that are handed to it. A set of helper classes, which implement the interface Iterator, iterate over commonly encountered input data structures and feed the elements of these data structures to a pipe as instances.

A pipe is frequently used in conjunction with an InstanceList As instances are added to the list, they are processed by the pipe associated with the instance list and the processed Instance is kept in the list.

In one common usage, a FileIterator is given a list of directories to operate over. The FileIterator walks through each directory, creating an instance for each file and putting the data from the file in the data field of the instance. The directory of the file is stored in the target field of the instance. The FileIterator feeds instances to an InstanceList, which processes the instances through its associated pipe and keeps the results.

Pipes can be hierachically composed. In a typical usage, a SerialPipe is created, which holds other pipes in an ordered list. Piping an instance through a SerialPipe means piping the instance through each of the child pipes in sequence.

A pipe holds two separate Alphabets: one for the symbols (feature names) encountered in the data fields of the instances processed through the pipe, and one for the symbols (e.g. class labels) encountered in the target fields.

Author:
Andrew McCallum mccallum@cs.umass.edu
See Also:
Serialized Form

Constructor Summary
Pipe()
          Construct a pipe with no data and target dictionaries
Pipe(Alphabet dataDict, Alphabet targetDict)
          Construct pipe with data and target dictionaries.
 
Method Summary
 boolean alphabetsMatch(AlphabetCarrying object)
           
 Alphabet getAlphabet()
           
 Alphabet[] getAlphabets()
           
 Alphabet getDataAlphabet()
           
 java.rmi.dgc.VMID getInstanceId()
           
 Alphabet getTargetAlphabet()
           
 Instance instanceFrom(Instance inst)
           
 Instance[] instancesFrom(Instance inst)
           
 Instance[] instancesFrom(java.util.Iterator<Instance> source)
          A convenience method that will pull all instances from source through this pipe, and return the results as an array.
 boolean isDataAlphabetSet()
           
 boolean isTargetProcessing()
          Return true iff this pipe expects and processes information in the target slot.
 java.util.Iterator<Instance> newIteratorFrom(java.util.Iterator<Instance> source)
          Given an InstanceIterator, return a new InstanceIterator whose instances have also been processed by this pipe.
 Instance pipe(Instance inst)
          Really this should be 'protected', but isn't for historical reasons.
protected  void preceedingPipeDataAlphabetNotification(Alphabet a)
           
protected  void preceedingPipeTargetAlphabetNotification(Alphabet a)
           
 boolean precondition(Instance inst)
          Each instance processed is tested by this method.
 java.lang.Object readResolve()
          This gets called after readObject; it lets the object decide whether to return itself or return a previously read in version.
 void setDataAlphabet(Alphabet dDict)
           
 void setOrCheckDataAlphabet(Alphabet a)
           
 void setOrCheckTargetAlphabet(Alphabet a)
           
 void setTargetAlphabet(Alphabet tDict)
           
 void setTargetProcessing(boolean lookForAndProcessTarget)
          Set whether input is taken from target field of instance during processing.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

Pipe

public Pipe()
Construct a pipe with no data and target dictionaries


Pipe

public Pipe(Alphabet dataDict,
            Alphabet targetDict)
Construct pipe with data and target dictionaries. Note that, since the default values of the dataDictClass and targetDictClass are null, that if you specify null for one of the arguments here, this pipe step will not ever create any corresponding dictionary for the argument.

Parameters:
dataDict - Alphabet that will be used as the data dictionary.
targetDict - Alphabet that will be used as the target dictionary.
Method Detail

precondition

public boolean precondition(Instance inst)
Each instance processed is tested by this method. If it returns true, then the instance by-passes processing by this Pipe. Common usage is to override this method in an anonymous inner sub-class of Pipe. SerialPipes sp = new SerialPipes (new Pipe[] { new CharSequence2TokenSequence() { public boolean precondition (Instance inst) { return inst instanceof CharSequence; } }, new TokenSequence2FeatureSequence(), });


pipe

public Instance pipe(Instance inst)
Really this should be 'protected', but isn't for historical reasons.


newIteratorFrom

public java.util.Iterator<Instance> newIteratorFrom(java.util.Iterator<Instance> source)
Given an InstanceIterator, return a new InstanceIterator whose instances have also been processed by this pipe. If you override this method, be sure to check and obey this pipe's skipIfFalse(Instance) method.


instancesFrom

public Instance[] instancesFrom(java.util.Iterator<Instance> source)
A convenience method that will pull all instances from source through this pipe, and return the results as an array.


instancesFrom

public Instance[] instancesFrom(Instance inst)

instanceFrom

public Instance instanceFrom(Instance inst)

setTargetProcessing

public void setTargetProcessing(boolean lookForAndProcessTarget)
Set whether input is taken from target field of instance during processing. If argument is false, don't expect to find input material for the target. By default, this is true.


isTargetProcessing

public boolean isTargetProcessing()
Return true iff this pipe expects and processes information in the target slot.


getDataAlphabet

public Alphabet getDataAlphabet()

getTargetAlphabet

public Alphabet getTargetAlphabet()

getAlphabet

public Alphabet getAlphabet()
Specified by:
getAlphabet in interface AlphabetCarrying

getAlphabets

public Alphabet[] getAlphabets()
Specified by:
getAlphabets in interface AlphabetCarrying

alphabetsMatch

public boolean alphabetsMatch(AlphabetCarrying object)

setDataAlphabet

public void setDataAlphabet(Alphabet dDict)

isDataAlphabetSet

public boolean isDataAlphabetSet()

setOrCheckDataAlphabet

public void setOrCheckDataAlphabet(Alphabet a)

setTargetAlphabet

public void setTargetAlphabet(Alphabet tDict)

setOrCheckTargetAlphabet

public void setOrCheckTargetAlphabet(Alphabet a)

preceedingPipeDataAlphabetNotification

protected void preceedingPipeDataAlphabetNotification(Alphabet a)

preceedingPipeTargetAlphabetNotification

protected void preceedingPipeTargetAlphabetNotification(Alphabet a)

getInstanceId

public java.rmi.dgc.VMID getInstanceId()

readResolve

public java.lang.Object readResolve()
                             throws java.io.ObjectStreamException
This gets called after readObject; it lets the object decide whether to return itself or return a previously read in version. We use a hashMap of instanceIds to determine if we have already read in this object.

Returns:
Throws:
java.io.ObjectStreamException