Package cc.mallet.pipe

Classes for processing arbitrary data into instances.


Class Summary
AddClassifierTokenPredictions This pipe uses a Classifier to label each token (i.e., using 0-th order Markov assumption), then adds the predictions as features to each token.
AddClassifierTokenPredictions.TokenClassifiers This inner class represents the trained token classifiers.
Array2FeatureVector Converts a Java array of numerical types to a FeatureVector, where the Alphabet is the data array index wrapped in an Integer object.
AugmentableFeatureVectorAddConjunctions Add specified conjunctions to each instance.
AugmentableFeatureVectorLogScale Given an AugmentableFeatureVector, set those values greater than or equal to 1 to log(value)+1.
BranchingPipe Deprecated.
CharSequence2CharNGrams Transform a character sequence into a token sequence of character N grams.
CharSequence2TokenSequence Pipe that tokenizes a character sequence.
CharSequenceArray2TokenSequence Transform an array of character Sequences into a token sequence.
CharSequenceLowercase Replace the data string with a lowercased version.
CharSequenceRemoveHTML This pipe removes HTML from a CharSequence.
CharSequenceReplace Given a string, repeatedly look for matches of the regex, and replace the entire match with the given replacement string.
CharSubsequence Given a string, return only the portion of the string inside a regex parenthesized group.
Classification2ConfidencePredictingFeatureVector Pipe features from underlying classifier to the confidence prediction instance list
Csv2Array Converts a string of comma separated values to an array.
Csv2FeatureVector Converts a string of the form feature_1:val_1 feature_2:val_2 ...
Directory2FileIterator Convert a File object representing a directory into a FileIterator which iterates over files in the directory matching a pattern and which extracts a label from each file path to become the target field of the instance.
FeatureCountPipe Pruning low-count features can be a good way to save memory and computation.
FeatureDocFreqPipe Pruning low-count features can be a good way to save memory and computation.
FeatureSequence2AugmentableFeatureVector Convert the data field from a feature sequence to an augmentable feature vector.
FeatureSequence2FeatureVector Convert the data field from a feature sequence to a feature vector.
FeatureVectorConjunctions Include in the FeatureVector conjunctions of all its features.
FeatureVectorSequence2FeatureVectors Given instances with a FeatureVectorSequence in the data field, break up the sequence into the individual FeatureVectors, producing one FeatureVector per Instance.
Filename2CharSequence Given a filename contained in a string, read in contents of file into a CharSequence.
Input2CharSequence Pipe that can read from various kinds of text sources (either URI, File, or Reader) into a CharSequence
InstanceListTrimFeaturesByCount Unimplemented.
MakeAmpersandXMLFriendly convert & to &amp in tokens of a token sequence
Noop A pipe that does nothing to the instance fields but which has side effects on the dictionary.
Pipe The abstract superclass of all Pipes, which transform one data type to another.
PipeUtils Created: Aug 28, 2005
PrintInput Print the data field of each instance.
PrintInputAndTarget Print the data and target fields of each instance.
PrintTokenSequenceFeatures Print properties of the token sequence in the data field and the corresponding value of any token in a token sequence or feature in a featur sequence in the target field.
SaveDataInSource Set the source field of each instance to its data field.
SelectiveSGML2TokenSequence Similar to SGML2TokenSequence, except that only the tags listed in allowedTags are converted to Labels.
SerialPipes Convert an instance through a sequence of pipes.
SGML2TokenSequence Converts a string containing simple SGML tags into a dta TokenSequence of words, paired with a target TokenSequence containing the SGML tags in effect for each word.
SimpleTaggerSentence2StringTokenization This extends SimpleTaggerSentence2TokenSequence to use {Slink StringTokenizations} for use with the extract package.
SimpleTaggerSentence2TokenSequence Converts an external encoding of a sequence of elements with binary features to a TokenSequence.
SimpleTokenizer A simple unicode tokenizer that accepts sequences of letters as tokens.
SourceLocation2TokenSequence Read from File or BufferedRead in the data field and produce a TokenSequence.
StringAddNewLineDelimiter Pipe that can adds special text between lines to explicitly represent line breaks.
StringList2FeatureSequence Convert a list of strings into a feature sequence
SvmLight2FeatureVectorAndLabel This Pipe converts a line in SVMLight format to a Mallet instance with FeatureVector data and Label target.
Target2FeatureSequence Convert a token sequence in the target field into a feature sequence in the target field.
Target2Label Convert object in the target field into a label in the target field.
Target2LabelSequence convert a token sequence in the target field into a label sequence in the target field.
TargetRememberLastLabel For each position in the target, remember the last non-background label.
Token2FeatureVector convert the property list on a token into a feature vector
TokenSequence2FeatureSequence Convert the token sequence in the data field each instance to a feature sequence.
TokenSequence2FeatureSequenceWithBigrams Convert the token sequence in the data field of each instance to a feature sequence that preserves bigram information.
TokenSequence2FeatureVectorSequence Convert the token sequence in the data field of each instance to a feature vector sequence.
TokenSequenceLowercase Convert the text in each token in the token sequence in the data field to lower case.
TokenSequenceMatchDataAndTarget Run a regular expression over the text of each token; replace the text with the substring matching one regex group; create a target TokenSequence from the text matching another regex group.
TokenSequenceNGrams Convert the token sequence in the data field to a token sequence of ngrams.
TokenSequenceParseFeatureString Convert the string in each field Token.text to a list of Strings (space delimited).
TokenSequenceRemoveNonAlpha Remove tokens that contain non-alphabetic characters.
TokenSequenceRemoveStopwords Remove tokens from the token sequence in the data field whose text is in the stopword list.

Exception Summary

Package cc.mallet.pipe Description

Classes for processing arbitrary data into instances. Every class in this Directory should be a subclass of Pipe. Other classes should go in base.pipe.util.