cc.mallet.pipe
Class FeatureCountPipe

java.lang.Object
  extended by cc.mallet.pipe.Pipe
      extended by cc.mallet.pipe.FeatureCountPipe
All Implemented Interfaces:
AlphabetCarrying, java.io.Serializable

public class FeatureCountPipe
extends Pipe

Pruning low-count features can be a good way to save memory and computation. However, in order to use Vectors2Vectors, you need to write out the unpruned instance list, read it back into memory, collect statistics, create new instances, and then write everything back out.

This class supports a simpler method that makes two passes over the data: one to collect statistics and create an augmented "stop list", and a second to actually create instances.

See Also:
Serialized Form

Constructor Summary
FeatureCountPipe()
           
FeatureCountPipe(Alphabet dataAlphabet, Alphabet targetAlphabet)
           
 
Method Summary
 void addPrunedWordsToStoplist(SimpleTokenizer tokenizer, int minimumCount)
          Add all pruned words to the internal stoplist of a SimpleTokenizer.
 Alphabet getPrunedAlphabet(int minimumCount)
          Returns a new alphabet that contains only features at or above the specified limit.
 Instance pipe(Instance instance)
          Really this should be 'protected', but isn't for historical reasons.
 void writeCommonWords(java.io.File commonFile, int totalWords)
          List the most common words, for addition to a stop file
 void writePrunedWords(java.io.File prunedFile, int minimumCount)
          Writes a list of features that do not occur at or above the specified cutoff to the pruned file, one per line.
 
Methods inherited from class cc.mallet.pipe.Pipe
alphabetsMatch, getAlphabet, getAlphabets, getDataAlphabet, getInstanceId, getTargetAlphabet, instanceFrom, instancesFrom, instancesFrom, isDataAlphabetSet, isTargetProcessing, newIteratorFrom, preceedingPipeDataAlphabetNotification, preceedingPipeTargetAlphabetNotification, precondition, readResolve, setDataAlphabet, setOrCheckDataAlphabet, setOrCheckTargetAlphabet, setTargetAlphabet, setTargetProcessing
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

FeatureCountPipe

public FeatureCountPipe()

FeatureCountPipe

public FeatureCountPipe(Alphabet dataAlphabet,
                        Alphabet targetAlphabet)
Method Detail

pipe

public Instance pipe(Instance instance)
Description copied from class: Pipe
Really this should be 'protected', but isn't for historical reasons.

Overrides:
pipe in class Pipe

getPrunedAlphabet

public Alphabet getPrunedAlphabet(int minimumCount)
Returns a new alphabet that contains only features at or above the specified limit.


writePrunedWords

public void writePrunedWords(java.io.File prunedFile,
                             int minimumCount)
                      throws java.io.IOException
Writes a list of features that do not occur at or above the specified cutoff to the pruned file, one per line. This file can then be passed to a stopword filter as "additional stopwords".

Throws:
java.io.IOException

addPrunedWordsToStoplist

public void addPrunedWordsToStoplist(SimpleTokenizer tokenizer,
                                     int minimumCount)
Add all pruned words to the internal stoplist of a SimpleTokenizer.


writeCommonWords

public void writeCommonWords(java.io.File commonFile,
                             int totalWords)
                      throws java.io.IOException
List the most common words, for addition to a stop file

Throws:
java.io.IOException