cc.mallet.pipe
Class FeatureDocFreqPipe
java.lang.Object
cc.mallet.pipe.Pipe
cc.mallet.pipe.FeatureDocFreqPipe
- All Implemented Interfaces:
- AlphabetCarrying, java.io.Serializable
public class FeatureDocFreqPipe
- extends Pipe
Pruning low-count features can be a good way to save memory and computation.
However, in order to use Vectors2Vectors, you need to write out the unpruned
instance list, read it back into memory, collect statistics, create new
instances, and then write everything back out.
This class supports a simpler method that makes two passes over the data:
one to collect statistics and create an augmented "stop list", and a
second to actually create instances.
- See Also:
- Serialized Form
Methods inherited from class cc.mallet.pipe.Pipe |
alphabetsMatch, getAlphabet, getAlphabets, getDataAlphabet, getInstanceId, getTargetAlphabet, instanceFrom, instancesFrom, instancesFrom, isDataAlphabetSet, isTargetProcessing, newIteratorFrom, preceedingPipeDataAlphabetNotification, preceedingPipeTargetAlphabetNotification, precondition, readResolve, setDataAlphabet, setOrCheckDataAlphabet, setOrCheckTargetAlphabet, setTargetAlphabet, setTargetProcessing |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
FeatureDocFreqPipe
public FeatureDocFreqPipe()
FeatureDocFreqPipe
public FeatureDocFreqPipe(Alphabet dataAlphabet,
Alphabet targetAlphabet)
pipe
public Instance pipe(Instance instance)
- Description copied from class:
Pipe
- Really this should be 'protected', but isn't for historical reasons.
- Overrides:
pipe
in class Pipe
addPrunedWordsToStoplist
public void addPrunedWordsToStoplist(SimpleTokenizer tokenizer,
double docFrequencyCutoff)
- Add all pruned words to the internal stoplist of a SimpleTokenizer.
- Parameters:
docFrequencyCutoff
- Remove words that occur in greater than this proportion of documents. 0.05 corresponds to IDF >= 3.