cc.mallet.pipe
Class TokenSequenceRemoveStopwords

java.lang.Object
  extended by cc.mallet.pipe.Pipe
      extended by cc.mallet.pipe.TokenSequenceRemoveStopwords
All Implemented Interfaces:
AlphabetCarrying, java.io.Serializable

public class TokenSequenceRemoveStopwords
extends Pipe
implements java.io.Serializable

Remove tokens from the token sequence in the data field whose text is in the stopword list.

Author:
Andrew McCallum mccallum@cs.umass.edu
See Also:
Serialized Form

Constructor Summary
TokenSequenceRemoveStopwords()
           
TokenSequenceRemoveStopwords(boolean caseSensitive)
           
TokenSequenceRemoveStopwords(boolean caseSensitive, boolean markDeletions)
           
TokenSequenceRemoveStopwords(java.io.File stoplistFile, java.lang.String encoding, boolean includeDefault, boolean caseSensitive, boolean markDeletions)
          Load a stoplist from a file.
 
Method Summary
 TokenSequenceRemoveStopwords addStopWords(java.io.File wordlist)
          Add whitespace-separated tokens in file "wordlist" to the stoplist.
 TokenSequenceRemoveStopwords addStopWords(java.lang.String[] words)
           
 Instance pipe(Instance carrier)
          Really this should be 'protected', but isn't for historical reasons.
 TokenSequenceRemoveStopwords removeStopWords(java.io.File wordlist)
          Remove whitespace-separated tokens in file "wordlist" to the stoplist.
 TokenSequenceRemoveStopwords removeStopWords(java.lang.String[] words)
           
 TokenSequenceRemoveStopwords setCaseSensitive(boolean flag)
           
 TokenSequenceRemoveStopwords setMarkDeletions(boolean flag)
           
 
Methods inherited from class cc.mallet.pipe.Pipe
alphabetsMatch, getAlphabet, getAlphabets, getDataAlphabet, getInstanceId, getTargetAlphabet, instanceFrom, instancesFrom, instancesFrom, isDataAlphabetSet, isTargetProcessing, newIteratorFrom, preceedingPipeDataAlphabetNotification, preceedingPipeTargetAlphabetNotification, precondition, readResolve, setDataAlphabet, setOrCheckDataAlphabet, setOrCheckTargetAlphabet, setTargetAlphabet, setTargetProcessing
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

TokenSequenceRemoveStopwords

public TokenSequenceRemoveStopwords(boolean caseSensitive,
                                    boolean markDeletions)

TokenSequenceRemoveStopwords

public TokenSequenceRemoveStopwords(boolean caseSensitive)

TokenSequenceRemoveStopwords

public TokenSequenceRemoveStopwords()

TokenSequenceRemoveStopwords

public TokenSequenceRemoveStopwords(java.io.File stoplistFile,
                                    java.lang.String encoding,
                                    boolean includeDefault,
                                    boolean caseSensitive,
                                    boolean markDeletions)
Load a stoplist from a file.

Parameters:
stoplistFile - The file to load
encoding - The encoding of the stoplist file (eg UTF-8)
includeDefault - Whether to include the standard mallet English stoplist
Method Detail

setCaseSensitive

public TokenSequenceRemoveStopwords setCaseSensitive(boolean flag)

setMarkDeletions

public TokenSequenceRemoveStopwords setMarkDeletions(boolean flag)

addStopWords

public TokenSequenceRemoveStopwords addStopWords(java.lang.String[] words)

removeStopWords

public TokenSequenceRemoveStopwords removeStopWords(java.lang.String[] words)

removeStopWords

public TokenSequenceRemoveStopwords removeStopWords(java.io.File wordlist)
Remove whitespace-separated tokens in file "wordlist" to the stoplist.


addStopWords

public TokenSequenceRemoveStopwords addStopWords(java.io.File wordlist)
Add whitespace-separated tokens in file "wordlist" to the stoplist.


pipe

public Instance pipe(Instance carrier)
Description copied from class: Pipe
Really this should be 'protected', but isn't for historical reasons.

Overrides:
pipe in class Pipe