cc.mallet.pipe
Class CharSequenceRemoveHTML

java.lang.Object
  extended by cc.mallet.pipe.Pipe
      extended by cc.mallet.pipe.CharSequenceRemoveHTML
All Implemented Interfaces:
AlphabetCarrying, java.io.Serializable

public class CharSequenceRemoveHTML
extends Pipe

This pipe removes HTML from a CharSequence. The HTML is actually parsed here, so we should have less HTML slipping through... but it is almost certainly much slower than a regular expression, and could fail on broken HTML.

Author:
Greg Druck gdruck@cs.umass.edu
See Also:
Serialized Form

Constructor Summary
CharSequenceRemoveHTML()
           
 
Method Summary
static void main(java.lang.String[] args)
           
 Instance pipe(Instance carrier)
          Really this should be 'protected', but isn't for historical reasons.
 
Methods inherited from class cc.mallet.pipe.Pipe
alphabetsMatch, getAlphabet, getAlphabets, getDataAlphabet, getInstanceId, getTargetAlphabet, instanceFrom, instancesFrom, instancesFrom, isDataAlphabetSet, isTargetProcessing, newIteratorFrom, preceedingPipeDataAlphabetNotification, preceedingPipeTargetAlphabetNotification, precondition, readResolve, setDataAlphabet, setOrCheckDataAlphabet, setOrCheckTargetAlphabet, setTargetAlphabet, setTargetProcessing
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

CharSequenceRemoveHTML

public CharSequenceRemoveHTML()
Method Detail

pipe

public Instance pipe(Instance carrier)
Description copied from class: Pipe
Really this should be 'protected', but isn't for historical reasons.

Overrides:
pipe in class Pipe

main

public static void main(java.lang.String[] args)