cc.mallet.extract
Class HierarchicalTokenizationFilter

java.lang.Object
  extended by cc.mallet.extract.HierarchicalTokenizationFilter
All Implemented Interfaces:
TokenizationFilter

public class HierarchicalTokenizationFilter
extends java.lang.Object
implements TokenizationFilter

Tokenization filter that will create nested spans based on a hierarchical labeling of the data. The labels should be of the form LBL1[|LBLk]*. For example,

   A   A|B   A|B|C   A|B|C  A|B  A   A
   w1  w2    w3      w4     w5   w6  w7
 
will result in LabeledSpans like <A>w1 <B>w2 <C>w3 w4</C> w5</B> w6 w7</A> Also, labels of the form <B-field> will force a new instance of the field to begin, even if it is already active. And prefixes of I- are ignored so you can use BIO labeling. Created: Nov 12, 2004

Version:
$Id: HierarchicalTokenizationFilter.java,v 1.1 2007/10/22 21:37:44 mccallum Exp $
Author:
Constructor Summary
HierarchicalTokenizationFilter()
           
HierarchicalTokenizationFilter(java.util.regex.Pattern ignorePattern)
           
 
Method Summary
 LabeledSpans constructLabeledSpans(LabelAlphabet dict, java.lang.Object document, Label backgroundTag, Tokenization input, Sequence seq)
          Converts a the sequence of labels into a set of labeled spans.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

HierarchicalTokenizationFilter

public HierarchicalTokenizationFilter()

HierarchicalTokenizationFilter

public HierarchicalTokenizationFilter(java.util.regex.Pattern ignorePattern)
Method Detail

constructLabeledSpans

public LabeledSpans constructLabeledSpans(LabelAlphabet dict,
                                          java.lang.Object document,
                                          Label backgroundTag,
                                          Tokenization input,
                                          Sequence seq)
Description copied from interface: TokenizationFilter
Converts a the sequence of labels into a set of labeled spans. Essentially, this converts the output of sequence labeling into an extraction output.

Specified by:
constructLabeledSpans in interface TokenizationFilter
Returns: