SimpleTagger example
From Mallet
SimpleTagger is a command line interface to the MALLET Conditional Random Field(CRF) class. Here we present an extremely simple example showing the use of SimpleTagger to label a sequence of text. For a more general introduction, see this tutorial on conditional random fields (http://www.cs.umass.edu/~casutton/publications/crf-tutorial.pdf) by Sutton and McCallum (2006).
Your input file should be in the following format:
Bill CAPITALIZED noun
slept non-noun
here LOWERCASE STOPWORD non-noun
That is, each line represents one token, and has the format:
feature1 feature2 ... featuren label
Then you can train a CRF using SimpleTagger like this (on one line):
hough@gobur:~/tagger-test$ java -cp "/home/hough/mallet/class:/home/hough/mallet/lib/mallet-deps.jar" edu.umass.cs.mallet.base.fst.SimpleTagger --train true --model-file nouncrf sample
This assumes that mallet has been installed and built in /home/hough/mallet. Note that we specify the MALLET build directory (/home/hough/mallet/class) and the necessary MALLET jar files (/home/hough/mallet/mallet-deps.jar) in the classpath. The --train true option specifies that we are training, and --model-file nouncrf specifies where we would like the CRF written to.
This produces a trained CRF in the file "nouncrf".
If we have a file "stest" we would like labelled:
CAPITAL Al
slept
here
we can do this with the CRF in file nouncrf by typing:
hough@gobur:~/tagger-test$ java -cp "/home/hough/mallet/class:/home/hough/mallet/lib/mallet-deps.jar" edu.umass.cs.mallet.base.fst.SimpleTagger --model-file nouncrf stest
which produces the following output:
Number of predicates: 5 noun CAPITAL Al non-noun slept non-noun here
A list of all the options available with SimpleTagger can be obtained by specifying the --help option:
hough@gobur:~/tagger-test$ java -cp "/home/hough/mallet/class:/home/hough/mallet/lib/mallet-deps.jar" edu.umass.cs.mallet.base.fst.SimpleTagger --help