MalletMain Page | About | Help | FAQ | Special pages | Log in
Advanced Machine Learning for Language
Printable version | Disclaimers

Command-Line CRFs in GRMM

From Mallet

There is a command-line interface to training CRFs with arbitrary graphical structure. This document walks you through training a two-level factorial CRF. The main class for CRFs with general structure is called ACRF, which stands for abstract CRF.

Your Data Files

It is assumed that you have a fully observed training set with a bunch of labels in it, and a bunch of features. We have to index the features and labels in your data somehow, so we assume you data is split up into a sequence. We call each sequence position a time step. Each time step has a set of labels and a feature vector.

You should have training and testing files that look like this:

 LABEL11 LABEL12 ... LABEL1k ---- feature11 feature12 ...
 LABEL21 LABEL22 ... LABEL2k ---- feature21 feature22 ...
 LABEL32 LABEL32 ... LABEL3k ---- feature31 feature32 ...
 ....

That is, each line is one time step. It has a bunch of space-separated labels, folled by the special token ----, followed by the names of all the binary features that are on at that time step. Using the ---- allows different time steps to have different numbers of labels (I've never actually tried this, however.)

The command-line interface supports only binary features, but the underlying inference and learning code fully supports continous features. If you wanted to use continuous features, you'd have to make some minor changes to the command-line code, which I won't go into.

Now, GRMM doesn't assume that there's a chain structure among the time steps unless you tell it to. If you're doing document classification, for example, each time step could be an entire document. Or if you were doing 2D classification in an image, then each "timestep" would be a node in the grid, and the feature vector would be the features local to that node. The time-step framework is just to make it easier to read in your training data, and output results.

Examples of such data files are in the distribution under data/grmm. We will be using conll2000.train1k.txt and conll2000.test1k.txt.

Templates

Parameter tying is accomplished by the notion of templates. For each training instance, the ACRF trainer creates an unrolled graph, which the flat factor graph for p(y|x) that is defined by the CRF. (Imagine unrolling in a DBN.)

A template object represents a bunch of factors that have tied parameters. Given a training instance, the template object tells the ACRF trainer which sets of variables should have factors, by adding them to an object called an UnrolledGraph. Then the ACRF training code knows to give all those factors the same parameters, because they come from the same template. For example, a linear-chain CRF has one template object that adds all of the first-order edges. (This class is ACRF.BigramTemplate.)

The major method implemented by Template is addInstantiatedCliques. This method is called by the ACRF training harness. It is expected to take an UnrolledGraph, and add all the factors that belong in the graph based on the input. These factors should be instances of ACRF.UnrolledVarSet.

To create your own models, the main thing you need to do is create your subclass of ACRF.Template. A convenient class to subclass is ACRF.SequenceTemplate. See ACRF.BigramTemplate for an example.

The Command-Line Interface

The command-line interface is handled through GenericAcrfTui. Its most important parameters are

 --training    Name of the training file
 --testing     Name of the testing file
 --model-file  A text file specifying which template objects to use.

As an example, we'll train a factorial DCRF on the CoNLL 2000 paper as in Sutton, Rohanimanesh, and McCallum in ICML 2004. (Before doing any of this, you need to build GRMM using make.)

First, set up your data files. Toy data subsets are provied in data/grmm/conll2000.train1k.txt and data/grmm/conll2000.train1k.txt

Second, create a text file that describes the templates we want to use. We want three templates: one for the linear-chain edges across NP, one for the linear-chain edges across POS, and one for the in-between edges. So create a text file called tmpls.txt, with three lines:

 new ACRF.BigramTemplate (0)
 new ACRF.BigramTemplate (1)
 new ACRF.PairwiseFactorTemplate (0,1)

Third, train and test the ACRF. The following command line will do it:

 java -cp $GRMM/class:$GRMM/lib/mallet-deps.jar:$GRMM/lib/grmm-deps.jar \
   edu.umass.cs.mallet.grmm.learning.GenericAcrfTui \ 
   --training $GRMM/data/grmm/conll2000.train1k.txt \
   --testing  $GRMM/data//grmm/conll2000.test1k.txt \
   --model-file tmpls.txt > stdout.txt 2> stderr.txt

This TUI will print lots of diagnostic information, and save a serialized copy of the trained model.

Retrieved from "http://mallet.cs.umass.edu/index.php/Command-Line_CRFs_in_GRMM"

This page has been accessed 11911 times. This page was last modified 22:31, 3 Jul 2006.


Find
Navigation
Main Page
Community portal
Recent changes
Random page
Help
Donations
Edit
Edit this page
Editing help
This page
Discuss this page
Post a comment
Printable version
Context
Page history
What links here
Related changes
My pages
Create an account or log in
Special pages
New pages
Image list
Statistics
Bug reports
More...