MAchine Learning for LanguagE Toolkit

Document Classification with Generalized Expectation (GE)

In this tutorial we describe training maximum entropy document classifiers with generalized expectation (GE) constraints that specify affinities between words and labels. See [Druck, Mann, and McCallum 2008] for more information. We assume that the task is classifying baseball and hockey documents and that we have processed data sets baseball-hockey.train.vectors and baseball-hockey.test.vectors.
GE training requires unlabeled training data. We can hide labels using Vectors2Vectors.
java cc.mallet.classify.tui.Vectors2Vectors \
--input baseball-hockey.train.vectors \
--output baseball-hockey.unlabeled.vectors \
--hide-labels

User-provided Constraints

Suppose we know a priori that the words baseball and puck are good indicators of labels baseball and hockey respectively. Specifically, suppose that we estimate that 90% of the documents in which the word puck occurs should be labeled hockey, and similarly for baseball. We may specify these constraints in a file as follows.
baseball hockey:0.1 baseball:0.9
puck hockey:0.9 baseball:0.1
The general format for a constraints file is:
feature_name label_name=probability label_name=probability ...
The number of probabilities must be equal to the number of labels. The feature and label names must match the names in the data and target alphabets exactly.
The following command trains a MaxEnt classifier using the above constraints (assumed to be in file baseball-hockey.constraints). We specify the constraints file using constraintsFile and specify a regularization penalty with gasussianPriorVariance.
mallet train-classifier \
--training-file   baseball-hockey.unlabeled.vectors \
--testing-file    baseball-hockey.test.vectors \
--trainer "MaxEntGETrainer,gaussianPriorVariance=0.1,constraintsFile=\"baseball-hockey.constraints\"" \
--report test:accuracy
Below, we discuss machine-assisted methods for obtaining constraints.

User-provided Labeled Features

Rather than specifying the target expectations directly, we may instead specify "labels" for features, and have these converted into target expectations. Suppose we know that the word puck is associated with hockey, and the word baseball is associated with the label baseball. We may specify these labeled features in a file (baseball-hockey.labeled_features) as follows.
baseball baseball
puck hockey
The general format for a file with labeled features is:
feature_name label_name label_name ...
Vectors2FeatureConstraints can estimate target expectations from a file with labeled features. A simple heuristic for obtaining expectations from labeled features is to uniformly divide constant probability mass among the labels for a feature. By default, 0.9 probability is allocated to the labels for a feature. This estimation method can be specified using heuristic for the targets command option.
java cc.mallet.classify.tui.Vectors2FeatureConstraints \
--input baseball-hockey.train.vectors \
--output baseball-hockey.constraints \
--features-file baseball-hockey.labeled_features \
--targets heuristic 
The option majority-prob can be used to specify a value other than 0.9. We can use the constraints file baseball-hockey.constraints to perform GE training as above.

Machine-provided Candidate Features

We may obtain a set of candidate features for which constraints may be expressed using the Latent Dirichlet Allocation (LDA) based method of [Druck, Mann, and McCallum 2008].
java cc.mallet.classify.tui.Vectors2FeatureConstraints \
--input baseball-hockey.train.vectors \
--output baseball-hockey.features \
--feature-selection lda \
--lda-file baseball-hockey.train.lda \
--targets none \
--num-constraints 10 
The lda-file is a serialized LDA model file. See the topic modeling tutorial for more information. Setting targets to none tells Vectors2FeatureConstraints to output candidate features only. baseball-hockey.features will then contain a list of ten candidate features, one per line.
The above method is unsupervised (i.e. does not look at the true labels). We can also select candidate features using an "oracle" information gain method (infogain) that looks at the true labels. (Note that when using true labels obtaining constraints, baseball-hockey.train.vectors, rather than baseball-hockey.unlabeled.vectors, must be used.)
java cc.mallet.classify.tui.Vectors2FeatureConstraints \
--input baseball-hockey.train.vectors \
--output baseball-hockey.features \
--feature-selection infogain \
--targets none \
--num-constraints 10

Machine-provided Target Expectations

Given a set of candidate features, we may estimate constraints using two methods. The first method is to have the machine label the features (by revealing the true labels and using the method of [Druck, Mann, and McCallum 2008]), and convert these labels into expectations using the same heuristic as above.
java cc.mallet.classify.tui.Vectors2FeatureConstraints \
--input baseball-hockey.train.vectors \
--output baseball-hockey.constraints \
--features-file baseball-hockey.features \
--targets heuristic
Note that if the candidate features are also machine-provided, we may perform both steps at the same time using, for example, the command:
java cc.mallet.classify.tui.Vectors2FeatureConstraints \
--input baseball-hockey.train.vectors \
--output baseball-hockey.constraints \
--feature-selection lda \
--lda-file baseball-hockey.train.lda \
--num-constraints 10 \
--targets heuristic
Finally, we may estimate the expectations using the exact target expectations from the labeled data. The targets option to do this is oracle.
java cc.mallet.classify.tui.Vectors2FeatureConstraints \
--input baseball-hockey.train.vectors \
--output baseball-hockey.constraints \
--features-file baseball-hockey.features \
--targets oracle
Note that when using heuristic targets, the machine may discard candidate features in the labeling process (c.f. [Druck, Mann, and McCallum 2008]). However, the machine does not discard any candiate features when using --targets oracle .