Document Classification with Generalized Expectation (GE)
In this tutorial we describe training maximum entropy document classifiers with generalized
expectation (GE) constraints that specify affinities between words and labels. See
[Druck, Mann, and McCallum 2008] for more
information. We assume that the task is classifying baseball and hockey documents and that we have
processed data sets
baseball-hockey.train.vectors and
baseball-hockey.test.vectors.
GE training requires unlabeled training data. We can hide labels using
Vectors2Vectors.
java cc.mallet.classify.tui.Vectors2Vectors \
--input baseball-hockey.train.vectors \
--output baseball-hockey.unlabeled.vectors \
--hide-labels
User-provided Constraints
Suppose we know a priori that the words
baseball and
puck are good indicators of labels
baseball and
hockey respectively. Specifically, suppose that we estimate that 90% of the
documents in which the word
puck occurs should be labeled
hockey, and similarly for
baseball. We may specify these constraints in a file as follows.
baseball hockey:0.1 baseball:0.9
puck hockey:0.9 baseball:0.1
The general format for a constraints file is:
feature_name label_name=probability label_name=probability ...
The number of probabilities must be equal to the number of labels. The feature and label names must match the names in the data and target alphabets exactly.
The following command trains a MaxEnt classifier using the above constraints (assumed to be in file
baseball-hockey.constraints). We specify the constraints file using
constraintsFile and specify a regularization penalty with
gasussianPriorVariance.
mallet train-classifier \
--training-file baseball-hockey.unlabeled.vectors \
--testing-file baseball-hockey.test.vectors \
--trainer "MaxEntGETrainer,gaussianPriorVariance=0.1,constraintsFile=\"baseball-hockey.constraints\"" \
--report test:accuracy
Below, we discuss machine-assisted methods for obtaining constraints.
User-provided Labeled Features
Rather than specifying the target expectations directly, we may instead specify "labels" for features, and have these converted into target expectations. Suppose we know that the word
puck is associated with
hockey, and the word
baseball is associated with the label
baseball. We may specify these labeled features in a file (
baseball-hockey.labeled_features) as follows.
baseball baseball
puck hockey
The general format for a file with labeled features is:
feature_name label_name label_name ...
Vectors2FeatureConstraints can estimate target expectations from a file with labeled features. A simple heuristic for obtaining expectations from labeled features is to uniformly divide constant probability mass among the labels for a feature. By default, 0.9 probability is allocated to the labels for a feature. This estimation method can be specified using
heuristic for the
targets command option.
java cc.mallet.classify.tui.Vectors2FeatureConstraints \
--input baseball-hockey.train.vectors \
--output baseball-hockey.constraints \
--features-file baseball-hockey.labeled_features \
--targets heuristic
The option
majority-prob can be used to specify a value other than 0.9. We can use the constraints file
baseball-hockey.constraints to perform GE training as above.
Machine-provided Candidate Features
We may obtain a set of candidate features for which constraints may be expressed using
the Latent Dirichlet Allocation (LDA) based method of
[Druck, Mann, and McCallum 2008].
java cc.mallet.classify.tui.Vectors2FeatureConstraints \
--input baseball-hockey.train.vectors \
--output baseball-hockey.features \
--feature-selection lda \
--lda-file baseball-hockey.train.lda \
--targets none \
--num-constraints 10
The
lda-file is a serialized LDA model file. See the
topic modeling tutorial for more information. Setting
targets to
none tells
Vectors2FeatureConstraints to output candidate features only.
baseball-hockey.features will then contain a list of ten candidate features, one per line.
The above method is unsupervised (i.e. does not look at the true labels). We can also select
candidate features using an "oracle" information gain method (
infogain) that looks at the
true labels. (Note that when using true labels obtaining constraints,
baseball-hockey.train.vectors, rather than
baseball-hockey.unlabeled.vectors, must be used.)
java cc.mallet.classify.tui.Vectors2FeatureConstraints \
--input baseball-hockey.train.vectors \
--output baseball-hockey.features \
--feature-selection infogain \
--targets none \
--num-constraints 10
Machine-provided Target Expectations
Given a set of candidate features, we may estimate constraints using two methods. The first method is to have the machine label the features (by revealing the true labels and using the method of
[Druck, Mann, and McCallum 2008]), and convert these labels into expectations using the same
heuristic as above.
java cc.mallet.classify.tui.Vectors2FeatureConstraints \
--input baseball-hockey.train.vectors \
--output baseball-hockey.constraints \
--features-file baseball-hockey.features \
--targets heuristic
Note that if the candidate features are also machine-provided, we may perform both steps at the
same time using, for example, the command:
java cc.mallet.classify.tui.Vectors2FeatureConstraints \
--input baseball-hockey.train.vectors \
--output baseball-hockey.constraints \
--feature-selection lda \
--lda-file baseball-hockey.train.lda \
--num-constraints 10 \
--targets heuristic
Finally, we may estimate the expectations using the exact target expectations from the labeled data. The
targets option to do this is
oracle.
java cc.mallet.classify.tui.Vectors2FeatureConstraints \
--input baseball-hockey.train.vectors \
--output baseball-hockey.constraints \
--features-file baseball-hockey.features \
--targets oracle
Note that when using
heuristic targets, the machine may discard candidate features
in the labeling process (c.f.
[Druck, Mann, and McCallum 2008]). However, the machine does not discard
any candiate features when using
--targets oracle .