Use Mallet 2.0.7 or greater for this code.
The implementation of GE training of MaxEnt models in pre-2.0.7 versions of Mallet contains a bug that often results in low accuracy when the number of constraints is small. Specifically,
the Gaussian prior was not always being included in the objective function value, which caused problems in numerical optimization.
(Published experiments, i.e. [Druck, Mann, and McCallum 2008], used a different implementation and are not affected by this bug.)
To report problems with this code (including obtaining unexpected results), please contact gdruck@cs.umass.edu.
Document Classification with Expectation Constraints
In this tutorial we describe training maximum entropy document classifiers with
expectation constraints that specify affinities between words and labels.
See
[Druck, Mann, and McCallum 2008] for more
information. We assume that the task is classifying baseball and hockey documents and that we have
processed data sets
baseball-hockey.train.vectors and
baseball-hockey.test.vectors.
These methods require unlabeled training data. We can hide labels using
Vectors2Vectors.
java cc.mallet.classify.tui.Vectors2Vectors \
--input baseball-hockey.train.vectors \
--output baseball-hockey.unlabeled.vectors \
--hide-targets
If the data is truly unlabeled, then the easiest way to import it is to assign an arbitrary label to each document, ensuring that each label is used at least once.
Generalized Expectation
Suppose we know a priori that the words
baseball and
puck are good indicators of labels
baseball and
hockey respectively. Specifically, suppose that we estimate that 90% of the
documents in which the word
puck occurs should be labeled
hockey, and similarly for
baseball. We may specify these constraints in a file as follows.
baseball hockey:0.1 baseball:0.9
puck hockey:0.9 baseball:0.1
The general format for a constraints file is:
feature_name label_name=probability label_name=probability ...
The number of probabilities must be equal to the number of labels. The feature and label names must match the names in the data and target alphabets exactly.
The following command trains a MaxEnt classifier with the above constraints (assumed to be in file
baseball-hockey.constraints) using Generalized Expectation (GE) (as described in
[Druck, Mann, and McCallum 2008]).
We specify the constraints file using
constraintsFile and specify a regularization penalty with
gasussianPriorVariance.
mallet train-classifier \
--training-file baseball-hockey.unlabeled.vectors \
--testing-file baseball-hockey.test.vectors \
--trainer "MaxEntGETrainer,gaussianPriorVariance=0.1,
constraintsFile=\"baseball-hockey.constraints\"" \
--report test:accuracy
L2 Penalty
By default, the difference between the target and model expectations is penalized using KL divergence (as in
[Druck, Mann, and
McCallum 2008]). Instead, we can impose an L
2 penalty using the
L2 option.
mallet train-classifier \
--training-file baseball-hockey.unlabeled.vectors \
--testing-file baseball-hockey.test.vectors \
--trainer "MaxEntGETrainer,gaussianPriorVariance=0.1,L2=true,
constraintsFile=\"baseball-hockey.constraints\"" \
--report test:accuracy
API
The underlying trainer is cc.mallet.classify.MaxEntGETrainer. New GE constraints and penalties for training MaxEnt models can be defined by implementing
cc.mallet.classify.constraints.ge.MaxEntGEConstraint.
Generalized Expectation with Target Ranges
It is also possible to specify L
2 constraints that do not impose a penalty if the model expectation is within some target range.
For example, we can encourage model expectations to be in the range 90-100%.
baseball baseball:0.9,1
hockey hockey:0.9,1
In general, the format for range constraints is:
feature_name label_name=lower_probability,upper_probability ...
Support for such constraints is provided by
MaxEntGERangeTrainer.
mallet train-classifier \
--training-file baseball-hockey.unlabeled.vectors \
--testing-file baseball-hockey.test.vectors \
--trainer "MaxEntGERangeTrainer,gaussianPriorVariance=0.1,
constraintsFile=\"baseball-hockey.range_constraints\"" \
--report test:accuracy
API
The underlying trainer is cc.mallet.classify.MaxEntGERangeTrainer. New GE constraints and penalties for training MaxEnt models can be defined by implementing
cc.mallet.classify.constraints.ge.MaxEntGEConstraint.
Posterior Regularization
There is also support for training MaxEnt models with Posterior Regularization (PR)
[Ganchev, Graça, Gillenwater, and Taskar 2010].
The following command trains a MaxEnt classifier using the above constraints (assumed to be in file
baseball-hockey.constraints) with PR for 100 iterations. We specify the constraints file using
constraintsFile and specify a regularization penalty for each step (c.f.
[Bellare, Druck, and McCallum 2009]) with
pGasussianPriorVariance and
qGaussianPriorVariance.
mallet train-classifier \
--training-file baseball-hockey.unlabeled.vectors \
--testing-file baseball-hockey.test.vectors \
--trainer "MaxEntPRTrainer,minIterations=100,maxIterations=100,
pGaussianPriorVariance=0.1,qGaussianPriorVariance=1000,
constraintsFile=\"baseball-hockey.constraints\"" \
--report test:accuracy
API
The underlying trainer is cc.mallet.classify.MaxEntPRTrainer. New PR constraints and penalties for training MaxEnt models can be defined by implementing
cc.mallet.classify.constraints.pr.MaxEntPRConstraint.
Automated Methods for Obtaining Constraints
Below, we discuss machine-assisted methods for obtaining constraints. Note that these methods do not yet support target ranges.
User-provided Labeled Features
Rather than specifying the target expectations directly, we may instead specify "labels" for features, and have these converted into target expectations. Suppose we know that the word
puck is associated with
hockey, and the word
baseball is associated with the label
baseball. We may specify these labeled features in a file (
baseball-hockey.labeled_features) as follows.
baseball baseball
puck hockey
The general format for a file with labeled features is:
feature_name label_name label_name ...
Vectors2FeatureConstraints can estimate target expectations from a file with labeled features. A simple heuristic for obtaining expectations from labeled features is to uniformly divide constant probability mass among the labels for a feature. By default, 0.9 probability is allocated to the labels for a feature. This estimation method can be specified using
heuristic for the
targets command option.
java cc.mallet.classify.tui.Vectors2FeatureConstraints \
--input baseball-hockey.train.vectors \
--output baseball-hockey.constraints \
--features-file baseball-hockey.labeled_features \
--targets heuristic
The option
majority-prob can be used to specify a value other than 0.9. We can use the constraints file
baseball-hockey.constraints to perform GE training as above.
Machine-provided Candidate Features
We may obtain a set of candidate features for which constraints may be expressed using
the Latent Dirichlet Allocation (LDA) based method of
[Druck, Mann, and McCallum 2008].
java cc.mallet.classify.tui.Vectors2FeatureConstraints \
--input baseball-hockey.train.vectors \
--output baseball-hockey.features \
--feature-selection lda \
--lda-file baseball-hockey.train.lda \
--targets none \
--num-constraints 10
The
lda-file is a serialized LDA model file. See the
topic modeling tutorial for more information. Setting
targets to
none tells
Vectors2FeatureConstraints to output candidate features only.
baseball-hockey.features will then contain a list of ten candidate features, one per line.
The above method is unsupervised (i.e. does not look at the true labels). We can also select
candidate features using an "oracle" information gain method (
infogain) that looks at the
true labels. (Note that when using true labels obtaining constraints,
baseball-hockey.train.vectors, rather than
baseball-hockey.unlabeled.vectors, must be used.)
java cc.mallet.classify.tui.Vectors2FeatureConstraints \
--input baseball-hockey.train.vectors \
--output baseball-hockey.features \
--feature-selection infogain \
--targets none \
--num-constraints 10
Machine-provided Target Expectations
Given a set of candidate features, we may estimate constraints using two methods. The first method is to have the machine label the features (by revealing the true labels and using the method of
[Druck, Mann, and McCallum 2008]), and convert these labels into expectations using the same
heuristic as above.
java cc.mallet.classify.tui.Vectors2FeatureConstraints \
--input baseball-hockey.train.vectors \
--output baseball-hockey.constraints \
--features-file baseball-hockey.features \
--targets heuristic
Note that if the candidate features are also machine-provided, we may perform both steps at the
same time using, for example, the command:
java cc.mallet.classify.tui.Vectors2FeatureConstraints \
--input baseball-hockey.train.vectors \
--output baseball-hockey.constraints \
--feature-selection lda \
--lda-file baseball-hockey.train.lda \
--num-constraints 10 \
--targets heuristic
Finally, we may estimate the expectations using the exact target expectations from the labeled data. The
targets option to do this is
oracle.
java cc.mallet.classify.tui.Vectors2FeatureConstraints \
--input baseball-hockey.train.vectors \
--output baseball-hockey.constraints \
--features-file baseball-hockey.features \
--targets oracle
Note that when using
heuristic targets, the machine may discard candidate features
in the labeling process (c.f.
[Druck, Mann, and McCallum 2008]). However, the machine does not discard
any candidate features when using
--targets oracle .
Tips
- For GE training, a gaussianPriorVariance of 1 is a reasonable default choice.
- For PR training, in our experience large values for qGaussianPriorVariance and small values for pGaussianPriorVariance work best.
- The command line interfaces only provide basic functionality. In some cases it may be necessary to tweak the optimization code (by for example setting convergence tolerances or step sizes) in
order to obtain good results.
- As a rule of thumb, try to specify a set of constraints that is balanced among labels and covers many documents.