SimpleTaggerWithConstraints is a command line interface for training linear chain CRFs with expectation constraints and unlabeled data. It is very similar to SimpleTagger, described here. If the data is truly unlabeled, then the easiest way to import it is to assign an arbitrary label to each token, ensuring that each label is
used at least once.

Mallet CRFs can be trained with expectation constraints using Generalized Expectation (GE). For example, parameters can be estimated to match prior distributions over labels for
particular words.
For more information, see:
*
The new implementation (added 11/29/10) uses a new algorithm (see Chapter 6) that is O(NL*^{2})
(where
L
is
#labels and N is sequence length) for both one and two
state
constraints
(rather than
O(NL^{3}) and O(NL^{4})).

See also the tutorial for training MaxEnt models with expectation constraints.

Generalized Expectation Criteria for Semi-Supervised Learning of Conditional Random FieldsGideon Mann and Andrew McCallumACL 2008

See also the tutorial for training MaxEnt models with expectation constraints.

To train a CRF with expectation constraints using GE, specify `--learning ge` when running SimpleTaggerWithConstraints.
Available constraint violation penalties include `--penalty kl` for KL divergence and `--penalty l2` for L_{2}. Note that when
using a KL divergence penalty, the constraint must specify a complete target label distribution. SimpleTaggerWithConstraints currently
does not support transition (two label) constraints.
`train` and `test` contain the training and testing data in SimpleTagger. format. The format of the constraints file is either
`--learning ge` `--penalty l2`)

java cc.mallet.fst.semi_supervised.tui.SimpleTaggerWithConstraints \ --train true --test lab --penalty kl --learning ge \ --threads 4 --orders 0,1 \ train test constraintsHere

feature_name label_name=probability label_name=probability ...or, when using target ranges instead of values (currently only compatible with

feature_name label_name=lower_probability,upper_probability ...

OneLabelKLGEConstraints constraints = new OneLabelKLGEConstraints(); for (int featureIndex : constraints.keySet()) { constraints.addConstraint(featureIndex, constraints.get(featureIndex), weight); } ArrayListTheconstraintsList = new ArrayList (); constraintsList.add(constraints);

int numThreads = 1; CRFTrainerByGE trainer = new CRFTrainerByGE(crf, constraints, numThreads); trainer.setGaussianPriorVariance(gaussianPriorVariance); trainer.train(unlabeled, Integer.MAX_VALUE);The

- The labels of the unlabeled data are never considered by the code, so the targets for unlabeled instances could be present (so that
`TransducerEvaluator`s can use them), or they could be`null`. - If using this method with no labeled data, use a CRF with dense weights and fully connected transitions.
- The built-in
`GEConstraint`s use constraint features that are binary and normalized by the total count of the input feature. This means the targets and expectations are probability distributions. However, constraint features that are not binary or normalized can be created by implementing a new`GEConstraint`. - The included two label constraints disregard the transition into the first position to avoid complications with the start state.
- The
`StateLabelMap`maps between CRF states and labels. In a most cases, a default one-to-one`StateToLabelMap`is sufficient. This type of map is created by default by`CRFTrainerByGE`. However, a custom`StateLabelMap`can be specified using the`setStateLabelMap`method of`CRFTrainerByGE`. - If using a special CRF start state that is not included in the label set, create a
`StateLabelMap`, call`addStartState`with the state index of the start state, and specify this mapping to`CRFTrainerByGE`using`setStateLabelMap`. - In some cases it may be necessary to tweak the optimization code (by for example setting convergence tolerances or step sizes) in order to obtain good results.
- As a rule of thumb, try to specify a set of constraints that is balanced among labels and covers many tokens.

Mallet CRFs can also be trained with expectation constraints and unlabeled data using Posterior Regularization (PR). For example, parameters can be estimated to match prior distributions over labels
for
particular words.
For more information
[Bellare, Druck, and McCallum 2009]
and
[Ganchev, Graça, Gillenwater, and Taskar 2010].
See also the tutorial for training MaxEnt models with expectation constraints.

To train a CRF with expectation constraints using PR, specify `--learning pr` when running SimpleTaggerWithConstraints.
Currently only `--penalty l2` is available and range constraints are not supported.
`train` and `test` contain the training and testing data in SimpleTagger. format. The format of the constraints file is:

java cc.mallet.fst.semi_supervised.tui.SimpleTaggerWithConstraints \ --train true --test lab --penalty l2 --learning pr \ --threads 4 --orders 0,1 \ train test constraintsHere

feature_name label_name=probability label_name=probability ...

OneLabelL2PRConstraints constraints = new OneLabelL2PRConstraints(); for (int featureIndex : constraints.keySet()) { constraints.addConstraint(featureIndex, constraints.get(featureIndex), weight); } ArrayListTheconstraintsList = new ArrayList (); constraintsList.add(constraints);

int numThreads = 1; CRFTrainerByPR trainer = new CRFTrainerByPR(crf, constraints, numThreads); trainer.setPGaussianPriorVariance(gaussianPriorVariance); trainer.train(unlabeled, 100, 100);The

- The current implementation only supports fully connected finite state machines.
- In some cases it may be necessary to tweak the optimization code (by for example setting convergence tolerances, step sizes, number of iterations) in order to obtain good results.
- As a rule of thumb, try to specify a set of constraints that is balanced among labels and covers many tokens.
- For PR training, in our experience large values for the constraint
`weight`and small values for`pGaussianPriorVariance`work best.

This semi-supervised learning method aims to maximize the conditional log-likelihood of labeled data while minimizing the conditional entropy of the model's predictions on unlabeled data. For more information, see the following papers:

Semi-Supervised Conditional Random Fields for Improved Sequence Segmentation and LabelingFeng Jiao, Shaojun Wang, Chi-Hoon Lee, Russell Greiner, Dale SchuurmansACL 2006

Efficient Computation of Entropy Gradient for Semi-Supervised Conditional Random FieldsGideon Mann, Andrew McCallumHLT/NAACL 2007

Mallet includes an implementation of Entropy Regularization for training CRFs. The implementation is based on the O(*nS*^{2}) algorithm of [Mann and McCallum 07]. As in [Jiao et al. 06], the Mallet implementation uses the maximum likelihood parameter estimate as a starting point for optimizing the complete objective function. The weight of the ER term in the objective function can be set using the `setEntropyWeight` method in the `CRFTrainerByEntropyRegularization` class.

Example code:

CRFTrainerByEntropyRegularization trainer = new CRFTrainerByEntropyRegularization(crf); trainer.setEntropyWeight(gamma); trainer.setGaussianPriorVariance(sigma); trainer.addEvaluator(eval); trainer.train(trainingData, unlabeledData, Integer.MAX_VALUE);

Notes:

- You must use the method
`train(InstanceList trainingData, InstanceList unlabeledData, int numIterations)`to perform training. - Labeled data is only used in the likelihood term, and unlabeled data is only used in the ER term. This means the labels of the unlabeled data are never considered by the code, so the targets for unlabeled instances could be present (so that
`TransducerEvaluator`s can use them), or they could be`null`. - In our experience, the performance of this method is highly dependent on the weighting factor. We have often observed ER decrease performance because the entropy term dominates the objective function (or gradient).