cc.mallet.classify
Class FeatureConstraintUtil

java.lang.Object
  extended by cc.mallet.classify.FeatureConstraintUtil

public class FeatureConstraintUtil
extends java.lang.Object

Utility functions for creating feature constraints that can be used with GE training.

Author:
Gregory Druck gdruck@cs.umass.edu

Constructor Summary
FeatureConstraintUtil()
           
 
Method Summary
static double[][] getFeatureLabelCounts(InstanceList list, boolean useValues)
           
static java.util.HashMap<java.lang.Integer,java.util.ArrayList<java.lang.Integer>> labelFeatures(InstanceList list, java.util.ArrayList<java.lang.Integer> features)
           
static java.util.HashMap<java.lang.Integer,java.util.ArrayList<java.lang.Integer>> labelFeatures(InstanceList list, java.util.ArrayList<java.lang.Integer> features, boolean reject)
          Label features using heuristic described in "Learning from Labeled Features using Generalized Expectation Criteria" Gregory Druck, Gideon Mann, Andrew McCallum.
static java.util.HashMap<java.lang.Integer,double[]> readConstraintsFromFile(java.lang.String filename, InstanceList data)
          Reads feature constraints from a file, whether they are stored using Strings or indices.
static java.util.HashMap<java.lang.Integer,double[]> readConstraintsFromFileIndex(java.lang.String filename, InstanceList data)
          Reads feature constraints stored using strings from a file.
static java.util.HashMap<java.lang.Integer,double[]> readConstraintsFromFileString(java.lang.String filename, InstanceList data)
          Reads feature constraints stored using strings from a file.
static java.util.HashMap<java.lang.Integer,double[][]> readRangeConstraintsFromFile(java.lang.String filename, InstanceList data)
          Reads range constraints stored using strings from a file.
static java.util.ArrayList<java.lang.Integer> selectFeaturesByInfoGain(InstanceList list, int numFeatures)
          Select features with the highest information gain.
static java.util.ArrayList<java.lang.Integer> selectTopLDAFeatures(int numSelFeatures, ParallelTopicModel lda, Alphabet alphabet)
          Select top features in LDA topics.
static java.util.HashMap<java.lang.Integer,double[]> setTargetsUsingData(InstanceList list, java.util.ArrayList<java.lang.Integer> features)
           
static java.util.HashMap<java.lang.Integer,double[]> setTargetsUsingData(InstanceList list, java.util.ArrayList<java.lang.Integer> features, boolean normalize)
           
static java.util.HashMap<java.lang.Integer,double[]> setTargetsUsingData(InstanceList list, java.util.ArrayList<java.lang.Integer> features, boolean useValues, boolean normalize)
          Set target distributions using estimates from data.
static java.util.HashMap<java.lang.Integer,double[]> setTargetsUsingFeatureVoting(java.util.HashMap<java.lang.Integer,java.util.ArrayList<java.lang.Integer>> labeledFeatures, InstanceList trainingData)
          Set target distributions using feature voting heuristic described in "Learning from Labeled Features using Generalized Expectation Criteria" Gregory Druck, Gideon Mann, Andrew McCallum.
static java.util.HashMap<java.lang.Integer,double[]> setTargetsUsingHeuristic(java.util.HashMap<java.lang.Integer,java.util.ArrayList<java.lang.Integer>> labeledFeatures, int numLabels, double majorityProb)
          Set target distributions using "Schapire" heuristic described in "Learning from Labeled Features using Generalized Expectation Criteria" Gregory Druck, Gideon Mann, Andrew McCallum.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

FeatureConstraintUtil

public FeatureConstraintUtil()
Method Detail

readRangeConstraintsFromFile

public static java.util.HashMap<java.lang.Integer,double[][]> readRangeConstraintsFromFile(java.lang.String filename,
                                                                                           InstanceList data)
Reads range constraints stored using strings from a file. Format can be either: feature_name (label_name:lower_probability,upper_probability)+ or feature_name (label_name:probability)+ Constraints are only added for feature-label pairs that are present.

Parameters:
filename - File with feature constraints.
data - InstanceList used for alphabets.
Returns:
Constraints.

readConstraintsFromFile

public static java.util.HashMap<java.lang.Integer,double[]> readConstraintsFromFile(java.lang.String filename,
                                                                                    InstanceList data)
Reads feature constraints from a file, whether they are stored using Strings or indices.

Parameters:
filename - File with feature constraints.
data - InstanceList used for alphabets.
Returns:
Constraints.

readConstraintsFromFileString

public static java.util.HashMap<java.lang.Integer,double[]> readConstraintsFromFileString(java.lang.String filename,
                                                                                          InstanceList data)
Reads feature constraints stored using strings from a file. feature_name (label_name:probability)+ Labels that do appear get probability 0.

Parameters:
filename - File with feature constraints.
data - InstanceList used for alphabets.
Returns:
Constraints.

readConstraintsFromFileIndex

public static java.util.HashMap<java.lang.Integer,double[]> readConstraintsFromFileIndex(java.lang.String filename,
                                                                                         InstanceList data)
Reads feature constraints stored using strings from a file. feature_index label_0_prob label_1_prob ... label_n_prob Here each label must appear.

Parameters:
filename - File with feature constraints.
data - InstanceList used for alphabets.
Returns:
Constraints.

selectFeaturesByInfoGain

public static java.util.ArrayList<java.lang.Integer> selectFeaturesByInfoGain(InstanceList list,
                                                                              int numFeatures)
Select features with the highest information gain.

Parameters:
list - InstanceList for computing information gain.
numFeatures - Number of features to select.
Returns:
List of features with the highest information gains.

selectTopLDAFeatures

public static java.util.ArrayList<java.lang.Integer> selectTopLDAFeatures(int numSelFeatures,
                                                                          ParallelTopicModel lda,
                                                                          Alphabet alphabet)
Select top features in LDA topics.

Parameters:
numSelFeatures - Number of features to select.
ldaEst - LDAEstimatePr which provides an interface to an LDA model.
seqAlphabet - The alphabet for the sequence dataset, which may be different from the vector dataset alphabet.
alphabet - The vector dataset alphabet.
Returns:
ArrayList with the int indices of the selected features.

setTargetsUsingData

public static java.util.HashMap<java.lang.Integer,double[]> setTargetsUsingData(InstanceList list,
                                                                                java.util.ArrayList<java.lang.Integer> features)

setTargetsUsingData

public static java.util.HashMap<java.lang.Integer,double[]> setTargetsUsingData(InstanceList list,
                                                                                java.util.ArrayList<java.lang.Integer> features,
                                                                                boolean normalize)

setTargetsUsingData

public static java.util.HashMap<java.lang.Integer,double[]> setTargetsUsingData(InstanceList list,
                                                                                java.util.ArrayList<java.lang.Integer> features,
                                                                                boolean useValues,
                                                                                boolean normalize)
Set target distributions using estimates from data.

Parameters:
list - InstanceList used to estimate targets.
features - List of features for constraints.
normalize - Whether to normalize by feature counts
Returns:
Constraints (map of feature index to target), with targets set using estimates from supplied data.

setTargetsUsingHeuristic

public static java.util.HashMap<java.lang.Integer,double[]> setTargetsUsingHeuristic(java.util.HashMap<java.lang.Integer,java.util.ArrayList<java.lang.Integer>> labeledFeatures,
                                                                                     int numLabels,
                                                                                     double majorityProb)
Set target distributions using "Schapire" heuristic described in "Learning from Labeled Features using Generalized Expectation Criteria" Gregory Druck, Gideon Mann, Andrew McCallum.

Parameters:
labeledFeatures - HashMap of feature indices to lists of label indices for that feature.
numLabels - Total number of labels.
majorityProb - Probability mass divided among majority labels.
Returns:
Constraints (map of feature index to target distribution), with target distributions set using heuristic.

setTargetsUsingFeatureVoting

public static java.util.HashMap<java.lang.Integer,double[]> setTargetsUsingFeatureVoting(java.util.HashMap<java.lang.Integer,java.util.ArrayList<java.lang.Integer>> labeledFeatures,
                                                                                         InstanceList trainingData)
Set target distributions using feature voting heuristic described in "Learning from Labeled Features using Generalized Expectation Criteria" Gregory Druck, Gideon Mann, Andrew McCallum.

Parameters:
labeledFeatures - HashMap of feature indices to lists of label indices for that feature.
trainingData - InstanceList to use for computing expectations with feature voting.
Returns:
Constraints (map of feature index to target distribution), with target distributions set using feature voting.

labelFeatures

public static java.util.HashMap<java.lang.Integer,java.util.ArrayList<java.lang.Integer>> labelFeatures(InstanceList list,
                                                                                                        java.util.ArrayList<java.lang.Integer> features,
                                                                                                        boolean reject)
Label features using heuristic described in "Learning from Labeled Features using Generalized Expectation Criteria" Gregory Druck, Gideon Mann, Andrew McCallum.

Parameters:
list - InstanceList used to compute statistics for labeling features.
features - List of features to label.
reject - Whether to reject labeling features.
Returns:
Labeled features, HashMap mapping feature indices to list of labels.

labelFeatures

public static java.util.HashMap<java.lang.Integer,java.util.ArrayList<java.lang.Integer>> labelFeatures(InstanceList list,
                                                                                                        java.util.ArrayList<java.lang.Integer> features)

getFeatureLabelCounts

public static double[][] getFeatureLabelCounts(InstanceList list,
                                               boolean useValues)