cc.mallet.types
Class InstanceList

java.lang.Object
  extended by java.util.AbstractCollection<E>
      extended by java.util.AbstractList<E>
          extended by java.util.ArrayList<Instance>
              extended by cc.mallet.types.InstanceList
All Implemented Interfaces:
AlphabetCarrying, java.io.Serializable, java.lang.Cloneable, java.lang.Iterable<Instance>, java.util.Collection<Instance>, java.util.List<Instance>, java.util.RandomAccess
Direct Known Subclasses:
MultiInstanceList, PagedInstanceList

public class InstanceList
extends java.util.ArrayList<Instance>
implements java.io.Serializable, java.lang.Iterable<Instance>, AlphabetCarrying

A list of machine learning instances, typically used for training or testing of a machine learning algorithm.

All of the instances in the list will have been passed through the same Pipe, and thus must also share the same data and target Alphabets. InstanceList keeps a reference to the pipe and the two alphabets.

The most common way of adding instances to an InstanceList is through the add(PipeInputIterator) method. PipeInputIterators are a way of mapping general data sources into instances suitable for processing through a pipe. As each Instance is pulled from the PipeInputIterator, the InstanceList copies the instance and runs the copy through its pipe (with resultant destructive modifications) before saving the modified instance on its list. This is the usual way in which instances are transformed by pipes.

InstanceList also contains methods for randomly generating lists of feature vectors; splitting lists into non-overlapping subsets (useful for test/train splits), and iterators for cross validation.

Author:
Andrew McCallum mccallum@cs.umass.edu
See Also:
Instance, Pipe, Serialized Form

Nested Class Summary
 class InstanceList.CrossValidationIterator
          CrossValidationIterator allows iterating over pairs of InstanceList, where each pair is split into training/testing based on nfolds.
 
Field Summary
static java.lang.String TARGET_PROPERTY
           
 
Fields inherited from class java.util.AbstractList
modCount
 
Constructor Summary
InstanceList()
          Deprecated. 
InstanceList(Alphabet dataAlphabet, Alphabet targetAlphabet)
          Construct an InstanceList with initial capacity of 10, with a Noop default pipe.
InstanceList(Pipe pipe)
          Construct an InstanceList with initial capacity of 10, with given default pipe.
InstanceList(Pipe pipe, int capacity)
          Construct an InstanceList having given capacity, with given default pipe.
InstanceList(Randoms r, Alphabet vocab, java.lang.String[] classNames, int meanInstancesPerLabel)
           
InstanceList(Randoms r, Dirichlet classCentroidDistribution, double classCentroidAverageAlphaMean, double classCentroidAverageAlphaVariance, double featureVectorSizePoissonLambda, double classInstanceCountPoissonLambda, java.lang.String[] classNames)
          Creates a list consisting of randomly-generated FeatureVectors.
InstanceList(Randoms r, int vocabSize, int numClasses)
           
 
Method Summary
 boolean add(Instance instance)
          Appends the instance to this list without passing the instance through the InstanceList's pipe.
 boolean add(Instance instance, double instanceWeight)
          Appends the instance to this list without passing it through this InstanceList's pipe, assigning it the specified weight.
 void add(int index, Instance element)
           
 boolean add(java.lang.Object data, java.lang.Object target, java.lang.Object name, java.lang.Object source)
          Deprecated. Use trainingset.add (new Instance(data,target,name,source)) instead.
 boolean add(java.lang.Object data, java.lang.Object target, java.lang.Object name, java.lang.Object source, double instanceWeight)
          Deprecated. Use trainingset.addThruPipe (new Instance(data,target,name,source)) instead.
 boolean addAll(java.util.Collection<? extends Instance> instances)
           
 boolean addAll(int index, java.util.Collection<? extends Instance> c)
           
 void addThruPipe(Instance inst)
          Adds the input instance to this list, after passing it through the InstanceList's pipe.
 void addThruPipe(java.util.Iterator<Instance> ii)
          Adds to this list every instance generated by the iterator, passing each one through this InstanceList's pipe.
 void clear()
           
 java.lang.Object clone()
           
 InstanceList cloneEmpty()
           
protected  InstanceList cloneEmptyInto(InstanceList ret)
           
 InstanceList.CrossValidationIterator crossValidationIterator(int nfolds)
           
 InstanceList.CrossValidationIterator crossValidationIterator(int nfolds, int seed)
           
 Alphabet getAlphabet()
           
 Alphabet[] getAlphabets()
           
 Alphabet getDataAlphabet()
          Returns the Alphabet mapping features of the data to integers.
 java.lang.Class getDataClass()
          Returns the Java Class 'data' field of Instances in this list.
 FeatureSelection getFeatureSelection()
           
 double getInstanceWeight(Instance instance)
           
 double getInstanceWeight(int index)
           
 FeatureSelection[] getPerLabelFeatureSelection()
           
 Pipe getPipe()
          Returns the pipe through which each added Instance is passed, which may be null.
 Alphabet getTargetAlphabet()
          Returns the Alphabet mapping target output labels to integers.
 java.lang.Class getTargetClass()
          Returns the Java Class 'target' field of Instances in this list.
 void hideSomeLabels(java.util.BitSet bs)
           
 void hideSomeLabels(double proportionToHide, Randoms r)
           
static InstanceList load(java.io.File file)
          Constructs a new InstanceList, deserialized from file.
 double noisify(double ratio)
          Deprecated. 
 boolean remove(Instance instance)
           
 Instance remove(int index)
           
 void removeSources()
          Sets the "source" field to null in all instances.
 void removeTargets()
          Sets the "target" field to null in all instances.
 InstanceList sampleWithInstanceWeights(java.util.Random r)
          Deprecated. 
 InstanceList sampleWithReplacement(java.util.Random r, int numSamples)
           
 InstanceList sampleWithWeights(java.util.Random r, double[] weights)
          Returns an InstanceList of the same size, where the instances come from the random sampling (with replacement) of this list using the given weights.
 void save(java.io.File file)
          Saves this InstanceList to file.
 Instance set(int index, Instance instance)
           
 void setFeatureSelection(FeatureSelection selectedFeatures)
           
 void setInstance(int index, Instance instance)
          Replaces the Instance at position index with a new one.
 void setInstanceWeight(Instance instance, double weight)
           
 void setInstanceWeight(int index, double weight)
           
 void setPerLabelFeatureSelection(FeatureSelection[] selectedFeatures)
           
 void setPipe(Pipe p)
          Change the default Pipe associated with InstanceList.
 InstanceList shallowClone()
           
 void shuffle(java.util.Random r)
           
 InstanceList[] split(double[] proportions)
           
 InstanceList[] split(java.util.Random r, double[] proportions)
          Shuffles the elements of this list among several smaller lists.
 InstanceList[] splitInOrder(double[] proportions)
          Chops this list into several sequential sublists.
 InstanceList[] splitInOrder(int[] counts)
           
 InstanceList[] splitInTwoByModulo(int m)
          Returns a pair of new lists such that the first list in the pair contains every mth element of this list, starting with the first.
 InstanceList subList(double proportion)
           
 InstanceList subList(int start, int end)
           
 LabelVector targetLabelDistribution()
           
 void unhideAllLabels()
           
 
Methods inherited from class java.util.ArrayList
contains, ensureCapacity, get, indexOf, isEmpty, lastIndexOf, remove, removeRange, size, toArray, toArray, trimToSize
 
Methods inherited from class java.util.AbstractList
equals, hashCode, iterator, listIterator, listIterator
 
Methods inherited from class java.util.AbstractCollection
containsAll, removeAll, retainAll, toString
 
Methods inherited from class java.lang.Object
finalize, getClass, notify, notifyAll, wait, wait, wait
 
Methods inherited from interface java.lang.Iterable
iterator
 
Methods inherited from interface java.util.List
containsAll, equals, hashCode, iterator, listIterator, listIterator, removeAll, retainAll
 

Field Detail

TARGET_PROPERTY

public static final java.lang.String TARGET_PROPERTY
See Also:
Constant Field Values
Constructor Detail

InstanceList

public InstanceList(Pipe pipe,
                    int capacity)
Construct an InstanceList having given capacity, with given default pipe. Typically Instances added to this InstanceList will have gone through the pipe (for example using instanceList.addThruPipe); but this is not required. This InstanaceList will obtain its dataAlphabet and targetAlphabet from the pipe. It is required that all Instances in this InstanceList share these Alphabets.

Parameters:
pipe - The default pipe used to process instances added via the addThruPipe methods.
capacity - The initial capacity of the list; will grow further as necessary.

InstanceList

public InstanceList(Pipe pipe)
Construct an InstanceList with initial capacity of 10, with given default pipe. Typically Instances added to this InstanceList will have gone through the pipe (for example using instanceList.addThruPipe); but this is not required. This InstanaceList will obtain its dataAlphabet and targetAlphabet from the pipe. It is required that all Instances in this InstanceList share these Alphabets.

Parameters:
pipe - The default pipe used to process instances added via the addThruPipe methods.

InstanceList

public InstanceList(Alphabet dataAlphabet,
                    Alphabet targetAlphabet)
Construct an InstanceList with initial capacity of 10, with a Noop default pipe. Used in those infrequent circumstances when Instances typically would not have further processing, and objects containing vocabularies are entered directly into the InstanceList; for example, the creation of a random InstanceList using Dirichlets and Multinomials.

Parameters:
dataAlphabet - The vocabulary for added instances' data fields
targetAlphabet - The vocabulary for added instances' targets

InstanceList

@Deprecated
public InstanceList()
Deprecated. 

Creates a list that will have its pipe set later when its first Instance is added.


InstanceList

public InstanceList(Randoms r,
                    Dirichlet classCentroidDistribution,
                    double classCentroidAverageAlphaMean,
                    double classCentroidAverageAlphaVariance,
                    double featureVectorSizePoissonLambda,
                    double classInstanceCountPoissonLambda,
                    java.lang.String[] classNames)
Creates a list consisting of randomly-generated FeatureVectors.


InstanceList

public InstanceList(Randoms r,
                    Alphabet vocab,
                    java.lang.String[] classNames,
                    int meanInstancesPerLabel)

InstanceList

public InstanceList(Randoms r,
                    int vocabSize,
                    int numClasses)
Method Detail

shallowClone

public InstanceList shallowClone()

clone

public java.lang.Object clone()
Overrides:
clone in class java.util.ArrayList<Instance>

subList

public InstanceList subList(int start,
                            int end)
Specified by:
subList in interface java.util.List<Instance>
Overrides:
subList in class java.util.AbstractList<Instance>

subList

public InstanceList subList(double proportion)

addThruPipe

public void addThruPipe(java.util.Iterator<Instance> ii)
Adds to this list every instance generated by the iterator, passing each one through this InstanceList's pipe.


addThruPipe

public void addThruPipe(Instance inst)
Adds the input instance to this list, after passing it through the InstanceList's pipe.

If several instances are to be added then accumulate them in a List\ and use addThruPipe(Iterator) instead.


add

@Deprecated
public boolean add(java.lang.Object data,
                              java.lang.Object target,
                              java.lang.Object name,
                              java.lang.Object source,
                              double instanceWeight)
Deprecated. Use trainingset.addThruPipe (new Instance(data,target,name,source)) instead.

Constructs and appends an instance to this list, passing it through this list's pipe and assigning it the specified weight.

Returns:
true

add

@Deprecated
public boolean add(java.lang.Object data,
                              java.lang.Object target,
                              java.lang.Object name,
                              java.lang.Object source)
Deprecated. Use trainingset.add (new Instance(data,target,name,source)) instead.

Constructs and appends an instance to this list, passing it through this list's pipe. Default weight is 1.0.

Returns:
true

add

public boolean add(Instance instance)
Appends the instance to this list without passing the instance through the InstanceList's pipe. The alphabets of this Instance must match the alphabets of this InstanceList.

Specified by:
add in interface java.util.Collection<Instance>
Specified by:
add in interface java.util.List<Instance>
Overrides:
add in class java.util.ArrayList<Instance>
Returns:
true

add

public boolean add(Instance instance,
                   double instanceWeight)
Appends the instance to this list without passing it through this InstanceList's pipe, assigning it the specified weight.

Returns:
true

set

public Instance set(int index,
                    Instance instance)
Specified by:
set in interface java.util.List<Instance>
Overrides:
set in class java.util.ArrayList<Instance>

add

public void add(int index,
                Instance element)
Specified by:
add in interface java.util.List<Instance>
Overrides:
add in class java.util.ArrayList<Instance>

remove

public Instance remove(int index)
Specified by:
remove in interface java.util.List<Instance>
Overrides:
remove in class java.util.ArrayList<Instance>

remove

public boolean remove(Instance instance)

addAll

public boolean addAll(java.util.Collection<? extends Instance> instances)
Specified by:
addAll in interface java.util.Collection<Instance>
Specified by:
addAll in interface java.util.List<Instance>
Overrides:
addAll in class java.util.ArrayList<Instance>

addAll

public boolean addAll(int index,
                      java.util.Collection<? extends Instance> c)
Specified by:
addAll in interface java.util.List<Instance>
Overrides:
addAll in class java.util.ArrayList<Instance>

clear

public void clear()
Specified by:
clear in interface java.util.Collection<Instance>
Specified by:
clear in interface java.util.List<Instance>
Overrides:
clear in class java.util.ArrayList<Instance>

noisify

@Deprecated
public double noisify(double ratio)
Deprecated. 


cloneEmpty

public InstanceList cloneEmpty()

cloneEmptyInto

protected InstanceList cloneEmptyInto(InstanceList ret)

shuffle

public void shuffle(java.util.Random r)

split

public InstanceList[] split(java.util.Random r,
                            double[] proportions)
Shuffles the elements of this list among several smaller lists.

Parameters:
proportions - A list of numbers (not necessarily summing to 1) which, when normalized, correspond to the proportion of elements in each returned sublist. This method (and all the split methods) do not transfer the Instance weights to the resulting InstanceLists.
r - The source of randomness to use in shuffling.
Returns:
one InstanceList for each element of proportions

split

public InstanceList[] split(double[] proportions)

splitInOrder

public InstanceList[] splitInOrder(double[] proportions)
Chops this list into several sequential sublists.

Parameters:
proportions - A list of numbers corresponding to the proportion of elements in each returned sublist. If not already normalized to sum to 1.0, it will be normalized here.
Returns:
one InstanceList for each element of proportions

splitInOrder

public InstanceList[] splitInOrder(int[] counts)

splitInTwoByModulo

public InstanceList[] splitInTwoByModulo(int m)
Returns a pair of new lists such that the first list in the pair contains every mth element of this list, starting with the first. The second list contains all remaining elements.


sampleWithReplacement

public InstanceList sampleWithReplacement(java.util.Random r,
                                          int numSamples)

sampleWithInstanceWeights

@Deprecated
public InstanceList sampleWithInstanceWeights(java.util.Random r)
Deprecated. 

Returns an InstanceList of the same size, where the instances come from the random sampling (with replacement) of this list using the instance weights. The new instances all have their weights set to one.


sampleWithWeights

public InstanceList sampleWithWeights(java.util.Random r,
                                      double[] weights)
Returns an InstanceList of the same size, where the instances come from the random sampling (with replacement) of this list using the given weights. The length of the weight array must be the same as the length of this list The new instances all have their weights set to one.


getDataClass

public java.lang.Class getDataClass()
Returns the Java Class 'data' field of Instances in this list.


getTargetClass

public java.lang.Class getTargetClass()
Returns the Java Class 'target' field of Instances in this list.


setInstance

public void setInstance(int index,
                        Instance instance)
Replaces the Instance at position index with a new one.


getInstanceWeight

public double getInstanceWeight(Instance instance)

getInstanceWeight

public double getInstanceWeight(int index)

setInstanceWeight

public void setInstanceWeight(int index,
                              double weight)

setInstanceWeight

public void setInstanceWeight(Instance instance,
                              double weight)

setFeatureSelection

public void setFeatureSelection(FeatureSelection selectedFeatures)

getFeatureSelection

public FeatureSelection getFeatureSelection()

setPerLabelFeatureSelection

public void setPerLabelFeatureSelection(FeatureSelection[] selectedFeatures)

getPerLabelFeatureSelection

public FeatureSelection[] getPerLabelFeatureSelection()

removeTargets

public void removeTargets()
Sets the "target" field to null in all instances. This makes unlabeled data.


removeSources

public void removeSources()
Sets the "source" field to null in all instances. This will often save memory when the raw data had been placed in that field.


load

public static InstanceList load(java.io.File file)
Constructs a new InstanceList, deserialized from file. If the string value of file is "-", then deserialize from System.in.


save

public void save(java.io.File file)
Saves this InstanceList to file. If the string value of file is "-", then serialize to System.out.


getPipe

public Pipe getPipe()
Returns the pipe through which each added Instance is passed, which may be null.


setPipe

public void setPipe(Pipe p)
Change the default Pipe associated with InstanceList. This method is very dangerous and should only be used in extreme circumstances!!


getDataAlphabet

public Alphabet getDataAlphabet()
Returns the Alphabet mapping features of the data to integers.


getTargetAlphabet

public Alphabet getTargetAlphabet()
Returns the Alphabet mapping target output labels to integers.


getAlphabet

public Alphabet getAlphabet()
Specified by:
getAlphabet in interface AlphabetCarrying

getAlphabets

public Alphabet[] getAlphabets()
Specified by:
getAlphabets in interface AlphabetCarrying

targetLabelDistribution

public LabelVector targetLabelDistribution()

crossValidationIterator

public InstanceList.CrossValidationIterator crossValidationIterator(int nfolds,
                                                                    int seed)

crossValidationIterator

public InstanceList.CrossValidationIterator crossValidationIterator(int nfolds)

hideSomeLabels

public void hideSomeLabels(double proportionToHide,
                           Randoms r)

hideSomeLabels

public void hideSomeLabels(java.util.BitSet bs)

unhideAllLabels

public void unhideAllLabels()