cc.mallet.types
Class PagedInstanceList

java.lang.Object
  extended by java.util.AbstractCollection<E>
      extended by java.util.AbstractList<E>
          extended by java.util.ArrayList<Instance>
              extended by cc.mallet.types.InstanceList
                  extended by cc.mallet.types.PagedInstanceList
All Implemented Interfaces:
AlphabetCarrying, java.io.Serializable, java.lang.Cloneable, java.lang.Iterable<Instance>, java.util.Collection<Instance>, java.util.List<Instance>, java.util.RandomAccess

public class PagedInstanceList
extends InstanceList

TODO .split() methods still unreliable An InstanceList which avoids OutOfMemoryErrors by saving Instances to disk when there is not enough memory to create a new Instance. It implements a fixed-size paging scheme, where each page on disk stores instancesPerPage Instances. So, while the number of Instances per pages is constant, the size in bytes of each page may vary. Using this class instead of InstanceList means the number of Instances you can store is essentially limited only by disk size (and patience). The paging scheme is optimized for the most frequent case of looping through the InstanceList from index 0 to n. If there are n instances, then instances 0->(n/size()) are stored together on page 1, instances (n/size)+1 -> 2*(n/size) are on page 2, ... etc. This way, pages adjacent in the instances list will usually be in the same page. The paging scheme also tries to only keep one page in memory at a time. The justification for this is that the page size is near the limit of the maximum number of instances that can be kept in memory. Since we assume the frequent case is looping from instance 0 to n, keeping other Instances in memory will be a waste of resources. About instancesPerPage -- If instancesPerPage = -1, then its value will be set automatically by the following: When the first OutOfMemoryError is thrown, count how many instances are currently in memory, then divide by two. This is a conservative estimate of how many Instance objects can fit in memory simultaneously. If you know this value beforehand, simply pass it to the constructor. NOTE: The event which causes an OutOfMemoryError is the instantiation of a new Instance, _not_ the addition of this Instance to an InstanceList. Therefore, if you want to avoid OutOfMemoryErrors, let PagedInstanceList instantiate the new Instance for you. IOW, do this: Pipe p = ...; PagedInstanceList ilist = new PagedInstanceList (p); ilist.add (data, target, name, source); Or This Instance.Iterator iter = ...; Pipe p = ...; PagedInstanceList ilist = new PagedInstanceList (p); ilist.add (iter); But Not This: Pipe p = ...; PagedInstanceList ilist = new PagedInstanceList (p); ilist.add (new Instance (data, target, name, source)); If memory is low, the last example will throw an OutOfMemoryError before control has been passed to PagedInstanceList to catch the error. NOTE ALSO: To save write time, we do not write the same Instance to disk more than once, i.e., there are no dirty bits or write-throughs. Thus, this assumes that after an Instance has been passed through its Pipe, it is no longer modified. One way around this is to call PagedInstanceList.setInstance (Instance inst), which _will_ overwrite an Instance that has been paged to disk.

Author:
Aron Culotta culotta@cs.umass.edu
See Also:
InstanceList, Serialized Form

Nested Class Summary
 
Nested classes/interfaces inherited from class cc.mallet.types.InstanceList
InstanceList.CrossValidationIterator
 
Field Summary
 
Fields inherited from class cc.mallet.types.InstanceList
TARGET_PROPERTY
 
Fields inherited from class java.util.AbstractList
modCount
 
Constructor Summary
PagedInstanceList()
           
PagedInstanceList(Pipe pipe)
           
PagedInstanceList(Pipe pipe, int size)
           
PagedInstanceList(Pipe pipe, int size, int instancesPerPage, java.io.File swapDir)
          Creates a PagedInstanceList where "instancesPerPage" instances are swapped to disk in directory "swapDir" if the amount of free system memory drops below "minFreeMemory" bytes
 
Method Summary
 boolean add(Instance instance)
          Appends the instance to this list.
 InstanceList cloneEmpty()
           
 boolean collectGarbage()
           
 Instance get(int index)
          Returns the Instance at the specified index.
static InstanceList load(java.io.File file)
          Constructs a new InstanceList, deserialized from file.
 InstanceList sampleWithReplacement(java.util.Random r, int numSamples)
          Overridden to add samples in original order to reduce thrashing.
 InstanceList sampleWithWeights(java.util.Random r, double[] weights)
          Returns an InstanceList of the same size, where the instances come from the random sampling (with replacement) of this list using the given weights.
 Instance set(int index, Instance instance)
          Replaces the Instance at position index with a new one.
 void setCollectGarbage(boolean b)
          Constructs and appends an instance to this list, passing it through this list's pipe and assigning it the specified weight.
 InstanceList shallowClone()
           
 InstanceList[] split(double[] proportions)
           
 InstanceList[] split(java.util.Random r, double[] proportions)
          Shuffles the elements of this list among several smaller lists.
 InstanceList[] splitByModulo(int m)
          Returns a pair of new lists such that the first list in the pair contains every mth element of this list, starting with the first.
 void swapOutAll()
          Save all instances to disk and set to null to free memory.
 
Methods inherited from class cc.mallet.types.InstanceList
add, add, add, add, addAll, addAll, addThruPipe, addThruPipe, clear, clone, cloneEmptyInto, crossValidationIterator, crossValidationIterator, getAlphabet, getAlphabets, getDataAlphabet, getDataClass, getFeatureSelection, getInstanceWeight, getInstanceWeight, getPerLabelFeatureSelection, getPipe, getTargetAlphabet, getTargetClass, hideSomeLabels, hideSomeLabels, noisify, remove, remove, removeSources, removeTargets, sampleWithInstanceWeights, save, setFeatureSelection, setInstance, setInstanceWeight, setPerLabelFeatureSelection, setPipe, shuffle, splitInOrder, splitInOrder, splitInTwoByModulo, subList, subList, targetLabelDistribution, unhideAllLabels
 
Methods inherited from class java.util.ArrayList
contains, ensureCapacity, indexOf, isEmpty, lastIndexOf, remove, removeRange, size, toArray, toArray, trimToSize
 
Methods inherited from class java.util.AbstractList
equals, hashCode, iterator, listIterator, listIterator
 
Methods inherited from class java.util.AbstractCollection
containsAll, removeAll, retainAll, toString
 
Methods inherited from class java.lang.Object
finalize, getClass, notify, notifyAll, wait, wait, wait
 
Methods inherited from interface java.lang.Iterable
iterator
 
Methods inherited from interface java.util.List
containsAll, equals, hashCode, iterator, listIterator, listIterator, removeAll, retainAll
 

Constructor Detail

PagedInstanceList

public PagedInstanceList(Pipe pipe,
                         int size,
                         int instancesPerPage,
                         java.io.File swapDir)
Creates a PagedInstanceList where "instancesPerPage" instances are swapped to disk in directory "swapDir" if the amount of free system memory drops below "minFreeMemory" bytes

Parameters:
pipe - instance pipe
instancesPerPage - number of Instances to store in each page. If -1, determine at first call to swapOutExcept
swapDir - where the pages on disk live.

PagedInstanceList

public PagedInstanceList(Pipe pipe,
                         int size)

PagedInstanceList

public PagedInstanceList(Pipe pipe)

PagedInstanceList

public PagedInstanceList()
Method Detail

split

public InstanceList[] split(java.util.Random r,
                            double[] proportions)
Shuffles the elements of this list among several smaller lists. Overrides InstanceList.split to add instances in original order, to prevent thrashing.

Overrides:
split in class InstanceList
Parameters:
proportions - A list of numbers (not necessarily summing to 1) which, when normalized, correspond to the proportion of elements in each returned sublist.
r - The source of randomness to use in shuffling.
Returns:
one InstanceList for each element of proportions

split

public InstanceList[] split(double[] proportions)
Overrides:
split in class InstanceList

splitByModulo

public InstanceList[] splitByModulo(int m)
Returns a pair of new lists such that the first list in the pair contains every mth element of this list, starting with the first. The second list contains all remaining elements. Overrides InstanceList.splitByModulo to use PagedInstanceLists.


sampleWithReplacement

public InstanceList sampleWithReplacement(java.util.Random r,
                                          int numSamples)
Overridden to add samples in original order to reduce thrashing.

Overrides:
sampleWithReplacement in class InstanceList

sampleWithWeights

public InstanceList sampleWithWeights(java.util.Random r,
                                      double[] weights)
Returns an InstanceList of the same size, where the instances come from the random sampling (with replacement) of this list using the given weights. The length of the weight array must be the same as the length of this list The new instances all have their weights set to one.

Overrides:
sampleWithWeights in class InstanceList

swapOutAll

public void swapOutAll()
Save all instances to disk and set to null to free memory.


get

public Instance get(int index)
Returns the Instance at the specified index. If this Instance is not in memory, swap a block of instances back into memory.

Specified by:
get in interface java.util.List<Instance>
Overrides:
get in class java.util.ArrayList<Instance>

set

public Instance set(int index,
                    Instance instance)
Replaces the Instance at position index with a new one. Note that this is the only sanctioned way of changing an Instance.

Specified by:
set in interface java.util.List<Instance>
Overrides:
set in class InstanceList

add

public boolean add(Instance instance)
Appends the instance to this list. Note that since memory for the Instance has already been allocated, no check is made to catch OutOfMemoryError.

Specified by:
add in interface java.util.Collection<Instance>
Specified by:
add in interface java.util.List<Instance>
Overrides:
add in class InstanceList
Returns:
true if successful

setCollectGarbage

public void setCollectGarbage(boolean b)
Constructs and appends an instance to this list, passing it through this list's pipe and assigning it the specified weight. Checks are made to ensure an OutOfMemoryError is not thrown when instantiating a new Instance.


collectGarbage

public boolean collectGarbage()

shallowClone

public InstanceList shallowClone()
Overrides:
shallowClone in class InstanceList

cloneEmpty

public InstanceList cloneEmpty()
Overrides:
cloneEmpty in class InstanceList

load

public static InstanceList load(java.io.File file)
Constructs a new InstanceList, deserialized from file. If the string value of file is "-", then deserialize from System.in.