Proposed changes to Pipes
From Mallet
This proposal was culled from a couple conversations including Charles Sutton, Fernando Pereira, and others on the mallet-dev mailing list. Charles did the culling, however, so blame him for any obviously wrong ideas.
- Pipes should not remember their context of use. Right now, every pipe has a parent slot, and will check so you can't put the same pipe in more than one SerialPipes. There's no reason to enforce this restriction. A pipe should be allowed to be used in arbritrarily many SerialPipes. Each SerialPipes should enforce, however, that all of its subpipes have the same (==) alphabet.
- Should it be allowed for Pipes to have a null alphabet? I believe this was possible in the past, but discarded for a reason that I (Charles) don't know.
-
It is possible to hurt yourself royally by doing the following: Generate some training data and serialize it. Now, generate some testing data, and serialize it. When you deserialize these later, you must serialize the training data first, b/c its alphabet is larger.
The Mallet TUI tools prevent you from doing this, by resaving the *training* when you save new testing data. This is a half measure, though. You can still mess yourself up, by piping two different test sets from the training set. If you deserialize one of the test sets before you do the training set, you will not have Alphabet entries for the features in the second test set. Great pain will ensue.
The problem with this is that the deserializing alphabet trick needs to keep a timestamp, and use the MOST RECENT version of the alphabet, not the first one that happens to be deserialized.
- One may wish to have many:one or one:many Instance mappings in Pipes. (e.g., if you want to label sequences using local maxent classifiers). One way to do this is to say the Pipes map instance iterators to instances iteratiors. i.e., a pipe would have a spec like
public interface Pipe {
InstanceIterator pipedIterator (InstanceIterator in);
}
A garden-variety per-instance pipe would look like
public abstract class Instancewise extends Pipe {
public Instancewise() {}
public abstract Instance pipe (Instance i);
public InstanceIterator pipedIterator (final InstanceIterator ii) {
return new InstanceIterator() {
public Instance nextInstance() { return pipe(ii.nextInstance()); }
public boolean hasNext() { return ii.hasNext(); }
}
}