|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object cc.mallet.pipe.Pipe cc.mallet.pipe.TokenSequenceParseFeatureString
public class TokenSequenceParseFeatureString
Convert the string in each field Token.text
to a list
of Strings (space delimited). Add each string as a feature to the
token. If realValued
is true, then treat the position
in the list as the feature name and the value as a
double. Otherwise, the feature name is the string itself and the
value is 1.0.
Modified to allow feature names and values to be specified.eg: featureName1=featureValue1 featureName2=featureValue2 ... The name/value separator (here '=') can be specified.
If your data consists of feature/value pairs (eg height=10.7 width=3.6 length=1.7
),
use new TokenSequenceParseFeatureString(true, true, "=")
. This
format is typically used for sparse data, in which most features are equal to 0 in
any given instance.
If your data consists only of values, and the position determines which feature
the value is for (eg 10.7 3.6 1.7
),
use new TokenSequenceParseFeatureString(true)
.
This format is typically used for data that has a small number of features
that all have non-zero values most of the time.
If your data is in the form of named binary indicator variables
(eg yellow quacks has_webbed_feet
), use the constructor
new TokenSequenceParseFeatureString(false)
. Each token will be
interpreted as the name of a feature, whose value is 1.0.
Constructor Summary | |
---|---|
TokenSequenceParseFeatureString(boolean _realValued)
|
|
TokenSequenceParseFeatureString(boolean _realValued,
boolean _specifyFeatureNames)
|
|
TokenSequenceParseFeatureString(boolean _realValued,
boolean _specifyFeatureNames,
java.lang.String _nameValueSeparator)
|
Method Summary | |
---|---|
Instance |
pipe(Instance carrier)
Really this should be 'protected', but isn't for historical reasons. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Constructor Detail |
---|
public TokenSequenceParseFeatureString(boolean _realValued, boolean _specifyFeatureNames, java.lang.String _nameValueSeparator)
_realValued
- interpret each data token as a double, and associate it with a
feature called "Feature#K" where K is the order of the token, starting with 0.
Note that this option is currently ignored if _specifyFeatureNames
is true._specifyFeatureNames
- interpret each data token as a feature name/value pair,
separated by some delimiter, which is the equals sign ("=") unless otherwise specified._nameValueSeparator
- use a string other than = to separate name/value pairs. Colon (":") is
a common choice. Note that this string cannot consist of any whitespace, as the tokens stream
will already have been split.public TokenSequenceParseFeatureString(boolean _realValued, boolean _specifyFeatureNames)
public TokenSequenceParseFeatureString(boolean _realValued)
Method Detail |
---|
public Instance pipe(Instance carrier)
Pipe
pipe
in class Pipe
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |