Jython:Examples
From Mallet
| Table of contents |
Training
In order to train, we need to:
- Create a pipe to read in the training data and add the training
- file to the list of instances for that pipe.
- Create a CRF object and initialize it
- Train the CRF and save the resulting model.
Let's take a look at the code to do this. First of all, we define some parameters for later use. The meaning of each parameter is described when it is used.
""" define some variables so we don't have to search to change them """
defaultLabel = "O"
orders = range(1,2)
iterations = 500
variance = 10
trainingFileName = "CRFtrain"
modelFileName = "CRFmodel"
Next, we create a pipe to read in the input data. Mallet always uses pipes to read in training data. We create the pipe by specifying a list of pipes (or filters) for the data. The first one takes the raw strings in the text file and creates a token sequence from them. The second coverts that token sequence to a feature vector sequence required for training. There are a variety of other pipes in mallet for manipulating the features (e.g. for adding prefix and suffix features).
"""
Create a pipe to read in the training data, add the default feature to
that
pipe's alphabet, and tell the pipe to expect labels on the input
"""
p = List2Pipe(( SimpleTaggerSentence2TokenSequence(),
TokenSequence2FeatureVectorSequence()),
defaultLabel)
p.setTargetProcessing(1)
trainingData = LineGroupInstanceList(p, trainingFileName)
printDataInfo
List2Pipe takes a python sequence of Pipes and creates a single Pipe object from them. There are some examples of how to perform some data preprocessing by adding more pipes in the preprocessing section. List2Pipe is really just an interface to the SerailPipes java object. After constructing the pipes, it also adds the default label to the pipe's target alphabet. If you just want to train a CRF model, then don't worry about this - it has to do with mallet's sparse vector representation.
The call to setTargetProcessing tells the pipe to expect labels on the input (i.e. the input format should be <feature> <feature> <...> <label> as opposed to just <feature> <feature> <...>) .
Next we create the CRF and perform training. The default label is used as context before the start and after the end of an instance. For example: when doing named entity recognition we want this to be the outside label so that we never start a sentence in the middle of a name.
crf = initNewCRF(trainingData, orders, defaultLabel, variance)
crf.train(trainingData,None, None, None, iterations)
Finally, we save the model to a file.
saveModel(crf,modelFileName)
Putting it all together and adding import statements, we have:
""" import statements. """
from mallethon.crfs import *
from edu.umass.cs.mallet.base.fst import CRF4
from edu.umass.cs.mallet.base.pipe.iterator import LineGroupIterator
from edu.umass.cs.mallet.base.fst import
SimpleTaggerSentence2TokenSequence
from edu.umass.cs.mallet.base.pipe import
TokenSequence2FeatureVectorSequence
import jarray
""" define some variables so we don't have to search to change them
"""
defaultLabel = "O"
orders = range(1,2)
iterations = 500
variance = 10
trainingFileName = "CRFtrain"
modelFileName = "CRFmodel"
"""
Create a pipe to read in the training data, add the default feature to
that
pipe's alphabet, and tell the pipe to expect labels on the input
"""
p = List2Pipe(( SimpleTaggerSentence2TokenSequence(),
TokenSequence2FeatureVectorSequence()),
defaultLabel)
p.setTargetProcessing(1)
trainingData = LineGroupInstanceList(p, trainingFileName)
printDataInfo(p)
crf = initNewCRF(trainingData, orders, defaultLabel, variance)
crf.train(trainingData,None, None, None, iterations)
saveModel(crf,modelFileName)
Testing
To test the performance of a CRF, we:
- read in the CRF,
- read in the testing data,
- create an evaluator and call its test function
To read in the CRF:
crf = loadModel(modelName)
To read in the training data, we need to use the same input pipe that was used the first time, othewise the internal representation of the features and labels might be different, leading to incorrect results.
p = crf.getInputPipe()
p.setTargetProcessing(1) # expect labels
testingData = LineGroupInstanceList(p, testFileName)
Finally, we create an evaluator object and perform the actual evaluation:
TokenAccuracyEvaluator().test(crf, testingData, "Testing", None)
Putting these all together with some import statements, we get:
""" import statements. """
from mallethon.crfs import *
from edu.umass.cs.mallet.base.fst import TokenAccuracyEvaluator
testFileName = "CRFtest"
modelName = "CRFmodel"
crf = loadModel(modelName)
p = crf.getInputPipe()
p.setTargetProcessing(1)
testingData = LineGroupInstanceList(p, testFileName)
TokenAccuracyEvaluator().test(crf, testingData, "Testing", None)
A Shallow Parser
We can get a shallow parser with a very simple modification to the CRF from the previous examples. First of all, you can obtain the CONLL training and testing data from [1] (http://www.cnts.ua.ac.be/conll2000/chunking/). Get train.txt.gz and test.txt.gz. Once these are unzipped, the only change necessary for training is to set the input file name.
trainingData.add( LineGroupIterator(FileReader(File("train.txt")),
Pattern.compile("^\\s*$"), 1))
For evaluation, we now want to use a MultiSegmentationEvaluator. This takes a java array of start tags and a java array of continuation tags.
##### eval = TokenAccuracyEvaluator() # 0 = don't print viterbi path
startTags = jarray.array(("B-ADJP", "B-ADVP", "B-CONJP", "B-INTJ", "B-LST",
"B-NP", "B-PP", "B-PRT", "B-SBAR", "B-VP",), java.lang.String)
continueTags = jarray.array(("I-ADJP", "I-ADVP", "I-CONJP", "I-INTJ", "I-LST",
"I-NP", "I-PP", "I-PRT", "I-SBAR", "I-VP",), java.lang.String)
eval = MultiSegmentationEvaluator(startTags, continueTags)
Here we have listed all the B- tags, and the corresponding I- tags. It is important that the arrays be the same length and have elements in the same order, since this is how tags are matched up by the evaluator. The jarray.array function creates a java array from a python sequence. It is documented on the jython website at http://www.jython.org/docs/jarray.html.
After running jython Train.py and jython Test.py we get the following output:
Testing tokenaccuracy=0.939 B-ADJP segments true=438 pred=404 correct=279 misses=159 alarms=125 precision=0.6906 recall=0.637 f1=0.6627 B-ADVP segments true=866 pred=833 correct=659 misses=207 alarms=174 precision=0.7911 recall=0.761 f1=0.7758 B-CONJP segments true=9 pred=8 correct=5 misses=4 alarms=3 precision=0.625 recall=0.5556 f1=0.5882 B-INTJ segments true=2 pred=3 correct=1 misses=1 alarms=2 precision=0.3333 recall=0.5 f1=0.4 B-LST segments true=5 pred=2 correct=0 misses=5 alarms=2 precision=0 recall=0 f1=0 B-NP segments true=12422 pred=12350 correct=11118 misses=1304 alarms=1232 precision=0.9002 recall=0.895 f1=0.8976 B-PP segments true=4811 pred=4886 correct=4650 misses=161 alarms=236 precision=0.9517 recall=0.9665 f1=0.9591 B-PRT segments true=106 pred=103 correct=74 misses=32 alarms=29 precision=0.7184 recall=0.6981 f1=0.7081 B-SBAR segments true=535 pred=475 correct=414 misses=121 alarms=61 precision=0.8716 recall=0.7738 f1=0.8198 B-VP segments true=4658 pred=4614 correct=4211 misses=447 alarms=403 precision=0.9127 recall=0.904 f1=0.9083 OVERALL segments true=23852 pred=23678 correct=21411 misses=2441 alarms=2267 precision=0.9043 recall=0.8977 f1=0.9009
Getting the Predictions
In order to get the predictions (rather than just an evaluation score), we replace the call to eval (in the evaluation or parser example above) with:
for i in range(0,testingData.size()):
input = testingData.getInstance(i).getData()
output = crf.viterbiPath(input).output()
for j in range(0,output.size()):
fv = input.get(j)
print (fv.toString(1).encode()+" "+output.get(j).encode())
print
On the data from CONNL2000 the first few lines of output for this are:
NNP Rockwell B-NP NNP International I-NP NNP Corp. I-NP 's POS B-NP NNP Tulsa I-NP NN unit I-NP said VBD B-VP