Main Page
From Mallet
MALLET is an integrated collection of Java code useful for statistical natural language processing, document classification, clustering, information extraction, and other machine learning applications to text.
| Table of contents |
Getting Started
Find out about obtaining MALLET and look at a few tutorials.
Features
The toolkit provides facilities for:
- Several classification methods including naive Bayes, maximum entropy, Boosting, Winnow.
- Maximum entropy classifier training is highly efficient, making use of Nocedal's "Limited-Memory BFGS", an efficient quasi-Newton optimization technique. It also handles arbitrary real-valued features.
- A general framework for finite state transducers.
- An implementation of finite-state Conditional Random Fields, also trained by Limited-Memory BFGS.
- A general framework for optimization (based on "Numerical Recipes in C").
- Recursively descending directories, finding text files.
- Quite arbitrary pipelines of text processing steps.
- Tokenizing a text file, according to arbitrary regular expressions.
- Including N-grams among the tokens.
- Creating real-valued feature vectors, and feature vector sequences.
- Mapping strings to integers and back again, very efficiently.
- Selecting features by information gain, or other measures.
- Building and manipulating feature vectors.
- Saving trained models to disk.
- Performing test-train splits.
- Various evaluation procedures for performing multiple trials, calculating acccuracy, precision, recall, F1, etc.
Developing in MALLET
If you're writing code that uses MALLET (as opposed to using the command line tools), then we have several helpful notes in the developer's corner.
About the MALLET project
MALLET was written by [http://www.cs.umass.edu/~mccallum Andrew McCallum], with contributions from several graduate students and staff, including Aron Culotta, Al Hough, Wei Li, David Pinto, Charles Sutton, and Jerod Weinman, at University of Massachusetts Amherst, as well as contributions from Fernando Pereira, Ryan McDonald, and others at University of Pennsylvania.
The toolkit is Open Source Software, and is released under the Common Public License (http://www.opensource.org/licenses/cpl.php). You are welcome to use the code under the terms of the licence for research or commercial purposes, however please acknowledge its use with a citation:
McCallum, Andrew Kachites. "MALLET: A Machine Learning for Language Toolkit." http://mallet.cs.umass.edu. 2002.
Here is a BiBTeX entry:
@unpublished{McCallumMALLET,
author = "Andrew Kachites McCallum",
title = "MALLET: A Machine Learning for Language Toolkit",
note = "http://mallet.cs.umass.edu",
year = 2002}
Development on MALLET has been supported by these funding agencies.
Mailing Lists
There is a mailing list for mallet announcements. To subscribe, send an email to mallet-announce-request at cs.umass.edu with the body of the message "subscribe"
There is also a mailing list for mallet developers. To subscribe, send an email to mallet-dev-request at cs.umass.edu with the body of the message "subscribe"
Bugs and other issues should be reported to mallet-dev.
Other relevant software
You might also be interested in other similar software packages for machine learning applied to text.