Importing data
MALLET represents data as lists of "instances".
All MALLET instances include a data object. An instance can also include a name and
(in classification contexts) a label. For example, if the application is guessing
the language of web pages, an instance might consist of a vector of word counts
(data), the URL of the page (name) and the language of the page (label).
There are two primary methods for importing data into MALLET format,
first when the source data consists of many separate files, and second
when the data is contained in a single file, with one instance per line.
One instance per file:
After
downloading and building MALLET, change to the
MALLET directory.
Assume that text-only (.txt) versions of English web pages are
in files in a directory called
sample-data/web/en and text-only versions of German pages are in
sample-data/web/de
(
download sample data).
Now run this command:
bin/mallet import-dir --input sample-data/web/* --output web.mallet
MALLET will use the directory names as labels and the filenames as instance names.
One file, one instance per line: Assume the data is in the following format:
[URL] [language] [text of the page...]
After
downloading and building MALLET, change to the
MALLET directory and run the following command:
bin/mallet import-file --input /data/web/data.txt --output web.mallet
In this case, the first token of each line (whitespace delimited, with optional
comma) becomes the instance name, the second token becomes the label, and all
additional text on the line is interpreted as a sequence of word tokens. Note that
the data in this case will be a vector of feature/value pairs, such that a feature
consists of a distinct word type and the value is the number of times that word occurs
in the text.
There are many additional options to the import-dir and import-file
commands. Add the --help option to either of these commands to get a full list.
Some commonly used options to either command are:
--keep-sequence. This option preserves the document as a sequence of word
features, rather than a vector of word feature counts. Use this option for sequence
labeling tasks. The MALLET topic modeling toolkit also requires feature sequences rather
than feature vectors.
--preserve-case. MALLET by default converts all word features to lowercase.
--remove-stopwords. This option tells MALLET to ignore a standard list
of very common English adverbs, conjunctions, pronouns and prepositions.