Topic Modeling - Multiple languages
This guide describes topic modeling over several languages jointly.
We align languages by specifying that there are connections between documents across languages.
We start by importing documents. We need to create one instance list for each language.
The order that documents appear in the instance lists is important. The Nth document in list 1 is assumed to have the same topic distribution (though a different vocabulary) as the Nth document in list 2, and so on.
As a result, each document can only be aligned with one other document per language: we cannot specify that two English documents align with one French document unless we concatenate the two English documents into a single new document. If a document in one language does not have a comparable document in another language, there must still be a place-holder document in that language, but it can have zero words.
A demonstration corpus is available
here. It contains three eight-line files consisting of short excerpts from Wikipedia articles in English, French and German. Small stoplists are also included. Note that the articles for "Pope_Francis" in French and "ides_of_march" in German are missing, but they still have lines with empty values for the text field (they're not really missing from the wikis, I just didn't include them).
For the importing step, we can treat each language separately. Here's an example command that will import the English pages. For non-English languages it's especially important to use the --token-regex option. Here I'm defining a token as any sequence of Unicode letter characters.
bin/mallet import-file --input en.txt --stoplist-file en.stops \
--output en.sequences --keep-sequence --token-regex '\p{L}+'
The --print-output option may also be useful to make sure that the text is being properly handled.
Next we train a model. The command line syntax is similar to standard Mallet topic models. Instead of specifying a single input instance list, we specify a series of instance lists, one for each language. Since I included eight articles, I'll ask for eight topics (boring, I know). The documents are also fairly short, so I'll set the sum of the hyperparameters to 1.0, reflecting a prior guess of about one eighth of a "pseudo-word" worth of confidence for each topic.
bin/mallet run cc.mallet.topics.PolylingualTopicModel \
--language-inputs de.sequences en.sequences fr.sequences \
--num-topics 8 --alpha 1.0
The order that we specify the language-specific files doesn't change anything with respect to the topic model training procedure, but it determines how Mallet refers to each language. Since I specified German, English, and French, the output lists topic words in that order: German is language 0, English is 1, French is 2. Here's some of the output:
6 0 0 0 0 1 0 0 1 0 672 -12319.77767801979
1 0 0 1 0 0 1 0 0 1 676 -12272.078133744493
0 0 1 0 0 1 0 0 1 0 679 -12361.619915606001
0 1 0 0 1 0 0 1 0 0 682 -12240.14886367504
0 0 1 0 0 1 0 0 1 0 685 -12305.073608981767
0 0.125
0 49 0.01 per vor bis dynastie sondervollmachten
verfassungsentscheide mehrere venezuela staatspräsident politiker
caracas sabaneta juli tʃaβes el uɣo chávez rafael lebewesen
1 65 0.01 he his from until death son years five when
psuv united merged foundation republic leader rafaˈel contains
crystal iii
2 98 0.01 il sa que son politique venezuela d ses
raison parti république mort e calendrier maladie meurt serment
quatrième réélu
The block of numbers at the beginning lists the number of milliseconds for each iteration (either 1 or 0, most of the time), the total number of milliseconds taken so far (a little over half a second), and the total log probability of the model (numbers closer to 0 are better). Every 50 iterations, we show the top words in each language in each topic. I've copied the output for topic 0 above. The first number indicates the topic (0), followed by it's alpha hyperparameter (0.125 = 1.0 / 8). On the next lines we see the three languages. The row for language 0 (German) has its ID, the total number of words currently assigned to the topic (49), the beta hyperparameter for the language (0.01) and the top words. As we might expect, these aren't very good topics!
Many of the options available for standard Mallet topic models are available for multiple languages. Let's say we want to learn hyperparameters:
bin/mallet run cc.mallet.topics.PolylingualTopicModel \
--language-inputs de.sequences en.sequences fr.sequences \
--num-topics 8 --alpha 1.0 --optimize-interval 10 --optimize-burn-in 20
The previous command will iteratively learn weights for the topics (replacing the 0.125's) and learn language-specific smoothing parameters (replacing the 0.01's). For a toy corpus of this size, we might get odd results, but it should work fine for any significant document collection.
Many standard Mallet topic model options create a file. In the multilingual case, these options create multiple files with the language ID appended. For example, the command
bin/mallet run cc.mallet.topics.PolylingualTopicModel \
--language-inputs de.sequences en.sequences fr.sequences \
--num-topics 8 --alpha 1.0 --optimize-interval 10 --optimize-burn-in 20 \
--inferencer-filename inferencer
will create three files containing serialized Mallet topic inferencers, which can be used to estimate topic distributions for new documents. The files will be named inferencer.0 for German, inferencer.1 for English, and inferencer.2 for French. These inferencers are exactly like monolingual topic inferencers trained with the standard code. They can infer topics for new monolingual documents using the bin/mallet infer-topics ... command, but they cannot take advantage of relationships between documents in different languages.