Kudo et al. 2018 proposes yet another subword segmentation algorithm, the unigram language model. In this post I explain this technique and its advantages over the Byte-Pair Encoding algorithm.
Suppose you have a subword sentence x = [x1, x2, … , xn]. Then the unigram language model makes the assumption that the subwords of the sentence are independent one another, that is
where V is the pre-defined vocabulary. Then the most probable segmentation of the input sentence is x* , that is:
where S(X) denotes the set of segmentation candidates created from the input sentence, x.
x* can be determined by the Viterbi algorithm and the probability of the subword occurrences by the Expectation Maximization algorithm, by maximizing the marginal likelihood of the sentences, assuming that the subword probabilities are unknown. However, the vocabulary set is also unknown, therefore we treat it as a hidden variable that we “demask” by the following iterative steps:
- Define a training corpus and a maximum vocabulary size|V|, and a seed vocabulary. The seed vocabulary can be defined as the unique characters of the corpus and the most frequent unions of them. As we include all characters, the problem of out of vocabulary tokenization is removed.
- Fix the vocabulary and optimize p(x) by the EM algorithm.
- Compute the loss for each subword. This loss is defined as the the reduction of the likelihood of the corpus if the subword is removed from the vocabulary.
- Sort subwords according to their losses in a decreasing order and keep only the η% top of them.
- Repeat steps 2-4 until the vocabulary reaches the maximum vocabulary size|V|.
The unigram language model segmentation is based on the same idea as Byte-Pair Encoding (BPE) but gives more flexibility. BPE is a deterministic model while the unigram language model segmentation is based on a probabilistic language model and can output several segmentations with their corresponding probabilities.