Unigram language based subword segmentation

Kudo et al. 2018 proposes yet another subword segmentation algorithm, the unigram language model. In this post I explain this technique and its advantages over the Byte-Pair Encoding algorithm.

Suppose you have a subword sentence x = [x1, x2, … , xn]. Then the unigram language model makes the assumption that the subwords of the sentence are independent one another, that is

Screenshot 2019-10-01 at 11.35.36.png

where is the pre-defined vocabulary. Then the most probable segmentation of the input sentence is x* , that is:

Screenshot 2019-10-01 at 11.37.30.png

where S(X) denotes the set of segmentation candidates created from the input sentence, x.

x* can be determined by the Viterbi algorithm and the probability of the subword occurrences by the Expectation Maximization algorithm, by maximizing the marginal likelihood of the sentences, assuming that the subword probabilities are unknown. However, the vocabulary set is also unknown, therefore we treat it as a hidden variable that we “demask” by the following iterative steps:

  1. Define a training corpus and a maximum vocabulary size|V|, and a seed vocabulary. The seed vocabulary can be defined as the unique characters of the corpus and the most frequent unions of them. As we include all characters, the problem of out of vocabulary tokenization is removed.
  2. Fix the vocabulary and optimize p(x) by the EM algorithm.
  3. Compute the loss for each subword. This loss is defined as the the reduction of the likelihood of the corpus if the subword is removed from the vocabulary.
  4. Sort subwords according to their losses in a decreasing order and keep only the η% top of them.
  5. Repeat steps 2-4 until the vocabulary reaches the maximum vocabulary size|V|. 

The unigram language model segmentation is based on the same idea as Byte-Pair Encoding (BPE) but gives more flexibility. BPE is a deterministic model while the unigram language model segmentation is based on a probabilistic language model and can output several segmentations  with their corresponding probabilities.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: