This article explains SentencePiece, a language-independent subword tokenizer and detokenizer introduced by Kudo et al., 2018 and implemented in Python and C++. SentencePiece implements two subword segmentation algorithms, the Byte-Pair Encoding (BPE, Sennrich et al., 2016) and the Unigram language model (Kudo et al., 2018).
Byte Pair Encoding
In information theory, byte pair encoding (BPE) or digram coding is a simple form of data compression in which the most common pair of consecutive bytes of data is replaced with a byte that does not occur within that data. Look up Wikipedia for a good example of using BPE on a single string.
This technique is also employed in natural language processing models, such as the GPT-2, to tokenize word sequences. Continue reading “Byte Pair Encoding”
With the high performance of Google’s BERT model, we can hear more and more about the Wordpiece tokenisation. There is even a multilingual BERT model, as it was trained on 104 different languages. But how is it possible to apply the same model for 104 languages? The idea of using a shared vocabulary for above 100 languages intrigued me so I drove into it!
Unigram language based subword segmentation
Kudo et al. 2018 proposes yet another subword segmentation algorithm, the unigram language model. In this post I explain this technique and its advantages over the Byte-Pair Encoding algorithm.
Continue reading “Unigram language based subword segmentation”