This article explains SentencePiece, a language-independent subword tokenizer and detokenizer introduced by Kudo et al., 2018 and implemented in Python and C++. SentencePiece implements two subword segmentation algorithms, the Byte-Pair Encoding (BPE, Sennrich et al., 2016) and the Unigram language model (Kudo et al., 2018).
Subword tokenization is handy in many different areas of Natural Language Processing and especially useful in Neural Machine Translation. Many languages do not have a straightforward and easy way to be tokenized, they might not have spaces between their text units, they might consist of a large character base that change the meaning/ pronunciation when followed by/ following a certain character.
Therefore, subword tokenization techniques are often used in NLP. They provide two additional advantages over word tokenization:
- They remove the out of vocabulary problem. Imagine for instance when translating text from French to English. When the model sees previously unseen French words, it is not able to translate them. One solution is the subword tokenization. If the vocabulary contains characters as well, the out of vocabulary problem is eliminated (as even previously unseen words are tokenized to subwords/ characters).
- The subword tokenization is able to create a link between words such as “big” and “bigger”, “small” and “smaller”, etc. For instance if there is a lot of adjectives in our corpus, it is probable that the vocabulary will pick up on the common subwords such as ~”er”.
I described three tokenization techniques in previous articles: the WordPiece tokenization (used by BERT), the Byte-Pair Encoding and the Unigram language model. SentencePiece is a language independent subword tokenizer that can be trained on a corpus without pre-tokenization, that is on a raw text corpus without separating the text into words.
It has four elements:
- The normalizer encodes the unicode characters to their canonical forms.
- The trainer trains the subword segmentation algorithm from the normalized corpus. For this, we need to define the subword model, either BPE or Unigram language model.
- The encoder implements the normalizer and the trainer on a given corpus. That is, it applied the tokenization to the corpus.
- The decoder can convert a text back to its original form, that is, it detokenizes the text.