WordPiece Tokenisation

With the high performance of Google’s BERT model, we can hear more and more about the Wordpiece tokenisation. There is even a multilingual BERT model, as it was trained on 104 different languages. But how is it possible to apply the same model for 104 languages? The idea of using a shared vocabulary for above 100 languages intrigued me so I drove into it!

Wordpiece is a tokenisation algorithm that was originally proposed in 2015 by Google (see the article here) and was used for translation. The idea of the algorithm is that instead of trying to tokenise a large corpus of text into words, it will try to tokenise it into subwords or wordpieces.

But why is it needed, why can’t we simply tokenise the text into words? Maybe English and Latin speaking readers cannot really see the motivation so let me give you some. The following text is a Japanese poem:


Notice that there are no espaces between the characters. Moreover, there are over 50 000 characters (although people do not usually use as many). Larger character inventories and characters that exist in multiple formats (multiple width) pose problems to NLP researchers for instance in voice recognition/ translation tasks. Furthermore, English text might occur in Japanese text messages (by referring to a webpage for instance). So not only it is complicated to find a way to tokenise such a complex language but even if this is be done, we should add some tokenisation techniques for other languages (English in the example).

What can be done then?

Some tokenisation techniques are based on the characters instead of focusing on the words. Wordpiece tokenisation is such a method, instead of using the word units, it uses subword (wordpiece) units.

It is an iterative algorithm. First, we choose a large enough training corpus and we define either the maximum vocabulary size or the minimum change in the likelihood of the language model fitted on the data. Then the iterative algorithm is constructed in the following manner:

  1. Initialise a vocabulary with the individual characters found in the corpus
  2. Build a language model on the corpus by using the vocabulary from step 1.
  3. Generate one new word unit by combining two elements of the vocabulary. Choose the combined, new subword that increases the likelihood of the language model by the most when added to the model.
  4. Repeat step 2) and 3) until the maximum vocabulary size is achieved or when the increase in the likelihood falls below he predefined threshold.

If implemented in a naive fashion, finding the new token that increases the likelihood by the most can be computationally quite expensive (O(|V|^2) where |V| is the vocabulary size). Therefore, training can be speed up by only testing new subwords that actually exist in the corpus, or by choosing those that are likely in the corpus (have a high frequency).

BERT trained this model on the Wikipedia dump of over 100 languages, weighting each Wiki dump by its inverse size. Altogether, the final vocabulary contains 119 547 wordpieces.

Now if we input a French or a German language into the model, it can find the words’ subwords. Therefore this tokenisation technique tokenises the language sometimes in an unexpected fashion, but it is no problem as long as it can tokenise all text (no words that are out of the vocabulary) and as long as it is consistent.

Let’s see some tokenisation examples:

In [5]:
from pytorch_transformers import BertTokenizer 

tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-uncased')
tokenizer.tokenize('Can we see each other tomorrow for a drink?  ')
['can', 'we', 'see', 'each', 'other', 'tomorrow', 'for', 'a', 'drink', '?']
In [8]:
tokenizer.tokenize('On se voit demain pour boire un verre?')
['on', 'se', 'voit', 'dem', '##ain', 'pour', 'boi', '##re', 'un', 'verre', '?']
In [6]:
tokenizer.tokenize('Können wir uns morgen auf einen Drink sehen?')
['konnen', 'wir', 'uns', 'morgen', 'auf', 'einen', 'drink', 'sehen', '?']

We can see that the English and German sentences are quite well tokenized while the French tokenization has some surprising elements: “demain” is broken down to two wordpieces, “dem” and “##ain” where the “##” shows that the subword “ain” follows another subword.

All in all, this technique is surprisingly well-adapted to many languages, therefore even if your data is a less-used language, you can use the BERT multi-language model immediately, without pre-training the language model from scratch.


If this is not enough, more info in the original article and on this very interesting blog!

A la prochaine! 🙂



Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Create a website or blog at WordPress.com

Up ↑

%d bloggers like this: