This article explains SentencePiece, a language-independent subword tokenizer and detokenizer introduced by Kudo et al., 2018 and implemented in Python and C++. SentencePiece implements two subword segmentation algorithms, the Byte-Pair Encoding (BPE, Sennrich et al., 2016) and the Unigram language model (Kudo et al., 2018).

Unigram language based subword segmentation

Kudo et al. 2018 proposes yet another subword segmentation algorithm, the unigram language model. In this post I explain this technique and its advantages over the Byte-Pair Encoding algorithm.

Byte Pair Encoding

In information theory, byte pair encoding (BPE) or digram coding is a simple form of data compression in which the most common pair of consecutive bytes of data is replaced with a byte that does not occur within that data. Look up Wikipedia for a good example of using BPE on a single string. This technique is also employed in natural language processing models, such as the GPT-2, to tokenize word sequences.


XLNet: Generalized Autoregressive Pretraining for Language Understanding by Yang et al. was published in June 2019. The article claims that it overcomes shortcomings of BERT and achieves SOTA results in many NLP tasks. In this article I explain XLNet and show the code of a binary classification example on the IMDB dataset. I compare the two model as I did the same classification with BERT (see here). For the complete code, see my github (here).

Transformer… Transformer…

Neural Machine Translation [NMT] is a recently proposed task of machine learning that builds and trains a single, large neural network that reads a sentence and outputs a correct translation. Previous state of the art methods [here] use Recurrent Neural Networks and LSTM architectures to model long sequences, however, the recurrent nature of these methods prevents parallelization within training examples and this in turn leads to longer training time. Vaswani et al. 2017 proposes a novel technique, the Transformer, that relies entirely on the Attention Mechanism to model long sequences, thus can be parallelized… Read More

WordPiece Tokenisation

With the high performance of Google’s BERT model, we can hear more and more about the Wordpiece tokenisation. There is even a multilingual BERT model, as it was trained on 104 different languages. But how is it possible to apply the same model for 104 languages? The idea of using a shared vocabulary for above 100 languages intrigued me so I drove into it!

BERT: Bidirectional Transformers for Language Understanding

One of the major advances in deep learning in 2018 has been the development of effective NLP transfer learning methods, such as ULMFiT, ELMo and BERT. The Transformer Bidirectional Encoder Representations aka BERT has shown strong empirical performance therefore BERT will certainly continue to be a core method in NLP for years to come.

Expectation Maximization for MAP estimation

  “Expectation is the root of a heartache.” – William Shakespeare    Expectation–maximization (EM) is an iterative method that attempts to find the maximum likelihood estimator of a parameter θ of a parametric probability distribution. The algorithm computes maximum likelihood estimates of unknown parameters in probabilistic models involving latent variables. Therefore the EM algorithm is an iterative method that alternates between computing a conditional expectation and solving a maximization problem, hence its name.

Introduction to Graph Models

  “Graphical models are a marriage between probability theory and graph theory.” – Michael Jordan, 1998. Probability is very important in modern pattern recognition problems. These problems could be assessed by formulating and  solving difficult probabilistic models, however, using a graphic representation of these probabilistic problems is often highly advantageous for the following reasons: 1) The visualisation of models makes the models themselves easier to understand and handle. They can also help us to distinguish new models or to point out similarities between already existing model structures, that we have not assumed.

Forecasting recessions with economic indicators

Abstract This study compares three economic indicators often used in forecasting recessions: the Yield Spread, the Chicago Index and the Leading index. We find that the latter two predict recessions well one and two quarters ahead, but fail in forecasting recessions on a longer time period. On the contrary, the Yield Spread performs better when forecasting recessions four and six quarters ahead.