This article summarises some discrete probability distributions (Bernouilli, Binomial and the Poisson distributions) and some continuous ones, such as the normal, student-t and exponential distributions, and graphs them in Python.
Essential continuous probability theory
This blog post vulgarises some notions from the previous post applied to continuous cases. Probability theory is a large area of Mathematics and this post gives by no means a complete overview, only some essential background knowledge to understand ML models! Let’s get into it then!
Essential discrete probability theory
Probability theory is essential area of Mathematics in order to understand Machine Learning techniques. This post is a short introduction to it, a vulgarisation of the most important notions. This level is logical and easy to grasp through examples, so don’t be afraid of reading it through.
XLNet
XLNet: Generalized Autoregressive Pretraining for Language Understanding by Yang et al. was published in June 2019. The article claims that it overcomes shortcomings of BERT and achieves SOTA results in many NLP tasks.
In this article I explain XLNet and show the code of a binary classification example on the IMDB dataset. I compare the two model as I did the same classification with BERT (see here). For the complete code, see my github (here).
BERT: Bidirectional Transformers for Language Understanding
One of the major advances in deep learning in 2018 has been the development of effective NLP transfer learning methods, such as ULMFiT, ELMo and BERT. The Transformer Bidirectional Encoder Representations aka BERT has shown strong empirical performance therefore BERT will certainly continue to be a core method in NLP for years to come.
Continue reading “BERT: Bidirectional Transformers for Language Understanding”
Equity codes prediction using Naive Bayesian Classifier with scikit-learn
The aim of this article is to have an introduction to Naive baysian classification using scikit-learn. The naive Bayesian classification is a simple Bayesian type of probabilistic classification based on Bayes’ theorem with strong (so-called naive) independence of hypotheses. In this article, we will use it to build a basic text prediction system. We will predict Equity codes in a search form fashion (i.e prediction starts when user starts typing).
Continue reading “Equity codes prediction using Naive Bayesian Classifier with scikit-learn”
SentencePiece
This article explains SentencePiece, a language-independent subword tokenizer and detokenizer introduced by Kudo et al., 2018 and implemented in Python and C++. SentencePiece implements two subword segmentation algorithms, the Byte-Pair Encoding (BPE, Sennrich et al., 2016) and the Unigram language model (Kudo et al., 2018).
Sentiment Analysis: Supervised Learning with SVM and Apache Spark
The objective is the two-class discrimination (positive or negative opinion) from movie reviews using data from the IMDB database (50000 reviews).
Continue reading “Sentiment Analysis: Supervised Learning with SVM and Apache Spark”
Byte Pair Encoding
In information theory, byte pair encoding (BPE) or digram coding is a simple form of data compression in which the most common pair of consecutive bytes of data is replaced with a byte that does not occur within that data. Look up Wikipedia for a good example of using BPE on a single string.
This technique is also employed in natural language processing models, such as the GPT-2, to tokenize word sequences. Continue reading “Byte Pair Encoding”
Transformer… Transformer…
Neural Machine Translation [NMT] is a recently proposed task of machine learning that builds and trains a single, large neural network that reads a sentence and outputs a correct translation. Previous state of the art methods [here] use Recurrent Neural Networks and LSTM architectures to model long sequences, however, the recurrent nature of these methods prevents parallelization within training examples and this in turn leads to longer training time. Vaswani et al. 2017 proposes a novel technique, the Transformer, that relies entirely on the Attention Mechanism to model long sequences, thus can be parallelized and can be trained quicker.