Probability theory is essential area of Mathematics in order to understand Machine Learning techniques. This post is a short introduction to it, a vulgarisation of the most important notions. This level is logical and easy to grasp through examples, so don’t be afraid of reading it through.

**What is it? **

Probability is an area of mathematics that quantifies uncertainty. As machine learning tries to deduct patterns that are generally applicable on unobserved data, it is strongly based on the notion of uncertainty.

**A little motivation**

In this article there are only the essential basics of probability theory that we need to acquire in order to understand ML. As already said, this level is logical and easy to grasp through examples.

However, for those not yet convinced, let me give some additional motivation through a little historical anecdote. You see, probability theory as we know it today was not the only way people thought about uncertainty. On the contrary, there was quite a big debate about it in the 17th century. Mathematicians thought about it as “chance”, “frequentist probability” and “uncertainty”. What really motivated the development of our comprehension on the subject was something quite noble: *gambling*. Gerolamo Cardano, a struggling Italian mathematician who gambled to keep himself solvent, wrote about basic probability principles (and effective cheating methods) as early as in the 16th century (although published only about 100 years later!), while the “Gambler’s dispute” led to mathematical foundations of probability by Blaise Pascal and Pierre de Fermat in the 17th century. Something that arises so naturally from gambling can only be fun!

Even Thomas Bayes, who contributed massively to probability theory in the 18th century took a deep interest in it as a hobby, he was a minister for a profession.

**The unbearable lightness of uncertainty**

Uncertainty may arise through *incomplete modelling*, *small datasets* or *noisy data*. Take the example of classifying emails as spams or not. We train a model on selected classified mails, let’s say on 100 mails. Is that a large enough dataset to ensure that all possible type of spam emails are presents? That all important patterns are revealed? The answer is obviously no. What about a dataset of 10 000 mails? Would that make you feel that you can get rid of this sticky feeling of uncertainty? The answer is repeatedly no. We could only be able to get rid of uncertainty by considering all possibilities – something that is impossible with limited resources. What about emails that contain spelling mistakes? Or emails that were mistakenly badly classified in the training data? This will surely not ease our task of classification. However, pretending that human mistakes and noisy data is not present would be an unrealistic assumption – these are other important sources of uncertainty. The bottom line is this: *uncertainty arises because we try to generalise our knowledge gained from a limited training set on all unobserved data.* Such a generalisation has to have uncertainty!

Now let’s stay for a moment with our spam email classification example. Billy, your adolescent cousin, who has never sent you an email before sends you an email containing a bad joke. Almost simultaneously, you receive a real spam email containing some context from a website you have never visited (I believe you).. Your algorithm might classify both emails as spams. But.. Would you not prefer to introduce some measure of graduality in your model? Would you not be happier with the model if it says: well, I consider that both emails are spams, but Billy’s email seems less so. This is where probabilistic modelling can help us. We will now predict the probability of spam, and decide (with decision theory) how to make the final call about what should really be considered a spam.

Other sources of uncertainty may include unobserved variables that drive a system. Imagine the problem that you choose among three options, A, B and C. The three options correspond to a price you can win, A and B being nothing and C being a Ferrari. Before choosing an option, you don’t know which one will result in your dream car, as your prior knowledge is incomplete. This does not mean, however, that the problem you face is a stochastic one, even deterministic problems can present uncertainty when we have incomplete observations and/ or knowledge.

Another type of source of uncertainty may rise from the nature of the model, since there may be a stochasticity in the system. Examples of this phenomenon are stochastic physical processes (for instance processes in quantum mechanics). Time series modelling can introduce both, stochastic processes and unobserved variables. Well, I guess we need not to talk endlessly about every type of uncertainty we might face, the bottom line is this: uncertainty is inevitable but probability and decision theory are there for us to help us guide though them!

# A simple example

Let’s get into probability then by considering a simple example. There are two cylinders shaped boxes that your grandmother and grandfather prepared for you. They both contain candies, star and triangle shaped ones. You can choose a box and pick one candy without looking into the box. The only thing you know is that grandfather’s box (the grey one) contains usually more triangle shaped candies than star shaped ones, while usually grandmother fills her box with more star shaped candies than triangle ones, the first being healthier for your teeth. Now you have the choice to choose a box and pick a candy without looking to it.

Let’s formalise a few notions. Let the following notations be:

The probability that you pick a star shaped candy from either of the boxes is therefore* P(candy=star) *. By considering a frequentist approach, imagine that you pick candies many many times from the boxes, then you observe the shape of the candy and you put the candy back to the box it came from. By considering the number of times you picked a star shaped candy from all these trials, we can approximate the probability of star shaped candies. If we continue this endlessly, this approximation will be closer and closer to the probability of a star shaped candy, which is just the ratio of star shaped candies and total number of candies:

In the example given above this is 0.5 as we have 6 star shaped candies and 12 candies altogether in the two boxes. By similar reasoning, the probability that you pick a triangle shaped candy from all candies is just the ratio of triangles compared to the total number of candies, 0.5.

Now we see that the sum of these probabilities of either picking a triangle or star shaped candy is one. Indeed, probabilities should always sum to one over the set of all possible events. It is logical since we either pick a star or a triangle shaped candy.

Let’s further assume that you are indifferent between the boxes, you like both your grandparents the same way, so the probability that you choose either boxes is 0.5.

By now we are ready to introduce the idea of the ** joint probability**. It refers to the probability that two events occur together, simultaneously. For instance, the probability that you pick grandma’s box and a triangle shaped candy is:

Returning to the previous example of picking candies endlessly from the two boxes, (observing the candy type and the box and returning the candy in the box it came from), the joint probability of picking a triangle shaped candy and grandma’s box is the number of times you had these two events occurring together from all the trials you have made.

There is a relationship between joint probabilities and prior (simple) probabilities: the probability of choosing a triangle shaped candy is the sum of the joint probabilities of

- triangle shaped candy and grandma’s box
- triangle shaped candy and grandpa’s box

We say that we marginalise out the box, so the probability of choosing a triangle shaped candy is said to be a ** marginal probability**:

In general, the marginal probability of an event *X *is the sum of joint probabilities of events X and Y. This is the **product rule** of probability:

Now from the joint probability, we can deduce the ** conditional probability**. The probability that we choose a triangle shaped candy and grandma’s box can be broken down to the probability that we choose grandma’s box times the probability to pick a triangle shaped candy

*given that we chose grandma’s box*, that is the conditional probability of a triangle shaped candy, given grandma’s box.

In general, the joint probability of two events X and Y can be computed from the product of the conditional probability of X given Y times the probability of Y (and vice versa):

Here we used the symmetry of joint probabilities, that is *P(X, Y) = P(Y, X). *

Let’s use now our knowledge about the conditional and joint probabilities to answer some questions:

##### 1) What is the conditional probability of *P(triangle | grandma)* ?

This is the ratio of triangle shaped candies in grandma’s box:

#### 2) What is joint probability of *P(triangle, grandma)*?

By using the product rule, this is the product of the conditional probability of triangle shaped candy given grandma’s box times the probability of grandma’s box:

#### 3) We can show that *P(triangle)* = 0.5 by marginalising the box out by the sum rule:

By the same reasoning as in question 1) – 2), P(triangle, grandpa) is:

Assume now that you choose a box randomly and without observing it, then you pick a candy and you observe that it is of a triangle shape. What is the conditional probability that the box you chose was grandpa’s, given the triangle shape candy?

**What is P(box = grandpa | candy = triangle)? **

Now to answer this, we will call upon the product rule. The joint probability of an event X and Y can be computed as the product of the conditional and marginalised probabilities:

Now dividing with sides by the *P(triangle)* we see that:

And by applying again the product rule and by applying the symmetry property of the joint probability (*P(X, Y) = P(Y, X)*), we get a relationship between conditional probabilities:

In general, the relationship between two event’s (*X* and *Y*) conditional probabilities can be described as:

And we refer to this relationship as the **Bayes Rule**.

Now we are ready to respond to the question:

**4) What is ***P(box = grandpa | candy = triangle)?*

*P(box = grandpa | candy = triangle)?*

By the Bayes rule,

**Probability revisited **

Before finishing with the most essential ideas of probability theory applied to ML, let’s go far far away in time. I am interested in the probability that humans exist in 1 million years. Can we approximate this by trials similar to the candy picking example? What about humans in 10 million years? What is the probability that we exist by then?

We can clearly see now the shortcomings of the approach presented in the candy example: it is based on a frequentist interpretation of repeatable events. This is often not possible, and so we search another, more flexible interpretation of probability. We started this article stating that “*Probability is an area of mathematics that quantifies uncertainty*”. So let’s try to interpret now probability as a measure of uncertainty. Even though we can’t repeatedly see whether humanity can survive 1 million years, we can have some idea by considering the time of existence of homo sapiens, the speed of technological evolution we encounter and the state we push our lovely Earth to be in. So we can have some idea about the probability of surviving. This is not enough, however. If in 2000 years humans find another habitable planet, we should be able to update our measure of uncertainty (or certainty) of surviving, and this is what the Bayesian interpretation of probability offers in such an elegant way.

Now we see why the Bayesian probability is so important in machine learning. This is the way of thinking that allows to incorporate uncertainty about the model parameters into ML. Let me explain by naming probabilities in ML terms. In our candy example, the *P(box)* (box either being grandma’s or grandpa’s box) is called a ** prior probability**, since we assume this probability before picking a candy (and so prior to our trial).

The conditional probability of choosing a candy type given a box (*P(candy | box)*) is called a ** likelihood**, since this shows how likely is a candy type given a box.

Finally the conditional probability of the box given the candy type (*P(box | candy)*) is called the ** posterior probability**, since this is a probability after having observing the candy type, so after the trial. The probability of a box

*given the candy*was altered by observing the candy we picked.

With these naming, Bayes’ rule (or Bayes’ Theorem) can be rewritten for our example as:

Which means that the * posterior probability is proportional to the likelihood and the prior probabilities*.

Now this is extremely important in machine leaning for the following reason. Suppose we have some data *(D)* and we try to describe the data by a polynomial model. Our model has a vector of parameters, ** w. **We can have some idea about the probability of the parameters (

*P(w) – our prior probability*), then we observe the data and we can compute the probability of the observed data given the parameter vector,

*P(D|*. This is the

**w**)*likelihood of the data given the parameter vector,*But eventually what interest us is to find the best model, that is the best parameters describing the data, that is

**w**.*P(w | D), the posterior probability*of the parameters, given the data! There are ML methods that represent more of the frequentist approach while there are those that stay more on the Bayesian interpretation of probability. One thing is sure, Bayes’ theorem is central in ML.

So in general terms:

And that’s it! Let’s do a quick summary about the notions of discrete probability discussed above and then we are done!

# Summary

Let *X* and *Y* be two events. We define the following notions:

Rules of discrete probability:

We say that X and Y are independent if* P(X, Y) = P(X) * P(Y)* or equivalently if *P(X) = P(X | Y) * P(Y). *

Finally, the posterior probability of an event is **proportional to the likelihood and the prior probability.**

I hope the basic rules and notions of probability that are essential in ML are a little more clear now. As I wrote in the introduction, this is by no means a complete introduction to probability, if you are interested in more, check out the references! Thanks for reading!

# References

Bishop, Christopher M. Pattern recognition and machine learning. springer, 2006.

Murphy, Kevin P. *Machine learning: a probabilistic perspective*. MIT press, 2012.

Bertsekas, Dimitri P., and John N. Tsitsiklis. “Introduction to Probability Vol. 1.” (2002).

## Leave a Reply