This blog post vulgarises some notions from the previous post applied to continuous cases. Probability theory is a large area of Mathematics and this post gives by no means a complete overview, only some essential background knowledge to understand ML models! Let’s get into it then!
Why we need it?
In the previous post you can read about some reasons why we need probability theory in ML: we need to models uncertainty as a result of our learning problems and because we might gain a better understanding by trying to predict probabilities instead of hard coded class memberships.
But why do we need continuous probability theory instead of discrete one? Well, if whatever you try to model/ predict is a discrete variable, the corresponding probability distribution is discrete. On the contrary, when trying to predict say the height of a person, the duration of an employment, the weight of a baby, all these variables are continuous and thus the corresponding probability distribution is continuous. Actually you will encounter continuous probability distributions much more often than discrete ones, however, the underlying methods are quite similar and once you understood the discrete case and the way you can map this to a continuous one, you are ready to tackle any ML algorithm!
PDF and CDF and all this stuff?
Let x be a continuous random variable. The continuous probability density function (PDF) of x is a mapping that assigns a probability to each real value of x. Of course, as x is continuous, we cannot specify a probability for a given real value directly. You can think of this as drawing a line with a pencil on a piece of paper, and although you can tell the length of the line you drew, you cannot tell the length of a point on the line: it does not exist. Similarly, you cannot specify the probability of being exactly 178 cm, since the height of people is a continuous variable and thus you stand a better chance by saying: what is the probability that someone is between 178 and 179 cm?
Formally, the probability that a real valued random variable, x falls in the interval (x, x + 𝛿x) is given by p(x) as 𝛿x -> 0, and p(x) is called the probability density over x. The probability that x belongs to the interval [a, b] is the integral of p(x) between these two values.
Probabilities are non-negative and must have an integral over all real values equal to 1:
The probability of x being less than or equal to a real value w is given by the Cumulative Distribution Function (CDF) of x:
The derivative of the CDF is the PDF while we can obtain the CDF by integrating the PDF.
Mapping between discrete and continuous cases
In the previous post we have seen the notion of joint and conditional probabilities. These notions exist in the continuous world. For instance, the sum rule takes the form:
And the product rule is:
I give some reference at the end of the article if you want to dig deeper but going further requires measure theory so I stop here! Instead, let’s have a look at some important density functions in the next article! Thanks for reading and see you soon!
References
Bishop, Christopher M. Pattern recognition and machine learning. springer, 2006.
Murphy, Kevin P. Machine learning: a probabilistic perspective. MIT press, 2012.
Bertsekas, Dimitri P., and John N. Tsitsiklis. “Introduction to Probability Vol. 1.” (2002).
Leave a Reply