This blog shows the basic concepts of information theory needed for ML. Information theory studies the measure, communication and storage of an information.

**What is Information Theory? **

Information Theory is the area of Mathematics that studies the measure, the communication and the storage of an information. Its origins lie in signal processing and data compression. Information can be thought of in terms of certainty, or a lack thereof.

The question we pose in this article is the following: how much information have you received when you observed a random variable? Let this random variable be that there is oxygen in the air. You observed it, and then you are asked this frustrating question: how much information is covered by the occurrence of this event? Well, if you think as the majority of people, you shrug your shoulders and say: “*As the information I observed is always true, not much information is conveyed by this.*”

Now take another example. How much information is conveyed by observing the random variable, *x*, whether the sun has risen today? Suppose this random variable takes the value that the sun has not risen today. I guess it is quite clear that this news would have a very high *information context*, as the probability of its occurrence is very low. We can think of the information context as some kind of measure of how much information was transferred/ observed. Contrarily, observing that the sun has risen today would be less surprising as its probability is higher.

So the point is the following: the information context of a random variable* is disproportionate to its probability of occurrence*. If I tell you about something that was highly likely to occur, you are less interested and you might think that what I just told you was not very important. However, if I tell you about something that is extremely unlikely to happen, you will think that it is quite important. This measure of “*importance*” or “*surprise*” that is associated with a random variable is called its **information context**.

**Information content**

Let the measure of how much information is conveyed by observing a random variable (*x) *be *h(x).*

The measure of the information content (*h(x)*) depends therefore on the probability distribution of an event (*p(x)*). If the probability of *x* is very low, the corresponding information context should be high while if the probability of *x* is high, the corresponding information context should be low. We search therefore a functional form of *h(x)* that is monotonic and negatively proportional to the probability distribution of *x*. Finally, we want the information context of two independent variables to be the sum of the individual information context of the two variables (as the joint probability of two independent variables is the product of their probabilities).

We happily remember that the logarithmic function satisfies all our expectations. Therefore, we define the information context of *x* be:

The choice of the basis of the logarithm does not really affect our thinking, since all bases satisfy the above criterions. However, the base of *2* and *exp* is often used. If we take the basis of the logarithm to be 2, the units of the information context are called the *bits*, while if we take the basis of *exp*, they are called *fans*.

# Entropy

Now suppose that you want to transfer an information to someone about the random variable, *x*: whether the sun has risen today. What would be the *expected information content* of this random variable? It has two forms since *x* can take two values: the sun either rises today or not. So the expected information context should be the weighted information context:

This value is called the *entropy* of the random variable, *x*. It serves as a measure of uncertainty associated to a given probability distribution. A probability distribution with high entropy is one for which the outcome is more uncertain. Perhaps not surprisingly, the uniform distribution has the highest entropy: our uncertainty is high since all outcome is equally likely. (Proof here.) Now let’s compute the entropy of a random variable for a concrete example (example taken from: Cover and Thomas, 1991).

Let the random variable be the outcome of tossing an 8-sided die (8 sided instead of the usual 6 sided for mathematical convenience). The outcomes are {1, 2, 3, 4, 5, 6, 7, 8} and each outcome has an equal probability (1/8). Therefore the entropy of this variables is:

Now let’s compute the entropy of tossing a biased die with the following probabilities:

This example shows what we stated before, that the uniform distribution has the most uncertainty and thus the highest entropy.

Now another way of thinking about entropy is the way we can transmit an information. In the first example, we toss an 8-sided die and all outcomes are equally likely. If we wanted to write the outcome of the die in binary code, we would need 3 bits to do that, as at each bit we can signal 2 possibilities, and 2^3 = 8. Indeed, the entropy of a random variable can also be seen as the lower bound of the number of bits needed to transfer a the state of the random variable. Now we can understand why we called the information context of a random variable *bit* when we used the base 2. If we use the natural base logarithm, the entropy is measured in *nats*.

Now in case of continuous probability distributions, the entropy will be computed as:

For joint probability distributions of *(x, y)* we define the average *conditional entropy* of *x* given *y* (H[x | y]) as the average additional information needed to define the value of *x* after having observed the value of *y*.

Using the product rule, we define the *differential entropy* of *x* and* y* (*H[x, y]*) as:

# Implications to ML and the KL-divergence

Now how does all this fit into ML? Suppose we want to model the outcome of a random variable, x, but we are unable to observe its probability distribution (p(x)). Instead, we observe an approximating distribution, *q(x)*. Now when using *q(x) *to model the outcome of x, we define the average additional amount of information needed to define the value of* x* as the result of *q(x)* instead of the true distribution as the *relative entropy* of the* Kullback-Leibler (KL) divergence:*

The KL-divergence is always greater then or equal to zero and it is zero when *p(x)* and *q(x)* are equal. It is non-symmetric, that is the KL-divergence of between the distribution of *p(x)* and *q(x)* is not equal to the KL-divergence between the distributions *q(x)* and *p(x)*.

# Summary

In this blog post, we reviewed the most important notions of information theory used in ML and built up a comprehension of the entropy. The entropy associated to a probability distribution serves as a measure of uncertainty.

As I always include at the end of my posts, this is in no way a complete introduction to Information Theory and feel free to discover more of it, if it interest you! I left some references below, and thanks for reading!

# References

https://www.freecodecamp.org/news/will-the-sun-rise-tomorrow-255afc810682/

Bishop, Christopher M. Pattern recognition and machine learning. springer, 2006.

## Leave a Reply