We mentioned in the previous articles that probability theory can be used with decision theory in order to predict a target. What is the link and what is the place of decision theory in ML? This article try to find answers to these questions.

# Introduction

The previous article shows the place of probability in ML. Our aim is to predict the joint probability of the target and the data (p(t, x)– we call this inference) or the conditional probability of the target given the data (p(t | x) – the posterior probability). Whether we want to predict the posterior or the inference depends on the model and is a conceptual question as the Bayes theorem links these notions together.

Probability theory helps us to estimate the posterior or inference. Now we left with the question: given that we know either of these quantities, how can we predict the class for a new data (x)? Let me be more clear with an often cited example of cancer patients: suppose your algorithm analyses MRI-s and you wish to classify the input data into two classes: as healthy or sick patients. One way to do this is to choose the rule that minimises the misclassification rate. Would this approach be correct?

Our decision depends on the question we wish to answer. Let’s first do a little detour and talk about different errors. If we classify a healthy person as a cancer patient, it would be an error and would most probably give some sleepless nights to the patient, however, after more tests, it would be clear that the patient is healthy and so he would continue his life relieved, happily and most importantly in good health. This error is called the false positive error as we falsely classified a healthy person as a cancer patient.

Now another possible error is when we classify a sick person as healthy. This error is called false negative error and in this example it is much more dangerous than the false positive one, as a sick patient will continue his life without treatment. Therefore, the gravity of the false negative and false positive error depends on the problem. This provides an intuitive explication why we are not interested in simply minimising the misclassification rate. We want to do more than that: we want to have minimal misclassification rate but weighted by our knowledge of the problem. In the above example this would mean that we don’t mind false positive errors as much as false negatives.

Another way to make the decision boundary is to try to predict the posterior probability of a cancer given the data, let this quantity be p(sick | x). The question is the following: what is the threshold that you precise in order to predict cancer? For instance, you might say that your prediction is cancer whenever the posterior of a cancer is greater than the posterior of a healthy patient, that is:

This would mean that we classify the data to the highest posterior probability. This is actually the idea behind Naive Bayes and you can read more about it here. But is this decision boundary optimal? Again, this question depends on how we weight different errors. To account for this, we introduce a loss function. In our example, as we would penalise a false negative much more heavily than a false positive, let’s say that our loss is 10 000 for a false negative classification and only 1 for a false positive one. Then our actual loss is the loss function times the number of misclassified cases, in our actual example:

The optimal solution minimises the loss function, however, the loss function depends on the true class that we do not know. Therefore, our best guess is the expected loss that can be written in terms of probabilities. Let’s write the expected loss function in more general terms. To do this, we need the notion of decision region: let Rk be the decision region that data x belongs to class k (with k {sick, healthy}), that is let’s formalise our decision theory as follows:

if x Rk we predict class k for x.

Now we can formalise the probability of mistakes, for data x, and the two classes {sick, healthy}:

A visual representation of this problem is shown below. Suppose that the joint probability distribution of the data and sick vs data and healthy people are slightly different. Suppose as a first step, we have estimated these distributions. Now we define the decision region Rsick as all x less than X̂. All points that belong to this region will be classified as sick. On the contrary, all data superior or equal to X̂ will be classified as healthy.

Let’s now return to minimising the expected loss. In more general terms, the probability of error is for k classes can be seen as the sum of the misclassification probabilities, when we assign a data (x) to the class j that is not its true class:

Finally, let’s note the loss function for assigning the class k to a true class of j by Lkj. (For instance, in our example this would correspond to the factors of 10 000 and 1 for the false negative and false positive cases.)

Now the expected loss can be written as:

Our decision theory question simplifies to the following: choose the decision region Rj for data x that minimises the expected loss. This is the same as minimising:

for each class.

By using the product rule, we rewrite

That now holds a common term, p(x) and the posterior for each class. Therefore, the decision that minimises the expected loss for each class is the one that assigns the class with the highest posterior probability to each new data point. Therefore even though we could see intuitively at the beginning of the article that it is logic to choose the class that is the most probable, now we can see how it is mathematically explained.

Now we can visualise this idea by considering the inference of the data and cancer (p(x, sick)) and the joint distribution of data and the absence of cancer (p(x, healthy)). Suppose the two distributions are as in the following graph:

Now the question is where to allocate the decision regions. Suppose the decision is done as shown on the following graph:

With new data points with values less than the red line’s value are classified as cancer patients (sick) and values that are higher will be classified as healthy patients. Does this threshold minimise the error? The error is shown by the area filled with orange lines – we see that this decision region is not optimal as many points that correspond to a much higher p(x, sick) are predicted to be healthy. This cannot be optimal and as it is shown on the following graph, indeed the decision region that respect the highest posterior probability minimises the expected error.

Here the decision region is where the two distribution meet. If the posterior probability of “sick” is higher than “healthy”, the corresponding predicted value will be “sick” and vice versa. We can see that the corresponding errors are minimised by this decision.

Until now, we have seen that by maximising the posterior, the expected loss is minimised thus the decision rule is trivial, once we know the inference/ posterior.

We can also include an additional constraint to our decision rule: we might say that when the prediction of the model is too uncertain, we do some additional tests and will not make a clear decision. This is known as the reject region and you can read more about it in Bishop’s Pattern recognition (reference for this article and my absolute favourite in ML). This would correspond to the following graph for instance:

# Summary

We separated the prediction/ classification problem into two separate subproblems: predicting the inference (and so the posterior) as a first problem, followed by the second problem of the decision rule to assign new data points to optimal classes. Actually, we could do these two steps in more than one way:

1. First we solve the inference step then the decision step. These methods are called generative models as they need some knowledge about the prior probabilities.
2. Second, we can determine only the posterior probabilities and then use decision theory. These methods do not necessarily need knowledge about the priors (and/ or inference) and they are called discriminative models.
3. Finally, we can model the outcome by a discriminative function that does not try to estimate the probabilities but uses some other rules for the decision making. The discriminant function maps each input to a class directly.

That’s it about decision theory for ML, thanks for reading and check out the references for more!

# References:

Bishop, Christopher M. Pattern recognition and machine learning. springer, 2006.

This site uses Akismet to reduce spam. Learn how your comment data is processed.