This blog post summarizes the most often used evaluation metrics for binary classification.
In the decision theory article I already talked about how the decision regions of a classification can be decided upon. The aim is to choose the decision regions best suited for minimizing the loss (error) function. The problem becomes trivial once we know the type of classification we wish to implement, this is represented by the loss function.
Let me be more clear by using the same example as in the blog about decision theory: suppose we wish to classify MRIs of patients as cancerous or healthy. What is the aim of our classification?
If we classify a healthy person as a cancer patient (let cancer patient be class 1), it would be an error and would most probably give some sleepless nights to the patient, however, after more tests, it would be clear that the patient is healthy and so he would continue his life relieved, happily and most importantly in good health. This error is called the false positive (FP) error as we falsely classified a healthy person as a cancer patient.
Now another possible error is when we classify a sick person as healthy (let the class healthy be class 0). This error is called false negative (FN) error and in this example it is much more dangerous than the false positive one, as a sick patient will continue his life without treatment. Therefore, the gravity of the false negative and false positive error depends on the problem. Now of course when we classify a healthy person as healthy or a sick person as sick, we have a True Negative (TN) or True Positive (TP) case respectively. These two cases are correct classifications.
Now that we reviewed the type of errors we can make, let’s introduce some evaluation metrics with a toy example. Suppose we are in the same binary tumour classification problem as above, and suppose the number of observations are as follows:
Accuracy
Accuracy is simply the ratio of good predictions to all predictions.
Thus, we can compute it as:
And using the example above, we find a high accuracy:
Now accuracy is a very easy metric to compute and use, however, it does not make sense when the classes show a large imbalance. Let’s say for instance that our data is distributed as:
Now the accuracy is even higher than in the previous example:
This shows the limitations of the accuracy. It should not be used with large class imbalance and even when we do not have a high imbalance, accuracy is not the most informative metrics. Thus, we introduce other, more informative ones!
Precision
Precision shows what rate of positive classifications was actually correct. It is defined as:
In the above example, precision is:
The precision is around 0.76. This tells us that when the model predicts cancer (class 1, sick), this prediction is correct 76% of the time.
Recall
Recall focuses on another question: what rate of the positives were identified correctly? Thus, it can be computed as:
In our example:
As the model has a recall of 0.94, it correctly identifies 94% of tumours.
Now we can see that precision and recall gives us a better understanding of the model performance than accuracy. It tells us how confident we are in our predictions for class 1 (precision) and how confident we are that we predict class 1 (recall) when the true class is class 1.
Precision and recall should both be considered when evaluating a classification model.
Now we could simply say that our problem of finding the right threshold should be based on minimising the recall and the precision. Unfortunately, however, they are often in tension, that is when we want to reduce precision, recall increases and vica versa.
Combining precision and recall: F1 score
We can mesure the precision and the recall at the same time by constructing the F1-score, the harmonic mean of the two:
(So in the example it is around 0.84). It reaches its maximum score at 1, when we have perfect precision and recall.
Conclusion
This brief post goes trough the accuracy, precision and recall for binary classification. We can see that the accuracy is not sufficient to choose the best threshold, we gain more insight by considering the precision and recall. Since these two move often in opposite direction, we can consider the F1 score, their harmonic mean when we want to consider both of them.
References
Bishop, Christopher M. Pattern recognition and machine learning. springer, 2006.
Leave a Reply