Logistic regression

Logistic regression can be thought of as a generalisation of the linear model to classification. This article provides a quick overview and an example in Python for binary and multi-class logistic regressions.

Introduction to classification

Classification is the prediction of two or more discrete target values, so the goal of the classification to take an input vector x and assign the data into one of the K possible classes (let the classes be Ck with k ∈ [1,K]). If the classes are mutually exclusive (and so an input can be classified to one and only one class), we can think of the input space as the decision regions, defined by the decision boundaries.

Binary classification focuses to the case of two classes, often referred to as: y ∈ {0, 1} meaning that the target can be in class 0 (case when y=0,  C₀) or class 1 (corresponding to y=1, C₁). In this case, the value of the target variable, y, can be interpreted as the probability of the observation belonging to class 1 (C₁).

For more than two classes, (let K > 2 be the number of classes), we often predict a K-dimensional vector, such that if the observation belongs to the jth class, all elements of the output vector takes 0 except the jth element, which takes 1. So for instance in a multi-class classification with 5 possible outcomes, the following vector would correspond to belonging to the 4th class: y=[0, 0, 0, 1, 0]^T. Again, the value of jth element of the target vector can be interpreted as the probability of belonging to the given class (and so the sum of the elements must add to 1). For non-probabilistic models, other ways to represent the outcomes might be preferred.

In the last blog of linear regressions we defined the outcome y as a linear function of the inputs:

y is a real number here. When we do classification, however, we would like to interpret the jth value of the output vector as the probability of belonging to the jth class, thus its value must be between 0 and 1. We would also like if the sum of the elements in the predicted target vector sum to 1. Thus, we use a nonlinear function, f that squashes the values of the linear model to the interval of [0, 1]. This function is also called the activation function:

Its inverse is referred to as link function, and its decision boundaries correspond to constants. Hence, the decision boundaries of the classifications are linear in x, even though the activation function is non-linear. These type of models are called generalised linear models.

Introduction to Logistic regression: binary version

Logistic regression can be seen as an adaptation of the linear regression to classification. Hence the name, although it should be remembered that logistic regression is a classification and named “regression” only to show the similarities between the linear regression and its classification equivalent. First I introduce the binary linear regression in a probabilistic way, then I give some motivation from a purely linear model perspective.

So let’s first derive the logistic regression from the probabilistic interpretation of the linear regression. We recall the probabilistic way we can define the linear regression:

We can define the logistic regression in a similar probabilistic setting. Let’s first focus on the case of binary classification (two possible targets). The normal distribution would no longer be a good choice for describing the target value’s conditional distribution, thus we turn to the Bernoulli distribution.

Recalling the Bernoulli distribution, we know that μ(x)=E[y|x]=p(y=1|x). Thus, the second change we need to implement here is to squash the linear function (wTx) to the interval [0, 1]. We can do this by using the logistic sigmoid function:

The sigmoid function squashes the input to the interval [0, 1] thus allowing the probabilistic interpretation. Hence we derived the binary logistic regression from the linear regression by introducing two small changes: first by changing the normal distribution to the Bernoulli distribution (that makes much sense in a binary setting) and second by squashing the values of the linear equation by the sigmoid function. This second change allows for the values to be squashed between 0 and 1, and thus allow the Bernoulli distribution to make sense (have a compatible mean).

Thus, binary logistic regression can be thought of as:

Now that we know the exact form of the Binary logistic regression, we can write the Maximum Likelihood Estimator (MLE) and derive its Negative Log Likelihood (NLL):

The NLL (coincides with the cross-entropy) can no longer be used to derive the MLE estimator in a closed form. Instead, we use the gradient descent or another optimisation algorithm (such as Newton’s method) to do that.

The second way to introduce the logistic regression is a little more troublesome, we need to introduce a few notions first. Let’s stick with the binary logistic regression for a moment, with two classes, C₀ and C₁. Let the probability of belonging to class 1 be p (so let p = p(C₁)). Then we define the log-odds as the logarithm of the ratio of p and 1-p. (Of course in the binary setting 1 – p = p(C₀).) The logistic regression assumes a linear relationship between the inputs (x) and the log-odds.

By exponentiating the log-odds, we find the odds:

Then by writing the same for C₀, substituting it and simplifying the above expression, we find:

Again, we need to apply MLE to find the optimal parameters. In practice, once we got the parameter estimates by maximum likelihood estimation, we compute the probability of a new input data belonging to class 1 by simply substituting the parameters in the above expression, and the probability of class 0 by 1-p.

Now we still have two questions to answer before moving on to a practical binary example. First, what is the optimal threshold for separating the two classes? As explained above, the logistic regression gives probabilities of belonging to class 0 and 1 for an input x. However, what threshold should we choose to say that x belongs to class 0 or 1? This question is the decision theory part of the problem.

The difficulty of the question is that the optimal threshold depends on our problem. Suppose that we care equally for False Negative (FN, corresponding to the case that the true target is 1 but we predict class 0) and False Positive Errors (FP, we predict class 1 instead of the true target class 0). In this case, the ROC curve can give us some insights to what threshold to choose. The Receiving Operating (ROC) Curve maps the True Positive Rate (TPR) against the False Positive Rate (FPR) for different threshold levels. An optimal threshold could be when the difference between these two is the largest (see more here). However, Python does not allow for changing the threshold, it uses a probability of 0.5 by default. This is not a problem, however, if you have balanced classes (since it will be close to the optimal threshold anyway). The simplicity of fixing a threshold at 0.5 (and by balancing the dataset) is apparent once we consider that the train set is used to find the optimal parameters of the model, these optimal parameters give the probabilities of belonging to a class. Then, we use the test set to construct the ROC curve, and so we waisted our test set on a choice of a hyperparameter. So we need an additional validation set to see how our model performs. This can lead to overfitting. So to summarise the use of ROC curve for logistic regressions, it could be used to decide the optimal threshold, however, it is based on separate set and is not implemented in Python by default. So I suggest using a balanced-class instead. This would force the optimal threshold be around 0.5 and we would not use additional data.

Then the last question is how to evaluate the model. We can choose accuracy, log loss, precision, recall, and AUC for this purpose, more about classification evaluation here.

In what follows I show a classical binary classification example (Titanic survival), then talk about multi-class logistic regression.

Binary classification case in practice

To see a binary classification example in practice, I use the data about Titanic survivors downloaded from Kaggle: https://www.kaggle.com/c/titanic/data

The data contains information about the passengers: their class on the Titanic, their name, age, sex, the number of siblings and spouse on the ship (“SibSp”), the number of parents and children on the ship (“Parch”), information about their tickets, fare and cabin.

import numpy as np

import pandas as pd

train, test = pd.read_csv('titanic/train.csv'), pd.read_csv('titanic/test.csv')
train.head()

First I clean the data, I change the missing values of “Age” to the median age, I create a new column “NoRel” (number of relatives) that adds up the number of siblings/ spouse (“SibSp”) and number of parents/childer (“Parch”). I change the “Sex” and “Embarked” to numerical values (although I do not use this latter).

# Clean data 
t
# Missing values 
total = train.isnull().sum().sort_values(ascending=False)
missing_rate = round(train.isnull().sum()/train.isnull().count()*100, 1)
missing_data = pd.concat([total, missing_rate], axis=1, keys=['Total', 'Rate'])
print(missing_data)
# I will not use the "Cabin" feature as too many missing 
# Replace missing age with mean value 
train["Age"].fillna(train["Age"].median(skipna=True), inplace=True)

# Replace SibSp and Parch by NoRel (Number of relatives)
train["NoRel"] = train["SibSp"] + train["Parch"]

# Convert the "Sex", "Embarked" to numerical values 
genders = {"male": 0, "female": 1}
train['Sex'] = train['Sex'].map(genders)
ports = {"S": 0, "C": 1, "Q": 2}
train["Embarked"] = train["Embarked"].map(ports)
train.head()
Total  Rate
Cabin          687  77.1
Age            177  19.9
Embarked         2   0.2
Fare             0   0.0
Ticket           0   0.0
Parch            0   0.0
SibSp            0   0.0
Sex              0   0.0
Name             0   0.0
Pclass           0   0.0
Survived         0   0.0
PassengerId      0   0.0

Using train_test_split and 4 features

After having cleaned the data, I select some variables (“Age”, “Sex”, “NoRel”, “Pclass”) and do a first logistic regression by separating the data simply into train a test data by “train_test_split”. Since the data is not balanced, I add the class_weight=”balanced” option to the logistic regression, this creates as many negative as positive examples. Python’s logistic regression uses a default threshold of 0.5 (this is a good threshold in the case of a balanced dataset but not necessarily for un unbalanced one!). The results of the regressions are shown below:

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, precision_recall_curve,

roc_curve, auc, log_loss, accuracy_score
# Seperate the data into train and test set
selected_features = ["Age", "Sex", "NoRel", "Pclass"]
X_train, X_test, y_train, y_test = train_test_split(train[selected_features].values, train.Survived.values, test_size=0.2, random_state=0)

logreg = LogisticRegression(class_weight="balanced")
logreg.fit(X_train, y_train)

y_pred = logreg.predict(X_test)
y_pred_proba = logreg.predict_proba(X_test)[:, 1]
[fpr, tpr, thr] = roc_curve(y_test, y_pred_proba)

# Results 
print('Train/Test split results:')
print(f"Model accuracy is : {accuracy_score(y_test, y_pred):.3f}")
print(f"Model log_loss is {log_loss(y_test, y_pred_proba):.3f}")
print(f"Model auc is {auc(fpr, tpr):.3f}")
Train/Test split results:
Model accuracy is : 0.799
Model log_loss is 0.455
Model auc is 0.863

Now before going further, let’s see the intuition behind the ROC curve. I choose the optimal threshold where the difference between the true positive rate and the false negative rate is the largest. This is equivalent to say that the difference between the true positive rate and 1 minus the false positive rate is the smallest: where the TPR meets the (1-FPR) curve, graphed against the threshold. At this threshold, it is no longer possible to improve the TPR but not to increase the FPR. In the present case this threshold is around 0.54.

import matplotlib.pyplot as plt

[fpr, tpr, thr] = roc_curve(y_test, y_pred_proba)
ind = np.argmax(tpr - fpr)

plt.plot(thr, tpr, color="#5DADE2", label=f"TPR")
plt.plot(thr, 1-fpr, color="#000080", label=f"1-FPR")
plt.axvline(x=thr[ind], color="blue", label=f"Optimal threshold={optimal_theshold:.3f}", ymax=0.9, linestyle="--")
plt.xlabel("Threshold")
plt.ylabel("TPR/ 1-FPR")
plt.title("TPR and (1-FPR) for different thresholds")
plt.legend()
plt.show()

plt.plot(fpr, tpr, color='coral', label=f"ROC curve (AUC={auc(fpr, tpr):0.3f})")
plt.plot([0, 1], [0, 1], 'k--')
plt.plot([0,fpr[ind]], [tpr[ind],tpr[ind]], "k--", color="blue", label="Optimal threshold")
plt.plot([fpr[ind],fpr[ind]], [0,tpr[ind]], "k--", color="blue")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate (FPR)')
plt.ylabel('True Positive Rate (TPR = recall)')
plt.title('Receiver operating characteristic (ROC) curve')
plt.legend()
plt.show()

print(f"Using a threshold of {thr[ind]:.3f} guarantees a tpr of {tpr[ind] :.3f} and a 1-fpr of {1-fpr[ind]:.3f}") 
print(f"So that the fpr is: {fpr[ind]*100:.2f}.")
Using a threshold of 0.539 guarantees a tpr of 0.826 and a 1-fpr of 0.818
So that the fpr is: 18.18.

Python does not provide the option to choose the threshold, we could do it by wrapping the LogisticRegression and by adding a threshold parameter. However, this can be considered as overfitting the data as this parameter is chosen after “learning” the model, and uses additional data as explained in the introduction. Of course we could avoid this second point by cross-validation.

Cross validated logistic regression

from sklearn.model_selection import cross_val_score

logreg = LogisticRegression(class_weight="balanced")
# Use cross_val_score function
# We are passing the entirety of X and y, not X_train or y_train, it takes care of splitting the data
# cv=10 for 10 folds
# scoring = {'accuracy', 'neg_log_loss', 'roc_auc'} for evaluation metric - althought they are many
scores_accuracy = cross_val_score(logreg, train[selected_features], train.Survived, cv=10, scoring='accuracy')
scores_log_loss = cross_val_score(logreg, train[selected_features].values, train.Survived.values, cv=10, scoring='neg_log_loss')
scores_auc = cross_val_score(logreg, train[selected_features], train.Survived, cv=10, scoring='roc_auc')
print("K-fold cross-validation results:")
print(f"logistic regression average accuracy is {scores_accuracy.mean():2.3f}")
print(f"logistic regression average log_loss is {-scores_log_loss.mean():2.3f}")
print(f"logistic regression average auc is {scores_auc.mean():2.3}" )
K-fold cross-validation results:
logistic regression average accuracy is 0.787
logistic regression average log_loss is 0.466
logistic regression average auc is 0.85tm

Multi-class logistic regression

First let’s expand the binary logistic regression to a mutually exclusive multi-class case. Let the number of classes be K (K>2, K ∈ R) and let’s assume that each observation can belong to one and only one class. Let the probability that an observation x belongs to the class i be

The multi-class logistic regression assumes a linear relationship between the the log odds ratios of the different classes and the inputs, thus it is specified in terms of K-1 log-odds transformations:

Then by simple calculations, we find:

Then the parameters’ optimal values are found by MLE (with more than two classes, the data has a multinomial distribution). Just as in the binary case, we find the log likelihood, and the MLE parameters minimises the negative log likelihood. We find the parameters by optimisation steps as there is no longer a closed-form solution.

Conclusion

In this post we went through logistic regression for binary and multi-class data. We motivated logistic regression as an extension of the linear regression for distinct classes as well as from another perspective, forcing the log-odds ratios to have a linear nature.

Logistic regression is an efficient and straightforward algorithm, it is easy to implement and does not require very high computational power and feature scaling. However, as its parameters are derived by MLE, it is prompt to overfitting. Also, it becomes less efficient when used with a large number of categorical features. Finally it can’t handle well features that are highly correlated.

References

Bishop, Christopher M. Pattern recognition and machine learning. springer, 2006.

Murphy, Kevin P. Machine learning: a probabilistic perspective. MIT press, 2012.

Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning. Vol. 1. No. 10. New York: Springer series in statistics, 2001.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Create a website or blog at WordPress.com

Up ↑

%d bloggers like this: