As logistic regression, probit regression is called regression only to show its similarities between to the linear regression, however it is a classification method. This post gives a brief introduction to probit regression.
Probit regression is quite similar to logistic regression, it is a classification algorithm that is based on a generalised linear model (GLM). In the logistic regression article there is a longer introduction that goes over classifications and the GLMs, but in summary, all we need to keep in mind for the moment is that generalised linear models can be written as:
with f activation function on some transformation acting over a linear model (vector product of w) of the input (x).
Classification tries to predict two or more distinct target values. Thus, it can be thought of as a special case of linear regression, one that maps the continuous target variable to distinct values. To do this mapping, we use activation functions (f) to squash the continuous target value of the generalised linear model to the interval [0, 1] and thus allowing for a probabilistic interpretation. Binary logistic regression uses the logistic sigmoid activation function while multi-class logistic regression uses the softmax function as activation. Probit is a model very similar to logistic regressions, but it uses the normal distribution as activation function.
Binary classification happens when we try to map the input into one of two distinct classes. The probit regression assumes that the posterior class probability are given by a transformation acting on the linear function of the input variables, x.
where x is the input, y is the target vector, and f is an activation function.
We could have a simple threshold for the classification, in the following way: for each input xi, we first evaluate the linear model (w^T x), then we compare its value with a threshold, t. Of the value we got is higher than the chosen threshold, the corresponding target variable is 1, else its 0.
If the threshold (t) is drawn from a probability distribution p(t), then the activation function (f) is the cumulative distribution function:
If we assume that the distribution of the threshold is standard normal, the activation function becomes the inverse probit function, defined as:
Linear rescaling of this function does not matter (as it will be linear in the parameters), thus we can use the similar erf function instead:
The erf is related to the inverse probit function by:
The GLM that uses the inverse probit function as an activation function is called the probit regression:
As with logistic regression, we can find the parameters of the model by maximum likelihood estimation (MLE).
This blog went through the theory behind the probit model. It is quite similar to logistic regression, it is a classification method, a generalised linear model with the Standard Normal CDF as activation function. However, the probit regression is a little more sensitive to outliers than the logistic regression, as this latter has slightly fatter tails. It is also difficult to interpret the parameters of the model.
Bishop, Christopher M. Pattern recognition and machine learning. springer, 2006.
Murphy, Kevin P. Machine learning: a probabilistic perspective. MIT press, 2012.
Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning. Vol. 1. No. 10. New York: Springer series in statistics, 2001.
Feature image source: https://www.thedigitaltransformationpeople.com/channels/analytics/funny-things-happened-on-the-way-to-the-forum/
Leave a Reply