Maximum Likelihood Estimation (MLE)

Maximum likelihood estimation (MLE) is a method of estimating some parameters in a probabilistic setting. It is based on finding the parameters of a probability distribution that maximise a likelihood function of the observed data. The idea is to find the probability density function under which the observed data is most probable, the most likely. This blog gives a brief MLE overview.

Introduction

Suppose that I show you a sample of random variables (x1, x2, …, xN) generated from a Poisson distribution, but I am mean and don’t tell you the parameter of the distribution (θ). You are naturally very keen in estimating the parameter θ.

One approach to the problem is to pose the question: what is the probability of observing the sample I showed you, assuming that the sample was generated from a Poisson distribution with as yet unknown parameter θ? This probability is the joint probability of the data, and assuming that the data is i.i.d, we can write the joint probability as:

Of course, p(x) depends on the unknown parameter, θ. Therefore the question we pose now is the following: what value of θ would maximise the probability of observing the sample I showed you? If you can find the value of θ that maximises the probability of the sample given the value of θ, it will be a quite good estimation for the true value of the distribution’s parameter, right?

Thus, the MLE is a method that estimates some parameters by choosing the values of the parameters that maximise the probability of observing the sample we got. Thus, ML estimation gives the parameter value that makes the data the most probable.

MLE in practice

We started the above example by assuming that you observe a sample (x1, x2, …, xN) generated from a Poisson distribution. Therefore, you observed a sample of random variables and you assumed that the true distribution is in the family of Poisson distributions. This is your hypothesis that you make for the MLE. Once you made this assumption, you can use the MLE to estimate the unknown parameter of the distribution, however, your hypothesis about the family of the function must be good to assure that the MLE works well.

The MLE approach is based on the likelihood function. The likelihood function is the probability of observing the data given a value of the unknown parameter. Our goal is to estimate the parameters of the distribution that maximise this likelihood of the observed data.

So the summary of the steps of the maximum likelihood estimation are:

  1. Assume a family for the underlying distribution (for instance Poisson or Normal distribution). Let the probability density function, or pdf, for a random variable, x, conditioned on a set of parameters, θ, be denoted f(x|θ).
  2. Create the likelihood function of the observed data. The likelihood function (the joint density) of N independent and identically distributed (i.i.d.) random variables is the product of individual densities:

We call this joint density the likelihood function, conditional on the unobserved parameter(s) θ. x is the collection of all observed data. The likelihood of the data described the underlying data generating process, that is we could generate artificial data having found the parameters.

It is usually easier to work with the log likelihood function:

  1. Now we face a problem of optimisation: we need to find the hyperparameter(s) θ that maximises the likelihood function. The value of θ that maximises the log likelihood is the value that maximises the likelihood (since the log function is monotonically increasing), therefore

Which means that the ML estimation of the parameter(s) θ is the value of θ that maximises the log likelihood function. Let’s see two examples: one discrete case and one continuous case for the MLE!

Example of ML with discrete data

Suppose you observe 10 i.i.d observations {1, 5, 0, 1, 2, 3, 4, 1, 0, 3} and you assume that the underlying distribution is a Poisson distribution, that is:

As you assume that the observations are i.i.d., their joint density, the likelihood is the product of the individual densities:

And the log likelihood is:

Since our observed sample is {1, 5, 0, 1, 2, 3, 4, 1, 0, 3}, and

and

The log likelihood of the observed sample {1, 5, 0, 1, 2, 3, 4, 1, 0, 3} is:

And the actual log likelihood for the observed sample is:

Now the question is: what value of θ maximises the likelihood of observing the above sample? By taking the derivative of the log likelihood function with respect to θ and solving it to θ we find:

And since the second derivative is negative, we see that the found optimal value for θ is a maximum.

The following code shows the likelihood and the loglikelihood for different values of θ and shows that 2 is indeed the maximum. (The likelihood and log likelihood functions are scaled so that they fit visibly on the same graph. )

In [6]:
import numpy as np
from scipy.stats import poisson
import matplotlib.pyplot as plt
from math import exp, log, factorial


x_values = [1, 5, 0, 1, 2, 3, 4, 1, 0, 3]
theta_values = [0.7+x*0.1 for x in range(0, 30)]

# Poisson distribution
likelihood = lambda x_i, theta: theta**x_i / factorial(x_i) * exp(-theta) * 10 ** 0.90
log_likeli = lambda theta: -10 * theta + 20 * log(theta) - 12.242 + 25

likelihoods, log_likelihoods = [], []

# Get the likelihood and log likelihood for each theta
for theta in theta_values:
    likelihoods.append(np.prod([likelihood(x_i, theta) for x_i in x_values]))
    log_likelihoods.append(log_likeli(theta))


# Plot
plt.plot(theta_values, log_likelihoods, "go--", color="#000080", label=f"Log likelihood")
plt.plot(theta_values, likelihoods, "go--", color="#5DADE2", label=f"Likelihood", )
plt.axvline(x=2)
plt.xlabel("θ")
plt.ylabel("L(θ)*10**0.9/ log L(θ) + 25")
plt.title(f"The likelihood and log likelihood for different values of θ")
plt.legend()
plt.show()

Example of ML with continuous data

For the same set of observations, let’s now suppose that the underlying distribution is a normal. In this case, we search the mean (μ) and the variance (σ²) by ML.

Assuming that the observations are i.i.d, the likelihood can be written as:

And so the log likelihood is:

Maximising the log likelihood w.r.t. μ, we find the ML estimation of the mean and variance:

The ML estimation coincide with the sample mean and the sample variance measured w.r.t. the sample mean.
Since the ML optimisation w.r.t. the variance depends on the mean, first we need to compute the ML mean, then the variance.

I graph the likelihood and the log likelihood of the observed sample ({1, 5, 0, 1, 2, 3, 4, 1, 0, 3}) for different values of μ and σ². I do this in the following way: I fix the mean and compute the corresponding ML variance for each iteration. Thus the following graph does not include information about the ML variance at each iteration.

In [10]:
# ML for Normal distribution

from scipy.stats import norm
from math import exp, log, factorial, pi, sqrt

x_values = [1, 5, 0, 1, 2, 3, 4, 1, 0, 3]
mean_values = [-0.7+x*0.2 for x in range(0, 30)]

# Normal distribution
likelihood_normal = lambda x_i, mean, var: 1/sqrt(2*pi*var)*exp(-1/(2*var)*(x_i-mean)**2) * 100
log_likelihood_normal = lambda x_values, mean, var: -1/(2*var) * np.sum([(x_i-mean)**2 for x_i in x_values]) \
                                               - 10/2*log(var)-10/2*log(2*pi) + 25

normal_likelihoods, normal_log_likelihoods = [], []

# Get the likelihood and log likelihood for each theta
for mean in mean_values:
    # fix the mean and compute the var by the MLE formula
    var = np.sum([(x_i-mean)**2 for x_i in x_values])/10
    normal_likelihoods.append(np.prod([norm.pdf(x_i, mean, var)*10 for x_i in x_values]))
    normal_log_likelihoods.append(log_likelihood_normal(x_values, mean, var))

# Plot
plt.plot(mean_values, normal_log_likelihoods, "go--", color="#000080", label=f"Log likelihood")
plt.plot(mean_values, normal_likelihoods, "go--", color="#5DADE2", label=f"Likelihood", )
plt.axvline(x=2)
plt.xlabel("μ")
plt.ylabel("L(μ, var)*10/ log L(μ, var) + 25")
plt.title(f"The likelihood and log likelihood for different values of μ")
plt.legend()
plt.show()

var = np.sum([(x_i-2)**2 for x_i in x_values])/10
print(f"The ML variance is: {var}")
The ML variance is: 2.6

Again, we find that the maximum of the log likelihood (and the likelihood) is achieved when μ = 2, which coincides with the sample mean. The corresponding sample variance is 2.6. One limitation of the ML approach is the systematic underestimation of the variance when applied to the Gaussian. The intuitive explication is that we estimate the variance w.r.t. the sample mean and not the true mean. The ML approach will obtain a correct mean but a biased variance:

Where do we use MLE in Machine Learning?

MLE can be used for a problem of Machine Learning in several settings.

First, we can view the problem of fitting a machine learning model to the data as a problem of density estimation in the following way. Let the choice of the model and the model parameters be the hypothesis. (In comparison, our hypothesis in the above two examples was that the underlying distribution is a Poisson or a Normal distribution.)

Consider for instance the problem of predicting a continuous target, t. Let’s assume that your model is a linear model y(x) = wTx. However, this gives only a point estimate for the target variable for a new value of x. Let’s introduce the uncertainty of the model by assuming that the probability of the target conditional on the data (x), the parameter vector (w) and hyperparameter (σ²) is a normal distribution with mean wTx and variance σ². Now we can use MLE to estimate the values of σ² and the parameters w! Thus, we can use the MLE to find the modeling hypothesis that maximises the likelihood function.

Second, using MLE as a density estimation may be applied for clustering algorithms. It can be used to predict the parameters of different clusters.

Third, we can use MLE for models where we are interested in the conditional probability of the output, given the input.

Pros of ML:

1) First, MLEs are intuitive, as we search the distribution that is most likely conditional on the observed data. It is widely applicable and can be used in various cases. Furthermore, many ML problems can be interpreted as a special case of MLE.

2) MLEs have very desirable large sample properties:

– as the sample size increases, they become unbiased minimum variance estimators

– they have approximate normal distributions and approximate sample variances that can be calculated and used to generate confidence bounds

– likelihood functions can be used to test hypotheses about models and parameters

Cons of ML:

1) With small sample sizes, ML are not precise, they can be heavily biased as they depend on the exact sample.
2) Calculating MLEs can often be very complicated (with complex, non-linear equations).

References

Bishop, Christopher M. Pattern recognition and machine learning. springer, 2006

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Create a website or blog at WordPress.com

Up ↑

%d bloggers like this: