Maximum A Posteriori Estimation (MAP) is yet another method of density estimation. Unlike Maximum Likelihood estimation, however, it is a Bayesian method as it is based on the posterior probability. This blog gives a brief introduction to MAP estimation.
MAP estimation is based on finding the parameters of a probability distribution that maximise a posterior function of the parameters conditional on the data. Thus, compared to the ML estimation that finds the parameters maximising the likelihood, MAP estimation introduces the prior.
Recall that the Bayes theorem tells us the relation between the posterior and the likelihood: the posterior is proportionate to the product of the likelihood and the prior. (We can ignore the normalising constant as the goal of the MAP estimation is to find the optimal value of the parameters by optimisation, thus proportionality is sufficient.)
As with the ML estimation, we observe the data (x1, x2, …, xN). Next, we want to estimate the parameters of this distribution. To do so, we search the parameters that maximise the posterior (or equally the log posterior) that can be written as:
Therefore, we find the MAP estimate of the parameters by choosing the value that maximises the posterior.
Compared to the MLE, that is defined as:
we see that the MAP optimisation problem has an additional constant, f(θ) (or log f(θ)). This term does not depend on the number of observations, while the likelihood does. Therefore, as we increase the number of observations, the data overwhelms the prior, and the MAP estimate converges to the ML estimate.
If the initial hypothesis about the underlying distribution is good (so the form of f is well assumed), the ML and MAP estimates converge to the true values and so we say that the ML and MAP estimations are consistent.
Now instead of giving a complete example as I did in the ML article, let’s consider a simple case with discrete values and a uniform prior. Let’s assume that we have 10 possible values for θ and they are equally likely, so that the prior distribution is p=0.1 for each of the 10 possible values. The MAP estimation of the parameters is by definition:
Thus, we could think of MLE as a special case of MAP, a MAP with a uniform prior distribution. However, if we use a different prior, the MAP estimate will be weighted by prior and thus ML and MAP estimation will slightly differ (even though the MAP estimate converges to the MLE as we increase N).
Murphy, Kevin P. Machine learning: a probabilistic perspective. MIT press, 2012.
Greene, William H. Econometric analysis. Pearson Education India, 2003.
Leave a Reply