An Introduction to Maximum Likelihood Estimation (MLE)

Maximum Likelihood Estimation is a method of estimating the parameters of a model of an assumed probability distribution given some data. This is done through maximizing a likelihood function under a given model such that the observed data is most probable.

Let's walk through a visual example:

How would we model a probability distribution that best describes the data?

A simple gaussian could be a good fit here. Let's start with that. This is our assumed probability distribution.

Note: In practice, the functional form can vary (Gaussian, exponential, gamma, etc)

As we've decided to use a gaussian as our probability distribution, what are the optimal parameters of this gaussian to maximize the likelihood of observing this data?

The answer to the question above can be solved using MLE.

Estimating our optimal parameters

In MLE, we are attempting to estimate a parameter $\theta$ of a given probability distribution $f(x|\theta)$ . Our objective is to find the value of $\theta$ that maximizes the likelihood function $L(\theta)$ , which is defined as:

$L(\theta) = \prod_{i=1}^{n}f(x_i|\theta)$

where $x_1, \dots, x_n$ are the observations from the dataset and $f(x_i|\theta)$ is the probability density function of the probability distribution with parameter $\theta$ .

The Maximum Likelihood Estimate of $\theta$ is the value that maximizes the likelihood function, so we can solve the optimization problem:

$\hat{\theta} = \underset{\theta}{\operatorname{argmax}} \ L(\theta)$

As we are assuming a gaussian distribution, let's recall the gaussian equation:

$f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{ -\frac{(x-\mu)^2}{2\sigma^2} }$

where $\mu$ is the mean, $\sigma$ is the standard deviation, and $x_i$ is the independent variable

In our case $\theta = (\mu, \sigma)$ , and we are solving to maximize the likelihood of observing our data by discovering the optimal $(\mu, \sigma)$ .

Our likelihood function over N samples is then:

$L(\theta) = \prod_{i=1}^{n}\frac{1}{\sigma\sqrt{2\pi}} e^{ -\frac{(x_i-\mu)^2}{2\sigma^2} }$

We can take the natural log of the likelihood function in order to make the numbers easier to work with and convert products to sums. Tthe cumulative product of probabilities will be a tiny number, and taking the logarithm makes it much more interpretable. Also, since logarithms are monotically increasing, finding the parameters that maximize log likelihood is equivalent to finding the parameters that maximize our likelihood.

$L(\theta) = \prod_{i=1}^{n}f(x_i|\theta)$

$\ln(L(\theta)) = \sum_{i=1}^{n}\ln(f(x_i|\theta))$

$\ln(L(\theta)) = \sum_{i=1}^{n}\left( -\frac{(x_i-\mu)^2}{2\sigma^2} - \frac{1}{2}\ln(2\pi\sigma^2) \right)$

Next, we need to find the peak of our probability density function which maximizes the likelihood of our observed samples. We can do this through differentiation. Lets take the derivative of the log-likelihood function with respect to the parameters $\mu$ and $\sigma^2$ and set them equal to zero. (Note: In other distributions, setting the derivative to zero may not always work. Some functions could have multiple local peaks where derivatives are zero; in this case since know the functional form is Gaussian and it has a single peak, this works fine)

$\frac{\partial \ln(L(\theta))}{\partial \mu} = \sum_{i=1}^{n}\frac{(x_i-\mu)}{\sigma^2} = 0$

$\frac{\partial \ln(L(\theta))}{\partial \sigma^2} = \sum_{i=1}^{n}\left( -\frac{1}{\sigma^2} + \frac{(x_i-\mu)^2}{\sigma^4} \right) = 0$

Solving for $\mu$ and $\sigma^2$ :

$\mu = \frac{\sum_{i=1}^{n}x_i}{n}$

$\sigma^2 = \frac{\sum_{i=1}^{n}(x_i-\mu)^2}{n}$

Look familiar? the maximum likelihood estimates of the parameters $\mu$ and $\sigma^2$ are the sample mean and sample variance, respectively!

...But we can just directly calculate the mean and standard deviation of our sample set.. WTF was the point of this?

The idea here is that MLE provides a framework for estimating the parameters of an assumed distribution; our work here has helped us verify that our estimate is in fact a correct one given that we already know how to fit a gaussian. This formalizes the why and how. The method of finding optimal parameters could vary depending on the distribution and may not always have a closed form solution as illustrated in the example above, but I hoped you grasped the context of why this could be useful.

This StackOverflow answer by Aksakal answers it quite well:

"In this case, the average of your sample happens to also be the maximum likelihood estimator. So doing all the work derive the MLE feels like an unnecessary exercise, as you get back to your intuitive estimate of the mean you would have used in the first place. Well, this wasn't "just by chance"; this was specifically chosen to show that MLE estimators often lead to intuitive estimators.

But what if there was no intuitive estimator? For example, suppose you had a sample of iid gamma random variables and you were interested in estimating the shape and the rate parameters. Perhaps you culd try to reason out an estimator from the properties you know about Gamma distributions. But what would be the best way to do it? Using some combination of the estimated mean and variance? Why not use the estimated median instead of the mean? Or the log-mean? These all could be used to create some sort of estimator, but which will be a good one?

As it turns out, MLE theory gives us a great way of succinctly getting an answer to that question: take the values of the parameters that maximize the likelihood of the observed data (which seems pretty intuitive) and use that as your estimate. In fact, we have theory that states that under certain conditions, this will be approximately the best estimator. This is a lot better than trying to figure out a unique estimator for each type of data and then stepping lots of time worrying if it's really the best choice.

In short: while MLE doesn't provide new insight in the case of estimating the mean of normal data, it in general is a very, very useful tool."

Final Thoughts

As always, thanks for reading. I know some of the math looks cryptic and can be quite dense, so it's alright to gloss over it a bit if you're not super comortable with the notation. More importantly, I hope this helped develop some intuition behind what MLE is, and why it might be useful. Stay safe, and happy holidays!