Maximum Likelihood Estimation (MLE) is an approach to estimating parameters for a model. It is one of the core aspects of Item Response Theory (IRT), especially to estimate item parameters (analyze questions) and estimate person parameters (scoring). This article will provide an introduction to the concepts of MLE.
- History behind Maximum Likelihood Estimation
- Defining Maximum Likelihood Estimation
- Comparison of likelihood and probability
- Key characteristics of Maximum Likelihood Estimation
- Weaknesses of Maximum Likelihood Estimation
- Application of Maximum Likelihood Estimation
- Summarizing remarks about Maximum Likelihood Estimation
History behind Maximum Likelihood Estimation
Even though early ideas about MLE appeared in the mid-1700s, Sir Ronald Aylmer Fisher developed them into a more formalized concept much later. Fisher was working seminally on maximum likelihood from 1912 to 1922, criticizing himself and producing several justifications. In 1925, he finally published “Statistical Methods for Research Workers”, one of the 20th century’s most influential books on statistical methods. In general, the production of maximum likelihood concept has been a breakthrough in Statistics.
Defining Maximum Likelihood Estimation
Wikipedia defines MLE as follows:
In statistics, Maximum Likelihood Estimation is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate.
Merriam Webster has a slightly different definition for MLE:
A statistical method for estimating population parameters (as the mean and variance) from sample data that selects as estimates those parameter values maximizing the probability of obtaining the observed data.
To sum up, MLE is a method that detects parameter values of a model. These parameter values are identified such that they maximize the likelihood that the process designed by the model produced the data that were actually observed. To put it simply, MLE answers the question:
For which parameter value does the observed data have the biggest probability?
Comparison of likelihood and probability
The definitions above contain “probability” but it is important not to mix these two different concepts. Let us look at some differences between likelihood and probability, so that you could differentiate between them.
|Refers to the occurred events with known outcomes
|Refers to the events that will occur in the future
|Likelihoods do not add up to 1
|Probabilities add up to 1
|Example 1: I flipped a coin 20 times and obtained 20 heads. What is the likelihood that the coin is fair?
|Example 1: I flipped a coin 20 times. What is the probability of the coin to land heads or tails every time?
|Example 2: Given the fixed outcomes (data), what is the likelihood of different parameter values?
|Example 2: The fixed parameter P = 0.5 is given. What is the probability of different outcomes?
Calculating Maximum Likelihood Estimation
MLE can be calculated as a derivative of a log-likelihood in relation to each parameter, the mean μ and the variance σ2, that is equated to 0. There are four general steps in estimating the parameters:
- Call for a distribution of the observed data
- Estimate distribution’s parameters using log-likelihood
- Paste estimated parameters into a distribution’s probability function
- Evaluate the distribution of the observed data
Key characteristics of Maximum Likelihood Estimation
- MLE operates with one-dimensional data
- MLE uses only “clean” data (e.g. no outliers)
- MLE is usually computationally manageable
- MLE is often real-time on modern computers
- MLE works well for simple cases (e.g. binomial distribution)
Weaknesses of Maximum Likelihood Estimation
- MLE is sensitive to outliers
- MLE often demands optimization for speed and memory to obtain useful results
- MLE is sometimes poor at differentiating between models with similar distributions
- MLE can be technically challenging, especially for multidimensional data and complex models
Application of Maximum Likelihood Estimation
In order to apply MLE, two important assumptions (typically referred to as the i.i.d. assumption) need to be made:
- Data must be independently distributed, i.e. the observation of any given data point does not depend on the observation of any other data point (each data point is an independent experiment)
- Data must be identically distributed, i.e. each data point is generated from the same distribution family with the same parameters
Let us consider several world-known applications of MLE:
- Global Positioning System (GPS)
- Smart keyboard programs for iOS and Android operating systems (e.g. Swype)
- Speech recognition programs (e.g. Carnegie Mellon open source SPHINX speech recognizer, Dragon Naturally Speaking)
- Detection and measurement of the properties of the Higgs Boson at the European Organization for Nuclear Research (CERN) by means of the Large Hadron Collider (Francois Englert and Peter Higgs were awarded the Nobel Prize in Physics in 2013 for the theory of Higgs Boson)
Generally speaking, MLE is employed in agriculture, economics, finance, physics, medicine and many other fields.
Summarizing remarks about Maximum Likelihood Estimation
Despite some functional issues with MLE such as technical challenges for multidimensional data and complex multiparameter models that interfere solving many real world problems, MLE remains a powerful and widely used statistical approach for classification and parameter estimation. MLE has brought many successes to the mankind in both scientific and commercial worlds.
Aldrich, J. (1997). R. A. Fisher and the making of maximum likelihood 1912-1922. Statistical Science, 12(3), 162-176.
Stigler, S. M. (2007). The epic story of maximum likelihood. Statistical Science, 598-620.