Maximum Likelihood Estimation

Given data X1, X2, ..., Xn a random sample (iid) from a distribution with unknown parameter Θ, we want to find the value of Θ in the parameter space that maximizes our probability of observing that data.

If X1, X2, ..., Xn are discrete we can look at P(X1=x1, X2=x2, ..., Xn=xn) as a function of Θ and find the Θ that maximizes it.

If continuous we maximize the joint pdf with respect to Θ.

MLE is asymptotically unbiased (approaches unbiased as n grows).

Notation

The pmf/pdf for any one of X1, X2, ..., Xn is denoted by f(x). No subscripts because all X's have the same distribution. You can emphasize the dependence of f on a parameter Θ by writing it as f(x;Θ).

The joint pmf/pdf for all n of them is f(x1, x2, ..., xn;Θ) = product(i=1 to n) f(x;Θ).

The data (the x's) are fixed and the joint pdf can be thought of as a function of Θ.

Call this the likelihood function and denote it by L(Θ).

Example

Discrete example (coin flips)

X1, X2, ..., Xn ~ Bernoulli(p) (iid)

the pmf for one of them is:

f(x;p) = p^x (1-p)^(1-x) I{0,1}(x)

the joint pmf for all of them is:

f(vec<x>;p)
    = product(i=1 to n) f(xsubi;p)
    = product(i=1 to n) p^xsubi (1-p)^(1-xsubi) I{0,1}(xsubi)

so a likelihood is:

L(p) = p^(sum(i=1 to n)xsubi) (1-p)^(n-sum(i=1 to n)xsubi)

it is almost always easier to maximize the "log-likelihood"

l(p) = sum(i=1 to n)xsubi*ln(p) + (n-sum(i=1 to n)xsubi)ln(1-p)

you want to max with respect to p, so take derivative with respect to p and set
equal to 0:

a/ap l(p) = [sum(i=1 to n)xsubi]/p - [n-sum(i=1 to n)xsubi]/(1-p) = 0

...solve for p 😅

...switch to capitals to make your estimate an estimator

phat = [sum(i=1, to n)Xsubi]/n = Xbar

Continuous example (exponential)

X1, X2, ..., Xn ~ exp(rate = l)

pdf for one:

f(x;l) = le^(-lx) I(0,inf)(x)

joint pdf for all:

f(vec<x>;l) = product(i=1, n) le^(-lxsubi) I(0,inf)(xsubi)

a likelihood:

l(l) = n ln(l) - l*sum(i=1, n)*xsubi

take derivative with respect to l and set to 0

lhat = n/(sum(i=1, to n)Xi) = 1/Xbar

same result as method of moments

MLE with multiple parameters

When you get to the derivative step you have to solve for the system of equations. Solve for each parameter simultaneously.

It's best to look at specific examples to understand the full process.

Sample Variance

If you calculate an estimator for the population variance then you have a natural "sample variance". You can correct for bias and make this an unbiased estimator that is natural because it is unbiased and constant (converges to true population variance as n grows). This is denoted as S^2.

The denominator of S^2 is called the "degrees of freedom".

Support parameters

If you cannot drop the PDF indicators because they involve a parameter that you are solving for, you can define a piece wise function and pull it out to help you reason about how to maximize for the other parameter. This likely means you will not be taking the derivative and you have to draw everything out and think through it like "the max would be if this param is 0 but the param cannot be lower than the largest xsubn so the MLE is the largest xsubn" in the case of a uniform distribution.

The Invariance Property

If you want to estimate a function of a parameter using MLE, you can find the (unbiased normally) estimator of the parameter and just plug it in the function.

Large Sample Properties

Let X1, …, Xn be a random sample from a distribution with pdf f(x;theta).

Let theta_hat be an MLE for theta.

theta_hat exists and is unique.
thetahat converges in probability to theta. We say thetahat is a consistent estimator of theta.
theta_hat is asymptotically unbiased estimator of theta.
theta_hat is asymptotically efficient
theta_hat is asymptotically normal

lim(n to inf) E[theta_hat] = theta

lim(n to inf) CRLB_sub_theta / Var[theta_hat] = 1

theta_hat ~ N(theta, CRLB_sub_theta)

theta_hat - theta / sqrt(CRLB_sub_theta) distribution approaches N(0, 1)