Skip to content

Methods of Estimation

import matplotlib.pyplot as plt
import numpy as np

Method of moments estimation

Want to find a statistic T(X1,...,Xn)T(X_1,...,X_n) s.t. Eθ[T(X1,...,Xn)]=h(θ)E_\theta[T(X_1,...,X_n)] = h(\theta) where hh has a well-defined inverse. Then, we can set T(X1,...,Xn)=h(θ^)T(X_1,...,X_n) = h(\hat\theta) s.t. θ^=h1(T)\hat\theta = h^{-1}(T)

If X1,...,XnX_1,..., X_n indep. and Eθ(Xi)=h(θ)E_\theta(X_i) = h(\theta), then by substitution principle, we can estimate Eθ(Xi)E_\theta(X_i) by Xˉ\bar X and then Xˉ=h(θ^)\bar X = h(\hat\theta) and so θ^=h1(Xˉ)\hat\theta = h^{-1}(\bar X)

Example: Exponential Distribution

X1,...,XnX_1,...,X_n indep, f(x;λ)=λexp(λx),x0f(x;\lambda) = \lambda \exp(-\lambda x), x\geq 0, λ>0\lambda > 0 is unknown.

Note that for r>0,Eλ(Xir)=λrΓ(r+1)r > 0, E_\lambda (X_i^r) = \lambda ^{-r}\Gamma(r+1) so that we have MoM estimator

n1nXir=Γ(r+1)λ^rn^{-1}\sum^n X_i^r = \frac{\Gamma(r+1)}{\hat\lambda^r}
λ^(r)=((nΓ(r+1)1nXir))1/r\Rightarrow \hat\lambda(r) = \bigg((n\Gamma(r+1)^{-1}\sum^nX_i^r)\bigg)^{-1/r}

Using r=1r = 1 gives the best estimation (minimized s.d.)

Example: Gamma Distribution

X1,...,XnX_1,...,X_n indep. f(x;λ,α)=λaxa1exp(λx)Γ(a)1,x0f(x;\lambda, \alpha) = \lambda^a x^{a-1}exp(-\lambda x) \Gamma(a)^{-1}, x\geq 0. λ,a>0\lambda, a > 0 are unknown. Note that E(Xi)=a/λ,var(Xi)=a/λ2E(X_i) = a/\lambda, var(X_i) = a/\lambda^2, so that MoM gives

Xˉ=a^/λ^,S2=a^/λ^2\bar X = \hat a / \hat \lambda , S^2 = \hat a / \hat \lambda^2
a^=Xˉ2/S2,λ^=Xˉ/S2\Rightarrow \hat a = \bar X^2 / S^2 , \hat \lambda = \bar X / S^2

Confidence Interval

An interval I=[l(X1,...,Xn),u(X1,...,Xn)]\mathcal I = [l(X_1,...,X_n), u(X_1,...,X_n)] is a CI with coverage 100p%100p\% if

P[l(X1,...,Xn)θu(X1,...,Xn)]=p,θΘP[l(X_1,...,X_n)\leq \theta \leq u(X_1,...,X_n)] = p, \forall \theta\in\Theta

The pivotal method

Is not that often that we can measure such probability directly. One way to work around is to find a r.v. g(X1,...,Xn,θ)g(X_1,...,X_n,\theta) whose distribution is independent of θ\theta and any other unknown params.

Confidence Interval

An interval I=[l(X1,...,Xn),u(X1,...,Xn)]\mathcal I = [l(X_1,...,X_n), u(X_1,...,X_n)] is a CI with coverage 100p%100p\% if

P[l(X1,...,Xn)θu(X1,...,Xn)]=p,θΘP[l(X_1,...,X_n)\leq \theta \leq u(X_1,...,X_n)] = p, \forall \theta\in\Theta

Example

For X1,...,X20X_1,...,X_{20} indep. N(μ,σ2)\sim N(\mu, \sigma^2), the 95%95\% CI is [Xˉ±2.093S20][\bar X\pm -2.093\frac{S}{\sqrt{20}}].

The following example is 100 samples of size 20 from N(0,1)N(0, 1) and we note that 95%95\% of the samples falls into the confidence interval.

samples = np.random.randn(200, 20)
mean = samples.mean(axis=1)
sd = samples.std(axis=1)
not_in_CI = np.concatenate((np.where(mean + 2.093 * sd / samples.shape[1]**0.5 < 0)[0], 
                            np.where(mean - 2.093 * sd / samples.shape[1]**0.5 > 0)[0]))
plt.figure(figsize=(12, 4))
plt.errorbar(x=np.arange(samples.shape[0]), y=mean, 
             yerr = 2.093 * sd / 20**0.5 , 
             fmt=" ", label="CI covers true mean")
plt.errorbar(x=not_in_CI, y=mean[not_in_CI], 
             yerr = 2.093 * sd[not_in_CI] / 20**0.5 , 
             fmt=" ", color="red", label="CI does not cover")
plt.axhline(0, linestyle=":", color="grey")
plt.xlabel("sample"); plt.ylabel("CI")
plt.title(r"95% CIs for $\mu$"); plt.legend();

png

The pivotal method

Is not that often that we can measure such probability directly. One way to work around is to find a r.v. g(X1,...,Xn,θ)g(X_1,...,X_n,\theta) whose distribution is independent of θ\theta and any other unknown params.

Maximum Likelihood Estimation

Given (X1,...,Xn)(X_1,...,X_n) r.v. with joint pdf

f(x1,...,xn;θ1,...,θk)f(x_1,...,x_n; \theta_1,...,\theta_k)

where θ\theta's are unknown parameters.
The likelihood is defined as

L(θ1,...,θk)=f(x1,...,xn;θ1,...,θk)\mathcal L(\theta_1, ...,\theta_k) = f(x_1,...,x_n; \theta_1,...,\theta_k)

note that x1,...,xnx_1,...,x_n are fixed observations

Suppose that for each x\vec x, (T1(x),...,Tk(x))(T_1(\vec x), ..., T_k(\vec x)) maximize L(Θ)\mathcal L (\Theta) . Then maximum likelihood estimators (MLEs) of Θ\Theta are

θ^j=Tj(X1,...,Xn),j=1,...,k\hat \theta_j = T_j(X_1,..., X_n), j = 1,..., k

Existence and uniqueness

  • MLE is essentially an ad hoc procedure albeit one that works very well in many problems.
  • MLEs need not be unique, although in most cases, it is unique.
  • MLEs may not exist, typically when the sample size is too small.

Sufficient Statistic

A statistic T=(T1(X),...,Tm(X))T = (T_1(\vec X), ..., T_m(\vec X)) is sufficient for θ\theta if the conditional distribution of X\vec X given T=tT = t depends only on tt.

Neyman Factorization Theorem

TT is sufficient for θ\theta IFF

f(x;θ)=g(T(x);θ)h(x)f(\vec x;\theta) = g(T(\vec x); \theta) h(\vec x)

Observed Fisher Information

Given the MLE θ^\hat \theta, the observed Fisher information is

I(θ^)=d2dθ2lnL(θ^)\mathcal I(\hat\theta) = -\frac{d^2}{d\theta^2}\ln \mathcal L(\hat\theta)

Fisher information is an estimator for standard error, i.e.

s^e(θ^)={I(θ^)}1/2\hat se(\hat \theta) = \{\mathcal I(\hat\theta)\}^{-1/2}

Mathematically, this is the absolute curvature of the log-likelihood function at its maximum. If this is small, then the estimator is more well-defined (hence with smaller estimated s.e.)

Approximate normality of MLEs

Theorem For X1,...,XnX_1,...,X_n indep. with pdf ff for some real-valued θΘ\theta\in \Theta, if

  • Θ\Theta is an open set
  • A={x:f(x;θ)>0}A = \{x: f(x;\theta)> 0\} does not depend on θ\theta (true for the exponential families)
  • l(x;θ)l(x;\theta) is 3-time differentiable w.r.t. θ\theta for each xAx\in A.

Then, with θ0\theta_0 being the true parameter, we have

n(θ^nθ0){n1nl(Xi;θ0)}11nnl(Xi;θ0)1nnl(Xi;θ0)I(θ0)dN(0,I(θ0)1)\begin{align*} \sqrt n(\hat\theta_n - \theta_0) &\approx \bigg\{-n^{-1}\sum^n l''(X_i; \theta_0)\bigg\}^{-1}\frac{1}{\sqrt n} \sum^n l'(X_i;\theta_0)\\ &\approx \frac{1}{\sqrt n} \sum^n \frac{l'(X_i; \theta_0)}{\mathcal I(\theta_0)}\\ &\rightarrow^d N(0, \mathcal I(\theta_0)^{-1}) \end{align*}

proof. This conclusion follows Taylor expansion

0=1nnl(Xi;θ0)+{n(θ^nθ0)}nl(Xi;θ0)n+n(θ^nθ0)×Rn0 = \frac{1}{\sqrt n} \sum^n l'(X_i;\theta_0) + \{\sqrt n (\hat \theta_n - \theta_0)\} \frac{\sum^n l''(X_i; \theta_0)}{n} + \sqrt n(\hat\theta_n - \theta_0) \times R_n

where Rn=12(θ^nθ0)nl(Xi;θn)nR_n = \frac{1}{2}(\hat\theta_n - \theta_0) \frac{\sum^nl''(X_i; \theta_n^*)}{n} is the Taylor's remainder with θn\theta_n^* in between θ^n,θ0\hat\theta_n, \theta_0 so that

n(θ^nθ0)={n1nl(Xi;θ0)Rn}11nnl(Xi;θ0)\sqrt n (\hat\theta_n - \theta_0) =\bigg\{-n^{-1}\sum^n l''(X_i; \theta_0) - R_n\bigg\}^{-1}\frac{1}{\sqrt n} \sum^n l'(X_i;\theta_0)

Then, note that θ^nθ0p0\hat\theta_n - \theta_0 \rightarrow ^p 0 (i.e. θ^\hat\theta) is a "good" estimator from the assumption, and 1nl(Xi;θn)\frac{1}{n}\sum l''(X_i; \theta_n^*) is bounded. so that their product,

Rn=12(θ^nθ0)nl(Xi;θn)np0 R_n = \frac{1}{2}(\hat\theta_n - \theta_0) \frac{\sum^nl''(X_i; \theta_n^*)}{n}\rightarrow^p 0

Therefore, θ^nθ0d(0,I(θ0)1)\sqrt{\hat\theta_n - \theta_0}\rightarrow^d (0, \mathcal I(\theta_0)^{-1})