Skip to content

Summary Sheet

pooled two sample t-tests

  • assume equal population variances
  • \(s^2_p = \frac{(n_x-1)s_x^2 + (n_y-1)s_y^2}{n_x+n_y-2}\)
  • \(t = \frac{(\bar{x} - \bar{y} -D_0)}{\sqrt{s_p^2(n_x^{-1}+n_y^{-1})} }\sim t_{(n_x-1)(n_y-1)}\)

SLM Dummy variable

\(Y_i=\beta_0+\beta_1X_i + \epsilon_i\) assumptions

  • linear model is appropriate
  • \(\epsilon_i \sim N(0,\sigma^2)\)

Hypothesis test:
\(H_0:\beta_1 = 0, H_a:\beta_1\neq 0\)
\(t=b_1/se(b_1)\sim t_{N-2}\), \(N\) is the total number of observations

One Way ANOVA & GLM

\(Y_i=\vec{X_i}\vec{\beta}+\vec{\epsilon}\)

  • assumptions: same as dummy variable, jointly normally distributed errors
  • \(F=MSReg/MSE = \frac{(SSR/G-1)}{SSE/(N-G)}\sim F_{G-1,N-G}\)

Multiple Comparisons

Bonferrroni's Method:

  • \(P(\cup A_i)\leq \sum P(A_i)\)
  • \(k= {G \choose 2}\)
  • level at \(a/k\)

Tukey's Method: less conservative than Bonferroni's method

Two Way ANOVA

Overall vs. Partiral F-tests
$H_0: $ a subset of \(\beta\)'s are 0. $H_a: $ some of the \(\beta\) in the subset are not 0.
Let FULL model be with all explanatory variables, REDUCED be without the coefficients in testing.

\[F=\frac{(RSS_r - RSS_f)/\# \beta \text{ tested} }{MSE_f}\sim F_{\# \beta \text{ tested}, d.f. RES_f}\]

Describing "interactions"

GLM vs. Transformation

Transform Y so it has an approximate normal distribution with constant variance,

GLM: distribution of Y not restricted to Normal,
model parameters describe \(g(E(Y))\) rather than \(E(g(Y))\)
GLMs provide a unified theory of modeling that encompasses the most important models for continuous and discrete variables.

GLM tests

Wald
\(H_0:\beta_j = 0, H_a: \beta_j\neq 0\)
\(z=\hat\beta_j / se(\hat\beta_j)\sim N(0,1)\).
CI: \(\hat\beta_j \pm z_{a/2} se(\hat\beta_j)\)

LRT
\(H_0:\) some \(beta\) are 0, $H_a: $ at least one tested \(\beta\) is not 0.

\[G^2 = (-2\log \mathcal L_R) - (-2\log \mathcal L_F) = -2\log (\mathcal L_R / \mathcal L_F)\sim \chi^2_k\]

\(k= \# \beta\) tested

Global LRT
LRT comparing to the NULL model (null deviance)

AIC, BIC

  • combines log-likelihood with a penalty
  • \(AIC = -2\log\mathcal L + 2(p+1)\)
  • \(BIC = -2\log\mathcal L + \log N(p+1)\)
  • \(p\) number of explanatory variables, \(N\) sample size
  • Smaller is better
  • Better = \(diff(AIC) > 10\)
  • Same = \(diff(AIC) < 2\)

SLR vs. Binary LR

  • both use MLE

  • Binary LR has fewer assumptions

  • no outelires
  • no residual plots
  • non constant variance

Binary Logistic Regression

underlying distribution for each independent observation: \(Bernoulli(\pi_i)\)

We cannot estimate \(\pi_i\) for individual \(i\).

  • Let \(\pi = P(success)\),
  • ODDS: \(\pi/(1-\pi)\)
  • LOG ODDS: \(\log(\pi/(1-\pi))\)
  • ODDS RATIO is the ratio of two ODDS

\(E(Y\mid X)=\pi, var(Y\mid X) = \pi(1-\pi)\)

The model

\[\log(\pi/(1-\pi)) = X\beta\]
\[\log(\frac{\pi_i}{1-\pi_i}) = X_i\beta \quad\text{(no error term)}\]

MLE:\(P(Y_i=y_i)=\pi_i^{y_i}(1-\pi_i)^{1-y_i}\)

\[\mathcal{L} = \prod_1^n\pi_i^{y_i}(1-\pi_i)^{1-y_i}\]

where \(\pi_i = \frac{\exp(X_i\beta)}{1+\exp(X_i\beta)}=e^{\mu}/(1+e^\mu)\) and

\[1-\mu_i = 1-\frac{e^\mu}{1+e^\mu} = (1+e^\mu)^{-1}\]
\[\log\mathcal{L} = \sum_1^ny_i(X_i\beta) - y_i\log(1+\exp(X_i\beta))-(1-y_i)\log(1+X_i\beta))\]

Let \((a,b)\) be CI, CI for Odds ratio is \(e^a, e^b\), while we cannot compute CI for \(\pi\) since \(\pi\) is not normally distributed

Assumptions:

  • underlying model for Y is Bernoulli
  • independent observations
  • Correct form of model (linear relationship, included all relevant variables and excluded irrelevant)
  • enough large sample size

Binomial Logistic Regression

Let \(Y\) be the count of the number of "success"

\(P(Y=y)={m\choose y}\pi^y (1-\pi)^{m-y}\)

\(E(Y)=m\pi, var(Y)=m\pi(1-\pi)\)

Then the proportion of successes \(E(Y/m)=\pi, var(Y/m)=\pi(1-\pi)/m\)

Assume for each group of observation, it is independent.

We can estimate \(\pi_i\) is this case

MLE:

\[P(Y_i=y_i) = {m_i\choose y_i}\pi^{y_i}(1-\pi_i)^{m_i-y_i}\]
\[\mathcal L = \prod_1^n {m_i\choose y_i}\pi^{y_i}(1-\pi_i)^{m_i-y_i}\]

where \(\pi_i = \frac{e^\mu}{1+e^\mu}\)

\[\log\mathcal L = \sum y_i\log(\pi_i)+(m_i-y_i)\log(1-\pi_i) + \log{m_i\choose y_i}\]

Deviance \(=-2\log(\mathcal L_M/\mathcal L_S) = -2(\log \mathcal L_M - \log \mathcal L_S)\).

Saturated model has log likelihood ratio 0.

Logistic Regression Problems

  • Extrapolation: model outside of range of observed data may not be appropriate
  • Multicollinearity
  • unstable fitted equation
  • coefficient significance and signs
  • large standard error of coefficients
  • MLR may not converge
  • Influential points

Specific to logistic

  • Complete separation
    • one of a linear combination of explanatory variables perfectly predict \(Y\), then MLE cannot be computed
  • Quasi-complete separation
    • almost perfectly predict Y
    • Solution simplify model, or try other options

Extra-binomial variation

  • when Bernoulli observations are not independent
  • use quasibinomial
  • model for variance: \(var(Y_i)=\phi m_i \pi_i(1-\pi_i)\)
  • $\hat\phi = $ sum of squared Pearson residuals / d.f.

GOF

To check model adequacy using LRT

$H_0: $ fitted model fits data as well as Saturated model. $H_a: $ saturated model is better, the fitted model is inadequate

\(G^2 = -2\log(\mathcal L_F /\mathcal L_S)\sim \chi^2_{n-(p+1)}\)

Log linear Model

  • Why not linear
  • outcome is counts and small numbers
  • Won't have a normal distribution conditional on age
  • Why no logistic
  • Not a binary outcome
  • Not a binomial outcome since not a fixed number of trials

\(P(Y=y)=\mu^y e^{-\mu} / y!, E(Y)=var(Y)=\mu\)

\[\mathcal L = \prod_1^n \mu_i^{y_i} e^{-\mu_i} / y_i!\]
\[\log\mathcal L = \sum_1^n y_i \log (\mu_i) -\mu_i - \log(y_i!)\]

Two Factor Independence

Binomial Sampling
For \(2\times 2\) table
\(H_0: \mu_a = \mu_b, H_a: \mu_a\neq \mu_b\)

\[z=\frac{\hat\mu_a - \hat\mu_b}{se(\hat\mu_a - \hat\mu_b)}\sim N(0,1)\]

Assumption:

  • each trial is a Bernoulli
  • the number of groups are fixed
  • The underlying distribution is \(y_a\sim binomial(n_a, \pi_a), y_b\sim binomial(n_b, \pi_b)\)

Contingency Table

test statistics \(\chi^2 = \sum_j\sum_i (y_{ij} - \hat\mu_{ij})^2 / \hat\mu_{ij}\sim \chi^2_{(I-1)(J-1)}\) where \(\hat\mu_{ij} = \pi_{i.}\pi_{.j}/n\)

Contingency table model:
\(Y_{ij}\) be the r.v. representing the number of observations in the cell
\(y_{ij}\) be the observed cell counts

The underlying distribution of \(Y=(Y_{11},...,Y_{nn})\sim Multinomial\)

\[P(Y=y)=\frac{n!}{y_{11}!...y_{nn}!} \prod_{i,j}\pi_{ij}^{y_{ij} }\]

Using MLE subjecting to \(\sum_{ij}\pi_{ij} = 1\), we get \(\hat\mu_{ij} = y_{ij} / n\)

With null hypothesis of independence we can get \(\hat\mu_{ij} = \hat\mu_{i.}\hat\mu_{.j}\) Then we can use LRT where the full model contains the interaction terms

\[\log\mathcal L_F = \sum_{ij}y_{ij}\log(y_{ij}/n)\]
\[\log\mathcal L_R = \sum_{ij}y_{ij}\log(y_{i.}y_{.j}/nn)\]
\[d.f. = (IJ-1)-(I+J-2)\]

lose 1 for constraint \(\sum_{ij}\pi_{ij} = 1\), lose 2 for constraints \(\sum_i \pi_{i.}=1,\sum_j\pi_{.j}=1\)

Fisher's Exact Test

  • randomization test
  • appropriate for small sample size
  • assumes the row and column totals are fixed
  • p-value is calculated from hypergeometrix distribution
\[P=\frac{ {a+b\choose a}{c+d\choose c} }{ {n\choose a+c} }\]

Poisson Regression

  • counts aren't fixed
  • treat IJ count as realizations of a Possion random variable

Compare the interactions term

Three-way interactions

  • complete independence: does not have any interaction terms
  • block independence: joint probability of two factors (say A,B) is independent of the third (C). Then include the interaction term between \(AB\)
  • partial independence: \(P(AB\mid C)=P(A\mid C)P(B\mid C)\), AB are conditionally independent on \(C\). Include interactions between \(AC,BC\)
  • Uniform association: include all two-way interactions
  • Saturated model: include three-way interactions