Skip to content

Binary logistic regression

Case Study: Donner Party

library(Sleuth2)
donner = case2001
Age = donner$Age
Sex = donner$Sex
Status = donner$Status
# contingency table
xtabs(~Status+Sex, data=donner)
          Sex
Status     Male Female
  Died       20      5
  Survived   10     10
summary(Age)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   15.0    24.0    28.0    31.8    40.0    65.0

Response \(Y_i=\{0,1\}\) survived or died
Predictor: \(X_i\) age/sex

Model

\(Y_i\mid X_i=\mathbb{I}\)(response is in category of interest)\(\sim Bernoulli(\pi_i)\)
\(E(Y_i\mid X_i)=\pi_i, var(Y_i\mid X_i)=\pi_1(1-\pi_i)\)

Generalized Linear Models

Having response \(Y\) can a set of explanatory variables \(X_1,...,X_p\)

Want \(E(Y)\) as a linear function in the parameters, ie. \(g(E(Y))=X\beta\)

Key idea: find the link function \(g\) such that the model is true

Some link functions

link function usual distribution of \(Y\mid X\)
identity \(g(\mu)=\mu\) normal
log \(g(\mu)=\log\mu,\mu>0\) Poisson(count data)
logistic \(g(\mu)=\log(\mu/(1-\mu)),0<\mu<1\) Bernoulli(binary),Binomial

GLM vs. Transformation

  • For transformation, transforming \(Y\) so it has an approximate normal distribution with constant variance
  • For GLM, distribution of \(Y\) not restricted to Normal
  • Model parameters describe \(g(E(Y))\) rather than \(E(g(Y))\)
  • GLM provide a unified theory of modeling that encompasses the most important models for continuous and discrete variables

Binary Logistic Regression

  • Let \(\pi=P("success"),0<\pi<1\)
  • The odds in favor of success is \(\frac{\pi}{1-\pi}\)
  • The log odds is \(\log(\frac{\pi}{1-\pi})\)

Model

\(\log(\frac{\pi}{1-\pi}) = \beta_0 + \beta_1X_1+...+\beta_pX_p\)

Let the linear predictor \(\eta=\beta_0 + \beta_1X_1+...+\beta_pX_p\), then Logistic function is \(\pi(\eta)=\frac{e^{\eta}}{1+e^\eta}\)

eq <- function(x){exp(x)/(1+exp(x))}
plot(eq(-10:10), type='l')

png

  • log-odds \(\log(\frac{\pi}{1-\pi})\in(-\infty, \infty)\), is increasing function.
  • Predicts the natural log of the odds for a subject bing in one category or another
  • Regression coefficients can be used to estimate odds ratio for each of the independent variables
  • Tell which predictors can be used to determine if a subject was in a category of interest

MLE

  • Date: \(Y_i=\mathbb{I}(\text{if response is in category of interest})\)
  • Model: \(P(Y_i=y_1)=\pi_i^{y_{i}}(1-\pi_i)^{1-y_{i}}\)
  • Assumption: The observations are independent.
  • Joint density:

    \[P(Y_1=y_1,...,Y_n=y_n)=\prod_1^n \pi_i^{y_{i}}(1-\pi_i)^{1-y_{i}}\]
    \[\pi_i = \frac{\exp(\beta_0+\beta_1X_{i1}+...+\beta_pX_{ip})}{1+\exp(\beta_0+\beta_1X_{i1}+...+\beta_pX_{ip})}\]

Likelihood function: plug in observed data and think of the joint density as a function of \(\beta\)'s

\[\mathcal{L}(\beta_0,...,\beta_p)=\prod_{1^n}\pi_i\vec\beta^{y_i}(1-\pi_i{\vec\beta})^{1-y_i}\]

There is no explicit expression for the max likelihood estimators. Two iterative numerical solution methods are Newton-Raphson algorithm and Fisher scoring or IWLS (In R, glm())

# additive
fita<-glm(Status~Age+Sex, family=binomial, data=donner)
summary(fita)
Call:
glm(formula = Status ~ Age + Sex, family = binomial, data = donner)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.7445  -1.0441  -0.3029   0.8877   2.0472

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)  1.63312    1.11018   1.471   0.1413  
Age         -0.07820    0.03728  -2.097   0.0359 *
SexFemale    1.59729    0.75547   2.114   0.0345 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 61.827  on 44  degrees of freedom
Residual deviance: 51.256  on 42  degrees of freedom
AIC: 57.256

Number of Fisher Scoring iterations: 4
anova(fita)
A anova: 3 × 4
DfDevianceResid. DfResid. Dev
<int><dbl><int><dbl>
NULLNA NA4461.82654
Age 15.5358204356.29072
Sex 15.0344374251.25628
# modelling "Died"
Status2 = relevel(Status, ref="Survived")
fitad <- glm(Status2 ~ Age + Sex, family=binomial, data=donner)
summary(fitad)
Call:
glm(formula = Status2 ~ Age + Sex, family = binomial, data = donner)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.0472  -0.8877   0.3029   1.0441   1.7445

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept) -1.63312    1.11018  -1.471   0.1413  
Age          0.07820    0.03728   2.097   0.0359 *
SexFemale   -1.59729    0.75547  -2.114   0.0345 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 61.827  on 44  degrees of freedom
Residual deviance: 51.256  on 42  degrees of freedom
AIC: 57.256

Number of Fisher Scoring iterations: 4
# Reference group as "Male"
Sex2 = relevel(Sex, ref="Male")
fitadf <- glm(Status2~Age + Sex2, family=binomial, data=donner)
summary(fitadf)
Call:
glm(formula = Status2 ~ Age + Sex2, family = binomial, data = donner)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.0472  -0.8877   0.3029   1.0441   1.7445

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept) -1.63312    1.11018  -1.471   0.1413  
Age          0.07820    0.03728   2.097   0.0359 *
Sex2Female  -1.59729    0.75547  -2.114   0.0345 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 61.827  on 44  degrees of freedom
Residual deviance: 51.256  on 42  degrees of freedom
AIC: 57.256

Number of Fisher Scoring iterations: 4
# Additive model for survived
fitasf <- glm(Status~Age + Sex2, family=binomial, data=donner)
summary(fitasf)
Call:
glm(formula = Status ~ Age + Sex2, family = binomial, data = donner)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.7445  -1.0441  -0.3029   0.8877   2.0472

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)  1.63312    1.11018   1.471   0.1413  
Age         -0.07820    0.03728  -2.097   0.0359 *
Sex2Female   1.59729    0.75547   2.114   0.0345 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 61.827  on 44  degrees of freedom
Residual deviance: 51.256  on 42  degrees of freedom
AIC: 57.256

Number of Fisher Scoring iterations: 4
# the order of independent variable does not metter
fitsaf <- glm(Status~ Sex2+Age, family=binomial, data=donner)
summary(fitsaf)
Call:
glm(formula = Status ~ Sex2 + Age, family = binomial, data = donner)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.7445  -1.0441  -0.3029   0.8877   2.0472

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)  1.63312    1.11018   1.471   0.1413  
Sex2Female   1.59729    0.75547   2.114   0.0345 *
Age         -0.07820    0.03728  -2.097   0.0359 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 61.827  on 44  degrees of freedom
Residual deviance: 51.256  on 42  degrees of freedom
AIC: 57.256

Number of Fisher Scoring iterations: 4
anova(fitasf)
A anova: 3 × 4
DfDevianceResid. DfResid. Dev
<int><dbl><int><dbl>
NULLNA NA4461.82654
Age 15.5358204356.29072
Sex2 15.0344374251.25628
anova(fitsaf)
A anova: 3 × 4
DfDevianceResid. DfResid. Dev
<int><dbl><int><dbl>
NULLNA NA4461.82654
Sex2 14.5402674357.28628
Age 16.0299914251.25628
# Higher order model with 3 higher order/interaction terms
fitfull <- glm(Status~Age + Sex2 + Age:Sex2+I(Age^2)+I(Age^2):Sex2, family=binomial, data=donner)
summary(fitfull)
Call:
glm(formula = Status ~ Age + Sex2 + Age:Sex2 + I(Age^2) + I(Age^2):Sex2, 
    family = binomial, data = donner)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.3396  -0.9757  -0.3438   0.5269   1.5901

Coefficients:
                     Estimate Std. Error z value Pr(>|z|)
(Intercept)         -3.318484   3.940184  -0.842    0.400
Age                  0.183031   0.226632   0.808    0.419
Sex2Female           0.265286  10.455222   0.025    0.980
I(Age^2)            -0.002803   0.002985  -0.939    0.348
Age:Sex2Female       0.299877   0.696050   0.431    0.667
Sex2Female:I(Age^2) -0.007356   0.010689  -0.688    0.491

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 61.827  on 44  degrees of freedom
Residual deviance: 45.361  on 39  degrees of freedom
AIC: 57.361

Number of Fisher Scoring iterations: 5
# interaction model
fitas <- glm(Status~Age*Sex2, family=binomial,data=donner)
summary(fitas)
Call:
glm(formula = Status ~ Age * Sex2, family = binomial, data = donner)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.2279  -0.9388  -0.5550   0.7794   1.6998

Coefficients:
               Estimate Std. Error z value Pr(>|z|)  
(Intercept)     0.31834    1.13103   0.281   0.7784  
Age            -0.03248    0.03527  -0.921   0.3571  
Sex2Female      6.92805    3.39887   2.038   0.0415 *
Age:Sex2Female -0.16160    0.09426  -1.714   0.0865 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 61.827  on 44  degrees of freedom
Residual deviance: 47.346  on 41  degrees of freedom
AIC: 55.346

Number of Fisher Scoring iterations: 5
anova(fitasf, test="Chisq")
A anova: 3 × 5
DfDevianceResid. DfResid. DevPr(>Chi)
<int><dbl><int><dbl><dbl>
NULLNA NA4461.82654 NA
Age 15.5358204356.290720.01863103
Sex2 15.0344374251.256280.02484816
vcov(fitasf, digits=3)
A matrix: 3 × 3 of type dbl
(Intercept)AgeSex2Female
(Intercept) 1.23250837-0.038472741 0.06007099
Age-0.03847274 0.001390134-0.00823197
Sex2Female 0.06007099-0.008231970 0.57073339
cbind(bhat=coef(fitasf), confint.default(fitasf)) # CI for beta's
A matrix: 3 × 3 of type dbl
bhat2.5 %97.5 %
(Intercept) 1.63312031-0.5428002 3.809040837
Age-0.07820407-0.1512803-0.005127799
Sex2Female 1.59729350 0.1166015 3.077985503
exp(coef(fitasf)) # odd ratios
(Intercept) 5.11982524825608
Age         0.924775685984312
Sex2Female  4.93964517580916
exp(cbind(OR=coef(fitasf), confint.default(fitasf))) # Ci for odd ratios
A matrix: 3 × 3 of type dbl
OR2.5 %97.5 %
(Intercept)5.11982520.581118745.1071530
Age0.92477570.8596067 0.9948853
Sex2Female4.93964521.123671621.7146143

Wald Chi-sq Procedures

logistic regression inference on a single \(\beta\)

  • Hypotheses: \(H_0:\beta_j=0,H_a:\beta_j\neq 0\)
  • Test statistic: \(z=\hat\beta_j/se(\hat\beta_j)\), \(\hat\beta_j\) is MLE for \(\beta_j\)
  • By standard large-sample results, MLE's are normally distributed. Thus, for large \(n\), under \(H_0\), z is an observation from an approx. \(N(0,1)\). Note that \(Z^2\sim \chi_1\), we can use chisq for p-value
  • 95% CI: \(\hat\beta_j \pm 1.96se(\hat\beta_j)\)

Since the logistic function is monotonic, we can expand CI by doing monotonic transformation (e.x. multiplication and exp)

# Wald test
library(aod)
wald.test(Sigma=vcov(fitasf), b=coef(fitasf), Terms=2:3)
Wald test:
----------
Chi-squared test:
X2 = 6.9, df = 2, P(> X2) = 0.032
wald.test(Sigma=vcov(fitas), b=coef(fitas), Terms=4)
Wald test:
----------
Chi-squared test:
X2 = 2.9, df = 1, P(> X2) = 0.086

Model Assumptions

  • Underlying probability model for response is Bernoulli
  • Observations are independent
  • The form of the model is correct
    • Linear relationship between logits and explanatory variables
    • All relevant variables are included; irrelevant ones excluded
  • Sample size is large enough for valid inference

B Logistic R vs. Linear R - both utilize MLR's for the \(\beta\)'s - Less assumption to check for than in linear regression - No need to check for outliers - No residual plots - Variance is not constant

Need to test - (designing aspect) Independence of observations - (inference) Form of the model: test higher-order terms such as non-linear in X, interaction, and higher order interaction

Wald cs. LRT test

  • Walk test on single \(\beta = 0\), LRT find \(k\geq 1\) \(\beta\)'s.
  • LRT compares nested models
  • LRT allow small to moderate sample sizes \(\beta\) near boundary of parameter space

Comparing models: Likelihood Ratio Test

  • Idea: compare likelihood of data under Full model, \(\mathcal{L}_F\) to likelihood under REDUCED model \(\mathcal{L}_R\) of same data. the likelihood ratio \(\mathcal{L}_R/\mathcal{L}F\) where \(\mathcal{L}_R\leq \mathcal{L}_F\)
  • Hypothesis: \(H_0:\beta_1=...\beta_k=0\) (Reduced model is appropriate, fits data as swll as Full model)\(H_a:\) at least one of betas is not 0 (full model is better)
  • Test statistic: Deviance (residual)
\[G^2 = -2\log\mathcal{L}_R - (-2\log\mathcal{L}_F)=-2\log(\mathcal{L}_R/\mathcal{L}_F)\sim \chi^2_k\]

For large \(n\), under \(H_0\), \(G^2\) is an observation from a chi-square distribution with \(k\) d.f.

Exercise

Models

\[\begin{align*} log(\frac{\pi_i}{1-\pi_i}) &= \\&(1) a_0+a_1Age_i+a_2I_{Female, i}\\ &(2) \beta_0 +\beta_1Age_i+\beta_2 I_{female_i,i} + \beta_3Age_i*I_{female,i}+\beta_4Age_i^2 + \beta_5Age_i^2*I_{Female, i} \\ &(3) \gamma_0 + \gamma_1Age_i + \gamma_2I_{Female, i} + \gamma_3Age_i *I_{Female, i}\end{align*}\]
# additive model (model 1)
fitadd <- glm(Status~Age + Sex, family=binomial, donner)
summary(fitadd)
Call:
glm(formula = Status ~ Age + Sex, family = binomial, data = donner)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.7445  -1.0441  -0.3029   0.8877   2.0472

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)  1.63312    1.11018   1.471   0.1413  
Age         -0.07820    0.03728  -2.097   0.0359 *
SexFemale    1.59729    0.75547   2.114   0.0345 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 61.827  on 44  degrees of freedom
Residual deviance: 51.256  on 42  degrees of freedom
AIC: 57.256

Number of Fisher Scoring iterations: 4
# full model (model 2)
fitfull <- glm(Status~Age + Sex + Age: Sex + I(Age^2) + I(Age^2):Sex, 
               family=binomial, donner)
summary(fitfull)
Call:
glm(formula = Status ~ Age + Sex + Age:Sex + I(Age^2) + I(Age^2):Sex, 
    family = binomial, data = donner)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.3396  -0.9757  -0.3438   0.5269   1.5901

Coefficients:
                    Estimate Std. Error z value Pr(>|z|)
(Intercept)        -3.318484   3.940184  -0.842    0.400
Age                 0.183031   0.226632   0.808    0.419
SexFemale           0.265286  10.455222   0.025    0.980
I(Age^2)           -0.002803   0.002985  -0.939    0.348
Age:SexFemale       0.299877   0.696050   0.431    0.667
SexFemale:I(Age^2) -0.007356   0.010689  -0.688    0.491

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 61.827  on 44  degrees of freedom
Residual deviance: 45.361  on 39  degrees of freedom
AIC: 57.361

Number of Fisher Scoring iterations: 5
# interative model (model 3)
fitint <- glm(Status~Age*Sex, family=binomial, donner)
summary(fitint)
Call:
glm(formula = Status ~ Age * Sex, family = binomial, data = donner)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.2279  -0.9388  -0.5550   0.7794   1.6998

Coefficients:
              Estimate Std. Error z value Pr(>|z|)  
(Intercept)    0.31834    1.13103   0.281   0.7784  
Age           -0.03248    0.03527  -0.921   0.3571  
SexFemale      6.92805    3.39887   2.038   0.0415 *
Age:SexFemale -0.16160    0.09426  -1.714   0.0865 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 61.827  on 44  degrees of freedom
Residual deviance: 47.346  on 41  degrees of freedom
AIC: 55.346

Number of Fisher Scoring iterations: 5
# p-value for model2 vs. model1
1 - pchisq(5.895, 3)
0.11683196297711

Q1

Using R output determine whether a model with the 3 higher order polynomial terms and interaction term is an improvement over the additive model

Higher order interaction vs. additive Hypotheses: \(H_0:\beta_3=\beta_4=\beta_5 = 0\)

Test statistic \(G^2 = 51.256 - 45.361 = 5.895\sim \chi^2_3\)

p-value: 0.12

Conclusion: There is some evidence that the additive model is better than the higher order model

Q2

determine whether the effect of Age on the odds of survival differ with Sex

Hypothesis: \(H_0: \gamma_3 = 0\)

Test statistic (LRT): \(51.256 - 47.346 = 3.91\sim \chi^2_1\)

p-value(LRT): 0.048

Test statistic (Wald): \(-0.16160/0.09426 = -1.71\sim N(0,1)\)

p-value(Wald): 0.0865

(Shown in R-output)

Conclusion: There is inconclusive evidence that the interactive model is better than the additive model

# p-value for Q2 LRT
1 - pchisq(51.256-47.346, 1)
0.0479995973077516

Q3

determine whether or not the additive model fits better then the Null model

Hypothesis: \(H_0: a_i = 0, i\neq 0\)

Test statistic: \(G^2 = 61.827 - 51.256=10.571\sim \chi_2^2\)

p-value: 0.005

Conclusion: There is strong evidence that age and sex are relevant factors for estimating the odds of survival

# p-value for Q3
1 - pchisq(61.827-51.256, 2)
0.00506449930534014

AIC & BIC

Akaike's Information Criterion

\[AIC=-2log\mathcal{L}+2(p+1)\]

Schwarz's (Bayesian Information) Criterion

\[-2log\mathcal{L}+(p+1)\log N\]

\(p\) number of variables, \(N\) sample size

Smaller means better model fit
BIC applies stronger penalty for model complexity than AIC

AIC Rule of Thumb: one model fits better if \(diff(AIC) > 10\), equivalent if \(diff(AIC)<2\)

For the example, we can choose the simplest additive model

Estimated probability of survival

phats <- predict.glm(fitasf, type="response")
phats[1:5]
1   0.458700966259221
2   0.525540478323231
3   0.183166067768785
4   0.328935876664949
5   0.364335972178779
library(ggplot2)
ggplot(donner, aes(x=Age, y=Status, color=Sex))+geom_point()

png

ggplot(donner, aes(x=Age, y=phats)) + ylab("Estimated Probablities")+geom_line(aes(color=Sex), size=1)

png