Skip to content

Binomial Logistic Regression

Y ~ Binomial(m, π)

\(P(Y=y)={m\choose y} \pi^y (1-\pi)^{m-y}\) v.s. Bernoulli => \(\pi\) remains the same in binomial, while in Bernoulli case it is independent for each \(X_i\)

\(E(Y)=m\pi\), \(var(Y)=m\pi(1-\pi)\)

Consider modelling \(Y.m\) be the proportion of success out of \(m\) independent Bernoulli trials \(E(Y/m)=\pi, var(Y/m)=\pi(1-\pi)/m\)

library(Sleuth3)
library(ggplot2)
krunnit = case2101

Extinct = krunnit$Extinct # number of "success"
AtRisk = krunnit$AtRisk
Area = krunnit$Area

pis = Extinct/AtRisk
NExtinct = AtRisk - Extinct # number of "failure"
logitpi = log(pis/(1-(pis)))
logarea = log(Area)
case2101
A data.frame: 18 × 4
IslandAreaAtRiskExtinct
<fct><dbl><int><int>
Ulkokrunni 185.8075 5
Maakrunni 105.8067 3
Ristikari 30.706610
Isonkivenletto 8.5051 6
Hietakraasukka 4.8028 3
Kraasukka 4.5020 4
Lansiletto 4.3043 8
Pihlajakari 3.6031 3
Tyni 2.6028 5
Tasasenletto 1.7032 6
Raiska 1.2030 8
Pohjanletto 0.7020 2
Toro 0.7031 9
Luusiletto 0.6016 5
Vatunginletto 0.4015 7
Vatunginnokka 0.3033 8
Tiirakari 0.204013
Ristikarenletto 0.07 6 3

Let \(\pi_i\) be the probability of extinction for each island, assume that this is the same for each species for bird on a particular island

Assume species survival is independent.
Then \(Y_i\sim Binomial(m_i,\pi_i)\)

Unlike binary logistic model for Bernoulli distribution, we can estimate \(\pi_i\) from the data.

Observed response proportion \(\bar{\pi_i} = y_i/m_i\)

Observed or empirical logits: (S-"saturated)

\[\log(\frac{\bar{\pi}_{S,i}}{1 - \bar{\pi}_{S,i}}) = \log(\frac{y_i}{m_i-y_i})\]
ggplot(krunnit, aes(x=Area, y=logitpi)) + geom_point()

png

ggplot(krunnit, aes(x=log(Area), y=logitpi)) + geom_point()

png

Then the proposed model is based on plot

\[\log(\frac{\pi_i}{1-\pi_i})=\beta_0 + \beta_1 \log(Area_i)\]
# cbind(Extinct, NExtinct ~ log(Area) is the way to represent it in R
fitbl <- glm(cbind(Extinct, NExtinct)~log(Area), family=binomial, data=krunnit)
summary(fitbl)
Call:
glm(formula = cbind(Extinct, NExtinct) ~ log(Area), family = binomial, 
    data = krunnit)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.71726  -0.67722   0.09726   0.48365   1.49545

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.19620    0.11845 -10.099  < 2e-16 ***
log(Area)   -0.29710    0.05485  -5.416 6.08e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 45.338  on 17  degrees of freedom
Residual deviance: 12.062  on 16  degrees of freedom
AIC: 75.394

Number of Fisher Scoring iterations: 4
anova(fitbl, test="Chisq")
A anova: 2 × 5
DfDevianceResid. DfResid. DevPr(>Chi)
<int><dbl><int><dbl><dbl>
NULLNA NA1745.33802 NA
log(Area) 133.276511612.061517.994246e-09
vcov(fitbl) # variance co-variance matrix
A matrix: 2 × 2 of type dbl
(Intercept)log(Area)
(Intercept) 0.014029452-0.002602237
log(Area)-0.002602237 0.003008830
# 95% CI for beta
CL = cbind(bhat=coef(fitbl), confint.default(fitbl))
CL
A matrix: 2 × 3 of type dbl
bhat2.5 %97.5 %
(Intercept)-1.1961955-1.4283454-0.9640456
log(Area)-0.2971037-0.4046132-0.1895942

Model summary

# observations: 18
# coefficients: 2
fitted model \(logit(\hat\pi) = -1.196-0.297\log(Area)\)

Wald procedures still work the same

\(H_0:\beta_1=0\)
\(z=\frac{\hat\beta_1}{se(\hat\beta_1)}\sim N(0,1)\)
CI: \(\hat\beta_1\pm t_{\alpha,1}se(\hat\beta_1)\)

Interpretation of beta1

Model:

\[logit(\pi) = \beta_0 + \beta_1\log(x)\Rightarrow\\ \\frac{pi}{1-\pi} = e^{\beta_0}e^{\beta_1\log(x)} = e^{\beta_0}x^{\beta_1}\]

CHanging \(x\) by a factor of \(h\) will change the odds by a multiplicative factor of \(h^{\beta_1}\)

Example: Halving island area changes odds by a factor of \(0.5^{-0.2971} = 1.23\)

Therefore, the odds of extinction on a smaller island are 123% of the odds of extinction on an island double its size.

Halving of area is associated with an increase in the odds of extinction by an estimated 23%. An approximate 95% confidence interval for the percentage change in odds is 14%-32%

# CI of beta's for halfing area
0.5^(CL)
A matrix: 2 × 3 of type dbl
bhat2.5 %97.5 %
(Intercept)2.2913462.6913791.950773
log(Area)1.2286751.3237341.140443
phats = predict.glm(fitbl, type="response")
options(digits=4)
rbind(Extinct, NExtinct, pis, phats)
A matrix: 4 × 18 of type dbl
123456789101112131415161718
Extinct 5.00000 3.0000010.00000 6.0000 3.0000 4.000 8.0000 3.00000 5.0000 6.0000 8.0000 2.0000 9.0000 5.00007.0000 8.000013.00003.0000
NExtinct70.0000064.0000056.0000045.000025.000016.00035.000028.0000023.000026.000022.000018.000022.000011.00008.000025.000027.00003.0000
pis 0.06667 0.04478 0.15152 0.1176 0.1071 0.200 0.1860 0.09677 0.1786 0.1875 0.2667 0.1000 0.2903 0.31250.4667 0.2424 0.32500.5000
phats 0.06017 0.07036 0.09854 0.1380 0.1595 0.162 0.1639 0.17125 0.1854 0.2052 0.2226 0.2516 0.2516 0.26030.2842 0.3019 0.32780.3998

Question: estimate the probability of extinction for a species on the Ulkokrunni island (Area \(= 185.5 km^2\))

\(logit(\hat\pi_{M,1}=)-1.196 - 0.297\log(185.5) = -2.75\)

\(\hat\pi_{M,1} = 0.06\)

logit_result = -1.196 - 0.297 * log(185.5)
pi_m1 = exp(logit_result) / (1+ exp(logit_result))
print(logit_result)
print(pi_m1)
[1] -2.747
[1] 0.06024

Diagnostics

Model assumptions

  • Underlying probability model is Binomial variance non constant, is a function of the mean
  • Observation independent
  • The form of the model is correct
  • Liner relationship between logits and explanatory variables
  • All relevant variables are included, irrelevant ones excluded
  • Sample size is large enough for valid inferences
  • outliers

Saturated Model

Model that fits exactly with the data
Most general model possible for the data

Consider one explanatory variable, X with \(n\) unique levels for the outcome, \(Y\sim Binomial(m,\pi)\)

Saturated Model as many parameter coefficients as \(n\)

Fitted Model nested within a FULL model, has (p+1) parameters

Null ModelL intercept only

Checking model adequacy: Form of the model - Deviance Goodness of Fit Test

Form of Hypothesis: $H_0: $ reduced model, $H_a: $ Full model

Deviance GOF test compares the fitted model \(M\) to the saturated model \(S\)

Test Statistic:

\[Deviance = -2\log(\mathcal{L}_R/\mathcal{L}_F)\sim \chi^2_{n-p-1}\]

This is an asymptotic approximation, so it works better if each \(m_i>5\)

summary(fitbl)
Call:
glm(formula = cbind(Extinct, NExtinct) ~ log(Area), family = binomial, 
    data = krunnit)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.7173  -0.6772   0.0973   0.4837   1.4954

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -1.1962     0.1184  -10.10  < 2e-16 ***
log(Area)    -0.2971     0.0549   -5.42  6.1e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 45.338  on 17  degrees of freedom
Residual deviance: 12.062  on 16  degrees of freedom
AIC: 75.39

Number of Fisher Scoring iterations: 4

Example Deviance GOF with R-output \(H_0:\) Fitted model \(logit(\pi)~\log(Area)\)
$H_a: $ Saturated model

Test statistic: \(Deviance = 12.062\) (Residual deviance)

Distribution: \(Deviance\sim \chi^2_{18-2}\)

p-value: \(P(\chi^2_{16}\leq 12.062) = 0.74\)

Conclusion: the data are consistent with \(H_0\); the simpler model with linear function of log(Area) is adequate (fit as well as the saturated model)

# calculate p-value for example
1 - pchisq(12.062, 16)

0.739700862578766

Small deviance leads to larger p-value and vice versa.

Large p-value means: - fitted model is adequate - Test is not powerful enough to detect inadequacies

Small p-value means: - fitted model is not adequate, consider a more complex model with more explanatory variables or higher order terms and so on OR - response distribution is not adequately modeled by the Binomial distribution, OR - There are severe outliers

Outliers

Response (raw) residuals

\[\hat\pi_{S,i}-\hat\pi_{M,i}=y_i/M_i-\hat\pi_{M,i}\]

Standardized residuals
- Pearson Residuals: uses estimate of s.d. of \(Y\)

\[P_{res,i}=\frac{y_i-m_i\hat\pi_{M,i}}{\sqrt{m_i\hat\pi_{M,i}(1-\hat\pi_{M,i})}}\]
  • Deviance Residuals: defined so that the sum of the square of the residuals is the deviance
\[D_{res,i}=sign(y_i-m_i\hat\pi_{<i})\sqrt{2(y_i\log(\frac{y_i}{m_i\hat\pi_{M,i}})+(m_i-y_i)\log(\frac{m_i-y_i}{m_i-m_i\pi\hat\pi_{M,i}}))}\]
# response residuals
rres = residuals(fitbl, type="response")
# pearson residuals
pres = residuals(fitbl, type="pearson")
# Deviance residuals
dres = residuals(fitbl, type="deviance")
cbind(pis, phats, rres, pres, dres)
pisphatsrrespresdres
10.066670.06017 0.006493 0.23646 0.23266
20.044780.07036-0.025585-0.81883-0.87369
30.151520.09854 0.052975 1.44400 1.34958
40.117650.13800-0.020351-0.42139-0.43071
50.107140.15946-0.052319-0.75619-0.79584
60.200000.16205 0.037951 0.46058 0.44746
70.186050.16389 0.022155 0.39247 0.38577
80.096770.17125-0.074480-1.10075-1.18097
90.178570.18542-0.006844-0.09318-0.09363
100.187500.20524-0.017742-0.24850-0.25127
110.266670.22264 0.044030 0.57969 0.56727
120.100000.25158-0.151576-1.56220-1.71726
130.290320.25158 0.038747 0.49717 0.48934
140.312500.26030 0.052203 0.47588 0.46659
150.466670.28415 0.182515 1.56733 1.49545
160.242420.30185-0.059429-0.74367-0.75939
170.325000.32783-0.002828-0.03810-0.03813
180.500000.39984 0.100157 0.50082 0.49570
par(mfrow=c(1,3))
plot(log(Area), rres)
plot(log(Area), pres)
plot(log(Area), dres)

png

Pearson vs Deviance Residuals

  • Both asymptotically to \(N(0,1)\)
  • Possible outlier if \(|res|>2\)
  • Definite outlier if \(|res|>3\)
  • Under small \(n\) \(D_{res}\) closer to \(N(0,1)\) then \(P_{res}\)
  • When \(\hat\pi\) close to extreme (0, 1), \(P_{res}\) are unstable

Problems and Solutions

extrapolation

  • Don't make inferences/predictions outside range of observed data
  • Model may no longer be appropriate

Multicollinearity
Consequences includes

  • unstable fitted equation
  • Coefficient that should be statistically significant while not
  • Coefficient may have wrong sign
  • Sometimes large \(se(\beta)\)
  • Sometimes numerical procedure to find MLEs does not converge

Influential Points

Model: overfit issue, should build model on training data (cross validation)

Problems specific to Logistic Regression

Extra-binomial variation (over dispersion) - variance of \(Y_i\) greater than \(m_i\pi_i(1-\pi_i)\) - Does not bias \(\hat\beta\)'s but se of \(\hat\beta\)'s will be too small

Solution: add one more parameter to the model, \(\phi\)-dispersion parameter. Then \(var(Y_i)=\phi m_i\pi_i(1-\pi_i)\)

Complete separation

  • One or a linear combination of explanatory variables perfectly predict whether \(Y=1\lor 0\)
  • In Binary response, when \(y_i=1,\hat y_i=1\), then \(\sum_{i=1}^n\{y_i\log(\hat y_i + (1-y_i)\log(1-\hat y_i))=0\}\)
  • MLE cannot be computed

Quasi-complete separation:

  • explanatory variables predict \(Y=1\lor 0\) almost perfectly
  • MLE are numerically unstable

Solution: simplify the model. Try other methods like penalized maximum likelihood, exact logistic regression, Bayesian methods

Conclusions about Logistic Regression

False Logistic regression describes population proportion of probability as a linear function of explanatory variables.

Non-linear \(\hat\pi = e^{\hat\mu}/(1+e^{\hat\mu})\)

Hence Logistic regression is a nonlinear regression model

Extra-binomial variation

Suppose \(X_1,...,X_m\) are not independent but identically distributed \(\sim Bernoulli(\pi)\). Further suppose \(\forall X_i,X_j. cov(X_1,X_j)=\rho, \rho>0\).

Let \(Y_1=\sum_i^{m_1}X_i\)

Then

\[var(Y_i)= \sum_i^{m_1}var(X_i) + \sum_{i\neq j} cov(X_i,X_j)\\ = m_1\pi(1-\pi)+\sum_{i\neq j} \rho\sqrt{var(X_i)var(X_j)}\\ =m_1\pi(1-\pi) + m_1(m_1-1)\rho\pi(1-\pi)\\ >m_1\pi(1-\pi)\]

Therefore, the model for variance:

\[var(Y_i)=\psi m_i\pi_i(1-\pi_i)\]

estimate of \(\hat\psi\) by scaled Pearson chi-square statistic

\[\hat\psi = \sum_i^n \frac{P^2_{res,i}}{n-(p+1)}=\frac{\text{sum of squared Pearson residuals}}{d.f.}\]

\(\hat\psi>>1\) indicates evidence of over-dispersion

\(\psi\) does not effect \(E(Y_i)\), hence using over-dispersion does not change \(\hat\beta\)

The standard errors under this assumption is \(se_{\psi}(\hat\beta)=\sqrt{\hat\psi}se(\hat\beta)\)

The following two ways are equivalent

# manually conpute the psi hat
psihat = sum(residuals(fitbl, type="pearson")^2/fitbl$df.residual)
psihat
0.73257293737338
# apply the dispersion
summary(fitbl, dispersion=psihat)
Call:
glm(formula = cbind(Extinct, NExtinct) ~ log(Area), family = binomial, 
    data = krunnit)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.7173  -0.6772   0.0973   0.4837   1.4954

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -1.1962     0.1014  -11.80  < 2e-16 ***
log(Area)    -0.2971     0.0469   -6.33  2.5e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 0.7326)

    Null deviance: 45.338  on 17  degrees of freedom
Residual deviance: 12.062  on 16  degrees of freedom
AIC: 75.39

Number of Fisher Scoring iterations: 4
# use quasi binomial family
fitbl2 <- glm(cbind(Extinct, NExtinct)~log(Area), family=quasibinomial, data=krunnit)
fitbl2
Call:  glm(formula = cbind(Extinct, NExtinct) ~ log(Area), family = quasibinomial, 
    data = krunnit)

Coefficients:
(Intercept)    log(Area)  
     -1.196       -0.297

Degrees of Freedom: 17 Total (i.e. Null);  16 Residual
Null Deviance:      45.3 
Residual Deviance: 12.1     AIC: NA

Using Logistic Regression for Classification

Want: \(y^*\mid (x_1^*, ...,x_p^*)= \mathbb{I}\)

Do: calculate \(\hat\pi_M^*\) be the estimated probability that \(y^*=1\) based on the fitted model given \(X_i=x_i^*\), then predict that \(y^* = \mathbb{I}(\hat\pi_M^* \text{is large enough})\)

Need: a good cutoff \(\hat\pi_M^*\)

Classification

Try different cutoffs and see which gives fewest incorrect classifications

  • Useful if proportion of 1's and 0's in date reflect their relative proportions in the population
  • Likely to overestimate. To overcome,cross-validation (training group vs. validation group)

Confusion Matrix

https://en.wikipedia.org/wiki/Confusion_matrix

Choose a cutoff probability based on one of the 5 criteria for success of classification that is most important to you

Examples

  • High sensitivity makes good screening test
  • High specificity makes a good confirmatory test
  • A screening test followed by a confirmatory test is good (but expensive) diagnostic procedure