Binomial Logistic Regression

Y ~ Binomial(m, π)

$P(Y=y)={m\choose y} \pi^y (1-\pi)^{m-y}$ v.s. Bernoulli => $\pi$ remains the same in binomial, while in Bernoulli case it is independent for each $X_i$

$E(Y)=m\pi$, $var(Y)=m\pi(1-\pi)$

Consider modelling $Y.m$ be the proportion of success out of $m$ independent Bernoulli trials $E(Y/m)=\pi, var(Y/m)=\pi(1-\pi)/m$

library(Sleuth3)
library(ggplot2)
krunnit = case2101

Extinct = krunnit$Extinct # number of "success"
AtRisk = krunnit$AtRisk
Area = krunnit$Area

pis = Extinct/AtRisk
NExtinct = AtRisk - Extinct # number of "failure"
logitpi = log(pis/(1-(pis)))
logarea = log(Area)
case2101

A data.frame: 18 × 4
Island	Area	AtRisk	Extinct
<fct>	<dbl>	<int>	<int>
Ulkokrunni	185.80	75	5
Maakrunni	105.80	67	3
Ristikari	30.70	66	10
Isonkivenletto	8.50	51	6
Hietakraasukka	4.80	28	3
Kraasukka	4.50	20	4
Lansiletto	4.30	43	8
Pihlajakari	3.60	31	3
Tyni	2.60	28	5
Tasasenletto	1.70	32	6
Raiska	1.20	30	8
Pohjanletto	0.70	20	2
Toro	0.70	31	9
Luusiletto	0.60	16	5
Vatunginletto	0.40	15	7
Vatunginnokka	0.30	33	8
Tiirakari	0.20	40	13
Ristikarenletto	0.07	6	3

Let $\pi_i$ be the probability of extinction for each island, assume that this is the same for each species for bird on a particular island

Assume species survival is independent.
Then $Y_i\sim Binomial(m_i,\pi_i)$

Unlike binary logistic model for Bernoulli distribution, we can estimate $\pi_i$ from the data.

Observed response proportion $\bar{\pi_i} = y_i/m_i$

Observed or empirical logits: (S-"saturated)

\[\log(\frac{\bar{\pi}_{S,i}}{1 - \bar{\pi}_{S,i}}) = \log(\frac{y_i}{m_i-y_i})\]

ggplot(krunnit, aes(x=Area, y=logitpi)) + geom_point()

ggplot(krunnit, aes(x=log(Area), y=logitpi)) + geom_point()

Then the proposed model is based on plot

\[\log(\frac{\pi_i}{1-\pi_i})=\beta_0 + \beta_1 \log(Area_i)\]

# cbind(Extinct, NExtinct ~ log(Area) is the way to represent it in R
fitbl <- glm(cbind(Extinct, NExtinct)~log(Area), family=binomial, data=krunnit)
summary(fitbl)

Call:
glm(formula = cbind(Extinct, NExtinct) ~ log(Area), family = binomial, 
    data = krunnit)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.71726  -0.67722   0.09726   0.48365   1.49545

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.19620    0.11845 -10.099  < 2e-16 ***
log(Area)   -0.29710    0.05485  -5.416 6.08e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 45.338  on 17  degrees of freedom
Residual deviance: 12.062  on 16  degrees of freedom
AIC: 75.394

Number of Fisher Scoring iterations: 4

anova(fitbl, test="Chisq")

A anova: 2 × 5
	Df	Deviance	Resid. Df	Resid. Dev	Pr(>Chi)
	<int>	<dbl>	<int>	<dbl>	<dbl>
NULL	NA	NA	17	45.33802	NA
log(Area)	1	33.27651	16	12.06151	7.994246e-09

vcov(fitbl) # variance co-variance matrix

A matrix: 2 × 2 of type dbl
	(Intercept)	log(Area)
(Intercept)	0.014029452	-0.002602237
log(Area)	-0.002602237	0.003008830

# 95% CI for beta
CL = cbind(bhat=coef(fitbl), confint.default(fitbl))
CL

A matrix: 2 × 3 of type dbl
	bhat	2.5 %	97.5 %
(Intercept)	-1.1961955	-1.4283454	-0.9640456
log(Area)	-0.2971037	-0.4046132	-0.1895942

Model summary

# observations: 18
# coefficients: 2
fitted model $logit(\hat\pi) = -1.196-0.297\log(Area)$

Wald procedures still work the same

$H_0:\beta_1=0$
$z=\frac{\hat\beta_1}{se(\hat\beta_1)}\sim N(0,1)$
CI: $\hat\beta_1\pm t_{\alpha,1}se(\hat\beta_1)$

Interpretation of beta1

Model:

\[logit(\pi) = \beta_0 + \beta_1\log(x)\Rightarrow\\ \\frac{pi}{1-\pi} = e^{\beta_0}e^{\beta_1\log(x)} = e^{\beta_0}x^{\beta_1}\]

CHanging $x$ by a factor of $h$ will change the odds by a multiplicative factor of $h^{\beta_1}$

Example: Halving island area changes odds by a factor of $0.5^{-0.2971} = 1.23$

Therefore, the odds of extinction on a smaller island are 123% of the odds of extinction on an island double its size.

Halving of area is associated with an increase in the odds of extinction by an estimated 23%. An approximate 95% confidence interval for the percentage change in odds is 14%-32%

# CI of beta's for halfing area
0.5^(CL)

A matrix: 2 × 3 of type dbl
	bhat	2.5 %	97.5 %
(Intercept)	2.291346	2.691379	1.950773
log(Area)	1.228675	1.323734	1.140443

phats = predict.glm(fitbl, type="response")
options(digits=4)
rbind(Extinct, NExtinct, pis, phats)

A matrix: 4 × 18 of type dbl
	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18
Extinct	5.00000	3.00000	10.00000	6.0000	3.0000	4.000	8.0000	3.00000	5.0000	6.0000	8.0000	2.0000	9.0000	5.0000	7.0000	8.0000	13.0000	3.0000
NExtinct	70.00000	64.00000	56.00000	45.0000	25.0000	16.000	35.0000	28.00000	23.0000	26.0000	22.0000	18.0000	22.0000	11.0000	8.0000	25.0000	27.0000	3.0000
pis	0.06667	0.04478	0.15152	0.1176	0.1071	0.200	0.1860	0.09677	0.1786	0.1875	0.2667	0.1000	0.2903	0.3125	0.4667	0.2424	0.3250	0.5000
phats	0.06017	0.07036	0.09854	0.1380	0.1595	0.162	0.1639	0.17125	0.1854	0.2052	0.2226	0.2516	0.2516	0.2603	0.2842	0.3019	0.3278	0.3998

Question: estimate the probability of extinction for a species on the Ulkokrunni island (Area $= 185.5 km^2$)

$logit(\hat\pi_{M,1}=)-1.196 - 0.297\log(185.5) = -2.75$

$\hat\pi_{M,1} = 0.06$

logit_result = -1.196 - 0.297 * log(185.5)
pi_m1 = exp(logit_result) / (1+ exp(logit_result))
print(logit_result)
print(pi_m1)

[1] -2.747
[1] 0.06024

Diagnostics

Model assumptions

Underlying probability model is Binomial variance non constant, is a function of the mean
Observation independent
The form of the model is correct
Liner relationship between logits and explanatory variables
All relevant variables are included, irrelevant ones excluded
Sample size is large enough for valid inferences
outliers

Saturated Model

Model that fits exactly with the data
Most general model possible for the data

Consider one explanatory variable, X with $n$ unique levels for the outcome, $Y\sim Binomial(m,\pi)$

Saturated Model as many parameter coefficients as $n$

Fitted Model nested within a FULL model, has (p+1) parameters

Null ModelL intercept only

Checking model adequacy: Form of the model - Deviance Goodness of Fit Test

Form of Hypothesis: $H_0: $ reduced model, $H_a: $ Full model

Deviance GOF test compares the fitted model $M$ to the saturated model $S$

Test Statistic:

\[Deviance = -2\log(\mathcal{L}_R/\mathcal{L}_F)\sim \chi^2_{n-p-1}\]

This is an asymptotic approximation, so it works better if each $m_i>5$

summary(fitbl)

Call:
glm(formula = cbind(Extinct, NExtinct) ~ log(Area), family = binomial, 
    data = krunnit)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.7173  -0.6772   0.0973   0.4837   1.4954

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -1.1962     0.1184  -10.10  < 2e-16 ***
log(Area)    -0.2971     0.0549   -5.42  6.1e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 45.338  on 17  degrees of freedom
Residual deviance: 12.062  on 16  degrees of freedom
AIC: 75.39

Number of Fisher Scoring iterations: 4

Example Deviance GOF with R-output $H_0:$ Fitted model $logit(\pi)~\log(Area)$
$H_a: $ Saturated model

Test statistic: $Deviance = 12.062$ (Residual deviance)

Distribution: $Deviance\sim \chi^2_{18-2}$

p-value: $P(\chi^2_{16}\leq 12.062) = 0.74$

Conclusion: the data are consistent with $H_0$; the simpler model with linear function of log(Area) is adequate (fit as well as the saturated model)

# calculate p-value for example
1 - pchisq(12.062, 16)

0.739700862578766

Small deviance leads to larger p-value and vice versa.

Large p-value means: - fitted model is adequate - Test is not powerful enough to detect inadequacies

Small p-value means: - fitted model is not adequate, consider a more complex model with more explanatory variables or higher order terms and so on OR - response distribution is not adequately modeled by the Binomial distribution, OR - There are severe outliers

Outliers

Response (raw) residuals

\[\hat\pi_{S,i}-\hat\pi_{M,i}=y_i/M_i-\hat\pi_{M,i}\]

Standardized residuals
- Pearson Residuals: uses estimate of s.d. of $Y$

\[P_{res,i}=\frac{y_i-m_i\hat\pi_{M,i}}{\sqrt{m_i\hat\pi_{M,i}(1-\hat\pi_{M,i})}}\]

Deviance Residuals: defined so that the sum of the square of the residuals is the deviance

\[D_{res,i}=sign(y_i-m_i\hat\pi_{<i})\sqrt{2(y_i\log(\frac{y_i}{m_i\hat\pi_{M,i}})+(m_i-y_i)\log(\frac{m_i-y_i}{m_i-m_i\pi\hat\pi_{M,i}}))}\]

# response residuals
rres = residuals(fitbl, type="response")
# pearson residuals
pres = residuals(fitbl, type="pearson")
# Deviance residuals
dres = residuals(fitbl, type="deviance")
cbind(pis, phats, rres, pres, dres)

	pis	phats	rres	pres	dres
1	0.06667	0.06017	0.006493	0.23646	0.23266
2	0.04478	0.07036	-0.025585	-0.81883	-0.87369
3	0.15152	0.09854	0.052975	1.44400	1.34958
4	0.11765	0.13800	-0.020351	-0.42139	-0.43071
5	0.10714	0.15946	-0.052319	-0.75619	-0.79584
6	0.20000	0.16205	0.037951	0.46058	0.44746
7	0.18605	0.16389	0.022155	0.39247	0.38577
8	0.09677	0.17125	-0.074480	-1.10075	-1.18097
9	0.17857	0.18542	-0.006844	-0.09318	-0.09363
10	0.18750	0.20524	-0.017742	-0.24850	-0.25127
11	0.26667	0.22264	0.044030	0.57969	0.56727
12	0.10000	0.25158	-0.151576	-1.56220	-1.71726
13	0.29032	0.25158	0.038747	0.49717	0.48934
14	0.31250	0.26030	0.052203	0.47588	0.46659
15	0.46667	0.28415	0.182515	1.56733	1.49545
16	0.24242	0.30185	-0.059429	-0.74367	-0.75939
17	0.32500	0.32783	-0.002828	-0.03810	-0.03813
18	0.50000	0.39984	0.100157	0.50082	0.49570

par(mfrow=c(1,3))
plot(log(Area), rres)
plot(log(Area), pres)
plot(log(Area), dres)

Pearson vs Deviance Residuals

Both asymptotically to $N(0,1)$
Possible outlier if $|res|>2$
Definite outlier if $|res|>3$
Under small $n$ $D_{res}$ closer to $N(0,1)$ then $P_{res}$
When $\hat\pi$ close to extreme (0, 1), $P_{res}$ are unstable

Problems and Solutions

extrapolation

Don't make inferences/predictions outside range of observed data
Model may no longer be appropriate

Multicollinearity
Consequences includes

unstable fitted equation
Coefficient that should be statistically significant while not
Coefficient may have wrong sign
Sometimes large $se(\beta)$
Sometimes numerical procedure to find MLEs does not converge

Influential Points

Model: overfit issue, should build model on training data (cross validation)

Problems specific to Logistic Regression

Extra-binomial variation (over dispersion) - variance of $Y_i$ greater than $m_i\pi_i(1-\pi_i)$ - Does not bias $\hat\beta$'s but se of $\hat\beta$'s will be too small

Solution: add one more parameter to the model, $\phi$-dispersion parameter. Then $var(Y_i)=\phi m_i\pi_i(1-\pi_i)$

Complete separation

One or a linear combination of explanatory variables perfectly predict whether $Y=1\lor 0$
In Binary response, when $y_i=1,\hat y_i=1$, then $\sum_{i=1}^n\{y_i\log(\hat y_i + (1-y_i)\log(1-\hat y_i))=0\}$
MLE cannot be computed

Quasi-complete separation:

explanatory variables predict $Y=1\lor 0$ almost perfectly
MLE are numerically unstable

Solution: simplify the model. Try other methods like penalized maximum likelihood, exact logistic regression, Bayesian methods

Conclusions about Logistic Regression

False Logistic regression describes population proportion of probability as a linear function of explanatory variables.

Non-linear $\hat\pi = e^{\hat\mu}/(1+e^{\hat\mu})$

Hence Logistic regression is a nonlinear regression model

Extra-binomial variation

Suppose $X_1,...,X_m$ are not independent but identically distributed $\sim Bernoulli(\pi)$. Further suppose $\forall X_i,X_j. cov(X_1,X_j)=\rho, \rho>0$.

Let $Y_1=\sum_i^{m_1}X_i$

Then

\[var(Y_i)= \sum_i^{m_1}var(X_i) + \sum_{i\neq j} cov(X_i,X_j)\\ = m_1\pi(1-\pi)+\sum_{i\neq j} \rho\sqrt{var(X_i)var(X_j)}\\ =m_1\pi(1-\pi) + m_1(m_1-1)\rho\pi(1-\pi)\\ >m_1\pi(1-\pi)\]

Therefore, the model for variance:

\[var(Y_i)=\psi m_i\pi_i(1-\pi_i)\]

estimate of $\hat\psi$ by scaled Pearson chi-square statistic

\[\hat\psi = \sum_i^n \frac{P^2_{res,i}}{n-(p+1)}=\frac{\text{sum of squared Pearson residuals}}{d.f.}\]

$\hat\psi>>1$ indicates evidence of over-dispersion

$\psi$ does not effect $E(Y_i)$, hence using over-dispersion does not change $\hat\beta$

The standard errors under this assumption is $se_{\psi}(\hat\beta)=\sqrt{\hat\psi}se(\hat\beta)$

The following two ways are equivalent

# manually conpute the psi hat
psihat = sum(residuals(fitbl, type="pearson")^2/fitbl$df.residual)
psihat

0.73257293737338

# apply the dispersion
summary(fitbl, dispersion=psihat)

Call:
glm(formula = cbind(Extinct, NExtinct) ~ log(Area), family = binomial, 
    data = krunnit)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.7173  -0.6772   0.0973   0.4837   1.4954

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -1.1962     0.1014  -11.80  < 2e-16 ***
log(Area)    -0.2971     0.0469   -6.33  2.5e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 0.7326)

    Null deviance: 45.338  on 17  degrees of freedom
Residual deviance: 12.062  on 16  degrees of freedom
AIC: 75.39

Number of Fisher Scoring iterations: 4

# use quasi binomial family
fitbl2 <- glm(cbind(Extinct, NExtinct)~log(Area), family=quasibinomial, data=krunnit)
fitbl2

Call:  glm(formula = cbind(Extinct, NExtinct) ~ log(Area), family = quasibinomial, 
    data = krunnit)

Coefficients:
(Intercept)    log(Area)  
     -1.196       -0.297

Degrees of Freedom: 17 Total (i.e. Null);  16 Residual
Null Deviance:      45.3 
Residual Deviance: 12.1     AIC: NA

Using Logistic Regression for Classification

Want: $y^*\mid (x_1^*, ...,x_p^*)= \mathbb{I}$

Do: calculate $\hat\pi_M^*$ be the estimated probability that $y^*=1$ based on the fitted model given $X_i=x_i^*$, then predict that $y^* = \mathbb{I}(\hat\pi_M^* \text{is large enough})$

Need: a good cutoff $\hat\pi_M^*$

Classification

Try different cutoffs and see which gives fewest incorrect classifications

Useful if proportion of 1's and 0's in date reflect their relative proportions in the population
Likely to overestimate. To overcome,cross-validation (training group vs. validation group)

Confusion Matrix

https://en.wikipedia.org/wiki/Confusion_matrix

Choose a cutoff probability based on one of the 5 criteria for success of classification that is most important to you

Examples

High sensitivity makes good screening test
High specificity makes a good confirmatory test
A screening test followed by a confirmatory test is good (but expensive) diagnostic procedure