binary logistic regression - model selection basics - r

I have binary outcome variable and 4 predictors: 2 binary one and 2 continuous (truncated to whole numbers). I have some 1158 observations and the objective of the analysis is to predict the probability of the binary outcome (infection), check Goodness of fit and predictive quality of the final model.
> str(data)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 1158 obs. of 5 variables:
$ age : num 25 49 41 19 55 37 30 31 52 37 ...
$ gender: num 1 1 1 0 0 0 1 0 1 1 ...
$ var1 : num 0 0 0 0 0 0 0 0 0 0 ...
$ y : num 1 0 0 1 1 0 1 1 0 1 ...
$ var2 : num 26 33 25 30 28 20 28 21 17 25 ...
I have seen that the data is sometimes split in 2: testing and training data set, but not always. I assume this depends on the original sample size? Is it advisable to split the data for my analysis?
For now, I have not split the data. I conducted varius variable selection procedures:
manual LRT based backward selection manual
LRT based forward selection automated
LRT based backward selection
AIC backward selection procedure
AIC forward selection procedure
And the all lead to the same results: Only age and gender should be included in my model.
Deviance Residuals:
Min 1Q Median 3Q Max
-1.2716 -0.8767 -0.7361 1.3008 1.9353
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.785753 0.238634 3.293 0.000992 ***
age -0.031504 0.004882 -6.453 1.1e-10 ***
gender -0.223195 0.129774 -1.720 0.085455 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1444.9 on 1157 degrees of freedom
Residual deviance: 1398.7 on 1155 degrees of freedom
AIC: 1404.7
Now, i want to see if any interactions or polynomials are significant. The dot (.) denotes the full model with 4 predictors.
full.twoway <- glm(y ~ (.)^2 , family = binomial, data=data) # includes 2-way interactions
summary(full.twoway)
model.aic.backward_2w <- step(full.twoway, direction = "backward", trace = 1)
summary(model.aic.backward_2w)
full.treeway <- glm(y ~ (.)^3 , family = binomial, data=data) # includes 3-way interactions
summary(full.treeway)
full.treeway <- glm(reject ~ (.)^3 , family = binomial, data=renal) # includes 3-way interactions
summary(full.treeway)
# significant interaction: age:male:cardio at 0.5
model.aic.backward_3w <- step(full.treeway, direction = "backward", trace = 1)
summary(model.aic.backward_3w)
# polynomials
model.polynomial <- glm(y ~ age + gender + I(age^2), family = binomial, data=data)
# only age, gender significant
Also only age and gender are significant. This seems very strange to me. I would have expected some interaction or polynomial term to be significant. Am I doing something wrong? Are there some other variable selection techniques?
EDIT:
I have partitioned the dataset in training and testing. Training dataset consist of 868 observations. The results of the selection procedure indicate that only the variable age is significant now...

Related

as.factor not working to convert numerical variable to categorical variable

I am trying to convert some numerical variables to categorical variables using as.factor but is not working. Since I am a relatively new R user I am not sure about my code. When I try to run a regression model the results also presented numerical variables. Does anyone have any suggestions?
as.factor(data$race_fup)
is.factor(data$sex)
str(data)
$ sex : num [1:812] 0 0 1 0 0 0 1 1 1 1 ...
$ race_fup : num [1:812] 1 3 2 3 2 2 3 3 3 3 ...
$ education_degree: num [1:812] 2 1 3 4 2 0 1 4 6 7 ...
GLM.1 <- glm(intubation ~ sex + education_degree + race_fup, family = binomial(logit) , data=data)```
Warning message:
glm.fit: probabilidades ajustadas numericamente 0 ou 1 ocorreu
summary(GLM.1)
Call:
`glm(formula = intubation ~ sex + education_degree + race_fup,
family = binomial(logit), data = data)`
Deviance Residuals:
Min 1Q Median 3Q Max
-1.0701 -1.0002 -0.9472 1.3353 1.5188
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.577e-01 1.514e-01 -1.702 0.0888
sex -3.769e-02 1.455e-01 -0.259 0.7956
education_degree -6.837e-02 3.877e-02 -1.764 0.0778
race_fup -6.052e-05 5.799e-04 -0.104 0.9169
(Intercept) .
sex
education_degree .
race_fup
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1074.1 on 806 degrees of freedom
Residual deviance: 1069.9 on 803 degrees of freedom
(5 observations deleted due to missingness)
AIC: 1077.9
Number of Fisher Scoring iterations: 6
There are some functions which changes in place i.e. if we use := in data.table, it modifies the original object or the set functions i.e. setDT - converts the original object from data.frame to data.table
library(data.table)
nm1 <- c("race_fup", "sex")
setDT(data)[, (nm1) := lapply(.SD, as.factor), .SDcols = nm1]
Or another option is compound assignment operator (%<>%) from magrittr
library(dplyr)
library(magrittr)
data %<>%
mutate(across(c(race_fup, sex), as.factor))
whereas if we use the %>%, it wont change the original object unless we assign (<- (at the beginning)
data <- data %>%
mutate(across(c(race_fup, sex), as.factor))
or -> (at the end))
data %>%
mutate(across(c(race_fup, sex), as.factor)) -> data
In base R, without doing any assignment, the original object stays the same
data[c('race_fup', 'sex')] <- lapply(data[c('race_fup', 'sex')], as.factor)
NOTE: If there are multiple columns, we could use either one of the methods

Calculating odds ratio from glm output

It is my first time doing logistic regressions and I am currently trying to teach myself how to find the odds ratio. I got my coefficients from r as shown below.
(Intercept) totalmins
0.2239254 1.2424020
To exponentiate the regression coefficient I did the following:
exp1.242/exp1.242+1 = 0.77
Really not sure if this is the correct process or not.
Any advice on how I would go about calculating odds ratio would be greatly appreciated
detection- 1/0 data if animal was detected at site
total mins- time animal spent at site
here's the output
glm(formula = detection ~ totalmins, family = binomial(link = "logit"),
data = data)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.81040 -0.63571 0.00972 0.37355 1.16771
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.49644 0.81818 -1.829 0.0674 .
totalmins 0.21705 0.08565 2.534 0.0113
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 41.194 on 33 degrees of freedom
Residual deviance: 21.831 on 32 degrees of freedom
(1 observation deleted due to missingness)
AIC: 25.831
Number of Fisher Scoring iterations: 8
This model evaluates the log odds of detecting an animal at the site based on the time in minutes that the animal spent on the site. The model output indicates:
log odds(animal detected | time on site) = -1.49644 + 0.21705 * minutes animal on site
To convert to odds ratios, we exponentiate the coefficients:
odds(animal detected) = exp(-1.49644) * exp(0.21705 * minutes animal on site)
Therefore, the odds and probability of detection if the animal spends 0 minutes on site is e(-1.49644) or 0.2239. The odds ratio of detection if an animal is on site for X minutes is calculated as follows. We'll model odds ratios for minutes 0 through 10, and calculate the associated probability of detection.
# odds of detection if animal on site for X minutes
coef_df <- data.frame(intercept=rep(-1.49644,11),
slopeMinutes=rep(0.21705,11),
minutesOnSite=0:10)
coef_df$minuteValue <- coef_df$minutesOnSite * coef_df$slopeMinutes
coef_df$intercept_exp <- exp(coef_df$intercept)
coef_df$slope_exp <- exp(coef_df$minuteValue)
coef_df$odds <- coef_df$intercept_exp * coef_df$slope_exp
coef_df$probability <- coef_df$odds / (1 + coef_df$odds)
...and the output:
> coef_df[,c(3:4,6:8)]
minutesOnSite intercept_exp slope_exp odds probability
1 0 0.2239 1.000 0.2239 0.1830
2 1 0.2239 1.242 0.2782 0.2177
3 2 0.2239 1.544 0.3456 0.2569
4 3 0.2239 1.918 0.4294 0.3004
5 4 0.2239 2.383 0.5335 0.3479
6 5 0.2239 2.960 0.6629 0.3986
7 6 0.2239 3.678 0.8235 0.4516
8 7 0.2239 4.569 1.0232 0.5057
9 8 0.2239 5.677 1.2712 0.5597
10 9 0.2239 7.053 1.5793 0.6123
11 10 0.2239 8.763 1.9622 0.6624
>
See also How to get probability from GLM output for another example using space shuttle autolander data from the MASS package.

SAS PROC MIXED vs lmerTest output

I am trying to reproduce output from the PROC MIXED procedure using the Satterwaithe approximation in SAS using the lmerTest package in R.
This is my data:
Participant Condition Data
1 0 -1,032941629
1 0 0,869267841
1 0 -1,636722191
1 0 -1,15451393
1 0 0,340454836
1 0 -0,399315906
1 1 0,668983169
1 1 1,937817592
1 1 3,110013393
1 1 3,23409718
2 0 0,806881925
2 1 2,71020911
2 1 3,406864275
2 1 1,494288182
2 1 0,741827047
2 1 2,532062685
2 1 3,702118917
2 1 1,825046681
2 1 4,37167021
2 1 1,85125279
3 0 0,288743786
3 0 1,024396121
3 1 2,051281876
3 1 0,24543851
3 1 3,349677964
3 1 1,565395822
3 1 3,077031712
3 1 1,087494708
3 1 1,546150033
3 1 0,440249347
Using the following statement in SAS:
proc mixed data=mbd;
class participant;
model data = condition / solution ddfm=sat;
random intercept condition / sub=participant;
run;
I get this output:
My problem is that I can't seem to reproduce these results using lmerTest in R.
I thought that lmer(Data ~ Condition + (1 | Participant) + (Condition | Participant), REML=TRUE) was the equivalent statement of what I did in SAS but this gives different results. Note that the degrees of freedom are way off from the SAS output so I think I'm estimating parameters in R that I'm not estimating in SAS. I tried several other statements in R but I didn't manage to get the exact same output. However this should be possible as the lmer() function from the lmerTest package also uses the Satterwaithe approximation and should be exactly the same as the SAS PROC MIXED procedure.
Does anybody know what I'm doing wrong in R?
Thanks a lot!
Bart
You don't specify the same random effects as in your SAS example. (Condition | Participant) is expanded internally to (1 + Condition | Participant), which fits a random intercept, a random slope and the covariance between them [1]. So, you have two additional parameters (an intercept variance and the covariance) in your model. Uncorrelated random effects can be specified using || in lme4 syntax. Note how the formula gets expanded in the summary output.
library(lmerTest)
fit <- lmer(Data ~ Condition + (Condition || Participant), REML=TRUE, data = DF)
summary(fit)
#Linear mixed model fit by REML
#t-tests use Satterthwaite approximations to degrees of freedom ['lmerMod']
#Formula: Data ~ Condition + ((1 | Participant) + (0 + Condition | Participant))
# Data: DF
#
#REML criterion at convergence: 90.6
#
#Scaled residuals:
# Min 1Q Median 3Q Max
#-1.58383 -0.78970 -0.06993 0.87801 1.91237
#
#Random effects:
# Groups Name Variance Std.Dev.
# Participant (Intercept) 0.00000 0.000
# Participant.1 Condition 0.07292 0.270
# Residual 1.20701 1.099
#Number of obs: 30, groups: Participant, 3
#
#Fixed effects:
# Estimate Std. Error df t value Pr(>|t|)
#(Intercept) -0.09931 0.36621 26.50400 -0.271 0.788363
#Condition 2.23711 0.46655 12.05700 4.795 0.000432 ***
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
#Correlation of Fixed Effects:
# (Intr)
#Condition -0.785

How can I compare regression coefficients across three (or more) groups using R?

Sometimes your research may predict that the size of a regression coefficient may vary across groups. For example, you might believe that the regression coefficient of height predicting weight would differ across three age groups (young, middle age, senior citizen). Below, we have a data file with 3 fictional young people, 3 fictional middle age people, and 3 fictional senior citizens, along with their height and their weight. The variable age indicates the age group and is coded 1 for young people, 2 for middle aged, and 3 for senior citizens.
So, how can I compare regression coefficients (slope mainly) across three (or more) groups using R?
Sample data:
age height weight
1 56 140
1 60 155
1 64 143
2 56 117
2 60 125
2 64 133
3 74 245
3 75 241
3 82 269
There is an elegant answer to this in CrossValidated.
But briefly,
require(emmeans)
data <- data.frame(age = factor(c(1,1,1,2,2,2,3,3,3)),
height = c(56,60,64,56,60,64,74,75,82),
weight = c(140,155,142,117,125,133,245,241,269))
model <- lm(weight ~ height*age, data)
anova(model) #check the results
Analysis of Variance Table
Response: weight
Df Sum Sq Mean Sq F value Pr(>F)
height 1 25392.3 25392.3 481.5984 0.0002071 ***
age 2 2707.4 1353.7 25.6743 0.0129688 *
height:age 2 169.0 84.5 1.6027 0.3361518
Residuals 3 158.2 52.7
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
slopes <- emtrends(model, 'age', var = 'height') #gets each slope
slopes
age height.trend SE df lower.CL upper.CL
1 0.25 1.28 3 -3.84 4.34
2 2.00 1.28 3 -2.09 6.09
3 3.37 1.18 3 -0.38 7.12
Confidence level used: 0.95
pairs(slopes) #gets their comparisons two by two
contrast estimate SE df t.ratio p.value
1 - 2 -1.75 1.82 3 -0.964 0.6441
1 - 3 -3.12 1.74 3 -1.790 0.3125
2 - 3 -1.37 1.74 3 -0.785 0.7363
P value adjustment: tukey method for comparing a family of 3 estimates
To determine whether the regression coefficients "differ across three age groups" we can use anova function in R. For example, using the data in the question and shown reproducibly in the note at the end:
fm1 <- lm(weight ~ height, DF)
fm3 <- lm(weight ~ age/(height - 1), DF)
giving the following which is significant at the 2.7% level so we would conclude that there are differences in the regression coefficients of the groups if we were using a 5% cutoff but not if we were using a 1% cutoff. The age/(height - 1) in the formula for fm3 means that height is nested in age and the overall intercept is omitted. Thus the model estimates separate intercepts and slopes for each age group. This is equivalent to age + age:height - 1.
> anova(fm1, fm3)
Analysis of Variance Table
Model 1: weight ~ height
Model 2: weight ~ age/(height - 1)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 7 2991.57
2 3 149.01 4 2842.6 14.307 0.02696 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Note 1: Above fm3 has 6 coefficients, an intercept and slope for each group. If you want 4 coefficients, a common intercept and separate slopes, then use
lm(weight ~ age:height, DF)
Note 2: We can also compare a model in which subsets of levels are the same. For example, we can compare a model in which ages 1 and 2 are the same to models in which they are all the same (fm1) and all different (fm3):
fm2 <- lm(weight ~ age/(height - 1), transform(DF, age = factor(c(1, 1, 3)[age])))
anova(fm1, fm2, fm3)
If you do a large number of tests you can get significance on some just by chance so you will want to lower the cutoff for p values.
Note 3: There are some notes on lm formulas here: https://sites.google.com/site/r4naturalresources/r-topics/fitting-models/formulas
Note 4: We used this as the input:
Lines <- "age height weight
1 56 140
1 60 155
1 64 143
2 56 117
2 60 125
2 64 133
3 74 245
3 75 241
3 82 269"
DF <- read.table(text = Lines, header = TRUE)
DF$age <- factor(DF$age)

Ordinal independent variables for logistic regression in R using ordered() function

I'm doing binary logistic regression in R, and some of the independent variables represent ordinal data. I just want to make sure I'm doing it correctly. In the example below, I created sample data and ran glm() based on the assumption that the independent variable "I" represents continuous data. Then I ran it again using ordered(I) instead. The results came out a little bit differently, so it seems like a successful test. My question is whether it's doing what I think it's doing...e.g., it's seeing the integer data, coercing it to ordinal data based on the values of the integers, and running the glm() with a different formula to account for the idea that the distance between "1," "2," "3," etc. may not be the same, hence making it "correct" if this represents ordinal data. Is that correct?
> str(gorilla)
'data.frame': 14 obs. of 2 variables:
$ I: int 1 1 1 2 2 2 3 3 4 4 ...
$ D: int 0 0 1 0 0 1 1 1 0 1 ...
> glm.out = glm(D ~ I, family=binomial(logit), data=gorilla)
> summary(glm.out)
...tried it again with ordered:
glm.out = glm(D ~ ordered(I), family=binomial(logit), data=gorilla)
> summary(glm.out)
PS: In case it would help, here's the full output from these tests (one thing I'm noticing is the very large standard error numbers):
Call:
glm(formula = D ~ I, family = binomial(logit), data = gorilla)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.7067 -1.0651 0.7285 1.0137 1.4458
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.0624 1.2598 -0.843 0.399
I 0.4507 0.3846 1.172 0.241
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 19.121 on 13 degrees of freedom
Residual deviance: 17.621 on 12 degrees of freedom
AIC: 21.621
Number of Fisher Scoring iterations: 4
> glm.out = glm(D ~ ordered(I), family=binomial(logit), data=gorilla)
> summary(glm.out)
Call:
glm(formula = D ~ ordered(I), family = binomial(logit), data = gorilla)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.66511 -0.90052 0.00013 0.75853 1.48230
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.6557 922.4405 0.004 0.997
ordered(I).L 1.3524 1.2179 1.110 0.267
ordered(I).Q -9.5220 2465.3259 -0.004 0.997
ordered(I).C 0.1282 1.2974 0.099 0.921
ordered(I)^4 13.6943 3307.5816 0.004 0.997
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 19.121 on 13 degrees of freedom
Residual deviance: 14.909 on 9 degrees of freedom
AIC: 24.909
Number of Fisher Scoring iterations: 17
Data used:
I,D
1,0
1,0
1,1
2,0
2,0
2,1
3,1
3,1
4,0
4,1
5,0
5,1
5,1
5,1

Resources