Model selection gives back intercept only model - r

I am performing logistic regression on the model with CHD sickness vs a few variables (see the data frame).
ind sbp tobacco ldl adiposity typea obesity alcohol age chd
1 1 160 12.00 5.73 23.11 49 25.30 97.20 52 1
2 2 144 0.01 4.41 28.61 55 28.87 2.06 63 1
...
I performed backward stepwise selection on this model to receive the best model, but I get as the result the model that contains only the intercept. Why can it be? What does it mean?
model <-glm(chd ~ ., data = CHD, family = "binomial"(link = logit))
intercept_only <- glm(chd ~ 1, data=CHD, family = "binomial"(link = logit))
#perform backward stepwise regression
back <- step(intercept_only, direction='backward', scope=formula(model), trace=0)
#view results of backward stepwise regression
Step Df Deviance Resid. Df Resid. Dev AIC
1 NA NA 461 596.1084 598.1084```

To do backward regression, you should start with a model that contains variables, rather than the model with intercept only:
back <- step(model, direction='backward', scope=formula(model), trace=0)
The intercept_only model should only be used if you set direction='forward' or direction='both'.

Related

lrtest() not working: posthoc testing for multinomial with vglm() not working with lrtest()

I am running a multinomial analysis with vglm(). It all works, but then I try to follow the instructions from the following website (https://rcompanion.org/handbook/H_08.html) to do a pairwise test, because emmeans cannot handle pairwise for vglm models. The lrtest() part gives me the following error:
Error in lrtest.default(model) :
'list' object cannot be coerced to type 'double'
I cannot figure out what is wrong, I even copy and pasted the exact code that the website used (see below) and get the same error with their own code and dataset. Any ideas?
Their code and suggestion for doing pairwise testing with vglm() is the only pairwise testing option I found for vglm() anywhere on the web.
Here is the code along with all the expected output and extra details from their website (it is simpler than mine but gets same error anyways).
Input = ("
County Sex Result Count
Bloom Female Pass 9
Bloom Female Fail 5
Bloom Male Pass 7
Bloom Male Fail 17
Cobblestone Female Pass 11
Cobblestone Female Fail 4
Cobblestone Male Pass 9
Cobblestone Male Fail 21
Dougal Female Pass 9
Dougal Female Fail 7
Dougal Male Pass 19
Dougal Male Fail 9
Heimlich Female Pass 15
Heimlich Female Fail 8
Heimlich Male Pass 14
Heimlich Male Fail 17
")
Data = read.table(textConnection(Input),header=TRUE)
### Order factors otherwise R will alphabetize them
Data$County = factor(Data$County,
levels=unique(Data$County))
Data$Sex = factor(Data$Sex,
levels=unique(Data$Sex))
Data$Result = factor(Data$Result,
levels=unique(Data$Result))
### Check the data frame
library(psych)
headTail(Data)
str(Data)
summary(Data)
### Remove unnecessary objects
rm(Input)
Multinomial regression
library(VGAM)
model = vglm(Result ~ Sex + County + Sex:County,
family=multinomial(refLevel=1),
weights = Count,
data = Data)
summary(model)
library(car)
Anova(model,
type="II",
test="Chisq")```
Analysis of Deviance Table (Type II tests)
Response: Result
Df Chisq Pr(>Chisq)
Sex 1 6.7132 0.00957 **
County 3 4.1947 0.24120
Sex:County 3 7.1376 0.06764 .
library(rcompanion)
nagelkerke(model)
$Pseudo.R.squared.for.model.vs.null
Pseudo.R.squared
McFadden 0.0797857
Cox and Snell (ML) 0.7136520
Nagelkerke (Cragg and Uhler) 0.7136520
$Likelihood.ratio.test
Df.diff LogLik.diff Chisq p.value
7 -10.004 20.009 0.0055508
library(lmtest)
lrtest(model)
Likelihood ratio test
Model 1: Result ~ Sex + County + Sex:County
Model 2: Result ~ 1
#Df LogLik Df Chisq Pr(>Chisq)
1 8 -115.39
2 15 -125.39 7 20.009 0.005551 **
Post-hoc analysis
At the time of writing, the lsmeans package cannot be used with vglm models.
One option for post-hoc analysis would be to conduct analyses on reduced models, including only two levels of a factor. For example, if the variable County x Sex term had been significant, the following code could be used to create a reduced dataset with only Bloom–Female and Bloom–Male, and analyze this data with vglm.
Data.b = Data[Data$County=="Bloom" &
(Data$Sex=="Female"| Data$Sex=="Male") , ]
Data.b$County = factor(Data.b$County)
Data.b$Sex = factor(Data.b$Sex)
summary(Data.b)
County Sex Result Count
Bloom:4 Female:2 Pass:2 Min. : 5.0
Male :2 Fail:2 1st Qu.: 6.5
Median : 8.0
Mean : 9.5
3rd Qu.:11.0
Max. :17.0
library(VGAM)
model.b = vglm(Result ~ Sex,
family=multinomial(refLevel=1),
weights = Count,
data = Data.b)
lrtest(model.b)
Likelihood ratio test
#Df LogLik Df Chisq Pr(>Chisq)
1 2 -23.612
2 3 -25.864 1 4.5041 0.03381 *
Summary table of results
Comparison p-value
Bloom–Female - Bloom–Male 0.034
Cobblestone–Female - Cobblestone–Male 0.0052
Dougal–Female - Dougal–Male 0.44
Heimlich–Female - Heimlich–Male 0.14
p.value = c(0.034, 0.0052, 0.44, 0.14)
p.adj = p.adjust(p.value,
method = "fdr")
p.adj = signif(p.adj,
2)
p.adj
[1] 0.068 0.021 0.440 0.190
Comparison p-value p.adj
Bloom–Female - Bloom–Male 0.034 0.068
Cobblestone–Female - Cobblestone–Male 0.0052 0.021
Dougal–Female - Dougal–Male 0.44 0.44
Heimlich–Female - Heimlich–Male 0.14 0.19
It looks to me like qdrq() can be used. As I commented, you can't use the lazy interface, you have to give all the specific needed parameters:
> library(emmeans)
> RG = qdrg(formula(model), Data, coef(model), vcov(model), link = "log")
> RG
'emmGrid' object with variables:
Sex = Female, Male
County = Bloom, Cobblestone, Dougal, Heimlich
Transformation: “log”
> emmeans(RG, consec ~ Sex | County)
$emmeans
County = Bloom:
Sex emmean SE df asymp.LCL asymp.UCL
Female -0.588 0.558 Inf -1.68100 0.5054
Male 0.887 0.449 Inf 0.00711 1.7675
County = Cobblestone:
Sex emmean SE df asymp.LCL asymp.UCL
Female -1.012 0.584 Inf -2.15597 0.1328
Male 0.847 0.398 Inf 0.06643 1.6282
County = Dougal:
Sex emmean SE df asymp.LCL asymp.UCL
Female -0.251 0.504 Inf -1.23904 0.7364
Male -0.747 0.405 Inf -1.54032 0.0459
County = Heimlich:
Sex emmean SE df asymp.LCL asymp.UCL
Female -0.629 0.438 Inf -1.48668 0.2295
Male 0.194 0.361 Inf -0.51320 0.9015
Results are given on the log (not the response) scale.
Confidence level used: 0.95
$contrasts
County = Bloom:
contrast estimate SE df z.ratio p.value
Male - Female 1.475 0.716 Inf 2.060 0.0394
County = Cobblestone:
contrast estimate SE df z.ratio p.value
Male - Female 1.859 0.707 Inf 2.630 0.0085
County = Dougal:
contrast estimate SE df z.ratio p.value
Male - Female -0.496 0.646 Inf -0.767 0.4429
County = Heimlich:
contrast estimate SE df z.ratio p.value
Male - Female 0.823 0.567 Inf 1.450 0.1470
Results are given on the log (not the response) scale.
If I understand this model correctly, the response is the log of the ratio of the 2nd multinomial response to the 1st. So what we see above is estimated differences of logs and setimated differences of those differences. If run with type = "response" you would get estimated ratios, and ratios of those ratios.
Probably something changed in either the VGAM package or the lmtest package since that was written.
But the following will work for a likelihood ratio test for vglm models:
VGAM::lrtest(model)
VGAM::lrtest(model, model2)

extract random effects from MCMCglmm

I have a function which runs an MCMCglmm a bunch of times.
shuffles <- 1:10
names(shuffles) <- paste0("shuffle_", shuffles)
library(MCMCglmm)
library(dplyr)
library(tibble)
library(purrr)
ddd <- purrr::map(shuffles,
~ df %>%
mutate(Trait = sample(Trait)) %>%
MCMCglmm(fixed = Trait ~ 1,
random = ~ Year,
data = .,
family = "categorical",
verbose = FALSE)) %>%
purrr::map( ~ tibble::as_tibble(summary(.x)$solutions, rownames = "model_term")) %>%
dplyr::bind_rows(., .id = 'shuffle')
ddd
This section extracts fixed effects only.
(summary(.x)$Solutions, rownames = "model_term")
But note that I am not running a model without any fixed effects and so the output is empty.
How can I extract random effects using the same or similar code?
I guess I can change 'solutions' to something else to extract random effects from a model I have run without any fixed effects.
Note that this is an extension to a previous question (with example df) here - lapply instead of for loop for randomised hypothesis testing r
A relatively easy way to do this is with broom.mixed::tidy. It's not clear whether you mean you want to extract the summary for the top-level random effects parameters (i.e. the variances of the random effects), or for the estimates of the group-level effects.
library(broom.mixed)
tidy(m, effects="ran_pars")
##
## effect group term estimate std.error
## 1 ran_pars Year var__(Intercept) 0.00212 629.
## 2 ran_pars Residual var__Observation 40465. 24211.
If you want the group-level effects you need effects="ran_vals", but you have to re-run your model with pr=TRUE (or do it that way in the first place) in order to have these effects saved in the model object:
m <- MCMCglmm(Trait ~ ID, random = ~ Year, data = df, family = "categorical", pr=TRUE)
tidy(m, effects="ran_vals")
effect group level term estimate std.error
<chr> <chr> <chr> <chr> <dbl> <dbl>
1 ran_vals Year 1992 (Intercept) 2.65e-8 4.90
2 ran_vals Year 1993 (Intercept) 1.14e-8 6.23
3 ran_vals Year 1994 (Intercept) 1.28e-8 4.88
4 ran_vals Year 1995 (Intercept) -6.83e-9 5.31
5 ran_vals Year 1996 (Intercept) -1.36e-8 5.07
6 ran_vals Year 1997 (Intercept) 1.31e-8 5.24
7 ran_vals Year 1998 (Intercept) -2.80e-9 5.25
8 ran_vals Year 1999 (Intercept) 3.52e-8 5.68

Estimating effect size with emmenas for post hoc

Is there a way to have effect size (such as Cohen's d or the most appropriate) directly using emmeans()?
I cannot find anything for obtaining effect size by using emmeans()
post <- emmeans(fit, pairwise~ favorite.pirate | sex)
emmip(fit, ~ favorite.pirate | sex)
There is not a built-in provision for effect-size calculations, but you can cobble one together by defining a custom contrast function that divides each pairwise comparison by a value of sigma:
mypw.emmc = function(..., sigma = 1) {
result = emmeans:::pairwise.emmc (...)
for (i in seq_along(result[1, ]))
result[[i]] = result[[i]] / sigma
result
}
Here's a test run:
> mypw.emmc(1:3, sigma = 4)
1 - 2 1 - 3 2 - 3
1 0.25 0.25 0.00
2 -0.25 0.00 0.25
3 0.00 -0.25 -0.25
With your model, the error SD is 9.246 (look at summary(fit); so, ...
> emmeans(fit, mypw ~ sex, sigma = 9.246, name = "effect.size")
NOTE: Results may be misleading due to involvement in interactions
$emmeans
sex emmean SE df lower.CL upper.CL
female 63.8 0.434 3.03 62.4 65.2
male 74.5 0.809 15.82 72.8 76.2
other 68.8 1.439 187.08 65.9 71.6
Results are averaged over the levels of: favorite.pirate
Degrees-of-freedom method: kenward-roger
Confidence level used: 0.95
$contrasts
effect.size estimate SE df t.ratio p.value
female - male -1.158 0.0996 399 -11.624 <.0001
female - other -0.537 0.1627 888 -3.299 0.0029
male - other 0.621 0.1717 981 3.617 0.0009
Results are averaged over the levels of: favorite.pirate
Degrees-of-freedom method: kenward-roger
P value adjustment: tukey method for comparing a family of 3 estimates
Some words of caution though:
The SEs of the effect sizes are misleading because they don't account for the variation in sigma.
This is not a very good example because
a. The factors interact (Edward Low is different in his profile).
Also, see the warning message.
b. The model is singular (as warned when the model was fitted), yielding an estimated variance of zero for college)
library(yarrr)
View(pirates)
library(lme4)
library(lmerTest)
fit <- lmer(weight~ favorite.pirate * sex +(1|college), data = pirates)
anova(fit, ddf = "Kenward-Roger")
post <- emmeans(fit, pairwise~ sex)
post

R logistic regression and marginal effects - how to exclude NA values in categorical independent variable

I am a beginner with R. I am using glm to conduct logistic regression and then using the 'margins' package to calculate marginal effects but I don't seem to be able to exclude the missing values in my categorical independent variable.
I have tried to ask R to exclude NAs from the regression. The categorical variable is weight status at age 9 (wgt9), and it has three levels (1, 2, 3) and some NAs.
What am I doing wrong? Why do I get a wgt9NA result in my outputs and how can I correct it?
Thanks in advance for any help/advice.
Conduct logistic regression
summary(logit.phbehav <- glm(obese13 ~ gender + as.factor(wgt9) + aded08b,
data = gui, weights = bdwg01, family = binomial(link = "logit")))
Regression output
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -3.99 0.293 -13.6 2.86e- 42
2 gender 0.387 0.121 3.19 1.42e- 3
3 as.factor(wgt9)2 2.49 0.177 14.1 3.28e- 45
4 as.factor(wgt9)3 4.65 0.182 25.6 4.81e-144
5 as.factor(wgt9)NA 2.60 0.234 11.1 9.94e- 29
6 aded08b -0.0755 0.0224 -3.37 7.47e- 4
Calculate the marginal effects
effects_logit_phtotal = margins(logit.phtot)
print(effects_logit_phtotal)
summary(effects_logit_phtotal)
Marginal effects output
> summary(effects_logit_phtotal)
factor AME SE z p lower upper
aded08a -0.0012 0.0002 -4.8785 0.0000 -0.0017 -0.0007
gender 0.0115 0.0048 2.3899 0.0169 0.0021 0.0210
wgt92 0.0941 0.0086 10.9618 0.0000 0.0773 0.1109
wgt93 0.4708 0.0255 18.4569 0.0000 0.4208 0.5207
wgt9NA 0.1027 0.0179 5.7531 0.0000 0.0677 0.1377
First of all welcome to stack overflow. Please check the answer here to see how to make a great R question. Not providing a sample of your data, some times makes it impossible to answer the question. However taking a guess, I think that you have not set your NA values correctly but as strings. This behavior can be seen in the dummy data below.
First let's create the dummy data:
v1 <- c(2,3,3,3,2,2,2,2,NA,NA,NA)
v2 <- c(2,3,3,3,2,2,2,2,"NA","NA","NA")
v3 <- c(11,5,6,7,10,8,7,6,2,5,3)
obese <- c(0,1,1,0,0,1,1,1,0,0,0)
df <- data.frame(obese,v1,v2)
Using the variable named v1, does not include NA as a category:
glm(formula = obese ~ as.factor(v1) + v3, family = binomial(link = "logit"),
data = df)
Deviance Residuals:
1 2 3 4 5 6 7 8
-2.110e-08 2.110e-08 1.168e-05 -1.105e-05 -2.110e-08 3.094e-06 2.110e-08 2.110e-08
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 401.48 898581.15 0 1
as.factor(v1)3 -96.51 326132.30 0 1
v3 -46.93 106842.02 0 1
While making the string "NA" to factor gives an output similar to the one in question:
glm(formula = obese ~ as.factor(v2) + v3, family = binomial(link = "logit"),
data = df)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.402e-05 -2.110e-08 -2.110e-08 2.110e-08 1.472e-05
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 394.21 744490.08 0.001 1
as.factor(v2)3 -95.33 340427.26 0.000 1
as.factor(v2)NA -327.07 613934.84 -0.001 1
v3 -45.99 84477.60 -0.001 1
Try the following to replace NAs that are strings:
gui$wgt9[ gui$wgt9 == "NA" ] <- NA
Don't forget to accept any answer that solved your problem.

No effect of corstr argument on correlation structure in gee

Here is an extract from my data frame, representing results of a longitudinal study (A is an outcome parameter measured at two time points):
wide<-structure(list(ID = c(9000296L, 9001104L, 9001400L, 9001695L,
9001897L, 9002316L), BMI = c(29.8, 30.7, 23.5, 28.6, 25.9,
25.1),B.1 = c(100, 70.83, 100, 89.29, 100, 92.86), A.5 = c(100,
NA, 92.86, NA, 100, 89.29)), .Names = c("ID", "BMI", "A.1",
"A.5"), class = "data.frame", row.names = c(2L, 5L, 6L,
7L, 8L, 10L))
wide
ID BMI A.1 A.5
2 9000296 29.8 100.0 100.0
5 9001104 30.7 70.8 NA
6 9001400 23.5 100.0 92.9
7 9001695 28.6 89.3 NA
8 9001897 25.9 100.0 100.0
10 9002316 25.1 92.9 89.3
As you can see, there is correlation between A1 and A5, as it should be in a longitudinal study:
library (psych)
corr.test (wide [,c(3,4)] )
Call:corr.test(x = wide[, c(3, 4)])
Correlation matrix
A.1 A.5
A.1 1.00 0.78
A.5 0.78 1.00
Then I transform my data to long format
long<- reshape (wide, varying = c(3,4), direction="long")
long
ID BMI time A id
1.1 9000296 29.8 1 100.0 1
2.1 9001104 30.7 1 70.8 2
3.1 9001400 23.5 1 100.0 3
4.1 9001695 28.6 1 89.3 4
5.1 9001897 25.9 1 100.0 5
6.1 9002316 25.1 1 92.9 6
1.5 9000296 29.8 5 100.0 1
2.5 9001104 30.7 5 NA 2
3.5 9001400 23.5 5 92.9 3
4.5 9001695 28.6 5 NA 4
5.5 9001897 25.9 5 100.0 5
6.5 9002316 25.1 5 89.3 6
Then I try to fit gee model first using independent correlation structure:
library (gee)
model1<- gee(A~time+BMI, id=ID, corstr= "independence", data = long)
Beginning Cgee S-function, #(#) geeformula.q 4.13 98/01/27
running glm to get initial regression estimate
(Intercept) time BMI
122.389 0.508 -1.127
summary (model1)
GEE: GENERALIZED LINEAR MODELS FOR DEPENDENT DATA
gee S-function, version 4.13 modified 98/01/27 (1998)
Model:
Link: Identity
Variance to Mean Relation: Gaussian
Correlation Structure: Independent
Call:
gee(formula = A ~ time + BMI, id = ID, data = long, corstr = "independence")
Summary of Residuals:
Min 1Q Median 3Q Max
-17.46 -4.62 1.11 5.79 10.69
Coefficients:
Estimate Naive S.E. Naive z Robust S.E. Robust z
(Intercept) 122.389 34.18 3.580 31.00 3.949
time 0.508 1.60 0.317 1.12 0.453
BMI -1.127 1.23 -0.919 1.23 -0.913
Estimated Scale Parameter: 93.6
Number of Iterations: 1
Working Correlation
[,1] [,2]
[1,] 1 0
[2,] 0 0
And using exchangeable correlation structure:
model2<- gee(A~time+BMI, id=ID, corstr= "exchangeable", data = long)
Beginning Cgee S-function, #(#) geeformula.q 4.13 98/01/27
running glm to get initial regression estimate
(Intercept) time BMI
122.389 0.508 -1.127
summary (model2)
GEE: GENERALIZED LINEAR MODELS FOR DEPENDENT DATA
gee S-function, version 4.13 modified 98/01/27 (1998)
Model:
Link: Identity
Variance to Mean Relation: Gaussian
Correlation Structure: Exchangeable
Call:
gee(formula = A ~ time + BMI, id = ID, data = long, corstr = "exchangeable")
Summary of Residuals:
Min 1Q Median 3Q Max
-17.46 -4.62 1.11 5.79 10.69
Coefficients:
Estimate Naive S.E. Naive z Robust S.E. Robust z
(Intercept) 122.389 34.18 3.580 31.00 3.949
time 0.508 1.60 0.317 1.12 0.453
BMI -1.127 1.23 -0.919 1.23 -0.913
Estimated Scale Parameter: 93.6
Number of Iterations: 1
Working Correlation
[,1] [,2]
[1,] 1 0
[2,] 0 0
As you can see, the outputs are identical in spite of using different correlation structures in gee models. In both instances the correlations are zero in the correlation matrices.
In my actual data I have many more observations and time points, but there are also within-subject correlations which are significant. However all gee models (also using different dependent variables) also has zero correlations in their correlation matrices and changing corstr argument does not lead to changes in the model output.
All this seems very strange.
Could you please suggest what I did wrong.
I have found a solution!
ID variable should be sorted!
long<-long [order(long.p$ID),]
model1<- gee(A~time+BMI, id=ID, corstr= "independence", data = long)
Beginning Cgee S-function, #(#) geeformula.q 4.13 98/01/27
running glm to get initial regression estimate
(Intercept) time BMI
122.389 0.508 -1.127
model2<- gee(A~time+BMI, id=ID, corstr= "exchangeable", data = long)
Beginning Cgee S-function, #(#) geeformula.q 4.13 98/01/27
running glm to get initial regression estimate
(Intercept) time BMI
122.389 0.508 -1.127
Warning message:
In gee(A ~ time + BMI, id = ID, corstr = "exchangeable", data = long) :
Working correlation estimate not positive definite
> model1
GEE: GENERALIZED LINEAR MODELS FOR DEPENDENT DATA
gee S-function, version 4.13 modified 98/01/27 (1998)
Model:
Link: Identity
Variance to Mean Relation: Gaussian
Correlation Structure: Independent
Call:
gee(formula = A ~ time + BMI, id = ID, data = long, corstr = "independence")
Number of observations : 10
Maximum cluster size : 2
Coefficients:
(Intercept) time BMI
122.389 0.508 -1.127
Estimated Scale Parameter: 93.6
Number of Iterations: 1
Working Correlation[1:4,1:4]
[,1] [,2]
[1,] 1 0
[2,] 0 1
Returned Error Value:
[1] 0
model2
GEE: GENERALIZED LINEAR MODELS FOR DEPENDENT DATA
gee S-function, version 4.13 modified 98/01/27 (1998)
Model:
Link: Identity
Variance to Mean Relation: Gaussian
Correlation Structure: Exchangeable
Call:
gee(formula = A ~ time + BMI, id = ID, data = long, corstr = "exchangeable")
Number of observations : 10
Maximum cluster size : 2
Coefficients:
(Intercept) time BMI
180.00 -1.70 -3.16
Estimated Scale Parameter: 154
Number of Iterations: 5
Working Correlation[1:4,1:4]
[,1] [,2]
[1,] 1.0 2.8
[2,] 2.8 1.0
Returned Error Value:
[1] 1000

Resources