R hmftest multinomial logit model " system is computationally singular" - r

I have a multinomial logit model with two individual specific variables (first and age).
I would like to conduct the hmftest to check if the IIA holds.
My dataset looks like this:
head(df)
mode choice first age
1 both 1 0 24
2 pre 1 1 23
3 both 1 2 53
4 post 1 3 43
5 no 1 1 55
6 both 1 2 63
I adjusted it for the mlogit to:
mode choice first age idx
1 TRUE 1 0 24 1:both
2 FALSE 1 0 24 1:no
3 FALSE 1 0 24 1:post
4 FALSE 1 0 24 1:pre
5 FALSE 1 1 23 2:both
6 FALSE 1 1 23 2:no
7 FALSE 1 1 23 2:post
8 TRUE 1 1 23 2:pre
9 TRUE 1 2 53 3:both
10 FALSE 1 2 53 3:no
~~~ indexes ~~~~
id1 id2
1 1 both
2 1 no
3 1 post
4 1 pre
5 2 both
6 2 no
7 2 post
8 2 pre
9 3 both
10 3 no
indexes: 1, 2
My original (full) model runs as follows:
full <- mlogit(mode ~ 0 | first + age, data = df_mlogit, reflevel = "no")
leading to the following result:
Call:
mlogit(formula = mode ~ 0 | first + age, data = df_mlogit, reflevel = "no",
method = "nr")
Frequencies of alternatives:choice
no both post pre
0.2 0.4 0.2 0.2
nr method
18 iterations, 0h:0m:0s
g'(-H)^-1g = 8.11E-07
gradient close to zero
Coefficients :
Estimate Std. Error z-value Pr(>|z|)
(Intercept):both 2.0077e+01 1.0441e+04 0.0019 0.9985
(Intercept):post -4.1283e-01 1.4771e+04 0.0000 1.0000
(Intercept):pre 5.3346e-01 1.4690e+04 0.0000 1.0000
first1:both -4.0237e+01 1.1059e+04 -0.0036 0.9971
first1:post -8.9168e-01 1.4771e+04 -0.0001 1.0000
first1:pre -6.6805e-01 1.4690e+04 0.0000 1.0000
first2:both -1.9674e+01 1.0441e+04 -0.0019 0.9985
first2:post -1.8975e+01 1.5683e+04 -0.0012 0.9990
first2:pre -1.8889e+01 1.5601e+04 -0.0012 0.9990
first3:both -2.1185e+01 1.1896e+04 -0.0018 0.9986
first3:post 1.9200e+01 1.5315e+04 0.0013 0.9990
first3:pre 1.9218e+01 1.5237e+04 0.0013 0.9990
age:both 2.1898e-02 2.9396e-02 0.7449 0.4563
age:post 9.3377e-03 2.3157e-02 0.4032 0.6868
age:pre -1.2338e-02 2.2812e-02 -0.5408 0.5886
Log-Likelihood: -61.044
McFadden R^2: 0.54178
Likelihood ratio test : chisq = 144.35 (p.value = < 2.22e-16)
To test for IIA, I exclude one alternative from the model (here "pre") and run the model as follows:
part <- mlogit(mode ~ 0 | first + age, data = df_mlogit, reflevel = "no",
alt.subset = c("no", "post", "both"))
leading to
Call:
mlogit(formula = mode ~ 0 | first + age, data = df_mlogit, alt.subset = c("no",
"post", "both"), reflevel = "no", method = "nr")
Frequencies of alternatives:choice
no both post
0.25 0.50 0.25
nr method
18 iterations, 0h:0m:0s
g'(-H)^-1g = 6.88E-07
gradient close to zero
Coefficients :
Estimate Std. Error z-value Pr(>|z|)
(Intercept):both 1.9136e+01 6.5223e+03 0.0029 0.9977
(Intercept):post -9.2040e-01 9.2734e+03 -0.0001 0.9999
first1:both -3.9410e+01 7.5835e+03 -0.0052 0.9959
first1:post -9.3119e-01 9.2734e+03 -0.0001 0.9999
first2:both -1.8733e+01 6.5223e+03 -0.0029 0.9977
first2:post -1.8094e+01 9.8569e+03 -0.0018 0.9985
first3:both -2.0191e+01 1.1049e+04 -0.0018 0.9985
first3:post 2.0119e+01 1.1188e+04 0.0018 0.9986
age:both 2.1898e-02 2.9396e-02 0.7449 0.4563
age:post 1.9879e-02 2.7872e-02 0.7132 0.4757
Log-Likelihood: -27.325
McFadden R^2: 0.67149
Likelihood ratio test : chisq = 111.71 (p.value = < 2.22e-16)
However when I want to codnuct the hmftest then the following error occurs:
> hmftest(full, part)
Error in solve.default(diff.var) :
system is computationally singular: reciprocal condition number = 4.34252e-21
Does anyone have an idea where the problem might be?

I believe the issue here could be that the hmftest checks if the probability ratio of two alternatives depends only on the characteristics of these alternatives.
Since there are only individual-level variables here, the test won't work in this case.

Related

Multilevel Logit Model in R - not including all the values for random intercept

I am building a random intercept model in R using the glmer function, with the 2nd level variable being country. When I run my model however, it is only including 24 countries and 27005 observations when there are 60 countries and 75047 observations.
I can provide other info if necessary but just wondering if anyone has any initial idea why this might be. I cannot find anything online.
Generalized linear mixed model fit by maximum likelihood (Adaptive Gauss-Hermite Quadrature, nAGQ = 0) ['glmerMod']
Family: binomial ( logit )
Formula: serve ~ age + sex + income + religion + proud + trusting + outgoing + (1 | country)
Data: WVS
Control: glmerControl(optimizer = "bobyqa")
AIC BIC logLik deviance df.resid
30102.4 30250.1 -15033.2 30066.4 26987
Scaled residuals:
Min 1Q Median 3Q Max
-4.2087 -0.8943 0.4331 0.6737 3.8525
Random effects:
Groups Name Variance Std.Dev.
country (Intercept) 0.6272 0.7919
Number of obs: 27005, groups: country, 24
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.188730 0.181939 1.037 0.299584
age -0.004503 0.001229 -3.666 0.000247 ***
sexmale 0.672997 0.028757 23.403 < 2e-16 ***
income -0.005812 0.007070 -0.822 0.411024
religionRather important 0.117421 0.049464 2.374 0.017604 *
religionVery important 0.269977 0.048460 5.571 2.53e-08 ***
proud2 -0.210176 0.033430 -6.287 3.23e-10 ***
proud3 -0.306502 0.054530 -5.621 1.90e-08 ***
proud4 -0.601837 0.099568 -6.044 1.50e-09 ***
trusting2 0.134689 0.055366 2.433 0.014987 *
trusting3 0.195169 0.056104 3.479 0.000504 ***
trusting4 0.309589 0.054498 5.681 1.34e-08 ***
trusting5 0.294739 0.059784 4.930 8.22e-07 ***
outgoing2 -0.160543 0.062618 -2.564 0.010352 *
outgoing3 -0.119559 0.062781 -1.904 0.056861 .
outgoing4 0.120816 0.060180 2.008 0.044689 *
outgoing5 0.238158 0.063453 3.753 0.000175 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Here is a sample of the data:
conscription serve country sex education income religion immigrant proud trusting outgoing age
1 1 Yes ALG male 3 5 Very important 0 1 2 2 -15.7403361
2 1 Yes ALG female 3 6 Rather important 0 2 4 2 -12.7403361
3 1 Yes ALG female 3 6 Very important 0 1 3 3 -10.7403361
4 1 Yes ALG female 3 5 Very important 0 1 3 4 -8.7403361
5 1 Yes ALG female 2 7 Very important 0 1 4 4 -1.7403361
6 1 Yes ALG male 4 5 Very important 0 1 3 4 -0.7403361
7 1 Yes ALG male 3 7 Very important 0 1 2 2 4.2596639
8 1 Yes ALG female 2 2 Rather important 0 1 3 4 7.2596639
9 1 Yes ALG male 1 5 Rather important 0 1 3 2 22.2596639
11 1 Yes ALG female 4 5 Very important 0 1 3 1 -13.7403361

Check for statistically significant differences between groups after running a logistic regression w/ interaction & random effect?

I ran an ordinal logistic regression (using the function clmm from the R package ordinal) with a two-factor interaction and a random effect.
The response is a factor w/ 5 levels (Liker scale: 1 2 3 4 5), the independent variables are a factor w/ 2 levels (time) and a factor w/ 3 levels (group)
The code looks like this:
library(ordinal)
# dataset
ID time group response
person1 1 a 3
person2 1 a 5
person3 1 c 5
person4 1 b 2
person5 1 c 2
person6 1 a 4
person1 2 a 2
person2 2 a 2
person3 2 c 1
person4 2 b 4
person5 2 c 3
person6 2 a 4
... ... ... ...
# model
model <- clmm(response ~ time*group + (1|ID))
# model results
formula: response ~ time * group + (1 | ID)
data: dataset
link threshold nobs logLik AIC niter max.grad cond.H
logit flexible 168 -226.76 473.52 508(4150) 9.42e-05 1.8e+02
Random effects:
Groups Name Variance Std.Dev.
ID (Intercept) 5.18 2.276
Number of groups: ID 84
Coefficients:
Estimate Std. Error z value Pr(>|z|)
time2 0.2837 0.5289 0.536 0.59170
group_b 1.8746 0.6946 2.699 0.00695 **
group_c 4.0023 0.9383 4.265 2e-05 ***
time2:group_b -0.5100 0.7294 -0.699 0.48447
time2:group_c -0.8830 0.9749 -0.906 0.36508
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Threshold coefficients:
Estimate Std. Error z value
1|2 -2.6223 0.6440 -4.072
2|3 0.2474 0.5427 0.456
3|4 2.5384 0.5824 4.359
4|5 4.6786 0.7143 6.550
As you can see, the model results show only whether there are differences compared to the intercept (i.e. time1:group_a). However, what I am also interested in is to check if the difference between time1:group_b and time2:group_b is statistically significant, same for group_c.
Since I have to account for the random effect, I cannot use a simple chi-square test to check for statistically significant differences between groups. I therefore tried to run the function contrast from the R package emmeans, which uses the output of the function emmeans, see the code below:
library(emmeans)
em <- emmeans(model, ~ time | group) #calculates the estimated marginal means
contrast(em, "consec", simple = "each")
# contrast results
$`simple contrasts for time`
group = a:
contrast estimate SE df z.ratio p.value
2 - 1 0.284 0.529 Inf 0.536 0.5917
group = b:
contrast estimate SE df z.ratio p.value
2 - 1 -0.226 0.482 Inf -0.470 0.6386
group = c:
contrast estimate SE df z.ratio p.value
2 - 1 -0.599 0.816 Inf -0.734 0.4629
Note: contrasts are still on the as.factor scale
$`simple contrasts for group`
time = 1:
contrast estimate SE df z.ratio p.value
b - a 1.87 0.695 Inf 2.699 0.0137
c - b 2.13 0.871 Inf 2.443 0.0284
time = 2:
contrast estimate SE df z.ratio p.value
b - a 1.36 0.687 Inf 1.986 0.0897
c - b 1.75 0.838 Inf 2.095 0.0695
Note: contrasts are still on the as.factor scale
P value adjustment: mvt method for 2 tests
My questions are:
a) Is this a correct and valid method to check whether the differences are significant?
b) If not, what is the correct way to do this?
Of course any other suggestion is extremely welcome! Thanks a lot.

Setting covariates to the mean for marginal effects using ggpredict

I have some data (df):
inter out time int
0 1 21 0
0 0 32 0
0 1 44 0
0 0 59 0
0 1 88 0
0 1 111 0
0 0 54 0
1 0 63 63
1 1 73 73
1 1 83 83
1 0 93 93
1 1 52 52
1 0 33 33
1 1 10 10
And I run a glm model:
m <- glm(out ~ inter + time + int, data = df, family = binomial(link = "logit"))
The model coefficients are:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.00916 1.82200 -0.554 0.580
inter 2.00906 2.64959 0.758 0.448
time 0.02293 0.03010 0.762 0.446
int -0.03502 0.04215 -0.831 0.406
I want to get the marginal effects, which according to my understanding is the predicted probabilities at certain levels holding other variables constant, which in this case is 0 vs. 1 for my binary predictor, 'inter'. If this in incorrect, please let me know. According to [https://rdrr.io/cran/ggeffects/man/ggpredict.html], "All remaining covariates that are not specified in terms are held constant (see 'Details')". The mean of time and int are 58.29 and 29.07, respectively, so the formula to get the predicted probabilities is:
Level 0:
sum = -1.00916 + (.02293 * 58.29) + (-.03502 * 29.07)
sume = exp(sum)
sumee <- sume/(1+sume)
sumee = 0.33
Level 1:
sum = -1.00916 + 2.00906 + (.02293 * 58.29) + (-.03502 * 29.07)
sume = exp(sum)
sumee <- sume/(1+sume)
sumee = 0.79
The predicted probability holding other variables constant is 0.79 for level 1 compared to 0.33 for level 0, which is exactly what these ggpredict statements produce:
ggpredict(m, terms = c("inter", "time [mean]"))
ggpredict(m, terms = c("inter"))
However, when I specify 'int' at the mean with "int [mean]", it produces different results:
ggpredict(m, terms = c("inter", "time [mean]", "int [mean]"))
ggpredict(m, terms = c("inter", "int [mean]"))
It says level 0 has a predicted probability of 0.19, compared to 0.64 for level 1. Why? Shouldn't all four commands produce the same results since r automatically calculates covariates at the mean? Using other functions for 'int', such as min and max ggpredict(m, terms = c("inter", "time [mean]", "int [min]")) produces predictable results based on the formula.

using emmeans for lmer

I've been trying to calculate marginal means for my lmer & glmer in R. I found the emmeans function and I've been trying to understand it and apply it to my model. I found that it's hard to get the means for an interaction, so I'm starting with just additive predictors, but the function doesn't work the way it's presented in examples (e.g. here https://cran.r-project.org/web/packages/emmeans/vignettes/sophisticated.html)
emmeans(Oats.lmer, "nitro")
nitro emmean SE df lower.CL upper.CL
0.0 78.89207 7.294379 7.78 61.98930 95.79484
0.2 97.03425 7.136271 7.19 80.25029 113.81822
0.4 114.19816 7.136186 7.19 97.41454 130.98179
0.6 124.06857 7.070235 6.95 107.32795 140.80919
what I'm getting is:
emmeans(model2, "VariableA")
VariableA emmean SE df lower.CL upper.CL
0.4657459 2649.742 120.8955 19.07 2396.768 2902.715
Only one line and the variable is averaged instead of split into 0 and 1 (which are the values in the dataset, and maybe the problem is that it's categorical?)
The model I'm running is :
model2 = lmer (rt ~ variableA + variableB + (1 |participant) + (1 |sequence/item), data=memoryData, REML=FALSE)
EDIT: The data file is quite big and I wasn't sure how to extract useful information from it, but here is the structure:
> str(memoryData)
'data.frame': 3168 obs. of 123 variables:
$ participant : int 10 10 10 10 10 10 10 10 10 10 ...
$ variableA : int 1 1 1 1 1 1 1 1 1 1 ...
$ variableB : int 1 1 1 1 1 1 1 1 1 1 ...
$ sequence: int 1 1 1 1 1 1 1 1 1 1 ...
$ item : int 25 26 27 28 29 30 31 32 33 34 ...
$ accuracy : int 1 1 1 1 1 1 0 1 1 1 ...
$ rt : num 1720 1628 1728 2247 1247 ...
Why is the function not working for me?
And as a further question, is there a way to get these means when I include interaction between variables A and B?
EDIT 2: ok, it did work when I changed it to factor, I guess my method of doing it was incorrect. But I'm still not sure how to calculate it when there is an interaction? Because with this method, R says "NOTE: Results may be misleading due to involvement in interactions"
To see marginal means of interactions, add all variables of the interaction term to emmeans(), and you need to use the at-argument if you want to see the marginal means at different levels of the interaction terms.
Here are some examples, for the average effect of the interaction, and for marginal effects at different levels of the interaction term. The latter has the advantage in terms of visualization.
library(ggeffects)
library(lme4)
library(emmeans)
data("sleepstudy")
sleepstudy$inter <- sample(1:5, size = nrow(sleepstudy), replace = T)
m <- lmer(Reaction ~ Days * inter + (1 + Days | Subject), data = sleepstudy)
# average marginal effect of interaction
emmeans(m, c("Days", "inter"))
#> Days inter emmean SE df lower.CL upper.CL
#> 4.5 2.994444 298.3427 8.84715 16.98 279.6752 317.0101
#>
#> Degrees-of-freedom method: kenward-roger
#> Confidence level used: 0.95
# marginal effects at different levels of interactions -
# useful for plotting
ggpredict(m, c("Days [3,5,7]", "inter"))
#>
#> # Predicted values of Reaction
#> # x = Days
#>
#> # inter = 1
#> x predicted std.error conf.low conf.high
#> 3 279.349 8.108 263.458 295.240
#> 5 304.839 9.818 285.597 324.082
#> 7 330.330 12.358 306.109 354.551
#>
#> # inter = 2
#> x predicted std.error conf.low conf.high
#> 3 280.970 7.624 266.028 295.912
#> 5 304.216 9.492 285.613 322.819
#> 7 327.462 11.899 304.140 350.784
#>
#> # inter = 3
#> x predicted std.error conf.low conf.high
#> 3 282.591 7.446 267.997 297.185
#> 5 303.593 9.384 285.200 321.985
#> 7 324.594 11.751 301.562 347.626
#>
#> # inter = 4
#> x predicted std.error conf.low conf.high
#> 3 284.212 7.596 269.325 299.100
#> 5 302.969 9.502 284.345 321.594
#> 7 321.726 11.925 298.353 345.099
#>
#> # inter = 5
#> x predicted std.error conf.low conf.high
#> 3 285.834 8.055 270.046 301.621
#> 5 302.346 9.839 283.062 321.630
#> 7 318.858 12.408 294.540 343.177
#>
#> Adjusted for:
#> * Subject = 308
emmeans(m, c("Days", "inter"), at = list(Days = c(3, 5, 7), inter = 1:5))
#> Days inter emmean SE df lower.CL upper.CL
#> 3 1 279.3488 8.132335 23.60 262.5493 296.1483
#> 5 1 304.8394 9.824196 20.31 284.3662 325.3125
#> 7 1 330.3300 12.366296 20.69 304.5895 356.0704
#> 3 2 280.9700 7.630745 18.60 264.9754 296.9646
#> 5 2 304.2160 9.493225 17.77 284.2529 324.1791
#> 7 2 327.4621 11.901431 17.84 302.4420 352.4822
#> 3 3 282.5912 7.445982 16.96 266.8786 298.3038
#> 5 3 303.5927 9.383978 16.98 283.7927 323.3927
#> 7 3 324.5942 11.751239 16.98 299.7988 349.3896
#> 3 4 284.2124 7.601185 18.34 268.2639 300.1609
#> 5 4 302.9694 9.504102 17.85 282.9900 322.9487
#> 7 4 321.7263 11.927612 17.99 296.6666 346.7860
#> 3 5 285.8336 8.076779 23.02 269.1264 302.5409
#> 5 5 302.3460 9.845207 20.48 281.8399 322.8521
#> 7 5 318.8584 12.416642 21.02 293.0380 344.6788
#>
#> Degrees-of-freedom method: kenward-roger
#> Confidence level used: 0.95
And a plotting example:
ggpredict(m, c("Days", "inter [1,3,5]")) %>% plot()
You say that "changing the vari[a]ble to factor doesn't help", but I would think this would (as documented in the emmeans FAQ):
md <- transform(memoryData,
variableA=factor(variableA),
variableB=factor(variableB))
model2 = lmer (rt ~ variableA + variableB +
(1 |participant) + (1 |sequence/item), data=md, REML=FALSE)
emmeans(model2, ~variableA)
emmeans(model2, ~variableB)
emmeans(model2, ~variableA + variableB)
If this really doesn't work, then we need a reproducible example ...

Conditional logistic regression: within subject matching

I'm trying to compare the prevalence of a specific lesion (binary) at the symptomatic side to the asymptomatic side within a group of patients.
I've already performed a McNemar test to compare the prevalence at the symptomatic versus asymptomatic side within patients.
However, I'm asked to perform also a conditional logistic regression. I'm not sure if my syntax is correct with respect to the stratification:
summary(clogit(ds$symp ~ ds$asymp, strata(ds$ID), data=ds, method = "exact"))
Does R compare both sides of the patient (symptomatic vs asymptomatic) within the patient(s)? Or do I have to duplicate manually the patient ID (one ID for the symptomatic side AND one ID for the asymptomatic side)?
Thanks,
An example:
ID symp asymp
1 0 0
2 1 0
3 0 0
4 0 0
5 1 0
6 1 1
7 0 0
8 0 0
9 0 1
10 0 0
As an example: patient 2 has a lesion at the symptomatic side and patient 9 only at the asymptomatic side. Patients 6 at both sides.
A Exact McNemar test showes:
test <- table(df$symp, df$asymp)
compare <- exact2x2(test, paired = TRUE, alternative = "two.sided", tsmethod = "central")
print(compare)
Exact McNemar test (with central confidence intervals)
data: test
b = 1, c = 2, p-value = 1
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.00847498 9.60452988
sample estimates:
odds ratio
0.5
However, a conditional logistic regression model:
> summary(clogit(df$symp ~ df$asymp, strata(df$ID), data=df, method = "exact"))
Call:
coxph(formula = Surv(rep(1, 10L), df$symp) ~ df$asymp, data = df,
method = "exact")
n= 10, number of events= 3
coef exp(coef) se(coef) z Pr(>|z|)
df$symp 0.973 2.646 1.524 0.638 0.523
exp(coef) exp(-coef) lower .95 upper .95
df$asymp 2.646 0.378 0.1334 52.46
Rsquare= 0.039 (max possible= 0.616 )
Likelihood ratio test= 0.4 on 1 df, p=0.528
Wald test = 0.41 on 1 df, p=0.5232
Score (logrank) test = 0.43 on 1 df, p=0.5127

Resources