I'm trying to compare the prevalence of a specific lesion (binary) at the symptomatic side to the asymptomatic side within a group of patients.
I've already performed a McNemar test to compare the prevalence at the symptomatic versus asymptomatic side within patients.
However, I'm asked to perform also a conditional logistic regression. I'm not sure if my syntax is correct with respect to the stratification:
summary(clogit(ds$symp ~ ds$asymp, strata(ds$ID), data=ds, method = "exact"))
Does R compare both sides of the patient (symptomatic vs asymptomatic) within the patient(s)? Or do I have to duplicate manually the patient ID (one ID for the symptomatic side AND one ID for the asymptomatic side)?
Thanks,
An example:
ID symp asymp
1 0 0
2 1 0
3 0 0
4 0 0
5 1 0
6 1 1
7 0 0
8 0 0
9 0 1
10 0 0
As an example: patient 2 has a lesion at the symptomatic side and patient 9 only at the asymptomatic side. Patients 6 at both sides.
A Exact McNemar test showes:
test <- table(df$symp, df$asymp)
compare <- exact2x2(test, paired = TRUE, alternative = "two.sided", tsmethod = "central")
print(compare)
Exact McNemar test (with central confidence intervals)
data: test
b = 1, c = 2, p-value = 1
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.00847498 9.60452988
sample estimates:
odds ratio
0.5
However, a conditional logistic regression model:
> summary(clogit(df$symp ~ df$asymp, strata(df$ID), data=df, method = "exact"))
Call:
coxph(formula = Surv(rep(1, 10L), df$symp) ~ df$asymp, data = df,
method = "exact")
n= 10, number of events= 3
coef exp(coef) se(coef) z Pr(>|z|)
df$symp 0.973 2.646 1.524 0.638 0.523
exp(coef) exp(-coef) lower .95 upper .95
df$asymp 2.646 0.378 0.1334 52.46
Rsquare= 0.039 (max possible= 0.616 )
Likelihood ratio test= 0.4 on 1 df, p=0.528
Wald test = 0.41 on 1 df, p=0.5232
Score (logrank) test = 0.43 on 1 df, p=0.5127
Related
I have a multinomial logit model with two individual specific variables (first and age).
I would like to conduct the hmftest to check if the IIA holds.
My dataset looks like this:
head(df)
mode choice first age
1 both 1 0 24
2 pre 1 1 23
3 both 1 2 53
4 post 1 3 43
5 no 1 1 55
6 both 1 2 63
I adjusted it for the mlogit to:
mode choice first age idx
1 TRUE 1 0 24 1:both
2 FALSE 1 0 24 1:no
3 FALSE 1 0 24 1:post
4 FALSE 1 0 24 1:pre
5 FALSE 1 1 23 2:both
6 FALSE 1 1 23 2:no
7 FALSE 1 1 23 2:post
8 TRUE 1 1 23 2:pre
9 TRUE 1 2 53 3:both
10 FALSE 1 2 53 3:no
~~~ indexes ~~~~
id1 id2
1 1 both
2 1 no
3 1 post
4 1 pre
5 2 both
6 2 no
7 2 post
8 2 pre
9 3 both
10 3 no
indexes: 1, 2
My original (full) model runs as follows:
full <- mlogit(mode ~ 0 | first + age, data = df_mlogit, reflevel = "no")
leading to the following result:
Call:
mlogit(formula = mode ~ 0 | first + age, data = df_mlogit, reflevel = "no",
method = "nr")
Frequencies of alternatives:choice
no both post pre
0.2 0.4 0.2 0.2
nr method
18 iterations, 0h:0m:0s
g'(-H)^-1g = 8.11E-07
gradient close to zero
Coefficients :
Estimate Std. Error z-value Pr(>|z|)
(Intercept):both 2.0077e+01 1.0441e+04 0.0019 0.9985
(Intercept):post -4.1283e-01 1.4771e+04 0.0000 1.0000
(Intercept):pre 5.3346e-01 1.4690e+04 0.0000 1.0000
first1:both -4.0237e+01 1.1059e+04 -0.0036 0.9971
first1:post -8.9168e-01 1.4771e+04 -0.0001 1.0000
first1:pre -6.6805e-01 1.4690e+04 0.0000 1.0000
first2:both -1.9674e+01 1.0441e+04 -0.0019 0.9985
first2:post -1.8975e+01 1.5683e+04 -0.0012 0.9990
first2:pre -1.8889e+01 1.5601e+04 -0.0012 0.9990
first3:both -2.1185e+01 1.1896e+04 -0.0018 0.9986
first3:post 1.9200e+01 1.5315e+04 0.0013 0.9990
first3:pre 1.9218e+01 1.5237e+04 0.0013 0.9990
age:both 2.1898e-02 2.9396e-02 0.7449 0.4563
age:post 9.3377e-03 2.3157e-02 0.4032 0.6868
age:pre -1.2338e-02 2.2812e-02 -0.5408 0.5886
Log-Likelihood: -61.044
McFadden R^2: 0.54178
Likelihood ratio test : chisq = 144.35 (p.value = < 2.22e-16)
To test for IIA, I exclude one alternative from the model (here "pre") and run the model as follows:
part <- mlogit(mode ~ 0 | first + age, data = df_mlogit, reflevel = "no",
alt.subset = c("no", "post", "both"))
leading to
Call:
mlogit(formula = mode ~ 0 | first + age, data = df_mlogit, alt.subset = c("no",
"post", "both"), reflevel = "no", method = "nr")
Frequencies of alternatives:choice
no both post
0.25 0.50 0.25
nr method
18 iterations, 0h:0m:0s
g'(-H)^-1g = 6.88E-07
gradient close to zero
Coefficients :
Estimate Std. Error z-value Pr(>|z|)
(Intercept):both 1.9136e+01 6.5223e+03 0.0029 0.9977
(Intercept):post -9.2040e-01 9.2734e+03 -0.0001 0.9999
first1:both -3.9410e+01 7.5835e+03 -0.0052 0.9959
first1:post -9.3119e-01 9.2734e+03 -0.0001 0.9999
first2:both -1.8733e+01 6.5223e+03 -0.0029 0.9977
first2:post -1.8094e+01 9.8569e+03 -0.0018 0.9985
first3:both -2.0191e+01 1.1049e+04 -0.0018 0.9985
first3:post 2.0119e+01 1.1188e+04 0.0018 0.9986
age:both 2.1898e-02 2.9396e-02 0.7449 0.4563
age:post 1.9879e-02 2.7872e-02 0.7132 0.4757
Log-Likelihood: -27.325
McFadden R^2: 0.67149
Likelihood ratio test : chisq = 111.71 (p.value = < 2.22e-16)
However when I want to codnuct the hmftest then the following error occurs:
> hmftest(full, part)
Error in solve.default(diff.var) :
system is computationally singular: reciprocal condition number = 4.34252e-21
Does anyone have an idea where the problem might be?
I believe the issue here could be that the hmftest checks if the probability ratio of two alternatives depends only on the characteristics of these alternatives.
Since there are only individual-level variables here, the test won't work in this case.
I ran an ordinal logistic regression (using the function clmm from the R package ordinal) with a two-factor interaction and a random effect.
The response is a factor w/ 5 levels (Liker scale: 1 2 3 4 5), the independent variables are a factor w/ 2 levels (time) and a factor w/ 3 levels (group)
The code looks like this:
library(ordinal)
# dataset
ID time group response
person1 1 a 3
person2 1 a 5
person3 1 c 5
person4 1 b 2
person5 1 c 2
person6 1 a 4
person1 2 a 2
person2 2 a 2
person3 2 c 1
person4 2 b 4
person5 2 c 3
person6 2 a 4
... ... ... ...
# model
model <- clmm(response ~ time*group + (1|ID))
# model results
formula: response ~ time * group + (1 | ID)
data: dataset
link threshold nobs logLik AIC niter max.grad cond.H
logit flexible 168 -226.76 473.52 508(4150) 9.42e-05 1.8e+02
Random effects:
Groups Name Variance Std.Dev.
ID (Intercept) 5.18 2.276
Number of groups: ID 84
Coefficients:
Estimate Std. Error z value Pr(>|z|)
time2 0.2837 0.5289 0.536 0.59170
group_b 1.8746 0.6946 2.699 0.00695 **
group_c 4.0023 0.9383 4.265 2e-05 ***
time2:group_b -0.5100 0.7294 -0.699 0.48447
time2:group_c -0.8830 0.9749 -0.906 0.36508
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Threshold coefficients:
Estimate Std. Error z value
1|2 -2.6223 0.6440 -4.072
2|3 0.2474 0.5427 0.456
3|4 2.5384 0.5824 4.359
4|5 4.6786 0.7143 6.550
As you can see, the model results show only whether there are differences compared to the intercept (i.e. time1:group_a). However, what I am also interested in is to check if the difference between time1:group_b and time2:group_b is statistically significant, same for group_c.
Since I have to account for the random effect, I cannot use a simple chi-square test to check for statistically significant differences between groups. I therefore tried to run the function contrast from the R package emmeans, which uses the output of the function emmeans, see the code below:
library(emmeans)
em <- emmeans(model, ~ time | group) #calculates the estimated marginal means
contrast(em, "consec", simple = "each")
# contrast results
$`simple contrasts for time`
group = a:
contrast estimate SE df z.ratio p.value
2 - 1 0.284 0.529 Inf 0.536 0.5917
group = b:
contrast estimate SE df z.ratio p.value
2 - 1 -0.226 0.482 Inf -0.470 0.6386
group = c:
contrast estimate SE df z.ratio p.value
2 - 1 -0.599 0.816 Inf -0.734 0.4629
Note: contrasts are still on the as.factor scale
$`simple contrasts for group`
time = 1:
contrast estimate SE df z.ratio p.value
b - a 1.87 0.695 Inf 2.699 0.0137
c - b 2.13 0.871 Inf 2.443 0.0284
time = 2:
contrast estimate SE df z.ratio p.value
b - a 1.36 0.687 Inf 1.986 0.0897
c - b 1.75 0.838 Inf 2.095 0.0695
Note: contrasts are still on the as.factor scale
P value adjustment: mvt method for 2 tests
My questions are:
a) Is this a correct and valid method to check whether the differences are significant?
b) If not, what is the correct way to do this?
Of course any other suggestion is extremely welcome! Thanks a lot.
I have some data (df):
inter out time int
0 1 21 0
0 0 32 0
0 1 44 0
0 0 59 0
0 1 88 0
0 1 111 0
0 0 54 0
1 0 63 63
1 1 73 73
1 1 83 83
1 0 93 93
1 1 52 52
1 0 33 33
1 1 10 10
And I run a glm model:
m <- glm(out ~ inter + time + int, data = df, family = binomial(link = "logit"))
The model coefficients are:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.00916 1.82200 -0.554 0.580
inter 2.00906 2.64959 0.758 0.448
time 0.02293 0.03010 0.762 0.446
int -0.03502 0.04215 -0.831 0.406
I want to get the marginal effects, which according to my understanding is the predicted probabilities at certain levels holding other variables constant, which in this case is 0 vs. 1 for my binary predictor, 'inter'. If this in incorrect, please let me know. According to [https://rdrr.io/cran/ggeffects/man/ggpredict.html], "All remaining covariates that are not specified in terms are held constant (see 'Details')". The mean of time and int are 58.29 and 29.07, respectively, so the formula to get the predicted probabilities is:
Level 0:
sum = -1.00916 + (.02293 * 58.29) + (-.03502 * 29.07)
sume = exp(sum)
sumee <- sume/(1+sume)
sumee = 0.33
Level 1:
sum = -1.00916 + 2.00906 + (.02293 * 58.29) + (-.03502 * 29.07)
sume = exp(sum)
sumee <- sume/(1+sume)
sumee = 0.79
The predicted probability holding other variables constant is 0.79 for level 1 compared to 0.33 for level 0, which is exactly what these ggpredict statements produce:
ggpredict(m, terms = c("inter", "time [mean]"))
ggpredict(m, terms = c("inter"))
However, when I specify 'int' at the mean with "int [mean]", it produces different results:
ggpredict(m, terms = c("inter", "time [mean]", "int [mean]"))
ggpredict(m, terms = c("inter", "int [mean]"))
It says level 0 has a predicted probability of 0.19, compared to 0.64 for level 1. Why? Shouldn't all four commands produce the same results since r automatically calculates covariates at the mean? Using other functions for 'int', such as min and max ggpredict(m, terms = c("inter", "time [mean]", "int [min]")) produces predictable results based on the formula.
I have a datset that looks like:
Treatment Surface ex.time excision antib.time antibiotic inf.time infection
1 0 15 12 0 12 0 12 0
2 0 20 9 0 9 0 9 0
3 0 15 13 0 13 0 7 1
4 0 20 11 1 29 0 29 0
5 0 70 28 1 31 0 4 1
6 0 20 11 0 11 0 8 1
he variables represented in the dataset are as follows:
Observation number
Treatment
0-routine bathing 1-Body cleansing
Surface
Percentage of total surface area burned
Exis.time
Time to excision or on study time
Excision
indicator: 1=yes 0=no
Antib.time
Time to prophylactic antibiotic treatment or on study time
antibiotic
indicator: 1=yes 0=no
inf.time Time to straphylocous aureaus infection or on study time
infection
indicator: 1=yes 0=no
I want to model the time until infection as a function of treatment, surface, time until antibiotic treatment and time until excision. According to other posts this dataset must be transformed from wide to long. However I am not sure how to do it? Then, once the data is in the right format i would use this formula:
coxph(Surv(start, stop, event) ~ m, data=times)
Until now i have run just a normal Cox's regression, but i guess this is not correct because the time dependency is not accounted for?
coxph(formula = Surv(inf.time, infection) ~ Treatment + Surface +
ex.time + antib.time, data = BurnData)
n= 154, number of events= 48
coef exp(coef) se(coef) z Pr(>|z|)
Treatment -0.453748 0.635243 0.300805 -1.508 0.131
Surface 0.006932 1.006956 0.007333 0.945 0.345
ex.time 0.013503 1.013595 0.018841 0.717 0.474
antib.time 0.009546 1.009592 0.009560 0.999 0.318
exp(coef) exp(-coef) lower .95 upper .95
Treatment 0.6352 1.5742 0.3523 1.145
Surface 1.0070 0.9931 0.9926 1.022
ex.time 1.0136 0.9866 0.9768 1.052
antib.time 1.0096 0.9905 0.9909 1.029
Concordance= 0.576 (se = 0.046 )
Rsquare= 0.041 (max possible= 0.942 )
Likelihood ratio test= 6.5 on 4 df, p=0.1648
Wald test = 6.55 on 4 df, p=0.1618
Score (logrank) test = 6.71 on 4 df, p=0.1519
Sometimes your research may predict that the size of a regression coefficient may vary across groups. For example, you might believe that the regression coefficient of height predicting weight would differ across three age groups (young, middle age, senior citizen). Below, we have a data file with 3 fictional young people, 3 fictional middle age people, and 3 fictional senior citizens, along with their height and their weight. The variable age indicates the age group and is coded 1 for young people, 2 for middle aged, and 3 for senior citizens.
So, how can I compare regression coefficients (slope mainly) across three (or more) groups using R?
Sample data:
age height weight
1 56 140
1 60 155
1 64 143
2 56 117
2 60 125
2 64 133
3 74 245
3 75 241
3 82 269
There is an elegant answer to this in CrossValidated.
But briefly,
require(emmeans)
data <- data.frame(age = factor(c(1,1,1,2,2,2,3,3,3)),
height = c(56,60,64,56,60,64,74,75,82),
weight = c(140,155,142,117,125,133,245,241,269))
model <- lm(weight ~ height*age, data)
anova(model) #check the results
Analysis of Variance Table
Response: weight
Df Sum Sq Mean Sq F value Pr(>F)
height 1 25392.3 25392.3 481.5984 0.0002071 ***
age 2 2707.4 1353.7 25.6743 0.0129688 *
height:age 2 169.0 84.5 1.6027 0.3361518
Residuals 3 158.2 52.7
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
slopes <- emtrends(model, 'age', var = 'height') #gets each slope
slopes
age height.trend SE df lower.CL upper.CL
1 0.25 1.28 3 -3.84 4.34
2 2.00 1.28 3 -2.09 6.09
3 3.37 1.18 3 -0.38 7.12
Confidence level used: 0.95
pairs(slopes) #gets their comparisons two by two
contrast estimate SE df t.ratio p.value
1 - 2 -1.75 1.82 3 -0.964 0.6441
1 - 3 -3.12 1.74 3 -1.790 0.3125
2 - 3 -1.37 1.74 3 -0.785 0.7363
P value adjustment: tukey method for comparing a family of 3 estimates
To determine whether the regression coefficients "differ across three age groups" we can use anova function in R. For example, using the data in the question and shown reproducibly in the note at the end:
fm1 <- lm(weight ~ height, DF)
fm3 <- lm(weight ~ age/(height - 1), DF)
giving the following which is significant at the 2.7% level so we would conclude that there are differences in the regression coefficients of the groups if we were using a 5% cutoff but not if we were using a 1% cutoff. The age/(height - 1) in the formula for fm3 means that height is nested in age and the overall intercept is omitted. Thus the model estimates separate intercepts and slopes for each age group. This is equivalent to age + age:height - 1.
> anova(fm1, fm3)
Analysis of Variance Table
Model 1: weight ~ height
Model 2: weight ~ age/(height - 1)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 7 2991.57
2 3 149.01 4 2842.6 14.307 0.02696 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Note 1: Above fm3 has 6 coefficients, an intercept and slope for each group. If you want 4 coefficients, a common intercept and separate slopes, then use
lm(weight ~ age:height, DF)
Note 2: We can also compare a model in which subsets of levels are the same. For example, we can compare a model in which ages 1 and 2 are the same to models in which they are all the same (fm1) and all different (fm3):
fm2 <- lm(weight ~ age/(height - 1), transform(DF, age = factor(c(1, 1, 3)[age])))
anova(fm1, fm2, fm3)
If you do a large number of tests you can get significance on some just by chance so you will want to lower the cutoff for p values.
Note 3: There are some notes on lm formulas here: https://sites.google.com/site/r4naturalresources/r-topics/fitting-models/formulas
Note 4: We used this as the input:
Lines <- "age height weight
1 56 140
1 60 155
1 64 143
2 56 117
2 60 125
2 64 133
3 74 245
3 75 241
3 82 269"
DF <- read.table(text = Lines, header = TRUE)
DF$age <- factor(DF$age)