I have some data (df):
inter out time int
0 1 21 0
0 0 32 0
0 1 44 0
0 0 59 0
0 1 88 0
0 1 111 0
0 0 54 0
1 0 63 63
1 1 73 73
1 1 83 83
1 0 93 93
1 1 52 52
1 0 33 33
1 1 10 10
And I run a glm model:
m <- glm(out ~ inter + time + int, data = df, family = binomial(link = "logit"))
The model coefficients are:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.00916 1.82200 -0.554 0.580
inter 2.00906 2.64959 0.758 0.448
time 0.02293 0.03010 0.762 0.446
int -0.03502 0.04215 -0.831 0.406
I want to get the marginal effects, which according to my understanding is the predicted probabilities at certain levels holding other variables constant, which in this case is 0 vs. 1 for my binary predictor, 'inter'. If this in incorrect, please let me know. According to [https://rdrr.io/cran/ggeffects/man/ggpredict.html], "All remaining covariates that are not specified in terms are held constant (see 'Details')". The mean of time and int are 58.29 and 29.07, respectively, so the formula to get the predicted probabilities is:
Level 0:
sum = -1.00916 + (.02293 * 58.29) + (-.03502 * 29.07)
sume = exp(sum)
sumee <- sume/(1+sume)
sumee = 0.33
Level 1:
sum = -1.00916 + 2.00906 + (.02293 * 58.29) + (-.03502 * 29.07)
sume = exp(sum)
sumee <- sume/(1+sume)
sumee = 0.79
The predicted probability holding other variables constant is 0.79 for level 1 compared to 0.33 for level 0, which is exactly what these ggpredict statements produce:
ggpredict(m, terms = c("inter", "time [mean]"))
ggpredict(m, terms = c("inter"))
However, when I specify 'int' at the mean with "int [mean]", it produces different results:
ggpredict(m, terms = c("inter", "time [mean]", "int [mean]"))
ggpredict(m, terms = c("inter", "int [mean]"))
It says level 0 has a predicted probability of 0.19, compared to 0.64 for level 1. Why? Shouldn't all four commands produce the same results since r automatically calculates covariates at the mean? Using other functions for 'int', such as min and max ggpredict(m, terms = c("inter", "time [mean]", "int [min]")) produces predictable results based on the formula.
Related
I have a multinomial logit model with two individual specific variables (first and age).
I would like to conduct the hmftest to check if the IIA holds.
My dataset looks like this:
head(df)
mode choice first age
1 both 1 0 24
2 pre 1 1 23
3 both 1 2 53
4 post 1 3 43
5 no 1 1 55
6 both 1 2 63
I adjusted it for the mlogit to:
mode choice first age idx
1 TRUE 1 0 24 1:both
2 FALSE 1 0 24 1:no
3 FALSE 1 0 24 1:post
4 FALSE 1 0 24 1:pre
5 FALSE 1 1 23 2:both
6 FALSE 1 1 23 2:no
7 FALSE 1 1 23 2:post
8 TRUE 1 1 23 2:pre
9 TRUE 1 2 53 3:both
10 FALSE 1 2 53 3:no
~~~ indexes ~~~~
id1 id2
1 1 both
2 1 no
3 1 post
4 1 pre
5 2 both
6 2 no
7 2 post
8 2 pre
9 3 both
10 3 no
indexes: 1, 2
My original (full) model runs as follows:
full <- mlogit(mode ~ 0 | first + age, data = df_mlogit, reflevel = "no")
leading to the following result:
Call:
mlogit(formula = mode ~ 0 | first + age, data = df_mlogit, reflevel = "no",
method = "nr")
Frequencies of alternatives:choice
no both post pre
0.2 0.4 0.2 0.2
nr method
18 iterations, 0h:0m:0s
g'(-H)^-1g = 8.11E-07
gradient close to zero
Coefficients :
Estimate Std. Error z-value Pr(>|z|)
(Intercept):both 2.0077e+01 1.0441e+04 0.0019 0.9985
(Intercept):post -4.1283e-01 1.4771e+04 0.0000 1.0000
(Intercept):pre 5.3346e-01 1.4690e+04 0.0000 1.0000
first1:both -4.0237e+01 1.1059e+04 -0.0036 0.9971
first1:post -8.9168e-01 1.4771e+04 -0.0001 1.0000
first1:pre -6.6805e-01 1.4690e+04 0.0000 1.0000
first2:both -1.9674e+01 1.0441e+04 -0.0019 0.9985
first2:post -1.8975e+01 1.5683e+04 -0.0012 0.9990
first2:pre -1.8889e+01 1.5601e+04 -0.0012 0.9990
first3:both -2.1185e+01 1.1896e+04 -0.0018 0.9986
first3:post 1.9200e+01 1.5315e+04 0.0013 0.9990
first3:pre 1.9218e+01 1.5237e+04 0.0013 0.9990
age:both 2.1898e-02 2.9396e-02 0.7449 0.4563
age:post 9.3377e-03 2.3157e-02 0.4032 0.6868
age:pre -1.2338e-02 2.2812e-02 -0.5408 0.5886
Log-Likelihood: -61.044
McFadden R^2: 0.54178
Likelihood ratio test : chisq = 144.35 (p.value = < 2.22e-16)
To test for IIA, I exclude one alternative from the model (here "pre") and run the model as follows:
part <- mlogit(mode ~ 0 | first + age, data = df_mlogit, reflevel = "no",
alt.subset = c("no", "post", "both"))
leading to
Call:
mlogit(formula = mode ~ 0 | first + age, data = df_mlogit, alt.subset = c("no",
"post", "both"), reflevel = "no", method = "nr")
Frequencies of alternatives:choice
no both post
0.25 0.50 0.25
nr method
18 iterations, 0h:0m:0s
g'(-H)^-1g = 6.88E-07
gradient close to zero
Coefficients :
Estimate Std. Error z-value Pr(>|z|)
(Intercept):both 1.9136e+01 6.5223e+03 0.0029 0.9977
(Intercept):post -9.2040e-01 9.2734e+03 -0.0001 0.9999
first1:both -3.9410e+01 7.5835e+03 -0.0052 0.9959
first1:post -9.3119e-01 9.2734e+03 -0.0001 0.9999
first2:both -1.8733e+01 6.5223e+03 -0.0029 0.9977
first2:post -1.8094e+01 9.8569e+03 -0.0018 0.9985
first3:both -2.0191e+01 1.1049e+04 -0.0018 0.9985
first3:post 2.0119e+01 1.1188e+04 0.0018 0.9986
age:both 2.1898e-02 2.9396e-02 0.7449 0.4563
age:post 1.9879e-02 2.7872e-02 0.7132 0.4757
Log-Likelihood: -27.325
McFadden R^2: 0.67149
Likelihood ratio test : chisq = 111.71 (p.value = < 2.22e-16)
However when I want to codnuct the hmftest then the following error occurs:
> hmftest(full, part)
Error in solve.default(diff.var) :
system is computationally singular: reciprocal condition number = 4.34252e-21
Does anyone have an idea where the problem might be?
I believe the issue here could be that the hmftest checks if the probability ratio of two alternatives depends only on the characteristics of these alternatives.
Since there are only individual-level variables here, the test won't work in this case.
I have a datset that looks like:
Treatment Surface ex.time excision antib.time antibiotic inf.time infection
1 0 15 12 0 12 0 12 0
2 0 20 9 0 9 0 9 0
3 0 15 13 0 13 0 7 1
4 0 20 11 1 29 0 29 0
5 0 70 28 1 31 0 4 1
6 0 20 11 0 11 0 8 1
he variables represented in the dataset are as follows:
Observation number
Treatment
0-routine bathing 1-Body cleansing
Surface
Percentage of total surface area burned
Exis.time
Time to excision or on study time
Excision
indicator: 1=yes 0=no
Antib.time
Time to prophylactic antibiotic treatment or on study time
antibiotic
indicator: 1=yes 0=no
inf.time Time to straphylocous aureaus infection or on study time
infection
indicator: 1=yes 0=no
I want to model the time until infection as a function of treatment, surface, time until antibiotic treatment and time until excision. According to other posts this dataset must be transformed from wide to long. However I am not sure how to do it? Then, once the data is in the right format i would use this formula:
coxph(Surv(start, stop, event) ~ m, data=times)
Until now i have run just a normal Cox's regression, but i guess this is not correct because the time dependency is not accounted for?
coxph(formula = Surv(inf.time, infection) ~ Treatment + Surface +
ex.time + antib.time, data = BurnData)
n= 154, number of events= 48
coef exp(coef) se(coef) z Pr(>|z|)
Treatment -0.453748 0.635243 0.300805 -1.508 0.131
Surface 0.006932 1.006956 0.007333 0.945 0.345
ex.time 0.013503 1.013595 0.018841 0.717 0.474
antib.time 0.009546 1.009592 0.009560 0.999 0.318
exp(coef) exp(-coef) lower .95 upper .95
Treatment 0.6352 1.5742 0.3523 1.145
Surface 1.0070 0.9931 0.9926 1.022
ex.time 1.0136 0.9866 0.9768 1.052
antib.time 1.0096 0.9905 0.9909 1.029
Concordance= 0.576 (se = 0.046 )
Rsquare= 0.041 (max possible= 0.942 )
Likelihood ratio test= 6.5 on 4 df, p=0.1648
Wald test = 6.55 on 4 df, p=0.1618
Score (logrank) test = 6.71 on 4 df, p=0.1519
I have binary outcome variable and 4 predictors: 2 binary one and 2 continuous (truncated to whole numbers). I have some 1158 observations and the objective of the analysis is to predict the probability of the binary outcome (infection), check Goodness of fit and predictive quality of the final model.
> str(data)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 1158 obs. of 5 variables:
$ age : num 25 49 41 19 55 37 30 31 52 37 ...
$ gender: num 1 1 1 0 0 0 1 0 1 1 ...
$ var1 : num 0 0 0 0 0 0 0 0 0 0 ...
$ y : num 1 0 0 1 1 0 1 1 0 1 ...
$ var2 : num 26 33 25 30 28 20 28 21 17 25 ...
I have seen that the data is sometimes split in 2: testing and training data set, but not always. I assume this depends on the original sample size? Is it advisable to split the data for my analysis?
For now, I have not split the data. I conducted varius variable selection procedures:
manual LRT based backward selection manual
LRT based forward selection automated
LRT based backward selection
AIC backward selection procedure
AIC forward selection procedure
And the all lead to the same results: Only age and gender should be included in my model.
Deviance Residuals:
Min 1Q Median 3Q Max
-1.2716 -0.8767 -0.7361 1.3008 1.9353
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.785753 0.238634 3.293 0.000992 ***
age -0.031504 0.004882 -6.453 1.1e-10 ***
gender -0.223195 0.129774 -1.720 0.085455 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1444.9 on 1157 degrees of freedom
Residual deviance: 1398.7 on 1155 degrees of freedom
AIC: 1404.7
Now, i want to see if any interactions or polynomials are significant. The dot (.) denotes the full model with 4 predictors.
full.twoway <- glm(y ~ (.)^2 , family = binomial, data=data) # includes 2-way interactions
summary(full.twoway)
model.aic.backward_2w <- step(full.twoway, direction = "backward", trace = 1)
summary(model.aic.backward_2w)
full.treeway <- glm(y ~ (.)^3 , family = binomial, data=data) # includes 3-way interactions
summary(full.treeway)
full.treeway <- glm(reject ~ (.)^3 , family = binomial, data=renal) # includes 3-way interactions
summary(full.treeway)
# significant interaction: age:male:cardio at 0.5
model.aic.backward_3w <- step(full.treeway, direction = "backward", trace = 1)
summary(model.aic.backward_3w)
# polynomials
model.polynomial <- glm(y ~ age + gender + I(age^2), family = binomial, data=data)
# only age, gender significant
Also only age and gender are significant. This seems very strange to me. I would have expected some interaction or polynomial term to be significant. Am I doing something wrong? Are there some other variable selection techniques?
EDIT:
I have partitioned the dataset in training and testing. Training dataset consist of 868 observations. The results of the selection procedure indicate that only the variable age is significant now...
I'm trying to compare the prevalence of a specific lesion (binary) at the symptomatic side to the asymptomatic side within a group of patients.
I've already performed a McNemar test to compare the prevalence at the symptomatic versus asymptomatic side within patients.
However, I'm asked to perform also a conditional logistic regression. I'm not sure if my syntax is correct with respect to the stratification:
summary(clogit(ds$symp ~ ds$asymp, strata(ds$ID), data=ds, method = "exact"))
Does R compare both sides of the patient (symptomatic vs asymptomatic) within the patient(s)? Or do I have to duplicate manually the patient ID (one ID for the symptomatic side AND one ID for the asymptomatic side)?
Thanks,
An example:
ID symp asymp
1 0 0
2 1 0
3 0 0
4 0 0
5 1 0
6 1 1
7 0 0
8 0 0
9 0 1
10 0 0
As an example: patient 2 has a lesion at the symptomatic side and patient 9 only at the asymptomatic side. Patients 6 at both sides.
A Exact McNemar test showes:
test <- table(df$symp, df$asymp)
compare <- exact2x2(test, paired = TRUE, alternative = "two.sided", tsmethod = "central")
print(compare)
Exact McNemar test (with central confidence intervals)
data: test
b = 1, c = 2, p-value = 1
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.00847498 9.60452988
sample estimates:
odds ratio
0.5
However, a conditional logistic regression model:
> summary(clogit(df$symp ~ df$asymp, strata(df$ID), data=df, method = "exact"))
Call:
coxph(formula = Surv(rep(1, 10L), df$symp) ~ df$asymp, data = df,
method = "exact")
n= 10, number of events= 3
coef exp(coef) se(coef) z Pr(>|z|)
df$symp 0.973 2.646 1.524 0.638 0.523
exp(coef) exp(-coef) lower .95 upper .95
df$asymp 2.646 0.378 0.1334 52.46
Rsquare= 0.039 (max possible= 0.616 )
Likelihood ratio test= 0.4 on 1 df, p=0.528
Wald test = 0.41 on 1 df, p=0.5232
Score (logrank) test = 0.43 on 1 df, p=0.5127
Sometimes your research may predict that the size of a regression coefficient may vary across groups. For example, you might believe that the regression coefficient of height predicting weight would differ across three age groups (young, middle age, senior citizen). Below, we have a data file with 3 fictional young people, 3 fictional middle age people, and 3 fictional senior citizens, along with their height and their weight. The variable age indicates the age group and is coded 1 for young people, 2 for middle aged, and 3 for senior citizens.
So, how can I compare regression coefficients (slope mainly) across three (or more) groups using R?
Sample data:
age height weight
1 56 140
1 60 155
1 64 143
2 56 117
2 60 125
2 64 133
3 74 245
3 75 241
3 82 269
There is an elegant answer to this in CrossValidated.
But briefly,
require(emmeans)
data <- data.frame(age = factor(c(1,1,1,2,2,2,3,3,3)),
height = c(56,60,64,56,60,64,74,75,82),
weight = c(140,155,142,117,125,133,245,241,269))
model <- lm(weight ~ height*age, data)
anova(model) #check the results
Analysis of Variance Table
Response: weight
Df Sum Sq Mean Sq F value Pr(>F)
height 1 25392.3 25392.3 481.5984 0.0002071 ***
age 2 2707.4 1353.7 25.6743 0.0129688 *
height:age 2 169.0 84.5 1.6027 0.3361518
Residuals 3 158.2 52.7
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
slopes <- emtrends(model, 'age', var = 'height') #gets each slope
slopes
age height.trend SE df lower.CL upper.CL
1 0.25 1.28 3 -3.84 4.34
2 2.00 1.28 3 -2.09 6.09
3 3.37 1.18 3 -0.38 7.12
Confidence level used: 0.95
pairs(slopes) #gets their comparisons two by two
contrast estimate SE df t.ratio p.value
1 - 2 -1.75 1.82 3 -0.964 0.6441
1 - 3 -3.12 1.74 3 -1.790 0.3125
2 - 3 -1.37 1.74 3 -0.785 0.7363
P value adjustment: tukey method for comparing a family of 3 estimates
To determine whether the regression coefficients "differ across three age groups" we can use anova function in R. For example, using the data in the question and shown reproducibly in the note at the end:
fm1 <- lm(weight ~ height, DF)
fm3 <- lm(weight ~ age/(height - 1), DF)
giving the following which is significant at the 2.7% level so we would conclude that there are differences in the regression coefficients of the groups if we were using a 5% cutoff but not if we were using a 1% cutoff. The age/(height - 1) in the formula for fm3 means that height is nested in age and the overall intercept is omitted. Thus the model estimates separate intercepts and slopes for each age group. This is equivalent to age + age:height - 1.
> anova(fm1, fm3)
Analysis of Variance Table
Model 1: weight ~ height
Model 2: weight ~ age/(height - 1)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 7 2991.57
2 3 149.01 4 2842.6 14.307 0.02696 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Note 1: Above fm3 has 6 coefficients, an intercept and slope for each group. If you want 4 coefficients, a common intercept and separate slopes, then use
lm(weight ~ age:height, DF)
Note 2: We can also compare a model in which subsets of levels are the same. For example, we can compare a model in which ages 1 and 2 are the same to models in which they are all the same (fm1) and all different (fm3):
fm2 <- lm(weight ~ age/(height - 1), transform(DF, age = factor(c(1, 1, 3)[age])))
anova(fm1, fm2, fm3)
If you do a large number of tests you can get significance on some just by chance so you will want to lower the cutoff for p values.
Note 3: There are some notes on lm formulas here: https://sites.google.com/site/r4naturalresources/r-topics/fitting-models/formulas
Note 4: We used this as the input:
Lines <- "age height weight
1 56 140
1 60 155
1 64 143
2 56 117
2 60 125
2 64 133
3 74 245
3 75 241
3 82 269"
DF <- read.table(text = Lines, header = TRUE)
DF$age <- factor(DF$age)