How to get value of group = 0 in linear mixed model - r

I have a very simple stat question probably.
So, I am fitting linear mixed models like this:
lme(dependent ~ Group + Sex + Age + npgs, data=boookclub, random = ~ 1| subject)
Group is a factor variable with levels = 0, 1 , 2 , 3
The dependent are continuous variables standardized (mean 0) and the others are covariates with sex being factor, with Male/Female levels, Age being numerical, and npgs being numerical continuous standardized as well.
When I get the table with beta, standard error, t and p values, I get this:
Value Std.Error DF t-value p-value
(Intercept) -0.04550502 0.02933385 187 -1.551280 0.0025
Group1 0.04219801 0.03536929 181 1.193069 0.2344
Group2 0.03350827 0.03705896 181 0.904188 0.3671
Group3 0.00192119 0.03012654 181 0.063771 0.9492
SexMale 0.03866387 0.05012901 181 0.771287 0.4415
Age -0.00011675 0.00148684 181 -0.078520 0.9375
npgs 0.15308844 0.01637163 181 9.350835 0.0000
SexMale:Age 0.00492966 0.00276117 181 1.785352 0.0759
My problem is: how do I get the beta of Group0? In this case the intercept is Group0 but also the average of npgs, being npgs standardized. How do I get the Beta of Group0? And how can I check if Group0 is significantly associated to the dependent? I'd like to see the effect of all Group levels.
Thanks

The easiest way to do what you want may be with the emmeans package, but you may also have some conceptual issues. Technical details first, then conceptual:
Technical
Fitting an example (this isn't necessarily statistically sensible, but I wanted an example with a categorical fixed effect)
library(nlme)
m1 <- lme(Yield~Variety, random = ~1|Block, data=Alfalfa)
As with your example, the effects are "intercept" (= mean of the baseline group, which is the "Cossack" variety in this case [by default, the alphabetically-first group]), "Ladak" (difference between Ladak and Cossack means) and "Ranger" (similarly). (As #Ben hints in the comments above, R automatically generates dummies for [most of] the levels of the categorical variables [factors] in your model.)
coef(summary(m1))
## Value Std.Error DF t-value p-value
## (Intercept) 1.57166667 0.11665326 64 13.4729767 2.373343e-20
## VarietyLadak 0.09458333 0.07900687 64 1.1971532 2.356624e-01
## VarietyRanger -0.01916667 0.07900687 64 -0.2425949 8.090950e-01
The emmeans package is a convenient way to see predicted values for each group without recoding.
library(emmeans)
emmeans(m1, spec = ~Variety)
## Variety emmean SE df lower.CL upper.CL
## Cossack 1.57 0.117 5 1.27 1.87
## Ladak 1.67 0.117 5 1.37 1.97
## Ranger 1.55 0.117 5 1.25 1.85
Conceptual
You can't "check if Group0 is significantly associated with the dependent [response] variable". You can only check whether the response variables differs significantly between two groups, or whether it differs significantly among all groups (e.g. the results of anova()). You have to pick a baseline. (If you insist, you can test all pairwise comparisons among groups; emmeans can help with this too.) If you "remove the intercept" (by fitting Variety ~Yield-1, or by looking at the results that emmeans produces) then the difference you are quantifying is the difference between the mean of a particular group and zero. This is usually not a meaningful question; in the example here, for instance, this would be testing whether a wheat variety gave a yield that was significantly greater than zero — probably not very interesting.
On the other hand, if you are just interested in estimating the expected value in each group (conditioning on the baseline values of the other variables in the model), along with the standard errors/CIs, then the answers you get from emmeans are perfectly sensible.
There's a related question here that explains why you get an NA value if you manually create dummies for every level of your factor ...

Related

Failed to contrast intercepts through emmeans in R

I would like to test the simetry in the response of an observer to a contrast stimuli with different polarity, positive (white) and negative (black). I took the reaction time (RT) as dependent variable, along four different contrasts. It is known that the response time follows a Pieron curve whose asymptotas are placed (1) at observer threshold (Inf) and (2) at a base RT placed somewere between 250 and 450 msec.
The knowledge allows us to linearize the relationship transforming the independent variable (effective contrast EC) as 1/EC^2 (tEC), so the equation linking RT to EC becomes:
RT = m * tEC + RT0
To test the symmetry I established the criteria: same slope and same intercept in the two polarities implies symmetry.
To obtain the coefficients I made a linear model with interaction (coding trough a dummy variable for Polarity: Positive or Negative). The output of lm is clear to me, but some colegues prefer somthing more similar to an ANOVA output. So I decided to use emmeans to make the contrasts. With the slope is all right, but when computing the interceps starts the problem. The intercepts computed by lm are very different from the output of emmeans, and the conclusions are also different. In what follows I reproduce the example.
The question is two fold: It is possible to use emmeans to solve my problem? If not, it is possible to make the contrasts through other packages (which one)?
Data
RT1000
EC
tEC
Polarity
596.3564
-25
0.001600
Negative
648.2471
-20
0.002500
Negative
770.7602
-17
0.003460
Negative
831.2971
-15
0.004444
Negative
1311.3331
15
0.004444
Positive
1173.8942
17
0.003460
Positive
1113.7240
20
0.002500
Positive
869.3635
25
0.001600
Positive
Code
# Model
model <- lm(RT1000 ~ tEC * Polarity, data = Data)
# emmeans
library(emmeans)
# Slopes
m.slopes <- lstrends(model, "Polarity", var="tEC")
# Intercepts
m.intercept <- lsmeans(model, "Polarity")
# Contrasts
pairs(m.slopes)
pairs(m.intercept)
Outputs
Modelo
term
estimate
std.error
statistic
p.value
(Intercept)
449.948
66.829
6.733
0.003
tEC
87205.179
20992.976
4.154
0.014
PolarityPositive
230.946
94.511
2.444
0.071
tEC:PolarityPositive
58133.172
29688.551
1.958
0.122
Slopes (it is all right)
Polarity
tEC.trend
SE
df
lower.CL
upper.CL
Negative
87205.18
20992.98
4
28919.33
145491.0
Positive
145338.35
20992.98
4
87052.51
203624.2
contrast
estimate
SE
df
t.ratio
p.value
Negative - Positive
-58133.17
29688.55
4
-1.958101
0.12182
Intercepts (problem)
Polarity
lsmean
SE
df
lower.CL
upper.CL
Negative
711.6652
22.2867
4
649.7874
773.543
Positive
1117.0787
22.2867
4
1055.2009
1178.957
contrast
estimate
SE
df
t.ratio
p.value
Negative - Positive
-405.4135
31.51816
4
-12.86285
0.000211
Computed intercepts through emmeans differs from the ones computed by lm. I think the problem is that the model is not defined for EC = 0. But I'm not sure.
What you are calling the intercepts are not; they are the model predictions at the mean value of tEC. If you want the intercepts, use instead:
m.intercept <- lsmeans(model, "Polarity", at = list(tEC = 0))
You can tell what reference levels are being used via
ref_grid(model) # or str(m.intercept)
Please note that the model fitted here consists of two lines with different slopes; hence the difference between the predictions changes depending on the value of tEC. Thus, I would strongly recommend against testing the comparison of the intercepts; those are predictions at a tEC value that, as you say, can't even occur. Instead, try to be less of a mathematician and do the comparisons at a few representative values of tEC, e.g.,
LSMs <- lsmeans(model, "Polarity", at = list(tEC = c(0.001, 0.003, 0.005)))
pairs(LSMs, by = tEC)
You can also easily visualize the fitted lines:
emmip(model, Polarity ~ tEC, cov.reduce = range)

Multivariable regression interaction term with categorical variables

I am kind of new to R and am working on glm model and wanted to look for the interaction effect of BMI groups and patient groups (4 groups) on mortality (binary) in subgroup analysis. I have the following codes:
model <- glm(death~patient.group*bmi.group, data = data, family = "binomial")
summary(model)
and I get the following:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.4798903 0.0361911 -96.153 < 2e-16 ***
patient.group2 0.0067614 0.0507124 0.133 0.894
patient.group3 0.0142658 0.0503444 0.283 0.777
patient.group4 0.0212416 0.0497523 0.427 0.669
bmi.group2 0.1009282 0.0478828 2.108 0.035 *
bmi.group3 0.2397047 0.0552043 4.342 1.41e-05 ***
patient.group2:bmi.group2 -0.0488768 0.0676473 -0.723 0.470
patient.group3:bmi.group2 -0.0461319 0.0672853 -0.686 0.493
patient.group4:bmi.group2 -0.1014986 0.0672675 -1.509 0.131
patient.group2:bmi.group3 -0.0806240 0.0791977 -1.018 0.309
patient.group3:bmi.group3 -0.0008951 0.0785683 -0.011 0.991
patient.group4:bmi.group3 -0.0546519 0.0795683 -0.687 0.492
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
So as displayed I will have a p-value for each of the patient.group:bmi.group. My question is, is there a way I can get a single p-value for patient.group:bmi.group instead of one for each subgroup? I have tried to look for answers online but I still could not find the answer :(
Many thanks in advance.
It depends on whether you regard your patient and BMI groups as factors or continuous covariates. If they are covariates, #jay.sf's suggestion is appropriate. It fits a single degree of freedom term for the interaction between the linear effect of patient group and the linear effect of BMI group.
But this depends on both the ordering and definition of the groups. It assumes, for example, that the "difference" between patient groups 1 and 2 is the same as that between patient groups 2 and 3 and so on. Is the ordering of patient groups such that, in some way, group 1 < group 2 < group 3 < group 4? Similarly for BMI. This model would also assume that a change of 1 unit on the patient scale was "the same" as a change of one unit on the BMI scale. I don't know if these are reasonable assumptions.
It would be more usual to consider both patient group and BMI group as factors. This assumes no ordering in groups, nor that the difference between any two groups was equal to that between any other two. In this case, jay.sf's suggestion would give a misleading answer.
To illustrate my point...
First, generate some artifical data as you haven't provided any:
data <- tibble() %>%
expand(patient.group=1:4, bmi.group=1:3, rep=1:5) %>%
mutate(
z=-0.25*patient.group + 0.75*bmi.group,
death=rbernoulli(nrow(.), exp(z)/exp(1+z))
) %>%
select(-z)
Fit a simple continuous covariate model with interaction, as per jay.sf's suggestion:
covariateModel <- glm(death~patient.group * bmi.group, data = data, family = "binomial")
summary(covariateModel)
Giving, in part
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.6962 1.8207 -1.481 0.139
patient.group 0.7407 0.6472 1.144 0.252
bmi.group 1.2697 0.8340 1.523 0.128
patient.group:bmi.group -0.3807 0.2984 -1.276 0.202
Here, the p value for the patient.group:bmi.group interaction is a Wald test based on a single degree of freedom z test.
A slightly more complicated approach is necessary to fit the factor model with interaction and obtain a test for the "overall" interaction effect.
mainEffectModel <- glm(death~as.factor(patient.group) + as.factor(bmi.group), data = data, family = "binomial")
interactionModel <- glm(death~as.factor(patient.group) * as.factor(bmi.group), data = data, family = "binomial")
anova(mainEffectModel, interactionModel, test="Chisq")
Giving
Analysis of Deviance Table
Model 1: death ~ as.factor(patient.group) + as.factor(bmi.group)
Model 2: death ~ as.factor(patient.group) * as.factor(bmi.group)
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 54 81.159
2 48 70.579 6 10.58 0.1023
Here, the change in deviance is a score test and is distributed as a chi-squared statistic on (4-1) x (3-1) = 6 degrees of freedom.
The two approaches give similar answers using my particular dataset, but they may not always do so. Both are statistically correct, but which one is most appropriate depends on your particular situation. We don't have enough information to comment.
This excellent post provides more context.

wilcoxon test using data stratification

I have a really basic problem. I have the concentrations of one chemical stored in one column and the gender of the study participant in a second column.
What is the code to do the wilcoxon test to see if there is a difference between the concentrations found in boys and the concentrations found in girls? Some explanation of the code would also be useful for me to understand how it works. Thanks!
I got this code for the ANOVA test to work which is also fine. Can anyone tell me if it does the thing that I need?
av <- aov(UC_MEHP ~ BQF05C1, data=data)
av
summary(av)
the output looks like this
> av <- aov(UC_MEHP ~ BQF05C1, data=data)
> av
Call:
aov(formula = UC_MEHP ~ BQF05C1, data = data)
Terms:
BQF05C1 Residuals
Sum of Squares 0.3445 2917.4564
Deg. of Freedom 1 151
Residual standard error: 4.395555
Estimated effects may be unbalanced
21 observations deleted due to missingness
> summary(av)
Df Sum Sq Mean Sq F value Pr(>F)
BQF05C1 1 0.3 0.344 0.018 0.894
Residuals 151 2917.5 19.321
21 observations deleted due to missingness
I'm sorry, I know it's not a very advanced question...
From ?wilcox.test:
## S3 method for class 'formula'
wilcox.test(formula, data, subset, na.action, ...)
...
formula: a formula of the form ‘lhs ~ rhs’ where ‘lhs’ is a numeric
variable giving the data values and ‘rhs’ a factor with two
levels giving the corresponding groups.
So wilcox.test(UC_MEHP ~ BQF05C1, data=data) should work (assuming that BQF05C1 is the column specifying gender and UC_MEHP is the concentration).

Within and between factors in regression models in R

I'm trying to run a rmANOVA and a corresponding regression model. In the experiment participants were completing a questionnaire which was evaluating how much of a trait X they have (score). Then they were performing a task, in which each participant was exposed to three conditions (COND - nSCM, SCM, SC). Their brain responses were measured (ERP).
This is how it looks like:
> head(df)
code SEX AGE SCORE COND ERP
1 AA1407 male 29 14 nSCM -3.0348373
2 AN0312 male 26 13 nSCM -1.8799240
3 BR1410 male 23 30 nSCM 0.4284033
4 EZ2404 male 23 23 nSCM -0.7615117
5 HA1012 female 27 22 nSCM -2.9301698
6 HS3004 male 30 16 nSCM -0.5468492
Since I am a bit confused about how to use different types of variables in R, maybe someone could also reassure me about the following:
> sapply(df,class)
code SEX AGE SCORE COND ERP
"factor" "factor" "numeric" "numeric" "factor" "numeric"
Based on the experimental design, the ANOVA design has one between-subject IV: SCORE, one within-subject IV: COND and the DV is ERP (right?).
This is the model I used and the summary:
> anERP <- aov(ERP ~ COND*SCORE, data = df)
> summary(anERP)
Df Sum Sq Mean Sq F value Pr(>F)
COND 2 0.21 0.105 0.027 0.9736
SCORE 1 16.87 16.868 4.297 0.0419 *
COND:SCORE 2 0.58 0.289 0.074 0.9291
Residuals 69 270.85 3.925
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
So, IF this is right (please let me know if anything doesn't seem right), I should also find an effect for SCORE when I build a regression model, right? Also, I'm not sure how to interpret this effect, since AQ is an interval variable (scores in range 6-35). I would appreciate a little help here.
Now I'm very confused about how this model should look like for regression. I started with simple lm model with SCORE and COND as fixed effects:
> lmERP <- lm(ERP ~ SCORE*COND, data = df)
> summary(lmERP)
Call:
lm(formula = ERP ~ SCORE * COND, data = df)
Residuals:
Min 1Q Median 3Q Max
-5.2554 -1.0916 0.1975 1.4582 3.3097
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.04269 1.06193 -2.865 0.00552 **
SCORE 0.06458 0.05229 1.235 0.22108
CONDSCM -0.08141 1.50180 -0.054 0.95693
CONDnSCM 0.36646 1.50180 0.244 0.80795
SCORE:CONDSCM 0.01111 0.07396 0.150 0.88104
SCORE:CONDnSCM -0.01707 0.07396 -0.231 0.81814
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.981 on 69 degrees of freedom
Multiple R-squared: 0.0612, Adjusted R-squared: -0.006827
F-statistic: 0.8997 on 5 and 69 DF, p-value: 0.4864
However, here the main effect of SCORE doesn't reach significance. How is it possible? Shouldn't rmANOVA and regression show roughly similar results (or at least the main effects)?
I guess I'm not applying the right linear model here, since it doesn't seem to recognise there are both within and between subject factors in the design.
I have read hundreds of webpages, tutorials and forums and I'm still completely confused about these models. Thank you in advance for any piece of advice!
Repeated-measures or mixed-model designs can be very confusing to specify using R's base aov function. In the code you have written, for example, aov will treat all the specified factors as independent (i.e., between-subject). I highly recommend using a library that makes it easier to specify these types of designs.
The ez library contains ezANOVA, which makes these tests simple to perform, provided that all your cases are complete (all factors are fully crossed, with no missing data). Assuming that your CODE column uniquely identifies each subject and you wanted to include all factors from your data set, the test would look something like this:
my.aov <- ezANOVA(data = df, dv = ERP, wid = CODE, between = .(SEX, AGE, SCORE), within = COND).
It is also possible to implement these designs with the lme4 package (in fact, ezANOVA is a wrapper around lme4's functions). While lme4 allows for more flexible model specifications and can tolerate incomplete data, its syntax is more difficult. Bodo Winter's tutorial on lme4 is a good start, if you want to go really deep.
As an aside, there is usually little point in performing both an ANOVA and a linear regression. Unless the two tests are specified in a way that treats the factors differently, the results will be equivalent.

Testing for multiple determinants in cox

I'm doing an analysis where I want to find what are determinants for (cardiovascular) events in my cohort of patients. I want to do a cox analysis (coxph from the survival package). Now I want to find which determinants are independent determinants.
Now the question is: do I simply make one model in which I throw all my prespecified determinants (age, gender, BMI, cholesterol levels etc) and then see what happens? Or do I have to do univariate testing first? And then only add the significant/near significant (e.g. p value <0.10) in the full cox model?
This was the model I am using now:
model1 <- coxph(Surv(time,event==1)~age+gender+ckdepi+smoking+cholesterol+hdl+bpsys+t2dm+bmi, data=data)
And this is my output:
coef exp(coef) se(coef) z p
age 0.04235 1.04326 0.00911 4.65 3.3e-06***
gender -0.36583 0.69362 0.11538 -3.17 0.00152**
ckdepi -0.01078 0.98927 0.00309 -3.49 0.00048***
smoking 0.14560 1.15674 0.03070 4.74 2.1e-06***
chol 0.10312 1.10862 0.03746 2.75 0.00590**
hdl -0.04485 0.95614 0.13231 -0.34 0.73465
sysbp 0.00356 1.00357 0.00225 1.58 0.11392
t2dm 0.39539 1.48496 0.11763 3.36 0.00078***
bmi -0.00898 0.99106 0.01227 -0.73 0.46427
Likelihood ratio test=97.4 on 9 df, p=0
n= 3084, number of events= 525
(2 observations deleted due to missingness)
Also, I THINK I need to do testing for linearity (e.g. use restricted cubic splines or addquadratic functions) in the fully adjusted model, but I'm not sure. Is this correct?
In that case, my next step would be to test at least ckdepi and bmi since I'm pretty sure they might not be linear.

Resources