Means of different groups in ANCOVA with 2 factors R - r

I am currently studying the influence of different traits on the shell volume of a snail.
I have a dataframe, where each line represents a given individual, and several columns with all its attributes (length, shell volume, sex, infection).
I made the ANCOVA: mod=aov(log(volume) ~ infection*sex*log(length)).
I got this:
Df Sum Sq Mean Sq F value Pr(>F)
inf 1 4.896 4.896 258.126 <2e-16 ***
sex 1 3.653 3.653 192.564 <2e-16 ***
log(length) 1 14.556 14.556 767.335 <2e-16 ***
inf:sex 1 0.028 0.028 1.472 0.227
inf:log(length) 1 0.020 0.020 1.064 0.304
sex:log(length) 1 0.001 0.001 0.076 0.783
inf:sex:log(length) 1 0.010 0.010 0.522 0.471
Residuals 174 3.301 0.019
So significant effects of sex, infection and length, but no interaction terms.
Since there are no interactions, I would like to know, for a given sex, whether the intercept of log(volume) = f(log(length)) is bigger for infected individuals or uninfected individuals.
I tried to use summary.lm(mod), which gave me this:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.42806 0.15429 -2.774 0.00613 **
infmic -0.54963 0.40895 -1.344 0.18070
sexM -0.11542 0.35508 -0.325 0.74554
log(length) 2.41915 0.11144 21.709 < 2e-16 ***
infmic:sexM 0.52459 0.63956 0.820 0.41320
infmic:log(length) 0.43215 0.33717 1.282 0.20166
sexM:log(length) 0.04207 0.28113 0.150 0.88122
infmic:sexM:log(length) -0.38222 0.52920 -0.722 0.47110
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1377 on 174 degrees of freedom
Multiple R-squared: 0.8753, Adjusted R-squared: 0.8703
F-statistic: 174.5 on 7 and 174 DF, p-value: < 2.2e-16
But I have trouble interpreting the results, and still don't see how to conclude.
I also have "few" other questions:
Why aren't sex and infection significant in the lm output?
I know it is not significant here,but how to interpret the lines about the interaction terms?
What I think is that infmic:sexM represents the change in the slope of log(volume)=f(log(length)) for infected males compared with uninfected females. Then, would infmic:length be the change of slope between infected females and uninfected females? And sexM:length the change between uninfected males and uninfected females? Is this true?
And what does the triple interaction term represent?
Thanks a lot!

EDIT: I found part of the answer.
Let's split the data in 4 groups (F-NI, F-I, M-NI, M-I), and look for the equation of the regression line log(volume) = f(log(length)) for each of these groups. Here, the coefficients are the ones given by the function summary.lm(mod)
The equations are:
For non-infected females:log(volume) = (Intercept) + log(length)
For infected females:log(volume) = (Intercept) + infmic + log(length) + infmic:log(length)
For non-infected males:log(volume) = (Intercept) + sexM + log(length) + sexM:log(length)
For infected males:log(volume) = (Intercept) + infmic + sexM + infmic:sexM + log(length) + infmic:log(length) + sexM:log(length) + infmic:sexM:log(length)
For each equation, the slope is the part that starts with log(length), and the intercept is the part before.
It might be obvious for some of you, but I really didn't understand what each coefficient represented at first, so I prefer to put it here!
Alice

Related

Adjusted mean with emmeans at 3 expressions

I know how to get the adjusted mean by emmeans when I have 2 expressions present, such as with sex.
sex == 1 : men, sex == 2 : women --> 2 expressions.
The associated model with the subsequent adjusted mean (EMM) calculation is:
mean_MF <- lm(LZ~age + SES_3 + sex, data = MF)
summary(mean_MF)
emmeans(mean_MF, ~ sex)
and the output looks like this:
> emmeans(mean_MF, ~ sex)
sex emmean SE df lower.CL upper.CL
1 7.05 0.0193 20894 7.02 7.09
2 6.96 0.0187 20894 6.93 7.00
Results are averaged over the levels of: belastet_SZ, belastet_SNZ, guteSeiten_SZ, guteSeiten_SNZ, SES_3
Confidence level used: 0.95
But if I want to calculate the adjusted mean of a variable with 3 values, I only get an adjusted mean of a common value? expression, instead of for all 3.
e.g. for age (Alter), here I have 3 characteristics which are coded as follows:
18-30 years: 1
31-40 years: 2
41-51 years: 3
What else do I need to add to the emmeans function so that I get the adjusted means of all three variables?
F_Alter <- lm(LZ~ SES_3 + Alter, data = Frauen)
summary(F_Alter)
emmeans(F_Alter, ~ Alter)
The summary of (F_Alter) looks as follows:
> summary(F_Alter)
Call:
lm(formula = LZ ~ SES_3 + Alterfactor, data = Frauen)
Residuals:
Min 1Q Median 3Q Max
-7.2303 -1.1162 0.1951 1.1220 3.8838
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.44956 0.05653 131.777 < 2e-16 ***
SES_3mittel -0.42539 0.04076 -10.437 < 2e-16 ***
SES_3niedrig -1.11411 0.05115 -21.781 < 2e-16 ***
Alterfactor -0.07309 0.02080 -3.513 0.000444 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.889 on 14481 degrees of freedom
(5769 Beobachtungen als fehlend gelöscht)
Multiple R-squared: 0.03287, Adjusted R-squared: 0.03267
F-statistic: 164 on 3 and 14481 DF, p-value: < 2.2e-16
In the following output I only get a value of 1.93 instead of my 3 expressions and the respective specific EEM's.
emmeans(F_Alter, ~ Alter)
Alter emmean SE df lower.CL upper.CL
1.93 6.8 0.0179 14481 6.76 6.83
Results are averaged over the levels of: SES_3
Confidence level used: 0.95
What can I change in the emmeans formula to get the output for my 3 age expressions (1, 2, 3)?
The predictor Alter in the original question was not coded as a factor, and so it was being treated as a continuous numeric variable in the model estimation and by emmeans.
The problem is fixed by creating a new factor variable,
Frauen$Alterfactor = as.factor(Frauer$Alter)
and then using this new variable as the predictor in the model.

Is there any way to split interaction effects in a linear model up?

I have a 2x2 factorial design: control vs enriched, and strain1 vs strain2. I wanted to make a linear model, which I did as follows:
anova(lmer(length ~ Strain + Insect + Strain:Insect + BW_final + (1|Pen), data = mydata))
Where length is one of the dependent variables I want to analyse, Strain and Insect as treatments, Strain:Insect as interaction effect, BW_final as covariate, and Pen as random effect.
As output I get this:
Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
Strain 3.274 3.274 1 65 0.1215 0.7285
Insect 14.452 14.452 1 65 0.5365 0.4665
BW_final 45.143 45.143 1 65 1.6757 0.2001
Strain:Insect 52.813 52.813 1 65 1.9604 0.1662
As you can see, I only get 1 interaction term: Strain:Insect. However, I'd like to see 4 interaction terms: Strain1:Control, Strain1:Enriched, Strain2:Control, Strain2:Enriched.
Is there any way to do this in R?
Using summary instead of anova I get:
> summary(linearmer)
Linear mixed model fit by REML. t-tests use Satterthwaite's method [lmerModLmerTest]
Formula: length ~ Strain + Insect + Strain:Insect + BW_final + (1 | Pen)
Data: mydata_young
REML criterion at convergence: 424.2
Scaled residuals:
Min 1Q Median 3Q Max
-1.95735 -0.52107 0.07014 0.43928 2.13383
Random effects:
Groups Name Variance Std.Dev.
Pen (Intercept) 0.00 0.00
Residual 26.94 5.19
Number of obs: 70, groups: Pen, 27
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 101.646129 7.530496 65.000000 13.498 <2e-16 ***
StrainRoss 0.648688 1.860745 65.000000 0.349 0.729
Insect 0.822454 2.062696 65.000000 0.399 0.691
BW_final -0.005188 0.004008 65.000000 -1.294 0.200
StrainRoss:Insect -3.608430 2.577182 65.000000 -1.400 0.166
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr) StrnRs Insect BW_fnl
StrainRoss 0.253
Insect -0.275 0.375
BW_final -0.985 -0.378 0.169
StrnRss:Ins 0.071 -0.625 -0.775 0.016
convergence code: 0
boundary (singular) fit: see ?isSingular```

Test if intercepts in ancova model are significantly different in R

I ran a model explaining the weight of some plant as a function of time and trying to incorporate the treatment effect.
mod <- lm(weight ~time + treatment)
The model looks like this:
with model summary being:
Call:
lm(formula = weight ~ time + treatment, data = df)
Residuals:
Min 1Q Median 3Q Max
-21.952 -7.674 0.770 6.851 21.514
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -37.5790 3.2897 -11.423 < 2e-16 ***
time 4.7478 0.2541 18.688 < 2e-16 ***
treatmentB 8.2000 2.4545 3.341 0.00113 **
treatmentC 5.4633 2.4545 2.226 0.02797 *
treatmentD 20.3533 2.4545 8.292 2.36e-13 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 9.506 on 115 degrees of freedom
Multiple R-squared: 0.7862, Adjusted R-squared: 0.7788
F-statistic: 105.7 on 4 and 115 DF, p-value: < 2.2e-16
ANOVA table
Analysis of Variance Table
Response: weight
Df Sum Sq Mean Sq F value Pr(>F)
time 1 31558.1 31558.1 349.227 < 2.2e-16 ***
treatment 3 6661.9 2220.6 24.574 2.328e-12 ***
Residuals 115 10392.0 90.4
I want to test the H0 that intercept1=intercept2=intercept3=intercept4. Is this done by simply interpreting the t-value and p-value for the intercept ( I guess not because this is the baseline (treatment A in this case))? I'm a bit puzzled by this as not much attention is paid on difference in intercept on most sources i looked up.

Difference between lm(y~x1/x2) and aov(y~x1+Error(x2))

I have trouble understanding the difference between these two notations.
According to R intro y~x1/x2 represents that x2 in nested within x1. If x1 is a factor and x2 a continuous variable, is lm( y~x1/x2) a correct representation of nested ANCOVA?
What is confusing is that some online help topics suggest using aov(y~x1+Error(x2)) to represent a nested anova. Yet those two codes have completely different results.
For example:
x2 = rnorm(1000,2)
x1 = rep( c("A","B"), each=500)
y = x2*3+rnorm(1000)
Under this scenario I would expect x2 to be significant and x1 to be non significant.
summary(aov(y~x1+Error(x2)))
Error: x2
Df Sum Sq Mean Sq
x1 1 9262 9262
Error: Within
Df Sum Sq Mean Sq F value Pr(>F)
x1 1 0.0 0.0003 0 0.985
Residuals 997 967.9 0.9708
aov() works as expected. However, lm()....
summary(lm( y~x1/x2))
Call:
lm(formula = y ~ x1/x2)
Residuals:
Min 1Q Median 3Q Max
-3.4468 -0.6352 0.0092 0.6526 2.8294
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.08727 0.09566 0.912 0.3618
x1B -0.24501 0.13715 -1.786 0.0743 .
x1A:x2 2.94012 0.04362 67.401 <2e-16 ***
x1B:x2 3.06272 0.04326 70.806 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9838 on 996 degrees of freedom
Multiple R-squared: 0.9058, Adjusted R-squared: 0.9055
F-statistic: 3191 on 3 and 996 DF, p-value: < 2.2e-16
x1 is marginally significant, and in many iterations it is highly significant? How can these results be so different?
What am I missing? Those two formulas are not suppose to represent the same thing? Or am I misunderstanding something on the underlying statistics?

(Quasi)-Complete separation according to a random effect in logistic GLMM

I am experiencing convergence warning and very large group variance while fitting a binary logistic GLMM model using lme4. I am wondering whether this could be related to (quasi) complete separation according to the random effect, i.e., the fact that many individuals (the random effect/grouping variable) have only 0 in the dependent variable resulting in low within individual variation? If this could be a problem, are there alternative modelling strategies to deal with such cases?
More precisely, I am studying the chance that an individual is observed in a given status (having children while leaving by their parents) at a given age. In other words, I have several observations for each individual (typically 50) specifying whether the individual was observed in this state at a given age. Here is an example:
id age status
1 21 0
1 22 0
1 23 0
1 24 1
1 25 0
1 26 1
1 27 0
...
The chance to observe a status of 1 is quite low (between 1 and 5% depending on the cases) and I have a lot of observations (150'000 observations and 3'000 individuals).
The model was fitted using glmer specifying ID (individual) as a random effect and including some explanatory factors (age categories, parental education and the period where the status was observed). I get the following convergence warnings (except when using nAGQ=0) and very large group variance (here more than 25).
"Model failed to converge with max|grad| = 2.21808 (tol = 0.001, component 2)"
"Model is nearly unidentifiable: very large eigenvalue\n - Rescale variables?"
Here is the obtained model.
AIC BIC logLik deviance df.resid
9625.0 9724.3 -4802.5 9605.0 151215
Scaled residuals:
Min 1Q Median 3Q Max
-2.529 -0.003 -0.002 -0.001 47.081
Random effects:
Groups Name Variance Std.Dev.
id (Intercept) 28.94 5.38
Number of obs: 151225, groups: id, 3067
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -10.603822 0.496392 -21.362 < 2e-16 ***
agecat[18,21) -0.413018 0.075119 -5.498 3.84e-08 ***
agecat[21,24) -1.460205 0.095315 -15.320 < 2e-16 ***
agecat[24,27) -2.844713 0.137484 -20.691 < 2e-16 ***
agecat[27,30) -3.837227 0.199644 -19.220 < 2e-16 ***
parent_educ -0.007390 0.003609 -2.048 0.0406 *
period_cat80 s 0.126521 0.113044 1.119 0.2630
period_cat90 s -0.105139 0.176732 -0.595 0.5519
period_cat00 s -0.507052 0.263580 -1.924 0.0544 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr) a[18,2 a[21,2 a[24,2 a[27,3 prnt_d pr_80' pr_90'
agct[18,21) -0.038
agct[21,24) -0.006 0.521
agct[24,27) 0.006 0.412 0.475
agct[27,30) 0.011 0.325 0.393 0.378
parent_educ -0.557 0.059 0.087 0.084 0.078
perd_ct80 s -0.075 -0.258 -0.372 -0.380 -0.352 -0.104
perd_ct90 s -0.048 -0.302 -0.463 -0.471 -0.448 -0.151 0.732
perd_ct00 s -0.019 -0.293 -0.459 -0.434 -0.404 -0.138 0.559 0.739
You could try one of a few different optimizers available through the nloptr and optimx packages. There's even an allFit function available through the afex package that tries them for you (just see the allFit helpfile). e.g:
all_mod <- allFit(exist_model)
That will let you check how stable your estimates are. This points over to more resources on the gradient topic.
If you're worried about complete separation, see here for Ben Bolker's answer to use the bglmer function from the blme package. It operates much like glmer, but allows you to add priors to the model specification.

Resources