Model simplification (two way ANOVA) - r

I am using ANOVA to analyse results from an experiment to see whether there are any effects of my explanatory variables (Heating and Dungfauna) on my response variable (Biomass). I started by looking at the main effects and interaction:
full.model <- lm(log(Biomass) ~ Heating*Dungfauna, data= df)
anova(full.model)
I understand that it is necessary to complete model simplification, removing non-significant interactions or effects to eventually reach the simplest model which still explains the results. I tried two ways of removing the interaction. However, when I manually remove the interaction (Heating*Fauna -> Heating+Fauna), the new ANOVA gives a different output to when I use this model simplification 'shortcut':
new.model <- update(full.model, .~. -Dungfauna:Heating)
anova(model)
Which way is the appropriate way to remove the interaction and simplify the model?
In both cases the data is log transformed -
lm(log(CC_noAcari_EmergencePatSoil)~ Dungfauna*Heating, data= biomass)
ANOVA output from manually changing Heating*Dungfauna to Heating+Dungfauna:
Response: log(CC_noAcari_EmergencePatSoil)
Df Sum Sq Mean Sq F value Pr(>F)
Heating 2 4.806 2.403 5.1799 0.01012 *
Dungfauna 1 37.734 37.734 81.3432 4.378e-11 ***
Residuals 39 18.091 0.464
ANOVA output from using simplification 'shortcut':
Response: log(CC_noAcari_EmergencePatSoil)
Df Sum Sq Mean Sq F value Pr(>F)
Dungfauna 1 41.790 41.790 90.0872 1.098e-11 ***
Heating 2 0.750 0.375 0.8079 0.4531
Residuals 39 18.091 0.464

R's anova and aov functions compute the Type I or "sequential" sums of squares. The order in which the predictors are specified matters. A model that specifies y ~ A + B is asking for the effect of A conditioned on B, whereas Y ~ B + A is asking for the effect of B conditioned on A. Notice that your first model specifies Dungfauna*Heating, while your comparison model uses Heating+Dungfauna.
Consider this simple example using the "mtcars" data set. Here I specify two additive models (no interactions). Both models specify the same predictors, but in different orders:
add.model <- lm(log(mpg) ~ vs + cyl, data = mtcars)
anova(add.model)
Df Sum Sq Mean Sq F value Pr(>F)
vs 1 1.22434 1.22434 48.272 1.229e-07 ***
cyl 1 0.78887 0.78887 31.103 5.112e-06 ***
Residuals 29 0.73553 0.02536
add.model2 <- lm(log(mpg) ~ cyl + vs, data = mtcars)
anova(add.model2)
Df Sum Sq Mean Sq F value Pr(>F)
cyl 1 2.00795 2.00795 79.1680 8.712e-10 ***
vs 1 0.00526 0.00526 0.2073 0.6523
Residuals 29 0.73553 0.02536
You could specify Type II or Type III sums of squares using car::Anova:
car::Anova(add.model, type = 2)
car::Anova(add.model2, type = 2)
Which gives the same result for both models:
Sum Sq Df F value Pr(>F)
vs 0.00526 1 0.2073 0.6523
cyl 0.78887 1 31.1029 5.112e-06 ***
Residuals 0.73553 29
summary also provides equivalent (and consistent) metrics regardless of the order of predictors, though it's not quite a formal ANOVA table:
summary(add.model)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.92108 0.20714 18.930 < 2e-16 ***
vs -0.04414 0.09696 -0.455 0.652
cyl -0.15261 0.02736 -5.577 5.11e-06 ***

Related

Interpreting output from emmeans::contrast

I have data from a longitudinal study and calculated the regression using the lme4::lmer function. After that I calculated the contrasts for these data but I am having difficulty interpreting my results, as they were unexpected. I think I might have made a mistake in the code. Unfortunately I couldn't replicate my results with an example, but I will post both the failed example and my actual results below.
My results:
library(lme4)
library(lmerTest)
library(emmeans)
#regression
regmemory <- lmer(memory ~ as.factor(QuartileConsumption)*Age+
(1 + Age | ID) + sex + education +
HealthScore, CognitionData)
#results
summary(regmemory)
#Fixed effects:
# Estimate Std. Error df t value Pr(>|t|)
#(Intercept) -7.981e-01 9.803e-02 1.785e+04 -8.142 4.15e-16 ***
#as.factor(QuartileConsumption)2 -8.723e-02 1.045e-01 2.217e+04 -0.835 0.40376
#as.factor(QuartileConsumption)3 5.069e-03 1.036e-01 2.226e+04 0.049 0.96097
#as.factor(QuartileConsumption)4 -2.431e-02 1.030e-01 2.213e+04 -0.236 0.81337
#Age -1.709e-02 1.343e-03 1.989e+04 -12.721 < 2e-16 ***
#sex 3.247e-01 1.520e-02 1.023e+04 21.355 < 2e-16 ***
#education 2.979e-01 1.093e-02 1.061e+04 27.266 < 2e-16 ***
#HealthScore -1.098e-06 5.687e-07 1.021e+04 -1.931 0.05352 .
#as.factor(QuartileConsumption)2:Age 1.101e-03 1.842e-03 1.951e+04 0.598 0.55006
#as.factor(QuartileConsumption)3:Age 4.113e-05 1.845e-03 1.935e+04 0.022 0.98221
#as.factor(QuartileConsumption)4:Age 1.519e-03 1.851e-03 1.989e+04 0.821 0.41174
#contrasts
emmeans(regmemory, poly ~ QuartileConsumption * Age)$contrast
#$contrasts
# contrast estimate SE df z.ratio p.value
# linear 0.2165 0.0660 Inf 3.280 0.0010
# quadratic 0.0791 0.0289 Inf 2.733 0.0063
# cubic -0.0364 0.0642 Inf -0.567 0.5709
The interaction terms in the regression results are not significant, but the linear contrast is. Shouldn't the p-value for the contrast be non-significant?
Below is the code I wrote to try to recreate these results, but failed:
library(dplyr)
library(lme4)
library(lmerTest)
library(emmeans)
data("sleepstudy")
#create quartile column
sleepstudy$Quartile <- sample(1:4, size = nrow(sleepstudy), replace = T)
#regression
model1 <- lmer(Reaction ~ Days * as.factor(Quartile) + (1 + Days | Subject), data = sleepstudy)
#results
summary(model1)
#Fixed effects:
# Estimate Std. Error df t value Pr(>|t|)
#(Intercept) 258.1519 9.6513 54.5194 26.748 < 2e-16 ***
#Days 9.8606 2.0019 43.8516 4.926 1.24e-05 ***
#as.factor(Quartile)2 -11.5897 11.3420 154.1400 -1.022 0.308
#as.factor(Quartile)3 -5.0381 11.2064 155.3822 -0.450 0.654
#as.factor(Quartile)4 -10.7821 10.8798 154.0820 -0.991 0.323
#Days:as.factor(Quartile)2 0.5676 2.1010 152.1491 0.270 0.787
#Days:as.factor(Quartile)3 0.2833 2.0660 155.5669 0.137 0.891
#Days:as.factor(Quartile)4 1.8639 2.1293 153.1315 0.875 0.383
#contrast
emmeans(model1, poly ~ Quartile*Days)$contrast
#contrast estimate SE df t.ratio p.value
# linear -1.91 18.78 149 -0.102 0.9191
# quadratic 10.40 8.48 152 1.227 0.2215
# cubic -18.21 18.94 150 -0.961 0.3379
In this example, the p-value for the linear contrast is non-significant just as the interactions from the regression. Did I do something wrong, or these results are to be expected?
Look at the emmeans() call for the original model:
emmeans(regmemory, poly ~ QuartileConsumption * Age)
This requests that we obtain marginal means for combinations of QuartileConsumption and Age, and obtain polynomial contrasts from those results. It appears that Age is a quantitative variable, so in computing the marginal means, we just use the mean value of Age (see documentation for ref_grid() and vignette("basics", "emmeans")). So the marginal means display, which wasn't shown in the OP, will be in this general form:
QuartileConsumption Age emmean
------------------------------------
1 <mean> <est1>
2 <mean> <est2>
3 <mean> <est3>
4 <mean> <est4>
... and the contrasts shown will be the linear, quadratic, and cubic trends of those four estimates, in the order shown.
Note that these marginal means have nothing to do with the interaction effect; they are just predictions from the model for the four levels of QuartileConsumption at the mean Age (and mean education, mean health score), averaged over the two sexes, if I understand the data structure correctly. So essentially the polynomial contrasts estimate polynomial trends of the 4-level factor at the mean age. And note in particular that age is held constant, so we certainly are not looking at any effects of Age.
I am guessing what you want to be doing to examine the interaction is to assess how the age trend varies over the four levels of that factor. If that is the case, one useful thing to do would be something like
slopes <- emtrends(regmemory, ~ QuartileConsumption, var = "age")
slopes # display the estimated slope at each level
pairs(slopes) # pairwise comparisons of these slopes
See vignette("interactions", "emmeans") and the section on interactions with covariates.

R AOV will not do the interaction between two variables

I have in the past had R perform aov's with interaction between two varbles, however I am unable to get it to do so now.
Code:
x.aov <- aov(thesis_temp$`Transformed Time to Metamorphosis` ~ thesis_temp$Sex + thesis_temp$Mature + thesis_temp$Sex * thesis_temp$Mature)
Output:
Df Sum Sq Mean Sq F value Pr(>F)
thesis_temp$Sex 1 0.000332 0.0003323 1.370 0.2452
thesis_temp$Mature 1 0.000801 0.0008005 3.301 0.0729 .
Residuals 82 0.019886 0.0002425
I want it to also include a Sex x Mature interaction, but it will not produce this. Any suggestions of how to get R to also do the interaction analysis?

Anova table for pscl:zeroinfl

We're trying to model a count variable with excessive zeros using a zero-inflated poisson (as implemented in pscl package). Here is a (simplified) output showing both categorical and continuous explanatory variables:
library(pscl)
> m1 <- zeroinfl(y ~ treatment + some_covar, data = d, dist =
"poisson")
> summary(m1)
Count model coefficients (poisson with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.189253 0.102256 31.189 < 2e-16 ***
treatmentB -0.282478 0.107965 -2.616 0.00889 **
treatmentC 0.227633 0.103605 2.197 0.02801 *
some_covar 0.002190 0.002329 0.940 0.34706
Zero-inflation model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.67251 0.74961 0.897 0.3696
treatmentB -1.72728 0.89931 -1.921 0.0548 .
treatmentC -0.31761 0.77668 -0.409 0.6826
some_covar -0.03736 0.02684 -1.392 0.1640
summary gave us some good answers but we are looking for a ANOVA-like table. So, the question is: is it ok to use car::Anova to obtain such table?
> Anova(m1)
Analysis of Deviance Table (Type II tests)
Response: y
Df Chisq Pr(>Chisq)
treatment 2 30.7830 2.068e-07 ***
some_covar 1 0.8842 0.3471
It seems to work fine but i'm not really sure whether is a valid approach since documentation is missing (seems like is only considering the 'count model' part?). Do you recommend to follow this approach or there is a better way?

Why Error() in aov() gives three levels?

I'm trying to understand how to properly run an Repeated Measures or Nested ANOVA in R, without using mixed models. From consulting tutorials, the formula for a one-variable repeated measures anova is:
aov(Y ~ IV+ Error(SUBJECT/IV) )
where IV is the within subjects and subject is the identity of the subjects. However, most examples show outputs with two strata: Error:subject and Error: subject:WS. Meanwhile I am getting three strata ( Error:subject and Error: subject:WS, Error:within). Why do I have three strata, when I'm trying to specify only two (Within and Between)?
Here is an reproducible example:
data(beavers)
id = rep(c("beaver1","beaver2"),times=c(nrow(beaver1),nrow(beaver2)))
data = data.frame(id=id,rbind(beaver1,beaver2))
data$activ=factor(data$activ)
aov(temp~activ+Error(id/activ),data=data)
temp is a continuous measure of temperature, id is the identity of the beaver activ is binary factor for activity. The output of the model is:
Error: id
Df Sum Sq Mean Sq
activ 1 28.74 28.74
Error: id:activ
Df Sum Sq Mean Sq F value Pr(>F)
activ 1 15.313 15.313 18.51 0.145
Residuals 1 0.827 0.827
Error: Within
Df Sum Sq Mean Sq F value Pr(>F)
Residuals 210 7.85 0.03738

Confused about ANOVA in R

I am new to R and statistics and am trying to do two-factor ANOVA on a dataset in csv file where values of each factor are in its own column. I was using
> mydata <- read.csv("myfile.csv")
> model = lm(result ~ factor1 * factor2, data=mydata)
As a check, I tried the ChickWeight data from R sample dataset.
> anova(with(ChickWeight, lm(weight ~ Time + Diet)))
Analysis of Variance Table
Response: weight
Df Sum Sq Mean Sq F value Pr(>F)
Time 1 2042344 2042344 1576.460 < 2.2e-16 ***
Diet 3 129876 43292 > 33.417 < 2.2e-16 ***
Residuals 573 742336 1296
> write.csv(file="ChickWeight.csv", x=ChickWeight, row.names=F)
> data = read.csv("ChickWeight.csv", header=T)
> anova(lm(weight ~ Time + Diet, data=data))
Analysis of Variance Table
Response: weight
Df Sum Sq Mean Sq F value Pr(>F)
Time 1 2042344 2042344 1537.033 < 2.2e-16 ***
Diet 1 108177 108177 81.412 < 2.2e-16 ***
Residuals 575 764036 1329
Noticeably, degrees of freedom are lost for Diet column with the data read from csv into a dataframe. What am I missing here?
Got the clue from this post: Why do R and statsmodels give slightly different ANOVA results?
When the data is being read from CSV file, the Diet column is becoming an ordinary numeric column, but for ANOVA it has to be a factor variable (I am still not clear why it is a separate class/kind in R and why it cannot take care of it automatically: inexact binary representation of floats? ).
So the solution was:
> data$Diet = factor(data$Diet)
> anova(lm("weight ~ Time + Diet", data=data))
Analysis of Variance Table
Response: weight
Df Sum Sq Mean Sq F value Pr(>F)
Time 1 2042344 2042344 1576.460 < 2.2e-16 ***
Diet 3 129876 43292 33.417 < 2.2e-16 ***
Residuals 573 742336 1296

Resources