I want to get the equation of the linear model for the following experiment mat in latin square.
data <- c(12.5,11,13,11.4)
row <- factor(rep(1:2,2))
col <- factor(rep(1:2,each=2))
car <- c("B","A","A","B")
mat <- data.frame(row,col,car,data)
mat
# row col car data
# 1 1 1 B 12.5
# 2 2 1 A 11.0
# 3 1 2 A 13.0
# 4 2 2 B 11.4
I might recommend using a mixed model approach to this.
mat <- data.frame(data=c(12.5,11,13,11.4),
row=factor(rep(1:2,2)),
col=factor(rep(1:2,each=2)),
car=c("B","A","A","B"))
I'm using lmerTest because it will more easily provide you with (approximate) p-values
By default anova() uses the Satterthwaite approximation, or you can tell it to use the more accurate Kenward-Roger approximation. In either case you can see that the denominator df are exactly, or nearly zero, and the p-value is either missing or very close to 1, indicating that your model doesn't make sense (i.e. even using the mixed model it's overparameterized).
library("lmerTest")
anova(m1 <- lmer(data~car+(1|row)+(1|col),data=mat))
anova(m1,ddf="Kenward-Roger")
## Sum Sq Mean Sq NumDF DenDF F.value Pr(>F)
## car 0.0025 0.0025 1 9.6578e-06 2.0019 0.9999
Try for a bigger design:
set.seed(101)
mat2 <- data.frame(data=rnorm(36),
row=gl(6,6),
col=gl(6,1,36),
car=sample(LETTERS[1:2],size=36,replace=TRUE))
m2A <- lm(data~car+row+col,data=mat2)
anova(m2A)
## (excerpt)
## Df Sum Sq Mean Sq F value Pr(>F)
## car 1 1.2571 1.25709 1.6515 0.211
m2B <- lmer(data~car+(1|row)+(1|col),data=mat2)
anova(m2B)
## Sum Sq Mean Sq NumDF DenDF F.value Pr(>F)
## car 1.178 1.178 1 17.098 1.56 0.2285
anova(m2B,ddf="Kenward-Roger")
## Sum Sq Mean Sq NumDF DenDF F.value Pr(>F)
## car 1.178 1.178 1 17.005 1.1029 0.3083
It surprises me a little bit that the lm and lmerTest answers are so far apart here -- I would have thought this was an example where there was a well-formulated "classic" answer -- but I'm not sure. Might be worth following up on CrossValidated or Google.
fit <- lm(data~row+col+car,mat)
coef(fit)
# (Intercept) row2 col2 carB
# 12.55 -1.55 0.45 -0.05
So the effect of the row factor is -1.55, the effect of the col factor is 0.45, and the effect of the car factor is -0.05. The intercept term is the value of data expected when al the factors are at the first level (row=1, col=1, car=A).
Notice that your design is over-specified: you have only 4 pieces of data, which is enough to specify the effects of two factors and their interaction, but you have set it up so that car is the interaction. So there are no degrees of freedom left for error.
Related
I am using ANOVA to analyse results from an experiment to see whether there are any effects of my explanatory variables (Heating and Dungfauna) on my response variable (Biomass). I started by looking at the main effects and interaction:
full.model <- lm(log(Biomass) ~ Heating*Dungfauna, data= df)
anova(full.model)
I understand that it is necessary to complete model simplification, removing non-significant interactions or effects to eventually reach the simplest model which still explains the results. I tried two ways of removing the interaction. However, when I manually remove the interaction (Heating*Fauna -> Heating+Fauna), the new ANOVA gives a different output to when I use this model simplification 'shortcut':
new.model <- update(full.model, .~. -Dungfauna:Heating)
anova(model)
Which way is the appropriate way to remove the interaction and simplify the model?
In both cases the data is log transformed -
lm(log(CC_noAcari_EmergencePatSoil)~ Dungfauna*Heating, data= biomass)
ANOVA output from manually changing Heating*Dungfauna to Heating+Dungfauna:
Response: log(CC_noAcari_EmergencePatSoil)
Df Sum Sq Mean Sq F value Pr(>F)
Heating 2 4.806 2.403 5.1799 0.01012 *
Dungfauna 1 37.734 37.734 81.3432 4.378e-11 ***
Residuals 39 18.091 0.464
ANOVA output from using simplification 'shortcut':
Response: log(CC_noAcari_EmergencePatSoil)
Df Sum Sq Mean Sq F value Pr(>F)
Dungfauna 1 41.790 41.790 90.0872 1.098e-11 ***
Heating 2 0.750 0.375 0.8079 0.4531
Residuals 39 18.091 0.464
R's anova and aov functions compute the Type I or "sequential" sums of squares. The order in which the predictors are specified matters. A model that specifies y ~ A + B is asking for the effect of A conditioned on B, whereas Y ~ B + A is asking for the effect of B conditioned on A. Notice that your first model specifies Dungfauna*Heating, while your comparison model uses Heating+Dungfauna.
Consider this simple example using the "mtcars" data set. Here I specify two additive models (no interactions). Both models specify the same predictors, but in different orders:
add.model <- lm(log(mpg) ~ vs + cyl, data = mtcars)
anova(add.model)
Df Sum Sq Mean Sq F value Pr(>F)
vs 1 1.22434 1.22434 48.272 1.229e-07 ***
cyl 1 0.78887 0.78887 31.103 5.112e-06 ***
Residuals 29 0.73553 0.02536
add.model2 <- lm(log(mpg) ~ cyl + vs, data = mtcars)
anova(add.model2)
Df Sum Sq Mean Sq F value Pr(>F)
cyl 1 2.00795 2.00795 79.1680 8.712e-10 ***
vs 1 0.00526 0.00526 0.2073 0.6523
Residuals 29 0.73553 0.02536
You could specify Type II or Type III sums of squares using car::Anova:
car::Anova(add.model, type = 2)
car::Anova(add.model2, type = 2)
Which gives the same result for both models:
Sum Sq Df F value Pr(>F)
vs 0.00526 1 0.2073 0.6523
cyl 0.78887 1 31.1029 5.112e-06 ***
Residuals 0.73553 29
summary also provides equivalent (and consistent) metrics regardless of the order of predictors, though it's not quite a formal ANOVA table:
summary(add.model)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.92108 0.20714 18.930 < 2e-16 ***
vs -0.04414 0.09696 -0.455 0.652
cyl -0.15261 0.02736 -5.577 5.11e-06 ***
I have in the past had R perform aov's with interaction between two varbles, however I am unable to get it to do so now.
Code:
x.aov <- aov(thesis_temp$`Transformed Time to Metamorphosis` ~ thesis_temp$Sex + thesis_temp$Mature + thesis_temp$Sex * thesis_temp$Mature)
Output:
Df Sum Sq Mean Sq F value Pr(>F)
thesis_temp$Sex 1 0.000332 0.0003323 1.370 0.2452
thesis_temp$Mature 1 0.000801 0.0008005 3.301 0.0729 .
Residuals 82 0.019886 0.0002425
I want it to also include a Sex x Mature interaction, but it will not produce this. Any suggestions of how to get R to also do the interaction analysis?
I'm trying to understand how to properly run an Repeated Measures or Nested ANOVA in R, without using mixed models. From consulting tutorials, the formula for a one-variable repeated measures anova is:
aov(Y ~ IV+ Error(SUBJECT/IV) )
where IV is the within subjects and subject is the identity of the subjects. However, most examples show outputs with two strata: Error:subject and Error: subject:WS. Meanwhile I am getting three strata ( Error:subject and Error: subject:WS, Error:within). Why do I have three strata, when I'm trying to specify only two (Within and Between)?
Here is an reproducible example:
data(beavers)
id = rep(c("beaver1","beaver2"),times=c(nrow(beaver1),nrow(beaver2)))
data = data.frame(id=id,rbind(beaver1,beaver2))
data$activ=factor(data$activ)
aov(temp~activ+Error(id/activ),data=data)
temp is a continuous measure of temperature, id is the identity of the beaver activ is binary factor for activity. The output of the model is:
Error: id
Df Sum Sq Mean Sq
activ 1 28.74 28.74
Error: id:activ
Df Sum Sq Mean Sq F value Pr(>F)
activ 1 15.313 15.313 18.51 0.145
Residuals 1 0.827 0.827
Error: Within
Df Sum Sq Mean Sq F value Pr(>F)
Residuals 210 7.85 0.03738
I perform following ezANOVA:
RMANOVAGHB1 <- ezANOVA(GHB1, dv=DIF.SCORE.STARTLE, wid=RAT.ID, within=TRIAL.TYPE, between=GROUP, detailed = TRUE, return_aov = TRUE)
My dataset looks like this:
RAT.ID DIF.SCORE.STARTLE GROUP TRIAL.TYPE
1 1 170.73 SAL TONO
2 1 80.07 SAL NOAL
3 2 456.40 PROP TONO
4 2 290.40 PROP NOAL
5 3 507.20 SAL TONO
6 3 261.60 SAL NOAL
7 4 208.67 PROP TONO
8 4 137.60 PROP NOAL
9 5 500.50 SAL TONO
10 5 445.73 SAL NOAL
up until rat.id 16.
My supervisors don't work with R, so they can't help me. I need code that will give me all post hoc contrasts, but looking it up only confuses me more and more.
I already tried to do TukeyHSD on the aov output of ezANOVA and tried pairwise.t.test next (as I found out bonferroni is a more appropriate correction in this case), but none seem to work. I've also found things about using a linear model and then multcomp, but I don't know if that would be a good solution in this case. I feel like the problem with everything I tried is either that I have between and within variables or that my dataset is not set up right. Another complicating factor is that I'm just a beginner with R and my statistical knowledge is still pretty basic as this is one of my first practical experiences with doing analyses.
If it's important, this is the output of the anova:
$ANOVA
Effect DFn DFd SSn SSd F p p<.05 ges
1 (Intercept) 1 14 1233568.9 1076460.9 16.043280 0.001302172 * 0.508451750
2 GROUP 1 14 212967.9 1076460.9 2.769771 0.118272657 0.151521743
3 TRIAL.TYPE 1 14 137480.6 116097.9 16.578499 0.001143728 * 0.103365833
4 GROUP:TRIAL.TYPE 1 14 11007.2 116097.9 1.327335 0.268574391 0.009145489
$aov
Call:
aov(formula = formula(aov_formula), data = data)
Grand Mean: 196.3391
Stratum 1: RAT.ID
Terms:
GROUP Residuals
Sum of Squares 212967.9 1076460.9
Deg. of Freedom 1 14
Residual standard error: 277.2906
1 out of 2 effects not estimable
Estimated effects are balanced
Stratum 2: RAT.ID:TRIAL.TYPE
Terms:
TRIAL.TYPE GROUP:TRIAL.TYPE Residuals
Sum of Squares 137480.6 11007.2 116097.9
Deg. of Freedom 1 1 14
Residual standard error: 91.0643
Estimated effects may be unbalanced
My solution, considering your dataset - first 5 rats:
1. Let's build the linear model:
model.lm = lm(DIF_SCORE_STARTLE ~ GROUP * TRIAL_TYPE, data = dat)
2. Let's chceck the homogeneity of variance (leveneTest) and distribution of our model (Shapiro-Wilk). We are looking for normal distribution and our variance should be homogenic. Two tests for this:
>shapiro.test(resid(model.lm))
Shapiro-Wilk normality test
data: resid(model.lm)
W = 0.91783, p-value = 0.3392
> leveneTest(DIF_SCORE_STARTLE ~ GROUP * TRIAL_TYPE, data = dat)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 3 0.066 0.976
6
Our p-values are higher than 0.05 in both cases so we don't have proof that our variance differs between groups. In case of normality test we can also conclude that the sample doesn't deviate from normality. Summarizing we can use parametrical tests such as ANOVA or pairwise t-test.
3.Yo can also run:
hist(resid(model.lm))
To check how does distribution of our data look like. And check the model:
plot(model.lm)
Here: https://stats.stackexchange.com/questions/58141/interpreting-plot-lm/65864 you'll find interpretation of plots produced by this function. As I saw, data looks fine.
4.Now finally we can do ANOVA test and post hoc HSD test:
> anova(model.lm)
Analysis of Variance Table
Response: DIF_SCORE_STARTLE
Df Sum Sq Mean Sq F value Pr(>F)
GROUP 1 7095 7095 0.2323 0.6469
TRIAL_TYPE 1 39451 39451 1.2920 0.2990
GROUP:TRIAL_TYPE 1 84 84 0.0027 0.9600
Residuals 6 183215 30536
> (result.hsd = HSD.test(model.lm, list('GROUP', 'TRIAL_TYPE')))
$statistics
Mean CV MSerror HSD r.harmonic
305.89 57.12684 30535.91 552.2118 2.4
$parameters
Df ntr StudentizedRange alpha test name.t
6 4 4.895599 0.05 Tukey GROUP:TRIAL_TYPE
$means
DIF_SCORE_STARTLE std r Min Max
PROP:NOAL 214.0000 108.0459 2 137.60 290.40
PROP:TONO 332.5350 175.1716 2 208.67 456.40
SAL:NOAL 262.4667 182.8315 3 80.07 445.73
SAL:TONO 392.8100 192.3561 3 170.73 507.20
$comparison
NULL
$groups
trt means M
1 SAL:TONO 392.8100 a
2 PROP:TONO 332.5350 a
3 SAL:NOAL 262.4667 a
4 PROP:NOAL 214.0000 a
As you see, our 'pairs' have been grouped in one big group a that means that there are not significant difference between them. However there's some difference between NOAL and TONO no matter of SAL and PROP.
I'm conducting a simulation study in R. Basically, I generate fake data sets and then run an ANOVA on the data using the aov function. But I'm having difficulty extracting p-values. Previous questionss do not help (Extract p-value from aov) -- I am running a mixed ANOVA.
First I have an ANOVA:
results <- summary(aov(dv~(A*B*C*D*E)+Error(subj/(A*B*C*D)), data = mdata)) # conduct repeated measures ANOVA
which generate this output:
Error: subj
Df Sum Sq Mean Sq F value Pr(>F)
E 1 1039157 1039157 0.95 0.334
Residuals 58 63428016 1093586
Error: subj:A
Df Sum Sq Mean Sq F value Pr(>F)
A 1 1996 1996 0.220 0.641
A:E 1 2294 2294 0.253 0.617
Residuals 58 526389 9076
...
I'm truncating the output for space. What I want list of p-values with the effect name (A or A:E). I have halfway succeeded, but it's messy. I can extract the p-values using this get_p function that I made.
#Function
get_p = function(results,head){
results[[1]]$'Pr(>F)'
}
#Get p-values
p <- sapply(results, get_p)
I end up with a this:
$`Error: subj`
[1] 0.3337094 NA
$`Error: subj:A`
[1] 0.6408826 0.6170181 NA
...
Any ideas on how to get a list of p-values (.6408, .6178) and effect names ('A', 'A:E')?
I found the answer, which seems to be:
get_p1 = function(results){
results[[1]]$'Pr(>F)'[[1]]
}
get_p2 = function(results){
results[[1]]$'Pr(>F)'[[2]]
}
pvals <- c(sapply(results, get_p1), sapply(results, get_p2))