R AOV will not do the interaction between two variables - r

I have in the past had R perform aov's with interaction between two varbles, however I am unable to get it to do so now.
Code:
x.aov <- aov(thesis_temp$`Transformed Time to Metamorphosis` ~ thesis_temp$Sex + thesis_temp$Mature + thesis_temp$Sex * thesis_temp$Mature)
Output:
Df Sum Sq Mean Sq F value Pr(>F)
thesis_temp$Sex 1 0.000332 0.0003323 1.370 0.2452
thesis_temp$Mature 1 0.000801 0.0008005 3.301 0.0729 .
Residuals 82 0.019886 0.0002425
I want it to also include a Sex x Mature interaction, but it will not produce this. Any suggestions of how to get R to also do the interaction analysis?

Related

Denominator degrees of freedom (when using lmer) are different in loop than outside loop

I am looping an lmer function over 20+ y variables, with 159 rows of observations. When I run an lmer function inside the loop, denominator degrees of freedom are lost. Outside the loop (or even if I specify one y variable inside the loop), denominator df is as expected.
I have a df with 20 y variables for plants in two chambers with 4 treatments (replicated in both treatements).
lmer_test <- lmer(leaves_mean~Treatment + (1|Chamber), data = df)
aov_test <- anova(lmer_test)
aov_test
This gives:
Type III Analysis of Variance Table with Satterthwaite's method
Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
Treatment 97.593 32.531 3 153 1.1966 0.3131
I have a loop:
for(u in colnames(df)[6:ncol(df)])
{
Y_variable_Rex <- names(df[u])
lmer_u_Rex <-lmer(get(u) ~ Treatment + (1|Chamber), data = df)
aov_u_Rex <- anova(lmer_u_Rex)
aov_u_Rex
}
That gives
Type III Analysis of Variance Table with Satterthwaite's method
Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
Treatment 208420 69473 3 4.6406e-14 3.5911e+31 1
If I specify the exact code as outside the loop (replacing get(u) with "leaves_mean")... I get the correct result:
Type III Analysis of Variance Table with Satterthwaite's method
Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
Treatment 97.593 32.531 3 153 1.1966 0.3131
The "get_u" specification should result in exactly the same result inside the loop. What is happening that makes the denominator degrees of freedom different (0)?
I answered my own question...
The loop was reporting denominator df for a different variable (the last one).
There are missing values in that column as that was a y variable of sampled values, with many missing rows.
I was able to reproduce the error and then resolve it within the loop.
I was approaching de-bugging here like you might do in a Matlab loop (step-wise through each row), which is not a good method for fixing R loops.
Thank you for reading this question and considering answering!

Model simplification (two way ANOVA)

I am using ANOVA to analyse results from an experiment to see whether there are any effects of my explanatory variables (Heating and Dungfauna) on my response variable (Biomass). I started by looking at the main effects and interaction:
full.model <- lm(log(Biomass) ~ Heating*Dungfauna, data= df)
anova(full.model)
I understand that it is necessary to complete model simplification, removing non-significant interactions or effects to eventually reach the simplest model which still explains the results. I tried two ways of removing the interaction. However, when I manually remove the interaction (Heating*Fauna -> Heating+Fauna), the new ANOVA gives a different output to when I use this model simplification 'shortcut':
new.model <- update(full.model, .~. -Dungfauna:Heating)
anova(model)
Which way is the appropriate way to remove the interaction and simplify the model?
In both cases the data is log transformed -
lm(log(CC_noAcari_EmergencePatSoil)~ Dungfauna*Heating, data= biomass)
ANOVA output from manually changing Heating*Dungfauna to Heating+Dungfauna:
Response: log(CC_noAcari_EmergencePatSoil)
Df Sum Sq Mean Sq F value Pr(>F)
Heating 2 4.806 2.403 5.1799 0.01012 *
Dungfauna 1 37.734 37.734 81.3432 4.378e-11 ***
Residuals 39 18.091 0.464
ANOVA output from using simplification 'shortcut':
Response: log(CC_noAcari_EmergencePatSoil)
Df Sum Sq Mean Sq F value Pr(>F)
Dungfauna 1 41.790 41.790 90.0872 1.098e-11 ***
Heating 2 0.750 0.375 0.8079 0.4531
Residuals 39 18.091 0.464
R's anova and aov functions compute the Type I or "sequential" sums of squares. The order in which the predictors are specified matters. A model that specifies y ~ A + B is asking for the effect of A conditioned on B, whereas Y ~ B + A is asking for the effect of B conditioned on A. Notice that your first model specifies Dungfauna*Heating, while your comparison model uses Heating+Dungfauna.
Consider this simple example using the "mtcars" data set. Here I specify two additive models (no interactions). Both models specify the same predictors, but in different orders:
add.model <- lm(log(mpg) ~ vs + cyl, data = mtcars)
anova(add.model)
Df Sum Sq Mean Sq F value Pr(>F)
vs 1 1.22434 1.22434 48.272 1.229e-07 ***
cyl 1 0.78887 0.78887 31.103 5.112e-06 ***
Residuals 29 0.73553 0.02536
add.model2 <- lm(log(mpg) ~ cyl + vs, data = mtcars)
anova(add.model2)
Df Sum Sq Mean Sq F value Pr(>F)
cyl 1 2.00795 2.00795 79.1680 8.712e-10 ***
vs 1 0.00526 0.00526 0.2073 0.6523
Residuals 29 0.73553 0.02536
You could specify Type II or Type III sums of squares using car::Anova:
car::Anova(add.model, type = 2)
car::Anova(add.model2, type = 2)
Which gives the same result for both models:
Sum Sq Df F value Pr(>F)
vs 0.00526 1 0.2073 0.6523
cyl 0.78887 1 31.1029 5.112e-06 ***
Residuals 0.73553 29
summary also provides equivalent (and consistent) metrics regardless of the order of predictors, though it's not quite a formal ANOVA table:
summary(add.model)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.92108 0.20714 18.930 < 2e-16 ***
vs -0.04414 0.09696 -0.455 0.652
cyl -0.15261 0.02736 -5.577 5.11e-06 ***

How do I run Kruskal and post HOC on multiple variables in R?

Please excuse me if I have not formated my code correctly as I am new to the site. I also do not know how to provide sample data properly.
I have a data set of 42 obs. and 37 variables (first column being the group, 3 groups) of non normal distributed data; I want to compare all of my 36 parameters between the 3 groups and do a subsequent post hoc (pairwise.wilcox?).
The data are flow cell counts for three different patient groups. I have been able to perform the initial comparison creating a formula and running an aov (though I would like to do Kruskal) but have not found a way to perform the post hoc to all variables in the same way.
#Data
Type Neutrophils Monocytes NKC .....
------------------------------------------
IN 546 2663 545
IN 0797 7979 008
OUT 0899 3899 345
OUT 6868 44533 689
HC 9898 43443 563
#Cbind all variable together to run model on all
formula <- as.formula(paste0("cbind(", paste(names(LessCount)[-1],
collapse = ","), ") ~ Type"))
print(formula)
#Run test on model
fit <- aov(formula, data=LessCount)
#Print results
summary(fit)
Response Neutrophils :
Df Sum Sq Mean Sq F value Pr(>F)
Type 2 18173966 9086983 1.8099 0.1771
Residuals 39 195806220 5020672
Response Monocytes :
Df Sum Sq Mean Sq F value Pr(>F)
Type 2 694945 347472 0.7131 0.4964
Residuals 39 19004809 487303
Response Mono.Classic :
Df Sum Sq Mean Sq F value Pr(>F)
Type 2 1561778 780889 2.5842 0.08833 .
Residuals 39 11785116 302182
###export anova####
capture.output(summary(fit),file="test1.csv")
#If Significant,Check which# (currently doing by hand individually)
pairwise.wilcox.test(LessCount$pDCs, LessCount$Type,
p.adjust.method = "BH")
I get out a table the results for the aov for every variable in my console, but would like to do the same for the post hoc, since I need every p value.
Thank you in advance.
Maybe you can directly use the function kruskal.test() and get the p.values.
Here is an example with the iris dataset. I use the function apply() in order to apply the kruskal.test function to each variable (except Species, which is the variable with group information).
data(iris)
apply(iris[-5], 2, function(x) kruskal.test(x = x, g = iris$Species)$p.value)
# Sepal.Length Sepal.Width Petal.Length Petal.Width
# 8.918734e-22 1.569282e-14 4.803974e-29 3.261796e-29

Why Error() in aov() gives three levels?

I'm trying to understand how to properly run an Repeated Measures or Nested ANOVA in R, without using mixed models. From consulting tutorials, the formula for a one-variable repeated measures anova is:
aov(Y ~ IV+ Error(SUBJECT/IV) )
where IV is the within subjects and subject is the identity of the subjects. However, most examples show outputs with two strata: Error:subject and Error: subject:WS. Meanwhile I am getting three strata ( Error:subject and Error: subject:WS, Error:within). Why do I have three strata, when I'm trying to specify only two (Within and Between)?
Here is an reproducible example:
data(beavers)
id = rep(c("beaver1","beaver2"),times=c(nrow(beaver1),nrow(beaver2)))
data = data.frame(id=id,rbind(beaver1,beaver2))
data$activ=factor(data$activ)
aov(temp~activ+Error(id/activ),data=data)
temp is a continuous measure of temperature, id is the identity of the beaver activ is binary factor for activity. The output of the model is:
Error: id
Df Sum Sq Mean Sq
activ 1 28.74 28.74
Error: id:activ
Df Sum Sq Mean Sq F value Pr(>F)
activ 1 15.313 15.313 18.51 0.145
Residuals 1 0.827 0.827
Error: Within
Df Sum Sq Mean Sq F value Pr(>F)
Residuals 210 7.85 0.03738

R: lm() with factors. Don't understand how ANOVA table calculates "Sum Sq"

I'm learning R and trying to understand how lm() handles factor variables & how to make sense of the ANOVA table. I'm fairly new to statistics, so please be gentle with me.
Here's some movie data from Rotten Tomatoes. I'm trying to model the score of each movie based on the mean scores for all of the movies in 4 groups: those rated G, PG, PG-13, and R.
download.file("http://www.rossmanchance.com/iscam2/data/movies03RT.txt", destfile = "./movies.txt")
movies <- read.table("./movies.txt", sep = "\t", header = T, quote = "")
lm1 <- lm(movies$score ~ as.factor(movies$rating))
anova(lm1)
and the ANOVA output:
## Analysis of Variance Table
##
## Response: movies$score
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(movies$rating) 3 570 190 0.92 0.43
## Residuals 136 28149 207
I understand how to get all the numbers in this table, EXCEPT Sum Sq and Mean Sq for as.factor(movies$rating). Can someone please explain how that Sum Sq is calculated from my data? I know that Mean Sqis just Sum Sq divided by Df.
There are various ways to get that. One of them is to use the equation:
http://en.wikipedia.org/wiki/Sum_of_squares_(statistics)
SS_total = SS_reg + SS_error
So:
y = movies$score
sum((y - mean(y))^2) - sum(lm1$residuals^2)

Resources