I'm conducting a simulation study in R. Basically, I generate fake data sets and then run an ANOVA on the data using the aov function. But I'm having difficulty extracting p-values. Previous questionss do not help (Extract p-value from aov) -- I am running a mixed ANOVA.
First I have an ANOVA:
results <- summary(aov(dv~(A*B*C*D*E)+Error(subj/(A*B*C*D)), data = mdata)) # conduct repeated measures ANOVA
which generate this output:
Error: subj
Df Sum Sq Mean Sq F value Pr(>F)
E 1 1039157 1039157 0.95 0.334
Residuals 58 63428016 1093586
Error: subj:A
Df Sum Sq Mean Sq F value Pr(>F)
A 1 1996 1996 0.220 0.641
A:E 1 2294 2294 0.253 0.617
Residuals 58 526389 9076
...
I'm truncating the output for space. What I want list of p-values with the effect name (A or A:E). I have halfway succeeded, but it's messy. I can extract the p-values using this get_p function that I made.
#Function
get_p = function(results,head){
results[[1]]$'Pr(>F)'
}
#Get p-values
p <- sapply(results, get_p)
I end up with a this:
$`Error: subj`
[1] 0.3337094 NA
$`Error: subj:A`
[1] 0.6408826 0.6170181 NA
...
Any ideas on how to get a list of p-values (.6408, .6178) and effect names ('A', 'A:E')?
I found the answer, which seems to be:
get_p1 = function(results){
results[[1]]$'Pr(>F)'[[1]]
}
get_p2 = function(results){
results[[1]]$'Pr(>F)'[[2]]
}
pvals <- c(sapply(results, get_p1), sapply(results, get_p2))
Related
I have in the past had R perform aov's with interaction between two varbles, however I am unable to get it to do so now.
Code:
x.aov <- aov(thesis_temp$`Transformed Time to Metamorphosis` ~ thesis_temp$Sex + thesis_temp$Mature + thesis_temp$Sex * thesis_temp$Mature)
Output:
Df Sum Sq Mean Sq F value Pr(>F)
thesis_temp$Sex 1 0.000332 0.0003323 1.370 0.2452
thesis_temp$Mature 1 0.000801 0.0008005 3.301 0.0729 .
Residuals 82 0.019886 0.0002425
I want it to also include a Sex x Mature interaction, but it will not produce this. Any suggestions of how to get R to also do the interaction analysis?
Please excuse me if I have not formated my code correctly as I am new to the site. I also do not know how to provide sample data properly.
I have a data set of 42 obs. and 37 variables (first column being the group, 3 groups) of non normal distributed data; I want to compare all of my 36 parameters between the 3 groups and do a subsequent post hoc (pairwise.wilcox?).
The data are flow cell counts for three different patient groups. I have been able to perform the initial comparison creating a formula and running an aov (though I would like to do Kruskal) but have not found a way to perform the post hoc to all variables in the same way.
#Data
Type Neutrophils Monocytes NKC .....
------------------------------------------
IN 546 2663 545
IN 0797 7979 008
OUT 0899 3899 345
OUT 6868 44533 689
HC 9898 43443 563
#Cbind all variable together to run model on all
formula <- as.formula(paste0("cbind(", paste(names(LessCount)[-1],
collapse = ","), ") ~ Type"))
print(formula)
#Run test on model
fit <- aov(formula, data=LessCount)
#Print results
summary(fit)
Response Neutrophils :
Df Sum Sq Mean Sq F value Pr(>F)
Type 2 18173966 9086983 1.8099 0.1771
Residuals 39 195806220 5020672
Response Monocytes :
Df Sum Sq Mean Sq F value Pr(>F)
Type 2 694945 347472 0.7131 0.4964
Residuals 39 19004809 487303
Response Mono.Classic :
Df Sum Sq Mean Sq F value Pr(>F)
Type 2 1561778 780889 2.5842 0.08833 .
Residuals 39 11785116 302182
###export anova####
capture.output(summary(fit),file="test1.csv")
#If Significant,Check which# (currently doing by hand individually)
pairwise.wilcox.test(LessCount$pDCs, LessCount$Type,
p.adjust.method = "BH")
I get out a table the results for the aov for every variable in my console, but would like to do the same for the post hoc, since I need every p value.
Thank you in advance.
Maybe you can directly use the function kruskal.test() and get the p.values.
Here is an example with the iris dataset. I use the function apply() in order to apply the kruskal.test function to each variable (except Species, which is the variable with group information).
data(iris)
apply(iris[-5], 2, function(x) kruskal.test(x = x, g = iris$Species)$p.value)
# Sepal.Length Sepal.Width Petal.Length Petal.Width
# 8.918734e-22 1.569282e-14 4.803974e-29 3.261796e-29
I'm trying to understand how to properly run an Repeated Measures or Nested ANOVA in R, without using mixed models. From consulting tutorials, the formula for a one-variable repeated measures anova is:
aov(Y ~ IV+ Error(SUBJECT/IV) )
where IV is the within subjects and subject is the identity of the subjects. However, most examples show outputs with two strata: Error:subject and Error: subject:WS. Meanwhile I am getting three strata ( Error:subject and Error: subject:WS, Error:within). Why do I have three strata, when I'm trying to specify only two (Within and Between)?
Here is an reproducible example:
data(beavers)
id = rep(c("beaver1","beaver2"),times=c(nrow(beaver1),nrow(beaver2)))
data = data.frame(id=id,rbind(beaver1,beaver2))
data$activ=factor(data$activ)
aov(temp~activ+Error(id/activ),data=data)
temp is a continuous measure of temperature, id is the identity of the beaver activ is binary factor for activity. The output of the model is:
Error: id
Df Sum Sq Mean Sq
activ 1 28.74 28.74
Error: id:activ
Df Sum Sq Mean Sq F value Pr(>F)
activ 1 15.313 15.313 18.51 0.145
Residuals 1 0.827 0.827
Error: Within
Df Sum Sq Mean Sq F value Pr(>F)
Residuals 210 7.85 0.03738
I want to get the equation of the linear model for the following experiment mat in latin square.
data <- c(12.5,11,13,11.4)
row <- factor(rep(1:2,2))
col <- factor(rep(1:2,each=2))
car <- c("B","A","A","B")
mat <- data.frame(row,col,car,data)
mat
# row col car data
# 1 1 1 B 12.5
# 2 2 1 A 11.0
# 3 1 2 A 13.0
# 4 2 2 B 11.4
I might recommend using a mixed model approach to this.
mat <- data.frame(data=c(12.5,11,13,11.4),
row=factor(rep(1:2,2)),
col=factor(rep(1:2,each=2)),
car=c("B","A","A","B"))
I'm using lmerTest because it will more easily provide you with (approximate) p-values
By default anova() uses the Satterthwaite approximation, or you can tell it to use the more accurate Kenward-Roger approximation. In either case you can see that the denominator df are exactly, or nearly zero, and the p-value is either missing or very close to 1, indicating that your model doesn't make sense (i.e. even using the mixed model it's overparameterized).
library("lmerTest")
anova(m1 <- lmer(data~car+(1|row)+(1|col),data=mat))
anova(m1,ddf="Kenward-Roger")
## Sum Sq Mean Sq NumDF DenDF F.value Pr(>F)
## car 0.0025 0.0025 1 9.6578e-06 2.0019 0.9999
Try for a bigger design:
set.seed(101)
mat2 <- data.frame(data=rnorm(36),
row=gl(6,6),
col=gl(6,1,36),
car=sample(LETTERS[1:2],size=36,replace=TRUE))
m2A <- lm(data~car+row+col,data=mat2)
anova(m2A)
## (excerpt)
## Df Sum Sq Mean Sq F value Pr(>F)
## car 1 1.2571 1.25709 1.6515 0.211
m2B <- lmer(data~car+(1|row)+(1|col),data=mat2)
anova(m2B)
## Sum Sq Mean Sq NumDF DenDF F.value Pr(>F)
## car 1.178 1.178 1 17.098 1.56 0.2285
anova(m2B,ddf="Kenward-Roger")
## Sum Sq Mean Sq NumDF DenDF F.value Pr(>F)
## car 1.178 1.178 1 17.005 1.1029 0.3083
It surprises me a little bit that the lm and lmerTest answers are so far apart here -- I would have thought this was an example where there was a well-formulated "classic" answer -- but I'm not sure. Might be worth following up on CrossValidated or Google.
fit <- lm(data~row+col+car,mat)
coef(fit)
# (Intercept) row2 col2 carB
# 12.55 -1.55 0.45 -0.05
So the effect of the row factor is -1.55, the effect of the col factor is 0.45, and the effect of the car factor is -0.05. The intercept term is the value of data expected when al the factors are at the first level (row=1, col=1, car=A).
Notice that your design is over-specified: you have only 4 pieces of data, which is enough to specify the effects of two factors and their interaction, but you have set it up so that car is the interaction. So there are no degrees of freedom left for error.
I'm learning R and trying to understand how lm() handles factor variables & how to make sense of the ANOVA table. I'm fairly new to statistics, so please be gentle with me.
Here's some movie data from Rotten Tomatoes. I'm trying to model the score of each movie based on the mean scores for all of the movies in 4 groups: those rated G, PG, PG-13, and R.
download.file("http://www.rossmanchance.com/iscam2/data/movies03RT.txt", destfile = "./movies.txt")
movies <- read.table("./movies.txt", sep = "\t", header = T, quote = "")
lm1 <- lm(movies$score ~ as.factor(movies$rating))
anova(lm1)
and the ANOVA output:
## Analysis of Variance Table
##
## Response: movies$score
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(movies$rating) 3 570 190 0.92 0.43
## Residuals 136 28149 207
I understand how to get all the numbers in this table, EXCEPT Sum Sq and Mean Sq for as.factor(movies$rating). Can someone please explain how that Sum Sq is calculated from my data? I know that Mean Sqis just Sum Sq divided by Df.
There are various ways to get that. One of them is to use the equation:
http://en.wikipedia.org/wiki/Sum_of_squares_(statistics)
SS_total = SS_reg + SS_error
So:
y = movies$score
sum((y - mean(y))^2) - sum(lm1$residuals^2)