How to conduct LSD test with interactions in R? - r

I have a field data
sowing_date<- rep(c("Early" ,"Normal"), each=12)
herbicide<- rep (rep(c("No" ,"Yes"), each=6),2)
nitrogen<- rep (rep(c("No" ,"Yes"), each=3),4)
Block<- rep(c("Block 1" ,"Block 2", "Block 3"), times=8)
Yield<- c(30,27,25,40,41,42,37,38,40,48,47,46,25,27,26,
41,41,42,38,39,42,57,59,60)
DataA<- data.frame(sowing_date,herbicide,nitrogen,Block,Yield)
and I conducted 3-way ANOVA
anova3way <- aov (Yield ~ sowing_date + herbicide + nitrogen +
sowing_date:herbicide + sowing_date:nitrogen +
herbicide:nitrogen + sowing_date:herbicide:nitrogen +
factor(Block), data=DataA)
summary(anova3way)
There is a 3-way interaction among 3 factors. So, I'd like to see which combination shows the greatest yield.
I know how to compare mean difference with single factor like below, but in case of interactions, how can I do that?
library(agricolae)
LSD_Test<- LSD.test(anova3way,"sowing_date")
LSD_Test
For example, I'd like to check the mean difference under 3 way interaction, and also interaction between each two factors.
For example, I'd like to get this LSD result in R
Could you tell me how can I do that?
Many thanks,

One way which does take some manual work is to encode the experimental parameters as -1 and 1 in order to properly separate the 2 and 3 parameter interactions.
Once you have values encoded you can pull the residual degree of freedoms and the sum of the error square from the ANOVA model and pass it to the LSD.test function.
See Example below:
sowing_date<- rep(c("Early" ,"Normal"), each=12)
herbicide<- rep (rep(c("No" ,"Yes"), each=6),2)
nitrogen<- rep (rep(c("No" ,"Yes"), each=3),4)
Block<- rep(c("Block 1" ,"Block 2", "Block 3"), times=8)
Yield<- c(30,27,25,40,41,42,37,38,40,48,47,46,25,27,26,
41,41,42,38,39,42,57,59,60)
DataA<- data.frame(sowing_date,herbicide,nitrogen,Block,Yield)
anova3way <- aov (Yield ~ sowing_date * herbicide * nitrogen +
factor(Block), data=DataA)
summary(anova3way)
#Encode the experiment's parameters as -1 and 1
DataA$codeSD <- ifelse(DataA$sowing_date == "Early", -1, 1)
DataA$codeherb <- ifelse(DataA$herbicide == "No", -1, 1)
DataA$codeN2 <- ifelse(DataA$nitrogen == "No", -1, 1)
library(agricolae)
LSD_Test<- LSD.test(anova3way, c("sowing_date"))
LSD_Test
#Manually defining the treatment and specifying the
# degrees of freedom and Sum of the squares (Frin the resduals from the ANOVA)
print(LSD.test(y=DataA$Yield, trt=DataA$sowing_date, DFerror=14, MSerror=34.3))
#Example for a two parameter value
print(LSD.test(y=DataA$Yield, trt=(DataA$codeSD*DataA$codeherb), DFerror=14, MSerror=34.3))
print(LSD.test(y=DataA$Yield, trt=(DataA$codeSD*DataA$codeherb*DataA$codeN2), DFerror=14, MSerror=34.3))
#calaculate the means and std (as a check)
#DataA %>% group_by(sowing_date) %>% summarize(mean=mean(Yield), sd=sd(Yield))
#DataA %>% group_by(codeSD*codeherb*codeN2) %>% summarize(mean=mean(Yield), sd=sd(Yield))
You will need to manually track which run/condition goes with the -1 and 1 in the final report.
Edit:
So my answer above with show the overall effect based on interactions. For example how does the interaction of herbicide and nitrogen effect yield.
Based on your comment where you want to determine which combination provides the greatest yield, you the use the LSD.test() function again but passing a vector of parameter names.
LSD_Test<- LSD.test(anova3way, c("sowing_date", "herbicide", "nitrogen"))
LSD_Test
From the groups part of the out put you can see Normal, Yes and Yes is the optimal yield. The "groups" column will identify the unique clusters of results. For example the last 2 rows provide a similar yield.
...
$groups
Yield groups
Normal:Yes:Yes 58.66667 a
Early:Yes:Yes 47.00000 b
Normal:No:Yes 41.33333 c
Early:No:Yes 41.00000 cd
Normal:Yes:No 39.66667 cd
Early:Yes:No 38.33333 d
Early:No:No 27.33333 e
Normal:No:No 26.00000 e
...

Related

How can I show significant comparisons from Tukey post-hoc test in ggplot2 bar plot?

I have a dataset with several variables that looks like this:
Competitor Disturbance Group MT CVt
1 M P A 17.416667 63.39274
2 M P A 11.055556 91.32450
3 M C N 13.928571 78.11438
4 B C N 13.500000 61.20542
5 B T E 12.700000 48.11819
6 B T E 27.250000 63.44356
I've made a GLMM (mMT1) with 3 predictors (Competitor, Disturbance and Group), one response (MT) and one random factor (Species, not shown in example dataset).
After fitting and checking the model, I calculated ls means with the package emmeans:
ls_MT <- emmeans(mMT1, pairwise~Disturbance*Competitor*Group, type="response")
And performed a post-hoc test:
post_MT <- emmeans(mMT1, transform="response", component="cond",list(~Disturbance|Competitor|Group,~Competitor|Group|Disturbance,~Group|Disturbance|Competitor))
pairs(post_MT)
Finally I produced a bar chart with ggplot2, based on the ls means and se
ggplot(ls_MT, aes(x=Disturbance, fill=Competitor, y=response))+
geom_bar(stat="identity",position=position_dodge())+
facet_grid(cols=vars(Group))+labs(y = "log10(MT)")+
scale_color_manual(values=c("#2ca02c","#d62728"))+
geom_errorbar(aes(ymin=ls_MT$lower, ymax=ls_MT$upper), width=.2,
position=position_dodge(.9))+
theme_light()
Which produces a plot like this:
At this point I'm struggling with 2 things:
Is there a way to have the bar plots side by side without the faceting by group? It looks ok but I'd prefer them to be one single plot rather than 3 side by side
How can I annotate on the graph the comparisons from the post-hoc test? I've seen that several functions allow to perform separate tests (e.g. stat_compare_means) and plot those results, but I can't find a solution for Tukey's post-hoc test. Another solution could be to add
geom_signif(map_signif_level = c(" * * * "=0.001, " * * "=0.01, "*"=0.05), comparisons = list(c("P","C"),c("P","T"),c("C","T"))
but that pools together the B and M (Competitor) values of each Disturbance (P, T, C). In this case I'd actually like to compare the B and M within each of P, T and C. Or even better, how can I specify between which bars I'd like to show the results of the post-hoc? For example I'd like to compare A-T-B with E-T-B, etc.

pairwise ANOVA of subset of data

I need to perform multiple pairwise ANOVA's in R, and correct the p-values using bonferroni. However I don't need to compare every CLASS to each other. Below is my data format and selcontrasts: the pairs of which I need to contrast the log10relquant. Does any of you know how I could execute this? I use the dplyr, lsmeans and broom packages.
SEX EXPERIENCED AGE CLASS compound relquant log10relquant
1 FEMALE NO 1D 1F C14 0.004012910 -2.396541
2 FEMALE NO 1D 1F C14 0.003759812 -2.424834
3 FEMALE NO 1D 1F C14 0.003838553 -2.415832
4 FEMALE NO 1D 1F C14 0.003582754 -2.445783
5 MALE NO 1D 1M C14 0.005099237 -2.292495
6 MALE NO 1D 1M C14 0.005379093 -2.269291
selcontrasts <- c("1F - 1M", "4F - 4M", "4EF - 4EM",
"7F - 7M", "7EF - 7EM", # sex differences
"1M - 4M", "4M - 7M", "1M - 7M", "1F - 4F",
"4F - 7F", "1F - 7F", # age differences
"4M - 4EM", "7M - 7EM", "4F - 4EF",
"7F - 7EF" # social experience)
x=list(selcontrasts)
Currently I'm using this to pair the whole dataset (so to compair every class) instead of the selected contrasts:
pvalsage=data.frame(datagr %>%
do( data.frame(summary(contrast(lsmeans(
aov(log10relquant ~ CLASS, data = .), ~ CLASS ),
method="pairwise",adjust="none"))) ))
To only do the selected contrasts in list x, I tried:
pvalsage=data.frame(datagr %>%
do(data.frame(summary(contrast(lsmeans(
aov(log10relquant ~ CLASS, data = .),~ CLASS),
method = x, adjust="none"))) ))
But I get the error:
error in contrast.ref.grid(lsmeans(aov(log10relquant ~ CLASS, data = .), :
Nonconforming number of contrast coefficients
If I understand the question correctly (and I very well might not), there are really three factors involved: SEX (two levels), EXPERIENCED (two levels), and AGE (3 levels, namely 1, 4, and 7). And what is required is separate comparisons of the levels of each factor, for each combination of the other two.
If that is the case, then combining the three factors into one named CLASS just makes it a lot harder, because it makes it much harder to keep track of the levels of the factors separately. What's simpler is to fit a model that accounts for all three factors, estimate the means for each factor combination, and then do the required comparisons using by variables. Thus, for each dataset dat, you do:
require(emmeans)
mod = aov(log10relquant ~ SEX * EXPERIENCED * AGE, data = dat)
emm = emmeans(mod, ~ SEX * EXPERIENCED * AGE)
rbind(pairs(emm, by = c("EXPERIENCED", "AGE")),
pairs(emm, by = c("SEX", "EXPERIENCED")),
pairs(emm, by = c("SEX", "AGE")),
adjust = "bonferroni")
I did not try to embed this in the functional-programming paradigm; I'll leave it to the OP to figure out those details.
Note: The emmeans package (estimated marginal means) is a continuation of lsmeans, which will be deprecated in the future. It works the same way.
PS -- Looking at the code in the question, I am concerned that the end results will not show the actual estimates being compared (the EMMs), only the comparisons; and the further implication in the naming that really, only P values are sought. This grates on me. I don't like to watch people go straight to statistical tests without even looking at the quantities being tested.
You could do the pairwise contrast anyway and then filter the rows only containing your selcontrasts into a new dataframe followed by p.adjust= bonferroni with only the contrasts of your interest.
or you could write a mycontr.lmsc function and define selcontrasts and use that as method =
(Y)

Confusing p values with ANOVA on a big dataframe

I am trying to analyse the significant differences between different car company performance values across different countries. I am using ANOVA to do this.
Running ANOVA on my real dataset (30 countries, 1000 car companies and 90000 measurement scores) gave every car a zero p-value.
Confused by this, I created a reproducible example (below) with 30 groups, 3 car companies, 90000 random scores. Purposely, I kept a score of 1 for the Benz company where you shouldn't see any difference between countries. After running anova, I see a pvalue of 0.46 instead of 1.
Does any one know why is this ?
Reproducible example
set.seed(100000)
qqq <- 90000
df = data.frame(id = c(1:90000), country = c(rep("usa",3000), rep("usb",3000), rep("usc",3000), rep("usd",3000), rep("use",3000), rep("usf",3000), rep("usg",3000), rep("ush",3000), rep("usi",3000), rep("usj",3000), rep("usk",3000), rep("usl",3000), rep("usm",3000), rep("usn",3000), rep("uso",3000), rep("usp",3000), rep("usq",3000), rep("usr",3000), rep("uss",3000), rep("ust",3000), rep("usu",3000), rep("usv",3000), rep("usw",3000), rep("usx",3000), rep("usy",3000), rep("usz",3000), rep("usaa",3000), rep("usab",3000), rep("usac",3000), rep("usad",3000)), tesla=runif(90000), bmw=runif(90000), benz=rep(1, each=qqq))
str(df)
out<-data.frame()
for(j in 3:ncol(df)){
amod2 <- aov(df[,j]~df$country)
out[(j-2),1]<-colnames(df)[j]
out[(j-2),2]<-summary(amod2, test = adjusted("bonferroni"))[[1]][[1,"Pr(>F)"]]
}
colnames(out)<-c("cars","pvalue")
write.table(out,"df.output")
df.output
"cars" "pvalue"
"1" "tesla" 0.245931589754359
"2" "bmw" 0.382730335188437
"3" "benz" 0.465083026215268
With respect to the "benz" p-value in your reproducible example: an ANOVA analysis requires positive variance (i.e., non-constant data). If you violate this assumption, the model is degenerate. Technically, the p-value is based on an F-statistic whose value is a normalized ratio of the variance attributable to the "country" effect (for "benz" in your example, zero) divided by the total variance (for "benz" in your example, zero), so your F-statistic has "value" 0/0 or NaN.
Because of the approach R takes to calculating the F-statistic (using a QR matrix decomposition to improve numerical stability in "nearly" degenerate cases), it calculates an F-statistic equal to 1 (w/ 29 and 89970 degrees of freedom). This gives a p-value of:
> pf(1, 29, 89970, lower=FALSE)
[1] 0.465083
>
but it is, of course, largely meaningless.
With respect to your original problem, with large datasets relatively small effects will yield very small p-values. For example, if you add the following after your df definition above to introduce a difference in country usa:
df = within(df, {
o = country=="usa"
tesla[o] = tesla[o] + .1
bmw[o] = bmw[o] + .1
benz[o] = benz[o] + .1
rm(o)
})
you will find that out looks like this:
> out
cars pvalue
1 tesla 9.922166e-74
2 bmw 5.143542e-74
3 benz 0.000000e+00
>
Is this what you're seeing, or are you seeing all of them exactly zero?

T-test with grouping variable

I've got a data frame with 36 variables and 74 observations. I'd like to make a two sample paired ttest of 35 variables by 1 grouping variable (with two levels).
For example: the data frame contains "age" "body weight" and "group" variables.
Now I suppose I can do the ttest for each variable with this code:
t.test(age~group)
But, is there a way to test all the 35 variables with one code, and not one by one?
Sven has provided you with a great way of implementing what you wanted to have implemented. I, however, want to warn you about the statistical aspect of what you are doing.
Recall that if you are using the standard confidence level of 0.05, this means that for each t-test performed, you have a 5% chance of committing Type 1 error (incorrectly rejecting the null hypothesis.) By the laws of probability, running 35 individual t-tests compounds your probability of committing type 1 error by a factor of 35, or more exactly:
Pr(Type 1 Error) = 1 - (0.95)^35 = 0.834
Meaning that you have about an 83.4% chance of falsely rejecting a null hypothesis. Basically what this means is that, by running so many T-tests, there is a very high probability that at least one of your T-tests is going to provide an incorrect result.
Just FYI.
An example data frame:
dat <- data.frame(age = rnorm(10, 30), body = rnorm(10, 30),
weight = rnorm(10, 30), group = gl(2,5))
You can use lapply:
lapply(dat[1:3], function(x)
t.test(x ~ dat$group, paired = TRUE, na.action = na.pass))
In the command above, 1:3 represents the numbers of the columns including the variables. The argument paired = TRUE is necessary to perform a paired t-test.

Gompertz Aging analysis in R

I have survival data from an experiment in flies which examines rates of aging in various genotypes. The data is available to me in several layouts so the choice of which is up to you, whichever suits the answer best.
One dataframe (wide.df) looks like this, where each genotype (Exp, of which there is ~640) has a row, and the days run in sequence horizontally from day 4 to day 98 with counts of new deaths every two days.
Exp Day4 Day6 Day8 Day10 Day12 Day14 ...
A 0 0 0 2 3 1 ...
I make the example using this:
wide.df2<-data.frame("A",0,0,0,2,3,1,3,4,5,3,4,7,8,2,10,1,2)
colnames(wide.df2)<-c("Exp","Day4","Day6","Day8","Day10","Day12","Day14","Day16","Day18","Day20","Day22","Day24","Day26","Day28","Day30","Day32","Day34","Day36")
Another version is like this, where each day has a row for each 'Exp' and the number of deaths on that day are recorded.
Exp Deaths Day
A 0 4
A 0 6
A 0 8
A 2 10
A 3 12
.. .. ..
To make this example:
df2<-data.frame(c("A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A"),c(0,0,0,2,3,1,3,4,5,3,4,7,8,2,10,1,2),c(4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36))
colnames(df2)<-c("Exp","Deaths","Day")
What I would like to do is perform a Gompertz Analysis (See second paragraph of "the life table" here). The equation is:
μx = α*e β*x
Where μx is probability of death at a given time, α is initial mortality rate, and β is the rate of aging.
I would like to be able to get a dataframe which has α and β estimates for each of my ~640 genotypes for further analysis later.
I need help going from the above dataframes to an output of these values for each of my genotypes in R.
I have looked through the package flexsurv which may house the answer but I have failed in attempts to find and implement it.
This should get you started...
Firstly, for the flexsurvreg function to work, you need to specify your input data as a Surv object (from package:survival). This means one row per observation.
The first thing is to re-create the 'raw' data from the summary tables you provide.
(I know rbind is not efficient, but you can always switch to data.table for large sets).
### get rows with >1 death
df3 <- df2[df2$Deaths>1, 2:3]
### expand to give one row per death per time
df3 <- sapply(df3, FUN=function(x) rep(df3[, 2], df3[, 1]))
### each death is 1 (occurs once)
df3[, 1] <- 1
### add this to the rows with <=1 death
df3 <- rbind(df3, df2[!df2$Deaths>1, 2:3])
### convert to Surv object
library(survival)
s1 <- with(df3, Surv(Day, Deaths))
### get parameters for Gompertz distribution
library(flexsurv)
f1 <- flexsurvreg(s1 ~ 1, dist="gompertz")
giving
> f1$res
est L95% U95%
shape 0.165351912 0.1281016481 0.202602176
rate 0.001767956 0.0006902161 0.004528537
Note that this is an intercept-only model as all your genotypes are A.
You can loop this over multiple survival objects once you have re-created the per-observation data as above.
From the flexsurv docs:
Gompertz distribution with shape parameter a and rate parameter
b has hazard function
H(x: a, b) = b.e^{ax}
So it appears your alpha is b, the rate, and beta is a, the shape.

Resources