leveneTest between 2-levels of 3-level factor? - r

What I have done so far:
I have a data.frame results with response Fail, and three factors PREP, CLEAN & ADHES.
ADHES has 3 levels: Crest Cryst Poly
I calculated the variances:
sigma..k=tapply(Fail,ADHES,var)
print(sqrt(sigma..k)):
Crest Cryst Poly
17.56668 41.64679 39.42669
then used leveneTest to test for constance of variance:
print(leveneTest(Fail~ADHES))
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 2 3.929 0.02588 *
51
The Question:
Now I want to use Levene's to test between only the Cryst & Poly levels of the factor ADHES, but I can't work out the syntax to do this in R.

Thanks to the hint #PauloCardoso gave me I worked it out:
leveneTest(subset(results,ADHES == 'Cryst' | ADHES == 'Poly')[,5],
subset(results,ADHES == 'Cryst' | ADHES == 'Poly')[,3])
('Fail' & 'ADHES' are columns 5 & 3 respectively in my data.frame 'results')
Obrigadinho!

Related

How to conduct LSD test with interactions in R?

I have a field data
sowing_date<- rep(c("Early" ,"Normal"), each=12)
herbicide<- rep (rep(c("No" ,"Yes"), each=6),2)
nitrogen<- rep (rep(c("No" ,"Yes"), each=3),4)
Block<- rep(c("Block 1" ,"Block 2", "Block 3"), times=8)
Yield<- c(30,27,25,40,41,42,37,38,40,48,47,46,25,27,26,
41,41,42,38,39,42,57,59,60)
DataA<- data.frame(sowing_date,herbicide,nitrogen,Block,Yield)
and I conducted 3-way ANOVA
anova3way <- aov (Yield ~ sowing_date + herbicide + nitrogen +
sowing_date:herbicide + sowing_date:nitrogen +
herbicide:nitrogen + sowing_date:herbicide:nitrogen +
factor(Block), data=DataA)
summary(anova3way)
There is a 3-way interaction among 3 factors. So, I'd like to see which combination shows the greatest yield.
I know how to compare mean difference with single factor like below, but in case of interactions, how can I do that?
library(agricolae)
LSD_Test<- LSD.test(anova3way,"sowing_date")
LSD_Test
For example, I'd like to check the mean difference under 3 way interaction, and also interaction between each two factors.
For example, I'd like to get this LSD result in R
Could you tell me how can I do that?
Many thanks,
One way which does take some manual work is to encode the experimental parameters as -1 and 1 in order to properly separate the 2 and 3 parameter interactions.
Once you have values encoded you can pull the residual degree of freedoms and the sum of the error square from the ANOVA model and pass it to the LSD.test function.
See Example below:
sowing_date<- rep(c("Early" ,"Normal"), each=12)
herbicide<- rep (rep(c("No" ,"Yes"), each=6),2)
nitrogen<- rep (rep(c("No" ,"Yes"), each=3),4)
Block<- rep(c("Block 1" ,"Block 2", "Block 3"), times=8)
Yield<- c(30,27,25,40,41,42,37,38,40,48,47,46,25,27,26,
41,41,42,38,39,42,57,59,60)
DataA<- data.frame(sowing_date,herbicide,nitrogen,Block,Yield)
anova3way <- aov (Yield ~ sowing_date * herbicide * nitrogen +
factor(Block), data=DataA)
summary(anova3way)
#Encode the experiment's parameters as -1 and 1
DataA$codeSD <- ifelse(DataA$sowing_date == "Early", -1, 1)
DataA$codeherb <- ifelse(DataA$herbicide == "No", -1, 1)
DataA$codeN2 <- ifelse(DataA$nitrogen == "No", -1, 1)
library(agricolae)
LSD_Test<- LSD.test(anova3way, c("sowing_date"))
LSD_Test
#Manually defining the treatment and specifying the
# degrees of freedom and Sum of the squares (Frin the resduals from the ANOVA)
print(LSD.test(y=DataA$Yield, trt=DataA$sowing_date, DFerror=14, MSerror=34.3))
#Example for a two parameter value
print(LSD.test(y=DataA$Yield, trt=(DataA$codeSD*DataA$codeherb), DFerror=14, MSerror=34.3))
print(LSD.test(y=DataA$Yield, trt=(DataA$codeSD*DataA$codeherb*DataA$codeN2), DFerror=14, MSerror=34.3))
#calaculate the means and std (as a check)
#DataA %>% group_by(sowing_date) %>% summarize(mean=mean(Yield), sd=sd(Yield))
#DataA %>% group_by(codeSD*codeherb*codeN2) %>% summarize(mean=mean(Yield), sd=sd(Yield))
You will need to manually track which run/condition goes with the -1 and 1 in the final report.
Edit:
So my answer above with show the overall effect based on interactions. For example how does the interaction of herbicide and nitrogen effect yield.
Based on your comment where you want to determine which combination provides the greatest yield, you the use the LSD.test() function again but passing a vector of parameter names.
LSD_Test<- LSD.test(anova3way, c("sowing_date", "herbicide", "nitrogen"))
LSD_Test
From the groups part of the out put you can see Normal, Yes and Yes is the optimal yield. The "groups" column will identify the unique clusters of results. For example the last 2 rows provide a similar yield.
...
$groups
Yield groups
Normal:Yes:Yes 58.66667 a
Early:Yes:Yes 47.00000 b
Normal:No:Yes 41.33333 c
Early:No:Yes 41.00000 cd
Normal:Yes:No 39.66667 cd
Early:Yes:No 38.33333 d
Early:No:No 27.33333 e
Normal:No:No 26.00000 e
...

Sign of Cohen's d is unaffected by reversing order of factor levels in R

I'm using Cohen's d (implemented using cohen.d() from the effsize package) as a measure of effect size in my dependent variable between two levels of a factor.
My code looks like this: cohen.d(d, f) where d is a vector of numeric values and f is a factor with two levels: "A" and "B".
Based on my understanding, the sign of Cohen's d is dependent on the order of means (i.e. factor levels) entered into the formula. However, my cohen.d() command returns a negative value (and negative CIs), even if I reverse the order of levels in f.
Here is a reproducible example:
library('effsize')
# Load in Chickweight data
a=ChickWeight
# Cohens d requires two levels in factor f, so take the first two available in Diet
a=a[a$Diet==c(1,2),]
a$Diet=a$Diet[ , drop=T]
# Compute cohen's d with default order of Diet
d1 = a$weight
f1 = a$Diet
cohen1 = cohen.d(d1,f1)
# Re-order levels of Diet
a$Diet = relevel(a$Diet, ref=2)
# Re-compute cohen's d
d2 = a$weight
f2 = a$Diet
cohen2 = cohen.d(d2,f2)
# Compare values
cohen1
cohen2
Can anyone explain why this is the case, and/or if I'm doing something wrong?
Thanks in advance for any advice!
I'm not entirely sure what the reasoning behind the issue in your example is (maybe someone else can comment here), but if you look at the examples under ?cohen.d, there are a few different methods for calculating it:
treatment = rnorm(100,mean=10)
control = rnorm(100,mean=12)
d = (c(treatment,control))
f = rep(c("Treatment","Control"),each=100)
## compute Cohen's d
## treatment and control
cohen.d(treatment,control)
## data and factor
cohen.d(d,f)
## formula interface
cohen.d(d ~ f)
If you use the first example of cohen.d(treatment, control) and reverse that to cohen.d(control, treatment) you get the following:
cohen.d(treatment, control)
Cohen's d
d estimate: -1.871982 (large)
95 percent confidence interval:
inf sup
-2.206416 -1.537547
cohen.d(control, treatment)
Cohen's d
d estimate: 1.871982 (large)
95 percent confidence interval:
inf sup
1.537547 2.206416
So using the two-vector method from the examples with your data, we can do:
a1 <- a[a$Diet == 1,"weight"]
a2 <- a[a$Diet == 2,"weight"]
cohen3a <- cohen.d(a1, a2)
cohen3b <- cohen.d(a2, a1)
I noticed that f in the ?cohen.d examples is not a factor, but a character vector. I tried playing around with the cohen.d(d, f) method, but didn't find a solution. Would like to see if someone else has anything regarding that.

Am I following the correct procedures with the dunn.test function?

I tested differences among sampling sites in terms of abundance values using kruskal.test. However, I want to determine the multiple differences between sites.
The dunn.test function has the option to use a vector data with a categorical vector or use the formula expression as lm.
I write the function in the way to use in a data frame with many columns, but I have not found an example that confirms my procedures.
library(dunn.test)
df<-data.frame(a=runif(5,1,20),b=runif(5,1,20), c=runif(5,1,20))
kruskal.test(df)
dunn.test(df)
My results were:
Kruskal-Wallis chi-squared = 6.02, df = 2, p-value = 0.04929
Kruskal-Wallis chi-squared = 6.02, df = 2, p-value = 0.05
Comparison of df by group
Between 1 and 2 2.050609, 0.0202
Between 1 and 3 -0.141421, 0.4438
Between 2 and 3 -2.192031, 0.0142
I took a look at your code, and you are close. One issue is that you should be specifying a method to correct for multiple comparisons, using the method argument.
Correcting for Multiple Comparisons
For your example data, I'll use the Benjamini-Yekutieli variant of the False Discovery Rate (FDR). The reasons why I think this is a good performer for your data are beyond the scope of StackOverflow, but you can read more about it and other correction methods here. I also suggest you read the associated papers; most of them are open-access.
library(dunn.test)
set.seed(711) # set pseudorandom seed
df <- data.frame(a = runif(5,1,20),
b = runif(5,1,20),
c = runif(5,1,20))
dunn.test(df, method = "by") # correct for multiple comparisons using "B-Y" procedure
# Output
data: df and group
Kruskal-Wallis chi-squared = 3.62, df = 2, p-value = 0.16
Comparison of df by group
(Benjamini-Yekuteili)
Col Mean-|
Row Mean | 1 2
---------+----------------------
2 | 0.494974
| 0.5689
|
3 | -1.343502 -1.838477
| 0.2463 0.1815
alpha = 0.05
Reject Ho if p <= alpha/2
Interpreting the Results
The first row in each cell provides the Dunn's pairwise z test statistic for each comparison, and the second row provides your corrected p-values.
Notice that, once corrected for multiple comparisons, none of your pairwise tests are significant at an alpha of 0.05, which is not surprising given that each of your example "sites" was generated by exactly the same distribution. I hope this has been helpful. Happy analyzing!
P.S. In the future, you should use set.seed() if you're going to construct example dataframes using runif (or any other kind of pseudorandom number generation). Also, if you have other questions about statistical analysis, it's better to ask at: https://stats.stackexchange.com/

Confusing p values with ANOVA on a big dataframe

I am trying to analyse the significant differences between different car company performance values across different countries. I am using ANOVA to do this.
Running ANOVA on my real dataset (30 countries, 1000 car companies and 90000 measurement scores) gave every car a zero p-value.
Confused by this, I created a reproducible example (below) with 30 groups, 3 car companies, 90000 random scores. Purposely, I kept a score of 1 for the Benz company where you shouldn't see any difference between countries. After running anova, I see a pvalue of 0.46 instead of 1.
Does any one know why is this ?
Reproducible example
set.seed(100000)
qqq <- 90000
df = data.frame(id = c(1:90000), country = c(rep("usa",3000), rep("usb",3000), rep("usc",3000), rep("usd",3000), rep("use",3000), rep("usf",3000), rep("usg",3000), rep("ush",3000), rep("usi",3000), rep("usj",3000), rep("usk",3000), rep("usl",3000), rep("usm",3000), rep("usn",3000), rep("uso",3000), rep("usp",3000), rep("usq",3000), rep("usr",3000), rep("uss",3000), rep("ust",3000), rep("usu",3000), rep("usv",3000), rep("usw",3000), rep("usx",3000), rep("usy",3000), rep("usz",3000), rep("usaa",3000), rep("usab",3000), rep("usac",3000), rep("usad",3000)), tesla=runif(90000), bmw=runif(90000), benz=rep(1, each=qqq))
str(df)
out<-data.frame()
for(j in 3:ncol(df)){
amod2 <- aov(df[,j]~df$country)
out[(j-2),1]<-colnames(df)[j]
out[(j-2),2]<-summary(amod2, test = adjusted("bonferroni"))[[1]][[1,"Pr(>F)"]]
}
colnames(out)<-c("cars","pvalue")
write.table(out,"df.output")
df.output
"cars" "pvalue"
"1" "tesla" 0.245931589754359
"2" "bmw" 0.382730335188437
"3" "benz" 0.465083026215268
With respect to the "benz" p-value in your reproducible example: an ANOVA analysis requires positive variance (i.e., non-constant data). If you violate this assumption, the model is degenerate. Technically, the p-value is based on an F-statistic whose value is a normalized ratio of the variance attributable to the "country" effect (for "benz" in your example, zero) divided by the total variance (for "benz" in your example, zero), so your F-statistic has "value" 0/0 or NaN.
Because of the approach R takes to calculating the F-statistic (using a QR matrix decomposition to improve numerical stability in "nearly" degenerate cases), it calculates an F-statistic equal to 1 (w/ 29 and 89970 degrees of freedom). This gives a p-value of:
> pf(1, 29, 89970, lower=FALSE)
[1] 0.465083
>
but it is, of course, largely meaningless.
With respect to your original problem, with large datasets relatively small effects will yield very small p-values. For example, if you add the following after your df definition above to introduce a difference in country usa:
df = within(df, {
o = country=="usa"
tesla[o] = tesla[o] + .1
bmw[o] = bmw[o] + .1
benz[o] = benz[o] + .1
rm(o)
})
you will find that out looks like this:
> out
cars pvalue
1 tesla 9.922166e-74
2 bmw 5.143542e-74
3 benz 0.000000e+00
>
Is this what you're seeing, or are you seeing all of them exactly zero?

How would I create an index to be used in regression?

I have 2 continuous variables, each having values in the range [0, 1]. Each can be categorized as Low ($\le 0.25$), Medium ($0.25 - 0.70$) and High ($\ge 0.7$). I need to create an index using both the variables and use this index in a regression model. The generated index will be as per following truth table:
Var1/ Var2 | Low | Medium | High |
=======================================
Low | Low | Low | Low |
Medium | Low | Medium | Medium |
High | Low | Medium | High |
=======================================
Straight forward multiplication of the two variables is not the solution as some values will yield a Medium output (var1 = 0.75 and var2 = 0.8 for example).
In the model, I would like to use the index expression (rather than the categorical transformation). This will preserve the data variation.
What f(var1, var2) will provide me this index to be used in lm/R?
Help!!!
I do not know whether there is an inbuild function for this and I couldn't find it instantly. Can you use something like the following?
get_index <- function(var1, var2)
{
if (var1 < 0 || var1 > 1 || var2 < 0 || var2 > 1)
return("out of range");
low <- min(var1, var2);
if (low < 0.25)
return("Low");
if (low <= 0.70)
return("Medium");
return("High");
}
How about:
cfun <- function(x) cut(x,c(-0.01,0.25,0.7,1.01),
labels=c("low","medium","high"))
var1c <- cfun(var1)
var2c <- cfun(var2)
comb <- ifelse(var1c=="low" | var2c=="low", "low",
ifelse(var1c=="medium" | var2c=="medium", "medium",
"high"))
or actually, as suggested by other answers:
cfun(min(var1,var2))
After re-reading your request my (second) guess is that you want this: only the "numerical index" and you could dispense with the use of a character vector label. If entered as a numerical variable in a regression formula the p-value for that synthetic interaction would give you a "test of trend" for the joint "minimum" descretized level condition.
inter.n <- pmin( findInterval(x, c(0, .25, .7, 1)),
findInterval(y, c(0, .25, .7, 1)) )
Earlier comments:
At the moment it is unclear how you want the inequalities to work when values are at the boundaries. The findInterval function can be used when the boundaries are closed on either the right (the default) or the left. You say : " Low ($\le 0.25$), Medium ($0.25 - 0.70$) and High ($\ge 0.7$)", which would make a value of either 0.2 or 0.7 a member of two groups. There would be fairly simple code with which is Low ($\lt 0.25$), Medium ($\ge 0.25 & $\lt 0.70$) and High ($\ge 0.7$):
x=runif(1000)
y=runif(1000)
inter <- c("Low", "Middle", "High")[ pmin( findInterval(x, c(0,.25,.7,1)),
findInterval(y, c(0, .25, .7, 1)))]
> table(inter)
inter
High Low Middle
78 383 539
If you use a modification of #BenBolker's cfun that makes ordered factors, you can get pmin to work directly on the values:
cfun2 <- function(x) cut(x,c(0, 0.25, 0.7, 1.01), include.lowest=TRUE,
labels=c("low","medium","high"), ordered=TRUE)
inter.f <- pmin( cfun2(x) , cfun2(y) )
table(inter.f)
#--------
inter.f
low medium high
449 473 78
And that is in some ways superior because the table function automagically honors the ordering of the factor labels.
I am beginner at R languange and syntax, but it seems you are more looking for a function rather than a procedure.
What about using f(var1, var2)=min(var1,var2)? Clearly, you have to apply this to the numeric version, and then categorize the variables.
In my point of view since you want to use this new index in a regression, you are trying to do what is known as feature elimination. Generally, it is best that you use all the variables that you have if the total number of variables is small. Now if the number of variables is big and you need to therefore eliminate some then there are multiple ways to do it including stepwise elimination, recursive feature elimination etc.
In your case you only have 2 variables and essentially you want to combine those 2 without losing any variance. Well, to my point of view one thing you can use is Principal Component Analysis. Let's see an example:
#create data
var1 <- runif(1:100)
var2 <- runif(1:100)
df <- data.frame(var1,var2)
#the below line will create a PCA model
PCAmod <- princomp(var1+var2,data=df) #uses formula syntax without a response variable
> summary(PCAmod)
Importance of components:
Comp.1
Standard deviation 0.4052599
Proportion of Variance 1.0000000
Cumulative Proportion 1.0000000
The above shows that a new principal component has been created i.e. a vector of 100 new elements that in this example explains 100% of the variance of var1 and var2 (proporsion of variance in the table above).
newvar <- PCAmod$scores #the new vector
Essentially, the newvar can be used instead of var1 and var2
If you need the vector to be numbers ranging between [0,1] then you can scale it:
scaled_newvar <- scale(newvar,center=min(newvar), scale=max(newvar)-min(newvar) )
> summary(scaled_newvar)
Comp.1
Min. :0.0000
1st Qu.:0.2991
Median :0.4607
Mean :0.4788
3rd Qu.:0.6566
Max. :1.0000
However, the above will probably not confirm your 'low','medium','high' condition table but I think this is the right thing to do if you will use the above in a regression.
If the above is not satisfying enough then (and I wouldn't recommend it) then:
Just use the min(var1,var2) for each combination and use that
Multiply the two, applying the boundary value if it is outside the range you would want it to be e.g. if both var1 and var2 are high and their product is medium then choose 0.75 as the correct value.
According to your final edit, you could just multiply the 2 together without caring about 'low','medium','high'

Resources