What is the ggplot2/plyr way to calculate statistical tests between two subgroups? - r

I am a rather novice user of R and have come to appreciate the elegance of ggplot2 and plyr. Right now, I am trying to analyze a large dataset that I can not share here, but I have reconstructed my problem with the diamonds dataset (shortened for convenience).
Without further ado:
diam <- diamonds[diamonds$cut=="Fair"|diamonds$cut=="Ideal",]
boxplots <- ggplot(diam, aes(x=cut, price)) + geom_boxplot(aes(fill=cut)) + facet_wrap(~ color)
print(boxplots)
What the plot produces is a set of boxplots, comparing the price of the two cuts "Fair" and "Ideal".
I would now very much like to proceed by statistically comparing the two cuts for each color subgroup (D,E,F,..,J) using either t.test or wilcox.test.
How would I implement this in an way that is as elegant as the ggplot2-syntax? I assume I would use ddply from the plyr-package, but I couldn't figure out how to feed two subgroups into a function that calculates the appropriate statistics..

I think you're looking for:
library(plyr)
ddply(diam,"color",
function(x) {
w <- wilcox.test(price~cut,data=x)
with(w,data.frame(statistic,p.value))
})
(Substituting t.test for wilcox.test seems to work fine too.)
results:
color statistic p.value
1 D 339753.5 4.232833e-24
2 E 591104.5 6.789386e-19
3 F 731767.5 2.955504e-11
4 G 950008.0 1.176953e-12
5 H 611157.5 2.055857e-17
6 I 213019.0 3.299365e-04
7 J 56870.0 2.364026e-01

ddply returns a data frame as output and, assuming that I am reading your question properly, that isn't what you are looking for. I believe you would like to conduct a series of t-tests using a series of subsets of data so the only real task is compiling a list of those subsets. Once you have them you can use a function like lapply() to run a t-test for each subset in your list. I am sure this isn't the most elegant solution, but one approach would be to create a list of unique pairs of your colors using a function like this:
get.pairs <- function(v){
l <- length(v)
n <- sum(1:l-1)
a <- vector("list",n)
j = 1
k = 2
for(i in 1:n){
a[[i]] <- c(v[j],v[k])
if(k < l){
k <- k + 1
} else {
j = j + 1
k = j + 1
}
}
return(a)
}
Now you can use that function to get your list of unique pairs of colors:
> (color.pairs <- get.pairs(levels(diam$color))))
[[1]]
[1] "D" "E"
[[2]]
[1] "D" "F"
...
[[21]]
[1] "I" "J"
Now you can use each of these lists to run a t.test (or whatever you would like) on your subset of your data frame, like so:
> t.test(price~cut,data=diam[diam$color %in% color.pairs[[1]],])
Welch Two Sample t-test
data: price by cut
t = 8.1594, df = 427.272, p-value = 3.801e-15
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
1008.014 1647.768
sample estimates:
mean in group Fair mean in group Ideal
3938.711 2610.820
Now use lapply() to run your test for each subset in your list of color pairs:
> lapply(color.pairs,function(x) t.test(price~cut,data=diam[diam$color %in% x,]))
[[1]]
Welch Two Sample t-test
data: price by cut
t = 8.1594, df = 427.272, p-value = 3.801e-15
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
1008.014 1647.768
sample estimates:
mean in group Fair mean in group Ideal
3938.711 2610.820
...
[[21]]
Welch Two Sample t-test
data: price by cut
t = 0.8813, df = 375.996, p-value = 0.3787
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-260.0170 682.3882
sample estimates:
mean in group Fair mean in group Ideal
4802.912 4591.726

Related

Sign of Cohen's d is unaffected by reversing order of factor levels in R

I'm using Cohen's d (implemented using cohen.d() from the effsize package) as a measure of effect size in my dependent variable between two levels of a factor.
My code looks like this: cohen.d(d, f) where d is a vector of numeric values and f is a factor with two levels: "A" and "B".
Based on my understanding, the sign of Cohen's d is dependent on the order of means (i.e. factor levels) entered into the formula. However, my cohen.d() command returns a negative value (and negative CIs), even if I reverse the order of levels in f.
Here is a reproducible example:
library('effsize')
# Load in Chickweight data
a=ChickWeight
# Cohens d requires two levels in factor f, so take the first two available in Diet
a=a[a$Diet==c(1,2),]
a$Diet=a$Diet[ , drop=T]
# Compute cohen's d with default order of Diet
d1 = a$weight
f1 = a$Diet
cohen1 = cohen.d(d1,f1)
# Re-order levels of Diet
a$Diet = relevel(a$Diet, ref=2)
# Re-compute cohen's d
d2 = a$weight
f2 = a$Diet
cohen2 = cohen.d(d2,f2)
# Compare values
cohen1
cohen2
Can anyone explain why this is the case, and/or if I'm doing something wrong?
Thanks in advance for any advice!
I'm not entirely sure what the reasoning behind the issue in your example is (maybe someone else can comment here), but if you look at the examples under ?cohen.d, there are a few different methods for calculating it:
treatment = rnorm(100,mean=10)
control = rnorm(100,mean=12)
d = (c(treatment,control))
f = rep(c("Treatment","Control"),each=100)
## compute Cohen's d
## treatment and control
cohen.d(treatment,control)
## data and factor
cohen.d(d,f)
## formula interface
cohen.d(d ~ f)
If you use the first example of cohen.d(treatment, control) and reverse that to cohen.d(control, treatment) you get the following:
cohen.d(treatment, control)
Cohen's d
d estimate: -1.871982 (large)
95 percent confidence interval:
inf sup
-2.206416 -1.537547
cohen.d(control, treatment)
Cohen's d
d estimate: 1.871982 (large)
95 percent confidence interval:
inf sup
1.537547 2.206416
So using the two-vector method from the examples with your data, we can do:
a1 <- a[a$Diet == 1,"weight"]
a2 <- a[a$Diet == 2,"weight"]
cohen3a <- cohen.d(a1, a2)
cohen3b <- cohen.d(a2, a1)
I noticed that f in the ?cohen.d examples is not a factor, but a character vector. I tried playing around with the cohen.d(d, f) method, but didn't find a solution. Would like to see if someone else has anything regarding that.

Use lapply and show variable names

I am fairly new to R, though I've programmed in Python and Java a lot. I have searched these questions about using a for loop to run through a list of variables and everyone keeps mentioning to use lapply. I have done that, and my code works in the sense that it gives me the answers, but it doesn't work in the sense that the answers hide important details. Here's my code and some of the output.
> bat <- read.csv(file="mlbTeam2016-B.csv", header=TRUE)
> varlist <- names(bat)[6:32]
> varlist
[1] "AB.B" "R.B" "H.B" "X2B.B" "X3B.B" "HR.B" "RBI.B"
[8] "BB.B" "SO.B" "SB.B" "CS.B" "AVG.B" "OBP.B" "SLG.B"
[15] "OPS.B" "IBB.B" "HBP.B" "SAC.B" "SF.B" "TB.B" "XBH.B"
[22] "GDP.B" "GO.B" "AO.B" "GO_AO.B" "NP.B" "PA.B"
> lapply(varlist, function(i){
+ var <- eval(parse(text=paste("bat$",i)))
+ cor.test(bat$W, var, alternative="two.sided", method="pearson")
+ })
[[1]]
Pearson's product-moment correlation
data: bat$W and var
t = 0.35067, df = 28, p-value = 0.7285
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.3013221 0.4164731
sample estimates:
cor
0.06612551
etc
The problem is that each output says data: bat$W and var without telling me which variable it is testing in this step. This is fine, except I have to go back and look up to see what variable this corresponds to. That is better than typing this code in dozens of times, but not ideal. I also know that using eval(parse( is bad, but I can't figure out another way to handle that line.
This is my desired output:
[[1]]
Pearson's product-moment correlation
data: bat$W and bat$AB.B
t = 0.35067, df = 28, p-value = 0.7285
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.3013221 0.4164731
sample estimates:
cor
0.06612551
I would suggest creating a correlation matrix rather than doing this using lapply.
This link will walk you through how to do that
http://www.sthda.com/english/wiki/correlation-matrix-a-quick-start-guide-to-analyze-format-and-visualize-a-correlation-matrix-using-r-software
You can select the variables you want using dplyr:
select(bat, one_of(varlist))
This should be a bit easier than the approach you are using.

R: Testing each level of a factor without creating new variables

Suppose I have a data frame with a binary grouping variable and a factor. An example of such a grouping variable could specify assignment to the treatment and control conditions of an experiment. In the below, b is the grouping variable while a is an arbitrary factor variable:
a <- c("a","a","a","b","b")
b <- c(0,0,1,0,1)
df <- data.frame(a,b)
I want to complete two-sample t-tests to assess the below:
For each level of a, whether there is a difference in the mean propensity to adopt that level between the groups specified in b.
I have used the dummies package to create separate dummies for each level of the factor and then manually performed t-tests on the resulting variables:
library(dummies)
new <- dummy.data.frame(df, names = "a")
t.test(new$aa, new$b)
t.test(new$ab, new$b)
I am looking for help with the following:
Is there a way to perform this without creating a large number of dummy variables via dummy.data.frame()?
If there is not a quicker way to do it without creating a large number of dummies, is there a quicker way to complete the t-test across multiple columns?
Note
This is similar to but different from R - How to perform the same operation on multiple variables and nearly the same as this question Apply t-test on many columns in a dataframe split by factor but the solution of that question no longer works.
Here is a base R solution implementing a chi-squired test for equality of proportions, which I believe is more likely to answer whatever question you're asking of your data (see my comment above):
set.seed(1)
## generate similar but larger/more complex toy dataset
a <- sample(letters[1:4], 100, replace = T)
b <- sample(0:1, 10, replace = T)
head((df <- data.frame(a,b)))
a b
1 b 1
2 b 0
3 c 0
4 d 1
5 a 1
6 d 0
## create a set of contingency tables for proportions
## of each level of df$a to the others
cTbls <- lapply(unique(a), function(x) table(df$a==x, df$b))
## apply chi-squared test to each contingency table
results <- lapply(cTbls, prop.test, correct = FALSE)
## preserve names
names(results) <- unique(a)
## only one result displayed for sake of space:
results$b
2-sample test for equality of proportions without continuity
correction
data: X[[i]]
X-squared = 0.18382, df = 1, p-value = 0.6681
alternative hypothesis: two.sided
95 percent confidence interval:
-0.2557295 0.1638177
sample estimates:
prop 1 prop 2
0.4852941 0.5312500
Be aware, however, that is you might not want to interpret your p-values without correcting for multiple comparisons. A quick simulation demonstrates that the chance of incorrectly rejecting the null hypothesis with at least one of of your tests can be dramatically higher than 5%(!) :
set.seed(11)
sum(
replicate(1e4, {
a <- sample(letters[1:4], 100, replace = T)
b <- sample(0:1, 100, replace = T)
df <- data.frame(a,b)
cTbls <- lapply(unique(a), function(x) table(df$a==x, df$b))
results <- lapply(cTbls, prop.test, correct = FALSE)
any(lapply(results, function(x) x$p.value < .05))
})
) / 1e4
[1] 0.1642
I dont exactly understand what this is doing from a statistical standpoint, but this code generates a list where each element is the output from the t.test() you run above:
a <- c("a","a","a","b","b")
b <- c(0,0,1,0,1)
df <- data.frame(a,b)
library(dplyr)
library(tidyr)
dfNew<-df %>% group_by(a) %>% summarise(count = n()) %>% spread(a, count)
lapply(1:ncol(dfNew), function (x)
t.test(c(rep(1, dfNew[1,x]), rep(0, length(b)-dfNew[1,x])), b))
This will save you the typing of t.test(foo, bar) continuously, and also eliminates the need for dummy variables.
Edit: I dont think the above method preserves the order of the columns, only the frequency of values measured as 0 or 1. If the order is important (again, I dont know the goal of this procedure) then you can use the dummy method and lapply through the data.frame you named new.
library(dummies)
new <- dummy.data.frame(df, names = "a")
lapply(1:(ncol(new)-1), function(x)
t.test(new[,x], new[,ncol(new)]))

Grabbing certain results out of multiple t.test outputs to create a table

I have run 48 t-tests (coded by hand instead of writing a loop) and would like to splice out certain results of those t.tests to create a table of the things I'm most interested in.
Specifically, I would like to keep only the p-value, confidence interval, and the mean of x and mean of y for each of these 48 tests and then build a table of the results.
Is there an elegant, quick way to do this beyond the top answer detailed here , wherein I would go in for all 48 tests and grab all three desired outputs with something along the lines of ttest$p.value? Perhaps a loop?
Below is a sample of the coded input for one t-test, followed by the output delivered by R.
# t.test comparing means of Change_Unemp for 2005 government employment (ix)
lowgov6 <- met_res[met_res$Gov_Emp_2005 <= 93310, "Change_Unemp"]
highgov6 <- met_res[met_res$Gov_Emp_2005 > 93310, "Change_Unemp"]
t.test(lowgov6,highgov6,pool.sd=FALSE,na.rm=TRUE)
Welch Two Sample t-test
data: lowgov6 and highgov6
t = 1.5896, df = 78.978, p-value = 0.1159
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.1813909 1.6198399
sample estimates:
mean of x mean of y
4.761224 4.042000
Save all of your t-tests into a list:
tests <- list()
tests[[1]] <- t.test(lowgov6,highgov6,pool.sd=FALSE,na.rm=TRUE)
# repeat for all tests
# there are probably faster ways than doing all of that by hand
# extract your values using `sapply`
sapply(tests, function(x) {
c(x$estimate[1],
x$estimate[2],
ci.lower = x$conf.int[1],
ci.upper = x$conf.int[2],
p.value = x$p.value)
})
The output is something like the following:
[,1] [,2]
mean of x 0.12095949 0.03029474
mean of y -0.05337072 0.07226999
ci.lower -0.11448679 -0.31771191
ci.upper 0.46314721 0.23376141
p.value 0.23534905 0.76434012
But will have 48 columns. You can t() the result if you'd like it transposed.

Permutations of correlation coefficients

My question is on the permutation of correlation coefficients.
A<-data.frame(A1=c(1,2,3,4,5),B1=c(6,7,8,9,10),C1=c(11,12,13,14,15 ))
B<-data.frame(A2=c(6,7,7,10,11),B2=c(2,1,3,8,11),C2=c(1,5,16,7,8))
cor(A,B)
# A2 B2 C2
# A1 0.9481224 0.9190183 0.459588
# B1 0.9481224 0.9190183 0.459588
# C1 0.9481224 0.9190183 0.459588
I obtained this correlation and then wanted to perform permutation tests to check if the correlation still holds.
I did the permutation as follows:
A<-as.vector(t(A))
B<-as.vector(t(B))
corperm <- function(A,B,1000) {
# n is the number of permutations
# x and y are the vectors to correlate
obs = abs(cor(A,B))
tmp = sapply(1:n,function(z) {abs(cor(sample(A,replace=TRUE),B))})
return(1-sum(obs>tmp)/n)
}
The result was
[1] 0.645
and using "cor.test"
cor.test(A,B)
Pearson's product-moment correlation
data: A and B
t = 0.4753, df = 13, p-value = 0.6425
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.4089539 0.6026075
sample estimates:
cor
0.1306868
How could I draw a plot or a histogram to show the actual correlation and the permuted correlation value from the permuted data ???
first of all, you can't have done it exactly this ways as ...
> corperm = function(A,B,1000) {
Error: unexpected numeric constant in "corperm = function(A,B,1000"
The third argument has no name but it should have one! Perhaps you meant
> corperm <- function(A, B, n=1000) {
# etc
Then you need to think about what do you want to achieve. Initially you have two data sets with 3 variables and then you collapse them into two vectors and compute a correlation between the permuted vectors. Why does it make sense? The structure of permuted data set should be the same as the original data set.
obs = abs(cor(A,B))
tmp = sapply(1:n,function(z) {abs(cor(sample(A,replace=TRUE),B))})
return(1-sum(obs>tmp)/n)
Why do you use replace=TRUE here? This makes sense if you would like to have bootstrap CI-s but (a) it'd be better to use a dedicated function then e.g boot from package boot, and (B) you'd need to do the same with B, i.e. sample(B, replace=TRUE).
For permutation test you sample without replacement and it makes no difference whether you do it for both A and B or only A.
And how to get the histogram? Well, hist(tmp) would draw you a histogram of the permuted values, and obs is absolute value of the observed correlation.
HTHAB
(edit)
corperm <- function(x, y, N=1000, plot=FALSE){
reps <- replicate(N, cor(sample(x), y))
obs <- cor(x,y)
p <- mean(reps > obs) # shortcut for sum(reps > obs)/N
if(plot){
hist(reps)
abline(v=obs, col="red")
}
p
}
Now you can use this on a single pair of variables:
corperm(A[,1], B[,1])
To apply it to all pairs, use for or mapply. for is easier to understand so I wouldn't insist in using mapply to get all possible pairs.
res <- matrix(NA, nrow=NCOL(A), ncol=NCOL(B))
for(iii in 1:3) for(jjj in 1:3) res[iii,jjj] <- corperm(A[,iii], B[,jjj], plot=FALSE)
rownames(res)<-names(A)
colnames(res) <- names(B)
print(res)
To make all histograms, use plot=TRUE above.
I think there is not much significance to do permutation test for correlation analysis of two variants, because the cor.test()function offers "p.value" which has the same effect as permutation test.

Resources