I would like to use the R package plyr to run a pairwise t test on a really large data frame, but I'm not sure how to do it. I recently learned how to do correlations using plyr, and I really like how you can specify which groups you want to compare and then plyr breaks down the data for you. For example, you could have plyr calculate the correlation between sepal length and sepal width for each species of iris in the iris dataset like this:
Correlations <- ddply(iris, "Species", function(x) cor(x$Sepal.Length, x$Sepal.Width))
I could break the data frame down myself by specifying that the data for the setosa species of iris are in rows 1:50 and so on, but plyr would be less likely than me to mess up and accidentally say rows 1:51, for example.
So how do I do something similar with a paired t test? How can I specify which observations are the pairs? Here's some example data that are similar to what I'm working with, and I'd like the pairs to be the Subject and I'd like to break the data down by Pesticide:
Exposure <- data.frame("Subject" = rep(1:4, 6),
"Season" = rep(c(rep("summer", 4), rep("winter", 4)),3),
"Pesticide" = rep(c("atrazine", "metolachlor", "chlorpyrifos"), each=8),
"Exposure" = sample(1:100, size=24))
Exposure$Subject <- as.factor(Exposure$Subject)
In other words, the question I'd like to evaluate is whether there is a difference in pesticide exposure for each person during the winter versus during the summer, and I'd like to answer that question separately for each of the three pesticides.
Much thanks in advance!
An edit: To clarify, this is how to do an unpaired t test in plyr:
TTests <- dlply(Exposure, "Pesticide", function(x) t.test(x$Exposure ~ x$Season))
And if I add "paired=T" in there, plyr will do a paired t test, but it assumes that I always have the pairs in the same order. While I do have them all in the same order in the example data frame above, I don't in my real data because I sometimes have missing data.
Do you want this?
library(data.table)
# convert to data.table in place
setDT(Exposure)
# make sure data is sorted correctly
setkey(Exposure, Pesticide, Season, Subject)
Exposure[, list(res = list(t.test(Exposure[Season == "summer"],
Exposure[Season == "winter"],
paired = T)))
, by = Pesticide]$res
#[[1]]
#
# Paired t-test
#
#data: Exposure[Season == "summer"] and Exposure[Season == "winter"]
#t = -4.1295, df = 3, p-value = 0.02576
#alternative hypothesis: true difference in means is not equal to 0
#95 percent confidence interval:
# -31.871962 -4.128038
#sample estimates:
#mean of the differences
# -18
#
#
#[[2]]
#
# Paired t-test
#
#data: Exposure[Season == "summer"] and Exposure[Season == "winter"]
#t = -6.458, df = 3, p-value = 0.007532
#alternative hypothesis: true difference in means is not equal to 0
#95 percent confidence interval:
# -73.89299 -25.10701
#sample estimates:
#mean of the differences
# -49.5
#
#
#[[3]]
#
# Paired t-test
#
#data: Exposure[Season == "summer"] and Exposure[Season == "winter"]
#t = -2.5162, df = 3, p-value = 0.08646
#alternative hypothesis: true difference in means is not equal to 0
#95 percent confidence interval:
# -30.008282 3.508282
#sample estimates:
#mean of the differences
# -13.25
I don't know ddply, but here's how I would do using some base functions.
by(data = Exposure, INDICES = Exposure$Pesticide, FUN = function(x) {
t.test(Exposure ~ Season, data = x)
})
Exposure$Pesticide: atrazine
Welch Two Sample t-test
data: Exposure by Season
t = -0.1468, df = 5.494, p-value = 0.8885
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-49.63477 44.13477
sample estimates:
mean in group summer mean in group winter
60.50 63.25
----------------------------------------------------------------------------------------------
Exposure$Pesticide: chlorpyrifos
Welch Two Sample t-test
data: Exposure by Season
t = -0.8932, df = 4.704, p-value = 0.4151
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-83.58274 41.08274
sample estimates:
mean in group summer mean in group winter
52.25 73.50
----------------------------------------------------------------------------------------------
Exposure$Pesticide: metolachlor
Welch Two Sample t-test
data: Exposure by Season
t = 0.8602, df = 5.561, p-value = 0.4252
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-39.8993 81.8993
sample estimates:
mean in group summer mean in group winter
62.5 41.5
Related
How can extract statistics from this model. To conduct several T-tests I used this:
A<-lapply(merged_DF_final[2:6], function(x) t.test(x ~ merged_DF_final$Group))
How can I extract information about the p-value, t statistics, confidence interval, and group means for each specific subtest and output on a single table?
This is what is saved on A:
$HC_HC_L_amygdala_baseline
Welch Two Sample t-test
data: x by merged_DF_final$Group t = 0.039543, df = 47.412, p-value =
0.9686 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -0.4694404 0.4882694 sample estimates: mean in group CONN mean in group HC
0.2954200 0.2860055
$HC_HC_L_culmen_baseline
Welch Two Sample t-test
data: x by merged_DF_final$Group t = 0.81387, df = 53.695, p-value =
0.4193 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -0.2970321 0.7028955 sample estimates: mean in group CONN mean in group HC
0.4020883 0.1991566
$HC_HC_L_fusiform_baseline
Welch Two Sample t-test
data: x by merged_DF_final$Group t = 0.024945, df = 53.851, p-value =
0.9802 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -0.5768786 0.5914136 sample estimates: mean in group CONN mean in group HC
0.5552184 0.5479509
$HC_HC_L_insula_baseline
Welch Two Sample t-test
data: x by merged_DF_final$Group t = 0.79659, df = 52.141, p-value =
0.4293 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -0.3000513 0.6951466 sample estimates: mean in group CONN mean in group HC
0.12436946 -0.07317818
$HC_HC_L_lingual_gyrus_baseline
Welch Two Sample t-test
data: x by merged_DF_final$Group t = -0.11033, df = 53.756, p-value =
0.9126 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -0.5172863 0.4633268 sample estimates: mean in group CONN mean in group HC
0.4395066 0.4664864
Look at names(A[[1]]) or str(A[[1]]) to see what the components are, then use $ or [[ to extract them, e.g.
names(t.test(extra ~ group, data = sleep))
[1] "statistic" "parameter" "p.value" "conf.int" "estimate"
[6] "null.value" "stderr" "alternative" "method" "data.name"
You can then sapply(A, "[[", "statistic") or (being more careful) vapply(A, "[[", "statistic", FUN.VALUE = numeric(1))
If you like tidyverse you can purrr::map_dbl(A, "statistic") (for results with a single value); you'll need purrr::map(A, ~.$estimate[1]) for the mean of the first group etc.. (sapply() will automatically collapse to a matrix.)
I have a data frame with 2 columns being age and sex. I'm doing statistical analysis to determine if there's a difference in the age distribution in the two groups of sex. I know that if I don't call data= it will give an error (I believe it's something w/ the dplyr library). I was wondering what the single . in the data parameter does. Does it direct it to the data frame we used before the %>% ?
age_sex.htest <- d %>%
t.test(formula=age~sex, data=.)
As #markus has pointed out, d is passed to the data argument in t.test. Here is the output from data(sleep) using the ..
library(dplyr)
data(sleep)
sleep %>% t.test(formula=extra ~ group, data = .)
# Output
Welch Two Sample t-test
data: extra by group
t = -1.8608, df = 17.776, p-value = 0.07939
alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
95 percent confidence interval:
-3.3654832 0.2054832
sample estimates:
mean in group 1 mean in group 2
0.75 2.33
If you put sleep directly into data of t.test, then you will get the same result, as t.test is running the exact same data.
t.test(formula=extra ~ group, data = sleep)
# Output
Welch Two Sample t-test
data: extra by group
t = -1.8608, df = 17.776, p-value = 0.07939
alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
95 percent confidence interval:
-3.3654832 0.2054832
sample estimates:
mean in group 1 mean in group 2
0.75 2.33
In this case, the . is not that beneficial, though some people prefer this stylistically (I generally do).
However, it is extremely useful when you want to run the analysis on a slight alteration of the dataframe. So, with the sleep dataset, for example, if you wanted to remove ID == 10 from both groups, then you could remove those with filter, and then run the t.test.
sleep %>%
filter(ID != 10) %>%
t.test(formula = extra ~ group, data = .)
So, we pass an altered version of the sleep dataset without the rows where ID is 10. So now, we will see a change in the output:
Welch Two Sample t-test
data: extra by group
t = -1.7259, df = 15.754, p-value = 0.1039
alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
95 percent confidence interval:
-3.5677509 0.3677509
sample estimates:
mean in group 1 mean in group 2
0.6111111 2.2111111
we compare the average (or mean) of one group against the set average (or mean). This set average can be any theoretical value (or it can be the population mean).
I am trying to compute the average mean of a small group of 300 observations against 1500 observations using one sided t.test.Is this approach correct? If not is there an alternative to this?
head(data$BMI)
attach(data)
tester<-mean(BMI)
table(BMI)
set.seed(123)
sampler<-sample(min(BMI):max(BMI),300,replace = TRUE)
mean(sampler)
t.test(sampler,tester)
The last line of the code yield-
Error in t.test.default(sampler, tester) : not enough 'y' observations
For testing your sample in t.test, you can do:
d <- rnorm(1500,mean = 3, sd = 1)
s <- sample(d,300)
Then, test for the normality of d and s:
> shapiro.test(d)
Shapiro-Wilk normality test
data: d
W = 0.9993, p-value = 0.8734
> shapiro.test(s)
Shapiro-Wilk normality test
data: s
W = 0.99202, p-value = 0.1065
Here the test is superior to 0.05, so you could consider that both d and s are normally distributed. So, you can test for t.test:
> t.test(d,s)
Welch Two Sample t-test
data: d and s
t = 0.32389, df = 444.25, p-value = 0.7462
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.09790144 0.13653776
sample estimates:
mean of x mean of y
2.969257 2.949939
I was provided with three t-tests:
Two Sample t-test
data: cammol by gender
t = -3.8406, df = 175, p-value = 0.0001714
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.11460843 -0.03680225
sample estimates:
mean in group 1 mean in group 2
2.318132 2.393837
Welch Two Sample t-test
data: alkphos by gender
t = -2.9613, df = 145.68, p-value = 0.003578
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-22.351819 -4.458589
sample estimates:
mean in group 1 mean in group 2
85.81319 99.21839
Two Sample t-test
data: phosmol by gender
t = -3.4522, df = 175, p-value = 0.0006971
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.14029556 -0.03823242
sample estimates:
mean in group 1 mean in group 2
1.059341 1.148605
And I want to construct a table with these t-test results in R markdown like:
wanted_table_format
I've tried reading some instructions for using "knitr" and "kable" functions, but honestly, I do not know how to apply the t-test results to those functions.
What could I do?
Suppose your three t-tests are saved as t1, t2, and t3.
t1 <- t.test(rnorm(100), rnorm(100)
t2 <- t.test(rnorm(100), rnorm(100, 1))
t3 <- t.test(rnorm(100), rnorm(100, 2))
You could turn them into one data frame (that can then be printed as a table) with the broom and purrr packages:
library(broom)
library(purrr)
tab <- map_df(list(t1, t2, t3), tidy)
On the above data, this would become:
estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
1 0.07889713 -0.008136139 -0.08703327 0.535986 5.925840e-01 193.4152 -0.2114261 0.3692204
2 -0.84980010 0.132836627 0.98263673 -6.169076 3.913068e-09 194.2561 -1.1214809 -0.5781193
3 -1.95876967 -0.039048940 1.91972073 -13.270232 3.618929e-29 197.9963 -2.2498519 -1.6676875
method alternative
1 Welch Two Sample t-test two.sided
2 Welch Two Sample t-test two.sided
3 Welch Two Sample t-test two.sided
Some of the columns probably don't matter to you, so you could do something like this to get just the columns you want:
tab[c("estimate", "statistic", "p.value", "conf.low", "conf.high")]
As noted in the comments, you'd have to first do install.packages("broom") and install.packages("purrr").
totaldata$Age2 <- ifelse(totaldata$Age<=50, 0, 1)
t.test(totaldata$concernsubscorehiv, totaldata$Age2,alternative='two.sided',na.rm=TRUE, conf.level=.95, paired=FALSE
This code yiels this result:
Welch Two Sample t-test
data:
totaldata$concernsubscorehiv and totaldata$Age2
t = 33.19, df = 127.42, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
3.370758 3.798164
sample estimates:
mean of x mean of y
4.336842 0.752381
As you can see the mean of group y is 0.752381
Then we I estimate the mean of each group using this:
aggregate(totaldata$concernsubscorehiv~totaldata$Age2,data=totaldata,mean)
This yields
totaldata$Age2 totaldata$concernsubscorehiv
1 0 4.354286
2 1 4.330612
As you can see the mean of group 0 is 4.354286 not 0.752381 as estimated by t test. What is the problem?
You don't use t.test correctly. 0.752381 is the fraction of people for which age2 is 1. You are supplying a vector of your normal data, and a vector of zero and ones, when instead you want to split up the first vector based on the grouping in the second.
Consider the following:
out <- rnorm(10)*5+100
bin <- rbinom(n=10, size=1, prob=0.5)
mean(out)
[1] 101.9462
mean(bin)
[1] 0.4
From the ?t.test helpfile, we know that x and y are:
x a (non-empty) numeric vector of data values.
y an optional (non-empty) numeric vector of data values.
So, by supplying both out and bin, I compare each vector to each other, which probably does not make much sense in this example. See:
t.test(out, bin)
Welch Two Sample t-test
data: out and bin
t = 86.665, df = 9.3564, p-value = 6.521e-15
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
98.91092 104.18149
sample estimates:
mean of x mean of y
101.9462 0.4000
Here, you see that t.test correctly estimated the means for my two supplied vectors, as I have shown above. What you want to do is to split up the first vector based on whether or not the second is 0 or 1.
In my toy example, I can do this easily by writing:
t.test(out[which(bin==1)], out[which(bin==0)])
Welch Two Sample t-test
data: out[which(bin == 1)] and out[which(bin == 0)]
t = 0.34943, df = 5.1963, p-value = 0.7405
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-5.828182 7.686176
sample estimates:
mean of x mean of y
102.5036 101.5746
Here, these two means correspond exactly to
tapply(out, bin, mean)
0 1
101.5746 102.5036