df <- data.frame (rating1 = c(1,5,2,4,5),
rating2 = c(2,1,2,4,2),
rating3 = c(0,2,1,2,0),
race = c("black", "asian", "white","black","white"),
gender = c("male","female","female","male","female")
)
I'd like to conduct t-test of group mean (e.g. mean of asians in rating1) and the overall mean of each rating (e.g. rating1). Below is my code for Asians in rating1.
asian_df <- df %>%
filter(race == "asian")
t.test(asian_df$rating1, mean(df$rating1))
Then for Blacks in rating 2, I'd run
black_df <- df %>%
filter(race == "black")
t.test(black_df$rating2, mean(df$rating2))
How can I write a function that automates the t-test for each group? So far I have to manually change the variable name to essentially run for each race, each gender and on each rating (rating 1 to rating 3). Thanks!
Performing multiple t-tests increases your risk of Type I error and you will need to adjust for multiple comparisons in order for your results to be valid/meaningful. You can run the t-tests by looping through your variables, e.g.
library(tidyverse)
df <- data.frame (rating1 = c(5,8,7,8,9,6,9,7,8,5,8,5),
rating2 = c(2,7,8,4,9,3,6,1,7,3,9,1),
rating3 = c(0,6,1,2,7,2,9,1,6,2,3,1),
race = c("asian", "asian", "asian","black","asian","black","white","black","white","black","white","black"),
gender = c("male","female","female","male","female","male","female","male","female","male","female","male")
)
for (rac in unique(df$race)){
tmp_df <- df %>%
filter(race == rac)
print(rac)
print(t.test(tmp_df$rating1,
rep(mean(df$rating1),
length(tmp_df$rating1))))
}
[1] "asian"
Welch Two Sample t-test
data: tmp_df$rating1 and rep(mean(df$rating1), length(tmp_df$rating1))
t = 0.19518, df = 3, p-value = 0.8577
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-2.550864 2.884198
sample estimates:
mean of x mean of y
7.250000 7.083333
[1] "black"
Welch Two Sample t-test
data: tmp_df$rating1 and rep(mean(df$rating1), length(tmp_df$rating1))
t = -1.5149, df = 4, p-value = 0.2044
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-2.5022651 0.7355985
sample estimates:
mean of x mean of y
6.200000 7.083333
[1] "white"
Welch Two Sample t-test
data: tmp_df$rating1 and rep(mean(df$rating1), length(tmp_df$rating1))
t = 3.75, df = 2, p-value = 0.06433
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.1842176 2.6842176
sample estimates:
mean of x mean of y
8.333333 7.083333
for (gend in unique(df$gender)){
tmp_df <- df %>%
filter(gender == gend)
print(gend)
print(t.test(tmp_df$rating1,
rep(mean(df$rating1),
length(tmp_df$rating1))))
}
[1] "male"
Welch Two Sample t-test
data: tmp_df$rating1 and rep(mean(df$rating1), length(tmp_df$rating1))
t = -2.0979, df = 5, p-value = 0.09
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-2.4107761 0.2441094
sample estimates:
mean of x mean of y
6.000000 7.083333
[1] "female"
Welch Two Sample t-test
data: tmp_df$rating1 and rep(mean(df$rating1), length(tmp_df$rating1))
t = 3.5251, df = 5, p-value = 0.01683
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.2933469 1.8733198
sample estimates:
mean of x mean of y
8.166667 7.083333
Due to multiple testing (in this example, 5 t-tests) your chance of a false positive is 1 - (1 - 0.05)^5 = 22.62% <- very high. To account for this, you can apply the Bonferroni correction, which basically takes your required p-value (in this case, p < 0.05) and divides it by the number of tests (i.e. the new p-value required to reject the null is p < 0.01). When you apply this correction, even the 'best' t-test result (gender; p-value = 0.01683) is not statistically significant.
An alternative approach would be to compare means in all conditions using ANOVA, then use Tukey's HSD to determine which groups are different. Tukey's HSD is a single post-hoc test, so you don't need to account for multiple testing, and your results are valid. Adapting this approach to your problem might be a better way to go e.g.
anova_one_way <- aov(rating1 + rating2 + rating3 ~ race + gender, data = df)
summary(anova_one_way)
Df Sum Sq Mean Sq F value Pr(>F)
race 2 266.70 133.35 14.01 0.00243 **
gender 1 140.08 140.08 14.72 0.00497 **
Residuals 8 76.13 9.52
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
TukeyHSD(anova_one_way)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = rating1 + rating2 + rating3 ~ race + gender, data = df)
$race
diff lwr upr p adj
black-asian -7.050000 -12.963253 -1.136747 0.0224905
white-asian 4.416667 -2.315868 11.149201 0.2076254
white-black 11.466667 5.029132 17.904201 0.0023910
$gender
diff lwr upr p adj
male-female -3.416667 -7.523829 0.6904958 0.0913521
Related
this may be a trivial question.
In my data, I have two groups grp1 and grp2. In each group, I have some observations assigned to the treatment group and some observations assigned to the control group.
My question is whether there is a statistically significant difference on dv of the treatment in grp1 and grp2. In some way, this is a difference in differences.
I want to estimate if the following difference is significant:
dd = mean(dv_grp1_treat-dv_grp1_control)-mean(dv_grp2_treat-dv_grp2_control)
# create data
install.packages("librarian")
librarian::shelf(librarian,tidyverse,truncnorm)
aud_tr<- as.data.frame(list(avglist=rtruncnorm(625, a=0,b=4, mean=2.1, sd=1))) %>% mutate(group="grp1_tr")
aud_notr <- as.data.frame(list(avglist=rtruncnorm(625, a=0,b=4, mean=2, sd=1))) %>% mutate(group="grp1_notr")
noaud_tr<- as.data.frame(list(avglist=rtruncnorm(625, a=0,b=4, mean=2.4, sd=1))) %>% mutate(group="grp2_tr")
noaud_notr<- as.data.frame(list(avglist=rtruncnorm(625, a=0,b=4, mean=2.1, sd=1))) %>% mutate(group="grp2_notr")
df<- bind_rows(aud_tr,aud_notr,noaud_tr,noaud_notr)
unique(df$group)
[1] "grp1_treat" "grp1_control" "grp2_treat" "grp2_control"
I know how to run t.test for difference in means between in each group, but how do I do it if I want to examine the difference across groups?
t.test(df$dv[df$group=="grp1_treat"],df$dv[df$group=="grp1_control"])
t.test(df$dv[df$group=="grp2_treat"],df$dv[df$group=="grp2_control"])
It sounds like you need a two-way analysis of variance (ANOVA). Firstly, you should ensure that you separate out "group membership" and "treatment versus control" into two columns, since these are really two distinct variables:
df$treatment <- ifelse(grepl('treat', df$group), 'treat', 'control')
df$group <- ifelse(grepl('1', df$group), 'grp1', 'grp2')
Then you can carry out a two way ANOVA using aov
summary(aov(dv ~ group + treatment, data = df))
#> Df Sum Sq Mean Sq F value Pr(>F)
#> group 1 1.18 1.175 1.362 0.245
#> treatment 1 26.14 26.145 30.307 1.14e-07 ***
#> Residuals 197 169.95 0.863
#> ---
#> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
This tells you that, in this sample, the effect of treatment was significant, but the effect of group membership was not
Data
Obviously, we don't have your data since it wasn't supplied in the question, but the following sample data frame has the same names and structure as your own:
set.seed(1)
df <- data.frame(dv = c(rnorm(50, 3.2), rnorm(50, 3.8),
rnorm(50, 3.5), rnorm(50, 4.1)),
group = rep(c('grp1_control', 'grp1_treat',
'grp2_control', 'grp2_treat'), each = 50))
I am trying to use apply() on an array of matrices.
Here is an example:
data(UCBAdmissions)
fisher.test(UCBAdmissions[,,1]) #This works great
apply(UCBAdmissions, c(1,2,3), fisher.test) #This fails
The UCBAdmissions data has six contingency table data within Dept part, namely: "A", "B", "C" , "D", "E" , and "F".
dimnames(UCBAdmissions)
#$Admit
#[1] "Admitted" "Rejected"
#$Gender
#[1] "Male" "Female"
#$Dept
#[1] "A" "B" "C" "D" "E" "F"
You can apply fisher.test to each of these six tables. It is not clear to me from your code apply(UCBAdmissions, c(1,2,3), fisher.test) to which part of the six tables you want to apply fisher.test.
If you want to apply fisher.test to the first three of the six tables, namely "A", "B", and "C", you need to subset the UCBAdmissions data first, and then set the dimension to 3.
apply(UCBAdmissions[,,1:3], 3, fisher.test)
# $A
#
# Fisher's Exact Test for Count Data
#
# data: array(newX[, i], d.call, dn.call)
# p-value = 1.669e-05
# alternative hypothesis: true odds ratio is not equal to 1
# 95 percent confidence interval:
# 0.1970420 0.5920417
# sample estimates:
# odds ratio
# 0.3495628
#
#
# $B
#
# Fisher's Exact Test for Count Data
#
# data: array(newX[, i], d.call, dn.call)
# p-value = 0.6771
# alternative hypothesis: true odds ratio is not equal to 1
# 95 percent confidence interval:
# 0.2944986 2.0040231
# sample estimates:
# odds ratio
# 0.8028124
#
#
# $C
#
# Fisher's Exact Test for Count Data
#
# data: array(newX[, i], d.call, dn.call)
# p-value = 0.3866
# alternative hypothesis: true odds ratio is not equal to 1
# 95 percent confidence interval:
# 0.8452173 1.5162918
# sample estimates:
# odds ratio
# 1.1329
Another option is to replace 3 with the dimension name:
apply(UCBAdmissions[,,1:3], "Dept", fisher.test)
This will give exactly the same result as the previous code.
In another case, if you want to apply fisher.test to contingency tables between Admit and Dept for "A", "B", "C", grouped by Gender, you can use:
apply(UCBAdmissions[,,1:3], "Gender", fisher.test)
# $Male
#
# Fisher's Exact Test for Count Data
#
# data: array(newX[, i], d.call, dn.call)
# p-value = 7.217e-16
# alternative hypothesis: two.sided
#
#
# $Female
#
# Fisher's Exact Test for Count Data
#
# data: array(newX[, i], d.call, dn.call)
# p-value < 2.2e-16
# alternative hypothesis: two.sided
To show the part being tested more clearly, I reshape the data and then filter it so that I have male only students in depts A, B, and C. Then, I apply fisher.test to the data
DF <- UCBAdmissions %>%
as.data.frame %>%
filter(Gender == "Male",
Dept == "A" | Dept == "B" | Dept == "C") %>%
pivot_wider(-Gender, names_from = Admit, values_from = Freq)
DF
# # A tibble: 3 x 3
# Dept Admitted Rejected
# <fct> <dbl> <dbl>
# 1 A 512 313
# 2 B 353 207
# 3 C 120 205
fisher.test(DF[1:3, 2:3])
#
# Fisher's Exact Test for Count Data
#
# data: DF[1:3, 2:3]
# p-value = 7.217e-16
# alternative hypothesis: two.sided
The result is exactly the same as the one resulted from apply(UCBAdmissions[,,1:3], "Gender", fisher.test) for Male group.
Something like this:
Personally I do it this way:
First make a list UCB_list
then bind list elements to dataframe with rbindlist from data.table
finally, use lapply indicating the column y=df$Gender you want to iterate through:
library(data.table)
UCB_list <- list(UCBAdmissions)
df <- rbindlist(lapply(UCB_list, data.frame))
lapply(df, fisher.test, y = df$Gender)
> lapply(df, fisher.test, y = df$Gender)
$Admit
Fisher's Exact Test for Count Data
data: X[[i]] and df$Gender
p-value = 1
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.1537975 6.5020580
sample estimates:
odds ratio
1
$Gender
Fisher's Exact Test for Count Data
data: X[[i]] and df$Gender
p-value = 7.396e-07
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
16.56459 Inf
sample estimates:
odds ratio
Inf
$Dept
Fisher's Exact Test for Count Data
data: X[[i]] and df$Gender
p-value = 1
alternative hypothesis: two.sided
$Freq
Fisher's Exact Test for Count Data
data: X[[i]] and df$Gender
p-value = 0.4783
alternative hypothesis: two.sided
I have a (big)dataset which looks like this:-
dat <- data.frame(m=c(rep("a",4),rep("b",3),rep("c",2)),
n1 =round(rnorm(mean = 20,sd = 10,n = 9)))
g <- rnorm(20,10,5)
dat
m n1
1 a 15.132
2 a 17.723
3 a 3.958
4 a 19.239
5 b 11.417
6 b 12.583
7 b 32.946
8 c 11.970
9 c 26.447
I want to perform a t-test on each category of "m" with vectorg like
n1.a <- c(15.132,17.723,3.958,19.329)
I need to do a t-test like t.test(n1.a,g)
I initially thought about breaking them up into list using split(dat,dat$m) and
then use lapply, but it is not working .
Any thoughts on how to go about it ?
Here's a tidyverse solution using map from purrr:
dat %>%
split(.$m) %>%
map(~ t.test(.x$n1, g), data = .x$n1)
Or, using lapply as you mentioned, which will store all of your t-test statistics in a list (or a shorter version using by, thanks #markus):
dat <- split(dat, dat$m)
dat <- lapply(dat, function(x) t.test(x$n1, g))
Or
dat <- by(dat, m, function(x) t.test(x$n1, g))
Which gives us:
$a
Welch Two Sample t-test
data: .x$n1 and g
t = 1.5268, df = 3.0809, p-value = 0.2219
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-11.61161 33.64902
sample estimates:
mean of x mean of y
21.2500 10.2313
$b
Welch Two Sample t-test
data: .x$n1 and g
t = 1.8757, df = 2.2289, p-value = 0.1883
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-7.325666 20.863073
sample estimates:
mean of x mean of y
17.0000 10.2313
$c
Welch Two Sample t-test
data: .x$n1 and g
t = 10.565, df = 19, p-value = 2.155e-09
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
7.031598 10.505808
sample estimates:
mean of x mean of y
19.0000 10.2313
In base R you can do
lapply(split(dat, dat$m), function(x) t.test(x$n1, g))
Output
$a
Welch Two Sample t-test
data: x$n1 and g
t = 1.9586, df = 3.2603, p-value = 0.1377
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-6.033451 27.819258
sample estimates:
mean of x mean of y
21.0000 10.1071
$b
Welch Two Sample t-test
data: x$n1 and g
t = 2.3583, df = 2.3202, p-value = 0.1249
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-5.96768 25.75349
sample estimates:
mean of x mean of y
20.0000 10.1071
$c
Welch Two Sample t-test
data: x$n1 and g
t = 13.32, df = 15.64, p-value = 6.006e-10
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
13.77913 19.00667
sample estimates:
mean of x mean of y
26.5000 10.1071
Data
set.seed(1)
dat <- data.frame(m=c(rep("a",4),rep("b",3),rep("c",2)),
n1 =round(rnorm(mean = 20,sd = 10,n = 9)))
g <- rnorm(20,10,5)
I can't find the bug in my code, and/or the flaw in my logic. I have a matrix, X, of 0's and 1's and a vector y of continuous values and I want to do a 2 sample t-test in R where the rows of X indicate the different groups of y.
For example:
x = matrix(rbinom(60,1,.5),ncol=10)
y = abs(rnorm(ncol(x)))
apply(x,1,function(x,y=y)t.test(y[x==1],y[x==0]))
So using this code I would have expected to get 6 t-tests where each row of X corresponds to the two groups of y. However, I get this error when I run my code:
Error in t.test(y[x == 1], y[x == 0]) :
promise already under evaluation: recursive default argument reference or earlier problems?
Can someone explain the error and modify my code to get what I want.
The problem comes from your re-use of variable names in your function arguments. This should work:
apply(x,1,function(x.f,y.f=y)t.test(y.f[x.f==1],y.f[x.f==0]))
What about
apply(x,1,function(x,z)t.test(y[x==1],y[x==0]),y)
If you want to use the second argument within the function, you should also pass it to apply
Following works:
> apply(x,1,function(a)t.test(y[a==1],y[a==0]))
[[1]]
You should give better names to data in data.frames and vectors so that x and y etc can be used as general variables. Also there is no need to send y to the function since it will be same for all tests.
Output:
Welch Two Sample t-test
data: y[a == 1] and y[a == 0]
t = 0.43835, df = 5.377, p-value = 0.6782
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.6356057 0.9036413
sample estimates:
mean of x mean of y
0.5807408 0.4467230
[[2]]
Welch Two Sample t-test
data: y[a == 1] and y[a == 0]
t = -0.80208, df = 5.5382, p-value = 0.4555
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.0985419 0.5644195
sample estimates:
mean of x mean of y
0.4337110 0.7007722
[[3]]
Welch Two Sample t-test
data: y[a == 1] and y[a == 0]
t = 0.58194, df = 7.3884, p-value = 0.5779
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.5584942 0.9283034
sample estimates:
mean of x mean of y
0.6329878 0.4480832
[[4]]
Welch Two Sample t-test
data: y[a == 1] and y[a == 0]
t = 1.1148, df = 4.8236, p-value = 0.3174
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.4919082 1.2308641
sample estimates:
mean of x mean of y
0.7622223 0.3927443
[[5]]
Welch Two Sample t-test
data: y[a == 1] and y[a == 0]
t = 0.23436, df = 5.5539, p-value = 0.8231
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.7818960 0.9439901
sample estimates:
mean of x mean of y
0.5729543 0.4919073
[[6]]
Welch Two Sample t-test
data: y[a == 1] and y[a == 0]
t = -1.015, df = 7.9168, p-value = 0.3401
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.0152988 0.3954558
sample estimates:
mean of x mean of y
0.3855747 0.6954962
For only p values:
> apply(x,1,function(a)t.test(y[a==1],y[a==0])$p.value)
[1] 0.6781895 0.4555338 0.5779255 0.3173567 0.8231019 0.3400979
I would like to do the following paired t-test:
str1<-' ENSEMBLE 0.934 0.934 0.934 0.934 '
str2<-' J48 0.934 0.934 0.934 0.934 '
df1 <- read.table(text=scan(text=str1, what='', quiet=TRUE), header=TRUE)
df2 <- read.table(text=scan(text=str2, what='', quiet=TRUE), header=TRUE)
t.test ( df1$ENSEMBLE, df2$J48, mu=0 , alt="two.sided", paired = T, conf.level = 0.95)
I get the following result:
Paired t-test
data: df1$ENSEMBLE and df2$J48
t = NaN, df = 3, p-value = NA
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
NaN NaN
sample estimates:
mean of the differences
0
Why do I get it?
It's because the datasets are exactly the same.
df2[1,1] <- .935
t.test ( df1$ENSEMBLE, df2$J48, mu=0 , alt="two.sided", paired = T, conf.level = 0.95)
Paired t-test
data: df1$ENSEMBLE and df2$J48
t = -1, df = 3, p-value = 0.391
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.0010456116 0.0005456116
sample estimates:
mean of the differences
-0.00025
Your two vectors are exactly the same. There is no variance in either group and therefore no standard error. So your answer is undefined