Conduct Multiple T-Tests in R, Condensed - r

I wish to conduct multiple t-tests in R, without having to go through a copy-paste of each test. Each test will whether differences exist in the "Type" (whether "Left" or "Right") when looking at the "Level_#". Currently, I might have:
t.test(Level_1 ~ Type, alternative="two.sided", conf.level=0.99)
t.test(Level_2 ~ Type, alternative="two.sided", conf.level=0.99)
Type Level_1 Level_2 Level_3
Left 17 50 98
Right 18 65 65
Left 23 7 19
Left 65 7 100
Right 9 13 17
The issue is that I have hundreds of "Level_#" and would like to know how to automate this process and output a data frame of the results. My thought is to somehow incorporate an apply function.

You can do it with using the tidyverse approach, and using the purrr and broom packages.
require(tidyverse)
require(broom)
df %>%
gather(var, level, -type) %>%
nest(-var) %>%
mutate(model = purrr::map(data, function(x) {
t.test(level ~ type, alternative="two.sided", conf.level=0.99,
data = x)}),
value = purrr::map(model, tidy),
conf.low = purrr::map(value, "conf.low"),
conf.high = purrr::map(value, "conf.high"),
pvalue = purrr::map(value, "p.value")) %>%
select(-data, -model, -value)
Output:
var conf.low conf.high pvalue
1 level1 -3.025393 4.070641 0.6941518
2 level2 -3.597754 3.356125 0.9260015
3 level3 -3.955293 3.673493 0.9210724
Sample data:
set.seed(123)
df <- data.frame(type = rep(c("left", "right"), 25),
level1 = rnorm(50, mean = 85, sd = 5),
level2 = rnorm(50, mean = 75, sd = 5),
level3 = rnorm(50, mean = 65, sd = 5))

Related

Using a temporal inner variable in dplyr outside of the group

I need to calculate an FDR variable per group, using an expected random distribution of p values (corresponds to the "Random" type).
library(dplyr)
library(data.table)
calculate_empirical_fdr = function(control_pVal, test_pVal) {
m_control = length(control_pVal)
m_test = length(test_pVal)
unlist(lapply(test_pVal, function(significance_threshold) {
m_control = length(control_pVal)
m_test = length(test_pVal)
FP_expected = length(control_pVal[control_pVal<=significance_threshold])*m_test/m_control # number of
expected false positives in a p-value sequence with the size m_test
S = length(test_pVal[test_pVal<=significance_threshold]) # number of significant hits (FP + TP)
return(FP_expected/S)
}))
}
An example dataset with groups I need to control for in the "Group" variable:
set.seed(42)
library(dplyr); library(data.table)
dataset_test = data.table(Type = c(rep("Random", 500),
rep("test1", 500),
rep("test2", 500)),
Group = sample(c("group1", "group2", "group3"), 1500, replace = T),
Pvalue = c(runif(n = 500),
rbeta(n = 500, shape1 = 1, shape2 = 4),
rbeta(n = 500, shape1 = 1, shape2 = 6))
)
Data visualization:
I have found that the best way to use my function per group would be using a temporal variable where I can store the p values of the random type, but this does not work:
dataset_test %>%
group_by(Group) %>%
{filter(Type=="Random") %>% select(Pvalue) ->> control_set } %>%
group_by(Type, add = T) %>%
mutate(FDR_empirical = calculate_empirical_fdr(control_pVal = control_set,
test_pVal = Pvalue)) %>%
data.table()
Error in filter(Type == "Random") : object 'Type' not found
I understand that probably temporal vairables "do not see" the environment within the data.table, would be glad to hear any suggestions how to fix it.
You can do something like this, which filters the control group P-values using the data.table special .BY
setDT(dataset_test)
dataset_test[
i= Type!="Random",
j = FDR_empirical:=calculate_empirical_fdr(dataset_test[Type=="Random" & Group ==.BY$Group, Pvalue], Pvalue),
by = .(Group, Type)
]
Output:
Type Group Pvalue FDR_empirical
1: Random group1 0.70292111 NA
2: Random group1 0.72383117 NA
3: Random group1 0.76413459 NA
4: Random group1 0.87942702 NA
5: Random group2 0.71229213 NA
---
1496: test2 group3 0.34817178 0.3681791
1497: test2 group1 0.22419118 0.2308988
1498: test2 group3 0.07258545 0.2314655
1499: test2 group2 0.24687976 0.2849462
1500: test2 group1 0.12206777 0.1760657
Two possible solutions
Use the dot .
dataset_test %>%
group_by(Group) %>%
{filter(., Type=="Random") %>% select(Pvalue) ->> control_set; . } %>%
group_by(Type, add = T)
Use the tee-pipe %T>% from the magrittr package
library(magrittr)
dataset_test %>%
group_by(Group) %T>% {
filter(., Type=="Random") %>% select(Pvalue) ->> control_set} %>%
group_by(Type, add = T)

How to easily generate/simulate example data with different groups for modelling

How to easily generate/simulate meaningful example data for modelling: e.g. telling that give me n rows of data, for 2 groups, their sex distributions and mean age should differ by X and Y units, respectively? Is there a simple way for doing it automatically? Any packages?
For example, what would be the simplest way for generating such data?
groups: two groups: A, B
sex: different sex distributions: A 30%, B 70%
age: different mean ages: A 50, B 70
PS! Tidyverse solutions are especially welcome.
My best try so far is still quite a lot of code:
n=100
d = bind_rows(
#group A females
tibble(group = rep("A"),
sex = rep("Female"),
age = rnorm(n*0.4, 50, 4)),
#group B females
tibble(group = rep("B"),
sex = rep("Female"),
age = rnorm(n*0.3, 45, 4)),
#group A males
tibble(group = rep("A"),
sex = rep("Male"),
age = rnorm(n*0.20, 60, 6)),
#group B males
tibble(group = rep("B"),
sex = rep("Male"),
age = rnorm(n*0.10, 55, 4)))
d %>% group_by(group, sex) %>%
summarise(n = n(),
mean_age = mean(age))
There are lots of ways to sample from vectors and to draw from random distributions in R. For example, the data set you requested could be created like this:
set.seed(69) # Makes samples reproducible
df <- data.frame(groups = rep(c("A", "B"), each = 100),
sex = c(sample(c("M", "F"), 100, TRUE, prob = c(0.3, 0.7)),
sample(c("M", "F"), 100, TRUE, prob = c(0.5, 0.5))),
age = c(runif(100, 25, 75), runif(100, 50, 90)))
And we can use the tidyverse to show it does what was expected:
library(dplyr)
df %>%
group_by(groups) %>%
summarise(age = mean(age),
percent_male = length(which(sex == "M")))
#> # A tibble: 2 x 3
#> groups age percent_male
#> <chr> <dbl> <int>
#> 1 A 49.4 29
#> 2 B 71.0 50

R package "infer" - Iterative bootstrapping / looping over column names

I'm bootstrapping with the infer package.
The statistic of interest is the mean, example data is given by a tibble with 3 columns and 5 rows. My real tibble has 86 rows and 40 columns. For every column I want to do a bootstrap simulation, like shown below for the column "x" in tibble "test_tibble".
library(infer)
library(tidyverse)
test_tibble <- tibble(x = 1:5, y = 6:10, z = 11:15)
# A tibble: 5 x 3
x y z
<int> <int> <int>
1 1 6 11
2 2 7 12
3 3 8 13
4 4 9 14
5 5 10 15
specify(test_tibble, response = x) %>%
generate(reps = 100, type = "bootstrap") %>%
calculate(stat = "mean") %>%
summarise(
lower_CI = quantile(probs = 0.025, stat),
upper_CI = quantile(probs = 0.975, stat)
)
# A tibble: 1 x 2
lower_CI upper_CI
<dbl> <dbl>
1 2.10 4
I am now looking for a way of doing the same thing for the other columns in my tibble. I have tried a for-loop like this:
for (i in 1:ncol(test_tibble)){
var_name <- names(test_tibble)[i]
specify(test_tibble, response = var_name) %>%
generate(reps = 100, type = "bootstrap") %>%
calculate(stat = "mean") %>%
summarise(
lower_CI = quantile(probs = 0.025, stat),
upper_CI = quantile(probs = 0.975, stat)
)
}
Unfortunately, this returns the follwing error
Error: The response variable `var_name` cannot be found in this dataframe.
Is there any way of iterating over the columns x, y and z without entering them manually as arguments for "response"? That'd be quite tedious for 40 columns.
This is a tricky question with a tricky answer.
Take a look at the response argument of the specify function in documentation:
The variable name in x that will serve as the response. This is alternative to using the formula argument.
With this in mind I modified the code to automate the process, adding one more column to the original dataframe and using the formula argument to obtain the same result, using a column of ones as explanatory variable.
library(infer)
library(tidyverse)
test_tibble <- tibble(x = 1:5, y = 6:10, z = 11:15, w = seq(1, 1, length.out = 5))
for (i in 1:ncol(test_tibble)){
var_name <- names(test_tibble)[i]
specify(test_tibble, formula = eval(parse(text = paste0(var_name, "~", "w"))))[, 1] %>%
generate(reps = 100, type = "bootstrap") %>%
calculate(stat = "mean") %>%
summarise(
lower_CI = quantile(probs = 0.025, stat),
upper_CI = quantile(probs = 0.975, stat)
)
}
Hope it helps

dplyr with stats test

I have the follow data setup
library(dplyr)
library(broom)
pop.mean = 0.10
df = data.frame(
trial = as.integer(runif(1000, min = 5, max = 20)),
success = as.integer(runif(1000, min = 0, max = 20)),
my.group = factor(rep(c("a","b","c","d"), each = 250))
)
I want to group on my.group and apply binom.test
bi.test <- df %>% group_by(my.group) %>%
do(test = binom.test(sum(success),
sum(trial),
pop.mean,
alternative = c("two.sided"),
conf.level = 0.95))
Getting error message, cannot find success what am I doing wrong here?
We need to extract the columns using $ within do
res <- df %>%
group_by(my.group) %>%
do(test = binom.test(sum(.$success),
sum(.$trial),
pop.mean,
alternative = c("two.sided"),
conf.level = 0.95))
If we are using the broom functions, then
res1 <- df %>%
group_by(my.group) %>%
do(test = tidy(binom.test(sum(.$success),
sum(.$trial),
pop.mean,
alternative = c("two.sided"),
conf.level = 0.95)))
res1$test %>%
bind_rows %>%
bind_cols(res1[1], .)
# A tibble: 4 x 9
# my.group estimate statistic p.value parameter conf.low conf.high method alternative
# <fctr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fctr> <fctr>
#1 a 0.7908251 2310 0 2921 0.7756166 0.8054487 Exact binomial test two.sided
#2 b 0.7525138 2320 0 3083 0.7368831 0.7676640 Exact binomial test two.sided
#3 c 0.8446337 2479 0 2935 0.8310152 0.8575612 Exact binomial test two.sided
#4 d 0.7901683 2395 0 3031 0.7752305 0.8045438 Exact binomial test two.sided
NOTE: The dataset was created with a seed of 24 i.e. set.seed(24)
Thanks #akrun
I came up with a solution with tidyr::nest and purr::map after reading your answer.
res <- df %>%
group_by(my.group) %>%
tidyr::nest() %>%
mutate(bi.test =
purrr::map(data, function(df) broom::tidy(
binom.test(sum(df$success),
sum(df$trial),
pop.mean,
alternative = c("two.sided"),
conf.level = 0.95)))) %>%
select(my.group, bi.test) %>%
tidyr::unnest()

Create t.test table with dplyr?

Suppose I have data that looks like this:
set.seed(031915)
myDF <- data.frame(
Name= rep(c("A", "B"), times = c(10,10)),
Group = rep(c("treatment", "control", "treatment", "control"), times = c(5,5,5,5)),
X = c(rnorm(n=5,mean = .05, sd = .001), rnorm(n=5,mean = .02, sd = .001),
rnorm(n=5,mean = .08, sd = .02), rnorm(n=5,mean = .03, sd = .02))
)
I want to create a t.test table with a row for "A" and one for "B"
I can write my own function that does that:
ttestbyName <- function(Name) {
b <- t.test(myDF$X[myDF$Group == "treatment" & myDF$Name==Name],
myDF$X[myDF$Group == "control" & myDF$Name==Name],
conf.level = 0.90)
dataNameX <- data.frame(Name = Name,
treatment = round(b$estimate[[1]], digits = 4),
control = round(b$estimate[[2]], digits = 4),
CI = paste('(',round(b$conf.int[[1]],
digits = 4),', ',
round(b$conf.int[[2]],
digits = 4), ')',
sep=""),
pvalue = round(b$p.value, digits = 4),
ntreatment = nrow(myDF[myDF$Group == "treatment" & myDF$Name==Name,]),
ncontrol = nrow(myDF[myDF$Group == "control" & myDF$Name==Name,]))
}
library(parallel)
Test_by_Name <- mclapply(unique(myDF$Name), ttestbyName)
Test_by_Name <- do.call("rbind", Test_by_Name)
and the output looks like this:
Name treatment control CI pvalue ntreatment ncontrol
1 A 0.0500 0.0195 (0.0296, 0.0314) 0.0000 5 5
2 B 0.0654 0.0212 (0.0174, 0.071) 0.0161 5 5
I'm wondering if there is a cleaner way of doing this with dplyr. I thought about using groupby, but I'm a little lost.
Thanks!
Not much cleaner, but here's an improvement:
library(dplyr)
ttestbyName <- function(myName) {
bt <- filter(myDF, Group=="treatment", Name==myName)
bc <- filter(myDF, Group=="control", Name==myName)
b <- t.test(bt$X, bc$X, conf.level=0.90)
dataNameX <- data.frame(Name = myName,
treatment = round(b$estimate[[1]], digits = 4),
control = round(b$estimate[[2]], digits = 4),
CI = paste('(',round(b$conf.int[[1]],
digits = 4),', ',
round(b$conf.int[[2]],
digits = 4), ')',
sep=""),
pvalue = round(b$p.value, digits = 4),
ntreatment = nrow(bt), # changes only in
ncontrol = nrow(bc)) # these 2 nrow() args
}
You should really replace the do.call function with rbindlist from data.table:
library(data.table)
Test_by_Name <- lapply(unique(myDF$Name), ttestbyName)
Test_by_Name <- rbindlist(Test_by_Name)
or, even better, use the %>% pipes:
Test_by_Name <- myDF$Name %>%
unique %>%
lapply(., ttestbyName) %>%
rbindlist
> Test_by_Name
Name treatment control CI pvalue ntreatment ncontrol
1: A 0.0500 0.0195 (0.0296, 0.0314) 0.0000 5 5
2: B 0.0654 0.0212 (0.0174, 0.071) 0.0161 5 5
An old question, but the broom package has since been made available for this exact purpose (as well as other statistical tests):
library(broom)
library(dplyr)
myDF %>% group_by(Name) %>%
do(tidy(t.test(X~Group, data = .)))
Source: local data frame [2 x 9]
Groups: Name [2]
Name estimate estimate1 estimate2 statistic p.value
(fctr) (dbl) (dbl) (dbl) (dbl) (dbl)
1 A -0.03050475 0.01950384 0.05000860 -63.838440 1.195226e-09
2 B -0.04423181 0.02117864 0.06541046 -3.104927 1.613625e-02
Variables not shown: parameter (dbl), conf.low (dbl), conf.high (dbl)
library(tidyr)
library(dplyr)
myDF %>% group_by(Group) %>% mutate(rowname=1:n())%>%
spread(Group, X) %>%
group_by(Name) %>%
do(b = t.test(.$control, .$treatment)) %>%
mutate(
treatment = round(b[['estimate']][[2]], digits = 4),
control = round(b[['estimate']][[1]], digits = 4),
CI = paste0("(", paste(b[['conf.int']], collapse=", "), ")"),
pvalue = b[['p.value']]
)
# Name treatment control CI pvalue
#1 A 0.0500 0.0195 (-0.031677109707283, -0.0293323994902097) 1.195226e-09
#2 B 0.0654 0.0212 (-0.0775829100729602, -0.010880719830447) 1.613625e-02
You can add ncontrol, ntreatment manually.
You can do it with a custom t.test function and do:
my.t.test <- function(data, formula, ...)
{
tt <- t.test(formula=formula, data=data, ...)
ests <- tt$estimate
names(ests) <- sub("mean in group ()", "\\1",names(ests))
counts <- xtabs(formula[c(1,3)],data)
names(counts) <- paste0("n",names(counts))
cbind(
as.list(ests),
data.frame(
CI = paste0("(", paste(tt$conf.int, collapse=", "), ")"),
pvalue = tt$p.value,
stringsAsFactors=FALSE
),
as.list(counts)
)
}
myDF %>% group_by(Name) %>% do(my.t.test(.,X~Group))
Source: local data frame [2 x 7]
Groups: Name
Name control treatment CI pvalue ncontrol ntreatment
1 A 0.01950384 0.05000860 (-0.031677109707283, -0.0293323994902097) 1.195226e-09 5 5
2 B 0.02117864 0.06541046 (-0.0775829100729602, -0.010880719830447) 1.613625e-02 5 5

Resources