Recap statistic after lapply

Recap statistic after lapply - r

I have a dataframe with multiple columns and two different groups - see below.
set.seed(123)
d <- data.frame(
q1 = rnorm(20),
q2 = rnorm(20),
q3 = rnorm(20),
group = sample(c("A", "B"), size = 20, replace = TRUE))
I use lapply to calculate the ttest for each column between the two groups as reported below:
lapply(d[,-4], function(i) t.test(i ~ d$group))
lapply returns for each column the results listing several statistical info data (I just reported column q1)
$q1
Welch Two Sample t-test
data: i by d$group
t = -0.76262, df = 17.323, p-value = 0.4559
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.2294678 0.5759458
sample estimates:
mean in group A mean in group B
-0.05443279 0.27232820
I want to recap the main stat info (t, df, pvalue) as single table for each column (q1, q2, q3...)

You can use lapply() again to extract each parameter and bind_rows():
library(dplyr)
lapply(l, function(x) {
data.frame(t = x$statistic,
df = x$parameter,
pv = x$p.value) # returns a dataframe for each element in l
}) %>% bind_rows()
# t df pv
# 1 -1.031983 13.533116 0.32017136
# 2 -2.458574 9.771018 0.03427922
# 3 1.421821 11.416813 0.18181697
You can do this in one shot:
lapply(d[,-4], function(i) {
res <- t.test(i ~ d$group)
data.frame(t = res$statistic,
df = res$parameter,
pv = res$p.value)
}) %>% bind_rows()
If you want to keep reference to the column names pass .id to bind_rows():
lapply(d[,-4], function(i) {
res <- t.test(i ~ d$group)
data.frame(t = res$statistic,
df = res$parameter,
pv = res$p.value)
}) %>% bind_rows(.id='id')
# id t df pv
# 1 q1 -0.7626249 17.32329 0.4559469
# 2 q2 -1.6467070 17.73117 0.1172263
# 3 q3 0.5288851 13.01589 0.6057874
Example:
set.seed(123)
d <- data.frame(
q1 = rnorm(20),
q2 = rnorm(20),
q3 = rnorm(20),
group = sample(c("A", "B"), size = 20, replace = TRUE))
l <- lapply(d[,-4], function(i) {
t.test(i ~ d$group)
})

Related

How can I automate t-test for nested variables in R?

I would like to automate the collection of summary statistics that arise from t-tests. In the example below I have nested variables Age, Location, and Treatment. For each Age & Location I would like to run a t-test based on Treatment which has the two categorical names Control & Treatment. Put another way, I would like to know about the difference between the Control and Treatment means at each Location for each Age.
I would like to run the t-tests using the col_t_welch function in matrixTests because the output already has several of the summary statistics I'm looking for (i.e., mean.diff, stderr, and pvalue). How could I set up my dataframe (df1) to be able to fun a for-loop for a nested t-test?
Reproducible Example:
library(matrixTests)
library(ggplot2)
set.seed(123)
df1 <- data.frame(matrix(ncol = 4, nrow = 36))
x <- c("Age","Location","Treatment","Value")
colnames(df1) <- x
df1$Age <- as.factor(rep(c(1,2,3), each = 12))
df1$Location <- as.factor(rep(c("Central","North"), each = 6))
df1$Treatment <- as.factor(rep(c("Control","Treatment"), each = 3))
df1$Value <- round(rnorm(36,200,25),0)
# I can't get the for-loop below to work because I'm not sure how to set up the data frame, but I was thinking something along these lines.
i <- 1
p <- numeric(length = 3*2)
mean_diff <- numeric(length = 3*2)
SE_diff <- numeric(length = 3*2)
for(j in c("1", "2", "3")){
for(k in c("Control", "Treatment")){
ttest <- col_t_welch(Value, data = df1, subset = Age == j & Treatment == k))
p[i] <- a$pvalue
mean_diff[i] <- ttest$mean.diff
SE_diff[i] <- ttest$stderr
i <- i + 1
}
}
The ideal final data frame would look like d2 below.
d2 <- expand.grid(Age = rep(c(1,2,3), 1),
Location = rep(c("Central","North"), 1),
mean_diff = NA,
SE_diff = NA,
pvalue = NA)
C1 <- df1[c(1:6),3:4]
N1 <- df1[c(7:12),3:4]
C2 <- df1[c(13:18),3:4]
N2 <- df1[c(19:24),3:4]
C3 <- df1[c(25:30),3:4]
N3 <- df1[c(31:36),3:4]
c1_mod <- col_t_welch(x=C1[1:3,2], y=C1[4:6,2])
n1_mod <- col_t_welch(x=N1[1:3,2], y=N1[4:6,2])
c2_mod <- col_t_welch(x=C2[1:3,2], y=C2[4:6,2])
n2_mod <- col_t_welch(x=N2[1:3,2], y=N2[4:6,2])
c3_mod <- col_t_welch(x=C3[1:3,2], y=C3[4:6,2])
n3_mod <- col_t_welch(x=N3[1:3,2], y=N3[4:6,2])
d2[1,3] <- c1_mod$mean.diff
d2[1,4] <- c1_mod$stderr
d2[1,5] <- c1_mod$pvalue
d2[2,3] <- c2_mod$mean.diff
d2[2,4] <- c2_mod$stderr
d2[2,5] <- c2_mod$pvalue
d2[3,3] <- c3_mod$mean.diff
d2[3,4] <- c3_mod$stderr
d2[3,5] <- c3_mod$pvalue
d2[4,3] <- n1_mod$mean.diff
d2[4,4] <- n1_mod$stderr
d2[4,5] <- n1_mod$pvalue
d2[5,3] <- n2_mod$mean.diff
d2[5,4] <- n2_mod$stderr
d2[5,5] <- n2_mod$pvalue
d2[6,3] <- n3_mod$mean.diff
d2[6,4] <- n3_mod$stderr
d2[6,5] <- n3_mod$pvalue
d2

I think this might help you
Libraries
library(matrixTests)
library(tidyverse)
Data
set.seed(123)
df1 <- data.frame(matrix(ncol = 4, nrow = 36))
x <- c("Age","Location","Treatment","Value")
colnames(df1) <- x
df1$Age <- as.factor(rep(c(1,2,3), each = 12))
df1$Location <- as.factor(rep(c("Central","North"), each = 6))
df1$Treatment <- as.factor(rep(c("Control","Treatment"), each = 3))
df1$Value <- round(rnorm(36,200,25),0)
How to
df1 %>%
group_nest(Age,Location,Treatment) %>%
pivot_wider(names_from = Treatment,values_from = data) %>%
mutate(
test = map2(
.x = Control,
.y = Treatment,
.f = ~col_t_welch(.x,.y)
)
) %>%
unnest(test) %>%
select(Age,Location,pvalue,mean.diff,stderr)
Result
# A tibble: 6 x 5
Age Location pvalue mean.diff stderr
<fct> <fct> <dbl> <dbl> <dbl>
1 1 Central 0.675 -9.67 21.3
2 1 North 0.282 -22 17.7
3 2 Central 0.925 -3 28.4
4 2 North 0.570 9.33 14.6
5 3 Central 0.589 -14.7 25.0
6 3 North 0.311 -11.3 8.59

variable length df subsampling function r

I need to write a function involving subsetting a df by a variable n bins. Like, if n is 2, then subsample the df some number of times in two bins (from the first half, then from the second half). If n is 3, subsample in 3 bins (first 1/3, second 1/3, third 1/3). I've been doing this for different lengths of n manually so far, and I know there must be a better way to do it. I want to write it into a function with n as an input, but I can't make it work so far. Code below.
# create df
df <- data.frame(year = c(1:46),
sample = seq(from=10,to=30,length.out = 46) + rnorm(46,mean=0,sd=2) )
# real df has some NAs, so we'll add some here
df[c(20,32),2] <- NA
this df is 46 years of sampling. I want to pretend instead of 46 samples, I only took 2, but at one random year in the first half (1:23), and one random year in the second half (24:46).
# to subset in 2 groups, say, 200 times
# I'll make a df of elements to sample
samplelist <- data.frame(firstsample = sample(1:(nrow(df)/2),200,replace = T), # first sample in first half of vector
secondsample = sample((nrow(df)/2):nrow(df),200, replace = T) )# second sample in second half of vector
samplelist <- as.matrix(samplelist)
# start a df to add to
plot_df <- df %>% mutate(first='all',
second = 'all',
group='full')
# fill the df using coords from expand.grid
for(i in 1:nrow(samplelist)){
plot_df <<- rbind(plot_df,
df[samplelist[i,] , ] %>%
mutate(
first = samplelist[i,1],
second = samplelist[i,2],
group = i
))
print(i)
}
(If we can make it skip samples on "NA" sample years, that would be extra good).
So, if I wanted to do this for three points instead of two, I'd repeat the process like this:
# to subset in 3 groups 200 times
# I'll make a df of elements to sample
samplelist <- data.frame(firstsample = sample(1:(nrow(df)/3),200,replace = T), # first sample in first 1/3
secondsample = sample(round(nrow(df)/3):round(nrow(df)*(2/3)),200, replace = T), # second sample in second 1/3
thirdsample = sample(round(nrow(df)*(2/3)):nrow(df), 200, replace=T) # third sample in last 1/3
)
samplelist <- as.matrix(samplelist)
# start a df to add to
plot_df <- df %>% mutate(first='all',
second = 'all',
third = 'all',
group='full')
# fill the df using coords from expand.grid
for(i in 1:nrow(samplelist)){
plot_df <<- rbind(plot_df,
df[samplelist[i,] , ] %>%
mutate(
first = samplelist[i,1],
second = samplelist[i,2],
third = samplelist[i,3],
group = i
))
print(i)
}
but, I want to do this many times, sampling up to ~20 times (so in 20 bins), so this manual method is not sustainable. Can you help me write a function to say "pick one sample from n bins x times"?
btw, this is the plot I am making with the complete df:
plot_df %>%
ggplot(aes(x=year,y=sample)) +
geom_point(color="grey40") +
stat_smooth(geom="line",
method = "lm",
alpha=.3,
aes(color=group,
group=group),
se=F,
show.legend = F) +
geom_line(color="grey40") +
geom_smooth(data = plot_df %>% filter(group %in% c("full")),
method = "lm",
alpha=.7,
color="black",
size=2,
#se=F,
# fill="grey40
show.legend = F
) +
theme_classic()

If I got you right, the following function splits your df in n bins, draws x samples from each and puts the results back into cols of a df:
library(tidyverse)
set.seed(42)
df <- data.frame(year = c(1:46),
sample = seq(from=10,to=30,length.out = 46) + rnorm(46,mean=0,sd=2) )
get_df_sample <- function(df, n, x) {
df %>%
# bin df in n bins of (approx.) equal length
mutate(bin = ggplot2::cut_number(seq_len(nrow(.)), n, labels = seq_len(n))) %>%
# split by bin
split(.$bin) %>%
# sample x times from each bin
map(~ .x[sample(seq_len(nrow(.x)), x, replace = TRUE),]) %>%
# keep only column "sample"
map(~ select(.x, sample)) %>%
# Rename: Add number of df-bin from which sample is drawn
imap(~ rename(.x, !!sym(paste0("sample_", .y)) := sample)) %>%
# bind
bind_cols() %>%
# Add group = rownames
rownames_to_column(var = "group")
}
get_df_sample(df, 3, 200) %>%
head()
#> sample_1 sample_2 sample_3 group
#> 1 12.58631 18.27561 24.74263 1
#> 2 19.46218 24.24423 23.44881 2
#> 3 12.92179 18.47367 27.40558 3
#> 4 15.22020 18.47367 26.29243 4
#> 5 12.58631 24.24423 24.43108 5
#> 6 19.46218 23.36464 27.40558 6
Created on 2020-03-24 by the reprex package (v0.3.0)

Here's a function using loops, closer to what you started doing:
df <- data.frame(year = c(1:46),
sample = seq(from=10, to=30, length.out = 46) +
rnorm(46,mean=0,sd=2))
df[c(20,32), 2] <- NA
my_function <- function(n, sample_size, data = df) {
plot_df <- data %>% mutate(group = 'full')
sample_matrix <- matrix(data = NA, nrow = sample_size, ncol = n)
first_row <- 1 # First subset has 1 as first row, no matter how many subsets
for (i in 1:n) {
last_row <- round(first_row + nrow(df)/n - 1) # Determine last row of i-th subset
sample_matrix[, i] <- sample(first_row:last_row, sample_size, replace = T) # Store sample directly in matrix
first_row <- i + last_row # Determine first row for next i
group_name <- paste("group", i, sep = "_") # Column name for i-th group
plot_df[[group_name]] <- "all" # Column for i-th group
}
for (j in 1:sample_size) {
# Creating a new data frame for new observations
new_obs <- df[sample_matrix[j,], ]
new_obs[["group"]] <- j
for (group_n in 1:n) {
new_obs[[paste0("group_", group_n)]] <- sample_matrix[j, group_n]
}
plot_df <- rbind(plot_df, new_obs)
plot_df <<- plot_df
}
}
my_function(2, 200, data = df)

Efficient way to get a matrix of high and low expressions for multiple variables to be used for simulations

I want to have a matrix including one high (1 sd above average) and low (1 sd below median) expression for each variable out of multiple variables.
In one variant, for each variable I would like to have one high expression, while all other variables are low.
In addition, I would like to have a variant in which all other variables are set to 0 and then there is a high and a low expression for each variable.
I want to use it for model predictions.
For three variables I would already need for variant 1:
pred_da <- data.frame(var1 = c(median(da$var1)+1*sd(da$var1), median(da$var1)-1*sd(da$var1), median(da$var1)-1*sd(da$var1)), var2 = c(median(da$var2)-1*sd(da$var2), median(da$var2)+1*sd(da$var2), median(da$var2)-1*sd(da$var2)), var3 = c(median(da$var3)-1*sd(da$var3), median(da$var3)-1*sd(da$var3), median(da$var3)+1*sd(da$var3)))
For variant 2 it would be even more...
There should be a more efficient way to do it?

I think Adam B.'s solution puts the medians instead of median - sd as results (see code below in reproducible example).
Also, your example code uses median +/- sd, while the text defines "high" as 1 sd above average (not median), so it is not clear which one you want. I went with median in both cases.
You can achieve the same quite easily with base R by filling a matrix with the "low" expression for each column and adding the "high" expression in the diagonal:
# data (common to all versions)
set.seed(1)
da <-
data.frame(
ID = 1:10,
var1 = rnorm(10, 0, 1),
var2 = rpois(10, 2),
var3 = rexp(10, 1),
stringsAsFactors = FALSE
)
varnames <- colnames(da)[-1]
# my version
mat <- data.matrix(da[, -1])
median_da <- apply(mat, 2, median)
sds <- apply(mat, 2, sd)
lower <- median_da - sds
higher <- median_da + sds
res_mat <-
matrix(
rep(lower, each = length(varnames)),
nrow = length(varnames),
dimnames = list(seq_along(varnames), varnames)
)
diag(res_mat) <- higher
data.frame(res_mat)
#> var1 var2 var3
#> 1 1.0371615 -0.4337209 -0.1102957
#> 2 -0.5240104 2.4337209 -0.1102957
#> 3 -0.5240104 -0.4337209 1.3406680
## your version:
pred_da <-
data.frame(
var1 = c(
median(da$var1) + 1 * sd(da$var1),
median(da$var1) - 1 * sd(da$var1),
median(da$var1) - 1 * sd(da$var1)
),
var2 = c(
median(da$var2) - 1 * sd(da$var2),
median(da$var2) + 1 * sd(da$var2),
median(da$var2) - 1 * sd(da$var2)
),
var3 = c(
median(da$var3) - 1 * sd(da$var3),
median(da$var3) - 1 * sd(da$var3),
median(da$var3) + 1 * sd(da$var3)
)
)
# check for equality of results:
all.equal(data.frame(res_mat), pred_da, check.attributes = FALSE)
#> [1] TRUE
# Adam B.'s version:
library(tidyverse)
median_da <- da %>%
select(- ID) %>%
mutate_all(~ median(.x)) %>%
slice(1)
sds <- da %>%
select(- ID) %>%
summarise_all(sd)
add_sd <- function(varname, sd) {
median <- median_da %>%
pluck(varname)
median_da %>%
mutate(!!varname := median + sd)
}
preds_da <- map2(varnames, sds, ~ add_sd(varname = .x, sd = .y)) %>% bind_rows()
preds_da
#> var1 var2 var3
#> 1 1.0371615 1.000000 0.6151862
#> 2 0.2565755 2.433721 0.6151862
#> 3 0.2565755 1.000000 1.3406680
median_da
#> var1 var2 var3
#> 1 0.2565755 1 0.6151862

It's a bit of a mind-squeezer with nonstandard eval, but I managed to get it to work with my example data:
library(tidyverse)
da <- tibble(ID = 1:10, V1 = rnorm(10, 0, 1), V2 = rpois(10, 2), V3 = rexp(10, 1))
varnames <- colnames(da)[-1]
median_da <- da %>%
select(- ID) %>%
mutate_all(~ median(.x)) %>%
slice(1)
sds <- da %>%
select(- ID) %>%
summarise_all(sd)
add_sd <- function(varname, sd) {
median <- median_da %>%
pluck(varname)
median_low <- median_da %>%
mutate(!!varname := median - sd)
median_high <- median_da %>%
mutate(!!varname := median + sd)
median_low %>%
bind_rows(median_high)
}
preds_da <- map2(varnames, sds, ~ add_sd(varname = .x, sd = .y)) %>% bind_rows()

How to randomly select and bind data columns based on their median values in R?

I have two dataframes in wide format. Each of the columns is a time series of page hits for various wikipedia articles.
set.seed(123)
library(tidyr)
time = as.Date('2009-01-01') + 0:9
wiki_1 <- data.frame(
W = sample(1:1000,10,replace = T),
X = sample(1:100,10,replace = T),
Y = sample(1:10,10,replace = T),
Z = sample(1:10,10, replace = T)
)
wiki_2 <- data.frame(
A = sample(500:1000,10,replace = T),
B = sample(90:100,10,replace = T),
C = sample(1:10,10,replace = T),
D = sample(1:10,10,replace = T)
)
I want to combine one of the columns from the first dataset (wiki_1) with n columns from the second dataset (wiki_2). But this selection should be based on how close the median values of the columns in wiki_2 are to those in wiki_1 e.g. by order of magnitude.
In this example, for n = 2, Y should be matched with C and D because of how close their median values are.
median(wiki_1$Y) # 7
median(wiki_2$C) # 6
median(wiki_2$D) # 4.5
I'm not sure how to implement the difference in median values criterion to get the desired result.
Additionally, it would be useful to be able to randomly sample from the columns in wiki_2 that satisfy the criterion as my real dataset has many more columns.
This is what I'm working with so far:
df <- zoo(cbind(subset(wiki_1,select="Y"),
subset(wiki_2,select=c("C","D"))),time)

I think this is what you're after. I added a column to wiki_2 in order to allow more than 2 matches to show the random selection of matching columns.
set.seed(123)
library(tidyr)
time = as.Date('2009-01-01') + 0:9
wiki_1 <- data.frame(
W = sample(1:1000,10,replace = T),
X = sample(1:100,10,replace = T),
Y = sample(1:10,10,replace = T),
Z = sample(1:10,10, replace = T)
)
wiki_2 <- data.frame(
A = sample(500:1000,10,replace = T),
B = sample(90:100,10,replace = T),
C = sample(1:10,10,replace = T),
D = sample(1:10,10,replace = T),
E = sample(1:20,10,replace = T)
)
selectColsByMedian <- function(df1, df2, ref_v, n_v, cutoff_v) {
#' Select Columns By Median
#' #description Select any number of columns from a test data.frame whose median value is
#' close to the median value of a specified column from a reference data.frame. "Close to"
#' is determined as the absolute value of the difference in medians being less thant he specified cutoff.
#' Outputs a new data.frame containing the reference data.frame's test column and all matching columns
#' from the test data.frame
#' #param df1 reference data.frame
#' #param df2 test data.frame
#' #param ref_v column from reference data.frame to test against
#' #param n_v number of columns from df2 to select
#' #param cutoff_v value to use to determine if test columns' medians are close enough
#' #return data.frame with 1 column from df1 and matching columns from df2
## Get median of ref
med_v <- median(df1[,ref_v], na.rm = T)
## Get other medians
otherMed_v <- apply(wiki_2, 2, function(x) median(x, na.rm = T))
## Get differences
medDiff_v <- sapply(otherMed_v, function(x) abs(med_v - x))
## Get whoever is within range (and order them)
inRange_v <- sort(medDiff_v[medDiff_v < cutoff_v])
inRangeCols_v <- names(inRange_v)
## Select random sample, if needed
if (length(inRangeCols_v) > n_v){
whichRandom_v <- sample(1:length(inRangeCols_v), size = n_v, replace = F)
} else {
whichRandom_v <- 1:length(inRangeCols_v)
}
finalCols_v <- inRangeCols_v[whichRandom_v]
## Final output
out_df <- cbind(df1[,ref_v], df2[,finalCols_v])
colnames(out_df) <- c(ref_v, finalCols_v)
## Return
return(out_df)
} # selectColsByMedian
### 3 matching columns, select 2
match3pick2_df <- selectColsByMedian(df1 = wiki_1, df2 = wiki_2, ref_v = "Y", n_v = 2, cutoff_v = 12)
match3pick2_df2 <- selectColsByMedian(df1 = wiki_1, df2 = wiki_2, ref_v = "Y", n_v = 2, cutoff_v = 12)
### 2 matching columns, select 2
match2pick2_df <- selectColsByMedian(df1 = wiki_1, df2 = wiki_2, ref_v = "Y", n_v = 2, cutoff_v = 10)

Here is my solution, I've added more columns to wiki_2 to allow for subsetting (but it works if ncols(wiki_1) == ncols(wiki_2).
set.seed(123)
wiki_1 <- data.frame(
W = sample(1:1000,10,replace = T),
X = sample(1:100,10,replace = T),
Y = sample(1:10,10,replace = T),
Z = sample(1:10,10, replace = T)
)
wiki_2 <- data.frame(
A = sample(500:1000,100,replace = T),
B = sample(90:100,100,replace = T),
C = sample(1:10,100,replace = T),
D = sample(1:10,100,replace = T)
)
combineMedianComp <- function(data1, data2, col, n){
if(nrow(data1) > nrow(data2)) stop("Rows in 'data2' need to be greater or equal to rows in 'data1'")
medRef <- median(data1[[col]], na.rm = T, ) # median of desired column
medComp <- sapply(data2, function(x){abs(medRef - median(x, na.rm = T))}) # vector with medians for each columns in data2 ('wiki_2')
cols <- names(sort(medComp)[seq_len(n)]) # sort this vector in ascending order, select top n
d2 <- data2[, c(cols)] # select columns in data2 that have medians closest to 'medRef'
d2 <- d2[sample(seq_len(nrow(d2)), size = nrow(data1), replace = F), ] # subset column as to match those in data1
# merge data
res <- do.call(cbind, list(data1[col], d2))
return(res)
}
combineMedianComp(data1 = wiki_1, data2 = wiki_2, col = "Y", n = 2)

You can do:
time = as.Date('2009-01-01') + 0:9
close_median <- function(df1, df2, to_match = NULL){
# get median
m <- median(df1[[to_match]])
# get difference of median from other data
mat_cols <- apply(df2, 2, function(x) abs(m - median(x)))
# get top 2 matched column
cols <- sort(names(sort(v)[1:2]))
return(cbind(df1[to_match], df2[cols], row.names=time))
}
close_median(wiki_1, wiki_2, 'Y')
Y C D
2009-01-01 8 9 10
2009-01-02 7 8 1
2009-01-03 1 7 7
2009-01-04 10 3 10
2009-01-05 2 1 1
2009-01-06 3 10 3
2009-01-07 6 2 3
2009-01-08 5 8 10
2009-01-09 3 8 5
2009-01-10 10 8 3

Create t.test table with dplyr?

Suppose I have data that looks like this:
set.seed(031915)
myDF <- data.frame(
Name= rep(c("A", "B"), times = c(10,10)),
Group = rep(c("treatment", "control", "treatment", "control"), times = c(5,5,5,5)),
X = c(rnorm(n=5,mean = .05, sd = .001), rnorm(n=5,mean = .02, sd = .001),
rnorm(n=5,mean = .08, sd = .02), rnorm(n=5,mean = .03, sd = .02))
)
I want to create a t.test table with a row for "A" and one for "B"
I can write my own function that does that:
ttestbyName <- function(Name) {
b <- t.test(myDF$X[myDF$Group == "treatment" & myDF$Name==Name],
myDF$X[myDF$Group == "control" & myDF$Name==Name],
conf.level = 0.90)
dataNameX <- data.frame(Name = Name,
treatment = round(b$estimate[[1]], digits = 4),
control = round(b$estimate[[2]], digits = 4),
CI = paste('(',round(b$conf.int[[1]],
digits = 4),', ',
round(b$conf.int[[2]],
digits = 4), ')',
sep=""),
pvalue = round(b$p.value, digits = 4),
ntreatment = nrow(myDF[myDF$Group == "treatment" & myDF$Name==Name,]),
ncontrol = nrow(myDF[myDF$Group == "control" & myDF$Name==Name,]))
}
library(parallel)
Test_by_Name <- mclapply(unique(myDF$Name), ttestbyName)
Test_by_Name <- do.call("rbind", Test_by_Name)
and the output looks like this:
Name treatment control CI pvalue ntreatment ncontrol
1 A 0.0500 0.0195 (0.0296, 0.0314) 0.0000 5 5
2 B 0.0654 0.0212 (0.0174, 0.071) 0.0161 5 5
I'm wondering if there is a cleaner way of doing this with dplyr. I thought about using groupby, but I'm a little lost.
Thanks!

Not much cleaner, but here's an improvement:
library(dplyr)
ttestbyName <- function(myName) {
bt <- filter(myDF, Group=="treatment", Name==myName)
bc <- filter(myDF, Group=="control", Name==myName)
b <- t.test(bt$X, bc$X, conf.level=0.90)
dataNameX <- data.frame(Name = myName,
treatment = round(b$estimate[[1]], digits = 4),
control = round(b$estimate[[2]], digits = 4),
CI = paste('(',round(b$conf.int[[1]],
digits = 4),', ',
round(b$conf.int[[2]],
digits = 4), ')',
sep=""),
pvalue = round(b$p.value, digits = 4),
ntreatment = nrow(bt), # changes only in
ncontrol = nrow(bc)) # these 2 nrow() args
}
You should really replace the do.call function with rbindlist from data.table:
library(data.table)
Test_by_Name <- lapply(unique(myDF$Name), ttestbyName)
Test_by_Name <- rbindlist(Test_by_Name)
or, even better, use the %>% pipes:
Test_by_Name <- myDF$Name %>%
unique %>%
lapply(., ttestbyName) %>%
rbindlist
> Test_by_Name
Name treatment control CI pvalue ntreatment ncontrol
1: A 0.0500 0.0195 (0.0296, 0.0314) 0.0000 5 5
2: B 0.0654 0.0212 (0.0174, 0.071) 0.0161 5 5

An old question, but the broom package has since been made available for this exact purpose (as well as other statistical tests):
library(broom)
library(dplyr)
myDF %>% group_by(Name) %>%
do(tidy(t.test(X~Group, data = .)))
Source: local data frame [2 x 9]
Groups: Name [2]
Name estimate estimate1 estimate2 statistic p.value
(fctr) (dbl) (dbl) (dbl) (dbl) (dbl)
1 A -0.03050475 0.01950384 0.05000860 -63.838440 1.195226e-09
2 B -0.04423181 0.02117864 0.06541046 -3.104927 1.613625e-02
Variables not shown: parameter (dbl), conf.low (dbl), conf.high (dbl)

library(tidyr)
library(dplyr)
myDF %>% group_by(Group) %>% mutate(rowname=1:n())%>%
spread(Group, X) %>%
group_by(Name) %>%
do(b = t.test(.$control, .$treatment)) %>%
mutate(
treatment = round(b[['estimate']][[2]], digits = 4),
control = round(b[['estimate']][[1]], digits = 4),
CI = paste0("(", paste(b[['conf.int']], collapse=", "), ")"),
pvalue = b[['p.value']]
)
# Name treatment control CI pvalue
#1 A 0.0500 0.0195 (-0.031677109707283, -0.0293323994902097) 1.195226e-09
#2 B 0.0654 0.0212 (-0.0775829100729602, -0.010880719830447) 1.613625e-02
You can add ncontrol, ntreatment manually.

You can do it with a custom t.test function and do:
my.t.test <- function(data, formula, ...)
{
tt <- t.test(formula=formula, data=data, ...)
ests <- tt$estimate
names(ests) <- sub("mean in group ()", "\\1",names(ests))
counts <- xtabs(formula[c(1,3)],data)
names(counts) <- paste0("n",names(counts))
cbind(
as.list(ests),
data.frame(
CI = paste0("(", paste(tt$conf.int, collapse=", "), ")"),
pvalue = tt$p.value,
stringsAsFactors=FALSE
),
as.list(counts)
)
}
myDF %>% group_by(Name) %>% do(my.t.test(.,X~Group))
Source: local data frame [2 x 7]
Groups: Name
Name control treatment CI pvalue ncontrol ntreatment
1 A 0.01950384 0.05000860 (-0.031677109707283, -0.0293323994902097) 1.195226e-09 5 5
2 B 0.02117864 0.06541046 (-0.0775829100729602, -0.010880719830447) 1.613625e-02 5 5

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Recap statistic after lapply - r

Related

How can I automate t-test for nested variables in R?

variable length df subsampling function r

Efficient way to get a matrix of high and low expressions for multiple variables to be used for simulations

How to randomly select and bind data columns based on their median values in R?

Create t.test table with dplyr?

Categories

Resources