Multiple paired t-tests on multiple variables simultaneously using dplyr/tidyverse - r

Assume a data structure like this:
ID testA_wave1 testA_wave2 testA_wave3 testB_wave1 testB_wave2 testB_wave3
1 1 3 2 3 6 5 3
2 2 4 4 4 3 6 6
3 3 10 2 1 4 4 4
4 4 5 3 12 2 7 4
5 5 5 3 9 2 4 2
6 6 10 0 2 6 6 5
7 7 6 8 4 6 8 3
8 8 1 5 4 5 6 0
9 9 3 2 7 8 4 4
10 10 4 9 5 11 8 8
What I want to achieve is to calculate a paired t-test for every test separately (in this case meaning testA and testB, but in real-life I have much more tests). I want to do it that way that I compare the first wave of a given test with every other subsequent wave of the same test (meaning testA_wave1 vs testA_wave2 and testA_wave1 vs testA_wave3 in the case of testA).
This way, I was able to achieve it:
df %>%
gather(variable, value, -ID) %>%
mutate(wave_ID = paste0("wave", parse_number(variable)),
variable = ifelse(grepl("testA", variable), "testA",
ifelse(grepl("testB", variable), "testB", NA_character_))) %>%
group_by(wave_ID, variable) %>%
summarise(value = list(value)) %>%
spread(wave_ID, value) %>%
group_by(variable) %>%
mutate(p_value_w1w2 = t.test(unlist(wave1), unlist(wave2), paired = TRUE)$p.value,
p_value_w1w3 = t.test(unlist(wave1), unlist(wave3), paired = TRUE)$p.value) %>%
select(variable, matches("(p_value)"))
variable p_value_w1w2 p_value_w1w3
<chr> <dbl> <dbl>
1 testA 0.664 0.921
2 testB 0.146 0.418
However, I would like to see different/more elegant solutions that give similar results. I'm looking mostly for dplyr/tidyverse solutions, but if there is a completely different way to achieve it, I'm not against it.
Sample data:
set.seed(123)
df <- data.frame(ID = 1:20,
testA_wave1 = round(rnorm(20, 5, 3), 0),
testA_wave2 = round(rnorm(20, 5, 3), 0),
testA_wave3 = round(rnorm(20, 5, 3), 0),
testB_wave1 = round(rnorm(20, 5, 3), 0),
testB_wave2 = round(rnorm(20, 5, 3), 0),
testB_wave3 = round(rnorm(20, 5, 3), 0))

Since dplyr 0.8.0 we can use group_split to split a dataframe into list of dataframes.
We gather the dataframe and convert it into long format and then separate the names of the column (key) into different columns (test and wave). We then use group_split to split the dataframe into list based on test column. For every dataframe in the list we spread it into wide format and then calculate the t.test values and rbind them into one dataframe using map_dfr.
library(tidyverse)
df %>%
gather(key, value, -ID) %>%
separate(key, c("test", "wave")) %>%
group_split(test) %>% #Previously we had to do split(.$test) here
map_dfr(. %>%
spread(wave, value) %>%
summarise(test = first(test),
p_value_w1w2 = t.test(wave1, wave2, paired = TRUE)$p.value,
p_value_w1w3 = t.test(wave1, wave3, paired = TRUE)$p.value))
# A tibble: 2 x 3
# test p_value_w1w2 p_value_w1w3
# <chr> <dbl> <dbl>
#1 testA 0.664 0.921
#2 testB 0.146 0.418
We manually perform the t-test above as there were only 2 values which needed to be calculated. If there are more number of wave... columns then this could become cumbersome. In such cases we could do
df %>%
gather(key, value, -ID) %>%
separate(key, c("test", "wave")) %>%
group_split(test) %>%
map_dfr(function(data)
data %>%
spread(wave, value) %>%
summarise_at(vars(setdiff(unique(data$wave), "wave1")),
function(x) t.test(.$wave1, x, paired = TRUE)$p.value) %>%
mutate(test = first(data$test)))
# wave2 wave3 test
# <dbl> <dbl> <chr>
#1 0.664 0.921 testA
#2 0.146 0.418 testB
Here it will perform the t-test for every "wave.." column with "wave1" column.
Since you are also open to other solutions, here is an attempt with purely base R solution
sapply(split.default(df[-1], sub("_.*", "", names(df[-1]))), function(x)
c(p_value_w1w2 = t.test(x[[1]], x[[2]],paired = TRUE)$p.value,
p_value_w1w3 = t.test(x[[1]], x[[3]],paired = TRUE)$p.value))
# testA testB
#p_value_w1w2 0.6642769 0.1456059
#p_value_w1w3 0.9209554 0.4184603
We split the columns based on test* and create a list of dataframes and apply t.test on different combinations of columns for each dataframe.

Update 03/16/2022
The tidyverse has evolved and so should this solution.
First I make a simplifying assumption: If we designed the experiment, then we know what the groups are and how many waves we followed them through. If we don't know, then we can extract this information from the column names. See at below.
library("broom")
library("tidyverse")
tests <- c("A", "B")
waves <- 3
comparisons <-
list(
test = tests,
first = 1,
later = seq(2, waves)
) %>%
cross_df()
comparisons
#> # A tibble: 4 × 3
#> test first later
#> <chr> <dbl> <int>
#> 1 A 1 2
#> 2 B 1 2
#> 3 A 1 3
#> 4 B 1 3
Transform the data from wide format to long format.
data <- df %>%
pivot_longer(
-ID,
names_to = "test_wave"
) %>%
extract(
test_wave, c("test", "wave"),
regex = "test(.+)_wave(.+)",
convert = TRUE
)
Then pair the comparisons we want to make with the data we collected. I've added lots of rename statements to make for more readable code but it's not strictly necessary.
comparisons %>%
inner_join(
data,
by = c("test", "first" = "wave")
) %>%
rename(
value.first = value
) %>%
inner_join(
data,
by = c("test", "later" = "wave", "ID")
) %>%
rename(
value.later = value
) %>%
group_by(
test, first, later
) %>%
group_modify(
~ tidy(t.test(.x$value.first, .x$value.later, paired = TRUE))
) %>%
ungroup() %>%
pivot_wider(
id_cols = test,
names_from = later,
names_glue = "wave1_vs_wave{later}",
values_from = p.value
)
#> # A tibble: 2 × 3
#> test wave1_vs_wave2 wave1_vs_wave3
#> <chr> <dbl> <dbl>
#> 1 A 0.664 0.921
#> 2 B 0.146 0.418
Appendix: Extract test names and number of waves from column names.
design <- df %>%
select(starts_with("test")) %>%
colnames() %>%
str_match("test(.+)_wave(.+)")
tests <- unique(design[, 2])
waves <- max(as.integer(design[, 3]))
Created on 2022-03-16 by the reprex package (v2.0.1)
Old solution
Here is one way to do it, using purrr quite a bit.
library("tidyverse")
set.seed(123)
df <- tibble(
ID = 1:20,
testA_wave1 = round(rnorm(20, 5, 3), 0),
testA_wave2 = round(rnorm(20, 5, 3), 0),
testA_wave3 = round(rnorm(20, 5, 3), 0),
testB_wave1 = round(rnorm(20, 5, 3), 0),
testB_wave2 = round(rnorm(20, 5, 3), 0),
testB_wave3 = round(rnorm(20, 5, 3), 0)
)
pvalues <- df %>%
# From wide tibble to long tibble
gather(test, value, -ID) %>%
separate(test, c("test", "wave")) %>%
# Not stricly necessary; will order the waves alphabetically instead
mutate(wave = parse_number(wave)) %>%
inner_join(., ., by = c("ID", "test")) %>%
# If there are two waves w1 and w2,
# we end up with pairs (w1, w1), (w1, w2), (w2, w1) and (w2, w2),
# so filter out to keep the pairing (w1, w2) only
filter(wave.x == 1, wave.x < wave.y) %>%
nest(ID, value.x, value.y) %>%
mutate(pvalue = data %>%
# Perform the test
map(~t.test(.$value.x, .$value.y, paired = TRUE)) %>%
map(broom::tidy) %>%
# Also not strictly necessary; you might want to keep all
# information about the test: estimate, statistic, etc.
map_dbl(pluck, "p.value"))
pvalues
#> # A tibble: 4 x 5
#> test wave.x wave.y data pvalue
#> <chr> <dbl> <dbl> <list> <dbl>
#> 1 testA 1 2 <tibble [20 x 3]> 0.664
#> 2 testA 1 3 <tibble [20 x 3]> 0.921
#> 3 testB 1 2 <tibble [20 x 3]> 0.146
#> 4 testB 1 3 <tibble [20 x 3]> 0.418
pvalues %>%
# Drop the data in order to pivot the table
select(- data) %>%
unite("waves", wave.x, wave.y, sep = ":") %>%
spread(waves, pvalue)
#> # A tibble: 2 x 3
#> test `1:2` `1:3`
#> <chr> <dbl> <dbl>
#> 1 testA 0.664 0.921
#> 2 testB 0.146 0.418
Created on 2019-03-08 by the reprex package (v0.2.1)

To throw in a data.table solution:
library(stringr)
library(data.table)
library(magrittr) ## for the pipe operator
dt_sol <- function(df) {
## create patterns for the melt operation:
## all columns from the same wave should go in one column
grps <- str_extract(names(df)[-1],
"[0-9]+$") %>%
unique() %>%
paste0("wave", ., "$")
grp_names <- sub("\\$", "", grps)
## melt the data table: all test*_wave_i data go into column wave_i
df.m <- melt(df,
measure = patterns(grps),
value.name = grp_names,
variable.name = "test")
## define the names for the new column, we want to extract estimate and p.value
new_cols <- c(outer(c("p.value", "estimate"),
grp_names[-1],
paste, sep = "_"))
## use lapply on .SD which equals to all wave_i columns but the first one
## return estimate and p.value
df.m[,
setNames(unlist(lapply(.SD,
function(col) {
t.test(wave1, col, paired = TRUE)[c("p.value", "estimate")]
}), recursive = FALSE), new_cols),
test, ## group by each test
.SDcols = grp_names[-1]]
}
dt <- copy(df)
setDT(dt)
dt_sol(dt)
# test p.value_wave2 estimate_wave2 p.value_wave3 estimate_wave3
# 1: 1 0.6642769 0.40 0.9209554 -0.1
# 2: 2 0.1456059 -1.45 0.4184603 0.7
Benchmark
Comparing the data.table solution to the tidyverse solution we get an 3-fold speed increase with teh data.tablesolution:
dp_sol <- function(df) {
df %>%
gather(test, value, -ID) %>%
separate(test, c("test", "wave")) %>%
inner_join(., ., by = c("ID", "test")) %>%
filter(wave.x == 1, wave.x < wave.y) %>%
nest(ID, value.x, value.y) %>%
mutate(pvalue = data %>%
map(~t.test(.$value.x, .$value.y, paired = TRUE)) %>%
map(broom::tidy) %>%
map_dbl(pluck, "p.value"))
}
library(microbenchmark)
microbenchmark(dplyr = dp_sol(df),
data.table = dt_sol(dt))
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# dplyr 6.119273 6.897456 7.639569 7.348364 7.996607 14.938182 100 b
# data.table 1.902547 2.307395 2.790910 2.758789 3.133091 4.923153 100 a
With a slightly bigger input:
make_df <- function(nr_tests = 2,
nr_waves = 3,
n_per_wave = 20) {
mat <- cbind(seq(1, n_per_wave),
matrix(round(rnorm(nr_tests * nr_waves * n_per_wave), 0),
nrow = n_per_wave))
c_names <- c(outer(1:nr_waves, 1:nr_tests, function(w, t) glue::glue("test{t}_wave{w}")))
colnames(mat) <- c("ID", c_names)
as.data.frame(mat)
}
df2 <- make_df(100, 100, 10)
dt2 <- copy(df2)
setDT(dt2)
microbenchmark(dplyr = dp_sol(df2),
data.table = dt_sol(dt2)
# Unit: seconds
# expr min lq mean median uq max neval cld
# dplyr 3.469837 3.669819 3.877548 3.821475 3.984518 5.268596 100 b
# data.table 1.018939 1.126244 1.193548 1.173175 1.252855 1.743075 100 a

Using all combinations without replacement:
Just for testA group:
comb <- arrangements::combinations(names(df)[grep("testA",names(df))], k = 2,n = 3,replace = F )
tTest <- function(x, data = df){
ttest <- t.test(x =data[x[1]] , y = data[x[2]])
return(data.frame(var1 = x[1],
var2 = x[2],
t = ttest[["statistic"]][["t"]],
pvalue = ttest[["p.value"]]))
}
result <- apply(comb, 1, tTest, data = df)
Result:
dplyr::bind_rows(result)
var1 var2 t pvalue
1 testA_wave1 testA_wave2 0.5009236 0.6193176
2 testA_wave1 testA_wave3 -0.6426433 0.5243146
3 testA_wave2 testA_wave3 -1.1564854 0.2547069
For all groups:
comb <- arrangements::combinations(x = names(df)[-1], k = 2,n = 6, replace = F )
result <- apply(comb, 1, tTest, data = df)
Result:
dplyr::bind_rows(result)
var1 var2 t pvalue
1 testA_wave1 testA_wave2 0.5009236 0.6193176
2 testA_wave1 testA_wave3 -0.6426433 0.5243146
3 testA_wave1 testB_wave1 0.4199215 0.6769510
4 testA_wave1 testB_wave2 -0.3447992 0.7321465
5 testA_wave1 testB_wave3 0.0000000 1.0000000
6 testA_wave2 testA_wave3 -1.1564854 0.2547069
7 testA_wave2 testB_wave1 -0.1070172 0.9153442
8 testA_wave2 testB_wave2 -0.8516264 0.3997630
9 testA_wave2 testB_wave3 -0.5640491 0.5762010
10 testA_wave3 testB_wave1 1.1068781 0.2754186
11 testA_wave3 testB_wave2 0.2966237 0.7683692
12 testA_wave3 testB_wave3 0.7211103 0.4755291
13 testB_wave1 testB_wave2 -0.7874100 0.4360152
14 testB_wave1 testB_wave3 -0.4791735 0.6346043
15 testB_wave2 testB_wave3 0.3865414 0.7013933

To throw another, somewhat more concise, data.table solution into the mix, in which we melt the data into long format:
setDT(df)
x = melt(df[,-1])[, tname := sub('_.+','',variable)][, wave := sub('.+_','',variable)]
x[wave != 'wave1', .(p.value =
t.test(x[tname==test & wave == 'wave1', value], value, paired = TRUE)$p.value),
by = .(test=tname,wave)]
# test wave p.value
# 1: testA wave2 0.6642769
# 2: testA wave3 0.9209554
# 3: testB wave2 0.1456059
# 4: testB wave3 0.4184603

Related

Programmatically count grouped data using logic rules and string

I have a grouped data frame which I want to summarise into "count of values less than x, y, z by group". I can manually generate the wide dataframe I want using code similar to this below
library(tidyverse)
set.seed(1337)
df <- data.frame(cbind(group = seq(1:5), num = sample(x = 1:400, size = 100, replace = T)))
manual <- df %>%
group_by(group) %>%
summarise(less_than_50 = sum(num < 50),
less_than_100 = sum(num < 100),
less_than_150 = sum(num < 150))
However, I'd like to be able to define a list of "less thans" and generate these columns by referring to a list. I've done something similar in the past, though using enframe(quantile()) to generate a long list of quantiles before pivoting
pc <- c(0.1, 0.5, 0.9)
quantiles <- df %>%
group_by(group) %>%
summarise(enframe(quantile(num, pc))) %>%
pivot_wider(
id_cols = group,
names_from = name,
values_from = value
)
But I don't know / understand the way to define a custom function within the enframe(). Ideally I'd like to apply this in something like the code below (though this obviously doesn't work), with or without the pivot step, in order to get back to the same output as "manual"
levels <- c(50, 100, 150)
programmatic <- df %>%
group_by(group) %>%
summarise(cols = ("less_than", x), num < levels) %>%
pivot...
Any help greatly appreciated
One way you could do it:
library(tidyverse)
set.seed(1337)
df <- data.frame(cbind(group = seq(1:5), num = sample(x = 1:400, size = 100, replace = T)))
less_than <- function(x) {
df %>%
group_by(group) %>%
summarise(less_than_ = sum(num < x)) %>%
rename_with(~ str_c(., x), .cols = -group)
}
levels <- c(50, 100, 150)
map_dfr(levels, less_than) |>
group_by(group) |>
summarise(across(everything(), mean, na.rm = TRUE))
#> # A tibble: 5 × 4
#> group less_than_50 less_than_100 less_than_150
#> <int> <dbl> <dbl> <dbl>
#> 1 1 4 5 10
#> 2 2 2 2 5
#> 3 3 2 6 11
#> 4 4 4 5 5
#> 5 5 1 7 9
# Manual result for comparison
df %>%
group_by(group) %>%
summarise(less_than_50 = sum(num < 50),
less_than_100 = sum(num < 100),
less_than_150 = sum(num < 150))
#> # A tibble: 5 × 4
#> group less_than_50 less_than_100 less_than_150
#> <int> <int> <int> <int>
#> 1 1 4 5 10
#> 2 2 2 2 5
#> 3 3 2 6 11
#> 4 4 4 5 5
#> 5 5 1 7 9
Created on 2022-06-06 by the reprex package (v2.0.1)

Perform multiple two-sample t-test using dplyr in R

I would like to perform multiple pairwise t-tests on a dataset containing about 400 different column variables and 3 subject groups, and extract p-values for every comparison. A shorter representative example of the data, using only 2 variables could be the following;
df <- tibble(var1 = rnorm(90, 1, 1), var2 = rnorm(90, 1.5, 1), group = rep(1:3, each = 30))
Ideally the end result will be a summarised data frame containing four columns; one for the variable being tested (var1, var2 etc.), two for the groups being tested every time and a final one for the p-value.
I've tried duplicating the group column in the long form, and doing a double group_by in order to do the comparisons but with no result
result <- df %>%
pivot_longer(var1:var2, "var", "value") %>%
rename(group_a = group) %>%
mutate(group_b = group_a) %>%
group_by(group_a, group_b) %>%
summarise(n = n())
We can reshape the data into 'long' format with pivot_longer, then grouped by 'group', apply the pairwise.t.test, extract the list elements and transform into tibble with tidy (from broom) and unnest the list column
library(dplyr)
library(tidyr)
library(broom)
df %>%
pivot_longer(cols = -group, names_to = 'grp') %>%
group_by(group) %>%
summarise(out = list(pairwise.t.test(value, grp
) %>%
tidy)) %>%
unnest(c(out))
-output
# A tibble: 3 x 4
group group1 group2 p.value
<int> <chr> <chr> <dbl>
1 1 var2 var1 0.0760
2 2 var2 var1 0.0233
3 3 var2 var1 0.000244
In case you end up wanting more information about the t-tests, here is an approach that will allow you to extract more information such as the degrees of freedom and value of the test statistic:
library(dplyr)
library(tidyr)
library(purrr)
library(broom)
df <- tibble(
var1 = rnorm(90, 1, 1),
var2 = rnorm(90, 1.5, 1),
group = rep(1:3, each = 30)
)
df %>%
select(-group) %>%
names() %>%
map_dfr(~ {
y <- .
combn(3, 2) %>%
t() %>%
as.data.frame() %>%
pmap_dfr(function(V1, V2) {
df %>%
select(group, all_of(y)) %>%
filter(group %in% c(V1, V2)) %>%
t.test(as.formula(sprintf("%s ~ group", y)), ., var.equal = TRUE) %>%
tidy() %>%
transmute(y = y,
group_1 = V1,
group_2 = V2,
df = parameter,
t_value = statistic,
p_value = p.value
)
})
})
#> # A tibble: 6 x 6
#> y group_1 group_2 df t_value p_value
#> <chr> <int> <int> <dbl> <dbl> <dbl>
#> 1 var1 1 2 58 -0.337 0.737
#> 2 var1 1 3 58 -1.35 0.183
#> 3 var1 2 3 58 -1.06 0.295
#> 4 var2 1 2 58 -0.152 0.879
#> 5 var2 1 3 58 1.72 0.0908
#> 6 var2 2 3 58 1.67 0.100
And here is #akrun's answer tweaked to give the same p-values as the above approach. Note the p.adjust.method = "none" which gives independent t-tests which will inflate your Type I error rate.
df %>%
pivot_longer(
cols = -group,
names_to = "y"
) %>%
group_by(y) %>%
summarise(
out = list(
tidy(
pairwise.t.test(
value,
group,
p.adjust.method = "none",
pool.sd = FALSE
)
)
)
) %>%
unnest(c(out))
#> # A tibble: 6 x 4
#> y group1 group2 p.value
#> <chr> <chr> <chr> <dbl>
#> 1 var1 2 1 0.737
#> 2 var1 3 1 0.183
#> 3 var1 3 2 0.295
#> 4 var2 2 1 0.879
#> 5 var2 3 1 0.0909
#> 6 var2 3 2 0.100
Created on 2021-07-30 by the reprex package (v1.0.0)

applying function to each group using dplyr and return specified dataframe

I used group_map for the first time and think I do it correctly. This is my code:
library(REAT)
df <- data.frame(value = c(1,1,1, 1,0.5,0.1, 0,0,0,1), group = c(1,1,1, 2,2,2, 3,3,3,3))
haves <- df %>%
group_by(group) %>%
group_map(~gini(.x$value, coefnorm = TRUE))
The thing is that haves is a list rather than a data frame. What would I have to do to obtain this df
wants <- data.frame(group = c(1,2,3), gini = c(0,0.5625,1))
group gini
1 0.0000
2 0.5625
3 1.0000
Thanks!
You can use dplyr::summarize:
df %>%
group_by(group) %>%
summarize(gini = gini(value, coefnorm = TRUE))
#> # A tibble: 3 x 2
#> group gini
#> <dbl> <dbl>
#> 1 1 0
#> 2 2 0.562
#> 3 3 1
According to the documentation, group_map always produces a list. group_modify is an alternative that produces a tibble if the function does, but gini just outputs a vector. So, you could do something like this...
df %>%
group_by(group) %>%
group_modify(~tibble(gini = gini(.x$value, coefnorm = TRUE)))
# A tibble: 3 x 2
# Groups: group [3]
group gini
<dbl> <dbl>
1 1 0
2 2 0.562
3 3 1
Using data.table
library(data.table)
setDT(df)[, .(gini = gini(value, coefnorm = TRUE)), group]
For grouped datasets, we can specify .data if in case we don't want to use column names unquoted
library(dplyr)
df %>%
group_by(group) %>%
summarize(gini = gini(.data$value, coefnorm = TRUE))

Averaging daily observations to a single day across multiple columns, R [duplicate]

This question already has answers here:
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Can dplyr summarise over several variables without listing each one? [duplicate]
(2 answers)
Closed 5 years ago.
I'm struggling a bit with the dplyr-syntax. I have a data frame with different variables and one grouping variable. Now I want to calculate the mean for each column within each group, using dplyr in R.
df <- data.frame(
a = sample(1:5, n, replace = TRUE),
b = sample(1:5, n, replace = TRUE),
c = sample(1:5, n, replace = TRUE),
d = sample(1:5, n, replace = TRUE),
grp = sample(1:3, n, replace = TRUE)
)
df %>% group_by(grp) %>% summarise(mean(a))
This gives me the mean for column "a" for each group indicated by "grp".
My question is: is it possible to get the means for each column within each group at once? Or do I have to repeat df %>% group_by(grp) %>% summarise(mean(a)) for each column?
What I would like to have is something like
df %>% group_by(grp) %>% summarise(mean(a:d)) # "mean(a:d)" does not work
In dplyr (>=1.00) you may use across(everything() in summarise to apply a function to all variables:
library(dplyr)
df %>% group_by(grp) %>% summarise(across(everything(), list(mean)))
#> # A tibble: 3 x 5
#> grp a b c d
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 3.08 2.98 2.98 2.91
#> 2 2 3.03 3.04 2.97 2.87
#> 3 3 2.85 2.95 2.95 3.06
Alternatively, the purrrlyr package provides the same functionality:
library(purrrlyr)
df %>% slice_rows("grp") %>% dmap(mean)
#> # A tibble: 3 x 5
#> grp a b c d
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 3.08 2.98 2.98 2.91
#> 2 2 3.03 3.04 2.97 2.87
#> 3 3 2.85 2.95 2.95 3.06
Also don't forget about data.table (use keyby to sort sort groups):
library(data.table)
setDT(df)[, lapply(.SD, mean), keyby = grp]
#> grp a b c d
#> 1: 1 3.079412 2.979412 2.979412 2.914706
#> 2: 2 3.029126 3.038835 2.967638 2.873786
#> 3: 3 2.854701 2.948718 2.951567 3.062678
Let's try to compare performance.
library(dplyr)
library(purrrlyr)
library(data.table)
library(bench)
set.seed(123)
n <- 10000
df <- data.frame(
a = sample(1:5, n, replace = TRUE),
b = sample(1:5, n, replace = TRUE),
c = sample(1:5, n, replace = TRUE),
d = sample(1:5, n, replace = TRUE),
grp = sample(1:3, n, replace = TRUE)
)
dt <- setDT(df)
mark(
dplyr = df %>% group_by(grp) %>% summarise(across(everything(), list(mean))),
purrrlyr = df %>% slice_rows("grp") %>% dmap(mean),
data.table = dt[, lapply(.SD, mean), keyby = grp],
check = FALSE
)
#> # A tibble: 3 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 dplyr 2.81ms 2.85ms 328. NA 17.3
#> 2 purrrlyr 7.96ms 8.04ms 123. NA 24.5
#> 3 data.table 596.33µs 707.91µs 1409. NA 10.3
We can summarize by using summarize_at, summarize_all and summarize_if on dplyr 0.7.4. We can set the multiple columns and functions by using vars and funs argument as below code. The left-hand side of funs formula is assigned to suffix of summarized vars. In the dplyr 0.7.4, summarise_each(and mutate_each) is already deprecated, so we cannot use these functions.
options(scipen = 100, dplyr.width = Inf, dplyr.print_max = Inf)
library(dplyr)
packageVersion("dplyr")
# [1] ‘0.7.4’
set.seed(123)
df <- data_frame(
a = sample(1:5, 10, replace=T),
b = sample(1:5, 10, replace=T),
c = sample(1:5, 10, replace=T),
d = sample(1:5, 10, replace=T),
grp = as.character(sample(1:3, 10, replace=T)) # For convenience, specify character type
)
df %>% group_by(grp) %>%
summarise_each(.vars = letters[1:4],
.funs = c(mean="mean"))
# `summarise_each()` is deprecated.
# Use `summarise_all()`, `summarise_at()` or `summarise_if()` instead.
# To map `funs` over a selection of variables, use `summarise_at()`
# Error: Strings must match column names. Unknown columns: mean
You should change to the following code. The following codes all have the same result.
# summarise_at
df %>% group_by(grp) %>%
summarise_at(.vars = letters[1:4],
.funs = c(mean="mean"))
df %>% group_by(grp) %>%
summarise_at(.vars = names(.)[1:4],
.funs = c(mean="mean"))
df %>% group_by(grp) %>%
summarise_at(.vars = vars(a,b,c,d),
.funs = c(mean="mean"))
# summarise_all
df %>% group_by(grp) %>%
summarise_all(.funs = c(mean="mean"))
# summarise_if
df %>% group_by(grp) %>%
summarise_if(.predicate = function(x) is.numeric(x),
.funs = funs(mean="mean"))
# A tibble: 3 x 5
# grp a_mean b_mean c_mean d_mean
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 1 2.80 3.00 3.6 3.00
# 2 2 4.25 2.75 4.0 3.75
# 3 3 3.00 5.00 1.0 2.00
You can also have multiple functions.
df %>% group_by(grp) %>%
summarise_at(.vars = letters[1:2],
.funs = c(Mean="mean", Sd="sd"))
# A tibble: 3 x 5
# grp a_Mean b_Mean a_Sd b_Sd
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 1 2.80 3.00 1.4832397 1.870829
# 2 2 4.25 2.75 0.9574271 1.258306
# 3 3 3.00 5.00 NA NA
You can simply pass more arguments to summarise:
df %>% group_by(grp) %>% summarise(mean(a), mean(b), mean(c), mean(d))
Source: local data frame [3 x 5]
grp mean(a) mean(b) mean(c) mean(d)
1 1 2.500000 3.500000 2.000000 3.0
2 2 3.800000 3.200000 3.200000 2.8
3 3 3.666667 3.333333 2.333333 3.0
For completeness: with dplyr v0.2 ddply with colwise will also do this:
> ddply(df, .(grp), colwise(mean))
grp a b c d
1 1 4.333333 4.00 1.000000 2.000000
2 2 2.000000 2.75 2.750000 2.750000
3 3 3.000000 4.00 4.333333 3.666667
but it is slower, at least in this case:
> microbenchmark(ddply(df, .(grp), colwise(mean)),
df %>% group_by(grp) %>% summarise_each(funs(mean)))
Unit: milliseconds
expr min lq mean
ddply(df, .(grp), colwise(mean)) 3.278002 3.331744 3.533835
df %>% group_by(grp) %>% summarise_each(funs(mean)) 1.001789 1.031528 1.109337
median uq max neval
3.353633 3.378089 7.592209 100
1.121954 1.133428 2.292216 100
All the examples are great, but I figure I'd add one more to show how working in a "tidy" format simplifies things. Right now the data frame is in "wide" format meaning the variables "a" through "d" are represented in columns. To get to a "tidy" (or long) format, you can use gather() from the tidyr package which shifts the variables in columns "a" through "d" into rows. Then you use the group_by() and summarize() functions to get the mean of each group. If you want to present the data in a wide format, just tack on an additional call to the spread() function.
library(tidyverse)
# Create reproducible df
set.seed(101)
df <- tibble(a = sample(1:5, 10, replace=T),
b = sample(1:5, 10, replace=T),
c = sample(1:5, 10, replace=T),
d = sample(1:5, 10, replace=T),
grp = sample(1:3, 10, replace=T))
# Convert to tidy format using gather
df %>%
gather(key = variable, value = value, a:d) %>%
group_by(grp, variable) %>%
summarize(mean = mean(value)) %>%
spread(variable, mean)
#> Source: local data frame [3 x 5]
#> Groups: grp [3]
#>
#> grp a b c d
#> * <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 3.000000 3.5 3.250000 3.250000
#> 2 2 1.666667 4.0 4.666667 2.666667
#> 3 3 3.333333 3.0 2.333333 2.333333

mapping over nested tibbles and run regressions

I have some data which looks similar to the following:
Data:
rank_A <- sample(x = 1:5, size = 100, replace = TRUE)
rank_B <- sample(x = 1:5, size = 100, replace = TRUE)
Y_A <- rnorm(n = 100)
Y_B <- rnorm(n = 100)
X <- rnorm(n = 100)
df <- data.frame(rank_A, rank_B, Y_A, Y_B, X)
Which looks like:
> tibble(df)
# A tibble: 100 x 1
df$rank_A $rank_B $Y_A $Y_B $X
<int> <int> <dbl> <dbl> <dbl>
1 5 1 0.128 -0.833 1.15
2 2 1 0.165 0.325 0.225
3 1 3 0.525 -0.632 0.390
4 5 3 -1.32 0.718 -0.377
5 4 2 -0.900 -0.364 -0.259
6 5 3 -1.17 0.556 -0.0702
7 4 1 -0.560 -1.66 -1.64
8 5 1 -2.01 -0.898 0.306
9 1 1 -2.56 0.693 1.34
10 4 2 0.338 -0.733 -0.481
# ... with 90 more rows
So I have two rank columns and a corresponding Y for each rank. A and B here are just two categories. I want to run two regressions both on Y_A and Y_B both regressed on X. I have the following code which works for rank_A.
df_regs <- df %>%
group_by(rank_A) %>%
nest() %>%
mutate(
Reg_A = map(data,
~lm(Y_A ~ X,
data = .)
),
Reg_A_summary = map(Reg_A,
~summary(.)
)
)
df_regs$Reg_A_summary
I would like to construct it such that I do not have to write a new piece of code for rank_B. I have many different regressions (different X variables I will give to each Y_A and Y_B). I thought about doing something like the following (which doesn't work).
df %>%
mutate(Class_A = group_by(rank_A) %>%
nest() %>%
mutate(
Reg_A = map(data,
~lm(Y_a ~ X,
data = .)
)
),
Class_B = group_by(rank_B) %>%
nest() %>%
mutate(
Reg_B = map(data,
~lm(Y_b ~ X,
data = .)
)
)
)
If you know of a cleaner method, please also let me know your thoughts as I will have many regressions.
Here is one approach, using the new pivot functions from the latest release of tidyr...
df2 <- df %>% pivot_longer(cols=-X,
names_to = c(".value", "Class"),
names_sep = "_") %>%
group_by(Class, rank) %>%
summarise(Reg = list(lm(Y ~ X)))
This creates a tibble with columns Class, rank, and the list output Reg of the lm model for Y~X for each combination. You can then map through it to extract the summary, coefficients, or whatever.
We can do this with map2 for corresponding 'group_by` and dependent columns
library(dplyr)
library(purrr)
library(stringr)
library(broom)
map2(c('rank_A', 'rank_B'), c("Y_A", "Y_B"), ~ {
grp <- .x
predCol <- .y
newCol <- str_replace(predCol, "Y", "Reg")
df %>%
group_by_at(grp) %>%
nest %>%
mutate(!! newCol:= map(data,
~lm(reformulate('X', response = predCol))))
}
)

Resources