How can I summarize the multiple columns? [duplicate] - r

This question already has answers here:
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Can dplyr summarise over several variables without listing each one? [duplicate]
(2 answers)
Closed 5 years ago.
I'm struggling a bit with the dplyr-syntax. I have a data frame with different variables and one grouping variable. Now I want to calculate the mean for each column within each group, using dplyr in R.
df <- data.frame(
a = sample(1:5, n, replace = TRUE),
b = sample(1:5, n, replace = TRUE),
c = sample(1:5, n, replace = TRUE),
d = sample(1:5, n, replace = TRUE),
grp = sample(1:3, n, replace = TRUE)
)
df %>% group_by(grp) %>% summarise(mean(a))
This gives me the mean for column "a" for each group indicated by "grp".
My question is: is it possible to get the means for each column within each group at once? Or do I have to repeat df %>% group_by(grp) %>% summarise(mean(a)) for each column?
What I would like to have is something like
df %>% group_by(grp) %>% summarise(mean(a:d)) # "mean(a:d)" does not work

In dplyr (>=1.00) you may use across(everything() in summarise to apply a function to all variables:
library(dplyr)
df %>% group_by(grp) %>% summarise(across(everything(), list(mean)))
#> # A tibble: 3 x 5
#> grp a b c d
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 3.08 2.98 2.98 2.91
#> 2 2 3.03 3.04 2.97 2.87
#> 3 3 2.85 2.95 2.95 3.06
Alternatively, the purrrlyr package provides the same functionality:
library(purrrlyr)
df %>% slice_rows("grp") %>% dmap(mean)
#> # A tibble: 3 x 5
#> grp a b c d
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 3.08 2.98 2.98 2.91
#> 2 2 3.03 3.04 2.97 2.87
#> 3 3 2.85 2.95 2.95 3.06
Also don't forget about data.table (use keyby to sort sort groups):
library(data.table)
setDT(df)[, lapply(.SD, mean), keyby = grp]
#> grp a b c d
#> 1: 1 3.079412 2.979412 2.979412 2.914706
#> 2: 2 3.029126 3.038835 2.967638 2.873786
#> 3: 3 2.854701 2.948718 2.951567 3.062678
Let's try to compare performance.
library(dplyr)
library(purrrlyr)
library(data.table)
library(bench)
set.seed(123)
n <- 10000
df <- data.frame(
a = sample(1:5, n, replace = TRUE),
b = sample(1:5, n, replace = TRUE),
c = sample(1:5, n, replace = TRUE),
d = sample(1:5, n, replace = TRUE),
grp = sample(1:3, n, replace = TRUE)
)
dt <- setDT(df)
mark(
dplyr = df %>% group_by(grp) %>% summarise(across(everything(), list(mean))),
purrrlyr = df %>% slice_rows("grp") %>% dmap(mean),
data.table = dt[, lapply(.SD, mean), keyby = grp],
check = FALSE
)
#> # A tibble: 3 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 dplyr 2.81ms 2.85ms 328. NA 17.3
#> 2 purrrlyr 7.96ms 8.04ms 123. NA 24.5
#> 3 data.table 596.33µs 707.91µs 1409. NA 10.3

We can summarize by using summarize_at, summarize_all and summarize_if on dplyr 0.7.4. We can set the multiple columns and functions by using vars and funs argument as below code. The left-hand side of funs formula is assigned to suffix of summarized vars. In the dplyr 0.7.4, summarise_each(and mutate_each) is already deprecated, so we cannot use these functions.
options(scipen = 100, dplyr.width = Inf, dplyr.print_max = Inf)
library(dplyr)
packageVersion("dplyr")
# [1] ‘0.7.4’
set.seed(123)
df <- data_frame(
a = sample(1:5, 10, replace=T),
b = sample(1:5, 10, replace=T),
c = sample(1:5, 10, replace=T),
d = sample(1:5, 10, replace=T),
grp = as.character(sample(1:3, 10, replace=T)) # For convenience, specify character type
)
df %>% group_by(grp) %>%
summarise_each(.vars = letters[1:4],
.funs = c(mean="mean"))
# `summarise_each()` is deprecated.
# Use `summarise_all()`, `summarise_at()` or `summarise_if()` instead.
# To map `funs` over a selection of variables, use `summarise_at()`
# Error: Strings must match column names. Unknown columns: mean
You should change to the following code. The following codes all have the same result.
# summarise_at
df %>% group_by(grp) %>%
summarise_at(.vars = letters[1:4],
.funs = c(mean="mean"))
df %>% group_by(grp) %>%
summarise_at(.vars = names(.)[1:4],
.funs = c(mean="mean"))
df %>% group_by(grp) %>%
summarise_at(.vars = vars(a,b,c,d),
.funs = c(mean="mean"))
# summarise_all
df %>% group_by(grp) %>%
summarise_all(.funs = c(mean="mean"))
# summarise_if
df %>% group_by(grp) %>%
summarise_if(.predicate = function(x) is.numeric(x),
.funs = funs(mean="mean"))
# A tibble: 3 x 5
# grp a_mean b_mean c_mean d_mean
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 1 2.80 3.00 3.6 3.00
# 2 2 4.25 2.75 4.0 3.75
# 3 3 3.00 5.00 1.0 2.00
You can also have multiple functions.
df %>% group_by(grp) %>%
summarise_at(.vars = letters[1:2],
.funs = c(Mean="mean", Sd="sd"))
# A tibble: 3 x 5
# grp a_Mean b_Mean a_Sd b_Sd
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 1 2.80 3.00 1.4832397 1.870829
# 2 2 4.25 2.75 0.9574271 1.258306
# 3 3 3.00 5.00 NA NA

You can simply pass more arguments to summarise:
df %>% group_by(grp) %>% summarise(mean(a), mean(b), mean(c), mean(d))
Source: local data frame [3 x 5]
grp mean(a) mean(b) mean(c) mean(d)
1 1 2.500000 3.500000 2.000000 3.0
2 2 3.800000 3.200000 3.200000 2.8
3 3 3.666667 3.333333 2.333333 3.0

For completeness: with dplyr v0.2 ddply with colwise will also do this:
> ddply(df, .(grp), colwise(mean))
grp a b c d
1 1 4.333333 4.00 1.000000 2.000000
2 2 2.000000 2.75 2.750000 2.750000
3 3 3.000000 4.00 4.333333 3.666667
but it is slower, at least in this case:
> microbenchmark(ddply(df, .(grp), colwise(mean)),
df %>% group_by(grp) %>% summarise_each(funs(mean)))
Unit: milliseconds
expr min lq mean
ddply(df, .(grp), colwise(mean)) 3.278002 3.331744 3.533835
df %>% group_by(grp) %>% summarise_each(funs(mean)) 1.001789 1.031528 1.109337
median uq max neval
3.353633 3.378089 7.592209 100
1.121954 1.133428 2.292216 100

All the examples are great, but I figure I'd add one more to show how working in a "tidy" format simplifies things. Right now the data frame is in "wide" format meaning the variables "a" through "d" are represented in columns. To get to a "tidy" (or long) format, you can use gather() from the tidyr package which shifts the variables in columns "a" through "d" into rows. Then you use the group_by() and summarize() functions to get the mean of each group. If you want to present the data in a wide format, just tack on an additional call to the spread() function.
library(tidyverse)
# Create reproducible df
set.seed(101)
df <- tibble(a = sample(1:5, 10, replace=T),
b = sample(1:5, 10, replace=T),
c = sample(1:5, 10, replace=T),
d = sample(1:5, 10, replace=T),
grp = sample(1:3, 10, replace=T))
# Convert to tidy format using gather
df %>%
gather(key = variable, value = value, a:d) %>%
group_by(grp, variable) %>%
summarize(mean = mean(value)) %>%
spread(variable, mean)
#> Source: local data frame [3 x 5]
#> Groups: grp [3]
#>
#> grp a b c d
#> * <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 3.000000 3.5 3.250000 3.250000
#> 2 2 1.666667 4.0 4.666667 2.666667
#> 3 3 3.333333 3.0 2.333333 2.333333

Related

How to determine the mean of each columns in R with the command summarize() and group_by() of several columns? [duplicate]

This question already has answers here:
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Can dplyr summarise over several variables without listing each one? [duplicate]
(2 answers)
Closed 5 years ago.
I'm struggling a bit with the dplyr-syntax. I have a data frame with different variables and one grouping variable. Now I want to calculate the mean for each column within each group, using dplyr in R.
df <- data.frame(
a = sample(1:5, n, replace = TRUE),
b = sample(1:5, n, replace = TRUE),
c = sample(1:5, n, replace = TRUE),
d = sample(1:5, n, replace = TRUE),
grp = sample(1:3, n, replace = TRUE)
)
df %>% group_by(grp) %>% summarise(mean(a))
This gives me the mean for column "a" for each group indicated by "grp".
My question is: is it possible to get the means for each column within each group at once? Or do I have to repeat df %>% group_by(grp) %>% summarise(mean(a)) for each column?
What I would like to have is something like
df %>% group_by(grp) %>% summarise(mean(a:d)) # "mean(a:d)" does not work
In dplyr (>=1.00) you may use across(everything() in summarise to apply a function to all variables:
library(dplyr)
df %>% group_by(grp) %>% summarise(across(everything(), list(mean)))
#> # A tibble: 3 x 5
#> grp a b c d
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 3.08 2.98 2.98 2.91
#> 2 2 3.03 3.04 2.97 2.87
#> 3 3 2.85 2.95 2.95 3.06
Alternatively, the purrrlyr package provides the same functionality:
library(purrrlyr)
df %>% slice_rows("grp") %>% dmap(mean)
#> # A tibble: 3 x 5
#> grp a b c d
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 3.08 2.98 2.98 2.91
#> 2 2 3.03 3.04 2.97 2.87
#> 3 3 2.85 2.95 2.95 3.06
Also don't forget about data.table (use keyby to sort sort groups):
library(data.table)
setDT(df)[, lapply(.SD, mean), keyby = grp]
#> grp a b c d
#> 1: 1 3.079412 2.979412 2.979412 2.914706
#> 2: 2 3.029126 3.038835 2.967638 2.873786
#> 3: 3 2.854701 2.948718 2.951567 3.062678
Let's try to compare performance.
library(dplyr)
library(purrrlyr)
library(data.table)
library(bench)
set.seed(123)
n <- 10000
df <- data.frame(
a = sample(1:5, n, replace = TRUE),
b = sample(1:5, n, replace = TRUE),
c = sample(1:5, n, replace = TRUE),
d = sample(1:5, n, replace = TRUE),
grp = sample(1:3, n, replace = TRUE)
)
dt <- setDT(df)
mark(
dplyr = df %>% group_by(grp) %>% summarise(across(everything(), list(mean))),
purrrlyr = df %>% slice_rows("grp") %>% dmap(mean),
data.table = dt[, lapply(.SD, mean), keyby = grp],
check = FALSE
)
#> # A tibble: 3 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 dplyr 2.81ms 2.85ms 328. NA 17.3
#> 2 purrrlyr 7.96ms 8.04ms 123. NA 24.5
#> 3 data.table 596.33µs 707.91µs 1409. NA 10.3
We can summarize by using summarize_at, summarize_all and summarize_if on dplyr 0.7.4. We can set the multiple columns and functions by using vars and funs argument as below code. The left-hand side of funs formula is assigned to suffix of summarized vars. In the dplyr 0.7.4, summarise_each(and mutate_each) is already deprecated, so we cannot use these functions.
options(scipen = 100, dplyr.width = Inf, dplyr.print_max = Inf)
library(dplyr)
packageVersion("dplyr")
# [1] ‘0.7.4’
set.seed(123)
df <- data_frame(
a = sample(1:5, 10, replace=T),
b = sample(1:5, 10, replace=T),
c = sample(1:5, 10, replace=T),
d = sample(1:5, 10, replace=T),
grp = as.character(sample(1:3, 10, replace=T)) # For convenience, specify character type
)
df %>% group_by(grp) %>%
summarise_each(.vars = letters[1:4],
.funs = c(mean="mean"))
# `summarise_each()` is deprecated.
# Use `summarise_all()`, `summarise_at()` or `summarise_if()` instead.
# To map `funs` over a selection of variables, use `summarise_at()`
# Error: Strings must match column names. Unknown columns: mean
You should change to the following code. The following codes all have the same result.
# summarise_at
df %>% group_by(grp) %>%
summarise_at(.vars = letters[1:4],
.funs = c(mean="mean"))
df %>% group_by(grp) %>%
summarise_at(.vars = names(.)[1:4],
.funs = c(mean="mean"))
df %>% group_by(grp) %>%
summarise_at(.vars = vars(a,b,c,d),
.funs = c(mean="mean"))
# summarise_all
df %>% group_by(grp) %>%
summarise_all(.funs = c(mean="mean"))
# summarise_if
df %>% group_by(grp) %>%
summarise_if(.predicate = function(x) is.numeric(x),
.funs = funs(mean="mean"))
# A tibble: 3 x 5
# grp a_mean b_mean c_mean d_mean
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 1 2.80 3.00 3.6 3.00
# 2 2 4.25 2.75 4.0 3.75
# 3 3 3.00 5.00 1.0 2.00
You can also have multiple functions.
df %>% group_by(grp) %>%
summarise_at(.vars = letters[1:2],
.funs = c(Mean="mean", Sd="sd"))
# A tibble: 3 x 5
# grp a_Mean b_Mean a_Sd b_Sd
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 1 2.80 3.00 1.4832397 1.870829
# 2 2 4.25 2.75 0.9574271 1.258306
# 3 3 3.00 5.00 NA NA
You can simply pass more arguments to summarise:
df %>% group_by(grp) %>% summarise(mean(a), mean(b), mean(c), mean(d))
Source: local data frame [3 x 5]
grp mean(a) mean(b) mean(c) mean(d)
1 1 2.500000 3.500000 2.000000 3.0
2 2 3.800000 3.200000 3.200000 2.8
3 3 3.666667 3.333333 2.333333 3.0
For completeness: with dplyr v0.2 ddply with colwise will also do this:
> ddply(df, .(grp), colwise(mean))
grp a b c d
1 1 4.333333 4.00 1.000000 2.000000
2 2 2.000000 2.75 2.750000 2.750000
3 3 3.000000 4.00 4.333333 3.666667
but it is slower, at least in this case:
> microbenchmark(ddply(df, .(grp), colwise(mean)),
df %>% group_by(grp) %>% summarise_each(funs(mean)))
Unit: milliseconds
expr min lq mean
ddply(df, .(grp), colwise(mean)) 3.278002 3.331744 3.533835
df %>% group_by(grp) %>% summarise_each(funs(mean)) 1.001789 1.031528 1.109337
median uq max neval
3.353633 3.378089 7.592209 100
1.121954 1.133428 2.292216 100
All the examples are great, but I figure I'd add one more to show how working in a "tidy" format simplifies things. Right now the data frame is in "wide" format meaning the variables "a" through "d" are represented in columns. To get to a "tidy" (or long) format, you can use gather() from the tidyr package which shifts the variables in columns "a" through "d" into rows. Then you use the group_by() and summarize() functions to get the mean of each group. If you want to present the data in a wide format, just tack on an additional call to the spread() function.
library(tidyverse)
# Create reproducible df
set.seed(101)
df <- tibble(a = sample(1:5, 10, replace=T),
b = sample(1:5, 10, replace=T),
c = sample(1:5, 10, replace=T),
d = sample(1:5, 10, replace=T),
grp = sample(1:3, 10, replace=T))
# Convert to tidy format using gather
df %>%
gather(key = variable, value = value, a:d) %>%
group_by(grp, variable) %>%
summarize(mean = mean(value)) %>%
spread(variable, mean)
#> Source: local data frame [3 x 5]
#> Groups: grp [3]
#>
#> grp a b c d
#> * <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 3.000000 3.5 3.250000 3.250000
#> 2 2 1.666667 4.0 4.666667 2.666667
#> 3 3 3.333333 3.0 2.333333 2.333333

Averaging daily observations to a single day across multiple columns, R [duplicate]

This question already has answers here:
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Can dplyr summarise over several variables without listing each one? [duplicate]
(2 answers)
Closed 5 years ago.
I'm struggling a bit with the dplyr-syntax. I have a data frame with different variables and one grouping variable. Now I want to calculate the mean for each column within each group, using dplyr in R.
df <- data.frame(
a = sample(1:5, n, replace = TRUE),
b = sample(1:5, n, replace = TRUE),
c = sample(1:5, n, replace = TRUE),
d = sample(1:5, n, replace = TRUE),
grp = sample(1:3, n, replace = TRUE)
)
df %>% group_by(grp) %>% summarise(mean(a))
This gives me the mean for column "a" for each group indicated by "grp".
My question is: is it possible to get the means for each column within each group at once? Or do I have to repeat df %>% group_by(grp) %>% summarise(mean(a)) for each column?
What I would like to have is something like
df %>% group_by(grp) %>% summarise(mean(a:d)) # "mean(a:d)" does not work
In dplyr (>=1.00) you may use across(everything() in summarise to apply a function to all variables:
library(dplyr)
df %>% group_by(grp) %>% summarise(across(everything(), list(mean)))
#> # A tibble: 3 x 5
#> grp a b c d
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 3.08 2.98 2.98 2.91
#> 2 2 3.03 3.04 2.97 2.87
#> 3 3 2.85 2.95 2.95 3.06
Alternatively, the purrrlyr package provides the same functionality:
library(purrrlyr)
df %>% slice_rows("grp") %>% dmap(mean)
#> # A tibble: 3 x 5
#> grp a b c d
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 3.08 2.98 2.98 2.91
#> 2 2 3.03 3.04 2.97 2.87
#> 3 3 2.85 2.95 2.95 3.06
Also don't forget about data.table (use keyby to sort sort groups):
library(data.table)
setDT(df)[, lapply(.SD, mean), keyby = grp]
#> grp a b c d
#> 1: 1 3.079412 2.979412 2.979412 2.914706
#> 2: 2 3.029126 3.038835 2.967638 2.873786
#> 3: 3 2.854701 2.948718 2.951567 3.062678
Let's try to compare performance.
library(dplyr)
library(purrrlyr)
library(data.table)
library(bench)
set.seed(123)
n <- 10000
df <- data.frame(
a = sample(1:5, n, replace = TRUE),
b = sample(1:5, n, replace = TRUE),
c = sample(1:5, n, replace = TRUE),
d = sample(1:5, n, replace = TRUE),
grp = sample(1:3, n, replace = TRUE)
)
dt <- setDT(df)
mark(
dplyr = df %>% group_by(grp) %>% summarise(across(everything(), list(mean))),
purrrlyr = df %>% slice_rows("grp") %>% dmap(mean),
data.table = dt[, lapply(.SD, mean), keyby = grp],
check = FALSE
)
#> # A tibble: 3 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 dplyr 2.81ms 2.85ms 328. NA 17.3
#> 2 purrrlyr 7.96ms 8.04ms 123. NA 24.5
#> 3 data.table 596.33µs 707.91µs 1409. NA 10.3
We can summarize by using summarize_at, summarize_all and summarize_if on dplyr 0.7.4. We can set the multiple columns and functions by using vars and funs argument as below code. The left-hand side of funs formula is assigned to suffix of summarized vars. In the dplyr 0.7.4, summarise_each(and mutate_each) is already deprecated, so we cannot use these functions.
options(scipen = 100, dplyr.width = Inf, dplyr.print_max = Inf)
library(dplyr)
packageVersion("dplyr")
# [1] ‘0.7.4’
set.seed(123)
df <- data_frame(
a = sample(1:5, 10, replace=T),
b = sample(1:5, 10, replace=T),
c = sample(1:5, 10, replace=T),
d = sample(1:5, 10, replace=T),
grp = as.character(sample(1:3, 10, replace=T)) # For convenience, specify character type
)
df %>% group_by(grp) %>%
summarise_each(.vars = letters[1:4],
.funs = c(mean="mean"))
# `summarise_each()` is deprecated.
# Use `summarise_all()`, `summarise_at()` or `summarise_if()` instead.
# To map `funs` over a selection of variables, use `summarise_at()`
# Error: Strings must match column names. Unknown columns: mean
You should change to the following code. The following codes all have the same result.
# summarise_at
df %>% group_by(grp) %>%
summarise_at(.vars = letters[1:4],
.funs = c(mean="mean"))
df %>% group_by(grp) %>%
summarise_at(.vars = names(.)[1:4],
.funs = c(mean="mean"))
df %>% group_by(grp) %>%
summarise_at(.vars = vars(a,b,c,d),
.funs = c(mean="mean"))
# summarise_all
df %>% group_by(grp) %>%
summarise_all(.funs = c(mean="mean"))
# summarise_if
df %>% group_by(grp) %>%
summarise_if(.predicate = function(x) is.numeric(x),
.funs = funs(mean="mean"))
# A tibble: 3 x 5
# grp a_mean b_mean c_mean d_mean
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 1 2.80 3.00 3.6 3.00
# 2 2 4.25 2.75 4.0 3.75
# 3 3 3.00 5.00 1.0 2.00
You can also have multiple functions.
df %>% group_by(grp) %>%
summarise_at(.vars = letters[1:2],
.funs = c(Mean="mean", Sd="sd"))
# A tibble: 3 x 5
# grp a_Mean b_Mean a_Sd b_Sd
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 1 2.80 3.00 1.4832397 1.870829
# 2 2 4.25 2.75 0.9574271 1.258306
# 3 3 3.00 5.00 NA NA
You can simply pass more arguments to summarise:
df %>% group_by(grp) %>% summarise(mean(a), mean(b), mean(c), mean(d))
Source: local data frame [3 x 5]
grp mean(a) mean(b) mean(c) mean(d)
1 1 2.500000 3.500000 2.000000 3.0
2 2 3.800000 3.200000 3.200000 2.8
3 3 3.666667 3.333333 2.333333 3.0
For completeness: with dplyr v0.2 ddply with colwise will also do this:
> ddply(df, .(grp), colwise(mean))
grp a b c d
1 1 4.333333 4.00 1.000000 2.000000
2 2 2.000000 2.75 2.750000 2.750000
3 3 3.000000 4.00 4.333333 3.666667
but it is slower, at least in this case:
> microbenchmark(ddply(df, .(grp), colwise(mean)),
df %>% group_by(grp) %>% summarise_each(funs(mean)))
Unit: milliseconds
expr min lq mean
ddply(df, .(grp), colwise(mean)) 3.278002 3.331744 3.533835
df %>% group_by(grp) %>% summarise_each(funs(mean)) 1.001789 1.031528 1.109337
median uq max neval
3.353633 3.378089 7.592209 100
1.121954 1.133428 2.292216 100
All the examples are great, but I figure I'd add one more to show how working in a "tidy" format simplifies things. Right now the data frame is in "wide" format meaning the variables "a" through "d" are represented in columns. To get to a "tidy" (or long) format, you can use gather() from the tidyr package which shifts the variables in columns "a" through "d" into rows. Then you use the group_by() and summarize() functions to get the mean of each group. If you want to present the data in a wide format, just tack on an additional call to the spread() function.
library(tidyverse)
# Create reproducible df
set.seed(101)
df <- tibble(a = sample(1:5, 10, replace=T),
b = sample(1:5, 10, replace=T),
c = sample(1:5, 10, replace=T),
d = sample(1:5, 10, replace=T),
grp = sample(1:3, 10, replace=T))
# Convert to tidy format using gather
df %>%
gather(key = variable, value = value, a:d) %>%
group_by(grp, variable) %>%
summarize(mean = mean(value)) %>%
spread(variable, mean)
#> Source: local data frame [3 x 5]
#> Groups: grp [3]
#>
#> grp a b c d
#> * <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 3.000000 3.5 3.250000 3.250000
#> 2 2 1.666667 4.0 4.666667 2.666667
#> 3 3 3.333333 3.0 2.333333 2.333333

Multiple paired t-tests on multiple variables simultaneously using dplyr/tidyverse

Assume a data structure like this:
ID testA_wave1 testA_wave2 testA_wave3 testB_wave1 testB_wave2 testB_wave3
1 1 3 2 3 6 5 3
2 2 4 4 4 3 6 6
3 3 10 2 1 4 4 4
4 4 5 3 12 2 7 4
5 5 5 3 9 2 4 2
6 6 10 0 2 6 6 5
7 7 6 8 4 6 8 3
8 8 1 5 4 5 6 0
9 9 3 2 7 8 4 4
10 10 4 9 5 11 8 8
What I want to achieve is to calculate a paired t-test for every test separately (in this case meaning testA and testB, but in real-life I have much more tests). I want to do it that way that I compare the first wave of a given test with every other subsequent wave of the same test (meaning testA_wave1 vs testA_wave2 and testA_wave1 vs testA_wave3 in the case of testA).
This way, I was able to achieve it:
df %>%
gather(variable, value, -ID) %>%
mutate(wave_ID = paste0("wave", parse_number(variable)),
variable = ifelse(grepl("testA", variable), "testA",
ifelse(grepl("testB", variable), "testB", NA_character_))) %>%
group_by(wave_ID, variable) %>%
summarise(value = list(value)) %>%
spread(wave_ID, value) %>%
group_by(variable) %>%
mutate(p_value_w1w2 = t.test(unlist(wave1), unlist(wave2), paired = TRUE)$p.value,
p_value_w1w3 = t.test(unlist(wave1), unlist(wave3), paired = TRUE)$p.value) %>%
select(variable, matches("(p_value)"))
variable p_value_w1w2 p_value_w1w3
<chr> <dbl> <dbl>
1 testA 0.664 0.921
2 testB 0.146 0.418
However, I would like to see different/more elegant solutions that give similar results. I'm looking mostly for dplyr/tidyverse solutions, but if there is a completely different way to achieve it, I'm not against it.
Sample data:
set.seed(123)
df <- data.frame(ID = 1:20,
testA_wave1 = round(rnorm(20, 5, 3), 0),
testA_wave2 = round(rnorm(20, 5, 3), 0),
testA_wave3 = round(rnorm(20, 5, 3), 0),
testB_wave1 = round(rnorm(20, 5, 3), 0),
testB_wave2 = round(rnorm(20, 5, 3), 0),
testB_wave3 = round(rnorm(20, 5, 3), 0))
Since dplyr 0.8.0 we can use group_split to split a dataframe into list of dataframes.
We gather the dataframe and convert it into long format and then separate the names of the column (key) into different columns (test and wave). We then use group_split to split the dataframe into list based on test column. For every dataframe in the list we spread it into wide format and then calculate the t.test values and rbind them into one dataframe using map_dfr.
library(tidyverse)
df %>%
gather(key, value, -ID) %>%
separate(key, c("test", "wave")) %>%
group_split(test) %>% #Previously we had to do split(.$test) here
map_dfr(. %>%
spread(wave, value) %>%
summarise(test = first(test),
p_value_w1w2 = t.test(wave1, wave2, paired = TRUE)$p.value,
p_value_w1w3 = t.test(wave1, wave3, paired = TRUE)$p.value))
# A tibble: 2 x 3
# test p_value_w1w2 p_value_w1w3
# <chr> <dbl> <dbl>
#1 testA 0.664 0.921
#2 testB 0.146 0.418
We manually perform the t-test above as there were only 2 values which needed to be calculated. If there are more number of wave... columns then this could become cumbersome. In such cases we could do
df %>%
gather(key, value, -ID) %>%
separate(key, c("test", "wave")) %>%
group_split(test) %>%
map_dfr(function(data)
data %>%
spread(wave, value) %>%
summarise_at(vars(setdiff(unique(data$wave), "wave1")),
function(x) t.test(.$wave1, x, paired = TRUE)$p.value) %>%
mutate(test = first(data$test)))
# wave2 wave3 test
# <dbl> <dbl> <chr>
#1 0.664 0.921 testA
#2 0.146 0.418 testB
Here it will perform the t-test for every "wave.." column with "wave1" column.
Since you are also open to other solutions, here is an attempt with purely base R solution
sapply(split.default(df[-1], sub("_.*", "", names(df[-1]))), function(x)
c(p_value_w1w2 = t.test(x[[1]], x[[2]],paired = TRUE)$p.value,
p_value_w1w3 = t.test(x[[1]], x[[3]],paired = TRUE)$p.value))
# testA testB
#p_value_w1w2 0.6642769 0.1456059
#p_value_w1w3 0.9209554 0.4184603
We split the columns based on test* and create a list of dataframes and apply t.test on different combinations of columns for each dataframe.
Update 03/16/2022
The tidyverse has evolved and so should this solution.
First I make a simplifying assumption: If we designed the experiment, then we know what the groups are and how many waves we followed them through. If we don't know, then we can extract this information from the column names. See at below.
library("broom")
library("tidyverse")
tests <- c("A", "B")
waves <- 3
comparisons <-
list(
test = tests,
first = 1,
later = seq(2, waves)
) %>%
cross_df()
comparisons
#> # A tibble: 4 × 3
#> test first later
#> <chr> <dbl> <int>
#> 1 A 1 2
#> 2 B 1 2
#> 3 A 1 3
#> 4 B 1 3
Transform the data from wide format to long format.
data <- df %>%
pivot_longer(
-ID,
names_to = "test_wave"
) %>%
extract(
test_wave, c("test", "wave"),
regex = "test(.+)_wave(.+)",
convert = TRUE
)
Then pair the comparisons we want to make with the data we collected. I've added lots of rename statements to make for more readable code but it's not strictly necessary.
comparisons %>%
inner_join(
data,
by = c("test", "first" = "wave")
) %>%
rename(
value.first = value
) %>%
inner_join(
data,
by = c("test", "later" = "wave", "ID")
) %>%
rename(
value.later = value
) %>%
group_by(
test, first, later
) %>%
group_modify(
~ tidy(t.test(.x$value.first, .x$value.later, paired = TRUE))
) %>%
ungroup() %>%
pivot_wider(
id_cols = test,
names_from = later,
names_glue = "wave1_vs_wave{later}",
values_from = p.value
)
#> # A tibble: 2 × 3
#> test wave1_vs_wave2 wave1_vs_wave3
#> <chr> <dbl> <dbl>
#> 1 A 0.664 0.921
#> 2 B 0.146 0.418
Appendix: Extract test names and number of waves from column names.
design <- df %>%
select(starts_with("test")) %>%
colnames() %>%
str_match("test(.+)_wave(.+)")
tests <- unique(design[, 2])
waves <- max(as.integer(design[, 3]))
Created on 2022-03-16 by the reprex package (v2.0.1)
Old solution
Here is one way to do it, using purrr quite a bit.
library("tidyverse")
set.seed(123)
df <- tibble(
ID = 1:20,
testA_wave1 = round(rnorm(20, 5, 3), 0),
testA_wave2 = round(rnorm(20, 5, 3), 0),
testA_wave3 = round(rnorm(20, 5, 3), 0),
testB_wave1 = round(rnorm(20, 5, 3), 0),
testB_wave2 = round(rnorm(20, 5, 3), 0),
testB_wave3 = round(rnorm(20, 5, 3), 0)
)
pvalues <- df %>%
# From wide tibble to long tibble
gather(test, value, -ID) %>%
separate(test, c("test", "wave")) %>%
# Not stricly necessary; will order the waves alphabetically instead
mutate(wave = parse_number(wave)) %>%
inner_join(., ., by = c("ID", "test")) %>%
# If there are two waves w1 and w2,
# we end up with pairs (w1, w1), (w1, w2), (w2, w1) and (w2, w2),
# so filter out to keep the pairing (w1, w2) only
filter(wave.x == 1, wave.x < wave.y) %>%
nest(ID, value.x, value.y) %>%
mutate(pvalue = data %>%
# Perform the test
map(~t.test(.$value.x, .$value.y, paired = TRUE)) %>%
map(broom::tidy) %>%
# Also not strictly necessary; you might want to keep all
# information about the test: estimate, statistic, etc.
map_dbl(pluck, "p.value"))
pvalues
#> # A tibble: 4 x 5
#> test wave.x wave.y data pvalue
#> <chr> <dbl> <dbl> <list> <dbl>
#> 1 testA 1 2 <tibble [20 x 3]> 0.664
#> 2 testA 1 3 <tibble [20 x 3]> 0.921
#> 3 testB 1 2 <tibble [20 x 3]> 0.146
#> 4 testB 1 3 <tibble [20 x 3]> 0.418
pvalues %>%
# Drop the data in order to pivot the table
select(- data) %>%
unite("waves", wave.x, wave.y, sep = ":") %>%
spread(waves, pvalue)
#> # A tibble: 2 x 3
#> test `1:2` `1:3`
#> <chr> <dbl> <dbl>
#> 1 testA 0.664 0.921
#> 2 testB 0.146 0.418
Created on 2019-03-08 by the reprex package (v0.2.1)
To throw in a data.table solution:
library(stringr)
library(data.table)
library(magrittr) ## for the pipe operator
dt_sol <- function(df) {
## create patterns for the melt operation:
## all columns from the same wave should go in one column
grps <- str_extract(names(df)[-1],
"[0-9]+$") %>%
unique() %>%
paste0("wave", ., "$")
grp_names <- sub("\\$", "", grps)
## melt the data table: all test*_wave_i data go into column wave_i
df.m <- melt(df,
measure = patterns(grps),
value.name = grp_names,
variable.name = "test")
## define the names for the new column, we want to extract estimate and p.value
new_cols <- c(outer(c("p.value", "estimate"),
grp_names[-1],
paste, sep = "_"))
## use lapply on .SD which equals to all wave_i columns but the first one
## return estimate and p.value
df.m[,
setNames(unlist(lapply(.SD,
function(col) {
t.test(wave1, col, paired = TRUE)[c("p.value", "estimate")]
}), recursive = FALSE), new_cols),
test, ## group by each test
.SDcols = grp_names[-1]]
}
dt <- copy(df)
setDT(dt)
dt_sol(dt)
# test p.value_wave2 estimate_wave2 p.value_wave3 estimate_wave3
# 1: 1 0.6642769 0.40 0.9209554 -0.1
# 2: 2 0.1456059 -1.45 0.4184603 0.7
Benchmark
Comparing the data.table solution to the tidyverse solution we get an 3-fold speed increase with teh data.tablesolution:
dp_sol <- function(df) {
df %>%
gather(test, value, -ID) %>%
separate(test, c("test", "wave")) %>%
inner_join(., ., by = c("ID", "test")) %>%
filter(wave.x == 1, wave.x < wave.y) %>%
nest(ID, value.x, value.y) %>%
mutate(pvalue = data %>%
map(~t.test(.$value.x, .$value.y, paired = TRUE)) %>%
map(broom::tidy) %>%
map_dbl(pluck, "p.value"))
}
library(microbenchmark)
microbenchmark(dplyr = dp_sol(df),
data.table = dt_sol(dt))
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# dplyr 6.119273 6.897456 7.639569 7.348364 7.996607 14.938182 100 b
# data.table 1.902547 2.307395 2.790910 2.758789 3.133091 4.923153 100 a
With a slightly bigger input:
make_df <- function(nr_tests = 2,
nr_waves = 3,
n_per_wave = 20) {
mat <- cbind(seq(1, n_per_wave),
matrix(round(rnorm(nr_tests * nr_waves * n_per_wave), 0),
nrow = n_per_wave))
c_names <- c(outer(1:nr_waves, 1:nr_tests, function(w, t) glue::glue("test{t}_wave{w}")))
colnames(mat) <- c("ID", c_names)
as.data.frame(mat)
}
df2 <- make_df(100, 100, 10)
dt2 <- copy(df2)
setDT(dt2)
microbenchmark(dplyr = dp_sol(df2),
data.table = dt_sol(dt2)
# Unit: seconds
# expr min lq mean median uq max neval cld
# dplyr 3.469837 3.669819 3.877548 3.821475 3.984518 5.268596 100 b
# data.table 1.018939 1.126244 1.193548 1.173175 1.252855 1.743075 100 a
Using all combinations without replacement:
Just for testA group:
comb <- arrangements::combinations(names(df)[grep("testA",names(df))], k = 2,n = 3,replace = F )
tTest <- function(x, data = df){
ttest <- t.test(x =data[x[1]] , y = data[x[2]])
return(data.frame(var1 = x[1],
var2 = x[2],
t = ttest[["statistic"]][["t"]],
pvalue = ttest[["p.value"]]))
}
result <- apply(comb, 1, tTest, data = df)
Result:
dplyr::bind_rows(result)
var1 var2 t pvalue
1 testA_wave1 testA_wave2 0.5009236 0.6193176
2 testA_wave1 testA_wave3 -0.6426433 0.5243146
3 testA_wave2 testA_wave3 -1.1564854 0.2547069
For all groups:
comb <- arrangements::combinations(x = names(df)[-1], k = 2,n = 6, replace = F )
result <- apply(comb, 1, tTest, data = df)
Result:
dplyr::bind_rows(result)
var1 var2 t pvalue
1 testA_wave1 testA_wave2 0.5009236 0.6193176
2 testA_wave1 testA_wave3 -0.6426433 0.5243146
3 testA_wave1 testB_wave1 0.4199215 0.6769510
4 testA_wave1 testB_wave2 -0.3447992 0.7321465
5 testA_wave1 testB_wave3 0.0000000 1.0000000
6 testA_wave2 testA_wave3 -1.1564854 0.2547069
7 testA_wave2 testB_wave1 -0.1070172 0.9153442
8 testA_wave2 testB_wave2 -0.8516264 0.3997630
9 testA_wave2 testB_wave3 -0.5640491 0.5762010
10 testA_wave3 testB_wave1 1.1068781 0.2754186
11 testA_wave3 testB_wave2 0.2966237 0.7683692
12 testA_wave3 testB_wave3 0.7211103 0.4755291
13 testB_wave1 testB_wave2 -0.7874100 0.4360152
14 testB_wave1 testB_wave3 -0.4791735 0.6346043
15 testB_wave2 testB_wave3 0.3865414 0.7013933
To throw another, somewhat more concise, data.table solution into the mix, in which we melt the data into long format:
setDT(df)
x = melt(df[,-1])[, tname := sub('_.+','',variable)][, wave := sub('.+_','',variable)]
x[wave != 'wave1', .(p.value =
t.test(x[tname==test & wave == 'wave1', value], value, paired = TRUE)$p.value),
by = .(test=tname,wave)]
# test wave p.value
# 1: testA wave2 0.6642769
# 2: testA wave3 0.9209554
# 3: testB wave2 0.1456059
# 4: testB wave3 0.4184603

How can I calculate mean for all columns in a dataframe using dplyr? [duplicate]

This question already has answers here:
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Can dplyr summarise over several variables without listing each one? [duplicate]
(2 answers)
Closed 5 years ago.
I'm struggling a bit with the dplyr-syntax. I have a data frame with different variables and one grouping variable. Now I want to calculate the mean for each column within each group, using dplyr in R.
df <- data.frame(
a = sample(1:5, n, replace = TRUE),
b = sample(1:5, n, replace = TRUE),
c = sample(1:5, n, replace = TRUE),
d = sample(1:5, n, replace = TRUE),
grp = sample(1:3, n, replace = TRUE)
)
df %>% group_by(grp) %>% summarise(mean(a))
This gives me the mean for column "a" for each group indicated by "grp".
My question is: is it possible to get the means for each column within each group at once? Or do I have to repeat df %>% group_by(grp) %>% summarise(mean(a)) for each column?
What I would like to have is something like
df %>% group_by(grp) %>% summarise(mean(a:d)) # "mean(a:d)" does not work
In dplyr (>=1.00) you may use across(everything() in summarise to apply a function to all variables:
library(dplyr)
df %>% group_by(grp) %>% summarise(across(everything(), list(mean)))
#> # A tibble: 3 x 5
#> grp a b c d
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 3.08 2.98 2.98 2.91
#> 2 2 3.03 3.04 2.97 2.87
#> 3 3 2.85 2.95 2.95 3.06
Alternatively, the purrrlyr package provides the same functionality:
library(purrrlyr)
df %>% slice_rows("grp") %>% dmap(mean)
#> # A tibble: 3 x 5
#> grp a b c d
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 3.08 2.98 2.98 2.91
#> 2 2 3.03 3.04 2.97 2.87
#> 3 3 2.85 2.95 2.95 3.06
Also don't forget about data.table (use keyby to sort sort groups):
library(data.table)
setDT(df)[, lapply(.SD, mean), keyby = grp]
#> grp a b c d
#> 1: 1 3.079412 2.979412 2.979412 2.914706
#> 2: 2 3.029126 3.038835 2.967638 2.873786
#> 3: 3 2.854701 2.948718 2.951567 3.062678
Let's try to compare performance.
library(dplyr)
library(purrrlyr)
library(data.table)
library(bench)
set.seed(123)
n <- 10000
df <- data.frame(
a = sample(1:5, n, replace = TRUE),
b = sample(1:5, n, replace = TRUE),
c = sample(1:5, n, replace = TRUE),
d = sample(1:5, n, replace = TRUE),
grp = sample(1:3, n, replace = TRUE)
)
dt <- setDT(df)
mark(
dplyr = df %>% group_by(grp) %>% summarise(across(everything(), list(mean))),
purrrlyr = df %>% slice_rows("grp") %>% dmap(mean),
data.table = dt[, lapply(.SD, mean), keyby = grp],
check = FALSE
)
#> # A tibble: 3 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 dplyr 2.81ms 2.85ms 328. NA 17.3
#> 2 purrrlyr 7.96ms 8.04ms 123. NA 24.5
#> 3 data.table 596.33µs 707.91µs 1409. NA 10.3
We can summarize by using summarize_at, summarize_all and summarize_if on dplyr 0.7.4. We can set the multiple columns and functions by using vars and funs argument as below code. The left-hand side of funs formula is assigned to suffix of summarized vars. In the dplyr 0.7.4, summarise_each(and mutate_each) is already deprecated, so we cannot use these functions.
options(scipen = 100, dplyr.width = Inf, dplyr.print_max = Inf)
library(dplyr)
packageVersion("dplyr")
# [1] ‘0.7.4’
set.seed(123)
df <- data_frame(
a = sample(1:5, 10, replace=T),
b = sample(1:5, 10, replace=T),
c = sample(1:5, 10, replace=T),
d = sample(1:5, 10, replace=T),
grp = as.character(sample(1:3, 10, replace=T)) # For convenience, specify character type
)
df %>% group_by(grp) %>%
summarise_each(.vars = letters[1:4],
.funs = c(mean="mean"))
# `summarise_each()` is deprecated.
# Use `summarise_all()`, `summarise_at()` or `summarise_if()` instead.
# To map `funs` over a selection of variables, use `summarise_at()`
# Error: Strings must match column names. Unknown columns: mean
You should change to the following code. The following codes all have the same result.
# summarise_at
df %>% group_by(grp) %>%
summarise_at(.vars = letters[1:4],
.funs = c(mean="mean"))
df %>% group_by(grp) %>%
summarise_at(.vars = names(.)[1:4],
.funs = c(mean="mean"))
df %>% group_by(grp) %>%
summarise_at(.vars = vars(a,b,c,d),
.funs = c(mean="mean"))
# summarise_all
df %>% group_by(grp) %>%
summarise_all(.funs = c(mean="mean"))
# summarise_if
df %>% group_by(grp) %>%
summarise_if(.predicate = function(x) is.numeric(x),
.funs = funs(mean="mean"))
# A tibble: 3 x 5
# grp a_mean b_mean c_mean d_mean
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 1 2.80 3.00 3.6 3.00
# 2 2 4.25 2.75 4.0 3.75
# 3 3 3.00 5.00 1.0 2.00
You can also have multiple functions.
df %>% group_by(grp) %>%
summarise_at(.vars = letters[1:2],
.funs = c(Mean="mean", Sd="sd"))
# A tibble: 3 x 5
# grp a_Mean b_Mean a_Sd b_Sd
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 1 2.80 3.00 1.4832397 1.870829
# 2 2 4.25 2.75 0.9574271 1.258306
# 3 3 3.00 5.00 NA NA
You can simply pass more arguments to summarise:
df %>% group_by(grp) %>% summarise(mean(a), mean(b), mean(c), mean(d))
Source: local data frame [3 x 5]
grp mean(a) mean(b) mean(c) mean(d)
1 1 2.500000 3.500000 2.000000 3.0
2 2 3.800000 3.200000 3.200000 2.8
3 3 3.666667 3.333333 2.333333 3.0
For completeness: with dplyr v0.2 ddply with colwise will also do this:
> ddply(df, .(grp), colwise(mean))
grp a b c d
1 1 4.333333 4.00 1.000000 2.000000
2 2 2.000000 2.75 2.750000 2.750000
3 3 3.000000 4.00 4.333333 3.666667
but it is slower, at least in this case:
> microbenchmark(ddply(df, .(grp), colwise(mean)),
df %>% group_by(grp) %>% summarise_each(funs(mean)))
Unit: milliseconds
expr min lq mean
ddply(df, .(grp), colwise(mean)) 3.278002 3.331744 3.533835
df %>% group_by(grp) %>% summarise_each(funs(mean)) 1.001789 1.031528 1.109337
median uq max neval
3.353633 3.378089 7.592209 100
1.121954 1.133428 2.292216 100
All the examples are great, but I figure I'd add one more to show how working in a "tidy" format simplifies things. Right now the data frame is in "wide" format meaning the variables "a" through "d" are represented in columns. To get to a "tidy" (or long) format, you can use gather() from the tidyr package which shifts the variables in columns "a" through "d" into rows. Then you use the group_by() and summarize() functions to get the mean of each group. If you want to present the data in a wide format, just tack on an additional call to the spread() function.
library(tidyverse)
# Create reproducible df
set.seed(101)
df <- tibble(a = sample(1:5, 10, replace=T),
b = sample(1:5, 10, replace=T),
c = sample(1:5, 10, replace=T),
d = sample(1:5, 10, replace=T),
grp = sample(1:3, 10, replace=T))
# Convert to tidy format using gather
df %>%
gather(key = variable, value = value, a:d) %>%
group_by(grp, variable) %>%
summarize(mean = mean(value)) %>%
spread(variable, mean)
#> Source: local data frame [3 x 5]
#> Groups: grp [3]
#>
#> grp a b c d
#> * <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 3.000000 3.5 3.250000 3.250000
#> 2 2 1.666667 4.0 4.666667 2.666667
#> 3 3 3.333333 3.0 2.333333 2.333333

Summarizing multiple columns with dplyr? [duplicate]

This question already has answers here:
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Can dplyr summarise over several variables without listing each one? [duplicate]
(2 answers)
Closed 5 years ago.
I'm struggling a bit with the dplyr-syntax. I have a data frame with different variables and one grouping variable. Now I want to calculate the mean for each column within each group, using dplyr in R.
df <- data.frame(
a = sample(1:5, n, replace = TRUE),
b = sample(1:5, n, replace = TRUE),
c = sample(1:5, n, replace = TRUE),
d = sample(1:5, n, replace = TRUE),
grp = sample(1:3, n, replace = TRUE)
)
df %>% group_by(grp) %>% summarise(mean(a))
This gives me the mean for column "a" for each group indicated by "grp".
My question is: is it possible to get the means for each column within each group at once? Or do I have to repeat df %>% group_by(grp) %>% summarise(mean(a)) for each column?
What I would like to have is something like
df %>% group_by(grp) %>% summarise(mean(a:d)) # "mean(a:d)" does not work
In dplyr (>=1.00) you may use across(everything() in summarise to apply a function to all variables:
library(dplyr)
df %>% group_by(grp) %>% summarise(across(everything(), list(mean)))
#> # A tibble: 3 x 5
#> grp a b c d
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 3.08 2.98 2.98 2.91
#> 2 2 3.03 3.04 2.97 2.87
#> 3 3 2.85 2.95 2.95 3.06
Alternatively, the purrrlyr package provides the same functionality:
library(purrrlyr)
df %>% slice_rows("grp") %>% dmap(mean)
#> # A tibble: 3 x 5
#> grp a b c d
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 3.08 2.98 2.98 2.91
#> 2 2 3.03 3.04 2.97 2.87
#> 3 3 2.85 2.95 2.95 3.06
Also don't forget about data.table (use keyby to sort sort groups):
library(data.table)
setDT(df)[, lapply(.SD, mean), keyby = grp]
#> grp a b c d
#> 1: 1 3.079412 2.979412 2.979412 2.914706
#> 2: 2 3.029126 3.038835 2.967638 2.873786
#> 3: 3 2.854701 2.948718 2.951567 3.062678
Let's try to compare performance.
library(dplyr)
library(purrrlyr)
library(data.table)
library(bench)
set.seed(123)
n <- 10000
df <- data.frame(
a = sample(1:5, n, replace = TRUE),
b = sample(1:5, n, replace = TRUE),
c = sample(1:5, n, replace = TRUE),
d = sample(1:5, n, replace = TRUE),
grp = sample(1:3, n, replace = TRUE)
)
dt <- setDT(df)
mark(
dplyr = df %>% group_by(grp) %>% summarise(across(everything(), list(mean))),
purrrlyr = df %>% slice_rows("grp") %>% dmap(mean),
data.table = dt[, lapply(.SD, mean), keyby = grp],
check = FALSE
)
#> # A tibble: 3 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 dplyr 2.81ms 2.85ms 328. NA 17.3
#> 2 purrrlyr 7.96ms 8.04ms 123. NA 24.5
#> 3 data.table 596.33µs 707.91µs 1409. NA 10.3
We can summarize by using summarize_at, summarize_all and summarize_if on dplyr 0.7.4. We can set the multiple columns and functions by using vars and funs argument as below code. The left-hand side of funs formula is assigned to suffix of summarized vars. In the dplyr 0.7.4, summarise_each(and mutate_each) is already deprecated, so we cannot use these functions.
options(scipen = 100, dplyr.width = Inf, dplyr.print_max = Inf)
library(dplyr)
packageVersion("dplyr")
# [1] ‘0.7.4’
set.seed(123)
df <- data_frame(
a = sample(1:5, 10, replace=T),
b = sample(1:5, 10, replace=T),
c = sample(1:5, 10, replace=T),
d = sample(1:5, 10, replace=T),
grp = as.character(sample(1:3, 10, replace=T)) # For convenience, specify character type
)
df %>% group_by(grp) %>%
summarise_each(.vars = letters[1:4],
.funs = c(mean="mean"))
# `summarise_each()` is deprecated.
# Use `summarise_all()`, `summarise_at()` or `summarise_if()` instead.
# To map `funs` over a selection of variables, use `summarise_at()`
# Error: Strings must match column names. Unknown columns: mean
You should change to the following code. The following codes all have the same result.
# summarise_at
df %>% group_by(grp) %>%
summarise_at(.vars = letters[1:4],
.funs = c(mean="mean"))
df %>% group_by(grp) %>%
summarise_at(.vars = names(.)[1:4],
.funs = c(mean="mean"))
df %>% group_by(grp) %>%
summarise_at(.vars = vars(a,b,c,d),
.funs = c(mean="mean"))
# summarise_all
df %>% group_by(grp) %>%
summarise_all(.funs = c(mean="mean"))
# summarise_if
df %>% group_by(grp) %>%
summarise_if(.predicate = function(x) is.numeric(x),
.funs = funs(mean="mean"))
# A tibble: 3 x 5
# grp a_mean b_mean c_mean d_mean
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 1 2.80 3.00 3.6 3.00
# 2 2 4.25 2.75 4.0 3.75
# 3 3 3.00 5.00 1.0 2.00
You can also have multiple functions.
df %>% group_by(grp) %>%
summarise_at(.vars = letters[1:2],
.funs = c(Mean="mean", Sd="sd"))
# A tibble: 3 x 5
# grp a_Mean b_Mean a_Sd b_Sd
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 1 2.80 3.00 1.4832397 1.870829
# 2 2 4.25 2.75 0.9574271 1.258306
# 3 3 3.00 5.00 NA NA
You can simply pass more arguments to summarise:
df %>% group_by(grp) %>% summarise(mean(a), mean(b), mean(c), mean(d))
Source: local data frame [3 x 5]
grp mean(a) mean(b) mean(c) mean(d)
1 1 2.500000 3.500000 2.000000 3.0
2 2 3.800000 3.200000 3.200000 2.8
3 3 3.666667 3.333333 2.333333 3.0
For completeness: with dplyr v0.2 ddply with colwise will also do this:
> ddply(df, .(grp), colwise(mean))
grp a b c d
1 1 4.333333 4.00 1.000000 2.000000
2 2 2.000000 2.75 2.750000 2.750000
3 3 3.000000 4.00 4.333333 3.666667
but it is slower, at least in this case:
> microbenchmark(ddply(df, .(grp), colwise(mean)),
df %>% group_by(grp) %>% summarise_each(funs(mean)))
Unit: milliseconds
expr min lq mean
ddply(df, .(grp), colwise(mean)) 3.278002 3.331744 3.533835
df %>% group_by(grp) %>% summarise_each(funs(mean)) 1.001789 1.031528 1.109337
median uq max neval
3.353633 3.378089 7.592209 100
1.121954 1.133428 2.292216 100
All the examples are great, but I figure I'd add one more to show how working in a "tidy" format simplifies things. Right now the data frame is in "wide" format meaning the variables "a" through "d" are represented in columns. To get to a "tidy" (or long) format, you can use gather() from the tidyr package which shifts the variables in columns "a" through "d" into rows. Then you use the group_by() and summarize() functions to get the mean of each group. If you want to present the data in a wide format, just tack on an additional call to the spread() function.
library(tidyverse)
# Create reproducible df
set.seed(101)
df <- tibble(a = sample(1:5, 10, replace=T),
b = sample(1:5, 10, replace=T),
c = sample(1:5, 10, replace=T),
d = sample(1:5, 10, replace=T),
grp = sample(1:3, 10, replace=T))
# Convert to tidy format using gather
df %>%
gather(key = variable, value = value, a:d) %>%
group_by(grp, variable) %>%
summarize(mean = mean(value)) %>%
spread(variable, mean)
#> Source: local data frame [3 x 5]
#> Groups: grp [3]
#>
#> grp a b c d
#> * <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 3.000000 3.5 3.250000 3.250000
#> 2 2 1.666667 4.0 4.666667 2.666667
#> 3 3 3.333333 3.0 2.333333 2.333333

Resources