Suppose I have a tibble like tb_1 here
# A tibble: 7 x 2
Grp Srt
<chr> <int>
1 A 10
2 B 4
3 B 7
4 A 5
5 A 1
6 A 3
7 B 2
which I have reproduced below:
tb_1 <- structure(
list(
Grp = c("A", "B", "B", "A", "A", "A", "B"),
Srt = c(10L, 4L, 7L, 5L, 1L, 3L, 2L)
),
class = c("tbl_df", "tbl", "data.frame"),
row.names = c(NA, -7L)
)
I would like a performant function arrange_groups() in the style of dplyr; which will sort (by given variables) the observations within each existing group, while preserving the locations where that group is distributed.
library(dplyr)
tb_2 <- tb_1 %>%
# Group by 'Grp'.
group_by(Grp) %>%
# Sort by 'Srt' WITHIN each group.
arrange_groups(Srt)
In the resulting tb_2, the 4 observations from the "A" group should remain distributed among the 1st, 4th, 5th, and 6th rows; after they have been sorted by Srt among themselves. Likewise, the 3 observations from the "B" group should remain distributed among the 2nd, 3rd, and 7th rows.
# A tibble: 7 x 2
# Groups: Grp [2]
Grp Srt
<chr> <int>
1 A 1
2 B 2
3 B 4
4 A 3
5 A 5
6 A 10
7 B 7
I have reproduced tb_2 below:
tb_2 <- structure(
list(
Grp = c("A", "B", "B", "A", "A", "A", "B"),
Srt = c(1L, 2L, 4L, 3L, 5L, 10L, 7L)
),
class = c("grouped_df", "tbl_df", "tbl", "data.frame"),
row.names = c(NA, -7L),
groups = structure(
list(
Grp = c("A", "B"),
.rows = structure(
list(
c(1L, 4L, 5L, 6L),
c(2L, 3L, 7L)
),
ptype = integer(0),
class = c("vctrs_list_of", "vctrs_vctr", "list")
)
),
class = c("tbl_df", "tbl", "data.frame"),
row.names = c(NA, -2L),
.drop = TRUE
)
)
Update
While I was able to answer my own question, I am leaving the floor open for other solutions. I am curious to see what alternatives exist, especially those that are more performant, more creative, or work with different ecosystems like data.table.
Toward Optimization
Further solutions should ideally
avoid recalculating order(Srt_1, Srt_2, ...) for every column in df;
be no slower than existing suggestions in data.table.
Solutions
Within the tidyverse, that goal can be accomplished by either a simple workflow or (among others) the following two functions.
Workflow
You could simply ignore arrange_groups() and instead implement a dplyr workflow with mutate(), since operations (like order()) will be applied within groups anyway.
library(dplyr)
tb_1 %>%
group_by(Grp) %>%
# Arbitrary sorting variables go HERE:
mutate(across(everything(), ~.[order(Srt)]))
# ^^^
Reordering Function
This arrange_groups_1() function sorts first by existing groups, and then by the variables given in .... With the data thus sorted within its groups, arrange_groups_1() then maps those groups back to their original locations.
arrange_groups_1 <- function(.data, ...) {
# Arrange into group "regions", and sort within each region; then...
dplyr::arrange(.data = .data, ... = ..., .by_group = TRUE)[
# ...map the results back to the original structure.
order(order(dplyr::group_indices(.data = .data))),
]
}
It is compatible with dplyr:
library(dplyr)
tb_1 %>%
group_by(Grp) %>%
arrange_groups_1(Srt)
Mutating Function
Less clever but more straightforward than arrange_groups_1(), the arrange_groups_2() solution simply implements the workflow in functional form.
arrange_groups_2 <- function(.data, ...) {
# Capture the symbols for the sorting variables.
dots <- dplyr::enquos(...)
dplyr::mutate(
.data = .data,
dplyr::across(
# Sort across the entire dataset.
.cols = dplyr::everything(),
# Sort each group "in place"; by variables captured from the proper scope.
.fns = ~.[order(!!!dots)]
)
)
}
It too is compatible with dplyr:
library(dplyr)
tb_1 %>%
group_by(Grp) %>%
arrange_groups_2(Srt)
Result
Given a tb_1 like yours, all of these solutions will yield the desired result:
# A tibble: 7 x 2
# Groups: Grp [2]
Grp Srt
<chr> <int>
1 A 1
2 B 2
3 B 4
4 A 3
5 A 5
6 A 10
7 B 7
Performance
On large datasets, the disparity in performance might become significant. Given a df with 1 million observations and several variables for grouping (Grp_*) and sorting (Srt_*)
set.seed(0)
df <- data.frame(
Record_ID = 1:1000000,
Grp_1 = sample(x = letters[ 1:6 ] , size = 1000000, replace = TRUE ),
Grp_2 = sample(x = letters[ 7:12] , size = 1000000, replace = TRUE ),
Grp_3 = sample(x = letters[13:18] , size = 1000000, replace = TRUE ),
Grp_4 = sample(x = letters[19:26] , size = 1000000, replace = TRUE ),
Srt_1 = sample(x = 1:1000000, size = 1000000, replace = FALSE),
Srt_2 = sample(x = 1000001:2000000, size = 1000000, replace = FALSE),
Srt_3 = sample(x = 2000001:3000000, size = 1000000, replace = FALSE),
Srt_4 = sample(x = 3000001:4000000, size = 1000000, replace = FALSE)
)
here are calculated the relative performances of each solution:
library(dplyr)
library(microbenchmark)
performances <- list(
one_var = microbenchmark(
arrange_groups_1 = df %>%
group_by(Grp_1) %>%
arrange_groups_1(Srt_1),
arrange_groups_2 = df %>%
group_by(Grp_1) %>%
arrange_groups_2(Srt_1),
workflow = df %>%
group_by(Grp_1) %>%
mutate(across(everything(), ~.[order(Srt_1)])),
times = 50
),
two_vars = microbenchmark(
arrange_groups_1 = df %>%
group_by(Grp_1, Grp_2) %>%
arrange_groups_1(Srt_1, Srt_2),
arrange_groups_2 = df %>%
group_by(Grp_1, Grp_2) %>%
arrange_groups_2(Srt_1, Srt_2),
workflow = df %>%
group_by(Grp_1, Grp_2) %>%
mutate(across(everything(), ~.[order(Srt_1, Srt_2)])),
times = 50
),
three_vars = microbenchmark(
arrange_groups_1 = df %>%
group_by(Grp_1, Grp_2, Grp_3) %>%
arrange_groups_1(Srt_1, Srt_2, Srt_3),
arrange_groups_2 = df %>%
group_by(Grp_1, Grp_2, Grp_3) %>%
arrange_groups_2(Srt_1, Srt_2, Srt_3),
workflow = df %>%
group_by(Grp_1, Grp_2, Grp_3) %>%
mutate(across(everything(), ~.[order(Srt_1, Srt_2, Srt_3)])),
times = 50
),
four_vars = microbenchmark(
arrange_groups_1 = df %>%
group_by(Grp_1, Grp_2, Grp_3, Grp_4) %>%
arrange_groups_1(Srt_1, Srt_2, Srt_3, Srt_4),
arrange_groups_2 = df %>%
group_by(Grp_1, Grp_2, Grp_3, Grp_4) %>%
arrange_groups_2(Srt_1, Srt_2, Srt_3, Srt_4),
workflow = df %>%
group_by(Grp_1, Grp_2, Grp_3, Grp_4) %>%
mutate(across(everything(), ~.[order(Srt_1, Srt_2, Srt_3, Srt_4)])),
times = 50
)
)
Clearly arrange_groups_1() is outclassed. I suspect arrange_groups_2() can hold its own against the workflow, and remain within sight of the latter while offering more ergonomic usage. However, this suspicion should be verified on other (and better) machines for larger sets of grouping and sorting variables.
#> performances
$one_var
Unit: milliseconds
expr min lq mean median uq max neval
arrange_groups_1 2066.4674 2155.8859 2231.3547 2199.7442 2283.5782 2565.0542 50
arrange_groups_2 352.3775 385.1829 435.2595 444.8746 464.1493 607.0927 50
workflow 337.2756 391.0174 428.9049 435.8385 454.7347 546.4498 50
$two_vars
Unit: milliseconds
expr min lq mean median uq max neval
arrange_groups_1 3580.5395 3688.1506 3842.2048 3799.5430 3979.9716 4317.7100 50
arrange_groups_2 230.1166 239.9141 265.0786 249.3640 287.1006 359.1822 50
workflow 221.6627 234.2732 256.6200 243.3707 281.2269 365.9102 50
$three_vars
Unit: milliseconds
expr min lq mean median uq max neval
arrange_groups_1 5113.6341 5340.5483 5441.3399 5443.5068 5535.0578 5946.6958 50
arrange_groups_2 261.9329 274.1785 295.6854 282.4638 323.5710 412.0139 50
workflow 224.8709 236.9958 263.2440 252.6042 292.7043 339.6351 50
$four_vars
Unit: milliseconds
expr min lq mean median uq max neval
arrange_groups_1 6810.3864 7035.7077 7237.6941 7156.7051 7314.4667 8051.8558 50
arrange_groups_2 581.9000 603.7822 640.8977 626.4116 672.6488 859.8239 50
workflow 349.7786 361.6454 391.7517 375.1532 429.3643 485.9227 50
Update
Hybrid Function
Inspired by #akrun's answer, here is a function that integrates the power of data.table...
arrange_groups_3 <- function(.data, ...) {
# Name the variables for grouping, and their complement in '.data'.
group_vars <- dplyr::group_vars(.data)
other_vars <- setdiff(names(.data), group_vars)
# For proper scoping, generate here the expression for sorting.
sort_expr <- substitute(order(...))
dplyr::as_tibble(data.table::as.data.table(.data)[,
(other_vars) := lapply(
# Sort each column, using an index...
.SD, \(x, i) x[i],
# ...which we need calculate only once.
i = eval(sort_expr)
),
group_vars
])
}
...with the ergonomics of dplyr.
library(dplyr)
tb_1 %>%
group_by(Grp) %>%
arrange_groups_3(Srt)
However, my implementation drops the original grouping in .data, so it's still a work in progress.
Fast Mutate
This rather fast implementation was inspired by #Henrik's suggestion to use dtplyr, a data.table backend for dplyr.
arrange_groups_4 <- function(.data, ...) {
# Capture the symbols for the sorting and grouping variables.
sort_syms <- dplyr::enquos(...)
group_syms <- dplyr::groups(.data)
.data |>
# Use a "data.table" backend.
dtplyr::lazy_dt() |>
# Preserve the grouping.
dplyr::group_by(!!!group_syms) |>
# Perform the sorting.
dplyr::mutate(
dplyr::across(
# Sort across the entire dataset.
.cols = dplyr::everything(),
# Sort each group "in place": subscript using the index...
.fns = `[`,
# ...generated when ordering by the sorting variables.
i = order(!!!sort_syms)
)
)
}
Although I have yet to test it for more than 4 grouping and sorting variables, it seems to complete in linear time:
$one_var
Unit: milliseconds
expr min lq mean median uq max neval
arrange_groups_4 30.738 31.8028 46.81692 37.6586 59.8274 95.4703 50
$two_vars
Unit: milliseconds
expr min lq mean median uq max neval
arrange_groups_4 41.4364 41.9118 52.91332 46.4306 66.1674 80.171 50
$three_vars
Unit: milliseconds
expr min lq mean median uq max neval
arrange_groups_4 47.8605 48.6225 62.06675 51.9562 71.487 237.0102 50
$four_vars
Unit: milliseconds
expr min lq mean median uq max neval
arrange_groups_4 67.306 69.1426 78.68869 73.81695 88.7874 108.2624 50
The question asked about dplyr. Here, is an attempt with data.table as this also involves efficiency. Benchmarks with OP's big dataset 'df' are below
library(data.table)
system.time({
df %>%
group_by(Grp_1, Grp_2, Grp_3, Grp_4) %>%
mutate(across(everything(), ~.[order(Srt_1, Srt_2, Srt_3, Srt_4)]))
})
# user system elapsed
# 0.552 0.013 0.564
system.time({
grpnms <- grep("Grp", names(df), value = TRUE)
othernms <- setdiff(names(df), grpnms)
setDT(df)[, (othernms) := lapply(.SD, \(x)
x[order(Srt_1, Srt_2, Srt_3, Srt_4)]), grpnms]
})
# user system elapsed
# 0.348 0.012 0.360
Here's another dplyr solution relying on a join preserving the row order. (The id column can of course be dropped as a last step, and the temporary objects aren't necessary to create separately, but the method is nice and clear with this presentation.)
group_order = tb_1 %>%
select(Grp) %>%
group_by(Grp) %>%
mutate(id = row_number())
row_order = tb_1 %>%
arrange(Srt) %>%
group_by(Grp) %>%
mutate(id = row_number())
result = group_order %>% left_join(row_order, by = c("Grp", "id"))
result
# # A tibble: 7 × 3
# # Groups: Grp [2]
# Grp id Srt
# <chr> <int> <int>
# 1 A 1 1
# 2 B 1 2
# 3 B 2 4
# 4 A 2 3
# 5 A 3 5
# 6 A 4 10
# 7 B 3 7
Benchmarking, this is better than arrange_groups_1 but otherwise not great:
four_vars = microbenchmark(
arrange_groups_2 = df %>%
group_by(Grp_1, Grp_2, Grp_3, Grp_4) %>%
arrange_groups_2(Srt_1, Srt_2, Srt_3, Srt_4),
workflow = df %>%
group_by(Grp_1, Grp_2, Grp_3, Grp_4) %>%
mutate(across(everything(), ~.[order(Srt_1, Srt_2, Srt_3, Srt_4)])),
join = {
df %>%
group_by(Grp_1, Grp_2, Grp_3, Grp_4) %>%
mutate(id = row_number()) %>%
left_join(
df %>%
arrange(Srt_1, Srt_2, Srt_3, Srt_4) %>%
group_by(Grp_1, Grp_2, Grp_3, Grp_4) %>%
mutate(id = row_number()),
by = c("Grp_1", "Grp_2", "Grp_3", "Grp_4", "id")
)
},
times = 10
)
four_vars
# Unit: milliseconds
# expr min lq mean median uq max neval
# arrange_groups_2 356.7114 366.2305 393.7209 377.6245 389.1009 537.6800 10
# workflow 242.6982 245.5079 252.8441 252.3410 257.7656 267.5277 10
# join 366.6957 400.1438 438.5274 443.0696 477.5481 505.2293 10
Related
I'm trying to get multiple summary statistics in R grouped by Team. I used code like below, but output is not what I want.
please point me in a better direction. Thanks!
set.seed(77)
data <- data.frame(Team =sample(c("A","B"),30, replace=TRUE),
gender=sample(c("female","male"),30, replace=TRUE),
Age =sample(c(0:100),30, replace=T))
dat <- data %>%
group_by(Team, gender) %>%
dplyr::summarize_all(list(my_mean = mean,
my_sum = sum,
my_sd = sd)) %>%
as.data.frame()
df <- data %>%
group_by(Team) %>%
summarize(total = n(gender),
mean = mean(Age),
Max_Age = max(Age),
Min_Age = min(Age),
sd = sd(Age),
)
I want to get like this pic.
You may need to create the dataframe for the summary statistics of age per Team (age_summary in the example below) and that for the count of Team members per gender and Team (gender_summary in the example below), and then merge them into one dataframe (say summary_df).
library(tidyverse)
set.seed(77)
data <- data.frame(
Team = sample(c("A", "B"), 30, replace = TRUE),
gender = sample(c("female", "male"), 30, replace = TRUE),
Age = sample(c(0:100), 30, replace = T)
)
age_summary <- data %>%
group_by(Team) %>%
summarize(
mean = mean(Age),
Max = max(Age),
Min = min(Age),
sd = sd(Age)
) %>%
column_to_rownames("Team") %>%
t() %>%
as_tibble(
rownames = "age_summary"
)
gender_summary <- data %>%
group_by(Team) %>%
count(gender) %>%
ungroup() %>%
pivot_wider(names_from = Team, values_from = n)
summary_df <- full_join(
age_summary,
gender_summary
) %>%
mutate(
"item" = if_else(
is.na(gender),
"Age",
"Sex"
)
) %>%
unite("summary", c(age_summary, gender), na.rm = TRUE, remove = FALSE) %>%
relocate(item, .before = 1) %>%
select(-c(age_summary, gender))
# # A tibble: 6 × 4
# item summary A B
# <chr> <chr> <dbl> <dbl>
# 1 Age mean 45.6 57.8
# 2 Age Max 92 82
# 3 Age Min 5 14
# 4 Age sd 30.1 22.1
# 5 Sex female 8 9
# 6 Sex male 7 6
I have a dataframe with 40 rows and ~40000 columns. The 40 rows are split into group "A" and group "B" (20 each). For each column, I would like to apply a statistical test (wilcox.test()) comparing the two groups. I started using a for loop to run through the 40000 columns but it was very slow.
Minimal Reproducible Example (MRE):
library(tidyverse)
set.seed(123)
metrics <- paste("metric_", 1:40000, sep = "")
patient_IDs <- paste("patientID_", 1:40, sep = "")
m <- matrix(sample(1:20, 1600000, replace = TRUE), ncol = 40000, nrow = 40,
dimnames=list(patient_IDs, metrics))
test_data <- as.data.frame(m)
test_data$group <- c(rep("A", 20), rep("B", 20))
# Collate list of metrics to analyse ("check") for significance
list_to_check <- colnames(test_data)[1:40000]
Original 'loop' method (this is what I want to vectorise):
# Create a variable to store the results
results_A_vs_B <- c()
# Loop through the "list_to_check" and,
# for each 'value', compare group A with group B
# and load the results into the "results_A_vs_B" variable
for (i in list_to_check) {
outcome <- wilcox.test(test_data[test_data$group == "A", ][[i]],
test_data[test_data$group == "B", ][[i]],
exact = FALSE)
if (!is.nan(outcome$p.value) && outcome$p.value <= 0.05) {
results_A_vs_B[i] <- paste(outcome$p.value, sep = "\t")
}
}
# Format the results into a dataframe
summarised_results_A_vs_B <- as.data.frame(results_A_vs_B) %>%
rownames_to_column(var = "A vs B") %>%
rename("Wilcox Test P-value" = "results_A_vs_B")
Benchmarking the answers so far:
# Ronak Shah's "Map" approach
Map_func <- function(dataset, list_to_check) {
tmp <- split(dataset[list_to_check], dataset$group)
stack(Map(function(x, y) wilcox.test(x, y, exact = FALSE)$p.value, tmp[[1]], tmp[[2]]))
}
# #Onyambu's data.table method
dt_func <- function(dataset, list_to_check) {
melt(setDT(dataset), measure.vars = list_to_check)[, dcast(.SD, rowid(group) + variable ~ group)][, wilcox.test(A, B, exact = FALSE)$p.value, variable]
}
# #Park's dplyr method (with some minor tweaks)
dplyr_func <- function(dataset, list_to_check){
dataset %>%
summarise(across(all_of(list_to_check),
~ wilcox.test(.x ~ group, exact = FALSE)$p.value)) %>%
pivot_longer(cols = everything(),
names_to = "Metrics",
values_to = "Wilcox Test P-value")
}
library(microbenchmark)
res_map <- microbenchmark(Map_func(test_data, list_to_check), times = 10)
res_dplyr <- microbenchmark(dplyr_func(test_data, list_to_check), times = 2)
library(data.table)
res_dt <- microbenchmark(dt_func(test_data, list_to_check), times = 10)
autoplot(rbind(res_map, res_dt, res_dplyr))
# Excluding dplyr
autoplot(rbind(res_map, res_dt))
--
Running the code on a server took a couple of seconds longer but the difference between Map and data.table was more pronounced (laptop = 4 cores, server = 8 cores):
autoplot(rbind(res_map, res_dt))
Here is another option -
Map_approach <- function(dataset, list_to_check) {
tmp <- split(dataset[list_to_check], dataset$group)
stack(Map(function(x, y) wilcox.test(x, y)$p.value, tmp[[1]], tmp[[2]]))
}
Map_approach(data_subset, list_to_check)
# values ind
#1 5.359791e-05 value_1
#2 5.499685e-08 value_2
#3 1.503951e-06 value_3
#4 6.179352e-08 value_4
#5 5.885650e-08 value_5
Testing it on larger sample Map is slightly faster than the for loop.
n <- 1e6
data_subset <- data.frame(patient_ID = 1:n,
group = c(rep("A", n/2),
rep("B", n/2)),
value_1 = c(sample(1:10, n/2, replace = TRUE),
sample(5:15, n/2, replace = TRUE)),
value_2 = c(sample(1:5, n/2, replace = TRUE),
sample(15:n/2, n/2, replace = TRUE)),
value_3 = c(sample(1:12, n/2, replace = TRUE),
sample(8:17, n/2, replace = TRUE)),
value_4 = c(sample(5:10, n/2, replace = TRUE),
sample(15:25, n/2, replace = TRUE)),
value_5 = c(sample(20:40, n/2, replace = TRUE),
sample(10:15, n/2, replace = TRUE)))
microbenchmark::microbenchmark(loop = wilcox_loop(data_subset, list_to_check),
Map = Map_approach(data_subset, list_to_check))
#Unit: seconds
# expr min lq mean median uq max neval cld
# loop 5.254573 5.631162 5.788624 5.734480 5.920424 6.756319 100 b
# Map 4.710790 5.084783 5.201711 5.160722 5.309048 5.721540 100 a
May you try this code? It's slightly faster in my computer.
wilcox_loop2 <- function(data_subset, list_to_check){
A = data_subset[data_subset$group == "A",]
B = data_subset[data_subset$group == "B",]
outcome <- sapply(list_to_check, function(x) wilcox.test(A[[x]],
B[[x]],
exact = FALSE)$p.value)
as.data.frame(outcome) %>%
rownames_to_column(var = "A vs B") %>%
rename("Wilcox Test P-value" = "outcome")
}
I'm not sure it's OK to split data into A and B...
My system time costs is like
microbenchmark::microbenchmark(origin = wilcox_loop(data_subset, list_to_check),
test = wilcox_loop2(data_subset, list_to_check))
Unit: milliseconds
expr min lq mean median uq max neval cld
origin 4.815601 5.006951 6.490757 5.385502 6.790752 21.5876 100 b
test 3.817801 4.116151 5.146963 4.330500 4.870651 15.8271 100 a
I am interested in the difference in the mean of some variable according to a binary covariate.
I am computing the confidence interval of this difference by bootstraping:
library(tidyverse)
df = mtcars %>%
select(disp, vs) %>%
mutate(vs=factor(vs, labels=c("vshaped", "straight")))
by1="straight"
by2="vshaped"
R=1000
set.seed(1)
beffect = numeric(length=R)
for (i in 1:R) {
ib = sample(1:nrow(df), replace = TRUE)
xi = df$disp[ib]
byi = df$vs[ib]
beffect[i] = mean(xi[byi==by2], na.rm = TRUE) - mean(xi[byi==by1], na.rm = TRUE)
}
mean(beffect)
#> [1] 175.9203
sd(beffect)
#> [1] 29.3409
Created on 2021-06-13 by the reprex package (v2.0.0)
This works, but I find it quite unreadable and I wonder about its efficiency, as for loops are often considered a bad design in R.
Being a heavy user of the tidyverse, I would like to rewrite this using this framework.
Is there a fast and readable way to do so?
PS: Here is the closest I could get, but it is far from being more readable and it is 250 times slower:
beffect2 = replicate(R, {
df %>%
slice_sample(prop=1, replace = TRUE) %>%
group_by(vs) %>%
summarise(m=mean(disp)) %>%
pivot_wider(names_from = "vs", values_from = "m") %>%
transmute(x=!!ensym(by2) - !!ensym(by1))
}, simplify = FALSE) %>%
map_dbl(identity)
EDIT: here are the benchmarks of all methods so far:
# with R=50 ***********
# microbenchmark::microbenchmark(f_dc(50), f_akrun(50), f_akrun_diff(50), f_akrun_bindout(50), f_cole(50), f_forloop(50), times = 5)
# Unit: milliseconds
# expr min lq mean median uq max neval
# f_dc() 497.4559 524.9582 560.94690 553.6271 572.2261 656.4672 5
# f_akrun() 101.6295 108.5232 111.22400 110.7238 111.4105 123.8330 5
# f_akrun_diff() 270.0261 283.3257 308.92806 283.6411 314.7233 392.9241 5
# f_akrun_bindout() 21.8185 21.9725 76.68770 22.9811 30.2129 286.4535 5
# f_cole() 2.7685 3.1343 3.63484 3.2679 4.4346 4.5689 5
# f_forloop() 2.1136 2.1277 3.14156 3.4968 3.6740 4.2957 5
# with R=500 **********
# microbenchmark::microbenchmark(f_dc(500), f_akrun(500), f_akrun_diff(500), f_akrun_bindout(500), f_cole(500), f_forloop(500), times = 5)
# Unit: milliseconds
# expr min lq mean median uq max neval
# f_dc() 4270.2451 4535.4618 4543.85930 4539.3032 4613.5823 4760.7041 5
# f_akrun() 936.3249 951.3230 970.27424 956.3674 992.3162 1015.0397 5
# f_akrun_diff() 2501.3871 2509.5429 2589.47288 2608.5254 2649.3819 2678.5271 5
# f_akrun_bindout() 108.3761 108.7238 113.26746 112.2521 118.4673 118.5180 5
# f_cole() 23.1283 23.4074 24.75386 23.9244 26.4594 26.8498 5
# f_forloop() 20.4243 21.1367 23.26222 21.2130 22.5616 30.9755 5
This may be overlooking the obvious, but the tidyverse equivalent of a for loop would involve something like purrr::map(). The simplest conversion would be to use purrr::map_dbl(1:R, ...) such as:
library(purrr)
## better for memory and performance to extract vectors ahead of loop
disp = dt$disp
vs = dt$vs
map_dbl(1:R,
~ {
ib = sample(nrow(df), replace = TRUE)
xi = disp[ib]
byi = vs[ib]
mean(xi[byi == by2], na.rm = TRUE) - mean(xi[byi == by1], na.rm = TRUE)
})
Also, since by is binary, you may be able to improve performance by translating this into rcpp.
We could use map and avoid the multiple pivot_wider steps
library(purrr)
library(dplyr)
set.seed(1)
out <- map_dfr(seq_len(R), ~ {
ib <- sample(1:nrow(df), replace = TRUE)
df %>%
slice(ib) %>%
summarise(beffect = mean(disp[vs == by2], na.rm = TRUE) -
mean(disp[vs == by1], na.rm = TRUE))
})
-checking
mean(out$beffect)
#[1] 175.9203
sd(out$beffect)
#[1] 29.3409
Or may use diff instead of pivot_wider
set.seed(1)
out2 <- replicate(R, df %>%
slice_sample(prop = 1, replace = TRUE) %>%
group_by(vs) %>%
summarise(m = mean(disp), .groups = 'drop') %>%
summarise(beffect = diff(m[2:1])), simplify = FALSE) %>%
bind_rows
-checking
mean(out2$beffect)
#[1] 175.9203
Or another option would be to do the sample, bind them together with a group identifier, use that to extract the values of the columns, do a group by the group identifier and 'vs' and get the mean
set.seed(1)
out3 <- replicate(R, sample(seq_len(nrow(df)), replace = TRUE) %>%
as_tibble, simplify = FALSE) %>%
bind_rows(.id = 'grp') %>%
mutate(vs = df$vs[value], disp = df$disp[value]) %>%
group_by(grp, vs) %>%
summarise(beffect = mean(disp), .groups = 'drop_last') %>%
group_by(grp) %>%
summarise(beffect = diff(beffect[2:1]), .groups = 'drop')
-checking
mean(out3$beffect)
#[1] 175.9203
Benchmarks
system.time({set.seed(1)
out3 <- replicate(R, sample(seq_len(nrow(df)), replace = TRUE) %>%
as_tibble, simplify = FALSE) %>%
bind_rows(.id = 'grp') %>%
mutate(vs = df$vs[value], disp = df$disp[value]) %>%
group_by(grp, vs) %>%
summarise(beffect = mean(disp), .groups = 'drop_last') %>%
group_by(grp) %>%
summarise(beffect = diff(beffect[2:1]), .groups = 'drop')})
# user system elapsed
# 0.202 0.007 0.208
Or with map
system.time({
set.seed(1)
out <- map_dfr(seq_len(R), ~ {
ib <- sample(1:nrow(df), replace = TRUE)
df %>%
slice(ib) %>%
summarise(beffect = mean(disp[vs == by2], na.rm = TRUE) -
mean(disp[vs == by1], na.rm = TRUE))
})
})
# user system elapsed
# 1.329 0.013 1.338
Or instead of pivot_wider, take the diff
system.time({set.seed(1)
out2 <- replicate(R, df %>%
slice_sample(prop = 1, replace = TRUE) %>%
group_by(vs) %>%
summarise(m = mean(disp), .groups = 'drop') %>%
summarise(beffect = diff(m[2:1])), simplify = FALSE) %>%
bind_rows
})
# user system elapsed
# 3.753 0.027 3.758
Or a similar approach in data.table
library(data.table)
system.time({
setDT(df)
set.seed(1)
out3 <- rbindlist(
replicate(R,
df[df[, .I[sample(seq_len(.N), replace = TRUE)]
]][, .(m = mean(disp)), vs][, .(beffect = m[2]- m[1])],
simplify = FALSE)
)
})
# user system elapsed
# 1.181 0.055 1.230
-OP's method
system.time({replicate(R, {
df %>%
slice_sample(prop=1, replace = TRUE) %>%
group_by(vs) %>%
summarise(m=mean(disp)) %>%
pivot_wider(names_from = "vs", values_from = "m") %>%
transmute(x=!!ensym(by2) - !!ensym(by1))
}, simplify = FALSE)})
user system elapsed
6.991 0.063 7.009
microbenchmark::microbenchmark(f_dc(), f_akrun1(), f_akrun2(), f_akrun3(), f_forloop(), times = 5)
Unit: milliseconds
expr min lq mean median uq max neval cld
f_dc() 6453.14052 6512.34196 6772.0079 6534.08171 6939.61358 7420.86152 5 d
f_akrun1() 1288.96812 1328.96075 1377.0833 1353.79346 1372.30852 1541.38573 5 b
f_akrun2() 3685.33619 3703.33018 3814.8367 3801.52657 3915.75432 3968.23609 5 c
f_akrun3() 178.30997 179.77604 194.0712 189.18425 205.37485 217.71095 5 a
f_forloop() 30.11329 33.37171 35.0534 36.80903 36.95909 38.01389 5 a
Consider the following: I create a df_tbl with 1,000,000 rows. Large enough such that computation time isn't trivially fast.
I put the dataframe into Spark, and perform computations with the dataframe held in RAM, and the Spark dataframe.
Microbenchmark suggests that the computation with the Spark dataframe is faster, as would be expected, yet, when I'm programming interactively the computation involving the Spark dataframe is noticeably slower to return a result.
I'm curious as to what is going on. Example code given below:
library(sparklyr)
library(dplyr)
sc <- spark_connect(master = "local")
#> * Using Spark: 2.2.0
# main --------------------------------------------------------------------
N <- 1000000
df <- data_frame(
CASENO = 1000001:(1000000 + N),
sex = sample(1:2, N, rep = TRUE),
group = sample(1:5, N, rep = TRUE),
x = abs(rnorm(N)),
y = rnorm(N),
z = rnorm(N)
)
spark_df <- sdf_copy_to(sc, df, "spark_df", memory = FALSE, overwrite = TRUE)
benchmark <- microbenchmark::microbenchmark(
df %>% group_by(sex, group) %>% summarise(sum_x = sum(x)) %>% mutate(prop = sum_x/sum(sum_x)),
spark_df %>% group_by(sex, group) %>% summarise(sum_x = sum(x)) %>% mutate(prop = sum_x/sum(sum_x))
)
summary(benchmark)
#> expr
#> 1 df %>% group_by(sex, group) %>% summarise(sum_x = sum(x)) %>% mutate(prop = sum_x/sum(sum_x))
#> 2 spark_df %>% group_by(sex, group) %>% summarise(sum_x = sum(x)) %>% mutate(prop = sum_x/sum(sum_x))
#> min lq mean median uq max neval
#> 1 36.92519 39.119954 43.993727 41.522914 45.885576 107.71227 100
#> 2 1.12158 1.279999 1.855679 1.423407 1.551012 20.22911 100
start1 <- proc.time()
df %>% group_by(sex, group) %>% summarise(sum_x = sum(x)) %>% mutate(prop = sum_x/sum(sum_x))
end1 <- proc.time() - start1
start2 <- proc.time()
spark_df %>% group_by(sex, group) %>% summarise(sum_x = sum(x)) %>% mutate(prop = sum_x/sum(sum_x))
end2 <- proc.time() - start2
end1
#> user system elapsed
#> 0.33 0.04 0.37
end2
#> user system elapsed
#> 0.18 0.00 7.51
Created on 2018-03-27 by the [reprex package](http://reprex.tidyverse.org) (v0.2.0).
I am using dplyr summarise function. My data contain NAs so I need to include na.rm=TRUE for each call. for example:
group <- rep(c('a', 'b'), 3)
value <- c(1:4, NA, NA)
df = data.frame(group, value)
library(dplyr)
group_by(df, group) %>% summarise(
mean = mean(value, na.rm=TRUE),
sd = sd(value, na.rm=TRUE),
min = min(value, na.rm=TRUE))
Is there a way to write the argument na.rm=TRUE only one time, and not
on each row?
You should use summarise_at, which lets you compute multiple functions for the supplied columns and set arguments that are shared among them:
df %>% group_by(group) %>%
summarise_at("value",
funs(mean = mean, sd = sd, min = min),
na.rm = TRUE)
If you're planning to apply your functions to one column only, you can use filter(!is.na()) in order to filter out any NA values of this variable only (i.e. NA in other variables won't affect the process).
group <- rep(c('a', 'b'), 3)
value <- c(1:4, NA, NA)
df = data.frame(group, value)
library(dplyr)
group_by(df, group) %>%
filter(!is.na(value)) %>%
summarise(mean = mean(value),
sd = sd(value),
min = min(value))
# # A tibble: 2 x 4
# group mean sd min
# <fctr> <dbl> <dbl> <dbl>
# 1 a 2 1.414214 1
# 2 b 3 1.414214 2