How to define a keyed (grouped) join in data table? - r

I would like to overload the [.data.table operator so that grouped joins can be achieved. I will use dplyr to illustrate what I mean by a grouped join. Take for example the function below that wraps around dplyr::inner_join:
inner_join_grp <- function(x, y, by = NULL, copy = FALSE, suffix = c(".x", ".y"),
...) {
stopifnot(identical(group_vars(x), group_vars(y)))
grp <- group_vars(x)
by <- c(by, grp)
return(inner_join(x, y, by, copy, suffix, ...))
}
All this function does is to check that input data.frames x and y have the same groups, and if they do, to add the grouping variables to the join key specified by by = ....
An application would be like this:
n_grp <- 5
n_grp2 <- 3
set.seed(0)
tmp_df <-
data.frame(grp = rep(letters[1:n_grp2], each = n_grp),
grp2 = rep(1:n_grp, times = n_grp2),
x = runif(n_grp * n_grp2))
tmp_df_2 <-
data.frame(grp = rep(letters[1:n_grp2], each = n_grp),
grp2 = rep(1:n_grp, times = n_grp2),
y = runif(n_grp * n_grp2))
Using my function:
tmp_df %>%
group_by(grp) %>%
inner_join_grp(tmp_df_2 %>%
group_by(grp),
by = "grp2")
Produces the expected fifteen row table:
# A tibble: 15 x 4
# Groups: grp [?]
grp grp2 x y
<fct> <int> <dbl> <dbl>
1 a 1 0.897 0.770
2 a 2 0.266 0.498
3 a 3 0.372 0.718
4 a 4 0.573 0.992
5 a 5 0.908 0.380
...
whereas using dplyr::inner_join produces a 45 row table since the join is only on grp2:
# A tibble: 45 x 5
# Groups: grp.x [?]
grp.x grp2 x grp.y y
<fct> <int> <dbl> <fct> <dbl>
1 a 1 0.897 a 0.770
2 a 1 0.897 b 0.777
3 a 1 0.897 c 0.267
...
My question is whether is possible to achieve something similar using [, when the input data tables are keyed. I know I can overload merge.data.table (see below for a rough example) in the same way but I would prefer to pass an additional argument to [ to achieve a keyed join.
merge_dt_keyed <-
function (x, y, by = NULL, by.x = NULL, by.y = NULL, all = FALSE,
all.x = all, all.y = all, sort = TRUE,
suffixes = c(".x", ".y"),
allow.cartesian = getOption("datatable.allow.cartesian"),
...){
stopifnot(identical(key(x), key(y)))
new_key <- key(x)
by.x <- c(new_key, by, by.x)
by.y <- c(new_key, by, by.y)
return(merge(x, y, by.x = by.x, by.y = by.y, all = FALSE,
all.x = all, all.y = all, sort = TRUE,
suffixes = c(".x", ".y"),
allow.cartesian = getOption("datatable.allow.cartesian"),
...))
}

Something along these lines should work:
`[.data.table` = function(x, i, ...) {
args = match.call()
if ('on' %in% names(args)) {
args[['on']] = union(args[['on']], intersect(key(x), key(i)))
}
args[[1]] = data.table:::`[.data.table`
eval(args)
}

Related

Different results of a full_join in arrow and dplyr

I get different results when using full_join on tibble and on arrow_table. Maybe somebody can give a hand on what is going on?
library(arrow)
library(dplyr)
xa1 <- arrow_table(x = 1L)
xa2 <- arrow_table(x = 2L)
x1 <- tibble(x = 1L)
x2 <- tibble(x = 2L)
full_join(xa1,xa2,on = c("x")) %>% collect() %>% compute()
full_join(x1,x2)
# A tibble: 2 × 1
x
<int>
1 1
2 NA
full_join(x1,x2)
Joining, by = "x"
# A tibble: 2 × 1
x
<int>
1 1
2 2
There is no on argument in dplyr::.*_join. Usage according to ?dplyr::full_join is
full_join(
x,
y,
by = NULL,
copy = FALSE,
suffix = c(".x", ".y"),
...,
keep = NULL
)
on is a data.table join argument. We need by here
library(arrow)
library(dplyr)
full_join(xa1, xa2, by = "x") %>%
collect() %>%
compute()
-output
# A tibble: 2 × 1
x
<int>
1 1
2 2
By looking at the methods and source code
> methods("full_join")
[1] full_join.arrow_dplyr_query* full_join.ArrowTabular* full_join.data.frame* full_join.Dataset* full_join.RecordBatchReader*
> getAnywhere(full_join.ArrowTabular)
function (x, y, by = NULL, copy = FALSE, suffix = c(".x", ".y"),
..., keep = FALSE)
{
query <- do_join(x, y, by, copy, suffix, ..., keep = keep,
join_type = "FULL_OUTER")
if (!keep) {
query$selected_columns <- post_join_projection(names(x),
names(y), handle_join_by(by, x, y), suffix)
}
query
}
by is used in the functions that are called inside

How to combine dplyr group_by, summarise, across and multiple function outputs?

I have the following tibble:
tTest = tibble(Cells = rep(c("C1", "C2", "C3"), times = 3),
Gene = rep(c("G1", "G2", "G3"), each = 3),
Experiment_score = 1:9,
Pattern1 = 1:9,
Pattern2 = -(1:9),
Pattern3 = 9:1) %>%
group_by(Gene)
and I would like to correlate the Experiment_score with each of the Pattern columns for all Gene.
Looking at the tidyverse across page and examples, I thought this would work:
# `corList` is a simple wrapper for `cor` to have exactly two outputs:
corList = function(x, y) {
result = cor.test(x, y)
return(list(stat = result$estimate, pval = result$p.value))
}
tTest %>% summarise(across(starts_with("Pattern"), ~ corList(Experiment_score, .x), .names = "{.col}_corr_{.fn}"))
but I got this:
I have found a solution by melting the Pattern columns and I will post it down below for completeness but the challenge is that I have dozens of Pattern columns and millions of rows. If I melt the Pattern columns, I end up with half a billion rows, seriously hampering my ability to work with the data.
EDIT:
My own imperfect solution:
# `corVect` is a simple wrapper for `cor` to have exactly two outputs:
corVect = function(x, y) {
result = cor.test(x, y)
return(c(stat = result$estimate, pval = result$p.value))
}
tTest %>% pivot_longer(starts_with("Pattern"), names_to = "Pattern", values_to = "Strength") %>%
group_by(Gene, Pattern) %>%
summarise(CorrVal = corVect(Experiment_score, Strength)) %>%
mutate(CorrType = c("corr", "corr_pval")) %>%
# Reformat
pivot_wider(id_cols = c(Gene, Pattern), names_from = CorrType, values_from = CorrVal)
To get the desired result in one step, wrap the function return as a tibble rather than a list, and call .unpack = TRUE in across. Here using a conveniently-named corTibble function:
library(tidyverse)
tTest = tibble(
Cells = rep(c("C1", "C2", "C3"), times = 3),
Gene = rep(c("G1", "G2", "G3"), each = 3),
Experiment_score = 1:9,
Pattern1 = 1:9 + rnorm(9), # added some noise
Pattern2 = -(1:9 + rnorm(9)),
Pattern3 = 9:1 + rnorm(9)
) %>%
group_by(Gene)
corTibble = function(x, y) {
result = cor.test(x, y)
return(tibble(stat = result$estimate, pval = result$p.value))
}
tTest %>% summarise(across(
starts_with("Pattern"),
~ corTibble(Experiment_score, .x),
.names = "{.col}_corr",
.unpack = TRUE
))
#> # A tibble: 3 × 7
#> Gene Pattern1_corr_stat Pattern1_corr_pval Pattern2…¹ Patte…² Patte…³ Patte…⁴
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 G1 0.947 0.208 -0.991 0.0866 -1.00 0.0187
#> 2 G2 0.964 0.172 -0.872 0.325 -0.981 0.126
#> 3 G3 0.995 0.0668 -0.680 0.524 -0.409 0.732
#> # … with abbreviated variable names ¹​Pattern2_corr_stat, ²​Pattern2_corr_pval,
#> # ³​Pattern3_corr_stat, ⁴​Pattern3_corr_pval

Calculate mean and standard deviation for subgroups

I want to calculate the mean and standard deviation for subgroups every column in my dataset.
The membership of the subgroups is based on the values in the column of interest and these subgroups are specific to each column of interest.
# Example data
set.seed(1)
library(data.table)
df <- data.frame(baseline = runif(100), `Week0_12` = runif(100), `Week12_24` = runif(100))
So for column Baseline, a row may be assigned to another subgroup than for column Week0_12.
I can of course create these 'subgroup columns' manually for each column and then calculate the statistics for each column by column subgroup:
df$baseline_subgroup <- ifelse(df$baseline < 0.2, "subgroup_1", "subgroup_2")
df <- as.data.table(df)
df[, .(mean = mean(baseline), sd = sd(baseline)), by = baseline_subgroup]
Giving this output:
baseline_subgroup mean sd
1: subgroup_2 0.58059314 0.22670071
2: subgroup_1 0.09793105 0.05317809
Doing this for every column separately is too much repetition, especially given that I have many columns my actual data.
df$Week0_12_subgroup <- ifelse(df$Week0-12 < 0.2, "subgroup_1", "subgroup_2")
df[, .(mean = mean(Week0_12), sd = sd(Week0_12 )), by = Week0_12_subgroup ]
df$Week12_24_subgroup <- ifelse(df$Week0-12 < 0.2, "subgroup_1", "subgroup_2")
df[, .(mean = mean(Week12_24), sd = sd(Week12_24)), by = Week12_24_subgroup ]
What is a more elegant approach to do this?
Here's a tidyverse method that gives an easy-to-read and easy-to-plot output:
library(tidyverse)
set.seed(1)
df <- data.frame(baseline = runif(100),
`Week0_12` = runif(100),
`Week12_24` = runif(100))
df2 <- df %>%
summarize(across(everything(), list(mean_subgroup1 = ~mean(.x[.x < 0.2]),
sd_subgroup1 = ~sd(.x[.x < 0.2]),
mean_subgroup2 = ~mean(.x[.x > 0.2]),
sd_subgroup2 = ~sd(.x[.x > 0.2])))) %>%
pivot_longer(everything(), names_pattern = '^(.*)_(.*)_(.*$)',
names_to = c('time', 'measure', 'subgroup')) %>%
pivot_wider(names_from = measure, values_from = value)
df2
#> # A tibble: 6 x 4
#> time subgroup mean sd
#> <chr> <chr> <dbl> <dbl>
#> 1 baseline subgroup1 0.0979 0.0532
#> 2 baseline subgroup2 0.581 0.227
#> 3 Week0_12 subgroup1 0.117 0.0558
#> 4 Week0_12 subgroup2 0.594 0.225
#> 5 Week12_24 subgroup1 0.121 0.0472
#> 6 Week12_24 subgroup2 0.545 0.239
ggplot(df2, aes(time, mean, group = subgroup)) +
geom_errorbar(aes(ymin = mean - sd, ymax = mean + sd, color = subgroup),
width = 0.1) +
geom_point() +
theme_minimal(base_size = 16)
Created on 2022-07-14 by the reprex package (v2.0.1)
You could use apply to apply a subgroup function across each column
i. e.
# list to house dfs of summary statistics
summaries <- list()
subgroup <- function(x){
# x is the column that we are interested in
df$current_subgroup<- ifelse(x < 0.2, "subgroup_1", "subgroup_2")
library(data.table)
df <- as.data.table(df)
summaries.append(df[, .(mean = mean(baseline), sd = sd(baseline)), by = baseline_subgroup])
}
# MARGIN = 2 applies across columns
apply(df, 2, subgroup)
You can create a custom function and apply it using .SD, i.e.
library(data.table)
f1 <- function(x){
i_mean <- mean(x);
i_sd <- sd(x);
list(Avg = i_mean, standard_dev = i_sd)
}
setDT(df)[, unlist(lapply(.SD, f1), recursive = FALSE), by = baseline_subgroup][]
baseline_subgroup baseline.Avg baseline.standard_dev Week0.12.Avg Week0.12.standard_dev Week12.24.Avg Week12.24.standard_dev
1: subgroup_2 0.5950020 0.22556590 0.5332555 0.2651810 0.5467046 0.2912027
2: subgroup_1 0.1006693 0.04957005 0.5947161 0.2645519 0.5137543 0.3213723

Passing names of objects from ellipsis as strings to left_join

Background
I have a simple helper function that applies left_join to any number of passed tables in other to gather them and return one object.
Example
# Settings ----------------------------------------------------------------
library("tidyverse")
set.seed(123)
# Data --------------------------------------------------------------------
sample_one <-
tibble(
column_a = c(1, 2),
column_b = runif(n = 2),
column_other = runif(n = 2)
)
sample_two <-
tibble(
column_a = c(1, 2),
column_b = runif(n = 2),
column_other = runif(n = 2)
)
sample_three <-
tibble(
column_a = c(1, 2),
column_b = runif(n = 2),
column_other = runif(n = 2)
)
# Function ----------------------------------------------------------------
left_join_on_column_a <- function(keep_var, ...) {
keep_var <- enquo(keep_var)
dots <- list(...)
clean_dfs <- map(dots, select, !!keep_var, "column_a")
reduce(.x = clean_dfs,
.f = left_join,
"column_a") %>%
gather(key = "model_type", !!keep_var, -column_a)
}
# Test --------------------------------------------------------------------
left_join_on_column_a(keep_var = column_b, sample_one, sample_two, sample_three)
Problem
I would like to be able to programmatically modify the suffix argument of left_join:
suffix
If there are non-joined duplicate variables in x and y, these
suffixes will be added to the output to disambiguate them. Should be a
character vector of length 2.
Current results
# A tibble: 6 x 3
column_a model_type column_b
<dbl> <chr> <dbl>
1 1 column_b.x 0.288
2 2 column_b.x 0.788
3 1 column_b.y 0.940
4 2 column_b.y 0.0456
5 1 column_b 0.551
6 2 column_b 0.457
Desired results
# A tibble: 6 x 3
column_a model_type column_b
<dbl> <chr> <dbl>
1 1 sample_one 0.288
2 2 sample_one 0.788
3 1 sample_two 0.940
4 2 sample_two 0.0456
5 1 sample_three 0.551
6 2 sample_three 0.457
The model_type column reflects name of the object passed via ....
Attempts
I was trying to capture names of the objects passed within ... but it's not a named object so it doesn't make sense:
left_join_on_column_a <- function(keep_var, ...) {
keep_var <- enquo(keep_var)
dots <- list(...)
table_names <- names(dots)
clean_dfs <- map(dots, select, !!keep_var, "column_a")
reduce(.x = clean_dfs,
.f = left_join,
"column_a",
table_names) %>%
gather(key = "model_type", !!keep_var, -column_a)
}
Maybe rename column_b so that you don't have to worry about suffix
left_join_on_column_a <- function(keep_var, common_var, ...) {
nm = unname(sapply(rlang::enexprs(...), as.character))
keep_var <- as.character(substitute(keep_var))
common_var = as.character(substitute(common_var))
foo = function(x, y) {
x %>% select(!!common_var, !!y := !!keep_var)
}
reduce(.x = Map(foo, list(...), nm),
.f = left_join,
common_var) %>%
gather("model_type", !!keep_var, -!!common_var)
}
left_join_on_column_a(column_b, column_a, sample_one, sample_two, sample_three)

R Quasiquotation & tidyeval for dynamic variable references in R in own functions

I'm trying to get my head around using quasiquotation from the tidyverse in R in my own functions. I've read this one here: Passing a list of arguments to a function with quasiquotation and the whole thing here: https://tidyeval.tidyverse.org/
But I still don't get it to work.
Assume I have the following data:
dat <- data.frame(time = runif(20),
group1 = rep(1:2, times = 10),
group2 = rep(1:2, each = 10),
group3 = rep(3:4, each = 10))
What I want to do now is to write a function that does the following:
take a data set
specify the variable that contains the time (note, in another data set this might be called "hours" or "qtime" or whatever)
specify by which groups I want to do operations/statistics on
So what I want the user to do is to use a function like:
test_function(data = dat, time_var = "time", group_vars = c("group1", "group3")) Note, I might choose different grouping variables or none next time.
Let's say within the function I want to:
calculate certain statistics on the time variable, e.g. the quantiles. Note: I want to split this up by my grouping variables
Here's one of my latest tries:
test_function <- function(data, time_var = NULL, group_vars = NULL)
{
# Note I initialize the variables with NULL, since e.g. the user might not specify a grouping
and I want to check for that in my function at some point)
time_var <- enquo(time_var)
group_vars <- enquos(group_vars)
# Here I try to group by my grouping variables
temp_data <- data %>%
group_by_at(group_vars) %>%
mutate(!!sym(time_var) := !!sym(time_var) / 60)
# Here I'm calculating some stats
time_stats <- temp_data %>%
summarize_at(vars(!!time_var), list(p0.1_time = ~quantile(., probs = 0.1, na.rm = T),
p0.2_time = ~quantile(., probs = 0.2, na.rm = T),
p0.3_time = ~quantile(., probs = 0.3, na.rm = T),
p0.4_time = ~quantile(., probs = 0.4, na.rm = T),
p0.5_time = ~quantile(., probs = 0.5, na.rm = T),
p0.6_time = ~quantile(., probs = 0.6, na.rm = T),
p0.7_time = ~quantile(., probs = 0.7, na.rm = T),
p0.8_time = ~quantile(., probs = 0.8, na.rm = T),
p0.9_time = ~quantile(., probs = 0.9, na.rm = T),
p0.95_time = ~quantile(., probs = 0.95, na.rm = T)))
}
What is wrong with my code? I.e. I specifically struggle with the !!, !!!, sym, enquo, enquos things. Why does the group_by_at thing doesn't need the !! thing, whereas my summarize and mutate do need it?
Make these changes:
use sym and syms rather than enquo and enquos
use !! and !!! respectively.
createpo as a list and then use unnest_wider to expand into columns
quantile is already vectorized so we don't need map
the mutate can be incorporated right into the quantile call eliminating it
consolidate the pipelines into a single pipeline
use TRUE rather than T since the latter can be masked by a variable of that name whereas no variable may be called TRUE.
we can use plain group_by and summarize
there is no group3 in the sample data so we used group2 instead
this does not make sense without time_var so remove the default of NULL
This gives the following code
test_function <- function(data, time_var, group_vars = NULL) {
p <- c(1:9/10, 0.95)
time_var <- sym(time_var)
group_vars <- syms(group_vars)
data %>%
group_by(!!!group_vars) %>%
summarize(po = list(quantile(!!time_var / 60, p, na.rm = TRUE))) %>%
ungroup %>%
unnest_wider(po)
}
test_function(data = dat, time_var = "time", group_vars = c("group1", "group2"))
giving:
# A tibble: 4 x 12
group1 group2 `10%` `20%` `30%` `40%` `50%` `60%` `70%` `80%` `90%` `95%`
<int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 0.00237 0.00432 0.00654 0.00903 0.0115 0.0120 0.0124 0.0133 0.0147 0.0154
2 1 2 0.00244 0.00251 0.00281 0.00335 0.00388 0.00410 0.00432 0.00493 0.00591 0.00640
3 2 1 0.00371 0.00381 0.00468 0.00632 0.00796 0.0101 0.0122 0.0136 0.0143 0.0147
4 2 2 0.00385 0.00538 0.00630 0.00660 0.00691 0.00725 0.00759 0.00907 0.0117 0.0130

Resources