Different results of a full_join in arrow and dplyr - r

I get different results when using full_join on tibble and on arrow_table. Maybe somebody can give a hand on what is going on?
library(arrow)
library(dplyr)
xa1 <- arrow_table(x = 1L)
xa2 <- arrow_table(x = 2L)
x1 <- tibble(x = 1L)
x2 <- tibble(x = 2L)
full_join(xa1,xa2,on = c("x")) %>% collect() %>% compute()
full_join(x1,x2)
# A tibble: 2 × 1
x
<int>
1 1
2 NA
full_join(x1,x2)
Joining, by = "x"
# A tibble: 2 × 1
x
<int>
1 1
2 2

There is no on argument in dplyr::.*_join. Usage according to ?dplyr::full_join is
full_join(
x,
y,
by = NULL,
copy = FALSE,
suffix = c(".x", ".y"),
...,
keep = NULL
)
on is a data.table join argument. We need by here
library(arrow)
library(dplyr)
full_join(xa1, xa2, by = "x") %>%
collect() %>%
compute()
-output
# A tibble: 2 × 1
x
<int>
1 1
2 2
By looking at the methods and source code
> methods("full_join")
[1] full_join.arrow_dplyr_query* full_join.ArrowTabular* full_join.data.frame* full_join.Dataset* full_join.RecordBatchReader*
> getAnywhere(full_join.ArrowTabular)
function (x, y, by = NULL, copy = FALSE, suffix = c(".x", ".y"),
..., keep = FALSE)
{
query <- do_join(x, y, by, copy, suffix, ..., keep = keep,
join_type = "FULL_OUTER")
if (!keep) {
query$selected_columns <- post_join_projection(names(x),
names(y), handle_join_by(by, x, y), suffix)
}
query
}
by is used in the functions that are called inside

Related

create function to pass into dplyr::summarise

In my data preparation, I want to create a function for repeated computations into the summarise function. So the idea is to create a function like so:
my_func <-
function(criteria){
sum(case_when(eval(rlang::parse_expr(criteria)))*100, na.rm = TRUE)
}
So then, I can use that function to parse different criteria:
DT %>%
group_by(group_var) %>%
summarise(
# Indicator A:
ia = my_func(var_x %in% c(1,2,3)~1,TRUE ~ 0),
# Indicator B:
ft = my_func(var_x %in% c(4,5)~1,TRUE ~ 0)
)
But, with the above code, I got an error. I really appreciate any idea on how to make this work.
IMHO there is no reason to use rlang::parse_expr. Instead you could use ... like so:
library(dplyr)
my_func <- function(...) {
sum(case_when(...) * 100, na.rm = TRUE)
}
mtcars %>%
group_by(am) %>%
summarise(
ia = my_func(cyl %in% c(4, 6) ~ 1, TRUE ~ 0)
)
#> # A tibble: 2 × 2
#> am ia
#> <dbl> <dbl>
#> 1 0 700
#> 2 1 1100
EDIT To pass a column to scale the result instead of the hard-coded 100 you could do:
my_func <- function(..., scale) {
sum(case_when(...) * {{ scale }}, na.rm = TRUE)
}
mtcars %>%
group_by(am) %>%
summarise(
ia = my_func(cyl %in% c(4, 6) ~ 1, TRUE ~ 0, scale = mpg)
)
#> # A tibble: 2 × 2
#> am ia
#> <dbl> <dbl>
#> 1 0 145.
#> 2 1 286.

i want to write a custom function with tidyverse verbs/syntax that accepts the grouping parameters of my function as string

I want to write a function that has as parameters a data set, a variable to be grouped, and another parameter to be filtered. I want to write the function in such a way that I can afterwards apply map() to it and pass the variables to be grouped in to map() as a vector. Nevertheless, I don't know how my custom function rating() accepts the variables to be grouped as a string. This is what i have tried.
data = tibble(a = seq.int(1:10),
g1 = c(rep("blue", 3), rep("green", 3), rep("red", 4)),
g2 = c(rep("pink", 2), rep("hotpink", 6), rep("firebrick", 2)),
na = NA,
stat=c(23,43,53,2,43,18,54,94,43,87))
rating = function(data, by, no){
data %>%
select(a, {{by}}, stat) %>%
group_by({{by}}) %>%
mutate(rank = rank(stat)) %>%
ungroup() %>%
filter(a == no)
}
fn(data = data, by = g2, no = 5) #this works
And this is the way i want to use my function
map(.x = c("g1", "g2"), .f = ~rating(data = data, by = .x, no = 1))
... but i get
Error: Must group by variables found in `.data`.
* Column `.x` is not found.
As we are passing character elements, it would be better to convert to symbol and evaluate (!!)
library(dplyr)
library(purrr)
rating <- function(data, by, no){
by <- rlang::ensym(by)
data %>%
select(a, !! by, stat) %>%
group_by(!!by) %>%
mutate(rank = rank(stat)) %>%
ungroup() %>%
filter(a == no)
}
-testing
> map(.x = c("g1", "g2"), .f = ~rating(data = data, by = !!.x, no = 1))
[[1]]
# A tibble: 1 × 4
a g1 stat rank
<int> <chr> <dbl> <dbl>
1 1 blue 23 1
[[2]]
# A tibble: 1 × 4
a g2 stat rank
<int> <chr> <dbl> <dbl>
1 1 pink 23 1
It also works with unquoted input
> rating(data, by = g2, no = 5)
# A tibble: 1 × 4
a g2 stat rank
<int> <chr> <dbl> <dbl>
1 5 hotpink 43 3

Passing names of objects from ellipsis as strings to left_join

Background
I have a simple helper function that applies left_join to any number of passed tables in other to gather them and return one object.
Example
# Settings ----------------------------------------------------------------
library("tidyverse")
set.seed(123)
# Data --------------------------------------------------------------------
sample_one <-
tibble(
column_a = c(1, 2),
column_b = runif(n = 2),
column_other = runif(n = 2)
)
sample_two <-
tibble(
column_a = c(1, 2),
column_b = runif(n = 2),
column_other = runif(n = 2)
)
sample_three <-
tibble(
column_a = c(1, 2),
column_b = runif(n = 2),
column_other = runif(n = 2)
)
# Function ----------------------------------------------------------------
left_join_on_column_a <- function(keep_var, ...) {
keep_var <- enquo(keep_var)
dots <- list(...)
clean_dfs <- map(dots, select, !!keep_var, "column_a")
reduce(.x = clean_dfs,
.f = left_join,
"column_a") %>%
gather(key = "model_type", !!keep_var, -column_a)
}
# Test --------------------------------------------------------------------
left_join_on_column_a(keep_var = column_b, sample_one, sample_two, sample_three)
Problem
I would like to be able to programmatically modify the suffix argument of left_join:
suffix
If there are non-joined duplicate variables in x and y, these
suffixes will be added to the output to disambiguate them. Should be a
character vector of length 2.
Current results
# A tibble: 6 x 3
column_a model_type column_b
<dbl> <chr> <dbl>
1 1 column_b.x 0.288
2 2 column_b.x 0.788
3 1 column_b.y 0.940
4 2 column_b.y 0.0456
5 1 column_b 0.551
6 2 column_b 0.457
Desired results
# A tibble: 6 x 3
column_a model_type column_b
<dbl> <chr> <dbl>
1 1 sample_one 0.288
2 2 sample_one 0.788
3 1 sample_two 0.940
4 2 sample_two 0.0456
5 1 sample_three 0.551
6 2 sample_three 0.457
The model_type column reflects name of the object passed via ....
Attempts
I was trying to capture names of the objects passed within ... but it's not a named object so it doesn't make sense:
left_join_on_column_a <- function(keep_var, ...) {
keep_var <- enquo(keep_var)
dots <- list(...)
table_names <- names(dots)
clean_dfs <- map(dots, select, !!keep_var, "column_a")
reduce(.x = clean_dfs,
.f = left_join,
"column_a",
table_names) %>%
gather(key = "model_type", !!keep_var, -column_a)
}
Maybe rename column_b so that you don't have to worry about suffix
left_join_on_column_a <- function(keep_var, common_var, ...) {
nm = unname(sapply(rlang::enexprs(...), as.character))
keep_var <- as.character(substitute(keep_var))
common_var = as.character(substitute(common_var))
foo = function(x, y) {
x %>% select(!!common_var, !!y := !!keep_var)
}
reduce(.x = Map(foo, list(...), nm),
.f = left_join,
common_var) %>%
gather("model_type", !!keep_var, -!!common_var)
}
left_join_on_column_a(column_b, column_a, sample_one, sample_two, sample_three)

How to pass second parameter to function while using the map function of purrr package in R

Apologies for what might be a very simple question.
I am new to using the purrr package in R and I'm struggling with trying to pass a second parameter to a function.
library(dplyr)
library(purrr)
my_function <- function(x, y = 2) {
z = x + y
return(z)
}
my_df_2 <- my_df %>%
mutate(new_col = map_dbl(.x = old_col, .f = my_function))
This works and most often I don't need to change the value of y, but if I had to pass a different value for y (say y = 3) through the mutate & map combination, what is the syntax for it?
Thank you very much in advance!
The other idea is to use the following syntax.
library(dplyr)
library(purrr)
# The function
my_function <- function(x, y = 2) {
z = x + y
return(z)
}
# Example data frame
my_df <- data_frame(old_col = 1:5)
# Apply the function
my_df_2 <- my_df %>%
mutate(new_col = map_dbl(old_col, ~my_function(.x, y = 3)))
my_df_2
# # A tibble: 5 x 2
# old_col new_col
# <int> <dbl>
# 1 1 4.
# 2 2 5.
# 3 3 6.
# 4 4 7.
# 5 5 8.
I think all you need to do is modify map_dbl like so:
library(dplyr)
library(purrr)
df <- data.frame(a = c(2, 3, 4, 5.5))
my_function <- function(x, y = 2) {
z = x + y
return(z)
}
df %>%
mutate(new_col = map_dbl(.x = a, y = 3, .f = my_function))
a new_col
1 2.0 5.0
2 3.0 6.0
3 4.0 7.0
4 5.5 8.5

How to define a keyed (grouped) join in data table?

I would like to overload the [.data.table operator so that grouped joins can be achieved. I will use dplyr to illustrate what I mean by a grouped join. Take for example the function below that wraps around dplyr::inner_join:
inner_join_grp <- function(x, y, by = NULL, copy = FALSE, suffix = c(".x", ".y"),
...) {
stopifnot(identical(group_vars(x), group_vars(y)))
grp <- group_vars(x)
by <- c(by, grp)
return(inner_join(x, y, by, copy, suffix, ...))
}
All this function does is to check that input data.frames x and y have the same groups, and if they do, to add the grouping variables to the join key specified by by = ....
An application would be like this:
n_grp <- 5
n_grp2 <- 3
set.seed(0)
tmp_df <-
data.frame(grp = rep(letters[1:n_grp2], each = n_grp),
grp2 = rep(1:n_grp, times = n_grp2),
x = runif(n_grp * n_grp2))
tmp_df_2 <-
data.frame(grp = rep(letters[1:n_grp2], each = n_grp),
grp2 = rep(1:n_grp, times = n_grp2),
y = runif(n_grp * n_grp2))
Using my function:
tmp_df %>%
group_by(grp) %>%
inner_join_grp(tmp_df_2 %>%
group_by(grp),
by = "grp2")
Produces the expected fifteen row table:
# A tibble: 15 x 4
# Groups: grp [?]
grp grp2 x y
<fct> <int> <dbl> <dbl>
1 a 1 0.897 0.770
2 a 2 0.266 0.498
3 a 3 0.372 0.718
4 a 4 0.573 0.992
5 a 5 0.908 0.380
...
whereas using dplyr::inner_join produces a 45 row table since the join is only on grp2:
# A tibble: 45 x 5
# Groups: grp.x [?]
grp.x grp2 x grp.y y
<fct> <int> <dbl> <fct> <dbl>
1 a 1 0.897 a 0.770
2 a 1 0.897 b 0.777
3 a 1 0.897 c 0.267
...
My question is whether is possible to achieve something similar using [, when the input data tables are keyed. I know I can overload merge.data.table (see below for a rough example) in the same way but I would prefer to pass an additional argument to [ to achieve a keyed join.
merge_dt_keyed <-
function (x, y, by = NULL, by.x = NULL, by.y = NULL, all = FALSE,
all.x = all, all.y = all, sort = TRUE,
suffixes = c(".x", ".y"),
allow.cartesian = getOption("datatable.allow.cartesian"),
...){
stopifnot(identical(key(x), key(y)))
new_key <- key(x)
by.x <- c(new_key, by, by.x)
by.y <- c(new_key, by, by.y)
return(merge(x, y, by.x = by.x, by.y = by.y, all = FALSE,
all.x = all, all.y = all, sort = TRUE,
suffixes = c(".x", ".y"),
allow.cartesian = getOption("datatable.allow.cartesian"),
...))
}
Something along these lines should work:
`[.data.table` = function(x, i, ...) {
args = match.call()
if ('on' %in% names(args)) {
args[['on']] = union(args[['on']], intersect(key(x), key(i)))
}
args[[1]] = data.table:::`[.data.table`
eval(args)
}

Resources