Julia: Using groupby with DataFramesMeta and Lazy - julia

I'm new to Julia, and can't quite figure out how to use the groupby function with Lazy/DataFramesMeta. It seems as if Lazy has a namespace conflict with DataFrames, but I'm not sure how to resolve it.
using DataFrames, DataFramesMeta, RDatasets
#works
iris = dataset("datasets", "iris")
iris = #linq iris |>
groupby(:Species) |>
transform(mean_sepal_length = mean(:SepalLength))
using Lazy
#doesn't work
iris2 = dataset("datasets", "iris")
iris2 = #> begin
iris2
#groupby(:Species)
#transform(mean_sepal_length = mean(:SepalLength))
end
#doesnt work
iris2 = dataset("datasets", "iris")
iris2 = #> begin
iris2
#DataFrames.groupby(:Species)
#transform(mean_sepal_length = mean(:SepalLength))
end
#this works
iris2 = dataset("datasets", "iris")
iris2 = #> begin
iris2
#transform(mean_sepal_length = mean(:SepalLength))
end

You have to pass qualified function name to #> and remember that there is no #groupby macro (it is a function) e.g.:
julia> using DataFrames, DataFramesMeta, RDatasets
julia> iris = dataset("datasets", "iris");
julia> a = #linq iris |>
groupby(:Species) |>
transform(mean_sepal_length = mean(:SepalLength));
julia> using Lazy
WARNING: using Lazy.groupby in module Main conflicts with an existing identifier.
julia> b = #> begin
iris
DataFrames.groupby(:Species)
#transform(mean_sepal_length = mean(:SepalLength))
end;
julia> a == b
true
Actually the only problem you would have is when you want to use #linq as it would not accept qualified names:
julia> a = #linq iris |>
DataFrames.groupby(:Species) |>
transform(mean_sepal_length = mean(:SepalLength))
ERROR: MethodError: Cannot `convert` an object of type Expr to an object of type DataFramesMeta.SymbolParameter
This may have arisen from a call to the constructor DataFramesMeta.SymbolParameter(...),
since type constructors fall back to convert methods.
a workaround is to create a variable that references the method you want:
julia> gb = DataFrames.groupby
groupby (generic function with 4 methods)
julia> a = #linq iris |>
gb(:Species) |>
transform(mean_sepal_length = mean(:SepalLength))
goes through.

Related

How to pass a dataframe as object and as string in a function

I would like to customize the pin_write function from pins package:
The original works this way:
library(pins)
# create board:
board_versioned <- board_folder("your path", versioned = TRUE)
board_versioned %>%
pin_write(iris, "iris")
# gives:
# Guessing `type = 'rds'`
# Creating new version '20221030T182552Z-f2bf1'
# Writing to pin 'iris'
Now I want to create a custom function:
library(pins)
my_pin_write <- function(board, df) {
board %>%
pin_write(df, deparse(substitute(df)))
}
my_pin_write(board_versioned, iris)
#gives:
# Guessing `type = 'rds'`
# Replacing version '20221030T182736Z-f2bf1' with '20221030T182750Z-f2bf1'
# Writing to pin 'df'
The problem is Wrting to pin 'df' .
I would expect:
Writing to pin 'iris'
I can't manage how to pass the dataframe as name as string in this situation. Many thanks!
You are using a pipe call. In that case the df will be searched within the piped environment, and if not found, use df You have 2 options, do not use the pipe, ie
pin_write(board, df, deparse(substitute(df)))
for substitute to use the function environment or if you use the pipe, call the substitute function outside of the pipe. eg
nm <- deparse(substitute(df))
board %>%
pin_write(df, nm)
You could decide to use the rlang::enxpr function:
board %>%
pin_write(df, deparse(rlang::enxpr(df)))
We could do
my_pin_write <- function(board, df) {
board %>%
pin_write(df, rlang::as_string(rlang::ensym(df)))
}
-testing
> my_pin_write(board_versioned, iris)
Another option is to replace magrittr's pipe (%>%) by the R native pipe (|>) that is available since R 4.1.0.
library(pins)
board_versioned <- board_folder("your path", versioned = TRUE)
my_pin_write <- function(board, df) {
board |>
pin_write(df, deparse(substitute(df)))
}
my_pin_write(board_versioned, iris)
#> Guessing `type = 'rds'`
#> Creating new version '20221031T091813Z-911fb'
#> Writing to pin 'iris'
Created on 2022-10-31 with reprex v2.0.2

Why do Tidyeval quotes fail in lamdas?

Below is a simple example of how a quote is used to dynamically rename a tibble column.
quoteExample = function() {
new_name = quo("new_name_value");
tibble(old_name=list(1,2,3)) %>%
rename( !! quo_name(new_name) := old_name)
}
quoteExample()
Result= tibble(new_name_value=list(1,2,3))
Below the same simple example except this time in a lamda.
{function ()
new_name = quo("new_name_value");
tibble(old_name=list(1,2,3)) %>%
rename( !! quo_name(new_name) := old_name)
} ()
Result= Error in is_quosure(quo) : object 'new_name' not found
Why do quotes fail in a lamda but not in a named function? Where does this difference come from? Am I doing something wrong?
EDIT: The example above has been solved by Akrun, but below is another example that fails although the suggested solution has been applied:
df = tibble(data=list(tibble(old_name= c(1,2,3))))
df %>%
mutate(data = map(data, (function(d){
new_name = quo("new_value")
d %>% rename( !! quo_name(new_name) := old_name)
})))
Result: Error in is_quosure(quo) : object 'new_name' not found
Is this failing because of another issue?
This is basically the same issue as the one here. The main cause is the !! operator forcing immediate evaluation of its argument, before the anonymous function environment is created. In your case, !! quo_name(new_name) attempts to find the definition of new_name relative to the expression as a whole (i.e., the entire mutate(...) expression). Since new_name is defined in the expression itself, you end up with a circular dependency that results in "object not found" error.
You three options are
1) Pull your lambda out into a standalone function to ensure its environment is created first, thus having all variables in that environment properly initialized before the !! operator forces their evaluation:
f <- function(d) {
new_name = sym("new_value")
d %>% rename(!!new_name := old_name)
}
df %>% mutate(data = map(data, f))
2) Define new_name outside the expression that attempts to force its evaluation with !!
new_name = sym("new_value")
df %>%
mutate(data = map(data, function(d) {d %>% rename(!!new_name := old_name)}))
3) Rewrite your expression such that it doesn't use the !! operator to evaluate variables that have not been initialized yet (new_name in this case):
df %>%
mutate(data = map(data, function(d) {
new_name = "new_value"
do.call( partial(rename, d), set_names(syms("old_name"), new_name) )
}))
SIDE NOTE: You will notice that I replaced your quo() calls with sym(). The function quo() captures an expression together with its environment. Since the string literal "new_value" will always evaluate to the same value, there is no need to tag along its environment. In general, the proper verb for capturing column names as symbols is sym().
If we make it self-contained with a () or with {} it should work
(function() {
new_name = quo("new_name_value");
tibble(old_name=list(1,2,3)) %>%
rename( !! quo_name(new_name) := old_name)
})()
# A tibble: 3 x 1
# new_name_value
# <list>
#1 <dbl [1]>
#2 <dbl [1]>
#3 <dbl [1]>
If the anonymous function contains only a single expression, it is not needed to use {}, but if it have more one line of expression, we wrap with {}. According to ?body
The bodies of all but the simplest are braced expressions, that is calls to {: see the ‘Examples’ section for how to create such a call.

enquo() inside a magrittr pipeline

I just would like to understand what's going wrong here.
In the first case (working), I assign the enquo()-ted argument to a variable, in the second case, I use the enquoted argument directly in my call to mutate.
library("dplyr")
df <- tibble(x = 1:5, y= 1:5, z = 1:5)
# works
myfun <- function(df, transformation) {
my_transformation <- rlang::enquo(transformation)
df %>%
gather("key","value", x,y,z) %>%
mutate(value = UQ(my_transformation))
}
myfun(df,exp(value))
# does not work
myfun_2 <- function(df, transformation) {
df %>%
gather("key","value", x,y,z) %>%
mutate(value = UQ(rlang::enquo(transformation)))
}
myfun_2(df,exp(value))
#>Error in mutate_impl(.data, dots) : Column `value` is of unsupported type closure
Edit
Here are some more lines to think about :)
Wrapping the call into quo() it looks as if the expression to evaluate is "built" correctly
# looks as if the whole thing should be working
myfun_2_1 <- function(df, transformation) {
quo(df %>%
gather("key","value", x,y,z) %>%
mutate(value = UQ(rlang::enquo(transformation))))
}
myfun_2_1(df,exp(value))
If you tell this to eval_tidy, it works (it doesn't work without quo())
# works
myfun_2_2 <- function(df, transformation) {
eval_tidy(quo(df %>%
gather("key","value", x,y,z) %>%
mutate(value = UQ(rlang::enquo(transformation)))))
}
myfun_2_2(df,exp(value))
If you don't use the pipe, it also works
# works
myfun_2_3 <- function(df, transformation) {
mutate(gather(df,"key","value", x,y,z), value = UQ(rlang::enquo(transformation)))
}
myfun_2_3(df,exp(value))
Regarding the error message, this is what one gets, when one tries to pass types that are not supported by data.frames, eg.
mutate(df, value = function(x) x)
# Error in mutate_impl(.data, dots) : Column value is of unsupported type closure
To me it looks as if the quosure in myfun_2 isn't evaluated by mutate, which is somehow interesting/non-intuitive behaviour. Do you think I should report this to the developers?
This limitation is solved in rlang 0.2.0.
Technically: The core of the issue was that magrittr evaluates its arguments in a child of the current environment. This is this environment that contains the . pronoun. As of 0.2.0, capture of arguments with enquo() and variants is now lexically scoped, which means it looks up the stack of parent environments to find the argument to capture. This solves the magrittr problem.

R dplyr 0.7.2 - functional programming. Resolving dataframe name

I am writing a R function using dplyr 0.7.2 syntax to pass input and output data frame names and a column name to sort on. The following is the code I have.
#test data frame creation
lb<- data.frame(study = replicate(25,"ABC"),
subjid = c("x1","x2","x3","x4","x5"),
visit = c("SCREENING","VISIT1","VISIT2","VISIT3","EOT"),
visitn = c(-1,1,2,3,4),
param = c("ALB","AST","HGB","HCT","LDL"),
aval = replicate(5, sample(c(20:100), 1, rep = TRUE)))
#sort function- user to provide input/output df names and column name to sort on
sortdf <- function(ind,outd,col){
col <- enquo(col)
outd <- ind %>% arrange(!!col)
outd <<- outd # return dataframe to workspace
}
sortdf(lb,lb_sort, visitn)
the above code works but the output df name is not getting resolved to lb_sort. output df is named as the name of the associated parameter (outd). Need some help!
Thanks,
Prasanna
You do not need to make use of the << in this context. In effect, your function is a wrapper for arrange:
my_sort <- function(df, col) {
col <- enquo(col)
df %>%
arrange(!!col)
}
my_sort(df = lb, col = visitn)
Then you could create your objects as usual:
my_sort(df = lb, col = visitn) -> sorted_stuff
Edit
As per request, forcing creation of names object in parent environment.
my_sort <- function(df, col, some_name) {
col <- enquo(col)
df %>%
arrange(!!col) -> dta_a
# Gather env. inf
e <- environment() # current environment
p <- parent.env(e)
# Create object in parent env.
assign(x = some_name,
value = dta_a,
envir = p)
# If desired return another object
# return(some_other_data)
}
my_sort(df = lb, col = visitn, some_name ="created_data")
Explanation
e/p objects are used to gather information about functions current and parent environment
assign uses string and creates names object in function's parent environment. Global environment, if called as provided in the example.
Remarks
This is odd behaviour, when called as shown:
>> ls()
[1] "lb" "my_sort"
>> my_sort(df = lb, col = visitn, some_name ="created_data")
>> ls()
[1] "created_data" "lb" "my_sort"
The function leaves "created_data" object in global environment. This is inconsistent with expected behaviour where the user would usually create objects:
my_sort(df = lb, col = visitn) -> created_data
and I wouldn't encourage using it. If the actual problem is concerned with returning multiple objects a potentially better approach may involve packing all the results into a list and returning one list:
list(result_1 = mtcars,
result_2 = airquality)

Error: cannot join on columns: index out of bounds [duplicate]

I am trying to perform an inner join two tables using dplyr, and I think I'm getting tripped up by non-standard evaluation rules. When using the by=("a" = "b") argument, everything works as expected when "a" and "b" are actual strings. Here's a toy example that works:
library(dplyr)
data(iris)
inner_join(iris, iris, by=c("Sepal.Length" = "Sepal.Width"))
But let's say I was putting inner_join in a function:
library(dplyr)
data(iris)
myfn <- function(xname, yname) {
data(iris)
inner_join(iris, iris, by=c(xname = yname))
}
myfn("Sepal.Length", "Sepal.Width")
This returns the following error:
Error: cannot join on columns 'xname' x 'Sepal.Width': index out of bounds
I suspect there is some fancy expression, deparsing, quoting, or unquoting that I could do to make this work, but I'm a bit murky on those details.
You can use
myfn <- function(xname, yname) {
data(iris)
inner_join(iris, iris, by=setNames(yname, xname))
}
The suggested syntax in the ?inner_join documentation of
by = c("a"="b") # same as by = c(a="b")
is slightly misleading because both those values aren't proper character values. You're actually created a named character vector. To dynamically set the values to the left of the equals sign is different from those on the right. You can use setNames() to set the names of the vector dynamically.
I like MrFlick's answer and fber's addendum, but I prefer structure. For me setNames feels as something at the end of a pipe, not as an on-the-fly constructor. On another note, both setNames and structure enable the use of variables in the function call.
myfn <- function(xnames, ynames) {
data(iris)
inner_join(iris, iris, by = structure(names = xnames, .Data = ynames))
}
x <- "Sepal.Length"
myfn(x, "Sepal.Width")
A named vector argument would run into problems here:
myfn <- function(byvars) {
data(iris)
inner_join(iris, iris, by = byvars)
}
x <- "Sepal.Length"
myfn(c(x = "Sepal.Width"))
You could solve that, though, by using setNames or structure in the function call.
I know I'm late to the party, but how about:
myfn <- function(byvar) {
data(iris)
inner_join(iris, iris, by=byvar)
}
This way you can do what you want with:
myfn(c("Sepal.Length"="Sepal.Width"))
I faced a nearly identical challenge as #Peter, but needed to pass multiple different sets of by = join parameters at one time. I chose to use the map() function from the tidyverse package, purrr.
This is the subset of the tidyverse that I used.
library(magrittr)
library(dplyr)
library(rlang)
library(purrr)
First, I adapted myfn to use map() for the case posted by Peter. 42's comment and Felipe Gerard's answer made it clear that the by argument can take a named vector. map() requires a list over which to iterate.
myfn_2 <- function(xname, yname) {
by_names <- list(setNames(nm = xname, yname ))
data(iris)
# map() returns a single-element list. We index to retrieve dataframe.
map( .x = by_names,
.f = ~inner_join(x = iris,
y = iris,
by = .x)) %>%
`[[`(1)
}
myfn_2("Sepal.Length", "Sepal.Width")
I found that I didn't need quo_name() / !! in building the function.
Then, I adapted the function to take a list of by parameters. For each by_i in by_grps, we could extend x and y to add named values on which to join.
by_grps <- list( by_1 = list(x = c("Sepal.Length"), y = c("Sepal.Width")),
by_2 = list(x = c("Sepal.Width"), y = c("Petal.Width"))
)
myfn_3 <- function(by_grps_list, nm_dataset) {
by_named_vectors_list <- lapply(by_grps_list,
function(by_grp) setNames(object = by_grp$y,
nm = by_grp$x))
map(.x = by_named_vectors_list,
.f = ~inner_join(nm_dataset, nm_dataset, by = .x))
}
myfn_3(by_grps, iris)

Resources