How to evaluate empty quosure programmatically? - r

In my dataset, I have a few possible grouping variables a, b, c. How do I programmatically tell dplyr to not group by any variables?
For example:
granularity <- NA
if(isTRUE(granularity == 'all')){
# all group variables
group_variables <- quos(a, b, c)
}else if(isTRUE(granularity == 'no_c')){
# all except c
group_variables <- quos(a, b)
}else{
# no group variables
group_variables <- quo()
}
data_summary <- mydata %>%
group_by(!!! group_variables) %>%
summarise(
x_mean = mean(x)
)
This will run correctly if I set granularity to 'all' or 'no_c', but it fails when I assign group_variables to the empty quosure. Does anyone know how to make this work?
Edit: This question also applies to functions like select, so assume I wanted to run
data_select <- mydata %>%
select(!!! select_variables, d, e, f)
How do I set select_variables to sometimes be quos(a, b, c) or sometimes be empty?
Thanks!

Use group_variables <- NULL in that clause:
}else{
# no group variables
group_variables <- NULL
}
also note the massive warning:
Error in grouped_df_impl(data, unname(vars), drop) :
Column `<empty>` is unknown
In addition: Warning message:
Unquoting language objects with `!!!` is soft-deprecated as of rlang 0.3.0.
Please use `!!` instead.
# Bad:
dplyr::select(data, !!!enquo(x))
# Good:
dplyr::select(data, !!enquo(x)) # Unquote single quosure
dplyr::select(data, !!!enquos(x)) # Splice list of quosures
You might want to consider not using packages with unstable APIs.

Related

Function containing dataframe and variable using lapply

I have two dataframes and a function, which works when I use it on a single variable.
library(tidyverse)
iris1<-iris
iris2<-iris
iris_fn<-function(df,species_type){
df1<-df%>%
filter((Species==species_type))
return(df1)}
new_df<-iris_fn(df=iris1, species_type="setosa")
I want to pass a vector of variables to the function with the expected output being a list of dataframes (3), one filtered to each variable, for which I have been experimenting using lapply:
variables<-c("setosa","versicolor","virginica")
new_df<-lapply(df=iris1, species_type="setosa", FUN= iris_fn)
The error message is Error in is.vector(X) : argument "X" is missing, with no default which I dont understand because I have stated the variables of the function and what the name of the function is.
Can anyone suggest a solution to get the desired output? I essentially need a version of lapply or purrr function that will allow a dataframe and a vector as inputs.
lapply expects an argument called X as the main input. You could re-write it so that the function expects X instead of species_type e.g.
iris_fn <- function(df, X){
df1 <- df %>% filter((Species==X))
return(df1)
}
variables <- c("setosa", "versicolor", "virginica")
new_df <- lapply(X=variables, FUN=iris_fn, df=iris1)
EDIT:
Alternatively to avoid using X, you need the first argument of the function to match the lapply input e.g.
iris_fn <- function(species_type, df){
df1 <- df %>% filter((Species==species_type))
return(df1)
}
new_df <- lapply(variables, FUN=iris_fn, df=iris1)
Check out the split function for a convenient way to split a data.frame to a list e.g. split(iris, f=iris$Species)
From ?lapply : lapply(X, FUN, ...) , by naming all your arguments there's no X that could be passed to function as the first arg.
Try something like this:
library(dplyr)
iris1<-iris
# note the changes arg. order
iris_fn<-function(species_type, df){
df1<-df%>%
filter((Species==species_type))
return(df1)}
variables<-c("setosa","versicolor","virginica")
new_df_list <-lapply(variables, iris_fn, df=iris1 )
Or with just an anonymous function:
new_df_list <-lapply(variables, \(x) filter(iris1, Species == x))
As you already use Tidyverse, perhaps with purrr::map() instead:
library(purrr)
new_df_list <- map(variables, ~ filter(iris1, Species == .x))
Created on 2022-11-14 with reprex v2.0.2

Access result later in pipe

Access result later in pipe
I am trying to create functions which print the number of rows excluded in a dataset at each step in a pipe.
Something like this:
iris %>%
function_which_save_nrows_and_return_the_data() %>%
filter(exclude some rows) %>%
function_which_prints_difference_in_rows_before_after_exlusion_and_returns_data %>%
function_which_save_nrows_and_return_the_data() %>%
function_which_prints_difference_in_rows_before_after_exlusion_and_returns_data ...etc
These are the functions I have attempted:
n_before = function(x) {assign("rows", nrow(x), .GlobalEnv); return(x)}
n_excluded = function(x) {
print(rows - nrow(x))
return(x)
}
This successfully saves the object rows:
But if I add two more links, the object is NOT saved:
So how can I create and access the rows-object later the pipe?
This is due to R's lazy evaluation. It occurs even if pipes are not used. See code below. In that code the argument to n_excluded is filter(n_before(iris), Species != 'setosa') and at the point that rows is used in the print statement the argument has not been referenced from within n_excluded so the entire argument will not have been evaluated and so rows does not yet exist.
if (exists("rows")) rm(rows) # ensure rows does not exist
n_excluded(filter(n_before(iris), Species != 'setosa'))
## Error in h(simpleError(msg, call)) :
## error in evaluating the argument 'x' in selecting a method for function
## 'print': object 'rows' not found
To fix this
1) we can force x before the print statement.
n_excluded = function(x) {
force(x)
print(rows - nrow(x))
return(x)
}
2) Alternately, we can use the magrittr sequential pipe which guarantees that legs are run in order. magrittr makes it available but does not provide an operator for it but we can assign it to an operator like this.
`%s>%` <- magrittr::pipe_eager_lexical
iris %>%
n_before() %>%
filter(Species != 'setosa') %s>% # note use of %s>% on this line
n_excluded()
The magrittr developer has stated that he will add it as an operator if there is sufficient demand for it so you might want to add such request to magrittr issue #247 on github.
You can also use the extended capabilities of pipeR.
library(dplyr)
library(pipeR)
n_excluded = function(x) {
print(rows - nrow(x))
return(x)
}
p <- iris %>>%
(~rows=nrow(.)) %>>%
filter(Species != "setosa") %>>%
n_excluded()

enquo() inside a magrittr pipeline

I just would like to understand what's going wrong here.
In the first case (working), I assign the enquo()-ted argument to a variable, in the second case, I use the enquoted argument directly in my call to mutate.
library("dplyr")
df <- tibble(x = 1:5, y= 1:5, z = 1:5)
# works
myfun <- function(df, transformation) {
my_transformation <- rlang::enquo(transformation)
df %>%
gather("key","value", x,y,z) %>%
mutate(value = UQ(my_transformation))
}
myfun(df,exp(value))
# does not work
myfun_2 <- function(df, transformation) {
df %>%
gather("key","value", x,y,z) %>%
mutate(value = UQ(rlang::enquo(transformation)))
}
myfun_2(df,exp(value))
#>Error in mutate_impl(.data, dots) : Column `value` is of unsupported type closure
Edit
Here are some more lines to think about :)
Wrapping the call into quo() it looks as if the expression to evaluate is "built" correctly
# looks as if the whole thing should be working
myfun_2_1 <- function(df, transformation) {
quo(df %>%
gather("key","value", x,y,z) %>%
mutate(value = UQ(rlang::enquo(transformation))))
}
myfun_2_1(df,exp(value))
If you tell this to eval_tidy, it works (it doesn't work without quo())
# works
myfun_2_2 <- function(df, transformation) {
eval_tidy(quo(df %>%
gather("key","value", x,y,z) %>%
mutate(value = UQ(rlang::enquo(transformation)))))
}
myfun_2_2(df,exp(value))
If you don't use the pipe, it also works
# works
myfun_2_3 <- function(df, transformation) {
mutate(gather(df,"key","value", x,y,z), value = UQ(rlang::enquo(transformation)))
}
myfun_2_3(df,exp(value))
Regarding the error message, this is what one gets, when one tries to pass types that are not supported by data.frames, eg.
mutate(df, value = function(x) x)
# Error in mutate_impl(.data, dots) : Column value is of unsupported type closure
To me it looks as if the quosure in myfun_2 isn't evaluated by mutate, which is somehow interesting/non-intuitive behaviour. Do you think I should report this to the developers?
This limitation is solved in rlang 0.2.0.
Technically: The core of the issue was that magrittr evaluates its arguments in a child of the current environment. This is this environment that contains the . pronoun. As of 0.2.0, capture of arguments with enquo() and variants is now lexically scoped, which means it looks up the stack of parent environments to find the argument to capture. This solves the magrittr problem.

dplyr exclude columns using the dot argument

How can we write a function that let user drop multiple columns using the ... argument dplyr style?
E.g.
mydrop=function(x,...){function body}
mydrop(npk,N:K)
returns npk[,c("block","yield")].
Note that it is important that the ... argument is compatible with all the ?select_helpers functions.
Similar to #akrun, but allowing for the N:K , dplyr style column selection the OP requested for (...), as well as some error handling:
mydrop <- function(x,...){
try(
todrop <- x %>%
select(...) %>% names(.)
, silent = TRUE)
if(exists('todrop')){
x %>% select(setdiff(current_vars(), todrop))
}else x
}
Perhaps we can use
mydrop <- function(x,...){
nm <- list(...)
if(length(nm)>0) {
x %>%
select(-one_of(unlist(nm)))
} else x
}
mydrop(npk, "N", "K")
Using reproducible example
mydrop(mtcars, 'mpg', 'cyl')
mydrop(mtcars)
mydrop(mtcars, names(mtcars)[-1])
mydrop(mtcars, names(mtcars))

multidplyr : assign functions to cluster

(see working solution below)
I want to use multidplyr to parallelize a function :
calculs.R
f <- function(x){
return(x+1)
}
main.R
library(dplyr)
library(multidplyr)
source("calculs.R")
d <- data.frame(a=1:1000,b=sample(1:2,1000),replace=T)
result <- d %>%
partition(b) %>%
do(f(.)) %>%
collect()
I then get:
Initialising 3 core cluster.
Error in checkForRemoteErrors(lapply(cl, recvResult)) :
2 nodes produced errors; first error: could not find function "f"
In addition: Warning message:
group_indices_.grouped_df ignores extra arguments
How can I assign sourced functions to each core?
==================
Here is the flawless script:
Must extract the value to update, and turn the result into a dataframe
calcul.R
f <- function(x){
return(data.frame(x$a+1))
}
Must set the clusters and assign the sourced functions
main.R
library(dplyr)
library(multidplyr)
source("calculs.R")
cl <- create_cluster(3)
set_default_cluster(cl)
cluster_copy(cl, f)
d <- data.frame(a=1:10,b=c(rep(1,5),rep(2,5)))
result <- d %>%
partition(b) %>%
do(f(.)) %>%
collect()
It looks like you initialized a cluster (though you don't show this part). You need to export variables/function from your global environment to each worker. Assuming you made your cluster as
cl <- create_cluster(3)
set_default_cluster(cl)
Can you try
cluster_copy(cl, f)
This will copy-and-export f to each worker (I think...)
Extra
You'll likely run into another problem which is that your function accepts x as an argument, to which you add 1
f <- function(x){
return(x+1)
}
Since you're passing a data frame to f, you are asking for data.frame+1, which doesn't make sense. You might want to change your function to something like
f <- function(x){
return(x$a+1)
}

Resources