programming with dplyr::arrange in dplyr v.0.7 - r

I am trying to get my head around the new implementations in dplyr with respect to programming and non standard evaluation. So the verb_ functions are replaced by enquo of the argument and then applying !! in the regular verb function. Translating select from old to new works fine, the following function give similar results:
select_old <- function(x, ...) {
vars <- as.character(match.call())[-(1:2)]
x %>% select(vars)
}
select_new <- function(x, ...) {
vars <- as.character(match.call())[-(1:2)]
vars_enq <- enquo(vars)
x %>% select(!!vars_enq)
}
However when I try to use arrange in the new programming style I'll get an error:
arrange_old <- function(x, ...) {
vars <- as.character(match.call())[-(1:2)]
x %>% arrange_(vars)
}
arrange_new <- function(x, ...){
vars <- as.character(match.call())[-(1:2)]
vars_enq <- enquo(vars)
x %>% arrange(!!vars_enq)
}
mtcars %>% arrange_new(cyl)
# Error in arrange_impl(.data, dots) :
# incorrect size (1) at position 1, expecting : 32
32 is obviously the number of rows of mtcars, the inner function of dplyr apparently expects a vector of this length. My questions are why does the new programming style not traslate for arrange and how to it then in the new style.

You are overthinking it. Use the appropriate function to deal with .... No need to use match.call at all (also not in the old versions, really).
arrange_new <- function(x, ...){
dots <- quos(...)
x %>% arrange(!!!dots)
}
Of course this function does the exact same as the normal arrange, but I guess you are just using this as an example.
You can write a select function in the same way.
The arrange_old should probably have looked something like:
arrange_old <- function(x, ...){
dots <- lazyeval::lazy_dots(...)
x %>% arrange_(.dots = dots)
}

You don't actually need rlang in this situation. This will work:
my_arrange <- function(x, ...) arrange(x, ...)
# test
DF <- data.frame(a = c(2, 2, 1, 1), b = 4:1)
DF %>% my_arrange(a, b)

Related

Error with tidy select when feeding column names into purrr::map for user function

I have a long function that uses a dataframe column name as an input and am trying to apply it to several different column names without a new line of code each time. I am having issues with tidyselect within the function called by map. I believe the issue is related to defusing, but I cannot figure it out. A toy example using mtcars data is below.
This works correctly with map:
library(tidyverse)
sum_dplyr <- function(df, x) {
res <- df %>% summarise(mean = mean({{x}}, na.rm = TRUE))
return(res)
}
sum_dplyr(mtcars, disp)
map(names(mtcars), ~ sum_dplyr(mtcars, mtcars[[.]])) # all columns -> works fine
While this gives the error "Must subset columns with a valid subscript vector" when feeding the function through map:
library(tidyverse)
sel_dplyr <- function(df, x) {
res <- df %>% dplyr::select({{x}})
return(res)
}
sel_dplyr(mtcars, disp) # ok
map(names(mtcars), ~ sel_dplyr(mtcars, mtcars[[.]])) # all columns -> error
What am I missing here ? Many thanks !
It may be better to correct the function to make sure that it takes both unquoted and quoted. With map, we are passing a character string. So, instead of {{}}, can use ensym with !!
sum_dplyr <- function(df, x) {
x <- rlang::ensym(x)
res <- df %>%
summarise(mean = mean(!!x, na.rm = TRUE))
return(res)
}
Similarly for sel_dplyr
sel_dplyr <- function(df, x) {
x <- rlang::ensym(x)
res <- df %>%
dplyr::select(!! x)
return(res)
}
and then test as
library(purrr)
library(dplyr)
map(names(mtcars), ~ sel_dplyr(mtcars, !!.x))
sel_dplyr(mtcars, carb)

Applying a Function to a Data Frame : lapply vs traditional way

I have this data frame in R:
x <- seq(1, 10,0.1)
y <- seq(1, 10,0.1)
data_frame <- expand.grid(x,y)
I also have this function:
some_function <- function(x,y) { return(x+y) }
Basically, I want to create a new column in the data frame based on "some_function". I thought I could do this with the "lapply" function in R:
data_frame$new_column <-lapply(c(data_frame$x, data_frame$y),some_function)
This does not work:
Error in `$<-.data.frame`(`*tmp*`, f, value = list()) :
replacement has 0 rows, data has 8281
I know how to do this in a more "clunky and traditional" way:
data_frame$new_column = x + y
But I would like to know how to do this using "lapply" - in the future, I will have much more complicated and longer functions that will be a pain to write out like I did above. Can someone show me how to do this using "lapply"?
Thank you!
When working within a data.frame you could use apply instead of lapply:
x <- seq(1, 10,0.1)
y <- seq(1, 10,0.1)
data_frame <- expand.grid(x,y)
head(data_frame)
some_function <- function(x,y) { return(x+y) }
data_frame$new_column <- apply(data_frame, 1, \(x) some_function(x["Var1"], x["Var2"]))
head(data_frame)
To apply a function to rows set MAR = 1, to apply a function to columns set MAR = 2.
lapply, as the name suggests, is a list-apply. As a data.frame is a list of columns you can use it to compute over columns but within rectangular data, apply is often the easiest.
If some_function is written for that specific purpose, it can be written to accept a single row of the data.frame as in
x <- seq(1, 10,0.1)
y <- seq(1, 10,0.1)
data_frame <- expand.grid(x,y)
head(data_frame)
some_function <- function(row) { return(row[1]+row[2]) }
data_frame$yet_another <- apply(data_frame, 1, some_function)
head(data_frame)
Final comment: Often functions written for only a pair of values come out as perfectly vectorized. Probably the best way to call some_function is without any function of the apply-familiy as in
some_function <- function(x,y) { return(x + y) }
data_frame$last_one <- some_function(data_frame$Var1, data_frame$Var2)

Dplyr indirection / pipe doesn't work inside a closure

I have a code which uses dplyr indirection:
library(dplyr)
createGenerator <- function(data, column)
{
values <- data %>% pull({{column}})
function(n)
{
values %>% sample(n)
}
}
df <- data.frame(x = 1:10, y = 1:10)
df %>% createGenerator(x)(1)
It gives me an error
Error in pull(., { : object 'x' not found
However if I don't create a closure it works, like in code below
createGenerator <- function(data, column, n)
{
values <- data %>% pull({{column}}) %>% sample(n)
}
But I need a possibility to create a closure. What am I missing in closure creation code?
There is a problem with the pipes, specifically the pipe within the enclosed function. I guess there might be a scoping problem, as you are dealing with different environments and also promises rather than existing objects.
No pipe (which I personally prefer, but I guess that's taste)
library(dplyr)
createGenerator <- function(data, column) {
values <- pull(data, {{ column }})
function(n) {
sample(values, n)
}
}
df <- data.frame(x = 1:10, y = 1:10)
createGenerator(df, x)(2)
#> [1] 4 5
or you create values within the enclosed function. Then the pipe works.
createGenerator <- function(data, column) {
function(n) {
values <- data %>% pull({{column}})
values %>% sample(n)
}
}
createGenerator(df, x)(2)
#> [1] 7 5

Ways to add multiple columns to data frame using plyr/dplyr/purrr

I often have a need to mutate a data frame through the additional of several columns at once using a custom function, preferably using parallelization. Below are the ways I already know how to do this.
Setup
library(dplyr)
library(plyr)
library(purrr)
library(doMC)
registerDoMC(2)
df <- data.frame(x = rnorm(10), y = rnorm(10), z = rnorm(10))
Suppose that I want two new columns, foocol = x + y and barcol = (x + y) * 100, but that these are actually complex calculations done in a custom function.
Method 1: Add columns separately using rowwise and mutate
foo <- function(x, y) return(x + y)
bar <- function(x, y) return((x + y) * 100)
df_out1 <- df %>% rowwise() %>% mutate(foocol = foo(x, y), barcol = bar(x, y))
This is not a good solution since it requires two function calls for each row and two "expensive" calculations of x + y. It's also not parallelized.
Method 2: Trick ddply into rowwise operation
df2 <- df
df2$id <- 1:nrow(df2)
df_out2 <- ddply(df2, .(id), function(r) {
foocol <- r$x + r$y
barcol <- foocol * 100
return(cbind(r, foocol, barcol))
}, .parallel = T)
Here I trick ddply into calling a function on each row by splitting on a unique id column I just created. It's clunky, though, and requires maintaining a useless column.
Method 3: splat
foobar <- function(x, y, ...) {
foocol <- x + y
barcol <- foocol * 100
return(data.frame(x, y, ..., foocol, barcol))
}
df_out3 <- splat(foobar)(df)
I like this solution since you can reference the columns of df in the custom function (which can be anonymous if desired) without array comprehension. However, this method isn't parallelized.
Method 4: by_row
df_out4 <- df %>% by_row(function(r) {
foocol <- r$x + r$y
barcol <- foocol * 100
return(data.frame(foocol = foocol, barcol = barcol))
}, .collate = "cols")
The by_row function from purrr eliminates the need for the unique id column, but this operation isn't parallelized.
Method 5: pmap_df
df_out5 <- pmap_df(df, foobar)
# or equivalently...
df_out5 <- df %>% pmap_df(foobar)
This is the best option I've found. The pmap family of functions also accept anonymous functions to apply to the arguments. I believe pmap_df converts df to a list and back, though, so maybe there is a performance hit.
It's also a bit annoying that I need to reference all the columns I plan on using for calculation in the function definition function(x, y, ...) instead of just function(r) for the row object.
Am I missing any good or better options? Are there any concerns with the methods I described?
How about using data.table?
library(data.table)
foo <- function(x, y) return(x + y)
bar <- function(x, y) return((x + y) * 100)
dt <- as.data.table(df)
dt[, foocol:=foo(x,y)]
dt[, barcol:=bar(x,y)]
The data.table library is quite fast and has at least some some potential for parallelization.

Writing R function with uncertain numbers of variables, using for table()

I'm not quite familiar with R function dealing with variables used.
Here's the problem:
I want to built a function, of which variables ... are column names of data frame used for table().
f <- function (data, ...){
T <- with(data, table(...) # ... variables input
return(T)
}
How can I deal with the code?
Thanks a lot for answering!
The order of evaluation doesn't quite work right with with() apparently. Here's an alternative that should work (using sample data from #DavidArenburg)
set.seed(1)
data1 <- data.frame(a = sample(5,5), b = sample(5,5))
f <- function (data, ...) {
xx <- lapply(substitute(...()), eval, data, parent.frame())
T <- do.call(table, xx)
return(T)
}
f(data = data1, a,b)
It is often far easier to avoid non-standard evaluation and use character strings to reference the columns within a data.frame.
set.seed(1)
data1 <- data.frame(a = sample(5,5), b = sample(5,5))
f <- function (data, ...) {
do.call(table,data[unlist(list(...))])
}
# the following calls to `f` return the same results
f(data = data1, 'a','b')
f(data = data1, c('a','b'))
a <- c('a','b')
f(data = data1, a)

Resources