I'm doing some programming using dplyr, and am curious about how to pass an expression as (specifically a MoreArgs) argument to mapply?
Consider a simple function F that subsets a data.frame based on some ids and a time_range, then outputs a summary statistic based on some other column x.
require(dplyr)
F <- function(ids, time_range, df, date_column, x) {
date_column <- enquo(date_column)
x <- enquo(x)
df %>%
filter(person_id %chin% ids) %>%
filter(time_range[1] <= (!!date_column) & (!!date_column) <= time_range[2]) %>%
summarise(newvar = sum(!!x))
}
We can make up some example data to which we can apply our function F.
person_ids <- lapply(1:2, function(i) sample(letters, size = 10))
time_ranges <- lapply(list(c("2014-01-01", "2014-12-31"),
c("2015-01-01", "2015-12-31")), as.Date)
require(data.table)
dt <- CJ(person_id = letters,
date_col = seq.Date(from = as.Date('2014-01-01'), to = as.Date('2015-12-31'), by = '1 day'))
dt[, z := rnorm(nrow(dt))] # The variable we will later sum over, i.e. apply F to.
We can successfully apply our function to each of our inputs.
F(person_ids[[1]], time_ranges[[1]], dt, date_col, z)
F(person_ids[[2]], time_ranges[[2]], dt, date_col, z)
And so if I wanted, I could write a simple for-loop to solve my problem. But if we try to apply syntactic sugar and wrap everything within mapply, we get an error.
mapply(F, ids = person_ids, time_range = time_ranges, MoreArgs = list(df = dt, date_column = date_col, x = z))
# Error in mapply... object 'date_col' not found
In mapply, MoreArgs is provided as a list, but R tries to evaluate the list elements, causing the error. As suggested by #Gregor, you can quote those MoreArgs that we don't want to evaluate immediately, preventing the error and allowing the function to proceed. This can be done with base quote or dplyr quo:
mapply(F, person_ids, time_ranges, MoreArgs = list(dt, quote(date_col), quote(z)))
mapply(F, person_ids, time_ranges, MoreArgs = list(dt, quo(date_col), quo(z)))
Another option is to use map2 from the purrr package, which is the tidyverse equivalent of mapply with two input vectors. tidyverse functions are set up to work with non-standard evaluation, which avoids the error you're getting with mapply without the need for quoting the arguments:
library(purrr)
map2(person_ids, time_ranges, F, dt, date_col, z)
[[1]]
newvar
1 40.23419
[[2]]
newvar
1 71.42327
More generally, you could use pmap, which iterates in parallel over any number of input vectors:
pmap(list(person_ids, time_ranges), F, dt, date_col, z)
Related
I would like to compute in R something of the following kind:
It is important that the summand could be any function f(y,x).
My approach so far is with nested for loops:
n <- 5
fun <- function(y,x){y^2 + sqrt(y*x)} # might be any function of y and x
sum_x <- c()
for(x in 1:n){
sum_y <- c()
for(y in 0:x){
sum_y[y+1] <- fun(y,x)
}
sum_x[x] <- sum(sum_y)
}
sum(sum_x) # 147.6317
However, I do not like this approach. It's pretty ugly and becomes very inconvenient if lower and upper bound need to be more flexible. I thought about using expand.grid and then applying fun to it using mapply, but couldn't figure out how to express the nested structure of the sums. Any suggestions how to do this?
You can perform the outer product based on a function. This outer product will look at all combinations of two input variables and place the result in a matrix; it takes the following form:
outer(<rows>, <cols>, FUN)
In your case specifically, the following suffices:
n <- 5
fun <- function(x, y) {ifelse (y > x, 0, y^2 + sqrt(x * y))}
outer(1:n, 1:n, FUN = fun) %>% sum() # 147.6317
Since y ranges from 0 and y occurs in both terms, it defaults to 0 (by chance). Regardless, it's necessary in this case to include some form of indexing in the function definition since the nested summing of y is dependent on x.
You could use nested sapply which will apply fun for only required terms and then take sum of it.
sum(unlist(sapply(seq_len(n), function(x) sapply(0:x, fun, x))))
#[1] 147.6317
We can also use outer with use of rowCumsums from matrixStats
library(matrixStats)
sum(outer(seq_len(n), seq_len(n), FUN = fun) * rowCumsums(diag(n)))
#[1] 147.6317
Or with crossing from tidyr
library(tidyr)
library(dplyr)
crossing(x = seq_len(n), y = seq_len(n)) %>%
filter(y <= x) %>%
transmute(out = fun(y, x)) %>%
summarise(out = sum(out)) %>%
pull(out)
#[1] 147.6317
I am trying to get my head around the new implementations in dplyr with respect to programming and non standard evaluation. So the verb_ functions are replaced by enquo of the argument and then applying !! in the regular verb function. Translating select from old to new works fine, the following function give similar results:
select_old <- function(x, ...) {
vars <- as.character(match.call())[-(1:2)]
x %>% select(vars)
}
select_new <- function(x, ...) {
vars <- as.character(match.call())[-(1:2)]
vars_enq <- enquo(vars)
x %>% select(!!vars_enq)
}
However when I try to use arrange in the new programming style I'll get an error:
arrange_old <- function(x, ...) {
vars <- as.character(match.call())[-(1:2)]
x %>% arrange_(vars)
}
arrange_new <- function(x, ...){
vars <- as.character(match.call())[-(1:2)]
vars_enq <- enquo(vars)
x %>% arrange(!!vars_enq)
}
mtcars %>% arrange_new(cyl)
# Error in arrange_impl(.data, dots) :
# incorrect size (1) at position 1, expecting : 32
32 is obviously the number of rows of mtcars, the inner function of dplyr apparently expects a vector of this length. My questions are why does the new programming style not traslate for arrange and how to it then in the new style.
You are overthinking it. Use the appropriate function to deal with .... No need to use match.call at all (also not in the old versions, really).
arrange_new <- function(x, ...){
dots <- quos(...)
x %>% arrange(!!!dots)
}
Of course this function does the exact same as the normal arrange, but I guess you are just using this as an example.
You can write a select function in the same way.
The arrange_old should probably have looked something like:
arrange_old <- function(x, ...){
dots <- lazyeval::lazy_dots(...)
x %>% arrange_(.dots = dots)
}
You don't actually need rlang in this situation. This will work:
my_arrange <- function(x, ...) arrange(x, ...)
# test
DF <- data.frame(a = c(2, 2, 1, 1), b = 4:1)
DF %>% my_arrange(a, b)
I often have a need to mutate a data frame through the additional of several columns at once using a custom function, preferably using parallelization. Below are the ways I already know how to do this.
Setup
library(dplyr)
library(plyr)
library(purrr)
library(doMC)
registerDoMC(2)
df <- data.frame(x = rnorm(10), y = rnorm(10), z = rnorm(10))
Suppose that I want two new columns, foocol = x + y and barcol = (x + y) * 100, but that these are actually complex calculations done in a custom function.
Method 1: Add columns separately using rowwise and mutate
foo <- function(x, y) return(x + y)
bar <- function(x, y) return((x + y) * 100)
df_out1 <- df %>% rowwise() %>% mutate(foocol = foo(x, y), barcol = bar(x, y))
This is not a good solution since it requires two function calls for each row and two "expensive" calculations of x + y. It's also not parallelized.
Method 2: Trick ddply into rowwise operation
df2 <- df
df2$id <- 1:nrow(df2)
df_out2 <- ddply(df2, .(id), function(r) {
foocol <- r$x + r$y
barcol <- foocol * 100
return(cbind(r, foocol, barcol))
}, .parallel = T)
Here I trick ddply into calling a function on each row by splitting on a unique id column I just created. It's clunky, though, and requires maintaining a useless column.
Method 3: splat
foobar <- function(x, y, ...) {
foocol <- x + y
barcol <- foocol * 100
return(data.frame(x, y, ..., foocol, barcol))
}
df_out3 <- splat(foobar)(df)
I like this solution since you can reference the columns of df in the custom function (which can be anonymous if desired) without array comprehension. However, this method isn't parallelized.
Method 4: by_row
df_out4 <- df %>% by_row(function(r) {
foocol <- r$x + r$y
barcol <- foocol * 100
return(data.frame(foocol = foocol, barcol = barcol))
}, .collate = "cols")
The by_row function from purrr eliminates the need for the unique id column, but this operation isn't parallelized.
Method 5: pmap_df
df_out5 <- pmap_df(df, foobar)
# or equivalently...
df_out5 <- df %>% pmap_df(foobar)
This is the best option I've found. The pmap family of functions also accept anonymous functions to apply to the arguments. I believe pmap_df converts df to a list and back, though, so maybe there is a performance hit.
It's also a bit annoying that I need to reference all the columns I plan on using for calculation in the function definition function(x, y, ...) instead of just function(r) for the row object.
Am I missing any good or better options? Are there any concerns with the methods I described?
How about using data.table?
library(data.table)
foo <- function(x, y) return(x + y)
bar <- function(x, y) return((x + y) * 100)
dt <- as.data.table(df)
dt[, foocol:=foo(x,y)]
dt[, barcol:=bar(x,y)]
The data.table library is quite fast and has at least some some potential for parallelization.
I usually like to use lapply() instead of a for loop:
lx <- split( x, x$hr) #with the next step being lapply( lx, function( x) ...)).
But now each element of lx includes the column hr, which is inefficient because that information is already in names( lx).
So now I must do:
lx <- lapply( lx, function( X) select( X, -hr))
(An alternative is:
HR <- unique( x$hr)
lx <- select( lx, -hr)
lx <- split( x, HR)
)
The whole point of lapply() over a for loop is to be efficient so these extra lines bother me. It seems like such a common use case, and my experience has shown that usually R has something more efficient, or I'm missing something.
Can this be achieved in a single function call or one-liner?
EDIT: Specific Example
DF <- data.frame( A = 1:2, B = 2:3, C = 3:4)
DF <- split( DF, factor( DF$A)) # but each list element still contains the column A which is
# redundant (because the names() of the list element equals A
# as well), so I have to write the following line if I want
# to be efficient especially with large datasets
DF <- lapply( DF, function( x) select( x, -A)) # I hate always writing this line!
Remove the split column first:
split(DF[-1], DF[[1]])
or
split(subset(DF, select = -A), DF$A)
Update: Added last line.
I usually use the combination of colwise and tapply to calculate grouped values in a data frame. However, I found unexpectedly that the parameter FUN in tapply cannot work correctly with colwise from plyr. The example is as follows:
Data:
df <- data.frame(a = 1:10, b = rep(1:2, each = 5), c = 2:11)
Normal:
library(plyr)
colwise(tapply)(subset(df, select = c(a, c)), df$b, function(x){sum(x[x > 2])})
Above code is correct and can work normally. But if I add FUN, it will be wrong:
colwise(tapply)(subset(df, select = c(a, c)), df$b, FUN = function(x){sum(x[x > 2])})
Error is:
Error in FUN(X[[1L]], ...) :
unused arguments (function (X, INDEX, FUN = NULL, ..., simplify = TRUE)
{
FUN <- if (!is.null(FUN)) match.fun(FUN)
if (!is.list(INDEX)) INDEX <- list(INDEX)
nI <- length(INDEX)
if (!nI) stop("'INDEX' is of length zero")
namelist <- vector("list", nI)
names(namelist) <- names(INDEX)
extent <- integer(nI)
nx <- length(X)
one <- 1
group <- rep.int(one, nx)
ngroup <- one
for (i in seq_along(INDEX)) {
index <- as.factor(INDEX[[i]])
if (length(index) != nx) stop("arguments must have same length")
namelist[[i]] <- levels(index)
extent[i] <- nlevels(index)
group <- group + ngroup * (as.integer(index) - one)
ngroup <- ngroup * nlevels(index)
}
if (is.null(FUN)) return(group)
ans <- lapply(X = split(X, group), FUN = FUN, ...)
index <- as.integer(names(ans))
if (simplify && all(unlist(lapply(ans, length)) == 1)) {
ansmat <- array(dim = extent, dimnames = namelist)
Could anyone explain the reason? Thank you in advance.
Well, the issue is that both lapply and tapply have an optional FUN argument. Note that colwise(tapply) is a function with the following line:
out <- do.call("lapply", c(list(filtered, .fun, ...), dots))
Let's go to this line with our debugger by writing
ct <- colwise(tapply); trace(ct, quote(browser()), at = 6)
and then running
ct(subset(df, select = c(a, c)), df$b, FUN = function(x){sum(x[x > 2])})
Now let's print c(list(filtered, .fun, ...), dots). Notice that the first three (unnamed) arguments are now the dataframe, tapply, and db$b, with the FUN argument above coming in last. However, this argument is named. Since this is a do.call on lapply, instead of that argument becoming an optional parameter for tapply, it now becomes the main call on lapply! So what is happening is that you are turning this into:
lapply(subset(df, select = c(a, c)), function(x){sum(x[x > 2])}, tapply, df$b)
This, of course, makes no sense, and if you execute the above (still in your debugger) manually, you will get the exact same error you are getting. For a simple workaround, try:
tapply2 <- function(.FUN, ...) tapply(FUN = .FUN, ...)
colwise(tapply2)(subset(df, select = c(a, c)), df$b, .FUN = function(x){sum(x[x > 2])})
The plyr package should be checking for ... arguments named FUN (or anything that can interfere with lapply's job), but it doesn't seem the author included this. You can submit a pull request to the plyr package that implements any of the following workarounds:
Defines a local
.lapply <- function(`*X*`, `*FUN*`, ...) lapply(X = `*X*`, `*FUN*`, ...)
(minimizing interference further).
Checks names(list(...)) within the colwise(tapply) function for X and FUN (can introduce problems if the author intended to prevent evaluation of promises until the child call).
Calls do.call("lapply", ...) explicitly with named X and FUN, so that you get the intended
formal argument "FUN" matched by multiple actual arguments