Tidyeval: pass list of columns as quosure to select() - r

I want to pass a bunch of columns to pmap() inside mutate(). Later, I want to select those same columns.
At the moment, I'm passing a list of column names to pmap() as a quosure, which works fine, although I have no idea whether this is the "right" way to do it. But I can't figure out how to use the same quosure/list for select().
I've got almost no experience with tidyeval, I've only got this far by playing around. I imagine there must be a way to use the same thing both for pmap() and select(), preferably without having to put each of my column names in quotation marks, but I haven't found it yet.
library(dplyr)
library(rlang)
library(purrr)
df <- tibble(a = 1:3,
b = 101:103) %>%
print
#> # A tibble: 3 x 2
#> a b
#> <int> <int>
#> 1 1 101
#> 2 2 102
#> 3 3 103
cols_quo <- quo(list(a, b))
df2 <- df %>%
mutate(outcome = !!cols_quo %>%
pmap_int(function(..., word) {
args <- list(...)
# just to be clear this isn't what I actually want to do inside pmap
return(args[[1]] + args[[2]])
})) %>%
print()
#> # A tibble: 3 x 3
#> a b outcome
#> <int> <int> <int>
#> 1 1 101 102
#> 2 2 102 104
#> 3 3 103 106
# I get why this doesn't work, but I don't know how to do something like this that does
df2 %>%
select(!!cols_quo)
#> Error in .f(.x[[i]], ...): object 'a' not found

This is a bit tricky because of the mix of semantics involved in this problem. pmap() takes a list and passes each element as its own argument to a function (it's kind of equivalent to !!! in that sense). Your quoting function thus needs to quote its arguments and somehow pass a list of columns to pmap().
Our quoting function can go one of two ways. Either quote (i.e., delay) the list creation, or create an actual list of quoted expressions right away:
quoting_fn1 <- function(...) {
exprs <- enquos(...)
# For illustration purposes, return the quoted inputs instead of
# doing something with them. Normally you'd call `mutate()` here:
exprs
}
quoting_fn2 <- function(...) {
expr <- quo(list(!!!enquos(...)))
expr
}
Since our first variant does nothing but return a list of quoted inputs, it's actually equivalent to quos():
quoting_fn1(a, b)
#> <list_of<quosure>>
#>
#> [[1]]
#> <quosure>
#> expr: ^a
#> env: global
#>
#> [[2]]
#> <quosure>
#> expr: ^b
#> env: global
The second version returns a quoted expression that instructs R to create a list with quoted inputs:
quoting_fn2(a, b)
#> <quosure>
#> expr: ^list(^a, ^b)
#> env: 0x7fdb69d9bd20
There is a subtle but important difference between the two. The first version creates an actual list object:
exprs <- quoting_fn1(a, b)
typeof(exprs)
#> [1] "list"
On the other hand, the second version does not return a list, it returns an expression for creating a list:
expr <- quoting_fn2(a, b)
typeof(expr)
#> [1] "language"
Let's find out which version is more appropriate for interfacing with pmap(). But first we'll give a name to the pmapped function to make the code clearer and easier to experiment with:
myfunction <- function(..., word) {
args <- list(...)
# just to be clear this isn't what I actually want to do inside pmap
args[[1]] + args[[2]]
}
Understanding how tidy eval works is hard in part because we usually don't get to observe the unquoting step. We'll use rlang::qq_show() to reveal the result of unquoting expr (the delayed list) and exprs (the actual list) with !!:
rlang::qq_show(
mutate(df, outcome = pmap_int(!!expr, myfunction))
)
#> mutate(df, outcome = pmap_int(^list(^a, ^b), myfunction))
rlang::qq_show(
mutate(df, outcome = pmap_int(!!exprs, myfunction))
)
#> mutate(df, outcome = pmap_int(<S3: quosures>, myfunction))
When we unquote the delayed list, mutate() calls pmap_int() with list(a, b), evaluated in the data frame, which is exactly what we need:
mutate(df, outcome = pmap_int(!!expr, myfunction))
#> # A tibble: 3 x 3
#> a b outcome
#> <int> <int> <int>
#> 1 1 101 102
#> 2 2 102 104
#> 3 3 103 106
On the other hand, if we unquote an actual list of quoted expressions, we get an error:
mutate(df, outcome = pmap_int(!!exprs, myfunction))
#> Error in mutate_impl(.data, dots) :
#> Evaluation error: Element 1 is not a vector (language).
That's because the quoted expressions inside the list are not evaluated in the data frame. In fact, they are not evaluated at all. pmap() gets the quoted expressions as is, which it doesn't understand. Recall what qq_show() has shown us:
#> mutate(df, outcome = pmap_int(<S3: quosures>, myfunction))
Anything inside angular brackets is passed as is. This is a sign that we should somehow have used !!! instead, to inline each element of the list of quosures in the surrounding expression. Let's try it:
rlang::qq_show(
mutate(df, outcome = pmap_int(!!!exprs, myfunction))
)
#> mutate(df, outcome = pmap_int(^a, ^b, myfunction))
Hmm... Doesn't look right. We're supposed to pass a list to pmap_int(), and here it gets each quoted input as separate argument. Indeed we get a type error:
mutate(df, outcome = pmap_int(!!!exprs, myfunction))
#> Error in mutate_impl(.data, dots) :
#> Evaluation error: `.x` is not a list (integer).
That's easy to fix, just splice into a call to list():
rlang::qq_show(
mutate(df, outcome = pmap_int(list(!!!exprs), myfunction))
)
#> mutate(df, outcome = pmap_int(list(^a, ^b), myfunction))
And voilà!
mutate(df, outcome = pmap_int(list(!!!exprs), myfunction))
#> # A tibble: 3 x 3
#> a b outcome
#> <int> <int> <int>
#> 1 1 101 102
#> 2 2 102 104
#> 3 3 103 106

We can use quos when there are more than one element and evaluate with !!!
cols_quo <- quos(a, b)
df2 %>%
select(!!!cols_quo)
The object 'df2' can be created with
df %>%
mutate(output = list(!!! cols_quo) %>%
reduce(`+`))
If we want to use the quosure as in the OP's post
cols_quo <- quo(list(a, b))
df2 %>%
select(!!! as.list(quo_expr(cols_quo))[-1])
# A tibble: 3 x 2
# a b
# <int> <int>
#1 1 101
#2 2 102
#3 3 103

Related

Access to the new column name from the function inside across

Is it possible to get the name of the new column from the across function?
For example
data.frame(a=1) %>%
mutate(across(c(b=a, c=a), function(i) if("new column name"=="b") a+1 else a+0.5))
Expected result:
#> a b c
#> 1 1 2 1.5
Created on 2021-12-09 by the reprex package (v2.0.0)
I attempted to use cur_column() but its return value appears to be a in this case.
I apologise that this example is too simple to achieve the desired result in other ways, but my actual code is a large nested dataframe that is difficult to provide.
Interesting question. It seems that because you are defining b and c within the across call, they aren't available through cur_column().
data.frame(a=1) %>%
mutate(across(c(b=a, c=a), function(i) print(cur_column())))
#> [1] "a"
#> [1] "a"
It works fine if they are already defined in the data.frame. Using tibble() here so I can refer to a in the constructor.
tibble(a=1, b=a, c=a) %>%
mutate(across(-a, ~ if (cur_column() == "b") .x + 1 else .x + 0.5))
#> # A tibble: 1 × 3
#> a b c
#> <dbl> <dbl> <dbl>
#> 1 1 2 1.5
And similarly works even if you are in the same mutate() call, just making sure to define b and c prior to across().
data.frame(a=1) %>%
mutate(across(c(b=a, c=a), ~.x),
across(-a, ~ if (cur_column() == "b") .x + 1 else .x + 0.5))
#> a b c
#> 1 1 2 1.5
I believe it's happening because across is working across the rhs (a) and assigning the values to the lhs (b) only after the function in across() has been applied. I'm not sure this is the expected behavior (although it does seem right), I actually don't know, so will open up a GitHub issue because I think it's an interesting example!

R: `Error in f(x): could not find function "f"` when trying to use column of functions as argument in a tibble

I'm experimenting with using functions in dataframes (tidyverse tibbles) in R and I ran into some difficulties. The following is a minimal (trivial) example of my problem.
Suppose I have a function that takes in three arguments: x and y are numbers, and f is a function. It performs f(x) + y and returns the output:
func_then_add = function(x, y, f) {
result = f(x) + y
return(result)
}
And I have some simple functions it might use as f:
squarer = function(x) {
result = x^2
return(result)
}
cuber = function(x) {
result = x^3
return(result)
}
Done on its own, func_then_add works as advertised:
> func_then_add(5, 2, squarer)
[1] 27
> func_then_add(6, 11, cuber)
[1] 227
But lets say I have a dataframe (tidyverse tibble) with two columns for the numeric arguments, and one column for which function I want:
library(tidyverse)
library(magrittr)
test_frame = tribble(
~arg_1, ~arg_2, ~func,
5, 2, squarer,
6, 11, cuber
)
> test_frame
# A tibble: 2 x 3
arg_1 arg_2 func
<dbl> <dbl> <list>
1 5 2 <fn>
2 6 11 <fn>
I then want to make another column result that is equal to func_then_add applied to those three columns. It should be 27 and 227 like before. But when I try this, I get an error:
> test_frame %>% mutate(result=func_then_add(.$arg_1, .$arg_2, .$func))
Error in f(x) : could not find function "f"
Why does this happen, and how do I get what I want properly? I confess that I'm new to "functional programming", so maybe I'm just making an obvious syntax error ...
Not the most elegant but we can do:
test_frame %>%
mutate(Res= map(seq_along(.$func), function(x)
func_then_add(.$arg_1, .$arg_2, .$func[[x]])))
EDIT: The above maps both over the entire data which isn't really what OP desires. As suggested by #January this can be better applied as:
Result <- test_frame %>%
mutate(Res= map(seq_along(.$func), function(x)
func_then_add(.$arg_1[x], .$arg_2[x], .$func[[x]])))
Result$Res
The above again is not very efficient since it returns a list. A better alternative(again as suggested by #January is to use map_dbl which returns the same data type as its objects:
test_frame %>%
mutate(Res= map_dbl(seq_along(.$func), function(x)
func_then_add(.$arg_1[x], .$arg_2[x], .$func[[x]])))
# A tibble: 2 x 4
arg_1 arg_2 func Res
<dbl> <dbl> <list> <dbl>
1 5 2 <fn> 27
2 6 11 <fn> 227
This is because you should map instead of mutating. Mutate calls the function once, and supplies the whole columns as arguments.
The second problem is that test_frame$func[1] is not a function, but a list with one element. You can't have "function" columns, only list columns.
Try this:
test_frame$result <- with(test_frame,
map_dbl(1:2, ~ func_then_add(arg_1[.], arg_2[.], func[[.]])))
Result:
# A tibble: 2 x 4
arg_1 arg_2 func result
<dbl> <dbl> <list> <dbl>
1 5 2 <fn> 27
2 6 11 <fn> 227
EDIT: a simpler solution using dplyr, mutate and rowwise:
test_frame %>% rowwise %>% mutate(res=func_then_add(arg_1, arg_2, func))
Quite frankly, I am slightly puzzled by this last one. Why func and not func[[1]]? func should be a list, and not function. mutate and rowwise are doing here something sinister, like automatically converting a list to a vector.
Edit 2: actually, this is written explicitly in the rowwise manual:
Its main impact is to allow you to work with list-variables in
‘summarise()’ and ‘mutate()’ without having to use ‘[[1]]’.
Final edit: I became so fixated on tidyverse recently that I did not think of the simplest option – using base R:
apply(test_frame, 1, function(x) func_then_add(x$arg_1, x$arg_2, x$func))

Renaming doesn't work for column names starting with two dots

I updated my tidyverse and my read_excel() function (from readxl) has also changed. Columns without titles are are now called ..1, ..2 and so on, when they used to be called X__1, X__2.
I'm trying to rename() these columns starting with two dots, but I'm getting an error message.
Here's an example:
library(tidyverse)
df <- tibble(a = 1:3,
..1 = 4:6)
df <- df %>%
rename(b = ..1)
Throws the error:
Error in .f(.x[[i]], ...) :
..1 used in an incorrect context, no ... to look in
I get the same error if I use backticks around the name: rename(b = `..1`).
..1 is a reserved word in R. See help("reserved") and help("..1"). Try quoting it:
df %>% rename(b = "..1")
giving:
# A tibble: 3 x 2
a b
<int> <int>
1 1 4
2 2 5
3 3 6
The janitor package has a very handy function clean_names for tasks like this. In this case, it replaces any .. that come from readxl with x. I added another .. column to show how the replacement works.
library(tidyverse)
df <- tibble(a = 1:3,
..1 = 4:6,
..5 = 10:12)
df %>%
janitor::clean_names()
#> # A tibble: 3 x 3
#> a x1 x5
#> <int> <int> <int>
#> 1 1 4 10
#> 2 2 5 11
#> 3 3 6 12
It seems like the naming setup in readxl is a topic of debate: see this issue, among others on the best way to convert unusable names from Excel sheets. There's also a vignette on it. To be honest, the last couple times I've needed to mess with readxl names, I just passed the data frame to janitor.

tryCatch inside dplyr's mutate?

Is there any exception handeling mechanism in dplyr's mutate()? What I mean is a way to catch exceptions and handle them.
Let us suppose that I have a function that throws an error in some cases (in the example if the input is negative), for the sake of simplicity I define the function, but in real life it will be a function in some R package. Let us suppose this function is vectorized:
# function throwing an error
my_func <- function(x){
if(x > 0) return(sqrt(x))
stop('x must be positive')
}
my_func_vect <- Vectorize(my_func)
Now, let's suppose I want to use this function inside mutate().
If this function is used inside a mutate(), it stops at the first error and no result is returned:
library(dplyr)
# dummy data
data <- data.frame(x = c(1, -1, 4, 9))
data %>% mutate(y = my_func_vect(x))
# Error in mutate_impl(.data, dots) : Evaluation error: x must be positive.
Is there a way to catch the error, and do something (e.g. return an NA) in this case, while getting results for the other elements?
The result I expect is what would be achieved using a loop with tryCatch(), i.e. something along the lines of:
y <- rep(NA_real_, length(data$x))
for(i in seq_along(data$x)) {
tryCatch({
y[i] <- my_func_vect(data$x[i])
}, error = function(err){})
}
y
# Result is: 1 NA 2 4
We can also make use of purrr's safely() or possibly() functions.
From the purrr help:
safely: wrapped function instead returns a list with components result and error. One value is always NULL.
quietly: wrapped function instead returns a list with components result, output, messages and warnings.
possibly: wrapped function uses a default value (otherwise) whenever an error occurs.
It doesn't change the fact that you have to apply the function to each row separately.
library(dplyr)
library(purrr)
# function throwing an error
my_func <- function(x){
if(x > 0) return(sqrt(x))
stop('x must be positive')
}
my_func_vect <- Vectorize(my_func)
# dummy data
data <- data.frame(x = c(1, -1, 4, 9))
With map:
data %>%
mutate(y = map_dbl(x, ~possibly(my_func_vect, otherwise = NA_real_)(.x)))
#> x y
#> 1 1 1
#> 2 -1 NA
#> 3 4 2
#> 4 9 3
Using rowwise():
data %>%
rowwise() %>%
mutate(y = possibly(my_func_vect, otherwise = NA_real_)(x))
#> Source: local data frame [4 x 2]
#> Groups: <by row>
#>
#> # A tibble: 4 x 2
#> x y
#> <dbl> <dbl>
#> 1 1 1
#> 2 -1 NA
#> 3 4 2
#> 4 9 3
The others functions are somewhat more difficult to use and apply in a 'data-frame environment', as they are more suited to work with lists, and returns such.
Created on 2018-05-15 by the reprex package (v0.2.0).
You want to evaluate every occuring error individually, maybe you shouldn't use the vectorized function. Instead use map from the purrr package- which is effectively the same as lapply here.
Make a function to catch the error for standard use if you want NA values in the case you get an error.
try_my_func <- function(x) {
tryCatch(my_func(x), error = function(err){NA})
}
Then use mutate with map
data %>% mutate(y = purrr::map(x, try_my_func))
x y
1 1 1
2 -1 NA
3 4 2
4 9 3
Or similarly, if you don't want to declare a new function.
data %>% mutate(y = purrr::map(x, ~ tryCatch(my_func(.), error = function(err){NA})))
And lastly if you Do want to use a Vectorized function, you can skip the map function altogether. But personally I never use Vectorize so I'd do it with map.
data %>% mutate(y = Vectorize(try_my_func)(x))

Mutating columns of a data frame based on a predicate function (dplyr::mutate_if)

I would like to use dplyr's mutate_if() function to convert list-columns to data-frame-columns, but run into a puzzling error when I try to do so. I am using dplyr 0.5.0, purrr 0.2.2, R 3.3.0.
The basic setup looks like this: I have a data frame d, some of whose columns are lists:
d <- dplyr::data_frame(
A = list(
list(list(x = "a", y = 1), list(x = "b", y = 2)),
list(list(x = "c", y = 3), list(x = "d", y = 4))
),
B = LETTERS[1:2]
)
I would like to convert the column of lists (in this case, d$A) to a column of data frames using the following function:
tblfy <- function(x) {
x %>%
purrr::transpose() %>%
purrr::simplify_all() %>%
dplyr::as_data_frame()
}
That is, I would like the list-column d$A to be replaced by the list lapply(d$A, tblfy), which is
[[1]]
# A tibble: 2 x 2
x y
<chr> <dbl>
1 a 1
2 b 2
[[2]]
# A tibble: 2 x 2
x y
<chr> <dbl>
1 c 3
2 d 4
Of course, in this simple case, I could just do a simple reassignment. The point, however, is that I would like to do this programmatically, ideally with dplyr, in a generally applicable way that could deal with any number of list-columns.
Here's where I stumble: When I try to convert the list-columns to data-frame-columns using the following application
d %>% dplyr::mutate_if(is.list, funs(tblfy))
I get an error message that I don't know how to interpret:
Error: Each variable must be named.
Problem variables: 1, 2
Why does mutate_if() fail? How can I properly apply it to get the desired result?
Remark
A commenter has pointed out that the function tblfy() should be vectorized. That is a reasonable suggestion. But — unless I have vectorized incorrectly — that does not seem to get at the root of the problem. Plugging in a vectorized version of tblfy(),
tblfy_vec <- Vectorize(tblfy)
into mutate_if() fails with the error
Error: wrong result size (4), expected 2 or 1
Update
After gaining some experience with purrr, I now find the following approach natural, if somewhat long-winded:
d %>%
map_if(is.list, ~ map(., ~ map_df(., identity))) %>%
as_data_frame()
This is more or less identical to #alistaire's solution, below, but uses map_if(), resp. map(), in place of mutate_if(), resp. Vectorize().
The original tblfy function errors out for me (even when its elements are chained directly), so let's rebuild it a bit, adding vectorization as well, which lets us avoid an otherwise-necessary prior rowwise() call:
tblfy <- Vectorize(function(x){x %>% purrr::map_df(identity) %>% list()})
Now we can use mutate_if nicely:
d %>% mutate_if(purrr::is_list, tblfy)
## Source: local data frame [2 x 2]
##
## A B
## <list> <chr>
## 1 <tbl_df [2,2]> A
## 2 <tbl_df [2,2]> B
...and if we unnest to see what's there,
d %>% mutate_if(purrr::is_list, tblfy) %>% tidyr::unnest()
## Source: local data frame [4 x 3]
##
## B x y
## <chr> <chr> <dbl>
## 1 A a 1
## 2 A b 2
## 3 B c 3
## 4 B d 4
A couple notes:
map_df(identity) seems to be more efficient at building a tibble than any of the alternative formulations. I know the identity call seems unnecessary, but most everything else breaks.
I'm not sure how widely useful tblfy will be, as it's somewhat dependent on the structure of the lists in the list column, which can vary enormously. If you have a lot with a similar structure, I suppose it's useful, though.
There may be a way to do this with pmap instead of Vectorize, but I can't get it to work with some cursory tries.
In-place conversion without any copying:
library(data.table)
for (col in d) if (is.list(col)) lapply(col, setDF)
d
#Source: local data frame [2 x 2]
#
# A B
#1 <S3:data.frame> A
#2 <S3:data.frame> B

Resources