Access to the new column name from the function inside across - r

Is it possible to get the name of the new column from the across function?
For example
data.frame(a=1) %>%
mutate(across(c(b=a, c=a), function(i) if("new column name"=="b") a+1 else a+0.5))
Expected result:
#> a b c
#> 1 1 2 1.5
Created on 2021-12-09 by the reprex package (v2.0.0)
I attempted to use cur_column() but its return value appears to be a in this case.
I apologise that this example is too simple to achieve the desired result in other ways, but my actual code is a large nested dataframe that is difficult to provide.

Interesting question. It seems that because you are defining b and c within the across call, they aren't available through cur_column().
data.frame(a=1) %>%
mutate(across(c(b=a, c=a), function(i) print(cur_column())))
#> [1] "a"
#> [1] "a"
It works fine if they are already defined in the data.frame. Using tibble() here so I can refer to a in the constructor.
tibble(a=1, b=a, c=a) %>%
mutate(across(-a, ~ if (cur_column() == "b") .x + 1 else .x + 0.5))
#> # A tibble: 1 × 3
#> a b c
#> <dbl> <dbl> <dbl>
#> 1 1 2 1.5
And similarly works even if you are in the same mutate() call, just making sure to define b and c prior to across().
data.frame(a=1) %>%
mutate(across(c(b=a, c=a), ~.x),
across(-a, ~ if (cur_column() == "b") .x + 1 else .x + 0.5))
#> a b c
#> 1 1 2 1.5
I believe it's happening because across is working across the rhs (a) and assigning the values to the lhs (b) only after the function in across() has been applied. I'm not sure this is the expected behavior (although it does seem right), I actually don't know, so will open up a GitHub issue because I think it's an interesting example!

Related

Renaming doesn't work for column names starting with two dots

I updated my tidyverse and my read_excel() function (from readxl) has also changed. Columns without titles are are now called ..1, ..2 and so on, when they used to be called X__1, X__2.
I'm trying to rename() these columns starting with two dots, but I'm getting an error message.
Here's an example:
library(tidyverse)
df <- tibble(a = 1:3,
..1 = 4:6)
df <- df %>%
rename(b = ..1)
Throws the error:
Error in .f(.x[[i]], ...) :
..1 used in an incorrect context, no ... to look in
I get the same error if I use backticks around the name: rename(b = `..1`).
..1 is a reserved word in R. See help("reserved") and help("..1"). Try quoting it:
df %>% rename(b = "..1")
giving:
# A tibble: 3 x 2
a b
<int> <int>
1 1 4
2 2 5
3 3 6
The janitor package has a very handy function clean_names for tasks like this. In this case, it replaces any .. that come from readxl with x. I added another .. column to show how the replacement works.
library(tidyverse)
df <- tibble(a = 1:3,
..1 = 4:6,
..5 = 10:12)
df %>%
janitor::clean_names()
#> # A tibble: 3 x 3
#> a x1 x5
#> <int> <int> <int>
#> 1 1 4 10
#> 2 2 5 11
#> 3 3 6 12
It seems like the naming setup in readxl is a topic of debate: see this issue, among others on the best way to convert unusable names from Excel sheets. There's also a vignette on it. To be honest, the last couple times I've needed to mess with readxl names, I just passed the data frame to janitor.

Tidyeval: pass list of columns as quosure to select()

I want to pass a bunch of columns to pmap() inside mutate(). Later, I want to select those same columns.
At the moment, I'm passing a list of column names to pmap() as a quosure, which works fine, although I have no idea whether this is the "right" way to do it. But I can't figure out how to use the same quosure/list for select().
I've got almost no experience with tidyeval, I've only got this far by playing around. I imagine there must be a way to use the same thing both for pmap() and select(), preferably without having to put each of my column names in quotation marks, but I haven't found it yet.
library(dplyr)
library(rlang)
library(purrr)
df <- tibble(a = 1:3,
b = 101:103) %>%
print
#> # A tibble: 3 x 2
#> a b
#> <int> <int>
#> 1 1 101
#> 2 2 102
#> 3 3 103
cols_quo <- quo(list(a, b))
df2 <- df %>%
mutate(outcome = !!cols_quo %>%
pmap_int(function(..., word) {
args <- list(...)
# just to be clear this isn't what I actually want to do inside pmap
return(args[[1]] + args[[2]])
})) %>%
print()
#> # A tibble: 3 x 3
#> a b outcome
#> <int> <int> <int>
#> 1 1 101 102
#> 2 2 102 104
#> 3 3 103 106
# I get why this doesn't work, but I don't know how to do something like this that does
df2 %>%
select(!!cols_quo)
#> Error in .f(.x[[i]], ...): object 'a' not found
This is a bit tricky because of the mix of semantics involved in this problem. pmap() takes a list and passes each element as its own argument to a function (it's kind of equivalent to !!! in that sense). Your quoting function thus needs to quote its arguments and somehow pass a list of columns to pmap().
Our quoting function can go one of two ways. Either quote (i.e., delay) the list creation, or create an actual list of quoted expressions right away:
quoting_fn1 <- function(...) {
exprs <- enquos(...)
# For illustration purposes, return the quoted inputs instead of
# doing something with them. Normally you'd call `mutate()` here:
exprs
}
quoting_fn2 <- function(...) {
expr <- quo(list(!!!enquos(...)))
expr
}
Since our first variant does nothing but return a list of quoted inputs, it's actually equivalent to quos():
quoting_fn1(a, b)
#> <list_of<quosure>>
#>
#> [[1]]
#> <quosure>
#> expr: ^a
#> env: global
#>
#> [[2]]
#> <quosure>
#> expr: ^b
#> env: global
The second version returns a quoted expression that instructs R to create a list with quoted inputs:
quoting_fn2(a, b)
#> <quosure>
#> expr: ^list(^a, ^b)
#> env: 0x7fdb69d9bd20
There is a subtle but important difference between the two. The first version creates an actual list object:
exprs <- quoting_fn1(a, b)
typeof(exprs)
#> [1] "list"
On the other hand, the second version does not return a list, it returns an expression for creating a list:
expr <- quoting_fn2(a, b)
typeof(expr)
#> [1] "language"
Let's find out which version is more appropriate for interfacing with pmap(). But first we'll give a name to the pmapped function to make the code clearer and easier to experiment with:
myfunction <- function(..., word) {
args <- list(...)
# just to be clear this isn't what I actually want to do inside pmap
args[[1]] + args[[2]]
}
Understanding how tidy eval works is hard in part because we usually don't get to observe the unquoting step. We'll use rlang::qq_show() to reveal the result of unquoting expr (the delayed list) and exprs (the actual list) with !!:
rlang::qq_show(
mutate(df, outcome = pmap_int(!!expr, myfunction))
)
#> mutate(df, outcome = pmap_int(^list(^a, ^b), myfunction))
rlang::qq_show(
mutate(df, outcome = pmap_int(!!exprs, myfunction))
)
#> mutate(df, outcome = pmap_int(<S3: quosures>, myfunction))
When we unquote the delayed list, mutate() calls pmap_int() with list(a, b), evaluated in the data frame, which is exactly what we need:
mutate(df, outcome = pmap_int(!!expr, myfunction))
#> # A tibble: 3 x 3
#> a b outcome
#> <int> <int> <int>
#> 1 1 101 102
#> 2 2 102 104
#> 3 3 103 106
On the other hand, if we unquote an actual list of quoted expressions, we get an error:
mutate(df, outcome = pmap_int(!!exprs, myfunction))
#> Error in mutate_impl(.data, dots) :
#> Evaluation error: Element 1 is not a vector (language).
That's because the quoted expressions inside the list are not evaluated in the data frame. In fact, they are not evaluated at all. pmap() gets the quoted expressions as is, which it doesn't understand. Recall what qq_show() has shown us:
#> mutate(df, outcome = pmap_int(<S3: quosures>, myfunction))
Anything inside angular brackets is passed as is. This is a sign that we should somehow have used !!! instead, to inline each element of the list of quosures in the surrounding expression. Let's try it:
rlang::qq_show(
mutate(df, outcome = pmap_int(!!!exprs, myfunction))
)
#> mutate(df, outcome = pmap_int(^a, ^b, myfunction))
Hmm... Doesn't look right. We're supposed to pass a list to pmap_int(), and here it gets each quoted input as separate argument. Indeed we get a type error:
mutate(df, outcome = pmap_int(!!!exprs, myfunction))
#> Error in mutate_impl(.data, dots) :
#> Evaluation error: `.x` is not a list (integer).
That's easy to fix, just splice into a call to list():
rlang::qq_show(
mutate(df, outcome = pmap_int(list(!!!exprs), myfunction))
)
#> mutate(df, outcome = pmap_int(list(^a, ^b), myfunction))
And voilà!
mutate(df, outcome = pmap_int(list(!!!exprs), myfunction))
#> # A tibble: 3 x 3
#> a b outcome
#> <int> <int> <int>
#> 1 1 101 102
#> 2 2 102 104
#> 3 3 103 106
We can use quos when there are more than one element and evaluate with !!!
cols_quo <- quos(a, b)
df2 %>%
select(!!!cols_quo)
The object 'df2' can be created with
df %>%
mutate(output = list(!!! cols_quo) %>%
reduce(`+`))
If we want to use the quosure as in the OP's post
cols_quo <- quo(list(a, b))
df2 %>%
select(!!! as.list(quo_expr(cols_quo))[-1])
# A tibble: 3 x 2
# a b
# <int> <int>
#1 1 101
#2 2 102
#3 3 103

dplyr summarise evaluates custom function twice? [duplicate]

It seems that summarise and summarise_each are making unnecessary extra calls to the callback functions they are provided with. Suppose that we have the following
X <- data.frame( Group = rep(c("G1","G2"),2:3), Var1 = 1:5, Var2 = 11:15 )
which looks like this:
Group Var1 Var2
1 G1 1 11
2 G1 2 12
3 G2 3 13
4 G2 4 14
5 G2 5 15
Further suppose that we have a (potentially expensive) function
f <- function(v)
{
cat( "Calling f with vector", v, "\n" )
## ...additional bookkeeping and processing...
mean(v)
}
that we would like to apply to each of our variables in each group. Using dplyr, we might go about it in the following way:
X %>% group_by( Group ) %>% summarise_each( funs(f) )
However, the output shows that f was called one additional time for each variable in G1:
Calling f with vector 1 2
Calling f with vector 1 2
Calling f with vector 3 4 5
Calling f with vector 11 12
Calling f with vector 11 12
Calling f with vector 13 14 15
# A tibble: 2 x 3
Group Var1 Var2
<fctr> <dbl> <dbl>
1 G1 1.5 11.5
2 G2 4.0 14.0
The same issue is present when using summarize:
> X %>% group_by( Group ) %>% summarise( test = f(Var1) )
Calling f with vector 1 2
Calling f with vector 1 2
Calling f with vector 3 4 5
# A tibble: 2 × 2
Group test
<fctr> <dbl>
1 G1 1.5
2 G2 4.0
Why is this happening and how would one go about preventing summarise and summarise_each from making those extra calls?
(This is using R version 3.3.0 and dplyr version 0.5.0)
EDIT: It appears that the issue has to do with the interplay between group_by and summarise/summarise_each. Without the grouping, no extra calls are made. Also, mutate and mutate_each do not suffer from this issue. (Credit: eddi and eipi10 for these findings)
Although this issue is still present in dplyr 0.5.0 (published 2016-06-24), it is fixed in the dplyr GitHub repro. It was fixed with this commit made on 2016-09-24. I've confirmed that I can reproduce the issue when I checkout and build the version at the previous commit, but not when building from that one or subsequent ones.
(And yes, I tried a whole bunch of other ones before I found it. Why I go to such lengths in hope of earning imaginary internet points, I leave as a question for my therapist. :)
In particular, in the function SEXP process_data(const Data& gdf) in inst/include/dplyr/Result/CallbackProcessor.h, note these changes:
CLASS* obj = static_cast<CLASS*>(this);
typename Data::group_iterator git = gdf.group_begin();
RObject first_result = obj->process_chunk(*git);
++git; // This line was added
and
for (int i = 1; i < ngroups; ++git, ++i) { // changed from starting at i = 0
RObject chunk = obj->process_chunk(*git);
[Comments added by me, not part of the actual source]

dplyr summarise() and summarise_each() make extra calls to the provided functions

It seems that summarise and summarise_each are making unnecessary extra calls to the callback functions they are provided with. Suppose that we have the following
X <- data.frame( Group = rep(c("G1","G2"),2:3), Var1 = 1:5, Var2 = 11:15 )
which looks like this:
Group Var1 Var2
1 G1 1 11
2 G1 2 12
3 G2 3 13
4 G2 4 14
5 G2 5 15
Further suppose that we have a (potentially expensive) function
f <- function(v)
{
cat( "Calling f with vector", v, "\n" )
## ...additional bookkeeping and processing...
mean(v)
}
that we would like to apply to each of our variables in each group. Using dplyr, we might go about it in the following way:
X %>% group_by( Group ) %>% summarise_each( funs(f) )
However, the output shows that f was called one additional time for each variable in G1:
Calling f with vector 1 2
Calling f with vector 1 2
Calling f with vector 3 4 5
Calling f with vector 11 12
Calling f with vector 11 12
Calling f with vector 13 14 15
# A tibble: 2 x 3
Group Var1 Var2
<fctr> <dbl> <dbl>
1 G1 1.5 11.5
2 G2 4.0 14.0
The same issue is present when using summarize:
> X %>% group_by( Group ) %>% summarise( test = f(Var1) )
Calling f with vector 1 2
Calling f with vector 1 2
Calling f with vector 3 4 5
# A tibble: 2 × 2
Group test
<fctr> <dbl>
1 G1 1.5
2 G2 4.0
Why is this happening and how would one go about preventing summarise and summarise_each from making those extra calls?
(This is using R version 3.3.0 and dplyr version 0.5.0)
EDIT: It appears that the issue has to do with the interplay between group_by and summarise/summarise_each. Without the grouping, no extra calls are made. Also, mutate and mutate_each do not suffer from this issue. (Credit: eddi and eipi10 for these findings)
Although this issue is still present in dplyr 0.5.0 (published 2016-06-24), it is fixed in the dplyr GitHub repro. It was fixed with this commit made on 2016-09-24. I've confirmed that I can reproduce the issue when I checkout and build the version at the previous commit, but not when building from that one or subsequent ones.
(And yes, I tried a whole bunch of other ones before I found it. Why I go to such lengths in hope of earning imaginary internet points, I leave as a question for my therapist. :)
In particular, in the function SEXP process_data(const Data& gdf) in inst/include/dplyr/Result/CallbackProcessor.h, note these changes:
CLASS* obj = static_cast<CLASS*>(this);
typename Data::group_iterator git = gdf.group_begin();
RObject first_result = obj->process_chunk(*git);
++git; // This line was added
and
for (int i = 1; i < ngroups; ++git, ++i) { // changed from starting at i = 0
RObject chunk = obj->process_chunk(*git);
[Comments added by me, not part of the actual source]

Mutating columns of a data frame based on a predicate function (dplyr::mutate_if)

I would like to use dplyr's mutate_if() function to convert list-columns to data-frame-columns, but run into a puzzling error when I try to do so. I am using dplyr 0.5.0, purrr 0.2.2, R 3.3.0.
The basic setup looks like this: I have a data frame d, some of whose columns are lists:
d <- dplyr::data_frame(
A = list(
list(list(x = "a", y = 1), list(x = "b", y = 2)),
list(list(x = "c", y = 3), list(x = "d", y = 4))
),
B = LETTERS[1:2]
)
I would like to convert the column of lists (in this case, d$A) to a column of data frames using the following function:
tblfy <- function(x) {
x %>%
purrr::transpose() %>%
purrr::simplify_all() %>%
dplyr::as_data_frame()
}
That is, I would like the list-column d$A to be replaced by the list lapply(d$A, tblfy), which is
[[1]]
# A tibble: 2 x 2
x y
<chr> <dbl>
1 a 1
2 b 2
[[2]]
# A tibble: 2 x 2
x y
<chr> <dbl>
1 c 3
2 d 4
Of course, in this simple case, I could just do a simple reassignment. The point, however, is that I would like to do this programmatically, ideally with dplyr, in a generally applicable way that could deal with any number of list-columns.
Here's where I stumble: When I try to convert the list-columns to data-frame-columns using the following application
d %>% dplyr::mutate_if(is.list, funs(tblfy))
I get an error message that I don't know how to interpret:
Error: Each variable must be named.
Problem variables: 1, 2
Why does mutate_if() fail? How can I properly apply it to get the desired result?
Remark
A commenter has pointed out that the function tblfy() should be vectorized. That is a reasonable suggestion. But — unless I have vectorized incorrectly — that does not seem to get at the root of the problem. Plugging in a vectorized version of tblfy(),
tblfy_vec <- Vectorize(tblfy)
into mutate_if() fails with the error
Error: wrong result size (4), expected 2 or 1
Update
After gaining some experience with purrr, I now find the following approach natural, if somewhat long-winded:
d %>%
map_if(is.list, ~ map(., ~ map_df(., identity))) %>%
as_data_frame()
This is more or less identical to #alistaire's solution, below, but uses map_if(), resp. map(), in place of mutate_if(), resp. Vectorize().
The original tblfy function errors out for me (even when its elements are chained directly), so let's rebuild it a bit, adding vectorization as well, which lets us avoid an otherwise-necessary prior rowwise() call:
tblfy <- Vectorize(function(x){x %>% purrr::map_df(identity) %>% list()})
Now we can use mutate_if nicely:
d %>% mutate_if(purrr::is_list, tblfy)
## Source: local data frame [2 x 2]
##
## A B
## <list> <chr>
## 1 <tbl_df [2,2]> A
## 2 <tbl_df [2,2]> B
...and if we unnest to see what's there,
d %>% mutate_if(purrr::is_list, tblfy) %>% tidyr::unnest()
## Source: local data frame [4 x 3]
##
## B x y
## <chr> <chr> <dbl>
## 1 A a 1
## 2 A b 2
## 3 B c 3
## 4 B d 4
A couple notes:
map_df(identity) seems to be more efficient at building a tibble than any of the alternative formulations. I know the identity call seems unnecessary, but most everything else breaks.
I'm not sure how widely useful tblfy will be, as it's somewhat dependent on the structure of the lists in the list column, which can vary enormously. If you have a lot with a similar structure, I suppose it's useful, though.
There may be a way to do this with pmap instead of Vectorize, but I can't get it to work with some cursory tries.
In-place conversion without any copying:
library(data.table)
for (col in d) if (is.list(col)) lapply(col, setDF)
d
#Source: local data frame [2 x 2]
#
# A B
#1 <S3:data.frame> A
#2 <S3:data.frame> B

Resources