Unexpected dplyr::bind_rows() behavior

Unexpected dplyr::bind_rows() behavior - r

Short Version:
I'm encountering an error with dplyr::bind_rows() which I don't understand. I want to split my data based on some condition (e.g. a == 1), operate on one part (e.g. b = b * 10), and bind it back to the other part using dplyr::bind_rows() in a single pipe chain. It works fine if I provide the first input to the two parts explictly, but if instead I pipe them in with . it complains about the data type of agrument 2.
Here's a MRE of the issue:
library(tidyverse)
# sim data
d <- tibble(a = 1:4, b = 1:4)
# works when 'd' is supplied directly to bind_rows()
bind_rows(d %>% filter(a == 1),
d %>% filter(!a == 1) %>% mutate(b = b * 10))
#> # A tibble: 4 x 2
#> a b
#> <int> <dbl>
#> 1 1 1
#> 2 2 20
#> 3 3 30
#> 4 4 40
# fails when 'd' is piped in to bind_rows()
d %>%
bind_rows(. %>% filter(a == 1),
. %>% filter(!a == 1) %>% mutate(b = b * 10))
#> Error: Argument 2 must be a data frame or a named atomic vector.
Long Version:
If I capture what the bind_rows() call is getting as input as a list() instead, I can see that two unexpected (to me) things are happening.
Instead of evaluating the pipe chains I provided it seems to just capure them as a functional sequence.
I can see that the input (.) is invisibly being provided in addition to the two explict arguments, so I get 3 items instead of 2 in the list.
# capture intermediate values for diagnostics
d %>%
list(. %>% filter(a == 1),
. %>% filter(!a == 1) %>% mutate(b = b * 10))
#> [[1]]
#> # A tibble: 4 x 2
#> a b
#> <int> <int>
#> 1 1 1
#> 2 2 2
#> 3 3 3
#> 4 4 4
#>
#> [[2]]
#> Functional sequence with the following components:
#>
#> 1. filter(., a == 1)
#>
#> Use 'functions' to extract the individual functions.
#>
#> [[3]]
#> Functional sequence with the following components:
#>
#> 1. filter(., !a == 1)
#> 2. mutate(., b = b * 10)
#>
#> Use 'functions' to extract the individual functions.
This leads me to the following inelegant solution where I solve the first problem by piping to the inner function which seems to force evaluation correctly (for reasons I don't understand) and then solve the second problem by subsetting the list prior to performing the bind_rows() operation.
# hack solution to force eval and clean duplicated input
d %>%
list(filter(., a == 1),
filter(., !a == 1) %>% mutate(b = b * 10)) %>%
.[-1] %>%
bind_rows()
#> # A tibble: 4 x 2
#> a b
#> <int> <dbl>
#> 1 1 1
#> 2 2 20
#> 3 3 30
#> 4 4 40
Created on 2022-01-24 by the reprex package (v2.0.1)
It seems like it might be related to this issue, but I can't quite see how. It would be great to understand why this is happening and find a way code this without the need to assign intermediate variables or do this weird hack to subset the intermediate list.
EDIT:
Knowing this was related to curly braces ({}) enabled me to find a few more helpful links:
1, 2, 3

If we want to use ., then block it with scope operator ({})
library(dplyr)
d %>%
{
bind_rows({.} %>% filter(a == 1),
{.} %>% filter(!a == 1) %>% mutate(b = b * 10))
}
-output
# A tibble: 4 × 2
a b
<int> <dbl>
1 1 1
2 2 20
3 3 30
4 4 40

Related

Mutate All columns in a list of tibbles

Lets suppose I have the following list of tibbles:
a_list_of_tibbles <- list(
a = tibble(a = rnorm(10)),
b = tibble(a = runif(10)),
c = tibble(a = letters[1:10])
)
Now I want to map them all into a single dataframe/tibble, which is not possible due to the differing column types.
How would I go about this?
I have tried this, but I want to get rid of the for loop
for(i in 1:length(a_list_of_tibbles)){
a_list_of_tibbles[[i]] <- a_list_of_tibbles[[i]] %>% mutate_all(as.character)
}
Then I run:
map_dfr(.x = a_list_of_tibbles, .f = as_tibble)

We could do the computation within the map - use across instead of the suffix _all (which is getting deprecated) to loop over the columns of the dataset
library(dplyr)
library(purrr)
map_dfr(a_list_of_tibbles,
~.x %>%
mutate(across(everything(), as.character) %>%
as_tibble))
-output
# A tibble: 30 × 1
a
<chr>
1 0.735200825884485
2 1.4741501589461
3 1.39870958697574
4 -0.36046362308853
5 -0.893860999301402
6 -0.565468636033674
7 -0.075270267983768
8 2.33534260196058
9 0.69667906338348
10 1.54213170143702
# … with 20 more rows

Another alternative is to use:
library(tidyverse)
map_depth(a_list_of_tibbles, 2, as.character) %>%
bind_rows()
#> # A tibble: 30 × 1
#> a
#> <chr>
#> 1 0.0894618169853206
#> 2 -1.50144637645091
#> 3 1.44795821718513
#> 4 0.0795342912030257
#> 5 -0.837985570593029
#> 6 -0.050845557103668
#> 7 0.031194556366589
#> 8 0.0989551909839589
#> 9 1.87007290229274
#> 10 0.67816212007413
#> # … with 20 more rows
Created on 2021-12-20 by the reprex package (v2.0.1)

assigning id values from values, not names, with purrr::map_dfr

I think this question is related to Using map_dfr and .id for list names and list of list names but not identical ...
I often use map_dfr for a case where I want to use the value of each argument, not its name, as the .id variable. Here's a silly example: I am computing the mean of mtcars$mpg raised to the second, fourth, and sixth power:
library(tidyverse)
list(2,4,6) %>% map_dfr(~tibble(x=mean(mtcars$mpg^.)), .id="name")
## name x
## <chr> <dbl>
## 1 1 439.
## 2 2 262350.
## 3 3 198039783.
I would like the name variable to be 2, 4, 6 instead of 1, 2, 3. I can hack this by including setNames(.data) in the pipeline:
list(2,4,6) %>%
setNames(.data) %>%
map_dfr(~tibble(x=mean(mtcars$mpg^.)), .id="name")
but I wonder if there is a more idiomatic approach I'm missing?
As for the suggestion of using something like ~ tible(name=., ...): nice, but slightly less convenient for the case where the mapping function already returns a tibble, because we have to add an otherwise unnecessary tibble() call:
list(2, 4, 6) %>%
map_dfr(~ tibble(name=.,
broom::tidy(lm(mpg~cyl, data=mtcars, offset=rep(., nrow(mtcars))))))

OK, I think I found this shortly before posting (so I'll answer). This answer points out that tibble::lst() is a self-naming list function, so as long as we use tibble::lst(2,4,6) instead of list(2,4,6), it Just Works, e.g.
lst(2,4,6) %>% map_dfr(~tibble(x=mean(mtcars$mpg^.)), .id="name")

This can work too:
library(tidyverse)
##ben Bolker answer.
lst(2,4,6) %>% map_dfr(~tibble(x=mean(mtcars$mpg^.)), .id="power")
#> # A tibble: 3 x 2
#> power x
#> <chr> <dbl>
#> 1 2 439.
#> 2 4 262350.
#> 3 6 198039783.
list(2, 4, 6) %>% map_df(~ tibble(power = as.character(.x) , x = mean(mtcars$mpg^.)))
#> # A tibble: 3 x 2
#> power x
#> <chr> <dbl>
#> 1 2 439.
#> 2 4 262350.
#> 3 6 198039783.
#another option
seq(2, 6, 2) %>% map2_df(rerun(length(.), mtcars$mpg), ~ c(x = as.character(.x), mean = round(mean(.y^.x), 0)))
#> # A tibble: 3 x 2
#> x mean
#> <chr> <chr>
#> 1 2 439
#> 2 4 262350
#> 3 6 198039783
Created on 2021-06-06 by the reprex package (v2.0.0)

This is also possible, however it would not have been my first choice and only a map would suffice:
library(purrr)
list(2, 4, 6) %>%
pmap_dfr(~ tibble(power = c(...), x = map_dbl(c(...), ~ mean(mtcars$mpg ^ .x))))
# A tibble: 3 x 2
power x
<dbl> <dbl>
1 2 439.
2 4 262350.
3 6 198039783.

curly curly tidy evaluation programming with multiple inputs and custom function across columns

My question is similar to this question but I need to apply a more complex function across columns and I can't figure out how to apply Lionel's suggested solution to a custom function with a scoped verb like filter_at() or a filter()+across() equivalent. It doesn't look like a "superstache"/{{{}}} operator has been introduced.
Here is a non-programmed example of what I want to do (doesn't use NSE):
library(dplyr)
library(magrittr)
foo <- tibble(group = c(1,1,2,2,3,3),
a = c(1,1,0,1,2,2),
b = c(1,1,2,2,0,1))
foo %>%
group_by(group) %>%
filter_at(vars(a,b), any_vars(n_distinct(.) != 1)) %>%
ungroup
#> # A tibble: 4 x 3
#> group a b
#> <dbl> <dbl> <dbl>
#> 1 2 0 2
#> 2 2 1 2
#> 3 3 2 0
#> 4 3 2 1
I haven't found an equivalent of this filter_at line with filter+across() yet, but since the new(ish) tidyeval functions predate dplyr 1.0 I assume that issue can be set aside. Here is my attempt to make a programmed version where the filtering variables are user-supplied with dots:
my_function <- function(data, ..., by) {
dots <- enquos(..., .named = TRUE)
helperfunc <- function(arg) {
return(any_vars(n_distinct(arg) != length(arg)))
}
dots <- lapply(dots, function(dot) call("helperfunc", dot))
data %>%
group_by({{ by }}) %>%
filter(!!!dots) %>%
ungroup
}
foo %>%
my_function(a, b, group)
#> Error: Problem with `filter()` input `..1`.
#> x Input `..1` is named.
#> i This usually means that you've used `=` instead of `==`.
#> i Did you mean `a == helperfunc(a)`?
I'd love if there were a way to just plug in an NSE operator inside the vars() argument in filter_at and not have to make all these extra calls (I assume this is what a {{{}}} function would do?)

Maybe I'm misunderstanding what the issue is, but the standard pattern of forwarding the dots seems to work fine here:
my_function <- function(data, ..., by) {
data %>%
group_by({{ by }}) %>%
filter_at(vars(...), any_vars(n_distinct(.) != 1)) %>%
ungroup
}
foo %>%
my_function( a, b, by=group ) # works

Here is a way to use across() to achieve this that is covered in vignette("colwise").
my_function <- function(data, vars, by) {
data %>%
group_by({{ by }}) %>%
filter(n_distinct(across({{ vars }}, ~ .x)) != 1) %>%
ungroup()
}
foo %>%
my_function(c(a, b), by = group)
# A tibble: 4 x 3
group a b
<dbl> <dbl> <dbl>
1 2 0 2
2 2 1 2
3 3 2 0
4 3 2 1

An option with across
my_function <- function(data, by, ...) {
dots <- enquos(..., .named = TRUE)
nm1 <- purrr::map_chr(dots, rlang::as_label)
data %>%
dplyr::group_by({{ by }}) %>%
dplyr::mutate(across(nm1, ~ n_distinct(.) !=1, .names = "{col}_ind")) %>%
dplyr::ungroup() %>%
dplyr::filter(dplyr::select(., ends_with('ind')) %>% purrr::reduce(`|`)) %>%
dplyr::select(-ends_with('ind'))
}
my_function(foo, group, a, b)
# A tibble: 4 x 3
# group a b
# <dbl> <dbl> <dbl>
#1 2 0 2
#2 2 1 2
#3 3 2 0
#4 3 2 1
Or with filter/across
foo %>%
group_by(group) %>%
filter(any(!across(c(a,b), ~ n_distinct(.) == 1)))
# A tibble: 4 x 3
# Groups: group [2]
# group a b
# <dbl> <dbl> <dbl>
#1 2 0 2
#2 2 1 2
#3 3 2 0
#4 3 2 1

Replace infinite values in an R data frame [why doesn't `is.infinite()` behave like `is.na()`]

library(tidyverse)
df <- tibble(col1 = c("A", "B", "C"),
col2 = c(NA, Inf, 5))
#> # A tibble: 3 x 2
#> col1 col2
#> <chr> <dbl>
#> 1 A NA
#> 2 B Inf
#> 3 C 5
I can use the base R is.na() function to easily replace NAs with 0s, shown below:
df %>% replace(is.na(.), 0)
#> # A tibble: 3 x 2
#> col1 col2
#> <chr> <dbl>
#> 1 A 0
#> 2 B Inf
#> 3 C 5
If I try to duplicate this logic with is.infinite() things break:
df %>% replace(is.infinite(.), 1)
#> Error in is.infinite(.) : default method not implemented for type 'list'
Looking at this older answer about Inf and R data frames I can hack together the solution shown below. This takes my original data frame and turns all NA into 0 and all Inf into 1. Why doesn't is.infinite() behave like is.na() and what is (perhaps) a better way to do what I want?
df %>%
replace(is.na(.), 0) %>%
mutate_if(is.numeric, list(~na_if(abs(.), Inf))) %>% # line 3
replace(is.na(.), 1)
#> # A tibble: 3 x 2
#> col1 col2
#> <chr> <dbl>
#> 1 A 0
#> 2 B 1
#> 3 C 5

The is.infinite expects the input 'x' to be atomic vector according to ?is.infinite
x- object to be tested: the default methods handle atomic vectors.
whereas ?is.na can take a vector, matrix, data.frame as input
an R object to be tested: the default method for is.na and anyNA handle atomic vectors, lists, pairlists, and NULL
Also, by checking the methods,
methods('is.na')
#[1] is.na.data.frame is.na.data.table* is.na.numeric_version is.na.POSIXlt is.na.raster* is.na.vctrs_vctr*
methods('is.infinite') # only for vectors
#[1] is.infinite.vctrs_vctr*
We can modify the replace in the code to
library(dplyr)
df %>%
mutate_if(is.numeric, ~ replace_na(., 0) %>%
replace(., is.infinite(.), 1))
# A tibble: 3 x 2
# col1 col2
# <chr> <dbl>
#1 A 0
#2 B 1
#3 C 5

dplyr summarise: Equivalent of ".drop=FALSE" to keep groups with zero length in output

When using summarise with plyr's ddply function, empty categories are dropped by default. You can change this behavior by adding .drop = FALSE. However, this doesn't work when using summarise with dplyr. Is there another way to keep empty categories in the result?
Here's an example with fake data.
library(dplyr)
df = data.frame(a=rep(1:3,4), b=rep(1:2,6))
# Now add an extra level to df$b that has no corresponding value in df$a
df$b = factor(df$b, levels=1:3)
# Summarise with plyr, keeping categories with a count of zero
plyr::ddply(df, "b", summarise, count_a=length(a), .drop=FALSE)
b count_a
1 1 6
2 2 6
3 3 0
# Now try it with dplyr
df %.%
group_by(b) %.%
summarise(count_a=length(a), .drop=FALSE)
b count_a .drop
1 1 6 FALSE
2 2 6 FALSE
Not exactly what I was hoping for. Is there a dplyr method for achieving the same result as .drop=FALSE in plyr?

The issue is still open, but in the meantime, especially since your data are already factored, you can use complete from "tidyr" to get what you might be looking for:
library(tidyr)
df %>%
group_by(b) %>%
summarise(count_a=length(a)) %>%
complete(b)
# Source: local data frame [3 x 2]
#
# b count_a
# (fctr) (int)
# 1 1 6
# 2 2 6
# 3 3 NA
If you wanted the replacement value to be zero, you need to specify that with fill:
df %>%
group_by(b) %>%
summarise(count_a=length(a)) %>%
complete(b, fill = list(count_a = 0))
# Source: local data frame [3 x 2]
#
# b count_a
# (fctr) (dbl)
# 1 1 6
# 2 2 6
# 3 3 0

Since dplyr 0.8 group_by gained the .drop argument that does just what you asked for:
df = data.frame(a=rep(1:3,4), b=rep(1:2,6))
df$b = factor(df$b, levels=1:3)
df %>%
group_by(b, .drop=FALSE) %>%
summarise(count_a=length(a))
#> # A tibble: 3 x 2
#> b count_a
#> <fct> <int>
#> 1 1 6
#> 2 2 6
#> 3 3 0
One additional note to go with #Moody_Mudskipper's answer: Using .drop=FALSE can give potentially unexpected results when one or more grouping variables are not coded as factors. See examples below:
library(dplyr)
data(iris)
# Add an additional level to Species
iris$Species = factor(iris$Species, levels=c(levels(iris$Species), "empty_level"))
# Species is a factor and empty groups are included in the output
iris %>% group_by(Species, .drop=FALSE) %>% tally
#> Species n
#> 1 setosa 50
#> 2 versicolor 50
#> 3 virginica 50
#> 4 empty_level 0
# Add character column
iris$group2 = c(rep(c("A","B"), 50), rep(c("B","C"), each=25))
# Empty groups involving combinations of Species and group2 are not included in output
iris %>% group_by(Species, group2, .drop=FALSE) %>% tally
#> Species group2 n
#> 1 setosa A 25
#> 2 setosa B 25
#> 3 versicolor A 25
#> 4 versicolor B 25
#> 5 virginica B 25
#> 6 virginica C 25
#> 7 empty_level <NA> 0
# Turn group2 into a factor
iris$group2 = factor(iris$group2)
# Now all possible combinations of Species and group2 are included in the output,
# whether present in the data or not
iris %>% group_by(Species, group2, .drop=FALSE) %>% tally
#> Species group2 n
#> 1 setosa A 25
#> 2 setosa B 25
#> 3 setosa C 0
#> 4 versicolor A 25
#> 5 versicolor B 25
#> 6 versicolor C 0
#> 7 virginica A 0
#> 8 virginica B 25
#> 9 virginica C 25
#> 10 empty_level A 0
#> 11 empty_level B 0
#> 12 empty_level C 0
Created on 2019-03-13 by the reprex package (v0.2.1)

dplyr solution:
First make grouped df
by_b <- tbl_df(df) %>% group_by(b)
then we summarise those levels that occur by counting with n()
res <- by_b %>% summarise( count_a = n() )
then we merge our results into a data frame that contains all factor levels:
expanded_res <- left_join(expand.grid(b = levels(df$b)),res)
finally, in this case since we are looking at counts the NA values are changed to 0.
final_counts <- expanded_res[is.na(expanded_res)] <- 0
This can also be implemented functionally, see answers:
Add rows to grouped data with dplyr?
A hack:
I thought I would post a terrible hack that works in this case for interest's sake. I seriously doubt you should ever actually do this but it shows how group_by() generates the atrributes as if df$b was a character vector not a factor with levels. Also, I don't pretend to understand this properly -- but I am hoping this helps me learn -- this is the only reason I'm posting it!
by_b <- tbl_df(df) %>% group_by(b)
define an "out-of-bounds" value that cannot exist in dataset.
oob_val <- nrow(by_b)+1
modify attributes to "trick" summarise():
attr(by_b, "indices")[[3]] <- rep(NA,oob_val)
attr(by_b, "group_sizes")[3] <- 0
attr(by_b, "labels")[3,] <- 3
do the summary:
res <- by_b %>% summarise(count_a = n())
index and replace all occurences of oob_val
res[res == oob_val] <- 0
which gives the intended:
> res
Source: local data frame [3 x 2]
b count_a
1 1 6
2 2 6
3 3 0

this is not exactly what was asked in the question, but at least for this simple example, you could get the same result using xtabs, for example:
using dplyr:
df %>%
xtabs(formula = ~ b) %>%
as.data.frame()
or shorter:
as.data.frame(xtabs( ~ b, df))
result (equal in both cases):
b Freq
1 1 6
2 2 6
3 3 0

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Unexpected dplyr::bind_rows() behavior - r

If we want to use ., then block it with scope operator ({}) library(dplyr) d %>% { bind_rows({.} %>% filter(a == 1), {.} %>% filter(!a == 1) %>% mutate(b = b * 10)) } -output # A tibble: 4 × 2 a b <int> <dbl> 1 1 1 2 2 20 3 3 30 4 4 40

Related

Mutate All columns in a list of tibbles

assigning id values from values, not names, with purrr::map_dfr

curly curly tidy evaluation programming with multiple inputs and custom function across columns

Replace infinite values in an R data frame [why doesn't `is.infinite()` behave like `is.na()`]

dplyr summarise: Equivalent of ".drop=FALSE" to keep groups with zero length in output

Categories

Resources