R rlang: use .x in map() with quosure? - r

I am trying to pass a set of variables/values in a data.frame to a map function, but am not sure how to deal with the fact that .x refers to a quosure that needs to be evaluated: mutate(df2 = map2(variable, value, ~filter(df1, .x==.y))) A naive !!.x will not work.
Here my data.frame has one column for variable, one for value, that will be mapped in a filter call:
tibble(variable=c("wool", "tension"),
value= c("A", "L"))
#> # A tibble: 2 x 2
#> variable value
#> <chr> <chr>
#> 1 wool A
#> 2 tension L
How can I pass these to filter? Should I declare instead variable as quosure? I tried a few approaches:
library(tidyverse)
data(warpbreaks)
tibble(variable=c("wool", "tension"),
value= c("A", "L")) %>%
mutate(data_filtered=map2(variable, value, ~filter(warpbreaks, .x==.y)))
#> # A tibble: 2 x 3
#> variable value data_filtered
#> <chr> <chr> <list>
#> 1 wool A <data.frame [0 × 3]>
#> 2 tension L <data.frame [0 × 3]>
tibble(variable=c(quo(wool), quo(tension)),
value= c("A", "L")) %>%
mutate(data_filtered=map2(variable, value, ~filter(warpbreaks, eval_tidy(.x)==.y)))
#> Error in eval_tidy(.x): object 'wool' not found

In your example you're trying to use dplyr verbs in a nested way: there's a filter() inside mutate(). This works well for the normal use, but we need to be a little careful when using tidy eval features because they are applied very early, when the outer function is called. For this reason there's often a timing problem if you try to use !! or .data in the inner verb.
#zack's answer shows how you can decompose the problem in two steps to avoid the nested issue. In this case, another possibility is to omit the mutate() step by mapping directly over df (credit to #Spacedman for the idea). Here we're going to use pmap() which maps in parallel over a list or data frame:
# For pretty-printing
options(tibble.print_max = 5, tibble.print_min = 5)
warpbreaks <- as_tibble(warpbreaks)
pmap(df, ~ filter(warpbreaks, .data[[.x]] == .y))
#> [[1]]
#> # A tibble: 27 x 3
#> breaks wool tension
#> <dbl> <fct> <fct>
#> 1 26 A L
#> 2 30 A L
#> 3 54 A L
#> 4 25 A L
#> 5 70 A L
#> # … with 22 more rows
#>
#> [[2]]
#> # A tibble: 18 x 3
#> breaks wool tension
#> <dbl> <fct> <fct>
#> 1 26 A L
#> 2 30 A L
#> 3 54 A L
#> 4 25 A L
#> 5 70 A L
#> # … with 13 more rows

You can use R's native substitution tools, rlang is more valuable when dealing with environments but for more complex symbol substitution (nested for example) base R is easier (for me at least).
tibble(variable=c("wool", "tension"),
value= c("A", "L")) %>%
mutate(data_filtered=map2(variable, value, ~eval(bquote(
filter(warpbreaks, .(sym(.x)) ==.y)))))
tibble(variable=c("wool", "tension"),
value= c("A", "L")) %>%
mutate(data_filtered=map2(variable, value, ~eval(substitute(
filter(warpbreaks, X ==.y), list(X = sym(.x))))))
# output for either
# # A tibble: 2 x 3
# variable value data_filtered
# <chr> <chr> <list>
# 1 wool A <data.frame [27 x 3]>
# 2 tension L <data.frame [18 x 3]>

Something weird goes on with the anonymous function evaluation of .x. To be honest I'm not sure what, but defining a function outside of the map2 call seems to work alright (credit to #Lionel Henry for the ~ filter(df1, !!sym(.x) == .y) bit:
library(tidyverse)
df <- tibble(variable=c("wool", "tension"),
value= c("A", "L"))
data(warpbreaks)
# doesn't work with anonymous function
tibble(variable=c("wool", "tension"),
value= c("A", "L")) %>%
mutate(data_filtered=map2(variable, value, ~ filter(warpbreaks, !!sym(.x) == .y)))
#> Error in is_symbol(x): object '.x' not found
# works when you define function outside of map2
temp <- function(x, y, data){
filter(data, !!sym(x) == y)
}
tibble(variable=c("wool", "tension"),
value= c("A", "L")) %>%
mutate(data_filtered=map2(variable, value, temp, warpbreaks))
#> # A tibble: 2 x 3
#> variable value data_filtered
#> <chr> <chr> <list>
#> 1 wool A <data.frame [27 x 3]>
#> 2 tension L <data.frame [18 x 3]>
Created on 2019-05-07 by the reprex package (v0.2.1)
You can also do the following without the externally defined function:
tibble(variable=c("wool", "tension"),
value= c("A", "L")) %>%
mutate(data_filtered = map2(variable, value, ~ filter(..3, ..3[[..1]] == ..2), warpbreaks))
#> # A tibble: 2 x 3
#> variable value data_filtered
#> <chr> <chr> <list>
#> 1 wool A <data.frame [27 x 3]>
#> 2 tension L <data.frame [18 x 3]>

Related

Unnest a list-column of tibbles with different data type (cannot combine double and character)

I would like to unnest a nested tibble, however, I get an error.
Example data:
library(tidyverse)
df <- tribble(
~x, ~y,
1, tibble(a=1, b=2),
2, tibble(a=4:5, b=c("thank","you"),c=1:2))
df
#> # A tibble: 2 x 2
#> x y
#> <dbl> <list>
#> 1 1 <tibble [1 x 2]>
#> 2 2 <tibble [2 x 3]>
df %>% unnest(y)
#> Error: Can't combine `..1$b` <double> and `..2$b` <character>.
Created on 2021-11-03 by the reprex package (v2.0.1)
I think I have to change the data type of all tibbles listed in y to character, but I got stuck with that.
Maybe the following?
library(tidyverse)
df <- tribble(
~x, ~y,
1, tibble(a=1, b=2),
2, tibble(a=4:5, b=c("thank","you"),c=1:2))
df %>%
mutate(y = map(y, ~ mutate(.x, b = as.character(b)))) %>%
unnest(cols = c(y))
#> # A tibble: 3 × 4
#> x a b c
#> <dbl> <dbl> <chr> <int>
#> 1 1 1 2 NA
#> 2 2 4 thank 1
#> 3 2 5 you 2
There might be a more elegant way but this works. Turn the tribble into characters and then you can combine. You can most likely just change column 'b' to the same class and it will be fine.
library(tidyverse)
df <- tribble(
~x, ~y,
1, tibble(a=1, b=2),
2, tibble(a=4:5, b=c("thank","you"),c=1:2))
df$y[[1]] <- purrr::map_df(df$y[[1]], as.character)
df$y[[2]] <- purrr::map_df(df$y[[2]], as.character)
df %>% unnest(y)
Starting with the original data all you really need to do is change the class of b in the first nested tibble to character.
df$y[[1]]$b <- as.character(df$y[[1]]$b )
df %>% unnest(y)

Pass a vector of arguments to map function

I'm trying to create a function that will map across a nested tibble. This function needs to take a vector of parameters that will vary for each row.
When I call purrr:map2() on the nested data, purrr tries to loop over all values of the parameter vector and all rows in the dataset. What can I do to pass the entire vector as a single argument?
library(tidyverse)
myf <- function(x, params) {
print(params)
x %>%
mutate(new_mpg = mpg + rnorm(n(), params[1], params[2])) %>%
summarise(old = mean(mpg), new = mean(new_mpg)) %>%
as.list()
}
# Calling function with params defined is great!
myf(mtcars, params = c(5, 10))
#> [1] 5 10
#> $old
#> [1] 20.09062
#>
#> $new
#> [1] 25.62049
# Cannot work in purr as vector, tries to loop over param
mtcars %>%
group_by(cyl) %>% # from base R
nest() %>%
mutate(
newold = map2(data, c(5, 10), myf),
)
#> [1] 5
#> Warning in rnorm(n(), params[1], params[2]): NAs produced
#> [1] 10
#> Warning in rnorm(n(), params[1], params[2]): NAs produced
#> Error: Problem with `mutate()` column `newold`.
#> ℹ `newold = map2(data, c(5, 10), myf)`.
#> ℹ `newold` must be size 1, not 2.
#> ℹ The error occurred in group 1: cyl = 4.
# New function wrapper with hard-coded params
myf2 <- function(x){
myf(x, c(5, 10))
}
# works great! but not what I need
mtcars %>%
group_by(cyl) %>% # from base R
nest() %>%
mutate(
mean = 5,
sd = 10,
newold = map(data, myf2),
)
#> [1] 5 10
#> [1] 5 10
#> [1] 5 10
#> # A tibble: 3 × 5
#> # Groups: cyl [3]
#> cyl data mean sd newold
#> <dbl> <list> <dbl> <dbl> <list>
#> 1 6 <tibble [7 × 10]> 5 10 <named list [2]>
#> 2 4 <tibble [11 × 10]> 5 10 <named list [2]>
#> 3 8 <tibble [14 × 10]> 5 10 <named list [2]>
Created on 2021-11-29 by the reprex package (v2.0.0)
Skip the group_by() step and just use nest() - otherwise your data will remain grouped after nesting and need to be ungrouped. To get your function to work, just pass the parameters as a list.
library(tidyverse)
mtcars %>%
nest(data = -cyl) %>%
mutate(
newold = map2_df(data, list(c(5, 10)), myf)
) %>%
unpack(newold)
# A tibble: 3 x 4
cyl data old new
<dbl> <list> <dbl> <dbl>
1 6 <tibble [7 x 10]> 19.7 30.7
2 4 <tibble [11 x 10]> 26.7 31.1
3 8 <tibble [14 x 10]> 15.1 17.0
You don't need map2. I think what you need is map.
mtcars %>%
group_by(cyl) %>% # from base R
nest() %>%
mutate(
newold = map(data, myf, params = c(5, 10)),
)
# [1] 5 10
# [1] 5 10
# [1] 5 10
# # A tibble: 3 x 3
# # Groups: cyl [3]
# cyl data newold
# <dbl> <list> <list>
# 1 6 <tibble [7 x 10]> <named list [2]>
# 2 4 <tibble [11 x 10]> <named list [2]>
# 3 8 <tibble [14 x 10]> <named list [2]>
If you have multiple sets of params. You can ungroup your data frame, add a list column with your params, and use map2.
mtcars %>%
group_by(cyl) %>%
nest() %>%
ungroup() %>%
# Add different sets of params
mutate(Params = list(a = c(5, 10), b = c(6, 11), c = c(7, 12))) %>%
mutate(
newold = map2(data, Params, myf)
)
# [1] 5 10
# [1] 6 11
# [1] 7 12
# # A tibble: 3 x 4
# cyl data Params newold
# <dbl> <list> <named list> <list>
# 1 6 <tibble [7 x 10]> <dbl [2]> <named list [2]>
# 2 4 <tibble [11 x 10]> <dbl [2]> <named list [2]>
# 3 8 <tibble [14 x 10]> <dbl [2]> <named list [2]>

Removing duplicate records in a dataframe based on the values of a list column

I have a dataframe which contains duplicate values in a list column and I want to keep only the first appearence of each unique value.
Let's say I have the following tibble:
df <- tribble(
~x, ~y,
1, tibble(a = 1:2, b = 2:3),
2, tibble(a = 1:2, b = 2:3),
3, tibble(a = 0:1, b = 0:1)
)
df
#> # A tibble: 3 x 2
#> x y
#> <dbl> <list>
#> 1 1 <tibble [2 x 2]>
#> 2 2 <tibble [2 x 2]>
#> 3 3 <tibble [2 x 2]>
The desired outcome is:
desired_df
#> # A tibble: 2 x 2
#> x y
#> <dbl> <list>
#> 1 1 <tibble [2 x 2]>
#> 2 3 <tibble [2 x 2]>
Wasn't y a list column I'd be able to use distinct(df, y, .keep_all = TRUE), but the fuction doesn't support list columns properly, as shown:
distinct(df, y, .keep_all = TRUE)
#> Warning: distinct() does not fully support columns of type `list`.
#> List elements are compared by reference, see ?distinct for details.
#> This affects the following columns:
#> - `y`
#> # A tibble: 3 x 2
#> x y
#> <dbl> <list>
#> 1 1 <tibble [2 x 2]>
#> 2 2 <tibble [2 x 2]>
#> 3 3 <tibble [2 x 2]>
Is there any "clean" way to achieve what I want?
One option is to use filter with duplicated
library(dplyr)
df %>%
filter(!duplicated(y))
I have come to an answer, but I think it's quite "wordy" (and I suspect it might be slow as well):
df <- df %>%
mutate(unique_list_id = match(y, unique(y))) %>%
group_by(unique_list_id) %>%
slice(1) %>%
ungroup() %>%
select(-unique_list_id)
df
#> # A tibble: 2 x 2
#> x y
#> <dbl> <list>
#> 1 1 <tibble [2 x 2]>
#> 2 3 <tibble [2 x 2]>

Multiple tests with pairwise combinations using dplyr/tidyverse

My question is related to this one but a more complex example, in which I would like to statistically compare multiple columns in all combinations, and each of the columns has a different number of samples.
Consider the original data:
# A tibble: 51 x 3
trial person score
<chr> <chr> <dbl>
1 foo a 0.266
2 bar b 0.372
3 foo c 0.573
4 bar a 0.908
5 foo b 0.202
6 bar c 0.898
7 foo a 0.945
8 bar b 0.661
9 foo c 0.629
10 foo b 0.206
For each trial type, I'd like to run a statistical test comparing the scores of each person. So, I need the following test results:
Trial foo, compare all score samples of persons A–B, B–C, C–A
Trial bar, compare all score samples of persons A–B, B–C, C–A
Of course, there are more than two trials, and more than three persons.
Hence, the solution using group_split given in the other question does not work, as it implies always testing agains the first person (in my case), not all pairwise combinations.
So, in the following code, I'm stuck at two points:
library(tidyverse)
#> Registered S3 methods overwritten by 'ggplot2':
#> method from
#> [.quosures rlang
#> c.quosures rlang
#> print.quosures rlang
library(broom)
set.seed(1)
df = tibble::tibble(
trial = rep(c("foo", "bar"), 30),
person = rep(c("a", "b", "c"), 20),
score = runif(60)
) %>%
filter(score > 0.2)
df %>%
group_by(person, trial) %>%
summarize(scores = list(score)) %>%
spread(person, scores) %>%
group_split(trial) %>%
map_df(function(data) {
data %>%
summarize_at(vars(b:c), function(x) {
wilcox.test(.$a, x, paired = FALSE) %>% broom::tidy
})
})
#> Error in wilcox.test.default(.$a, x, paired = FALSE): 'x' must be numeric
Created on 2019-05-29 by the reprex package (v0.3.0)
The value of x is apparently not just the actual list of scores, but the column vector of scores for a single trial. But I don't know how else to deal with the fact that the number of samples in each person is different.
Also, I still have to manually specify the column names, which would already be a combinatorial nightmare if there were more than, say, four persons.
I can somehow get the combinations as such:
df %>%
group_split(trial) %>%
map_df(function(data) {
combinations = expand(tibble(x = unique(data$person), y = unique(data$person)), x, y) %>% filter(x != y)
})
… but that doesn't really help in creating columns for comparison.
What could I do to make this work?
This will allow you to programmatically specify combinations and get around the error you were hitting in wilcox.test().
combos <- unique(df$person) %>%
combn(2, simplify = F) %>%
set_names(map_chr(., ~ paste(., collapse = "_")))
df %>%
group_split(trial) %>%
set_names(map_chr(., ~ unique(.$trial))) %>%
map_df(function(x) {
map_df(combos, function(y) {
filter(x, person %in% y) %>%
wilcox.test(score ~ person, data = .) %>%
broom::tidy()
}, .id = "contrast")
}, .id = "trial")
# A tibble: 6 x 6
trial contrast statistic p.value method alternative
<chr> <chr> <dbl> <dbl> <chr> <chr>
1 bar a_b 34 0.878 Wilcoxon rank sum test two.sided
2 bar a_c 32 1 Wilcoxon rank sum test two.sided
3 bar b_c 31 0.959 Wilcoxon rank sum test two.sided
4 foo a_b 41 1 Wilcoxon rank sum test two.sided
5 foo a_c 41 1 Wilcoxon rank sum test two.sided
6 foo b_c 43 0.863 Wilcoxon rank sum test two.sided
Since this differs a lot from the pattern you started with, I'm not sure it will work for your real world case, but it works here so I wanted to share.
Here is an alternative solution that uses nesting to handle groups (persons) with different number of measurements.
library("broom")
library("tidyverse")
set.seed(1)
df <-
tibble(
trial = rep(c("foo", "bar"), 30),
person = rep(c("a", "b", "c"), 20),
score = runif(60)
) %>%
filter(score > 0.2)
comparisons <- df %>%
expand(
trial,
group1 = person,
group2 = person
) %>%
filter(
group1 < group2
)
comparisons
#> # A tibble: 6 × 3
#> trial group1 group2
#> <chr> <chr> <chr>
#> 1 bar a b
#> 2 bar a c
#> 3 bar b c
#> 4 foo a b
#> 5 foo a c
#> 6 foo b c
df <- df %>% nest_by(trial, person)
df
#> # A tibble: 6 × 3
#> # Rowwise: trial, person
#> trial person data
#> <chr> <chr> <list<tibble[,1]>>
#> 1 bar a [8 × 1]
#> 2 bar b [8 × 1]
#> 3 bar c [8 × 1]
#> 4 foo a [9 × 1]
#> 5 foo b [9 × 1]
#> 6 foo c [9 × 1]
comparisons %>%
inner_join(
df, by = c("trial", "group1" = "person")
) %>%
inner_join(
df, by = c("trial", "group2" = "person")
) %>%
mutate(
p.value = map2_dbl(
data.x, data.y, ~ wilcox.test(.x$score, .y$score)$p.value
)
)
#> # A tibble: 6 × 6
#> trial group1 group2 data.x data.y p.value
#> <chr> <chr> <chr> <list<tibble[,1]>> <list<tibble[,1]>> <dbl>
#> 1 bar a b [8 × 1] [8 × 1] 0.878
#> 2 bar a c [8 × 1] [8 × 1] 1
#> 3 bar b c [8 × 1] [8 × 1] 0.959
#> 4 foo a b [9 × 1] [9 × 1] 1
#> 5 foo a c [9 × 1] [9 × 1] 1
#> 6 foo b c [9 × 1] [9 × 1] 0.863
Created on 2022-03-17 by the reprex package (v2.0.1)

dplyr: summarise each column and return list columns

I am looking to summarize each column in a tibble with a custom summary function that will return different sized tibbles depending on the data.
Let’s say my summary function is this:
mysummary <- function(x) {quantile(x)[1:sample(1:5, 1)] %>% as_tibble}
It can be applied to one column as such:
cars %>% summarise(speed.summary = list(mysummary(speed)))
But I can't figure out a way to achieve this using summarise_all (or something similar).
Using the cars data, the desired output would be:
tribble(
~speed.summary, ~dist.summary,
mysummary(cars$speed), mysummary(cars$dist)
)
# A tibble: 1 x 2
speed.summary dist.summary
<list> <list>
1 <tibble [5 x 1]> <tibble [2 x 1]>
Of course the actual data has many more columns...
Suggestions?
We can use
res <- cars %>%
summarise_all(funs(summary = list(mysummary(.)))) %>%
as.tibble
res
# A tibble: 1 x 2
# speed_summary dist_summary
# <list> <list>
#1 <tibble [3 x 1]> <tibble [2 x 1]>
res$speed_summary
#[[1]]
# A tibble: 3 x 1
# value
#* <dbl>
#1 4.00
#2 12.0
#3 15.0
Is this what you had in mind?
# loading necessary libraries and the data
library(tibble)
library(purrr)
#> Warning: package 'purrr' was built under R version 3.4.2
data(cars)
# custom summary function (only for numeric variables)
mysummary <- function(x) {
if (is.numeric(x)) {
df <- quantile(x)[1:sample(1:5, 1)]
df <- tibble::as.tibble(df)
}
}
# return a list of different sized tibbles depending on the data
purrr::map(.x = cars, .f = mysummary)
#> $speed
#> # A tibble: 5 x 1
#> value
#> * <dbl>
#> 1 4.00
#> 2 12.0
#> 3 15.0
#> 4 19.0
#> 5 25.0
#>
#> $dist
#> # A tibble: 1 x 1
#> value
#> * <dbl>
#> 1 2.00
Created on 2018-01-27 by the reprex
package (v0.1.1.9000).

Resources