I would like to use the map function with the tidyverse to create a column of data frames based on arguments from some, but not all, of the columns of the original data frame/tibble.
I would prefer to be able to use the map function so that I can replace this with future_map to utilize parallel computing.
With the exception of this solution not using map, this solution produces the correct end result (see also this question and answer: How to use rowwise to create a list column based on a function):
library(tidyverse)
library(purrr)
df <- data.frame(a= c(1,2,3), b=c(2,3,4), c=c(6,5,8))
fun <- function(q,y) {
r <- data.frame(col1 = c(q+y, q, q, y), col2 = c(q,q,q,y))
r
}
result1 <- df %>% rowwise(a) %>% mutate(list1 = list(fun(a, b)))
> result1
# A tibble: 3 × 4
# Rowwise: a
a b c list1
<dbl> <dbl> <dbl> <list>
1 1 2 6 <df [4 × 2]>
2 2 3 5 <df [4 × 2]>
3 3 4 8 <df [4 × 2]>
How can I instead do this with map? Here are three incorrect attempts:
Incorrect attempt 1:
wrong1 <- df %>% mutate(list1 = map(list(a,b), fun))
Incorrect attempt 2:
wrong2 <- df %>% mutate(list1 = map(c(a,b), fun))
Incorrect attempt 2:
wrong3 <- df %>% mutate(list1 = list(map(list(a,b), fun)))
The error I get is x argument "y" is missing, with no default. And I am not sure how to pass multiple arguments into a situation like this.
I would like a solution with multiple arguments, but if that is not possible, let's move to a function with one argument.
fun_one_arg <- function(q) {
r <- data.frame(col1 = c(q, q, q, q+q), col2 = c(3*q,q,q,q/2))
r
}
wrong4 <- df %>% mutate(list1 = map(a, fun_one_arg))
wrong5 <- df %>% mutate(list1 = list(map(a, fun_one_arg)))
These run, but the fourth columns are not data frames, as I would have expected.
We can use map2 as there are two arguments
library(dplyr)
df %>%
mutate(list1 = map2(a, b, fun)) %>%
as_tibble
# A tibble: 3 x 4
a b c list1
<dbl> <dbl> <dbl> <list>
1 1 2 6 <df [4 × 2]>
2 2 3 5 <df [4 × 2]>
3 3 4 8 <df [4 × 2]>
Or another option is pmap which can take more than 2 columns as well. The ..1, ..2 represents the columns in the same order
df %>%
mutate(list1 = pmap(across(c(a, b)), ~ fun(..1, ..2))) %>%
as_tibble
Related
So let's say we have this df:
a = c(rep(1,5),rep(0,5),rep(1,5),rep(0,5))
b = c(rep(4,5),rep(3,5),rep(2,5),rep(1,5))
c = c(rep("w",5),rep("x",5),rep("y",5),rep("z",5))
df = data.frame(a,b,c)
df = df %>%
nest(data=c(a,b))
I want to use parameters from inside the nested "data" column to do things to the entire dataframe, for example use filter() to eliminate rows where the sum of "a" inside the nested "data" is equal to 0. Or to arrange the rows of the dataframe by the max() of b. How can I do this?
I cam up with a pretty dumb way of doing this, but I am not happy, as this isn't really applicable to the larger datasets I'm working with:
sum_column = function(df){
df = df %>%
summarize(value=sum(a))
return(df[[1]][1])
}
#so many a new column with the sum of a, and THEN filter by that
df = df %>%
mutate(sum_of_a = map(data, ~sum_column(.x))) %>%
filter(!sum_of_a==0)
map returns a list, perhaps you want map_dbl?
library(dplyr)
library(purrr)
df %>%
mutate(sum_of_a = map_dbl(data, ~ sum(.x$a))) %>%
filter(!sum_of_a == 0)
# # A tibble: 2 × 3
# c data sum_of_a
# <chr> <list> <dbl>
# 1 w <tibble [5 × 2]> 5
# 2 y <tibble [5 × 2]> 5
or more directly (in case you no longer need sum_of_a):
df %>%
filter(abs(map_dbl(data, ~ sum(.x$a))) > 0)
# # A tibble: 2 × 2
# c data
# <chr> <list>
# 1 w <tibble [5 × 2]>
# 2 y <tibble [5 × 2]>
(The only reason I changed from ! . == 0 to abs(.) > 0 is due to floating-point tests of equality, not wanting to assume the precision and scale of numbers you're actually using. C.f., Why are these numbers not equal?, https://cran.r-project.org/doc/FAQ/R-FAQ.html#Why-doesn_0027t-R-think-these-numbers-are-equal_003f.)
I want to run some modelling on each group of variables a and b. The problem is that nest() doesn't include the grouping variables which are needed by the model.
expand.grid(a = LETTERS[1:3], b = LETTERS[1:2], c=1:3, d=1:3) %>%
group_by(a, b) %>%
nest()
The resulting table includes a and b on the "outside" and c and d in the nested tibble. How can I add a and b to the nested tibble?
Using cur_data_all() this creates a 3 column data frame in which the last column, nest, is a list each of whose components is the 4 column data frame in one a,b group.
ans <- expand.grid(a = LETTERS[1:3], b = LETTERS[1:2], c=1:3, d=1:3) %>%
group_by(a, b) %>%
summarize(nest = list(cur_data_all()), .groups = "drop")
giving:
> ans
# A tibble: 6 x 3
a b nest
<fct> <fct> <list>
1 A A <tibble [9 x 4]>
2 A B <tibble [9 x 4]>
3 B A <tibble [9 x 4]>
4 B B <tibble [9 x 4]>
5 C A <tibble [9 x 4]>
6 C B <tibble [9 x 4]>
> names(ans$nest[[1]])
[1] "a" "b" "c" "d"
If a data frame with just a single ccolumn, nest, is desired equal to the nest column above (except for attributes) then this code would work.
expand.grid(a = LETTERS[1:3], b = LETTERS[1:2], c=1:3, d=1:3) %>%
group_modify(~ tibble(nest = group_split(., a, b)))
I downloaded a data set from the web. It's got 6 columns, and the 6th column is filled with other dataframes. So, an example:
id homeTeam homeScore awayTeam away stats
401112436 Louisville 17 Notre Dame 35 <data.frame [4 × 4]>
401112114 Oklahoma 49 Houston 31 <data.frame [4 × 4]>
401114218 USC 31 Fresno State 23 <data.frame [4 × 4]>
I want to create a new column in the original dataframe with the value in row 1, column 2 of the "stats" dataframe for each row.
I added a row_id column with the row number, and tried
df$new_col <- df$stats[[df$row_id]][1,2]
but I'm getting a recursive error. When I hard code a number
df$stats[[1]][1,2]
it returns the correct number. I don't know why it wouldn't work with the row_id value just the same.
We can use pluck from purrr
library(dplyr)
library(purrr)
df %>% mutate(new_col = map_dbl(stats, pluck, 2, 1))
Using a reproducible example :
temp <- data.frame(a = 1:4, b = 2:5)
df <- tibble(a = 1:2, b = 6:7, c = list(temp, temp))
df %>% mutate(new_col = map_dbl(c, purrr::pluck, 2, 1))
# a b c new_col
# <int> <int> <list> <dbl>
#1 1 6 <df[,2] [4 × 2]> 2
#2 2 7 <df[,2] [4 × 2]> 2
With map, we loop over the 'stats' column, extract the second column, first element to create the 'new_col' in mutate and unnest the list element
library(purrr)
library(dplyr)
library(tidyr)
df <- df %>%
mutate(new_col = map(stats, ~ .x[[2]][1])) %>%
unnest(c(new_col))
df
# A tibble: 2 x 4
# a b stats new_col
# <int> <int> <list> <int>
#1 1 6 <df[,2] [4 × 2]> 2
#2 2 7 <df[,2] [4 × 2]> 2
If the column is character, use map_chr, if it is double, use map_dbl or if we don't know the type, then simply use map to return a list column and then unnest
Or in base R
df$new_col <- sapply(df$stats, function(x) x[[2]][1])
data
temp <- data.frame(a = 1:4, b = 2:5)
df <- tibble(a = 1:2, b = 6:7, stats = list(temp, temp))
Hi I am trying to apply a very simple function by using purrr::map however i keep getting the error Error in mutate_impl(.data, dots) :
Evaluation error: unused argument (.x[[i]]).
The codes are as below:
data = data.frame(name = c('A', 'B', 'C'), metric = c(0.29, 0.39,0.89))
get_sample_size = function(metric, threshold = 0.01){
sample_size = ceiling((1.96^2)*(metric*(1-metric))/(threshold^2))
return(data.frame(sample_size))
}
data %>% group_by(name) %>% tidyr::nest() %>%
dplyr::mutate(result = purrr::map( .x = data, .f = get_sample_size, metric = metric, threshold = 0.01 ))
You don't need nest. The metric argument from get_sample_size function should be a numeric vector, but if you do nest, the data column is a list of data frame, which cannot be the input for the metric argument.
I think you can use summarize and map to apply your function to the metric column.
library(tidyverse)
data %>%
group_by(name) %>%
summarize(result = purrr::map(.x = metric,
.f = get_sample_size,
threshold = 0.01))
# # A tibble: 3 x 2
# name result
# <fct> <list>
# 1 A <data.frame [1 x 1]>
# 2 B <data.frame [1 x 1]>
# 3 C <data.frame [1 x 1]>
When you pass metric in the ... part of map, it's not clear that that is a column in the nested data frame. But once you nest the data like you've done, metric isn't a column in data, it's a column in the nested frame...also called "data." (This is a good example of why you want more specific variable names btw.)
If you're mapping over the data column, you can use $metric to point to that column, either in writing out a function, as I've done here (such as df$metric), or in formula notation (such as .$metric).
As #www said, you don't need nested data frames in this case. But for a more complicated case, you might need nested data frames to work with, such as for building models, so it's good to know how to reference exactly the data you want.
library(tidyverse)
data %>%
group_by(name) %>%
tidyr::nest() %>%
mutate(result = map(data, function(df) {
get_sample_size(metric = df$metric, threshold = 0.01)
}))
#> # A tibble: 3 x 3
#> name data result
#> <fct> <list> <list>
#> 1 A <tibble [1 × 1]> <data.frame [1 × 1]>
#> 2 B <tibble [1 × 1]> <data.frame [1 × 1]>
#> 3 C <tibble [1 × 1]> <data.frame [1 × 1]>
Created on 2019-01-16 by the reprex package (v0.2.1)
I'm trying to use map() of purrr package to apply filter() function to the data stored in a nested data frame.
"Why wouldn't you filter first, and then nest? - you might ask.
That will work (and I'll show my desired outcome using such process), but I'm looking for ways to do it with purrr.
I want to have just one data frame, with two list-columns, both being nested data frames - one full and one filtered.
I can achieve it now by performing nest() twice: once on all data, and second on filtered data:
library(tidyverse)
df <- tibble(
a = sample(x = rep(c('x','y'),5), size = 10),
b = sample(c(1:10)),
c = sample(c(91:100))
)
df_full_nested <- df %>%
group_by(a) %>%
nest(.key = 'full')
df_filter_nested <- df %>%
filter(c >= 95) %>% ##this is the key step
group_by(a) %>%
nest(.key = 'filtered')
## Desired outcome - one data frame with 2 nested list-columns: one full and one filtered.
## How to achieve this without breaking it out into 2 separate data frames?
df_nested <- df_full_nested %>%
left_join(df_filter_nested, by = 'a')
The objects look like this:
> df
# A tibble: 10 x 3
a b c
<chr> <int> <int>
1 y 8 93
2 x 9 94
3 y 10 99
4 x 5 97
5 y 2 100
6 y 3 95
7 x 7 96
8 y 6 92
9 x 4 91
10 x 1 98
> df_full_nested
# A tibble: 2 x 2
a full
<chr> <list>
1 y <tibble [5 x 2]>
2 x <tibble [5 x 2]>
> df_filter_nested
# A tibble: 2 x 2
a filtered
<chr> <list>
1 y <tibble [3 x 2]>
2 x <tibble [3 x 2]>
> df_nested
# A tibble: 2 x 3
a full filtered
<chr> <list> <list>
1 y <tibble [5 x 2]> <tibble [4 x 2]>
2 x <tibble [5 x 2]> <tibble [4 x 2]>
So, this works. But it is not clean. And in real life, I group by several columns, which means I also have to join on several columns... It gets hairy fast.
I'm wondering if there is a way to apply filter to the nested column. This way, I'd operate within the same object. Just cleaner and more understandable code.
I'm thinking it'd look like
df_full_nested %>% mutate(filtered = map(full, ...))
But I am not sure how to map filter() properly
Thanks!
You can use map(full, ~ filter(., c >= 95)), where . stands for individual nested tibble, to which you can apply the filter directly:
df_nested_2 <- df_full_nested %>% mutate(filtered = map(full, ~ filter(., c >= 95)))
identical(df_nested, df_nested_2)
# [1] TRUE