Get Mutate Error When Applying Purrr::Map on Grouped Data - r

Hi I am trying to apply a very simple function by using purrr::map however i keep getting the error Error in mutate_impl(.data, dots) :
Evaluation error: unused argument (.x[[i]]).
The codes are as below:
data = data.frame(name = c('A', 'B', 'C'), metric = c(0.29, 0.39,0.89))
get_sample_size = function(metric, threshold = 0.01){
sample_size = ceiling((1.96^2)*(metric*(1-metric))/(threshold^2))
return(data.frame(sample_size))
}
data %>% group_by(name) %>% tidyr::nest() %>%
dplyr::mutate(result = purrr::map( .x = data, .f = get_sample_size, metric = metric, threshold = 0.01 ))

You don't need nest. The metric argument from get_sample_size function should be a numeric vector, but if you do nest, the data column is a list of data frame, which cannot be the input for the metric argument.
I think you can use summarize and map to apply your function to the metric column.
library(tidyverse)
data %>%
group_by(name) %>%
summarize(result = purrr::map(.x = metric,
.f = get_sample_size,
threshold = 0.01))
# # A tibble: 3 x 2
# name result
# <fct> <list>
# 1 A <data.frame [1 x 1]>
# 2 B <data.frame [1 x 1]>
# 3 C <data.frame [1 x 1]>

When you pass metric in the ... part of map, it's not clear that that is a column in the nested data frame. But once you nest the data like you've done, metric isn't a column in data, it's a column in the nested frame...also called "data." (This is a good example of why you want more specific variable names btw.)
If you're mapping over the data column, you can use $metric to point to that column, either in writing out a function, as I've done here (such as df$metric), or in formula notation (such as .$metric).
As #www said, you don't need nested data frames in this case. But for a more complicated case, you might need nested data frames to work with, such as for building models, so it's good to know how to reference exactly the data you want.
library(tidyverse)
data %>%
group_by(name) %>%
tidyr::nest() %>%
mutate(result = map(data, function(df) {
get_sample_size(metric = df$metric, threshold = 0.01)
}))
#> # A tibble: 3 x 3
#> name data result
#> <fct> <list> <list>
#> 1 A <tibble [1 × 1]> <data.frame [1 × 1]>
#> 2 B <tibble [1 × 1]> <data.frame [1 × 1]>
#> 3 C <tibble [1 × 1]> <data.frame [1 × 1]>
Created on 2019-01-16 by the reprex package (v0.2.1)

Related

How can I wrangle the data frame based on the parameters inside a nested column?

So let's say we have this df:
a = c(rep(1,5),rep(0,5),rep(1,5),rep(0,5))
b = c(rep(4,5),rep(3,5),rep(2,5),rep(1,5))
c = c(rep("w",5),rep("x",5),rep("y",5),rep("z",5))
df = data.frame(a,b,c)
df = df %>%
nest(data=c(a,b))
I want to use parameters from inside the nested "data" column to do things to the entire dataframe, for example use filter() to eliminate rows where the sum of "a" inside the nested "data" is equal to 0. Or to arrange the rows of the dataframe by the max() of b. How can I do this?
I cam up with a pretty dumb way of doing this, but I am not happy, as this isn't really applicable to the larger datasets I'm working with:
sum_column = function(df){
df = df %>%
summarize(value=sum(a))
return(df[[1]][1])
}
#so many a new column with the sum of a, and THEN filter by that
df = df %>%
mutate(sum_of_a = map(data, ~sum_column(.x))) %>%
filter(!sum_of_a==0)
map returns a list, perhaps you want map_dbl?
library(dplyr)
library(purrr)
df %>%
mutate(sum_of_a = map_dbl(data, ~ sum(.x$a))) %>%
filter(!sum_of_a == 0)
# # A tibble: 2 × 3
# c data sum_of_a
# <chr> <list> <dbl>
# 1 w <tibble [5 × 2]> 5
# 2 y <tibble [5 × 2]> 5
or more directly (in case you no longer need sum_of_a):
df %>%
filter(abs(map_dbl(data, ~ sum(.x$a))) > 0)
# # A tibble: 2 × 2
# c data
# <chr> <list>
# 1 w <tibble [5 × 2]>
# 2 y <tibble [5 × 2]>
(The only reason I changed from ! . == 0 to abs(.) > 0 is due to floating-point tests of equality, not wanting to assume the precision and scale of numbers you're actually using. C.f., Why are these numbers not equal?, https://cran.r-project.org/doc/FAQ/R-FAQ.html#Why-doesn_0027t-R-think-these-numbers-are-equal_003f.)

Conditionally join R tables based on nested membership

Consider this example nested dataframe of 3 counties, 3 towns, and a range of zip codes associated with them. Two of the towns share the same name (B), but are in different counties:
df <- tibble(
county = c(1,1,1,2,2,2,2,3),
town = c("A","A","A","B","B","B","B","B"),
zip = c(12864,12865,12866,89501,89502,89503,89504,76512)) %>%
nest(data=c(zip))
I have another dataframe that contains town names, a zipcode, and a placeholder value, but is missing the county field:
df2 <- tibble(
town = c("A", "B", "B"),
zip = c(12866, 89504, 76512),
value = c("foo", "bar", "ski"))
My real data has hundreds of instances of these duplicated town names, and I need to join these two tables together so that each town gets the correct placeholder value based on the zip code (not the town name, which has duplicates). However, dplyr only seems to join on equality. As such, I'm stuck - what I'm after is something like inner_join(df, df2, by = c(df2$zip %in% df$data$zip)), but that obviously doesn't work.
I'm also aware of data.table being able to handle inequality in joins, but this always seems to relate to greater than/less than conditionals. How can I successfully join these tables to return the following output, in situations where I have more than 3 neatly matched rows between the dataframes?
county town data value
<dbl> <chr> <list> <chr>
1 1 A <tibble [3 x 1]> foo
2 2 B <tibble [4 x 1]> bar
3 3 B <tibble [1 x 1]> ski
We could do this with map
library(purrr)
library(dplyr)
df %>%
mutate(value = map_chr(data, ~ inner_join(.x, df2, by = 'zip') %>%
pull(value)))
-output
# A tibble: 3 × 4
county town data value
<dbl> <chr> <list> <chr>
1 1 A <tibble [3 × 1]> foo
2 2 B <tibble [4 × 1]> bar
3 3 B <tibble [1 × 1]> ski
Or another option is regex_inner_join
library(fuzzyjoin)
library(stringr)
library(dplyr)
library(purrr)
df %>%
mutate(zip = map_chr(data, ~ str_c(.x$zip, collapse="|"))) %>%
regex_inner_join(df2 %>%
select(-town), by = "zip") %>%
select(-starts_with('zip'))
-output
# A tibble: 3 × 4
county town data value
<dbl> <chr> <list> <chr>
1 1 A <tibble [3 × 1]> foo
2 2 B <tibble [4 × 1]> bar
3 3 B <tibble [1 × 1]> ski
I think you'll have to "roll your own join":
df %>% mutate(value = df2$value[
sapply(data, function(x) match(unlist(x), df2$zip) %>% .[!is.na(.)])
])
This works for the example provided, but I'm not clear whether there could be multiple matches to df2$zip within a group of df$data$zip's.

Create column of data frames based on function

I would like to use the map function with the tidyverse to create a column of data frames based on arguments from some, but not all, of the columns of the original data frame/tibble.
I would prefer to be able to use the map function so that I can replace this with future_map to utilize parallel computing.
With the exception of this solution not using map, this solution produces the correct end result (see also this question and answer: How to use rowwise to create a list column based on a function):
library(tidyverse)
library(purrr)
df <- data.frame(a= c(1,2,3), b=c(2,3,4), c=c(6,5,8))
fun <- function(q,y) {
r <- data.frame(col1 = c(q+y, q, q, y), col2 = c(q,q,q,y))
r
}
result1 <- df %>% rowwise(a) %>% mutate(list1 = list(fun(a, b)))
> result1
# A tibble: 3 × 4
# Rowwise: a
a b c list1
<dbl> <dbl> <dbl> <list>
1 1 2 6 <df [4 × 2]>
2 2 3 5 <df [4 × 2]>
3 3 4 8 <df [4 × 2]>
How can I instead do this with map? Here are three incorrect attempts:
Incorrect attempt 1:
wrong1 <- df %>% mutate(list1 = map(list(a,b), fun))
Incorrect attempt 2:
wrong2 <- df %>% mutate(list1 = map(c(a,b), fun))
Incorrect attempt 2:
wrong3 <- df %>% mutate(list1 = list(map(list(a,b), fun)))
The error I get is x argument "y" is missing, with no default. And I am not sure how to pass multiple arguments into a situation like this.
I would like a solution with multiple arguments, but if that is not possible, let's move to a function with one argument.
fun_one_arg <- function(q) {
r <- data.frame(col1 = c(q, q, q, q+q), col2 = c(3*q,q,q,q/2))
r
}
wrong4 <- df %>% mutate(list1 = map(a, fun_one_arg))
wrong5 <- df %>% mutate(list1 = list(map(a, fun_one_arg)))
These run, but the fourth columns are not data frames, as I would have expected.
We can use map2 as there are two arguments
library(dplyr)
df %>%
mutate(list1 = map2(a, b, fun)) %>%
as_tibble
# A tibble: 3 x 4
a b c list1
<dbl> <dbl> <dbl> <list>
1 1 2 6 <df [4 × 2]>
2 2 3 5 <df [4 × 2]>
3 3 4 8 <df [4 × 2]>
Or another option is pmap which can take more than 2 columns as well. The ..1, ..2 represents the columns in the same order
df %>%
mutate(list1 = pmap(across(c(a, b)), ~ fun(..1, ..2))) %>%
as_tibble

nesting categorical variable, bootstrap, then extract median in R

I'm having trouble with what seems like a simple solution. I have a data frame with some locations and each location has a value associated with it. I nested the data.frame by the locations and then bootstrapped the values using purrr (see below).
library(tidyverse)
library(modelr)
library(purrr)
locations <- c("grave","pinkham","lower pinkham", "meadow", "dodge", "young")
values <- rnorm(n = 100, mean = 3, sd = .5)
df <- data.frame(df)
df.boot <- df %>%
nest(-locations) %>%
mutate(boot = map(data,~bootstrap(.,n=100, id = "values")))
Now I'm trying to get the median from each bootstrap in the final list df.boot$boot, but can't seem to figure it out? I've tried to apply map(boot, median) but the more I dig in the more that doesn't make sense. The wanted vector in the boot list is idx from which I can get the median value and then store it (pretty much what boot function does but iterating by unique categorical variables). Any help would be much appreciated. I might just be going at this the wrong way...
If we need to extract the median
library(dplyr)
library(purrr)
library(modelr)
out <- df %>%
group_by(locations) %>%
nest %>%
mutate(boot = map(data, ~ bootstrap(.x, n = 100, id = 'values') %>%
pull('strap') %>%
map_dbl(~ as_tibble(.x) %>%
pull('values') %>%
median)))
out
# A tibble: 6 x 3
# Groups: locations [6]
# locations data boot
# <fct> <list> <list>
#1 pinkham <tibble [12 × 1]> <dbl [100]>
#2 lower pinkham <tibble [17 × 1]> <dbl [100]>
#3 meadow <tibble [16 × 1]> <dbl [100]>
#4 dodge <tibble [22 × 1]> <dbl [100]>
#5 grave <tibble [21 × 1]> <dbl [100]>
#6 young <tibble [12 × 1]> <dbl [100]>
data
df <- data.frame(values, locations = sample(locations, 100, replace = TRUE))

Use filter() (and other dplyr functions) inside nested data frames with map()

I'm trying to use map() of purrr package to apply filter() function to the data stored in a nested data frame.
"Why wouldn't you filter first, and then nest? - you might ask.
That will work (and I'll show my desired outcome using such process), but I'm looking for ways to do it with purrr.
I want to have just one data frame, with two list-columns, both being nested data frames - one full and one filtered.
I can achieve it now by performing nest() twice: once on all data, and second on filtered data:
library(tidyverse)
df <- tibble(
a = sample(x = rep(c('x','y'),5), size = 10),
b = sample(c(1:10)),
c = sample(c(91:100))
)
df_full_nested <- df %>%
group_by(a) %>%
nest(.key = 'full')
df_filter_nested <- df %>%
filter(c >= 95) %>% ##this is the key step
group_by(a) %>%
nest(.key = 'filtered')
## Desired outcome - one data frame with 2 nested list-columns: one full and one filtered.
## How to achieve this without breaking it out into 2 separate data frames?
df_nested <- df_full_nested %>%
left_join(df_filter_nested, by = 'a')
The objects look like this:
> df
# A tibble: 10 x 3
a b c
<chr> <int> <int>
1 y 8 93
2 x 9 94
3 y 10 99
4 x 5 97
5 y 2 100
6 y 3 95
7 x 7 96
8 y 6 92
9 x 4 91
10 x 1 98
> df_full_nested
# A tibble: 2 x 2
a full
<chr> <list>
1 y <tibble [5 x 2]>
2 x <tibble [5 x 2]>
> df_filter_nested
# A tibble: 2 x 2
a filtered
<chr> <list>
1 y <tibble [3 x 2]>
2 x <tibble [3 x 2]>
> df_nested
# A tibble: 2 x 3
a full filtered
<chr> <list> <list>
1 y <tibble [5 x 2]> <tibble [4 x 2]>
2 x <tibble [5 x 2]> <tibble [4 x 2]>
So, this works. But it is not clean. And in real life, I group by several columns, which means I also have to join on several columns... It gets hairy fast.
I'm wondering if there is a way to apply filter to the nested column. This way, I'd operate within the same object. Just cleaner and more understandable code.
I'm thinking it'd look like
df_full_nested %>% mutate(filtered = map(full, ...))
But I am not sure how to map filter() properly
Thanks!
You can use map(full, ~ filter(., c >= 95)), where . stands for individual nested tibble, to which you can apply the filter directly:
df_nested_2 <- df_full_nested %>% mutate(filtered = map(full, ~ filter(., c >= 95)))
identical(df_nested, df_nested_2)
# [1] TRUE

Resources