I want to run some modelling on each group of variables a and b. The problem is that nest() doesn't include the grouping variables which are needed by the model.
expand.grid(a = LETTERS[1:3], b = LETTERS[1:2], c=1:3, d=1:3) %>%
group_by(a, b) %>%
nest()
The resulting table includes a and b on the "outside" and c and d in the nested tibble. How can I add a and b to the nested tibble?
Using cur_data_all() this creates a 3 column data frame in which the last column, nest, is a list each of whose components is the 4 column data frame in one a,b group.
ans <- expand.grid(a = LETTERS[1:3], b = LETTERS[1:2], c=1:3, d=1:3) %>%
group_by(a, b) %>%
summarize(nest = list(cur_data_all()), .groups = "drop")
giving:
> ans
# A tibble: 6 x 3
a b nest
<fct> <fct> <list>
1 A A <tibble [9 x 4]>
2 A B <tibble [9 x 4]>
3 B A <tibble [9 x 4]>
4 B B <tibble [9 x 4]>
5 C A <tibble [9 x 4]>
6 C B <tibble [9 x 4]>
> names(ans$nest[[1]])
[1] "a" "b" "c" "d"
If a data frame with just a single ccolumn, nest, is desired equal to the nest column above (except for attributes) then this code would work.
expand.grid(a = LETTERS[1:3], b = LETTERS[1:2], c=1:3, d=1:3) %>%
group_modify(~ tibble(nest = group_split(., a, b)))
Related
So let's say we have this df:
a = c(rep(1,5),rep(0,5),rep(1,5),rep(0,5))
b = c(rep(4,5),rep(3,5),rep(2,5),rep(1,5))
c = c(rep("w",5),rep("x",5),rep("y",5),rep("z",5))
df = data.frame(a,b,c)
df = df %>%
nest(data=c(a,b))
I want to use parameters from inside the nested "data" column to do things to the entire dataframe, for example use filter() to eliminate rows where the sum of "a" inside the nested "data" is equal to 0. Or to arrange the rows of the dataframe by the max() of b. How can I do this?
I cam up with a pretty dumb way of doing this, but I am not happy, as this isn't really applicable to the larger datasets I'm working with:
sum_column = function(df){
df = df %>%
summarize(value=sum(a))
return(df[[1]][1])
}
#so many a new column with the sum of a, and THEN filter by that
df = df %>%
mutate(sum_of_a = map(data, ~sum_column(.x))) %>%
filter(!sum_of_a==0)
map returns a list, perhaps you want map_dbl?
library(dplyr)
library(purrr)
df %>%
mutate(sum_of_a = map_dbl(data, ~ sum(.x$a))) %>%
filter(!sum_of_a == 0)
# # A tibble: 2 × 3
# c data sum_of_a
# <chr> <list> <dbl>
# 1 w <tibble [5 × 2]> 5
# 2 y <tibble [5 × 2]> 5
or more directly (in case you no longer need sum_of_a):
df %>%
filter(abs(map_dbl(data, ~ sum(.x$a))) > 0)
# # A tibble: 2 × 2
# c data
# <chr> <list>
# 1 w <tibble [5 × 2]>
# 2 y <tibble [5 × 2]>
(The only reason I changed from ! . == 0 to abs(.) > 0 is due to floating-point tests of equality, not wanting to assume the precision and scale of numbers you're actually using. C.f., Why are these numbers not equal?, https://cran.r-project.org/doc/FAQ/R-FAQ.html#Why-doesn_0027t-R-think-these-numbers-are-equal_003f.)
Consider this example nested dataframe of 3 counties, 3 towns, and a range of zip codes associated with them. Two of the towns share the same name (B), but are in different counties:
df <- tibble(
county = c(1,1,1,2,2,2,2,3),
town = c("A","A","A","B","B","B","B","B"),
zip = c(12864,12865,12866,89501,89502,89503,89504,76512)) %>%
nest(data=c(zip))
I have another dataframe that contains town names, a zipcode, and a placeholder value, but is missing the county field:
df2 <- tibble(
town = c("A", "B", "B"),
zip = c(12866, 89504, 76512),
value = c("foo", "bar", "ski"))
My real data has hundreds of instances of these duplicated town names, and I need to join these two tables together so that each town gets the correct placeholder value based on the zip code (not the town name, which has duplicates). However, dplyr only seems to join on equality. As such, I'm stuck - what I'm after is something like inner_join(df, df2, by = c(df2$zip %in% df$data$zip)), but that obviously doesn't work.
I'm also aware of data.table being able to handle inequality in joins, but this always seems to relate to greater than/less than conditionals. How can I successfully join these tables to return the following output, in situations where I have more than 3 neatly matched rows between the dataframes?
county town data value
<dbl> <chr> <list> <chr>
1 1 A <tibble [3 x 1]> foo
2 2 B <tibble [4 x 1]> bar
3 3 B <tibble [1 x 1]> ski
We could do this with map
library(purrr)
library(dplyr)
df %>%
mutate(value = map_chr(data, ~ inner_join(.x, df2, by = 'zip') %>%
pull(value)))
-output
# A tibble: 3 × 4
county town data value
<dbl> <chr> <list> <chr>
1 1 A <tibble [3 × 1]> foo
2 2 B <tibble [4 × 1]> bar
3 3 B <tibble [1 × 1]> ski
Or another option is regex_inner_join
library(fuzzyjoin)
library(stringr)
library(dplyr)
library(purrr)
df %>%
mutate(zip = map_chr(data, ~ str_c(.x$zip, collapse="|"))) %>%
regex_inner_join(df2 %>%
select(-town), by = "zip") %>%
select(-starts_with('zip'))
-output
# A tibble: 3 × 4
county town data value
<dbl> <chr> <list> <chr>
1 1 A <tibble [3 × 1]> foo
2 2 B <tibble [4 × 1]> bar
3 3 B <tibble [1 × 1]> ski
I think you'll have to "roll your own join":
df %>% mutate(value = df2$value[
sapply(data, function(x) match(unlist(x), df2$zip) %>% .[!is.na(.)])
])
This works for the example provided, but I'm not clear whether there could be multiple matches to df2$zip within a group of df$data$zip's.
I would like to use the map function with the tidyverse to create a column of data frames based on arguments from some, but not all, of the columns of the original data frame/tibble.
I would prefer to be able to use the map function so that I can replace this with future_map to utilize parallel computing.
With the exception of this solution not using map, this solution produces the correct end result (see also this question and answer: How to use rowwise to create a list column based on a function):
library(tidyverse)
library(purrr)
df <- data.frame(a= c(1,2,3), b=c(2,3,4), c=c(6,5,8))
fun <- function(q,y) {
r <- data.frame(col1 = c(q+y, q, q, y), col2 = c(q,q,q,y))
r
}
result1 <- df %>% rowwise(a) %>% mutate(list1 = list(fun(a, b)))
> result1
# A tibble: 3 × 4
# Rowwise: a
a b c list1
<dbl> <dbl> <dbl> <list>
1 1 2 6 <df [4 × 2]>
2 2 3 5 <df [4 × 2]>
3 3 4 8 <df [4 × 2]>
How can I instead do this with map? Here are three incorrect attempts:
Incorrect attempt 1:
wrong1 <- df %>% mutate(list1 = map(list(a,b), fun))
Incorrect attempt 2:
wrong2 <- df %>% mutate(list1 = map(c(a,b), fun))
Incorrect attempt 2:
wrong3 <- df %>% mutate(list1 = list(map(list(a,b), fun)))
The error I get is x argument "y" is missing, with no default. And I am not sure how to pass multiple arguments into a situation like this.
I would like a solution with multiple arguments, but if that is not possible, let's move to a function with one argument.
fun_one_arg <- function(q) {
r <- data.frame(col1 = c(q, q, q, q+q), col2 = c(3*q,q,q,q/2))
r
}
wrong4 <- df %>% mutate(list1 = map(a, fun_one_arg))
wrong5 <- df %>% mutate(list1 = list(map(a, fun_one_arg)))
These run, but the fourth columns are not data frames, as I would have expected.
We can use map2 as there are two arguments
library(dplyr)
df %>%
mutate(list1 = map2(a, b, fun)) %>%
as_tibble
# A tibble: 3 x 4
a b c list1
<dbl> <dbl> <dbl> <list>
1 1 2 6 <df [4 × 2]>
2 2 3 5 <df [4 × 2]>
3 3 4 8 <df [4 × 2]>
Or another option is pmap which can take more than 2 columns as well. The ..1, ..2 represents the columns in the same order
df %>%
mutate(list1 = pmap(across(c(a, b)), ~ fun(..1, ..2))) %>%
as_tibble
I downloaded a data set from the web. It's got 6 columns, and the 6th column is filled with other dataframes. So, an example:
id homeTeam homeScore awayTeam away stats
401112436 Louisville 17 Notre Dame 35 <data.frame [4 × 4]>
401112114 Oklahoma 49 Houston 31 <data.frame [4 × 4]>
401114218 USC 31 Fresno State 23 <data.frame [4 × 4]>
I want to create a new column in the original dataframe with the value in row 1, column 2 of the "stats" dataframe for each row.
I added a row_id column with the row number, and tried
df$new_col <- df$stats[[df$row_id]][1,2]
but I'm getting a recursive error. When I hard code a number
df$stats[[1]][1,2]
it returns the correct number. I don't know why it wouldn't work with the row_id value just the same.
We can use pluck from purrr
library(dplyr)
library(purrr)
df %>% mutate(new_col = map_dbl(stats, pluck, 2, 1))
Using a reproducible example :
temp <- data.frame(a = 1:4, b = 2:5)
df <- tibble(a = 1:2, b = 6:7, c = list(temp, temp))
df %>% mutate(new_col = map_dbl(c, purrr::pluck, 2, 1))
# a b c new_col
# <int> <int> <list> <dbl>
#1 1 6 <df[,2] [4 × 2]> 2
#2 2 7 <df[,2] [4 × 2]> 2
With map, we loop over the 'stats' column, extract the second column, first element to create the 'new_col' in mutate and unnest the list element
library(purrr)
library(dplyr)
library(tidyr)
df <- df %>%
mutate(new_col = map(stats, ~ .x[[2]][1])) %>%
unnest(c(new_col))
df
# A tibble: 2 x 4
# a b stats new_col
# <int> <int> <list> <int>
#1 1 6 <df[,2] [4 × 2]> 2
#2 2 7 <df[,2] [4 × 2]> 2
If the column is character, use map_chr, if it is double, use map_dbl or if we don't know the type, then simply use map to return a list column and then unnest
Or in base R
df$new_col <- sapply(df$stats, function(x) x[[2]][1])
data
temp <- data.frame(a = 1:4, b = 2:5)
df <- tibble(a = 1:2, b = 6:7, stats = list(temp, temp))
I'm trying to use map() of purrr package to apply filter() function to the data stored in a nested data frame.
"Why wouldn't you filter first, and then nest? - you might ask.
That will work (and I'll show my desired outcome using such process), but I'm looking for ways to do it with purrr.
I want to have just one data frame, with two list-columns, both being nested data frames - one full and one filtered.
I can achieve it now by performing nest() twice: once on all data, and second on filtered data:
library(tidyverse)
df <- tibble(
a = sample(x = rep(c('x','y'),5), size = 10),
b = sample(c(1:10)),
c = sample(c(91:100))
)
df_full_nested <- df %>%
group_by(a) %>%
nest(.key = 'full')
df_filter_nested <- df %>%
filter(c >= 95) %>% ##this is the key step
group_by(a) %>%
nest(.key = 'filtered')
## Desired outcome - one data frame with 2 nested list-columns: one full and one filtered.
## How to achieve this without breaking it out into 2 separate data frames?
df_nested <- df_full_nested %>%
left_join(df_filter_nested, by = 'a')
The objects look like this:
> df
# A tibble: 10 x 3
a b c
<chr> <int> <int>
1 y 8 93
2 x 9 94
3 y 10 99
4 x 5 97
5 y 2 100
6 y 3 95
7 x 7 96
8 y 6 92
9 x 4 91
10 x 1 98
> df_full_nested
# A tibble: 2 x 2
a full
<chr> <list>
1 y <tibble [5 x 2]>
2 x <tibble [5 x 2]>
> df_filter_nested
# A tibble: 2 x 2
a filtered
<chr> <list>
1 y <tibble [3 x 2]>
2 x <tibble [3 x 2]>
> df_nested
# A tibble: 2 x 3
a full filtered
<chr> <list> <list>
1 y <tibble [5 x 2]> <tibble [4 x 2]>
2 x <tibble [5 x 2]> <tibble [4 x 2]>
So, this works. But it is not clean. And in real life, I group by several columns, which means I also have to join on several columns... It gets hairy fast.
I'm wondering if there is a way to apply filter to the nested column. This way, I'd operate within the same object. Just cleaner and more understandable code.
I'm thinking it'd look like
df_full_nested %>% mutate(filtered = map(full, ...))
But I am not sure how to map filter() properly
Thanks!
You can use map(full, ~ filter(., c >= 95)), where . stands for individual nested tibble, to which you can apply the filter directly:
df_nested_2 <- df_full_nested %>% mutate(filtered = map(full, ~ filter(., c >= 95)))
identical(df_nested, df_nested_2)
# [1] TRUE