Conditionally join R tables based on nested membership - r

Consider this example nested dataframe of 3 counties, 3 towns, and a range of zip codes associated with them. Two of the towns share the same name (B), but are in different counties:
df <- tibble(
county = c(1,1,1,2,2,2,2,3),
town = c("A","A","A","B","B","B","B","B"),
zip = c(12864,12865,12866,89501,89502,89503,89504,76512)) %>%
nest(data=c(zip))
I have another dataframe that contains town names, a zipcode, and a placeholder value, but is missing the county field:
df2 <- tibble(
town = c("A", "B", "B"),
zip = c(12866, 89504, 76512),
value = c("foo", "bar", "ski"))
My real data has hundreds of instances of these duplicated town names, and I need to join these two tables together so that each town gets the correct placeholder value based on the zip code (not the town name, which has duplicates). However, dplyr only seems to join on equality. As such, I'm stuck - what I'm after is something like inner_join(df, df2, by = c(df2$zip %in% df$data$zip)), but that obviously doesn't work.
I'm also aware of data.table being able to handle inequality in joins, but this always seems to relate to greater than/less than conditionals. How can I successfully join these tables to return the following output, in situations where I have more than 3 neatly matched rows between the dataframes?
county town data value
<dbl> <chr> <list> <chr>
1 1 A <tibble [3 x 1]> foo
2 2 B <tibble [4 x 1]> bar
3 3 B <tibble [1 x 1]> ski

We could do this with map
library(purrr)
library(dplyr)
df %>%
mutate(value = map_chr(data, ~ inner_join(.x, df2, by = 'zip') %>%
pull(value)))
-output
# A tibble: 3 × 4
county town data value
<dbl> <chr> <list> <chr>
1 1 A <tibble [3 × 1]> foo
2 2 B <tibble [4 × 1]> bar
3 3 B <tibble [1 × 1]> ski
Or another option is regex_inner_join
library(fuzzyjoin)
library(stringr)
library(dplyr)
library(purrr)
df %>%
mutate(zip = map_chr(data, ~ str_c(.x$zip, collapse="|"))) %>%
regex_inner_join(df2 %>%
select(-town), by = "zip") %>%
select(-starts_with('zip'))
-output
# A tibble: 3 × 4
county town data value
<dbl> <chr> <list> <chr>
1 1 A <tibble [3 × 1]> foo
2 2 B <tibble [4 × 1]> bar
3 3 B <tibble [1 × 1]> ski

I think you'll have to "roll your own join":
df %>% mutate(value = df2$value[
sapply(data, function(x) match(unlist(x), df2$zip) %>% .[!is.na(.)])
])
This works for the example provided, but I'm not clear whether there could be multiple matches to df2$zip within a group of df$data$zip's.

Related

How can I wrangle the data frame based on the parameters inside a nested column?

So let's say we have this df:
a = c(rep(1,5),rep(0,5),rep(1,5),rep(0,5))
b = c(rep(4,5),rep(3,5),rep(2,5),rep(1,5))
c = c(rep("w",5),rep("x",5),rep("y",5),rep("z",5))
df = data.frame(a,b,c)
df = df %>%
nest(data=c(a,b))
I want to use parameters from inside the nested "data" column to do things to the entire dataframe, for example use filter() to eliminate rows where the sum of "a" inside the nested "data" is equal to 0. Or to arrange the rows of the dataframe by the max() of b. How can I do this?
I cam up with a pretty dumb way of doing this, but I am not happy, as this isn't really applicable to the larger datasets I'm working with:
sum_column = function(df){
df = df %>%
summarize(value=sum(a))
return(df[[1]][1])
}
#so many a new column with the sum of a, and THEN filter by that
df = df %>%
mutate(sum_of_a = map(data, ~sum_column(.x))) %>%
filter(!sum_of_a==0)
map returns a list, perhaps you want map_dbl?
library(dplyr)
library(purrr)
df %>%
mutate(sum_of_a = map_dbl(data, ~ sum(.x$a))) %>%
filter(!sum_of_a == 0)
# # A tibble: 2 × 3
# c data sum_of_a
# <chr> <list> <dbl>
# 1 w <tibble [5 × 2]> 5
# 2 y <tibble [5 × 2]> 5
or more directly (in case you no longer need sum_of_a):
df %>%
filter(abs(map_dbl(data, ~ sum(.x$a))) > 0)
# # A tibble: 2 × 2
# c data
# <chr> <list>
# 1 w <tibble [5 × 2]>
# 2 y <tibble [5 × 2]>
(The only reason I changed from ! . == 0 to abs(.) > 0 is due to floating-point tests of equality, not wanting to assume the precision and scale of numbers you're actually using. C.f., Why are these numbers not equal?, https://cran.r-project.org/doc/FAQ/R-FAQ.html#Why-doesn_0027t-R-think-these-numbers-are-equal_003f.)

In R, a column of my dataframe is populated with other dataframes. I want to return a specific value as a new column in the original dataframe

I downloaded a data set from the web. It's got 6 columns, and the 6th column is filled with other dataframes. So, an example:
id homeTeam homeScore awayTeam away stats
401112436 Louisville 17 Notre Dame 35 <data.frame [4 × 4]>
401112114 Oklahoma 49 Houston 31 <data.frame [4 × 4]>
401114218 USC 31 Fresno State 23 <data.frame [4 × 4]>
I want to create a new column in the original dataframe with the value in row 1, column 2 of the "stats" dataframe for each row.
I added a row_id column with the row number, and tried
df$new_col <- df$stats[[df$row_id]][1,2]
but I'm getting a recursive error. When I hard code a number
df$stats[[1]][1,2]
it returns the correct number. I don't know why it wouldn't work with the row_id value just the same.
We can use pluck from purrr
library(dplyr)
library(purrr)
df %>% mutate(new_col = map_dbl(stats, pluck, 2, 1))
Using a reproducible example :
temp <- data.frame(a = 1:4, b = 2:5)
df <- tibble(a = 1:2, b = 6:7, c = list(temp, temp))
df %>% mutate(new_col = map_dbl(c, purrr::pluck, 2, 1))
# a b c new_col
# <int> <int> <list> <dbl>
#1 1 6 <df[,2] [4 × 2]> 2
#2 2 7 <df[,2] [4 × 2]> 2
With map, we loop over the 'stats' column, extract the second column, first element to create the 'new_col' in mutate and unnest the list element
library(purrr)
library(dplyr)
library(tidyr)
df <- df %>%
mutate(new_col = map(stats, ~ .x[[2]][1])) %>%
unnest(c(new_col))
df
# A tibble: 2 x 4
# a b stats new_col
# <int> <int> <list> <int>
#1 1 6 <df[,2] [4 × 2]> 2
#2 2 7 <df[,2] [4 × 2]> 2
If the column is character, use map_chr, if it is double, use map_dbl or if we don't know the type, then simply use map to return a list column and then unnest
Or in base R
df$new_col <- sapply(df$stats, function(x) x[[2]][1])
data
temp <- data.frame(a = 1:4, b = 2:5)
df <- tibble(a = 1:2, b = 6:7, stats = list(temp, temp))

Use filter() (and other dplyr functions) inside nested data frames with map()

I'm trying to use map() of purrr package to apply filter() function to the data stored in a nested data frame.
"Why wouldn't you filter first, and then nest? - you might ask.
That will work (and I'll show my desired outcome using such process), but I'm looking for ways to do it with purrr.
I want to have just one data frame, with two list-columns, both being nested data frames - one full and one filtered.
I can achieve it now by performing nest() twice: once on all data, and second on filtered data:
library(tidyverse)
df <- tibble(
a = sample(x = rep(c('x','y'),5), size = 10),
b = sample(c(1:10)),
c = sample(c(91:100))
)
df_full_nested <- df %>%
group_by(a) %>%
nest(.key = 'full')
df_filter_nested <- df %>%
filter(c >= 95) %>% ##this is the key step
group_by(a) %>%
nest(.key = 'filtered')
## Desired outcome - one data frame with 2 nested list-columns: one full and one filtered.
## How to achieve this without breaking it out into 2 separate data frames?
df_nested <- df_full_nested %>%
left_join(df_filter_nested, by = 'a')
The objects look like this:
> df
# A tibble: 10 x 3
a b c
<chr> <int> <int>
1 y 8 93
2 x 9 94
3 y 10 99
4 x 5 97
5 y 2 100
6 y 3 95
7 x 7 96
8 y 6 92
9 x 4 91
10 x 1 98
> df_full_nested
# A tibble: 2 x 2
a full
<chr> <list>
1 y <tibble [5 x 2]>
2 x <tibble [5 x 2]>
> df_filter_nested
# A tibble: 2 x 2
a filtered
<chr> <list>
1 y <tibble [3 x 2]>
2 x <tibble [3 x 2]>
> df_nested
# A tibble: 2 x 3
a full filtered
<chr> <list> <list>
1 y <tibble [5 x 2]> <tibble [4 x 2]>
2 x <tibble [5 x 2]> <tibble [4 x 2]>
So, this works. But it is not clean. And in real life, I group by several columns, which means I also have to join on several columns... It gets hairy fast.
I'm wondering if there is a way to apply filter to the nested column. This way, I'd operate within the same object. Just cleaner and more understandable code.
I'm thinking it'd look like
df_full_nested %>% mutate(filtered = map(full, ...))
But I am not sure how to map filter() properly
Thanks!
You can use map(full, ~ filter(., c >= 95)), where . stands for individual nested tibble, to which you can apply the filter directly:
df_nested_2 <- df_full_nested %>% mutate(filtered = map(full, ~ filter(., c >= 95)))
identical(df_nested, df_nested_2)
# [1] TRUE

How to add calculated columns to nested data frames (list columns) using purrr

I would like to perform calculations on a nested data frame (stored as a list-column), and add the calculated variable back to each dataframe using purrr functions. I'll use this result to join to other data, and keeping it compact helps me to organize and examine it better. I can do this in a couple of steps, but it seems like there may be a solution I haven't come across. If there is a solution out there, I haven't been able to find it easily.
Load libraries. example requires the following packages (available on CRAN):
library(dplyr)
library(purrr)
library(RcppRoll) # to calculate rolling mean
Example data with 3 subjects, and repeated measurements over time:
test <- data_frame(
id= rep(1:3, each=20),
time = rep(1:20, 3),
var1 = rnorm(60, mean=10, sd=3),
var2 = rnorm(60, mean=95, sd=5)
)
Store the data as nested dataframe:
t_nest <- test %>% nest(-id)
id data
<int> <list>
1 1 <tibble [20 x 3]>
2 2 <tibble [20 x 3]>
3 3 <tibble [20 x 3]>
Perform calculations. I will calculate multiple new variables based on the data, although a solution for just one could be expanded later. The result of each calculation will be a numeric vector, same length as the input (n=20):
t1 <- t_nest %>%
mutate(var1_rollmean4 = map(data, ~RcppRoll::roll_mean(.$var1, n=4, align="right", fill=NA)),
var2_delta4 = map(data, ~(.$var2 - lag(.$var2, 3))*0.095),
var3 = map2(var1_rollmean4, var2_delta4, ~.x -.y))
id data var1_rollmean4 var2_delta4 var3
<int> <list> <list> <list> <list>
1 1 <tibble [20 x 3]> <dbl [20]> <dbl [20]> <dbl [20]>
2 2 <tibble [20 x 3]> <dbl [20]> <dbl [20]> <dbl [20]>
3 3 <tibble [20 x 3]> <dbl [20]> <dbl [20]> <dbl [20]>
my solution is to unnest this data, and then nest again. There doesn't seem to be anything wrong with this, but seems like a better solution may exist.
t1 %>% unnest %>%
nest(-id)
id data
<int> <list>
1 1 <tibble [20 x 6]>
2 2 <tibble [20 x 6]>
3 3 <tibble [20 x 6]>
This other solution (from SO 42028710) is close, but not quite because it is a list rather than nested dataframes:
map_df(t_nest$data, ~ mutate(.x, var1calc = .$var1*100))
I've found quite a bit of helpful information using the purrr Cheatsheet but can't quite find the answer.
You can wrap another mutate when mapping through the data column and add the columns in each nested tibble:
t11 <- t_nest %>%
mutate(data = map(data,
~ mutate(.x,
var1_rollmean4 = RcppRoll::roll_mean(var1, n=4, align="right", fill=NA),
var2_delta4 = (var2 - lag(var2, 3))*0.095,
var3 = var1_rollmean4 - var2_delta4
)
))
t11
# A tibble: 3 x 2
# id data
# <int> <list>
#1 1 <tibble [20 x 6]>
#2 2 <tibble [20 x 6]>
#3 3 <tibble [20 x 6]>
unnest-nest method, and then reorder the columns inside:
nest_unnest <- t1 %>%
unnest %>% nest(-id) %>%
mutate(data = map(data, ~ select(.x, time, var1, var2, var1_rollmean4, var2_delta4, var3)))
identical(nest_unnest, t11)
# [1] TRUE
It seems like for what you're trying to do, nesting is not necessary
library(tidyverse)
library(zoo)
test %>%
group_by(id) %>%
mutate(var1_rollmean4 = rollapplyr(var1, 4, mean, fill=NA),
var2_delta4 = (var2 - lag(var2, 3))*0.095,
var3 = (var1_rollmean4 - var2_delta4))
# A tibble: 60 x 7
# Groups: id [3]
# id time var1 var2 var1_rollmean4 var2_delta4 var3
# <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 1 9.865199 96.45723 NA NA NA
# 2 1 2 9.951429 92.78354 NA NA NA
# 3 1 3 12.831509 95.00553 NA NA NA
# 4 1 4 12.463664 95.37171 11.277950 -0.10312483 11.381075
# 5 1 5 11.781704 92.05240 11.757076 -0.06945881 11.826535
# 6 1 6 12.756932 92.15666 12.458452 -0.27064269 12.729095
# 7 1 7 12.346409 94.32411 12.337177 -0.09952197 12.436699
# 8 1 8 10.223695 100.89043 11.777185 0.83961377 10.937571
# 9 1 9 4.031945 87.38217 9.839745 -0.45357658 10.293322
# 10 1 10 11.859477 97.96973 9.615382 0.34633428 9.269047
# ... with 50 more rows
Edit You could nest the result with %>% nest(-id) still
If you still prefer to nest or are nesting for other reasons, it would go like
t1 <- t_nest %>%
mutate(data = map(data, ~.x %>% mutate(...)))
That is, you mutate on .x within the map statement. This will treat data as a data.frame and mutate will column-bind results to it.

How to count rows in nested data_frames with dplyr

Here's a dumb example dataframe:
df <- data_frame(A = c(rep(1, 5), rep(2, 4)), B = 1:9) %>%
group_by(A) %>%
nest()
which looks like this:
> df
# A tibble: 2 × 2
A data
<dbl> <list>
1 1 <tibble [5 × 1]>
2 2 <tibble [4 × 1]>
I would like to add a third column called N with entries equal to the number of rows in each nested data_frame in data. I figured this would work:
> df %>%
+ mutate(N = nrow(data))
Error: Unsupported type NILSXP for column "N"
What's going wrong?
Combining dplyr and purrr you could do:
library(tidyverse)
df %>%
mutate(n = map_dbl(data, nrow))
#> # A tibble: 2 × 3
#> A data n
#> <dbl> <list> <dbl>
#> 1 1 <tibble [5 × 1]> 5
#> 2 2 <tibble [4 × 1]> 4
I like this approach, because you stay within your usual workflow, creating a new column within mutate, but leveraging the map_*-family, since you need to operate on a list.
You could do:
df %>%
rowwise() %>%
mutate(N = nrow(data))
Which gives:
#Source: local data frame [2 x 3]
#Groups: <by row>
#
## A tibble: 2 × 3
# A data N
# <dbl> <list> <int>
#1 1 <tibble [5 × 1]> 5
#2 2 <tibble [4 × 1]> 4
With dplyr:
df %>%
group_by(A) %>%
mutate(N = nrow(data.frame(data)))
A data N
<dbl> <list> <int>
1 1 <tibble [5 × 1]> 5
2 2 <tibble [4 × 1]> 4

Resources