Dplyr logical membership test of nested data - r

I'm trying to create a logical test for membership of a dataframe variable in a nested column. Using mtcars as a stand-in, I can generally replicate what I'm trying to do (though the process using may seem inefficient/circuitous since it's not my real data):
library(dplyr)
m <- mtcars %>%
group_by(cyl) %>%
summarize(grz = unique(gear)) %>%
nest(data = c(cyl))
Which produces a nested column of cylinders (data) associated with the grz variable:
# A tibble: 3 x 2
grz data
<dbl> <list>
1 4 <grouped_df [2 x 1]>
2 3 <grouped_df [3 x 1]>
3 5 <grouped_df [3 x 1]>
I want to add a column testing if the value of grz is present in the nested data column, and can't seem to figure out why this doesn't work:
library(purrr)
m %>% mutate(test = map2_lgl(.x = data, .y = grz, ~ .y %in% .x))
# A tibble: 3 x 3
grz data test
<dbl> <list> <lgl>
1 4 <grouped_df [2 x 1]> FALSE
2 3 <grouped_df [3 x 1]> FALSE
3 5 <grouped_df [3 x 1]> FALSE
The first row of grz (value of 4) should produce a TRUE boolean, while the other two should be FALSE.

We need to extract the column as %in% table should be vector or matrix
library(dplyr)
library(purrr)
m %>%
mutate(test = map2_lgl(data, grz, ~ .y %in% .x$cyl))
-output
# A tibble: 3 × 3
grz data test
<dbl> <list> <lgl>
1 4 <grouped_df [2 × 1]> TRUE
2 3 <grouped_df [3 × 1]> FALSE
3 5 <grouped_df [3 × 1]> FALSE

Related

Removing duplicate records in a dataframe based on the values of a list column

I have a dataframe which contains duplicate values in a list column and I want to keep only the first appearence of each unique value.
Let's say I have the following tibble:
df <- tribble(
~x, ~y,
1, tibble(a = 1:2, b = 2:3),
2, tibble(a = 1:2, b = 2:3),
3, tibble(a = 0:1, b = 0:1)
)
df
#> # A tibble: 3 x 2
#> x y
#> <dbl> <list>
#> 1 1 <tibble [2 x 2]>
#> 2 2 <tibble [2 x 2]>
#> 3 3 <tibble [2 x 2]>
The desired outcome is:
desired_df
#> # A tibble: 2 x 2
#> x y
#> <dbl> <list>
#> 1 1 <tibble [2 x 2]>
#> 2 3 <tibble [2 x 2]>
Wasn't y a list column I'd be able to use distinct(df, y, .keep_all = TRUE), but the fuction doesn't support list columns properly, as shown:
distinct(df, y, .keep_all = TRUE)
#> Warning: distinct() does not fully support columns of type `list`.
#> List elements are compared by reference, see ?distinct for details.
#> This affects the following columns:
#> - `y`
#> # A tibble: 3 x 2
#> x y
#> <dbl> <list>
#> 1 1 <tibble [2 x 2]>
#> 2 2 <tibble [2 x 2]>
#> 3 3 <tibble [2 x 2]>
Is there any "clean" way to achieve what I want?
One option is to use filter with duplicated
library(dplyr)
df %>%
filter(!duplicated(y))
I have come to an answer, but I think it's quite "wordy" (and I suspect it might be slow as well):
df <- df %>%
mutate(unique_list_id = match(y, unique(y))) %>%
group_by(unique_list_id) %>%
slice(1) %>%
ungroup() %>%
select(-unique_list_id)
df
#> # A tibble: 2 x 2
#> x y
#> <dbl> <list>
#> 1 1 <tibble [2 x 2]>
#> 2 3 <tibble [2 x 2]>

Use filter() (and other dplyr functions) inside nested data frames with map()

I'm trying to use map() of purrr package to apply filter() function to the data stored in a nested data frame.
"Why wouldn't you filter first, and then nest? - you might ask.
That will work (and I'll show my desired outcome using such process), but I'm looking for ways to do it with purrr.
I want to have just one data frame, with two list-columns, both being nested data frames - one full and one filtered.
I can achieve it now by performing nest() twice: once on all data, and second on filtered data:
library(tidyverse)
df <- tibble(
a = sample(x = rep(c('x','y'),5), size = 10),
b = sample(c(1:10)),
c = sample(c(91:100))
)
df_full_nested <- df %>%
group_by(a) %>%
nest(.key = 'full')
df_filter_nested <- df %>%
filter(c >= 95) %>% ##this is the key step
group_by(a) %>%
nest(.key = 'filtered')
## Desired outcome - one data frame with 2 nested list-columns: one full and one filtered.
## How to achieve this without breaking it out into 2 separate data frames?
df_nested <- df_full_nested %>%
left_join(df_filter_nested, by = 'a')
The objects look like this:
> df
# A tibble: 10 x 3
a b c
<chr> <int> <int>
1 y 8 93
2 x 9 94
3 y 10 99
4 x 5 97
5 y 2 100
6 y 3 95
7 x 7 96
8 y 6 92
9 x 4 91
10 x 1 98
> df_full_nested
# A tibble: 2 x 2
a full
<chr> <list>
1 y <tibble [5 x 2]>
2 x <tibble [5 x 2]>
> df_filter_nested
# A tibble: 2 x 2
a filtered
<chr> <list>
1 y <tibble [3 x 2]>
2 x <tibble [3 x 2]>
> df_nested
# A tibble: 2 x 3
a full filtered
<chr> <list> <list>
1 y <tibble [5 x 2]> <tibble [4 x 2]>
2 x <tibble [5 x 2]> <tibble [4 x 2]>
So, this works. But it is not clean. And in real life, I group by several columns, which means I also have to join on several columns... It gets hairy fast.
I'm wondering if there is a way to apply filter to the nested column. This way, I'd operate within the same object. Just cleaner and more understandable code.
I'm thinking it'd look like
df_full_nested %>% mutate(filtered = map(full, ...))
But I am not sure how to map filter() properly
Thanks!
You can use map(full, ~ filter(., c >= 95)), where . stands for individual nested tibble, to which you can apply the filter directly:
df_nested_2 <- df_full_nested %>% mutate(filtered = map(full, ~ filter(., c >= 95)))
identical(df_nested, df_nested_2)
# [1] TRUE

How to add calculated columns to nested data frames (list columns) using purrr

I would like to perform calculations on a nested data frame (stored as a list-column), and add the calculated variable back to each dataframe using purrr functions. I'll use this result to join to other data, and keeping it compact helps me to organize and examine it better. I can do this in a couple of steps, but it seems like there may be a solution I haven't come across. If there is a solution out there, I haven't been able to find it easily.
Load libraries. example requires the following packages (available on CRAN):
library(dplyr)
library(purrr)
library(RcppRoll) # to calculate rolling mean
Example data with 3 subjects, and repeated measurements over time:
test <- data_frame(
id= rep(1:3, each=20),
time = rep(1:20, 3),
var1 = rnorm(60, mean=10, sd=3),
var2 = rnorm(60, mean=95, sd=5)
)
Store the data as nested dataframe:
t_nest <- test %>% nest(-id)
id data
<int> <list>
1 1 <tibble [20 x 3]>
2 2 <tibble [20 x 3]>
3 3 <tibble [20 x 3]>
Perform calculations. I will calculate multiple new variables based on the data, although a solution for just one could be expanded later. The result of each calculation will be a numeric vector, same length as the input (n=20):
t1 <- t_nest %>%
mutate(var1_rollmean4 = map(data, ~RcppRoll::roll_mean(.$var1, n=4, align="right", fill=NA)),
var2_delta4 = map(data, ~(.$var2 - lag(.$var2, 3))*0.095),
var3 = map2(var1_rollmean4, var2_delta4, ~.x -.y))
id data var1_rollmean4 var2_delta4 var3
<int> <list> <list> <list> <list>
1 1 <tibble [20 x 3]> <dbl [20]> <dbl [20]> <dbl [20]>
2 2 <tibble [20 x 3]> <dbl [20]> <dbl [20]> <dbl [20]>
3 3 <tibble [20 x 3]> <dbl [20]> <dbl [20]> <dbl [20]>
my solution is to unnest this data, and then nest again. There doesn't seem to be anything wrong with this, but seems like a better solution may exist.
t1 %>% unnest %>%
nest(-id)
id data
<int> <list>
1 1 <tibble [20 x 6]>
2 2 <tibble [20 x 6]>
3 3 <tibble [20 x 6]>
This other solution (from SO 42028710) is close, but not quite because it is a list rather than nested dataframes:
map_df(t_nest$data, ~ mutate(.x, var1calc = .$var1*100))
I've found quite a bit of helpful information using the purrr Cheatsheet but can't quite find the answer.
You can wrap another mutate when mapping through the data column and add the columns in each nested tibble:
t11 <- t_nest %>%
mutate(data = map(data,
~ mutate(.x,
var1_rollmean4 = RcppRoll::roll_mean(var1, n=4, align="right", fill=NA),
var2_delta4 = (var2 - lag(var2, 3))*0.095,
var3 = var1_rollmean4 - var2_delta4
)
))
t11
# A tibble: 3 x 2
# id data
# <int> <list>
#1 1 <tibble [20 x 6]>
#2 2 <tibble [20 x 6]>
#3 3 <tibble [20 x 6]>
unnest-nest method, and then reorder the columns inside:
nest_unnest <- t1 %>%
unnest %>% nest(-id) %>%
mutate(data = map(data, ~ select(.x, time, var1, var2, var1_rollmean4, var2_delta4, var3)))
identical(nest_unnest, t11)
# [1] TRUE
It seems like for what you're trying to do, nesting is not necessary
library(tidyverse)
library(zoo)
test %>%
group_by(id) %>%
mutate(var1_rollmean4 = rollapplyr(var1, 4, mean, fill=NA),
var2_delta4 = (var2 - lag(var2, 3))*0.095,
var3 = (var1_rollmean4 - var2_delta4))
# A tibble: 60 x 7
# Groups: id [3]
# id time var1 var2 var1_rollmean4 var2_delta4 var3
# <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 1 9.865199 96.45723 NA NA NA
# 2 1 2 9.951429 92.78354 NA NA NA
# 3 1 3 12.831509 95.00553 NA NA NA
# 4 1 4 12.463664 95.37171 11.277950 -0.10312483 11.381075
# 5 1 5 11.781704 92.05240 11.757076 -0.06945881 11.826535
# 6 1 6 12.756932 92.15666 12.458452 -0.27064269 12.729095
# 7 1 7 12.346409 94.32411 12.337177 -0.09952197 12.436699
# 8 1 8 10.223695 100.89043 11.777185 0.83961377 10.937571
# 9 1 9 4.031945 87.38217 9.839745 -0.45357658 10.293322
# 10 1 10 11.859477 97.96973 9.615382 0.34633428 9.269047
# ... with 50 more rows
Edit You could nest the result with %>% nest(-id) still
If you still prefer to nest or are nesting for other reasons, it would go like
t1 <- t_nest %>%
mutate(data = map(data, ~.x %>% mutate(...)))
That is, you mutate on .x within the map statement. This will treat data as a data.frame and mutate will column-bind results to it.

filter tibble of tibbles by nrow

I've a tibble like that
>dat
# A tibble: 556 × 3
sample run abc
<chr> <chr> <list>
1 206_03_07_2013 21102016 <tibble [304 × 21]>
2 206_04_07_2017 7082017 <tibble [229 × 21]>
3 206_04_10_2015 25112015 <tibble [2,687 × 21]>
4 206_07_08_2013 15102015 <tibble [460 × 21]>
5 206_08_12_2016 3032017 <tibble [3,250 × 21]>
6 206_11_03_2014 21102016 <tibble [975 × 21]>
7 206_13_02_2013 21112016 <tibble [101 × 21]>
8 206_13_03_2013 21112016 <tibble [345 × 21]>
9 206_14_08_2014 8092016 <tibble [1,952 × 21]>
10 206_19_03_2015 25012016 <tibble [11 × 21]>
# ... with 546 more rows
The abc column contains tibble of different length. I want to filter the dat tibble using their length (>100 rows).
I could do something like that :
dat[sapply(dat$abs,nrow)>100,]
but I would like to use dplyr phylosophy ?
Any ideas ?
Thanks
A way could be:
library(dplyr)
library(purrr)
dat <- tribble(
~foo, ~bar,
1, as_tibble(head(iris, 3)),
2, as_tibble(head(iris, 7))
)
# # A tibble: 2 x 2
# foo bar
# <dbl> <list>
# 1 1 <tibble [3 x 5]>
# 2 2 <tibble [7 x 5]>
res <- filter(dat, map_int(bar, nrow) > 5)
# # A tibble: 1 x 2
# foo bar
# <dbl> <list>
# 1 2 <tibble [7 x 5]>
desired_output <- dat[sapply(dat$bar,nrow)>5,]
identical(res, desired_output)
# [1] TRUE
There is not really any added value here, compared to what you tried, it's a matter of using drop-in replacements to [ and sapply (with filter and map_int respectively). Base R functions are not incompatible with the so-called "dplyr philosophy". If you mean the use of the magrittr pipe %>%, dat %>% .[sapply(.$bar, nrow) > 5, ] and dat %>% filter(map_int(bar, nrow) > 5) work equally well.
Note: I usually prefer all.equal over identical but couldn't make it work:
all.equal(res, desired_output)
# Error in equal_data_frame(target, current, ignore_col_order = ignore_col_order, :
# Can't join on 'bar' x 'bar' because of incompatible types (list / list)
(See https://github.com/tidyverse/dplyr/issues/2194)

How to count rows in nested data_frames with dplyr

Here's a dumb example dataframe:
df <- data_frame(A = c(rep(1, 5), rep(2, 4)), B = 1:9) %>%
group_by(A) %>%
nest()
which looks like this:
> df
# A tibble: 2 × 2
A data
<dbl> <list>
1 1 <tibble [5 × 1]>
2 2 <tibble [4 × 1]>
I would like to add a third column called N with entries equal to the number of rows in each nested data_frame in data. I figured this would work:
> df %>%
+ mutate(N = nrow(data))
Error: Unsupported type NILSXP for column "N"
What's going wrong?
Combining dplyr and purrr you could do:
library(tidyverse)
df %>%
mutate(n = map_dbl(data, nrow))
#> # A tibble: 2 × 3
#> A data n
#> <dbl> <list> <dbl>
#> 1 1 <tibble [5 × 1]> 5
#> 2 2 <tibble [4 × 1]> 4
I like this approach, because you stay within your usual workflow, creating a new column within mutate, but leveraging the map_*-family, since you need to operate on a list.
You could do:
df %>%
rowwise() %>%
mutate(N = nrow(data))
Which gives:
#Source: local data frame [2 x 3]
#Groups: <by row>
#
## A tibble: 2 × 3
# A data N
# <dbl> <list> <int>
#1 1 <tibble [5 × 1]> 5
#2 2 <tibble [4 × 1]> 4
With dplyr:
df %>%
group_by(A) %>%
mutate(N = nrow(data.frame(data)))
A data N
<dbl> <list> <int>
1 1 <tibble [5 × 1]> 5
2 2 <tibble [4 × 1]> 4

Resources