filter tibble of tibbles by nrow - r

I've a tibble like that
>dat
# A tibble: 556 × 3
sample run abc
<chr> <chr> <list>
1 206_03_07_2013 21102016 <tibble [304 × 21]>
2 206_04_07_2017 7082017 <tibble [229 × 21]>
3 206_04_10_2015 25112015 <tibble [2,687 × 21]>
4 206_07_08_2013 15102015 <tibble [460 × 21]>
5 206_08_12_2016 3032017 <tibble [3,250 × 21]>
6 206_11_03_2014 21102016 <tibble [975 × 21]>
7 206_13_02_2013 21112016 <tibble [101 × 21]>
8 206_13_03_2013 21112016 <tibble [345 × 21]>
9 206_14_08_2014 8092016 <tibble [1,952 × 21]>
10 206_19_03_2015 25012016 <tibble [11 × 21]>
# ... with 546 more rows
The abc column contains tibble of different length. I want to filter the dat tibble using their length (>100 rows).
I could do something like that :
dat[sapply(dat$abs,nrow)>100,]
but I would like to use dplyr phylosophy ?
Any ideas ?
Thanks

A way could be:
library(dplyr)
library(purrr)
dat <- tribble(
~foo, ~bar,
1, as_tibble(head(iris, 3)),
2, as_tibble(head(iris, 7))
)
# # A tibble: 2 x 2
# foo bar
# <dbl> <list>
# 1 1 <tibble [3 x 5]>
# 2 2 <tibble [7 x 5]>
res <- filter(dat, map_int(bar, nrow) > 5)
# # A tibble: 1 x 2
# foo bar
# <dbl> <list>
# 1 2 <tibble [7 x 5]>
desired_output <- dat[sapply(dat$bar,nrow)>5,]
identical(res, desired_output)
# [1] TRUE
There is not really any added value here, compared to what you tried, it's a matter of using drop-in replacements to [ and sapply (with filter and map_int respectively). Base R functions are not incompatible with the so-called "dplyr philosophy". If you mean the use of the magrittr pipe %>%, dat %>% .[sapply(.$bar, nrow) > 5, ] and dat %>% filter(map_int(bar, nrow) > 5) work equally well.
Note: I usually prefer all.equal over identical but couldn't make it work:
all.equal(res, desired_output)
# Error in equal_data_frame(target, current, ignore_col_order = ignore_col_order, :
# Can't join on 'bar' x 'bar' because of incompatible types (list / list)
(See https://github.com/tidyverse/dplyr/issues/2194)

Related

using apply on listcolumns in R seems inconsistent

..or at least inconsistent with my intuition.
I'm trying to extract data from inside a listcolumn using apply - in the example I've got a column of tibbles called eagles:
df1 <- tibble(
location = c(1,2),
eagles = list(
tibble(
talons = c(2,3,4),
beaks = c("blue","red","red")),
tibble(
talons = c(2,3),
beaks = c("red","red"))))
and extracting the beaks values as vectors using apply:
df1$beakz <- apply(df1, 1, \(x) x$eagles$beaks)
which works as expected:
> df1
# A tibble: 2 x 3
location eagles beakz
<dbl> <list> <list>
1 1 <tibble [3 x 2]> <chr [3]>
2 2 <tibble [2 x 2]> <chr [2]>
However if I add another row to one of the nested tibbles, the apply function won't play along anymore:
df2 <- tibble(
location = c(1,2),
eagles = list(
tibble(
talons = c(2,3,4),
beaks = c("blue","red","red")),
tibble(
talons = c(2,3,2),
beaks = c("red","red","yellow"))))
df2$beakz <- apply(df2, 1, \(x) x$eagles$beaks)
Error:
! Assigned data `apply(df2, 1, function(x) x$eagles$beaks)` must be compatible with existing data.
x Existing data has 2 rows.
x Assigned data has 3 rows.
i Only vectors of size 1 are recycled.
The expected output would be adding a listcolumn beakz with two vectors (of length 3) as elements.
Additionally, if both the nested tibbles have two rows only, the apply function does work, but instead of a single new listcolumn, I get two new columns:
df3 <- tibble(
location = c(1,2),
eagles = list(
tibble(
talons = c(2,3),
beaks = c("blue","red")),
tibble(
talons = c(2,3),
beaks = c("red","red"))))
df3$beakz <- apply(df3, 1, \(x) x$eagles$beaks)
df3
# A tibble: 2 x 3
location eagles beakz[,1] [,2]
<dbl> <list> <chr> <chr>
1 1 <tibble [2 x 2]> blue red
2 2 <tibble [2 x 2]> red red
This is a grossly simplified example, but basically, I would expect apply to function the same way in all three cases: I would like to extract a column as a vector and bring it up a level. Ideally using apply, although I'm sure there are purrr ways of doing this. But mainly I would just like to understand why this works this way, because debugging it has not been much fun :lolsob:
(also would appreciate it if someone with enough reputation could add listcolumn or list-column to the tags)
This is happening because apply() does not return a list, it returns a 3x2 matrix, which has too many rows to be put into df2. To get it to do what you want you could e.g. coerce it to a data frame (to give the columns names) and then to a list. There's probably a more elegant way to do it. But basically apply() does not play well with the list-structure of your data, whereas the purrr functions do.
apply(df2, 1, \(x) x$eagles$beaks)
#> [,1] [,2]
#> [1,] "blue" "red"
#> [2,] "red" "red"
#> [3,] "red" "yellow"
class(apply(df2, 1, \(x) x$eagles$beaks))
#> [1] "matrix" "array"
df2$beakz <- as.list(data.frame(apply(df2, 1, \(x) x$eagles$beaks)))
df2
#> # A tibble: 2 × 3
#> location eagles beakz
#> <dbl> <list> <named list>
#> 1 1 <tibble [3 × 2]> <chr [3]>
#> 2 2 <tibble [3 × 2]> <chr [3]>
df2$beakz
#> $X1
#> [1] "blue" "red" "red"
#>
#> $X2
#> [1] "red" "red" "yellow"
Purely for reference (not debugging OP), purrr works without issue:
library(purrr)
> mutate(df1, beaks=map(eagles, ~ .x$beaks))
# A tibble: 2 × 3
location eagles beaks
<dbl> <list> <list>
1 1 <tibble [3 × 2]> <chr [3]>
2 2 <tibble [2 × 2]> <chr [2]>
> mutate(df2, beaks=map(eagles, ~ .x$beaks))
# A tibble: 2 × 3
location eagles beaks
<dbl> <list> <list>
1 1 <tibble [3 × 2]> <chr [3]>
2 2 <tibble [3 × 2]> <chr [3]>
> mutate(df3, beaks=map(eagles, ~ .x$beaks))
# A tibble: 2 × 3
location eagles beaks
<dbl> <list> <list>
1 1 <tibble [2 × 2]> <chr [2]>
2 2 <tibble [2 × 2]> <chr [2]>

Dplyr logical membership test of nested data

I'm trying to create a logical test for membership of a dataframe variable in a nested column. Using mtcars as a stand-in, I can generally replicate what I'm trying to do (though the process using may seem inefficient/circuitous since it's not my real data):
library(dplyr)
m <- mtcars %>%
group_by(cyl) %>%
summarize(grz = unique(gear)) %>%
nest(data = c(cyl))
Which produces a nested column of cylinders (data) associated with the grz variable:
# A tibble: 3 x 2
grz data
<dbl> <list>
1 4 <grouped_df [2 x 1]>
2 3 <grouped_df [3 x 1]>
3 5 <grouped_df [3 x 1]>
I want to add a column testing if the value of grz is present in the nested data column, and can't seem to figure out why this doesn't work:
library(purrr)
m %>% mutate(test = map2_lgl(.x = data, .y = grz, ~ .y %in% .x))
# A tibble: 3 x 3
grz data test
<dbl> <list> <lgl>
1 4 <grouped_df [2 x 1]> FALSE
2 3 <grouped_df [3 x 1]> FALSE
3 5 <grouped_df [3 x 1]> FALSE
The first row of grz (value of 4) should produce a TRUE boolean, while the other two should be FALSE.
We need to extract the column as %in% table should be vector or matrix
library(dplyr)
library(purrr)
m %>%
mutate(test = map2_lgl(data, grz, ~ .y %in% .x$cyl))
-output
# A tibble: 3 × 3
grz data test
<dbl> <list> <lgl>
1 4 <grouped_df [2 × 1]> TRUE
2 3 <grouped_df [3 × 1]> FALSE
3 5 <grouped_df [3 × 1]> FALSE

How to convert a list of tibbles/dataframes into a nested tibble/dataframe

Sample Data
ex_list <- list(a = tibble(x = 1:4, y = 5:8),
b = mtcars)
How do I convert this list of tibbles/dataframes into a nested tibble as shown below:
# A tibble: 2 x 2
data_name data
<chr> <list>
1 a <tibble [4 × 2]>
2 b <df [32 × 11]>
Tidy solutions appreciated!
We may use enframe
library(tibble)
enframe(ex_list)
# A tibble: 2 x 2
name value
<chr> <list>
1 a <tibble [4 × 2]>
2 b <df [32 × 11]>
If we need to change the column names, use the name and value
> enframe(ex_list, name = 'data_name', value = 'data')
# A tibble: 2 x 2
data_name data
<chr> <list>
1 a <tibble [4 × 2]>
2 b <df [32 × 11]>
Is this what you want?
library(tidyverse)
lapply(ex_list, nest) %>%
dplyr::bind_rows(., .id = "data_name")
# # A tibble: 2 x 2
# data_name data
# <chr> <list>
# 1 a <tibble [4 x 2]>
# 2 b <tibble [32 x 11]>
#OR map
#map(ex_list, nest) %>%
# bind_rows(., .id = "data_name")

Removing duplicate records in a dataframe based on the values of a list column

I have a dataframe which contains duplicate values in a list column and I want to keep only the first appearence of each unique value.
Let's say I have the following tibble:
df <- tribble(
~x, ~y,
1, tibble(a = 1:2, b = 2:3),
2, tibble(a = 1:2, b = 2:3),
3, tibble(a = 0:1, b = 0:1)
)
df
#> # A tibble: 3 x 2
#> x y
#> <dbl> <list>
#> 1 1 <tibble [2 x 2]>
#> 2 2 <tibble [2 x 2]>
#> 3 3 <tibble [2 x 2]>
The desired outcome is:
desired_df
#> # A tibble: 2 x 2
#> x y
#> <dbl> <list>
#> 1 1 <tibble [2 x 2]>
#> 2 3 <tibble [2 x 2]>
Wasn't y a list column I'd be able to use distinct(df, y, .keep_all = TRUE), but the fuction doesn't support list columns properly, as shown:
distinct(df, y, .keep_all = TRUE)
#> Warning: distinct() does not fully support columns of type `list`.
#> List elements are compared by reference, see ?distinct for details.
#> This affects the following columns:
#> - `y`
#> # A tibble: 3 x 2
#> x y
#> <dbl> <list>
#> 1 1 <tibble [2 x 2]>
#> 2 2 <tibble [2 x 2]>
#> 3 3 <tibble [2 x 2]>
Is there any "clean" way to achieve what I want?
One option is to use filter with duplicated
library(dplyr)
df %>%
filter(!duplicated(y))
I have come to an answer, but I think it's quite "wordy" (and I suspect it might be slow as well):
df <- df %>%
mutate(unique_list_id = match(y, unique(y))) %>%
group_by(unique_list_id) %>%
slice(1) %>%
ungroup() %>%
select(-unique_list_id)
df
#> # A tibble: 2 x 2
#> x y
#> <dbl> <list>
#> 1 1 <tibble [2 x 2]>
#> 2 3 <tibble [2 x 2]>

How to count rows in nested data_frames with dplyr

Here's a dumb example dataframe:
df <- data_frame(A = c(rep(1, 5), rep(2, 4)), B = 1:9) %>%
group_by(A) %>%
nest()
which looks like this:
> df
# A tibble: 2 × 2
A data
<dbl> <list>
1 1 <tibble [5 × 1]>
2 2 <tibble [4 × 1]>
I would like to add a third column called N with entries equal to the number of rows in each nested data_frame in data. I figured this would work:
> df %>%
+ mutate(N = nrow(data))
Error: Unsupported type NILSXP for column "N"
What's going wrong?
Combining dplyr and purrr you could do:
library(tidyverse)
df %>%
mutate(n = map_dbl(data, nrow))
#> # A tibble: 2 × 3
#> A data n
#> <dbl> <list> <dbl>
#> 1 1 <tibble [5 × 1]> 5
#> 2 2 <tibble [4 × 1]> 4
I like this approach, because you stay within your usual workflow, creating a new column within mutate, but leveraging the map_*-family, since you need to operate on a list.
You could do:
df %>%
rowwise() %>%
mutate(N = nrow(data))
Which gives:
#Source: local data frame [2 x 3]
#Groups: <by row>
#
## A tibble: 2 × 3
# A data N
# <dbl> <list> <int>
#1 1 <tibble [5 × 1]> 5
#2 2 <tibble [4 × 1]> 4
With dplyr:
df %>%
group_by(A) %>%
mutate(N = nrow(data.frame(data)))
A data N
<dbl> <list> <int>
1 1 <tibble [5 × 1]> 5
2 2 <tibble [4 × 1]> 4

Resources