how to create nested data frame by collapsing columns - r

I have a dataframe I want to collapse some columns (y and z) to create a nested dataframe, for instance:
df <- data.frame(x = rep(c(1,2,3,4),times=3), y = rep(c("Y","W","T","R"),times=3), z = rep(c("A","B","C","D"),times=3))
x y z
=========
1 Y A
2 W B
3 T C
4 R D
1 Y A
2 W B
3 T C
4 R D
1 Y A
2 W B
I want to collapse the z column and nest it for each unique group of x. The resulting dataframe should look like this:
x zy
======
1 <dataframe>
2 <dataframe>
3 <dataframe>
4 <dataframe>
How do I accomplish this?

library(tidyverse)
df %>%
group_by(x) %>%
nest(data = c(z, y))
# A tibble: 4 × 2
# Groups: x [4]
x data
<dbl> <list>
1 1 <tibble [3 × 2]>
2 2 <tibble [3 × 2]>
3 3 <tibble [3 × 2]>
4 4 <tibble [3 × 2]>

Try a list
library(dplyr)
df %>%
group_by(x) %>%
summarise(zy = list(cbind(y, z)))
# A tibble: 4 × 2
x zy
<dbl> <list>
1 1 <chr [3 × 2]>
2 2 <chr [3 × 2]>
3 3 <chr [3 × 2]>
4 4 <chr [3 × 2]>

Related

How to convert a list of tibbles/dataframes into a nested tibble/dataframe

Sample Data
ex_list <- list(a = tibble(x = 1:4, y = 5:8),
b = mtcars)
How do I convert this list of tibbles/dataframes into a nested tibble as shown below:
# A tibble: 2 x 2
data_name data
<chr> <list>
1 a <tibble [4 × 2]>
2 b <df [32 × 11]>
Tidy solutions appreciated!
We may use enframe
library(tibble)
enframe(ex_list)
# A tibble: 2 x 2
name value
<chr> <list>
1 a <tibble [4 × 2]>
2 b <df [32 × 11]>
If we need to change the column names, use the name and value
> enframe(ex_list, name = 'data_name', value = 'data')
# A tibble: 2 x 2
data_name data
<chr> <list>
1 a <tibble [4 × 2]>
2 b <df [32 × 11]>
Is this what you want?
library(tidyverse)
lapply(ex_list, nest) %>%
dplyr::bind_rows(., .id = "data_name")
# # A tibble: 2 x 2
# data_name data
# <chr> <list>
# 1 a <tibble [4 x 2]>
# 2 b <tibble [32 x 11]>
#OR map
#map(ex_list, nest) %>%
# bind_rows(., .id = "data_name")

mutate_if, across or map_if where want to process the iteration conditional on value in another field

Simple example:
mydf <- data.frame(
x = 1:3,
y = c(1, 0, 1),
z = 1:3
) %>% group_by(x) %>% nest
mydf %>% mutate(blah = map_dbl(.x = data, ~ .x$z * 2))
Returns:
# A tibble: 3 x 3
# Groups: x [3]
x data blah
<int> <list> <dbl>
1 1 <tibble [1 × 2]> 2
2 2 <tibble [1 × 2]> 4
3 3 <tibble [1 × 2]> 6
I would like to mutate or map conditional on y. If y=1, then process with .x * 2 else (y = 0) then just use NA.
Desired result:
# A tibble: 3 x 3
# Groups: x [3]
x data blah
<int> <list> <dbl>
1 1 <tibble [1 × 2]> 2
2 2 <tibble [1 × 2]> NA
3 3 <tibble [1 × 2]> 6
Should I use mutate_if, mutate_across, map_if? How can I get this result?
In case the OP needs to retain the map model in their real use case, map2() is one possibility...
mydf %>% mutate(blah = map2(x, y, ~ifelse(.y == 1, .x * 2, NA)))
x y blah
1 1 1 2
2 2 0 NA
3 3 1 6

Removing duplicate records in a dataframe based on the values of a list column

I have a dataframe which contains duplicate values in a list column and I want to keep only the first appearence of each unique value.
Let's say I have the following tibble:
df <- tribble(
~x, ~y,
1, tibble(a = 1:2, b = 2:3),
2, tibble(a = 1:2, b = 2:3),
3, tibble(a = 0:1, b = 0:1)
)
df
#> # A tibble: 3 x 2
#> x y
#> <dbl> <list>
#> 1 1 <tibble [2 x 2]>
#> 2 2 <tibble [2 x 2]>
#> 3 3 <tibble [2 x 2]>
The desired outcome is:
desired_df
#> # A tibble: 2 x 2
#> x y
#> <dbl> <list>
#> 1 1 <tibble [2 x 2]>
#> 2 3 <tibble [2 x 2]>
Wasn't y a list column I'd be able to use distinct(df, y, .keep_all = TRUE), but the fuction doesn't support list columns properly, as shown:
distinct(df, y, .keep_all = TRUE)
#> Warning: distinct() does not fully support columns of type `list`.
#> List elements are compared by reference, see ?distinct for details.
#> This affects the following columns:
#> - `y`
#> # A tibble: 3 x 2
#> x y
#> <dbl> <list>
#> 1 1 <tibble [2 x 2]>
#> 2 2 <tibble [2 x 2]>
#> 3 3 <tibble [2 x 2]>
Is there any "clean" way to achieve what I want?
One option is to use filter with duplicated
library(dplyr)
df %>%
filter(!duplicated(y))
I have come to an answer, but I think it's quite "wordy" (and I suspect it might be slow as well):
df <- df %>%
mutate(unique_list_id = match(y, unique(y))) %>%
group_by(unique_list_id) %>%
slice(1) %>%
ungroup() %>%
select(-unique_list_id)
df
#> # A tibble: 2 x 2
#> x y
#> <dbl> <list>
#> 1 1 <tibble [2 x 2]>
#> 2 3 <tibble [2 x 2]>

filter tibble of tibbles by nrow

I've a tibble like that
>dat
# A tibble: 556 × 3
sample run abc
<chr> <chr> <list>
1 206_03_07_2013 21102016 <tibble [304 × 21]>
2 206_04_07_2017 7082017 <tibble [229 × 21]>
3 206_04_10_2015 25112015 <tibble [2,687 × 21]>
4 206_07_08_2013 15102015 <tibble [460 × 21]>
5 206_08_12_2016 3032017 <tibble [3,250 × 21]>
6 206_11_03_2014 21102016 <tibble [975 × 21]>
7 206_13_02_2013 21112016 <tibble [101 × 21]>
8 206_13_03_2013 21112016 <tibble [345 × 21]>
9 206_14_08_2014 8092016 <tibble [1,952 × 21]>
10 206_19_03_2015 25012016 <tibble [11 × 21]>
# ... with 546 more rows
The abc column contains tibble of different length. I want to filter the dat tibble using their length (>100 rows).
I could do something like that :
dat[sapply(dat$abs,nrow)>100,]
but I would like to use dplyr phylosophy ?
Any ideas ?
Thanks
A way could be:
library(dplyr)
library(purrr)
dat <- tribble(
~foo, ~bar,
1, as_tibble(head(iris, 3)),
2, as_tibble(head(iris, 7))
)
# # A tibble: 2 x 2
# foo bar
# <dbl> <list>
# 1 1 <tibble [3 x 5]>
# 2 2 <tibble [7 x 5]>
res <- filter(dat, map_int(bar, nrow) > 5)
# # A tibble: 1 x 2
# foo bar
# <dbl> <list>
# 1 2 <tibble [7 x 5]>
desired_output <- dat[sapply(dat$bar,nrow)>5,]
identical(res, desired_output)
# [1] TRUE
There is not really any added value here, compared to what you tried, it's a matter of using drop-in replacements to [ and sapply (with filter and map_int respectively). Base R functions are not incompatible with the so-called "dplyr philosophy". If you mean the use of the magrittr pipe %>%, dat %>% .[sapply(.$bar, nrow) > 5, ] and dat %>% filter(map_int(bar, nrow) > 5) work equally well.
Note: I usually prefer all.equal over identical but couldn't make it work:
all.equal(res, desired_output)
# Error in equal_data_frame(target, current, ignore_col_order = ignore_col_order, :
# Can't join on 'bar' x 'bar' because of incompatible types (list / list)
(See https://github.com/tidyverse/dplyr/issues/2194)

How to count rows in nested data_frames with dplyr

Here's a dumb example dataframe:
df <- data_frame(A = c(rep(1, 5), rep(2, 4)), B = 1:9) %>%
group_by(A) %>%
nest()
which looks like this:
> df
# A tibble: 2 × 2
A data
<dbl> <list>
1 1 <tibble [5 × 1]>
2 2 <tibble [4 × 1]>
I would like to add a third column called N with entries equal to the number of rows in each nested data_frame in data. I figured this would work:
> df %>%
+ mutate(N = nrow(data))
Error: Unsupported type NILSXP for column "N"
What's going wrong?
Combining dplyr and purrr you could do:
library(tidyverse)
df %>%
mutate(n = map_dbl(data, nrow))
#> # A tibble: 2 × 3
#> A data n
#> <dbl> <list> <dbl>
#> 1 1 <tibble [5 × 1]> 5
#> 2 2 <tibble [4 × 1]> 4
I like this approach, because you stay within your usual workflow, creating a new column within mutate, but leveraging the map_*-family, since you need to operate on a list.
You could do:
df %>%
rowwise() %>%
mutate(N = nrow(data))
Which gives:
#Source: local data frame [2 x 3]
#Groups: <by row>
#
## A tibble: 2 × 3
# A data N
# <dbl> <list> <int>
#1 1 <tibble [5 × 1]> 5
#2 2 <tibble [4 × 1]> 4
With dplyr:
df %>%
group_by(A) %>%
mutate(N = nrow(data.frame(data)))
A data N
<dbl> <list> <int>
1 1 <tibble [5 × 1]> 5
2 2 <tibble [4 × 1]> 4

Resources