Simplifying the list for nested data frame - r

Sorry I am new in R
I need to get a dataframe ready a json format. But I have trouble to put the variable back to the original format c(1,2,3,...). For example
library(tidyr)
x<-tibble(x = 1:3, y = list(c(1,5), c(1,5,10), c(1,2,3,20)))
View(x)
This shows
1 1 c(1, 5)
2 2 c(1, 5, 10)
3 3 c(1, 2, 3, 20)
x1<-x %>% unnest(y)
x2<-x1 %>% nest(data=c(y))
View(x2)
This shows
1 1 1 variable
2 2 1 variable
3 3 1 variable
the desired format is c(...) rather than a variable to get ready for the json data file
1 1 c(1, 5)
2 2 c(1, 5, 10)
3 3 c(1, 2, 3, 20)
Please help

x$y is a list-column of doubles. Whereas x2$y is a list-column of tibbles.
Use map and unlist to turn the tibbles into doubles.
library(tidyverse)
x2 %>%
mutate(data = map(data, unlist))
#> # A tibble: 3 x 2
#> x data
#> <int> <list>
#> 1 1 <dbl [2]>
#> 2 2 <dbl [3]>
#> 3 3 <dbl [4]>
Alternatively, instead of nesting, you can use summarise.
x1 %>%
group_by(x) %>%
summarise(data = list(y))
#> # A tibble: 3 x 2
#> x data
#> <int> <list>
#> 1 1 <dbl [2]>
#> 2 2 <dbl [3]>
#> 3 3 <dbl [4]>

Related

How to group_by and summarise vectors inside a tibble into a single vector?

tibble(x = rep(1:3, 2),
y = list(1:5, 1:10, 10:20, 20:40, 1:50, 5:10)) -> df
df
#> # A tibble: 6 × 2
#> x y
#> <int> <list>
#> 1 1 <int [5]>
#> 2 2 <int [10]>
#> 3 3 <int [11]>
#> 4 1 <int [21]>
#> 5 2 <int [50]>
#> 6 3 <int [6]>
I want to group_by 'x' and summmarise the vectors of each group into a single vector. I tried using c(), but it didn't help.
df %>%
group_by(x) %>%
summarise(z = c(y))
#> `summarise()` has grouped output by 'x'. You can override using the `.groups`
#> argument.
#> # A tibble: 6 × 2
#> # Groups: x [3]
#> x z
#> <int> <list>
#> 1 1 <int [5]>
#> 2 1 <int [21]>
#> 3 2 <int [10]>
#> 4 2 <int [50]>
#> 5 3 <int [11]>
#> 6 3 <int [6]>
I also want a union of elements in a group or any other similar function applied to these kinds of datasets.
df %>%
group_by(x) %>%
summarise(z = union(y))
#> Error in `summarise()`:
#> ! Problem while computing `z = union(y)`.
#> ℹ The error occurred in group 1: x = 1.
#> Caused by error in `base::union()`:
#> ! argument "y" is missing, with no default
If you want the data to remain nested, you can do
df %>%
group_by(x) %>%
summarise(z = list(unlist(y)))
The c() function won't work because it' doesn't unnest-lists. For example, compare
c(list(1:3, 4:5))
unlist(list(1:3, 4:5))
The c function doesn't return a single vector. But unlist does. This matters because your function will recieve a list of matching row values when you use summarize.
Also note that if you leave off the list(), the values don't be nested anymore
df %>%
group_by(x) %>%
summarise(z = unlist(y))
# x z
# <int> <int>
# 1 1 1
# 2 1 2
# 3 1 3
# 4 1 4
# 5 1 5
# 6 1 20
# 7 1 21
# ...

Create a list column with ranges set by existing columns

I am trying to create a list column within a data frame, specifying the range using existing columns, something like:
# A tibble: 3 x 3
A B C
<dbl> <dbl> <list>
1 1 6 c(1, 2, 3, 4, 5, 6)
2 2 5 c(2, 3, 4, 5)
3 3 4 c(3, 4)
The catch is that it would need to be created as follows:
df %>% mutate(C = c(A:B))
I have a dataset containing integers entered as ranges, i.e someone has entered "7 to 26". I've separated the ranges into two columns A & B, or "start" and "end", and was hoping to use c(A:B) to create a list, but using dplyr I keep getting:
Warning messages:
1: In a:b : numerical expression has 3 elements: only the first used
2: In a:b : numerical expression has 3 elements: only the first used
Which gives:
# A tibble: 3 x 3
A B C
<dbl> <dbl> <list>
1 1 6 list(1:6)
2 2 5 list(1:6)
3 3 4 list(1:6)
Has anyone had a similar issue and found a workaround?
You can use map2() in purrr
library(dplyr)
df %>%
mutate(C = purrr::map2(A, B, seq))
or do rowwise() before mutate()
df %>%
rowwise() %>%
mutate(C = list(A:B)) %>%
ungroup()
Both methods give
# # A tibble: 3 x 3
# A B C
# <int> <int> <list>
# 1 1 6 <int [6]>
# 2 2 5 <int [4]>
# 3 3 4 <int [2]>
Data
df <- tibble::tibble(A = 1:3, B = 6:4)

summarize to vector output

Let's say I have the following (simplified) tibble containing a group and values in vectors:
set.seed(1)
(tb_vec <- tibble(group = factor(rep(c("A","B"), c(2,3))),
values = replicate(5, sample(3), simplify = FALSE)))
# A tibble: 5 x 2
group values
<fct> <list>
1 A <int [3]>
2 A <int [3]>
3 B <int [3]>
4 B <int [3]>
5 B <int [3]>
tb_vec[[1,2]]
[1] 1 3 2
I would like to summarize the values vectors per group by summing them (vectorized) and tried the following:
tb_vec %>% group_by(group) %>%
summarize(vec_sum = colSums(purrr::reduce(values, rbind)))
Error: Column vec_sum must be length 1 (a summary value), not 3
The error surprises me, because tibbles (the output format) can contain vectors as well.
My expected output would be the following summarized tibble:
# A tibble: 2 x 2
group vec_sum
<fct> <list>
1 A <dbl [3]>
2 B <dbl [3]>
Is there a tidyverse solution accommodate the vector output of summarize? I want to avoid splitting the tibble, because then I loose the factor.
You just need to add list(.) within summarise in your solution, in order to be able to have a column with 2 elements, where each element is a vector of 3 values:
library(tidyverse)
set.seed(1)
(tb_vec <- tibble(group = factor(rep(c("A","B"), c(2,3))),
values = replicate(5, sample(3), simplify = FALSE)))
tb_vec %>%
group_by(group) %>%
summarize(vec_sum = list(colSums(purrr::reduce(values, rbind)))) -> res
res$vec_sum
# [[1]]
# [1] 2 4 6
#
# [[2]]
# [1] 6 5 7

how to "spread" a list-column?

Consider this simple example
mydf <- data_frame(regular_col = c(1,2),
normal_col = c('a','b'),
weird_col = list(list('hakuna', 'matata'),
list('squash', 'banana')))
> mydf
# A tibble: 2 x 3
regular_col normal_col weird_col
<dbl> <chr> <list>
1 1 a <list [2]>
2 2 b <list [2]>
I would like to extract the elements of weird_col (programmatically, the number of elements may change) so that each element is placed on a different column. That is, I expect the following output
> data_frame(regular_col = c(1,2),
+ normal_col = c('a','b'),
+ weirdo_one = c('hakuna', 'squash'),
+ weirdo_two = c('matata', 'banana'))
# A tibble: 2 x 4
regular_col normal_col weirdo_one weirdo_two
<dbl> <chr> <chr> <chr>
1 1 a hakuna matata
2 2 b squash banana
However, I am unable to do so in simple terms. For instance, using the classic unnest fails here, as it expands the dataframe instead of placing each element of the list in a different column.
> mydf %>% unnest(weird_col)
# A tibble: 4 x 3
regular_col normal_col weird_col
<dbl> <chr> <list>
1 1 a <chr [1]>
2 1 a <chr [1]>
3 2 b <chr [1]>
4 2 b <chr [1]>
Is there any solution in the tidyverse for that?
You can extract the values from the output of unnest, process a little to make your column names, and then spread back out. Note that I use flatten_chr because of your depth-one list-column, but if it is nested you can use flatten and spread works just as well on list-cols.
library(tidyverse)
#> Warning: package 'dplyr' was built under R version 3.5.1
mydf <- data_frame(
regular_col = c(1, 2),
normal_col = c("a", "b"),
weird_col = list(
list("hakuna", "matata"),
list("squash", "banana")
)
)
mydf %>%
unnest(weird_col) %>%
group_by(regular_col, normal_col) %>%
mutate(
weird_col = flatten_chr(weird_col),
weird_colname = str_c("weirdo_", row_number())
) %>% # or just as.character
spread(weird_colname, weird_col)
#> # A tibble: 2 x 4
#> # Groups: regular_col, normal_col [2]
#> regular_col normal_col weirdo_1 weirdo_2
#> <dbl> <chr> <chr> <chr>
#> 1 1 a hakuna matata
#> 2 2 b squash banana
Created on 2018-08-12 by the reprex package (v0.2.0).
unnest develops lists and vectors vertically, and one row data frames horizontally. So what we can do is change your lists into data frames (with adequate column names) and unnest afterwards.
mydf %>% mutate(weird_col = map(weird_col,~ as_data_frame(
setNames(.,paste0("weirdo_",1:length(.)))
))) %>%
unnest
# # A tibble: 2 x 4
# regular_col normal_col weirdo_1 weirdo_2
# <dbl> <chr> <chr> <chr>
# 1 1 a hakuna matata
# 2 2 b squash banana

How to separate a column list of fixed size X to X different columns?

I have a tibble with one column being a list column, always having two numeric values named a and b (e.g. as a result of calling purrr:map to a function which returns a list), say:
df <- tibble(x = 1:3, y = list(list(a = 1, b = 2), list(a = 3, b = 4), list(a = 5, b = 6)))
df
# A tibble: 3 × 2
x y
<int> <list>
1 1 <list [2]>
2 2 <list [2]>
3 3 <list [2]>
How do I separate the list column y into two columns a and b, and get:
df_res <- tibble(x = 1:3, a = c(1,3,5), b = c(2,4,6))
df_res
# A tibble: 3 × 3
x a b
<int> <dbl> <dbl>
1 1 1 2
2 2 3 4
3 3 5 6
Looking for something like tidyr::separate to deal with a list instead of a string.
Using dplyr (current release: 0.7.0):
bind_cols(df[1], bind_rows(df$y))
# # A tibble: 3 x 3
# x a b
# <int> <dbl> <dbl>
# 1 1 1 2
# 2 2 3 4
# 3 3 5 6
edit based on OP's comment:
To embed this in a pipe and in case you have many non-list columns, we can try:
df %>% select(-y) %>% bind_cols(bind_rows(df$y))
We could also make use the map_df from purrr
library(tidyverse)
df %>%
summarise(x = list(x), new = list(map_df(.$y, bind_rows))) %>%
unnest
# A tibble: 3 x 3
# x a b
# <int> <dbl> <dbl>
#1 1 1 2
#2 2 3 4
#3 3 5 6

Resources