how to "spread" a list-column? - r

Consider this simple example
mydf <- data_frame(regular_col = c(1,2),
normal_col = c('a','b'),
weird_col = list(list('hakuna', 'matata'),
list('squash', 'banana')))
> mydf
# A tibble: 2 x 3
regular_col normal_col weird_col
<dbl> <chr> <list>
1 1 a <list [2]>
2 2 b <list [2]>
I would like to extract the elements of weird_col (programmatically, the number of elements may change) so that each element is placed on a different column. That is, I expect the following output
> data_frame(regular_col = c(1,2),
+ normal_col = c('a','b'),
+ weirdo_one = c('hakuna', 'squash'),
+ weirdo_two = c('matata', 'banana'))
# A tibble: 2 x 4
regular_col normal_col weirdo_one weirdo_two
<dbl> <chr> <chr> <chr>
1 1 a hakuna matata
2 2 b squash banana
However, I am unable to do so in simple terms. For instance, using the classic unnest fails here, as it expands the dataframe instead of placing each element of the list in a different column.
> mydf %>% unnest(weird_col)
# A tibble: 4 x 3
regular_col normal_col weird_col
<dbl> <chr> <list>
1 1 a <chr [1]>
2 1 a <chr [1]>
3 2 b <chr [1]>
4 2 b <chr [1]>
Is there any solution in the tidyverse for that?

You can extract the values from the output of unnest, process a little to make your column names, and then spread back out. Note that I use flatten_chr because of your depth-one list-column, but if it is nested you can use flatten and spread works just as well on list-cols.
library(tidyverse)
#> Warning: package 'dplyr' was built under R version 3.5.1
mydf <- data_frame(
regular_col = c(1, 2),
normal_col = c("a", "b"),
weird_col = list(
list("hakuna", "matata"),
list("squash", "banana")
)
)
mydf %>%
unnest(weird_col) %>%
group_by(regular_col, normal_col) %>%
mutate(
weird_col = flatten_chr(weird_col),
weird_colname = str_c("weirdo_", row_number())
) %>% # or just as.character
spread(weird_colname, weird_col)
#> # A tibble: 2 x 4
#> # Groups: regular_col, normal_col [2]
#> regular_col normal_col weirdo_1 weirdo_2
#> <dbl> <chr> <chr> <chr>
#> 1 1 a hakuna matata
#> 2 2 b squash banana
Created on 2018-08-12 by the reprex package (v0.2.0).

unnest develops lists and vectors vertically, and one row data frames horizontally. So what we can do is change your lists into data frames (with adequate column names) and unnest afterwards.
mydf %>% mutate(weird_col = map(weird_col,~ as_data_frame(
setNames(.,paste0("weirdo_",1:length(.)))
))) %>%
unnest
# # A tibble: 2 x 4
# regular_col normal_col weirdo_1 weirdo_2
# <dbl> <chr> <chr> <chr>
# 1 1 a hakuna matata
# 2 2 b squash banana

Related

Calculate correlation for two data frames for all columns after group_by in R

Sample data:
A <- data.frame(region = c("US","US", "UK","UK","AUS","AUS"), a = c(1,2,3,4,5,8), b = c(4,5,6,7,8,2), c = c(9,6,5,43,2,5))
B <- data.frame(region = c("US","US", "UK","UK","AUS","AUS"),a = c(7,4,3,6,9,81), b = c(9,4,3,7,0,35), c = c(22,5,6,2,9,33))
Expected output:
(x is the correlation for the column between two data frames in the region)
I have tried:
Binding two data frames into one and calculate correlation between two columns in one data frame. It is a bit tedious to type every column names, which also creates too many columns. Is there a simpler way to do this?
If my understanding is not off, then here is a solution using dplyr and tidyr.
library(dplyr)
library(tidyr)
rbind(cbind(set = "A", A), cbind(set = "B", B)) %>%
pivot_longer(-c(set, region)) %>%
group_by(region, name) %>%
summarise(value = cor(value[set == "A"], value[set == "B"]), .groups = "drop") %>%
pivot_wider()
Output
# A tibble: 3 x 4
region a b c
<chr> <dbl> <dbl> <dbl>
1 AUS 1 -1 1
2 UK 1 1 -1
3 US -1 -1 1
This is a little convoluted but it's an alternative way to do it.
library(tidyverse)
A <- data.frame(region = c("US","US", "UK","UK","AUS","AUS"), a = c(1,2,3,4,5,8), b = c(4,5,6,7,8,2), c = c(9,6,5,43,2,5))
B <- data.frame(region = c("US","US", "UK","UK","AUS","AUS"),a = c(7,4,3,6,9,81), b = c(9,4,3,7,0,35), c = c(22,5,6,2,9,33))
(df <- map(list(A, B), ~nest_by(.x, region)) %>%
reduce(inner_join, by = 'region'))
#> # A tibble: 3 × 3
#> # Rowwise: region
#> region data.x data.y
#> <chr> <list<tibble[,3]>> <list<tibble[,3]>>
#> 1 AUS [2 × 3] [2 × 3]
#> 2 UK [2 × 3] [2 × 3]
#> 3 US [2 × 3] [2 × 3]
bind_cols(select(df, region), map2_dfr(df$data.x, df$data.y, ~map2_dfc(.x, .y, ~cor(.x, .y))))
#> # A tibble: 3 × 4
#> # Rowwise: region
#> region a b c
#> <chr> <dbl> <dbl> <dbl>
#> 1 AUS 1 -1 1
#> 2 UK 1 1 -1
#> 3 US -1 -1 1
Created on 2022-01-06 by the reprex package (v2.0.1)

How to extract first value from lists in data.frames columns?

This question is similar to R: How to extract a list from a dataframe?
But I could not implement it to my question in an easy way.
weird_df <- data_frame(col1 =c('hello', 'world', 'again'),col_weird = list(list(12,23), list(23,24), NA),col_weird2 = list(list(0,45), list(4,45),list(45,45.45,23)))
weird_df
# A tibble: 3 x 3
col1 col_weird col_weird2
<chr> <list> <list>
1 hello <list [2]> <list [2]>
2 world <list [2]> <list [2]>
3 again <lgl [1]> <list [3]>
>
I want in the columns col_weirdand col_weird2 to only display the first value of the current list.
col1 col_weird col_weird2
1 hello 12 0
2 world 23 4
3 again NA 45
My real problem has a lot of columns.I tried this (altered acceptend answer in posted link)
library(tidyr)
library(purrr)
weird_df %>%
mutate(col_weird = map(c(col_weird,col_weird2), toString ) ) %>%
separate(col_weird, into = c("col1"), convert = TRUE) %>%
separate(col_weird2, into = c("col2",convert = T)
One solution would be to write a simple function that extracts the first value from each list in a vector of lists . This you can then apply to the relevant columns in your data frame.
library(tibble)
#create data
weird_df <- tibble(col1 =c('hello', 'world', 'again'),
col_weird = list(list(12,23), list(23,24), NA),
col_weird2 = list(list(0,45), list(4,45), list(45,45.45,23)))
#function to extract first values from a vector of lists
fnc <- function(x) {
sapply(x, FUN = function(y) {y[[1]]})
}
#apply function to the relevant columns
weird_df[,2:3] <- apply(weird_df[,2:3], MARGIN = 2, FUN = fnc)
weird_df
# A tibble: 3 x 3
col1 col_weird col_weird2
<chr> <dbl> <dbl>
1 hello 12 0
2 world 23 4
3 again NA 45
Here is a dplyr solution
library(dplyr)
weird_df %>% mutate(across(c(col_weird, col_weird2), ~vapply(., `[[`, numeric(1L), 1L)))
Output
# A tibble: 3 x 3
col1 col_weird col_weird2
<chr> <dbl> <dbl>
1 hello 12 0
2 world 23 4
3 again NA 45

Storing and calling variables in a column in dplyr within a function

I want to store some variables within a column cell within a tibble. I then want to call that column and either paste the names of those variables or call that column and paste the columns which those variables correspond to together. In addition, all of this occurs within a function and this is the only piece of hard coding left so I'd really like to find a way to solve this.
library("tidyverse")
myData<-tibble("c1"=c("a","b","c"),
"c2"=c("1","2","3"),
"c3"=c("A","B","C"),
factors=c(list(c("c1","c2")),list(c("c2","c3")),list(c("c1","c2","c3"))))
myData%>%mutate(factors1=interaction(!!!quos(factors),sep=":",lex.order=TRUE))
# A tibble: 3 x 5
c1 c2 c3 factors factors1
<chr> <chr> <chr> <list> <fct>
1 a 1 A <chr [2]> c1:c2:c1
2 b 2 B <chr [2]> c2:c3:c2
3 c 3 C <chr [3]> c1:c2:c3
So this allows me to concatenate the names of the variables but as you can see, if one list is longer than the others, it loops.
For the second problem in which I would like to use the $factors column to specifically call the values of other columns, I can hardcode this like so:
myData%>%
mutate(factors2=interaction(!!!syms(c("c1","c2")),sep=":",lex.order=TRUE))
# A tibble: 3 x 5
c1 c2 c3 factors factors2
<chr> <chr> <chr> <list> <fct>
1 a 1 A <chr [2]> a:1
2 b 2 B <chr [2]> b:2
3 c 3 C <chr [3]> c:3
However if I try this:
myData%>%
mutate(factors2=interaction(!!!syms(factors),sep=":",lex.order=TRUE))
Error in lapply(.x, .f, ...) : object 'factors' not found
The same happens if I try to unlist the factors or use other rlang expressions. I have also tried nesting rlang expressions but so far haven't found one that works as I intended.
I feel like this is something that should be possible but so far I haven't found a question on stack overflow or a tutorial that indicates that it is so maybe I'm on a wild goose chase. Thank you all for your time and help.
My code in full:
library("tidyverse")
myData<-tibble("c1"=c("a","b","c"),
"c2"=c("1","2","3"),
"c3"=c("A","B","C"),
factors=c(list(c("c1","c2")),list(c("c2","c3")),list(c("c1","c2","c3"))))%>%
mutate(factors1=interaction(!!!quos(factors),sep=":",lex.order=TRUE))%>%
mutate(factors2=interaction(!!!syms(factors),sep=":",lex.order=TRUE))
My desired output is:
# A tibble: 3 x 6
c1 c2 c3 factors factors1 factors2
<chr> <chr> <chr> <list> <fct> <fct>
1 a 1 A <chr [2]> c1:c2 a:1
2 b 2 B <chr [2]> c2:c3 2:B
3 c 3 C <chr [3]> c1:c2:c3 c:3:C
Here is a method using map and imap:
library(tidyverse)
myData %>%
mutate(factor1 = factors %>% map(~interaction(as.list(.), sep=':', lex.order = TRUE)) %>% unlist(),
factor2 = factors %>% imap(~interaction(myData[.y, match(.x, names(myData))], sep=":", lex.order = TRUE)) %>% unlist())
For factor1, instead of splicing the arguments into dots, I pass a list into interaction.
For factor2, I match factors in each row with the names in myData and uses the column index (match(.x, names(myData))) in combination with the row index (.y from imap) to subset the appropriate elements to feed into interaction.
Both factor1 and factor2 require an unlist because map and imap returns lists.
Output:
# A tibble: 3 x 6
c1 c2 c3 factors factor1 factor2
<chr> <chr> <chr> <list> <fct> <fct>
1 a 1 A <chr [2]> c1:c2 a:1
2 b 2 B <chr [2]> c2:c3 2:B
3 c 3 C <chr [3]> c1:c2:c3 c:3:C
You first question can be addressed with purrr::map and purrr::lift families of functions:
myData %>%
mutate( factors1 = map(factors, lift_dv(interaction, sep=":", lex.order=TRUE)) ) %>%
mutate_at( "factors1", lift(fct_c) )
# # A tibble: 3 x 5
# c1 c2 c3 factors factors1
# <chr> <chr> <chr> <list> <fct>
# 1 a 1 A <chr [2]> c1:c2
# 2 b 2 B <chr [2]> c2:c3
# 3 c 3 C <chr [3]> c1:c2:c3
The second question is more tricky, because !!! causes the evaluation of its argument immediately, which can sometimes lead to unintuitive operator precedence inside a dplyr chain. The cleanest way is to define a standalone function that composes your interaction expressions:
f <- function(fct) {expr( interaction(!!!syms(fct), sep=":", lex.order=TRUE) )}
# Example usage
f( myData$factors[[1]] ) # interaction(c1, c2, sep = ":", lex.order = TRUE)
f( myData$factors[[2]] ) # interaction(c2, c3, sep = ":", lex.order = TRUE)
myData %>% mutate( e = map(factors, f) )
# # A tibble: 3 x 5
# c1 c2 c3 factors e
# <chr> <chr> <chr> <list> <list>
# 1 a 1 A <chr [2]> <language>
# 2 b 2 B <chr [2]> <language>
# 3 c 3 C <chr [3]> <language>
Unfortunately, we can't evaluate e directly, because it will feed the entire columns c1, c2, and c3 to the expressions, whereas you only want a single value that is in the same row as the expression. For this reason, we need to encapsulate columns c1 through c3 in a row-wise fashion.
X <- myData %>% mutate( e = map(factors, f) ) %>%
rowwise() %>% mutate( d = list(data_frame(c1,c2,c3)) ) %>% ungroup()
# # A tibble: 3 x 6
# c1 c2 c3 factors e d
# <chr> <chr> <chr> <list> <list> <list>
# 1 a 1 A <chr [2]> <language> <tibble [1 × 3]>
# 2 b 2 B <chr [2]> <language> <tibble [1 × 3]>
# 3 c 3 C <chr [3]> <language> <tibble [1 × 3]>
Now you have expressions in e that need to be applied to data in d, so it's just a simple map2 traversal from here. Putting everything together and cleaning up, we get:
myData %>%
mutate( factors1 = map(factors, lift_dv(interaction, sep=":", lex.order=TRUE)) ) %>%
mutate( e = map(factors, f) ) %>%
rowwise() %>% mutate( d = list(data_frame(c1,c2,c3)) ) %>% ungroup() %>%
mutate( factors2 = map2( e, d, rlang::eval_tidy ) ) %>%
mutate_at( vars(factors1,factors2), lift(fct_c) ) %>%
select( -e, -d )
# # A tibble: 3 x 6
# c1 c2 c3 factors factors1 factors2
# <chr> <chr> <chr> <list> <fct> <fct>
# 1 a 1 A <chr [2]> c1:c2 a:1
# 2 b 2 B <chr [2]> c2:c3 2:B
# 3 c 3 C <chr [3]> c1:c2:c3 c:3:C

Unnest one column list to many columns in tidyr

For example, I have a tidy data frame like this:
df <- tibble(id=1:2,
ctn=list(list(a="x",b=1),
list(a="y",b=2)))
# A tibble: 2 x 2
id ctn
<int> <list>
1 1 <list [2]>
2 2 <list [2]>
How could I unnest ctn column to the right so that the data frame will be like this:
# A tibble: 2 x 3
id a b
<int> <chr> <dbl>
1 1 x 1
2 2 y 2
With dplyr and purrr
df %>%
mutate(ctn = map(ctn, as_tibble)) %>%
unnest()
# A tibble: 2 x 3
id a b
<int> <chr> <dbl>
1 1 x 1
2 2 y 2
One option is
library(data.table)
setDT(df)[, unlist(ctn, recursive = FALSE), id]
# id a b
#1: 1 x 1
#2: 2 y 2
Or with tidyr
library(tidyverse)
df$ctn %>%
setNames(., df$id) %>%
bind_rows(., .id = 'id')
# A tibble: 2 x 3
# id a b
# <chr> <chr> <dbl>
#1 1 x 1
#2 2 y 2
In a tidy way we can now (dplyr 1.0.2 and above) do this using rowwise():
df %>% rowwise() %>% mutate(as_tibble(ctn))
# A tibble: 2 x 4
# Rowwise:
id ctn a b
<int> <list> <chr> <dbl>
1 1 <named list [2]> x 1
2 2 <named list [2]> y 2
And sticking to purrr we can also:
df %>% mutate(map_dfr(ctn, as_tibble))
# A tibble: 2 x 4
id ctn a b
<int> <list> <chr> <dbl>
1 1 <named list [2]> x 1
2 2 <named list [2]> y 2

How to separate a column list of fixed size X to X different columns?

I have a tibble with one column being a list column, always having two numeric values named a and b (e.g. as a result of calling purrr:map to a function which returns a list), say:
df <- tibble(x = 1:3, y = list(list(a = 1, b = 2), list(a = 3, b = 4), list(a = 5, b = 6)))
df
# A tibble: 3 × 2
x y
<int> <list>
1 1 <list [2]>
2 2 <list [2]>
3 3 <list [2]>
How do I separate the list column y into two columns a and b, and get:
df_res <- tibble(x = 1:3, a = c(1,3,5), b = c(2,4,6))
df_res
# A tibble: 3 × 3
x a b
<int> <dbl> <dbl>
1 1 1 2
2 2 3 4
3 3 5 6
Looking for something like tidyr::separate to deal with a list instead of a string.
Using dplyr (current release: 0.7.0):
bind_cols(df[1], bind_rows(df$y))
# # A tibble: 3 x 3
# x a b
# <int> <dbl> <dbl>
# 1 1 1 2
# 2 2 3 4
# 3 3 5 6
edit based on OP's comment:
To embed this in a pipe and in case you have many non-list columns, we can try:
df %>% select(-y) %>% bind_cols(bind_rows(df$y))
We could also make use the map_df from purrr
library(tidyverse)
df %>%
summarise(x = list(x), new = list(map_df(.$y, bind_rows))) %>%
unnest
# A tibble: 3 x 3
# x a b
# <int> <dbl> <dbl>
#1 1 1 2
#2 2 3 4
#3 3 5 6

Resources