preserve column-of-tibbles when using grouped summarise with reduce - r

Question
When using tibble %>% group_by() %>% summarise(...=reduce(...)) on a column containing tibbles, I would like the output to remain a column of tibbles. How do I do that most efficiently?
Minimal Example:
Setup
vec1 = rnorm(10)
vec2 = rnorm(10)
vec3 = rnorm(10)
vec4 = rnorm(10)
tib=tibble(grpvar=factor(c('a','a','b','b')))
tib$col2=1
tib$col2[1]=tibble(vec1)
tib$col2[2]=tibble(vec2)
tib$col2[3]=tibble(vec3)
tib$col2[4]=tibble(vec4)
This is what it looks like:
grpvar col2
<fct> <list>
1 a <dbl [10]>
2 a <dbl [10]>
3 b <dbl [10]>
4 b <dbl [10]>
A very minimal tibble with a variable that will be used for grouping, and another column containing tibbles which contain vectors of length 10.
Problem
Using reduce within summarise simplifies the output...
tib %>% group_by(grpvar) %>% summarise(aggr=reduce(col2,`+`))
yields:
grpvar aggr
<fct> <dbl>
1 a -0.0206
...
10 a -0.101
...
20 b 0.520
Here, the tibble becomes very long ... I don't want 10 rows per group variable, but instead just one tibble containing the 10 values.
Desired output:
This is what it should look like
desired_outout<-tibble(grpvar=c('a','b'),aggr=NA)
desired_outout$aggr[1]=tibble(reduce(tib$col2[1:2],`+`))
desired_outout$aggr[2]=tibble(reduce(tib$col2[3:4],`+`))
which looks like:
# A tibble: 2 x 2
grpvar aggr
<chr> <list>
1 a <dbl [10]>
2 b <dbl [10]
i.e., it retains the column-of-tibbles structure (which internally, I believe is a list of vectors)

Wrap reduce with list:
tib %>% group_by(grpvar) %>% summarise(aggr=list(reduce(col2,`+`)))
Output:
# A tibble: 2 x 2
grpvar aggr
<fct> <list>
1 a <dbl [10]>
2 b <dbl [10]>

Related

Removing NULL values from the table

I'm having a table with me which has NUll values in a Column, those null values add to Extra Label in the Highchart graph. How to manipulate data using Dplyr to get rid of rows which has Null Values in the specific column?
I was thinking to make changes in the backend SQL queries, and filter the result for the desired output. But it is not an appropriate way.
This is not working,
dplyr::filter(!is.na(ColumnWithNullValues)) %>%
Actual code:
df <- data() %>%
dplyr::filter(CreatedBy == 'owner') %>%
dplyr::group_by(`Reason for creation`) %>%
dplyr::arrange(ReasonOrder) %>%
ColumnWithNullValues <- This column has Null values.
Here is one option with base R
df[!sapply(df$ColumnWithNullValues, is.null),]
data
library(tibble)
df <- tibble(
ColumnWithNullValues = list(c(1:5), NULL, c(6:10)))
Here is a small example df:
library(dplyr)
library(purrr)
df <- tibble(
ColumnWithNullValues = list(c(1:5), NULL, c(6:10))
)
df
#> # A tibble: 3 x 1
#> ColumnWithNullValues
#> <list>
#> 1 <int [5]>
#> 2 <NULL>
#> 3 <int [5]>
In this case the most logical might seem to use:
df %>%
filter(!is.null(ColumnWithNullValues))
#> # A tibble: 3 x 1
#> ColumnWithNullValues
#> <list>
#> 1 <int [5]>
#> 2 <NULL>
#> 3 <int [5]>
But as you can see, that does not work. Instead, we need to use map/sapply/vapply to get inside the list. For example like this:
df %>%
filter(map_lgl(ColumnWithNullValues, function(x) !all(is.null(x))))
#> # A tibble: 2 x 1
#> ColumnWithNullValues
#> <list>
#> 1 <int [5]>
#> 2 <int [5]>
But as #akrun hast explained in the comment, it is not possible that an element in the list contains NULL among other value. So we can simplify the code to this:
df %>%
filter(!map_lgl(ColumnWithNullValues, is.null))
#> # A tibble: 3 x 1
#> ColumnWithNullValues
#> <list>
#> 1 <int [5]>
#> 2 <int [5]>

summarize to vector output

Let's say I have the following (simplified) tibble containing a group and values in vectors:
set.seed(1)
(tb_vec <- tibble(group = factor(rep(c("A","B"), c(2,3))),
values = replicate(5, sample(3), simplify = FALSE)))
# A tibble: 5 x 2
group values
<fct> <list>
1 A <int [3]>
2 A <int [3]>
3 B <int [3]>
4 B <int [3]>
5 B <int [3]>
tb_vec[[1,2]]
[1] 1 3 2
I would like to summarize the values vectors per group by summing them (vectorized) and tried the following:
tb_vec %>% group_by(group) %>%
summarize(vec_sum = colSums(purrr::reduce(values, rbind)))
Error: Column vec_sum must be length 1 (a summary value), not 3
The error surprises me, because tibbles (the output format) can contain vectors as well.
My expected output would be the following summarized tibble:
# A tibble: 2 x 2
group vec_sum
<fct> <list>
1 A <dbl [3]>
2 B <dbl [3]>
Is there a tidyverse solution accommodate the vector output of summarize? I want to avoid splitting the tibble, because then I loose the factor.
You just need to add list(.) within summarise in your solution, in order to be able to have a column with 2 elements, where each element is a vector of 3 values:
library(tidyverse)
set.seed(1)
(tb_vec <- tibble(group = factor(rep(c("A","B"), c(2,3))),
values = replicate(5, sample(3), simplify = FALSE)))
tb_vec %>%
group_by(group) %>%
summarize(vec_sum = list(colSums(purrr::reduce(values, rbind)))) -> res
res$vec_sum
# [[1]]
# [1] 2 4 6
#
# [[2]]
# [1] 6 5 7

Turning variables from a list nested in a data frame into columns

I have a tbl_df for which one column is a list of named variables (which themselves are mostly lists). Preferably using tidyverse code, I would like to apply a function over a set of these variables, and turn the output of each of these function calls into a new column in the data frame (kind of like what mutate_at does, but for these nested variables).
For example, my current data resembles something like this:
d <- tibble(
l = list(list("a"=list("a1","a2","a3","a4"),
"b"=list("b1","b2","b3")),
list("a"=list("x1","x2"),
"b"=list("y3")))
)
# A tibble: 2 x 1
l
<list>
1 <list [2]>
2 <list [2]>
I would like to apply functions to the variables in d$l (i.e., a and b) in the same way that mutate_at does when you give it multiple functions, automatically creating new columns that are named after the variables that created them. For example, one function I would like to apply would return their lengths as new columns, i.e.:
# A tibble: 2 x 3
l n_a n_b
<list> <dbl> <dbl>
1 <list [2]> 4. 3.
2 <list [2]> 2. 1.
Does anyone know an easy way of doing this? So far I've been doing stuff like this:
d %>%
mutate(n_a = purrr::map(l, ~length(.$a)) %>%
purrr::simplify(),
n_b = purrr::map(l, ~length(.$b)) %>%
purrr::simplify())
But I don't want to have to write that out for every variable in l (the real data has ~24 variables).
EDIT: Also, to be clear, the example of getting the lengths is just one function that I'd like to apply. I really want a more general way of applying arbitrary functions on arbitrary subsets of the variables in l.
You can use lengths to extract elements' length without looping through them; And use bind_cols instead of mutate to add multiple columns to the data frame:
d %>% bind_cols(map_dfr(.$l, ~ as.list(lengths(.))))
# A tibble: 2 x 3
# l a b
# <list> <int> <int>
#1 <list [2]> 4 3
#2 <list [2]> 2 1
Or use compose to chain as.list and lengths:
d %>% bind_cols(map_dfr(.$l, compose(as.list, lengths)))
# A tibble: 2 x 3
# l a b
# <list> <int> <int>
#1 <list [2]> 4 3
#2 <list [2]> 2 1
Notice this method dynamically check the names for your list, if element with specific names are missing, the result gives NA:
d <- tibble(
l = list(list("a"=list("a1","a2","a3","a4")),
list("a"=list("x1","x2"),
"b"=list("y3")))
)
d %>% bind_cols(map_dfr(.$l, ~ as.list(lengths(.))))
# A tibble: 2 x 3
# l a b
# <list> <int> <int>
#1 <list [1]> 4 NA
#2 <list [2]> 2 1
Another option is to use transpose from the purrr package and the lengths function.
bind_cols(d, map(transpose(d$l), lengths))
# # A tibble: 2 x 3
# l a b
# <list> <int> <int>
# 1 <list [2]> 4 3
# 2 <list [2]> 2 1
you can use sapply and assign the result to your new columns:
d[,c("a","b")] <- t(sapply(d$l,lengths))
# # A tibble: 2 x 3
# l a b
# <list> <int> <int>
# 1 <list [2]> 4 3
# 2 <list [2]> 2 1

How to calculate length of vector within a list column (nested)

I have the following code
library(tidyverse)
dat <- iris %>%
group_by(Species) %>%
summarise(summary = list(fivenum(Petal.Width)))
dat
#> # A tibble: 3 x 2
#> Species summary
#> <fct> <list>
#> 1 setosa <dbl [5]>
#> 2 versicolor <dbl [5]>
#> 3 virginica <dbl [5]>
Basically I used the Iris data, grouped it by Species and then calculated fivenum().
What I want to do is to simply calculate the length of the summary values:
this is what I have tried but it doesn't produce what I expect:
dat %>%
mutate(nof_value = length(summary))
# A tibble: 3 x 3
# Species summary nof_values
# <fct> <list> <int>
#1 setosa <dbl [5]> 3
#2 versicolor <dbl [5]> 3
#3 virginica <dbl [5]> 3
The nof_values should all be equal to 5. What's the right way to do it?
We can use lengths to calculate the length of nested list
library(tidyverse)
dat %>%
mutate(nof_values = lengths(summary))
# Species summary nof_values
# <fct> <list> <int>
#1 setosa <dbl [5]> 5
#2 versicolor <dbl [5]> 5
#3 virginica <dbl [5]> 5
whose equivalent in base R is
dat$nof_values <- lengths(dat$summary)
Side note : length is different from lengths
length(dat$summary)
#[1] 3
lengths(dat$summary)
#[1] 5 5 5
You can use the map_int command from the purrr package (which is part of the tidyverse)
dat <- iris %>%
group_by(Species) %>%
summarise(summary = list(fivenum(Petal.Width))) %>%
mutate(nof_value = map_int(summary, length))

How to add calculated columns to nested data frames (list columns) using purrr

I would like to perform calculations on a nested data frame (stored as a list-column), and add the calculated variable back to each dataframe using purrr functions. I'll use this result to join to other data, and keeping it compact helps me to organize and examine it better. I can do this in a couple of steps, but it seems like there may be a solution I haven't come across. If there is a solution out there, I haven't been able to find it easily.
Load libraries. example requires the following packages (available on CRAN):
library(dplyr)
library(purrr)
library(RcppRoll) # to calculate rolling mean
Example data with 3 subjects, and repeated measurements over time:
test <- data_frame(
id= rep(1:3, each=20),
time = rep(1:20, 3),
var1 = rnorm(60, mean=10, sd=3),
var2 = rnorm(60, mean=95, sd=5)
)
Store the data as nested dataframe:
t_nest <- test %>% nest(-id)
id data
<int> <list>
1 1 <tibble [20 x 3]>
2 2 <tibble [20 x 3]>
3 3 <tibble [20 x 3]>
Perform calculations. I will calculate multiple new variables based on the data, although a solution for just one could be expanded later. The result of each calculation will be a numeric vector, same length as the input (n=20):
t1 <- t_nest %>%
mutate(var1_rollmean4 = map(data, ~RcppRoll::roll_mean(.$var1, n=4, align="right", fill=NA)),
var2_delta4 = map(data, ~(.$var2 - lag(.$var2, 3))*0.095),
var3 = map2(var1_rollmean4, var2_delta4, ~.x -.y))
id data var1_rollmean4 var2_delta4 var3
<int> <list> <list> <list> <list>
1 1 <tibble [20 x 3]> <dbl [20]> <dbl [20]> <dbl [20]>
2 2 <tibble [20 x 3]> <dbl [20]> <dbl [20]> <dbl [20]>
3 3 <tibble [20 x 3]> <dbl [20]> <dbl [20]> <dbl [20]>
my solution is to unnest this data, and then nest again. There doesn't seem to be anything wrong with this, but seems like a better solution may exist.
t1 %>% unnest %>%
nest(-id)
id data
<int> <list>
1 1 <tibble [20 x 6]>
2 2 <tibble [20 x 6]>
3 3 <tibble [20 x 6]>
This other solution (from SO 42028710) is close, but not quite because it is a list rather than nested dataframes:
map_df(t_nest$data, ~ mutate(.x, var1calc = .$var1*100))
I've found quite a bit of helpful information using the purrr Cheatsheet but can't quite find the answer.
You can wrap another mutate when mapping through the data column and add the columns in each nested tibble:
t11 <- t_nest %>%
mutate(data = map(data,
~ mutate(.x,
var1_rollmean4 = RcppRoll::roll_mean(var1, n=4, align="right", fill=NA),
var2_delta4 = (var2 - lag(var2, 3))*0.095,
var3 = var1_rollmean4 - var2_delta4
)
))
t11
# A tibble: 3 x 2
# id data
# <int> <list>
#1 1 <tibble [20 x 6]>
#2 2 <tibble [20 x 6]>
#3 3 <tibble [20 x 6]>
unnest-nest method, and then reorder the columns inside:
nest_unnest <- t1 %>%
unnest %>% nest(-id) %>%
mutate(data = map(data, ~ select(.x, time, var1, var2, var1_rollmean4, var2_delta4, var3)))
identical(nest_unnest, t11)
# [1] TRUE
It seems like for what you're trying to do, nesting is not necessary
library(tidyverse)
library(zoo)
test %>%
group_by(id) %>%
mutate(var1_rollmean4 = rollapplyr(var1, 4, mean, fill=NA),
var2_delta4 = (var2 - lag(var2, 3))*0.095,
var3 = (var1_rollmean4 - var2_delta4))
# A tibble: 60 x 7
# Groups: id [3]
# id time var1 var2 var1_rollmean4 var2_delta4 var3
# <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 1 9.865199 96.45723 NA NA NA
# 2 1 2 9.951429 92.78354 NA NA NA
# 3 1 3 12.831509 95.00553 NA NA NA
# 4 1 4 12.463664 95.37171 11.277950 -0.10312483 11.381075
# 5 1 5 11.781704 92.05240 11.757076 -0.06945881 11.826535
# 6 1 6 12.756932 92.15666 12.458452 -0.27064269 12.729095
# 7 1 7 12.346409 94.32411 12.337177 -0.09952197 12.436699
# 8 1 8 10.223695 100.89043 11.777185 0.83961377 10.937571
# 9 1 9 4.031945 87.38217 9.839745 -0.45357658 10.293322
# 10 1 10 11.859477 97.96973 9.615382 0.34633428 9.269047
# ... with 50 more rows
Edit You could nest the result with %>% nest(-id) still
If you still prefer to nest or are nesting for other reasons, it would go like
t1 <- t_nest %>%
mutate(data = map(data, ~.x %>% mutate(...)))
That is, you mutate on .x within the map statement. This will treat data as a data.frame and mutate will column-bind results to it.

Resources