Is there a way to automatically propagate NULL in purrr::map?

Is there a way to automatically propagate NULL in purrr::map? - r

I have some data, some of which is replicated, some not. I can only fit
my model for the replicated data.
library(tidyverse)
d <- tribble(
~env, ~val,
"A", 1,
"A", 2,
"B", 3
)
I am using tidyr::nest() and purrr::map() functions to fit my model.
However, in every function I use for map() I had to cater for the special
case that a particular set of data is not modellable, which i archieved
via calls in the style of
map(col, function(elem){ if(!is.null(elem)) DO_STUFF(elem) else NULL})
After a while, I managed to extract this behaviour to a purrr-style adverb
function which takes another function and wraps it such that this behaviour
for NULL elements is automatic:
maybe <- function(fun){
function(val,...){ if(!is.null(val)) fun(val, ...) else NULL}
}
However, this left me wondering: Am I duplicating behaviour which is
already archievable using tidyverse functions?
Bonus question: Is there a word in functional programming for a function like maybe?
This is an example to test my adverb:
Simple Model: A mean for data in environment A and no model for data in
environment B (since the data is unreplicated:)
modelFuns <- list(A = mean, B = NULL)
Group data by environment and determine the appropriate model for each group
d <- d %>% group_by(env) %>% nest(.key = "data")
d %<>% mutate(model = modelFuns[env])
d
## # A tibble: 2 x 3
## env data model
## <chr> <list> <list>
## 1 A <tibble [2 × 1]> <fn>
## 2 B <tibble [1 × 1]> <NULL>
Perform modelling:
d %<>% mutate(out = pmap(list(model, data), maybe(function(m,d) m(d$val))))
d
## # A tibble: 2 x 4
## env data model out
## <chr> <list> <list> <list>
## 1 A <tibble [2 × 1]> <fn> <dbl [1]>
## 2 B <tibble [1 × 1]> <NULL> <NULL>
Which is equivalent to the following code which does not use my maybe adverb:
d %<>% mutate(out = pmap(list(model, data), function(m,d){if(!is.null(m)) m(d$val) else NULL}))
d
## # A tibble: 2 x 4
## env data model out
## <chr> <list> <list> <list>
## 1 A <tibble [2 × 1]> <fn> <dbl [1]>
## 2 B <tibble [1 × 1]> <NULL> <NULL>
The fact that there might be a value or there might be NULL propagates to
everything I want to do with the modelling results downstream, which is
why the adverb maybe is useful. Does something like this already exist
in the tidyverse?
isModelNice <- function(val) val > 0
d %<>% mutate(nice = map(out, maybe(isModelNice)))
d
## # A tibble: 2 x 5
## env data model out nice
## <chr> <list> <list> <list> <list>
## 1 A <tibble [2 × 1]> <fn> <dbl [1]> <lgl [1]>
## 2 B <tibble [1 × 1]> <NULL> <NULL> <NULL>

could you use purrr::possibly()?
library(tidyverse)
d <- tribble(
~env, ~val,
"A", 1,
"A", 2,
"B", 3
)
modelFuns <- list(A = mean, B = NULL)
d %>% group_by(env) %>%
nest(.key = "data") %>%
mutate(model = modelFuns[env]) %>%
mutate(out = pmap(list(model, data), possibly(function(m,d) m(d$val), NULL)))

Related

using apply on listcolumns in R seems inconsistent

..or at least inconsistent with my intuition.
I'm trying to extract data from inside a listcolumn using apply - in the example I've got a column of tibbles called eagles:
df1 <- tibble(
location = c(1,2),
eagles = list(
tibble(
talons = c(2,3,4),
beaks = c("blue","red","red")),
tibble(
talons = c(2,3),
beaks = c("red","red"))))
and extracting the beaks values as vectors using apply:
df1$beakz <- apply(df1, 1, \(x) x$eagles$beaks)
which works as expected:
> df1
# A tibble: 2 x 3
location eagles beakz
<dbl> <list> <list>
1 1 <tibble [3 x 2]> <chr [3]>
2 2 <tibble [2 x 2]> <chr [2]>
However if I add another row to one of the nested tibbles, the apply function won't play along anymore:
df2 <- tibble(
location = c(1,2),
eagles = list(
tibble(
talons = c(2,3,4),
beaks = c("blue","red","red")),
tibble(
talons = c(2,3,2),
beaks = c("red","red","yellow"))))
df2$beakz <- apply(df2, 1, \(x) x$eagles$beaks)
Error:
! Assigned data `apply(df2, 1, function(x) x$eagles$beaks)` must be compatible with existing data.
x Existing data has 2 rows.
x Assigned data has 3 rows.
i Only vectors of size 1 are recycled.
The expected output would be adding a listcolumn beakz with two vectors (of length 3) as elements.
Additionally, if both the nested tibbles have two rows only, the apply function does work, but instead of a single new listcolumn, I get two new columns:
df3 <- tibble(
location = c(1,2),
eagles = list(
tibble(
talons = c(2,3),
beaks = c("blue","red")),
tibble(
talons = c(2,3),
beaks = c("red","red"))))
df3$beakz <- apply(df3, 1, \(x) x$eagles$beaks)
df3
# A tibble: 2 x 3
location eagles beakz[,1] [,2]
<dbl> <list> <chr> <chr>
1 1 <tibble [2 x 2]> blue red
2 2 <tibble [2 x 2]> red red
This is a grossly simplified example, but basically, I would expect apply to function the same way in all three cases: I would like to extract a column as a vector and bring it up a level. Ideally using apply, although I'm sure there are purrr ways of doing this. But mainly I would just like to understand why this works this way, because debugging it has not been much fun :lolsob:
(also would appreciate it if someone with enough reputation could add listcolumn or list-column to the tags)

This is happening because apply() does not return a list, it returns a 3x2 matrix, which has too many rows to be put into df2. To get it to do what you want you could e.g. coerce it to a data frame (to give the columns names) and then to a list. There's probably a more elegant way to do it. But basically apply() does not play well with the list-structure of your data, whereas the purrr functions do.
apply(df2, 1, \(x) x$eagles$beaks)
#> [,1] [,2]
#> [1,] "blue" "red"
#> [2,] "red" "red"
#> [3,] "red" "yellow"
class(apply(df2, 1, \(x) x$eagles$beaks))
#> [1] "matrix" "array"
df2$beakz <- as.list(data.frame(apply(df2, 1, \(x) x$eagles$beaks)))
df2
#> # A tibble: 2 × 3
#> location eagles beakz
#> <dbl> <list> <named list>
#> 1 1 <tibble [3 × 2]> <chr [3]>
#> 2 2 <tibble [3 × 2]> <chr [3]>
df2$beakz
#> $X1
#> [1] "blue" "red" "red"
#>
#> $X2
#> [1] "red" "red" "yellow"

Purely for reference (not debugging OP), purrr works without issue:
library(purrr)
> mutate(df1, beaks=map(eagles, ~ .x$beaks))
# A tibble: 2 × 3
location eagles beaks
<dbl> <list> <list>
1 1 <tibble [3 × 2]> <chr [3]>
2 2 <tibble [2 × 2]> <chr [2]>
> mutate(df2, beaks=map(eagles, ~ .x$beaks))
# A tibble: 2 × 3
location eagles beaks
<dbl> <list> <list>
1 1 <tibble [3 × 2]> <chr [3]>
2 2 <tibble [3 × 2]> <chr [3]>
> mutate(df3, beaks=map(eagles, ~ .x$beaks))
# A tibble: 2 × 3
location eagles beaks
<dbl> <list> <list>
1 1 <tibble [2 × 2]> <chr [2]>
2 2 <tibble [2 × 2]> <chr [2]>

Pass a vector of arguments to map function

I'm trying to create a function that will map across a nested tibble. This function needs to take a vector of parameters that will vary for each row.
When I call purrr:map2() on the nested data, purrr tries to loop over all values of the parameter vector and all rows in the dataset. What can I do to pass the entire vector as a single argument?
library(tidyverse)
myf <- function(x, params) {
print(params)
x %>%
mutate(new_mpg = mpg + rnorm(n(), params[1], params[2])) %>%
summarise(old = mean(mpg), new = mean(new_mpg)) %>%
as.list()
}
# Calling function with params defined is great!
myf(mtcars, params = c(5, 10))
#> [1] 5 10
#> $old
#> [1] 20.09062
#>
#> $new
#> [1] 25.62049
# Cannot work in purr as vector, tries to loop over param
mtcars %>%
group_by(cyl) %>% # from base R
nest() %>%
mutate(
newold = map2(data, c(5, 10), myf),
)
#> [1] 5
#> Warning in rnorm(n(), params[1], params[2]): NAs produced
#> [1] 10
#> Warning in rnorm(n(), params[1], params[2]): NAs produced
#> Error: Problem with `mutate()` column `newold`.
#> ℹ `newold = map2(data, c(5, 10), myf)`.
#> ℹ `newold` must be size 1, not 2.
#> ℹ The error occurred in group 1: cyl = 4.
# New function wrapper with hard-coded params
myf2 <- function(x){
myf(x, c(5, 10))
}
# works great! but not what I need
mtcars %>%
group_by(cyl) %>% # from base R
nest() %>%
mutate(
mean = 5,
sd = 10,
newold = map(data, myf2),
)
#> [1] 5 10
#> [1] 5 10
#> [1] 5 10
#> # A tibble: 3 × 5
#> # Groups: cyl [3]
#> cyl data mean sd newold
#> <dbl> <list> <dbl> <dbl> <list>
#> 1 6 <tibble [7 × 10]> 5 10 <named list [2]>
#> 2 4 <tibble [11 × 10]> 5 10 <named list [2]>
#> 3 8 <tibble [14 × 10]> 5 10 <named list [2]>
Created on 2021-11-29 by the reprex package (v2.0.0)

Skip the group_by() step and just use nest() - otherwise your data will remain grouped after nesting and need to be ungrouped. To get your function to work, just pass the parameters as a list.
library(tidyverse)
mtcars %>%
nest(data = -cyl) %>%
mutate(
newold = map2_df(data, list(c(5, 10)), myf)
) %>%
unpack(newold)
# A tibble: 3 x 4
cyl data old new
<dbl> <list> <dbl> <dbl>
1 6 <tibble [7 x 10]> 19.7 30.7
2 4 <tibble [11 x 10]> 26.7 31.1
3 8 <tibble [14 x 10]> 15.1 17.0

You don't need map2. I think what you need is map.
mtcars %>%
group_by(cyl) %>% # from base R
nest() %>%
mutate(
newold = map(data, myf, params = c(5, 10)),
)
# [1] 5 10
# [1] 5 10
# [1] 5 10
# # A tibble: 3 x 3
# # Groups: cyl [3]
# cyl data newold
# <dbl> <list> <list>
# 1 6 <tibble [7 x 10]> <named list [2]>
# 2 4 <tibble [11 x 10]> <named list [2]>
# 3 8 <tibble [14 x 10]> <named list [2]>
If you have multiple sets of params. You can ungroup your data frame, add a list column with your params, and use map2.
mtcars %>%
group_by(cyl) %>%
nest() %>%
ungroup() %>%
# Add different sets of params
mutate(Params = list(a = c(5, 10), b = c(6, 11), c = c(7, 12))) %>%
mutate(
newold = map2(data, Params, myf)
)
# [1] 5 10
# [1] 6 11
# [1] 7 12
# # A tibble: 3 x 4
# cyl data Params newold
# <dbl> <list> <named list> <list>
# 1 6 <tibble [7 x 10]> <dbl [2]> <named list [2]>
# 2 4 <tibble [11 x 10]> <dbl [2]> <named list [2]>
# 3 8 <tibble [14 x 10]> <dbl [2]> <named list [2]>

How to store both function and its input data inside designated tibble columns, then iterate over rows to execute?

I'm trying to run a data wrangling procedure inside a tibble using tools from {purrr} package. My method is to organize everything I need inside a tibble:
the input data inside a column
the function to apply upon the input data gets its own column too
My problem: how can I use purrr's mapping functions to say "take the function stored in column x and apply it over the data in column y"?
Below is a minimal example, based on mtcars and iris. I want to summarise each data set, in the same workflow: first subset columns, then do some aggregation. For the aggregation part, I preemptively set up 2 functions, one for each data.
summarise_iris()
summarise_mtcars()
Then I organize all I need inside a tibble (see trb object below).
The first part, the subsetting, works well. As can be seen in trb_1 below, dat_selected is a new column I mutated, which stores the output of the subset step.
However, the second part is not working. I want to take the function in column summarise_func and apply it over the data stored in column dat_selected. But it's not working. Why not? I purposely used map() because it maps only 1 input to the function.
library(purrr)
library(tibble)
library(dplyr, warn.conflicts = FALSE)
summarise_iris <- function(.dat) {
.dat %>%
group_by(Species) %>%
summarise(across(starts_with("Sepal"), ~ mean(.x, na.rm = TRUE)))
}
# to test: iris %>% summarise_iris()
summarise_mtcars <- function(.dat) {
.dat %>%
group_by(am) %>%
summarise(mpg_median = median(mpg))
}
# to test: mtcars %>% summarise_mtcars()
trb <-
tribble(~original_data, ~cols_to_select, ~summarise_func,
mtcars, c("am", "disp", "mpg"), ~summarise_mtcars(.),
iris, c("Species", "Sepal.Length", "Sepal.Width"), ~summarise_iris(.)
)
trb_1 <-
trb %>%
mutate(dat_selected = map2(.x = original_data, .y = cols_to_select, .f = ~select(.x, all_of(.y))))
trb_1
#> # A tibble: 2 x 4
#> original_data cols_to_select summarise_func dat_selected
#> <list> <list> <list> <list>
#> 1 <df [32 x 11]> <chr [3]> <formula> <df [32 x 3]>
#> 2 <df [150 x 5]> <chr [3]> <formula> <df [150 x 3]>
trb_1 %>%
mutate(dat_summarised = map(.x = dat_selected, .f = summarise_func))
#> Error: Problem with `mutate()` column `dat_summarised`.
#> i `dat_summarised = map(.x = dat_selected, .f = summarise_func)`.
#> x Index 1 must have length 1, not 2
Created on 2021-12-02 by the reprex package (v2.0.1.9000)
How can I achieve the desired output (see below) using the in-table method I'm trying to incorporate? I.e.:
trb_1 %>%
mutate(dat_summarised = map(.x = dat_selected, .f = summarise_func))
## to give the desired output that's equivalent to what we get if we run:
summar_mtcars <- mtcars %>% summarise_mtcars()
summar_iris <- iris %>% summarise_iris()
trb_1 %>%
tibble::add_column(dat_summarised = list(summar_mtcars, summar_iris))
## # A tibble: 2 x 5
## original_data cols_to_select summarise_func dat_selected dat_summarised
## <list> <list> <list> <list> <list>
## 1 <df [32 x 11]> <chr [3]> <formula> <df [32 x 3]> <tibble [2 x 2]>
## 2 <df [150 x 5]> <chr [3]> <formula> <df [150 x 3]> <tibble [3 x 3]>
UPDATE
I don't know if the following is in the right direction, but based on this answer, I thought to utilize rlang::as_function() such that:
trb_1 %>%
mutate(dat_summarised = map(.x = dat_selected, .f = ~rlang::as_function(summarise_func)))
But it gives a different error now:
x Can't convert a list to function

I think you can take a simpler approach. First, we don't need to select columns, that's inherent to summarize anyway. Let's create columns that define the columns to group by, the columns to summarize, and functions to use.
library(purrr)
library(tibble)
library(dplyr, warn.conflicts = FALSE)
trb <-
tribble(~original_data, ~cols_to_group, ~cols_to_summarize, ~summarise_func,
mtcars, "am", "mpg", \(x) mean(x, na.rm = T),
iris, "Species", ~starts_with("Sepal"), median
)
The \(x) mean(x, na.rm = TRUE) syntax is the new anonymous function syntax in R 4.1. If using an earlier version, just change to function(x) mean(...)
Now we can define a function (to eventually use in pmap that accepts the data, grouping columns, columns to analyse, and the summarize functions.
summarize_fun <- function(
.dat, .group_cols, .summ_cols, .funs
) {
.dat %>%
group_by(across(!!.group_cols)) %>%
summarize(across(!!.summ_cols, .funs))
}
And now we can just use these within mutate(pmap(...)) to get the result we want. I rely on !! for unquoting expressions because that works for passing in things like ~starts_with("Sepal"), which don't work with {{ }} to my knowledge.
trb_final <- trb %>%
mutate(dat_summarized = pmap(
list(
.dat=original_data,
.group_cols=cols_to_group,
.summ_cols=cols_to_summarize,
.funs=summarise_func
),
summarize_fun
))
trb_final
#> # A tibble: 2 × 5
#> original_data cols_to_group cols_to_summarize summarise_func dat_summarized
#> <list> <chr> <list> <list> <list>
#> 1 <df [32 × 11]> am <chr [1]> <fn> <tibble [2 × 2]>
#> 2 <df [150 × 5]> Species <formula> <fn> <tibble [3 × 3]>
trb_final$dat_summarized
#> [[1]]
#> # A tibble: 2 × 2
#> am mpg
#> <dbl> <dbl>
#> 1 0 17.1
#> 2 1 24.4
#>
#> [[2]]
#> # A tibble: 3 × 3
#> Species Sepal.Length Sepal.Width
#> <fct> <dbl> <dbl>
#> 1 setosa 5 3.4
#> 2 versicolor 5.9 2.8
#> 3 virginica 6.5 3
General functions
If instead as in the comments, we want just to apply generic functions to summarize, then just rely on pmap with 2 arguments, the data and the summarizing function.
summarize_mtcars <- function(.dat) {
.dat %>%
group_by(am) %>%
summarise(mpg_median = median(mpg))
}
summarize_iris <- function(.dat) {
.dat %>%
group_by(Species) %>%
summarise(across(starts_with("Sepal"), ~ mean(.x, na.rm = TRUE)))
}
Now we can just define our data frame to analyze using the original data and the two summarize_... functions we defined for the datasets.
trb <-
tribble(~original_data, ~summarize_func,
mtcars, summarize_mtcars,
iris, summarize_iris
)
And then just use pmap as before (can also use map2 of course).
trb_final <- trb %>%
mutate(dat_summarized = pmap(
list(
original_data,
summarize_func
),
\(.d, .f) .f(.d)
))
trb_final
#> # A tibble: 2 × 3
#> original_data summarize_func dat_summarized
#> <list> <list> <list>
#> 1 <df [32 × 11]> <fn> <tibble [2 × 2]>
#> 2 <df [150 × 5]> <fn> <tibble [3 × 3]>
trb_final$dat_summarized
#> [[1]]
#> # A tibble: 2 × 2
#> am mpg_median
#> <dbl> <dbl>
#> 1 0 17.3
#> 2 1 22.8
#>
#> [[2]]
#> # A tibble: 3 × 3
#> Species Sepal.Length Sepal.Width
#> <fct> <dbl> <dbl>
#> 1 setosa 5.01 3.43
#> 2 versicolor 5.94 2.77
#> 3 virginica 6.59 2.97

I would store the functions as strings:
trb <-
tribble(~original_data, ~cols_to_select, ~summarise_func,
mtcars, c("am", "disp", "mpg"), "summarise_mtcars",
iris, c("Species", "Sepal.Length", "Sepal.Width"), "summarise_iris"
)
Then you can simply use do.call in your map call. Or you convert your functions to strings on the fly with mutate:
trb_2 <- trb_1 %>%
mutate(summarise_func = as.character(summarise_func)) %>%
mutate(dat_summarised = map2(summarise_func, original_data, ~ do.call(what = .x, args = list(.dat = .y))))
trb_2
#> # A tibble: 2 × 5
#> original_data cols_to_select summarise_func dat_selected dat_summarised
#> <list> <list> <chr> <list> <list>
#> 1 <df [32 × 11]> <chr [3]> summarise_mtcars <df [32 × 3]> <tibble [2 × 2]>
#> 2 <df [150 × 5]> <chr [3]> summarise_iris <df [150 × 3]> <tibble [3 × 3]>
Created on 2021-12-02 by the reprex package (v2.0.1)
Update: Storing functions or rather function names as strings can be problematic if the underlying function changes (I get that now). The problem is getting the function into the tibble in the first place. What you do in the question is storing it as a formula. A better way is (imo) to store it in a list column:
trb <-
tribble(~original_data, ~cols_to_select, ~summarise_func,
mtcars, c("am", "disp", "mpg"), list(fun = summarise_mtcars),
iris, c("Species", "Sepal.Length", "Sepal.Width"), list(fun = summarise_iris)
)
With a slight adaptation, this original answer then works like this:
trb_3 <- trb_1 %>%
mutate(dat_summarised = map2(summarise_func, original_data, ~ do.call(what = .x$fun, args = list(.dat = .y))))
trb_3
#> # A tibble: 2 × 5
#> original_data cols_to_select summarise_func dat_selected dat_summarised
#> <list> <list> <list> <list> <list>
#> 1 <df [32 × 11]> <chr [3]> <named list [1]> <df [32 × 3]> <tibble [2 × 2]>
#> 2 <df [150 × 5]> <chr [3]> <named list [1]> <df [150 × 3]> <tibble [3 × 3]>
Created on 2021-12-02 by the reprex package (v2.0.1)

Tune recipe in workflow set with custom range (or value)

I'm trying to use workflow_set() function in tidymodels to evaluate a batch of models.
I've understand that is possible to modify some model specification in order to change the search range so, for example, given this specification:
spec_lin <- linear_reg( penalty = tune(),
mixture = tune() ) %>%
set_engine('glmnet')
I can modify the range using:
rec_base <- recipe( price ~ feat_1) %>%
step_novel(feat_1) %>%
step_other(feat_1,threshold=.2 ) %>%
step_dummy(feat_1)
rec_adv_param <- rec_base %>%
parameters() %>%
update ( mixture = mixture(c(0.1,0.01)) )
My attempt is to do the same but with the parameters in the recipe. For example:
rec_tuned <- recipe( price ~ feat_1) %>%
step_novel(feat_1) %>%
step_other(feat_1,threshold=tune() ) %>%
step_dummy(feat_1)
followed by
rec_adv_param <- rec_tuned %>%
parameters() %>%
update ( threshold = threshold(c(0.1,0.2)) )
However when I try to use it in the workflow_set() definition if I use something like
wf_set <- workflow_set(recipes, models, cross = TRUE )
option_add(param_info = rec_adv_param, id = "rec_tuned_spec_lin")
The finale "wf_set" lost his original tuning parameters the has been changed with the
threshold = threshold(c(0.1,0.2)
Is there a way to add the parameters specification for the recipe in all workflow_set models?
Thanks

You can add the parameters for a recipe via option_add(), either for a single workflow by id for all workflows if you leave id = NULL. When you go to tune or fit on resampled data, these options will be used.
For example, if we want to try 0 to 20 PCA components (instead of the default):
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
data(Chicago)
data("chi_features_set")
time_val_split <-
sliding_period(
Chicago,
date,
"month",
lookback = 38,
assess_stop = 1
)
## notice that there are no options; defaults will be used
chi_features_set
#> # A workflow set/tibble: 3 × 4
#> wflow_id info option result
#> <chr> <list> <list> <list>
#> 1 date_lm <tibble [1 × 4]> <opts[0]> <list [0]>
#> 2 plus_holidays_lm <tibble [1 × 4]> <opts[0]> <list [0]>
#> 3 plus_pca_lm <tibble [1 × 4]> <opts[0]> <list [0]>
## make new params
pca_param <-
parameters(num_comp()) %>%
update(num_comp = num_comp(c(0, 20)))
## add new params to workflowset like this:
chi_features_set %>%
option_add(param_info = pca_param, id = "plus_pca_lm")
#> # A workflow set/tibble: 3 × 4
#> wflow_id info option result
#> <chr> <list> <list> <list>
#> 1 date_lm <tibble [1 × 4]> <opts[0]> <list [0]>
#> 2 plus_holidays_lm <tibble [1 × 4]> <opts[0]> <list [0]>
#> 3 plus_pca_lm <tibble [1 × 4]> <opts[1]> <list [0]>
## now these new parameters can be used by `workflow_map()`:
chi_features_set %>%
option_add(param_info = pca_param, id = "plus_pca_lm") %>%
workflow_map(resamples = time_val_split, grid = 21, seed = 1)
#> # A workflow set/tibble: 3 × 4
#> wflow_id info option result
#> <chr> <list> <list> <list>
#> 1 date_lm <tibble [1 × 4]> <opts[2]> <rsmp[+]>
#> 2 plus_holidays_lm <tibble [1 × 4]> <opts[2]> <rsmp[+]>
#> 3 plus_pca_lm <tibble [1 × 4]> <opts[3]> <tune[+]>
Created on 2021-07-30 by the reprex package (v2.0.0)

How to add calculated columns to nested data frames (list columns) using purrr

I would like to perform calculations on a nested data frame (stored as a list-column), and add the calculated variable back to each dataframe using purrr functions. I'll use this result to join to other data, and keeping it compact helps me to organize and examine it better. I can do this in a couple of steps, but it seems like there may be a solution I haven't come across. If there is a solution out there, I haven't been able to find it easily.
Load libraries. example requires the following packages (available on CRAN):
library(dplyr)
library(purrr)
library(RcppRoll) # to calculate rolling mean
Example data with 3 subjects, and repeated measurements over time:
test <- data_frame(
id= rep(1:3, each=20),
time = rep(1:20, 3),
var1 = rnorm(60, mean=10, sd=3),
var2 = rnorm(60, mean=95, sd=5)
)
Store the data as nested dataframe:
t_nest <- test %>% nest(-id)
id data
<int> <list>
1 1 <tibble [20 x 3]>
2 2 <tibble [20 x 3]>
3 3 <tibble [20 x 3]>
Perform calculations. I will calculate multiple new variables based on the data, although a solution for just one could be expanded later. The result of each calculation will be a numeric vector, same length as the input (n=20):
t1 <- t_nest %>%
mutate(var1_rollmean4 = map(data, ~RcppRoll::roll_mean(.$var1, n=4, align="right", fill=NA)),
var2_delta4 = map(data, ~(.$var2 - lag(.$var2, 3))*0.095),
var3 = map2(var1_rollmean4, var2_delta4, ~.x -.y))
id data var1_rollmean4 var2_delta4 var3
<int> <list> <list> <list> <list>
1 1 <tibble [20 x 3]> <dbl [20]> <dbl [20]> <dbl [20]>
2 2 <tibble [20 x 3]> <dbl [20]> <dbl [20]> <dbl [20]>
3 3 <tibble [20 x 3]> <dbl [20]> <dbl [20]> <dbl [20]>
my solution is to unnest this data, and then nest again. There doesn't seem to be anything wrong with this, but seems like a better solution may exist.
t1 %>% unnest %>%
nest(-id)
id data
<int> <list>
1 1 <tibble [20 x 6]>
2 2 <tibble [20 x 6]>
3 3 <tibble [20 x 6]>
This other solution (from SO 42028710) is close, but not quite because it is a list rather than nested dataframes:
map_df(t_nest$data, ~ mutate(.x, var1calc = .$var1*100))
I've found quite a bit of helpful information using the purrr Cheatsheet but can't quite find the answer.

You can wrap another mutate when mapping through the data column and add the columns in each nested tibble:
t11 <- t_nest %>%
mutate(data = map(data,
~ mutate(.x,
var1_rollmean4 = RcppRoll::roll_mean(var1, n=4, align="right", fill=NA),
var2_delta4 = (var2 - lag(var2, 3))*0.095,
var3 = var1_rollmean4 - var2_delta4
)
))
t11
# A tibble: 3 x 2
# id data
# <int> <list>
#1 1 <tibble [20 x 6]>
#2 2 <tibble [20 x 6]>
#3 3 <tibble [20 x 6]>
unnest-nest method, and then reorder the columns inside:
nest_unnest <- t1 %>%
unnest %>% nest(-id) %>%
mutate(data = map(data, ~ select(.x, time, var1, var2, var1_rollmean4, var2_delta4, var3)))
identical(nest_unnest, t11)
# [1] TRUE

It seems like for what you're trying to do, nesting is not necessary
library(tidyverse)
library(zoo)
test %>%
group_by(id) %>%
mutate(var1_rollmean4 = rollapplyr(var1, 4, mean, fill=NA),
var2_delta4 = (var2 - lag(var2, 3))*0.095,
var3 = (var1_rollmean4 - var2_delta4))
# A tibble: 60 x 7
# Groups: id [3]
# id time var1 var2 var1_rollmean4 var2_delta4 var3
# <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 1 9.865199 96.45723 NA NA NA
# 2 1 2 9.951429 92.78354 NA NA NA
# 3 1 3 12.831509 95.00553 NA NA NA
# 4 1 4 12.463664 95.37171 11.277950 -0.10312483 11.381075
# 5 1 5 11.781704 92.05240 11.757076 -0.06945881 11.826535
# 6 1 6 12.756932 92.15666 12.458452 -0.27064269 12.729095
# 7 1 7 12.346409 94.32411 12.337177 -0.09952197 12.436699
# 8 1 8 10.223695 100.89043 11.777185 0.83961377 10.937571
# 9 1 9 4.031945 87.38217 9.839745 -0.45357658 10.293322
# 10 1 10 11.859477 97.96973 9.615382 0.34633428 9.269047
# ... with 50 more rows
Edit You could nest the result with %>% nest(-id) still
If you still prefer to nest or are nesting for other reasons, it would go like
t1 <- t_nest %>%
mutate(data = map(data, ~.x %>% mutate(...)))
That is, you mutate on .x within the map statement. This will treat data as a data.frame and mutate will column-bind results to it.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Is there a way to automatically propagate NULL in purrr::map? - r

Related

using apply on listcolumns in R seems inconsistent

Pass a vector of arguments to map function

How to store both function and its input data inside designated tibble columns, then iterate over rows to execute?

Tune recipe in workflow set with custom range (or value)

How to add calculated columns to nested data frames (list columns) using purrr

Categories

Resources