I see from the purrr documentation that it should be possible to map a list of functions onto arguments using the map(list(fn1, fn2, fn3), exec, !!!args) syntax or something similar. How would this work for the broom functions tidy, glance, and augment, which usually must be supplemented with do? These are three functions I almost always like to execute at the same time on the same data and model. Of course I can do this explicitly:
# works but is repetitive
MY_MODEL <- hp ~ cyl
my_glance <- mtcars %>% do(glance(lm(data = ., formula = MY_MODEL)))
my_tidy <- mtcars %>% do(tidy(lm(data = ., formula = MY_MODEL)))
my_augment <- mtcars %>% do(augment(lm(data = ., formula = MY_MODEL)))
I suspect there is a better, more compact way to do this without having to retype ...lm(data = ., formula = MY_MODEL... every time, but I couldn't figure it out. I tried
# doesn't work
omnibroom <- function(df, model){
map(list(glance, tidy, augment),
exec,
~{(do(.x(lm(data = df, formula = model))))}
)
}
omnibroom(mtcars, MY_MODEL)
but I think I don't understand the !!! syntax appropriately.
Is there a compact idiom for calling these three broom functions on the same model and data?
It's possible to do this in two lines with simple re-factoring. No do or !!! necessary.
mdl <- mtcars %>% lm(data=., formula=MY_MODEL)
res1 <- map( list(glance, tidy, augment), exec, mdl )
If you really want to squish it down into a single line, use { to help guide pipe input to the correct place in lm:
res2 <- mtcars %>%
{map( list(glance, tidy, augment), exec, lm(data=., formula=MY_MODEL) )}
Verification:
identical( res1, list(my_glance, my_tidy, my_augment) ) # TRUE
identical( res1, res2 ) # TRUE
EDIT to address grouping
Arbitrary functions like lm don't respect data frame groups. While do is a popular approach to handle grouping in this case, I personally think that tidyr::nest() is more intuitive because it places all intermediates and results alongside the data:
## "Listify" broom functions: f -> map( ..., f )
omnibroom <- map( list(glance, tidy, augment), ~function(l) map(l, .x) ) %>%
set_names( c("glance","tidy","augment") )
result <- mtcars %>% nest( data = -gear ) %>%
mutate( model = map(data, lm, formula=MY_MODEL) ) %>%
mutate_at( "model", omnibroom )
# # A tibble: 3 x 6
# gear data model glance tidy augment
# <dbl> <list> <list> <list> <list> <list>
# 1 4 <tibble [12 × 10… <lm> <tibble [1 × 11… <tibble [2 × … <tibble [12 × …
# 2 3 <tibble [15 × 10… <lm> <tibble [1 × 11… <tibble [2 × … <tibble [15 × …
# 3 5 <tibble [5 × 10]> <lm> <tibble [1 × 11… <tibble [2 × … <tibble [5 × 9…
This format also naturally lends itself to unnesting, since broom functions produce data frames:
result %>% select( gear, tidy ) %>% unnest( tidy )
# # A tibble: 6 x 6
# gear term estimate std.error statistic p.value
# <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 4 (Intercept) -5.00 25.3 -0.198 0.847
# 2 4 cyl 20.2 5.30 3.82 0.00339
# 3 3 (Intercept) -47.5 56.1 -0.847 0.412
# 4 3 cyl 30.0 7.42 4.04 0.00142
# 5 5 (Intercept) -101. 51.9 -1.94 0.148
# 6 5 cyl 49.4 8.28 5.96 0.00944
Related
I have a dataframe of 4 columns: Dataset, X, Y, Group.
The task is to fit a linear model to each of the five groups (The group column contains 5 groups: a, b, c, d, e) in the dataframe and then compare the slope with the dataframe test_2. For the test_2 I have already fitted a model, as there was no group separation like in the test_1. For the test_1 we have been suggested to use the function nest_by to compute a group-wise linear models
I have tried to fit a model with the function nest_by
Input:
model <- test_1 %>%
nest_by(Group) %>%
mutate(model = list(lm(y ~ x, data = test_1)))
model
Output:
A tibble: 5 x 3
# Rowwise: Group
Group data model
<fct> <list<tibble[,3]>> <list>
1 a [58 x 3] <lm>
2 b [35 x 3] <lm>
3 c [47 x 3] <lm>
4 d [44 x 3] <lm>
5 e [38 x 3] <lm>
I do not know now how to proceed. I thought that I could ungroup them and do a summary(), but would be similar to just fit a model separately with the function filter() and create 5 separated models.
Yes, you can proceed further using tidy from broom package which is better option than summary and then doing unnest.
For example, for mtcars, for each cyl group, we can do the following,
library(tidyr)
library(dplyr)
library(purrr)
library(broom)
mtcars_model <- mtcars %>%
nest(data = -cyl) %>%
mutate(
model = map(data, ~ lm(mpg ~ wt, data = .))
)
# now simply for each cyl, tidy the model output and unnest it
mtcars_model %>%
mutate(
tidy_summary = map(model, tidy)
) %>%
unnest(tidy_summary)
#> # A tibble: 6 × 8
#> cyl data model term estimate std.error statistic p.value
#> <dbl> <list> <list> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 6 <tibble [7 × 10]> <lm> (Interce… 28.4 4.18 6.79 1.05e-3
#> 2 6 <tibble [7 × 10]> <lm> wt -2.78 1.33 -2.08 9.18e-2
#> 3 4 <tibble [11 × 10]> <lm> (Interce… 39.6 4.35 9.10 7.77e-6
#> 4 4 <tibble [11 × 10]> <lm> wt -5.65 1.85 -3.05 1.37e-2
#> 5 8 <tibble [14 × 10]> <lm> (Interce… 23.9 3.01 7.94 4.05e-6
#> 6 8 <tibble [14 × 10]> <lm> wt -2.19 0.739 -2.97 1.18e-2
Created on 2022-07-09 by the reprex package (v2.0.1)
For additional Information with examples, check here
Applying functions in a loop fashion with purrr::map is super handy, but extrating objects by its indices seems "too magical" to me, for example, the r.squared from lm summary method. How does it work internally?
library(tidyverse)
data("mtcars")
mtcars %>%
nest(data = -c(vs)) %>%
mutate(model = map(data, ~lm(mpg ~ wt, data = .x)),
summary = map(model, summary),
r2 = map_dbl(summary, "r.squared"))
# # A tibble: 2 x 5
# vs data model summary r2
# <dbl> <list> <list> <list> <dbl>
# 1 0 <tibble [18 x 10]> <lm> <smmry.lm> 0.672
# 2 1 <tibble [14 x 10]> <lm> <smmry.lm> 0.726
I am trying to translate this basic for loop using the purr package. The idea is to apply a function using data frame elements as parameters.
Creating the data frame to loop on using the mpg dataset from ggplot2:
param <- mpg %>% select(manufacturer, year) %>% distinct() %>% rename(man = manufacturer, y = year)
The function to apply:
fcn <- function(man, y) {
df <- mpg %>% filter(manufacturer == man & year == y)
mod <- lm(data = df, cty ~ hwy)
out <- summary(mod)
return(out)
}
And the loop to apply fcn for each man and y combination :
for (i in 1:nrow(param)) {
fcn(man = param$man[i],
y = param$y[i])
}
I am very new to purr and struggle how general specifications of purr::map work.
Thanks a lot.
EDIT :
I used here a very basic example with fcn and param to understand how to include function parameters (from param) inside the map specification. As a results, I was not particularly interested in a nesting beforehand but only the dull translation of the loop using map that could work for any king of function with multiple parameters.
If I have understood correctly you want to model the cty based on hwy for each year and manufacturer combinations.
library(tidyverse)
library(ggplot2)
library(purrr)
I have changed the definition of your function to fit to the map function settings.
fcn <- function(df){
mod <- lm(data = df, cty ~ hwy)
return(summary(mod))
}
The code below should produce the summary of the model for each year and manufacturer
mpg %>% group_by(manufacturer, year) %>%
nest() %>% mutate(model = map(data, fcn))
You can nest the data first within manufacturer and year, then map using a function, except below, I used the .x directly, which would be each element of the data you map through. You can also use tidy() from broom to put the summary() result into a data.frame:
library(purrr)
library(tidyr)
library(dplyr)
library(broom)
mpg = ggplot2::mpg
result = mpg %>%
select(manufacturer, year,cty,hwy) %>%
nest(data=c(cty, hwy)) %>%
mutate(
model=map(data,~lm(cty ~ hwy,data=.x)),
summary=map(model,~tidy(summary(.x)))
)
# A tibble: 30 x 5
manufacturer year data model summary
<chr> <int> <list> <list> <list>
1 audi 1999 <tibble [9 × 2]> <lm> <tibble [2 × 5]>
2 audi 2008 <tibble [9 × 2]> <lm> <tibble [2 × 5]>
3 chevrolet 2008 <tibble [12 × 2]> <lm> <tibble [2 × 5]>
4 chevrolet 1999 <tibble [7 × 2]> <lm> <tibble [2 × 5]>
5 dodge 1999 <tibble [16 × 2]> <lm> <tibble [2 × 5]>
6 dodge 2008 <tibble [21 × 2]> <lm> <tibble [2 × 5]>
If you want to look at the results of summary:
result %>% unnest(summary)
# A tibble: 55 x 9
manufacturer year data model term estimate std.error statistic p.value
<chr> <int> <list> <list> <chr> <dbl> <dbl> <dbl> <dbl>
1 audi 1999 <tibbl… <lm> (Inte… -5.85 6.15 -0.951 3.73e-1
2 audi 1999 <tibbl… <lm> hwy 0.879 0.235 3.74 7.27e-3
3 audi 2008 <tibbl… <lm> (Inte… -0.5 3.68 -0.136 8.96e-1
4 audi 2008 <tibbl… <lm> hwy 0.695 0.137 5.08 1.43e-3
The following post helped me to achieve the desired outcome, general enough to be applied in many situations and ignoring nesting: https://stackoverflow.com/a/52309113/10580543.
Using pmap:
output <- param %>% pmap(~fcn(.x, .y))
I have a nested dataset, where each row contains in a column a tibble with a subset of the main dataset.
I want to run regressions over each of these subsets, where I vary the dependent variable.
I built a function to do this taking only one variable as the dependent variable.
I'm trying to use map to loop over multiple possible dependent variables but it seems as if the formula is not being interpreted correctly.
library(purrr)
library(tidyr)
library(rlang)
reg_across_groups <- function(df, var) {
df <- df %>%
nest(data = -c("cyl"))
var_enq <- rlang::enquo(var)
model_formula <- formula(paste0(rlang::quo_name(var_enq), "~ hp"))
df %>%
dplyr::mutate(model = purrr::map(data, ~lm(model_formula, data = .x)))
}
# Works
reg_across_groups(mtcars, mpg)
reg_across_groups(mtcars, "mpg")
# Does not work
c("mpg", "wt") %>%
map(~ reg_across_groups(mtcars, .))
I get the following error message:
Error in eval(predvars, data, env) : object '.' not found
We can convert to symbol with ensym
library(dplyr)
library(tidyr)
library(purrr)
reg_across_groups <- function(df, var) {
var <- ensym(var)
df <- df %>%
nest(data = -c("cyl"))
model_formula <- formula(paste0(var, "~ hp"))
df %>%
dplyr::mutate(model = purrr::map(data, ~lm(model_formula, data = .x)))
}
c("mpg", "wt") %>%
map(~ reg_across_groups(mtcars, !!.x))
#[[1]]
# A tibble: 3 x 3
# cyl data model
# <dbl> <list<df[,10]>> <list>
#1 6 [7 × 10] <lm>
#2 4 [11 × 10] <lm>
#3 8 [14 × 10] <lm>
#[[2]]
# A tibble: 3 x 3
# cyl data model
# <dbl> <list<df[,10]>> <list>
#1 6 [7 × 10] <lm>
#2 4 [11 × 10] <lm>
#3 8 [14 × 10] <lm>
reg_across_groups(mtcars, mpg)
# A tibble: 3 x 3
# cyl data model
# <dbl> <list<df[,10]>> <list>
#1 6 [7 × 10] <lm>
#2 4 [11 × 10] <lm>
#3 8 [14 × 10] <lm>
reg_across_groups(mtcars, "mpg")
# A tibble: 3 x 3
# cyl data model
# <dbl> <list<df[,10]>> <list>
#1 6 [7 × 10] <lm>
#2 4 [11 × 10] <lm>
#3 8 [14 × 10] <lm>
I have data of the following form
date data
<chr> <list>
1 2012-01-05 <tibble [796 x 5]>
2 2012-01-12 <tibble [831 x 5]>
3 2012-01-19 <tibble [820 x 5]>
... ...
I would like to use something analogous to map() to calculate the mean and standard deviation.
I can currently use the following separately, but it is possible to calculate both at the same time.
mutate(stats = map(data, ~ sd(.$metric)))
mutate(stats = map(data, ~ mean(.$metric)))
Another alternative is to make a function that is like summary, which returns quartiles and the mean. but calculate the mean and sd instead. then I could use that new function in map as follows:
mutate(stats = map(data, ~ new_function(.$metric)))
Is there a better alternative?
A simple option to add multiple columns is to just make another list column of the desired summary statistics and unnest it:
library(tidyverse)
set.seed(47)
df <- data_frame(date = seq(as.Date('1970-01-01'), by = 1, length = 4),
data = map(date, ~data_frame(metric = rnorm(10))))
df
#> # A tibble: 4 x 2
#> date data
#> <date> <list>
#> 1 1970-01-01 <tibble [10 × 1]>
#> 2 1970-01-02 <tibble [10 × 1]>
#> 3 1970-01-03 <tibble [10 × 1]>
#> 4 1970-01-04 <tibble [10 × 1]>
df %>%
mutate(stats = map(data, ~data.frame(mean = mean(.x$metric),
sd = sd(.x$metric)))) %>%
unnest(stats)
#> # A tibble: 4 x 4
#> date data mean sd
#> <date> <list> <dbl> <dbl>
#> 1 1970-01-01 <tibble [10 × 1]> -0.106 0.992
#> 2 1970-01-02 <tibble [10 × 1]> -0.102 0.875
#> 3 1970-01-03 <tibble [10 × 1]> -0.833 0.979
#> 4 1970-01-04 <tibble [10 × 1]> 0.184 0.671
A more programmatic approach (which may scale better) is to iterate within the anonymous function over a list of functions. lst will automatically name them, so the results will be named, and map_dfc will cbind them into a data frame:
df %>%
mutate(stats = map(data,
~map_dfc(lst(mean, sd),
function(.fun) .fun(.x$metric)))) %>%
unnest(stats)
purrr has a purpose-built function for iterating over functions/parameters like this: invoke_map. If you want the function or parameters to be recycled, they have to be in a length-1 list. Since parameters should already be collected in a list, here it has to be a nested list.
df %>%
mutate(stats = map(data,
~invoke_map_dfc(lst(mean, sd),
list(list(.x$metric))))) %>%
unnest(stats)
All approaches return the same thing.