Use map to loop over multiple variables to regress in nested dataset - r

I have a nested dataset, where each row contains in a column a tibble with a subset of the main dataset.
I want to run regressions over each of these subsets, where I vary the dependent variable.
I built a function to do this taking only one variable as the dependent variable.
I'm trying to use map to loop over multiple possible dependent variables but it seems as if the formula is not being interpreted correctly.
library(purrr)
library(tidyr)
library(rlang)
reg_across_groups <- function(df, var) {
df <- df %>%
nest(data = -c("cyl"))
var_enq <- rlang::enquo(var)
model_formula <- formula(paste0(rlang::quo_name(var_enq), "~ hp"))
df %>%
dplyr::mutate(model = purrr::map(data, ~lm(model_formula, data = .x)))
}
# Works
reg_across_groups(mtcars, mpg)
reg_across_groups(mtcars, "mpg")
# Does not work
c("mpg", "wt") %>%
map(~ reg_across_groups(mtcars, .))
I get the following error message:
Error in eval(predvars, data, env) : object '.' not found

We can convert to symbol with ensym
library(dplyr)
library(tidyr)
library(purrr)
reg_across_groups <- function(df, var) {
var <- ensym(var)
df <- df %>%
nest(data = -c("cyl"))
model_formula <- formula(paste0(var, "~ hp"))
df %>%
dplyr::mutate(model = purrr::map(data, ~lm(model_formula, data = .x)))
}
c("mpg", "wt") %>%
map(~ reg_across_groups(mtcars, !!.x))
#[[1]]
# A tibble: 3 x 3
# cyl data model
# <dbl> <list<df[,10]>> <list>
#1 6 [7 × 10] <lm>
#2 4 [11 × 10] <lm>
#3 8 [14 × 10] <lm>
#[[2]]
# A tibble: 3 x 3
# cyl data model
# <dbl> <list<df[,10]>> <list>
#1 6 [7 × 10] <lm>
#2 4 [11 × 10] <lm>
#3 8 [14 × 10] <lm>
reg_across_groups(mtcars, mpg)
# A tibble: 3 x 3
# cyl data model
# <dbl> <list<df[,10]>> <list>
#1 6 [7 × 10] <lm>
#2 4 [11 × 10] <lm>
#3 8 [14 × 10] <lm>
reg_across_groups(mtcars, "mpg")
# A tibble: 3 x 3
# cyl data model
# <dbl> <list<df[,10]>> <list>
#1 6 [7 × 10] <lm>
#2 4 [11 × 10] <lm>
#3 8 [14 × 10] <lm>

Related

group-wise linear models function nest_by

I have a dataframe of 4 columns: Dataset, X, Y, Group.
The task is to fit a linear model to each of the five groups (The group column contains 5 groups: a, b, c, d, e) in the dataframe and then compare the slope with the dataframe test_2. For the test_2 I have already fitted a model, as there was no group separation like in the test_1. For the test_1 we have been suggested to use the function nest_by to compute a group-wise linear models
I have tried to fit a model with the function nest_by
Input:
model <- test_1 %>%
nest_by(Group) %>%
mutate(model = list(lm(y ~ x, data = test_1)))
model
Output:
A tibble: 5 x 3
# Rowwise: Group
Group data model
<fct> <list<tibble[,3]>> <list>
1 a [58 x 3] <lm>
2 b [35 x 3] <lm>
3 c [47 x 3] <lm>
4 d [44 x 3] <lm>
5 e [38 x 3] <lm>
I do not know now how to proceed. I thought that I could ungroup them and do a summary(), but would be similar to just fit a model separately with the function filter() and create 5 separated models.
Yes, you can proceed further using tidy from broom package which is better option than summary and then doing unnest.
For example, for mtcars, for each cyl group, we can do the following,
library(tidyr)
library(dplyr)
library(purrr)
library(broom)
mtcars_model <- mtcars %>%
nest(data = -cyl) %>%
mutate(
model = map(data, ~ lm(mpg ~ wt, data = .))
)
# now simply for each cyl, tidy the model output and unnest it
mtcars_model %>%
mutate(
tidy_summary = map(model, tidy)
) %>%
unnest(tidy_summary)
#> # A tibble: 6 × 8
#> cyl data model term estimate std.error statistic p.value
#> <dbl> <list> <list> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 6 <tibble [7 × 10]> <lm> (Interce… 28.4 4.18 6.79 1.05e-3
#> 2 6 <tibble [7 × 10]> <lm> wt -2.78 1.33 -2.08 9.18e-2
#> 3 4 <tibble [11 × 10]> <lm> (Interce… 39.6 4.35 9.10 7.77e-6
#> 4 4 <tibble [11 × 10]> <lm> wt -5.65 1.85 -3.05 1.37e-2
#> 5 8 <tibble [14 × 10]> <lm> (Interce… 23.9 3.01 7.94 4.05e-6
#> 6 8 <tibble [14 × 10]> <lm> wt -2.19 0.739 -2.97 1.18e-2
Created on 2022-07-09 by the reprex package (v2.0.1)
For additional Information with examples, check here

Grouped regression with dplyr using different formulas

I try to transfer the problem from this post to a setting where you use different formulas in the lm()
function in R.
Here a basic setup to reproduce the problem:
library(dplyr)
library(broom)
library(purrr)
library(tidyr)
# Generate data
set.seed(324)
dt <- data.frame(
t = sort(rep(c(1,2), 50)),
w1 = rnorm(100),
w2 = rnorm(100),
x1 = rnorm(100),
x2 = rnorm(100)
)
# Generate formulas
fm <- map(1:2, ~as.formula(paste0("w", .x, "~ x", .x)))
Now I try to run different regressions for each group t with models specified in formulas object fm :
# Approach 1:
dt %>% group_by(t) %>%
do(fit = tidy(map(fm, ~lm(.x, data = .)))) %>%
unnest(fit)
# Approach 2
dt %>% nest(-t) %>%
mutate(
fit = map(fm, ~lm(.x, data = .)),
tfit = tidy(fit)
)
This produces an error indicating that the formula cannot be converted to a data.frame . What am I doing wrong?
This needs map2 instead of map as the data column from nest is also a list of data.frame, and thus we need to loop over the corresponding elements of 'fm' list and data (map2 does that)
library(tidyr)
library(purrr)
library(dplyr)
library(broom)
out <- dt %>%
nest(data = -t) %>%
mutate(
fit = map2(fm, data, ~lm(.x, data = .y)),
tfit = map(fit, tidy))
-output
> out
# A tibble: 2 × 4
t data fit tfit
<dbl> <list> <list> <list>
1 1 <tibble [50 × 4]> <lm> <tibble [2 × 5]>
2 2 <tibble [50 × 4]> <lm> <tibble [2 × 5]>
> bind_rows(out$tfit)
# A tibble: 4 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 0.0860 0.128 0.670 0.506
2 x1 0.262 0.119 2.19 0.0331
3 (Intercept) -0.00285 0.152 -0.0187 0.985
4 x2 -0.115 0.154 -0.746 0.459
Or may also use
> imap_dfr(fm, ~ lm(.x, data = dt %>%
filter(t == .y)) %>%
tidy)
# A tibble: 4 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 0.0860 0.128 0.670 0.506
2 x1 0.262 0.119 2.19 0.0331
3 (Intercept) -0.00285 0.152 -0.0187 0.985
4 x2 -0.115 0.154 -0.746 0.459
If we want to have all the combinations of 'fm' for each level of 't', then use crossing
dt %>%
nest(data = -t) %>%
crossing(fm) %>%
mutate(fit = map2(fm, data, ~ lm(.x, data = .y)),
tfit = map(fit, tidy))
-output
# A tibble: 4 × 5
t data fm fit tfit
<dbl> <list> <list> <list> <list>
1 1 <tibble [50 × 4]> <formula> <lm> <tibble [2 × 5]>
2 1 <tibble [50 × 4]> <formula> <lm> <tibble [2 × 5]>
3 2 <tibble [50 × 4]> <formula> <lm> <tibble [2 × 5]>
4 2 <tibble [50 × 4]> <formula> <lm> <tibble [2 × 5]>

Use a function name that's a string in map loop?

Some code:
mymtcars <- mtcars %>% head %>% rownames_to_column('model') %>% group_by(vs) %>% nest
mymtcars
vs data
<dbl> <list>
1 0 <tibble [3 × 11]>
2 1 <tibble [3 × 11]>
I can fit a linear model on this list column df like so:
mymtcars %>%
+ mutate(mod = map(.x = data, ~ lm(.x$mpg ~ .x$cyl)))
# A tibble: 2 x 3
# Groups: vs [2]
vs data mod
<dbl> <list> <list>
1 0 <tibble [3 × 11]> <lm>
2 1 <tibble [3 × 11]> <lm>
What if my function name is a field?
mymtcars2 <- mtcars %>% head %>% rownames_to_column('model') %>% group_by(vs) %>% nest %>% crossing(func = c('lm'))
> mymtcars2
# A tibble: 2 x 3
vs data func
<dbl> <list> <chr>
1 0 <tibble [3 × 11]> lm
2 1 <tibble [3 × 11]> lm
I gave it a try with:
mymtcars2 %>%
+ mutate(mod = map2(.x = data, .y = func, ~ .y(.x$mpg ~ .x$cyl)))
Error: Problem with `mutate()` input `mod`.
x could not find function ".y"
ℹ Input `mod` is `map2(.x = data, .y = func, ~.y(.x$mpg ~ .x$cyl))`.
How can I pass the function to call in map and then call it in the above block?
May be using match.fun inside map2 like below:
models <- mymtcars2 %>%
mutate(mod = map2(.x = data, .y = func, ~ match.fun(.y)(.x$mpg ~ .x$cyl)))
Output:
[[1]]
Call:
match.fun(.y)(formula = .x$mpg ~ .x$cyl)
Coefficients:
(Intercept) .x$cyl
36.926733 -2.728218
[[2]]
Call:
match.fun(.y)(formula = .x$mpg ~ .x$cyl)
Coefficients:
(Intercept) .x$cyl
41.9400 -3.8025
I also found that I can use get:
mymtcars2 %>%
mutate(mod = map2(.x = data, .y = func, ~ get(.y)(.x$mpg ~ .x$cyl)))
Am unsure of when to use one over the other.
A different option could be:
mymtcars2 %>%
mutate(mod = map2(.x = data,
.y = func,
~ exec(.y, mpg ~ cyl, data = .x)))
vs data func mod
<dbl> <list> <chr> <list>
1 0 <tibble [3 × 11]> lm <lm>
2 1 <tibble [3 × 11]> lm <lm>
Since {dplyr} >= 1.0 this kind of problems can be solved with dplyr::rowwise. We can use it either with a classic do.call, in which case we have to wrap the arguments in list(), or with rlang::exec. With dlpyr::rowwise we don't need map2 which makes things more readable since there is no lambda function with .x .y. However, since the output column stores lm objects (and not an atomic vector), the result has to be wrapped in mod = list(...).
library(tidyverse)
mymtcars2 %>%
rowwise %>%
mutate(mod = list(do.call(func, list(mpg ~ cyl, data = data))))
#> # A tibble: 2 x 4
#> # Rowwise:
#> vs data func mod
#> <dbl> <list> <chr> <list>
#> 1 0 <tibble [3 × 11]> lm <lm>
#> 2 1 <tibble [3 × 11]> lm <lm>
mymtcars2 %>%
rowwise %>%
mutate(mod = list(exec(func, mpg ~ cyl, data = data)))
#> # A tibble: 2 x 4
#> # Rowwise:
#> vs data func mod
#> <dbl> <list> <chr> <list>
#> 1 0 <tibble [3 × 11]> lm <lm>
#> 2 1 <tibble [3 × 11]> lm <lm>
Created on 2021-08-28 by the reprex package (v0.3.0)

(tidy, glance, augment) with exec

I see from the purrr documentation that it should be possible to map a list of functions onto arguments using the map(list(fn1, fn2, fn3), exec, !!!args) syntax or something similar. How would this work for the broom functions tidy, glance, and augment, which usually must be supplemented with do? These are three functions I almost always like to execute at the same time on the same data and model. Of course I can do this explicitly:
# works but is repetitive
MY_MODEL <- hp ~ cyl
my_glance <- mtcars %>% do(glance(lm(data = ., formula = MY_MODEL)))
my_tidy <- mtcars %>% do(tidy(lm(data = ., formula = MY_MODEL)))
my_augment <- mtcars %>% do(augment(lm(data = ., formula = MY_MODEL)))
I suspect there is a better, more compact way to do this without having to retype ...lm(data = ., formula = MY_MODEL... every time, but I couldn't figure it out. I tried
# doesn't work
omnibroom <- function(df, model){
map(list(glance, tidy, augment),
exec,
~{(do(.x(lm(data = df, formula = model))))}
)
}
omnibroom(mtcars, MY_MODEL)
but I think I don't understand the !!! syntax appropriately.
Is there a compact idiom for calling these three broom functions on the same model and data?
It's possible to do this in two lines with simple re-factoring. No do or !!! necessary.
mdl <- mtcars %>% lm(data=., formula=MY_MODEL)
res1 <- map( list(glance, tidy, augment), exec, mdl )
If you really want to squish it down into a single line, use { to help guide pipe input to the correct place in lm:
res2 <- mtcars %>%
{map( list(glance, tidy, augment), exec, lm(data=., formula=MY_MODEL) )}
Verification:
identical( res1, list(my_glance, my_tidy, my_augment) ) # TRUE
identical( res1, res2 ) # TRUE
EDIT to address grouping
Arbitrary functions like lm don't respect data frame groups. While do is a popular approach to handle grouping in this case, I personally think that tidyr::nest() is more intuitive because it places all intermediates and results alongside the data:
## "Listify" broom functions: f -> map( ..., f )
omnibroom <- map( list(glance, tidy, augment), ~function(l) map(l, .x) ) %>%
set_names( c("glance","tidy","augment") )
result <- mtcars %>% nest( data = -gear ) %>%
mutate( model = map(data, lm, formula=MY_MODEL) ) %>%
mutate_at( "model", omnibroom )
# # A tibble: 3 x 6
# gear data model glance tidy augment
# <dbl> <list> <list> <list> <list> <list>
# 1 4 <tibble [12 × 10… <lm> <tibble [1 × 11… <tibble [2 × … <tibble [12 × …
# 2 3 <tibble [15 × 10… <lm> <tibble [1 × 11… <tibble [2 × … <tibble [15 × …
# 3 5 <tibble [5 × 10]> <lm> <tibble [1 × 11… <tibble [2 × … <tibble [5 × 9…
This format also naturally lends itself to unnesting, since broom functions produce data frames:
result %>% select( gear, tidy ) %>% unnest( tidy )
# # A tibble: 6 x 6
# gear term estimate std.error statistic p.value
# <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 4 (Intercept) -5.00 25.3 -0.198 0.847
# 2 4 cyl 20.2 5.30 3.82 0.00339
# 3 3 (Intercept) -47.5 56.1 -0.847 0.412
# 4 3 cyl 30.0 7.42 4.04 0.00142
# 5 5 (Intercept) -101. 51.9 -1.94 0.148
# 6 5 cyl 49.4 8.28 5.96 0.00944

adding summarize output to original tibble

I would like to do something in between mutate and summarize.
I would like to calculate a summary statistic on groups, but retain the original data as a nested object. I assume this is a pretty generic task, but I can't figure out how to do without invoking a join as well as grouping twice. example code below:
mtcars %>%
group_by(cyl) %>%
nest() %>%
left_join(mtcars %>%
group_by(cyl) %>%
summarise(mean_mpg = mean(mpg)))
which produced desired output:
# A tibble: 3 x 3
cyl data mean_mpg
<dbl> <list> <dbl>
1 6 <tibble [7 x 10]> 19.74286
2 4 <tibble [11 x 10]> 26.66364
3 8 <tibble [14 x 10]> 15.10000
but I feel like this is not the "correct" way to do this.
Here is one way to do this without join; Use map_dbl (which is essentially a map with the out come be a vector of type double) from purrr package (one member of the tidyverse family) to calculate the mean of mpg nested in the data column:
mtcars %>%
group_by(cyl) %>%
nest() %>%
mutate(mean_mpg = map_dbl(data, ~ mean(.x$mpg)))
# A tibble: 3 x 3
# cyl data mean_mpg
# <dbl> <list> <dbl>
#1 6 <tibble [7 x 10]> 19.74286
#2 4 <tibble [11 x 10]> 26.66364
#3 8 <tibble [14 x 10]> 15.10000
Or you can calculate mean_mpg before nesting, and add mean_mpg as one of the group variables:
mtcars %>%
group_by(cyl) %>%
mutate(mean_mpg = mean(mpg)) %>%
group_by(mean_mpg, add=TRUE) %>%
nest()
# A tibble: 3 x 3
# cyl mean_mpg data
# <dbl> <dbl> <list>
#1 6 19.74286 <tibble [7 x 10]>
#2 4 26.66364 <tibble [11 x 10]>
#3 8 15.10000 <tibble [14 x 10]>

Resources