group-wise linear models function nest_by - r

I have a dataframe of 4 columns: Dataset, X, Y, Group.
The task is to fit a linear model to each of the five groups (The group column contains 5 groups: a, b, c, d, e) in the dataframe and then compare the slope with the dataframe test_2. For the test_2 I have already fitted a model, as there was no group separation like in the test_1. For the test_1 we have been suggested to use the function nest_by to compute a group-wise linear models
I have tried to fit a model with the function nest_by
Input:
model <- test_1 %>%
nest_by(Group) %>%
mutate(model = list(lm(y ~ x, data = test_1)))
model
Output:
A tibble: 5 x 3
# Rowwise: Group
Group data model
<fct> <list<tibble[,3]>> <list>
1 a [58 x 3] <lm>
2 b [35 x 3] <lm>
3 c [47 x 3] <lm>
4 d [44 x 3] <lm>
5 e [38 x 3] <lm>
I do not know now how to proceed. I thought that I could ungroup them and do a summary(), but would be similar to just fit a model separately with the function filter() and create 5 separated models.

Yes, you can proceed further using tidy from broom package which is better option than summary and then doing unnest.
For example, for mtcars, for each cyl group, we can do the following,
library(tidyr)
library(dplyr)
library(purrr)
library(broom)
mtcars_model <- mtcars %>%
nest(data = -cyl) %>%
mutate(
model = map(data, ~ lm(mpg ~ wt, data = .))
)
# now simply for each cyl, tidy the model output and unnest it
mtcars_model %>%
mutate(
tidy_summary = map(model, tidy)
) %>%
unnest(tidy_summary)
#> # A tibble: 6 × 8
#> cyl data model term estimate std.error statistic p.value
#> <dbl> <list> <list> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 6 <tibble [7 × 10]> <lm> (Interce… 28.4 4.18 6.79 1.05e-3
#> 2 6 <tibble [7 × 10]> <lm> wt -2.78 1.33 -2.08 9.18e-2
#> 3 4 <tibble [11 × 10]> <lm> (Interce… 39.6 4.35 9.10 7.77e-6
#> 4 4 <tibble [11 × 10]> <lm> wt -5.65 1.85 -3.05 1.37e-2
#> 5 8 <tibble [14 × 10]> <lm> (Interce… 23.9 3.01 7.94 4.05e-6
#> 6 8 <tibble [14 × 10]> <lm> wt -2.19 0.739 -2.97 1.18e-2
Created on 2022-07-09 by the reprex package (v2.0.1)
For additional Information with examples, check here

Related

Adjusting the p-values on a subset of regression coefficients

Edited for Clarity
I frequently do stratified analyses. However, to avoid spending Type I error on hypotheses tests
that aren't of interest, I would like to remove certain values before using p.adjust().
library(purrr)
library(dplyr, warn.conflicts = FALSE)
library(broom)
library(tidyr)
mtcars_fit <- mtcars %>%
group_by(cyl) %>% # you can use "cyl" too, very flexible
nest() %>%
mutate(
model = map(data, ~ lm(mpg ~ wt, data = .)),
coeff = map(model, tidy, conf.int = FALSE)
) %>%
unnest(coeff) %>%
select(-statistic)
mtcars_fit
#> # A tibble: 6 × 7
#> # Groups: cyl [3]
#> cyl data model term estimate std.error p.value
#> <dbl> <list> <list> <chr> <dbl> <dbl> <dbl>
#> 1 6 <tibble [7 × 10]> <lm> (Intercept) 28.4 4.18 0.00105
#> 2 6 <tibble [7 × 10]> <lm> wt -2.78 1.33 0.0918
#> 3 4 <tibble [11 × 10]> <lm> (Intercept) 39.6 4.35 0.00000777
#> 4 4 <tibble [11 × 10]> <lm> wt -5.65 1.85 0.0137
#> 5 8 <tibble [14 × 10]> <lm> (Intercept) 23.9 3.01 0.00000405
#> 6 8 <tibble [14 × 10]> <lm> wt -2.19 0.739 0.0118
#If I want to adjust the p-values for multiple comparisons for the weight only and
#save the Type I error as I don't want to test the intercept, I would do something like this
mtcars_adjusted <- mtcars_fit %>%
mutate(
p.value2 = if_else(term != "(Intercept)", p.value, NA_real_),
p.value_adj = if_else(term != "(Intercept)", p.adjust(p.value2, method = "fdr"), NA_real_),
.after = "p.value"
) %>%
select(-p.value2)
mtcars_adjusted
#> # A tibble: 6 × 8
#> # Groups: cyl [3]
#> cyl data model term estimate std.error p.value p.val…¹
#> <dbl> <list> <list> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 6 <tibble [7 × 10]> <lm> (Intercept) 28.4 4.18 1.05e-3 NA
#> 2 6 <tibble [7 × 10]> <lm> wt -2.78 1.33 9.18e-2 0.0918
#> 3 4 <tibble [11 × 10]> <lm> (Intercept) 39.6 4.35 7.77e-6 NA
#> 4 4 <tibble [11 × 10]> <lm> wt -5.65 1.85 1.37e-2 0.0137
#> 5 8 <tibble [14 × 10]> <lm> (Intercept) 23.9 3.01 4.05e-6 NA
#> 6 8 <tibble [14 × 10]> <lm> wt -2.19 0.739 1.18e-2 0.0118
#> # … with abbreviated variable name ¹​p.value_adj
As this discussion on StackOverflow indicates that dplyr and p.adjust() often don't work well together, I applied the function outside the pipe as suggested.
#To check I will filter the dataset and make sure p adjusted values are the same
p.adj <- mtcars_fit %>%
filter(term != "(Intercept)") %>%
mutate(p.value_adj = NA_real_)
p.adj$p.value_adj = p.adjust(p.adj$p.value, method = "fdr")
p.adj
#> # A tibble: 3 × 8
#> # Groups: cyl [3]
#> cyl data model term estimate std.error p.value p.value_adj
#> <dbl> <list> <list> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 6 <tibble [7 × 10]> <lm> wt -2.78 1.33 0.0918 0.0918
#> 2 4 <tibble [11 × 10]> <lm> wt -5.65 1.85 0.0137 0.0206
#> 3 8 <tibble [14 × 10]> <lm> wt -2.19 0.739 0.0118 0.0206
Created on 2022-08-18 by the reprex package (v2.0.1)
The result is that the adjusted p-values are different, so I am unsure what is correct. The fact that I adjusted the P-values in two different ways -- with objects mtcars_adjusted and p.value_adj -- and got different adjusted P-values is concerning. The adjusted P-values for each object:
mtcars_adjusted: 0.0918, 0.0137, 0.0118
p.adj: 0.0918, 0.0206, 0.0206.
The resulting dataset is that I want to keep the intercept estimates without adjusting them in the p-value. The resulting dataset would look something like mtcars_adjusted, but I want to make sure the p-values are adjusted accurately. How would I go about doing this?
Implementing your adjustment within the pipe chain
You don't need to adjust your p-values outside of mutate() in your example. Below, I show the identical result can be produced within the piping chain.
# Adjust p-values for "wt" parameter estimates using your approach
p.adj <- mtcars_fit %>%
filter(term != "(Intercept)") %>%
mutate(p.value_adj = NA_real_)
p.adj$p.value_adj = p.adjust(p.adj$p.value, method = "fdr")
# Alternative approach
p.adj_alt <- mtcars_fit %>%
ungroup() %>%
filter(term != "(Intercept)") %>%
mutate(p.value_adj = p.adjust(p.adj$p.value, method = "fdr"))
# Show they are identical once ungrouped (which you should do once you are
# done with all by-group operations)
identical(ungroup(p.adj), p.adj_alt)
#> [1] TRUE
Whether you are accomplishing what you intended with your "outside of the pipe" approach is a different question than what you asked in your post, but I encourage you to make sure it is.
Adding the intercepts
Once you have your adjusted estimates, you can add in the intercept rows by filter()ing them from the original object and passing them with your adjusted data to bind_rows(). You can also combine the two p-values columns into a single column if you'd like using coalesce().
# Get intercepts, bind into a single data.frame, and create a coalesced
# column that combined the (un)adjusted p-values
mtcars_fit %>%
filter(term == "(Intercept)") %>%
bind_rows(p.adj) %>%
ungroup() %>%
mutate(p.value_combined = coalesce(p.value, p.value_adj))
#> # A tibble: 6 × 9
#> cyl data model term estim…¹ std.e…² p.value p.val…³ p.val…⁴
#> <dbl> <list> <list> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 6 <tibble [7 × 10]> <lm> (Inte… 28.4 4.18 1.05e-3 NA 1.05e-3
#> 2 4 <tibble [11 × 10]> <lm> (Inte… 39.6 4.35 7.77e-6 NA 7.77e-6
#> 3 8 <tibble [14 × 10]> <lm> (Inte… 23.9 3.01 4.05e-6 NA 4.05e-6
#> 4 6 <tibble [7 × 10]> <lm> wt -2.78 1.33 9.18e-2 0.0918 9.18e-2
#> 5 4 <tibble [11 × 10]> <lm> wt -5.65 1.85 1.37e-2 0.0206 1.37e-2
#> 6 8 <tibble [14 × 10]> <lm> wt -2.19 0.739 1.18e-2 0.0206 1.18e-2
#> # … with abbreviated variable names ¹​estimate, ²​std.error, ³​p.value_adj,
#> # ⁴​p.value_combined

Grouped regression with dplyr using different formulas

I try to transfer the problem from this post to a setting where you use different formulas in the lm()
function in R.
Here a basic setup to reproduce the problem:
library(dplyr)
library(broom)
library(purrr)
library(tidyr)
# Generate data
set.seed(324)
dt <- data.frame(
t = sort(rep(c(1,2), 50)),
w1 = rnorm(100),
w2 = rnorm(100),
x1 = rnorm(100),
x2 = rnorm(100)
)
# Generate formulas
fm <- map(1:2, ~as.formula(paste0("w", .x, "~ x", .x)))
Now I try to run different regressions for each group t with models specified in formulas object fm :
# Approach 1:
dt %>% group_by(t) %>%
do(fit = tidy(map(fm, ~lm(.x, data = .)))) %>%
unnest(fit)
# Approach 2
dt %>% nest(-t) %>%
mutate(
fit = map(fm, ~lm(.x, data = .)),
tfit = tidy(fit)
)
This produces an error indicating that the formula cannot be converted to a data.frame . What am I doing wrong?
This needs map2 instead of map as the data column from nest is also a list of data.frame, and thus we need to loop over the corresponding elements of 'fm' list and data (map2 does that)
library(tidyr)
library(purrr)
library(dplyr)
library(broom)
out <- dt %>%
nest(data = -t) %>%
mutate(
fit = map2(fm, data, ~lm(.x, data = .y)),
tfit = map(fit, tidy))
-output
> out
# A tibble: 2 × 4
t data fit tfit
<dbl> <list> <list> <list>
1 1 <tibble [50 × 4]> <lm> <tibble [2 × 5]>
2 2 <tibble [50 × 4]> <lm> <tibble [2 × 5]>
> bind_rows(out$tfit)
# A tibble: 4 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 0.0860 0.128 0.670 0.506
2 x1 0.262 0.119 2.19 0.0331
3 (Intercept) -0.00285 0.152 -0.0187 0.985
4 x2 -0.115 0.154 -0.746 0.459
Or may also use
> imap_dfr(fm, ~ lm(.x, data = dt %>%
filter(t == .y)) %>%
tidy)
# A tibble: 4 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 0.0860 0.128 0.670 0.506
2 x1 0.262 0.119 2.19 0.0331
3 (Intercept) -0.00285 0.152 -0.0187 0.985
4 x2 -0.115 0.154 -0.746 0.459
If we want to have all the combinations of 'fm' for each level of 't', then use crossing
dt %>%
nest(data = -t) %>%
crossing(fm) %>%
mutate(fit = map2(fm, data, ~ lm(.x, data = .y)),
tfit = map(fit, tidy))
-output
# A tibble: 4 × 5
t data fm fit tfit
<dbl> <list> <list> <list> <list>
1 1 <tibble [50 × 4]> <formula> <lm> <tibble [2 × 5]>
2 1 <tibble [50 × 4]> <formula> <lm> <tibble [2 × 5]>
3 2 <tibble [50 × 4]> <formula> <lm> <tibble [2 × 5]>
4 2 <tibble [50 × 4]> <formula> <lm> <tibble [2 × 5]>

Lost column name when applying lm with summarise/across

I want to use summarise/across with lm to fit regressions using different columns in a tibble. Like this:
library(tidyverse)
library(broom)
fits <- tibble(mtcars) %>%
summarise(across(c(vs, am), ~list(tidy(lm(wt ~ .x + mpg)))))
But the columns that get passed into lm as '.x', end up labeled as .x in the regression output.
fits %>% unnest(vs)
# A tibble: 3 x 6
term estimate std.error statistic p.value am
<chr> <dbl> <dbl> <dbl> <dbl> <list>
1 (Intercept) 6.10 0.353 17.3 8.36e-17 <tibble [3 × 5]>
2 .x 0.0738 0.239 0.308 7.60e- 1 <tibble [3 × 5]>
3 mpg -0.145 0.0200 -7.24 5.63e- 8 <tibble [3 × 5]>
I can preserve the name if I build the lm formula on the fly, and use cur_column(), but this feels kludgy:
tibble(mtcars) %>%
summarise(across(c(vs, am),
~list(tidy(lm(formula(paste0("wt ~ ", cur_column(), " + mpg"))))))) %>%
unnest(vs)
# A tibble: 3 x 6
term estimate std.error statistic p.value am
<chr> <dbl> <dbl> <dbl> <dbl> <list>
1 (Intercept) 6.10 0.353 17.3 8.36e-17 <tibble [3 × 5]>
2 vs 0.0738 0.239 0.308 7.60e- 1 <tibble [3 × 5]>
3 mpg -0.145 0.0200 -7.24 5.63e- 8 <tibble [3 × 5]>
I want the output to correctly use the true column name of .x, without having to do this workaround, but still using the summarise/across motif, without incorporating map.
Seems like this should be possible. Any suggestions?
*copying my comment from #akrun's answer to clarify what i'm looking for:
What I really want to know is, is the column name preserved in the summarise/across operation in a way that I can reference it directly in lm. Something like {{.x}} or rlang::as_name(.x). I mean, I know those don't work, but it seems like name information should be preserved, aside from just the string version in cur_column.
Can make it shorter with reformulate
library(dplyr)
library(broom)
library(tidyr)
tibble(mtcars) %>%
summarise(across(c(vs, am), ~
list(tidy(lm(reformulate(c(cur_column(), "mpg"), "wt")))))) %>%
unnest(vs)
-output
# A tibble: 3 x 6
# term estimate std.error statistic p.value am
# <chr> <dbl> <dbl> <dbl> <dbl> <list>
#1 (Intercept) 6.10 0.353 17.3 8.36e-17 <tibble [3 × 5]>
#2 vs 0.0738 0.239 0.308 7.60e- 1 <tibble [3 × 5]>
#3 mpg -0.145 0.0200 -7.24 5.63e- 8 <tibble [3 × 5]>

Translating a 'for loop' to 'purrr::map'

I am trying to translate this basic for loop using the purr package. The idea is to apply a function using data frame elements as parameters.
Creating the data frame to loop on using the mpg dataset from ggplot2:
param <- mpg %>% select(manufacturer, year) %>% distinct() %>% rename(man = manufacturer, y = year)
The function to apply:
fcn <- function(man, y) {
df <- mpg %>% filter(manufacturer == man & year == y)
mod <- lm(data = df, cty ~ hwy)
out <- summary(mod)
return(out)
}
And the loop to apply fcn for each man and y combination :
for (i in 1:nrow(param)) {
fcn(man = param$man[i],
y = param$y[i])
}
I am very new to purr and struggle how general specifications of purr::map work.
Thanks a lot.
EDIT :
I used here a very basic example with fcn and param to understand how to include function parameters (from param) inside the map specification. As a results, I was not particularly interested in a nesting beforehand but only the dull translation of the loop using map that could work for any king of function with multiple parameters.
If I have understood correctly you want to model the cty based on hwy for each year and manufacturer combinations.
library(tidyverse)
library(ggplot2)
library(purrr)
I have changed the definition of your function to fit to the map function settings.
fcn <- function(df){
mod <- lm(data = df, cty ~ hwy)
return(summary(mod))
}
The code below should produce the summary of the model for each year and manufacturer
mpg %>% group_by(manufacturer, year) %>%
nest() %>% mutate(model = map(data, fcn))
You can nest the data first within manufacturer and year, then map using a function, except below, I used the .x directly, which would be each element of the data you map through. You can also use tidy() from broom to put the summary() result into a data.frame:
library(purrr)
library(tidyr)
library(dplyr)
library(broom)
mpg = ggplot2::mpg
result = mpg %>%
select(manufacturer, year,cty,hwy) %>%
nest(data=c(cty, hwy)) %>%
mutate(
model=map(data,~lm(cty ~ hwy,data=.x)),
summary=map(model,~tidy(summary(.x)))
)
# A tibble: 30 x 5
manufacturer year data model summary
<chr> <int> <list> <list> <list>
1 audi 1999 <tibble [9 × 2]> <lm> <tibble [2 × 5]>
2 audi 2008 <tibble [9 × 2]> <lm> <tibble [2 × 5]>
3 chevrolet 2008 <tibble [12 × 2]> <lm> <tibble [2 × 5]>
4 chevrolet 1999 <tibble [7 × 2]> <lm> <tibble [2 × 5]>
5 dodge 1999 <tibble [16 × 2]> <lm> <tibble [2 × 5]>
6 dodge 2008 <tibble [21 × 2]> <lm> <tibble [2 × 5]>
If you want to look at the results of summary:
result %>% unnest(summary)
# A tibble: 55 x 9
manufacturer year data model term estimate std.error statistic p.value
<chr> <int> <list> <list> <chr> <dbl> <dbl> <dbl> <dbl>
1 audi 1999 <tibbl… <lm> (Inte… -5.85 6.15 -0.951 3.73e-1
2 audi 1999 <tibbl… <lm> hwy 0.879 0.235 3.74 7.27e-3
3 audi 2008 <tibbl… <lm> (Inte… -0.5 3.68 -0.136 8.96e-1
4 audi 2008 <tibbl… <lm> hwy 0.695 0.137 5.08 1.43e-3
The following post helped me to achieve the desired outcome, general enough to be applied in many situations and ignoring nesting: https://stackoverflow.com/a/52309113/10580543.
Using pmap:
output <- param %>% pmap(~fcn(.x, .y))

Use purrr to map to 2 functions

I have data of the following form
date data
<chr> <list>
1 2012-01-05 <tibble [796 x 5]>
2 2012-01-12 <tibble [831 x 5]>
3 2012-01-19 <tibble [820 x 5]>
... ...
I would like to use something analogous to map() to calculate the mean and standard deviation.
I can currently use the following separately, but it is possible to calculate both at the same time.
mutate(stats = map(data, ~ sd(.$metric)))
mutate(stats = map(data, ~ mean(.$metric)))
Another alternative is to make a function that is like summary, which returns quartiles and the mean. but calculate the mean and sd instead. then I could use that new function in map as follows:
mutate(stats = map(data, ~ new_function(.$metric)))
Is there a better alternative?
A simple option to add multiple columns is to just make another list column of the desired summary statistics and unnest it:
library(tidyverse)
set.seed(47)
df <- data_frame(date = seq(as.Date('1970-01-01'), by = 1, length = 4),
data = map(date, ~data_frame(metric = rnorm(10))))
df
#> # A tibble: 4 x 2
#> date data
#> <date> <list>
#> 1 1970-01-01 <tibble [10 × 1]>
#> 2 1970-01-02 <tibble [10 × 1]>
#> 3 1970-01-03 <tibble [10 × 1]>
#> 4 1970-01-04 <tibble [10 × 1]>
df %>%
mutate(stats = map(data, ~data.frame(mean = mean(.x$metric),
sd = sd(.x$metric)))) %>%
unnest(stats)
#> # A tibble: 4 x 4
#> date data mean sd
#> <date> <list> <dbl> <dbl>
#> 1 1970-01-01 <tibble [10 × 1]> -0.106 0.992
#> 2 1970-01-02 <tibble [10 × 1]> -0.102 0.875
#> 3 1970-01-03 <tibble [10 × 1]> -0.833 0.979
#> 4 1970-01-04 <tibble [10 × 1]> 0.184 0.671
A more programmatic approach (which may scale better) is to iterate within the anonymous function over a list of functions. lst will automatically name them, so the results will be named, and map_dfc will cbind them into a data frame:
df %>%
mutate(stats = map(data,
~map_dfc(lst(mean, sd),
function(.fun) .fun(.x$metric)))) %>%
unnest(stats)
purrr has a purpose-built function for iterating over functions/parameters like this: invoke_map. If you want the function or parameters to be recycled, they have to be in a length-1 list. Since parameters should already be collected in a list, here it has to be a nested list.
df %>%
mutate(stats = map(data,
~invoke_map_dfc(lst(mean, sd),
list(list(.x$metric))))) %>%
unnest(stats)
All approaches return the same thing.

Resources