I'm trying to run a simple single linear regression over a large number of variables, grouped according to another variable. Using the mtcars dataset as an example, I'd like to run a separate linear regression between mpg and each other variable (mpg ~ disp, mpg ~ hp, etc.), grouped by another variable (for example, cyl).
Running lm over each variable independently can easily be done using purrr::map (modified from this great tutorial - https://sebastiansauer.github.io/EDIT-multiple_lm_purrr_EDIT/):
library(dplyr)
library(tidyr)
library(purrr)
mtcars %>%
select(-mpg) %>% #exclude outcome, leave predictors
map(~ lm(mtcars$mpg ~ .x, data = mtcars)) %>%
map_df(glance, .id='variable') %>%
select(variable, r.squared, p.value)
# A tibble: 10 x 3
variable r.squared p.value
<chr> <dbl> <dbl>
1 cyl 0.726 6.11e-10
2 disp 0.718 9.38e-10
3 hp 0.602 1.79e- 7
4 drat 0.464 1.78e- 5
5 wt 0.753 1.29e-10
6 qsec 0.175 1.71e- 2
7 vs 0.441 3.42e- 5
8 am 0.360 2.85e- 4
9 gear 0.231 5.40e- 3
10 carb 0.304 1.08e- 3
And running a linear model over grouped variables is also easy using map:
mtcars %>%
split(.$cyl) %>% #split by grouping variable
map(~ lm(mpg ~ wt, data = .)) %>%
map_df(broom::glance, .id='cyl') %>%
select(cyl, variable, r.squared, p.value)
# A tibble: 3 x 3
cyl r.squared p.value
<chr> <dbl> <dbl>
1 4 0.509 0.0137
2 6 0.465 0.0918
3 8 0.423 0.0118
So I can run by variable, or by group. However, I can't figure out how to combine these two (grouping everything by cyl, then running lm(mpg ~ each other variable, separately). I'd hoped to do something like this:
mtcars %>%
select(-mpg) %>% #exclude outcome, leave predictors
split(.$cyl) %>% # group by grouping variable
map(~ lm(mtcars$mpg ~ .x, data = mtcars)) %>% #run lm across all variables
map_df(glance, .id='cyl') %>%
select(cyl, variable, r.squared, p.value)
and get a result that gives me cyl(group), variable, r.squared, and p.value (a combination of 3 groups * 10 variables = 30 model outputs).
But split() turns the dataframe into a list, which the construction from part 1 [ map(~ lm(mtcars$mpg ~ .x, data = mtcars)) ] can't handle. I have tried to modify it so that it doesn't explicitly refer to the original data structure, but can't figure out a working solution. Any help is greatly appreciated!
IIUC, you can use group_by and group_modify, with a map inside that iterates over predictors.
If you can isolate your predictor variables in advance, it'll make it easier, as with ivs in this solution.
library(tidyverse)
ivs <- colnames(mtcars)[3:ncol(mtcars)]
names(ivs) <- ivs
mtcars %>%
group_by(cyl) %>%
group_modify(function(data, key) {
map_df(ivs, function(iv) {
frml <- as.formula(paste("mpg", "~", iv))
lm(frml, data = data) %>% broom::glance()
}, .id = "iv")
}) %>%
select(cyl, iv, r.squared, p.value)
# A tibble: 27 × 4
# Groups: cyl [3]
cyl iv r.squared p.value
<dbl> <chr> <dbl> <dbl>
1 4 disp 0.648 0.00278
2 4 hp 0.274 0.0984
3 4 drat 0.180 0.193
4 4 wt 0.509 0.0137
5 4 qsec 0.0557 0.485
6 4 vs 0.00238 0.887
7 4 am 0.287 0.0892
8 4 gear 0.115 0.308
9 4 carb 0.0378 0.567
10 6 disp 0.0106 0.826
11 6 hp 0.0161 0.786
# ...
Related
for the purposes of this question, let's create the following setup:
mtcars %>%
group_split(carb) %>%
map(select, mpg) -> criterion
mtcars %>%
group_split(carb) %>%
map(select, qsec) -> predictor
This code will create two lists of length 6. What I want to do is to perform 6 linear regressions within each of these 6 groups. I read about the map2 function and I thought that the code should look like this:
map2(criterion, predictor, lm(criterion ~ predictor))
But that doesn't seem to work. So in which way could this be done?
simplify2array (you need a list of vectors, not a list of data frames) and use a lambda-function with ~:
map2(simplify2array(criterion), simplify2array(predictor), ~ lm(.x ~ .y))
While the direct answer to your question is already given, note that we can also use dplyr::nest_by() and then proceed automatically rowwise.
Now your models are stored in the mod column and we can use broom::tidy etc. to work with the models.
library(dplyr)
library(tidyr)
mtcars %>%
nest_by(carb) %>%
mutate(mod = list(lm(mpg ~ qsec, data = data)),
res = list(broom::tidy(mod))) %>%
unnest(res) %>%
filter(term != "(Intercept)")
#> # A tibble: 6 x 8
#> # Groups: carb [6]
#> carb data mod term estimate std.error statistic p.value
#> <dbl> <list<tibble[,10]>> <list> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 1 [7 x 10] <lm> qsec -1.26 4.51 -0.279 0.791
#> 2 2 [10 x 10] <lm> qsec 0.446 0.971 0.460 0.658
#> 3 3 [3 x 10] <lm> qsec -2.46 2.41 -1.02 0.493
#> 4 4 [10 x 10] <lm> qsec 0.0597 0.991 0.0602 0.953
#> 5 6 [1 x 10] <lm> qsec NA NA NA NA
#> 6 8 [1 x 10] <lm> qsec NA NA NA NA
Created on 2022-09-30 by the reprex package (v2.0.1)
I found this code online at tidyverse.org at this link:
mtcars %>%
split(.$cyl) %>%
map(~ lm(mpg ~ wt, data = .)) %>%
map(summary) %>%
map_dbl("r.squared")
The code works as expected. I'm now practicing with this same structure but using a long dataframe. You can see the code; it's mostly the same. First I convert to a tibble, add rownames for cars, select numeric variables, and make the dataframe a long data frame.
mtcars <- as_tibble(mtcars, rownames = 'car')
mtcars_numeric <- mtcars %>%
select(car, mpg, disp, hp, drat, wt, qsec)
mtcars_long_numeric <- pivot_longer(mtcars_numeric, names_to = 'names', values_to = 'values', 3:7)
mtcars_long_numeric %>%
split(.$names) %>%
map(~ lm(mpg ~ values, data = .)) %>%
map(summary) %>%
map_df("r.squared") %>%
pivot_longer(., names_to = 'explanatory_variable_to_mpg', values_to = 'r_squared', 1:5) %>%
arrange(desc(r_squared))
But what about other model statistics like p-value? How do I extract that? If I just change "r.squared" to "p.value" it doesn't work. I've tried other variations like "p_value" and "pvalue" and it doesn't work. I also don't know how to find the right names for these objects.
I can create a linear model object and look at the r.squared in the summary and get the right value.
mtcars_linear_model <- lm(mpg ~ wt, mtcars)
summary(mtcars_linear_model)$r.squared
...But outside of this vignette I don't know how I would have known that r.squared existed in the summary of linear model. If I just type the dollar sign after the summary(lm) I get values that don't exist. (Is this a bug?)
Then I tried a different tactic. I can see that if I use broom and tidy the linear model object I have other statistics:
broom::tidy(mtcars_linear_model)
Is there any way to add the broom::tidy function to these data frames involving purrr:map? The purpose would be to figure out how to extract other model statistics like p-value. Also, how do I find a comprehensive list of items I can extract from the summary of a linear model object summary(lm)$'?'
The following code doesn't work. I tried a few variations like %>% tidy() or else to wrap tidy around map(summary) like this: tidy(map(summary)) but it doesn't work.
mtcars_long_numeric %>%
split(.$names) %>%
map(~ lm(mpg ~ values, data = .)) %>%
map(summary) %>%
tidy() %>% #### ????????
map_df("r.squared") %>%
pivot_longer(., names_to = 'explanatory_variable_to_mpg', values_to = 'r_squared', 1:5) %>%
arrange(desc(r_squared))
This?. You need to use glance instead of tidy for model statistics.
mtcars_long_numeric %>%
nest_by(names) %>%
mutate(model = list(lm(mpg ~ values, data = data))) %>%
summarise(glance(model))
`summarise()` has grouped output by 'names'. You can override using the `.groups` argument.
# A tibble: 5 × 13
# Groups: names [5]
names r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int>
1 disp 0.718 0.709 3.25 76.5 9.38e-10 1 -82.1 170. 175. 317. 30 32
2 drat 0.464 0.446 4.49 26.0 1.78e- 5 1 -92.4 191. 195. 604. 30 32
3 hp 0.602 0.589 3.86 45.5 1.79e- 7 1 -87.6 181. 186. 448. 30 32
4 qsec 0.175 0.148 5.56 6.38 1.71e- 2 1 -99.3 205. 209. 929. 30 32
5 wt 0.753 0.745 3.05 91.4 1.29e-10 1 -80.0 166. 170. 278. 30 32
I'm having trouble figuring out how to use purrr::map() with mutate(across(...)).
I want to do a linear model and pull out the estimate for the slope of multiple columns as predicted by a single column.
Here is what I'm attempting with an example data set:
mtcars %>%
mutate(across(-mpg),
map(.x, lst(slope = ~lm(.x ~ mpg, data = .x) %>%
tidy() %>%
filter(term != "(Intercept") %>%
pull(estimate)
)))
The output I'm looking for would be new columns for each non-mpg column with _slope appended to the name, ie cyl_slope
In my actual data, I'll be grouping by another variable as well in case that matters, as I need the slope for each group for each predicted variable. I have this working in a standard mutate doing one variable at a time as follows:
df %>%
group_by(unitid) %>%
nest() %>%
mutate(tuition_and_fees_as_pct_total_rev_slope = map_dbl(data, ~lm(tuition_and_fees_as_pct_total_rev ~ year, data = .x) %>%
tidy() %>%
filter(term == "year") %>%
pull(estimate)
))
So:
I think my issue is how to pass the column name being predicted into the lm
I don't know if the solution requires nesting or not, so it would be appreciated if in the mtcars example that is considered.
If we wanted to do lm on all other columns with independent variable as 'mpg', one option is to loop over the column names of the 'mtcars' except the 'mpg', create the formula with reformulate, apply the lm, convert to a tidy format, filter out the 'Intercept' and select the 'estimate' column
library(dplyr)
library(tidyr)
library(broom)
map_dfc(setdiff(names(mtcars), 'mpg'), ~
lm(reformulate('mpg', response = .x), data = mtcars) %>%
tidy %>%
filter(term != "(Intercept)") %>%
select(estimate))
-output
# A tibble: 1 x 10
# estimate...1 estimate...2 estimate...3 estimate...4 estimate...5 estimate...6 estimate...7 estimate...8 estimate...9 estimate...10
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 -0.253 -17.4 -8.83 0.0604 -0.141 0.124 0.0555 0.0497 0.0588 -0.148
Or this can be done more easily with a matrix as dependent
library(stringr)
lm(as.matrix(mtcars[setdiff(names(mtcars), "mpg")]) ~ mpg,
data = mtcars) %>%
tidy %>%
filter(term != "(Intercept)") %>%
select(response, estimate) %>%
mutate(response = str_c(response, '_slope'))
-output
# A tibble: 10 x 2
# response estimate
# <chr> <dbl>
# 1 cyl_slope -0.253
# 2 disp_slope -17.4
# 3 hp_slope -8.83
# 4 drat_slope 0.0604
# 5 wt_slope -0.141
# 6 qsec_slope 0.124
# 7 vs_slope 0.0555
# 8 am_slope 0.0497
# 9 gear_slope 0.0588
#10 carb_slope -0.148
Or another option is summarise with across
mtcars %>%
summarise(across(-mpg, ~ list(lm(reformulate('mpg',
response = cur_column())) %>%
tidy %>%
filter(term != "(Intercept)") %>%
pull(estimate)), .names = "{.col}_slope")) %>%
unnest(everything())
# A tibble: 1 x 10
# cyl_slope disp_slope hp_slope drat_slope wt_slope qsec_slope vs_slope am_slope gear_slope carb_slope
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 -0.253 -17.4 -8.83 0.0604 -0.141 0.124 0.0555 0.0497 0.0588 -0.148
One option could be:
map_dfr(.x = names(select(mtcars, -c(mpg, vs))),
~ mtcars %>%
group_by(vs) %>%
nest() %>%
mutate(variable = .x,
estimate = map_dbl(data, function(y) lm(!!sym(.x) ~ mpg, data = y) %>%
tidy() %>%
filter(term != "(Intercept)") %>%
pull(estimate))) %>%
select(-data))
vs variable estimate
<dbl> <chr> <dbl>
1 0 cyl -0.242
2 1 cyl -0.116
3 0 disp -22.5
4 1 disp -8.01
5 0 hp -10.1
6 1 hp -3.26
7 0 drat 0.0748
8 1 drat 0.0529
9 0 wt -0.192
10 1 wt -0.113
11 0 qsec -0.0357
12 1 qsec -0.0432
13 0 am 0.0742
14 1 am 0.0710
15 0 gear 0.114
16 1 gear 0.0492
17 0 carb -0.0883
18 1 carb -0.0790
I want to calculate the pair-wise correlations between "mpg" and all other numeric variables of interest for each cyl in the mtcars dataset. I would like to adopt the tidy data principle.
It's rather easy with corrr::correlate().
library(dplyr)
library(tidyr)
library(purrr)
library(corrr)
data(mtcars)
mtcars2 <- mtcars[,1:7] %>%
group_nest(cyl) %>%
mutate(cors = map(data, corrr::correlate),
stretch = map(cors, corrr::stretch)) %>%
unnest(stretch)
mtcars2 %>%
filter(x == "mpg")
By using corrr::correlate(), all available pair-wise correlations have been calculated. I could use dplyr::filter() to select the correlations of interest.
However, when datasets are large, a lot of calculations go to the unwanted correlations, making this approach very time-consuming. So I tried to calculate only mpg vs. others. I'm not very familiar with purrr, and the following code doesn't work.
mtcars2 <- mtcars[,1:7] %>%
group_nest(cyl) %>%
mutate(comp = map(data, ~colnames),
corr = map(comp, ~cor.test(data[["mpg"]], data[[.]])))
If you need to use cor.test, below is an option using broom:
library(broom)
library(tidyr)
library(dplyr)
mtcars[,1:7] %>%
pivot_longer(-c(mpg,cyl)) %>%
group_by(cyl,name) %>%
do(tidy(cor.test(.$mpg,.$value)))
# A tibble: 15 x 10
# Groups: cyl, name [15]
cyl name estimate statistic p.value parameter conf.low conf.high method
<dbl> <chr> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <chr>
1 4 disp -0.805 -4.07 0.00278 9 -0.947 -0.397 Pears…
2 4 drat 0.424 1.41 0.193 9 -0.236 0.816 Pears…
3 4 hp -0.524 -1.84 0.0984 9 -0.855 0.111 Pears…
4 4 qsec -0.236 -0.728 0.485 9 -0.732 0.424 Pears…
5 4 wt -0.713 -3.05 0.0137 9 -0.920 -0.198 Pears…
6 6 disp 0.103 0.232 0.826 5 -0.705 0.794 Pears…
7 6 drat 0.115 0.258 0.807 5 -0.699 0.799 Pears…
If you just need the correlation, for big datasets, the nesting etc might be costly and unnecessary because you can simply do cor(,) and melt that:
#define columns to correlate
cor_vars = setdiff(colnames(mtcars)[1:7],"cyl")
split(mtcars[,1:7],mtcars$cyl) %>%
map_dfr(~data.frame(x="mpg",y=cor_vars,
cyl=unique(.x$cyl),rho=as.numeric(cor(.x$mpg,.x[,cor_vars]))))
x y cyl rho
1 mpg mpg 4 1.00000000
2 mpg disp 4 -0.80523608
3 mpg hp 4 -0.52350342
4 mpg drat 4 0.42423947
5 mpg wt 4 -0.71318483
6 mpg qsec 4 -0.23595389
7 mpg mpg 6 1.00000000
8 mpg disp 6 0.10308269
9 mpg hp 6 -0.12706785
10 mpg drat 6 0.11471598
11 mpg wt 6 -0.68154982
12 mpg qsec 6 -0.41871779
13 mpg mpg 8 1.00000000
14 mpg disp 8 -0.51976704
15 mpg hp 8 -0.28363567
16 mpg drat 8 0.04793248
17 mpg wt 8 -0.65035801
18 mpg qsec 8 -0.10433602
Would this work for you? I have done this in the past but on smallish datasets and have not bench marked it so not sure of performance. I use pivot_longer to reshape the data prior to nesting. The variables you pass essentially work as the filtering step, sort of
mtcars2 <- mtcars[,1:7] %>%
pivot_longer(c(-mpg, -cyl), names_to = "y.var", values_to = "value" ) %>%
group_nest(cyl, y.var) %>%
mutate(x.var = "mpg", #just so you can see this in the output
cor = map_dbl(data, ~ {cor <- cor.test(.x$mpg, .x$value)
cor$estimate})) %>%
select(data, cyl, x.var , y.var, cor) %>%
arrange(cyl, y.var)
I am running multiple models on multiple sections of my data set, similar to (but with many more models)
library(tidyverse)
d1 <- mtcars %>%
group_by(cyl) %>%
do(mod_linear = lm(mpg ~ disp + hp, data = ., x = TRUE))
d1
# Source: local data frame [3 x 3]
# Groups: <by row>
#
# # A tibble: 3 x 3
# cyl mod_linear
# * <dbl> <list>
# 1 4. <S3: lm>
# 2 6. <S3: lm>
# 3 8. <S3: lm>
I then tidy this tibble and save my parameter estimates using tidy() in the broom package.
I also want to calculate the standard deviation of the predictors (stored in models above as I set x = TRUE) to create and then compare re-scaled parameters. I can do the former of these using
d1 %>%
# group_by(cyl) %>%
do(term = colnames(.$mod$x),
pred_sd = apply(X = .$mod$x, MARGIN = 2, FUN = sd)) %>%
unnest()
# # A tibble: 9 x 2
# term pred_sd
# <chr> <dbl>
# 1 (Intercept) 0.00000
# 2 disp 26.87159
# 3 hp 20.93453
# 4 (Intercept) 0.00000
# 5 disp 41.56246
# 6 hp 24.26049
# 7 (Intercept) 0.00000
# 8 disp 67.77132
# 9 hp 50.97689
However, the result is not a grouped tibble so I end up loosing the cyl column to tell me which terms belong to which model. How can avoid this loss? - Adding in group_by again seems to throw an error.
n.b. I want avoid using purrr for at least for the first part (fitting the models) as I run different types of models and then need to reshape the results (d1), and I like the progress bar with do.
n.b. I want to work with the $x component of the models rather than the raw data as they have the data on correct scale (I am experimenting with different transformations of the predictors)
We can do this by nesting initially and then do the unnest
mtcars %>%
group_by(cyl) %>%
nest(-cyl) %>%
mutate(mod_linear = map(data, ~ lm(mpg ~ disp + hp, data = .x, x = TRUE)),
term = map(mod_linear, ~ names(coef(.x))),
pred = map(mod_linear, ~ .x$x %>%
as_tibble %>%
summarise_all(sd) %>%
unlist )) %>%
select(-data, -mod_linear) %>%
unnest
# A tibble: 9 x 3
# cyl term pred
# <dbl> <chr> <dbl>
#1 6.00 (Intercept) 0
#2 6.00 disp 41.6
#3 6.00 hp 24.3
#4 4.00 (Intercept) 0
#5 4.00 disp 26.9
#6 4.00 hp 20.9
#7 8.00 (Intercept) 0
#8 8.00 disp 67.8
#9 8.00 hp 51.0
Or instead of calling the map multiple times, this can be further made compact with
mtcars %>%
group_by(cyl) %>%
nest(-cyl) %>%
mutate(mod_contents = map(data, ~ {
mod <- lm(mpg ~ disp + hp, data = .x, x = TRUE)
term <- names(coef(mod))
pred <- mod$x %>%
as_tibble %>%
summarise_all(sd) %>%
unlist
tibble(term, pred)
}
)) %>%
select(-data) %>%
unnest
# A tibble: 9 x 3
# cyl term pred
# <dbl> <chr> <dbl>
#1 6.00 (Intercept) 0
#2 6.00 disp 41.6
#3 6.00 hp 24.3
#4 4.00 (Intercept) 0
#5 4.00 disp 26.9
#6 4.00 hp 20.9
#7 8.00 (Intercept) 0
#8 8.00 disp 67.8
#9 8.00 hp 51.0
If we start from 'd1' (based on the OP's code)
d1 %>%
ungroup %>%
mutate(mod_contents = map(mod_linear, ~ {
pred <- .x$x %>%
as_tibble %>%
summarise_all(sd) %>%
unlist
term <- .x %>%
coef %>%
names
tibble(term, pred)
})) %>%
select(-mod_linear) %>%
unnest