I'm having trouble figuring out how to use purrr::map() with mutate(across(...)).
I want to do a linear model and pull out the estimate for the slope of multiple columns as predicted by a single column.
Here is what I'm attempting with an example data set:
mtcars %>%
mutate(across(-mpg),
map(.x, lst(slope = ~lm(.x ~ mpg, data = .x) %>%
tidy() %>%
filter(term != "(Intercept") %>%
pull(estimate)
)))
The output I'm looking for would be new columns for each non-mpg column with _slope appended to the name, ie cyl_slope
In my actual data, I'll be grouping by another variable as well in case that matters, as I need the slope for each group for each predicted variable. I have this working in a standard mutate doing one variable at a time as follows:
df %>%
group_by(unitid) %>%
nest() %>%
mutate(tuition_and_fees_as_pct_total_rev_slope = map_dbl(data, ~lm(tuition_and_fees_as_pct_total_rev ~ year, data = .x) %>%
tidy() %>%
filter(term == "year") %>%
pull(estimate)
))
So:
I think my issue is how to pass the column name being predicted into the lm
I don't know if the solution requires nesting or not, so it would be appreciated if in the mtcars example that is considered.
If we wanted to do lm on all other columns with independent variable as 'mpg', one option is to loop over the column names of the 'mtcars' except the 'mpg', create the formula with reformulate, apply the lm, convert to a tidy format, filter out the 'Intercept' and select the 'estimate' column
library(dplyr)
library(tidyr)
library(broom)
map_dfc(setdiff(names(mtcars), 'mpg'), ~
lm(reformulate('mpg', response = .x), data = mtcars) %>%
tidy %>%
filter(term != "(Intercept)") %>%
select(estimate))
-output
# A tibble: 1 x 10
# estimate...1 estimate...2 estimate...3 estimate...4 estimate...5 estimate...6 estimate...7 estimate...8 estimate...9 estimate...10
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 -0.253 -17.4 -8.83 0.0604 -0.141 0.124 0.0555 0.0497 0.0588 -0.148
Or this can be done more easily with a matrix as dependent
library(stringr)
lm(as.matrix(mtcars[setdiff(names(mtcars), "mpg")]) ~ mpg,
data = mtcars) %>%
tidy %>%
filter(term != "(Intercept)") %>%
select(response, estimate) %>%
mutate(response = str_c(response, '_slope'))
-output
# A tibble: 10 x 2
# response estimate
# <chr> <dbl>
# 1 cyl_slope -0.253
# 2 disp_slope -17.4
# 3 hp_slope -8.83
# 4 drat_slope 0.0604
# 5 wt_slope -0.141
# 6 qsec_slope 0.124
# 7 vs_slope 0.0555
# 8 am_slope 0.0497
# 9 gear_slope 0.0588
#10 carb_slope -0.148
Or another option is summarise with across
mtcars %>%
summarise(across(-mpg, ~ list(lm(reformulate('mpg',
response = cur_column())) %>%
tidy %>%
filter(term != "(Intercept)") %>%
pull(estimate)), .names = "{.col}_slope")) %>%
unnest(everything())
# A tibble: 1 x 10
# cyl_slope disp_slope hp_slope drat_slope wt_slope qsec_slope vs_slope am_slope gear_slope carb_slope
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 -0.253 -17.4 -8.83 0.0604 -0.141 0.124 0.0555 0.0497 0.0588 -0.148
One option could be:
map_dfr(.x = names(select(mtcars, -c(mpg, vs))),
~ mtcars %>%
group_by(vs) %>%
nest() %>%
mutate(variable = .x,
estimate = map_dbl(data, function(y) lm(!!sym(.x) ~ mpg, data = y) %>%
tidy() %>%
filter(term != "(Intercept)") %>%
pull(estimate))) %>%
select(-data))
vs variable estimate
<dbl> <chr> <dbl>
1 0 cyl -0.242
2 1 cyl -0.116
3 0 disp -22.5
4 1 disp -8.01
5 0 hp -10.1
6 1 hp -3.26
7 0 drat 0.0748
8 1 drat 0.0529
9 0 wt -0.192
10 1 wt -0.113
11 0 qsec -0.0357
12 1 qsec -0.0432
13 0 am 0.0742
14 1 am 0.0710
15 0 gear 0.114
16 1 gear 0.0492
17 0 carb -0.0883
18 1 carb -0.0790
Related
I'm trying to run a simple single linear regression over a large number of variables, grouped according to another variable. Using the mtcars dataset as an example, I'd like to run a separate linear regression between mpg and each other variable (mpg ~ disp, mpg ~ hp, etc.), grouped by another variable (for example, cyl).
Running lm over each variable independently can easily be done using purrr::map (modified from this great tutorial - https://sebastiansauer.github.io/EDIT-multiple_lm_purrr_EDIT/):
library(dplyr)
library(tidyr)
library(purrr)
mtcars %>%
select(-mpg) %>% #exclude outcome, leave predictors
map(~ lm(mtcars$mpg ~ .x, data = mtcars)) %>%
map_df(glance, .id='variable') %>%
select(variable, r.squared, p.value)
# A tibble: 10 x 3
variable r.squared p.value
<chr> <dbl> <dbl>
1 cyl 0.726 6.11e-10
2 disp 0.718 9.38e-10
3 hp 0.602 1.79e- 7
4 drat 0.464 1.78e- 5
5 wt 0.753 1.29e-10
6 qsec 0.175 1.71e- 2
7 vs 0.441 3.42e- 5
8 am 0.360 2.85e- 4
9 gear 0.231 5.40e- 3
10 carb 0.304 1.08e- 3
And running a linear model over grouped variables is also easy using map:
mtcars %>%
split(.$cyl) %>% #split by grouping variable
map(~ lm(mpg ~ wt, data = .)) %>%
map_df(broom::glance, .id='cyl') %>%
select(cyl, variable, r.squared, p.value)
# A tibble: 3 x 3
cyl r.squared p.value
<chr> <dbl> <dbl>
1 4 0.509 0.0137
2 6 0.465 0.0918
3 8 0.423 0.0118
So I can run by variable, or by group. However, I can't figure out how to combine these two (grouping everything by cyl, then running lm(mpg ~ each other variable, separately). I'd hoped to do something like this:
mtcars %>%
select(-mpg) %>% #exclude outcome, leave predictors
split(.$cyl) %>% # group by grouping variable
map(~ lm(mtcars$mpg ~ .x, data = mtcars)) %>% #run lm across all variables
map_df(glance, .id='cyl') %>%
select(cyl, variable, r.squared, p.value)
and get a result that gives me cyl(group), variable, r.squared, and p.value (a combination of 3 groups * 10 variables = 30 model outputs).
But split() turns the dataframe into a list, which the construction from part 1 [ map(~ lm(mtcars$mpg ~ .x, data = mtcars)) ] can't handle. I have tried to modify it so that it doesn't explicitly refer to the original data structure, but can't figure out a working solution. Any help is greatly appreciated!
IIUC, you can use group_by and group_modify, with a map inside that iterates over predictors.
If you can isolate your predictor variables in advance, it'll make it easier, as with ivs in this solution.
library(tidyverse)
ivs <- colnames(mtcars)[3:ncol(mtcars)]
names(ivs) <- ivs
mtcars %>%
group_by(cyl) %>%
group_modify(function(data, key) {
map_df(ivs, function(iv) {
frml <- as.formula(paste("mpg", "~", iv))
lm(frml, data = data) %>% broom::glance()
}, .id = "iv")
}) %>%
select(cyl, iv, r.squared, p.value)
# A tibble: 27 × 4
# Groups: cyl [3]
cyl iv r.squared p.value
<dbl> <chr> <dbl> <dbl>
1 4 disp 0.648 0.00278
2 4 hp 0.274 0.0984
3 4 drat 0.180 0.193
4 4 wt 0.509 0.0137
5 4 qsec 0.0557 0.485
6 4 vs 0.00238 0.887
7 4 am 0.287 0.0892
8 4 gear 0.115 0.308
9 4 carb 0.0378 0.567
10 6 disp 0.0106 0.826
11 6 hp 0.0161 0.786
# ...
I am trying to do the same thing as below, except naming order changes. Got the code from here
mtcars; rownames(mtcars) <- NULL
df <- mtcars[,c(2,8,9)]
head(df)
(df
%>% pivot_longer(-cyl) ## spread out variables (vs, am)
%>% group_by(cyl,name)
%>% dplyr::mutate(n=n()) ## obs per cyl/var combo
%>% group_by(cyl,name,value)
%>% dplyr::summarise(prop=n()/n) ## proportion of 0/1 per cyl/var
%>% unique() ## not sure why I need this?
%>% pivot_wider(id_cols=c(cyl,name),names_from=value,values_from=prop)
)
Expected answer
cyl name `0` `1`
4 vs 0.0909 0.909
4 am 0.273 0.727
6 vs 0.429 0.571
6 am 0.571 0.429
8 vs 1 NA
8 am 0.857 0.143
One possible solution involves adding three lines below your code.
Basically, you modify your variable name to be a factor with values coming in the order specified in levels so that it is internally coded as 1, 2, ...
Then you group by cyl and sort according to name
(df
%>% pivot_longer(-cyl) ## spread out variables (vs, am)
%>% group_by(cyl,name)
%>% dplyr::mutate(n=n()) ## obs per cyl/var combo
%>% group_by(cyl,name,value)
%>% dplyr::summarise(prop=n()/n) ## proportion of 0/1 per cyl/var
%>% unique() ## not sure why I need this?
%>% pivot_wider(id_cols=c(cyl,name),names_from=value,values_from=prop)
%>% mutate(name = factor(name, levels = c("vs", "am")))
%>% group_by(cyl)
%>% arrange(name, .by_group = TRUE)
)
# A tibble: 6 x 4
# Groups: cyl [3]
cyl name `0` `1`
<dbl> <fct> <dbl> <dbl>
1 4 vs 0.0909 0.909
2 4 am 0.273 0.727
3 6 vs 0.429 0.571
4 6 am 0.571 0.429
5 8 vs 1 NA
6 8 am 0.857 0.143
Different take:
df %>% pivot_longer(!cyl) %>% group_by(cyl, name, value) %>% mutate(cnt = n()) %>%
ungroup() %>% group_by(cyl, name) %>% mutate(prop = cnt/n()) %>% distinct() %>%
pivot_wider(id_cols = c(cyl, name), names_from = value, values_from = prop) %>%
arrange(cyl, desc(name))
# A tibble: 6 x 4
# Groups: cyl, name [6]
cyl name `0` `1`
<dbl> <chr> <dbl> <dbl>
1 4 vs 0.0909 0.909
2 4 am 0.273 0.727
3 6 vs 0.429 0.571
4 6 am 0.571 0.429
5 8 vs 1 NA
6 8 am 0.857 0.143
>
I am looking for a way to calculate multiple t-tests for all the combinations of 2 list elements in R tidyverse environment.
I would like to test the means of Miles/(US) gallon based on Transmission for each combination of cyl and vs. My working example is this code:
mtcars %>%
filter(cyl==8 & vs == 0) %>%
mutate(am = as.factor(am)) %>%
# independent t-test
t.test(mpg ~ am, data = ., paired = FALSE)%>%
broom::tidy() %>%
mutate(cyl = 8) %>%
mutate(vs = 0) %>%
select(cyl, vs, everything())
I wrote this piece of code:
cyl_list <- list(unique(mtcars$cyl)) # 6 4 8
vs_list <- list(unique(mtcars$vs)) # 0 1
complete_t_test <- function(cyl_par, vs_par){
mtcars %>%
filter(cyl=={cyl_par} & vs == {vs_par}) %>%
mutate(am = as.factor(am)) %>%
# independent t-test
t.test(mpg ~ am, data = ., paired = FALSE) %>%
broom::tidy() %>%
mutate(cyl = {cyl_par}) %>%
mutate(vs = {vs_par}) %>%
select(cyl, vs, everything())}
I was thinking of something similar to purrr::map2(cyl_list, vs_list, complete_t_test)
but it did not work.
List columns may be a viable solution (see book R for Data Science, chapter 25). I create a list column using nest(), then do the t-tests, and unnest() again to see the results.
NB: Your example fails for several combinations in the mtcars data, and therefore I use possibly() to do the t-test only if appropriate data are available.
library("tidyverse")
f1 <- possibly(~t.test(mpg ~ am, data = .x), otherwise = NULL)
mtcars %>%
group_by(cyl, vs) %>%
nest() %>% # create list columns
mutate(res = map(data, ~f1(.x))) %>% # do t-tests
mutate(res = map(res, broom::tidy)) %>% # tidy()
unnest(res) %>% # unnest list columns
select(1:8) # show some columns for stackoverflow
#> # A tibble: 2 x 8
#> # Groups: cyl, vs [2]
#> cyl vs data estimate estimate1 estimate2 statistic p.value
#> <dbl> <dbl> <list<df[,9]>> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 4 1 [10 x 9] -5.47 22.9 28.4 -2.76 0.0254
#> 2 8 0 [14 x 9] -0.350 15.0 15.4 -0.391 0.704
Created on 2019-11-04 by the reprex package (v0.3.0)
Write a function which calculates t.test between one combination. Use cross_df to create all combinations and apply the function complete_t_test to each combination.
library(tidyverse)
complete_t_test <- function(cyl_par, vs_par) {
tryCatch({
mtcars %>%
filter(cyl== cyl_par & vs == vs_par) %>%
t.test(mpg ~ am, data = ., paired = FALSE) %>%
broom::tidy()
}, error = function(e) return(NA))
}
cyl_list <- unique(mtcars$cyl)
vs_list <- unique(mtcars$vs)
cross_df(list(a = cyl_list, b = vs_list)) %>%
mutate(t_test = map2(a, b, complete_t_test))
# a b t_test
# <dbl> <dbl> <list>
#1 6 0 <lgl [1]>
#2 4 0 <lgl [1]>
#3 8 0 <tibble [1 × 10]>
#4 6 1 <lgl [1]>
#5 4 1 <tibble [1 × 10]>
#6 8 1 <lgl [1]>
This question already has answers here:
Count number of rows within each group
(17 answers)
Closed 3 years ago.
I want to use the size of a group as part of a groupwise operation in dplyr::summarise.
E.g calculate the proportion of manuals by cylinder, by grouping the cars data by cyl and dividing the number of manuals by the size of the group:
mtcars %>%
group_by(cyl) %>%
summarise(zz = sum(am)/group_size(.))
But, (I think), because group_size is after a grouped tbl_df and . is ungrouped, this returns
Error in mutate_impl(.data, dots) : basic_string::resize
Is there a way to do this?
You probably can use n() to get the number of rows for group
library(dplyr)
mtcars %>%
group_by(cyl) %>%
summarise(zz = sum(am)/n())
# cyl zz
# <dbl> <dbl>
#1 4.00 0.727
#2 6.00 0.429
#3 8.00 0.143
It is just a group by mean
mtcars %>%
group_by(cyl) %>%
summarise(zz = mean(am))
# A tibble: 3 x 2
# cyl zz
# <dbl> <dbl>
#1 4 0.727
#2 6 0.429
#3 8 0.143
If we need to use group_size
library(tidyverse)
mtcars %>%
group_by(cyl) %>%
nest %>%
mutate(zz = map_dbl(data, ~ sum(.x$am)/group_size(.x))) %>%
arrange(cyl) %>%
select(-data)
# A tibble: 3 x 2
# cyl zz
# <dbl> <dbl>
#1 4 0.727
#2 6 0.429
#3 8 0.143
Or using do
mtcars %>%
group_by(cyl) %>%
do(data.frame(zz = sum(.$am)/group_size(.)))
# A tibble: 3 x 2
# Groups: cyl [3]
# cyl zz
# <dbl> <dbl>
#1 4 0.727
#2 6 0.429
#3 8 0.143
I am running multiple models on multiple sections of my data set, similar to (but with many more models)
library(tidyverse)
d1 <- mtcars %>%
group_by(cyl) %>%
do(mod_linear = lm(mpg ~ disp + hp, data = ., x = TRUE))
d1
# Source: local data frame [3 x 3]
# Groups: <by row>
#
# # A tibble: 3 x 3
# cyl mod_linear
# * <dbl> <list>
# 1 4. <S3: lm>
# 2 6. <S3: lm>
# 3 8. <S3: lm>
I then tidy this tibble and save my parameter estimates using tidy() in the broom package.
I also want to calculate the standard deviation of the predictors (stored in models above as I set x = TRUE) to create and then compare re-scaled parameters. I can do the former of these using
d1 %>%
# group_by(cyl) %>%
do(term = colnames(.$mod$x),
pred_sd = apply(X = .$mod$x, MARGIN = 2, FUN = sd)) %>%
unnest()
# # A tibble: 9 x 2
# term pred_sd
# <chr> <dbl>
# 1 (Intercept) 0.00000
# 2 disp 26.87159
# 3 hp 20.93453
# 4 (Intercept) 0.00000
# 5 disp 41.56246
# 6 hp 24.26049
# 7 (Intercept) 0.00000
# 8 disp 67.77132
# 9 hp 50.97689
However, the result is not a grouped tibble so I end up loosing the cyl column to tell me which terms belong to which model. How can avoid this loss? - Adding in group_by again seems to throw an error.
n.b. I want avoid using purrr for at least for the first part (fitting the models) as I run different types of models and then need to reshape the results (d1), and I like the progress bar with do.
n.b. I want to work with the $x component of the models rather than the raw data as they have the data on correct scale (I am experimenting with different transformations of the predictors)
We can do this by nesting initially and then do the unnest
mtcars %>%
group_by(cyl) %>%
nest(-cyl) %>%
mutate(mod_linear = map(data, ~ lm(mpg ~ disp + hp, data = .x, x = TRUE)),
term = map(mod_linear, ~ names(coef(.x))),
pred = map(mod_linear, ~ .x$x %>%
as_tibble %>%
summarise_all(sd) %>%
unlist )) %>%
select(-data, -mod_linear) %>%
unnest
# A tibble: 9 x 3
# cyl term pred
# <dbl> <chr> <dbl>
#1 6.00 (Intercept) 0
#2 6.00 disp 41.6
#3 6.00 hp 24.3
#4 4.00 (Intercept) 0
#5 4.00 disp 26.9
#6 4.00 hp 20.9
#7 8.00 (Intercept) 0
#8 8.00 disp 67.8
#9 8.00 hp 51.0
Or instead of calling the map multiple times, this can be further made compact with
mtcars %>%
group_by(cyl) %>%
nest(-cyl) %>%
mutate(mod_contents = map(data, ~ {
mod <- lm(mpg ~ disp + hp, data = .x, x = TRUE)
term <- names(coef(mod))
pred <- mod$x %>%
as_tibble %>%
summarise_all(sd) %>%
unlist
tibble(term, pred)
}
)) %>%
select(-data) %>%
unnest
# A tibble: 9 x 3
# cyl term pred
# <dbl> <chr> <dbl>
#1 6.00 (Intercept) 0
#2 6.00 disp 41.6
#3 6.00 hp 24.3
#4 4.00 (Intercept) 0
#5 4.00 disp 26.9
#6 4.00 hp 20.9
#7 8.00 (Intercept) 0
#8 8.00 disp 67.8
#9 8.00 hp 51.0
If we start from 'd1' (based on the OP's code)
d1 %>%
ungroup %>%
mutate(mod_contents = map(mod_linear, ~ {
pred <- .x$x %>%
as_tibble %>%
summarise_all(sd) %>%
unlist
term <- .x %>%
coef %>%
names
tibble(term, pred)
})) %>%
select(-mod_linear) %>%
unnest