Use broom and tidyverse to run regressions on different dependent variables - r

I'm looking for a Tidyverse / broom solution that can solve this puzzle:
Let's say I have different DVs and a specific set of IVS and I want to perform a regression that considers every DV and this specific set of IVs.
I know I can use something like for i in or apply family, but I really want to run that using tidyverse.
The following code works as an example
ds <- data.frame(income = rnorm(100, mean=1000,sd=200),
happiness = rnorm(100, mean = 6, sd=1),
health = rnorm(100, mean=20, sd = 3),
sex = c(0,1),
faculty = c(0,1,2,3))
mod1 <- lm(income ~ sex + faculty, ds)
mod2 <- lm(happiness ~ sex + faculty, ds)
mod3 <- lm(health ~ sex + faculty, ds)
summary(mod1)
summary(mod2)
summary(mod3)
Income, happiness, and health are DVs. Sex and Faculty are IVs and they will be used for all regressions.
That was the closest I found
Let me know If I need to clarify my question.
Thanks.

As you have different dependent variables but the same independent, you can form a matrix of these and pass to lm.
mod = lm(cbind(income, happiness, health) ~ sex + faculty, ds)
And I think broom::tidy works
library(broom)
tidy(mod)
# response term estimate std.error statistic p.value
# 1 income (Intercept) 1019.35703873 31.0922529 32.7849205 2.779199e-54
# 2 income sex -54.40337314 40.1399258 -1.3553431 1.784559e-01
# 3 income faculty 19.74808081 17.9511206 1.1001030 2.740100e-01
# 4 happiness (Intercept) 5.97334562 0.1675340 35.6545278 1.505026e-57
# 5 happiness sex 0.05345555 0.2162855 0.2471528 8.053124e-01
# 6 happiness faculty -0.02525431 0.0967258 -0.2610918 7.945753e-01
# 7 health (Intercept) 19.76489553 0.5412676 36.5159396 1.741411e-58
# 8 health sex 0.32399380 0.6987735 0.4636607 6.439296e-01
# 9 health faculty 0.10808545 0.3125010 0.3458723 7.301877e-01

Another method is to gather the dependent variables and use a grouped data frame to fit the models with do. This is the method explained in the broom and dplyr vignette.
library(tidyverse)
library(broom)
ds <- data.frame(
income = rnorm(100, mean = 1000, sd = 200),
happiness = rnorm(100, mean = 6, sd = 1),
health = rnorm(100, mean = 20, sd = 3),
sex = c(0, 1),
faculty = c(0, 1, 2, 3)
)
ds %>%
gather(dv_name, dv_value, income:health) %>%
group_by(dv_name) %>%
do(tidy(lm(dv_value ~ sex + faculty, data = .)))
#> # A tibble: 9 x 6
#> # Groups: dv_name [3]
#> dv_name term estimate std.error statistic p.value
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 happiness (Intercept) 6.25 0.191 32.7 3.14e-54
#> 2 happiness sex 0.163 0.246 0.663 5.09e- 1
#> 3 happiness faculty -0.172 0.110 -1.56 1.23e- 1
#> 4 health (Intercept) 20.1 0.524 38.4 1.95e-60
#> 5 health sex 0.616 0.677 0.909 3.65e- 1
#> 6 health faculty -0.653 0.303 -2.16 3.36e- 2
#> 7 income (Intercept) 1085. 32.8 33.0 1.43e-54
#> 8 income sex -12.9 42.4 -0.304 7.62e- 1
#> 9 income faculty -25.1 19.0 -1.32 1.89e- 1
Created on 2018-08-01 by the reprex package (v0.2.0).

We can loop through the column names that are dependent variables, use paste to create the formula to be passed into lm and get the summary statistics with tidy (from broom)
library(tidyverse)
library(broom)
map(names(ds)[1:3], ~
lm(formula(paste0(.x, "~",
paste(names(ds)[4:5], collapse=" + "))), data = ds) %>%
tidy)
If we want it in a single data.frame with a column identifier for dependent variable,
map_df(set_names(names(ds)[1:3]), ~
lm(formula(paste0(.x, "~",
paste(names(ds)[4:5], collapse=" + "))), data = ds) %>%
tidy, .id = "Dep_Variable")

Related

R loop over linear regression

I have looked over the forum but couldn't find what I am looking for.
I want to run a simple linear regression a couple of times. Each time using a different column as my independent variable, the dependent variable stays the same. After running it I want to be able to extract the R squared from each of the regressions. My thought process was to use a simple for loop. However, I cannot make it work.
Assume I work with the following data:
num value person1 person2 person3
0 1 229 29 81 0
1 2 203 17 75 0
2 3 244 62 0 55
and that I want to run the regression on the value using three variables: person1, person2 and person3. Note that this is a minimal working example but I hope to generalize the idea.
And so my initial attempt was to:
column <- names(df)[-2]
for(i in 3:5){
temp <- df[,c("value", column[i])]
lm.test <- lm(value ~ ., data = temp)
i + 1
}
However, when I run summary(lm.test) I only get a summary of the last regression, i.e. lm(value ~ person3) which I think makes sense but when trying to rewrite it as: lm.test[i] <- lm(value ~ ., data = temp) I get the following error:
debug at #3: temp <- df[,c("value", column[i])]
suggesting that there's something wrong with line 3?
If possible I'd like to be able to capture the summary for each regression but what I am really after is the R squared for each one of the regressions.
You can create formula in a loop and then run the lm. For instance, if I want to run regression on mtcars for regressing mpg on each of cyl, wt, hp, I can use the following:
vars <- c("cyl", "wt", "hp")
lm_results <- lapply(vars, function(col){
lm_formula <- as.formula(paste0("mpg ~ ", col))
lm(lm_formula, data = mtcars)
})
You can then again iterate over lm_results to get the r.squared:
lapply(lm_results, function(x) summary(x)$r.squared)
Here’s an approach using broom::glance() and purrr::map_dfr() to collect model summary stats into a tidy tibble:
library(broom)
library(purrr)
lm.test <- map_dfr(
set_names(names(df)[-2]),
~ glance(lm(
as.formula(paste("value ~", .x)),
data = df
)),
.id = "predictor"
)
Result:
# A tibble: 4 x 13
predictor r.squared adj.r.squared sigma statistic p.value df logLik AIC
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 num 0.131 -0.739 27.4 0.150 0.765 1 -12.5 31.1
2 person1 0.836 0.672 11.9 5.10 0.265 1 -10.0 26.1
3 person2 0.542 0.0831 19.9 1.18 0.474 1 -11.6 29.2
4 person3 0.607 0.215 18.4 1.55 0.431 1 -11.3 28.7
# ... with 4 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>,
# nobs <int>
NB, you can capture model coefficients with a similar approach using broom::tidy() instead of glance().

Extraction LM stats into table

I have made a graph that displays r2, p-value and equation from linear regressions in the top left corner using stat_poly_eq.
Now I wish to have the stats from the linear regression extracted into a table.
For an example, in the mtcars dataset, if I want to do linear regression on plots of hp against disp for each cylinder group (e.g. 4, 6, 8) and then extract the linear regression stats into a table, how could I do that?
Thanks!
Here's the graph I have:
library(ggplot2)
library(ggpmisc)
formula <- y~x
ggplot(mtcars, aes(disp, hp)) +
geom_point() +
geom_smooth(method = "lm",formula = formula) +
theme_bw()+
facet_wrap(~cyl, scales = "free")+
stat_poly_eq(
aes(label = paste(stat(adj.rr.label), stat(eq.label), stat(p.value.label), sep = "*\", \"*")),
formula = formula, parse = TRUE, size=3)
Do you mean something like this?
With nest_by, divide the rest of the columns in separated tibbles by each cyl
With summarise, calculate each lm. You need to set it into a list.
Operate like a normal list with map and calculate the stuff you need: coefficients (extractable with broom::tidy) and adj.r.squared (with summary(.)$adj.r.squared)
unnest the result of broom::tidy to make a unique tibble.
library(dplyr)
library(tidyr)
library(purrr)
mtcars %>%
nest_by(cyl) %>%
summarise(mdl = list(lm(hp ~ disp, data)), .groups = "drop") %>%
mutate(adjrsquared = map_dbl(mdl, ~summary(.)$adj.r.squared ),
mdl = map(mdl, broom::tidy)) %>%
unnest(mdl)
#> # A tibble: 6 x 7
#> cyl term estimate std.error statistic p.value adjrsquared
#> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 4 (Intercept) 47.0 25.3 1.86 0.0960 0.0988
#> 2 4 disp 0.339 0.234 1.45 0.182 0.0988
#> 3 6 (Intercept) 177. 42.0 4.22 0.00829 0.117
#> 4 6 disp -0.300 0.224 -1.34 0.238 0.117
#> 5 8 (Intercept) 178. 77.4 2.30 0.0405 -0.0682
#> 6 8 disp 0.0890 0.216 0.413 0.687 -0.0682

extract random effects from MCMCglmm

I have a function which runs an MCMCglmm a bunch of times.
shuffles <- 1:10
names(shuffles) <- paste0("shuffle_", shuffles)
library(MCMCglmm)
library(dplyr)
library(tibble)
library(purrr)
ddd <- purrr::map(shuffles,
~ df %>%
mutate(Trait = sample(Trait)) %>%
MCMCglmm(fixed = Trait ~ 1,
random = ~ Year,
data = .,
family = "categorical",
verbose = FALSE)) %>%
purrr::map( ~ tibble::as_tibble(summary(.x)$solutions, rownames = "model_term")) %>%
dplyr::bind_rows(., .id = 'shuffle')
ddd
This section extracts fixed effects only.
(summary(.x)$Solutions, rownames = "model_term")
But note that I am not running a model without any fixed effects and so the output is empty.
How can I extract random effects using the same or similar code?
I guess I can change 'solutions' to something else to extract random effects from a model I have run without any fixed effects.
Note that this is an extension to a previous question (with example df) here - lapply instead of for loop for randomised hypothesis testing r
A relatively easy way to do this is with broom.mixed::tidy. It's not clear whether you mean you want to extract the summary for the top-level random effects parameters (i.e. the variances of the random effects), or for the estimates of the group-level effects.
library(broom.mixed)
tidy(m, effects="ran_pars")
##
## effect group term estimate std.error
## 1 ran_pars Year var__(Intercept) 0.00212 629.
## 2 ran_pars Residual var__Observation 40465. 24211.
If you want the group-level effects you need effects="ran_vals", but you have to re-run your model with pr=TRUE (or do it that way in the first place) in order to have these effects saved in the model object:
m <- MCMCglmm(Trait ~ ID, random = ~ Year, data = df, family = "categorical", pr=TRUE)
tidy(m, effects="ran_vals")
effect group level term estimate std.error
<chr> <chr> <chr> <chr> <dbl> <dbl>
1 ran_vals Year 1992 (Intercept) 2.65e-8 4.90
2 ran_vals Year 1993 (Intercept) 1.14e-8 6.23
3 ran_vals Year 1994 (Intercept) 1.28e-8 4.88
4 ran_vals Year 1995 (Intercept) -6.83e-9 5.31
5 ran_vals Year 1996 (Intercept) -1.36e-8 5.07
6 ran_vals Year 1997 (Intercept) 1.31e-8 5.24
7 ran_vals Year 1998 (Intercept) -2.80e-9 5.25
8 ran_vals Year 1999 (Intercept) 3.52e-8 5.68

Many regressions using tidyverse and broom: Same dependent variable, different independent variables

This link shows how to answer my question in the case where we have the same independent variables, but potentially many different dependent variables: Use broom and tidyverse to run regressions on different dependent variables.
But my question is, how can I apply the same approach (e.g., tidyverse and broom) to run many regressions where we have the reverse situation: same dependent variables but different independent variable. In line with the code in the previous link, something like:
mod = lm(health ~ cbind(sex,income,happiness) + faculty, ds) %>% tidy()
However, this code does not do exactly what I want, and instead, produces:
Call:
lm(formula = income ~ cbind(sex, health) + faculty, data = ds)
Coefficients:
(Intercept) cbind(sex, health)sex
945.049 -47.911
cbind(sex, health)health faculty
2.342 1.869
which is equivalent to:
lm(formula = income ~ sex + health + faculty, data = ds)
Basically you'll need some way to create all the different formulas you want. Here's one way
qq <- expression(sex,income,happiness)
formulae <- lapply(qq, function(v) bquote(health~.(v)+faculty))
# [[1]]
# health ~ sex + faculty
# [[2]]
# health ~ income + faculty
# [[3]]
# health ~ happiness + faculty
Once you have all your formula, you can map them to lm and then to tidy()
library(purrr)
library(broom)
formulae %>% map(~lm(.x, ds)) %>% map_dfr(tidy, .id="model")
# A tibble: 9 x 6
# model term estimate std.error statistic p.value
# <chr> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 1 (Intercept) 19.5 0.504 38.6 1.13e-60
# 2 1 sex 0.755 0.651 1.16 2.49e- 1
# 3 1 faculty -0.00360 0.291 -0.0124 9.90e- 1
# 4 2 (Intercept) 19.8 1.70 11.7 3.18e-20
# 5 2 income -0.000244 0.00162 -0.150 8.81e- 1
# 6 2 faculty 0.143 0.264 0.542 5.89e- 1
# 7 3 (Intercept) 18.4 1.88 9.74 4.79e-16
# 8 3 happiness 0.205 0.299 0.684 4.96e- 1
# 9 3 faculty 0.141 0.262 0.539 5.91e- 1
Using sample data
set.seed(11)
ds <- data.frame(income = rnorm(100, mean=1000,sd=200),
happiness = rnorm(100, mean = 6, sd=1),
health = rnorm(100, mean=20, sd = 3),
sex = c(0,1),
faculty = c(0,1,2,3))
You could use the combn function to get all combinations of n independent variables and then iterate over them. Let's say n=3 here:
library(tidyverse)
ds <- data.frame(income = rnorm(100, mean=1000,sd=200),
happiness = rnorm(100, mean = 6, sd=1),
health = rnorm(100, mean=20, sd = 3),
sex = c(0,1),
faculty = c(0,1,2,3))
ivs = combn(names(ds)[names(ds)!="income"], 3, simplify=FALSE)
# Or, to get all models with 1 to 4 variables:
# ivs = map(1:4, ~combn(names(ds)[names(ds)!="income"], .x, simplify=FALSE)) %>%
# flatten()
names(ivs) = map(ivs, ~paste(.x, collapse="-"))
models = map(ivs,
~lm(as.formula(paste("income ~", paste(.x, collapse="+"))), data=ds))
map_df(models, broom::tidy, .id="model")
model term estimate std.error statistic p.value
* <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 happiness-health-sex (Intercept) 1086. 201. 5.39 5.00e- 7
2 happiness-health-sex happiness -25.4 21.4 -1.19 2.38e- 1
3 happiness-health-sex health 3.58 6.99 0.512 6.10e- 1
4 happiness-health-sex sex 11.5 41.5 0.277 7.82e- 1
5 happiness-health-faculty (Intercept) 1085. 197. 5.50 3.12e- 7
6 happiness-health-faculty happiness -25.8 20.9 -1.23 2.21e- 1
7 happiness-health-faculty health 3.45 6.98 0.494 6.23e- 1
8 happiness-health-faculty faculty 7.86 18.2 0.432 6.67e- 1
9 happiness-sex-faculty (Intercept) 1153. 141. 8.21 1.04e-12
10 happiness-sex-faculty happiness -25.9 21.4 -1.21 2.28e- 1
11 happiness-sex-faculty sex 3.44 46.2 0.0744 9.41e- 1
12 happiness-sex-faculty faculty 7.40 20.2 0.366 7.15e- 1
13 health-sex-faculty (Intercept) 911. 143. 6.35 7.06e- 9
14 health-sex-faculty health 3.90 7.03 0.554 5.81e- 1
15 health-sex-faculty sex 15.6 45.6 0.343 7.32e- 1
16 health-sex-faculty faculty 7.02 20.4 0.345 7.31e- 1

How to run multiple linear regressions with different independent variables and dependent variables adding standardized coefficients in R?

I'm currently trying to run a loop performing linear regression for multiple independent variables (n = 6) with multiple dependent variables (n=1000).
Here is some example data, with age, sex, and education representing my independent variables of interest and testscore_* being my dependent variables.
df = data.frame(ID = c(1001, 1002, 1003, 1004, 1005, 1006,1007, 1008, 1009, 1010, 1011),
age = as.numeric(c('56', '43','59','74','61','62','69','80','40','55','58')),
sex = as.numeric(c('0','1','0','0','1','1','0','1','0','1','0')),
testscore_1 = as.numeric(c('23','28','30','15','7','18','29','27','14','22','24')),
testscore_2 = as.numeric(c('1','3','2','5','8','2','5','6','7','8','2')),
testscore_3 = as.numeric(c('18','20','19','15','20','23','19','25','10','14','12')),
education = as.numeric(c('5','4','3','5','2', '1','4','4','3','5','2')))
I have working code that allows me to run a regression model for multiple DVs (which I'm sure more experienced R users will dislike for its lack of efficiency):
y <- as.matrix(df[4:6])
#model for age
lm_results <- lm(y ~ age, data = df)
write.csv((broom::tidy(lm_results)), "lm_results_age.csv")
regression_results <-broom::tidy(lm_results)
standardized_coefficients <- lm.beta(lm_results)
age_standardize_results <- coef(standardized_coefficients)
write.csv(age_standardize_results, "lm_results_age_standardized_coefficients.csv")
I would then repeat this all by manually replacing age with sex and education
Does anyone have a more elegant way of running this - for example, by way of a loop for all IVs of interest (i.e. age, sex and education)?
Also would greatly appreciate anyone who would suggest a quick way of combining broom::tidy(lm_results) with standardized coefficients from lm.beta::lm.beta, i.e. combining the standardized regression coefficients with the main model output.
This is an adaptation for a similar workflow I had to use in the past. Remember to really penalize yourself for running a crazy number of models. I added a couple predictor columns to your dataframe. Good luck!!
Solution:
# Creating pedictor and outcome vectors
ivs_vec <- names(df)[c(2:6, 10)]
dvs_vec <- names(df)[7:9]
# Creating formulas and running the models
ivs <- paste0(" ~ ", ivs_vec)
dvs_ivs <- unlist(lapply(ivs, function(x) paste0(dvs_vec, x)))
formulas <- lapply(dvs_ivs, formula)
lm_results <- lapply(formulas, function(x) {
lm(x, data = df)
})
# Creating / combining results
tidy_results <- lapply(lm_results, broom::tidy)
dv_list <- lapply(as.list(stringi::stri_extract_first_words(dvs_ivs)), rep, 2)
tidy_results <- Map(cbind, dv_list, tidy_results)
standardized_results <- lapply(lm_results, function(x) coef(lm.beta::lm.beta(x)))
combined_results <- Map(cbind, tidy_results, standardized_results)
# Cleaning up final results
names(combined_results) <- dvs_ivs
combined_results <- lapply(combined_results, function(x) {row.names(x) <- c(NULL); x})
new_names <- c("Outcome", "Term", "Estimate", "Std. Error", "Statistic", "P-value", "Standardized Estimate")
combined_results <- lapply(combined_results, setNames, new_names)
Results:
combined_results[1:5]
$`testscore_1 ~ age`
Outcome Term Estimate Std. Error Statistic P-value
Standardized Estimate
1 testscore_1 (Intercept) 18.06027731 12.3493569 1.4624468 0.1776424 0.00000000
2 testscore_1 age 0.05835152 0.2031295 0.2872627 0.7804155 0.09531823
$`testscore_2 ~ age`
Outcome Term Estimate Std. Error Statistic P-value Standardized Estimate
1 testscore_2 (Intercept) 3.63788676 4.39014570 0.8286483 0.4287311 0.0000000
2 testscore_2 age 0.01367313 0.07221171 0.1893478 0.8540216 0.0629906
$`testscore_3 ~ age`
Outcome Term Estimate Std. Error Statistic P-value Standardized Estimate
1 testscore_3 (Intercept) 6.1215175 6.698083 0.9139208 0.3845886 0.0000000
2 testscore_3 age 0.1943125 0.110174 1.7636870 0.1116119 0.5068026
$`testscore_1 ~ sex`
Outcome Term Estimate Std. Error Statistic P-value Standardized Estimate
1 testscore_1 (Intercept) 22.5 3.099283 7.2597435 4.766069e-05 0.0000000
2 testscore_1 sex -2.1 4.596980 -0.4568217 6.586248e-01 -0.1505386
$`testscore_2 ~ sex`
Outcome Term Estimate Std. Error Statistic P-value Standardized Estimate
1 testscore_2 (Intercept) 3.666667 1.041129 3.521816 0.006496884 0.0000000
2 testscore_2 sex 1.733333 1.544245 1.122447 0.290723029 0.3504247
Data:
df <- data.frame(ID = c(1001, 1002, 1003, 1004, 1005, 1006,1007, 1008, 1009, 1010, 1011),
age = as.numeric(c('56', '43','59','74','61','62','69','80','40','55','58')),
sex = as.numeric(c('0','1','0','0','1','1','0','1','0','1','0')),
pred1 = sample(1:11, 11),
pred2 = sample(1:11, 11),
pred3 = sample(1:11, 11),
testscore_1 = as.numeric(c('23','28','30','15','7','18','29','27','14','22','24')),
testscore_2 = as.numeric(c('1','3','2','5','8','2','5','6','7','8','2')),
testscore_3 = as.numeric(c('18','20','19','15','20','23','19','25','10','14','12')),
education = as.numeric(c('5','4','3','5','2', '1','4','4','3','5','2')))
Stumbled upon this a year later and documenting a tidyverse solution same data as #Andrew.
library(dplyr)
library(purrr)
library(tidyr)
library(stringi)
# Creating pedictor and outcome vectors
ivs_vec <- names(df)[c(2:6, 10)]
dvs_vec <- names(df)[7:9]
# Creating formulas and running the models
ivs <- paste0(" ~ ", ivs_vec)
dvs_ivs <- unlist(map(ivs, ~paste0(dvs_vec, .x)))
models <- map(setNames(dvs_ivs, dvs_ivs),
~ lm(formula = as.formula(.x),
data = df))
basics <-
map(models, ~ broom::tidy(.)) %>%
map2_df(.,
names(.),
~ mutate(.x, which_dependent = .y)) %>%
select(which_dependent, everything()) %>%
mutate(term = gsub("\\(Intercept\\)", "Intercept", term),
which_dependent = stringi::stri_extract_first_words(which_dependent))
basics$std_estimate <-
map_dfr(models, ~ coef(lm.beta::lm.beta(.)), .id = "which_dependent") %>%
pivot_longer(.,
cols = -which_dependent,
names_to = "term",
values_to = "std_estimate",
values_drop_na = TRUE) %>%
pull(std_estimate)
basics
#> # A tibble: 36 x 7
#> which_dependent term estimate std.error statistic p.value std_estimate
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 testscore_1 Intercept 18.1 12.3 1.46 0.178 0
#> 2 testscore_1 age 0.0584 0.203 0.287 0.780 0.0953
#> 3 testscore_2 Intercept 3.64 4.39 0.829 0.429 0
#> 4 testscore_2 age 0.0137 0.0722 0.189 0.854 0.0630
#> 5 testscore_3 Intercept 6.12 6.70 0.914 0.385 0
#> 6 testscore_3 age 0.194 0.110 1.76 0.112 0.507
#> 7 testscore_1 Intercept 22.5 3.10 7.26 0.0000477 0
#> 8 testscore_1 sex -2.10 4.60 -0.457 0.659 -0.151
#> 9 testscore_2 Intercept 3.67 1.04 3.52 0.00650 0
#> 10 testscore_2 sex 1.73 1.54 1.12 0.291 0.350
#> # … with 26 more rows

Resources