I have looked over the forum but couldn't find what I am looking for.
I want to run a simple linear regression a couple of times. Each time using a different column as my independent variable, the dependent variable stays the same. After running it I want to be able to extract the R squared from each of the regressions. My thought process was to use a simple for loop. However, I cannot make it work.
Assume I work with the following data:
num value person1 person2 person3
0 1 229 29 81 0
1 2 203 17 75 0
2 3 244 62 0 55
and that I want to run the regression on the value using three variables: person1, person2 and person3. Note that this is a minimal working example but I hope to generalize the idea.
And so my initial attempt was to:
column <- names(df)[-2]
for(i in 3:5){
temp <- df[,c("value", column[i])]
lm.test <- lm(value ~ ., data = temp)
i + 1
}
However, when I run summary(lm.test) I only get a summary of the last regression, i.e. lm(value ~ person3) which I think makes sense but when trying to rewrite it as: lm.test[i] <- lm(value ~ ., data = temp) I get the following error:
debug at #3: temp <- df[,c("value", column[i])]
suggesting that there's something wrong with line 3?
If possible I'd like to be able to capture the summary for each regression but what I am really after is the R squared for each one of the regressions.
You can create formula in a loop and then run the lm. For instance, if I want to run regression on mtcars for regressing mpg on each of cyl, wt, hp, I can use the following:
vars <- c("cyl", "wt", "hp")
lm_results <- lapply(vars, function(col){
lm_formula <- as.formula(paste0("mpg ~ ", col))
lm(lm_formula, data = mtcars)
})
You can then again iterate over lm_results to get the r.squared:
lapply(lm_results, function(x) summary(x)$r.squared)
Here’s an approach using broom::glance() and purrr::map_dfr() to collect model summary stats into a tidy tibble:
library(broom)
library(purrr)
lm.test <- map_dfr(
set_names(names(df)[-2]),
~ glance(lm(
as.formula(paste("value ~", .x)),
data = df
)),
.id = "predictor"
)
Result:
# A tibble: 4 x 13
predictor r.squared adj.r.squared sigma statistic p.value df logLik AIC
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 num 0.131 -0.739 27.4 0.150 0.765 1 -12.5 31.1
2 person1 0.836 0.672 11.9 5.10 0.265 1 -10.0 26.1
3 person2 0.542 0.0831 19.9 1.18 0.474 1 -11.6 29.2
4 person3 0.607 0.215 18.4 1.55 0.431 1 -11.3 28.7
# ... with 4 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>,
# nobs <int>
NB, you can capture model coefficients with a similar approach using broom::tidy() instead of glance().
I have made a graph that displays r2, p-value and equation from linear regressions in the top left corner using stat_poly_eq.
Now I wish to have the stats from the linear regression extracted into a table.
For an example, in the mtcars dataset, if I want to do linear regression on plots of hp against disp for each cylinder group (e.g. 4, 6, 8) and then extract the linear regression stats into a table, how could I do that?
Thanks!
Here's the graph I have:
library(ggplot2)
library(ggpmisc)
formula <- y~x
ggplot(mtcars, aes(disp, hp)) +
geom_point() +
geom_smooth(method = "lm",formula = formula) +
theme_bw()+
facet_wrap(~cyl, scales = "free")+
stat_poly_eq(
aes(label = paste(stat(adj.rr.label), stat(eq.label), stat(p.value.label), sep = "*\", \"*")),
formula = formula, parse = TRUE, size=3)
Do you mean something like this?
With nest_by, divide the rest of the columns in separated tibbles by each cyl
With summarise, calculate each lm. You need to set it into a list.
Operate like a normal list with map and calculate the stuff you need: coefficients (extractable with broom::tidy) and adj.r.squared (with summary(.)$adj.r.squared)
unnest the result of broom::tidy to make a unique tibble.
library(dplyr)
library(tidyr)
library(purrr)
mtcars %>%
nest_by(cyl) %>%
summarise(mdl = list(lm(hp ~ disp, data)), .groups = "drop") %>%
mutate(adjrsquared = map_dbl(mdl, ~summary(.)$adj.r.squared ),
mdl = map(mdl, broom::tidy)) %>%
unnest(mdl)
#> # A tibble: 6 x 7
#> cyl term estimate std.error statistic p.value adjrsquared
#> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 4 (Intercept) 47.0 25.3 1.86 0.0960 0.0988
#> 2 4 disp 0.339 0.234 1.45 0.182 0.0988
#> 3 6 (Intercept) 177. 42.0 4.22 0.00829 0.117
#> 4 6 disp -0.300 0.224 -1.34 0.238 0.117
#> 5 8 (Intercept) 178. 77.4 2.30 0.0405 -0.0682
#> 6 8 disp 0.0890 0.216 0.413 0.687 -0.0682
I have a function which runs an MCMCglmm a bunch of times.
shuffles <- 1:10
names(shuffles) <- paste0("shuffle_", shuffles)
library(MCMCglmm)
library(dplyr)
library(tibble)
library(purrr)
ddd <- purrr::map(shuffles,
~ df %>%
mutate(Trait = sample(Trait)) %>%
MCMCglmm(fixed = Trait ~ 1,
random = ~ Year,
data = .,
family = "categorical",
verbose = FALSE)) %>%
purrr::map( ~ tibble::as_tibble(summary(.x)$solutions, rownames = "model_term")) %>%
dplyr::bind_rows(., .id = 'shuffle')
ddd
This section extracts fixed effects only.
(summary(.x)$Solutions, rownames = "model_term")
But note that I am not running a model without any fixed effects and so the output is empty.
How can I extract random effects using the same or similar code?
I guess I can change 'solutions' to something else to extract random effects from a model I have run without any fixed effects.
Note that this is an extension to a previous question (with example df) here - lapply instead of for loop for randomised hypothesis testing r
A relatively easy way to do this is with broom.mixed::tidy. It's not clear whether you mean you want to extract the summary for the top-level random effects parameters (i.e. the variances of the random effects), or for the estimates of the group-level effects.
library(broom.mixed)
tidy(m, effects="ran_pars")
##
## effect group term estimate std.error
## 1 ran_pars Year var__(Intercept) 0.00212 629.
## 2 ran_pars Residual var__Observation 40465. 24211.
If you want the group-level effects you need effects="ran_vals", but you have to re-run your model with pr=TRUE (or do it that way in the first place) in order to have these effects saved in the model object:
m <- MCMCglmm(Trait ~ ID, random = ~ Year, data = df, family = "categorical", pr=TRUE)
tidy(m, effects="ran_vals")
effect group level term estimate std.error
<chr> <chr> <chr> <chr> <dbl> <dbl>
1 ran_vals Year 1992 (Intercept) 2.65e-8 4.90
2 ran_vals Year 1993 (Intercept) 1.14e-8 6.23
3 ran_vals Year 1994 (Intercept) 1.28e-8 4.88
4 ran_vals Year 1995 (Intercept) -6.83e-9 5.31
5 ran_vals Year 1996 (Intercept) -1.36e-8 5.07
6 ran_vals Year 1997 (Intercept) 1.31e-8 5.24
7 ran_vals Year 1998 (Intercept) -2.80e-9 5.25
8 ran_vals Year 1999 (Intercept) 3.52e-8 5.68
This link shows how to answer my question in the case where we have the same independent variables, but potentially many different dependent variables: Use broom and tidyverse to run regressions on different dependent variables.
But my question is, how can I apply the same approach (e.g., tidyverse and broom) to run many regressions where we have the reverse situation: same dependent variables but different independent variable. In line with the code in the previous link, something like:
mod = lm(health ~ cbind(sex,income,happiness) + faculty, ds) %>% tidy()
However, this code does not do exactly what I want, and instead, produces:
Call:
lm(formula = income ~ cbind(sex, health) + faculty, data = ds)
Coefficients:
(Intercept) cbind(sex, health)sex
945.049 -47.911
cbind(sex, health)health faculty
2.342 1.869
which is equivalent to:
lm(formula = income ~ sex + health + faculty, data = ds)
Basically you'll need some way to create all the different formulas you want. Here's one way
qq <- expression(sex,income,happiness)
formulae <- lapply(qq, function(v) bquote(health~.(v)+faculty))
# [[1]]
# health ~ sex + faculty
# [[2]]
# health ~ income + faculty
# [[3]]
# health ~ happiness + faculty
Once you have all your formula, you can map them to lm and then to tidy()
library(purrr)
library(broom)
formulae %>% map(~lm(.x, ds)) %>% map_dfr(tidy, .id="model")
# A tibble: 9 x 6
# model term estimate std.error statistic p.value
# <chr> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 1 (Intercept) 19.5 0.504 38.6 1.13e-60
# 2 1 sex 0.755 0.651 1.16 2.49e- 1
# 3 1 faculty -0.00360 0.291 -0.0124 9.90e- 1
# 4 2 (Intercept) 19.8 1.70 11.7 3.18e-20
# 5 2 income -0.000244 0.00162 -0.150 8.81e- 1
# 6 2 faculty 0.143 0.264 0.542 5.89e- 1
# 7 3 (Intercept) 18.4 1.88 9.74 4.79e-16
# 8 3 happiness 0.205 0.299 0.684 4.96e- 1
# 9 3 faculty 0.141 0.262 0.539 5.91e- 1
Using sample data
set.seed(11)
ds <- data.frame(income = rnorm(100, mean=1000,sd=200),
happiness = rnorm(100, mean = 6, sd=1),
health = rnorm(100, mean=20, sd = 3),
sex = c(0,1),
faculty = c(0,1,2,3))
You could use the combn function to get all combinations of n independent variables and then iterate over them. Let's say n=3 here:
library(tidyverse)
ds <- data.frame(income = rnorm(100, mean=1000,sd=200),
happiness = rnorm(100, mean = 6, sd=1),
health = rnorm(100, mean=20, sd = 3),
sex = c(0,1),
faculty = c(0,1,2,3))
ivs = combn(names(ds)[names(ds)!="income"], 3, simplify=FALSE)
# Or, to get all models with 1 to 4 variables:
# ivs = map(1:4, ~combn(names(ds)[names(ds)!="income"], .x, simplify=FALSE)) %>%
# flatten()
names(ivs) = map(ivs, ~paste(.x, collapse="-"))
models = map(ivs,
~lm(as.formula(paste("income ~", paste(.x, collapse="+"))), data=ds))
map_df(models, broom::tidy, .id="model")
model term estimate std.error statistic p.value
* <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 happiness-health-sex (Intercept) 1086. 201. 5.39 5.00e- 7
2 happiness-health-sex happiness -25.4 21.4 -1.19 2.38e- 1
3 happiness-health-sex health 3.58 6.99 0.512 6.10e- 1
4 happiness-health-sex sex 11.5 41.5 0.277 7.82e- 1
5 happiness-health-faculty (Intercept) 1085. 197. 5.50 3.12e- 7
6 happiness-health-faculty happiness -25.8 20.9 -1.23 2.21e- 1
7 happiness-health-faculty health 3.45 6.98 0.494 6.23e- 1
8 happiness-health-faculty faculty 7.86 18.2 0.432 6.67e- 1
9 happiness-sex-faculty (Intercept) 1153. 141. 8.21 1.04e-12
10 happiness-sex-faculty happiness -25.9 21.4 -1.21 2.28e- 1
11 happiness-sex-faculty sex 3.44 46.2 0.0744 9.41e- 1
12 happiness-sex-faculty faculty 7.40 20.2 0.366 7.15e- 1
13 health-sex-faculty (Intercept) 911. 143. 6.35 7.06e- 9
14 health-sex-faculty health 3.90 7.03 0.554 5.81e- 1
15 health-sex-faculty sex 15.6 45.6 0.343 7.32e- 1
16 health-sex-faculty faculty 7.02 20.4 0.345 7.31e- 1
I'm currently trying to run a loop performing linear regression for multiple independent variables (n = 6) with multiple dependent variables (n=1000).
Here is some example data, with age, sex, and education representing my independent variables of interest and testscore_* being my dependent variables.
df = data.frame(ID = c(1001, 1002, 1003, 1004, 1005, 1006,1007, 1008, 1009, 1010, 1011),
age = as.numeric(c('56', '43','59','74','61','62','69','80','40','55','58')),
sex = as.numeric(c('0','1','0','0','1','1','0','1','0','1','0')),
testscore_1 = as.numeric(c('23','28','30','15','7','18','29','27','14','22','24')),
testscore_2 = as.numeric(c('1','3','2','5','8','2','5','6','7','8','2')),
testscore_3 = as.numeric(c('18','20','19','15','20','23','19','25','10','14','12')),
education = as.numeric(c('5','4','3','5','2', '1','4','4','3','5','2')))
I have working code that allows me to run a regression model for multiple DVs (which I'm sure more experienced R users will dislike for its lack of efficiency):
y <- as.matrix(df[4:6])
#model for age
lm_results <- lm(y ~ age, data = df)
write.csv((broom::tidy(lm_results)), "lm_results_age.csv")
regression_results <-broom::tidy(lm_results)
standardized_coefficients <- lm.beta(lm_results)
age_standardize_results <- coef(standardized_coefficients)
write.csv(age_standardize_results, "lm_results_age_standardized_coefficients.csv")
I would then repeat this all by manually replacing age with sex and education
Does anyone have a more elegant way of running this - for example, by way of a loop for all IVs of interest (i.e. age, sex and education)?
Also would greatly appreciate anyone who would suggest a quick way of combining broom::tidy(lm_results) with standardized coefficients from lm.beta::lm.beta, i.e. combining the standardized regression coefficients with the main model output.
This is an adaptation for a similar workflow I had to use in the past. Remember to really penalize yourself for running a crazy number of models. I added a couple predictor columns to your dataframe. Good luck!!
Solution:
# Creating pedictor and outcome vectors
ivs_vec <- names(df)[c(2:6, 10)]
dvs_vec <- names(df)[7:9]
# Creating formulas and running the models
ivs <- paste0(" ~ ", ivs_vec)
dvs_ivs <- unlist(lapply(ivs, function(x) paste0(dvs_vec, x)))
formulas <- lapply(dvs_ivs, formula)
lm_results <- lapply(formulas, function(x) {
lm(x, data = df)
})
# Creating / combining results
tidy_results <- lapply(lm_results, broom::tidy)
dv_list <- lapply(as.list(stringi::stri_extract_first_words(dvs_ivs)), rep, 2)
tidy_results <- Map(cbind, dv_list, tidy_results)
standardized_results <- lapply(lm_results, function(x) coef(lm.beta::lm.beta(x)))
combined_results <- Map(cbind, tidy_results, standardized_results)
# Cleaning up final results
names(combined_results) <- dvs_ivs
combined_results <- lapply(combined_results, function(x) {row.names(x) <- c(NULL); x})
new_names <- c("Outcome", "Term", "Estimate", "Std. Error", "Statistic", "P-value", "Standardized Estimate")
combined_results <- lapply(combined_results, setNames, new_names)
Results:
combined_results[1:5]
$`testscore_1 ~ age`
Outcome Term Estimate Std. Error Statistic P-value
Standardized Estimate
1 testscore_1 (Intercept) 18.06027731 12.3493569 1.4624468 0.1776424 0.00000000
2 testscore_1 age 0.05835152 0.2031295 0.2872627 0.7804155 0.09531823
$`testscore_2 ~ age`
Outcome Term Estimate Std. Error Statistic P-value Standardized Estimate
1 testscore_2 (Intercept) 3.63788676 4.39014570 0.8286483 0.4287311 0.0000000
2 testscore_2 age 0.01367313 0.07221171 0.1893478 0.8540216 0.0629906
$`testscore_3 ~ age`
Outcome Term Estimate Std. Error Statistic P-value Standardized Estimate
1 testscore_3 (Intercept) 6.1215175 6.698083 0.9139208 0.3845886 0.0000000
2 testscore_3 age 0.1943125 0.110174 1.7636870 0.1116119 0.5068026
$`testscore_1 ~ sex`
Outcome Term Estimate Std. Error Statistic P-value Standardized Estimate
1 testscore_1 (Intercept) 22.5 3.099283 7.2597435 4.766069e-05 0.0000000
2 testscore_1 sex -2.1 4.596980 -0.4568217 6.586248e-01 -0.1505386
$`testscore_2 ~ sex`
Outcome Term Estimate Std. Error Statistic P-value Standardized Estimate
1 testscore_2 (Intercept) 3.666667 1.041129 3.521816 0.006496884 0.0000000
2 testscore_2 sex 1.733333 1.544245 1.122447 0.290723029 0.3504247
Data:
df <- data.frame(ID = c(1001, 1002, 1003, 1004, 1005, 1006,1007, 1008, 1009, 1010, 1011),
age = as.numeric(c('56', '43','59','74','61','62','69','80','40','55','58')),
sex = as.numeric(c('0','1','0','0','1','1','0','1','0','1','0')),
pred1 = sample(1:11, 11),
pred2 = sample(1:11, 11),
pred3 = sample(1:11, 11),
testscore_1 = as.numeric(c('23','28','30','15','7','18','29','27','14','22','24')),
testscore_2 = as.numeric(c('1','3','2','5','8','2','5','6','7','8','2')),
testscore_3 = as.numeric(c('18','20','19','15','20','23','19','25','10','14','12')),
education = as.numeric(c('5','4','3','5','2', '1','4','4','3','5','2')))
Stumbled upon this a year later and documenting a tidyverse solution same data as #Andrew.
library(dplyr)
library(purrr)
library(tidyr)
library(stringi)
# Creating pedictor and outcome vectors
ivs_vec <- names(df)[c(2:6, 10)]
dvs_vec <- names(df)[7:9]
# Creating formulas and running the models
ivs <- paste0(" ~ ", ivs_vec)
dvs_ivs <- unlist(map(ivs, ~paste0(dvs_vec, .x)))
models <- map(setNames(dvs_ivs, dvs_ivs),
~ lm(formula = as.formula(.x),
data = df))
basics <-
map(models, ~ broom::tidy(.)) %>%
map2_df(.,
names(.),
~ mutate(.x, which_dependent = .y)) %>%
select(which_dependent, everything()) %>%
mutate(term = gsub("\\(Intercept\\)", "Intercept", term),
which_dependent = stringi::stri_extract_first_words(which_dependent))
basics$std_estimate <-
map_dfr(models, ~ coef(lm.beta::lm.beta(.)), .id = "which_dependent") %>%
pivot_longer(.,
cols = -which_dependent,
names_to = "term",
values_to = "std_estimate",
values_drop_na = TRUE) %>%
pull(std_estimate)
basics
#> # A tibble: 36 x 7
#> which_dependent term estimate std.error statistic p.value std_estimate
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 testscore_1 Intercept 18.1 12.3 1.46 0.178 0
#> 2 testscore_1 age 0.0584 0.203 0.287 0.780 0.0953
#> 3 testscore_2 Intercept 3.64 4.39 0.829 0.429 0
#> 4 testscore_2 age 0.0137 0.0722 0.189 0.854 0.0630
#> 5 testscore_3 Intercept 6.12 6.70 0.914 0.385 0
#> 6 testscore_3 age 0.194 0.110 1.76 0.112 0.507
#> 7 testscore_1 Intercept 22.5 3.10 7.26 0.0000477 0
#> 8 testscore_1 sex -2.10 4.60 -0.457 0.659 -0.151
#> 9 testscore_2 Intercept 3.67 1.04 3.52 0.00650 0
#> 10 testscore_2 sex 1.73 1.54 1.12 0.291 0.350
#> # … with 26 more rows