Linear Regression Loops in R - r

I need beta coefficients and residual variance for multiple stock. My question is, how can I create a loop for multiple linear regression and extract the aforementioned coefficients into the output?
Here is what my data looks like, MR is my independent variable and rest of the columns are dependent variables, to each of which I have to perform a linear regression separately.
Thank you very much!
//Edit:
> dput(head(Beta_market_model_test))
structure(list(...1 = structure(c(1422748800, 1425168000, 1427846400,
1430438400, 1433116800, 1435708800), tzone = "UTC", class = c("POSIXct",
"POSIXt")), R1 = c(-0.0225553678146582, 0.084773882172773, -0.00628335525823254,
0.189767902403849, -0.129765571642446, -0.02268699227135), R2 = c(-0.000634819869861802,
0.0566396021070485, 0.0504313735522286, -0.0275926732076482,
0.0473125483284236, -0.0501700832780339), R3 = c(-0.0607564272876455,
0.0915928283206455, -0.116429377153136, 0.0338313435925748, -0.0731748018356279,
-0.082292041771696), R4 = c(0.036716647443291, 0.0409790469126645,
-0.0594941218382615, 0.0477272727272728, 0.0115690527838033,
-0.0187634024303074), R5 = c(0.00286365940192601, 0.0128875748616479,
0.000174637626924046, 0.0238214018458469, 0.0120599342185406,
-0.0627587867116033), R6 = c(-0.0944601447872712, 0.090838356632893,
-0.0577132600192821, 0.136928528648433, -0.0137770071043408,
0.0214549609033041), MR = c(-0.0388483879770769, 0.0858362570727453,
-0.0178553084990147, 0.0567646974926548, -0.0391124787432181,
-0.014626289866472)), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))

We could use cbind to specify the dependent variables in lm
model <- lm(cbind(R1, R2, R3, R4, R5, R6) ~ MR, data = df1)
s1 <- summary(model)
NOTE: We assume that the 'R1' to 'R6' are numeric columns i.e. the , should be replaced with . while reading into R
Update
If there are many columns and are in the range of sequece, extract those columns and convert to matrix
dep_data <- as.matrix(Beta_market_model_test[startsWith(
names(Beta_market_model_test), "R")])
model <- lm(dep_data ~ MR, data = Beta_market_model_test)
Checking the summary
summary(model)
Response R1 :
Call:
lm(formula = R1 ~ MR, data = Beta_market_model_test)
Residuals:
1 2 3 4 5 6
0.03757 -0.06851 0.01791 0.08624 -0.06919 -0.00402
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.006368 0.028060 0.227 0.8316
MR 1.711625 0.577571 2.963 0.0414 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.06831 on 4 degrees of freedom
Multiple R-squared: 0.6871, Adjusted R-squared: 0.6088
F-statistic: 8.782 on 1 and 4 DF, p-value: 0.04141
Response R2 :
Call:
lm(formula = R2 ~ MR, data = Beta_market_model_test)
Residuals:
1 2 3 4 5 6
-0.01047 0.03882 0.03925 -0.04355 0.03750 -0.06155
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.01232 0.02079 0.593 0.585
MR 0.06402 0.42797 0.150 0.888
Residual standard error: 0.05062 on 4 degrees of freedom
Multiple R-squared: 0.005564, Adjusted R-squared: -0.243
F-statistic: 0.02238 on 1 and 4 DF, p-value: 0.8883
Response R3 :
Call:
lm(formula = R3 ~ MR, data = Beta_market_model_test)
Residuals:
1 2 3 4 5 6
0.035081 0.014541 -0.049701 -0.002909 0.023029 -0.020041
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.04197 0.01431 -2.934 0.04266 *
MR 1.38661 0.29449 4.709 0.00925 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.03483 on 4 degrees of freedom
Multiple R-squared: 0.8472, Adjusted R-squared: 0.8089
F-statistic: 22.17 on 1 and 4 DF, p-value: 0.009249
Response R4 :
Call:
lm(formula = R4 ~ MR, data = Beta_market_model_test)
Residuals:
1 2 3 4 5 6
0.0438966 0.0002996 -0.0603723 0.0182067 0.0188503 -0.0208810
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.007732 0.016804 0.46 0.669
MR 0.383843 0.345886 1.11 0.329
Residual standard error: 0.04091 on 4 degrees of freedom
Multiple R-squared: 0.2354, Adjusted R-squared: 0.04425
F-statistic: 1.232 on 1 and 4 DF, p-value: 0.3293
Response R5 :
Call:
lm(formula = R5 ~ MR, data = Beta_market_model_test)
Residuals:
1 2 3 4 5 6
0.013692 -0.001676 0.006728 0.015178 0.022942 -0.056863
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.002917 0.013351 -0.218 0.838
MR 0.203653 0.274801 0.741 0.500
Residual standard error: 0.0325 on 4 degrees of freedom
Multiple R-squared: 0.1207, Adjusted R-squared: -0.09909
F-statistic: 0.5492 on 1 and 4 DF, p-value: 0.4998
Response R6 :
Call:
lm(formula = R6 ~ MR, data = Beta_market_model_test)
Residuals:
1 2 3 4 5 6
-0.04498 -0.03837 -0.03832 0.04938 0.03608 0.03622
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.006197 0.020555 0.302 0.7781
MR 1.433135 0.423083 3.387 0.0276 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.05004 on 4 degrees of freedom
Multiple R-squared: 0.7415, Adjusted R-squared: 0.6769
F-statistic: 11.47 on 1 and 4 DF, p-value: 0.0276
We could get the summary output in a data.frame easily in a tabular format with tidy from broom
library(purrr)
library(broom)
map_dfr(summary(model), tidy, .id = 'dep_var')
# A tibble: 12 x 6
# dep_var term estimate std.error statistic p.value
# <chr> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 Response R1 (Intercept) 0.00637 0.0281 0.227 0.832
# 2 Response R1 MR 1.71 0.578 2.96 0.0414
# 3 Response R2 (Intercept) 0.0123 0.0208 0.593 0.585
# 4 Response R2 MR 0.0640 0.428 0.150 0.888
# 5 Response R3 (Intercept) -0.0420 0.0143 -2.93 0.0427
# 6 Response R3 MR 1.39 0.294 4.71 0.00925
# 7 Response R4 (Intercept) 0.00773 0.0168 0.460 0.669
# 8 Response R4 MR 0.384 0.346 1.11 0.329
# 9 Response R5 (Intercept) -0.00292 0.0134 -0.218 0.838
#10 Response R5 MR 0.204 0.275 0.741 0.500
#11 Response R6 (Intercept) 0.00620 0.0206 0.302 0.778
#12 Response R6 MR 1.43 0.423 3.39 0.0276
Or to get other output with glance
map_dfr(summary(model), glance, .id = 'dep_var')

I'm just posting this to ask a question about my code:
library(dplyr)
library(tidyr)
library(broom)
df %>%
select(-...1) %>%
pivot_longer(R1:R6) %>%
group_by(name) %>%
nest(data = c(MR, value)) %>%
mutate(model = map(data, ~ lm(MR ~ value, data = .)),
glance = map(model, ~ glance(.x))) %>%
unnest(glance) %>%
select(- c(data, model))
# A tibble: 6 x 13
# Groups: name [6]
name r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 R1 0.687 0.609 0.0331 8.78 0.0414 1 13.2 -20.3 -20.9 0.00438
2 R2 0.00556 -0.243 0.0590 0.0224 0.888 1 9.69 -13.4 -14.0 0.0139
3 R3 0.847 0.809 0.0231 22.2 0.00925 1 15.3 -24.6 -25.2 0.00214
4 R4 0.235 0.0443 0.0517 1.23 0.329 1 10.5 -15.0 -15.6 0.0107
5 R5 0.121 -0.0991 0.0555 0.549 0.500 1 10.1 -14.1 -14.7 0.0123
6 R6 0.742 0.677 0.0301 11.5 0.0276 1 13.7 -21.5 -22.1 0.00362
# ... with 2 more variables: df.residual <int>, nobs <int>
Update
Thanks to my dear friend #akrun who always provides me with valuable suggestions.
In case you would like to avoid pivoting the data as with a really big data the pivoting could increase the rows to a degree that it would exceed the limitations, you can use the following code as well:
library(dplyr)
library(tidyr)
library(broom)
df %>%
select(-1) %>%
summarise(across(-MR, ~ list(lm(reformulate('MR', response = cur_column()),
data = df) %>%
summary))) %>%
unclass %>%
map_dfr(~ tidy(.x[[1]]))
# A tibble: 12 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 0.00637 0.0281 0.227 0.832
2 MR 1.71 0.578 2.96 0.0414
3 (Intercept) 0.0123 0.0208 0.593 0.585
4 MR 0.0640 0.428 0.150 0.888
5 (Intercept) -0.0420 0.0143 -2.93 0.0427
6 MR 1.39 0.294 4.71 0.00925
7 (Intercept) 0.00773 0.0168 0.460 0.669
8 MR 0.384 0.346 1.11 0.329
9 (Intercept) -0.00292 0.0134 -0.218 0.838
10 MR 0.204 0.275 0.741 0.500
11 (Intercept) 0.00620 0.0206 0.302 0.778
12 MR 1.43 0.423 3.39 0.0276

Related

edit string text in dataframe variable

I want to tidy up a dataframe and automate the process. Given the following data.frame:
library(survival)
library(rms)
library(broom)
library(tidyverse)
res.cox <- coxph(Surv(time, status) ~ rcs(age, 3) + sex + ph.ecog +
rcs(meal.cal, 4), data = lung)
output <- tidy(res.cox)
output
# term estimate std.error statistic p.value
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 rcs(age, 3)age -0.00306 0.0219 -0.140 0.889
# 2 rcs(age, 3)age' 0.0154 0.0261 0.592 0.554
# 3 sex -0.525 0.192 -2.74 0.00620
# 4 ph.ecog 0.421 0.131 3.22 0.00128
# 5 rcs(meal.cal, 4)meal.cal -0.000416 0.00104 -0.400 0.689
# 6 rcs(meal.cal, 4)meal.cal' 0.00118 0.00232 0.509 0.611
# 7 rcs(meal.cal, 4)meal.cal'' -0.00659 0.0114 -0.577 0.564
I want to remove the rcs-spline information from term variable and be left with:
# term estimate std.error statistic p.value
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 s1 age -0.00306 0.0219 -0.140 0.889
# 2 s2 age 0.0154 0.0261 0.592 0.554
# 3 sex -0.525 0.192 -2.74 0.00620
# 4 ph.ecog 0.421 0.131 3.22 0.00128
# 5 s1 meal.cal -0.000416 0.00104 -0.400 0.689
# 6 s2 meal.cal 0.00118 0.00232 0.509 0.611
# 7 s3 meal.cal -0.00659 0.0114 -0.577 0.564
I want the solution to easily work for other cases too so when you increase the number of knots:
res.cox2 <- coxph(Surv(time, status) ~ rcs(age, 4) + rcs(meal.cal, 6) +
sex + ph.ecog, data = lung)
output2 <- tidy(res.cox2)
output2
# term estimate std.error statistic p.value
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 rcs(age, 4)age 0.0419 0.0403 1.04 0.298
# 2 rcs(age, 4)age' -0.101 0.0806 -1.26 0.208
# 3 rcs(age, 4)age'' 0.569 0.388 1.47 0.142
# 4 rcs(meal.cal, 6)meal.cal -0.000974 0.00155 -0.631 0.528
# 5 rcs(meal.cal, 6)meal.cal' 0.00751 0.0115 0.655 0.512
# 6 rcs(meal.cal, 6)meal.cal'' -0.0217 0.0358 -0.607 0.544
# 7 rcs(meal.cal, 6)meal.cal''' 0.0614 0.123 0.501 0.616
# 8 rcs(meal.cal, 6)meal.cal'''' -0.0775 0.163 -0.475 0.634
# 9 sex -0.552 0.195 -2.83 0.00465
# 10 ph.ecog 0.440 0.132 3.34 0.000835
you would be left with:
# term estimate std.error statistic p.value
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 s1 age 0.0419 0.0403 1.04 0.298
# 2 s2 age -0.101 0.0806 -1.26 0.208
# 3 s3 age 0.569 0.388 1.47 0.142
# 4 s1 meal.cal -0.000974 0.00155 -0.631 0.528
# 5 s2 meal.cal 0.00751 0.0115 0.655 0.512
# 6 s3 meal.cal -0.0217 0.0358 -0.607 0.544
# 7 s4 meal.cal 0.0614 0.123 0.501 0.616
# 8 s5 meal.cal -0.0775 0.163 -0.475 0.634
# 9 sex -0.552 0.195 -2.83 0.00465
# 10 ph.ecog 0.440 0.132 3.34 0.000835
etc...
My attempt so far gets me some of the way but I am not sure of the best way to deal with the ', '' (note the first term does not contain a ') etc.:
output %>%
mutate(rcs_indicator = str_detect(term, fixed("rcs(")),
term = str_replace_all(term, "rcs\\(.+?\\)", ""))
# term estimate std.error statistic p.value rcs_indicator
# <chr> <dbl> <dbl> <dbl> <dbl> <lgl>
# 1 age -0.00306 0.0219 -0.140 0.889 TRUE
# 2 age' 0.0154 0.0261 0.592 0.554 TRUE
# 3 sex -0.525 0.192 -2.74 0.00620 FALSE
# 4 ph.ecog 0.421 0.131 3.22 0.00128 FALSE
# 5 meal.cal -0.000416 0.00104 -0.400 0.689 TRUE
# 6 meal.cal' 0.00118 0.00232 0.509 0.611 TRUE
# 7 meal.cal'' -0.00659 0.0114 -0.577 0.564 TRUE
It might be useful to just work with the terms I need to change directly:
unique(str_subset(output$term, fixed("rcs(")) %>%
str_replace_all("'", ""))
# [1] "rcs(age, 3)age" "rcs(meal.cal, 4)meal.cal"
I feel there is a way to do this in a simpler way than the steps I am doing.
Any suggestions?
Thanks
This one is clunky but it should work:
library(dplyr)
library(stringr)
output %>%
group_by(group =str_extract(term, 'rcs\\(.')) %>%
mutate(row = row_number()) %>%
mutate(term = str_replace_all(term, 'rcs\\(', paste0("s",row, " "))) %>%
mutate(term = ifelse(str_detect(term, 's\\d'),
str_extract(term, '.\\d\\s.*\\s'), term)) %>%
mutate(term = str_trim(term)) %>%
mutate(term = str_replace_all(term, '\\,', '')) %>%
ungroup() %>%
select(-c(group, row))
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 s1 age -0.00306 0.0219 -0.140 0.889
2 s2 age 0.0154 0.0261 0.592 0.554
3 sex -0.525 0.192 -2.74 0.00620
4 ph.ecog 0.421 0.131 3.22 0.00128
5 s1 meal.cal -0.000416 0.00104 -0.400 0.689
6 s2 meal.cal 0.00118 0.00232 0.509 0.611
7 s3 meal.cal -0.00659 0.0114 -0.577 0.564
This is also less elegant than desired, but should work for multiple knots
output %>%
mutate(is_spline = grepl("^rcs\\(.*?, \\d\\)", term),
n_term = str_count(term, "'") + 1,
pre = ifelse(is_spline, paste0('s', n_term, ' '), ""),
term = paste0(pre, gsub("(^rcs\\(.*?, \\d\\))|(\\'+$)", "", term))) %>%
select(-is_spline, -n_term, -pre)
#> # A tibble: 7 x 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 s1 age -0.00306 0.0219 -0.140 0.889
#> 2 s2 age 0.0154 0.0261 0.592 0.554
#> 3 sex -0.525 0.192 -2.74 0.00620
#> 4 ph.ecog 0.421 0.131 3.22 0.00128
#> 5 s1 meal.cal -0.000416 0.00104 -0.400 0.689
#> 6 s2 meal.cal 0.00118 0.00232 0.509 0.611
#> 7 s3 meal.cal -0.00659 0.0114 -0.577 0.564

Model fit of SEM in Lavaan

What is the reason for CFI=0 in a sem model in Lavaan. Statistic values are attached
Well, first let's check how does the CFI estimator works:
Usually, SEM programs do not present CFI values below 0, as such if a negative value is obtained, the software shows 0.
An example:
library(lavaan)
#> This is lavaan 0.6-8
#> lavaan is FREE software! Please report any bugs.
HS.model <- ' visual =~ x1 + x2 + x3
textual =~ x4 + x5 + x6
speed =~ x7 + x8 + x9 '
fit <- cfa(HS.model, data = HolzingerSwineford1939)
summary(fit, fit.measures = TRUE)
#> lavaan 0.6-8 ended normally after 35 iterations
#>
#> Estimator ML
#> Optimization method NLMINB
#> Number of model parameters 21
#>
#> Number of observations 301
#>
#> Model Test User Model:
#>
#> Test statistic 85.306
#> Degrees of freedom 24
#> P-value (Chi-square) 0.000
#>
#> Model Test Baseline Model:
#>
#> Test statistic 918.852
#> Degrees of freedom 36
#> P-value 0.000
#>
#> User Model versus Baseline Model:
#>
#> Comparative Fit Index (CFI) 0.931
#> Tucker-Lewis Index (TLI) 0.896
#>
#> Loglikelihood and Information Criteria:
#>
#> Loglikelihood user model (H0) -3737.745
#> Loglikelihood unrestricted model (H1) -3695.092
#>
#> Akaike (AIC) 7517.490
#> Bayesian (BIC) 7595.339
#> Sample-size adjusted Bayesian (BIC) 7528.739
#>
#> Root Mean Square Error of Approximation:
#>
#> RMSEA 0.092
#> 90 Percent confidence interval - lower 0.071
#> 90 Percent confidence interval - upper 0.114
#> P-value RMSEA <= 0.05 0.001
#>
#> Standardized Root Mean Square Residual:
#>
#> SRMR 0.065
#>
#> Parameter Estimates:
#>
#> Standard errors Standard
#> Information Expected
#> Information saturated (h1) model Structured
#>
#> Latent Variables:
#> Estimate Std.Err z-value P(>|z|)
#> visual =~
#> x1 1.000
#> x2 0.554 0.100 5.554 0.000
#> x3 0.729 0.109 6.685 0.000
#> textual =~
#> x4 1.000
#> x5 1.113 0.065 17.014 0.000
#> x6 0.926 0.055 16.703 0.000
#> speed =~
#> x7 1.000
#> x8 1.180 0.165 7.152 0.000
#> x9 1.082 0.151 7.155 0.000
#>
#> Covariances:
#> Estimate Std.Err z-value P(>|z|)
#> visual ~~
#> textual 0.408 0.074 5.552 0.000
#> speed 0.262 0.056 4.660 0.000
#> textual ~~
#> speed 0.173 0.049 3.518 0.000
#>
#> Variances:
#> Estimate Std.Err z-value P(>|z|)
#> .x1 0.549 0.114 4.833 0.000
#> .x2 1.134 0.102 11.146 0.000
#> .x3 0.844 0.091 9.317 0.000
#> .x4 0.371 0.048 7.779 0.000
#> .x5 0.446 0.058 7.642 0.000
#> .x6 0.356 0.043 8.277 0.000
#> .x7 0.799 0.081 9.823 0.000
#> .x8 0.488 0.074 6.573 0.000
#> .x9 0.566 0.071 8.003 0.000
#> visual 0.809 0.145 5.564 0.000
#> textual 0.979 0.112 8.737 0.000
#> speed 0.384 0.086 4.451 0.000
As you can see your model's X² is 85.306, with 24 degrees of freedom, and the baseline model has 918.852, with 36 degrees of freedom.
With that we can easily calculate CFI by hand:
1-((85.306-24)/(918.852-36))
#> [1] 0.9305591
Which you can compare with the CFI reported by the summary() function (i.e., 0.931).
The model reported by you allows us to check that your CFI would be negative if the software did not limit it to 0.
1-((5552.006-94)/(3181.455-21))
#> [1] -0.7269684
Created on 2021-03-27 by the reprex package (v1.0.0)

Performing a linear model in R of a single response with a single predictor from a large dataframe and repeat for each column

It might not be very clear from the title but what I wish to do is:
I have a dataframe df with, say, 200 columns and the first 80 columns are response variables (y1, y2, y3, ...) and the rest of 120 are predictors (x1, x2, x3, ...).
I wish to compute a linear model for each pair – lm(yi ~ xi, data = df).
Many problems and solutions I have looked through online have a either a fixed response vs many predictors or the other way around, using lapply() and its related functions.
Could anyone who is familiar with it point me to the right step?
use tidyverse
library(tidyverse)
library(broom)
df <- mtcars
y <- names(df)[1:3]
x <- names(df)[4:7]
result <- expand_grid(x, y) %>%
rowwise() %>%
mutate(frm = list(reformulate(x, y)),
model = list(lm(frm, data = df)))
result$model <- purrr::set_names(result$model, nm = paste0(result$y, " ~ ", result$x))
result$model[1:2]
#> $`mpg ~ hp`
#>
#> Call:
#> lm(formula = frm, data = df)
#>
#> Coefficients:
#> (Intercept) hp
#> 30.09886 -0.06823
#>
#>
#> $`cyl ~ hp`
#>
#> Call:
#> lm(formula = frm, data = df)
#>
#> Coefficients:
#> (Intercept) hp
#> 3.00680 0.02168
map_df(result$model, tidy)
#> # A tibble: 24 x 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) 30.1 1.63 18.4 6.64e-18
#> 2 hp -0.0682 0.0101 -6.74 1.79e- 7
#> 3 (Intercept) 3.01 0.425 7.07 7.41e- 8
#> 4 hp 0.0217 0.00264 8.23 3.48e- 9
#> 5 (Intercept) 21.0 32.6 0.644 5.25e- 1
#> 6 hp 1.43 0.202 7.08 7.14e- 8
#> 7 (Intercept) -7.52 5.48 -1.37 1.80e- 1
#> 8 drat 7.68 1.51 5.10 1.78e- 5
#> 9 (Intercept) 14.6 1.58 9.22 2.93e-10
#> 10 drat -2.34 0.436 -5.37 8.24e- 6
#> # ... with 14 more rows
map_df(result$model, glance)
#> # A tibble: 12 x 12
#> r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0.602 0.589 3.86 45.5 1.79e- 7 1 -87.6 181. 186.
#> 2 0.693 0.683 1.01 67.7 3.48e- 9 1 -44.6 95.1 99.5
#> 3 0.626 0.613 77.1 50.1 7.14e- 8 1 -183. 373. 377.
#> 4 0.464 0.446 4.49 26.0 1.78e- 5 1 -92.4 191. 195.
#> 5 0.490 0.473 1.30 28.8 8.24e- 6 1 -52.7 111. 116.
#> 6 0.504 0.488 88.7 30.5 5.28e- 6 1 -188. 382. 386.
#> 7 0.753 0.745 3.05 91.4 1.29e-10 1 -80.0 166. 170.
#> 8 0.612 0.599 1.13 47.4 1.22e- 7 1 -48.3 103. 107.
#> 9 0.789 0.781 57.9 112. 1.22e-11 1 -174. 355. 359.
#> 10 0.175 0.148 5.56 6.38 1.71e- 2 1 -99.3 205. 209.
#> 11 0.350 0.328 1.46 16.1 3.66e- 4 1 -56.6 119. 124.
#> 12 0.188 0.161 114. 6.95 1.31e- 2 1 -196. 398. 402.
#> # ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
Created on 2020-12-11 by the reprex package (v0.3.0)

How to extract confidence intervals from multiple regression models?

I am extracting the regression results for two different groups as shown in this example below. In the temp data.frame i get the estimate, std.error, statistic and p-value. However, i don't get the confidence intervals. Is there a simple way to extract them as well?
df <- tibble(
a = rnorm(1000),
b = rnorm(1000),
c = rnorm(1000),
d = rnorm(1000),
group = rbinom(n=1000, size=1, prob=0.5)
)
df$group = as.factor(df$group)
temp <- df %>%
group_by(group) %>%
do(model1 = tidy(lm(a ~ b + c + d, data = .))) %>%
gather(model_name, model, -group) %>%
unnest()
You are doing tidy on a lm object. If you check the help page, there is an option to include the confidence interval, conf.int=TRUE:
temp <- df %>%
group_by(group) %>%
do(model1 = tidy(lm(a ~ b + c + d, data = . ), conf.int=TRUE)) %>%
gather(model_name, model, -group) %>%
unnest()
# A tibble: 8 x 9
group model_name term estimate std.error statistic p.value conf.low conf.high
<fct> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0 model1 (Int… 0.0616 0.0423 1.46 0.146 -0.0215 0.145
2 0 model1 b 0.00178 0.0421 0.0424 0.966 -0.0808 0.0844
3 0 model1 c -0.00339 0.0431 -0.0787 0.937 -0.0881 0.0813
4 0 model1 d -0.0537 0.0445 -1.21 0.228 -0.141 0.0337
5 1 model1 (Int… -0.0185 0.0454 -0.408 0.683 -0.108 0.0707
6 1 model1 b 0.00128 0.0435 0.0295 0.976 -0.0842 0.0868
7 1 model1 c -0.0972 0.0430 -2.26 0.0244 -0.182 -0.0126
8 1 model1 d 0.0734 0.0457 1.60 0.109 -0.0165 0.163
If your version of dplyr is higher than 1.0.0, you can use:
df %>%
group_by(group) %>%
summarise(tidy(lm(a ~ b + c + d), conf.int = TRUE), .groups = "drop")
#> # A tibble: 8 x 8
#> group term estimate std.error statistic p.value conf.low conf.high
#> <fct> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0 (Intercept) 0.0734 0.0468 1.57 0.117 -0.0185 0.165
#> 2 0 b -0.101 0.0461 -2.19 0.0292 -0.191 -0.0102
#> 3 0 c 0.0337 0.0464 0.726 0.468 -0.0575 0.125
#> 4 0 d -0.101 0.0454 -2.23 0.0265 -0.190 -0.0118
#> 5 1 (Intercept) -0.0559 0.0468 -1.20 0.232 -0.148 0.0360
#> 6 1 b -0.0701 0.0474 -1.48 0.140 -0.163 0.0230
#> 7 1 c 0.0319 0.0477 0.668 0.504 -0.0619 0.126
#> 8 1 d -0.0728 0.0466 -1.56 0.119 -0.164 0.0188

Running regressions and extract model estimates to a dataframe in R

I have 3 exposure variables x1-x3, 10 outcome variables y1-y10 and 3 covariates cv1-cv3.
I would like to regress each outcome on each exposure adjusted for all covariates. Then I would like model estimates i.e. beta, SE, p-value placed in a dataframe. Is there a way to automate this in R. Thank you!
The models i want to run look like this:
y1 ~ x1+cv1+cv2+cv3 ... y10 ~ x1+cv1+cv2+cv3
y1 ~ x2+cv1+cv2+cv3 ... y10 ~ x2+cv1+cv2+cv3
y1 ~ x3+cv1+cv2+cv3 ... y10 ~ x3+cv1+cv2+cv3
Without data and a reproducible example, it is hard to help you, but here's an example with simulated data. First, create a fake dataset, called data:
library(tidyverse)
make_df <- function(y_i) {
data_frame(y_var = y_i, y_i = rnorm(100),
x1 = rnorm(100), x2 = rnorm(100), x3 = rnorm(100),
cv1 = runif(100), cv2 = runif(100), cv3 = runif(100))
}
ys <- paste0("Y_", sprintf("%02d", 1:10))
ys
#> [1] "Y_01" "Y_02" "Y_03" "Y_04" "Y_05" "Y_06" "Y_07" "Y_08" "Y_09" "Y_10"
data <-
ys %>%
map_dfr(make_df)
data
#> # A tibble: 1,000 x 8
#> y_var y_i x1 x2 x3 cv1 cv2 cv3
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Y_01 0.504 0.892 -0.806 -1.56 0.145 0.436 0.701
#> 2 Y_01 0.967 1.24 -1.19 0.920 0.866 0.00100 0.567
#> 3 Y_01 -0.824 -0.729 -0.0855 -1.06 0.0665 0.780 0.471
#> 4 Y_01 0.294 2.37 -0.514 -0.955 0.397 0.0462 0.209
#> 5 Y_01 -0.893 0.0298 0.0369 0.0787 0.640 0.709 0.0485
#> 6 Y_01 0.670 -0.347 1.56 2.11 0.843 0.542 0.793
#> 7 Y_01 -1.59 1.04 0.228 0.573 0.185 0.151 0.558
#> 8 Y_01 -2.04 0.289 -0.435 -0.113 0.833 0.0898 0.653
#> 9 Y_01 -0.637 0.818 -0.454 0.606 0.294 0.378 0.315
#> 10 Y_01 -1.61 -0.628 -2.75 1.06 0.353 0.0863 0.332
#> # ... with 990 more rows
At this point, you have options. One way is to use the group_by %>% do(tidy(*)) recipe:
data %>%
gather(x_var, x_value, -c(y_var, y_i, cv1:cv3)) %>%
group_by(y_var, x_var) %>%
do(broom::tidy(lm(y_i ~ x_value + cv1 + cv2 + cv3, data = .)))
#> # A tibble: 150 x 7
#> # Groups: y_var, x_var [30]
#> y_var x_var term estimate std.error statistic p.value
#> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Y_01 x1 (Intercept) -0.111 0.344 -0.324 0.747
#> 2 Y_01 x1 x_value -0.0440 0.111 -0.396 0.693
#> 3 Y_01 x1 cv1 0.286 0.372 0.769 0.444
#> 4 Y_01 x1 cv2 0.0605 0.379 0.160 0.873
#> 5 Y_01 x1 cv3 -0.0690 0.378 -0.182 0.856
#> 6 Y_01 x2 (Intercept) -0.146 0.336 -0.434 0.665
#> 7 Y_01 x2 x_value 0.117 0.105 1.12 0.265
#> 8 Y_01 x2 cv1 0.287 0.362 0.793 0.430
#> 9 Y_01 x2 cv2 0.0564 0.376 0.150 0.881
#> 10 Y_01 x2 cv3 0.0125 0.379 0.0330 0.974
#> # ... with 140 more rows
Another approach is to use a split variable and then a map function from purrr:
data %>%
gather(x_var, x_value, -c(y_var, y_i, cv1:cv3)) %>%
mutate(y_var_x_var = paste0(y_var, x_var)) %>%
split(.$y_var_x_var) %>%
map(~ lm(y_i ~ x_value + cv1 + cv2 + cv3, data = .))
#> $Y_01x1
#>
#> Call:
#> lm(formula = y_i ~ x_value + cv1 + cv2 + cv3, data = .)
#>
#> Coefficients:
#> (Intercept) x_value cv1 cv2 cv3
#> -0.11144 -0.04396 0.28585 0.06051 -0.06896
#>
#>
#> $Y_01x2
#>
#> Call:
#> lm(formula = y_i ~ x_value + cv1 + cv2 + cv3, data = .)
#>
#> Coefficients:
#> (Intercept) x_value cv1 cv2 cv3
#> -0.14562 0.11732 0.28726 0.05642 0.01249
#>
#>
# ...and so on...
#>
#>
#> $Y_10x2
#>
#> Call:
#> lm(formula = y_i ~ x_value + cv1 + cv2 + cv3, data = .)
#>
#> Coefficients:
#> (Intercept) x_value cv1 cv2 cv3
#> -0.45689 -0.02530 0.61375 0.34377 -0.02357
#>
#>
#> $Y_10x3
#>
#> Call:
#> lm(formula = y_i ~ x_value + cv1 + cv2 + cv3, data = .)
#>
#> Coefficients:
#> (Intercept) x_value cv1 cv2 cv3
#> -0.44423 -0.18377 0.64739 0.27688 -0.02013

Resources