How to run many regressions across rows and columns with vectorization - r

I want to run a series of linear regressions for multiple groups across columns. For the group stratification across rows, I can use the idea suggested here (Fitting several regression models with dplyr). In addition to that, I also need to regress them across different columns. See below the code I achieved with the loop. I wonder whether I can do both in a vectorized manner using the map function in package purrr together with the function of group_by in dplyr package and export the estimated beta coefficients and p values accordingly.
library(dplyr)
library(broom)
head(mtcars)
vec<-names(mtcars)[3:9]
data=NULL
for (i in 1:length(vec)){
df<-mtcars%>%
group_by(cyl)%>%
do( fit = lm( paste('mpg ~disp+',vec[i]), data = .))
dfCoef = tidy(df, fit)
res<-dfCoef %>%
filter(term=='disp')
res$con=vec[i]
data=bind_rows(data,res)
}
data

Using tidyr::(un)nest to perform the regressions by groups and a helper function this could be achieved like so:
library(dplyr)
library(broom)
library(tidyr)
library(purrr)
vec <- names(mtcars)[3:9]
lm_help <- function(vec) {
mtcars %>%
tidyr::nest(data = -cyl) %>%
mutate(con = vec,
fit = purrr::map(data, lm, formula = as.formula(paste0("mpg ~ disp + ", vec))),
tidy = purrr::map(fit, tidy)) %>%
select(cyl, con, tidy) %>%
tidyr::unnest(tidy) %>%
filter(term == "disp")
}
purrr::map(vec, lm_help) %>%
bind_rows()
#> # A tibble: 21 x 7
#> cyl con term estimate std.error statistic p.value
#> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 6 disp disp 0.00361 0.0156 0.232 0.826
#> 2 4 disp disp -0.135 0.0332 -4.07 0.00278
#> 3 8 disp disp -0.0196 0.00932 -2.11 0.0568
#> 4 6 hp disp 0.00180 0.0202 0.0890 0.933
#> 5 4 hp disp -0.120 0.0369 -3.24 0.0120
#> 6 8 hp disp -0.0186 0.00946 -1.97 0.0746
#> 7 6 drat disp 0.0224 0.0292 0.770 0.484
#> 8 4 drat disp -0.133 0.0406 -3.27 0.0114
#> 9 8 drat disp -0.0196 0.00977 -2.01 0.0697
#> 10 6 wt disp 0.0191 0.0109 1.75 0.154
#> # ... with 11 more rows

Related

How to reduce the regression output in markdown

I want to show a regression output in markdown but it contains a lot of character variables which result in a lot of independent variables. Is there any way to only show in the summary the first 5 variables? The summary function in combination with the options(max.print=80) does not provide the solution I want.
You can use tidy() function from broom package
library(broom)
library(magrittr)
lm(mpg ~ ., data = mtcars) %>% tidy() %>% head(n = 5)
#> # A tibble: 5 × 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) 12.3 18.7 0.657 0.518
#> 2 cyl -0.111 1.05 -0.107 0.916
#> 3 disp 0.0133 0.0179 0.747 0.463
#> 4 hp -0.0215 0.0218 -0.987 0.335
#> 5 drat 0.787 1.64 0.481 0.635
Created on 2022-07-08 by the reprex package (v2.0.1)
If I understand you correctly, you could for example subset the coefficients from the variables you want like this (I use mtcars dataset as an example):
model = lm(mpg ~ ., data=mtcars)
smy = summary(model)
smy$coefficients[1:5,]
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 12.30337416 18.71788443 0.6573058 0.5181244
#> cyl -0.11144048 1.04502336 -0.1066392 0.9160874
#> disp 0.01333524 0.01785750 0.7467585 0.4634887
#> hp -0.02148212 0.02176858 -0.9868407 0.3349553
#> drat 0.78711097 1.63537307 0.4813036 0.6352779
Created on 2022-07-07 by the reprex package (v2.0.1)

Summarize results of mutliple regression in a data table

I want to summarize the results of mutliple regressions in a data table.
Packages used in the example :
library(data.table)
library(fixest)
library(broom)
library(tidyr)
Example data
dt <- data.table(mtcars)
First I create all the formulas that will be used.
y_vars <- c("mpg","cyl")
x_vars <- c("disp", "hp")
vars <- tidyr::crossing(y_vars, x_vars)
vars$formula <- paste(vars$y_vars, "~", vars$x_vars)
formulas <- vars$formula
Then I estimate all the models and summarize the results using tidy() :
res <- lapply(formulas ,function(i) tidy(feols(as.formula(i),data=dt)))
data.table::rbindlist(res)
Here is the resulting data table :
term estimate std.error statistic p.value
1: (Intercept) 3.18856797 0.296387718 10.758097 8.121618e-12
2: disp 0.01299804 0.001135649 11.445474 1.802838e-12
3: (Intercept) 3.00679525 0.425485225 7.066744 7.405351e-08
4: hp 0.02168354 0.002635142 8.228604 3.477861e-09
5: (Intercept) 29.59985476 1.229719515 24.070411 3.576586e-21
6: disp -0.04121512 0.004711833 -8.747152 9.380327e-10
7: (Intercept) 30.09886054 1.633920950 18.421246 6.642736e-18
8: hp -0.06822828 0.010119304 -6.742389 1.787835e-07
The problem is I cannot identify the y variable in this summary table.
Ideally, I'd like to have one more column taking the value of the y variable.
I looked in tidy() documentation but did not found how to add it.
Any idea how to do this please ?
Consider either Map from base R (which can take multiple arguments)
library(data.table)
rbindlist(Map(function(fmla, yvar) transform(tidy(feols(as.formula(fmla),
data = dt)), yvar = yvar), formulas, vars$y_vars))
term estimate std.error statistic p.value yvar
1: (Intercept) 3.18856797 0.296387718 10.758097 8.121618e-12 cyl
2: disp 0.01299804 0.001135649 11.445474 1.802838e-12 cyl
3: (Intercept) 3.00679525 0.425485225 7.066744 7.405351e-08 cyl
4: hp 0.02168354 0.002635142 8.228604 3.477861e-09 cyl
5: (Intercept) 29.59985476 1.229719515 24.070411 3.576586e-21 mpg
6: disp -0.04121512 0.004711833 -8.747152 9.380327e-10 mpg
7: (Intercept) 30.09886054 1.633920950 18.421246 6.642736e-18 mpg
8: hp -0.06822828 0.010119304 -6.742389 1.787835e-07 mpg
or use map2 from purrr
library(dplyr)
library(purrr)
library(tidyr)
vars %>%
transmute(out = map2(y_vars, formula,
~ tidy(feols(as.formula(.y), data = dt)) %>%
mutate(y_var = .x))) %>%
unnest(out)
-output
# A tibble: 8 x 6
term estimate std.error statistic p.value y_var
<chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 (Intercept) 3.19 0.296 10.8 8.12e-12 cyl
2 disp 0.0130 0.00114 11.4 1.80e-12 cyl
3 (Intercept) 3.01 0.425 7.07 7.41e- 8 cyl
4 hp 0.0217 0.00264 8.23 3.48e- 9 cyl
5 (Intercept) 29.6 1.23 24.1 3.58e-21 mpg
6 disp -0.0412 0.00471 -8.75 9.38e-10 mpg
7 (Intercept) 30.1 1.63 18.4 6.64e-18 mpg
8 hp -0.0682 0.0101 -6.74 1.79e- 7 mpg

How to extract the coefficients of a linear model and store in a variable in R?

I have a data frame and I did a linear model. I want to extract the coefficients and store each coefficient into a variable using R.
This is my data frame
df <- mtcars
fit <- lm(mpg~., data = df)
This is how I extract one coefficient
beta_0 = fit$coefficients[1]
I want to do this automatically for all coefficients in my model. I tried to use a loop but is not working. I know is not the right code but that was what I found
for (i in fit$coefficients(1:11)) {
d["s{0}".format(x)] = variable1
}
df <- mtcars
fit <- lm(mpg~., data = df)
beta_0 = fit$coefficients[1]
#base R approach
coef_base <- coef(fit)
coef_base
#> (Intercept) cyl disp hp drat wt
#> 12.30337416 -0.11144048 0.01333524 -0.02148212 0.78711097 -3.71530393
#> qsec vs am gear carb
#> 0.82104075 0.31776281 2.52022689 0.65541302 -0.19941925
#tidyverse approach with the broom package
coef_tidy <- broom::tidy(fit)
coef_tidy
#> # A tibble: 11 x 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) 12.3 18.7 0.657 0.518
#> 2 cyl -0.111 1.05 -0.107 0.916
#> 3 disp 0.0133 0.0179 0.747 0.463
#> 4 hp -0.0215 0.0218 -0.987 0.335
#> 5 drat 0.787 1.64 0.481 0.635
#> 6 wt -3.72 1.89 -1.96 0.0633
#> 7 qsec 0.821 0.731 1.12 0.274
#> 8 vs 0.318 2.10 0.151 0.881
#> 9 am 2.52 2.06 1.23 0.234
#> 10 gear 0.655 1.49 0.439 0.665
#> 11 carb -0.199 0.829 -0.241 0.812
for (i in coef_base) {
#do work on i
print(i)
}
#> [1] 12.30337
#> [1] -0.1114405
#> [1] 0.01333524
#> [1] -0.02148212
#> [1] 0.787111
#> [1] -3.715304
#> [1] 0.8210407
#> [1] 0.3177628
#> [1] 2.520227
#> [1] 0.655413
#> [1] -0.1994193
In most cases, as.numeric(coef(fit)[i]) is sufficient to isolate the coefficients:
fit <- lm(mpg~.,mtcars)
for(i in 1:length(coef(fit))){
print(as.numeric(coef(fit)[i]))
}
#[1] 12.30337
#[1] -0.1114405
#[1] 0.01333524
#[1] -0.02148212
#[1] 0.787111
#[1] -3.715304
#[1] 0.8210407
#[1] 0.3177628
#[1] 2.520227
#[1] 0.655413
#[1] -0.1994193
If you have need to put coefficients into a data frame, this code will put each coefficient into a separate variable (variable1, variable2,..) within a dataframe (vars):
fit <- lm(mpg~.,mtcars)
ce <- coef(fit)
vars <- data.frame(col = (NA))
for(i in 1:length(ce)) {
new_col <- as.numeric(ce[i])
vars[ 1, i] <- new_col
colnames(vars)[i] <- paste0("variable", i)
}
vars
# variable1 variable2 variable3 variable4 variable5 variable6 variable7 variable8 variable9 variable10 variable11
# 1 12.30337 -0.1114405 0.01333524 -0.02148212 0.787111 -3.715304 0.8210407 0.3177628 2.520227 0.655413 -0.1994193

Extract lists of coefficients and p-values for multiple invariant independent variables in R

I try to do 1104 linear regressions with the same model. My independent variables do not change. However, my dependent variable does. Indeed, I have 1104 dependent variables. I do not know how to extract all the coefficients (intercepts included) and p-values in order to compute means of each (coefficients & p-values). How to do that with an easy way ? This is my model :
testMCFG1 <- lapply(101:1204, function(i) lm(recexp[,i]~recexp[,"rm"] + recexp[,"zdy"] + recexp[,"ztbl"] + recexp[,"ztms"] + recexp[,"zdfy"] + recexp[,"rm_zdy"] + recexp[,"rm_ztbl"] + recexp[,"rm_ztms"] + recexp[,"rm_zdfy"] + recexp[,"contexte"] + recexp[,"rm_contexte"]))
However, someone here has already showed me how to do that with only one invariant independent variable. That works. Find below the codes for this case:
y <- 'rm'
x <- names(recexp[101:1204])
models <- map(setNames(x, x),
~ lm(as.formula(paste(.x, y, sep="~")),
data=recexp))
pvalues <-
data.frame(rsquared = unlist(map(models, ~ summary(.)$r.squared)),
RSE = unlist(map(models, ~ summary(.)$sigma))) %>%
rownames_to_column(var = "which_dependent")
results <- full_join(basic_information, pvalues)
results %>% group_by(term) %>% summarise(mean_estimate = mean(estimate))
results %>% group_by(term) %>% summarise(mean_p = mean(p.value))
Here is a solution using several tidyverse packages. You don't provide your data so I'll use mtcars as an example. Put your independent variables into a fixed string called independents and we'll grab your dependents using a slice as you did with your code producing a character vector
#####
independents <- 'mpg + vs + am + gear'
dependent <- names(mtcars[2:7])
Load the libraries
library(dplyr)
library(purrr)
library(broom)
library(tidyr)
library(tibble)
Make a list of all the models using purrr::map
models <- map(setNames(dependent, dependent),
~ lm(as.formula(paste(.x, independents, sep="~")),
data=mtcars))
Take that list of lm models and feed it to broom::tidy to extract the basic information about beta estimates, and p values etc. To keep it neat use the name of the list item (which is the dependent variable) and add it as a column. Remove the parens from intercept and add a zero so it is always first and you know it's beta0
basics <-
map(models, ~ broom::tidy(.)) %>%
map2_df(.,
names(.),
~ mutate(.x, which_dependent = .y)) %>%
select(which_dependent, everything()) %>%
mutate(term = gsub("\\(Intercept\\)", "0Intercept", term))
Feed the list in again this time extract r squared and sigma a.k.a. "Residual standard error"
model_summary <-
data.frame(rsquared = unlist(map(models, ~ summary(.)$r.squared)),
RSE = unlist(map(models, ~ summary(.)$sigma))) %>%
rownames_to_column(var = "which_dependent")
Join the two based on which dependent variable
results <- full_join(basics, model_summary)
#> Joining, by = "which_dependent"
results
#> # A tibble: 30 x 8
#> which_dependent term estimate std.error statistic p.value rsquared RSE
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 cyl 0Inter… 10.4 1.14 9.13 9.58e-10 0.861 0.714
#> 2 cyl mpg -0.117 0.0382 -3.06 4.98e- 3 0.861 0.714
#> 3 cyl vs -1.80 0.374 -4.81 5.09e- 5 0.861 0.714
#> 4 cyl am -0.414 0.502 -0.826 4.16e- 1 0.861 0.714
#> 5 cyl gear -0.258 0.290 -0.891 3.81e- 1 0.861 0.714
#> 6 disp 0Inter… 571. 94.1 6.07 1.76e- 6 0.804 58.8
#> 7 disp mpg -9.50 3.14 -3.02 5.47e- 3 0.804 58.8
#> 8 disp vs -85.9 30.8 -2.79 9.49e- 3 0.804 58.8
#> 9 disp am -31.9 41.3 -0.774 4.45e- 1 0.804 58.8
#> 10 disp gear -26.8 23.9 -1.12 2.71e- 1 0.804 58.8
#> # … with 20 more rows
It's in long format so you can do things like summarise grouped by term
results %>%
group_by(term) %>%
summarise(mean_p = mean(p.value)) %>%
arrange(term)
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 5 x 2
#> term mean_p
#> <chr> <dbl>
#> 1 0Intercept 0.000168
#> 2 am 0.359
#> 3 gear 0.287
#> 4 mpg 0.0538
#> 5 vs 0.159
Or you can make it wider if you prefer...
wide_results <-
results %>%
pivot_wider(names_from = term,
values_from = estimate:p.value)
wide_results
#> # A tibble: 6 x 23
#> which_dependent rsquared RSE estimate_0Inter… estimate_mpg estimate_vs
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 cyl 0.861 0.714 10.4 -0.117 -1.80
#> 2 disp 0.804 58.8 571. -9.50 -85.9
#> 3 hp 0.736 37.7 241. -8.17 -41.4
#> 4 drat 0.667 0.331 2.07 0.0228 0.166
#> 5 wt 0.804 0.464 5.90 -0.104 -0.146
#> 6 qsec 0.734 0.988 17.5 0.0894 2.29
#> # … with 17 more variables: estimate_am <dbl>, estimate_gear <dbl>,
#> # std.error_0Intercept <dbl>, std.error_mpg <dbl>, std.error_vs <dbl>,
#> # std.error_am <dbl>, std.error_gear <dbl>, statistic_0Intercept <dbl>,
#> # statistic_mpg <dbl>, statistic_vs <dbl>, statistic_am <dbl>,
#> # statistic_gear <dbl>, p.value_0Intercept <dbl>, p.value_mpg <dbl>,
#> # p.value_vs <dbl>, p.value_am <dbl>, p.value_gear <dbl>
names(wide_results)
#> [1] "which_dependent" "rsquared" "RSE"
#> [4] "estimate_0Intercept" "estimate_mpg" "estimate_vs"
#> [7] "estimate_am" "estimate_gear" "std.error_0Intercept"
#> [10] "std.error_mpg" "std.error_vs" "std.error_am"
#> [13] "std.error_gear" "statistic_0Intercept" "statistic_mpg"
#> [16] "statistic_vs" "statistic_am" "statistic_gear"
#> [19] "p.value_0Intercept" "p.value_mpg" "p.value_vs"
#> [22] "p.value_am" "p.value_gear"

Combing tidyverse + survey [R]: How to use svyglm in Nest-Map-Unnest-Chain?

I am currently struggling to run weighted regression models on multiple variables in R.
When using (non-weighted) glm, I was successful by running the following:
mtcars_1 <- mtcars %>%
nest(-gear)%>%
mutate(model_0 = map(data, ~ glm(vs ~ drat, family = "binomial", data = .)))%>%
mutate(model_0_tidy = map(model_0, tidy))%>%
select(gear, model_0_tidy)%>%
ungroup()%>%
unnest(model_0_tidy)
That is I receive the following:
# A tibble: 6 x 6
gear term estimate std.error statistic p.value
<dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 4 (Intercept) -15.3 22.6 -0.677 0.499
2 4 drat 4.26 5.76 0.740 0.459
3 3 (Intercept) -3.91 7.39 -0.529 0.597
4 3 drat 0.801 2.32 0.345 0.730
5 5 (Intercept) 5.20 14.4 0.362 0.718
6 5 drat -1.71 3.77 -0.453 0.651
However, when I would like to weight my observations and thus use svyglm from the survey-package, nesting does not work.
This was my approach:
design_0 <- svydesign(ids=~0, data = mtcars, weights = mtaars$wt)
mtcars_2 <- mtcars%>%
nest(-gear)%>%
mutate(model_1 = map(data, ~ svyglm(vs ~ drat, family = quasibinomial(logit), design = design_0, data = .)))%>%
mutate(model_1_tidy = map(model_1, tidy))%>%
select(gear, model_1_tidy)%>%
ungroup()%>%
unnest(model_1_tidy)
# If suggested that wt serves as frequency weight
# Outcome
gear term estimate std.error statistic p.value
<dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 4 (Intercept) -8.12 3.88 -2.09 0.0451
2 4 drat 2.12 1.07 1.99 0.0554
3 3 (Intercept) -8.12 3.88 -2.09 0.0451
4 3 drat 2.12 1.07 1.99 0.0554
5 5 (Intercept) -8.12 3.88 -2.09 0.0451
6 5 drat 2.12 1.07 1.99 0.0554
Estimates for each type of gear (that is 3,4,5) turns out to be the same.
It appears as if nesting was essentially ignored here.
Are there any solutions for combining svyglm with nest-map-unnest? Or will I have to look for other, less comfortable ways?
Thank you!
try to do it this way
mtcars%>%
nest(-gear) %>%
mutate(design = map(data, ~ svydesign(ids=~0, data = .x, weights = ~ wt)),
model = map(.x = design,
.f = ~ svyglm(vs ~ drat,
family = quasibinomial(logit),
design = .x))) %>%
mutate(model_tidy = map(model, tidy)) %>%
select(gear, model_tidy)%>%
ungroup()%>%
unnest(model_tidy)

Resources