I want to summarize the results of mutliple regressions in a data table.
Packages used in the example :
library(data.table)
library(fixest)
library(broom)
library(tidyr)
Example data
dt <- data.table(mtcars)
First I create all the formulas that will be used.
y_vars <- c("mpg","cyl")
x_vars <- c("disp", "hp")
vars <- tidyr::crossing(y_vars, x_vars)
vars$formula <- paste(vars$y_vars, "~", vars$x_vars)
formulas <- vars$formula
Then I estimate all the models and summarize the results using tidy() :
res <- lapply(formulas ,function(i) tidy(feols(as.formula(i),data=dt)))
data.table::rbindlist(res)
Here is the resulting data table :
term estimate std.error statistic p.value
1: (Intercept) 3.18856797 0.296387718 10.758097 8.121618e-12
2: disp 0.01299804 0.001135649 11.445474 1.802838e-12
3: (Intercept) 3.00679525 0.425485225 7.066744 7.405351e-08
4: hp 0.02168354 0.002635142 8.228604 3.477861e-09
5: (Intercept) 29.59985476 1.229719515 24.070411 3.576586e-21
6: disp -0.04121512 0.004711833 -8.747152 9.380327e-10
7: (Intercept) 30.09886054 1.633920950 18.421246 6.642736e-18
8: hp -0.06822828 0.010119304 -6.742389 1.787835e-07
The problem is I cannot identify the y variable in this summary table.
Ideally, I'd like to have one more column taking the value of the y variable.
I looked in tidy() documentation but did not found how to add it.
Any idea how to do this please ?
Consider either Map from base R (which can take multiple arguments)
library(data.table)
rbindlist(Map(function(fmla, yvar) transform(tidy(feols(as.formula(fmla),
data = dt)), yvar = yvar), formulas, vars$y_vars))
term estimate std.error statistic p.value yvar
1: (Intercept) 3.18856797 0.296387718 10.758097 8.121618e-12 cyl
2: disp 0.01299804 0.001135649 11.445474 1.802838e-12 cyl
3: (Intercept) 3.00679525 0.425485225 7.066744 7.405351e-08 cyl
4: hp 0.02168354 0.002635142 8.228604 3.477861e-09 cyl
5: (Intercept) 29.59985476 1.229719515 24.070411 3.576586e-21 mpg
6: disp -0.04121512 0.004711833 -8.747152 9.380327e-10 mpg
7: (Intercept) 30.09886054 1.633920950 18.421246 6.642736e-18 mpg
8: hp -0.06822828 0.010119304 -6.742389 1.787835e-07 mpg
or use map2 from purrr
library(dplyr)
library(purrr)
library(tidyr)
vars %>%
transmute(out = map2(y_vars, formula,
~ tidy(feols(as.formula(.y), data = dt)) %>%
mutate(y_var = .x))) %>%
unnest(out)
-output
# A tibble: 8 x 6
term estimate std.error statistic p.value y_var
<chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 (Intercept) 3.19 0.296 10.8 8.12e-12 cyl
2 disp 0.0130 0.00114 11.4 1.80e-12 cyl
3 (Intercept) 3.01 0.425 7.07 7.41e- 8 cyl
4 hp 0.0217 0.00264 8.23 3.48e- 9 cyl
5 (Intercept) 29.6 1.23 24.1 3.58e-21 mpg
6 disp -0.0412 0.00471 -8.75 9.38e-10 mpg
7 (Intercept) 30.1 1.63 18.4 6.64e-18 mpg
8 hp -0.0682 0.0101 -6.74 1.79e- 7 mpg
Related
I want to show a regression output in markdown but it contains a lot of character variables which result in a lot of independent variables. Is there any way to only show in the summary the first 5 variables? The summary function in combination with the options(max.print=80) does not provide the solution I want.
You can use tidy() function from broom package
library(broom)
library(magrittr)
lm(mpg ~ ., data = mtcars) %>% tidy() %>% head(n = 5)
#> # A tibble: 5 × 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) 12.3 18.7 0.657 0.518
#> 2 cyl -0.111 1.05 -0.107 0.916
#> 3 disp 0.0133 0.0179 0.747 0.463
#> 4 hp -0.0215 0.0218 -0.987 0.335
#> 5 drat 0.787 1.64 0.481 0.635
Created on 2022-07-08 by the reprex package (v2.0.1)
If I understand you correctly, you could for example subset the coefficients from the variables you want like this (I use mtcars dataset as an example):
model = lm(mpg ~ ., data=mtcars)
smy = summary(model)
smy$coefficients[1:5,]
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 12.30337416 18.71788443 0.6573058 0.5181244
#> cyl -0.11144048 1.04502336 -0.1066392 0.9160874
#> disp 0.01333524 0.01785750 0.7467585 0.4634887
#> hp -0.02148212 0.02176858 -0.9868407 0.3349553
#> drat 0.78711097 1.63537307 0.4813036 0.6352779
Created on 2022-07-07 by the reprex package (v2.0.1)
These are the regression models that I want to obtain. I want to select many variables at the same time to develop a multivariate model, since my data frame has 357 variables.
summary(lm(formula = bci_bci ~ bti_acp, data = qog))
summary(lm(formula = bci_bci ~ wdi_pop, data = qog))
summary(lm(formula = bci_bci ~ ffp_sl, data = qog))
Instead of listing all your variables using + signs, you can also use the shorthand notation . to add all variables in data as explanatory variables (except the target variable on the left hand side of course).
data("mtcars")
mod <- lm(mpg ~ ., data = mtcars)
summary(mod)
#>
#> Call:
#> lm(formula = mpg ~ ., data = mtcars)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -3.4506 -1.6044 -0.1196 1.2193 4.6271
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 12.30337 18.71788 0.657 0.5181
#> cyl -0.11144 1.04502 -0.107 0.9161
#> disp 0.01334 0.01786 0.747 0.4635
#> hp -0.02148 0.02177 -0.987 0.3350
#> drat 0.78711 1.63537 0.481 0.6353
#> wt -3.71530 1.89441 -1.961 0.0633 .
#> qsec 0.82104 0.73084 1.123 0.2739
#> vs 0.31776 2.10451 0.151 0.8814
#> am 2.52023 2.05665 1.225 0.2340
#> gear 0.65541 1.49326 0.439 0.6652
#> carb -0.19942 0.82875 -0.241 0.8122
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 2.65 on 21 degrees of freedom
#> Multiple R-squared: 0.869, Adjusted R-squared: 0.8066
#> F-statistic: 13.93 on 10 and 21 DF, p-value: 3.793e-07
par(mfrow=c(2,2))
plot(mod)
par(mfrow=c(1,1))
Created on 2021-12-21 by the reprex package (v2.0.1)
If you want to include all two-way interactions, the notation would be this:
lm(mpg ~ (.)^2, data = mtcars)
If you want to include all three-way interactions, the notation would be this:
lm(mpg ~ (.)^3, data = mtcars)
If you create very large models (with many variables or interactions), make sure that you also perform some model size reduction after that, e.g. using the function step(). It's very likely that not all your predictors are actually going to be informative, and many could be correlated, which causes problems in multivariate models. One way out of this could be to remove any predictors that are highly correlated to other predictors from the model.
I have a data frame and I did a linear model. I want to extract the coefficients and store each coefficient into a variable using R.
This is my data frame
df <- mtcars
fit <- lm(mpg~., data = df)
This is how I extract one coefficient
beta_0 = fit$coefficients[1]
I want to do this automatically for all coefficients in my model. I tried to use a loop but is not working. I know is not the right code but that was what I found
for (i in fit$coefficients(1:11)) {
d["s{0}".format(x)] = variable1
}
df <- mtcars
fit <- lm(mpg~., data = df)
beta_0 = fit$coefficients[1]
#base R approach
coef_base <- coef(fit)
coef_base
#> (Intercept) cyl disp hp drat wt
#> 12.30337416 -0.11144048 0.01333524 -0.02148212 0.78711097 -3.71530393
#> qsec vs am gear carb
#> 0.82104075 0.31776281 2.52022689 0.65541302 -0.19941925
#tidyverse approach with the broom package
coef_tidy <- broom::tidy(fit)
coef_tidy
#> # A tibble: 11 x 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) 12.3 18.7 0.657 0.518
#> 2 cyl -0.111 1.05 -0.107 0.916
#> 3 disp 0.0133 0.0179 0.747 0.463
#> 4 hp -0.0215 0.0218 -0.987 0.335
#> 5 drat 0.787 1.64 0.481 0.635
#> 6 wt -3.72 1.89 -1.96 0.0633
#> 7 qsec 0.821 0.731 1.12 0.274
#> 8 vs 0.318 2.10 0.151 0.881
#> 9 am 2.52 2.06 1.23 0.234
#> 10 gear 0.655 1.49 0.439 0.665
#> 11 carb -0.199 0.829 -0.241 0.812
for (i in coef_base) {
#do work on i
print(i)
}
#> [1] 12.30337
#> [1] -0.1114405
#> [1] 0.01333524
#> [1] -0.02148212
#> [1] 0.787111
#> [1] -3.715304
#> [1] 0.8210407
#> [1] 0.3177628
#> [1] 2.520227
#> [1] 0.655413
#> [1] -0.1994193
In most cases, as.numeric(coef(fit)[i]) is sufficient to isolate the coefficients:
fit <- lm(mpg~.,mtcars)
for(i in 1:length(coef(fit))){
print(as.numeric(coef(fit)[i]))
}
#[1] 12.30337
#[1] -0.1114405
#[1] 0.01333524
#[1] -0.02148212
#[1] 0.787111
#[1] -3.715304
#[1] 0.8210407
#[1] 0.3177628
#[1] 2.520227
#[1] 0.655413
#[1] -0.1994193
If you have need to put coefficients into a data frame, this code will put each coefficient into a separate variable (variable1, variable2,..) within a dataframe (vars):
fit <- lm(mpg~.,mtcars)
ce <- coef(fit)
vars <- data.frame(col = (NA))
for(i in 1:length(ce)) {
new_col <- as.numeric(ce[i])
vars[ 1, i] <- new_col
colnames(vars)[i] <- paste0("variable", i)
}
vars
# variable1 variable2 variable3 variable4 variable5 variable6 variable7 variable8 variable9 variable10 variable11
# 1 12.30337 -0.1114405 0.01333524 -0.02148212 0.787111 -3.715304 0.8210407 0.3177628 2.520227 0.655413 -0.1994193
I want to run a series of linear regressions for multiple groups across columns. For the group stratification across rows, I can use the idea suggested here (Fitting several regression models with dplyr). In addition to that, I also need to regress them across different columns. See below the code I achieved with the loop. I wonder whether I can do both in a vectorized manner using the map function in package purrr together with the function of group_by in dplyr package and export the estimated beta coefficients and p values accordingly.
library(dplyr)
library(broom)
head(mtcars)
vec<-names(mtcars)[3:9]
data=NULL
for (i in 1:length(vec)){
df<-mtcars%>%
group_by(cyl)%>%
do( fit = lm( paste('mpg ~disp+',vec[i]), data = .))
dfCoef = tidy(df, fit)
res<-dfCoef %>%
filter(term=='disp')
res$con=vec[i]
data=bind_rows(data,res)
}
data
Using tidyr::(un)nest to perform the regressions by groups and a helper function this could be achieved like so:
library(dplyr)
library(broom)
library(tidyr)
library(purrr)
vec <- names(mtcars)[3:9]
lm_help <- function(vec) {
mtcars %>%
tidyr::nest(data = -cyl) %>%
mutate(con = vec,
fit = purrr::map(data, lm, formula = as.formula(paste0("mpg ~ disp + ", vec))),
tidy = purrr::map(fit, tidy)) %>%
select(cyl, con, tidy) %>%
tidyr::unnest(tidy) %>%
filter(term == "disp")
}
purrr::map(vec, lm_help) %>%
bind_rows()
#> # A tibble: 21 x 7
#> cyl con term estimate std.error statistic p.value
#> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 6 disp disp 0.00361 0.0156 0.232 0.826
#> 2 4 disp disp -0.135 0.0332 -4.07 0.00278
#> 3 8 disp disp -0.0196 0.00932 -2.11 0.0568
#> 4 6 hp disp 0.00180 0.0202 0.0890 0.933
#> 5 4 hp disp -0.120 0.0369 -3.24 0.0120
#> 6 8 hp disp -0.0186 0.00946 -1.97 0.0746
#> 7 6 drat disp 0.0224 0.0292 0.770 0.484
#> 8 4 drat disp -0.133 0.0406 -3.27 0.0114
#> 9 8 drat disp -0.0196 0.00977 -2.01 0.0697
#> 10 6 wt disp 0.0191 0.0109 1.75 0.154
#> # ... with 11 more rows
I am currently struggling to run weighted regression models on multiple variables in R.
When using (non-weighted) glm, I was successful by running the following:
mtcars_1 <- mtcars %>%
nest(-gear)%>%
mutate(model_0 = map(data, ~ glm(vs ~ drat, family = "binomial", data = .)))%>%
mutate(model_0_tidy = map(model_0, tidy))%>%
select(gear, model_0_tidy)%>%
ungroup()%>%
unnest(model_0_tidy)
That is I receive the following:
# A tibble: 6 x 6
gear term estimate std.error statistic p.value
<dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 4 (Intercept) -15.3 22.6 -0.677 0.499
2 4 drat 4.26 5.76 0.740 0.459
3 3 (Intercept) -3.91 7.39 -0.529 0.597
4 3 drat 0.801 2.32 0.345 0.730
5 5 (Intercept) 5.20 14.4 0.362 0.718
6 5 drat -1.71 3.77 -0.453 0.651
However, when I would like to weight my observations and thus use svyglm from the survey-package, nesting does not work.
This was my approach:
design_0 <- svydesign(ids=~0, data = mtcars, weights = mtaars$wt)
mtcars_2 <- mtcars%>%
nest(-gear)%>%
mutate(model_1 = map(data, ~ svyglm(vs ~ drat, family = quasibinomial(logit), design = design_0, data = .)))%>%
mutate(model_1_tidy = map(model_1, tidy))%>%
select(gear, model_1_tidy)%>%
ungroup()%>%
unnest(model_1_tidy)
# If suggested that wt serves as frequency weight
# Outcome
gear term estimate std.error statistic p.value
<dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 4 (Intercept) -8.12 3.88 -2.09 0.0451
2 4 drat 2.12 1.07 1.99 0.0554
3 3 (Intercept) -8.12 3.88 -2.09 0.0451
4 3 drat 2.12 1.07 1.99 0.0554
5 5 (Intercept) -8.12 3.88 -2.09 0.0451
6 5 drat 2.12 1.07 1.99 0.0554
Estimates for each type of gear (that is 3,4,5) turns out to be the same.
It appears as if nesting was essentially ignored here.
Are there any solutions for combining svyglm with nest-map-unnest? Or will I have to look for other, less comfortable ways?
Thank you!
try to do it this way
mtcars%>%
nest(-gear) %>%
mutate(design = map(data, ~ svydesign(ids=~0, data = .x, weights = ~ wt)),
model = map(.x = design,
.f = ~ svyglm(vs ~ drat,
family = quasibinomial(logit),
design = .x))) %>%
mutate(model_tidy = map(model, tidy)) %>%
select(gear, model_tidy)%>%
ungroup()%>%
unnest(model_tidy)