I would like to create a function where the dependent variable (y) regressed with individual independent variables (x1, x2, etc.) but not in the form of multiple regression. And I would like to include another function in the same formula is to calculate AIC value. So, both of these functions in the same formula. Can somebody have any idea how to do it? I have a huge dataset and I need to find a regression for an individual dependent variable with multiple independent variables. I would really appreciate it if somebody guides me here.
The following code will give you the results of the dependent variable (y) regressed with individual independent variables
data(mtcars)
x = names(mtcars[,-1])
out <- unlist(lapply(1, function(n) combn(x, 1, FUN=function(row) paste0("mpg ~ ", paste0(row, collapse = "+")))))
out
#> [1] "mpg ~ cyl" "mpg ~ disp" "mpg ~ hp" "mpg ~ drat" "mpg ~ wt"
#> [6] "mpg ~ qsec" "mpg ~ vs" "mpg ~ am" "mpg ~ gear" "mpg ~ carb"
library(broom)
#> Warning: package 'broom' was built under R version 3.5.3
library(dplyr)
#> Warning: package 'dplyr' was built under R version 3.5.3
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
#To have the regression coefficients
tmp1 = bind_rows(lapply(out, function(frml) {
a = tidy(lm(frml, data=mtcars))
a$frml = frml
return(a)
}))
head(tmp1)
#> # A tibble: 6 x 6
#> term estimate std.error statistic p.value frml
#> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 (Intercept) 37.9 2.07 18.3 8.37e-18 mpg ~ cyl
#> 2 cyl -2.88 0.322 -8.92 6.11e-10 mpg ~ cyl
#> 3 (Intercept) 29.6 1.23 24.1 3.58e-21 mpg ~ disp
#> 4 disp -0.0412 0.00471 -8.75 9.38e-10 mpg ~ disp
#> 5 (Intercept) 30.1 1.63 18.4 6.64e-18 mpg ~ hp
#> 6 hp -0.0682 0.0101 -6.74 1.79e- 7 mpg ~ hp
#To have the regression results i.e. R2, AIC, BIC
tmp2 = bind_rows(lapply(out, function(frml) {
a = glance(lm(frml, data=mtcars))
a$frml = frml
return(a)
}))
head(tmp2)
#> # A tibble: 6 x 12
#> r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
#> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#> 1 0.726 0.717 3.21 79.6 6.11e-10 2 -81.7 169. 174.
#> 2 0.718 0.709 3.25 76.5 9.38e-10 2 -82.1 170. 175.
#> 3 0.602 0.589 3.86 45.5 1.79e- 7 2 -87.6 181. 186.
#> 4 0.464 0.446 4.49 26.0 1.78e- 5 2 -92.4 191. 195.
#> 5 0.753 0.745 3.05 91.4 1.29e-10 2 -80.0 166. 170.
#> 6 0.175 0.148 5.56 6.38 1.71e- 2 2 -99.3 205. 209.
#> # ... with 3 more variables: deviance <dbl>, df.residual <int>, frml <chr>
write.csv(tmp1, "Try_lm_coefficients.csv")
write.csv(tmp2, "Try_lm_results.csv")
Created on 2019-11-11 by the reprex package (v0.3.0)
The results can be found in "Try_lm_coefficients.csv" and "Try_lm_results.csv" files.
Related
I want to show a regression output in markdown but it contains a lot of character variables which result in a lot of independent variables. Is there any way to only show in the summary the first 5 variables? The summary function in combination with the options(max.print=80) does not provide the solution I want.
You can use tidy() function from broom package
library(broom)
library(magrittr)
lm(mpg ~ ., data = mtcars) %>% tidy() %>% head(n = 5)
#> # A tibble: 5 × 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) 12.3 18.7 0.657 0.518
#> 2 cyl -0.111 1.05 -0.107 0.916
#> 3 disp 0.0133 0.0179 0.747 0.463
#> 4 hp -0.0215 0.0218 -0.987 0.335
#> 5 drat 0.787 1.64 0.481 0.635
Created on 2022-07-08 by the reprex package (v2.0.1)
If I understand you correctly, you could for example subset the coefficients from the variables you want like this (I use mtcars dataset as an example):
model = lm(mpg ~ ., data=mtcars)
smy = summary(model)
smy$coefficients[1:5,]
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 12.30337416 18.71788443 0.6573058 0.5181244
#> cyl -0.11144048 1.04502336 -0.1066392 0.9160874
#> disp 0.01333524 0.01785750 0.7467585 0.4634887
#> hp -0.02148212 0.02176858 -0.9868407 0.3349553
#> drat 0.78711097 1.63537307 0.4813036 0.6352779
Created on 2022-07-07 by the reprex package (v2.0.1)
After doing ps matching, I'm running a poisson model like so:
model <- glm(outcome ~ x1 + x2 + x3 ... ,
data = d,
weights = psweights$weights,
family = "poisson")
And then want to create a new data frame with the variable names, coefficients, and upper and lower confidence intervals. Just doing:
d2 <- summary(model)$coef
gets me the variable names, coefficients, standard errors, and z values. What is the easiest way to compute confidence intervals, convert them into columns and bind it all into one data frame?
How about this, using the broom package:
library(broom)
mod <- glm(hp ~ disp + drat + cyl, data=mtcars, family=poisson)
tidy(mod, conf.int=TRUE)
#> # A tibble: 4 × 7
#> term estimate std.error statistic p.value conf.low conf.high
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) 2.40 0.196 12.3 1.30e-34 2.02 2.79
#> 2 disp 0.000766 0.000259 2.96 3.07e- 3 0.000258 0.00127
#> 3 drat 0.240 0.0386 6.22 4.89e-10 0.164 0.315
#> 4 cyl 0.236 0.0195 12.1 1.21e-33 0.198 0.274
Created on 2022-06-30 by the reprex package (v2.0.1)
I have a data frame and I did a linear model. I want to extract the coefficients and store each coefficient into a variable using R.
This is my data frame
df <- mtcars
fit <- lm(mpg~., data = df)
This is how I extract one coefficient
beta_0 = fit$coefficients[1]
I want to do this automatically for all coefficients in my model. I tried to use a loop but is not working. I know is not the right code but that was what I found
for (i in fit$coefficients(1:11)) {
d["s{0}".format(x)] = variable1
}
df <- mtcars
fit <- lm(mpg~., data = df)
beta_0 = fit$coefficients[1]
#base R approach
coef_base <- coef(fit)
coef_base
#> (Intercept) cyl disp hp drat wt
#> 12.30337416 -0.11144048 0.01333524 -0.02148212 0.78711097 -3.71530393
#> qsec vs am gear carb
#> 0.82104075 0.31776281 2.52022689 0.65541302 -0.19941925
#tidyverse approach with the broom package
coef_tidy <- broom::tidy(fit)
coef_tidy
#> # A tibble: 11 x 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) 12.3 18.7 0.657 0.518
#> 2 cyl -0.111 1.05 -0.107 0.916
#> 3 disp 0.0133 0.0179 0.747 0.463
#> 4 hp -0.0215 0.0218 -0.987 0.335
#> 5 drat 0.787 1.64 0.481 0.635
#> 6 wt -3.72 1.89 -1.96 0.0633
#> 7 qsec 0.821 0.731 1.12 0.274
#> 8 vs 0.318 2.10 0.151 0.881
#> 9 am 2.52 2.06 1.23 0.234
#> 10 gear 0.655 1.49 0.439 0.665
#> 11 carb -0.199 0.829 -0.241 0.812
for (i in coef_base) {
#do work on i
print(i)
}
#> [1] 12.30337
#> [1] -0.1114405
#> [1] 0.01333524
#> [1] -0.02148212
#> [1] 0.787111
#> [1] -3.715304
#> [1] 0.8210407
#> [1] 0.3177628
#> [1] 2.520227
#> [1] 0.655413
#> [1] -0.1994193
In most cases, as.numeric(coef(fit)[i]) is sufficient to isolate the coefficients:
fit <- lm(mpg~.,mtcars)
for(i in 1:length(coef(fit))){
print(as.numeric(coef(fit)[i]))
}
#[1] 12.30337
#[1] -0.1114405
#[1] 0.01333524
#[1] -0.02148212
#[1] 0.787111
#[1] -3.715304
#[1] 0.8210407
#[1] 0.3177628
#[1] 2.520227
#[1] 0.655413
#[1] -0.1994193
If you have need to put coefficients into a data frame, this code will put each coefficient into a separate variable (variable1, variable2,..) within a dataframe (vars):
fit <- lm(mpg~.,mtcars)
ce <- coef(fit)
vars <- data.frame(col = (NA))
for(i in 1:length(ce)) {
new_col <- as.numeric(ce[i])
vars[ 1, i] <- new_col
colnames(vars)[i] <- paste0("variable", i)
}
vars
# variable1 variable2 variable3 variable4 variable5 variable6 variable7 variable8 variable9 variable10 variable11
# 1 12.30337 -0.1114405 0.01333524 -0.02148212 0.787111 -3.715304 0.8210407 0.3177628 2.520227 0.655413 -0.1994193
I try to do 1104 linear regressions with the same model. My independent variables do not change. However, my dependent variable does. Indeed, I have 1104 dependent variables. I do not know how to extract all the coefficients (intercepts included) and p-values in order to compute means of each (coefficients & p-values). How to do that with an easy way ? This is my model :
testMCFG1 <- lapply(101:1204, function(i) lm(recexp[,i]~recexp[,"rm"] + recexp[,"zdy"] + recexp[,"ztbl"] + recexp[,"ztms"] + recexp[,"zdfy"] + recexp[,"rm_zdy"] + recexp[,"rm_ztbl"] + recexp[,"rm_ztms"] + recexp[,"rm_zdfy"] + recexp[,"contexte"] + recexp[,"rm_contexte"]))
However, someone here has already showed me how to do that with only one invariant independent variable. That works. Find below the codes for this case:
y <- 'rm'
x <- names(recexp[101:1204])
models <- map(setNames(x, x),
~ lm(as.formula(paste(.x, y, sep="~")),
data=recexp))
pvalues <-
data.frame(rsquared = unlist(map(models, ~ summary(.)$r.squared)),
RSE = unlist(map(models, ~ summary(.)$sigma))) %>%
rownames_to_column(var = "which_dependent")
results <- full_join(basic_information, pvalues)
results %>% group_by(term) %>% summarise(mean_estimate = mean(estimate))
results %>% group_by(term) %>% summarise(mean_p = mean(p.value))
Here is a solution using several tidyverse packages. You don't provide your data so I'll use mtcars as an example. Put your independent variables into a fixed string called independents and we'll grab your dependents using a slice as you did with your code producing a character vector
#####
independents <- 'mpg + vs + am + gear'
dependent <- names(mtcars[2:7])
Load the libraries
library(dplyr)
library(purrr)
library(broom)
library(tidyr)
library(tibble)
Make a list of all the models using purrr::map
models <- map(setNames(dependent, dependent),
~ lm(as.formula(paste(.x, independents, sep="~")),
data=mtcars))
Take that list of lm models and feed it to broom::tidy to extract the basic information about beta estimates, and p values etc. To keep it neat use the name of the list item (which is the dependent variable) and add it as a column. Remove the parens from intercept and add a zero so it is always first and you know it's beta0
basics <-
map(models, ~ broom::tidy(.)) %>%
map2_df(.,
names(.),
~ mutate(.x, which_dependent = .y)) %>%
select(which_dependent, everything()) %>%
mutate(term = gsub("\\(Intercept\\)", "0Intercept", term))
Feed the list in again this time extract r squared and sigma a.k.a. "Residual standard error"
model_summary <-
data.frame(rsquared = unlist(map(models, ~ summary(.)$r.squared)),
RSE = unlist(map(models, ~ summary(.)$sigma))) %>%
rownames_to_column(var = "which_dependent")
Join the two based on which dependent variable
results <- full_join(basics, model_summary)
#> Joining, by = "which_dependent"
results
#> # A tibble: 30 x 8
#> which_dependent term estimate std.error statistic p.value rsquared RSE
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 cyl 0Inter… 10.4 1.14 9.13 9.58e-10 0.861 0.714
#> 2 cyl mpg -0.117 0.0382 -3.06 4.98e- 3 0.861 0.714
#> 3 cyl vs -1.80 0.374 -4.81 5.09e- 5 0.861 0.714
#> 4 cyl am -0.414 0.502 -0.826 4.16e- 1 0.861 0.714
#> 5 cyl gear -0.258 0.290 -0.891 3.81e- 1 0.861 0.714
#> 6 disp 0Inter… 571. 94.1 6.07 1.76e- 6 0.804 58.8
#> 7 disp mpg -9.50 3.14 -3.02 5.47e- 3 0.804 58.8
#> 8 disp vs -85.9 30.8 -2.79 9.49e- 3 0.804 58.8
#> 9 disp am -31.9 41.3 -0.774 4.45e- 1 0.804 58.8
#> 10 disp gear -26.8 23.9 -1.12 2.71e- 1 0.804 58.8
#> # … with 20 more rows
It's in long format so you can do things like summarise grouped by term
results %>%
group_by(term) %>%
summarise(mean_p = mean(p.value)) %>%
arrange(term)
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 5 x 2
#> term mean_p
#> <chr> <dbl>
#> 1 0Intercept 0.000168
#> 2 am 0.359
#> 3 gear 0.287
#> 4 mpg 0.0538
#> 5 vs 0.159
Or you can make it wider if you prefer...
wide_results <-
results %>%
pivot_wider(names_from = term,
values_from = estimate:p.value)
wide_results
#> # A tibble: 6 x 23
#> which_dependent rsquared RSE estimate_0Inter… estimate_mpg estimate_vs
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 cyl 0.861 0.714 10.4 -0.117 -1.80
#> 2 disp 0.804 58.8 571. -9.50 -85.9
#> 3 hp 0.736 37.7 241. -8.17 -41.4
#> 4 drat 0.667 0.331 2.07 0.0228 0.166
#> 5 wt 0.804 0.464 5.90 -0.104 -0.146
#> 6 qsec 0.734 0.988 17.5 0.0894 2.29
#> # … with 17 more variables: estimate_am <dbl>, estimate_gear <dbl>,
#> # std.error_0Intercept <dbl>, std.error_mpg <dbl>, std.error_vs <dbl>,
#> # std.error_am <dbl>, std.error_gear <dbl>, statistic_0Intercept <dbl>,
#> # statistic_mpg <dbl>, statistic_vs <dbl>, statistic_am <dbl>,
#> # statistic_gear <dbl>, p.value_0Intercept <dbl>, p.value_mpg <dbl>,
#> # p.value_vs <dbl>, p.value_am <dbl>, p.value_gear <dbl>
names(wide_results)
#> [1] "which_dependent" "rsquared" "RSE"
#> [4] "estimate_0Intercept" "estimate_mpg" "estimate_vs"
#> [7] "estimate_am" "estimate_gear" "std.error_0Intercept"
#> [10] "std.error_mpg" "std.error_vs" "std.error_am"
#> [13] "std.error_gear" "statistic_0Intercept" "statistic_mpg"
#> [16] "statistic_vs" "statistic_am" "statistic_gear"
#> [19] "p.value_0Intercept" "p.value_mpg" "p.value_vs"
#> [22] "p.value_am" "p.value_gear"
I want to run a series of linear regressions for multiple groups across columns. For the group stratification across rows, I can use the idea suggested here (Fitting several regression models with dplyr). In addition to that, I also need to regress them across different columns. See below the code I achieved with the loop. I wonder whether I can do both in a vectorized manner using the map function in package purrr together with the function of group_by in dplyr package and export the estimated beta coefficients and p values accordingly.
library(dplyr)
library(broom)
head(mtcars)
vec<-names(mtcars)[3:9]
data=NULL
for (i in 1:length(vec)){
df<-mtcars%>%
group_by(cyl)%>%
do( fit = lm( paste('mpg ~disp+',vec[i]), data = .))
dfCoef = tidy(df, fit)
res<-dfCoef %>%
filter(term=='disp')
res$con=vec[i]
data=bind_rows(data,res)
}
data
Using tidyr::(un)nest to perform the regressions by groups and a helper function this could be achieved like so:
library(dplyr)
library(broom)
library(tidyr)
library(purrr)
vec <- names(mtcars)[3:9]
lm_help <- function(vec) {
mtcars %>%
tidyr::nest(data = -cyl) %>%
mutate(con = vec,
fit = purrr::map(data, lm, formula = as.formula(paste0("mpg ~ disp + ", vec))),
tidy = purrr::map(fit, tidy)) %>%
select(cyl, con, tidy) %>%
tidyr::unnest(tidy) %>%
filter(term == "disp")
}
purrr::map(vec, lm_help) %>%
bind_rows()
#> # A tibble: 21 x 7
#> cyl con term estimate std.error statistic p.value
#> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 6 disp disp 0.00361 0.0156 0.232 0.826
#> 2 4 disp disp -0.135 0.0332 -4.07 0.00278
#> 3 8 disp disp -0.0196 0.00932 -2.11 0.0568
#> 4 6 hp disp 0.00180 0.0202 0.0890 0.933
#> 5 4 hp disp -0.120 0.0369 -3.24 0.0120
#> 6 8 hp disp -0.0186 0.00946 -1.97 0.0746
#> 7 6 drat disp 0.0224 0.0292 0.770 0.484
#> 8 4 drat disp -0.133 0.0406 -3.27 0.0114
#> 9 8 drat disp -0.0196 0.00977 -2.01 0.0697
#> 10 6 wt disp 0.0191 0.0109 1.75 0.154
#> # ... with 11 more rows