How can I export correspondent statistics from linear regression model to a table in R? - r

I am fitting a multiple linear regression model with 1 dependent variable and 45 independent variables. I would like to put the correspondent effect, 95% CI and p-value to table.
My data set is as below:
a b x1 x2 x3 x4 .... x45
23 15 1 34 4 45 8
10 45 2 15 2 55 18
Is there any more convenient way to write the regression syntax rather than writing lm(y ~ x1 + x2 + x3 + x4...+x45, data=df) as I have 45 variables with different names?
I've tried
mod1 <- lm(a ~ x1 + x2 + ....+x45, data=df)
mod2 <- lm(b ~ x1 + x2 + .... +x45, data=df)
But I don't know how to put the outcomes into table as following:
model variable effect 95%CI p
mod1 x1 177.93 79.16- 276.71 0.003
mod1 x2 -75.13 -116.46 - -33.8 0.003
...
mod1 x45 118.61 53.09-184.13 0.005
mod2 x1 79.53 36.94 - 122.13 0.004
mod2 x2 201.93 60.48 - 343.38 0.01
...
mod2 x45 61.56 20.87 - 102.25 0.005
It would be great if anyone can help. Many thanks!

Assuming that your dataset contains only the dependent variable and the relevant predictors, you can specify the regression like this:
lm(a ~ ., data = df)
In your case you could use dplyr::select to remove b or a:
library(dplyr
mod1 <- df %>%
select(-b) %>%
lm(a ~ ., data = .)
To summarise outcomes in a table, look at the R packages stargazer or sjPlot or broom. I like sjPlot::tab_model as it generates clean HTML tables in RStudio or in RMarkdown documents.
library(sjPlot)
iris %>%
lm(Petal.Length ~ ., data = .) %>%
tab_model()
Or using broom::tidy:
library(broom)
iris %>%
lm(Petal.Length ~ ., data = .) %>%
tidy()
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -1.11 0.270 -4.12 6.45e- 5
2 Sepal.Length 0.608 0.0502 12.1 1.07e-23
3 Sepal.Width -0.181 0.0804 -2.25 2.62e- 2
4 Petal.Width 0.602 0.121 4.96 1.97e- 6
5 Speciesversicolor 1.46 0.173 8.44 3.14e-14
6 Speciesvirginica 1.97 0.245 8.06 2.60e-13

When I used lmer function for the regression, this code can extract the table of the results. Please refer the code!
RESULT_model1 <- summary(model1<-lmer(LOGTHSMON_SELNG_AMT~ .-BLK_CD-LOGPERTHSMON_SELNG_AMT
- STDR_YM_CD - bub_dong-M201507-M201508-M201509
-STOR_CO
+(1|BLK_CD)
,data=dta_fin_WEST))
RESULT_model1_cof <- as.data.frame(tidy(RESULT_model1$coefficients))
RESULT_model1_cof$p.value <- 2 * (1 - pnorm(abs(RESULT_model1_cof$t.value)))
RESULT_model1_cof$VAR <- RESULT_model1_cof$.rownames
RESULT_model1_cof$STDERR <- RESULT_model1_cof$Std..Error

Related

R loop over linear regression

I have looked over the forum but couldn't find what I am looking for.
I want to run a simple linear regression a couple of times. Each time using a different column as my independent variable, the dependent variable stays the same. After running it I want to be able to extract the R squared from each of the regressions. My thought process was to use a simple for loop. However, I cannot make it work.
Assume I work with the following data:
num value person1 person2 person3
0 1 229 29 81 0
1 2 203 17 75 0
2 3 244 62 0 55
and that I want to run the regression on the value using three variables: person1, person2 and person3. Note that this is a minimal working example but I hope to generalize the idea.
And so my initial attempt was to:
column <- names(df)[-2]
for(i in 3:5){
temp <- df[,c("value", column[i])]
lm.test <- lm(value ~ ., data = temp)
i + 1
}
However, when I run summary(lm.test) I only get a summary of the last regression, i.e. lm(value ~ person3) which I think makes sense but when trying to rewrite it as: lm.test[i] <- lm(value ~ ., data = temp) I get the following error:
debug at #3: temp <- df[,c("value", column[i])]
suggesting that there's something wrong with line 3?
If possible I'd like to be able to capture the summary for each regression but what I am really after is the R squared for each one of the regressions.
You can create formula in a loop and then run the lm. For instance, if I want to run regression on mtcars for regressing mpg on each of cyl, wt, hp, I can use the following:
vars <- c("cyl", "wt", "hp")
lm_results <- lapply(vars, function(col){
lm_formula <- as.formula(paste0("mpg ~ ", col))
lm(lm_formula, data = mtcars)
})
You can then again iterate over lm_results to get the r.squared:
lapply(lm_results, function(x) summary(x)$r.squared)
Here’s an approach using broom::glance() and purrr::map_dfr() to collect model summary stats into a tidy tibble:
library(broom)
library(purrr)
lm.test <- map_dfr(
set_names(names(df)[-2]),
~ glance(lm(
as.formula(paste("value ~", .x)),
data = df
)),
.id = "predictor"
)
Result:
# A tibble: 4 x 13
predictor r.squared adj.r.squared sigma statistic p.value df logLik AIC
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 num 0.131 -0.739 27.4 0.150 0.765 1 -12.5 31.1
2 person1 0.836 0.672 11.9 5.10 0.265 1 -10.0 26.1
3 person2 0.542 0.0831 19.9 1.18 0.474 1 -11.6 29.2
4 person3 0.607 0.215 18.4 1.55 0.431 1 -11.3 28.7
# ... with 4 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>,
# nobs <int>
NB, you can capture model coefficients with a similar approach using broom::tidy() instead of glance().

How to add interaction terms in multinomial regression

I am using the mlogit function from the mlogit package to run a multinomial logit regression. I am not sure how to add interaction terms into my model. Here is a toy dataset and my attempt to add interactions:
library(mlogit)
data <- data.frame(y=sample(1:3, 24, replace = TRUE),
x1 = c(rep(1,12), rep(2,12)),
x2 = rep(c(rep(1,4), rep(2,4), rep(3,4)),2),
x3=rnorm(24),
z1 = sample(1:10, 24, replace = TRUE))
m0 <- mlogit(y ~ 0|x1 + x2 + x3 + z1, shape = "wide", data = data) #model with only main effects
m1 <- mlogit(y ~ 0|(x1 + x2 + x3 + z1)^2, shape = "wide", data = data) #model assuming with all possible 2-way interactions?
The output from summary(m1) shows:
Coefficients :
Estimate Std. Error z-value Pr(>|z|)
(Intercept):2 86.41088 164.93831 0.5239 0.6003
(Intercept):3 62.43859 163.57346 0.3817 0.7027
x1:2 -32.27065 82.62474 -0.3906 0.6961
x1:3 0.24661 84.07429 0.0029 0.9977
x2:2 -75.09247 81.36496 -0.9229 0.3561
x2:3 -85.16452 81.40983 -1.0461 0.2955
x3:2 113.11778 119.15990 0.9493 0.3425
x3:3 112.77622 117.74567 0.9578 0.3382
z1:2 11.18665 22.32508 0.5011 0.6163
z1:3 13.15552 22.26441 0.5909 0.5546
x1:2 34.01298 39.66983 0.8574 0.3912
x1:3 32.19141 39.48373 0.8153 0.4149
x1:2 -53.86747 59.75696 -0.9014 0.3674
x1:3 -47.97693 59.09055 -0.8119 0.4168
x1:2 -6.98799 11.29920 -0.6185 0.5363
x1:3 -10.41574 11.52313 -0.9039 0.3660
x2:2 0.59185 6.68807 0.0885 0.9295
x2:3 2.63458 4.94419 0.5329 0.5941
x2:2 0.80945 2.03769 0.3972 0.6912
x2:3 2.60383 2.21878 1.1735 0.2406
x3:2 -0.64112 1.64678 -0.3893 0.6970
x3:3 -2.14289 1.98436 -1.0799 0.2802
the first column is not quite clear to me what specific interactions were outputted. Any pointers will be greatly appreciated!
This might be a clearer way to do it:
library(dplyr)
library(broom)
library(nnet)
multinom(formula = y ~ (x1 + x2 + x3 + z1)^2, data = data) %>%
tidy()
# A tibble: 22 x 6
y.level term estimate std.error statistic p.value
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 2 (Intercept) -158. 247. -0.640 0.522
2 2 x1 -388. 247. -1.57 0.116
3 2 x2 -13.4 248. -0.0543 0.957
4 2 x3 120. 334. 0.360 0.719
5 2 z1 173. 968. 0.179 0.858
6 2 x1:x2 337. 248. 1.36 0.174
7 2 x1:x3 40.2 334. 0.120 0.904
8 2 x1:z1 -53.8 968. -0.0555 0.956
9 2 x2:x3 -137. 1018. -0.135 0.893
10 2 x2:z1 -76.6 910. -0.0841 0.933
# … with 12 more rows

R run linear model by group in dataset [duplicate]

This question already has answers here:
Linear Regression and group by in R
(10 answers)
Closed 2 years ago.
My dataset looks like this
df = data.frame(site=c(rep('A',95),rep('B',110),rep('C',250)),
nps_score=c(floor(runif(455, min=0, max=10))),
service_score=c(floor(runif(455, min=0, max=10))),
food_score=c(floor(runif(455, min=0, max=10))),
clean_score=c(floor(runif(455, min=0, max=10))))
I'd like to run a linear model on each group (i.e. for each site), and produce the coefficients for each group in a dataframe, along with the significance levels of each variable.
I am trying to group_by the site variable and then run the model for each site but it doesn't seem to be working. I've looked at some existing solutions on stack overflow but cannot seem to adapt the code to my solution.
#Trying to run this by group, and output the resulting coefficients per site in a separate df with their signficance levels.
library(MASS)
summary(ols <- rlm(nps_score ~ ., data = df))
Any help on this would be greatly appreciated
library(tidyverse)
library(broom)
library(MASS)
# We first create a formula object
my_formula <- as.formula(paste("nps_score ~ ", paste(df %>% select(-site, -nps_score) %>% names(), collapse= "+")))
# Now we can group by site and use the formula object within the pipe.
results <- df %>%
group_by(site) %>%
do(tidy(rlm(formula(my_formula), data = .)))
which gives:
# A tibble: 12 x 5
# Groups: site [3]
site term estimate std.error statistic
<chr> <chr> <dbl> <dbl> <dbl>
1 A (Intercept) 5.16 0.961 5.37
2 A service_score -0.0656 0.110 -0.596
3 A food_score -0.0213 0.102 -0.209
4 A clean_score -0.0588 0.110 -0.536
5 B (Intercept) 2.22 0.852 2.60
6 B service_score 0.221 0.103 2.14
7 B food_score 0.163 0.104 1.56
8 B clean_score -0.0383 0.0928 -0.413
9 C (Intercept) 5.47 0.609 8.97
10 C service_score -0.0367 0.0721 -0.509
11 C food_score -0.0585 0.0724 -0.808
12 C clean_score -0.0922 0.0691 -1.33
Note: i'm not familiar with the rlm function and if it provides p-values in the first place. But at least the tidy function doesn't offer p-values for rlm. If a simple linear regression would fit your suits, you could replace the rlm function by lm in which case a sixth column with p-values would be added.

How to run multiple linear regressions with different independent variables and dependent variables adding standardized coefficients in R?

I'm currently trying to run a loop performing linear regression for multiple independent variables (n = 6) with multiple dependent variables (n=1000).
Here is some example data, with age, sex, and education representing my independent variables of interest and testscore_* being my dependent variables.
df = data.frame(ID = c(1001, 1002, 1003, 1004, 1005, 1006,1007, 1008, 1009, 1010, 1011),
age = as.numeric(c('56', '43','59','74','61','62','69','80','40','55','58')),
sex = as.numeric(c('0','1','0','0','1','1','0','1','0','1','0')),
testscore_1 = as.numeric(c('23','28','30','15','7','18','29','27','14','22','24')),
testscore_2 = as.numeric(c('1','3','2','5','8','2','5','6','7','8','2')),
testscore_3 = as.numeric(c('18','20','19','15','20','23','19','25','10','14','12')),
education = as.numeric(c('5','4','3','5','2', '1','4','4','3','5','2')))
I have working code that allows me to run a regression model for multiple DVs (which I'm sure more experienced R users will dislike for its lack of efficiency):
y <- as.matrix(df[4:6])
#model for age
lm_results <- lm(y ~ age, data = df)
write.csv((broom::tidy(lm_results)), "lm_results_age.csv")
regression_results <-broom::tidy(lm_results)
standardized_coefficients <- lm.beta(lm_results)
age_standardize_results <- coef(standardized_coefficients)
write.csv(age_standardize_results, "lm_results_age_standardized_coefficients.csv")
I would then repeat this all by manually replacing age with sex and education
Does anyone have a more elegant way of running this - for example, by way of a loop for all IVs of interest (i.e. age, sex and education)?
Also would greatly appreciate anyone who would suggest a quick way of combining broom::tidy(lm_results) with standardized coefficients from lm.beta::lm.beta, i.e. combining the standardized regression coefficients with the main model output.
This is an adaptation for a similar workflow I had to use in the past. Remember to really penalize yourself for running a crazy number of models. I added a couple predictor columns to your dataframe. Good luck!!
Solution:
# Creating pedictor and outcome vectors
ivs_vec <- names(df)[c(2:6, 10)]
dvs_vec <- names(df)[7:9]
# Creating formulas and running the models
ivs <- paste0(" ~ ", ivs_vec)
dvs_ivs <- unlist(lapply(ivs, function(x) paste0(dvs_vec, x)))
formulas <- lapply(dvs_ivs, formula)
lm_results <- lapply(formulas, function(x) {
lm(x, data = df)
})
# Creating / combining results
tidy_results <- lapply(lm_results, broom::tidy)
dv_list <- lapply(as.list(stringi::stri_extract_first_words(dvs_ivs)), rep, 2)
tidy_results <- Map(cbind, dv_list, tidy_results)
standardized_results <- lapply(lm_results, function(x) coef(lm.beta::lm.beta(x)))
combined_results <- Map(cbind, tidy_results, standardized_results)
# Cleaning up final results
names(combined_results) <- dvs_ivs
combined_results <- lapply(combined_results, function(x) {row.names(x) <- c(NULL); x})
new_names <- c("Outcome", "Term", "Estimate", "Std. Error", "Statistic", "P-value", "Standardized Estimate")
combined_results <- lapply(combined_results, setNames, new_names)
Results:
combined_results[1:5]
$`testscore_1 ~ age`
Outcome Term Estimate Std. Error Statistic P-value
Standardized Estimate
1 testscore_1 (Intercept) 18.06027731 12.3493569 1.4624468 0.1776424 0.00000000
2 testscore_1 age 0.05835152 0.2031295 0.2872627 0.7804155 0.09531823
$`testscore_2 ~ age`
Outcome Term Estimate Std. Error Statistic P-value Standardized Estimate
1 testscore_2 (Intercept) 3.63788676 4.39014570 0.8286483 0.4287311 0.0000000
2 testscore_2 age 0.01367313 0.07221171 0.1893478 0.8540216 0.0629906
$`testscore_3 ~ age`
Outcome Term Estimate Std. Error Statistic P-value Standardized Estimate
1 testscore_3 (Intercept) 6.1215175 6.698083 0.9139208 0.3845886 0.0000000
2 testscore_3 age 0.1943125 0.110174 1.7636870 0.1116119 0.5068026
$`testscore_1 ~ sex`
Outcome Term Estimate Std. Error Statistic P-value Standardized Estimate
1 testscore_1 (Intercept) 22.5 3.099283 7.2597435 4.766069e-05 0.0000000
2 testscore_1 sex -2.1 4.596980 -0.4568217 6.586248e-01 -0.1505386
$`testscore_2 ~ sex`
Outcome Term Estimate Std. Error Statistic P-value Standardized Estimate
1 testscore_2 (Intercept) 3.666667 1.041129 3.521816 0.006496884 0.0000000
2 testscore_2 sex 1.733333 1.544245 1.122447 0.290723029 0.3504247
Data:
df <- data.frame(ID = c(1001, 1002, 1003, 1004, 1005, 1006,1007, 1008, 1009, 1010, 1011),
age = as.numeric(c('56', '43','59','74','61','62','69','80','40','55','58')),
sex = as.numeric(c('0','1','0','0','1','1','0','1','0','1','0')),
pred1 = sample(1:11, 11),
pred2 = sample(1:11, 11),
pred3 = sample(1:11, 11),
testscore_1 = as.numeric(c('23','28','30','15','7','18','29','27','14','22','24')),
testscore_2 = as.numeric(c('1','3','2','5','8','2','5','6','7','8','2')),
testscore_3 = as.numeric(c('18','20','19','15','20','23','19','25','10','14','12')),
education = as.numeric(c('5','4','3','5','2', '1','4','4','3','5','2')))
Stumbled upon this a year later and documenting a tidyverse solution same data as #Andrew.
library(dplyr)
library(purrr)
library(tidyr)
library(stringi)
# Creating pedictor and outcome vectors
ivs_vec <- names(df)[c(2:6, 10)]
dvs_vec <- names(df)[7:9]
# Creating formulas and running the models
ivs <- paste0(" ~ ", ivs_vec)
dvs_ivs <- unlist(map(ivs, ~paste0(dvs_vec, .x)))
models <- map(setNames(dvs_ivs, dvs_ivs),
~ lm(formula = as.formula(.x),
data = df))
basics <-
map(models, ~ broom::tidy(.)) %>%
map2_df(.,
names(.),
~ mutate(.x, which_dependent = .y)) %>%
select(which_dependent, everything()) %>%
mutate(term = gsub("\\(Intercept\\)", "Intercept", term),
which_dependent = stringi::stri_extract_first_words(which_dependent))
basics$std_estimate <-
map_dfr(models, ~ coef(lm.beta::lm.beta(.)), .id = "which_dependent") %>%
pivot_longer(.,
cols = -which_dependent,
names_to = "term",
values_to = "std_estimate",
values_drop_na = TRUE) %>%
pull(std_estimate)
basics
#> # A tibble: 36 x 7
#> which_dependent term estimate std.error statistic p.value std_estimate
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 testscore_1 Intercept 18.1 12.3 1.46 0.178 0
#> 2 testscore_1 age 0.0584 0.203 0.287 0.780 0.0953
#> 3 testscore_2 Intercept 3.64 4.39 0.829 0.429 0
#> 4 testscore_2 age 0.0137 0.0722 0.189 0.854 0.0630
#> 5 testscore_3 Intercept 6.12 6.70 0.914 0.385 0
#> 6 testscore_3 age 0.194 0.110 1.76 0.112 0.507
#> 7 testscore_1 Intercept 22.5 3.10 7.26 0.0000477 0
#> 8 testscore_1 sex -2.10 4.60 -0.457 0.659 -0.151
#> 9 testscore_2 Intercept 3.67 1.04 3.52 0.00650 0
#> 10 testscore_2 sex 1.73 1.54 1.12 0.291 0.350
#> # … with 26 more rows

How to build a summary table of glm's parameters and AICcWt

I have a dataset that I am using to build generalised linear models. The response variable is binary (absence/presence) and the explanatory variables are categorical.
CODE
library(tidyverse)
library(AICcmodavg)
# Data
set.seed(123)
t <- tibble(ID = 1:100,
A = as.factor(sample(c(0, 1), 100, T)),
B = as.factor(sample(c("black", "white"), 100, T)),
C = as.factor(sample(c("pos", "neg", "either"), 100, T)))
# Candidate set of models - Binomial family because response variable
# is binary (0 for absent & 1 for present)
# Global model is A ~ B_black + C_either
m1 <- glm(A ~ 1, binomial, t)
m2 <- glm(A ~ B, binomial, t)
m3 <- glm(A ~ C, binomial, t)
m4 <- glm(A ~ B + C, binomial, t)
# List with all models
ms <- list(null = m1, m_B = m2, m_C = m3, m_BC = m4)
# Summary table
aic_tbl <- aictab(ms)
PROBLEM
I want to build a table like the one below that summarises the coefficients, standard errors, and Akaike weights of the models within my candidate set.
QUESTION
Can anyone suggest how to best build this table using my list of models and AIC table?
Just to point it out: broom gets you half-way to where you want to get by turning the model output into a dataframe, which you can then reshape.
library(broom)
bind_rows(lapply(ms, tidy), .id="key")
key term estimate std.error statistic p.value
1 null (Intercept) -0.12014431182649532 0.200 -0.59963969517107030 0.549
2 m_B (Intercept) 0.00000000000000123 0.283 0.00000000000000433 1.000
3 m_B Bwhite -0.24116205496397874 0.401 -0.60071814968372905 0.548
4 m_C (Intercept) -0.47957308026188367 0.353 -1.35892869678271544 0.174
5 m_C Cneg 0.80499548069651150 0.507 1.58784953814722285 0.112
6 m_C Cpos 0.30772282333522433 0.490 0.62856402205887851 0.530
7 m_BC (Intercept) -0.36339654526926718 0.399 -0.90984856337213305 0.363
8 m_BC Bwhite -0.25083209866475475 0.408 -0.61515191157571303 0.538
9 m_BC Cneg 0.81144822536950656 0.508 1.59682131202527056 0.110
10 m_BC Cpos 0.32706970242195277 0.492 0.66527127770403538 0.506
And if you must insist of the layout of your table, I came up with the following (arguably clumsy) way of rearranging everything:
out <- bind_rows(lapply(ms, tidy), .id="mod")
t1 <- out %>% select(mod, term, estimate) %>% spread(term, estimate) %>% base::t
t2 <- out %>% select(mod, term, std.error) %>% spread(term, std.error) %>% base::t
rownames(t2) <- paste0(rownames(t2), "_std_e")
tmp <- rbind(t1, t2[-1,])
new_t <- as.data.frame(tmp[-1,])
colnames(new_t) <- tmp[1,]
new_t
Alternatively, you may want to familiarise yourself with packages that are meant to display model output for publication, e.g. texreg or stargazer come to mind:
library(texreg)
screenreg(ms)
==================================================
null m_B m_C m_BC
--------------------------------------------------
(Intercept) -0.12 0.00 -0.48 -0.36
(0.20) (0.28) (0.35) (0.40)
Bwhite -0.24 -0.25
(0.40) (0.41)
Cneg 0.80 0.81
(0.51) (0.51)
Cpos 0.31 0.33
(0.49) (0.49)
--------------------------------------------------
AIC 140.27 141.91 141.66 143.28
BIC 142.87 147.12 149.48 153.70
Log Likelihood -69.13 -68.95 -67.83 -67.64
Deviance 138.27 137.91 135.66 135.28
Num. obs. 100 100 100 100
==================================================
*** p < 0.001, ** p < 0.01, * p < 0.05

Resources