I am performing logistic regression on the model with CHD sickness vs a few variables (see the data frame).
ind sbp tobacco ldl adiposity typea obesity alcohol age chd
1 1 160 12.00 5.73 23.11 49 25.30 97.20 52 1
2 2 144 0.01 4.41 28.61 55 28.87 2.06 63 1
...
I performed backward stepwise selection on this model to receive the best model, but I get as the result the model that contains only the intercept. Why can it be? What does it mean?
model <-glm(chd ~ ., data = CHD, family = "binomial"(link = logit))
intercept_only <- glm(chd ~ 1, data=CHD, family = "binomial"(link = logit))
#perform backward stepwise regression
back <- step(intercept_only, direction='backward', scope=formula(model), trace=0)
#view results of backward stepwise regression
Step Df Deviance Resid. Df Resid. Dev AIC
1 NA NA 461 596.1084 598.1084```
To do backward regression, you should start with a model that contains variables, rather than the model with intercept only:
back <- step(model, direction='backward', scope=formula(model), trace=0)
The intercept_only model should only be used if you set direction='forward' or direction='both'.
I have looked over the forum but couldn't find what I am looking for.
I want to run a simple linear regression a couple of times. Each time using a different column as my independent variable, the dependent variable stays the same. After running it I want to be able to extract the R squared from each of the regressions. My thought process was to use a simple for loop. However, I cannot make it work.
Assume I work with the following data:
num value person1 person2 person3
0 1 229 29 81 0
1 2 203 17 75 0
2 3 244 62 0 55
and that I want to run the regression on the value using three variables: person1, person2 and person3. Note that this is a minimal working example but I hope to generalize the idea.
And so my initial attempt was to:
column <- names(df)[-2]
for(i in 3:5){
temp <- df[,c("value", column[i])]
lm.test <- lm(value ~ ., data = temp)
i + 1
}
However, when I run summary(lm.test) I only get a summary of the last regression, i.e. lm(value ~ person3) which I think makes sense but when trying to rewrite it as: lm.test[i] <- lm(value ~ ., data = temp) I get the following error:
debug at #3: temp <- df[,c("value", column[i])]
suggesting that there's something wrong with line 3?
If possible I'd like to be able to capture the summary for each regression but what I am really after is the R squared for each one of the regressions.
You can create formula in a loop and then run the lm. For instance, if I want to run regression on mtcars for regressing mpg on each of cyl, wt, hp, I can use the following:
vars <- c("cyl", "wt", "hp")
lm_results <- lapply(vars, function(col){
lm_formula <- as.formula(paste0("mpg ~ ", col))
lm(lm_formula, data = mtcars)
})
You can then again iterate over lm_results to get the r.squared:
lapply(lm_results, function(x) summary(x)$r.squared)
Here’s an approach using broom::glance() and purrr::map_dfr() to collect model summary stats into a tidy tibble:
library(broom)
library(purrr)
lm.test <- map_dfr(
set_names(names(df)[-2]),
~ glance(lm(
as.formula(paste("value ~", .x)),
data = df
)),
.id = "predictor"
)
Result:
# A tibble: 4 x 13
predictor r.squared adj.r.squared sigma statistic p.value df logLik AIC
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 num 0.131 -0.739 27.4 0.150 0.765 1 -12.5 31.1
2 person1 0.836 0.672 11.9 5.10 0.265 1 -10.0 26.1
3 person2 0.542 0.0831 19.9 1.18 0.474 1 -11.6 29.2
4 person3 0.607 0.215 18.4 1.55 0.431 1 -11.3 28.7
# ... with 4 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>,
# nobs <int>
NB, you can capture model coefficients with a similar approach using broom::tidy() instead of glance().
This question already has answers here:
Linear Regression and group by in R
(10 answers)
Closed 2 years ago.
My dataset looks like this
df = data.frame(site=c(rep('A',95),rep('B',110),rep('C',250)),
nps_score=c(floor(runif(455, min=0, max=10))),
service_score=c(floor(runif(455, min=0, max=10))),
food_score=c(floor(runif(455, min=0, max=10))),
clean_score=c(floor(runif(455, min=0, max=10))))
I'd like to run a linear model on each group (i.e. for each site), and produce the coefficients for each group in a dataframe, along with the significance levels of each variable.
I am trying to group_by the site variable and then run the model for each site but it doesn't seem to be working. I've looked at some existing solutions on stack overflow but cannot seem to adapt the code to my solution.
#Trying to run this by group, and output the resulting coefficients per site in a separate df with their signficance levels.
library(MASS)
summary(ols <- rlm(nps_score ~ ., data = df))
Any help on this would be greatly appreciated
library(tidyverse)
library(broom)
library(MASS)
# We first create a formula object
my_formula <- as.formula(paste("nps_score ~ ", paste(df %>% select(-site, -nps_score) %>% names(), collapse= "+")))
# Now we can group by site and use the formula object within the pipe.
results <- df %>%
group_by(site) %>%
do(tidy(rlm(formula(my_formula), data = .)))
which gives:
# A tibble: 12 x 5
# Groups: site [3]
site term estimate std.error statistic
<chr> <chr> <dbl> <dbl> <dbl>
1 A (Intercept) 5.16 0.961 5.37
2 A service_score -0.0656 0.110 -0.596
3 A food_score -0.0213 0.102 -0.209
4 A clean_score -0.0588 0.110 -0.536
5 B (Intercept) 2.22 0.852 2.60
6 B service_score 0.221 0.103 2.14
7 B food_score 0.163 0.104 1.56
8 B clean_score -0.0383 0.0928 -0.413
9 C (Intercept) 5.47 0.609 8.97
10 C service_score -0.0367 0.0721 -0.509
11 C food_score -0.0585 0.0724 -0.808
12 C clean_score -0.0922 0.0691 -1.33
Note: i'm not familiar with the rlm function and if it provides p-values in the first place. But at least the tidy function doesn't offer p-values for rlm. If a simple linear regression would fit your suits, you could replace the rlm function by lm in which case a sixth column with p-values would be added.
I'm currently trying to run a loop performing linear regression for multiple independent variables (n = 6) with multiple dependent variables (n=1000).
Here is some example data, with age, sex, and education representing my independent variables of interest and testscore_* being my dependent variables.
df = data.frame(ID = c(1001, 1002, 1003, 1004, 1005, 1006,1007, 1008, 1009, 1010, 1011),
age = as.numeric(c('56', '43','59','74','61','62','69','80','40','55','58')),
sex = as.numeric(c('0','1','0','0','1','1','0','1','0','1','0')),
testscore_1 = as.numeric(c('23','28','30','15','7','18','29','27','14','22','24')),
testscore_2 = as.numeric(c('1','3','2','5','8','2','5','6','7','8','2')),
testscore_3 = as.numeric(c('18','20','19','15','20','23','19','25','10','14','12')),
education = as.numeric(c('5','4','3','5','2', '1','4','4','3','5','2')))
I have working code that allows me to run a regression model for multiple DVs (which I'm sure more experienced R users will dislike for its lack of efficiency):
y <- as.matrix(df[4:6])
#model for age
lm_results <- lm(y ~ age, data = df)
write.csv((broom::tidy(lm_results)), "lm_results_age.csv")
regression_results <-broom::tidy(lm_results)
standardized_coefficients <- lm.beta(lm_results)
age_standardize_results <- coef(standardized_coefficients)
write.csv(age_standardize_results, "lm_results_age_standardized_coefficients.csv")
I would then repeat this all by manually replacing age with sex and education
Does anyone have a more elegant way of running this - for example, by way of a loop for all IVs of interest (i.e. age, sex and education)?
Also would greatly appreciate anyone who would suggest a quick way of combining broom::tidy(lm_results) with standardized coefficients from lm.beta::lm.beta, i.e. combining the standardized regression coefficients with the main model output.
This is an adaptation for a similar workflow I had to use in the past. Remember to really penalize yourself for running a crazy number of models. I added a couple predictor columns to your dataframe. Good luck!!
Solution:
# Creating pedictor and outcome vectors
ivs_vec <- names(df)[c(2:6, 10)]
dvs_vec <- names(df)[7:9]
# Creating formulas and running the models
ivs <- paste0(" ~ ", ivs_vec)
dvs_ivs <- unlist(lapply(ivs, function(x) paste0(dvs_vec, x)))
formulas <- lapply(dvs_ivs, formula)
lm_results <- lapply(formulas, function(x) {
lm(x, data = df)
})
# Creating / combining results
tidy_results <- lapply(lm_results, broom::tidy)
dv_list <- lapply(as.list(stringi::stri_extract_first_words(dvs_ivs)), rep, 2)
tidy_results <- Map(cbind, dv_list, tidy_results)
standardized_results <- lapply(lm_results, function(x) coef(lm.beta::lm.beta(x)))
combined_results <- Map(cbind, tidy_results, standardized_results)
# Cleaning up final results
names(combined_results) <- dvs_ivs
combined_results <- lapply(combined_results, function(x) {row.names(x) <- c(NULL); x})
new_names <- c("Outcome", "Term", "Estimate", "Std. Error", "Statistic", "P-value", "Standardized Estimate")
combined_results <- lapply(combined_results, setNames, new_names)
Results:
combined_results[1:5]
$`testscore_1 ~ age`
Outcome Term Estimate Std. Error Statistic P-value
Standardized Estimate
1 testscore_1 (Intercept) 18.06027731 12.3493569 1.4624468 0.1776424 0.00000000
2 testscore_1 age 0.05835152 0.2031295 0.2872627 0.7804155 0.09531823
$`testscore_2 ~ age`
Outcome Term Estimate Std. Error Statistic P-value Standardized Estimate
1 testscore_2 (Intercept) 3.63788676 4.39014570 0.8286483 0.4287311 0.0000000
2 testscore_2 age 0.01367313 0.07221171 0.1893478 0.8540216 0.0629906
$`testscore_3 ~ age`
Outcome Term Estimate Std. Error Statistic P-value Standardized Estimate
1 testscore_3 (Intercept) 6.1215175 6.698083 0.9139208 0.3845886 0.0000000
2 testscore_3 age 0.1943125 0.110174 1.7636870 0.1116119 0.5068026
$`testscore_1 ~ sex`
Outcome Term Estimate Std. Error Statistic P-value Standardized Estimate
1 testscore_1 (Intercept) 22.5 3.099283 7.2597435 4.766069e-05 0.0000000
2 testscore_1 sex -2.1 4.596980 -0.4568217 6.586248e-01 -0.1505386
$`testscore_2 ~ sex`
Outcome Term Estimate Std. Error Statistic P-value Standardized Estimate
1 testscore_2 (Intercept) 3.666667 1.041129 3.521816 0.006496884 0.0000000
2 testscore_2 sex 1.733333 1.544245 1.122447 0.290723029 0.3504247
Data:
df <- data.frame(ID = c(1001, 1002, 1003, 1004, 1005, 1006,1007, 1008, 1009, 1010, 1011),
age = as.numeric(c('56', '43','59','74','61','62','69','80','40','55','58')),
sex = as.numeric(c('0','1','0','0','1','1','0','1','0','1','0')),
pred1 = sample(1:11, 11),
pred2 = sample(1:11, 11),
pred3 = sample(1:11, 11),
testscore_1 = as.numeric(c('23','28','30','15','7','18','29','27','14','22','24')),
testscore_2 = as.numeric(c('1','3','2','5','8','2','5','6','7','8','2')),
testscore_3 = as.numeric(c('18','20','19','15','20','23','19','25','10','14','12')),
education = as.numeric(c('5','4','3','5','2', '1','4','4','3','5','2')))
Stumbled upon this a year later and documenting a tidyverse solution same data as #Andrew.
library(dplyr)
library(purrr)
library(tidyr)
library(stringi)
# Creating pedictor and outcome vectors
ivs_vec <- names(df)[c(2:6, 10)]
dvs_vec <- names(df)[7:9]
# Creating formulas and running the models
ivs <- paste0(" ~ ", ivs_vec)
dvs_ivs <- unlist(map(ivs, ~paste0(dvs_vec, .x)))
models <- map(setNames(dvs_ivs, dvs_ivs),
~ lm(formula = as.formula(.x),
data = df))
basics <-
map(models, ~ broom::tidy(.)) %>%
map2_df(.,
names(.),
~ mutate(.x, which_dependent = .y)) %>%
select(which_dependent, everything()) %>%
mutate(term = gsub("\\(Intercept\\)", "Intercept", term),
which_dependent = stringi::stri_extract_first_words(which_dependent))
basics$std_estimate <-
map_dfr(models, ~ coef(lm.beta::lm.beta(.)), .id = "which_dependent") %>%
pivot_longer(.,
cols = -which_dependent,
names_to = "term",
values_to = "std_estimate",
values_drop_na = TRUE) %>%
pull(std_estimate)
basics
#> # A tibble: 36 x 7
#> which_dependent term estimate std.error statistic p.value std_estimate
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 testscore_1 Intercept 18.1 12.3 1.46 0.178 0
#> 2 testscore_1 age 0.0584 0.203 0.287 0.780 0.0953
#> 3 testscore_2 Intercept 3.64 4.39 0.829 0.429 0
#> 4 testscore_2 age 0.0137 0.0722 0.189 0.854 0.0630
#> 5 testscore_3 Intercept 6.12 6.70 0.914 0.385 0
#> 6 testscore_3 age 0.194 0.110 1.76 0.112 0.507
#> 7 testscore_1 Intercept 22.5 3.10 7.26 0.0000477 0
#> 8 testscore_1 sex -2.10 4.60 -0.457 0.659 -0.151
#> 9 testscore_2 Intercept 3.67 1.04 3.52 0.00650 0
#> 10 testscore_2 sex 1.73 1.54 1.12 0.291 0.350
#> # … with 26 more rows
I'm looking for a Tidyverse / broom solution that can solve this puzzle:
Let's say I have different DVs and a specific set of IVS and I want to perform a regression that considers every DV and this specific set of IVs.
I know I can use something like for i in or apply family, but I really want to run that using tidyverse.
The following code works as an example
ds <- data.frame(income = rnorm(100, mean=1000,sd=200),
happiness = rnorm(100, mean = 6, sd=1),
health = rnorm(100, mean=20, sd = 3),
sex = c(0,1),
faculty = c(0,1,2,3))
mod1 <- lm(income ~ sex + faculty, ds)
mod2 <- lm(happiness ~ sex + faculty, ds)
mod3 <- lm(health ~ sex + faculty, ds)
summary(mod1)
summary(mod2)
summary(mod3)
Income, happiness, and health are DVs. Sex and Faculty are IVs and they will be used for all regressions.
That was the closest I found
Let me know If I need to clarify my question.
Thanks.
As you have different dependent variables but the same independent, you can form a matrix of these and pass to lm.
mod = lm(cbind(income, happiness, health) ~ sex + faculty, ds)
And I think broom::tidy works
library(broom)
tidy(mod)
# response term estimate std.error statistic p.value
# 1 income (Intercept) 1019.35703873 31.0922529 32.7849205 2.779199e-54
# 2 income sex -54.40337314 40.1399258 -1.3553431 1.784559e-01
# 3 income faculty 19.74808081 17.9511206 1.1001030 2.740100e-01
# 4 happiness (Intercept) 5.97334562 0.1675340 35.6545278 1.505026e-57
# 5 happiness sex 0.05345555 0.2162855 0.2471528 8.053124e-01
# 6 happiness faculty -0.02525431 0.0967258 -0.2610918 7.945753e-01
# 7 health (Intercept) 19.76489553 0.5412676 36.5159396 1.741411e-58
# 8 health sex 0.32399380 0.6987735 0.4636607 6.439296e-01
# 9 health faculty 0.10808545 0.3125010 0.3458723 7.301877e-01
Another method is to gather the dependent variables and use a grouped data frame to fit the models with do. This is the method explained in the broom and dplyr vignette.
library(tidyverse)
library(broom)
ds <- data.frame(
income = rnorm(100, mean = 1000, sd = 200),
happiness = rnorm(100, mean = 6, sd = 1),
health = rnorm(100, mean = 20, sd = 3),
sex = c(0, 1),
faculty = c(0, 1, 2, 3)
)
ds %>%
gather(dv_name, dv_value, income:health) %>%
group_by(dv_name) %>%
do(tidy(lm(dv_value ~ sex + faculty, data = .)))
#> # A tibble: 9 x 6
#> # Groups: dv_name [3]
#> dv_name term estimate std.error statistic p.value
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 happiness (Intercept) 6.25 0.191 32.7 3.14e-54
#> 2 happiness sex 0.163 0.246 0.663 5.09e- 1
#> 3 happiness faculty -0.172 0.110 -1.56 1.23e- 1
#> 4 health (Intercept) 20.1 0.524 38.4 1.95e-60
#> 5 health sex 0.616 0.677 0.909 3.65e- 1
#> 6 health faculty -0.653 0.303 -2.16 3.36e- 2
#> 7 income (Intercept) 1085. 32.8 33.0 1.43e-54
#> 8 income sex -12.9 42.4 -0.304 7.62e- 1
#> 9 income faculty -25.1 19.0 -1.32 1.89e- 1
Created on 2018-08-01 by the reprex package (v0.2.0).
We can loop through the column names that are dependent variables, use paste to create the formula to be passed into lm and get the summary statistics with tidy (from broom)
library(tidyverse)
library(broom)
map(names(ds)[1:3], ~
lm(formula(paste0(.x, "~",
paste(names(ds)[4:5], collapse=" + "))), data = ds) %>%
tidy)
If we want it in a single data.frame with a column identifier for dependent variable,
map_df(set_names(names(ds)[1:3]), ~
lm(formula(paste0(.x, "~",
paste(names(ds)[4:5], collapse=" + "))), data = ds) %>%
tidy, .id = "Dep_Variable")