Write a function to run multiple regression models with changing independent variables and changing dependent variables in R - r

Using the data set mtcars as an example: The goal is to write a function to run multiple regression models with changing independent variables and changing dependent variables.
In the code that I wrote (below), var are the independent variables and mpg is the independent variable. I used map to run regressions repeatedly with vs and am as the changing independent variables each time.
var = c("vs", "am")
mtcars %>% select(all_of(var)) %>%
map(~ glm(mpg ~ .x + cyl + disp + splines::ns(wt, 2) + hp,
family = gaussian(link = "identity"),
data = mtcars)) %>%
map_dfr(tidy, conf.int = T, .id = 'source') %>%
select(source, source, term, estimate, std.error, conf.low, conf.high, p.value)
I would like to run the same regression with a different set of independent variables, and also with a y that I can specify (e.g., I ran with mpg above, and I would like to change it to qsec or some other variables). So I envision a function like this:
function_name <- function(x, y, dataset){
dataset %>% select(all_of(x)) %>%
map(~ glm(y ~ .x + cyl + disp + splines::ns(wt, 2) + hp,
family = gaussian(link = "identity"),
data = dataset)) %>%
map_dfr(tidy, conf.int = T, .id = 'source') %>%
select(source, source, term, estimate, std.error, conf.low, conf.high, p.value)
}
But this function didn't work. Any suggestions?

You could achieve your desired result like so:
The issue with your code is that y ~ ... will not work. Instead you could use reformulate (or as.formula) to dynamically create the formula for your regression model.
To make this work loop directly over the character vector x or more more precisely setNames(x, x) instead of looping over dataset %>% select(all_of(x)).
library(dplyr)
library(purrr)
library(broom)
function_name <- function(x, y, dataset) {
map(setNames(x, x), ~ glm(reformulate(
termlabels = c(.x, "cyl", "disp", "splines::ns(wt, 2)", "hp"),
response = y
),
family = gaussian(link = "identity"),
data = dataset
)) %>%
map_dfr(tidy, conf.int = T, .id = "source") %>%
select(source, source, term, estimate, std.error, conf.low, conf.high, p.value)
}
var <- c("vs", "am")
function_name(x = var, y = "mpg", mtcars)
#> # A tibble: 14 × 7
#> source term estimate std.error conf.low conf.high p.value
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 vs (Intercept) 32.7 3.49 25.8 39.5 1.24e- 9
#> 2 vs vs 1.03 1.52 -1.95 4.01 5.05e- 1
#> 3 vs cyl -0.187 0.821 -1.80 1.42 8.21e- 1
#> 4 vs disp 0.000545 0.0119 -0.0228 0.0239 9.64e- 1
#> 5 vs splines::ns(wt, 2)1 -22.4 4.82 -31.9 -13.0 9.02e- 5
#> 6 vs splines::ns(wt, 2)2 -9.48 3.16 -15.7 -3.28 6.09e- 3
#> 7 vs hp -0.0202 0.0115 -0.0427 0.00226 9.02e- 2
#> 8 am (Intercept) 34.6 2.65 29.4 39.8 1.15e-12
#> 9 am am 0.0113 1.57 -3.06 3.08 9.94e- 1
#> 10 am cyl -0.470 0.714 -1.87 0.931 5.17e- 1
#> 11 am disp 0.000796 0.0125 -0.0236 0.0252 9.50e- 1
#> 12 am splines::ns(wt, 2)1 -21.5 5.86 -33.0 -10.0 1.14e- 3
#> 13 am splines::ns(wt, 2)2 -9.21 3.34 -15.8 -2.66 1.07e- 2
#> 14 am hp -0.0214 0.0136 -0.0480 0.00527 1.28e- 1

Related

Tidymodels Error: Can't rename variables in this context

I recently picked up Tidymodels after having used R for a few months in my school.
I was trying to make my first model using the Titanic Dataset on Kaggle, but ran into some issues when fitting the model. Could someone help me?
titanic_rec <- recipe(Survived ~ Sex + Age + Pclass + Embarked + Family_Size + Name, data = titanic_train) %>%
step_impute_knn(all_predictors(), k = 3) %>%
step_dummy(Sex, Pclass, Embarked, Family_Size, Name) %>%
step_interact(~ Sex:Age + Sex:Pclass + Pclass:Age)
log_model <- logistic_reg() %>%
set_engine("glm") %>%
set_mode("classification")
fitted_log_model <- workflow() %>%
add_model(log_model) %>%
add_recipe(titanic_rec) %>%
fit(data = titanic_train) %>%
pull_workflow_fit() %>%
tidy()
Every feature has a factor data type except Age and Survived which are doubles. The error seems to come about when I include the fit(data = ...) onwards.
Error: Can't rename variables in this context. Run `rlang::last_error()` to see where the error occurred.
24.
stop(fallback)
23.
signal_abort(cnd)
22.
abort("Can't rename variables in this context.")
21.
eval_select_recipes(to_impute, training, info)
20.
impute_var_lists(to_impute = x$terms, impute_using = x$impute_with, training = training, info = info)
19.
prep.step_impute_knn(x$steps[[i]], training = training, info = x$term_info)
18.
prep(x$steps[[i]], training = training, info = x$term_info)
17.
prep.recipe(blueprint$recipe, training = data, fresh = blueprint$fresh)
16.
recipes::prep(blueprint$recipe, training = data, fresh = blueprint$fresh)
15.
blueprint$mold$process(blueprint = blueprint, data = data)
14.
run_mold.recipe_blueprint(blueprint, data)
13.
run_mold(blueprint, data)
12.
mold.recipe(recipe, data, blueprint = blueprint)
11.
hardhat::mold(recipe, data, blueprint = blueprint)
10.
fit.action_recipe(action, workflow = workflow, data = data)
9.
fit(action, workflow = workflow, data = data)
8.
.fit_pre(workflow, data)
7.
fit.workflow(., data = titanic_train)
6.
fit(., data = titanic_train)
5.
is_workflow(x)
4.
validate_is_workflow(x)
3.
pull_workflow_fit(.)
2.
tidy(.)
1.
workflow() %>% add_model(log_model) %>% add_recipe(titanic_rec) %>% fit(data = titanic_train) %>% pull_workflow_fit() %>% tidy()
The posted error comes from step_impute_knn() where the number of neighbors should be specified by with neighbors. Secondly, I would advise against using name as a predictor since it creates a separate dummy variable for each name which would mess with the fit.
The final error comes in step_interact(). You can't use step_interact(~ Sex:Age) after step_dummy(Sex) becuase there won't be any columns named Sex after step_dummy() is done. Instead it will have Sex_male (since female is part of the intercept). A way to catch all the created dummy variables is to use starts_with() inside step_interact().
library(tidymodels)
titanic_train <- readr::read_csv("your/path/to/data/train.csv")
titanic_train <- titanic_train %>%
mutate(Survived = factor(Survived),
Pclass = factor(Pclass),
Family_Size = SibSp + Parch + 1)
titanic_rec <- recipe(Survived ~ Sex + Age + Pclass + Embarked + Family_Size,
data = titanic_train) %>%
step_impute_knn(all_predictors(), neighbors = 3) %>%
step_dummy(Sex, Pclass, Embarked) %>%
step_interact(~ starts_with("Sex_"):Age +
starts_with("Sex_"):starts_with("Pclass_") +
starts_with("Pclass_"):Age)
log_model <- logistic_reg() %>%
set_engine("glm") %>%
set_mode("classification")
fitted_log_model <- workflow() %>%
add_model(log_model) %>%
add_recipe(titanic_rec) %>%
fit(data = titanic_train) %>%
pull_workflow_fit() %>%
tidy()
fitted_log_model
#> # A tibble: 13 x 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) 3.85 0.921 4.18 0.0000289
#> 2 Age 0.0117 0.0226 0.516 0.606
#> 3 Family_Size -0.226 0.0671 -3.36 0.000769
#> 4 Sex_male -2.22 0.886 -2.50 0.0124
#> 5 Pclass_X2 1.53 1.16 1.31 0.189
#> 6 Pclass_X3 -2.42 0.884 -2.74 0.00615
#> 7 Embarked_Q -0.0461 0.368 -0.125 0.900
#> 8 Embarked_S -0.548 0.243 -2.26 0.0241
#> 9 Sex_male_x_Age -0.0488 0.0199 -2.46 0.0140
#> 10 Sex_male_x_Pclass_X2 -1.28 0.879 -1.46 0.144
#> 11 Sex_male_x_Pclass_X3 1.48 0.699 2.11 0.0347
#> 12 Age_x_Pclass_X2 -0.0708 0.0263 -2.69 0.00714
#> 13 Age_x_Pclass_X3 -0.0341 0.0209 -1.63 0.103
Created on 2021-07-01 by the reprex package (v2.0.0)

Extraction LM stats into table

I have made a graph that displays r2, p-value and equation from linear regressions in the top left corner using stat_poly_eq.
Now I wish to have the stats from the linear regression extracted into a table.
For an example, in the mtcars dataset, if I want to do linear regression on plots of hp against disp for each cylinder group (e.g. 4, 6, 8) and then extract the linear regression stats into a table, how could I do that?
Thanks!
Here's the graph I have:
library(ggplot2)
library(ggpmisc)
formula <- y~x
ggplot(mtcars, aes(disp, hp)) +
geom_point() +
geom_smooth(method = "lm",formula = formula) +
theme_bw()+
facet_wrap(~cyl, scales = "free")+
stat_poly_eq(
aes(label = paste(stat(adj.rr.label), stat(eq.label), stat(p.value.label), sep = "*\", \"*")),
formula = formula, parse = TRUE, size=3)
Do you mean something like this?
With nest_by, divide the rest of the columns in separated tibbles by each cyl
With summarise, calculate each lm. You need to set it into a list.
Operate like a normal list with map and calculate the stuff you need: coefficients (extractable with broom::tidy) and adj.r.squared (with summary(.)$adj.r.squared)
unnest the result of broom::tidy to make a unique tibble.
library(dplyr)
library(tidyr)
library(purrr)
mtcars %>%
nest_by(cyl) %>%
summarise(mdl = list(lm(hp ~ disp, data)), .groups = "drop") %>%
mutate(adjrsquared = map_dbl(mdl, ~summary(.)$adj.r.squared ),
mdl = map(mdl, broom::tidy)) %>%
unnest(mdl)
#> # A tibble: 6 x 7
#> cyl term estimate std.error statistic p.value adjrsquared
#> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 4 (Intercept) 47.0 25.3 1.86 0.0960 0.0988
#> 2 4 disp 0.339 0.234 1.45 0.182 0.0988
#> 3 6 (Intercept) 177. 42.0 4.22 0.00829 0.117
#> 4 6 disp -0.300 0.224 -1.34 0.238 0.117
#> 5 8 (Intercept) 178. 77.4 2.30 0.0405 -0.0682
#> 6 8 disp 0.0890 0.216 0.413 0.687 -0.0682

How to run many regressions across rows and columns with vectorization

I want to run a series of linear regressions for multiple groups across columns. For the group stratification across rows, I can use the idea suggested here (Fitting several regression models with dplyr). In addition to that, I also need to regress them across different columns. See below the code I achieved with the loop. I wonder whether I can do both in a vectorized manner using the map function in package purrr together with the function of group_by in dplyr package and export the estimated beta coefficients and p values accordingly.
library(dplyr)
library(broom)
head(mtcars)
vec<-names(mtcars)[3:9]
data=NULL
for (i in 1:length(vec)){
df<-mtcars%>%
group_by(cyl)%>%
do( fit = lm( paste('mpg ~disp+',vec[i]), data = .))
dfCoef = tidy(df, fit)
res<-dfCoef %>%
filter(term=='disp')
res$con=vec[i]
data=bind_rows(data,res)
}
data
Using tidyr::(un)nest to perform the regressions by groups and a helper function this could be achieved like so:
library(dplyr)
library(broom)
library(tidyr)
library(purrr)
vec <- names(mtcars)[3:9]
lm_help <- function(vec) {
mtcars %>%
tidyr::nest(data = -cyl) %>%
mutate(con = vec,
fit = purrr::map(data, lm, formula = as.formula(paste0("mpg ~ disp + ", vec))),
tidy = purrr::map(fit, tidy)) %>%
select(cyl, con, tidy) %>%
tidyr::unnest(tidy) %>%
filter(term == "disp")
}
purrr::map(vec, lm_help) %>%
bind_rows()
#> # A tibble: 21 x 7
#> cyl con term estimate std.error statistic p.value
#> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 6 disp disp 0.00361 0.0156 0.232 0.826
#> 2 4 disp disp -0.135 0.0332 -4.07 0.00278
#> 3 8 disp disp -0.0196 0.00932 -2.11 0.0568
#> 4 6 hp disp 0.00180 0.0202 0.0890 0.933
#> 5 4 hp disp -0.120 0.0369 -3.24 0.0120
#> 6 8 hp disp -0.0186 0.00946 -1.97 0.0746
#> 7 6 drat disp 0.0224 0.0292 0.770 0.484
#> 8 4 drat disp -0.133 0.0406 -3.27 0.0114
#> 9 8 drat disp -0.0196 0.00977 -2.01 0.0697
#> 10 6 wt disp 0.0191 0.0109 1.75 0.154
#> # ... with 11 more rows

Combing tidyverse + survey [R]: How to use svyglm in Nest-Map-Unnest-Chain?

I am currently struggling to run weighted regression models on multiple variables in R.
When using (non-weighted) glm, I was successful by running the following:
mtcars_1 <- mtcars %>%
nest(-gear)%>%
mutate(model_0 = map(data, ~ glm(vs ~ drat, family = "binomial", data = .)))%>%
mutate(model_0_tidy = map(model_0, tidy))%>%
select(gear, model_0_tidy)%>%
ungroup()%>%
unnest(model_0_tidy)
That is I receive the following:
# A tibble: 6 x 6
gear term estimate std.error statistic p.value
<dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 4 (Intercept) -15.3 22.6 -0.677 0.499
2 4 drat 4.26 5.76 0.740 0.459
3 3 (Intercept) -3.91 7.39 -0.529 0.597
4 3 drat 0.801 2.32 0.345 0.730
5 5 (Intercept) 5.20 14.4 0.362 0.718
6 5 drat -1.71 3.77 -0.453 0.651
However, when I would like to weight my observations and thus use svyglm from the survey-package, nesting does not work.
This was my approach:
design_0 <- svydesign(ids=~0, data = mtcars, weights = mtaars$wt)
mtcars_2 <- mtcars%>%
nest(-gear)%>%
mutate(model_1 = map(data, ~ svyglm(vs ~ drat, family = quasibinomial(logit), design = design_0, data = .)))%>%
mutate(model_1_tidy = map(model_1, tidy))%>%
select(gear, model_1_tidy)%>%
ungroup()%>%
unnest(model_1_tidy)
# If suggested that wt serves as frequency weight
# Outcome
gear term estimate std.error statistic p.value
<dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 4 (Intercept) -8.12 3.88 -2.09 0.0451
2 4 drat 2.12 1.07 1.99 0.0554
3 3 (Intercept) -8.12 3.88 -2.09 0.0451
4 3 drat 2.12 1.07 1.99 0.0554
5 5 (Intercept) -8.12 3.88 -2.09 0.0451
6 5 drat 2.12 1.07 1.99 0.0554
Estimates for each type of gear (that is 3,4,5) turns out to be the same.
It appears as if nesting was essentially ignored here.
Are there any solutions for combining svyglm with nest-map-unnest? Or will I have to look for other, less comfortable ways?
Thank you!
try to do it this way
mtcars%>%
nest(-gear) %>%
mutate(design = map(data, ~ svydesign(ids=~0, data = .x, weights = ~ wt)),
model = map(.x = design,
.f = ~ svyglm(vs ~ drat,
family = quasibinomial(logit),
design = .x))) %>%
mutate(model_tidy = map(model, tidy)) %>%
select(gear, model_tidy)%>%
ungroup()%>%
unnest(model_tidy)

How to pipe the tidy-ed lm model CI's into ggplot2?

I have the following code that is computing for every year bewteen 1961:2018 the effects of both predictor variables: base on balls per game (BB) and home runs per game (HR) on the response variable runs per game (R):
rm(list = ls())
library(dbplyr)
library(tidyverse)
library(broom)
library(Lahman)
fit <- Teams %>%
filter(yearID %in% 1961:2018) %>%
mutate(BB = BB / G,
HR = HR / G,
R = R / G) %>%
group_by(yearID) %>%
do(tidy(lm(R ~ BB + HR, data = .), conf.int = TRUE)) %>% filter(term=="BB")
fit
> fit
# A tibble: 58 x 8
# Groups: yearID [58]
yearID term estimate std.error statistic p.value conf.low conf.high
<int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1961 BB 0.0845 0.168 0.502 0.623 -0.274 0.443
2 1962 BB 0.142 0.273 0.520 0.610 -0.434 0.718
3 1963 BB 0.339 0.242 1.40 0.178 -0.171 0.849
4 1964 BB -0.105 0.302 -0.349 0.731 -0.742 0.532
5 1965 BB 0.235 0.253 0.928 0.366 -0.299 0.768
6 1966 BB 0.104 0.216 0.482 0.636 -0.351 0.559
7 1967 BB 0.0660 0.223 0.296 0.771 -0.405 0.537
8 1968 BB -0.199 0.203 -0.983 0.340 -0.627 0.229
9 1969 BB 0.153 0.163 0.942 0.357 -0.185 0.492
10 1970 BB 0.239 0.157 1.52 0.143 -0.0874 0.566
# ... with 48 more rows
I now would like to output this "fit" which is actually a tibble (or modernized data frame) into ggplot to show the estimates per year as points but also the regression line along with the CI's computed by the lm model and not simply recomputing it with geom_smooth(method = "lm").
I have tried the following without success. I know that the augment from broom should operate on the lm model output directly and therefore the following code is wrong but it illustrates what I'm trying to achieve:
augment(fit) %>%
ggplot() +
geom_point(aes(yearID, estimate)) +
geom_line(aes(yearID, .fitted), col = "blue")
How can I do that without "cheating" (double computing the lm once and then on the ggplot as well) and doing:
fit %>% ggplot(aes(yearID,estimate)) + geom_point() + geom_smooth(method = "lm")
I took a similar route to Patrick, using map() and nest():
library(tidyverse)
library(broom)
library(Lahman)
library(magrittr)
fit <- Teams %>%
filter(yearID %in% 1961:2018) %>%
mutate(
BB = BB / G,
HR = HR / G,
R = R / G
) %>%
nest(data = -yearID) %>%
mutate(
model = map(data, ~ lm(R ~ BB + HR, .x)), # apply model to all nested groups
m_tidy = map(model, tidy), # tidy up
est = map_dbl(m_tidy, ~ .x %>% # pull BB estimate from each group
filter(term == "BB") %>%
pull(estimate)),
)
Now at this point you could just %$% right into this next portion but I've kept them separate here so talk about mimicking the confidence interval properly. The geom_smooth() confidence interval is based on the t-distribution and not the normal distribution. Thus, we'll have to do a bit of extra work to get out intervals to work:
fit %$%
lm(est ~ yearID) %>%
augment() %>%
mutate(m.se.fit = .se.fit * qt(1 - (1-0.95)/2, nrow(fit))) %>% # 95% conf int calc
ggplot(aes(yearID, est)) +
geom_point() +
geom_line(aes(y = .fitted), col = "blue") +
geom_ribbon(aes(ymin = .fitted - m.se.fit, ymax = .fitted + m.se.fit), alpha = .2)
This plot essentially mirrors the desired plot:
fit %>% ggplot(aes(yearID, est)) +
geom_point() +
geom_smooth(method = "lm")
Created on 2019-10-23 by the reprex package (v0.3.0)
You can try map functions from the purrr package, which is included in tidyverse. A possible code for your described problem is listed below. Should also be possible with lapply if you are not that familar with the purrr package.
library(tidyverse)
library(broom)
library(Lahman)
fit <- Teams %>%
filter(yearID %in% 1961:2018) %>%
mutate(BB = BB / G,
HR = HR / G,
R = R / G) %>%
group_by(yearID) %>%
# consolidate your data
nest() %>%
# creates new nested column with your regression data
mutate(model = map(data, function(df)
tidy(lm(R ~ BB + HR, data = df), conf.int = TRUE) %>%
filter(term=="BB")
),
# extract the column estimate
model_est = map_dbl(model, function(df)
df %>% pull(estimate)
),
# extract the column conf.low
model_conf.low = map_dbl(model, function(df)
df %>% pull(conf.low)
),
# extract the column conf.high
model_conf.high = map_dbl(model, function(df)
df %>% pull(conf.high)
)
)
fit %>% ggplot(aes(yearID,model_est)) + geom_point() +
geom_line(aes(yearID, model_conf.low)) +
geom_line(aes(yearID, model_conf.high))

Resources