Using purrr to map right-hand-side variables to regression functions - r

I have a large number of regression models specified by reference to right-hand-side variables, and I want to use purrr to generate a set of lists with the models. Here is an example of the desired end result using toy data:
m.1.1 <- "cyl"
m.1.2 <- paste(c(m.1.1, "disp"), collapse = " + ")
m.1.1_reg <- lm(
data = mtcars,
formula = as.formula(paste0("mpg ~ ", m.1.1)))
m.1.2_reg <- lm(
data = mtcars,
formula = as.formula(paste0("mpg ~ ", m.1.2)))
How can I achieve the same outcome (i.e., a list named m.1.1_reg and another list named m.1.2_reg) using purrr?

Just loop over the 'm.1.' objects in map and create the formula
library(purrr)
out <- map(dplyr::lst(m.1.1, m.1.2),
~ lm(data = mtcars, formula = as.formula(paste0("mpg ~ ", .x))))
-checking the names
> names(out)
[1] "m.1.1" "m.1.2"

Related

lapply for multiple lmer models [duplicate]

I have seen an example of list apply (lapply) that works nicely to take a list of data objects,
and return a list of regression output, which we can pass to Stargazer for nicely formatted output.
Using stargazer with a list of lm objects created by lapply-ing over a split data.frame
library(MASS)
library(stargazer)
data(Boston)
by.river <- split(Boston, Boston$chas)
class(by.river)
fit <- lapply(by.river, function(dd)lm(crim ~ indus,data=dd))
stargazer(fit, type = "text")
What i would like to do is, instead of passing a list of datasets to do the same regression on each data set (as above),
pass a list of independent variables to do different regressions on the same data set. In long hand it would look like this:
fit2 <- vector(mode = "list", length = 2)
fit2[[1]] <- lm(nox ~ indus, data = Boston)
fit2[[2]] <- lm(crim ~ indus, data = Boston)
stargazer(fit2, type = "text")
with lapply, i tried this and it doesn't work. Where did I go wrong?
myvarc <- c("nox","crim")
class(myvarc)
myvars <- as.list(myvarc)
class(myvars)
fit <- lapply(myvars, function(dvar)lm(dvar ~ indus,data=Boston))
stargazer(fit, type = "text")
Consider creating dynamic formulas from string:
fit <- lapply(myvars, function(dvar)
lm(as.formula(paste0(dvar, " ~ indus")),data=Boston))
This should work:
fit <- lapply(myvars, function(dvar) lm(eval(paste0(dvar,' ~ wt')), data = Boston))
You can also use a dplyr & purrr approach, keep everything in a tibble, pull out what you want, when you need it. No difference in functionality from the lapply methods.
library(dplyr)
library(purrr)
library(MASS)
library(stargazer)
var_tibble <- tibble(vars = c("nox","crim"), data = list(Boston))
analysis <- var_tibble %>%
mutate(models = map2(data, vars, ~lm(as.formula(paste0(.y, " ~ indus")), data = .x))) %>%
mutate(tables = map2(models, vars, ~stargazer(.x, type = "text", dep.var.labels.include = FALSE, column.labels = .y)))
You can also use get():
# make a list of independent variables
list_x <- list("nox","crim")
# create regression function
my_reg <- function(x) { lm(indus ~ get(x), data = Boston) }
# run regression
results <- lapply(list_x, my_reg)

ANOVA on multiple models stored in a tibble

It is possible to use anova on multible models stored in a tibble without listing them manually.
An example prediction of wage from age in Wage dataset from the ISLR2 library. I have a tibble a column for polynomial degrees in one column, GLM models in another and CV errors in the third column.
I can use anova through do.call but it does not show p-values without passing test = 'F' as an argument.
library(ISLR2)
library(tidyverse)
library(boot)
GLM <- function(n) {
result <- glm(wage ~ poly(age, n), data = Wage)
return(result)
}
CV <- function(n) {
glm_fit <- glm(wage ~ poly(age, n), data = Wage)
result <- cv.glm(Wage, glm_fit, K = 10)$delta[1]
return(result)
}
set.seed(1)
models <- tibble(polynom_degrees = 1:10) %>%
mutate(linear_model = map(polynom_degrees, GLM)) %>%
mutate(CV_error = map(polynom_degrees, CV)) %>%
mutate(CV_error = unlist(CV_error))
do.call(anova, models$linear_model)

Function or package to compare model fits side by side?

Is there a more elegant way of doing this?
m1 <- lm(price ~ carat, data = diamonds)
m2 <- lm(price ~ carat + cut, data = diamonds)
m3 <- lm(price ~ carat + cut + depth, data = diamonds)
m1r2 <- summary(m1)$r.squared
m2r2 <- summary(m2)$r.squared
m3r2 <- summary(m3)$r.squared
data.frame(
model = c("m1", "m2", "m3"),
RSqd = c(m1r2, m2r2, m3r2)
)
With caret I often use the following to compare multiple models side by side:
resamples(list_of_models) %>% summary()
Is there a conventional approach to comparing models on fit such as R.Squared, AIC, RSE? As opposed to crudely writing a dataframe in the way I have above?
An option is to use mget
stack(mget(ls(pattern = "^m\\d+r\\d+$")))
From the input 'm's, get thee objects in to a list with mget and apply the summary by looping over the list and extract the r.squared
lapply(mget(ls(pattern = "^m\\d+$")), function(x) summary(x)$r.squared)
Also, this can be done with reformulate by passing the independent variables in a list
lapply(list('carat', c('carat', 'cut'), c('carat', 'cut', 'depth')),
function(nm) summary(lm(reformulate(nm, 'price'),
data = diamonds))$r.squared)
If we want to get multiple components
library(broom)
lapply(mget(ls(pattern = "^m\\d+$")), glance)
Here is a similar more tidyverse-based approach.
With only three variables it won't save you much typing, but once you have a two digit number of variables the 'many models approach' is really convenient.
library(dplyr)
library(purrr)
library(broom)
library(ggplot2)
reg_vars <- c("carat", "cut", "depth")
tibble(id = 1:3) %>%
mutate(equ = map(id, ~ reformulate(reg_vars[1:.x], response = "price")),
mod = map(equ, ~ lm(.x, data = diamonds)),
res = map(mod, glance)) %>%
pull(res) %>%
bind_rows(., .id = "model")
You could also try texreg package.
library(texreg)
screenreg(list(m1, m2, m3))

Use lapply for multiple regression with formula changing, not the dataset

I have seen an example of list apply (lapply) that works nicely to take a list of data objects,
and return a list of regression output, which we can pass to Stargazer for nicely formatted output.
Using stargazer with a list of lm objects created by lapply-ing over a split data.frame
library(MASS)
library(stargazer)
data(Boston)
by.river <- split(Boston, Boston$chas)
class(by.river)
fit <- lapply(by.river, function(dd)lm(crim ~ indus,data=dd))
stargazer(fit, type = "text")
What i would like to do is, instead of passing a list of datasets to do the same regression on each data set (as above),
pass a list of independent variables to do different regressions on the same data set. In long hand it would look like this:
fit2 <- vector(mode = "list", length = 2)
fit2[[1]] <- lm(nox ~ indus, data = Boston)
fit2[[2]] <- lm(crim ~ indus, data = Boston)
stargazer(fit2, type = "text")
with lapply, i tried this and it doesn't work. Where did I go wrong?
myvarc <- c("nox","crim")
class(myvarc)
myvars <- as.list(myvarc)
class(myvars)
fit <- lapply(myvars, function(dvar)lm(dvar ~ indus,data=Boston))
stargazer(fit, type = "text")
Consider creating dynamic formulas from string:
fit <- lapply(myvars, function(dvar)
lm(as.formula(paste0(dvar, " ~ indus")),data=Boston))
This should work:
fit <- lapply(myvars, function(dvar) lm(eval(paste0(dvar,' ~ wt')), data = Boston))
You can also use a dplyr & purrr approach, keep everything in a tibble, pull out what you want, when you need it. No difference in functionality from the lapply methods.
library(dplyr)
library(purrr)
library(MASS)
library(stargazer)
var_tibble <- tibble(vars = c("nox","crim"), data = list(Boston))
analysis <- var_tibble %>%
mutate(models = map2(data, vars, ~lm(as.formula(paste0(.y, " ~ indus")), data = .x))) %>%
mutate(tables = map2(models, vars, ~stargazer(.x, type = "text", dep.var.labels.include = FALSE, column.labels = .y)))
You can also use get():
# make a list of independent variables
list_x <- list("nox","crim")
# create regression function
my_reg <- function(x) { lm(indus ~ get(x), data = Boston) }
# run regression
results <- lapply(list_x, my_reg)

How to pass a large amount of models to gather_predictions

In the modelr package the function gather_predictions can be used to add predictions from multiple models to a data frame, I'm however unsure on how to specify these models in the function call. The help documentation gives the following exmaple:
df <- tibble::data_frame(
x = sort(runif(100)),
y = 5 * x + 0.5 * x ^ 2 + 3 + rnorm(length(x))
)
m1 <- lm(y ~ x, data = df)
grid <- data.frame(x = seq(0, 1, length = 10))
grid %>% add_predictions(m1)
m2 <- lm(y ~ poly(x, 2), data = df)
grid %>% spread_predictions(m1, m2)
grid %>% gather_predictions(m1, m2)
here the models are specifically mentioned in the function call. That works fine if we have a few models we want predictions for, but what if we have a large or unknown amount of models? In this case manually specifying the models isn't really workable anymore.
the way the help documentation phrases the arguments segment seems to suggest you need to add every model as a separate argument.
gather_predictions and spread_predictions take multiple models. The
name will be taken from either the argument name of the name of the
model.
And for example inputting a list of models into gather_predictions doesn't work.
Is there some easy way to input a list / large amount of models to gather_predictions?
example for 10 models in a list:
modelslist <- list()
for (N in 1:10) {
modelslist[[N]] <- lm(y ~ poly(x, N), data = df)
}
If having the models stored some other way than a list works better, that's fine as well.
m <- grid %>% gather_predictions(lm(y ~ poly(x, 1), data = df))
for (N in 2:10) {
m <- rbind(m, grid %>% gather_predictions(lm(y ~ poly(x, N), data = df)))
}
There are workarounds to solve this problem. My approach was to:
1. build a list of models with specific names
2. use a tweaked version of modelr::gather_predictions() to apply all models in the list to data
# prerequisites
library(tidyverse)
set.seed(1363)
# I'll use generic name 'data' throughout the code, so you can easily try other datasets.
# for this example I'll use your data df
data=df
# data visualization
ggplot(data, aes(x, y)) +
geom_point(size=3)
your sample data
# build a list of models
models <-vector("list", length = 5)
model_names <- vector("character", length=5)
for (i in 1:5) {
modelformula <- str_c("y ~ poly(x,", i, ")", sep="")
models[[i]] <- lm(as.formula(modelformula), data = data)
model_names[[i]] <- str_c('model', i) # remember we name the models here sequantially
}
# apply names to the models list
names(models) <- model_names
# this is modified verison of modelr::gather_predictions() in order to accept list of models
gather.predictions <- function (data, models, .pred = "pred", .model = "model")
{
df <- map2(models, .pred, modelr::add_predictions, data = data)
names(df) <- names(models)
bind_rows(df, .id = .model)
}
# the rest is the same as modelr's function...
grids <- gather.predictions(data = data, models = models, .pred = "y")
ggplot(data, aes(x, y)) +
geom_point() +
geom_line(data = grids, colour = "red") +
facet_wrap(~ model)
example of polynomial models (degree 1:5) applied to your sample data
side note: there are good reasons why I chose strings to build the model...to discuss.

Resources