Function or package to compare model fits side by side? - r

Is there a more elegant way of doing this?
m1 <- lm(price ~ carat, data = diamonds)
m2 <- lm(price ~ carat + cut, data = diamonds)
m3 <- lm(price ~ carat + cut + depth, data = diamonds)
m1r2 <- summary(m1)$r.squared
m2r2 <- summary(m2)$r.squared
m3r2 <- summary(m3)$r.squared
data.frame(
model = c("m1", "m2", "m3"),
RSqd = c(m1r2, m2r2, m3r2)
)
With caret I often use the following to compare multiple models side by side:
resamples(list_of_models) %>% summary()
Is there a conventional approach to comparing models on fit such as R.Squared, AIC, RSE? As opposed to crudely writing a dataframe in the way I have above?

An option is to use mget
stack(mget(ls(pattern = "^m\\d+r\\d+$")))
From the input 'm's, get thee objects in to a list with mget and apply the summary by looping over the list and extract the r.squared
lapply(mget(ls(pattern = "^m\\d+$")), function(x) summary(x)$r.squared)
Also, this can be done with reformulate by passing the independent variables in a list
lapply(list('carat', c('carat', 'cut'), c('carat', 'cut', 'depth')),
function(nm) summary(lm(reformulate(nm, 'price'),
data = diamonds))$r.squared)
If we want to get multiple components
library(broom)
lapply(mget(ls(pattern = "^m\\d+$")), glance)

Here is a similar more tidyverse-based approach.
With only three variables it won't save you much typing, but once you have a two digit number of variables the 'many models approach' is really convenient.
library(dplyr)
library(purrr)
library(broom)
library(ggplot2)
reg_vars <- c("carat", "cut", "depth")
tibble(id = 1:3) %>%
mutate(equ = map(id, ~ reformulate(reg_vars[1:.x], response = "price")),
mod = map(equ, ~ lm(.x, data = diamonds)),
res = map(mod, glance)) %>%
pull(res) %>%
bind_rows(., .id = "model")

You could also try texreg package.
library(texreg)
screenreg(list(m1, m2, m3))

Related

lapply for multiple lmer models [duplicate]

I have seen an example of list apply (lapply) that works nicely to take a list of data objects,
and return a list of regression output, which we can pass to Stargazer for nicely formatted output.
Using stargazer with a list of lm objects created by lapply-ing over a split data.frame
library(MASS)
library(stargazer)
data(Boston)
by.river <- split(Boston, Boston$chas)
class(by.river)
fit <- lapply(by.river, function(dd)lm(crim ~ indus,data=dd))
stargazer(fit, type = "text")
What i would like to do is, instead of passing a list of datasets to do the same regression on each data set (as above),
pass a list of independent variables to do different regressions on the same data set. In long hand it would look like this:
fit2 <- vector(mode = "list", length = 2)
fit2[[1]] <- lm(nox ~ indus, data = Boston)
fit2[[2]] <- lm(crim ~ indus, data = Boston)
stargazer(fit2, type = "text")
with lapply, i tried this and it doesn't work. Where did I go wrong?
myvarc <- c("nox","crim")
class(myvarc)
myvars <- as.list(myvarc)
class(myvars)
fit <- lapply(myvars, function(dvar)lm(dvar ~ indus,data=Boston))
stargazer(fit, type = "text")
Consider creating dynamic formulas from string:
fit <- lapply(myvars, function(dvar)
lm(as.formula(paste0(dvar, " ~ indus")),data=Boston))
This should work:
fit <- lapply(myvars, function(dvar) lm(eval(paste0(dvar,' ~ wt')), data = Boston))
You can also use a dplyr & purrr approach, keep everything in a tibble, pull out what you want, when you need it. No difference in functionality from the lapply methods.
library(dplyr)
library(purrr)
library(MASS)
library(stargazer)
var_tibble <- tibble(vars = c("nox","crim"), data = list(Boston))
analysis <- var_tibble %>%
mutate(models = map2(data, vars, ~lm(as.formula(paste0(.y, " ~ indus")), data = .x))) %>%
mutate(tables = map2(models, vars, ~stargazer(.x, type = "text", dep.var.labels.include = FALSE, column.labels = .y)))
You can also use get():
# make a list of independent variables
list_x <- list("nox","crim")
# create regression function
my_reg <- function(x) { lm(indus ~ get(x), data = Boston) }
# run regression
results <- lapply(list_x, my_reg)

Using purrr to map right-hand-side variables to regression functions

I have a large number of regression models specified by reference to right-hand-side variables, and I want to use purrr to generate a set of lists with the models. Here is an example of the desired end result using toy data:
m.1.1 <- "cyl"
m.1.2 <- paste(c(m.1.1, "disp"), collapse = " + ")
m.1.1_reg <- lm(
data = mtcars,
formula = as.formula(paste0("mpg ~ ", m.1.1)))
m.1.2_reg <- lm(
data = mtcars,
formula = as.formula(paste0("mpg ~ ", m.1.2)))
How can I achieve the same outcome (i.e., a list named m.1.1_reg and another list named m.1.2_reg) using purrr?
Just loop over the 'm.1.' objects in map and create the formula
library(purrr)
out <- map(dplyr::lst(m.1.1, m.1.2),
~ lm(data = mtcars, formula = as.formula(paste0("mpg ~ ", .x))))
-checking the names
> names(out)
[1] "m.1.1" "m.1.2"

R: Predictions from a list of coxph objects on newdata

I am building a series of Cox regression models, and getting predictions from those models on new data. I am able to get the expected number of events in some cases, but not others.
For example, if the formula in the coxph call is written out, then the predictions are calculated. But, if the the formula is stored in an object and that object called, I get an error. I also cannot get the predictions if I try to create them within a dplyr piped mutate function (for the function I am writing, this would be the most ideal place to get the predictions to work properly).
Any assistance is greatly appreciated!
Thank you,
Daniel
require(survival)
require(tidyverse)
n = 15
# creating tibble of tibbles.
results =
tibble(id = 1:n) %>%
group_by(id) %>%
do(
# creating tibble to evaluate model on
tbl0 = tibble(time = runif(n), x = runif(n)),
# creating tibble to build model on
tbl = tibble(time = runif(n), x = runif(n))
) %>%
ungroup
#it works when the formula is added the the coxph function already written out
map2(results$tbl, results$tbl0, ~ predict(coxph( Surv(time) ~ x, data = .x), newdata = .y, type = "expected"))
#but if the formula is previously defined, I get an error
f = as.formula(Surv(time) ~ x)
map2(results$tbl, results$tbl0, ~ predict(coxph( f, data = .x), newdata = .y, type = "expected"))
# I also get an error when I try to include in a dplyr pipe with mutate
results %>%
mutate(
pred = map2(tbl, tbl0, ~ predict(coxph( f, data = .x), newdata = .y, type = "expected"))
)
I figured it out (with the help of a friend). If you define the formula as a string, and within the function call coerce it to a formula everything runs smoothly. I am not sure why it works, but it does!
#define the formula as a string, and call it in the function with as.formula(.)
f = "Surv(time) ~ x"
map2(results$tbl, results$tbl0, ~ predict(coxph( as.formula(f), data = .x), newdata = .y, type = "expected"))
#also works in a dplyr pipe with mutate
results %>%
mutate(
pred = map2(tbl, tbl0, ~ predict(coxph( as.formula(f), data = .x), newdata = .y, type = "expected"))
)

Use lapply for multiple regression with formula changing, not the dataset

I have seen an example of list apply (lapply) that works nicely to take a list of data objects,
and return a list of regression output, which we can pass to Stargazer for nicely formatted output.
Using stargazer with a list of lm objects created by lapply-ing over a split data.frame
library(MASS)
library(stargazer)
data(Boston)
by.river <- split(Boston, Boston$chas)
class(by.river)
fit <- lapply(by.river, function(dd)lm(crim ~ indus,data=dd))
stargazer(fit, type = "text")
What i would like to do is, instead of passing a list of datasets to do the same regression on each data set (as above),
pass a list of independent variables to do different regressions on the same data set. In long hand it would look like this:
fit2 <- vector(mode = "list", length = 2)
fit2[[1]] <- lm(nox ~ indus, data = Boston)
fit2[[2]] <- lm(crim ~ indus, data = Boston)
stargazer(fit2, type = "text")
with lapply, i tried this and it doesn't work. Where did I go wrong?
myvarc <- c("nox","crim")
class(myvarc)
myvars <- as.list(myvarc)
class(myvars)
fit <- lapply(myvars, function(dvar)lm(dvar ~ indus,data=Boston))
stargazer(fit, type = "text")
Consider creating dynamic formulas from string:
fit <- lapply(myvars, function(dvar)
lm(as.formula(paste0(dvar, " ~ indus")),data=Boston))
This should work:
fit <- lapply(myvars, function(dvar) lm(eval(paste0(dvar,' ~ wt')), data = Boston))
You can also use a dplyr & purrr approach, keep everything in a tibble, pull out what you want, when you need it. No difference in functionality from the lapply methods.
library(dplyr)
library(purrr)
library(MASS)
library(stargazer)
var_tibble <- tibble(vars = c("nox","crim"), data = list(Boston))
analysis <- var_tibble %>%
mutate(models = map2(data, vars, ~lm(as.formula(paste0(.y, " ~ indus")), data = .x))) %>%
mutate(tables = map2(models, vars, ~stargazer(.x, type = "text", dep.var.labels.include = FALSE, column.labels = .y)))
You can also use get():
# make a list of independent variables
list_x <- list("nox","crim")
# create regression function
my_reg <- function(x) { lm(indus ~ get(x), data = Boston) }
# run regression
results <- lapply(list_x, my_reg)

How to pass a large amount of models to gather_predictions

In the modelr package the function gather_predictions can be used to add predictions from multiple models to a data frame, I'm however unsure on how to specify these models in the function call. The help documentation gives the following exmaple:
df <- tibble::data_frame(
x = sort(runif(100)),
y = 5 * x + 0.5 * x ^ 2 + 3 + rnorm(length(x))
)
m1 <- lm(y ~ x, data = df)
grid <- data.frame(x = seq(0, 1, length = 10))
grid %>% add_predictions(m1)
m2 <- lm(y ~ poly(x, 2), data = df)
grid %>% spread_predictions(m1, m2)
grid %>% gather_predictions(m1, m2)
here the models are specifically mentioned in the function call. That works fine if we have a few models we want predictions for, but what if we have a large or unknown amount of models? In this case manually specifying the models isn't really workable anymore.
the way the help documentation phrases the arguments segment seems to suggest you need to add every model as a separate argument.
gather_predictions and spread_predictions take multiple models. The
name will be taken from either the argument name of the name of the
model.
And for example inputting a list of models into gather_predictions doesn't work.
Is there some easy way to input a list / large amount of models to gather_predictions?
example for 10 models in a list:
modelslist <- list()
for (N in 1:10) {
modelslist[[N]] <- lm(y ~ poly(x, N), data = df)
}
If having the models stored some other way than a list works better, that's fine as well.
m <- grid %>% gather_predictions(lm(y ~ poly(x, 1), data = df))
for (N in 2:10) {
m <- rbind(m, grid %>% gather_predictions(lm(y ~ poly(x, N), data = df)))
}
There are workarounds to solve this problem. My approach was to:
1. build a list of models with specific names
2. use a tweaked version of modelr::gather_predictions() to apply all models in the list to data
# prerequisites
library(tidyverse)
set.seed(1363)
# I'll use generic name 'data' throughout the code, so you can easily try other datasets.
# for this example I'll use your data df
data=df
# data visualization
ggplot(data, aes(x, y)) +
geom_point(size=3)
your sample data
# build a list of models
models <-vector("list", length = 5)
model_names <- vector("character", length=5)
for (i in 1:5) {
modelformula <- str_c("y ~ poly(x,", i, ")", sep="")
models[[i]] <- lm(as.formula(modelformula), data = data)
model_names[[i]] <- str_c('model', i) # remember we name the models here sequantially
}
# apply names to the models list
names(models) <- model_names
# this is modified verison of modelr::gather_predictions() in order to accept list of models
gather.predictions <- function (data, models, .pred = "pred", .model = "model")
{
df <- map2(models, .pred, modelr::add_predictions, data = data)
names(df) <- names(models)
bind_rows(df, .id = .model)
}
# the rest is the same as modelr's function...
grids <- gather.predictions(data = data, models = models, .pred = "y")
ggplot(data, aes(x, y)) +
geom_point() +
geom_line(data = grids, colour = "red") +
facet_wrap(~ model)
example of polynomial models (degree 1:5) applied to your sample data
side note: there are good reasons why I chose strings to build the model...to discuss.

Resources