R code for looping through regression models, starting from Stata code - r

New R user here, coming from Stata. Most of my work consists of running several regression models with different combinations of dependent and independent variables and storing the results. For this, I make extensive use of macros and loops, which to my understanding are not preferred in R.
Using the "mtcars" dataset as an example, and assuming I'm interested in using mpg, disp and wt as dependent variables, hp and carb as independent variables and adjusting all models for vs, am and gear, in Stata I would do something like this:
local depvars mpg disp wt // create list of dependent variables
local indepvars hp carb // create list of independent variables
local confounders vs am gear // create list of control variables
foreach depvar of local depvars {
foreach indepvar of local indepvars {
reg `depvar' `indepvar' `confounders'
estimates store `depvar'_`indepvar'
}
}
Is there a way to do it in R? Potentially using the tidyverse approach which I'm starting to get familiar with?

This will make R to follow your Stata code:
depvars <- c('mpg', 'disp', 'wt')
indepvars <- c('hp', 'carb')
confounders <- c('vs', 'am', 'gear')
for (i in seq(length(depvars))) {
for (j in seq(length(indepvars))) {
my_model <- lm(as.formula(paste(depvars[i], "~", paste(c(indepvars[j], confounders), collapse = "+"))), data = mtcars)
assign(paste0(depvars[i], "_", indepvars[j]), my_model)
}
}
or with shorter code:
for (i in seq_along(depvars)) {
for (j in seq_along(indepvars)) {
assign(paste0(depvars[i], "_", indepvars[j]), lm(as.formula(paste(depvars[i], "~", paste(c(indepvars[j], confounders), collapse = "+"))), data = mtcars))
}
}```

Related

Relative importance for several groups in R

How do I calculate the relative importance using relaimpo package in R when I want to run it for several groups? As an example, in the mtcars dataframe I want to calculate the relative importance of several variables on mpg for every cyl. I calculated the relative importance of the variables on mpg, but I don't know how to make it per group. I tried to insert group_by(cyl) but I did not succeed. How would I do that in R?
library(relaimpo)
df <- mtcars
model <- lm(mpg ~ disp + hp + drat + wt, data=df)
rel_importance = calc.relimp(model, type = "lmg", rela=TRUE)
rel_importance
I'm not familiar with this package but in general if you want to apply a function by group in R you can split the data frame into a list of one data frame per group, and then apply the function to each element of the list.
In this case:
cyl_list <- split(df, df$cyl)
rel_importance_cyl <- lapply(
cyl_list,
\(df) {
model <- lm(mpg ~ disp + hp + drat + wt, data = df)
calc.relimp(model, type = "lmg", rela = TRUE)
}
)
names(rel_importance_cyl) # "4" "6" "8"
You can access this list either by name (e.g. rel_importance_cyl[["4"]]) or by index (e.g. rel_importance_cyl[[1]]), to see the values for each group.

Iterating and looping over multiple columns in glm in r using a name from another variable

I am trying to iterate over multiple columns for a glm function in R.
view(mtcars)
names <- names(mtcars[-c(1,2)])
for(i in 1:length(names)){
print(paste0("Starting iterations for ",names[i]))
model <- glm(mpg ~ cyl + paste0(names[i]), data=mtcars, family = gaussian())
summary(model)
print(paste0("Iterations for ",names[i], " finished"))
}
however, I am getting the following error:
[1] "Starting iterations for disp"
Error in model.frame.default(formula = mpg ~ cyl + paste0(names[i]), data = mtcars, :
variable lengths differ (found for 'paste0(names[i])')
Not sure, how I can correct this.
mpg ~ cyl + paste0(names[i]) or even mpg ~ cyl + names[i] is not a valid syntax for a formula. Use
reformulate(c("cyl", names[i]), "mpg")
instead, which dynamically creates a formula from variable names.
Since you need to build your model formula dynamically from string you need as.formula. Alternatively, consider reformulate which receives response and RHS variable names:
...
fml <- reformulate(c("cyl", names[i]), "mpg")
model <- glm(fml, data=mtcars, family = gaussian())
summary(model)
...
glm takes a formula which you can create using as.formula()
predictors <- names(mtcars[-c(1,2)])
for(predictor in predictors){
print(paste0("Starting iterations for ",predictor))
model <- glm(as.formula(paste0("mpg ~ cyl + ",predictor)),
data=mtcars,
family = gaussian())
print(summary(model))
print(paste0("Iterations for ",predictor, " finished"))
}

Best way to report multiple regression models on several dimensions (evolving model formulation and year of data)

Situation
I am fitting a series of evolving regression models. For the purposes of this question, we can think of these models in terms of Model A, Model B, and Model C. All models share at least one same covariate.
I am also fitting these models for two separate years of data. Again, for the purposes of this question, the years will be 2000 and 2010.
In an attempt to simplify the reporting of results, I am attempting to combine the reporting of the regressions into a single table that would have some kind of the following format:
2000 2010
Model A
Coef Ex1
Model B
Coef Ex1
Coef Ex2
Model C
Coef Ex1
Coef Ex2
Coef Ex3
The idea being that someone can look quickly at Coef Ex1 across several models and years.
What Have I Tried
I have tried to achieve the above table using both R stargazer and kable packages. With stargazer I can get the fully formatted table for a single model formulation across many years (e.g., stargazer(modelA2000, modelA2010), but I cannot figure out how to stack additional model formulations on the rows.
For kable I have been able to stack horizontal models, but I have not been able to add in additional years (e.g., coefs <- bind_rows(tidy(modelA2000), tidy(modelB2000), tidy(modelC2000)); coefs %>% kable()).
Question: how can I use stargazer or kable to report evolving regression models (which share the same covariates) in the rows but also with year of cross section on the column? I think I can somehow extend the answer posted here, although I'm not sure how.
Reproducible example
# Load the data
mtcars <- mtcars
# Create example results for models A, B, and C for 2000
modelA2000 <- lm(mpg ~ cyl, data = mtcars)
modelB2000 <- lm(mpg ~ cyl + wt, data = mtcars)
modelC2000 <- lm(mpg ~ cyl + wt + disp, data = mtcars)
# Slightly modify data for second set of results
mtcars$cyl <- mtcars$cyl*runif(1)
# Fit second set of results. Same models, pretending it's a different year.
modelA2010 <- lm(mpg ~ cyl, data = mtcars)
modelB2010 <- lm(mpg ~ cyl + wt, data = mtcars)
modelC2010 <- lm(mpg ~ cyl + wt + disp, data = mtcars)
Two notes before starting:
You want a pretty "custom" table, so it is almost inevitable that some manual operations will be required.
My answer relies on the development version of modelsummary, which you can install like this:
library(remotes)
install_github("vincentarelbundock/modelsummary")
We will need 4 concepts, many of them related to the broom package:
broom::tidy a function that takes a statistical model and returns a data.frame of estimates with one row per coefficient.
broom::glance a function that takes a statistical model and returns a one-row data.frame with model characteristics (e.g., number of observations)
modelsummary_list a list with 2 elements called "tidy" and "glance", and with a class name of "modelsummary_list".
The modelsummary package allows you to draw regression tables. Under the hood, it uses broom::tidy and broom::glance to extract information from those models. Users can also supply their own information about a model by supplying a list to which we assign the class modelsummary_list, as documented here.
EDIT: The recommended way to do this in modelsummary is now to use the group argument. Scroll to the end of this post for illustrative code.
Obsolete example with useful discussion
The modelsummary_wide is a function that was initially designed to "stack" results from several models with several groups of coefficients. This is useful for things like multinomial models, but it also helps us in your case, where you have multiple models in multiple groups (here: years).
First, we load packages, tweak the data, and estimate our models:
library(modelsummary)
library(broom)
library(dplyr)
mtcars2010 <- mtcars
mtcars2010$cyl <- mtcars$cyl * runif(1)
models <- list(
"A" = list(
lm(mpg ~ cyl, data = mtcars),
lm(mpg ~ cyl, data = mtcars2010)),
"B" = list(
lm(mpg ~ cyl + wt, data = mtcars),
lm(mpg ~ cyl + wt, data = mtcars2010)),
"C" = list(
lm(mpg ~ cyl + wt + disp, data = mtcars),
lm(mpg ~ cyl + wt + disp, data = mtcars2010)))
Notice that we saved our models in three groups, in a list of list.
Then, we define a tidy_model function that accepts a list of two models (one per year), combines the information on those two models, and creates a modelsummary_list object (again, please refer to the documentation). Note that we assign the "year" information to a "group" column in the tidy object.
We apply this function to each of our three groups of models using lapply.
tidy_model <- function(model_list) {
# tidy estimates
tidy_2000 <- broom::tidy(model_list[[1]])
tidy_2010 <- broom::tidy(model_list[[2]])
# create a "group" column
tidy_2000$group <- 2000
tidy_2010$group <- 2010
ti <- bind_rows(tidy_2000, tidy_2010)
# glance estimates
gl <- data.frame("N" = stats::nobs(model_list[[1]]))
# output
out <- list(tidy = ti, glance = gl)
class(out) <- "modelsummary_list"
return(out)
}
models <- lapply(models, tidy_model)
Finally, we call the modelsummary_wide with the stacking="vertical" argument to obtain this table:
modelsummary_wide(models, stacking = "vertical")
Of course, the table can be adjusted, coefficients renamed, etc. using the other arguments of the modelsummary_wide function or with kableExtra or some other package supported by the output argument.
More modern example without detailed explanation
library("modelsummary")
library("broom")
library("quantreg")
mtcars2010 <- mtcars
mtcars2010$cyl <- mtcars$cyl * runif(1)
models <- list(
"A" = list(
"2000" = rq(mpg ~ cyl, data = mtcars),
"2010" = rq(mpg ~ cyl, data = mtcars2010)),
"B" = list(
"2000" = rq(mpg ~ cyl + wt, data = mtcars),
"2010" = rq(mpg ~ cyl + wt, data = mtcars2010)),
"C" = list(
"2000" = rq(mpg ~ cyl + wt + disp, data = mtcars),
"2010" = rq(mpg ~ cyl + wt + disp, data = mtcars2010)))
tidy_model <- function(model_list) {
# tidy estimates
tidy_2000 <- broom::tidy(model_list[[1]])
tidy_2010 <- broom::tidy(model_list[[2]])
# create a "group" column
tidy_2000$group <- "2000"
tidy_2010$group <- "2010"
ti <- bind_rows(tidy_2000, tidy_2010)
# output
out <- list(tidy = ti, glance = data.frame("nobs 2010" = length(model_list[[1]]$fitted.values)))
class(out) <- "modelsummary_list"
return(out)
}
models <- lapply(models, tidy_model)
modelsummary(models,
group = model + term ~ group,
statistic = "conf.int")
2000
2010
A
(Intercept)
36.800
36.800
[30.034, 42.403]
[30.034, 42.403]
cyl
-2.700
-67.944
[-3.465, -1.792]
[-87.204, -45.102]
B
(Intercept)
38.871
38.871
[30.972, 42.896]
[30.972, 42.896]
cyl
-1.743
-43.858
[-2.154, -0.535]
[-54.215, -13.472]
wt
-2.679
-2.679
[-5.313, -1.531]
[-5.313, -1.531]
C
(Intercept)
40.683
40.683
[31.235, 47.507]
[31.235, 47.507]
cyl
-1.993
-50.162
[-3.137, -1.322]
[-78.948, -33.258]
wt
-2.937
-2.937
[-5.443, -1.362]
[-5.443, -1.362]
disp
0.003
0.003
[-0.009, 0.035]
[-0.009, 0.035]

Pass dynamically variable names in lm formula inside a function

I have a function that asks for two parameters:
dataRead (dataframe from the user)
variableChosen (which dependent variable the user wants to utilize
in the model)
Obs: indepent variable will always be the first column
But if the user gives me for example, a dataframe called dataGiven which columns names are: "Doses", "Weight"
I want that my model name has these names in my results
My actual function correctly make the lm, but my formula names from the data frame are gone (and shows how I got the data from the function)
Results_REG<- function (dataRead, variableChosen){
fit1 <- lm(formula = dataRead[,1]~dataRead[,variableChosen])
return(fit1)
}
When I call:
test1 <- Results_REG(dataGive, "Weight")
names(teste1$model)
shows:
"dataRead[, 1]" "dataRead[, variableChosen]"
I wanted to show my dataframe columns names, like:
"Doses" "Weight"
First off, it's always difficult to help without a reproducible code example. For future posts I recommend familiarising yourself with how to provide such a minimal reproducible example.
I'm not entirely clear on what you're asking, so I assume this is about how to create a function that fits a simple linear model based on data with a single user-chosen predictor var.
Here is an example based on mtcars
results_LM <- function(data, var) {
lm(data[, 1] ~ data[, var])
}
results_LM(mtcars, "disp")
#Call:
#lm(formula = data[, 1] ~ data[, var])
#
#Coefficients:
#(Intercept) data[, var]
# 29.59985 -0.04122
You can confirm that this gives the same result as
lm(mpg ~ disp, data = mtcars)
Or perhaps you're asking how to carry through the column names for the predictor? In that case we can use as.formula to construct a formula that we use together with the data argument in lm.
results_LM <- function(data, var) {
fm <- as.formula(paste(colnames(data)[1], "~", var))
lm(fm, data = data)
}
fit <- results_LM(mtcars, "disp")
fit
#Call:
#lm(formula = fm, data = data)
#
#Coefficients:
#(Intercept) disp
# 29.59985 -0.04122
names(fit$model)
#[1] "mpg" "disp"
outcome <- 'mpg'
model <- lm(mtcars[,outcome] ~ . ,mtcars)
yields the same result as:
data(mtcars)
model <- lm( mpg ~ . ,mtcars)
but allows you to pass a variable (the column name). However, this may cause an error where mpg is included in the right hand side of the equation as well. Not sure if anyone knows how to fix that.

Loop over column names in regression

I want to run a whole batch of regressions over every variable in a data frame, and then store the residual deviance value from each regression in a new vector as the loop goes along.
The frame is called "cw". The first few variables are just metadata, so ignore those. I try the following:
deviances<-c()
for (x in colnames(cw)[1:8]){deviances[x]<-NA}
for (x in colnames(cw)[8:27]){
model<-glm(cwonset ~ x, fmaily = binomial, data = cw)
append(deviances, model$deviance)
}
However, it gives the error:
Error in model.frame.default(formula = cwonset ~ x, data = cw, drop.unused.levels = TRUE) :
variable lengths differ (found for 'x')
Any idea why?
without data, i had to rely on mtcars to help you out, no need of for loop also. I assumed mpg as the dependent variable
Logic : sapply helps me o loop through each colname at a time and then I just regress that. It internally is a for loop though
sapply(colnames(mtcars[-1]), function(x) {
form <- as.formula(paste0("mpg~", x))
model <- glm(form, data = mtcars)
model$deviance})
# cyl disp hp drat wt qsec vs am gear carb
# 308.3342 317.1587 447.6743 603.5667 278.3219 928.6553 629.5193 720.8966 866.2980 784.2711

Resources