Loop over column names in regression - r

I want to run a whole batch of regressions over every variable in a data frame, and then store the residual deviance value from each regression in a new vector as the loop goes along.
The frame is called "cw". The first few variables are just metadata, so ignore those. I try the following:
deviances<-c()
for (x in colnames(cw)[1:8]){deviances[x]<-NA}
for (x in colnames(cw)[8:27]){
model<-glm(cwonset ~ x, fmaily = binomial, data = cw)
append(deviances, model$deviance)
}
However, it gives the error:
Error in model.frame.default(formula = cwonset ~ x, data = cw, drop.unused.levels = TRUE) :
variable lengths differ (found for 'x')
Any idea why?

without data, i had to rely on mtcars to help you out, no need of for loop also. I assumed mpg as the dependent variable
Logic : sapply helps me o loop through each colname at a time and then I just regress that. It internally is a for loop though
sapply(colnames(mtcars[-1]), function(x) {
form <- as.formula(paste0("mpg~", x))
model <- glm(form, data = mtcars)
model$deviance})
# cyl disp hp drat wt qsec vs am gear carb
# 308.3342 317.1587 447.6743 603.5667 278.3219 928.6553 629.5193 720.8966 866.2980 784.2711

Related

R code for looping through regression models, starting from Stata code

New R user here, coming from Stata. Most of my work consists of running several regression models with different combinations of dependent and independent variables and storing the results. For this, I make extensive use of macros and loops, which to my understanding are not preferred in R.
Using the "mtcars" dataset as an example, and assuming I'm interested in using mpg, disp and wt as dependent variables, hp and carb as independent variables and adjusting all models for vs, am and gear, in Stata I would do something like this:
local depvars mpg disp wt // create list of dependent variables
local indepvars hp carb // create list of independent variables
local confounders vs am gear // create list of control variables
foreach depvar of local depvars {
foreach indepvar of local indepvars {
reg `depvar' `indepvar' `confounders'
estimates store `depvar'_`indepvar'
}
}
Is there a way to do it in R? Potentially using the tidyverse approach which I'm starting to get familiar with?
This will make R to follow your Stata code:
depvars <- c('mpg', 'disp', 'wt')
indepvars <- c('hp', 'carb')
confounders <- c('vs', 'am', 'gear')
for (i in seq(length(depvars))) {
for (j in seq(length(indepvars))) {
my_model <- lm(as.formula(paste(depvars[i], "~", paste(c(indepvars[j], confounders), collapse = "+"))), data = mtcars)
assign(paste0(depvars[i], "_", indepvars[j]), my_model)
}
}
or with shorter code:
for (i in seq_along(depvars)) {
for (j in seq_along(indepvars)) {
assign(paste0(depvars[i], "_", indepvars[j]), lm(as.formula(paste(depvars[i], "~", paste(c(indepvars[j], confounders), collapse = "+"))), data = mtcars))
}
}```

Relative importance for several groups in R

How do I calculate the relative importance using relaimpo package in R when I want to run it for several groups? As an example, in the mtcars dataframe I want to calculate the relative importance of several variables on mpg for every cyl. I calculated the relative importance of the variables on mpg, but I don't know how to make it per group. I tried to insert group_by(cyl) but I did not succeed. How would I do that in R?
library(relaimpo)
df <- mtcars
model <- lm(mpg ~ disp + hp + drat + wt, data=df)
rel_importance = calc.relimp(model, type = "lmg", rela=TRUE)
rel_importance
I'm not familiar with this package but in general if you want to apply a function by group in R you can split the data frame into a list of one data frame per group, and then apply the function to each element of the list.
In this case:
cyl_list <- split(df, df$cyl)
rel_importance_cyl <- lapply(
cyl_list,
\(df) {
model <- lm(mpg ~ disp + hp + drat + wt, data = df)
calc.relimp(model, type = "lmg", rela = TRUE)
}
)
names(rel_importance_cyl) # "4" "6" "8"
You can access this list either by name (e.g. rel_importance_cyl[["4"]]) or by index (e.g. rel_importance_cyl[[1]]), to see the values for each group.

Iterating and looping over multiple columns in glm in r using a name from another variable

I am trying to iterate over multiple columns for a glm function in R.
view(mtcars)
names <- names(mtcars[-c(1,2)])
for(i in 1:length(names)){
print(paste0("Starting iterations for ",names[i]))
model <- glm(mpg ~ cyl + paste0(names[i]), data=mtcars, family = gaussian())
summary(model)
print(paste0("Iterations for ",names[i], " finished"))
}
however, I am getting the following error:
[1] "Starting iterations for disp"
Error in model.frame.default(formula = mpg ~ cyl + paste0(names[i]), data = mtcars, :
variable lengths differ (found for 'paste0(names[i])')
Not sure, how I can correct this.
mpg ~ cyl + paste0(names[i]) or even mpg ~ cyl + names[i] is not a valid syntax for a formula. Use
reformulate(c("cyl", names[i]), "mpg")
instead, which dynamically creates a formula from variable names.
Since you need to build your model formula dynamically from string you need as.formula. Alternatively, consider reformulate which receives response and RHS variable names:
...
fml <- reformulate(c("cyl", names[i]), "mpg")
model <- glm(fml, data=mtcars, family = gaussian())
summary(model)
...
glm takes a formula which you can create using as.formula()
predictors <- names(mtcars[-c(1,2)])
for(predictor in predictors){
print(paste0("Starting iterations for ",predictor))
model <- glm(as.formula(paste0("mpg ~ cyl + ",predictor)),
data=mtcars,
family = gaussian())
print(summary(model))
print(paste0("Iterations for ",predictor, " finished"))
}

Is there a way that I can put into a barplot the significant variables from regression?

Take mtcars for example:
> reg <- lm(mpg ~ cyl + disp + hp + drat + wt, data = mtcars)
> sigvar <- data.frame(summary(reg)$coef[summary(reg)$coef[,4] <= .05, 4]) #extracts significant variables with p-values
> rownames <- rownames(sig) #extracts the variables only
I hope to put the rownames on the x-axis of a barplot and the height of the barplot would be the average of the said rownames. Thanks.
I'm unclear what you want the heights of the bars to be in the barplot. However:
reg = lm(mpg ~ cyl + disp + hp + drat + wt, data = mtcars)
## extracts significant variables with p-values
## note the -1 means you skip the intercept
sigvar = summary(reg)$coef[-1,4] <= .05
If I understand what you want, you want a barplot of the average value of those variables which are significant as the height of your bar. You need to match up the significant variables with the variable names in the data frame
i = match(names(sigvar)[sigvar], names(mtcars))
i now contains the columns of the original data frame that correspond to the significant variable. Unfortunately for the mtcars data, this means mtcars[,i] only returns one column, so normally I would do something like
barplot(sapply(mtcars[,i], mean))
but that doesn't do the right thing here because mtcars[,i] returns a vector. Let's assume for argument that i = c(5,6), then this will work
i = 5:6
barplot(sapply(mtcars[,i], mean))

Any pitfalls to using programmatically constructed formulas?

I'm wanting to run through a long vector of potential explanatory variables,
regressing a response variable on each in turn. Rather than paste together
the model formula, I'm thinking of using reformulate(),
as demonstrated here.
The function fun() below seems to do the job, fitting the desired model. Notice, though, that
it records in its call element the name of the constructed formula object
rather than its value.
## (1) Function using programmatically constructed formula
fun <- function(XX) {
ff <- reformulate(response="mpg", termlabels=XX)
lm(ff, data=mtcars)
}
fun(XX=c("cyl", "disp"))
#
# Call:
# lm(formula = ff, data = mtcars) <<<--- Note recorded call
#
# Coefficients:
# (Intercept) cyl disp
# 34.66099 -1.58728 -0.02058
## (2) Result of directly specified formula (just for purposes of comparison)
lm(mpg ~ cyl + disp, data=mtcars)
#
# Call:
# lm(formula = mpg ~ cyl + disp, data = mtcars) <<<--- Note recorded call
#
# Coefficients:
# (Intercept) cyl disp
# 34.66099 -1.58728 -0.02058
My question: Is there any danger in this? Can this become a
problem if, for instance, I want to later apply update, or predict or
some other function to the model fit object, (possibly from some other environment)?
A slightly more awkward alternative that does, nevertheless, get the recorded
call right is to use eval(substitute()). Is this in any way a generally safer construct?
fun2 <- function(XX) {
ff <- reformulate(response="mpg", termlabels=XX)
eval(substitute(lm(FF, data=mtcars), list(FF=ff)))
}
fun2(XX=c("cyl", "disp"))$call
## lm(formula = mpg ~ cyl + disp, data = mtcars)
I'm always hesitant to claim there are no situations in which something involving R environments and scoping might bite, but ... after some more exploration, my first usage above does look safe.
It turns out that the printed call is a bit of red herring.
The formula that actually gets used by other functions (and the one extracted by formula() and as.formula()) is the one stored in the terms element of the fit object, and it gets the actual formula right. (The terms element contains an object of class "terms", which is just a "formula" with a bunch of attached attributes.)
To see that all of the proposals in my question and the associated comments store the same "formula" object (up to the associated environment), run the following.
## First the three approaches in my post
formula(fun(XX=c("cyl", "disp")))
# mpg ~ cyl + disp
# <environment: 0x026d2b7c>
formula(lm(mpg ~ cyl + disp, data=mtcars))
# mpg ~ cyl + disp
formula(fun2(XX=c("cyl", "disp"))$call)
# mpg ~ cyl + disp
# <environment: 0x02c4ce2c>
## Then Gabor Grothendieck's idea
XX = c("cyl", "disp")
ff <- reformulate(response="mpg", termlabels=XX)
formula(do.call("lm", list(ff, quote(mtcars))))
## mpg ~ cyl + disp
To confirm that formula() really is deriving its output from the terms element of the fit object, have a look at stats:::formula.lm and stats:::formula.terms.

Resources