create a variable to replace variables in a model in R - r

When writing statistic model, I usually use a lot of co-variables to adjust the model, so I need to rewrite the variable again and again. Even though I could copy and paste, the model looks very long. Could I create a variable which could replace many variables? E.g.:
fm <- lm(y ~ a+b+c+d+e, data)
I could create a variable like: model1 = a+b+c+d+e, then the model looks like:
fm <-lm(y ~ model1, data)
I tried many ways, but it did successful, like model1 <- c(a+b+c+d),
Could someone help me with this?

how about saving it as a formula?
model <- ~a+b+c+d
You can then extract the terms using terms, or update the formula using update
Example:
model <- mpg ~ disp + wt + cyl
lm(model, mtcars)
## Call:
## lm(formula = model, data = mtcars)
##
## Coefficients:
## (Intercept) disp wt cyl
## 41.107678 0.007473 -3.635677 -1.784944
model <- update(model, ~. + qsec)
lm(model, mtcars)
## Call:
## lm(formula = model, data = mtcars)
##
## Coefficients:
## (Intercept) disp wt cyl qsec
## 30.17771 0.01029 -4.55318 -1.24109 0.55277
Edit:
As Kristoffer Winther Balling mentioned in the comments, a cleverer way to do this is to save the formula as a string (e.g. "mpg ~ disp + wt + cyl") and then use as.formula. You can then use familiar paste or other string manipulation functions to change the formula.

Related

Iterating and looping over multiple columns in glm in r using a name from another variable

I am trying to iterate over multiple columns for a glm function in R.
view(mtcars)
names <- names(mtcars[-c(1,2)])
for(i in 1:length(names)){
print(paste0("Starting iterations for ",names[i]))
model <- glm(mpg ~ cyl + paste0(names[i]), data=mtcars, family = gaussian())
summary(model)
print(paste0("Iterations for ",names[i], " finished"))
}
however, I am getting the following error:
[1] "Starting iterations for disp"
Error in model.frame.default(formula = mpg ~ cyl + paste0(names[i]), data = mtcars, :
variable lengths differ (found for 'paste0(names[i])')
Not sure, how I can correct this.
mpg ~ cyl + paste0(names[i]) or even mpg ~ cyl + names[i] is not a valid syntax for a formula. Use
reformulate(c("cyl", names[i]), "mpg")
instead, which dynamically creates a formula from variable names.
Since you need to build your model formula dynamically from string you need as.formula. Alternatively, consider reformulate which receives response and RHS variable names:
...
fml <- reformulate(c("cyl", names[i]), "mpg")
model <- glm(fml, data=mtcars, family = gaussian())
summary(model)
...
glm takes a formula which you can create using as.formula()
predictors <- names(mtcars[-c(1,2)])
for(predictor in predictors){
print(paste0("Starting iterations for ",predictor))
model <- glm(as.formula(paste0("mpg ~ cyl + ",predictor)),
data=mtcars,
family = gaussian())
print(summary(model))
print(paste0("Iterations for ",predictor, " finished"))
}

Using column name of dataframe as predictor variable in linear regression

I'm trying to loop through all the column names of my data.frame and use them
as predictor variable in a linear regression.
What I currently have is:
for (i in 1:11){
for (j in 1:11){
if (i != j ){
var1 = names(newData)[i]
var2 = names(newData)[j]
glm.fit = glm(re78 ~ as.name(var1):as.name(var2), data=newData)
summary(glm.fit)
cv.glm(newData, glm.fit, K = 10)$delta[1]
}
}
}
Where newData is my data.frame and there are 11 columns in total. This code gives me the following error:
Error in model.frame.default(formula = re78 ~ as.name(var1), data = newData, :
invalid type (symbol) for variable 'as.name(var1)'
How can I fix this, and make it work?
It looks like you want models that use all combinations of two variables. Here's another way to do that using the built-in mtcars data frame for illustration and using mpg as the outcome variable.
We get all combinations of two variables (excluding the outcome variable, mpg in this case) using combn. combn returns a list where each list element is a vector containing the names of a pair of variables. Then we use map (from the purrr package) to create models for each pair of variables and store the results in a list.
We use reformulate to construct the model formula. .x refers back to the vectors of variables names (each element of vars). If you run, for example, reformulate(paste(c("cyl", "disp"),collapse="*"), "mpg"), you can see what reformulate is doing.
library(purrr)
# Get all combinations of two variables
vars = combn(names(mtcars)[-grep("mpg", names(mtcars))], 2, simplify=FALSE)
Now we want to run regression models on all pairs of variables and store results in a list:
# No interaction
models = map(vars, ~ glm(reformulate(.x, "mpg"), data=mtcars))
# Interaction only (no main effects)
models = map(vars, ~ glm(reformulate(paste(.x, collapse=":"), "mpg"), data=mtcars))
# Interaction and main effects
models = map(vars, ~ glm(reformulate(paste(.x, collapse="*"), "mpg"), data=mtcars))
Name each list element with the formula for that model:
names(models) = map(models, ~ .x[["terms"]])
To create the model formulas using paste instead of reformulate you could do (change + to : or *, depending on what combination of interactions and main effects you want to include):
models = map(vars, ~ glm(paste("mpg ~", paste(.x, collapse=" + ")), data=mtcars))
To see how paste is being used here, you can run:
paste("mpg ~", paste(c("cyl", "disp"), collapse=" * "))
Here's what the first two models look like when the models include both main effects and the interaction:
models[1:2]
$`mpg ~ cyl * disp`
Call: glm(formula = reformulate(paste(.x, collapse = "*"), "mpg"),
data = mtcars)
Coefficients:
(Intercept) cyl disp cyl:disp
49.03721 -3.40524 -0.14553 0.01585
Degrees of Freedom: 31 Total (i.e. Null); 28 Residual
Null Deviance: 1126
Residual Deviance: 198.1 AIC: 159.1
$`mpg ~ cyl * hp`
Call: glm(formula = reformulate(paste(.x, collapse = "*"), "mpg"),
data = mtcars)
Coefficients:
(Intercept) cyl hp cyl:hp
50.75121 -4.11914 -0.17068 0.01974
Degrees of Freedom: 31 Total (i.e. Null); 28 Residual
Null Deviance: 1126
Residual Deviance: 247.6 AIC: 166.3
To assess model output, you can use functions from the broom package. The code below returns data frames with, respectively, the coefficients and performance statistics for each model.
library(broom)
model_coefs = map_df(models, tidy, .id="Model")
model_performance = map_df(models, glance, .id="Model")
Here are what the results look like for models with both main effects and the interaction:
head(model_coefs, 8)
Model term estimate std.error statistic p.value
1 mpg ~ cyl * disp (Intercept) 49.03721186 5.004636297 9.798357 1.506091e-10
2 mpg ~ cyl * disp cyl -3.40524372 0.840189015 -4.052950 3.645320e-04
3 mpg ~ cyl * disp disp -0.14552575 0.040002465 -3.637919 1.099280e-03
4 mpg ~ cyl * disp cyl:disp 0.01585388 0.004947824 3.204212 3.369023e-03
5 mpg ~ cyl * hp (Intercept) 50.75120716 6.511685614 7.793866 1.724224e-08
6 mpg ~ cyl * hp cyl -4.11913952 0.988229081 -4.168203 2.672495e-04
7 mpg ~ cyl * hp hp -0.17068010 0.069101555 -2.469989 1.987035e-02
8 mpg ~ cyl * hp cyl:hp 0.01973741 0.008810871 2.240120 3.320219e-02
You can use fit <- glm(as.formula(paste0("re78 ~ ", var1)), data=newData) as #akrun suggest. Further, you likely do not want to call your object glm.fit as there is a function with the same.
Caveat: I do not why you have the double loop and the :. Do you not want a regression with a single covaraite? I have no idea what you are trying to achieve otherwise.

Map function in R for multiple regression

My goal is to run a multiple regression on each dependent variable in a list, using all of the independent variables in another list. I then would like to store the best model for each dependent variable by AIC.
I have written the below function guided by this post. However, instead of employing each independent variable individually, I'd like to run the model against the entire list as a multiple regression.
Any tips on how to build this function?
dep<-list("mpg~","cyl~","disp~") # list of unique dependent variables with ~
indep<-list("hp","drat","wt") # list of first unique independent variables
models<- Map(function(x,y) step(lm(as.formula(paste(x,paste(y),collapse="+")),data=mtcars),direction="backward"),dep,indep)
Start: AIC=88.43
mpg ~ hp
Df Sum of Sq RSS AIC
<none> 447.67 88.427
- hp 1 678.37 1126.05 115.943
Start: AIC=18.56
cyl ~ drat
Df Sum of Sq RSS AIC
<none> 50.435 18.558
- drat 1 48.44 98.875 38.100
Start: AIC=261.74
disp ~ wt
Df Sum of Sq RSS AIC
<none> 100709 261.74
- wt 1 375476 476185 309.45
[[1]]
Call:
lm(formula = mpg ~ hp, data = mtcars)
Coefficients:
(Intercept) hp
30.09886 -0.06823
[[2]]
Call:
lm(formula = cyl ~ drat, data = mtcars)
Coefficients:
(Intercept) drat
14.596 -2.338
[[3]]
Call:
lm(formula = disp ~ wt, data = mtcars)
Coefficients:
(Intercept) wt
-131.1 112.5
The y needs to be collapsed with + and then pasted to the x and y needs to be passed as a vector to each value of x
models <- lapply(dep, function(x, y)
step(lm(as.formula(paste(x, paste(y, collapse="+"))), data=mtcars),
direction="backward"), y = indep)

Any pitfalls to using programmatically constructed formulas?

I'm wanting to run through a long vector of potential explanatory variables,
regressing a response variable on each in turn. Rather than paste together
the model formula, I'm thinking of using reformulate(),
as demonstrated here.
The function fun() below seems to do the job, fitting the desired model. Notice, though, that
it records in its call element the name of the constructed formula object
rather than its value.
## (1) Function using programmatically constructed formula
fun <- function(XX) {
ff <- reformulate(response="mpg", termlabels=XX)
lm(ff, data=mtcars)
}
fun(XX=c("cyl", "disp"))
#
# Call:
# lm(formula = ff, data = mtcars) <<<--- Note recorded call
#
# Coefficients:
# (Intercept) cyl disp
# 34.66099 -1.58728 -0.02058
## (2) Result of directly specified formula (just for purposes of comparison)
lm(mpg ~ cyl + disp, data=mtcars)
#
# Call:
# lm(formula = mpg ~ cyl + disp, data = mtcars) <<<--- Note recorded call
#
# Coefficients:
# (Intercept) cyl disp
# 34.66099 -1.58728 -0.02058
My question: Is there any danger in this? Can this become a
problem if, for instance, I want to later apply update, or predict or
some other function to the model fit object, (possibly from some other environment)?
A slightly more awkward alternative that does, nevertheless, get the recorded
call right is to use eval(substitute()). Is this in any way a generally safer construct?
fun2 <- function(XX) {
ff <- reformulate(response="mpg", termlabels=XX)
eval(substitute(lm(FF, data=mtcars), list(FF=ff)))
}
fun2(XX=c("cyl", "disp"))$call
## lm(formula = mpg ~ cyl + disp, data = mtcars)
I'm always hesitant to claim there are no situations in which something involving R environments and scoping might bite, but ... after some more exploration, my first usage above does look safe.
It turns out that the printed call is a bit of red herring.
The formula that actually gets used by other functions (and the one extracted by formula() and as.formula()) is the one stored in the terms element of the fit object, and it gets the actual formula right. (The terms element contains an object of class "terms", which is just a "formula" with a bunch of attached attributes.)
To see that all of the proposals in my question and the associated comments store the same "formula" object (up to the associated environment), run the following.
## First the three approaches in my post
formula(fun(XX=c("cyl", "disp")))
# mpg ~ cyl + disp
# <environment: 0x026d2b7c>
formula(lm(mpg ~ cyl + disp, data=mtcars))
# mpg ~ cyl + disp
formula(fun2(XX=c("cyl", "disp"))$call)
# mpg ~ cyl + disp
# <environment: 0x02c4ce2c>
## Then Gabor Grothendieck's idea
XX = c("cyl", "disp")
ff <- reformulate(response="mpg", termlabels=XX)
formula(do.call("lm", list(ff, quote(mtcars))))
## mpg ~ cyl + disp
To confirm that formula() really is deriving its output from the terms element of the fit object, have a look at stats:::formula.lm and stats:::formula.terms.

Linear regression with interaction fails in the rms-package

I'm playing around with interaction in the formula. I wondered if it's possible to do a regression with interaction for one of the two dummy variables. This seems to work in regular linear regression using the lm() function but with the ols() function in the rms package the same formula fails. Anyone know why?
Here's my example
data(mtcars)
mtcars$gear <- factor(mtcars$gear)
regular_lm <- lm(mpg ~ wt + cyl + gear + cyl:gear, data=mtcars)
summary(regular_lm)
regular_lm <- lm(mpg ~ wt + cyl + gear + cyl:I(gear == "4"), data=mtcars)
summary(regular_lm)
And now the rms example
library(rms)
dd <- datadist(mtcars)
options(datadist = "dd")
regular_ols <- ols(mpg ~ wt + cyl + gear + cyl:gear, data=mtcars)
regular_ols
# Fails with:
# Error in if (!length(fname) || !any(fname == zname)) { :
# missing value where TRUE/FALSE needed
regular_ols <- ols(mpg ~ wt + cyl + gear + cyl:I(gear == "4"), data=mtcars)
This experiment might not be the wisest statistic to do as it seems that the estimates change significantly but I'm a little curious to why ols() fails since it should do the "same fitting routines used by lm"
I don't know exactly, but it has to do with the way the formula is evaluated rather than with the way the fit is done once the model has been translated. Using traceback() shows that the problem occurs within Design(eval.parent(m)); using options(error=recover) gets you to the point where you can see that
Browse[1]> fname
[1] "wt" "cyl" "gear"
Browse[1]> zname
[1] NA
in other words, zname is some internal variable that hasn't been set right because the Design function can't quite handle defining the interaction between cylinders and the (gear==4) dummy on the fly.
This works though:
mtcars$cylgr <- with(mtcars,interaction(cyl,gear == "4"))
regular_ols <- ols(mpg ~ wt + cyl + gear + cylgr, data=mtcars)

Resources