Any pitfalls to using programmatically constructed formulas? - r

I'm wanting to run through a long vector of potential explanatory variables,
regressing a response variable on each in turn. Rather than paste together
the model formula, I'm thinking of using reformulate(),
as demonstrated here.
The function fun() below seems to do the job, fitting the desired model. Notice, though, that
it records in its call element the name of the constructed formula object
rather than its value.
## (1) Function using programmatically constructed formula
fun <- function(XX) {
ff <- reformulate(response="mpg", termlabels=XX)
lm(ff, data=mtcars)
}
fun(XX=c("cyl", "disp"))
#
# Call:
# lm(formula = ff, data = mtcars) <<<--- Note recorded call
#
# Coefficients:
# (Intercept) cyl disp
# 34.66099 -1.58728 -0.02058
## (2) Result of directly specified formula (just for purposes of comparison)
lm(mpg ~ cyl + disp, data=mtcars)
#
# Call:
# lm(formula = mpg ~ cyl + disp, data = mtcars) <<<--- Note recorded call
#
# Coefficients:
# (Intercept) cyl disp
# 34.66099 -1.58728 -0.02058
My question: Is there any danger in this? Can this become a
problem if, for instance, I want to later apply update, or predict or
some other function to the model fit object, (possibly from some other environment)?
A slightly more awkward alternative that does, nevertheless, get the recorded
call right is to use eval(substitute()). Is this in any way a generally safer construct?
fun2 <- function(XX) {
ff <- reformulate(response="mpg", termlabels=XX)
eval(substitute(lm(FF, data=mtcars), list(FF=ff)))
}
fun2(XX=c("cyl", "disp"))$call
## lm(formula = mpg ~ cyl + disp, data = mtcars)

I'm always hesitant to claim there are no situations in which something involving R environments and scoping might bite, but ... after some more exploration, my first usage above does look safe.
It turns out that the printed call is a bit of red herring.
The formula that actually gets used by other functions (and the one extracted by formula() and as.formula()) is the one stored in the terms element of the fit object, and it gets the actual formula right. (The terms element contains an object of class "terms", which is just a "formula" with a bunch of attached attributes.)
To see that all of the proposals in my question and the associated comments store the same "formula" object (up to the associated environment), run the following.
## First the three approaches in my post
formula(fun(XX=c("cyl", "disp")))
# mpg ~ cyl + disp
# <environment: 0x026d2b7c>
formula(lm(mpg ~ cyl + disp, data=mtcars))
# mpg ~ cyl + disp
formula(fun2(XX=c("cyl", "disp"))$call)
# mpg ~ cyl + disp
# <environment: 0x02c4ce2c>
## Then Gabor Grothendieck's idea
XX = c("cyl", "disp")
ff <- reformulate(response="mpg", termlabels=XX)
formula(do.call("lm", list(ff, quote(mtcars))))
## mpg ~ cyl + disp
To confirm that formula() really is deriving its output from the terms element of the fit object, have a look at stats:::formula.lm and stats:::formula.terms.

Related

How to understand the arguments of "data" and "subset" in randomForest R package?

Arguments
data: an optional data frame containing the variables in the model. By default the variables are taken from the environment which randomForestis called from
subset: an index vector indicating which rows should be used. (NOTE: If given, this argument must be named.)
My questions:
Why is data argument "optional"? If data is optional, where does the training data come from? And what exactly is the meaning of "By default the variables are taken from the environment which randomForestis called from"?
Why do we need the subset parameter? Let's say, we have the iris data set. If I want to use the first 100 rows as the training data set, I just select training_data <- iris[1:100,]. Why bother? What's the benefit of using subset?
This is not an uncommon methodology, and certainly not unique to randomForests.
mpg <- mtcars$mpg
disp <- mtcars$disp
lm(mpg~disp)
# Call:
# lm(formula = mpg ~ disp)
# Coefficients:
# (Intercept) disp
# 29.59985 -0.04122
So when lm (in this case) is attempting to resolve the variables referenced in the formula mpg~disp, it looks at data if provided, then in the calling environment. Further example:
rm(mpg,disp)
mpg2 <- mtcars$mpg
lm(mpg2~disp)
# Error in eval(predvars, data, env) : object 'disp' not found
lm(mpg2~disp, data=mtcars)
# Call:
# lm(formula = mpg2 ~ disp, data = mtcars)
# Coefficients:
# (Intercept) disp
# 29.59985 -0.04122
(Notice that mpg2 is not in mtcars, so this used both methods for finding the data. I don't use this functionality, preferring the resilient step of providing all data in the call; it is not difficult to think of examples where reproducibility suffers if this is not the case.
Similarly, many similar functions (including lm) allow this subset= argument, so the fact that randomForests includes it is consistent. I believe it is merely a convenience argument, as the following are roughly equivalent:
lm(mpg~disp, data=mtcars, subset= cyl==4)
lm(mpg~disp, data=mtcars[mtcars$cyl == 4,])
mt <- mtcars[ mtcars$cyl == 4, ]
lm(mpg~disp, data=mt)
The use of subset allows slightly simpler referencing (cyl versus mtcars$cyl), and its utility is compounded when the number of referenced variables increases (i.e., for "code golf" purposes). But this could also be done with other mechanisms such as with, so ... mostly personal preference.
Edit: as joran pointed out, randomForest (and others but notably not lm) can be called with either a formula, which is where you'd typically use the data argument, or by specifying the predictor/response arguments separately with the arguments x and y, as in the following examples taken from ?randomForest (ignore the other arguments being inconsistent):
iris.rf <- randomForest(Species ~ ., data=iris, importance=TRUE, proximity=TRUE)
iris.rrf <- randomForest(iris[-1], iris[[1]], ntree=101, proximity=TRUE, oob.prox=FALSE)

Linear regression output "Call Formula" can't be exported to csv file

I'm working on a for function that uses the input data to generate several linear regression models, a~c. This data generate works perfectly fine, but, I need to export the data resulting to a csv file. I'm using the tydy() and glance () functions to obtain p values, intercepts, r2 etc. That part of the code works fine, but, the output file does not provide me with the "Call Formula:" of the linear regresion, so I'm having problem to interpret the out... Can someone tell me, please, how to make the call formula become the header of the csv file?.
I don't think that csv files have a singular header that you can save the call to but if you simply want to capture the call to lm with the results then (using mtcars data since you didn't provide much context):
library(broom)
m = lm(disp ~ hp + cyl, data = mtcars)
xxx<-tidy(m)
xxx$call<-toString(m$call)
xxx
produces...
term estimate std.error statistic p.value call
1 (Intercept) -144.5694333 37.6522356 -3.8395976 6.173165e-04 lm, disp ~ hp + cyl, mtcars
2 hp 0.2358181 0.2578106 0.9146953 3.678948e-01 lm, disp ~ hp + cyl, mtcars
3 cyl 55.0625843 9.8975401 5.5632595 5.310020e-06 lm, disp ~ hp + cyl, mtcars

create a variable to replace variables in a model in R

When writing statistic model, I usually use a lot of co-variables to adjust the model, so I need to rewrite the variable again and again. Even though I could copy and paste, the model looks very long. Could I create a variable which could replace many variables? E.g.:
fm <- lm(y ~ a+b+c+d+e, data)
I could create a variable like: model1 = a+b+c+d+e, then the model looks like:
fm <-lm(y ~ model1, data)
I tried many ways, but it did successful, like model1 <- c(a+b+c+d),
Could someone help me with this?
how about saving it as a formula?
model <- ~a+b+c+d
You can then extract the terms using terms, or update the formula using update
Example:
model <- mpg ~ disp + wt + cyl
lm(model, mtcars)
## Call:
## lm(formula = model, data = mtcars)
##
## Coefficients:
## (Intercept) disp wt cyl
## 41.107678 0.007473 -3.635677 -1.784944
model <- update(model, ~. + qsec)
lm(model, mtcars)
## Call:
## lm(formula = model, data = mtcars)
##
## Coefficients:
## (Intercept) disp wt cyl qsec
## 30.17771 0.01029 -4.55318 -1.24109 0.55277
Edit:
As Kristoffer Winther Balling mentioned in the comments, a cleverer way to do this is to save the formula as a string (e.g. "mpg ~ disp + wt + cyl") and then use as.formula. You can then use familiar paste or other string manipulation functions to change the formula.

Loop over column names in regression

I want to run a whole batch of regressions over every variable in a data frame, and then store the residual deviance value from each regression in a new vector as the loop goes along.
The frame is called "cw". The first few variables are just metadata, so ignore those. I try the following:
deviances<-c()
for (x in colnames(cw)[1:8]){deviances[x]<-NA}
for (x in colnames(cw)[8:27]){
model<-glm(cwonset ~ x, fmaily = binomial, data = cw)
append(deviances, model$deviance)
}
However, it gives the error:
Error in model.frame.default(formula = cwonset ~ x, data = cw, drop.unused.levels = TRUE) :
variable lengths differ (found for 'x')
Any idea why?
without data, i had to rely on mtcars to help you out, no need of for loop also. I assumed mpg as the dependent variable
Logic : sapply helps me o loop through each colname at a time and then I just regress that. It internally is a for loop though
sapply(colnames(mtcars[-1]), function(x) {
form <- as.formula(paste0("mpg~", x))
model <- glm(form, data = mtcars)
model$deviance})
# cyl disp hp drat wt qsec vs am gear carb
# 308.3342 317.1587 447.6743 603.5667 278.3219 928.6553 629.5193 720.8966 866.2980 784.2711

Linear regression with interaction fails in the rms-package

I'm playing around with interaction in the formula. I wondered if it's possible to do a regression with interaction for one of the two dummy variables. This seems to work in regular linear regression using the lm() function but with the ols() function in the rms package the same formula fails. Anyone know why?
Here's my example
data(mtcars)
mtcars$gear <- factor(mtcars$gear)
regular_lm <- lm(mpg ~ wt + cyl + gear + cyl:gear, data=mtcars)
summary(regular_lm)
regular_lm <- lm(mpg ~ wt + cyl + gear + cyl:I(gear == "4"), data=mtcars)
summary(regular_lm)
And now the rms example
library(rms)
dd <- datadist(mtcars)
options(datadist = "dd")
regular_ols <- ols(mpg ~ wt + cyl + gear + cyl:gear, data=mtcars)
regular_ols
# Fails with:
# Error in if (!length(fname) || !any(fname == zname)) { :
# missing value where TRUE/FALSE needed
regular_ols <- ols(mpg ~ wt + cyl + gear + cyl:I(gear == "4"), data=mtcars)
This experiment might not be the wisest statistic to do as it seems that the estimates change significantly but I'm a little curious to why ols() fails since it should do the "same fitting routines used by lm"
I don't know exactly, but it has to do with the way the formula is evaluated rather than with the way the fit is done once the model has been translated. Using traceback() shows that the problem occurs within Design(eval.parent(m)); using options(error=recover) gets you to the point where you can see that
Browse[1]> fname
[1] "wt" "cyl" "gear"
Browse[1]> zname
[1] NA
in other words, zname is some internal variable that hasn't been set right because the Design function can't quite handle defining the interaction between cylinders and the (gear==4) dummy on the fly.
This works though:
mtcars$cylgr <- with(mtcars,interaction(cyl,gear == "4"))
regular_ols <- ols(mpg ~ wt + cyl + gear + cylgr, data=mtcars)

Resources