How to dynamically reference datasets in function call of linear regression - r

Let's say I have a function like this:
data("mtcars")
ncol(mtcars)
test <- function(string){
fit <- lm(mpg ~ cyl,
data = string)
return(fit)
}
I'd like to be able to have the "string" variable evaluated as the dataset for a linear regression like so:
test("mtcars")
However, I get an error:
Error in eval(predvars, data, env) : invalid 'envir' argument of
type 'character'
I've tried using combinations of eval and parse, but to no avail. Any ideas?

You can use get() to search by name for an object.
test <- function(string){
fit <- lm(mpg ~ cyl, data = get(string))
return(fit)
}
test("mtcars")
# Call:
# lm(formula = mpg ~ cyl, data = get(string))
#
# Coefficients:
# (Intercept) cyl
# 37.885 -2.876
You can add one more line to make the output look better. Notice the change of the Call part in the output. It turns from data = get(string) to data = mtcars.
test <- function(string){
fit <- lm(mpg ~ cyl, data = get(string))
fit$call$data <- as.name(string)
return(fit)
}
test("mtcars")
# Call:
# lm(formula = mpg ~ cyl, data = mtcars)
#
# Coefficients:
# (Intercept) cyl
# 37.885 -2.876

Try this slight change to your code:
#Code
test <- function(string){
fit <- lm(mpg ~ cyl,
data = eval(parse(text=string)))
return(fit)
}
#Apply
test("mtcars")
Output:
Call:
lm(formula = mpg ~ cyl, data = eval(parse(text = string)))
Coefficients:
(Intercept) cyl
37.885 -2.876

Related

How to reference objects in a different environment inside a formula

I'm using fixest::feols() and I have a function I want to pass an argument to in order to subset the data using the subset = argument. However when keep getting the error: The argument 'subset' is a formula whose variables must be in the data set given in argument 'data'.
I have tried the following code:
library(fixest)
cars <- mtcars
my_fun <- function(data, hp.c.off) {
feols(mpg ~ disp + drat,
data = data,
subset = ~ hp > substitute(hp.c.off))
}
my_fun(data = cars, 150)
My expected outcome would be the same as if one typed:
feols(mpg ~ disp + drat,
data = cars,
subset = ~ hp > 150)
I know I have to replace the value of hp.c.off before passing it onto a formula. And one could do this by creating a string expression first and then using as.formula() however, I was wondering if there is a better way to do programmatically build the expression that didn't require creating a string expression first and then converting it into a formula.
Thanks!
1) Create the formula as a character string and then convert it to a formula.
my_fun <- function(data, hp.c.off) {
feols(mpg ~ disp + drat,
data = data,
subset = as.formula(paste("~ hp >", hp.c.off)))
}
2) or just don't use the subset= argument and instead use the data argument with subset.
my_fun <- function(data, hp.c.off) {
feols(mpg ~ disp + drat,
data = subset(data, hp > hp.c.off))
}
3) or use the fact that subset= can be a logical vector
my_fun <- function(data, hp.c.off) {
feols(mpg ~ disp + drat,
data = data,
subset = data$hp > hp.c.off)
}
You can use rlang::new_formula(), with rlang::expr() to quote the rhs and !!rlang::enexpr() to capture and inject the hp.c.off argument.
I don’t have fixest installed, but this demonstrates building the formula inside a function:
library(rlang)
cars <- mtcars
my_fun <- function(data, hp.c.off) {
new_formula(lhs = NULL, rhs = expr(hp > !!enexpr(hp.c.off)))
}
my_fun(data = cars, 150)
# ~hp > 150
# <environment: 0x1405e38>
Simple option is to pass an expression as argument to the function
my_fun <- function(data,expr = ~ hp > 150){
feols(mpg ~ disp + drat,
data = data,
subset = expr)
}
-testing
> my_fun(data = cars)
OLS estimation, Dep. Var.: mpg
Observations: 13
Standard-errors: IID
Estimate Std. Error t value Pr(>|t|)
(Intercept) 23.414923 8.019808 2.919636 0.015310 *
disp -0.021349 0.008284 -2.577276 0.027545 *
drat -0.201284 2.014207 -0.099932 0.922373
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 2.16851 Adj. R2: 0.300667

lapply function to pass single and + arguments to LM

I am stuck trying to pass "+" arguments to lm.
My 2 lines of code below work fine for single arguments like:
model_combinations=c('.', 'Long', 'Lat', 'Elev')
lm_models = lapply(model_combinations, function(x) {
lm(substitute(Y ~ i, list(i=as.name(x))), data=climatol_ann)})
But same code fails if I add 'Lat+Elev' at end of list of model_combinations as in:
model_combinations=c('.', 'Long', 'Lat', 'Elev', 'Lat+Elev')
Error in eval(expr, envir, enclos) : object 'Lat+Elev' not found
I've scanned posts but am unable to find solution.
I've generally found it more robust/easier to understand to use reformulate to construct formulas via string manipulations rather than trying to use substitute() to modify an expression, e.g.
model_combinations <- c('.', 'Long', 'Lat', 'Elev', 'Lat+Elev')
model_formulas <- lapply(model_combinations,reformulate,
response="Y")
lm_models <- lapply(model_formulas,lm,data=climatol_ann)
Because reformulate works at a string level, it doesn't have a problem if the elements are themselves non-atomic (e.g. Lat+Elev). The only tricky situation here is if your data argument or variables are constructed in some environment that can't easily be found, but passing an explicit data argument usually avoids problems.
(You can also use as.formula(paste(...)) or as.formula(sprintf(...)); reformulate() is just a convenient wrapper.)
With as.formula you can do:
models = lapply(model_combinations,function(x) lm(as.formula(paste("y ~ ",x)), data=climatol_ann))
For the mtcars dataset:
model_combs = c("hp","cyl","hp+cyl")
testModels = lapply(model_combs,function(x) lm(as.formula(paste("mpg ~ ",x)), data=mtcars) )
testModels
#[[1]]
#
#Call:
#lm(formula = as.formula(paste("mpg ~ ", x)), data = mtcars)
#
#Coefficients:
#(Intercept) hp
# 30.09886 -0.06823
#
#
#[[2]]
#
#Call:
#lm(formula = as.formula(paste("mpg ~ ", x)), data = mtcars)
#
#Coefficients:
#(Intercept) cyl
# 37.885 -2.876
#
#
#[[3]]
#
#Call:
#lm(formula = as.formula(paste("mpg ~ ", x)), data = mtcars)
#
#Coefficients:
#(Intercept) hp cyl
# 36.90833 -0.01912 -2.26469

Combining cbind and paste in linear model

I would like to know how can I come up with a lm formula syntax that would enable me to use paste together with cbind for multiple multivariate regression.
Example
In my model I have a set of variables, which corresponds to the primitive example below:
data(mtcars)
depVars <- paste("mpg", "disp")
indepVars <- paste("qsec", "wt", "drat")
Problem
I would like to create a model with my depVars and indepVars. The model, typed by hand, would look like that:
modExmple <- lm(formula = cbind(mpg, disp) ~ qsec + wt + drat, data = mtcars)
I'm interested in generating the same formula without referring to variable names and only using depVars and indepVars vectors defined above.
Attempt 1
For example, what I had on mind would correspond to:
mod1 <- lm(formula = formula(paste(cbind(paste(depVars, collapse = ",")), " ~ ",
indepVars)), data = mtcars)
Attempt 2
I tried this as well:
mod2 <- lm(formula = formula(cbind(depVars), paste(" ~ ",
paste(indepVars,
collapse = " + "))),
data = mtcars)
Side notes
I found a number of good examples on how to use paste with formula but I would like to know how I can combine with cbind.
This is mostly a syntax a question; in my real data I've a number of variables I would like to introduce to the model and making use of the previously generated vector is more parsimonious and makes the code more presentable. In effect, I'm only interested in creating a formula object that would contain cbind with variable names corresponding to one vector and the remaining variables corresponding to another vector.
In a word, I want to arrive at the formula in modExample without having to type variable names.
Think it works.
data(mtcars)
depVars <- c("mpg", "disp")
indepVars <- c("qsec", "wt", "drat")
lm(formula(paste('cbind(',
paste(depVars, collapse = ','),
') ~ ',
paste(indepVars, collapse = '+'))), data = mtcars)
All the solutions below use these definitions:
depVars <- c("mpg", "disp")
indepVars <- c("qsec", "wt", "drat")
1) character string formula Create a character string representing the formula and then run lm using do.call. Note that the the formula shown in the output displays correctly and is written out.
fo <- sprintf("cbind(%s) ~ %s", toString(depVars), paste(indepVars, collapse = "+"))
do.call("lm", list(fo, quote(mtcars)))
giving:
Call:
lm(formula = "cbind(mpg, disp) ~ qsec+wt+drat", data = mtcars)
Coefficients:
mpg disp
(Intercept) 11.3945 452.3407
qsec 0.9462 -20.3504
wt -4.3978 89.9782
drat 1.6561 -41.1148
1a) This would also work:
fo <- sprintf("cbind(%s) ~.", toString(depVars))
do.call("lm", list(fo, quote(mtcars[c(depVars, indepVars)])))
giving:
Call:
lm(formula = cbind(mpg, disp) ~ qsec + wt + drat, data = mtcars[c(depVars,
indepVars)])
Coefficients:
mpg disp
(Intercept) 11.3945 452.3407
qsec 0.9462 -20.3504
wt -4.3978 89.9782
drat 1.6561 -41.1148
2) reformulate #akrun and #Konrad, in comments below the question suggest using reformulate. This approach produces a "formula" object whereas the ones above produce a character string as the formula. (If this were desired for the prior solutions above it would be possible using fo <- formula(fo) .) Note that it is important that the response argument to reformulate be a call object and not a character string or else reformulate will interpret the character string as the name of a single variable.
fo <- reformulate(indepVars, parse(text = sprintf("cbind(%s)", toString(depVars)))[[1]])
do.call("lm", list(fo, quote(mtcars)))
giving:
Call:
lm(formula = cbind(mpg, disp) ~ qsec + wt + drat, data = mtcars)
Coefficients:
mpg disp
(Intercept) 11.3945 452.3407
qsec 0.9462 -20.3504
wt -4.3978 89.9782
drat 1.6561 -41.1148
3) lm.fit Another way that does not use a formula at all is:
m <- as.matrix(mtcars)
fit <- lm.fit(cbind(1, m[, indepVars]), m[, depVars])
The output is a list with these components:
> names(fit)
[1] "coefficients" "residuals" "effects" "rank"
[5] "fitted.values" "assign" "qr" "df.residual"

Using lapply to fit multiple model -- how to keep the model formula self-contained in lm object

The following code fits 4 different model formulas to the mtcars dataset, using either for loop or lapply. In both cases, the formula stored in the result is referred to as formulas[[1]], formulas[[2]], etc. instead of the human-readable formula.
formulas <- list(
mpg ~ disp,
mpg ~ I(1 / disp),
mpg ~ disp + wt,
mpg ~ I(1 / disp) + wt
)
res <- vector("list", length=length(formulas))
for (i in seq_along(formulas)) {
res[[i]] <- lm(formulas[[i]], data=mtcars)
}
res
lapply(formulas, lm, data=mtcars)
Is there a way to make the full, readable formula show up in the result?
This should work
lapply(formulas, function(x, data) eval(bquote(lm(.(x),data))), data=mtcars)
And it retruns
[[1]]
Call:
lm(formula = mpg ~ disp, data = data)
Coefficients:
(Intercept) disp
29.59985 -0.04122
[[2]]
Call:
lm(formula = mpg ~ I(1/disp), data = data)
Coefficients:
(Intercept) I(1/disp)
10.75 1557.67
....etc
We use bquote to insert the formula into the call to lm and then evaluate the expression.
Why not just:
lapply( formulas, function(frm) lm( frm, data=mtcars))
#------------------
[[1]]
Call:
lm(formula = frm, data = mtcars)
Coefficients:
(Intercept) disp
29.59985 -0.04122
[[2]]
Call:
lm(formula = frm, data = mtcars)
Coefficients:
(Intercept) I(1/disp)
10.75 1557.67
snpped....
If you wanted the names of the result to have the 'character'-ized version of the formulas it would just be"
names(res) <- as.character(formulas)
res[1]
#-----
$`mpg ~ disp`
Call:
lm(formula = frm, data = mtcars)
Coefficients:
(Intercept) disp
29.59985 -0.04122
you can also try something like
library(purrr)
library(tibble)
models <- map(formulas, lm, data = mtcars)
models

Easily performing the same regression on different datasets

I'm performing the same regression on several different datasets (same dependent and independe variables). However, there are many independent variables, and I often want to test adding/removing different variables. I'd like to avoid making all these changes to different lines of code, just because they use different datasets. Can I instead just copy the formula that was used to create some object, and then create a new object using a different dataset? For example, something like:
fit1 <- lm(y ~ x1 + x2 + x3 + ..., data = dataset1)
fit2 <- lm(fit1$call, data = dataset2) # this doesn't work
fit3 <- lm(fit1$call, data = dataset3) # this doesn't work
This way, if I want to update numerous regressions, I just update the first one and then rerun them all.
Can this be done? Preferably without using a loop or paste().
Thanks!
Or use update
(fit <- lm(mpg ~ wt, data = mtcars))
# Call:
# lm(formula = mpg ~ wt, data = mtcars)
#
# Coefficients:
# (Intercept) wt
# 37.285 -5.344
update(fit, data = mtcars[mtcars$hp < 100, ])
# Call:
# lm(formula = mpg ~ wt, data = mtcars[mtcars$hp < 100, ])
#
# Coefficients:
# (Intercept) wt
# 39.295 -5.379
update(fit, data = mtcars[1:10, ])
# Call:
# lm(formula = mpg ~ wt, data = mtcars[1:10, ])
#
# Coefficients:
# (Intercept) wt
# 33.774 -4.285
Collect your datasets into a list and then use lapply. E.g.:
dsets <- list(dataset1,dataset2,dataset3)
lapply(dsets, function(x) lm(y ~ x1 + x2, data=x) )
Not sure entirely that this what you want but you can do this as follows:
formula <- y ~ x1 + x2 + x3 + ...
fit1 <- lm(formula, data = dataset1)
fit2 <- lm(formula, data = dataset2)
fit3 <- lm(formula, data = dataset3)

Resources