I'm using fixest::feols() and I have a function I want to pass an argument to in order to subset the data using the subset = argument. However when keep getting the error: The argument 'subset' is a formula whose variables must be in the data set given in argument 'data'.
I have tried the following code:
library(fixest)
cars <- mtcars
my_fun <- function(data, hp.c.off) {
feols(mpg ~ disp + drat,
data = data,
subset = ~ hp > substitute(hp.c.off))
}
my_fun(data = cars, 150)
My expected outcome would be the same as if one typed:
feols(mpg ~ disp + drat,
data = cars,
subset = ~ hp > 150)
I know I have to replace the value of hp.c.off before passing it onto a formula. And one could do this by creating a string expression first and then using as.formula() however, I was wondering if there is a better way to do programmatically build the expression that didn't require creating a string expression first and then converting it into a formula.
Thanks!
1) Create the formula as a character string and then convert it to a formula.
my_fun <- function(data, hp.c.off) {
feols(mpg ~ disp + drat,
data = data,
subset = as.formula(paste("~ hp >", hp.c.off)))
}
2) or just don't use the subset= argument and instead use the data argument with subset.
my_fun <- function(data, hp.c.off) {
feols(mpg ~ disp + drat,
data = subset(data, hp > hp.c.off))
}
3) or use the fact that subset= can be a logical vector
my_fun <- function(data, hp.c.off) {
feols(mpg ~ disp + drat,
data = data,
subset = data$hp > hp.c.off)
}
You can use rlang::new_formula(), with rlang::expr() to quote the rhs and !!rlang::enexpr() to capture and inject the hp.c.off argument.
I don’t have fixest installed, but this demonstrates building the formula inside a function:
library(rlang)
cars <- mtcars
my_fun <- function(data, hp.c.off) {
new_formula(lhs = NULL, rhs = expr(hp > !!enexpr(hp.c.off)))
}
my_fun(data = cars, 150)
# ~hp > 150
# <environment: 0x1405e38>
Simple option is to pass an expression as argument to the function
my_fun <- function(data,expr = ~ hp > 150){
feols(mpg ~ disp + drat,
data = data,
subset = expr)
}
-testing
> my_fun(data = cars)
OLS estimation, Dep. Var.: mpg
Observations: 13
Standard-errors: IID
Estimate Std. Error t value Pr(>|t|)
(Intercept) 23.414923 8.019808 2.919636 0.015310 *
disp -0.021349 0.008284 -2.577276 0.027545 *
drat -0.201284 2.014207 -0.099932 0.922373
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 2.16851 Adj. R2: 0.300667
Related
I am trying to create a call to mice::with.mids(), then evaluate it. It appears the call is being created, but then it cannot be evaluated (must be some environment/scoping issue?). I've created a pared down reproducible example below. Any assistance is greatly appreciated! Thank you!
mice_in_tbl_uvregression <-
function(data, # mice data of class mids
method, # regression method, eg. lm, glm
formula = "hp ~ mpg", # character formula (needs to be character for other reasons)
method.args = NULL # named list of other args that will be passed to `method=`
) {
# construct the call
fun_call <-
rlang::call2(
rlang::expr(with),
data = data,
expr = rlang::expr((!!method)(formula = !!as.formula(formula), !!!method.args))
)
# evaluate call
eval(fun_call)
}
set.seed(12345)
mice_in_tbl_uvregression(
data = mice::mice(mtcars, m = 2),
method = lm
)
#> Error in eval(predvars, data, env): object 'hp' not found
Created on 2021-06-27 by the reprex package (v2.0.0)
We could parse a string created (to extract the language call) before doing the evaluation
mice_in_tbl_uvregression <-
function(data, # mice data of class mids
method, # regression method, eg. lm, glm
formula = "hp ~ mpg", # character formula (needs to be character for other reasons)
method.args = NULL # named list of other args that will be passed to `method=`
) {
# construct the call
fun_call <- parse(text = glue::glue("with(data = {deparse(substitute(data))}, expr = {deparse(substitute(method))}(as.formula({formula})))"))
print(fun_call[[1]])
out <- eval(fun_call)
out$call$expr[[2]] <- out$call$expr[[2]][[2]]
out
}
-testing
set.seed(12345)
out1 <- mice_in_tbl_uvregression(
data = mice::mice(mtcars, m = 2),
method = lm
)
-output
out1
call :
with.mids(data = mice::mice(mtcars, m = 2), expr = lm(hp ~ mpg))
call1 :
mice::mice(data = mtcars, m = 2)
nmis :
mpg cyl disp hp drat wt qsec vs am gear carb
0 0 0 0 0 0 0 0 0 0 0
analyses :
[[1]]
Call:
lm(formula = as.formula(hp ~ mpg))
Coefficients:
(Intercept) mpg
324.08 -8.83
[[2]]
Call:
lm(formula = as.formula(hp ~ mpg))
Coefficients:
(Intercept) mpg
324.08 -8.83
Let's say I have a function like this:
data("mtcars")
ncol(mtcars)
test <- function(string){
fit <- lm(mpg ~ cyl,
data = string)
return(fit)
}
I'd like to be able to have the "string" variable evaluated as the dataset for a linear regression like so:
test("mtcars")
However, I get an error:
Error in eval(predvars, data, env) : invalid 'envir' argument of
type 'character'
I've tried using combinations of eval and parse, but to no avail. Any ideas?
You can use get() to search by name for an object.
test <- function(string){
fit <- lm(mpg ~ cyl, data = get(string))
return(fit)
}
test("mtcars")
# Call:
# lm(formula = mpg ~ cyl, data = get(string))
#
# Coefficients:
# (Intercept) cyl
# 37.885 -2.876
You can add one more line to make the output look better. Notice the change of the Call part in the output. It turns from data = get(string) to data = mtcars.
test <- function(string){
fit <- lm(mpg ~ cyl, data = get(string))
fit$call$data <- as.name(string)
return(fit)
}
test("mtcars")
# Call:
# lm(formula = mpg ~ cyl, data = mtcars)
#
# Coefficients:
# (Intercept) cyl
# 37.885 -2.876
Try this slight change to your code:
#Code
test <- function(string){
fit <- lm(mpg ~ cyl,
data = eval(parse(text=string)))
return(fit)
}
#Apply
test("mtcars")
Output:
Call:
lm(formula = mpg ~ cyl, data = eval(parse(text = string)))
Coefficients:
(Intercept) cyl
37.885 -2.876
I've been using map() to calculate and extract certain statistics from multiple lm() models.
To give a reproducible example, using the mtcars dataset, I start with an input vector of formulae to be estimated using lm() models:
library(tidyverse)
df <- mtcars
input_char <- c("mpg ~ disp",
"mpg ~ disp + hp")
input_formula <- map(input_char, formula)
I've then got a function that calculates and extracts the relevant statistics for each model. For simplicity and reproducibility, here's a simplified function that just extracts the R-squared of the model.
get_rsquared <- function(a_formula) {
model1 <- lm(a_formula, data = df)
rsquared <- summary(model1)$r.squared
c(model = a_formula, rsquared = rsquared)
}
I've then used map to iterate through the formulae and extract the R-squared from each model.
models <- map(input_formula, get_rsquared)
models
which gives the output:
[[1]]
[[1]]$model
mpg ~ disp
<environment: 0x7f98987f4000>
[[1]]$rsquared
[1] 0.7183433
[[2]]
[[2]]$model
mpg ~ disp + hp
<environment: 0x7f98987f4000>
[[2]]$rsquared
[1] 0.7482402
My question is regarding the output being a list.
Is there a simple way to make the output a dataframe?
My desired output is:
#> model rsquared
#> 1 mpg ~ disp 0.7183433
#> 2 mpg ~ disp + hp 0.7482402
Keep the formulas as character strings and use as.formula() as part of the the get_rsquared() function as it's easier to work with them as character strings than formula objects.
library(purrr)
library(dplyr)
df <- mtcars
input_char <- c("mpg ~ disp",
"mpg ~ disp + hp")
get_rsquared <- function(a_formula) {
model1 <- lm(as.formula(a_formula), data = df)
rsquared <- summary(model1)$r.squared
list(model = a_formula, rsquared = rsquared)
}
map_df(input_char, get_rsquared)
# A tibble: 2 x 2
model rsquared
<chr> <dbl>
1 mpg ~ disp 0.718
2 mpg ~ disp + hp 0.748
Let's say I have use mtcars dataset to set arbitrary formula:
data(mtcars)
myFormula <- as.formula("mpg ~ cyl + I(disp / hp) + I(wt^2) + I((qsec + vs) / gear)")
I would like to use that formula inside lm function, but before that, I would like to remove potential rows that contain Inf, NaN and NA. From example if disp / hp result in any Inf values I would like to remove rows that contain it. I know I can do that by generate new variable first , remove Inf and then run lm with formula, but I would like to do that using formula terms, since it is part of shiny application and formula is input.
My try:
formulaTerms <- terms(myFormula)
formulaTerms <- gsub("I", "", labels(formulaTerms))
formulaTermsRatio <- formulaTerms[grep("/", formulaTerms)]
mtcarsDT <- setDT(mtcars)
mtcarsDT <- mtcarsDT[, formulaTermsRatio[1] := sym(formulaTermsRatio[1])]
Use drop.terms. Assuming that each term is represented by a single column in the model matrix (i.e. no factors with > 2 levels) we compute the model matrix mm and find the column numbers, wx, of the bad columns. Then use drop.terms to drop those columns from the terms object and extract the formula from the revised terms object.
mtcars[1, 3] <- Inf
# is.na is TRUE for NA or NaN; is.infinite is TRUE for Inf or -Inf
is.bad <- function(x) any(is.na(x) | is.infinite(x))
fo_terms <- terms(myFormula) # myFormula is taken from question
mm <- model.matrix(myFormula, mtcars)
wx <- which(apply(mm[, -1], 2, is.bad))
fo_terms2 <- drop.terms(fo_terms, wx, keep.response = TRUE)
fo2 <- formula(fo_terms2)
myFormula
## mpg ~ cyl + I(disp/hp) + I(wt^2) + I((qsec + vs)/gear)
fo2
## mpg ~ cyl + I(wt^2) + I((qsec + vs)/gear)
Update
If you want to remove bad rows rather than terms from the formula then:
lm(myFormula, mtcars, subset = !apply(mm, 1, is.bad))
Note that lm will automatically remove rows with NAs and NaNs (dependintg on the na.action argument) so in this case you could simplify is.bad to only check for Inf and -Inf.
Another approach would be to replace Inf and -Inf with NA.
mtcars[is.infinite(mtcars)] <- NA
and then perform lm normally.
You can remove these values from the data you're regressing on. Inf will occur where hp==0 or gear==0.
data(mtcars)
df <- mtcars
myFormula <- as.formula("mpg ~ cyl + I(disp / hp) + I(wt^2) + I((qsec + vs) / gear)")
df <- df[!(df$hp==0 | df$gear==0),]
lm(myFormula,df)
> lm(myFormula,df)
Call:
lm(formula = myFormula, data = df)
Coefficients:
(Intercept) cyl I(disp/hp) I(wt^2) I((qsec + vs)/gear)
35.5847 -1.9639 1.0707 -0.3671 -0.1699
I would like to know how can I come up with a lm formula syntax that would enable me to use paste together with cbind for multiple multivariate regression.
Example
In my model I have a set of variables, which corresponds to the primitive example below:
data(mtcars)
depVars <- paste("mpg", "disp")
indepVars <- paste("qsec", "wt", "drat")
Problem
I would like to create a model with my depVars and indepVars. The model, typed by hand, would look like that:
modExmple <- lm(formula = cbind(mpg, disp) ~ qsec + wt + drat, data = mtcars)
I'm interested in generating the same formula without referring to variable names and only using depVars and indepVars vectors defined above.
Attempt 1
For example, what I had on mind would correspond to:
mod1 <- lm(formula = formula(paste(cbind(paste(depVars, collapse = ",")), " ~ ",
indepVars)), data = mtcars)
Attempt 2
I tried this as well:
mod2 <- lm(formula = formula(cbind(depVars), paste(" ~ ",
paste(indepVars,
collapse = " + "))),
data = mtcars)
Side notes
I found a number of good examples on how to use paste with formula but I would like to know how I can combine with cbind.
This is mostly a syntax a question; in my real data I've a number of variables I would like to introduce to the model and making use of the previously generated vector is more parsimonious and makes the code more presentable. In effect, I'm only interested in creating a formula object that would contain cbind with variable names corresponding to one vector and the remaining variables corresponding to another vector.
In a word, I want to arrive at the formula in modExample without having to type variable names.
Think it works.
data(mtcars)
depVars <- c("mpg", "disp")
indepVars <- c("qsec", "wt", "drat")
lm(formula(paste('cbind(',
paste(depVars, collapse = ','),
') ~ ',
paste(indepVars, collapse = '+'))), data = mtcars)
All the solutions below use these definitions:
depVars <- c("mpg", "disp")
indepVars <- c("qsec", "wt", "drat")
1) character string formula Create a character string representing the formula and then run lm using do.call. Note that the the formula shown in the output displays correctly and is written out.
fo <- sprintf("cbind(%s) ~ %s", toString(depVars), paste(indepVars, collapse = "+"))
do.call("lm", list(fo, quote(mtcars)))
giving:
Call:
lm(formula = "cbind(mpg, disp) ~ qsec+wt+drat", data = mtcars)
Coefficients:
mpg disp
(Intercept) 11.3945 452.3407
qsec 0.9462 -20.3504
wt -4.3978 89.9782
drat 1.6561 -41.1148
1a) This would also work:
fo <- sprintf("cbind(%s) ~.", toString(depVars))
do.call("lm", list(fo, quote(mtcars[c(depVars, indepVars)])))
giving:
Call:
lm(formula = cbind(mpg, disp) ~ qsec + wt + drat, data = mtcars[c(depVars,
indepVars)])
Coefficients:
mpg disp
(Intercept) 11.3945 452.3407
qsec 0.9462 -20.3504
wt -4.3978 89.9782
drat 1.6561 -41.1148
2) reformulate #akrun and #Konrad, in comments below the question suggest using reformulate. This approach produces a "formula" object whereas the ones above produce a character string as the formula. (If this were desired for the prior solutions above it would be possible using fo <- formula(fo) .) Note that it is important that the response argument to reformulate be a call object and not a character string or else reformulate will interpret the character string as the name of a single variable.
fo <- reformulate(indepVars, parse(text = sprintf("cbind(%s)", toString(depVars)))[[1]])
do.call("lm", list(fo, quote(mtcars)))
giving:
Call:
lm(formula = cbind(mpg, disp) ~ qsec + wt + drat, data = mtcars)
Coefficients:
mpg disp
(Intercept) 11.3945 452.3407
qsec 0.9462 -20.3504
wt -4.3978 89.9782
drat 1.6561 -41.1148
3) lm.fit Another way that does not use a formula at all is:
m <- as.matrix(mtcars)
fit <- lm.fit(cbind(1, m[, indepVars]), m[, depVars])
The output is a list with these components:
> names(fit)
[1] "coefficients" "residuals" "effects" "rank"
[5] "fitted.values" "assign" "qr" "df.residual"