I am trying to create a call to mice::with.mids(), then evaluate it. It appears the call is being created, but then it cannot be evaluated (must be some environment/scoping issue?). I've created a pared down reproducible example below. Any assistance is greatly appreciated! Thank you!
mice_in_tbl_uvregression <-
function(data, # mice data of class mids
method, # regression method, eg. lm, glm
formula = "hp ~ mpg", # character formula (needs to be character for other reasons)
method.args = NULL # named list of other args that will be passed to `method=`
) {
# construct the call
fun_call <-
rlang::call2(
rlang::expr(with),
data = data,
expr = rlang::expr((!!method)(formula = !!as.formula(formula), !!!method.args))
)
# evaluate call
eval(fun_call)
}
set.seed(12345)
mice_in_tbl_uvregression(
data = mice::mice(mtcars, m = 2),
method = lm
)
#> Error in eval(predvars, data, env): object 'hp' not found
Created on 2021-06-27 by the reprex package (v2.0.0)
We could parse a string created (to extract the language call) before doing the evaluation
mice_in_tbl_uvregression <-
function(data, # mice data of class mids
method, # regression method, eg. lm, glm
formula = "hp ~ mpg", # character formula (needs to be character for other reasons)
method.args = NULL # named list of other args that will be passed to `method=`
) {
# construct the call
fun_call <- parse(text = glue::glue("with(data = {deparse(substitute(data))}, expr = {deparse(substitute(method))}(as.formula({formula})))"))
print(fun_call[[1]])
out <- eval(fun_call)
out$call$expr[[2]] <- out$call$expr[[2]][[2]]
out
}
-testing
set.seed(12345)
out1 <- mice_in_tbl_uvregression(
data = mice::mice(mtcars, m = 2),
method = lm
)
-output
out1
call :
with.mids(data = mice::mice(mtcars, m = 2), expr = lm(hp ~ mpg))
call1 :
mice::mice(data = mtcars, m = 2)
nmis :
mpg cyl disp hp drat wt qsec vs am gear carb
0 0 0 0 0 0 0 0 0 0 0
analyses :
[[1]]
Call:
lm(formula = as.formula(hp ~ mpg))
Coefficients:
(Intercept) mpg
324.08 -8.83
[[2]]
Call:
lm(formula = as.formula(hp ~ mpg))
Coefficients:
(Intercept) mpg
324.08 -8.83
Related
I'm using fixest::feols() and I have a function I want to pass an argument to in order to subset the data using the subset = argument. However when keep getting the error: The argument 'subset' is a formula whose variables must be in the data set given in argument 'data'.
I have tried the following code:
library(fixest)
cars <- mtcars
my_fun <- function(data, hp.c.off) {
feols(mpg ~ disp + drat,
data = data,
subset = ~ hp > substitute(hp.c.off))
}
my_fun(data = cars, 150)
My expected outcome would be the same as if one typed:
feols(mpg ~ disp + drat,
data = cars,
subset = ~ hp > 150)
I know I have to replace the value of hp.c.off before passing it onto a formula. And one could do this by creating a string expression first and then using as.formula() however, I was wondering if there is a better way to do programmatically build the expression that didn't require creating a string expression first and then converting it into a formula.
Thanks!
1) Create the formula as a character string and then convert it to a formula.
my_fun <- function(data, hp.c.off) {
feols(mpg ~ disp + drat,
data = data,
subset = as.formula(paste("~ hp >", hp.c.off)))
}
2) or just don't use the subset= argument and instead use the data argument with subset.
my_fun <- function(data, hp.c.off) {
feols(mpg ~ disp + drat,
data = subset(data, hp > hp.c.off))
}
3) or use the fact that subset= can be a logical vector
my_fun <- function(data, hp.c.off) {
feols(mpg ~ disp + drat,
data = data,
subset = data$hp > hp.c.off)
}
You can use rlang::new_formula(), with rlang::expr() to quote the rhs and !!rlang::enexpr() to capture and inject the hp.c.off argument.
I don’t have fixest installed, but this demonstrates building the formula inside a function:
library(rlang)
cars <- mtcars
my_fun <- function(data, hp.c.off) {
new_formula(lhs = NULL, rhs = expr(hp > !!enexpr(hp.c.off)))
}
my_fun(data = cars, 150)
# ~hp > 150
# <environment: 0x1405e38>
Simple option is to pass an expression as argument to the function
my_fun <- function(data,expr = ~ hp > 150){
feols(mpg ~ disp + drat,
data = data,
subset = expr)
}
-testing
> my_fun(data = cars)
OLS estimation, Dep. Var.: mpg
Observations: 13
Standard-errors: IID
Estimate Std. Error t value Pr(>|t|)
(Intercept) 23.414923 8.019808 2.919636 0.015310 *
disp -0.021349 0.008284 -2.577276 0.027545 *
drat -0.201284 2.014207 -0.099932 0.922373
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 2.16851 Adj. R2: 0.300667
Let's say I have a function like this:
data("mtcars")
ncol(mtcars)
test <- function(string){
fit <- lm(mpg ~ cyl,
data = string)
return(fit)
}
I'd like to be able to have the "string" variable evaluated as the dataset for a linear regression like so:
test("mtcars")
However, I get an error:
Error in eval(predvars, data, env) : invalid 'envir' argument of
type 'character'
I've tried using combinations of eval and parse, but to no avail. Any ideas?
You can use get() to search by name for an object.
test <- function(string){
fit <- lm(mpg ~ cyl, data = get(string))
return(fit)
}
test("mtcars")
# Call:
# lm(formula = mpg ~ cyl, data = get(string))
#
# Coefficients:
# (Intercept) cyl
# 37.885 -2.876
You can add one more line to make the output look better. Notice the change of the Call part in the output. It turns from data = get(string) to data = mtcars.
test <- function(string){
fit <- lm(mpg ~ cyl, data = get(string))
fit$call$data <- as.name(string)
return(fit)
}
test("mtcars")
# Call:
# lm(formula = mpg ~ cyl, data = mtcars)
#
# Coefficients:
# (Intercept) cyl
# 37.885 -2.876
Try this slight change to your code:
#Code
test <- function(string){
fit <- lm(mpg ~ cyl,
data = eval(parse(text=string)))
return(fit)
}
#Apply
test("mtcars")
Output:
Call:
lm(formula = mpg ~ cyl, data = eval(parse(text = string)))
Coefficients:
(Intercept) cyl
37.885 -2.876
I have many regression objects created by lm(). Each one has been built from a different data
frame, and these different data frames have different dimensions. But each data frame contains the logical variables x, y, and z. In some cases, I want to update each regression object so that the subset argument is x. In other cases, I want to update each regression object so that the subset argument is y. And in still other cases, I want to update each regression object so that the subset argument is z. What is an efficient way to do this?
This is the inefficient way:
# Set only one of these three variables to be TRUE
subsetX <- TRUE
subsetY <- FALSE
subsetZ <- FALSE
# Now update the regressions.
if (subsetX) {
update(lm1, subset = x)
update(lm2, subset = x)
[...]
} else if (subsetY) {
update(lm1, subset = y)
update(lm2, subset = y)
[...]
} else if (subsetZ) {
update(lm1, subset = z)
update(lm2, subset = z)
[...]
}
This approach is inefficient because there is a lot of duplication across the three code blocks that update the regressions. I would rather do something like
subsetVar <- dplyr::case_when(
subsetX ~ expression(x),
subsetY ~ expression(y),
subsetZ ~ expression(z))
update(lm1, subset = substitute(subsetVar))
update(lm2, subset = substitute(subsetVar))
[...]
That is, I would like to write at most one update() command for each
regression object, while still varying the subset argument on the basis of
logical (boolean) variables like subsetX and subsetY. Is this possible?
The code above doesn't work; when I try it, I get an Error in xj[i] : invalid subscript type 'symbol' error message.
I've searched other Stack Overflow questions, but I haven't found anything that speaks directly to this problem.
Here's a way to make your workflow a little easier with a simple function and using purrr::map2 to feed it the list of models and subsets you want
library(purrr)
set.seed(2020)
mtcars$x <- sample(c(TRUE, FALSE), 32, replace = TRUE)
mtcars$y <- sample(c(TRUE, FALSE), 32, replace = TRUE)
mtcars$z <- sample(c(TRUE, FALSE), 32, replace = TRUE)
lm1 <- lm(mpg ~ hp, mtcars)
subset_lm_by <- function(model, subset = NULL) {
if (subset == "x") {
update(model, subset = x)
} else if (subset == "y") {
update(model, subset = y)
} else if (subset == "z") {
update(model, subset = z)
} else {
# cat('I only accept x, y or z!')
}
}
models <- list(lm1, lm1, lm1, lm1)
subsets <- list("x", "y", "z", "nonsense")
purrr::map2(.x = models,
.y = subsets,
~ subset_lm_by(model = .x, subset = .y))
#> [[1]]
#>
#> Call:
#> lm(formula = mpg ~ hp, data = mtcars, subset = x)
#>
#> Coefficients:
#> (Intercept) hp
#> 31.21178 -0.08098
#>
#>
#> [[2]]
#>
#> Call:
#> lm(formula = mpg ~ hp, data = mtcars, subset = y)
#>
#> Coefficients:
#> (Intercept) hp
#> 32.83501 -0.07294
#>
#>
#> [[3]]
#>
#> Call:
#> lm(formula = mpg ~ hp, data = mtcars, subset = z)
#>
#> Coefficients:
#> (Intercept) hp
#> 32.53554 -0.08688
#>
#>
#> [[4]]
#> NULL
I know there are several ways to compare regression models. One way it to create models (from linear to multiple) and compare R2, Adjusted R2, etc:
Mod1: y=b0+b1
Mod2: y=b0+b1+b2
Mod3: y=b0+b1+b2+b3 (etc)
I´m aware that some packages could perform a stepwise regression, but I'm trying to analyze that with purrr. I could create several simple linear models (Thanks for this post here), and now I want to Know how can create regression models adding a specific IV to equation:
reproducible code
data(mtcars)
library(tidyverse)
library(purrr)
library(broom)
iv_vars <- c("cyl", "disp", "hp")
make_model <- function(nm) lm(mtcars[c("mpg", nm)])
fits <- Map(make_model, iv_vars)
glance_tidy <- function(x) c(unlist(glance(x)), unlist(tidy(x)[, -1]))
t(iv_vars %>% Map(f = make_model) %>% sapply(glance_tidy))
Output
What I want:
Mod1: mpg ~cyl
Mod2: mpg ~cly + disp
Mod3: mpg ~ cly + disp + hp
Thanks much.
I would begin by creating a list tibble storing your formulae. Then map the model over the formula, and map glance over the models.
library(tidyverse)
library(broom)
mtcars %>% as_tibble()
formula <- c(mpg ~ cyl, mpg ~ cyl + disp)
output <-
tibble(formula) %>%
mutate(model = map(formula, ~lm(formula = .x, data = mtcars)),
glance = map(model, glance))
output$glance
output %>% unnest(glance)
You could cumulatively paste over your vector of id_vars to get the combinations you want. I used the code in this answer to do this.
I use the plus sign as the separator between variables to get ready for the formula notation in lm.
cumpaste = function(x, .sep = " ") {
Reduce(function(x1, x2) paste(x1, x2, sep = .sep), x, accumulate = TRUE)
}
( iv_vars_cum = cumpaste(iv_vars, " + ") )
[1] "cyl" "cyl + disp" "cyl + disp + hp"
Then switch the make_model function to use a formula and a dataset. The explanatory variables, separated by the plus sign, get passed to the function after the tilde in the formula. Everything is pasted together, which lm conveniently interprets as a formula.
make_model = function(nm) {
lm(paste0("mpg ~", nm), data = mtcars)
}
Which we can see works as desired, returning a model with both explanatory variables.
make_model("cyl + disp")
Call:
lm(formula = as.formula(paste0("mpg ~", nm)), data = mtcars)
Coefficients:
(Intercept) cyl disp
34.66099 -1.58728 -0.02058
You'll likely need to rethink how you want to combine the info together, as you will now how differing numbers of columns due to the increased number of coefficients.
A possible option is to add dplyr::bind_rows to your glance_tidy function and then use map_dfr from purrr for the final output.
glance_tidy = function(x) {
dplyr::bind_rows( c( unlist(glance(x)), unlist(tidy(x)[, -1]) ) )
}
iv_vars_cum %>%
Map(f = make_model) %>%
map_dfr(glance_tidy, .id = "model")
# A tibble: 3 x 28
model r.squared adj.r.squared sigma statistic p.value df logLik AIC
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 cyl 0.7261800 0.7170527 3.205902 79.56103 6.112687e-10 2 -81.65321 169.3064
2 cyl + disp 0.7595658 0.7429841 3.055466 45.80755 1.057904e-09 3 -79.57282 167.1456
3 cyl + disp + hp 0.7678877 0.7430186 3.055261 30.87710 5.053802e-09 4 -79.00921 168.0184 ...
I am having trouble with formulas, environments, and survfit().
Things work fine for lm() but they fail for survfit().
General problem statement:
I am fitting a series of formulas to some data. So, I call the
modeling function with the formula passed as a variable. Later,
I want to work with the formula from the fitted object.
(From my naive point of view, the trouble comes from survfit not
recording the environment.)
Detailed Example
Expected behaviour as seen in lm():
library("plyr")
preds <- c("wt", "qsec")
f <- function() {
lm(mpg ~ wt, data = mtcars)
}
fits <- alply(preds, 1, function(pred)
{
modform <- reformulate(pred, response = "mpg")
lm(modform, data = mtcars)
})
fits[[1]]$call$formula
##modform
formula(fits[[1]])
## mpg ~ wt
## <environment: 0x1419d1a0>
Even though fits[[1]]$call$formula resolves to modform I can
still get the original formula with formula(fits[[1]]).
But things fail for survfit():
library("plyr")
library("survival")
preds <- c("resid.ds", "rx", "ecog.ps")
fits <-
alply(preds, 1, function(pred)
{
modform <- paste("Surv(futime, fustat)", pred, sep = " ~ ")
modform <- as.formula(modform)
print(modform)
fit <- survfit(modform, data = ovarian)
})
fits[[1]]$call$formula
## modform
formula(fits[[1]])
## Error in eval(expr, envir, enclos) : object 'modform' not found
Here (and in contrast to lm-fits), formula(fits[[1]]) does not
work!
So, my specific question is: How can I get back the formula used
to fit fits[[1]]?
The issue is that when x$formula is NULL, for an lm object there is a backup plan to get the formula; this doesn't exist for survfit objects
library("plyr")
library("survival")
preds <- c("wt", "qsec")
f <- function() lm(mpg ~ wt, data = mtcars)
fits <- alply(preds, 1, function(pred) {
modform <- reformulate(pred, response = "mpg")
lm(modform, data = mtcars)
})
fits[[1]]$formula
# NULL
The formula can be extracted with formula(fits[[1]]) which uses the formula generic. The lm S3 method for formula is
stats:::formula.lm
# function (x, ...)
# {
# form <- x$formula
# if (!is.null(form)) {
# form <- formula(x$terms)
# environment(form) <- environment(x$formula)
# form
# }
# else formula(x$terms)
# }
So when fits[[1]]$formula returns NULL, forumla.lm looks for a terms attribute in the object and finds the formula that way
fits[[1]]$terms
The survfit objects don't have x$formula or x$terms, so formula(x) givens an error
preds <- c("resid.ds", "rx", "ecog.ps")
fits <- alply(preds, 1, function(pred) {
modform <- paste("Surv(futime, fustat)", pred, sep = " ~ ")
modform <- as.formula(modform)
fit <- survfit(modform, data = ovarian)
})
fits[[1]]$formula
# NULL
formula(fits[[1]]) ## error
formula(fits[[1]]$terms)
# list()
You can fix this by inserting the formula into the call and evaluating it
modform <- as.formula(paste("Surv(futime, fustat)", 'rx', sep = " ~ "))
substitute(survfit(modform, data = ovarian), list(modform = modform))
# survfit(Surv(futime, fustat) ~ rx, data = ovarian)
eval(substitute(survfit(modform, data = ovarian), list(modform = modform)))
# Surv(futime, fustat) ~ rx
# Call: survfit(formula = Surv(futime, fustat) ~ rx, data = ovarian)
#
# n events median 0.95LCL 0.95UCL
# rx=1 13 7 638 268 NA
# rx=2 13 5 NA 475 NA
Or by manually putting the formula into x$call$formula
fit <- survfit(modform, data = ovarian)
fit$call$formula
# modform
fit$call$formula <- modform
fit$call$formula
# Surv(futime, fustat) ~ rx
fit
# Call: survfit(formula = Surv(futime, fustat) ~ rx, data = ovarian)
#
# n events median 0.95LCL 0.95UCL
# rx=1 13 7 638 268 NA
# rx=2 13 5 NA 475 NA