Object not found error when passing model formula to another function - r

I have a weird problem with R that I can't seem to work out.
I've tried to write a function that performs K-fold cross validation for a model chosen by the stepwise procedure in R. (I'm aware of the issues with stepwise procedures, it's purely for comparison purposes) :)
Now the issue is, that if I define the function parameters (linmod,k,direction) and run the contents of the function, it works flawlessly. BUT, if I run it as a function, I get an error saying the datas.train object can't be found.
I've tried stepping through the function with debug() and the object clearly exists, but R says it doesn't when I actually run the function. If I just fit a model using lm() it works fine, so I believe it's a problem with the step function in the loop, while inside a function. (try commenting out the step command, and set the predictions to those from the ordinary linear model.)
#CREATE A LINEAR MODEL TO TEST FUNCTION
lm.cars <- lm(mpg~.,data=mtcars,x=TRUE,y=TRUE)
#THE FUNCTION
cv.step <- function(linmod,k=10,direction="both"){
response <- linmod$y
dmatrix <- linmod$x
n <- length(response)
datas <- linmod$model
form <- formula(linmod$call)
# generate indices for cross validation
rar <- n/k
xval.idx <- list()
s <- sample(1:n, n) # permutation of 1:n
for (i in 1:k) {
xval.idx[[i]] <- s[(ceiling(rar*(i-1))+1):(ceiling(rar*i))]
}
#error calculation
errors <- R2 <- 0
for (j in 1:k){
datas.test <- datas[xval.idx[[j]],]
datas.train <- datas[-xval.idx[[j]],]
test.idx <- xval.idx[[j]]
#THE MODELS+
lm.1 <- lm(form,data= datas.train)
lm.step <- step(lm.1,direction=direction,trace=0)
step.pred <- predict(lm.step,newdata= datas.test)
step.error <- sum((step.pred-response[test.idx])^2)
errors[j] <- step.error/length(response[test.idx])
SS.tot <- sum((response[test.idx] - mean(response[test.idx]))^2)
R2[j] <- 1 - step.error/SS.tot
}
CVerror <- sum(errors)/k
CV.R2 <- sum(R2)/k
res <- list()
res$CV.error <- CVerror
res$CV.R2 <- CV.R2
return(res)
}
#TESTING OUT THE FUNCTION
cv.step(lm.cars)
Any thoughts?

When you created your formula, lm.cars, in was assigned its own environment. This environment stays with the formula unless you explicitly change it. So when you extract the formula with the formula function, the original environment of the model is included.
I don't know if I'm using the correct terminology here, but I think you need to explicitly change the environment for the formula inside your function:
cv.step <- function(linmod,k=10,direction="both"){
response <- linmod$y
dmatrix <- linmod$x
n <- length(response)
datas <- linmod$model
.env <- environment() ## identify the environment of cv.step
## extract the formula in the environment of cv.step
form <- as.formula(linmod$call, env = .env)
## The rest of your function follows

Another problem that can cause this is if one passes a character (string vector) to lm instead of a formula. vectors have no environment, and so when lm converts the character to a formula, it apparently also has no environment instead of being automatically assigned the local environment. If one then uses an object as weights that is not in the data argument data.frame, but is in the local function argument, one gets a not found error. This behavior is not very easy to understand. It is probably a bug.
Here's a minimal reproducible example. This function takes a data.frame, two variable names and a vector of weights to use.
residualizer = function(data, x, y, wtds) {
#the formula to use
f = "x ~ y"
#residualize
resid(lm(formula = f, data = data, weights = wtds))
}
residualizer2 = function(data, x, y, wtds) {
#the formula to use
f = as.formula("x ~ y")
#residualize
resid(lm(formula = f, data = data, weights = wtds))
}
d_example = data.frame(x = rnorm(10), y = rnorm(10))
weightsvar = runif(10)
And test:
> residualizer(data = d_example, x = "x", y = "y", wtds = weightsvar)
Error in eval(expr, envir, enclos) : object 'wtds' not found
> residualizer2(data = d_example, x = "x", y = "y", wtds = weightsvar)
1 2 3 4 5 6 7 8 9 10
0.8986584 -1.1218003 0.6215950 -0.1106144 0.1042559 0.9997725 -1.1634717 0.4540855 -0.4207622 -0.8774290
It is a very subtle bug. If one goes into the function environment with browser, one can see the weights vector just fine, but it somehow is not found in the lm call!
The bug becomes even harder to debug if one used the name weights for the weights variable. In this case, since lm can't find the weights object, it defaults to the function weights() from base thus throwing an even stranger error:
Error in model.frame.default(formula = f, data = data, weights = weights, :
invalid type (closure) for variable '(weights)'
Don't ask me how many hours it took me to figure this out.

Related

An R function cannot work in local environment of other functions

I use Matchit package for propensity score matching. It can generate a matched data after matching using get_matches() function.
However, if I do not run the get_matches() function in the global environment but include it in any other function, the matched data cannot be found in the local environment. (These prove to be misleading information. There is nothing wrong with MatchIt's output. Answer by Noah explains my question better.)
For producing my data
dataGen <- function(b0,b1,n = 2000,cor = 0){
# covariate
sigma <- matrix(rep(cor,9),3,3)
diag(sigma) <- rep(1,3)
cov <- MASS::mvrnorm(n, rep(0,3), sigma)
# error
error <- rnorm(n,0,sqrt(18))
# treatment variable
logit <- b0+b1*cov[,1]+0.3*cov[,2]+cov[,3]
p <- 1/(1+exp(-logit))
treat <- rbinom(n,1,p)
# outcome variable
y <- error+treat+cov[,1]+cov[,2]
data <- as.data.frame(cbind(cov,treat,y))
return(data)
}
set.seed(1)
data <- dataGen(b0=-0.92, b1=0.8, 900)
It is like the following works. The est.m.WLS() can use the m.data.
fm1 <- treat ~ V1+V2+V3
m.out <- MatchIt::matchit(data = data, formula = fm1, link = "logit", m.order = "random", caliper = 0.2)
m.data <- MatchIt::get_matches(m.out,data=data)
est.m.WLS <- function(m.data, fm2){
model.1 <- lm(fm2, data = m.data, weights=(weights))
est <- model.1$coefficients["treat"]
## regular robust standard error ignoring pair membership
model.1.2 <- lmtest::coeftest(model.1,vcov. = sandwich::vcovHC)
CI.r <- confint(model.1.2,"treat",level=0.95)
## cluster robust standard error accounting for pair membership
model.2.2 <- lmtest::coeftest(model.1, vcov. = sandwich::vcovCL, cluster = ~subclass)
CI.cr <- confint(model.2.2,"treat",level=0.95)
return(c(est=est,CI.r,CI.cr))
}
fm2 <- y ~ treat+V1+V2+V3
est.m.WLS(m.data,fm2)
But the next syntax does not work. It will report
"object 'm.data' not found"
rm(m.data)
m.out <- MatchIt::matchit(data = data, formula = fm1, link = "logit", m.order = "random", caliper = 0.2)
est.m.WLS <- function(m.out, fm2){
m.data <- MatchIt::get_matches(m.out,data=data)
model.1 <- lm(fm2, data = m.data, weights=(weights))
est <- model.1$coefficients["treat"]
## regular robust standard error ignoring pair membership
model.1.2 <- lmtest::coeftest(model.1,vcov. = sandwich::vcovHC)
CI.r <- confint(model.1.2,"treat",level=0.95)
## cluster robust standard error accounting for pair membership
model.2.2 <- lmtest::coeftest(model.1, vcov. = sandwich::vcovCL, cluster = ~subclass)
CI.cr <- confint(model.2.2,"treat",level=0.95)
return(c(est=est,CI.r,CI.cr))
}
est.m.WLS(m.out,fm2)
Since I want to run parallel loops using the groundhog library for simulation purpose, the get_matches function also cannot work in foreach()%dopar%{...} environment.
res=foreach(s = 1:7,.combine="rbind")%dopar%{
m.out <- MatchIt::matchit(data = data, formula = fm.p, distance = data$logit, m.order = "random", caliper = 0.2)
m.data <- MatchIt::get_matches(m.out,data=data)
...
}
How should I fix the problem?
Any help would be appreciated. Thank you!
Using for() loop directly will not run into any problem since it just works in the global environment, but it is too slow... I really hope to do the thousand time simulations at once. Help!
This has nothing to do with MatchIt or get_matches(). Run debugonce(est.m.WLS) with your second implementation of est.m.WLS(). You will see that get_matches() works perfectly fine and returns m.data. The problem is when lmtest() runs with a formula argument for cluster.
This is due to a bug in R, outside any package, that I have already requested to be fixed. The problem is that expand.model.matrix(), a function that searches for the dataset that the variables supplied to cluster could be in, only searches the global environment for data, but m.data does not exist in the global environment. To get around this issue, don't supply a formula to cluster; use cluster = m.data["subclass"]. This should hopefully be resolved in an upcoming R release.

How can I fit a linear model inside a user defined function in R; error non-numeric argument when specifying response variable?

I am trying to de-clutter some scripts by creating functions to complete repetitive tasks in R. One task I complete repeatedly is fitting a linear model to a set of data and creating predictions from that linear model fit. The data I am working with is concentration and flow data from streams, and flow is always the explanatory variable but the response variable changes and therefore I would like to include it as a function input. However, I receive a "non-numeric argument to mathematical function" error when I run the function. I have tried both with and without quotes since the lm() call does not require quotes but that results in the classis "object 'myobject' not found". Here's a simple example.
Update
flows <- seq(0,7,0.01)
dat <- tibble(flow=sample(flows,30),
parameter1_conc=rnorm(30,15,4),
parameter2_conc=rnorm(30,50,8))
regr_func <- function(modeldata,parameter,pred_maxflow,pred_flowint) {
mod <- lm(as.formula(paste('log(', parameter, ') ~ log(', flow, ')')), data=modeldata)
newflow <- data.frame(flow = seq(0, pred_maxflow, pred_flowint))
preds <<- predict(mod, newdata = newflow,
interval = 'prediction')
}
regr_func(modeldata = dat,
parameter = 'parameter1_conc',
pred_maxflow = 20,
pred_flowint = 0.001)
Original Example Error
flows <- seq(0,7,0.01)
dat <- tibble(flow=sample(flows,30),
parameter1_conc=rnorm(30,15,4),
parameter1_conc=rnorm(30,50,8))
regr_func <- function(modeldata,parameter,pred_maxflow,pred_flowint) {
mod <- lm(log(parameter)~log(flow), data = modeldata)
newflow <- data.frame(flow = seq(0, maxflow, flowint))
preds <<- predict(mod, newdata = newflow,
interval = 'prediction')
}
regr_func(modeldata = dat,
parameter = 'parameter1_conc',
pred_maxflow = 20,
pred_flowint = 0.001)
There are 3 issues here. The main one is that log(parameter) in your lm formula does not get substituted for the variable passed in as parameter. That means lm is literally looking for a column called parameter in your data, which doesn't exist. You can fix this by creating a formula with the name substituted in. Although doing this with strings is the most commonly used method to do this, it is a bit more efficient and safer to use substitute. This also allows you to pass your column name without quotes.
The second issue is that the arguments maxflow and flowint should probably be pred_maxflow and pred_flowint to match your function parameters.
Thirdly, using the <<- operator to write to a variable in the calling frame is bad practice. R users expect functions not to have such side effects, and know to store the output of function calls to variables under their control. Only in very rare circumstances should this be done within the function.
Putting all this together, we have:
regr_func <- function(modeldata, parameter, pred_maxflow, pred_flowint) {
f <- `[[<-`(x ~ log(flow), 2, substitute(log(parameter)))
mod <- lm(f, data = modeldata)
newflow <- data.frame(flow = seq(0, pred_maxflow, pred_flowint))
predict(mod, newdata = newflow, interval = 'prediction')
}
And we would call the function like this:
preds <- regr_func(modeldata = dat,
parameter = parameter1_conc,
pred_maxflow = 20,
pred_flowint = 0.001)
resulting in:
head(preds)
#> fit lwr upr
#> 1 Inf NaN NaN
#> 2 3.365491 2.188942 4.542041
#> 3 3.312636 2.219223 4.406049
#> 4 3.281717 2.236294 4.327140
#> 5 3.259780 2.248073 4.271488
#> 6 3.242765 2.256998 4.228531
Created on 2022-06-03 by the reprex package (v2.0.1)

Scoping with formulae in coxph objects

I'm trying to write a set of functions where the first function fits a cox model (via coxph in the survival package in R), and the second function gets estimated survival for a new dataset, given the fitted model object from the first function. I'm running into some sort of scoping issue that I don't quite know how to solve without substantially re-factoring my code (the only way I could think to do it would be much less general and much harder to read).
I have a very similar set of functions that are based on the glm function that do not run into the same issue and give me the answers I would expect. I've included a short worked example below that demonstrates the issue. The glue.cox and glue.glm are functions that have the basic functionality I am trying to get. glue.glm works as expected (yielding the same values from a calculation in the global environment), but the glue.cox complains that it can't find the data that was used to fit the cox model and ends with an error. I don't understand how to do this with substitute but I suspect that is the way forward. I've hit a wall with experimenting.
library(survival)
data.global = data.frame(time=runif(20), x=runif(20))
newdata.global = data.frame(x=c(0,1))
f1 = Surv(time) ~ x # this is the part that messes it up!!!!! Surv gets eval
f2 = time ~ x # this is the part that messes it up!!!!! Surv gets eval
myfit.cox.global = coxph(f1, data=data.global)
myfit.glm.global = glm(f2, data=data.global)
myfit.glm.global2 = glm(time ~ x, data=data.global)
myfit.cox <- function(f, dat.local){
coxph(f, data=dat.local)
}
myfit.glm <- function(f, dat.local){
glm(f, data=dat.local)
}
mypredict.cox <- function(ft, dat.local){
newdata = data.frame(x=c(0,1))
tail(survfit(ft, newdata)$surv, 1)
}
mypredict.glm <- function(ft, dat.local){
newdata = data.frame(x=c(0,1))
predict(ft, newdata)
}
glue.cox <- function(f, dat.local){
fit = myfit.cox(f, dat.local)
mypredict.cox(fit, dat.local)
}
glue.glm <- function(f, dat.local){
fit = myfit.glm(f, dat.local)
mypredict.glm(fit, dat.local)
}
# these numbers are the goal for non-survival data
predict(myfit.glm.global, newdata = newdata.global)
0.5950440 0.4542248
glue.glm(f2, data.global)
0.5950440 0.4542248 # this works
# these numbers are the goal for survival data
tail(survfit(myfit.cox.global, newdata = newdata.global)$surv, 1)
[20,] 0.02300798 0.03106081
glue.cox(f1, data.global)
Error in eval(predvars, data, env) : object 'dat.local' not found
This appears to work, at least in the narrow sense of making glue.cox() work as desired:
myfit.cox <- function(f, dat.local){
environment(f) <- list2env(list(dat.local=dat.local))
coxph(f, data=dat.local)
}
The trick here is that most R modeling/model-processing functions look for data in the environment associated with the formula.
I don't know why glue.glm works without doing more digging, except for the general statement that [g]lm objects store more of the information needed for downstream processing internally (e.g. in the $qr element) than other model types.

Can't give a subset when using randomForest inside a function

I'm wanting to create a function that uses within it the randomForest function from the randomForest package. This takes the "subset" argument, which is a vector of row numbers of the data frame to use for training. However, if I use this argument when calling the randomForest function in another defined function, I get the error:
Error in eval(substitute(subset), data, env) :
object 'tr_subset' not found
Here is a reproducible example, where we attempt to train a random forest to classify a response "type" either "A" or "B", based on three numerical predictors:
library(randomForest)
# define a random data frame to train with
test.data = data.frame(
type = rep(NA, times = 500),
x = runif(500),
y = runif(500),
z = runif(500)
)
train.data$type[runif(500) >= 0.5] = "A"
train.data$type[is.na(test.data$type)] = "B"
train.data$type = as.factor(test.data$type)
# define the training range
training.range = sample(500)[1:300]
# formula to use
tr_form = formula(type ~ x + y + z)
# Function that includes the randomForest function
train_rf = function(form, all_data, tr_subset) {
p = randomForest(
formula = form,
data = all_data,
subset = tr_subset,
na.action = na.omit
)
return(p)
}
# test the new defined function
test_tree = train_rf(form = tr_form, all_data = train.data, tr_subset = training.range)
Running this gives the error:
Error in eval(substitute(subset), data, env) :
object 'tr_subset' not found
If, however, subset = tr_subset is removed from the randomForest function, and tr_subset is removed from the train_rf function, this code runs fine, however the whole data set is used for training!
It should be noted that using the subset argument in randomForest when not defined in another function works completely fine, and is the intended method for the function, as described in the vignette linked above.
I know in the mean time I could just define another training set that has just the row numbers required, and train using all of that, but is there a reason why my original code doesn't work please?
Thanks.
EDIT: I conjecture that, as subset() is a base R function, R is getting confused and thinking you're wanting to use the base R function rather than defining an argument of the randomForest function. I'm not an expert, though, so I may be wrong.

R Programming: Evaluating an expression when objects exist in multiple environments

Short Version
An expression with two variables, x and y, where x is contained in environment 1
and y is contained in a second environment. How does the programmer evaluate
the expression?
Detailed Version
I have a function that takes a formula and data.frame as arguments. On the
the right hand side of the formula is a call to the function splines::bs to
generate a B-spline basis. The workhorse function does a few things, one of
which requires extracting the bs call from the formula and evaluating it. The
problem I am trying to solve involves evaluating the bs call when argument
values are contained in different environments.
Here are the functions needed to recreate the issue I am working on
library(splines)
extract_bmat <- function(form) {
B <- NULL
rr <- function(x) {
if (is.call(x) && grepl("bs", deparse(x[[1]]))) {
B <<- x
} else if (is.recursive(x)) {
as.call(lapply(as.list(x), rr))
} else {
x
}
}
z <- lapply(as.list(form), rr)
B
}
some_workhorse <- function(formula, data) {
# ... lots of cool stuff ...
# fit <- lm(formula, data)
bmat <- eval(extract_bmat(formula), data)
bmat
}
# The following works when evaluated in the .GlobalEnv
# The eval(extract_bmat(formula), data) call within the some_workhorse
# function works without errors
xi <- c(3, 4.5)
eg_data <- data.frame(x = 1:10, y = sin(1:10))
some_workhorse(y ~ bs(x, knots = xi), data = eg_data)
Now, if the function some_workhorse and the xi vector and eg_data
data.frame are generated within a function environment causes an error.
foo <- function() {
xi_in_foo <- c(2, 3)
eg_data_in_foo <- data.frame(x = 1:10, y = sin(1:10))
some_workhorse(y ~ bs(x, knots = xi_in_foo), data = eg_data_in_foo)
}
foo()
# Error in sort(c(rep(Boundary.knots, ord), knots)) :
# object 'xi_in_foo' not found
The location of the error is within the splines::bs call, but that is not the
important part; xi_in_foo not found is the important issue to address.
I know the issue is related to my poor handling of environments in R. My
primary question is
How should the call eval(extract_bmat(formula), data) within the
some_workhorse function be written so that it works correctly when called in
the .GlobalEnv or when called within a function environment?
Secondary question:
Within the extract_bmat function, I would prefer to define an environment
for B and use assign instead of <<-. I suspect that <<- is the best
option because of the uncertainty in the levels of recursion taking place.
That said, I would like to see other solutions.
Thanks for the help.
You should define your function as
some_workhorse <- function(formula, data) {
# ... lots of cool stuff ...
# fit <- lm(formula, data)
bmat <- eval(extract_bmat(formula), data, environment(formula))
bmat
}
Note that formulas in R capture the environment in which they were created. As long as xi_in_foo exists in the environment where the formula was defined, this should work. Variables will first be looked up in the data list/data.frame and then the formula environment would be used as the enclosing environment. If you weren't using formula,s sometimes people use parent.frame() as the enclos= parameter so that variables are looked for in the environment in which the function was called, rather than were the function was defined as is the default with R's lexical scoping.

Resources