R: lapply leads to different result in step function - r

In swiss data I try to do stepwise linear regression for different range of Agriculture, so I tried:
data <- swiss
splits <- split(data, cut(data$Agriculture, breaks=c(0, 50, Inf), right=FALSE))
select <- function(x) {
null <- lm(Fertility~1, data=splits[[x]])
full <- lm(Fertility~., data=splits[[x]])
step(null, scope=list(lower=null, upper=full, direction='forward'))
}
select(2)
this would work but the following doesn't:
null_list <- lapply(splits, function(x) {lm(Fertility~1, data=x)})
full_list <- lapply(splits, function(x) {lm(Fertility~., data=x)})
select <- function(x) {
null <- null_list[[x]]
full <- full_list[[x]]
step(null, scope=list(lower=null, upper=full, direction='forward'))
}
select(2)
The second version throws error:
Error in eval(expr, envir, enclos) : object 'Fertility' not found
But when I check
lm(Fertility~1, data=splits[[2]])
null_list[[2]]
and
lm(Fertility~., data=splits[[2]])
full_list[[2]]
They both look the same. What makes the difference? Any stupid mistakes made?

Well, if you look at the calls for the two versions, you'll see they are not exactly the same
lm(Fertility~1, data=splits[[2]])$call
# lm(formula = Fertility ~ 1, data = splits[[2]])
null_list[[2]]$call
# lm(formula = Fertility ~ 1, data = x)
Notice how the data= argument is different for each. The former still points to a valid global variable, the latter points to x which does not exist any more. The step() function tries to evaluate the formula in a context from where it was called. And in that context x is your loop counter. If you changed the select() function to
select <- function(z) {
null <- null_list[[z]]
full <- full_list[[z]]
step(null, scope=list(lower=null, upper=full, direction='forward'))
}
select(2)
You'd get a different error
Error in is.data.frame(data) : object 'x' not found
which basically means that step() is having trouble getting back to the variable that contains the data that can be used to re-fit a model adding or subtracting a covariate.
One work around would be to embed the data in the lm() call itself. You can do that with
null_list <- lapply(splits, function(x) {do.call("lm", list(Fertility~1, data=x))})
full_list <- lapply(splits, function(x) {do.call("lm", list(Fertility~., data=x))})
But you'll see this results in a "messy-looking" call but the the results should be the same.
This is unfortunately a side-effect of non-standard evaluation. It would be nice if step() looked for the data in the $model property of the full model, but I believe this doesn't match up when you have NA values so R has no choice but to try to re-evaluate the data= parameter in some context.

Related

Scoping with formulae in coxph objects

I'm trying to write a set of functions where the first function fits a cox model (via coxph in the survival package in R), and the second function gets estimated survival for a new dataset, given the fitted model object from the first function. I'm running into some sort of scoping issue that I don't quite know how to solve without substantially re-factoring my code (the only way I could think to do it would be much less general and much harder to read).
I have a very similar set of functions that are based on the glm function that do not run into the same issue and give me the answers I would expect. I've included a short worked example below that demonstrates the issue. The glue.cox and glue.glm are functions that have the basic functionality I am trying to get. glue.glm works as expected (yielding the same values from a calculation in the global environment), but the glue.cox complains that it can't find the data that was used to fit the cox model and ends with an error. I don't understand how to do this with substitute but I suspect that is the way forward. I've hit a wall with experimenting.
library(survival)
data.global = data.frame(time=runif(20), x=runif(20))
newdata.global = data.frame(x=c(0,1))
f1 = Surv(time) ~ x # this is the part that messes it up!!!!! Surv gets eval
f2 = time ~ x # this is the part that messes it up!!!!! Surv gets eval
myfit.cox.global = coxph(f1, data=data.global)
myfit.glm.global = glm(f2, data=data.global)
myfit.glm.global2 = glm(time ~ x, data=data.global)
myfit.cox <- function(f, dat.local){
coxph(f, data=dat.local)
}
myfit.glm <- function(f, dat.local){
glm(f, data=dat.local)
}
mypredict.cox <- function(ft, dat.local){
newdata = data.frame(x=c(0,1))
tail(survfit(ft, newdata)$surv, 1)
}
mypredict.glm <- function(ft, dat.local){
newdata = data.frame(x=c(0,1))
predict(ft, newdata)
}
glue.cox <- function(f, dat.local){
fit = myfit.cox(f, dat.local)
mypredict.cox(fit, dat.local)
}
glue.glm <- function(f, dat.local){
fit = myfit.glm(f, dat.local)
mypredict.glm(fit, dat.local)
}
# these numbers are the goal for non-survival data
predict(myfit.glm.global, newdata = newdata.global)
0.5950440 0.4542248
glue.glm(f2, data.global)
0.5950440 0.4542248 # this works
# these numbers are the goal for survival data
tail(survfit(myfit.cox.global, newdata = newdata.global)$surv, 1)
[20,] 0.02300798 0.03106081
glue.cox(f1, data.global)
Error in eval(predvars, data, env) : object 'dat.local' not found
This appears to work, at least in the narrow sense of making glue.cox() work as desired:
myfit.cox <- function(f, dat.local){
environment(f) <- list2env(list(dat.local=dat.local))
coxph(f, data=dat.local)
}
The trick here is that most R modeling/model-processing functions look for data in the environment associated with the formula.
I don't know why glue.glm works without doing more digging, except for the general statement that [g]lm objects store more of the information needed for downstream processing internally (e.g. in the $qr element) than other model types.

Invalid type (closure) for a variable that is not a function

loess.smooth <- function(dat) {
dat <- dat[complete.cases(dat),]
## response
vars <- colnames(dat)
## covariate
id <- 1:nrow(dat)
## define a loess filter function (fitting loess regression line)
loess.filter <- function (x, span) loess(formula = paste(x, "id", sep = "~"),
data = dat,
degree = 1,
span = span)$fitted
## apply filter column-by-column
new.dat <- as.data.frame(lapply(vars, loess.filter, span = 0.75),
col.names = colnames(dat))
}
When I try to apply loess.smooth to a dataframe, I get the error:
Error in model.frame.default(formula = paste(x, "id", sep = "~"), data = dat) :
invalid type (closure) for variable 'id'
I don't understand why this is a problem since id is not a function, which is implied by the error.
When I run through these lines of code outside of the function, it works perfectly fine and does exactly what I want it to do.
It is a scoping issue involving passing a vector of strings to the loess function instead of passing a vector of formulas. The problem is that the environment returns NULL for the former, so loess doesn't know where to find it. If you wrap the formula in as.formula it works. This variable will be assigned the local environment inside the function call by default.
As to the cryptic error, it happens when you name a variable the same name of a given function from another package that is loaded, since if a function doesn't find a variable in the local environment, it will scope in the loaded packages for the function. In my case, the id function was loaded by the dplyr library.

R Passing linear model to another function inside a function

I am trying to find the optimal "lambda" parameter for the Box-Cox transformation.
I am using the implementation from the MASS package, so I only need to create the model and extract the lambda.
Here is the code for the function:
library(MASS)
find_lambda <- function(x) {
# Function to find the best lambda for the Box-Cox transform
my_tmp <- data.frame(x = x) # Create a temporary data frame, to use it with the lm
str(my_tmp) # Gives the expected output
the_lm <- lm(x ~ 1, data = my_tmp) # Creates the linear model, no error here
print(summary(the_lm)) # Prints the summary, as expected
out <- boxcox(the_lm, plotit=FALSE) # Gives the error
best_lambda <- out$x[which.max(out$y)] # Extracting the best fitting lambda
return(best_lambda)
}
find_lambda(runif(100))
It gives the following error:
Error in is.data.frame(data) : object 'my_tmp' not found
The interesting thing is that the very same code is working outside the function. In other words, for some reason, the boxcox function from the MASS package is looking for the variable in the global environment.
I don't really understand, what exactly is going on... Do you have any ideas?
P.S. I do not provide a software/hardware specification, since this error was sucessfully replicated on a number of my friends' laptops.
P.P.S. I have found the way to solve the initial problem in the forecast package, but I still would like to know, why this code is not working.
Sometimes user contributed packages don't always do a great job tracking the environments where calls were executed when manipulating functions calls. The quickest fix for you would be to change the line from
the_lm <- lm(x ~ 1, data = my_tmp)
to
the_lm <- lm(x ~ 1, data = my_tmp, y=True, qr=True)
Because if the y and qr are not requested from the lm call, the boxcox function tries to re-run lm with those parameters via an update call and things get mucked up inside a function scope.
Why don't let box-cox do the fitting?
find_lambda <- function(x) {
# Function to find the best lambda for the Box-Cox transform
my_tmp <- data.frame(x = x) # Create a temporary data frame, to use it with the lm
out <- boxcox(x ~ 1, data = my_tmp, plotit=FALSE) # Gives the error
best_lambda <- out$x[which.max(out$y)] # Extracting the best fitting lambda
return(best_lambda)
}
I think your scoping issue is with update.default which calls eval(call, parent.frame()) and my_tmp doesn't exist in the boxcox environment. Please correct me if I'm wrong on this.
boxcox cannot find your data. This maybe because of some scoping issue.
You can feed data in to boxcox function.
find_lambda <- function(x) {
# Function to find the best lambda for the Box-Cox transform
my_tmp <- data.frame(x = x) # Create a temporary data frame, to use it with the lm
str(my_tmp) # Gives the expected output
the_lm <- lm(x ~ 1, data = my_tmp) # Creates the linear model, no error here
print(summary(the_lm)) # Prints the summary, as expected
out <- boxcox(the_lm, plotit=FALSE, data = my_tmp) # feed data in here
best_lambda <- out$x[which.max(out$y)] # Extracting the best fitting lambda
return(best_lambda)
}
find_lambda(runif(100))

Error in eval(expr, envir, enclos) : could not find function - Nested Functions & Environments

The R code shown below is a minimal working example to reproduce an error that I can't say I understand. Running the script should yield the error, Error in eval(expr, envir, enclos) : could not find function "fitModel". After reading a thing or two on environments I think I understand why this is happening in this case, "fitModel" is not defined in the execution environment of "obscureFunction". This I fixed by making the following change to "myFormula":
myFormula <- "y ~ eval(fitModel(x, a), envir = environment(fitModel))"
I don't understand how "fitModel" can be evaluated in the environment of "fitModel" when the function can't be found in the calling environment of "obscureFunction", in other words I don't understand why this code change works. I also don't understand why the original code works fine if the body of "topFunction" is run without calling it, i.e. if we define "fitModel" and "obscureFunction" in R_GlobalEnv and call "obscureFunction" from the console.
## Minimum Working Example to reproduce error
rm(list = ls())
library(minpack.lm)
topFunction <- function(){
fitModel <- function(x, a){
exp(-a * x)
}
## Create a function to use with lapply()
obscureFunction <- function(){
x <- seq(-1, 1, 0.01)
y <- exp(-0.5 * x)
Data <- data.frame(x, y)
init <- c(a = 1)
myFormula <- "y ~ fitModel(x, a)"
myFormula <- as.formula(myFormula)
nlsOutput <- nlsLM(formula = myFormula, start = init, data = Data)
return(nlsOutput)
}
## Function call
obscureFunction()
## Other calculations done with fitModel()
}
topFunction()
Well, there are two issues here. The first is using a string for a formula. It's better to use
myFormula <- y ~ fitModel(x, a)
The reason is that formulas capture their environment, strings do not. (As noted by #BridieG, the as.formula() will capture the environment; I skipped over that line when reading the code. I still think it's better to create the formula directly.) Having a reference environment makes it easier to find functions used in a formula. So if you were using lm() rather than nlsLM, this would work with these two changes
# myFormula <- "y ~ fitModel(x, a)" ... becomes
myFormula <- y ~ fitModel(x, 1)
#nlsOutput <- nlsLM(formula = myFormula, start = init, data = Data) ...becomes
nlsOutput <- lm(formula = myFormula, data = Data)
This works with the formula syntax (unquoted var names) and not the string because the formula can capture the environment.
At least that's how it should work. Package authors are free to evaluate formulas how ever they want, and the authors of the nlsLM() function decided to ignore the environment assigned to the formula. They do so in this function inside nlsLM()
FCT <- function(par) {
mf[m] <- par
rhs <- eval(formula[[3L]], envir = mf)
res <- lhs - rhs
res <- .swts * res
res
}
So this is the second problem. Here they enforce the evaluation in the mf object which is a data.frame made up of the covariates of the data and the parameter estimates. Had it been written as
rhs <- eval(formula[[3L]], envir = mf, environment(formula))
it would have worked. This is basically what model.frame() does in lm() that allows this to work. We can make our own "corrected" version of the function with
# tested with minpack.lm_1.1-8
nlsLM2<-nlsLM
body(nlsLM2)[[27]][[3]][[3]][[3]]<-quote(rhs<-eval(formula[[3L]], envir = mf, environment(formula)))
And then make these substitutions
# myFormula <- "y ~ fitModel(x, a)" ... becomes
myFormula <- y ~ fitModel(x, a)
#nlsOutput <- nlsLM(formula = myFormula, start = init, data = Data) ...becomes
nlsOutput <- nlsLM2(formula = myFormula, start = init, data = Data)
it work work and return
Nonlinear regression model
model: y ~ fitModel(x, a)
data: Data
a
0.5
residual sum-of-squares: 0
Number of iterations to convergence: 5
Achieved convergence tolerance: 1.49e-08
So there's not really much you can say about how all R functions handle environments and scope. This behavior is unique to how the nlsLM() authors decided to evaluate their parameters.

Object not found error when passing model formula to another function

I have a weird problem with R that I can't seem to work out.
I've tried to write a function that performs K-fold cross validation for a model chosen by the stepwise procedure in R. (I'm aware of the issues with stepwise procedures, it's purely for comparison purposes) :)
Now the issue is, that if I define the function parameters (linmod,k,direction) and run the contents of the function, it works flawlessly. BUT, if I run it as a function, I get an error saying the datas.train object can't be found.
I've tried stepping through the function with debug() and the object clearly exists, but R says it doesn't when I actually run the function. If I just fit a model using lm() it works fine, so I believe it's a problem with the step function in the loop, while inside a function. (try commenting out the step command, and set the predictions to those from the ordinary linear model.)
#CREATE A LINEAR MODEL TO TEST FUNCTION
lm.cars <- lm(mpg~.,data=mtcars,x=TRUE,y=TRUE)
#THE FUNCTION
cv.step <- function(linmod,k=10,direction="both"){
response <- linmod$y
dmatrix <- linmod$x
n <- length(response)
datas <- linmod$model
form <- formula(linmod$call)
# generate indices for cross validation
rar <- n/k
xval.idx <- list()
s <- sample(1:n, n) # permutation of 1:n
for (i in 1:k) {
xval.idx[[i]] <- s[(ceiling(rar*(i-1))+1):(ceiling(rar*i))]
}
#error calculation
errors <- R2 <- 0
for (j in 1:k){
datas.test <- datas[xval.idx[[j]],]
datas.train <- datas[-xval.idx[[j]],]
test.idx <- xval.idx[[j]]
#THE MODELS+
lm.1 <- lm(form,data= datas.train)
lm.step <- step(lm.1,direction=direction,trace=0)
step.pred <- predict(lm.step,newdata= datas.test)
step.error <- sum((step.pred-response[test.idx])^2)
errors[j] <- step.error/length(response[test.idx])
SS.tot <- sum((response[test.idx] - mean(response[test.idx]))^2)
R2[j] <- 1 - step.error/SS.tot
}
CVerror <- sum(errors)/k
CV.R2 <- sum(R2)/k
res <- list()
res$CV.error <- CVerror
res$CV.R2 <- CV.R2
return(res)
}
#TESTING OUT THE FUNCTION
cv.step(lm.cars)
Any thoughts?
When you created your formula, lm.cars, in was assigned its own environment. This environment stays with the formula unless you explicitly change it. So when you extract the formula with the formula function, the original environment of the model is included.
I don't know if I'm using the correct terminology here, but I think you need to explicitly change the environment for the formula inside your function:
cv.step <- function(linmod,k=10,direction="both"){
response <- linmod$y
dmatrix <- linmod$x
n <- length(response)
datas <- linmod$model
.env <- environment() ## identify the environment of cv.step
## extract the formula in the environment of cv.step
form <- as.formula(linmod$call, env = .env)
## The rest of your function follows
Another problem that can cause this is if one passes a character (string vector) to lm instead of a formula. vectors have no environment, and so when lm converts the character to a formula, it apparently also has no environment instead of being automatically assigned the local environment. If one then uses an object as weights that is not in the data argument data.frame, but is in the local function argument, one gets a not found error. This behavior is not very easy to understand. It is probably a bug.
Here's a minimal reproducible example. This function takes a data.frame, two variable names and a vector of weights to use.
residualizer = function(data, x, y, wtds) {
#the formula to use
f = "x ~ y"
#residualize
resid(lm(formula = f, data = data, weights = wtds))
}
residualizer2 = function(data, x, y, wtds) {
#the formula to use
f = as.formula("x ~ y")
#residualize
resid(lm(formula = f, data = data, weights = wtds))
}
d_example = data.frame(x = rnorm(10), y = rnorm(10))
weightsvar = runif(10)
And test:
> residualizer(data = d_example, x = "x", y = "y", wtds = weightsvar)
Error in eval(expr, envir, enclos) : object 'wtds' not found
> residualizer2(data = d_example, x = "x", y = "y", wtds = weightsvar)
1 2 3 4 5 6 7 8 9 10
0.8986584 -1.1218003 0.6215950 -0.1106144 0.1042559 0.9997725 -1.1634717 0.4540855 -0.4207622 -0.8774290
It is a very subtle bug. If one goes into the function environment with browser, one can see the weights vector just fine, but it somehow is not found in the lm call!
The bug becomes even harder to debug if one used the name weights for the weights variable. In this case, since lm can't find the weights object, it defaults to the function weights() from base thus throwing an even stranger error:
Error in model.frame.default(formula = f, data = data, weights = weights, :
invalid type (closure) for variable '(weights)'
Don't ask me how many hours it took me to figure this out.

Resources