Scoping with formulae in coxph objects - r

I'm trying to write a set of functions where the first function fits a cox model (via coxph in the survival package in R), and the second function gets estimated survival for a new dataset, given the fitted model object from the first function. I'm running into some sort of scoping issue that I don't quite know how to solve without substantially re-factoring my code (the only way I could think to do it would be much less general and much harder to read).
I have a very similar set of functions that are based on the glm function that do not run into the same issue and give me the answers I would expect. I've included a short worked example below that demonstrates the issue. The glue.cox and glue.glm are functions that have the basic functionality I am trying to get. glue.glm works as expected (yielding the same values from a calculation in the global environment), but the glue.cox complains that it can't find the data that was used to fit the cox model and ends with an error. I don't understand how to do this with substitute but I suspect that is the way forward. I've hit a wall with experimenting.
library(survival)
data.global = data.frame(time=runif(20), x=runif(20))
newdata.global = data.frame(x=c(0,1))
f1 = Surv(time) ~ x # this is the part that messes it up!!!!! Surv gets eval
f2 = time ~ x # this is the part that messes it up!!!!! Surv gets eval
myfit.cox.global = coxph(f1, data=data.global)
myfit.glm.global = glm(f2, data=data.global)
myfit.glm.global2 = glm(time ~ x, data=data.global)
myfit.cox <- function(f, dat.local){
coxph(f, data=dat.local)
}
myfit.glm <- function(f, dat.local){
glm(f, data=dat.local)
}
mypredict.cox <- function(ft, dat.local){
newdata = data.frame(x=c(0,1))
tail(survfit(ft, newdata)$surv, 1)
}
mypredict.glm <- function(ft, dat.local){
newdata = data.frame(x=c(0,1))
predict(ft, newdata)
}
glue.cox <- function(f, dat.local){
fit = myfit.cox(f, dat.local)
mypredict.cox(fit, dat.local)
}
glue.glm <- function(f, dat.local){
fit = myfit.glm(f, dat.local)
mypredict.glm(fit, dat.local)
}
# these numbers are the goal for non-survival data
predict(myfit.glm.global, newdata = newdata.global)
0.5950440 0.4542248
glue.glm(f2, data.global)
0.5950440 0.4542248 # this works
# these numbers are the goal for survival data
tail(survfit(myfit.cox.global, newdata = newdata.global)$surv, 1)
[20,] 0.02300798 0.03106081
glue.cox(f1, data.global)
Error in eval(predvars, data, env) : object 'dat.local' not found

This appears to work, at least in the narrow sense of making glue.cox() work as desired:
myfit.cox <- function(f, dat.local){
environment(f) <- list2env(list(dat.local=dat.local))
coxph(f, data=dat.local)
}
The trick here is that most R modeling/model-processing functions look for data in the environment associated with the formula.
I don't know why glue.glm works without doing more digging, except for the general statement that [g]lm objects store more of the information needed for downstream processing internally (e.g. in the $qr element) than other model types.

Related

Fitting Step functions

AIM: The aim here was to find a suitable fit, using step functions, which uses age to describe wage, in the Wage dataset in the library ISLR.
PLAN:
To find a suitable fit, I'll try multiple fits, which will have different cut points. I'll use the glm() function (of the boot library) for the fitting purpose. In order to check which fit is the best, I'll use the cv.glm() function to perform cross-validation over the fitted model.
PROBLEM:
In order to do so, I did the following:
all.cvs = rep(NA, 10)
for (i in 2:10) {
lm.fit = glm(wage~cut(Wage$age,i), data=Wage)
all.cvs[i] = cv.glm(Wage, lm.fit, K=10)$delta[2]
}
But this gives an error:
Error in model.frame.default(formula = wage ~ cut(Wage$age, i), data =
list( : variable lengths differ (found for 'cut(Wage$age, i)')
Whereas, when I run the code given below, it runs.(It can be found here)
all.cvs = rep(NA, 10)
for (i in 2:10) {
Wage$age.cut = cut(Wage$age, i)
lm.fit = glm(wage~age.cut, data=Wage)
all.cvs[i] = cv.glm(Wage, lm.fit, K=10)$delta[2]
}
Hypotheses and Results:
Well, it might be possible that cut() and glm() might not work together. But this works:
glm(wage~cut(age,4),data=Wage)
Question:
So, basically we're using the cut() function, saving it's results in a variable, then using that variable in the glm() function. But we can't put the cut function inside the glm() function. And that too, only if the code is in a loop.
So, why is the first version of the code not working?
This is confusing. Any help appreciated.

R Passing linear model to another function inside a function

I am trying to find the optimal "lambda" parameter for the Box-Cox transformation.
I am using the implementation from the MASS package, so I only need to create the model and extract the lambda.
Here is the code for the function:
library(MASS)
find_lambda <- function(x) {
# Function to find the best lambda for the Box-Cox transform
my_tmp <- data.frame(x = x) # Create a temporary data frame, to use it with the lm
str(my_tmp) # Gives the expected output
the_lm <- lm(x ~ 1, data = my_tmp) # Creates the linear model, no error here
print(summary(the_lm)) # Prints the summary, as expected
out <- boxcox(the_lm, plotit=FALSE) # Gives the error
best_lambda <- out$x[which.max(out$y)] # Extracting the best fitting lambda
return(best_lambda)
}
find_lambda(runif(100))
It gives the following error:
Error in is.data.frame(data) : object 'my_tmp' not found
The interesting thing is that the very same code is working outside the function. In other words, for some reason, the boxcox function from the MASS package is looking for the variable in the global environment.
I don't really understand, what exactly is going on... Do you have any ideas?
P.S. I do not provide a software/hardware specification, since this error was sucessfully replicated on a number of my friends' laptops.
P.P.S. I have found the way to solve the initial problem in the forecast package, but I still would like to know, why this code is not working.
Sometimes user contributed packages don't always do a great job tracking the environments where calls were executed when manipulating functions calls. The quickest fix for you would be to change the line from
the_lm <- lm(x ~ 1, data = my_tmp)
to
the_lm <- lm(x ~ 1, data = my_tmp, y=True, qr=True)
Because if the y and qr are not requested from the lm call, the boxcox function tries to re-run lm with those parameters via an update call and things get mucked up inside a function scope.
Why don't let box-cox do the fitting?
find_lambda <- function(x) {
# Function to find the best lambda for the Box-Cox transform
my_tmp <- data.frame(x = x) # Create a temporary data frame, to use it with the lm
out <- boxcox(x ~ 1, data = my_tmp, plotit=FALSE) # Gives the error
best_lambda <- out$x[which.max(out$y)] # Extracting the best fitting lambda
return(best_lambda)
}
I think your scoping issue is with update.default which calls eval(call, parent.frame()) and my_tmp doesn't exist in the boxcox environment. Please correct me if I'm wrong on this.
boxcox cannot find your data. This maybe because of some scoping issue.
You can feed data in to boxcox function.
find_lambda <- function(x) {
# Function to find the best lambda for the Box-Cox transform
my_tmp <- data.frame(x = x) # Create a temporary data frame, to use it with the lm
str(my_tmp) # Gives the expected output
the_lm <- lm(x ~ 1, data = my_tmp) # Creates the linear model, no error here
print(summary(the_lm)) # Prints the summary, as expected
out <- boxcox(the_lm, plotit=FALSE, data = my_tmp) # feed data in here
best_lambda <- out$x[which.max(out$y)] # Extracting the best fitting lambda
return(best_lambda)
}
find_lambda(runif(100))

For loops regression in R

I'm fitting GARCH model to the residuals of and ARIMA, and trying to apply ARCH(p) for p from 1 to 10 to compare the fitness. Here is my code. Errors are returned in the for loop part but I cannot figure out the reason why. Could anyone give some tips?
So for the single value p=1 the codes are as below and it's no problem.
fitone<- garchFit(~garch(1,0),data=logprice)
coef(fitone)
summary(fitone)
And for the for loop my codes go like
for (n in 1:10) {
fit [[n]]<- garchFit(~garch(n,0),data=logprice)
coef(fit[[n]])
summary(fit[[n]])
}
Error in .garchArgsParser(formula = formula, data = data, trace = FALSE) :
Formula and data units do not match.
I never wrote a loop code before. Can someone help me with the codes?
The problem is that generally one tries to evaluate all the variables in a formula in the context of the data= parameter, but your n variable isn't coming from logprice, it's coming from the global environment. You will need to dynamically create the formula. Here's one way to run all the models with lapply rather than a for look would be
library(fGarch)
#sample data
x.vec = as.vector(garchSim(garchSpec(rseed = 1985), n = 200)[,1])
fits <- lapply(1:10, function(n) {
garchFit(bquote(~garch(.(n),0)), data = x.vec, trace = FALSE)
})
and then we can get the coefs with
lapply(fits, coef)

Using predict in a function call with NLME objects and a formula

I have a problem with the package NLME using the following code:
library(nlme)
x <- rnorm(100)
z <- rep(c("a","b"),each=50)
y <- rnorm(100)
test.data <- data.frame(x,y,z)
test.fun <- function(test.dat)
{
form <- as.formula("y~x")
ran.form <- as.formula("~1|z")
modell <- lme(fixed = form, random=ran.form, data=test.dat)
pseudo.newdata <- test.dat[1,]
predict(modell, newdata= pseudo.newdata) ###THIS CAUSES THE ERROR!
}
test.fun(test.data)
The predict causes an error and I already found what basically causes it.
The modell object saves how it was called and predict seems to use that to make prediction but is unable to find the formula objects form and ran.form becauses it does not look for them in the right namespace. In fact, I can avoid the problem by doing this:
attach(environment(form), warn.conflicts = FALSE)
predict(modell, newdata= pseudo.newdata)
detach()
My long term goal however is to save the modell to disk and use them later. I suppose I could try saving the formula objects as well, but this strikes me as a very annoying and cumbersome way to deal with the problem.
I work with automatically generated formula objects instead of writing them down explicitly because I create many models with different definitions in a sort of batch process so I can not avoid them. So my ideal solution would be a way to create the lme object so that I can forget about the formula object afterwards and predict "just works". Thanks for any help.
Try replacing lme(arg1, arg2, arg3) with do.call(lme, list(arg1, arg2, arg3)).
library(nlme)
x <- rnorm(100)
z <- rep(c("a","b"),each=50)
y <- rnorm(100)
test.data <- data.frame(x,y,z)
test.fun <- function(test.dat)
{
form <- as.formula("y~x")
ran.form <- as.formula("~1|z")
## JUST NEED TO CHANGE THE FOLLOWING LINE
## modell <- lme(fixed = form, random=ran.form, data=test.dat)
modell <- do.call(lme, list(fixed=form, random=ran.form, data=test.data))
pseudo.newdata <- test.dat[1,]
predict(modell, newdata= pseudo.newdata) ###THIS CAUSES THE ERROR!
}
test.fun(test.data)
# a
# 0.07547742
# attr(,"label")
# [1] "Predicted values"
This works because do.call() evaluates its argument list in the calling frame, before evaluating the call to lme() that it constructs. To see why that helps, type debug(predict), and then run your code and mine, comparing the debugging messages printed when you are popped into the browser.

Object not found error when passing model formula to another function

I have a weird problem with R that I can't seem to work out.
I've tried to write a function that performs K-fold cross validation for a model chosen by the stepwise procedure in R. (I'm aware of the issues with stepwise procedures, it's purely for comparison purposes) :)
Now the issue is, that if I define the function parameters (linmod,k,direction) and run the contents of the function, it works flawlessly. BUT, if I run it as a function, I get an error saying the datas.train object can't be found.
I've tried stepping through the function with debug() and the object clearly exists, but R says it doesn't when I actually run the function. If I just fit a model using lm() it works fine, so I believe it's a problem with the step function in the loop, while inside a function. (try commenting out the step command, and set the predictions to those from the ordinary linear model.)
#CREATE A LINEAR MODEL TO TEST FUNCTION
lm.cars <- lm(mpg~.,data=mtcars,x=TRUE,y=TRUE)
#THE FUNCTION
cv.step <- function(linmod,k=10,direction="both"){
response <- linmod$y
dmatrix <- linmod$x
n <- length(response)
datas <- linmod$model
form <- formula(linmod$call)
# generate indices for cross validation
rar <- n/k
xval.idx <- list()
s <- sample(1:n, n) # permutation of 1:n
for (i in 1:k) {
xval.idx[[i]] <- s[(ceiling(rar*(i-1))+1):(ceiling(rar*i))]
}
#error calculation
errors <- R2 <- 0
for (j in 1:k){
datas.test <- datas[xval.idx[[j]],]
datas.train <- datas[-xval.idx[[j]],]
test.idx <- xval.idx[[j]]
#THE MODELS+
lm.1 <- lm(form,data= datas.train)
lm.step <- step(lm.1,direction=direction,trace=0)
step.pred <- predict(lm.step,newdata= datas.test)
step.error <- sum((step.pred-response[test.idx])^2)
errors[j] <- step.error/length(response[test.idx])
SS.tot <- sum((response[test.idx] - mean(response[test.idx]))^2)
R2[j] <- 1 - step.error/SS.tot
}
CVerror <- sum(errors)/k
CV.R2 <- sum(R2)/k
res <- list()
res$CV.error <- CVerror
res$CV.R2 <- CV.R2
return(res)
}
#TESTING OUT THE FUNCTION
cv.step(lm.cars)
Any thoughts?
When you created your formula, lm.cars, in was assigned its own environment. This environment stays with the formula unless you explicitly change it. So when you extract the formula with the formula function, the original environment of the model is included.
I don't know if I'm using the correct terminology here, but I think you need to explicitly change the environment for the formula inside your function:
cv.step <- function(linmod,k=10,direction="both"){
response <- linmod$y
dmatrix <- linmod$x
n <- length(response)
datas <- linmod$model
.env <- environment() ## identify the environment of cv.step
## extract the formula in the environment of cv.step
form <- as.formula(linmod$call, env = .env)
## The rest of your function follows
Another problem that can cause this is if one passes a character (string vector) to lm instead of a formula. vectors have no environment, and so when lm converts the character to a formula, it apparently also has no environment instead of being automatically assigned the local environment. If one then uses an object as weights that is not in the data argument data.frame, but is in the local function argument, one gets a not found error. This behavior is not very easy to understand. It is probably a bug.
Here's a minimal reproducible example. This function takes a data.frame, two variable names and a vector of weights to use.
residualizer = function(data, x, y, wtds) {
#the formula to use
f = "x ~ y"
#residualize
resid(lm(formula = f, data = data, weights = wtds))
}
residualizer2 = function(data, x, y, wtds) {
#the formula to use
f = as.formula("x ~ y")
#residualize
resid(lm(formula = f, data = data, weights = wtds))
}
d_example = data.frame(x = rnorm(10), y = rnorm(10))
weightsvar = runif(10)
And test:
> residualizer(data = d_example, x = "x", y = "y", wtds = weightsvar)
Error in eval(expr, envir, enclos) : object 'wtds' not found
> residualizer2(data = d_example, x = "x", y = "y", wtds = weightsvar)
1 2 3 4 5 6 7 8 9 10
0.8986584 -1.1218003 0.6215950 -0.1106144 0.1042559 0.9997725 -1.1634717 0.4540855 -0.4207622 -0.8774290
It is a very subtle bug. If one goes into the function environment with browser, one can see the weights vector just fine, but it somehow is not found in the lm call!
The bug becomes even harder to debug if one used the name weights for the weights variable. In this case, since lm can't find the weights object, it defaults to the function weights() from base thus throwing an even stranger error:
Error in model.frame.default(formula = f, data = data, weights = weights, :
invalid type (closure) for variable '(weights)'
Don't ask me how many hours it took me to figure this out.

Resources