I am trying to write my own modeling function in R, one which takes a formula, some data, and maybe some extra context, like weights; after calling model.frame to extract the necessary numeric data, it will perform a fit. My first pass looked like:
my_modfunc <- function(formula,data,weights=NULL) {
mf <- model.frame(formula,data=data,weights=weights)
wt <- model.weights(mf)
# do some fitting here...
}
# make fake data to test it
set.seed(1234)
data <- data.frame(x1=rnorm(50),x2=rnorm(50),y=rnorm(50),w=runif(50))
# call it:
my_modfunc(y ~ x1 + x2,data=data,weights=w)
This fails, I get the error:
Error in model.frame.default(formula, data = data, weights = weights) :
invalid type (closure) for variable '(weights)'
Similarly, if I call
my_modfunc(y ~ x1 + x2,data=data,weights='w')
I get the same error. I suspect there is some problem with environment, quoting, and so on.
Cutting and pasting the source for lm, I could rewrite my function as
# based on lm
weird_modfunc <- function(formula,data,weights=NULL ) {
cl <- match.call() # what?
mf <- match.call(expand.dots = FALSE) # what??
m <- match(c("formula", "data", "weights"), names(mf), 0L)
mf <- mf[c(1L, m)] # ??
mf$drop.unused.levels <- TRUE # ??
mf[[1L]] <- quote(stats::model.frame) ## ???
mf <- eval(mf, parent.frame())
wt <- as.vector(model.weights(mf))
# do some fitting here...
}
# this runs without error:
weird_modfunc(y ~ x1 + x2,data=data,weights=w)
# this fails with the same error as above about variable lengths.
weird_modfunc(y ~ x1 + x2,data=data,weights='w')
The problem is that this contains multiple somewhat mystical incantations that I do not know how to interpret, modify or maintain.
What is the right way to call model.frame? Bonus points for making my function accept both weights=w and weights='w'
Welcome to the joys of non-standard evaluation. I suggest you base your function on the lm approach. It constructs a call to model.frame and evaluates it. That's necessary, because model.frame does non-standard evaluation, i.e., it accepts/expects a symbol for the weights parameter. Furthermore, it also ensures correct scoping regarding the formula's environment.
weird_modfunc <- function(formula,data,weights=NULL ) {
#cl not needed, lm only adds this call to the return object
mf <- match.call(expand.dots = FALSE)
message("Call with ellipses not expanded: ")
#note that there are no ellipses in the function arguments for now,
#but you might want to change that later
print(mf)
#turn weights into symbol if character is passed
if (is.character(mf$weights)) mf$weights <- as.symbol(mf$weights)
m <- match(c("formula", "data", "weights"), names(mf), 0L)
message("Position of formula, data and weights in the call:")
print(m)
mf <- mf[c(1L, m)]
message("New call that only contains what is needed:")
print(mf)
mf$drop.unused.levels <- TRUE
message("Call with argument added:")
print(mf)
mf[[1L]] <- quote(stats::model.frame)
message("Change call to a call to model.frame:")
print(mf)
mf <- eval(mf, parent.frame()) #evaluate call
wt <- as.vector(model.weights(mf))
# do some fitting here...
message("Return value:")
wt
}
# this runs without error:
weird_modfunc(y ~ x1 + x2,data=data,weights=w)
#Call with ellipses not expanded:
#weird_modfunc(formula = y ~ x1 + x2, data = data, weights = w)
#Position of formula, data and weights in the call
#[1] 2 3 4
#New call that only contains what is needed:
#weird_modfunc(formula = y ~ x1 + x2, data = data, weights = w)
#Call with argument added:
#weird_modfunc(formula = y ~ x1 + x2, data = data, weights = w,
# drop.unused.levels = TRUE)
#Change call to a call to model.frame:
#stats::model.frame(formula = y ~ x1 + x2, data = data, weights = w,
# drop.unused.levels = TRUE)
#Return value:
# [1] 0.35299850 0.98095832 0.53888276 0.44403386 0.94936678 0.45248337 0.19062580 0.99160915 0.54845545 0.76881577 0.91342167 0.68211200 0.40725142
#[14] 0.40759230 0.14608279 0.19666771 0.19220934 0.40841440 0.34822131 0.83454285 0.19840001 0.86180531 0.39718531 0.15325377 0.33928338 0.36718044
#[27] 0.42737908 0.18633690 0.65801660 0.92041138 0.73389406 0.88231927 0.95334653 0.19490154 0.47261674 0.38605066 0.37416586 0.02785566 0.92935521
#[40] 0.41052928 0.95584022 0.27215284 0.51724649 0.97830984 0.36969649 0.31043044 0.03420963 0.66756585 0.92091638 0.04498960
#this runs without error too:
weird_modfunc(y ~ x1 + x2,data=data,weights='w')
Here is a simpler version but there might be problems (well, more than usual with non-standard evaluation):
my_modfunc <- function(formula,data,weights=NULL) {
weights <- substitute(weights)
if (!is.symbol(weights)) weights <- as.symbol(weights)
#substitute the symbol into the call:
mf <- eval(substitute(model.frame(formula,data=data,weights=weights)))
wt <- model.weights(mf)
# do some fitting here...
wt
}
my_modfunc(y ~ x1 + x2,data=data,weights=w)
#works
my_modfunc(y ~ x1 + x2,data=data,weights="w")
#works
Related
I have been stymied by an error that traces back to predict.lme, running inside a function, failing to interpret a formula based on a variable that has been passed from outside the function. I know the issue has to do with variable scope and different environments, but I've been unable to fully understand it or find a workaround. Your help would be much appreciated.
Here's a reproducible example:
# This will be the nested function.
train_test_perf <- function(train_data, test_data, model, termLabels) {
fixForm <- reformulate(termlabels=termLabels, response="Y")
fit <- nlme::lme(fixForm, data=train_data, random=~ 1|ID)
train_pred <- predict(fit, newdata=train_data, level=0, na.action=na.exclude)
rtrain <- cor.test(train_data$Y, train_pred)
test_pred <- predict(fit, newdata=test_data, level=0, na.action=na.exclude)
rtest <- cor.test(test_data$Y, test_pred)
tmp <- data.frame(Model=model,
R_train=rtrain$estimate,
R_test=rtest$estimate)
return(tmp)
}
# And here is the function that calls it.
myfunc <- function(df, newdf, varList) {
for (v in varList) {
perf <- train_test_perf(train_data=df, test_data=newdf, model=v, termLabels=v)
print(perf)
}
}
# The outer function call.
myfunc(df=dat, newdf=newdat, varList=list("W", "X"))
Running this gives the following error and traceback:
Error in eval(mCall$fixed) : object 'fixForm' not found
7.
eval(mCall$fixed)
6.
eval(mCall$fixed)
5.
eval(eval(mCall$fixed)[-2])
4.
predict.lme(fit, newdata = train_data, level = 0, na.action = na.exclude)
3.
predict(fit, newdata = train_data, level = 0, na.action = na.exclude)
2.
train_test_perf(train_data = df, test_data = newdf, model = v,
termLabels = v)
1.
myfunc(df = dat, newdf = newdat, varList = list("W", "X"))
It seems clear that predict.lme does not have access to the fixForm variable, but I haven't been able to work out a way to both define a formula based on a variable and have the value accessible to predict.lme. I'm not sure whether the nested function structure is part of the problem here--if it is, I would prefer to find a workaround that would maintain this structure, as my real-life code includes some other things inside myfunc that occur before and after the call to train_test_perf.
Thanks,
Jeff Phillips
Using a variable as formula doesn't stores the variable not the formula which might be the issue. We can use a do.call.
train_test_perf <- function(train_data, test_data, model, termLabels) {
fixForm <- reformulate(termlabels=termLabels, response="Y")
fit <- do.call(nlme::lme, list(fixForm, data=quote(train_data), random=~ 1|ID))
train_pred <- predict(fit, newdata=train_data, level=0, na.action=na.exclude)
rtrain <- cor.test(train_data$Y, train_pred)
test_pred <- predict(fit, newdata=test_data, level=0, na.action=na.exclude)
rtest <- cor.test(test_data$Y, test_pred)
tmp <- data.frame(Model=model, R_train=rtrain$estimate,
R_test=rtest$estimate)
return(tmp)
}
Finally put it in an sapply to avoid tedious for loops.
t(sapply(c("W", "X"), \(x) train_test_perf(train_data=dat, test_data=newdat, model=x, termLabels=x)))
# Model R_train R_test
# [1,] "W" 0.1686495 -0.001738604
# [2,] "X" 0.4138526 0.2992374
I'm trying to write some functions to ease refitting multiple models, but find it painful, as R is unable to locate proper data, when it plunges deeper into evaluation tree.
Despite an effort was made to store the formula environment inside the model, I guess there's really no way to unambiguously point to the raw data object.
This becomes even harder for fitting survival curves using survfit, where no terms object is being stored inside.
Do I really need to retype the data/formula as a parameter each time?
Example:
# model-fitting wrapper function
fn <- function(fn_formula, fn_data) {
lm(formula = fn_formula, data = fn_data)
}
# specify exemplary data and formula
data <- data.frame(
y = rnorm(100),
x1 = rnorm(100),
x2 = rnorm(100))
formula <- y ~ x1
# try to create and update the fit with different parameters
fn_fit <- fn(formula, data)
update(fn_fit, ~ x2)
# Error in is.data.frame(data) : object 'fn_data' not found
terms(fn_fit) %>% attr('.Environment')
# <environment: R_GlobalEnv>
terms(fn_fit$model) %>% attr('.Environment')
# <environment: R_GlobalEnv>
getCall(fn_fit)
# lm(formula = fn_formula, data = fn_data)
The variable that stores the data should be in the same scope for both the lm() and update() with the same name. Not sure what you are really trying to accomplish, bit if you want a function that creates a signature you can use in the global environment, you can do something like this would work
fn <- function(fn_formula, fn_data) {
do.call("lm", list(fn_formula, data=substitute(fn_data)))
}
fn_fit <- fn(formula, data)
update(fn_fit, ~ x2)
Otherwise if you really wanted to capture that variable in the local function scope, you can create a helper to fun update in the correct environment.
fn <- function(fn_formula, fn_data) {
environment(fn_formula) <- environment()
lm(formula = fn_formula, data = fn_data)
}
fn_update <- function(object, ...) {
mc<-match.call(definition = update)
mc[[1]] <- quote(update)
eval(mc, envir=environment(terms(object)))
}
fn_fit <- fn(formula, data)
fn_update(fn_fit, ~x2)
When you passed formula, the only items stored in the ['model'] sublist were those that were needed.
> names(fn_fit$model)
[1] "y" "x1"
But there's nothing named either 'data' or 'fn_data' in that object. MrFlick second suggestion is more resilient to modifications in the calling tree of frames:
> fn <- function(fn_formula, fn_data) {
+ do.call("lm", list(fn_formula, data=substitute(fn_data)))
+ }
> fn_fit <- fn(formula, data); rm(data) # mess with the calling environment
> update(fn_fit, ~ x2)
Error in terms.formula(formula, data = data) :
'data' argument is of the wrong type
That error occurred because the R interpreter only found the function named data; if instead you deploy the second option you get:
> data <- data.frame(
+ y = rnorm(100),
+ x1 = rnorm(100),
+ x2 = rnorm(100))
> fn <- function(fn_formula, fn_data) {
+ environment(fn_formula) <- environment()
+ lm(formula = fn_formula, data = fn_data)
+ }
>
> fn_update <- function(object, ...) {
+ mc<-match.call(definition = update)
+ mc[[1]] <- quote(update)
+ eval(mc, envir=environment(terms(object)))
+ }
>
> fn_fit <- fn(formula, data) ; rm(data)
> fn_update(fn_fit, ~x2)
Call:
lm(formula = y ~ x2, data = fn_data)
Coefficients:
(Intercept) x2
0.01117 -0.13004
I have a multiply imputed dataset of class mids. I use the with() function to estimate the m different datasets with the coxph() function.
However, I'm having trouble using the with() function within my own function.
The code below is a simplified example that reproduces the error: Error in Surv(enter,exit,event) object 'enter' not found
list<-"X1+X2"
var.used<-formula(paste("Surv(enter,exit,event)~",list,sep=""))
with.coxph<-function(form,dataset){
with(dataset,coxph(form))
}
with.coxph(var.used,data)
When I simply run the function on its own:
with(dataset, coxph(Surv(enter,exit,event)~X1+X2))
It works fine.
I think the problem is related to the environment where with() is called. I found different posts in here, but I can't seem to make it work. I tried assigning the dataset and the formula to the global environment:
with.coxph2<-function(form,dataset){
assign(".dataset",dataset,envir=.GlobalEnv)
assign(".form",dataset,envir=.GlobalEnv)
with(dataset,coxph(form))
remove(".dataset",dataset,envir=.GlobalEnv)
remove(".form",dataset,envir=.GlobalEnv)
}
with.coxph2(var.used,data)
but this produced the same error.
EDIT
I have attempted to fix the problem as described below. When i simply run the function with out the with() statement it works perfectly.
makeModel<-function(resp, explan, mData) {
mF <- formula(paste(resp, paste(explan, collapse = "+"), sep = "~"))
mod <- coxph(mF, data = mData)
mod$call$formula <- mF
mod$call$data <- as.symbol(deparse(substitute(mData)))
mod
}
cp <- makeModel("Surv(start, stop, event)", "X1", complete(data))
# This works fine
However, I still get the same error when I include the with() statement in the equation:
with.coxph<-function(resp, explan, mData) {
mF <- formula(paste(resp, paste(explan, collapse = "+"), sep = "~"))
mod <- with(mData,coxph(mF))
mod$call$formula <- mF
mod$call$data <- as.symbol(deparse(substitute(mData)))
mod
}
cp <- with.coxph("Surv(start, stop, event)", "X1", data)
# Error in Surv(enter,exit,event): object 'enter' not found
I had similiar issues when using the lm function and I wanted to pass the formula and/or data argument to it. Tha's what I am doing now to circumvent that:
makeModel <- function(resp, explan, mData) {
mF <- formula(paste(resp, paste(explan, collapse = "+"), sep = "~"))
mod <- coxph(mF, data = mData)
mod$call$formula <- mF
mod$call$data <- as.symbol(deparse(substitute(mData)))
mod
}
makeModelBad <- function(resp, explan, mData) {
mF <- formula(paste(resp, paste(explan, collapse = "+"), sep = "~"))
coxph(mF, data = mData)
}
library(survival)
example(coxph) # to load the data
(cp <- makeModel("Surv(start, stop, event)", "x", test2))
# Call:
# coxph(formula = Surv(start, stop, event) ~ x, data = test2)
#
#
# coef exp(coef) se(coef) z p
# x -0.0211 0.979 0.795 -0.0265 0.98
#
# Likelihood ratio test=0 on 1 df, p=0.979 n= 10, number of events= 7
cp.bad <- makeModelBad("Surv(start, stop, event)", "x", test2)
Explanation
In order to use the models created inside a function, I had to explicitely change the respective slots, because mData is not known outside the function and a call to update for example will fail:
update(cp, . ~ 1) # works
update(cp.bad, . ~ 1) # does not work
# Error in terms.formula(formula, special, data = data) :
# object 'mData' not found
The change to the formula slot, is more eye candy to show the formula in the print of the object.
I have a model building function where the formula can contain a some functions, and I would like it work so that if user inputs the function several times, only first occasion is used with a warning. For example in lm if we use same variable twice, the second one is dropped:
y<-1:3
x<-1:3
lm(y~x+x)
Call:
lm(formula = y ~ x + x)
Coefficients:
(Intercept) x
0 1
This works because the function terms used in model.frame removes variables with identical name. But in my case I'm working with functions inside of formula which doesn't necessarily have identical argument list, and I would like this behaviour to extend so that arguments of these functions wouldn't matter:
model(y~x+fn("x"))
(Intercept) x temp
1 1 1 1
2 1 2 1
3 1 3 1
model(y~x+fn("x")+fn("x")) #identical function calls
(Intercept) x temp
1 1 1 1
2 1 2 1
3 1 3 1
model(y~x+fn("x")+fn("z")) #function with different argument value
Error in attr(all_terms, "variables")[[1 + ind_fn]] :
subscript out of bounds
Here is an example function (highly simplified) I used above:
model <- function(formula, data) {
#the beginning is pretty much copied from lm function
mf <- match.call(expand.dots = FALSE)
mf <- mf[c(1L, match(c("formula", "data"), names(mf), 0L))]
mf[[1L]] <- as.name("model.frame")
mf$na.action <- as.name("na.pass")
all_terms <- if (missing(data)){
terms(formula, "fn")
} else terms(formula, "fn", data = data)
#find the position of the function call in the formula
ind_fn <- attr(all_terms, "specials")$fn
#update the formula by removing the "fn" part
if(!is.null(ind_fn)){
fn_term <- attr(all_terms, "variables")[[1 + ind_fn]]
formula <- update( formula, paste(". ~ .-", deparse(fn_term,
width.cutoff = 500L, backtick = TRUE)))
mf$formula<-formula
}
# build y and X
mf <- eval(mf, parent.frame())
y <- model.response(mf, "numeric")
mt <- attr(mf, "terms")
X <- model.matrix(mt, mf)
#if fn was in formula do something with it
if (!is.null(ind_fn)){
foobar<-function(type=c("x","z")){
if(type=="x"){
rep(1,nrow(X))
} else rep(0,nrow(X))
}
fn_term[[1]]<-as.name("foobar")
temp<-eval(fn_term)
X<-cbind(X,temp)
}
X
}
I could check the name of the specials (the function calls) and rename them as identical with the first occurence, but I was wondering if there would be more clever way of dealing with this?
I wasn't able to get your code to work correctly, but assuming I've understood your task, perhaps something like this accomplishes what you're after.
f <- y ~ x + fn("x") + fn("z") + z + fn('a')
# get list of terms
vars <- as.list(attr(terms(f), 'variables'))
# get those terms that are duplicate calls
redundant <- vars[sapply(vars, is.call) & duplicated(sapply(vars, function(x) as.list(x)[[1]]))]
# remove the duplicate calls from the formula
update(f, paste(". ~ .", paste(sapply(redundant, deparse), collapse='-'), sep='-'))
# y ~ x + fn("x") + z
I want use survfit() and basehaz() inside a function, but they do not work. Could you take a look at this problem. Thanks for your help. The following code leads to the error:
library(survival)
n <- 50 # total sample size
nclust <- 5 # number of clusters
clusters <- rep(1:nclust,each=n/nclust)
beta0 <- c(1,2)
set.seed(13)
#generate phmm data set
Z <- cbind(Z1=sample(0:1,n,replace=TRUE),
Z2=sample(0:1,n,replace=TRUE),
Z3=sample(0:1,n,replace=TRUE))
b <- cbind(rep(rnorm(nclust),each=n/nclust),rep(rnorm(nclust),each=n/nclust))
Wb <- matrix(0,n,2)
for( j in 1:2) Wb[,j] <- Z[,j]*b[,j]
Wb <- apply(Wb,1,sum)
T <- -log(runif(n,0,1))*exp(-Z[,c('Z1','Z2')]%*%beta0-Wb)
C <- runif(n,0,1)
time <- ifelse(T<C,T,C)
event <- ifelse(T<=C,1,0)
mean(event)
phmmd <- data.frame(Z)
phmmd$cluster <- clusters
phmmd$time <- time
phmmd$event <- event
fmla <- as.formula("Surv(time, event) ~ Z1 + Z2")
BaseFun <- function(x){
start.coxph <- coxph(x, phmmd)
print(start.coxph)
betahat <- start.coxph$coefficient
print(betahat)
print(333)
print(survfit(start.coxph))
m <- basehaz(start.coxph)
print(m)
}
BaseFun(fmla)
Error in formula.default(object, env = baseenv()) : invalid formula
But the following function works:
fit <- coxph(fmla, phmmd)
basehaz(fit)
It is a problem of scoping.
Notice that the environment of basehaz is:
environment(basehaz)
<environment: namespace:survival>
meanwhile:
environment(BaseFun)
<environment: R_GlobalEnv>
Therefore that is why the function basehaz cannot find the local variable inside the function.
A possible solution is to send x to the top using assign:
BaseFun <- function(x){
assign('x',x,pos=.GlobalEnv)
start.coxph <- coxph(x, phmmd)
print(start.coxph)
betahat <- start.coxph$coefficient
print(betahat)
print(333)
print(survfit(start.coxph))
m <- basehaz(start.coxph)
print(m)
rm(x)
}
BaseFun(fmla)
Other solutions may involved dealing with the environments more directly.
I'm following up on #moli's comment to #aatrujillob's answer. They were helpful so I thought I would explain how it solved things for me and a similar problem with the rpart and partykit packages.
Some toy data:
N <- 200
data <- data.frame(X = rnorm(N),W = rbinom(N,1,0.5))
data <- within( data, expr = {
trtprob <- 0.4 + 0.08*X + 0.2*W -0.05*X*W
Trt <- rbinom(N, 1, trtprob)
outprob <- 0.55 + 0.03*X -0.1*W - 0.3*Trt
Outcome <- rbinom(N,1,outprob)
rm(outprob, trtprob)
})
I want to split the data to training (train_data) and testing sets, and train the classification tree on train_data.
Here's the formula I want to use, and the issue with the following example. When I define this formula, the train_data object does not yet exist.
my_formula <- Trt~W+X
exists("train_data")
# [1] FALSE
exists("train_data", envir = environment(my_formula))
# [1] FALSE
Here's my function, which is similar to the original function. Again,
badFunc <- function(data, my_formula){
train_data <- data[1:100,]
ct_train <- rpart::rpart(
data= train_data,
formula = my_formula,
method = "class")
ct_party <- partykit::as.party(ct_train)
}
Trying to run this function throws an error similar to OP's.
library(rpart)
library(partykit)
bad_out <- badFunc(data=data, my_formula = my_formula)
# Error in is.data.frame(data) : object 'train_data' not found
# 10. is.data.frame(data)
# 9. model.frame.default(formula = Trt ~ W + X, data = train_data,
# na.action = function (x) {Terms <- attr(x, "terms") ...
# 8. stats::model.frame(formula = Trt ~ W + X, data = train_data,
# na.action = function (x) {Terms <- attr(x, "terms") ...
# 7. eval(expr, envir, enclos)
# 6. eval(mf, env)
# 5. model.frame.rpart(obj)
# 4. model.frame(obj)
# 3. as.party.rpart(ct_train)
# 2. partykit::as.party(ct_train)
# 1. badFunc(data = data, my_formula = my_formula)
print(bad_out)
# Error in print(bad_out) : object 'bad_out' not found
Luckily, rpart() is like coxph() in that you can specify the argument model=TRUE to solve these issues. Here it is again, with that extra argument.
goodFunc <- function(data, my_formula){
train_data <- data[1:100,]
ct_train <- rpart::rpart(
data= train_data,
## This solved it for me
model=TRUE,
##
formula = my_formula,
method = "class")
ct_party <- partykit::as.party(ct_train)
}
good_out <- goodFunc(data=data, my_formula = my_formula)
print(good_out)
# Model formula:
# Trt ~ W + X
#
# Fitted party:
# [1] root
# | [2] X >= 1.59791: 0.143 (n = 7, err = 0.9)
##### etc
documentation for model argument in rpart():
model:
if logical: keep a copy of the model frame in the result? If
the input value for model is a model frame (likely from an earlier
call to the rpart function), then this frame is used rather than
constructing new data.
Formulas can be tricky as they use lexical scoping and environments in a way that is not always natural (to me). Thank goodness Terry Therneau has made our lives easier with model=TRUE in these two packages!