Apply weights in rpart model gives error - r

I'm using the rpart package to fit some models, like this:
fitmodel = function(formula, data, w) {
fit = rpart(formula, data, weights = w)
}
Call the custom function
fit = fitmodel(y ~ x1 + x2, data, w)
This causes the error:
Error in eval(expr, envir, enclos) : object 'w' not found
Then i decided to use
fitmodel = function(formula, data, w) {
data$w = w
fit = rpart(formula, data, weights = w)
}
This works, but there's another problem:
This will work
fit = fitmodel(y ~ x1 + x2, data, w)
This does not work
fit = fitmodel(y ~ ., data, w)
Error in eval(expr, envir, enclos) : object 'w' not found
What's the correct way to apply weights inside a custom function? Thanks!

Hopefully someone else gives a more complete answer. The reason why rpart can't find w is that rpart searches the environment that the formula is defined in for data, weights, etc. The formula is created in some environment most likely the GlobalEnv and the w is created within some other function. Changing the environment of the formula to the environment where w is created with parent.frame fixes that. rpart can still find the data since the search path will always continue to the GlobalEnv. I'm not sure why the sys.frame(sys.nframe()) works since the environments aren't the same but apparently w is still somewhere on the search path
edit: sys.frame(sys.nframe()) seems to be the same as setting the environment of the forumla to the environment of the function rpart is called in (foo3 in this example). In that case, rpart looks for w, data, etc. in foo3, then bar3 then the GlobalEnv.
library(rpart)
data(iris)
bar <- function(formula, data) {
w <- rpois(nrow(iris), 1)
print(environment())
foo(formula, data, w)
}
foo <- function(formula, data, w) {
print(environment(formula))
fit <- rpart(formula, data, weights = w)
return(fit)
}
bar(I(Species == "versicolor") ~ ., data = iris)
## <environment: 0x1045b1a78>
## <environment: R_GlobalEnv>
## Error in eval(expr, envir, enclos) (from #2) : object 'w' not found
bar2 <- function(formula, data) {
w <- rpois(nrow(iris), 1)
print(environment())
foo2(formula, data, w)
}
foo2 <- function(formula, data, w) {
print(environment(formula))
environment(formula) <- parent.frame()
print(environment(formula))
fit <- rpart(formula, data, weights = w)
return(fit)
}
bar2(I(Species == "versicolor") ~ ., data = iris)
## <environment: 0x100bf5910>
## <environment: R_GlobalEnv>
## <environment: 0x100bf5910>
bar3 <- function(formula, data) {
w <- rpois(nrow(iris), 1)
print(environment())
foo3(formula, data, w)
}
foo3 <- function(formula, data, w) {
print(environment(formula))
environment(formula) <- environment() ## seems to be the same as sys.frame(sys.nframe())
print(environment(formula))
print(environment())
fit <- rpart(formula, data, weights = w)
return(fit)
}
bar3(I(Species == "versicolor") ~ ., data = iris)
## <environment: 0x104e11bb8>
## <environment: R_GlobalEnv>
## <environment: 0x104b4ff78>
## <environment: 0x104b4ff78>

According to the rpart documentation (March 12, 2017, page 23, section 6.1), "Weights are not yet supported, and will be ignored if present."
https://cran.r-project.org/web/packages/rpart/vignettes/longintro.pdf

I've managed to solve this using the code below, but i'm sure there's a better way:
The weak learner
fitmodel = function(formula, data, w) {
# just paste the weights into the data frame
data$w = w
rpart(formula, data, weights = w, control = rpart.control(maxdepth = 1))
}
The algorithm
ada.boost = function(formula, data, wl.FUN = fitmodel, test.data = NULL, M = 100) {
# Just rewrites the formula and get ride of any '.'
dep.var = all.vars(formula)[1]
vars = attr(terms(formula, data = data), "term.labels")
formula = as.formula(paste(dep.var, "~", paste(vars, collapse = "+")))
# ...more code
}
Now everything works!

Related

Cannot find object in function when object is defined with speedglm

I use speedglm to fit a GLM to data. When I call the function directly, the code works as expected, but when I create a function to fit the model, I get an error that an argument is not found.
The variable (w in the example below) clearly exists in the scope of the function but it seems that the variable is evaluated only later within the speedglm function where w is no longer available or so I think. This is where I start questioning my current understanding of R.
Did I make an error while creating the function, does speedglm use some weird trick to scope the variable (source code here) that breaks the normal (?) logic or do I have a wrong understanding of how R functions work?
I am trying to understand this behavior and also fix my train_glm function to make it work with speedglm and weights.
MWE
library(speedglm)
# works as expected
m1 <- speedglm(wt ~ cyl, data = mtcars, weights = mtcars$wt)
# define a small helper function that just forwards its arguments
train_glm <- function(f, d, w) {
speedglm(formula = f, data = d, weights = w)
}
# does not work
m <- train_glm(wt ~ cyl, d = mtcars, w = mtcars$wt)
#> Error in eval(extras, data, env) : object 'w' not found
Even weirder, if I change the code I found the following
# removing the weights as a base case -> WORKS
train_glm3 <- function(f, d) {
speedglm(formula = f, data = d)
}
m3 <- train_glm3(wt ~ cyl, d = mtcars)
# works
# hardcoding the weights inside the function -> BREAKS
train_glm4 <- function(f, d) {
speedglm(formula = f, data = d, weights = d$wt)
}
m4 <- train_glm4(wt ~ cyl, d = mtcars)
# Error in eval(extras, data, env) : object 'd' not found
# creating a new dataset and hardcoding the weights inside the function
# but using the name of the dataset at the highest environment -> WORKS
train_glm5 <- function(f, d) {
speedglm(formula = f, data = d, weights = mtcars2$wt)
}
mtcars2 <- mtcars
m5 <- train_glm5(wt ~ cyl, d = mtcars2)
# works
The solution (thanks to #Mike for the hint) is to evaluate the code either by using the solution given by this answer or by using do.call like so:
library(speedglm)
train_glm_docall <- function(f, d, w) {
do.call(
speedglm,
list(
formula = f,
data = d,
weights = w
)
)
}
m2 <- train_glm_docall(f = wt ~ cyl, d = mtcars, w = mtcars$wt)
class(m2)
#> [1] "speedglm" "speedlm"

How can I pass weights in a variable to rpart?

For some reason, the rpart from the rpart package can't see a variable defined in the context from which it is called. You can see in the reprex below that wts is defined just before the call to rpart, but when I call rpart, I get the error "object 'wts' not found".
If I omit the weights argument, there is no problem.
library(rpart)
data(mpg, package = "ggplot2")
scale <- function(x) {
x/sum(x)
}
fit_rpart <- function(formula, data, iters = 10) {
data <- as.data.frame(data)
models <- list()
for (i in 1:iters) {
wts <- scale(runif(nrow(data)))
print(head(wts))
models[[i]] <- rpart(formula = formula,
data = data, weights = wts,
method = "class")
}
return(models)
}
results <- fit_rpart(cyl == 4 ~ drv + cty + fl + class, mpg)
#> [1] 0.0072092177 0.0059019498 0.0007893446 0.0038617957 0.0067420603
#> [6] 0.0076892493
#> Error in eval(extras, data, env): object 'wts' not found
Created on 2022-09-28 by the reprex package (v2.0.1)
It seems the answer is similar to that provided here. (Similar, but not the same, as this issue relates to rpart, not lm. Not sure how to generalize the question to cover both.)
If I put these two lines at the top of fit_rpart, the code runs fine.
formula <- as.formula(formula)
environment(formula) <- environment()
It seems to be a name scoping problem as you said. You can also use <<- to pass an affectation to the parent environment.
(Also, data and formula are already existing objects in R so to be clear one could prefer to rename them).
library(rpart)
data(mpg, package = "ggplot2")
scale <- function(x) {
x/sum(x)
}
fit_rpart <- function(f, d, iters = 10) {
d <- as.data.frame(d)
models <- list()
for (i in 1:10) {
wts <<- scale(runif(nrow(d)))
print(head(wts))
models[[i]] <- rpart(formula = f,
data = d,
weights = wts,
method = "class")
}
return(models)
}
results <- fit_rpart(cyl == 4 ~ drv + cty + fl + class, mpg)
# [1] 0.0007921094 0.0039999229 0.0026862458 0.0018832820 0.0006866826 0.0076998391
# [1] 0.005962060 0.006942240 0.001572535 0.009360314 0.005485438 0.008135806
# [1] 0.007224990 0.004209857 0.007706282 0.007071345 0.003784652 0.006335056
# [1] 0.004098536 0.003289626 0.006710783 0.007364727 0.003099702 0.007693150
# [1] 0.005154814 0.001212012 0.001169259 0.005829825 0.004401704 0.004269959
# [1] 0.001262500 0.003705485 0.007466314 0.005450551 0.001292365 0.007920012
# [1] 0.0007083308 0.0033698483 0.0073883706 0.0013445097 0.0068669108 0.0048488413
# [1] 0.005139870 0.002053938 0.002125759 0.006488419 0.007129400 0.006937384
# [1] 0.006957664 0.006296365 0.000640707 0.008121049 0.008014404 0.007706194

How should I call model.frame in R?

I am trying to write my own modeling function in R, one which takes a formula, some data, and maybe some extra context, like weights; after calling model.frame to extract the necessary numeric data, it will perform a fit. My first pass looked like:
my_modfunc <- function(formula,data,weights=NULL) {
mf <- model.frame(formula,data=data,weights=weights)
wt <- model.weights(mf)
# do some fitting here...
}
# make fake data to test it
set.seed(1234)
data <- data.frame(x1=rnorm(50),x2=rnorm(50),y=rnorm(50),w=runif(50))
# call it:
my_modfunc(y ~ x1 + x2,data=data,weights=w)
This fails, I get the error:
Error in model.frame.default(formula, data = data, weights = weights) :
invalid type (closure) for variable '(weights)'
Similarly, if I call
my_modfunc(y ~ x1 + x2,data=data,weights='w')
I get the same error. I suspect there is some problem with environment, quoting, and so on.
Cutting and pasting the source for lm, I could rewrite my function as
# based on lm
weird_modfunc <- function(formula,data,weights=NULL ) {
cl <- match.call() # what?
mf <- match.call(expand.dots = FALSE) # what??
m <- match(c("formula", "data", "weights"), names(mf), 0L)
mf <- mf[c(1L, m)] # ??
mf$drop.unused.levels <- TRUE # ??
mf[[1L]] <- quote(stats::model.frame) ## ???
mf <- eval(mf, parent.frame())
wt <- as.vector(model.weights(mf))
# do some fitting here...
}
# this runs without error:
weird_modfunc(y ~ x1 + x2,data=data,weights=w)
# this fails with the same error as above about variable lengths.
weird_modfunc(y ~ x1 + x2,data=data,weights='w')
The problem is that this contains multiple somewhat mystical incantations that I do not know how to interpret, modify or maintain.
What is the right way to call model.frame? Bonus points for making my function accept both weights=w and weights='w'
Welcome to the joys of non-standard evaluation. I suggest you base your function on the lm approach. It constructs a call to model.frame and evaluates it. That's necessary, because model.frame does non-standard evaluation, i.e., it accepts/expects a symbol for the weights parameter. Furthermore, it also ensures correct scoping regarding the formula's environment.
weird_modfunc <- function(formula,data,weights=NULL ) {
#cl not needed, lm only adds this call to the return object
mf <- match.call(expand.dots = FALSE)
message("Call with ellipses not expanded: ")
#note that there are no ellipses in the function arguments for now,
#but you might want to change that later
print(mf)
#turn weights into symbol if character is passed
if (is.character(mf$weights)) mf$weights <- as.symbol(mf$weights)
m <- match(c("formula", "data", "weights"), names(mf), 0L)
message("Position of formula, data and weights in the call:")
print(m)
mf <- mf[c(1L, m)]
message("New call that only contains what is needed:")
print(mf)
mf$drop.unused.levels <- TRUE
message("Call with argument added:")
print(mf)
mf[[1L]] <- quote(stats::model.frame)
message("Change call to a call to model.frame:")
print(mf)
mf <- eval(mf, parent.frame()) #evaluate call
wt <- as.vector(model.weights(mf))
# do some fitting here...
message("Return value:")
wt
}
# this runs without error:
weird_modfunc(y ~ x1 + x2,data=data,weights=w)
#Call with ellipses not expanded:
#weird_modfunc(formula = y ~ x1 + x2, data = data, weights = w)
#Position of formula, data and weights in the call
#[1] 2 3 4
#New call that only contains what is needed:
#weird_modfunc(formula = y ~ x1 + x2, data = data, weights = w)
#Call with argument added:
#weird_modfunc(formula = y ~ x1 + x2, data = data, weights = w,
# drop.unused.levels = TRUE)
#Change call to a call to model.frame:
#stats::model.frame(formula = y ~ x1 + x2, data = data, weights = w,
# drop.unused.levels = TRUE)
#Return value:
# [1] 0.35299850 0.98095832 0.53888276 0.44403386 0.94936678 0.45248337 0.19062580 0.99160915 0.54845545 0.76881577 0.91342167 0.68211200 0.40725142
#[14] 0.40759230 0.14608279 0.19666771 0.19220934 0.40841440 0.34822131 0.83454285 0.19840001 0.86180531 0.39718531 0.15325377 0.33928338 0.36718044
#[27] 0.42737908 0.18633690 0.65801660 0.92041138 0.73389406 0.88231927 0.95334653 0.19490154 0.47261674 0.38605066 0.37416586 0.02785566 0.92935521
#[40] 0.41052928 0.95584022 0.27215284 0.51724649 0.97830984 0.36969649 0.31043044 0.03420963 0.66756585 0.92091638 0.04498960
#this runs without error too:
weird_modfunc(y ~ x1 + x2,data=data,weights='w')
Here is a simpler version but there might be problems (well, more than usual with non-standard evaluation):
my_modfunc <- function(formula,data,weights=NULL) {
weights <- substitute(weights)
if (!is.symbol(weights)) weights <- as.symbol(weights)
#substitute the symbol into the call:
mf <- eval(substitute(model.frame(formula,data=data,weights=weights)))
wt <- model.weights(mf)
# do some fitting here...
wt
}
my_modfunc(y ~ x1 + x2,data=data,weights=w)
#works
my_modfunc(y ~ x1 + x2,data=data,weights="w")
#works

How to access the source data when manipulating/updating fitted models

I'm trying to write some functions to ease refitting multiple models, but find it painful, as R is unable to locate proper data, when it plunges deeper into evaluation tree.
Despite an effort was made to store the formula environment inside the model, I guess there's really no way to unambiguously point to the raw data object.
This becomes even harder for fitting survival curves using survfit, where no terms object is being stored inside.
Do I really need to retype the data/formula as a parameter each time?
Example:
# model-fitting wrapper function
fn <- function(fn_formula, fn_data) {
lm(formula = fn_formula, data = fn_data)
}
# specify exemplary data and formula
data <- data.frame(
y = rnorm(100),
x1 = rnorm(100),
x2 = rnorm(100))
formula <- y ~ x1
# try to create and update the fit with different parameters
fn_fit <- fn(formula, data)
update(fn_fit, ~ x2)
# Error in is.data.frame(data) : object 'fn_data' not found
terms(fn_fit) %>% attr('.Environment')
# <environment: R_GlobalEnv>
terms(fn_fit$model) %>% attr('.Environment')
# <environment: R_GlobalEnv>
getCall(fn_fit)
# lm(formula = fn_formula, data = fn_data)
The variable that stores the data should be in the same scope for both the lm() and update() with the same name. Not sure what you are really trying to accomplish, bit if you want a function that creates a signature you can use in the global environment, you can do something like this would work
fn <- function(fn_formula, fn_data) {
do.call("lm", list(fn_formula, data=substitute(fn_data)))
}
fn_fit <- fn(formula, data)
update(fn_fit, ~ x2)
Otherwise if you really wanted to capture that variable in the local function scope, you can create a helper to fun update in the correct environment.
fn <- function(fn_formula, fn_data) {
environment(fn_formula) <- environment()
lm(formula = fn_formula, data = fn_data)
}
fn_update <- function(object, ...) {
mc<-match.call(definition = update)
mc[[1]] <- quote(update)
eval(mc, envir=environment(terms(object)))
}
fn_fit <- fn(formula, data)
fn_update(fn_fit, ~x2)
When you passed formula, the only items stored in the ['model'] sublist were those that were needed.
> names(fn_fit$model)
[1] "y" "x1"
But there's nothing named either 'data' or 'fn_data' in that object. MrFlick second suggestion is more resilient to modifications in the calling tree of frames:
> fn <- function(fn_formula, fn_data) {
+ do.call("lm", list(fn_formula, data=substitute(fn_data)))
+ }
> fn_fit <- fn(formula, data); rm(data) # mess with the calling environment
> update(fn_fit, ~ x2)
Error in terms.formula(formula, data = data) :
'data' argument is of the wrong type
That error occurred because the R interpreter only found the function named data; if instead you deploy the second option you get:
> data <- data.frame(
+ y = rnorm(100),
+ x1 = rnorm(100),
+ x2 = rnorm(100))
> fn <- function(fn_formula, fn_data) {
+ environment(fn_formula) <- environment()
+ lm(formula = fn_formula, data = fn_data)
+ }
>
> fn_update <- function(object, ...) {
+ mc<-match.call(definition = update)
+ mc[[1]] <- quote(update)
+ eval(mc, envir=environment(terms(object)))
+ }
>
> fn_fit <- fn(formula, data) ; rm(data)
> fn_update(fn_fit, ~x2)
Call:
lm(formula = y ~ x2, data = fn_data)
Coefficients:
(Intercept) x2
0.01117 -0.13004

formula error inside function

I want use survfit() and basehaz() inside a function, but they do not work. Could you take a look at this problem. Thanks for your help. The following code leads to the error:
library(survival)
n <- 50 # total sample size
nclust <- 5 # number of clusters
clusters <- rep(1:nclust,each=n/nclust)
beta0 <- c(1,2)
set.seed(13)
#generate phmm data set
Z <- cbind(Z1=sample(0:1,n,replace=TRUE),
Z2=sample(0:1,n,replace=TRUE),
Z3=sample(0:1,n,replace=TRUE))
b <- cbind(rep(rnorm(nclust),each=n/nclust),rep(rnorm(nclust),each=n/nclust))
Wb <- matrix(0,n,2)
for( j in 1:2) Wb[,j] <- Z[,j]*b[,j]
Wb <- apply(Wb,1,sum)
T <- -log(runif(n,0,1))*exp(-Z[,c('Z1','Z2')]%*%beta0-Wb)
C <- runif(n,0,1)
time <- ifelse(T<C,T,C)
event <- ifelse(T<=C,1,0)
mean(event)
phmmd <- data.frame(Z)
phmmd$cluster <- clusters
phmmd$time <- time
phmmd$event <- event
fmla <- as.formula("Surv(time, event) ~ Z1 + Z2")
BaseFun <- function(x){
start.coxph <- coxph(x, phmmd)
print(start.coxph)
betahat <- start.coxph$coefficient
print(betahat)
print(333)
print(survfit(start.coxph))
m <- basehaz(start.coxph)
print(m)
}
BaseFun(fmla)
Error in formula.default(object, env = baseenv()) : invalid formula
But the following function works:
fit <- coxph(fmla, phmmd)
basehaz(fit)
It is a problem of scoping.
Notice that the environment of basehaz is:
environment(basehaz)
<environment: namespace:survival>
meanwhile:
environment(BaseFun)
<environment: R_GlobalEnv>
Therefore that is why the function basehaz cannot find the local variable inside the function.
A possible solution is to send x to the top using assign:
BaseFun <- function(x){
assign('x',x,pos=.GlobalEnv)
start.coxph <- coxph(x, phmmd)
print(start.coxph)
betahat <- start.coxph$coefficient
print(betahat)
print(333)
print(survfit(start.coxph))
m <- basehaz(start.coxph)
print(m)
rm(x)
}
BaseFun(fmla)
Other solutions may involved dealing with the environments more directly.
I'm following up on #moli's comment to #aatrujillob's answer. They were helpful so I thought I would explain how it solved things for me and a similar problem with the rpart and partykit packages.
Some toy data:
N <- 200
data <- data.frame(X = rnorm(N),W = rbinom(N,1,0.5))
data <- within( data, expr = {
trtprob <- 0.4 + 0.08*X + 0.2*W -0.05*X*W
Trt <- rbinom(N, 1, trtprob)
outprob <- 0.55 + 0.03*X -0.1*W - 0.3*Trt
Outcome <- rbinom(N,1,outprob)
rm(outprob, trtprob)
})
I want to split the data to training (train_data) and testing sets, and train the classification tree on train_data.
Here's the formula I want to use, and the issue with the following example. When I define this formula, the train_data object does not yet exist.
my_formula <- Trt~W+X
exists("train_data")
# [1] FALSE
exists("train_data", envir = environment(my_formula))
# [1] FALSE
Here's my function, which is similar to the original function. Again,
badFunc <- function(data, my_formula){
train_data <- data[1:100,]
ct_train <- rpart::rpart(
data= train_data,
formula = my_formula,
method = "class")
ct_party <- partykit::as.party(ct_train)
}
Trying to run this function throws an error similar to OP's.
library(rpart)
library(partykit)
bad_out <- badFunc(data=data, my_formula = my_formula)
# Error in is.data.frame(data) : object 'train_data' not found
# 10. is.data.frame(data)
# 9. model.frame.default(formula = Trt ~ W + X, data = train_data,
# na.action = function (x) {Terms <- attr(x, "terms") ...
# 8. stats::model.frame(formula = Trt ~ W + X, data = train_data,
# na.action = function (x) {Terms <- attr(x, "terms") ...
# 7. eval(expr, envir, enclos)
# 6. eval(mf, env)
# 5. model.frame.rpart(obj)
# 4. model.frame(obj)
# 3. as.party.rpart(ct_train)
# 2. partykit::as.party(ct_train)
# 1. badFunc(data = data, my_formula = my_formula)
print(bad_out)
# Error in print(bad_out) : object 'bad_out' not found
Luckily, rpart() is like coxph() in that you can specify the argument model=TRUE to solve these issues. Here it is again, with that extra argument.
goodFunc <- function(data, my_formula){
train_data <- data[1:100,]
ct_train <- rpart::rpart(
data= train_data,
## This solved it for me
model=TRUE,
##
formula = my_formula,
method = "class")
ct_party <- partykit::as.party(ct_train)
}
good_out <- goodFunc(data=data, my_formula = my_formula)
print(good_out)
# Model formula:
# Trt ~ W + X
#
# Fitted party:
# [1] root
# | [2] X >= 1.59791: 0.143 (n = 7, err = 0.9)
##### etc
documentation for model argument in rpart():
model:
if logical: keep a copy of the model frame in the result? If
the input value for model is a model frame (likely from an earlier
call to the rpart function), then this frame is used rather than
constructing new data.
Formulas can be tricky as they use lexical scoping and environments in a way that is not always natural (to me). Thank goodness Terry Therneau has made our lives easier with model=TRUE in these two packages!

Resources