predict.lme is unable to interpret a formula defined from a variable - r

I have been stymied by an error that traces back to predict.lme, running inside a function, failing to interpret a formula based on a variable that has been passed from outside the function. I know the issue has to do with variable scope and different environments, but I've been unable to fully understand it or find a workaround. Your help would be much appreciated.
Here's a reproducible example:
# This will be the nested function.
train_test_perf <- function(train_data, test_data, model, termLabels) {
fixForm <- reformulate(termlabels=termLabels, response="Y")
fit <- nlme::lme(fixForm, data=train_data, random=~ 1|ID)
train_pred <- predict(fit, newdata=train_data, level=0, na.action=na.exclude)
rtrain <- cor.test(train_data$Y, train_pred)
test_pred <- predict(fit, newdata=test_data, level=0, na.action=na.exclude)
rtest <- cor.test(test_data$Y, test_pred)
tmp <- data.frame(Model=model,
# And here is the function that calls it.
myfunc <- function(df, newdf, varList) {
for (v in varList) {
perf <- train_test_perf(train_data=df, test_data=newdf, model=v, termLabels=v)
# The outer function call.
myfunc(df=dat, newdf=newdat, varList=list("W", "X"))
Running this gives the following error and traceback:
Error in eval(mCall$fixed) : object 'fixForm' not found
predict.lme(fit, newdata = train_data, level = 0, na.action = na.exclude)
predict(fit, newdata = train_data, level = 0, na.action = na.exclude)
train_test_perf(train_data = df, test_data = newdf, model = v,
termLabels = v)
myfunc(df = dat, newdf = newdat, varList = list("W", "X"))
It seems clear that predict.lme does not have access to the fixForm variable, but I haven't been able to work out a way to both define a formula based on a variable and have the value accessible to predict.lme. I'm not sure whether the nested function structure is part of the problem here--if it is, I would prefer to find a workaround that would maintain this structure, as my real-life code includes some other things inside myfunc that occur before and after the call to train_test_perf.
Using a variable as formula doesn't stores the variable not the formula which might be the issue. We can use a
train_test_perf <- function(train_data, test_data, model, termLabels) {
fixForm <- reformulate(termlabels=termLabels, response="Y")
fit <-, list(fixForm, data=quote(train_data), random=~ 1|ID))
train_pred <- predict(fit, newdata=train_data, level=0, na.action=na.exclude)
rtrain <- cor.test(train_data$Y, train_pred)
test_pred <- predict(fit, newdata=test_data, level=0, na.action=na.exclude)
rtest <- cor.test(test_data$Y, test_pred)
tmp <- data.frame(Model=model, R_train=rtrain$estimate,
Finally put it in an sapply to avoid tedious for loops.
t(sapply(c("W", "X"), \(x) train_test_perf(train_data=dat, test_data=newdat, model=x, termLabels=x)))
# Model R_train R_test
# [1,] "W" 0.1686495 -0.001738604
# [2,] "X" 0.4138526 0.2992374


Apply logistic regression in a function in R

I want to run logistic regression for multiple parameters and store the different metrics i.e AUC.
I wrote the function below but I get an error when I call it: Error in eval(predvars, data, env) : object 'X0' not found even if the variable exists in both my training and testing dataset. Any idea?
new.function <- function(a) {
model = glm(extry~a,family=binomial("logit"),data = train_df)
pred.prob <- predict(model,test_df, type='response')
predictFull <- prediction(pred.prob, test_df$extry)
auc_ROCR <- performance(predictFull, measure = "auc")
my_list <- list("AUC" = auc_ROCR)
# Call the function new.function supplying 6 as an argument.
les <- new.function(X0)
The main reason why your function didn't work is that you are trying to call an object into a formula. You can fix it with paste formula function, but that is ultimately quite limiting.
I suggest instead that you consider using update. This allow you more flexibility to change with multiple variable combination, or change a training dataset, without breaking the function.
model = glm(extry~a,family=binomial("logit"),data = train_df)
new.model = update(model, .~X0)
new.function <- function(model){
pred.prob <- predict(model, test_df, type='response')
predictFull <- prediction(pred.prob, test_df$extry)
auc_ROCR <- performance(predictFull, measure = "auc")
my_list <- list("AUC" = auc_ROCR)
les <- new.function(new.model)
The function can be further improved by calling the test_df as a separate argument, so that you can fit it with an alternative testing data.
To run the function in the way you intended, you would need to use non-standard evaluation to capture the symbol and insert it in a formula. This can be done using and as.formula. Here's a fully reproducible example using dummy data:
new.function <- function(a) {
# Convert symbol to character
a <- as.character($a)
# Build formula from character strings
form <- as.formula(paste("extry", a, sep = "~"))
model <- glm(form, family = binomial("logit"), data = train_df)
pred.prob <- predict(model, test_df, type = 'response')
predictFull <- ROCR::prediction(pred.prob, test_df$extry)
auc_ROCR <- ROCR::performance(predictFull, "auc")
list("AUC" = auc_ROCR)
Now we can call the function in the way you intended:
#> $AUC
#> A performance instance
#> 'Area under the ROC curve'
#> $AUC
#> A performance instance
#> 'Area under the ROC curve'
If you want to see the actual area under the curve you would need to do:
#> [1] 0.6599759
So you may wish to modify your function so that the list contains auc_ROCR#y.values[[1]] rather than auc_ROCR
Data used
train_df <- data.frame(X0 = sample(100), X1 = sample(100))
train_df$extry <- rbinom(100, 1, (train_df$X0 + train_df$X1)/200)
test_df <- data.frame(X0 = sample(100), X1 = sample(100))
test_df$extry <- rbinom(100, 1, (test_df$X0 + test_df$X1)/200)
Getting confidence intervals on prediction from caret::train

I'm trying to figure out how to get confidence intervals from a caret::train linear model.
My first try was just to run predict with the usual lm confidence intervals arguments:
m <- caret::train(mpg ~ poly(hp,2), data=mtcars, method="lm")
predict(m, newdata=mtcars, interval="confidence", level=0.95)
But it looks like the object returned from caret::train doesn't have this implemented.
My second attempt was to extract the finalModel and predict on that:
m <- caret::train(mpg ~ poly(hp,2), data=mtcars, method="lm")
fm <- m$finalModel
predict(fm, newdata=mtcars, interval="confidence", level=0.95)
But I get the error
Error in eval(predvars, data, env) : object 'poly(hp, 2)1' not found
Digging deeper it seems that the final model has some weird representation for the formula and is searching for a 'poly(hp, 2)1' column in my newdata rather than evaluating the formula. The m$finalModel looks like this:
lm(formula = .outcome ~ ., data = dat)
(Intercept) `poly(hp, 2)1` `poly(hp, 2)2`
20.09 -26.05 13.15
I should add that I'm not just using lm because I'm using caret to fit the model through cross validation.
How can I get the confidence intervals from the linear model fit through caret::train?
This is a horrible answer, or maybe the caret package just has a horrible implementation of this specific issue. In either case it seems fitting for opening an issue or wish on their github if not already existing (either a wish for more diversified predict functions or fixing the naming used in object$finalModel)
The problem (which occured on second trial) stems from how the caret package internally handles the diverse fitting procedures, basically restricting the predict function for what seems to be cleaning and standardization purposes.
The problem is two-fold.
The predict.train does not allow for prediction/confidence intervals
The finalModel contained in the output of train(...) contains a formula that is unusually formatted.
The two problems seems to be stem from the formatting of train and the usage in predict.train. Focusing first on the latter problem, this is apparent by looking at the output from
#`.outcome ~ `poly(hp, 2)1` + `poly(hp, 2)2`)
Obviously some formatting is performed while running train, as the expecteed output would be mpg ~ poly(hp, 2), while the output has expanded the RHS (and added quotes/tags) and changed the LHS. As such it would be nice to either fixup the formula, or be able to use the formula.
Looking into how the caret package uses this in the predict.train function reveals the code piece below for newdata input
--more code
if (!is.null(newdata)) {
if (inherits(object, "train.formula")) {
newdata <-
rn <- row.names(newdata)
Terms <- delete.response(object$terms)
m <- model.frame(Terms, newdata, na.action = na.action,
xlev = object$xlevels)
if (!is.null(cl <- attr(Terms, "dataClasses")))
.checkMFClasses(cl, m)
keep <- match(row.names(m), rn)
newdata <- model.matrix(Terms, m, contrasts = object$contrasts)
xint <- match("(Intercept)", colnames(newdata),
nomatch = 0)
if (xint > 0)
newdata <- newdata[, -xint, drop = FALSE]
--more code
out <- predictionFunction(method = object$modelInfo,
modelFit = object$finalModel, newdata = newdata,
preProc = object$preProcess)
For the less experienced R users, we basically see, that a model.matrix is constructed from scratch without using the output of formula(m$finalModel) (we can use this!), and later some function is called to predict based on the m$finalModel. Looking into predictionFunction from the same package reveals that this function simply calls m$modelInfo$predict(m$finalModel, newdata) (for our example)
Lastly looking at m$modelInfo$predict reveals the below code snippet
function(modelFit, newdata, submodels = NULL) {
newdata <-
predict(modelFit, newdata)
Note that modelFit = m$finalModel and newdata is made with the output above. Also Note that the call to predict does not allow one to specify interval = "confidence", which is the reason for the first problem.
Fixing the problem (sorta):
A myriad of ways exist for fixing this problem. One is use lm(...) instead of train(...). Another is to utilize the innards of the function to create a data object, that fits the weird model specification, so we can use predict(m$finalModel, newdata = newdata, interval = "confidence") in a way that works as expected.
I choose to do the latter.
caretNewdata <- caretTrainNewdata(m, mtcars)
preds <- predict(m$finalModel, caretNewdata, interval = "confidence")
head(preds, 3)
fit lwr upr
Mazda RX4 22.03708 20.74297 23.33119
Mazda RX4 Wag 22.03708 20.74297 23.33119
Datsun 710 24.21108 22.77257 25.64960
The function is provided below. for the nerdy, i basically extracted the model.matrix building process from predict.train, predictionFunction and m$modelInfo$predict. I will not promise that this function works for the general case usage of every caret model, but it is a place to start.
caretTrainNewdata function:
caretTrainNewdata <- function(object, newdata, na.action = na.omit){
if (!is.null(object$modelInfo$library))
for (i in object$modelInfo$library)"requireNamespaceQuietStop",
list(package = i))
if (!is.null(newdata)) {
if (inherits(object, "train.formula")) {
newdata <-
rn <- row.names(newdata)
Terms <- delete.response(object$terms)
m <- model.frame(Terms, newdata, na.action = na.action,
xlev = object$xlevels)
if (!is.null(cl <- attr(Terms, "dataClasses")))
.checkMFClasses(cl, m)
keep <- match(row.names(m), rn)
newdata <- model.matrix(Terms, m, contrasts = object$contrasts)
xint <- match("(Intercept)", colnames(newdata),
nomatch = 0)
if (xint > 0)
newdata <- newdata[, -xint, drop = FALSE]
else if (object$control$method != "oob") {
if (!is.null(object$trainingData)) {
if (object$method == "pam") {
newdata <- object$finalModel$xData
else {
newdata <- object$trainingData
newdata$.outcome <- NULL
if ("train.formula" %in% class(object) &&
any(unlist(lapply(newdata, is.factor)))) {
newdata <- model.matrix(~., data = newdata)[,
newdata <-
else stop("please specify data via newdata")
} else
stop("please specify data data via newdata")
if ("xNames" %in% names(object$finalModel) & is.null(object$preProcess$method$pca) &
newdata <- newdata[, colnames(newdata) %in% object$finalModel$xNames,
drop = FALSE]
newdata <- predict(preProc, newdata)
if(! &&
!is.null(object$modelInfo$predict) &&
any(grepl("", as.character(body(object$modelInfo$predict)))))
newdata <-

How to calculate the cross-validated R2 on a LASSO regression?

I am using this code to fit a model using LASSO regression.
IV1 <- data.frame(IV1 = rnorm(100))
IV2 <- data.frame(IV2 = rnorm(100))
IV3 <- data.frame(IV3 = rnorm(100))
IV4 <- data.frame(IV4 = rnorm(100))
IV5 <- data.frame(IV5 = rnorm(100))
DV <- data.frame(DV = rnorm(100))
x <-model.matrix(DV~.-IV5 , data)[,-1]
y <- data$DV
AB<-glmnet(x=x, y=y, alpha=1)
lambdas = NULL
for (i in 1:100)
fit <- cv.glmnet(x,y)
errors = data.frame(fit$lambda,fit$cvm)
lambdas <- rbind(lambdas,errors)
lambdas <- aggregate(lambdas[, 2], list(lambdas$fit.lambda), mean)
bestindex = which(lambdas[2]==min(lambdas[2]))
bestlambda = lambdas[bestindex,1]
fit <- glmnet(x,y,lambda=bestlambda)
I would like to calculate some sort of R2 using the training data. I assume that one way to do this is using the cross-validation that I performed in choosing lambda. Based off of this post it seems like this can be done using
However, when I run this, I get this error:
Warning message:
In max(1 - fit$cvm/var(y)) :
no non-missing arguments to max; returning -Inf
Can anyone point me in the right direction? Is this the best way to compute R2 based off of the training data?
The function glmnet does not return cvm as a result on fit
What you want to do is use cv.glmnet
The following works (note you must specify more than 1 lambda or let it figure it out)
fit <- cv.glmnet(x,y,lambda=lambdas[,1])
I'm not sure I understand what you are trying to do. Maybe do this?
for (i in 1:100)
fit <- cv.glmnet(x,y)
errors = data.frame(fit$lambda,fit$cvm)
lambdas <- rbind(lambdas,errors)
lambdas <- aggregate(lambdas[, 2], list(lambdas$fit.lambda), mean)
bestindex = which(lambdas[2]==min(lambdas[2]))
bestlambda = lambdas[bestindex,1]

R: Clustered robust standard errors using miceadds lm.cluster - error with subset and weights

I am trying to use the lm.cluster function in the package miceadds to get robust clustered standard errors for a multiply imputed dataset.
I am able to get the standard version of it to run but I get the following error when I try to add a subset or weights:
Error in eval(substitute(subset), data, env) :
..1 used in an incorrect context, no ... to look in
Example that works without subset or weights:
# imputation of the dataset: use six imputations
dat <- data.ma01[ , - c(1:2) ]
imp <- mice::mice( dat , maxit=3 , m=6 )
datlist <- miceadds::mids2datlist( imp )
# linear regression with cluster robust standard errors
mod <- lapply(datlist, FUN = function(data){miceadds::lm.cluster( data=data ,
formula=read ~ paredu+ female , cluster = data.ma01$idschool )} )
# extract parameters and covariance matrix
betas <- lapply( mod , FUN = function(rr){ coef(rr) } )
vars <- lapply( mod , FUN = function(rr){ vcov(rr) } )
# conduct statistical inference
summary(pool_mi( qhat = betas, u = vars ))
Example that breaks with subset:
mod <- lapply(datlist, FUN = function(data){miceadds::lm.cluster( data=data ,
formula=read ~ paredu+ female , cluster = data.ma01$idschool, subset=
(data.ma01$urban==1))} )
Error during wrapup: ..1 used in an incorrect context, no ... to look in
Example that breaks with weights:
mod <- lapply(datlist, FUN = function(data){miceadds::lm.cluster( data=data ,
formula=read ~ paredu+ female , cluster = data.ma01$idschool,
weights=data.ma01$studwgt)} )
Error during wrapup: ..1 used in an incorrect context, no ... to look in
From searching, I think I am encountering similar issues as others when passing these commands through an lm or glm wrapper (such as: Passing Argument to lm in R within Function or R : Pass argument to glm inside an R function or Passing the weights argument to a regression function inside an R function)
However, I am not sure how to address the issue with the imputed datasets & existing lm.cluster command.
This works fine with the estimatr package which is on CRAN and the estimatr::lm_robust() function. Two notes: (1) you can change the type of standard errors using se_type = and (2) I keep idschool in the data because we like the clusters to be in the same data.frame as we fit the model on.
# imputation of the dataset: use six imputations
dat <- data.ma01[, -c(1)] # note I keep idschool in data
imp <- mice::mice( dat , maxit = 3, m = 6)
datlist <- miceadds::mids2datlist(imp)
# linear regression with cluster robust standard errors
mod <- lapply(
function (dat) {
estimatr::lm_robust(read ~ paredu + female, dat, clusters = idschool)
# subset
mod <- lapply(
function (dat) {
estimatr::lm_robust(read ~ paredu + female, dat, clusters = idschool, subset = urban == 1)
# weights
mod <- lapply(
function (dat) {
estimatr::lm_robust(read ~ paredu + female, dat, clusters = idschool, weights = studwgt)
# note that you can use the `se_type` argument of lm_robust()
# to change the vcov estimation
# extract parameters and covariance matrix
betas <- lapply(mod, coef)
vars <- lapply(mod, vcov)
# conduct statistical inference
summary(pool_mi( qhat = betas, u = vars ))
I'm no expert, but there is an issue with the passing of the weights to lm(). I know this is not an ideal situation, but I managed to get it to work by modifying the lm.cluster() function to hard code the weights pass and then just used my own.
lm.cluster <- function (data, formula, cluster, wgts=NULL, ...)
if(is.null(wgts)) {
mod <- stats::lm(data = data, formula = formula)
} else {
data$.weights <- wgts
mod <- stats::lm(data = data, formula = formula, weights=data$.weights)
if (length(cluster) > 1) {
v1 <- cluster
else {
v1 <- data[, cluster]
dfr <- data.frame(cluster = v1)
vcov2 <- multiwayvcov::cluster.vcov(model = mod, cluster = dfr)
res <- list(lm_res = mod, vcov = vcov2)
class(res) <- "lm.cluster"

formula error inside function

I want use survfit() and basehaz() inside a function, but they do not work. Could you take a look at this problem. Thanks for your help. The following code leads to the error:
n <- 50 # total sample size
nclust <- 5 # number of clusters
clusters <- rep(1:nclust,each=n/nclust)
beta0 <- c(1,2)
#generate phmm data set
Z <- cbind(Z1=sample(0:1,n,replace=TRUE),
b <- cbind(rep(rnorm(nclust),each=n/nclust),rep(rnorm(nclust),each=n/nclust))
Wb <- matrix(0,n,2)
for( j in 1:2) Wb[,j] <- Z[,j]*b[,j]
Wb <- apply(Wb,1,sum)
T <- -log(runif(n,0,1))*exp(-Z[,c('Z1','Z2')]%*%beta0-Wb)
C <- runif(n,0,1)
time <- ifelse(T<C,T,C)
event <- ifelse(T<=C,1,0)
phmmd <- data.frame(Z)
phmmd$cluster <- clusters
phmmd$time <- time
phmmd$event <- event
fmla <- as.formula("Surv(time, event) ~ Z1 + Z2")
BaseFun <- function(x){
start.coxph <- coxph(x, phmmd)
betahat <- start.coxph$coefficient
m <- basehaz(start.coxph)
Error in formula.default(object, env = baseenv()) : invalid formula
But the following function works:
fit <- coxph(fmla, phmmd)
It is a problem of scoping.
Notice that the environment of basehaz is:
<environment: namespace:survival>
<environment: R_GlobalEnv>
Therefore that is why the function basehaz cannot find the local variable inside the function.
A possible solution is to send x to the top using assign:
BaseFun <- function(x){
start.coxph <- coxph(x, phmmd)
betahat <- start.coxph$coefficient
m <- basehaz(start.coxph)
Other solutions may involved dealing with the environments more directly.
I'm following up on #moli's comment to #aatrujillob's answer. They were helpful so I thought I would explain how it solved things for me and a similar problem with the rpart and partykit packages.
Some toy data:
N <- 200
data <- data.frame(X = rnorm(N),W = rbinom(N,1,0.5))
data <- within( data, expr = {
trtprob <- 0.4 + 0.08*X + 0.2*W -0.05*X*W
Trt <- rbinom(N, 1, trtprob)
outprob <- 0.55 + 0.03*X -0.1*W - 0.3*Trt
Outcome <- rbinom(N,1,outprob)
rm(outprob, trtprob)
I want to split the data to training (train_data) and testing sets, and train the classification tree on train_data.
Here's the formula I want to use, and the issue with the following example. When I define this formula, the train_data object does not yet exist.
my_formula <- Trt~W+X
# [1] FALSE
exists("train_data", envir = environment(my_formula))
# [1] FALSE
Here's my function, which is similar to the original function. Again,
badFunc <- function(data, my_formula){
train_data <- data[1:100,]
ct_train <- rpart::rpart(
data= train_data,
formula = my_formula,
method = "class")
ct_party <-
Trying to run this function throws an error similar to OP's.
bad_out <- badFunc(data=data, my_formula = my_formula)
# Error in : object 'train_data' not found
# 10.
# 9. model.frame.default(formula = Trt ~ W + X, data = train_data,
# na.action = function (x) {Terms <- attr(x, "terms") ...
# 8. stats::model.frame(formula = Trt ~ W + X, data = train_data,
# na.action = function (x) {Terms <- attr(x, "terms") ...
# 7. eval(expr, envir, enclos)
# 6. eval(mf, env)
# 5. model.frame.rpart(obj)
# 4. model.frame(obj)
# 3.
# 2.
# 1. badFunc(data = data, my_formula = my_formula)
# Error in print(bad_out) : object 'bad_out' not found
Luckily, rpart() is like coxph() in that you can specify the argument model=TRUE to solve these issues. Here it is again, with that extra argument.
goodFunc <- function(data, my_formula){
train_data <- data[1:100,]
ct_train <- rpart::rpart(
data= train_data,
## This solved it for me
formula = my_formula,
method = "class")
ct_party <-
good_out <- goodFunc(data=data, my_formula = my_formula)
# Model formula:
# Trt ~ W + X
# Fitted party:
# [1] root
# | [2] X >= 1.59791: 0.143 (n = 7, err = 0.9)
##### etc
documentation for model argument in rpart():
if logical: keep a copy of the model frame in the result? If
the input value for model is a model frame (likely from an earlier
call to the rpart function), then this frame is used rather than
constructing new data.
Formulas can be tricky as they use lexical scoping and environments in a way that is not always natural (to me). Thank goodness Terry Therneau has made our lives easier with model=TRUE in these two packages!
