Getting confidence intervals on prediction from caret::train - r

I'm trying to figure out how to get confidence intervals from a caret::train linear model.
My first try was just to run predict with the usual lm confidence intervals arguments:
m <- caret::train(mpg ~ poly(hp,2), data=mtcars, method="lm")
predict(m, newdata=mtcars, interval="confidence", level=0.95)
But it looks like the object returned from caret::train doesn't have this implemented.
My second attempt was to extract the finalModel and predict on that:
m <- caret::train(mpg ~ poly(hp,2), data=mtcars, method="lm")
fm <- m$finalModel
predict(fm, newdata=mtcars, interval="confidence", level=0.95)
But I get the error
Error in eval(predvars, data, env) : object 'poly(hp, 2)1' not found
Digging deeper it seems that the final model has some weird representation for the formula and is searching for a 'poly(hp, 2)1' column in my newdata rather than evaluating the formula. The m$finalModel looks like this:
Call:
lm(formula = .outcome ~ ., data = dat)
Coefficients:
(Intercept) `poly(hp, 2)1` `poly(hp, 2)2`
20.09 -26.05 13.15
I should add that I'm not just using lm because I'm using caret to fit the model through cross validation.
How can I get the confidence intervals from the linear model fit through caret::train?

Disclaimer:
This is a horrible answer, or maybe the caret package just has a horrible implementation of this specific issue. In either case it seems fitting for opening an issue or wish on their github if not already existing (either a wish for more diversified predict functions or fixing the naming used in object$finalModel)
The problem (which occured on second trial) stems from how the caret package internally handles the diverse fitting procedures, basically restricting the predict function for what seems to be cleaning and standardization purposes.
Problem:
The problem is two-fold.
The predict.train does not allow for prediction/confidence intervals
The finalModel contained in the output of train(...) contains a formula that is unusually formatted.
The two problems seems to be stem from the formatting of train and the usage in predict.train. Focusing first on the latter problem, this is apparent by looking at the output from
formula(m$finalModel)
#`.outcome ~ `poly(hp, 2)1` + `poly(hp, 2)2`)
Obviously some formatting is performed while running train, as the expecteed output would be mpg ~ poly(hp, 2), while the output has expanded the RHS (and added quotes/tags) and changed the LHS. As such it would be nice to either fixup the formula, or be able to use the formula.
Looking into how the caret package uses this in the predict.train function reveals the code piece below for newdata input
predict.formula
#output
--more code
if (!is.null(newdata)) {
if (inherits(object, "train.formula")) {
newdata <- as.data.frame(newdata)
rn <- row.names(newdata)
Terms <- delete.response(object$terms)
m <- model.frame(Terms, newdata, na.action = na.action,
xlev = object$xlevels)
if (!is.null(cl <- attr(Terms, "dataClasses")))
.checkMFClasses(cl, m)
keep <- match(row.names(m), rn)
newdata <- model.matrix(Terms, m, contrasts = object$contrasts)
xint <- match("(Intercept)", colnames(newdata),
nomatch = 0)
if (xint > 0)
newdata <- newdata[, -xint, drop = FALSE]
}
}
--more code
out <- predictionFunction(method = object$modelInfo,
modelFit = object$finalModel, newdata = newdata,
preProc = object$preProcess)
For the less experienced R users, we basically see, that a model.matrix is constructed from scratch without using the output of formula(m$finalModel) (we can use this!), and later some function is called to predict based on the m$finalModel. Looking into predictionFunction from the same package reveals that this function simply calls m$modelInfo$predict(m$finalModel, newdata) (for our example)
Lastly looking at m$modelInfo$predict reveals the below code snippet
m$modelInfo$predict
#output
function(modelFit, newdata, submodels = NULL) {
if(!is.data.frame(newdata))
newdata <- as.data.frame(newdata)
predict(modelFit, newdata)
}
Note that modelFit = m$finalModel and newdata is made with the output above. Also Note that the call to predict does not allow one to specify interval = "confidence", which is the reason for the first problem.
Fixing the problem (sorta):
A myriad of ways exist for fixing this problem. One is use lm(...) instead of train(...). Another is to utilize the innards of the function to create a data object, that fits the weird model specification, so we can use predict(m$finalModel, newdata = newdata, interval = "confidence") in a way that works as expected.
I choose to do the latter.
caretNewdata <- caretTrainNewdata(m, mtcars)
preds <- predict(m$finalModel, caretNewdata, interval = "confidence")
head(preds, 3)
#output
fit lwr upr
Mazda RX4 22.03708 20.74297 23.33119
Mazda RX4 Wag 22.03708 20.74297 23.33119
Datsun 710 24.21108 22.77257 25.64960
The function is provided below. for the nerdy, i basically extracted the model.matrix building process from predict.train, predictionFunction and m$modelInfo$predict. I will not promise that this function works for the general case usage of every caret model, but it is a place to start.
caretTrainNewdata function:
caretTrainNewdata <- function(object, newdata, na.action = na.omit){
if (!is.null(object$modelInfo$library))
for (i in object$modelInfo$library) do.call("requireNamespaceQuietStop",
list(package = i))
if (!is.null(newdata)) {
if (inherits(object, "train.formula")) {
newdata <- as.data.frame(newdata)
rn <- row.names(newdata)
Terms <- delete.response(object$terms)
m <- model.frame(Terms, newdata, na.action = na.action,
xlev = object$xlevels)
if (!is.null(cl <- attr(Terms, "dataClasses")))
.checkMFClasses(cl, m)
keep <- match(row.names(m), rn)
newdata <- model.matrix(Terms, m, contrasts = object$contrasts)
xint <- match("(Intercept)", colnames(newdata),
nomatch = 0)
if (xint > 0)
newdata <- newdata[, -xint, drop = FALSE]
}
}
else if (object$control$method != "oob") {
if (!is.null(object$trainingData)) {
if (object$method == "pam") {
newdata <- object$finalModel$xData
}
else {
newdata <- object$trainingData
newdata$.outcome <- NULL
if ("train.formula" %in% class(object) &&
any(unlist(lapply(newdata, is.factor)))) {
newdata <- model.matrix(~., data = newdata)[,
-1]
newdata <- as.data.frame(newdata)
}
}
}
else stop("please specify data via newdata")
} else
stop("please specify data data via newdata")
if ("xNames" %in% names(object$finalModel) & is.null(object$preProcess$method$pca) &
is.null(object$preProcess$method$ica))
newdata <- newdata[, colnames(newdata) %in% object$finalModel$xNames,
drop = FALSE]
if(!is.null(object$preProcess))
newdata <- predict(preProc, newdata)
if(!is.data.frame(newdata) &&
!is.null(object$modelInfo$predict) &&
any(grepl("as.data.frame", as.character(body(object$modelInfo$predict)))))
newdata <- as.data.frame(newdata)
newdata
}

Related

predict.lme is unable to interpret a formula defined from a variable

I have been stymied by an error that traces back to predict.lme, running inside a function, failing to interpret a formula based on a variable that has been passed from outside the function. I know the issue has to do with variable scope and different environments, but I've been unable to fully understand it or find a workaround. Your help would be much appreciated.
Here's a reproducible example:
# This will be the nested function.
train_test_perf <- function(train_data, test_data, model, termLabels) {
fixForm <- reformulate(termlabels=termLabels, response="Y")
fit <- nlme::lme(fixForm, data=train_data, random=~ 1|ID)
train_pred <- predict(fit, newdata=train_data, level=0, na.action=na.exclude)
rtrain <- cor.test(train_data$Y, train_pred)
test_pred <- predict(fit, newdata=test_data, level=0, na.action=na.exclude)
rtest <- cor.test(test_data$Y, test_pred)
tmp <- data.frame(Model=model,
R_train=rtrain$estimate,
R_test=rtest$estimate)
return(tmp)
}
# And here is the function that calls it.
myfunc <- function(df, newdf, varList) {
for (v in varList) {
perf <- train_test_perf(train_data=df, test_data=newdf, model=v, termLabels=v)
print(perf)
}
}
# The outer function call.
myfunc(df=dat, newdf=newdat, varList=list("W", "X"))
Running this gives the following error and traceback:
Error in eval(mCall$fixed) : object 'fixForm' not found
7.
eval(mCall$fixed)
6.
eval(mCall$fixed)
5.
eval(eval(mCall$fixed)[-2])
4.
predict.lme(fit, newdata = train_data, level = 0, na.action = na.exclude)
3.
predict(fit, newdata = train_data, level = 0, na.action = na.exclude)
2.
train_test_perf(train_data = df, test_data = newdf, model = v,
termLabels = v)
1.
myfunc(df = dat, newdf = newdat, varList = list("W", "X"))
It seems clear that predict.lme does not have access to the fixForm variable, but I haven't been able to work out a way to both define a formula based on a variable and have the value accessible to predict.lme. I'm not sure whether the nested function structure is part of the problem here--if it is, I would prefer to find a workaround that would maintain this structure, as my real-life code includes some other things inside myfunc that occur before and after the call to train_test_perf.
Thanks,
Jeff Phillips
Using a variable as formula doesn't stores the variable not the formula which might be the issue. We can use a do.call.
train_test_perf <- function(train_data, test_data, model, termLabels) {
fixForm <- reformulate(termlabels=termLabels, response="Y")
fit <- do.call(nlme::lme, list(fixForm, data=quote(train_data), random=~ 1|ID))
train_pred <- predict(fit, newdata=train_data, level=0, na.action=na.exclude)
rtrain <- cor.test(train_data$Y, train_pred)
test_pred <- predict(fit, newdata=test_data, level=0, na.action=na.exclude)
rtest <- cor.test(test_data$Y, test_pred)
tmp <- data.frame(Model=model, R_train=rtrain$estimate,
R_test=rtest$estimate)
return(tmp)
}
Finally put it in an sapply to avoid tedious for loops.
t(sapply(c("W", "X"), \(x) train_test_perf(train_data=dat, test_data=newdat, model=x, termLabels=x)))
# Model R_train R_test
# [1,] "W" 0.1686495 -0.001738604
# [2,] "X" 0.4138526 0.2992374

R: Clustered robust standard errors using miceadds lm.cluster - error with subset and weights

I am trying to use the lm.cluster function in the package miceadds to get robust clustered standard errors for a multiply imputed dataset.
I am able to get the standard version of it to run but I get the following error when I try to add a subset or weights:
Error in eval(substitute(subset), data, env) :
..1 used in an incorrect context, no ... to look in
Example that works without subset or weights:
require("mice")
require("miceadds")
data(data.ma01)
# imputation of the dataset: use six imputations
dat <- data.ma01[ , - c(1:2) ]
imp <- mice::mice( dat , maxit=3 , m=6 )
datlist <- miceadds::mids2datlist( imp )
# linear regression with cluster robust standard errors
mod <- lapply(datlist, FUN = function(data){miceadds::lm.cluster( data=data ,
formula=read ~ paredu+ female , cluster = data.ma01$idschool )} )
# extract parameters and covariance matrix
betas <- lapply( mod , FUN = function(rr){ coef(rr) } )
vars <- lapply( mod , FUN = function(rr){ vcov(rr) } )
# conduct statistical inference
summary(pool_mi( qhat = betas, u = vars ))
Example that breaks with subset:
mod <- lapply(datlist, FUN = function(data){miceadds::lm.cluster( data=data ,
formula=read ~ paredu+ female , cluster = data.ma01$idschool, subset=
(data.ma01$urban==1))} )
Error during wrapup: ..1 used in an incorrect context, no ... to look in
Example that breaks with weights:
mod <- lapply(datlist, FUN = function(data){miceadds::lm.cluster( data=data ,
formula=read ~ paredu+ female , cluster = data.ma01$idschool,
weights=data.ma01$studwgt)} )
Error during wrapup: ..1 used in an incorrect context, no ... to look in
From searching, I think I am encountering similar issues as others when passing these commands through an lm or glm wrapper (such as: Passing Argument to lm in R within Function or R : Pass argument to glm inside an R function or Passing the weights argument to a regression function inside an R function)
However, I am not sure how to address the issue with the imputed datasets & existing lm.cluster command.
Thanks
This works fine with the estimatr package which is on CRAN and the estimatr::lm_robust() function. Two notes: (1) you can change the type of standard errors using se_type = and (2) I keep idschool in the data because we like the clusters to be in the same data.frame as we fit the model on.
library(mice)
library(miceadds)
library(estimatr)
# imputation of the dataset: use six imputations
data(data.ma01)
dat <- data.ma01[, -c(1)] # note I keep idschool in data
imp <- mice::mice( dat , maxit = 3, m = 6)
datlist <- miceadds::mids2datlist(imp)
# linear regression with cluster robust standard errors
mod <- lapply(
datlist,
function (dat) {
estimatr::lm_robust(read ~ paredu + female, dat, clusters = idschool)
}
)
# subset
mod <- lapply(
datlist,
function (dat) {
estimatr::lm_robust(read ~ paredu + female, dat, clusters = idschool, subset = urban == 1)
}
)
# weights
mod <- lapply(
datlist,
function (dat) {
estimatr::lm_robust(read ~ paredu + female, dat, clusters = idschool, weights = studwgt)
}
)
# note that you can use the `se_type` argument of lm_robust()
# to change the vcov estimation
# extract parameters and covariance matrix
betas <- lapply(mod, coef)
vars <- lapply(mod, vcov)
# conduct statistical inference
summary(pool_mi( qhat = betas, u = vars ))
I'm no expert, but there is an issue with the passing of the weights to lm(). I know this is not an ideal situation, but I managed to get it to work by modifying the lm.cluster() function to hard code the weights pass and then just used my own.
lm.cluster <- function (data, formula, cluster, wgts=NULL, ...)
{
TAM::require_namespace_msg("multiwayvcov")
if(is.null(wgts)) {
mod <- stats::lm(data = data, formula = formula)
} else {
data$.weights <- wgts
mod <- stats::lm(data = data, formula = formula, weights=data$.weights)
}
if (length(cluster) > 1) {
v1 <- cluster
}
else {
v1 <- data[, cluster]
}
dfr <- data.frame(cluster = v1)
vcov2 <- multiwayvcov::cluster.vcov(model = mod, cluster = dfr)
res <- list(lm_res = mod, vcov = vcov2)
class(res) <- "lm.cluster"
return(res)
}

How to apply filter based feature selection for logistic regression in R's caret package?

I am trying to apply filter based feature selection in caret package for logistic regression. I was successful at using sbf() function for random forest and LDA models (using rfSBF and ldaSBF respectively).
The way I modified lmSBF is as follows:
# custom lmSBF
logisticRegressionWithPvalues <- lmSBF
logisticRegressionWithPvalues$score <- pScore
logisticRegressionWithPvalues$summary <- fiveStats
logisticRegressionWithPvalues$filter <- pCorrection
logisticRegressionWithPvalues$fit <- glmFit
# my training control parameters for sbf (selection by filter)
myTrainControlSBF = sbfControl(method = "cv",
number = 10,
saveDetails = TRUE,
verbose = FALSE,
functions = logisticRegressionWithPvalues)
# fit the logistic regression model
logisticRegressionModelWithSBF <- sbf(x = input_predictors,
y = input_labels,
sbfControl = myTrainControlSBF)
Here, glmFit function (mentioned above) is as follows:
# fit function for logistic regression
glmFit <- function(x, y, ...) {
if (ncol(x) > 0) {
tmp <- as.data.frame(x)
tmp$y <- y
glm(y ~ ., data = tmp, family = binomial)
}
else nullModel(y = y)
}
But while calling logisticRegressionModelWithSBF I am getting an error as:
Error in { : task 1 failed - "inputs must be factors"
What am I doing wrong?

Caret and rpart - definining method

i am trying to familiarize myself with the caret package. I would previously use rpart directly - e.g. with the following syntax
fit_rpart=rpart(y~.,data=dt1,method="anova").
i have specified anova as i am aiming for regression (rather than classification)
with caret - i would the following syntax:
rpart_fit <- train(y ~ ., data = dt1, method = "rpart",trControl=fitControl)
my question is, as the method slot is already used, where/how can i still specify method="anova"?
Many thanks in advance!
You can make a custom method using the current rpart code. First, get the current code:
library(caret)
rpart_code <- getModelInfo("rpart", regex = FALSE)[[1]]
You then just add the extra option to the code. This method is somewhat convoluted since it handles a bunch of different cases, but here is the edit:
rpart_code$fit <- function(x, y, wts, param, lev, last, classProbs, ...) {
cpValue <- if(!last) param$cp else 0
theDots <- list(...)
if(any(names(theDots) == "control")) {
theDots$control$cp <- cpValue
theDots$control$xval <- 0
ctl <- theDots$control
theDots$control <- NULL
} else ctl <- rpart.control(cp = cpValue, xval = 0)
## check to see if weights were passed in (and availible)
if(!is.null(wts)) theDots$weights <- wts
modelArgs <- c(list(formula = as.formula(".outcome ~ ."),
data = if(is.data.frame(x)) x else as.data.frame(x),
control = ctl,
method = "anova"),
theDots)
modelArgs$data$.outcome <- y
out <- do.call("rpart", modelArgs)
if(last) out <- prune.rpart(out, cp = param$cp)
out
}
then test:
library(rpart)
set.seed(445)
mod <- train(pgstat ~ age + eet + g2 + grade + gleason + ploidy,
data = stagec,
method = rpart_code,
tuneLength = 8)
Max
In caret 'method' refers to the type of model you would like to use, so for example rpart or lm (linear regression) or rf (random forest).
What you're referring to is defined as 'metric' in caret.
If your y-variable is a continuous variable, the metric will be default set to maximizing RMSE. So you don't have to do anything.
You could also explicitly specify this by:
rpart_fit <- train(y ~ ., data = dt1, method = "rpart",trControl=fitControl, metric="RMSE")

predict with kernlab package error Error in .local(object, ...) : test vector does not match model R

I'm testing the kernlab package in a regression problem. It seems it's a common issue to get 'Error in .local(object, ...) : test vector does not match model ! when passing the ksvm object to the predict function. However I just found answers to classification problems or custom kernels that are not applicable to my problem (I'm using a built-in one for regression). I'm running out of ideas here, my sample code is:
data <- matrix(rnorm(200*10),200,10)
tr <- data[1:150,]
ts <- data[151:200,]
mod <- ksvm(x = tr[,-1],
y = tr[,1],
kernel = "rbfdot", type = 'nu-svr',
kpar = "automatic", C = 60, cross = 3)
pred <- predict(mod,
ts
)
You forgot to remove the y variable in the test set, and so it fails because the number of predictors don't match. This will work:
predict(mod,ts[,-1])
You can use pred <- predict(mod, ts) if ts is a dataframe.
It would be
data <- setNames(data.frame(matrix(rnorm(200*10),200,10)),
c("Y",paste("X", 1:9, sep = "")))
tr <- data[1:150,]
ts <- data[151:200,]
mod <- ksvm(as.formula("Y ~ ."), data = tr,
kernel = "rbfdot", type = 'nu-svr',
kpar = "automatic", C = 60, cross = 3)
pred <- predict(mod, ts)

Resources