Format finalModel from caret - r

I have an annoying problem when pulling the finalModel from the caret package. I need the final logistic regression from a crossCV, but the list object has added the ` symbol around model terms that need to be evaluated in the formula (e.g., `I(x^2)`). This renders the object unusable for any other package.
I have tried: 1. stating the model formula in the function call, 2. using poly() instead of I for the terms, 3. using lapply on the model object with a gsub() function to replace the character, 4. using grep with lappy on the model object just to find the character and going thru manually with gsub. But basically, lapply doesn't dig down thru all the variable list lengths, and doesn't return a model object.
#Here's the problem
library(caret)
dat=data.frame(y=as.factor(rbinom(1000, 1, prob=0.5)),
x=rnorm(1000,10,1),
w=rnorm(1000,100,1),
z=rnorm(1000,1000,1))
levels(dat$y) <- c("A","P")
Train <- createDataPartition(dat$y, p=0.7, list=FALSE)
train<- dat[ Train, ]
test <- dat[ -Train, ]
lof=as.formula(y~ x+I(x^2)+w+I(w^2)+z+I(z^2))
m1<-glm(lof,family="binomial", data=train)
m1
pred1=predict(m1,newdata=test, type="response" )
ctrl <- trainControl(
method = "repeatedcv",
repeats = 3,
classProbs = TRUE,
summaryFunction = twoClassSummary,
)
m2<-train(lof,family="binomial", data=train,
method="glm",
trControl = ctrl,
metric = "ROC")
m3=m2$finalModel
m3
pred2=predict(m3,newdata=test, type="response" )
res1=lapply(m3, function (x) grepl('\\`I',x))
m3$terms[[3]]=gsub('\\`I',"",m3$terms[[3]])
m3$terms[[3]]=gsub(')\\`',"",m3$terms[[3]])
m3$terms
res=lapply(m3, function (x) grepl('\\`I',names(x)))
names(m3$effects)[res$effects==TRUE]=gsub('\\`I',"",names(m3$effects)[res$effects==TRUE])
names(m3$effects)[res$effects==TRUE]=gsub(')\\`',"",names(m3$effects)[res$effects==TRUE])
pred3=predict(m3,newdata=test, type="response" )
And an attempted fix with lapply finds some instances, but not all, and cannot be easily used to modify the original object. So even a manual fix is stymied. Obviously the easiest solution to figure out how to stop caret from doing this in the first place rather than trying to edit the object.
I should probably note that I'm not trying to extract the finalModel to use with predict, I am aware I can predict with caret on the entire m2 object... I want it for use with the dismo package.

No replies, but what I ended up doing as a workaround was to create new variables x2=x^2 etc, which is fine as far as it goes, but for my use that means I ALSO had to create new raster layers with x^2 etc which is a waste of time and space.

Related

Caret train function for muliple data frames as function

there has been a similar question to mine 6 years+ ago and it hasn't been solve (R -- Can I apply the train function in caret to a list of data frames?)
This is why I am bringing up this topic again.
I'm writing my own functions for my big R project at the moment and I'm wondering if there is an opportunity to sum up the model training function train() of the pakage caret for different dataframes with different predictors.
My function should look like this:
lda_ex <- function(data, predictor){
model <- train(predictor ~., data,
method = "lda",
trControl = trainControl(method = "none"),
preProc = c("center","scale"))
return(model)
}
Using it afterwards should work like this:
data_iris <- iris
predictor_iris <- "Species"
iris_res <- lda_ex(data = data_iris, predictor = predictor_iris)
Unfortunately the R formula is not able to deal with a variable as input as far as I tried.
Is there something I am missing?
Thank you in advance for helping me out!
Solving this would help me A LOT to keep my function sheet clean and safe work for sure.
By writing predictor_iris <- "Species", you are basically saving a string object in predictor_iris. Thus, when you run lda_ex, I guess you incur in some error concerning the formula object in train(), since you are trying to predict a string using vectors of covariates.
Indeed, I tried the following toy example:
X = rnorm(1000)
Y = runif(1000)
predictor = "Y"
lm(predictor ~ X)
which gives an error about differences in the lengths of variables.
Let me modify your function:
lda_ex <- function(data, formula){
model <- train(formula, data,
method = "lda",
trControl = trainControl(method = "none"),
preProc = c("center","scale"))
return(model)
}
The key difference is that now we must pass in the whole formula, instead of the predictor only. In that way, we avoid the string-related problem.
library(caret) # Recall to specify the packages needed to reproduce your examples!
data_iris <- iris
formula_iris = Species ~ . # Key difference!
iris_res <- lda_ex(data = data_iris, formula = formula_iris)

Using the caret::train package for calculating prediction error (MdAE) of glmms with beta-binomial errors

The question is more or less as the title indicates. I would like to use the caret::train function with beta-binomial models made with glmmTMB package (although I am not opposed to other functions capable of fitting beta-binomial models) to calculate median absolute error (MdAE) estimates through jack-knife (leave-one-out) cross-validation. The glmmTMBControl function is already capable of estimating the optimal dispersion parameter but I was hoping to retain this information somehow as well... or having caret do the calculation possibly?
The dataset I am working with looks like this:
df <- data.frame(Effect = rep(seq(from = 0.05, to = 1, by = 0.05), each = 5), Time = rep(seq(1:20), each = 5))
Ideally I would be able to pass the glmmTMB function to trainControl like so:
BB.glmm1 <- train(Time ~ Effect,
data = df, method = "glmmTMB",
method = "", metric = "MAD")
The output would be as per the examples contained in train, although possibly with estimates for the dispersion parameter.
Although I am in no way opposed to work arounds - Thank you in advance!
I am unsure how to perform the required operation with caret without creating a custom method but I trust it is fairly easy to implement it with a for (lapply) loop.
In the example I will use the sleepstudy data set since your example data throws a bunch of warnings.
library(glmmTMB)
to perform LOOCV - for every row, create a model without that row and predict on that row:
data(sleepstudy,package="lme4")
LOOCV <- lapply(1:nrow(sleepstudy), function(x){
m1 <- glmmTMB(Reaction ~ Days + (Days|Subject),
data = sleepstudy[-x,])
return(predict(m1, sleepstudy[x,], type = "response"))
})
get the median of the residuals (I think this is MdAE? if not post a comment on how its calculated):
median(abs(unlist(LOOCV) - sleepstudy$Reaction))

How to retrieve elastic net coefficients?

I am using the caret package to train an elastic net model on my dataset modDat. I take a grid search approach paired with repeated cross validation to select the optimal values of the lambda and fraction parameters required by the elastic net function. My code is shown below.
library(caret)
library(elasticnet)
grid <- expand.grid(
lambda = seq(0.5, 0.7, by=0.1),
fraction = seq(0, 1, by=0.1)
)
ctrl <- trainControl(
method = 'repeatedcv',
number = 5, #folds
repeats = 10, #repeats
classProbs = FALSE
)
set.seed(1)
enetTune <- train(
y ~ .,
data = modDat,
method = 'enet',
metric = 'RMSE',
tuneGrid = grid,
verbose = FALSE,
trControl = ctrl
)
I can get predictions using y_hat <- predict(enetTune, modDat), but I cannot view the coefficients underlying the predictions.
I have tried coef(enetTune$finalModel) but the only thing returned is NULL. I am suspecting that I have to give the coef() function more information but not sure how to do this.
In addition, I would like to produce a box plot of the 50 sets of coefficients (10 repeats of 5 folds) associated with the optimal lambda and fraction parameters.
To see the coefficients, use predict:
predict(enetTune$finalModel, type = "coefficients")
See ?predict.enet for more information on how to get specific coefficients.
Following on from the answer by #Weihuang Wong, you can get the coefficients from the final model using the following code:
predict.enet(enetTune$finalModel, s=enetTune$bestTune[1, "fraction"], type="coef", mode="fraction")$coefficients
To me what works best is stats::predict, as is #Weihuang Wong answer. However, as OP pointed out in a comment, that provides a list of coefficients for every value of lambda tested.
The important thing to understand here is that when you are using predict, your intention is precisely to predict the value of the parameters, and not really to retrieve them. You should then be aware of that an explore the options available.
In this case, you could use the same function with the argument s for the penalty parameter lambda. Remebember that you are still predicting, but this time you will get the coefficients you are looking for.
stats::predict(enetTune$finalModel, type = "coefficients", s = enetTune$bestTune$lambda)

When to use index and seeds arguments of train() in caret package in R

Primary Question:
After reading the documentation and google searching, I am still stumped as to what the situations are where it is advisable to pre-define resampling indices such as:
resamples <- createResample(classVector_training, times = 500, list=TRUE)
or predefine seeds such as:
seeds <- vector(mode = "list", length = 501) #length is = (n_repeats*nresampling)+1
for(i in 1:501) seeds[[i]]<- sample.int(n=1000, 1)
My plan is to train a bunch of different reproducible models using parallel processing via the doParallel package. Is predefining resamples unnecessary due to the seeds already being set? Do I need to predefine seeds in the way above instead of setting seeds=NULL in the trainControl object because I intend to use parallel processing? Is there any reason to pre-define both index and seeds as I've seen at least once via searching google? And what is a reason to ever use indexOut?
Side Question:
So far, I've managed to run train fine for RF:
rfControl <- trainControl(method="oob", number = 500, p = 0.7, returnData=TRUE, returnResamp = "all", savePredictions=TRUE, classProbs = TRUE, summaryFunction = twoClassSummary, allowParallel=TRUE)
mtryGrid <- expand.grid(mtry = 9480^0.5) #set mtry parameter to the square root of the number of variables
rfTrain <- train(x = training, y = classVector_training, method = "rf", trControl = rfControl, tuneGrid = mtryGrid)
But when I try to run train() with method = "baruta" as such:
borutaControl <- trainControl(method="bootstrap", number = 500, p = 0.7, returnData=TRUE, returnResamp = "all", savePredictions=TRUE, classProbs = TRUE, summaryFunction = twoClassSummary, allowParallel=TRUE)
borutaTrain <- train(x = training, y = classVector_training, method = "Boruta", trControl = borutaControl, tuneGrid = mtryGrid)
I end up getting the following error:
Error in names(trControl$indexOut) <- prettySeq(trControl$indexOut) : 'names' attribute [1] must be the same length as the vector [0]
Anyone know why?
There are a few different times random numbers are used here, so I'll try to be specific about which seeds.
Is predefining resamples unnecessary due to the seeds already being set?
If you do not provide your own resampling indices, the first things that train, rfe, sbf, gafs, and safs do is to create them. So, setting the overall seed prior to calling these controls the randomness of creating resamples. So, you can call these functions repeatedly and use the same samples of you set the main seed beforehand:
set.seed(2346)
mod1 <- train(y ~ x, data = dat, method = "a", ...)
set.seed(2346)
mod2 <- train(y ~ x, data = dat, method = "b", ...)
set.seed(2346)
mod3 <- rfe(x, y, ...)
You can use createResamples or createFolds if you like and give those to trainControl's index argument too.
One other note about this: if indexOut is missing, the holdouts are defined as whatever samples were not used to train the model. There are cases when this is bad (see the exception below) and that is why indexOut exists.
Do I need to predefine seeds in the way above instead of setting seeds=NULL in the trainControl object because I intend to use parallel processing?
That was the main intent. When the worker processes startup, there was no way to control the randomness inside the model fit prior to our addition of the seeds argument. You don't have to use it, but it will lead to reproducible models.
Note that, like resamples, train will create seeds for you if you do not supply them. They are found in the control$seeds element in the train object.
Note that trainControl(seeds) has nothing to do with creating the resamples.
Is there any reason to pre-define both index and seeds as I've seen at least once via searching google?
If you want to pre-define the resamples and control any potential randomness in the worker processes that build the models, then yes.
And what is a reason to ever use indexOut?
There are always specialized situations. The reason it is there is for time series data where you might have data splits that do not involve all the samples passed to train (this is the exception mentioned above). See the white space in this graphic.
tl/dr
trainControl(seeds) only controls the randomness of the model fits
setting the seed prior to calling train is one way to control the randomness of data splitting
Max

Cannot extractPrediction using caret in R

I'm totally stucked on a random forest classification model since I cannot extract predictions. And I'm really out of clues since:
predict(forest.model1, titanic.final.test)
works like a charm, while
extractPrediction(list(forest.model1), testX=titanic.final.test[,-2], testY=titanic.final.test[,2])
which should be equivalent, gives me this error:
Error in predict.randomForest(modelFit, newdata) :
variables in the training data missing in newdata
Here's my trainControl:
forest.fitControl <- trainControl( method = "repeatedcv", repeats = 5,
summaryFunction = twoClassSummary, classProbs=TRUE,
returnData=TRUE, seeds=NULL, savePredictions=TRUE, returnResamp="all")
any idea?
Test and Train need to have the same structure (i.e. all the same columns). So my only guess is that negating the second column is resulting in a different structure that the data used to train the model. Hard to know without seeing the structire of the training vs. test data.frames.
Edit After Looking at Code:
Recreated this from your repo... Sure it shouldn't be the first column you pull out for testX and use for testY. Something like:
extractPrediction(list(forest.model1), testX=titanic.final.test[,-1], testY=titanic.final.test[,1])

Resources