caretEnsemble: Component models do not have the same re-sampling strategies - r

I have several prediction models which are created using the same trainControl. These models have to be created beforehand (i.e. I can't use caretList to train multiple models simultaneously).
Below is my minimal example. When I manually combine multiple (already created) models and pass them to caretStack,
library("kernlab")
library("rpart")
library("caret")
library("caretEnsemble")
trainingControl <- trainControl(method='cv', number=10, savePredictions = "final", classProbs=TRUE)
data(spam)
ds <- spam
tr <- ds[sample(nrow(ds),3221),]
te <- ds[!(rownames(ds) %in% rownames(tr)),]
model <- train(tr[,-58], tr$type, 'svmRadial', trControl = trainingControl)
model2 <- train(tr[,-58], tr$type, 'rpart', trControl = trainingControl)
multimodel <- list(svm = model, nb = model2)
class(multimodel) <- "caretList"
stack <- caretStack(multimodel, method = "rf", metric = "ROC", trControl = trainingControl)
the library throws the error:
Component models do not have the same re-sampling strategies.
Why is that since I'm using the same strategy to generate the base models?
I found the "casting" to caretList class in the github discussion zachmayer/caretEnsemble/issues/104.

You are almost there. One of the things to remember is that when you want to use caretEnsemble is that in trainControl you have to set the resample index via the 'index' option in trainControl. If you run caretList it tends to set this itself, but it is better to do this yourself. This is especially true when you run different models outside of caretList. You need to make sure the resampling is the same. You can also see this in the example on github you refer to.
trainingControl <- trainControl(method='cv',
number=10,
savePredictions = "final",
classProbs=TRUE,
index=createResample(tr$type)) # this needs to be set.
This will make sure that your code will run.
Note that in the example code you have given, it will return with errors.

Related

Making caret train rf faster when ranger is not an option

The website I am trying to run the code is using an old version of R and does not accept ranger as the library. I have to use the caret package. I am trying to process about 800,000 lines in my train data frame and here is the code I use
control <- trainControl(method = 'repeatedcv',
number = 3,
repeats = 1,
search = 'grid')
tunegrid <- expand.grid(.mtry = c(sqrt(ncol(train_1))))
fit <- train(value~.,
data = train_1,
method = 'rf',
ntree = 73,
tuneGrid = tunegrid,
trControl = control)
Looking at previous posts, I tried to tune my control parameters, is there any way I can make the model run faster? Am I able to specify a specific setting so that it just generates a model with the parameters I set, and not try multiple options?
This is my code from ranger which I optimized and currently having accurate model
fit <- ranger(value ~ .,
data = train_1,
num.trees = 73,
max.depth = 35,mtry = 7,importance='impurity',splitrule = "extratrees")
Thank you so much for your time
When you specify method='rf', caret is using the randomForest package to build the model. If you don't want to do all the cross-validation that caret is useful for, just build your model using the randomForest package directly. e.g.
library(randomForest)
fit <- randomForest(value ~ ., data=train_1)
You can specify values for ntree, mtry etc.
Note that the randomForest package is slow (or just won't work) for large datasets. If ranger is unavailable, have you tried the Rborist package?

Does using the same trainControl object for cross-validation when training multiple models with caret allow for accurate model comparison?

I have been delving into the R package caret recently, and have a question about reproducibility and comparison of models during training that I haven't quite been able to pin down.
My intention is that each train call, and thus each resulting model, uses the same cross validation splits so that the initial stored results from the cross-validation are comparable from the out-of-sample estimations of the model that are calculated during building.
One method I've seen is that you can specify the seed prior to each train call as such:
set.seed(1)
model <- train(..., trControl = trainControl(...))
set.seed(1)
model2 <- train(..., trControl = trainControl(...))
set.seed(1)
model3 <- train(..., trControl = trainControl(...))
However, does sharing a trainControl object between the train calls mean that they are using the same resampling and indexes generally or whether I have to explicitly pass the seeds argument into the function. Does the train control object have random functions when it is used or are they set on declaration?
My current method has been:
set.seed(1)
train_control <- trainControl(method="cv", ...)
model1 <- train(..., trControl = train_control)
model2 <- train(..., trControl = train_control)
model3 <- train(..., trControl = train_control)
Are these train calls going to be using the same splits and be comparable, or do I have to take further steps to ensure that? i.e. specifying seeds when the trainControl object is made, or calling set.seed before each train? Or both?
Hopefully this has made some sense, and isn't a complete load of rubbish. Any help
My code project that I'm querying about can be found here. It might be easier to read it and you'll understand.
The CV folds are not created during defining trainControl unless explicitly stated using index argument which I recommend. These can be created using one of the specialized caret functions:
createFolds
createMultiFolds
createTimeSlices
groupKFold
That being said, using a specific seed prior to trainControl definition will not result in the same CV folds.
Example:
library(caret)
library(tidyverse)
set.seed(1)
trControl = trainControl(method = "cv",
returnResamp = "final",
savePredictions = "final")
create two models:
knnFit1 <- train(iris[,1:4], iris[,5],
method = "knn",
preProcess = c("center", "scale"),
tuneLength = 10,
trControl = trControl)
ldaFit2 <- train(iris[,1:4], iris[,5],
method = "lda",
tuneLength = 10,
trControl = trControl)
check if the same indexes are in the same folds:
knnFit1$pred %>%
left_join(ldaFit2$pred, by = "rowIndex") %>%
mutate(same = Resample.x == Resample.y) %>%
{all(.$same)}
#FALSE
If you set the same seed prior each train call
set.seed(1)
knnFit1 <- train(iris[,1:4], iris[,5],
method = "knn",
preProcess = c("center", "scale"),
tuneLength = 10,
trControl = trControl)
set.seed(1)
ldaFit2 <- train(iris[,1:4], iris[,5],
method = "lda",
tuneLength = 10,
trControl = trControl)
set.seed(1)
rangerFit3 <- train(iris[,1:4], iris[,5],
method = "ranger",
tuneLength = 10,
trControl = trControl)
knnFit1$pred %>%
left_join(ldaFit2$pred, by = "rowIndex") %>%
mutate(same = Resample.x == Resample.y) %>%
{all(.$same)}
knnFit1$pred %>%
left_join(rangerFit3$pred, by = "rowIndex") %>%
mutate(same = Resample.x == Resample.y) %>%
{all(.$same)}
the same indexes will be used in the folds. However I would not rely on this method when using parallel computation. Therefore in order to insure the same data splits are used it is best to define them manually using index/indexOut arguments to trainControl.
When you set the index argument manually the folds will be the same, however this does not ensure that models made by the same method will be the same, since most methods include some sort of stochastic process. So to be fully reproducible it is advisable to set the seed prior to each train call also. When run in parallel to get fully reproducible models the seeds argument to trainControl needs to be set.

Pooled Regression Results using mice, caret, and glmnet

Not sure if this more of a statistics question but the closest similar problem I could find is here, although I couldn't get it to work for my case.
I am trying to develop a pooled, penalized logistic regression model. I used mice to create a mids object and then fit a model to each dataset using caret repeated cross-validation with elastic net regression (glmnet) to tune parameters. The fitted object is not of class "mira" but I think I fixed that by changing the object class with the right list items. The major issue is that glmnet does not have an associated vcov method, which is required by pool().
I would like to use penalized regression based on the amount of variables and uncertainty over which ones are the best predictors. My data consists of 4x numeric variables and 9x categorical variables of varying levels and I anticipate including interactions.
Does anyone know how I might be able to create my own vcov method or otherwise address this issue? I am not sure if this is possible.
Example data and code are enclosed, noting that I am not able to share the actual data.
library(mice)
library(caret)
dat <- as.data.frame(list(time=c(4,3,1,1,2,2,3,5,2,4,5,1,4,3,1,1,2,2,3,5,2,4,5,1),
status=c(1,1,1,0,2,2,0,0,NA,1,2,0,1,1,1,NA,2,2,0,0,1,NA,2,0),
x=c(0,2,1,1,NA,NA,0,1,1,2,0,1,0,2,1,1,NA,NA,0,1,1,2,0,1),
sex=c("M","M","M","M","F","F","F","F","M","F","F","M","F","M","M","M","F","F","M","F","M","F","M","F")))
imp <- mice(dat,m=5, seed=192)
control = trainControl(method = "repeatedcv",
number = 10,
repeats=3,
verboseIter = FALSE)
mod <- list(analyses=vector("list", imp$m))
for(i in 1:imp$m){
mod$analyses[[i]] <- train(sex ~ .,
data = complete(imp, i),
method = "glmnet",
family="binomial",
trControl = control,
tuneLength = 10,
metric="Kappa")
}
obj <- as.mira(mod)
obj <- list(call=mod$analyses[[1]]$call, call1=imp$call, nmis=imp$nmis, analyses=mod$analyses)
oldClass(obj) <- "mira"
pool(obj)
Produces:
Error in pool(obj) : Object has no vcov() method.

R: Feature Selection with Cross Validation using Caret on Logistic Regression

I am currently learning how to implement logistical Regression in R
I have taken a data set and split it into a training and test set and wish to implement forward selection, backward selection and best subset selection using cross validation to select the best features.
I am using caret to implement cross-validation on the training data set and then testing the predictions on the test data.
I have seen the rfe control in caret and had also had a look at the documentation on the caret website as well as following the links on the question How to use wrapper feature selection with algorithms in R?. It isn't apparent to me how to change the type of feature selection as it seems to default to backward selection. Can anyone help me with my workflow. Below is a reproducible example
library("caret")
# Create an Example Dataset from German Credit Card Dataset
mydf <- GermanCredit
# Create Train and Test Sets 80/20 split
trainIndex <- createDataPartition(mydf$Class, p = .8,
list = FALSE,
times = 1)
train <- mydf[ trainIndex,]
test <- mydf[-trainIndex,]
ctrl <- trainControl(method = "repeatedcv",
number = 10,
savePredictions = TRUE)
mod_fit <- train(Class~., data=train,
method="glm",
family="binomial",
trControl = ctrl,
tuneLength = 5)
# Check out Variable Importance
varImp(mod_fit)
summary(mod_fit)
# Test the new model on new and unseen Data for reproducibility
pred = predict(mod_fit, newdata=test)
accuracy <- table(pred, test$Class)
sum(diag(accuracy))/sum(accuracy)
You can simply call it in mod_fit. When it comes to backward stepwise the code below is sufficient
trControl <- trainControl(method="cv",
number = 5,
savePredictions = T,
classProbs = T,
summaryFunction = twoClassSummary)
caret_model <- train(Class~.,
train,
method="glmStepAIC", # This method fits best model stepwise.
family="binomial",
direction="backward", # Direction
trControl=trControl)
Note that in trControl
method= "cv", # No need to call repeated here, the number defined afterward defines the k-fold.
classProbs = T,
summaryFunction = twoClassSummary # Gives back ROC, sensitivity and specifity of the chosen model.

Pass PCA preprocessing arguments to train()

I'm trying to build a predictive model in caret using PCA as pre-processing. The pre-processing would be as follows:
preProc <- preProcess(IL_train[,-1], method="pca", thresh = 0.8)
Is it possible to pass the thresh argument directly to caret's train() function? I've tried the following, but it doesn't work:
modelFit_pp <- train(IL_train$diagnosis ~ . , preProcess="pca",
thresh= 0.8, method="glm", data=IL_train)
If not, how can I pass the separate preProc results to the train() function?
As per the documentation, you specify additional preprocessing arguments with trainControl
?trainControl
...
preProcOptions
A list of options to pass to preProcess. The type of pre-processing
(e.g. center, scaling etc) is passed in via the preProc option in train.
...
Since your dataset is not reproducible, let's look at an example. I will use the Sonar dataset from mlbench and use the pls algorithm just for fun.
library(caret)
library(mlbench)
data(Sonar)
ctrl <- trainControl(preProcOptions = list(thresh = 0.95))
mod <- train(Class ~ .,
data = Sonar,
method = "pls",
trControl = ctrl)
Although documentation isn't the most exciting read, definitely make sure to try to go through it. Package authors work hard to create documentation and there are many wonders to be found within.

Resources