R caretEnsemble CV length incorrect - r

I am trying to ensemble models using the package caretEnsemble in R. Here is a minimally reproducible example. Please let me know if this should have extra information.
library(caret)
library(caretEnsemble)
library(xgboost)
library(plyr)
# Load iris data and convert to binary classification problem
data(iris)
data = iris
data$target = ifelse(data$Species == "setosa",1,0)
data = subset(data,select = -c(Species))
# Train control for models. 5 fold CV
set.seed(123)
index=createFolds(data$target, k=5,returnTrain = FALSE)
myControl = trainControl(method='cv', number=5,
returnResamp='none', classProbs=TRUE,
returnData=FALSE, savePredictions=TRUE,
verboseIter=FALSE, allowParallel=TRUE,
summaryFunction=twoClassSummary,
index=index)
# Layer 1 models
model1 = train(target ~ Sepal.Length,data=data, trControl = myControl, method = "glm", family = "binomial", metric = "ROC")
model2 = train(target ~ Sepal.Length,data=data, trControl = myControl, method = "xgbTree", metric = "ROC",
tuneGrid=expand.grid(nrounds = 50, max_depth=1, eta = .05, gamma = .5, colsample_bytree = 1,min_child_weight=1, subsample=1))
# Stack models
all.models <- list(model1, model2)
names(all.models) <- c("glm","xgb")
class(all.models) <- "caretList"
stacked <- caretStack(all.models, method = "glm", family = "binomial", metric = "ROC",
trControl=trainControl(method='cv', number=5,
returnResamp='none', classProbs=TRUE,
returnData=FALSE, savePredictions=TRUE,
verboseIter=FALSE, allowParallel=TRUE,
summaryFunction=twoClassSummary)
)
stacked
This is the main output that concerns me.
A glm ensemble of 2 base models: glm, xgb
Ensemble results:
Generalized Linear Model
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 480, 480, 480, 480, 480
Resampling results:
ROC Sens Spec
0.9509688 0.92 0.835
My issue is that there are 150 rows in the base data set, so 30 rows in each fold of the 5 fold CV. If you look at "index" you'll see that this is working correctly. Now if you look at the results of "stacked" you'll see that the 5 fold length of the meta/stacked model is 480 for each fold. This is 480*5 = 2400 in total, which is 16 times larger than the original data set. I have no idea why this is.
My main questions are:
1) Is this list of observations in each fold correct?
2) If so, why is this happening?

Figured out the issue in case anyone else stumbles on this. The index I created is an indicator of the out of sample rows, so the code should be:
myControl = trainControl(method='cv', number=5,
returnResamp='none', classProbs=TRUE,
returnData=FALSE, savePredictions=TRUE,
verboseIter=FALSE, allowParallel=TRUE,
summaryFunction=twoClassSummary,
indexOut=index)
Instead of index= it should be indexOut=. The data was training on 20% of the data and predicting on 80% before, which explains the overlap. Now that this option is properly set there is no overlap.

Related

How to set a ppv in caret for random forest in r?

So I'm interested in creating a model that optimizes PPV. I've create a RF model (below) that outputs me a confusion matrix, for which I then manually calculate sensitivity, specificity, ppv, npv, and F1. I know right now accuracy is optimized but I'm willing to forgo sensitivity and specificity to get a much higher ppv.
data_ctrl_null <- trainControl(method="cv", number = 5, classProbs = TRUE, summaryFunction=twoClassSummary, savePredictions=T, sampling=NULL)
set.seed(5368)
model_htn_df <- train(outcome ~ ., data=htn_df, ntree = 1000, tuneGrid = data.frame(mtry = 38), trControl = data_ctrl_null, method= "rf",
preProc=c("center","scale"),metric="ROC", importance=TRUE)
model_htn_df$finalModel #provides confusion matrix
Results:
Call:
randomForest(x = x, y = y, ntree = 1000, mtry = param$mtry, importance = TRUE)
Type of random forest: classification
Number of trees: 1000
No. of variables tried at each split: 38
OOB estimate of error rate: 16.2%
Confusion matrix:
no yes class.error
no 274 19 0.06484642
yes 45 57 0.44117647
My manual calculation: sen = 55.9% spec = 93.5%, ppv = 75.0%, npv = 85.9% (The confusion matrix switches my no and yes as outcomes, so I also switch the numbers when I calculate the performance metrics.)
So what do I need to do to get a PPV = 90%?
This is a similar question, but I'm not really following it.
We define a function to calculate PPV and return the results with a name:
PPV <- function (data,lev = NULL,model = NULL) {
value <- posPredValue(data$pred,data$obs, positive = lev[1])
c(PPV=value)
}
Let's say we have the following data:
library(randomForest)
library(caret)
data=iris
data$Species = ifelse(data$Species == "versicolor","versi","others")
trn = sample(nrow(iris),100)
Then we train by specifying PPV to be the metric:
mdl <- train(Species ~ ., data = data[trn,],
method = "rf",
metric = "PPV",
trControl = trainControl(summaryFunction = PPV,
classProbs = TRUE))
Random Forest
100 samples
4 predictor
2 classes: 'others', 'versi'
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 100, 100, 100, 100, 100, 100, ...
Resampling results across tuning parameters:
mtry PPV
2 0.9682811
3 0.9681759
4 0.9648426
PPV was used to select the optimal model using the largest value.
The final value used for the model was mtry = 2.
Now you can see it is trained on PPV. However you cannot force the training to achieve a PPV of 0.9.. It really depends on the data, if your independent variables have no predictive power, it will not improve however much you train it right?

Multiple evaluation metrics in classification using caret package [duplicate]

I used caret for logistic regression in R:
ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 10,
savePredictions = TRUE)
mod_fit <- train(Y ~ ., data=df, method="glm", family="binomial",
trControl = ctrl)
print(mod_fit)
The default metric printed is accuracy and Cohen kappa. I want to extract the matching metrics like sensitivity, specificity, positive predictive value etc. but I cannot find an easy way to do it. The final model is provided but it is trained on all the data (as far as I can tell from documentation), so I cannot use it for predicting anew.
Confusion matrix calculates all required parameters, but passing it as a summary function doesn't work:
ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 10,
savePredictions = TRUE, summaryFunction = confusionMatrix)
mod_fit <- train(Y ~ ., data=df, method="glm", family="binomial",
trControl = ctrl)
Error: `data` and `reference` should be factors with the same levels.
13.
stop("`data` and `reference` should be factors with the same levels.",
call. = FALSE)
12.
confusionMatrix.default(testOutput, lev, method)
11.
ctrl$summaryFunction(testOutput, lev, method)
Is there a way to extract this information in addition to accuracy and kappa, or somehow find it in the train_object returned by the caret train?
Thanks in advance!
Caret already has summary functions to output all the metrics you mention:
defaultSummary outputs Accuracy and Kappa
twoClassSummary outputs AUC (area under the ROC curve - see last line of answer), sensitivity and specificity
prSummary outputs precision and recall
in order to get combined metrics you can write your own summary function which combines the outputs of these three:
library(caret)
MySummary <- function(data, lev = NULL, model = NULL){
a1 <- defaultSummary(data, lev, model)
b1 <- twoClassSummary(data, lev, model)
c1 <- prSummary(data, lev, model)
out <- c(a1, b1, c1)
out}
lets try on the Sonar data set:
library(mlbench)
data("Sonar")
when defining the train control it is important to set classProbs = TRUE since some of these metrics (ROC and prAUC) can not be calculated based on predicted class but based on the predicted probabilities.
ctrl <- trainControl(method = "repeatedcv",
number = 10,
savePredictions = TRUE,
summaryFunction = MySummary,
classProbs = TRUE)
Now fit the model of your choice:
mod_fit <- train(Class ~.,
data = Sonar,
method = "rf",
trControl = ctrl)
mod_fit$results
#output
mtry Accuracy Kappa ROC Sens Spec AUC Precision Recall F AccuracySD KappaSD
1 2 0.8364069 0.6666364 0.9454798 0.9280303 0.7333333 0.8683726 0.8121087 0.9280303 0.8621526 0.10570484 0.2162077
2 31 0.8179870 0.6307880 0.9208081 0.8840909 0.7411111 0.8450612 0.8074942 0.8840909 0.8374326 0.06076222 0.1221844
3 60 0.8034632 0.6017979 0.9049242 0.8659091 0.7311111 0.8332068 0.7966889 0.8659091 0.8229330 0.06795824 0.1369086
ROCSD SensSD SpecSD AUCSD PrecisionSD RecallSD FSD
1 0.04393947 0.05727927 0.1948585 0.03410854 0.12717667 0.05727927 0.08482963
2 0.04995650 0.11053858 0.1398657 0.04694993 0.09075782 0.11053858 0.05772388
3 0.04965178 0.12047598 0.1387580 0.04820979 0.08951728 0.12047598 0.06715206
in this output
ROC is in fact the area under the ROC curve - usually called AUC
and
AUC is the area under the precision-recall curve across all cutoffs.

R caret: Combine rfe() and train()

I want to combine recursive feature elimination with rfe() and tuning together with model selection with trainControl() using the method rf (random forest). Instead of the standard summary statistic I would like to have the MAPE (mean absolute percentage error). Therefore I tried the following code using the ChickWeight data set:
library(caret)
library(randomForest)
library(MLmetrics)
# Compute MAPE instead of other metrics
mape <- function(data, lev = NULL, model = NULL){
mape <- MAPE(y_pred = data$pred, y_true = data$obs)
c(MAPE = mape)
}
# specify trainControl
trc <- trainControl(method="repeatedcv", number=10, repeats=3, search="grid", savePred =T,
summaryFunction = mape)
# set up grid
tunegrid <- expand.grid(.mtry=c(1:3))
# specify rfeControl
rfec <- rfeControl(functions=rfFuncs, method="cv", number=10, saveDetails = TRUE)
set.seed(42)
results <- rfe(weight ~ Time + Chick + Diet,
sizes=c(1:3), # number of predictors from which should algorithm chose the best predictor
data = ChickWeight,
method="rf",
ntree = 250,
metric= "RMSE",
tuneGrid=tunegrid,
rfeControl=rfec,
trControl = trc)
The code runs without errors. But where do I find the MAPE, which I defined as a summaryFunction in trainControl? Is trainControlexecuted or ignored?
How could I rewrite the code in order to do recursive feature elimination with rfe and then tune the hyperparameter mtry using trainControl within rfe and at the same time compute an additional error measure (MAPE)?
trainControl is ignored, as its description
Control the computational nuances of the train function
would suggest. To use MAPE, you want
rfec$functions$summary <- mape
Then
rfe(weight ~ Time + Chick + Diet,
sizes = c(1:3),
data = ChickWeight,
method ="rf",
ntree = 250,
metric = "MAPE", # Modified
maximize = FALSE, # Modified
rfeControl = rfec)
#
# Recursive feature selection
#
# Outer resampling method: Cross-Validated (10 fold)
#
# Resampling performance over subset size:
#
# Variables MAPE MAPESD Selected
# 1 0.1903 0.03190
# 2 0.1029 0.01727 *
# 3 0.1326 0.02136
# 53 0.1303 0.02041
#
# The top 2 variables (out of 2):
# Time, Chick.L

Sensitivity too low where as AUC very high in caret train crossvalidation resampling results

How should I interpret : Sensitivity too low where as AUC very high in caret train crossvalidation resampling results on the data I have trained.
Is the model performance bad ?
It usually occurs when there is a class imbalance and the default 50% probability cutoff produces poor predictions but the class probabilities, while poorly calibrated, do well at separating classes well.
Here is an example:
library(caret)
set.seed(1)
dat <- twoClassSim(500, intercept = 10)
set.seed(2)
mod <- train(Class ~ ., data = dat, method = "svmRadial",
tuneLength = 10,
preProc = c("center", "scale"),
metric = "ROC",
trControl = trainControl(search = "random",
classProbs = TRUE,
summaryFunction = twoClassSummary))
The results are
> mod
Support Vector Machines with Radial Basis Function Kernel
500 samples
15 predictor
2 classes: 'Class1', 'Class2'
Pre-processing: centered (15), scaled (15)
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 500, 500, 500, 500, 500, 500, ...
Resampling results across tuning parameters:
sigma C ROC Sens Spec
0.01124608 21.27349102 0.9615725 0.33389177 0.9910125
0.01330079 419.19384543 0.9579240 0.34620779 0.9914320
0.01942163 85.16782989 0.9535367 0.33211255 0.9920583
0.02168484 632.31603140 0.9516538 0.33065224 0.9911863
0.02395674 89.03035078 0.9497636 0.32504906 0.9909382
0.03988581 3.58620979 0.9392330 0.25279365 0.9920611
0.04204420 699.55658836 0.9356568 0.23920635 0.9931667
0.05263619 0.06127242 0.9265497 0.28134921 0.9839818
0.05364313 34.57839446 0.9264506 0.19560317 0.9934489
0.08838604 47.84104078 0.9029791 0.06296825 0.9955034
ROC was used to select the optimal model using the largest value.
The final values used for the model were sigma = 0.01124608 and C = 21.27349.

Different results with randomForest() and caret's randomForest (method = "rf")

I am new to caret, and I just want to ensure that I fully understand what it’s doing. Towards that end, I’ve been attempting to replicate the results I get from a randomForest() model using caret’s train() function for method="rf". Unfortunately, I haven’t been able to get matching results, and I’m wondering what I’m overlooking.
I’ll also add that given that randomForest uses bootstrapping to generate samples to fit each of the ntrees, and estimates error based on out-of-bag predictions, I’m a little fuzzy on the difference between specifying "oob" and "boot" in the trainControl function call. These options generate different results, but neither matches the randomForest() model.
Although I’ve read the caret Package website (http://topepo.github.io/caret/index.html), as well as various StackOverflow questions that seem potentially relevant, but I haven’t been able to figure out why the caret method = "rf" model produces different results from randomForest(). Thank you very much for any insight you might be able to offer.
Here’s a replicable example, using the CO2 dataset from the MASS package.
library(MASS)
data(CO2)
library(randomForest)
set.seed(1)
rf.model <- randomForest(uptake ~ .,
data = CO2,
ntree = 50,
nodesize = 5,
mtry=2,
importance=TRUE,
metric="RMSE")
library(caret)
set.seed(1)
caret.oob.model <- train(uptake ~ .,
data = CO2,
method="rf",
ntree=50,
tuneGrid=data.frame(mtry=2),
nodesize = 5,
importance=TRUE,
metric="RMSE",
trControl = trainControl(method="oob"),
allowParallel=FALSE)
set.seed(1)
caret.boot.model <- train(uptake ~ .,
data = CO2,
method="rf",
ntree=50,
tuneGrid=data.frame(mtry=2),
nodesize = 5,
importance=TRUE,
metric="RMSE",
trControl=trainControl(method="boot", number=50),
allowParallel=FALSE)
print(rf.model)
print(caret.oob.model$finalModel)
print(caret.boot.model$finalModel)
Produces the following:
print(rf.model)
Mean of squared residuals: 9.380421
% Var explained: 91.88
print(caret.oob.model$finalModel)
Mean of squared residuals: 38.3598
% Var explained: 66.81
print(caret.boot.model$finalModel)
Mean of squared residuals: 42.56646
% Var explained: 63.16
And the code to look at variable importance:
importance(rf.model)
importance(caret.oob.model$finalModel)
importance(caret.boot.model$finalModel)
Using formula interface in train converts factors to dummy. To compare results from caret with randomForest you should use the non-formula interface.
In your case, you should provide a seed inside trainControl to get the same result as in randomForest.
Section training in caret webpage, there are some notes on reproducibility where it explains how to use seeds.
library("randomForest")
set.seed(1)
rf.model <- randomForest(uptake ~ .,
data = CO2,
ntree = 50,
nodesize = 5,
mtry = 2,
importance = TRUE,
metric = "RMSE")
library("caret")
caret.oob.model <- train(CO2[, -5], CO2$uptake,
method = "rf",
ntree = 50,
tuneGrid = data.frame(mtry = 2),
nodesize = 5,
importance = TRUE,
metric = "RMSE",
trControl = trainControl(method = "oob", seed = 1),
allowParallel = FALSE)
If you are doing resampling, you should provide seeds for each resampling iteration and an additional one for the final model. Examples in ?trainControl show how to create them.
In the following example, the last seed is for the final model and I set it to 1.
seeds <- as.vector(c(1:26), mode = "list")
# For the final model
seeds[[26]] <- 1
caret.boot.model <- train(CO2[, -5], CO2$uptake,
method = "rf",
ntree = 50,
tuneGrid = data.frame(mtry = 2),
nodesize = 5,
importance = TRUE,
metric = "RMSE",
trControl = trainControl(method = "boot", seeds = seeds),
allowParallel = FALSE)
Definig correctly the non-formula interface with caret and seed in trainControl you will get the same results in all three models:
rf.model
caret.oob.model$final
caret.boot.model$final

Resources