R caret inconsistent results in model tuning - r

Using the caret package for model tuning today I have faced this strange behavior: given a specific combination of tuning parameters T*, the metric (i.e. Cohen's K) value associated with T* changes if T* is evaluated alone or as part of a grid of possible combinations. In the practical example which follows caret is used to interface with the gbm package.
# Load libraries and data
library (caret)
data<-read.csv("mydata.csv")
data$target<-as.factor(data$target)
# data are available at https://www.dropbox.com/s/1bglmqd14g840j1/mydata.csv?dl=0
Pocedure 1: T* evaluated alone
#Define 5-fold cv as validation settings
fitControl <- trainControl(method = "cv",number = 5)
# Define the combination of tuning parameter for this example T*
gbmGrid <- expand.grid(.interaction.depth = 1,
.n.trees = 1000,
.shrinkage = 0.1, .n.minobsinnode=1)
# Fit a gbm with T* as model parameters and K as scoring metric.
set.seed(825)
gbmFit1 <- train(target ~ ., data = data,
method = "gbm",
distribution="adaboost",
trControl = fitControl,
tuneGrid=gbmGrid,
verbose=F,
metric="Kappa")
# The results show that T* is associated with Kappa = 0.47. Remember this result and the confusion matrix.
testPred<-predict(gbmFit1, newdata = data)
confusionMatrix(testPred, data$target)
# output selection
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 832 34
1 0 16
Kappa : 0.4703
Procedure 2: T* evaluated along with other tuning profiles
Here everything is the same as in procedure 1 except for the fact that several combinations of tuning parameters {T} are considered:
# Notice that the original T* is included in {T}!!
gbmGrid2 <- expand.grid(.interaction.depth = 1,
.n.trees = seq(100,1000,by=100),
.shrinkage = 0.1, .n.minobsinnode=1)
# Fit the gbm
set.seed(825)
gbmFit2 <- train(target ~ ., data = data,
method = "gbm",
distribution="adaboost",
trControl = fitControl,
tuneGrid=gbmGrid2,
verbose=F,
metric="Kappa")
# Caret should pick the model with the highest Kappa.
# Since T* is in {T} I would expect the best model to have K >= 0.47
testPred<-predict(gbmFit2, newdata = data)
confusionMatrix(testPred, data$target)
# output selection
Reference
Prediction 0 1
0 831 47
1 1 3
Kappa : 0.1036
The results are inconsistent with my expectations: the best model in {T} scores K=0.10. How is it possible given that T* has K = 0.47 and it is included in {T}? Additionally, according to the following plot , K for T* as evaluated in procedure 2 is now around 0.01. Any idea about what is going on? Am I missing something?

I am getting consistent resampling results from your data and code.
The first model has Kappa = 0.00943
gbmFit1$results
interaction.depth n.trees shrinkage n.minobsinnode Accuracy Kappa AccuracySD
1 1 1000 0.1 1 0.9331022 0.009430576 0.004819004
KappaSD
1 0.0589132
The second model has the same results for n.trees = 1000
gbmFit2$results
shrinkage interaction.depth n.minobsinnode n.trees Accuracy Kappa AccuracySD
1 0.1 1 1 100 0.9421803 -0.002075765 0.002422952
2 0.1 1 1 200 0.9387776 -0.008326896 0.002468351
3 0.1 1 1 300 0.9365049 -0.012187900 0.002625886
4 0.1 1 1 400 0.9353749 -0.013950906 0.003077431
5 0.1 1 1 500 0.9353685 -0.013961221 0.003244201
6 0.1 1 1 600 0.9342322 -0.015486214 0.005202656
7 0.1 1 1 700 0.9319658 -0.018574633 0.007033402
8 0.1 1 1 800 0.9319658 -0.018574633 0.007033402
9 0.1 1 1 900 0.9342386 0.010955568 0.003144850
10 0.1 1 1 1000 0.9331022 0.009430576 0.004819004
KappaSD
1 0.004641553
2 0.004654972
3 0.003978702
4 0.004837097
5 0.004878259
6 0.007469843
7 0.009470466
8 0.009470466
9 0.057825336
10 0.058913202
Note that the best model in your second run has n.trees = 900
gbmFit2$bestTune
n.trees interaction.depth shrinkage n.minobsinnode
9 900 1 0.1 1
Since train picks the "best" model based on your metric, your second prediction is using a different model (n.trees of 900 instead of 1000).

Related

how to plot RMSE vs number of trees tries in bagging when using train() and cross validation in r

I am studying this website about bagging method. https://bradleyboehmke.github.io/HOML/bagging.html
I am going to use train() function with cross validation for bagging. something like below.
as far as I realized nbagg=200 tells r to try 200 trees, calculate RMSE for each and return the number of trees ( here 80 ) for which the best RMSE is achieved.
now how can I see what RMSE other nbagg values have produced in this model. like RMSE vs number of trees plot in that website ( begore introdicing cv method and train() function like plot below)
ames_bag2 <- train(
Sale_Price ~ .,
data = ames_train,
method = "treebag",
trControl = trainControl(method = "cv", number = 10),
nbagg = 200,
control = rpart.control(minsplit = 2, cp = 0)
)
ames_bag2
## Bagged CART
##
## 2054 samples
## 80 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1849, 1848, 1848, 1849, 1849, 1847, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 26957.06 0.8900689 16713.14
As the example you shared is not completely reproducible, I have taken a different example from the mtcars dataset to illustrate how you can do it. You can extend that for your data.
Note: The RMSE showed here is the average of 10 RMSEs as the CV number is 10 here. So we will store that only. Adding the relevant libraries too in the example here. And setting the maximum number of trees as 15, just for the example.
library(ipred)
library(caret)
library(rpart)
library(dplyr)
data("mtcars")
n_trees <-1
error_df <- data.frame()
while (n_trees <= 15) {
ames_bag2 <- train(
mpg ~.,
data = mtcars,
method = "treebag",
trControl = trainControl(method = "cv", number = 10),
nbagg = n_trees,
control = rpart.control(minsplit = 2, cp = 0)
)
error_df %>%
bind_rows(data.frame(trees=n_trees, rmse=mean(ames_bag2[["resample"]]$RMSE)))-> error_df
n_trees <- n_trees+1
}
error_df will show the output.
> error_df
trees rmse
1 1 2.493117
2 2 3.052958
3 3 2.052801
4 4 2.239841
5 5 2.500279
6 6 2.700347
7 7 2.642525
8 8 2.497162
9 9 2.263527
10 10 2.379366
11 11 2.447560
12 12 2.314433
13 13 2.423648
14 14 2.192112
15 15 2.256778

In the train method what's the relationship between tuneGrid and trControl?

The preferred method in R to train known ML models is to use the caret package and its generic train method. My question is what's the relationship between the tuneGrid and trControl parameters? as they are undoubtedly related and I can't figure out their relationship by reading the documentation ... for example:
library(caret)
# train and choose best model using cross validation
df <- ... # contains input data
control <- trainControl(method = "cv", number = 10, p = .9, allowParallel = TRUE)
fit <- train(y ~ ., method = "knn",
data = df,
tuneGrid = data.frame(k = seq(9, 71, 2)),
trControl = control)
If I run the code above what's happening? how do the 10 CV folds each containing 90% of the data as per the trainControl definition are combined with the 32 levels of k?
More concretely:
I have 32 levels for the parameter k.
I also have 10 CV folds.
Is the k-nearest neighbors model trained 32*10 times? or otherwise?
Yes, you are correct. You partition your training data into 10 sets, say 1..10. Starting with set 1, you train your model using all of 2..10 (90% of the training data) and test it on set 1. This is repeated again for set2, set3.. It's a total of 10 times, and you have 32 values of k to test, hence 32 * 10 = 320.
You can also pull out this cv results using the returnResamp function in trainControl. I simplify it to 3-fold and 4 values of k below:
df <- mtcars
set.seed(100)
control <- trainControl(method = "cv", number = 3, p = .9,returnResamp="all")
fit <- train(mpg ~ ., method = "knn",
data = mtcars,
tuneGrid = data.frame(k = 2:5),
trControl = control)
resample_results = fit$resample
resample_results
RMSE Rsquared MAE k Resample
1 3.502321 0.7772086 2.483333 2 Fold1
2 3.807011 0.7636239 2.861111 3 Fold1
3 3.592665 0.8035741 2.697917 4 Fold1
4 3.682105 0.8486331 2.741667 5 Fold1
5 2.473611 0.8665093 1.995000 2 Fold2
6 2.673429 0.8128622 2.210000 3 Fold2
7 2.983224 0.7120910 2.645000 4 Fold2
8 2.998199 0.7207914 2.608000 5 Fold2
9 2.094039 0.9620830 1.610000 2 Fold3
10 2.551035 0.8717981 2.113333 3 Fold3
11 2.893192 0.8324555 2.482500 4 Fold3
12 2.806870 0.8700533 2.368333 5 Fold3
# we manually calculate the mean RMSE for each parameter
tapply(resample_results$RMSE,resample_results$k,mean)
2 3 4 5
2.689990 3.010492 3.156360 3.162392
# and we can see it corresponds to the final fit result
fit$results
k RMSE Rsquared MAE RMSESD RsquaredSD MAESD
1 2 2.689990 0.8686003 2.029444 0.7286489 0.09245494 0.4376844
2 3 3.010492 0.8160947 2.394815 0.6925154 0.05415954 0.4067066
3 4 3.156360 0.7827069 2.608472 0.3805227 0.06283697 0.1122577
4 5 3.162392 0.8131593 2.572667 0.4601396 0.08070670 0.1891581

Does sbf() use metric argument to optimize model?

Passing ROC as metric argument value to the caretSBF function
Our objective is to use the ROC summary metric for model selection while running the Selection By Filtering sbf() function for features selection.
The BreastCancer dataset was used as a reproducible example from mlbench package to run train() and sbf() with metric = "Accuracy" and metric = "ROC"
We want to make sure sbf() takes the metric argument as applied by the train() and rfe() functions to optimize the model. To this aim, we planned to make use of the train() function with the sbf() function. The caretSBF$fit function makes a call to train(), and caretSBF is passed to sbfControl.
From the output, it seems the metric argument is used just for inner resampling and not for the sbf part, i.e. for the outer resampling of the output, the metric argument was not applied as used by train() and rfe().
As we have used caretSBF which uses train(), it appears that the metric argument is limited in scope to train() and hence is not passed to sbf.
We would appreciate clarification on whether sbf() uses metric argument for optimizing model, i.e. for outer resampling?
Here is our work on reproducible example, showing train() uses metric argument using Accuracy and ROC, but for sbf we are not sure.
I. DATA SECTION
## Loading required packages
library(mlbench)
library(caret)
## Loading `BreastCancer` Dataset from *mlbench* package
data("BreastCancer")
## Data cleaning for missing values
# Remove rows/observation with NA Values in any of the columns
BrC1 <- BreastCancer[complete.cases(BreastCancer),]
# Removing Class and Id Column and keeping just Numeric Predictors
Num_Pred <- BrC1[,2:10]
II. CUSTOMIZED SUMMARY FUNCTION
Defining fiveStats summary function
fiveStats <- function(...) c(twoClassSummary(...),
defaultSummary(...))
III. TRAIN SECTION
Defining trControl
trCtrl <- trainControl(method="repeatedcv", number=10,
repeats=1, classProbs = TRUE, summaryFunction = fiveStats)
TRAIN + METRIC = "Accuracy"
set.seed(1)
TR_acc <- train(Num_Pred,BrC1$Class, method="rf",metric="Accuracy",
trControl = trCtrl,tuneGrid=expand.grid(.mtry=c(2,3,4,5)))
TR_acc
# Random Forest
#
# 683 samples
# 9 predictor
# 2 classes: 'benign', 'malignant'
#
# No pre-processing
# Resampling: Cross-Validated (10 fold, repeated 1 times)
# Summary of sample sizes: 615, 615, 614, 614, 614, 615, ...
# Resampling results across tuning parameters:
#
# mtry ROC Sens Spec Accuracy Kappa
# 2 0.9936532 0.9729798 0.9833333 0.9765772 0.9490311
# 3 0.9936544 0.9729293 0.9791667 0.9750853 0.9457534
# 4 0.9929957 0.9684343 0.9750000 0.9706948 0.9361373
# 5 0.9922907 0.9684343 0.9666667 0.9677536 0.9295782
#
# Accuracy was used to select the optimal model using the largest value.
# The final value used for the model was mtry = 2.
TRAIN + METRIC = "ROC"
set.seed(1)
TR_roc <- train(Num_Pred,BrC1$Class, method="rf",metric="ROC",
trControl = trCtrl,tuneGrid=expand.grid(.mtry=c(2,3,4,5)))
TR_roc
# Random Forest
#
# 683 samples
# 9 predictor
# 2 classes: 'benign', 'malignant'
#
# No pre-processing
# Resampling: Cross-Validated (10 fold, repeated 1 times)
# Summary of sample sizes: 615, 615, 614, 614, 614, 615, ...
# Resampling results across tuning parameters:
#
# mtry ROC Sens Spec Accuracy Kappa
# 2 0.9936532 0.9729798 0.9833333 0.9765772 0.9490311
# 3 0.9936544 0.9729293 0.9791667 0.9750853 0.9457534
# 4 0.9929957 0.9684343 0.9750000 0.9706948 0.9361373
# 5 0.9922907 0.9684343 0.9666667 0.9677536 0.9295782
#
# ROC was used to select the optimal model using the largest value.
# The final value used for the model was mtry = 3.
IV. EDITING caretSBF
Editing caretSBF summary Function
caretSBF$summary <- fiveStats
V. SBF SECTION
Defining sbfControl
sbfCtrl <- sbfControl(functions=caretSBF,
method="repeatedcv", number=10, repeats=1,
verbose=T, saveDetails = T)
SBF + METRIC = "Accuracy"
set.seed(1)
sbf_acc <- sbf(Num_Pred, BrC1$Class,
sbfControl = sbfCtrl,
trControl = trCtrl, method="rf", metric="Accuracy")
## sbf_acc
sbf_acc
# Selection By Filter
#
# Outer resampling method: Cross-Validated (10 fold, repeated 1 times)
#
# Resampling performance:
#
# ROC Sens Spec Accuracy Kappa ROCSD SensSD SpecSD AccuracySD KappaSD
# 0.9931 0.973 0.9833 0.9766 0.949 0.006272 0.0231 0.02913 0.01226 0.02646
#
# Using the training set, 9 variables were selected:
# Cl.thickness, Cell.size, Cell.shape, Marg.adhesion, Epith.c.size...
#
# During resampling, the top 5 selected variables (out of a possible 9):
# Bare.nuclei (100%), Bl.cromatin (100%), Cell.shape (100%), Cell.size (100%), Cl.thickness (100%)
#
# On average, 9 variables were selected (min = 9, max = 9)
## Class of sbf_acc
class(sbf_acc)
# [1] "sbf"
## Names of elements of sbf_acc
names(sbf_acc)
# [1] "pred" "variables" "results" "fit" "optVariables"
# [6] "call" "control" "resample" "metrics" "times"
# [11] "resampledCM" "obsLevels" "dots"
## sbf_acc fit element*
sbf_acc$fit
# Random Forest
#
# 683 samples
# 9 predictor
# 2 classes: 'benign', 'malignant'
#
# No pre-processing
# Resampling: Cross-Validated (10 fold, repeated 1 times)
# Summary of sample sizes: 615, 614, 614, 615, 615, 615, ...
# Resampling results across tuning parameters:
#
# mtry ROC Sens Spec Accuracy Kappa
# 2 0.9933176 0.9706566 0.9833333 0.9751492 0.9460717
# 5 0.9920034 0.9662121 0.9791667 0.9707801 0.9363708
# 9 0.9914825 0.9684343 0.9708333 0.9693308 0.9327662
#
# Accuracy was used to select the optimal model using the largest value.
# The final value used for the model was mtry = 2.
## Elements of sbf_acc fit
names(sbf_acc$fit)
# [1] "method" "modelInfo" "modelType" "results" "pred"
# [6] "bestTune" "call" "dots" "metric" "control"
# [11] "finalModel" "preProcess" "trainingData" "resample" "resampledCM"
# [16] "perfNames" "maximize" "yLimits" "times" "levels"
## sbf_acc fit final Model
sbf_acc$fit$finalModel
# Call:
# randomForest(x = x, y = y, mtry = param$mtry)
# Type of random forest: classification
# Number of trees: 500
# No. of variables tried at each split: 2
#
# OOB estimate of error rate: 2.34%
# Confusion matrix:
# benign malignant class.error
# benign 431 13 0.02927928
# malignant 3 236 0.01255230
## sbf_acc metric
sbf_acc$fit$metric
# [1] "Accuracy"
## sbf_acc fit best Tune*
sbf_acc$fit$bestTune
# mtry
# 1 2
SBF + METRIC = "ROC"
set.seed(1)
sbf_roc <- sbf(Num_Pred, BrC1$Class,
sbfControl = sbfCtrl,
trControl = trCtrl, method="rf", metric="ROC")
## sbf_roc
sbf_roc
# Selection By Filter
#
# Outer resampling method: Cross-Validated (10 fold, repeated 1 times)
#
# Resampling performance:
#
# ROC Sens Spec Accuracy Kappa ROCSD SensSD SpecSD AccuracySD KappaSD
# 0.9931 0.973 0.9833 0.9766 0.949 0.006272 0.0231 0.02913 0.01226 0.02646
#
# Using the training set, 9 variables were selected:
# Cl.thickness, Cell.size, Cell.shape, Marg.adhesion, Epith.c.size...
#
# During resampling, the top 5 selected variables (out of a possible 9):
# Bare.nuclei (100%), Bl.cromatin (100%), Cell.shape (100%), Cell.size (100%), Cl.thickness (100%)
#
# On average, 9 variables were selected (min = 9, max = 9)
## Class of sbf_roc
class(sbf_roc)
# [1] "sbf"
## Names of elements of sbf_roc
names(sbf_roc)
# [1] "pred" "variables" "results" "fit" "optVariables"
# [6] "call" "control" "resample" "metrics" "times"
# [11] "resampledCM" "obsLevels" "dots"
## sbf_roc fit element*
sbf_roc$fit
# Random Forest
#
# 683 samples
# 9 predictor
# 2 classes: 'benign', 'malignant'
#
# No pre-processing
# Resampling: Cross-Validated (10 fold, repeated 1 times)
# Summary of sample sizes: 615, 614, 614, 615, 615, 615, ...
# Resampling results across tuning parameters:
#
# mtry ROC Sens Spec Accuracy Kappa
# 2 0.9933176 0.9706566 0.9833333 0.9751492 0.9460717
# 5 0.9920034 0.9662121 0.9791667 0.9707801 0.9363708
# 9 0.9914825 0.9684343 0.9708333 0.9693308 0.9327662
#
# ROC was used to select the optimal model using the largest value.
# The final value used for the model was mtry = 2.
## Elements of sbf_roc fit
names(sbf_roc$fit)
# [1] "method" "modelInfo" "modelType" "results" "pred"
# [6] "bestTune" "call" "dots" "metric" "control"
# [11] "finalModel" "preProcess" "trainingData" "resample" "resampledCM"
# [16] "perfNames" "maximize" "yLimits" "times" "levels"
## sbf_roc fit final Model
sbf_roc$fit$finalModel
# Call:
# randomForest(x = x, y = y, mtry = param$mtry)
# Type of random forest: classification
# Number of trees: 500
# No. of variables tried at each split: 2
#
# OOB estimate of error rate: 2.34%
# Confusion matrix:
# benign malignant class.error
# benign 431 13 0.02927928
# malignant 3 236 0.01255230
## sbf_roc metric
sbf_roc$fit$metric
# [1] "ROC"
## sbf_roc fit best Tune
sbf_roc$fit$bestTune
# mtry
# 1 2
Does sbf() use metric argument to optimize model? If yes, what metric does sbf() use as default? If sbf() uses metric argument, then how to set it to ROC?
Thanks.
sbf doesn't use the metric to optimize anything (unlike rfe); all sbf does is do a feature selection step before calling the model. Of course, you define the filters but there is no way to tune the filter using sbf so no metric is needed to guide that step.
Using sbf(x, y, metric = "ROC") will pass metric = "ROC" to whatever modeling function that you are using (and it designed to work with train when caretSBF is used. This happens because there is no metric argument to sbf:
> names(formals(caret:::sbf.default))
[1] "x" "y" "sbfControl" "..."

How to compute ROC and AUC under ROC after training using caret in R?

I have used caret package's train function with 10-fold cross validation. I also have got class probabilities for predicted classes by setting classProbs = TRUE in trControl, as follows:
myTrainingControl <- trainControl(method = "cv",
number = 10,
savePredictions = TRUE,
classProbs = TRUE,
verboseIter = TRUE)
randomForestFit = train(x = input[3:154],
y = as.factor(input$Target),
method = "rf",
trControl = myTrainingControl,
preProcess = c("center","scale"),
ntree = 50)
The output predictions I am getting is as follows.
pred obs 0 1 rowIndex mtry Resample
1 0 1 0.52 0.48 28 12 Fold01
2 0 0 0.58 0.42 43 12 Fold01
3 0 1 0.58 0.42 51 12 Fold01
4 0 0 0.68 0.32 55 12 Fold01
5 0 0 0.62 0.38 59 12 Fold01
6 0 1 0.92 0.08 71 12 Fold01
Now I want to calculate ROC and AUC under ROC using this data. How would I achieve this?
A sample example for AUC:
rf_output=randomForest(x=predictor_data, y=target, importance = TRUE, ntree = 10001, proximity=TRUE, sampsize=sampsizes)
library(ROCR)
predictions=as.vector(rf_output$votes[,2])
pred=prediction(predictions,target)
perf_AUC=performance(pred,"auc") #Calculate the AUC value
AUC=perf_AUC#y.values[[1]]
perf_ROC=performance(pred,"tpr","fpr") #plot the actual ROC curve
plot(perf_ROC, main="ROC plot")
text(0.5,0.5,paste("AUC = ",format(AUC, digits=5, scientific=FALSE)))
or using pROC and caret
library(caret)
library(pROC)
data(iris)
iris <- iris[iris$Species == "virginica" | iris$Species == "versicolor", ]
iris$Species <- factor(iris$Species) # setosa should be removed from factor
samples <- sample(NROW(iris), NROW(iris) * .5)
data.train <- iris[samples, ]
data.test <- iris[-samples, ]
forest.model <- train(Species ~., data.train)
result.predicted.prob <- predict(forest.model, data.test, type="prob") # Prediction
result.roc <- roc(data.test$Species, result.predicted.prob$versicolor) # Draw ROC curve.
plot(result.roc, print.thres="best", print.thres.best.method="closest.topleft")
result.coords <- coords(result.roc, "best", best.method="closest.topleft", ret=c("threshold", "accuracy"))
print(result.coords)#to get threshold and accuracy
Update 2019. This is what MLeval was written for (https://cran.r-project.org/web/packages/MLeval/index.html), it works with the Caret train output object to make ROCs, PR curves, calibration curves, and calculate metrics, such as ROC-AUC, sensitivity, specificity etc. It just uses one line to do all of this which is helpful for my analyses and may be of interest.
library(caret)
library(MLeval)
myTrainingControl <- trainControl(method = "cv",
number = 10,
savePredictions = TRUE,
classProbs = TRUE,
verboseIter = TRUE)
randomForestFit = train(x = Sonar[,1:60],
y = as.factor(Sonar$Class),
method = "rf",
trControl = myTrainingControl,
preProcess = c("center","scale"),
ntree = 50)
##
x <- evalm(randomForestFit)
## get roc curve plotted in ggplot2
x$roc
## get AUC and other metrics
x$stdres

Highly imbalanced data on C5.0 tree model

I have a imbalanced dataset with only 87 target events "F" out of all 496,978 obs, since I would like to see a rule/tree, I chose to use the tree models, I have been following the codes in "Applied Predictive Modeling in R" book by Dr Max Kuhn, in chapter 16 this imbalance issue is well addressed.
Here is the sample data structure:
str(training[,predictors])
'data.frame': 496978 obs. of 36 variables:
$ Point_Of_Sale_Code : Factor w/ 5 levels "c0","c2","c90",..: 3 3 5 5 3 3 5 5 5 5 ...
$ Delinquent_Amount : num 0 0 0 0 0 0 0 0 0 0 ...
$ Delinquent_Days_Count : num 0 0 0 0 0 0 0 0 0 0 ...
$ Overlimit_amt : num 0 0 0 0 0 0 0 0 0 0 ...
I tried the downsampling with random forest, it works well, with good auc= 0.9997 on test data, and confusion matrix
Reference
Prediction N F
N 140526 0
F 1442 24
however, rf does not give me a specific rule, so I tried the code in the book exactly as:
library(rpart)
library(e1071)
initialRpart <- rpart(flag ~ ., data = training,
control = rpart.control(cp = 0.0001))
rpartGrid <- data.frame(.cp = initialRpart$cptable[, "CP"])
cmat <- list(loss = matrix(c(0, 1, 20, 0), ncol = 2))
set.seed(1401)
cartWMod1 <- train(x = training[,predictors],
y = training$flag,
method = "rpart",
trControl = ctrlNoProb,
tuneGrid = rpartGrid,
metric = "Kappa",
parms = cmat)
cartWMod1
I got the error msg below everytime, no matter what I tried, like convert all int data type to num type, not sure why I get this warning msg,
Warning message:
In ni[1:m] * nj[1:m] : ***NAs produced by integer overflow***
Aggregating results
Selecting tuning parameters
Error in train.default(x = training[, predictors], y = training$flag, :
***final tuning parameters could not be determined***
I also tried the code for c5.0 package:
library(C50)
c5Grid <- expand.grid(.model = c("tree", "rules"),
.trials = c(1, (1:10)*10),
.winnow = FALSE)
finalCost <- matrix(c(0, 150, 1, 0), ncol = 2)
rownames(finalCost) <- colnames(finalCost) <- levels(training$flag)
set.seed(1401)
C5CostFit1 <- train(training[,predictors],
training$flag,
method = "C5.0",
metric = "Kappa",
tuneGrid = c5Grid,
cost = finalCost,
control = C5.0Control(earlyStopping = FALSE),
trControl = ctrlNoProb)
C5CostCM1 <- confusionMatrix(predict(C5CostFit, training), training$flag)
I got this result, which classify all the target event F to be nonevent N, Is it possible that I can increase the cost penalty from 150 to larger to fix this issue? Thank you!
C5CostCM1
Confusion Matrix and Statistics
Reference
Prediction N F
N 141968 ***24***
F 0 0
Accuracy : 0.9998
95% CI : (0.9997, 0.9999)
No Information Rate : 0.9998
P-Value [Acc > NIR] : 0.554
Kappa : NA
Mcnemar's Test P-Value : 2.668e-06
Sensitivity : 1.0000
Specificity : 0.0000
Pos Pred Value : 0.9998
Neg Pred Value : NaN
Prevalence : 0.9998
Detection Rate : 0.9998
Detection Prevalence : 1.0000
Balanced Accuracy : 0.5000
'Positive' Class : N
I have been googling this issue for the past week, but didn't see a solution, the code from the book work well though, but gives me error for my data... Any suggestion will be appriciated!! Thank you so much!
I think that it's telling you that something in the output (i.e., the list) has NAs in it--the Kappa stat.
Using something like this:
results.matrix = confusionMatrix(data, reference)
results.df = as.data.frame(results.matrix[3])
summary(is.finite(results.df$overall))
Gives you this:
Mode FALSE TRUE NA's
logical 1 6 0
So I'm guessing that's what it's picking up.

Resources