ROC in rfe() in caret package for R - r

I am using the caret package in R for training a radial basis SVM for classification; in addition, a linear SVM is used for variable selection. With metric="Accuracy", this works fine, but eventually I am more interested in optimizing metric="ROC". While the ROC is calculated for all models that are fit, there seems to be some problem with aggregating the ROC values.
The following is some example code:
library(caret)
library(mlbench)
set.seed(0)
data(Sonar)
x<-scale(Sonar[,1:60])
y<-as.factor(Sonar[,61])
# Custom summary function to use both
# defaultSummary() and twoClassSummary
# Also input and output of summary function are printed
svm.summary<-function(data, lev = NULL, model = NULL){
print(head(data,n=3))
a<-defaultSummary(data, lev, model)
b<-twoClassSummary(data, lev, model)
out<-c(a,b)
print(out)
out}
fitControl <- trainControl(
method = "cv",
number = 2,
classProbs = TRUE,
summaryFunction=svm.summary,
verbose=T,
allowParallel = FALSE)
# Ranking function: Rank Variables using a linear
# SVM
rankSVM<-function(object,x,y) {
print("ranking")
obj<-ksvm(x=as.matrix(x), y=y,
kernel=vanilladot,
kpar=list(), C=10,
scaled=F)
w<-t(obj#coef[[1]]%*%obj#xmatrix[[1]])
z<-abs(w)/sqrt(sum(w^2))
ord<-order(z,decreasing=T)
data.frame(var=dimnames(z)[[1]][ord],Overall=z[ord])
}
svmFuncs<-getModelInfo("svmRadial",regex=F)
svmFit<-function(x,y,first,last,...) {
out<-train(x=x,y=as.factor(y),
method="svmRadial",
trControl=fitControl,
scaled=F,
metric="Accuracy",
maximize=T,
returnData=T)
out$finalModel}
selectionFunctions<-list(summary=svm.summary,
fit=svmFit,
pred=svmFuncs$svmRadial$predict,
prob=svmFuncs$svmRadial$prob,
rank=rankSVM,
selectSize=pickSizeBest,
selectVar=pickVars)
selectionControl<-rfeControl(functions=selectionFunctions,
rerank=F,
verbose=T,
method="cv",
number=2)
subsets<-c(1,30,60)
svmProfile<-rfe(x=x,y=y,
sizes=subsets,
metric="Accuracy",
maximize=TRUE,
rfeControl=selectionControl)
svmProfile
The final output is the following:
> svmProfile
Recursive feature selection
Outer resampling method: Cross-Validated (2 fold)
Resampling performance over subset size:
Variables Accuracy Kappa ROC Sens Spec AccuracySD KappaSD ROCSD SensSD SpecSD Selected
1 0.8075 0.6122 NaN 0.8292 0.7825 0.02981 0.06505 NA 0.06153 0.1344 *
30 0.8028 0.6033 NaN 0.8205 0.7825 0.00948 0.02533 NA 0.09964 0.1344
60 0.8028 0.6032 NaN 0.8206 0.7823 0.00948 0.02679 NA 0.12512 0.1635
The top 1 variables (out of 1):
V49
ROC is NaN. Inspecting the output (as verbose=T and the summary function was patched to display both its output and parts of its input) reveals that while when tuning the SVMs in the inner loop, ROC seems to be calculated correctly:
+ Fold1: sigma=0.01172, C=0.25
pred obs M R
1 M R 0.6658878 0.3341122
2 M R 0.5679477 0.4320523
3 R R 0.2263576 0.7736424
Accuracy Kappa ROC Sens Spec
0.6730769 0.3480826 0.7961310 0.6428571 0.7083333
- Fold1: sigma=0.01172, C=0.25
+ Fold1: sigma=0.01172, C=0.50
pred obs M R
1 M R 0.7841249 0.2158751
2 M R 0.7231365 0.2768635
3 R R 0.3033492 0.6966508
Accuracy Kappa ROC Sens Spec
0.7692308 0.5214724 0.8407738 0.9642857 0.5416667
- Fold1: sigma=0.01172, C=0.50
[...]
there seems to be a problem in the outer iteration. "Between" two folds we get the following:
-(rfe) fit Fold1 size: 1
pred obs Variables
1 M R 1
2 M R 1
3 M R 1
Accuracy Kappa ROC Sens Spec
0.7864078 0.5662328 NA 0.8727273 0.6875000
pred obs Variables
1 R R 30
2 M R 30
3 M R 30
Accuracy Kappa ROC Sens Spec
0.7961165 0.5853939 NA 0.8909091 0.6875000
pred obs Variables
1 R R 60
2 M R 60
3 M R 60
Accuracy Kappa ROC Sens Spec
0.7961165 0.5842783 NA 0.9090909 0.6666667
+(rfe) fit Fold2 size: 60
So here it seems the input for the summary function is a matrix that does not contain the class probabilities but the number of variables instead, and so the ROCs cannot be calculated / aggregated correctly. Does anybody know how to prevent this? Did I forget to tell caret to output class probabilities in some place?
Help is greatly appreciated, as caret is really a cool package to use and would save me plenty of work if I can get this to run correctly.
Thoralf

getModelInfo is designed to get code for train and doesn't automatically work with rfe (I'll make a note of that in the documentation). rfe doesn't look for a slot called probs and no probability predictions means not ROC summary.
You might want base your code on caretFuncs, which is designed to work with rfe and should automate a lot of what I think you would like to do.
For example, in caretFuncs, the pred module will create class and probability predictions:
function(object, x) {
tmp <- predict(object, x)
if(object$modelType == "Classification" &
!is.null(object$modelInfo$prob)) {
out <- cbind(data.frame(pred = tmp),
as.data.frame(predict(object, x, type = "prob")))
} else out <- tmp
out
}
You might be able to simply plug in your rankSVM into caretFuncs$rank.
Take a look at the feature selection page on the website. It has details about what code modules you will need.

Related

How to use LOOCV to find a subset that classifies better than full set in R

I am working with the wbca data from the faraway package. The prior probability of sampling a malignant tumor is π0 = 1/3 and the prior probability for sampling a benign tumor is π1 = 2/3.
I am trying to use the naive Bayes classifier with multinomials to see if there is a good subset of the 9 features that classifies better than the full set using LOOCV.
I am unsure where to even begin with this, so any Rcode help would be great. Thanks!
You can try something below, the kernel estimate of your predictors might not be the most accurate, but it's something you can start with:
library(faraway)
library(naivebayes)
library(caret)
x = wbca[,!grepl("Class",colnames(wbca))]
y = factor(wbca$Class)
ctrl <- rfeControl(functions = nbFuncs,
method = "LOOCV")
bayesProfile <- rfe(x, y,
sizes = subsets,
rfeControl = ctrl)
bayesProfile
Recursive feature selection
Outer resampling method: Leave-One-Out Cross-Validation
Resampling performance over subset size:
Variables Accuracy Kappa Selected
2 0.9501 0.8891
3 0.9648 0.9225
4 0.9648 0.9223
5 0.9677 0.9290
6 0.9750 0.9454 *
7 0.9692 0.9322
8 0.9750 0.9455
9 0.9662 0.9255
The top 5 variables (out of 6):
USize, UShap, BNucl, Chrom, Epith
You can get the optimal variables:
bayesProfile$optVariables
[1] "USize" "UShap" "BNucl" "Chrom" "Epith" "Thick"

R Caret Random Forest AUC too good to be true?

Relative newbie to predictive modeling--most of my training/experience is in inferential stats. I'm trying to predict student college graduation in 4 years.
Basic issue is that I've done data cleaning (imputing, centering, scaling); split that processed/transformed data into training (70%) and testing (30%) sets; balanced the data using two approaches (because data was 65%=0, 35%=1--and I've found inconsistent advice on what classifies as unbalanced, but one source suggested anything not within 40/60 range)--ROSE "BOTH" and SMOTE; and ran random forests.
For the ROSE "BOTH" models I got 0.9242 accuracy on the training set and AUC of 0.9268 for the test set.
For the SMOTE model I got 0.9943 accuracy on the training set and AUC of 0.9971 on the test set.
More details on model performance are embedded in the code copied below.
This just seems too good to be true. But, from what I've been able to find slightly improved performance on the test set would not indicate overfitting (it'd be the other way around). So, is this models performance likely really good or is it too good to be true? I have not been able to find a direct answer to this question via SO searches.
Also, in a few weeks I'll have another cohort of data I can run this on. I suppose that could be another "test" set, correct? Then I can apply this to the newest cohort for which we are interested in knowing likelihood to graduate in 4 years.
Many thanks,
Brian
#Used for predictive modeling of 4-year graduation
#IMPORT DATA
library(haven)
grad4yr <- [file path]
#DETERMINE DATA BALANCE/UNBALANCE
prop.table(table(grad4yr$graduate_4_yrs))
# 0=0.6492, 1=0.3517
#convert to factor so next step doesn't impute outcome variable
grad4yr$graduate_4_yrs <- as.factor(grad4yr$graduate_4_yrs)
#Preprocess data, RANN package used
library('RANN')
#Create proprocessed values object which includes centering, scaling, and imputing missing values using KNN
Processed_Values <- preProcess(grad4yr, method = c("knnImpute","center","scale"))
#Create new dataset with imputed values and centering/scaling
#Confirmed this results in 0 cases with missing values
grad4yr_data_processed <- predict(Processed_Values, grad4yr)
#Confirm last step results in 0 cases with missing values
sum(is.na(grad4yr_data_processed))
#[1] 0
#Convert outcome variable to numeric to ensure dummify step (next) doesn't dummify outcome variable.
grad4yr_data_processed$graduate_4_yrs <- as.factor(grad4yr_data_processed$graduate_4_yrs)
#Convert all factor variables to dummy variables; fullrank used to omit one of new dummy vars in each
#set.
dmy <- dummyVars("~ .", data = grad4yr_data_processed, fullRank = TRUE)
#Create new dataset that has the data imputed AND transformed to have dummy variables for all variables that
#will go in models.
grad4yr_processed_transformed <- data.frame(predict(dmy,newdata = grad4yr_data_processed))
#Convert outcome variable back to binary/factor for predictive models and create back variable with same name
#not entirely sure who last step created new version of outcome var with ".1" at the end
grad4yr_processed_transformed$graduate_4_yrs.1 <- as.factor(grad4yr_processed_transformed$graduate_4_yrs.1)
grad4yr_processed_transformed$graduate_4_yrs <- as.factor(grad4yr_processed_transformed$graduate_4_yrs)
grad4yr_processed_transformed$graduate_4_yrs.1 <- NULL
#Split data into training and testing/validation datasets based on outcome at 70%/30%
index <- createDataPartition(grad4yr_processed_transformed$graduate_4_yrs, p=0.70, list=FALSE)
trainSet <- grad4yr_processed_transformed[index,]
testSet <- grad4yr_processed_transformed[-index,]
#load caret
library(caret)
#Feature selection using rfe in R Caret, used with profile/comparison
control <- rfeControl(functions = rfFuncs,
method = "repeatedcv",
repeats = 10,#using k=10 per Kuhn & Johnson pp70; and per James et al pp
#https://www-bcf.usc.edu/~gareth/ISL/ISLR%20First%20Printing.pdf
verbose = FALSE)
#create traincontrol using repeated cross-validation with 10 fold 5 times
fitControl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 5,
search = "random")
#Set the outcome variable object
grad4yrs <- 'graduate_4_yrs'
#set predictor variables object
predictors <- names(trainSet[!names(trainSet) %in% grad4yrs])
#create predictor profile to see what where prediction is best (by num vars)
grad4yr_pred_profile <- rfe(trainSet[,predictors],trainSet[,grad4yrs],rfeControl = control)
# Recursive feature selection
#
# Outer resampling method: Cross-Validated (10 fold, repeated 5 times)
#
# Resampling performance over subset size:
#
# Variables Accuracy Kappa AccuracySD KappaSD Selected
# 4 0.6877 0.2875 0.03605 0.08618
# 8 0.7057 0.3078 0.03461 0.08465 *
# 16 0.7006 0.2993 0.03286 0.08036
# 40 0.6949 0.2710 0.03330 0.08157
#
# The top 5 variables (out of 8):
# Transfer_Credits, HS_RANK, Admit_Term_Credits_Taken, first_enroll, Admit_ReasonUT10
#see data structure
str(trainSet)
#not copying output here, but confirms outcome var is factor and everything else is numeric
#given 65/35 split on outcome var and what can find about unbalanced data, considering unbalanced and doing steps to balance.
#using ROSE "BOTH and SMOTE to see how differently they perform. Also ran under/over with ROSE but they didn't perform nearly as
#well so removed from this script.
#SMOTE to balance data on the processed/dummified dataset
library(DMwR)#https://www3.nd.edu/~dial/publications/chawla2005data.pdf for justification
train.SMOTE <- SMOTE(graduate_4_yrs ~ ., data=grad4yr_processed_transformed, perc.over=600, perc.under=100)
#see how balanced SMOTE resulting dataset is
prop.table(table(train.SMOTE$graduate_4_yrs))
#0 1
#0.4615385 0.5384615
#open ROSE package/library
library("ROSE")
#ROSE to balance data (using BOTH) on the processed/dummified dataset
train.both <- ovun.sample(graduate_4_yrs ~ ., data=grad4yr_processed_transformed, method = "both", p=.5,
N = 2346)$data
#see how balanced BOTH resulting dataset is
prop.table(table(train.both$graduate_4_yrs))
#0 1
#0.4987212 0.5012788
#ROSE to balance data (using BOTH) on the processed/dummified dataset
table(grad4yr_processed_transformed$graduate_4_yrs)
#0 1
#1144 618
library("caret")
#create random forests using balanced data from above
RF_model_both <- train(train.both[,predictors],train.both[, grad4yrs],method = 'rf', trControl = fitControl, ntree=1000, tuneLength = 10)
#print info on accuracy & kappa for "BOTH" training model
# print(RF_model_both)
# Random Forest
#
# 2346 samples
# 40 predictor
# 2 classes: '0', '1'
#
# No pre-processing
# Resampling: Cross-Validated (10 fold, repeated 5 times)
# Summary of sample sizes: 2112, 2111, 2111, 2112, 2111, 2112, ...
# Resampling results across tuning parameters:
#
# mtry Accuracy Kappa
# 8 0.9055406 0.8110631
# 11 0.9053719 0.8107246
# 12 0.9057981 0.8115770
# 13 0.9054584 0.8108965
# 14 0.9048602 0.8097018
# 20 0.9034992 0.8069796
# 26 0.9027307 0.8054427
# 30 0.9034152 0.8068113
# 38 0.9023899 0.8047622
# 40 0.9032428 0.8064672
# Accuracy was used to select the optimal model using the largest value.
# The final value used for the model was mtry = 12.
RF_model_SMOTE <- train(train.SMOTE[,predictors],train.SMOTE[, grad4yrs],method = 'rf', trControl = fitControl, ntree=1000, tuneLength = 10)
#print info on accuracy & kappa for "SMOTE" training model
# print(RF_model_SMOTE)
# Random Forest
#
# 8034 samples
# 40 predictor
# 2 classes: '0', '1'
#
# No pre-processing
# Resampling: Cross-Validated (10 fold, repeated 5 times)
# Summary of sample sizes: 7231, 7231, 7230, 7230, 7231, 7231, ...
# Resampling results across tuning parameters:
#
# mtry Accuracy Kappa
# 17 0.9449082 0.8899939
# 19 0.9458047 0.8917740
# 21 0.9458543 0.8918695
# 29 0.9470243 0.8941794
# 31 0.9468750 0.8938864
# 35 0.9468003 0.8937290
# 36 0.9463772 0.8928876
# 40 0.9463275 0.8927828
#
# Accuracy was used to select the optimal model using the largest value.
# The final value used for the model was mtry = 29.
#Given that both accuracy and kappa appear better in the "SMOTE" random forest it's looking like it's the better model.
#But, running ROC/AUC on both to see how they both perform on validation data.
#Create predictions based on random forests above
rf_both_predictions <- predict.train(object=RF_model_both,testSet[, predictors], type ="raw")
rf_SMOTE_predictions <- predict.train(object=RF_model_SMOTE,testSet[, predictors], type ="raw")
#Create predictions based on random forests above
rf_both_pred_prob <- predict.train(object=RF_model_both,testSet[, predictors], type ="prob")
rf_SMOTE_pred_prob <- predict.train(object=RF_model_SMOTE,testSet[, predictors], type ="prob")
#create Random Forest confusion matrix to evaluate random forests
confusionMatrix(rf_both_predictions,testSet[,grad4yrs], positive = "1")
#output copied here:
# Confusion Matrix and Statistics
#
# Reference
# Prediction 0 1
# 0 315 12
# 1 28 173
#
# Accuracy : 0.9242
# 95% CI : (0.8983, 0.9453)
# No Information Rate : 0.6496
# P-Value [Acc > NIR] : < 2e-16
#
# Kappa : 0.8368
# Mcnemar's Test P-Value : 0.01771
#
# Sensitivity : 0.9351
# Specificity : 0.9184
# Pos Pred Value : 0.8607
# Neg Pred Value : 0.9633
# Prevalence : 0.3504
# Detection Rate : 0.3277
# Detection Prevalence : 0.3807
# Balanced Accuracy : 0.9268
#
# 'Positive' Class : 1
# confusionMatrix(rf_under_predictions,testSet[,grad4yrs], positive = "1")
#output copied here:
#Accuracy : 0.8258
#only copied accuracy as it was fair below two other versions
confusionMatrix(rf_SMOTE_predictions,testSet[,grad4yrs], positive = "1")
#output copied here:
# Confusion Matrix and Statistics
#
# Reference
# Prediction 0 1
# 0 340 0
# 1 3 185
#
# Accuracy : 0.9943
# 95% CI : (0.9835, 0.9988)
# No Information Rate : 0.6496
# P-Value [Acc > NIR] : <2e-16
#
# Kappa : 0.9876
# Mcnemar's Test P-Value : 0.2482
#
# Sensitivity : 1.0000
# Specificity : 0.9913
# Pos Pred Value : 0.9840
# Neg Pred Value : 1.0000
# Prevalence : 0.3504
# Detection Rate : 0.3504
# Detection Prevalence : 0.3561
# Balanced Accuracy : 0.9956
#
# 'Positive' Class : 1
#put predictions in dataset
testSet$rf_both_pred <- rf_both_predictions#predictions (BOTH)
testSet$rf_SMOTE_pred <- rf_SMOTE_predictions#probabilities (BOTH)
testSet$rf_both_prob <- rf_both_pred_prob#predictions (SMOTE)
testSet$rf_SMOTE_prob <- rf_SMOTE_pred_prob#probabilities (SMOTE)
library(pROC)
#get AUC of the BOTH predictions
testSet$rf_both_pred <- as.numeric(testSet$rf_both_pred)
Both_ROC_Curve <- roc(response = testSet$graduate_4_yrs,
predictor = testSet$rf_both_pred,
levels = rev(levels(testSet$graduate_4_yrs)))
auc(Both_ROC_Curve)
# Area under the curve: 0.9268
#get AUC of the SMOTE predictions
testSet$rf_SMOTE_pred <- as.numeric(testSet$rf_SMOTE_pred)
SMOTE_ROC_Curve <- roc(response = testSet$graduate_4_yrs,
predictor = testSet$rf_SMOTE_pred,
levels = rev(levels(testSet$graduate_4_yrs)))
auc(SMOTE_ROC_Curve)
#Area under the curve: 0.9971
#So, the SMOTE balanced data performed very well on training data and near perfect on the validation/test data.
#But, it seems almost too good to be true.
#Is there anything I might have missed or performed incorrectly?
I'll post as an answer my comment, even if this might be migrated.
I really think that you're overfitting, because you have balanced on the whole dataset.
Instead you should balance only the train set.
Here is your code:
library(DMwR)
train.SMOTE <- SMOTE(graduate_4_yrs ~ ., data=grad4yr_processed_transformed,
perc.over=600, perc.under=100)
By doing so your train.SMOTE now contains information from the test set too, so when you'll test on your testSet the model will have already seen part of the data, and this will likely be the cause of your "too good" results.
It should be:
library(DMwR)
train.SMOTE <- SMOTE(graduate_4_yrs ~ ., data=trainSet, # use only the train set
perc.over=600, perc.under=100)

Model evaluation in R with confusion matrix

Hi I have used the ROCR package to check the performance of a model, I would like to do more evaluation like a confusion matrix with kappa values or k fold.
below are the model and the predictions, any help would be great.
model <- cv.glmnet(sparesemx[train.set,],
first.round[train.set],
alpha = 0.05,
family = 'binomial')
training$sparse.fr.hat <- predict(model, newx = sparesemx, type =
'response')[,1]
predictions <- prediction(training$sparse.fr.hat[test.set],
first.round[test.set])
perform <- performance(predictions, 'tpr', 'fpr')
plot(perform)
performance(predictions, 'auc')
I am trying to use the caret library with the confusionMatrix() function but I am unable to generate the matrix. I have tried several inputs for the two agruments but I am not sure what is needed
Worked example, step by step in explicit detail.
library(OptimalCutpoints)
library(caret)
library(glmnet)
library(e1071)
data(elas) #predicting for variable "status"
Split the elas data into training (dev) and testing (val)
sample.ind <- sample(2,
nrow(elas),
replace = T,
prob = c(0.6,0.4))
elas.dev <- elas[sample.ind==1,]
elas.val <- elas[sample.ind==2,]
This example uses a logistic model so this is how the formula is specified, similar to your sparesemx matrix.
formula.glm<-glm(status ~ gender + elas, data = elas, family = binomial)
xfactors<-model.matrix(formula.glm)[,-1]
glmnet.x<-as.matrix(xfactors)
glmmod<-glmnet(x=glmnet.x[sample.ind==1,],y=elas.dev$status,alpha=1,
family='binomial')
#if you care; the lasso model includes both predictors
#cv.glmmod <- cv.glmnet(x=glmnet.x[sample.ind==1,], y=elas.dev$status, alpha=1, family='binomial')
#plot(cv.glmmod)
#cv.glmmod$lambda.min
#coef(cv.glmmod, s="lambda.min")
Now you have to get the predicted for the status variable using the two selected predictors from glmnet, which you did.
bestglm<-glm(status ~ gender + elas, data = elas.dev, family = binomial)
You got as far as around here. I'm using the fitted.values from my object and you're using prediction but you should get a column of actual values and fitted values. This doesn't tell you where the cutpoint is. Where do you draw the line between what is "positive" and what is "negative"?
I suggest using OptimalCutpoints for this.
Set this up for optimal.cutpoints; the container thing that comes next is just a data.frame where both variables have the same length. It contains actual versus predicted from the glm.
container.for.OC<-data.frame(fit=bestglm$fitted.values, truth=elas.dev$status)
I am using the Youden criteria here but there are many choices for the criteria.
optimal.cutpoint.Youden<-optimal.cutpoints(X = fit ~ truth , tag.healthy = 0,
methods = "Youden", pop.prev = NULL, data=container.for.OC,
control = control.cutpoints(), ci.fit = FALSE, conf.level = 0.95, trace = FALSE)
summary(optimal.cutpoint.Youden)
Here is what I got:
Area under the ROC curve (AUC): 0.818 (0.731, 0.905)
CRITERION: Youden
Number of optimal cutoffs: 1
Estimate
cutoff 0.4863188
Se 0.9180328
Sp 0.5882353
PPV 0.8000000
NPV 0.8000000
DLR.Positive 2.2295082
DLR.Negative 0.1393443
FP 14.0000000
FN 5.0000000
Optimal criterion 0.5062681
#not run
#plot(optimal.cutpoint.Youden)
Now apply what you've learned from the Youden cutoff to your validation set, elas.val.
This should match the cutoff from the table above.
MaxYoudenCutoff <- optimal.cutpoint.Youden$Youden$Global$optimal.cutoff$cutoff
This will give you the predicted levels from the Youden cutpoint. They have to be a factor object for your confusionMatrix function.
val.predicted<-predict(object=bestglm, newdata=elas.val, type="response")
val.factor.level<-factor(ifelse(val.predicted >=MaxYoudenCutoff,"1","0"))
Like before, make a small container for the confusionMatrix function.
container.for.CM <- data.frame(truth=factor(elas.val$status), fit=val.factor.level)
confusionMatrix(data=container.for.CM$fit, reference=container.for.CM$truth)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 7 8
1 6 37
Accuracy : 0.7586
95% CI : (0.6283, 0.8613)
No Information Rate : 0.7759
P-Value [Acc > NIR] : 0.6895
Kappa : 0.342
Mcnemar's Test P-Value : 0.7893
Sensitivity : 0.5385
Specificity : 0.8222
Pos Pred Value : 0.4667
Neg Pred Value : 0.8605
Prevalence : 0.2241
Detection Rate : 0.1207
Detection Prevalence : 0.2586
Balanced Accuracy : 0.6803
'Positive' Class : 0

How do I evaluate a multinomial classification model in R?

i am currently trying to build a muti-class prediction model to predict the letter out of 26 English alphabets. I have currently built a few models using ANN, SVM, Ensemble and nB. But i am stuck at the evaluating the accuracy of these models. Although the confusion matrix shows me the Alphabet-wise True and False predictions, I am only able to get an overall accuracy of each model. Is there a way to evaluate the model's accuracy similar to the ROC and AUC values for a Binomial Classification.
Note: I am currently running the model using the H2o package as it saves me more time.
Once you train a model in H2O, if you simply do: print(fit) it will show you all the available metrics for that model type. For multiclass, I'd recommend h2o.mean_per_class_error().
R code example on the iris dataset:
library(h2o)
h2o.init(nthreads = -1)
data(iris)
fit <- h2o.naiveBayes(x = 1:4,
y = 5,
training_frame = as.h2o(iris),
nfolds = 5)
Once you have the model, we can evaluate model performance using the h2o.performance() function to view all the metrics:
> h2o.performance(fit, xval = TRUE)
H2OMultinomialMetrics: naivebayes
** Reported on cross-validation data. **
** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
Cross-Validation Set Metrics:
=====================
Extract cross-validation frame with `h2o.getFrame("iris")`
MSE: (Extract with `h2o.mse`) 0.03582724
RMSE: (Extract with `h2o.rmse`) 0.1892808
Logloss: (Extract with `h2o.logloss`) 0.1321609
Mean Per-Class Error: 0.04666667
Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>,xval = TRUE)`
=======================================================================
Top-3 Hit Ratios:
k hit_ratio
1 1 0.953333
2 2 1.000000
3 3 1.000000
Or you can look at a particular metric, like mean_per_class_error:
> h2o.mean_per_class_error(fit, xval = TRUE)
[1] 0.04666667
If you want to view performance on a test set, then you can do the following:
perf <- h2o.performance(fit, test)
h2o.mean_per_class_error(perf)

Create a binary outcome with random forest

I have a dataset that looks like this:
TEAM1 TEAM2 EXPG1 EXPG2 Gewonnen
ADO Den Haag Groningen 1.5950 1.2672 1
I now try to predict the column Gewonnen based on EXPG1 and EXPG2. Therefore I created a training and test set and am creating the following model (all using rcaret):
modFit <- train(Gewonnen~ EXPG1 + EXPG2, data=training, method="rf", prox=TRUE)
I can't make a confusion matrix now because my data has more references. That's true because when I do:
pred <- predict(modFit, testing)
head(print)
It says: 0.5324000 0.7237333 0.2811333 0.8231000 0.8299333 0.9792000
Because I want to make a confusion matrix I can't turn them into on 0/1 but I have the feeling that there should be an option to do this in the model as well.
Any thoughts on what I should change in this model to create 0/1 values. I couldn't find it in the documentation:
modFit <- train(Gewonnen~ EXPG1 + EXPG2, data=training, method="rf", prox=TRUE)
First of all, as Tim Biegeleisen says, you should convert your Gewonnen variable to a factor (in both training & test sets), if it is not already:
training$Gewonnen <- as.factor(training$Gewonnen)
testing$Gewonnen <- as.factor(testing$Gewonnen)
After that, the type option in the caret function predict determines what type of response you get for a binary classification problem, i.e. class labels or probabilities. Here is a reproducible example from the caret documentation using the Sonar dataset from the package mlbench:
library(caret)
library(mlbench)
data(Sonar)
str(Sonar$Class)
# Factor w/ 2 levels "M","R": 2 2 2 2 2 2 2 2 2 2 ...
set.seed(998)
inTraining <- createDataPartition(Sonar$Class, p = .75, list = FALSE)
training <- Sonar[ inTraining,]
testing <- Sonar[-inTraining,]
modFit <- train(Class ~ ., data=training, method="rf", prox=TRUE)
pred <- predict(modFit, testing, type="prob") # for class probabilities
head(pred)
# M R
# 5 0.442 0.558
# 10 0.276 0.724
# 11 0.096 0.904
# 12 0.360 0.640
# 20 0.654 0.346
# 21 0.522 0.478
pred2 <- predict(modFit, testing, type="raw") # for class labels
head(pred2)
# [1] R R R R M M
# Levels: M R
For the confusion matrix, you will need class labels (i.e. pred2 above):
confusionMatrix(pred2, testing$Class)
# Confusion Matrix and Statistics
# Reference
# Prediction M R
# M 25 6
# R 2 18
This answer is a bit speculative as you omitted some critical details about your data set and I have not worked extensively with the caret package. That being said, it appears that you are running random forests in regression mode, which means that you will end up with a continuous function. This means that predictions can have a response value of 0, 1, or anything in between 0 and 1. If your Gewonnen column only has values of 0 or 1, and you want predicted values to also behave this way, then you can try turning Gewonnen into a categorical variable. As this article discusses, this might tell random forests to run in classification mode instead of regression mode.
Gewonnen <- as.factor(Gewonnen)
This builds the random forest as you did before, and you should have the responses you want.

Resources