I am trying to use logistic regression with caret. Data frame is "Default" in "ISLR2" package.
I am geting low specificity (27%), due to default probability threshold of 0.5.
What is the way to change this default probability threshlod, say to 0.2 or 0.7.
The code used is as below:
set.seed(7702)
# test & train partition
index <- sample(1:nrow(Default), 0.80*nrow(Default))
train_default <- Default[index, ]
test_default <- Default[-index, ]
# Creating controling parameters
controlValues <- trainControl(method = "cv",
number = 10,
savePredictions = "all",
classProbs = TRUE)
# building the model
model_default <- train(default ~ income + balance,
data = train_default,
method = "glm",
family = binomial,
trControl = controlValues)
# Model prediction & confusion Matrix
model_pred <- predict(model_default,
newdata = test_default)
confusionMatrix(model_pred, test_default$default)
Confusion Matrix and Statistics
Reference
Prediction No Yes
No 1937 41
Yes 7 15
Accuracy : 0.976
95% CI : (0.9683, 0.9823)
No Information Rate : 0.972
P-Value [Acc > NIR] : 0.1543
Kappa : 0.3747
Mcnemar's Test P-Value : 1.906e-06
Sensitivity : 0.9964
Specificity : 0.2679
Inside of predict function you need to specify de type='prob' parameter. This allows you to get all the probabilities and choose the threshold of your preference.
model_pred <- predict(model_default, newdata = test_default, type = "prob")
Then, you can manually make a classification. For example:
model_pred_class <- ifelse(model_pred < 0.2, "No", "Yes")
Related
I have built a Random Forest model for predicting if a customer is doing operations regarding to fraud or not. It is a large an a quite unbalanced sample, with 3% cases of fraud, and I want to predict the minority class (fraud).
I balance the data (50% each) and build the RF. So far, I have a good model with an overall accuracy of ~80% and a +70% fraud predicted correctly. But when I try the model on unseen data (test), although the overall accuracy is good, the negative predicted value (fraud) is really low compared to the training data (13% only vs +70%).
I have tried increasing the sample size, increasing the balanced categories, tuning RF parameters, ..., but none of them have worked well, with similar results. Am I overfitting somehow? What can I do to improve fraud detection (negative predicted value)
on unseen data?
Here is the code and results:
set.seed(1234)
#train and test sets
model <- sample(nrow(dataset), 0.7 * nrow(dataset))
train <- dataset[model, ]
test <- dataset[-model, ]
#Balance the data
balanced <- ovun.sample(custom21_type ~ ., data = train, method = "over",p = 0.5, seed = 1)$data
table(balanced$custom21_type)
0 1
5813 5861
#build the RF
rf5 = randomForest(custom21_type~.,ntree = 100,data = balanced,importance = TRUE,mtry=3,keep.inbag=TRUE)
rf5
Call:
randomForest(formula = custom21_type ~ ., data = balanced, ntree = 100, importance = TRUE, mtry = 3, keep.inbag = TRUE)
Type of random forest: classification
Number of trees: 100
No. of variables tried at each split: 3
OOB estimate of error rate: 21.47%
Confusion matrix:
0 1 class.error
0 4713 1100 0.1892310
1 1406 4455 0.2398908
#test on unseen data
predicted <- predict(rf5, newdata=test)
confusionMatrix(predicted,test$custom21_type)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 59722 559
1 13188 1938
Accuracy : 0.8177
95% CI : (0.8149, 0.8204)
No Information Rate : 0.9669
P-Value [Acc > NIR] : 1
Kappa : 0.1729
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.8191
Specificity : 0.7761
Pos Pred Value : 0.9907
Neg Pred Value : 0.1281
Prevalence : 0.9669
Detection Rate : 0.7920
Detection Prevalence : 0.7994
Balanced Accuracy : 0.7976
'Positive' Class : 0
First I notice that you are not using any cross validation. Including this will help add variation in the data used to train and will help reduce overfitting. Additionally we are going to user C.50 in place of randomForest because it is more robust and gives more penalties to type 1 errors.
One thing you may consider is actually not having a 50-50 balance split in the train data, but making it more 80-20. This is so that the underbalanced class is not over sampled. I am sure this is leading to overfitting and the failure for your model to classify novel examples as negative.
RUN THIS AFTER YOU CREATE THE RE-BALANCED DATA (p=.2)
library(caret)
#set up you cross validation
Control <- trainControl(
summaryFunction = twoClassSummary, #displays model score not confusion matrix
classProbs = TRUE, #important for the summaryFunction
verboseIter = TRUE, #tones down output
savePredictions = TRUE,
method = "repeatedcv", #repeated cross validation, 10 folds, 3 times
repeats = 3,
number = 10,
allowParallel = TRUE
)
Now I read in the comments that all your variables are categorical. This is optimal for NaiveBayes algorithms. However if you have any numerical data you will need to preprocess (scale, normalize, and NA input) as is standard procedure. We are also going to implement a grid-searching process.
IF YOUR DATA IS ALL CATEGORICAL
model_nb <- train(
x = balanced[,-(which(colnames(balanced))%in% "custom21_type")],
y= balanced$custom21_type,
metric = "ROC",
method = "nb",
trControl = Control,
tuneGrid = data.frame(fL=c(0,0.5,1.0), usekernel = TRUE,
adjust=c(0,0.5,1.0)))
IF YOU WOULD LIKE A RF APPROACH (make sure to preprocess if data is numeric)
model_C5 <- train(
x = balanced[,-(which(colnames(balanced))%in% "custom21_type")],
y= balanced$custom21_type,
metric = "ROC",
method = "C5.0",
trControl = Control,
tuneGrid = tuneGrid=expand.grid(.model = "tree",.trials = c(1,5,10), .winnow = F)))
Now we predict
C5_predict<-predict(model_C5, test, type = "raw")
NB_predict<-predict(model_nb, test, type = "raw")
confusionMatrix(C5_predict,test$custom21_type)
confusionMatrix(nb_predict,test$custom21_type)
EDIT:
try adjusting the cost matrix below. What this one does is penalize type two errors twice as bad as type one errors.
cost_mat <- matrix(c(0, 2, 1, 0), nrow = 2)
rownames(cost_mat) <- colnames(cost_mat) <- c("bad", "good")
cost_mod <- C5.0( x = balanced[,-(which(colnames(balanced))%in%
"custom21_type")],
y= balanced$custom21_type,
costs = cost_mat)
summary(cost_mod)
EDIT 2:
predicted <- predict(rf5, newdata=test, type="prob")
will give you the actual probabilities for each prediction. The default cut-off is .5. I.e. everything above .5 will get classified as 0 and everything below as 1. So you can adjust this cutoff to help with unbalanced classes.
ifelse(predicted[,1] < .4, 1, predicted[,1])
Hi I have used the ROCR package to check the performance of a model, I would like to do more evaluation like a confusion matrix with kappa values or k fold.
below are the model and the predictions, any help would be great.
model <- cv.glmnet(sparesemx[train.set,],
first.round[train.set],
alpha = 0.05,
family = 'binomial')
training$sparse.fr.hat <- predict(model, newx = sparesemx, type =
'response')[,1]
predictions <- prediction(training$sparse.fr.hat[test.set],
first.round[test.set])
perform <- performance(predictions, 'tpr', 'fpr')
plot(perform)
performance(predictions, 'auc')
I am trying to use the caret library with the confusionMatrix() function but I am unable to generate the matrix. I have tried several inputs for the two agruments but I am not sure what is needed
Worked example, step by step in explicit detail.
library(OptimalCutpoints)
library(caret)
library(glmnet)
library(e1071)
data(elas) #predicting for variable "status"
Split the elas data into training (dev) and testing (val)
sample.ind <- sample(2,
nrow(elas),
replace = T,
prob = c(0.6,0.4))
elas.dev <- elas[sample.ind==1,]
elas.val <- elas[sample.ind==2,]
This example uses a logistic model so this is how the formula is specified, similar to your sparesemx matrix.
formula.glm<-glm(status ~ gender + elas, data = elas, family = binomial)
xfactors<-model.matrix(formula.glm)[,-1]
glmnet.x<-as.matrix(xfactors)
glmmod<-glmnet(x=glmnet.x[sample.ind==1,],y=elas.dev$status,alpha=1,
family='binomial')
#if you care; the lasso model includes both predictors
#cv.glmmod <- cv.glmnet(x=glmnet.x[sample.ind==1,], y=elas.dev$status, alpha=1, family='binomial')
#plot(cv.glmmod)
#cv.glmmod$lambda.min
#coef(cv.glmmod, s="lambda.min")
Now you have to get the predicted for the status variable using the two selected predictors from glmnet, which you did.
bestglm<-glm(status ~ gender + elas, data = elas.dev, family = binomial)
You got as far as around here. I'm using the fitted.values from my object and you're using prediction but you should get a column of actual values and fitted values. This doesn't tell you where the cutpoint is. Where do you draw the line between what is "positive" and what is "negative"?
I suggest using OptimalCutpoints for this.
Set this up for optimal.cutpoints; the container thing that comes next is just a data.frame where both variables have the same length. It contains actual versus predicted from the glm.
container.for.OC<-data.frame(fit=bestglm$fitted.values, truth=elas.dev$status)
I am using the Youden criteria here but there are many choices for the criteria.
optimal.cutpoint.Youden<-optimal.cutpoints(X = fit ~ truth , tag.healthy = 0,
methods = "Youden", pop.prev = NULL, data=container.for.OC,
control = control.cutpoints(), ci.fit = FALSE, conf.level = 0.95, trace = FALSE)
summary(optimal.cutpoint.Youden)
Here is what I got:
Area under the ROC curve (AUC): 0.818 (0.731, 0.905)
CRITERION: Youden
Number of optimal cutoffs: 1
Estimate
cutoff 0.4863188
Se 0.9180328
Sp 0.5882353
PPV 0.8000000
NPV 0.8000000
DLR.Positive 2.2295082
DLR.Negative 0.1393443
FP 14.0000000
FN 5.0000000
Optimal criterion 0.5062681
#not run
#plot(optimal.cutpoint.Youden)
Now apply what you've learned from the Youden cutoff to your validation set, elas.val.
This should match the cutoff from the table above.
MaxYoudenCutoff <- optimal.cutpoint.Youden$Youden$Global$optimal.cutoff$cutoff
This will give you the predicted levels from the Youden cutpoint. They have to be a factor object for your confusionMatrix function.
val.predicted<-predict(object=bestglm, newdata=elas.val, type="response")
val.factor.level<-factor(ifelse(val.predicted >=MaxYoudenCutoff,"1","0"))
Like before, make a small container for the confusionMatrix function.
container.for.CM <- data.frame(truth=factor(elas.val$status), fit=val.factor.level)
confusionMatrix(data=container.for.CM$fit, reference=container.for.CM$truth)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 7 8
1 6 37
Accuracy : 0.7586
95% CI : (0.6283, 0.8613)
No Information Rate : 0.7759
P-Value [Acc > NIR] : 0.6895
Kappa : 0.342
Mcnemar's Test P-Value : 0.7893
Sensitivity : 0.5385
Specificity : 0.8222
Pos Pred Value : 0.4667
Neg Pred Value : 0.8605
Prevalence : 0.2241
Detection Rate : 0.1207
Detection Prevalence : 0.2586
Balanced Accuracy : 0.6803
'Positive' Class : 0
I'm attempting to run a 5-fold XGBoost model on this dataset. When I run the following code:
train_control<- trainControl(method="cv",
search = "random",
number=5,
verboseIter=TRUE)
# Train Models
xgb.mod<- train(Vote_perc~.,
data=forkfold,
trControl=train_control,
method="xgbTree",
family=binomial())
I receive a warning of:
Warning message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
There were missing values in resampled performance measures.
Furthermore, the "predict" function runs, but all predictions were the same number. I suspect it's an intercept-only model, but I'm not sure. Also when I remove the
search="random"
argument, it runs properly. I want to run random searches so that I can isolate what hyperparameters might be most effective, but everytime I try, I get that warning. What am I missing? Thank you!
Here is one approach you could perform with your data:
load data:
forkfold <- read.csv("forkfold.csv", row.names = 1)
the problem here is that the outcome variable is 0 in 97% of the cases while in the remaining 3% it is very close to zero.
length(forkfold$Vote_perc)
#output
7069
sum(forkfold$Vote_perc != 0)
#output
212
You described it a classification problem and I will treat it as such by converting it to a binary problem:
forkfold$Vote_perc <- ifelse(forkfold$Vote_perc != 0,
"one",
"zero")
Since the set is highly imbalanced using Accuracy as selection metric is out of the question. Here i will try to maximize Sensitivity + Specificity as described here by defining a custom evaluation function:
fourStats <- function (data, lev = levels(data$obs), model = NULL) {
out <- c(twoClassSummary(data, lev = levels(data$obs), model = NULL))
coords <- matrix(c(1, 1, out["Spec"], out["Sens"]),
ncol = 2,
byrow = TRUE)
colnames(coords) <- c("Spec", "Sens")
rownames(coords) <- c("Best", "Current")
c(out, Dist = dist(coords)[1])
}
I will specify this function in trainControl:
train_control <- trainControl(method = "cv",
search = "random",
number = 5,
verboseIter=TRUE,
classProbs = T,
savePredictions = "final",
summaryFunction = fourStats)
set.seed(1)
xgb.mod <- train(Vote_perc~.,
data = forkfold,
trControl = train_control,
method = "xgbTree",
tuneLength = 50,
metric = "Dist",
maximize = FALSE,
scale_pos_weight = sum(forkfold$Vote_perc == "zero")/sum(forkfold$Vote_perc == "one"))
I will use the before defined Dist metric in the fourStats summary function. This metric should be minimized so maximize = FALSE. I will use a random search over the tune space and 50 random sets of hyper parameter values will be tested (tuneLength = 50).
I also set scale_pos_weight parameter of the xgboost function. From the help of ?xgboost:
scale_pos_weight, [default=1] Control the balance of positive and
negative weights, useful for unbalanced classes. A typical value to
consider: sum(negative cases) / sum(positive cases) See Parameters
Tuning for more discussion. Also see Higgs Kaggle competition demo for
examples: R, py1, py2, py3
I defined it as recommended sum(negative cases) / sum(positive cases)
After the model trains it will pick some hype parameters that minimize Dist.
To evaluate the confusion matrix on the hold out predictions:
caret::confusionMatrix(xgb.mod$pred$pred, xgb.mod$pred$obs)
Confusion Matrix and Statistics
Reference
Prediction one zero
one 195 430
zero 17 6427
Accuracy : 0.9368
95% CI : (0.9308, 0.9423)
No Information Rate : 0.97
P-Value [Acc > NIR] : 1
Kappa : 0.4409
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.91981
Specificity : 0.93729
Pos Pred Value : 0.31200
Neg Pred Value : 0.99736
Prevalence : 0.02999
Detection Rate : 0.02759
Detection Prevalence : 0.08841
Balanced Accuracy : 0.92855
'Positive' Class : one
I'd say its not that bad.
You can do better if you tune the cutoff threshold of predictions, how to do this during the tuning process is described here. You can also use the out of fold predictions for tuning the cutoff threshold. Here I will show how to use pROC library for it:
library(pROC)
plot(roc(xgb.mod$pred$obs, xgb.mod$pred$one),
print.thres = TRUE)
The threshold shown on the image maximizes Sens + Spec:
to evaluate the out of fold performance using this threshold:
caret::confusionMatrix(ifelse(xgb.mod$pred$one > 0.369, "one", "zero"),
xgb.mod$pred$obs)
#output
Confusion Matrix and Statistics
Reference
Prediction one zero
one 200 596
zero 12 6261
Accuracy : 0.914
95% CI : (0.9072, 0.9204)
No Information Rate : 0.97
P-Value [Acc > NIR] : 1
Kappa : 0.3668
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.94340
Specificity : 0.91308
Pos Pred Value : 0.25126
Neg Pred Value : 0.99809
Prevalence : 0.02999
Detection Rate : 0.02829
Detection Prevalence : 0.11260
Balanced Accuracy : 0.92824
'Positive' Class : one
So out of 212 non zero entities you detected 200.
To perform better you may try to pre process the data. OR use a better hyper parameter search routine like mlrMBO package intended for use with mlr. Or perhaps change the learner (I doubt you can top xgboost here tho).
One more note, if it is not paramount to get a high Sensitivity perhaps using "Kappa" as selection metric might provide a more satisfying model.
As a final note lets check the performance of the model with the default scale_pos_weight = 1, using the already selected parameters:
set.seed(1)
xgb.mod2 <- train(Vote_perc~.,
data = forkfold,
trControl = train_control,
method = "xgbTree",
tuneGrid = data.frame(nrounds = 498,
max_depth = 3,
eta = 0.008833468,
gamma = 4.131242,
colsample_bytree = 0.4233169,
min_child_weight = 3,
subsample = 0.6212512),
metric = "Dist",
maximize = FALSE,
scale_pos_weight = 1)
caret::confusionMatrix(xgb.mod2$pred$pred, xgb.mod2$pred$obs)
#output
Confusion Matrix and Statistics
Reference
Prediction one zero
one 94 21
zero 118 6836
Accuracy : 0.9803
95% CI : (0.9768, 0.9834)
No Information Rate : 0.97
P-Value [Acc > NIR] : 3.870e-08
Kappa : 0.5658
Mcnemar's Test P-Value : 3.868e-16
Sensitivity : 0.44340
Specificity : 0.99694
Pos Pred Value : 0.81739
Neg Pred Value : 0.98303
Prevalence : 0.02999
Detection Rate : 0.01330
Detection Prevalence : 0.01627
Balanced Accuracy : 0.72017
'Positive' Class : one
So much worse at the default threshold of 0.5.
and the optimal threshold value:
plot(roc(xgb.mod2$pred$obs, xgb.mod2$pred$one),
print.thres = TRUE)
0.037 compared to the 0.369 obtained when we set scale_pos_weight as recommended. However with the optimal threshold both approaches yield identical predictions.
A very brief question on predictive analysis in R.
Why are the cross-validated results obtained with the MASS package Linear Discriminant Analysis so different from the ones obtained with caret?
#simulate data
set.seed(4321)
training_data = as.data.frame(matrix(rnorm(10000, sd = 12), 100, 10))
training_data$V1 = as.factor(sample(c(1,0), size = 100, replace = T))
names(training_data)[1] = 'outcome'
#MASS LDA
fit.lda_cv_MASS = lda(outcome~.
, training_data
, CV=T)
pred = fit.lda_cv_MASS$class
caret::confusionMatrix(pred, training_data$outcome)
This gives an accuracy of ~0.53
#caret interface LDA
lg.fit_cv_CARET = train(outcome ~ .
, data=training_data
, method="lda"
, trControl = trainControl(method = "LOOCV")
)
pred = predict(lg.fit_cv_CARET, training_data)
caret::confusionMatrix(pred, training_data$outcome)
Now this results in an accuracy of ~0.63.
I would have assumed they are identical since both use leave-one-out cross-validation.
Why are they different?
There are two points here, first is a mistake on your part and the other is a subtle difference.
point 1.
when you call predict on the caret train object you are in fact calling predict on a model fit on all the training data, hence the accuracy you get is not LOOCV but train accuracy. To get the re-sample accuracy you need just call:
lg.fit_cv_CARET$results
#output:
parameter Accuracy Kappa
1 none 0.48 -0.04208417
and not 0.63 which is just the train accuracy obtained when you call predict on the train data.
however this still does not match the 0.53 obtained by LDA. To understand why:
point 2. when fitting the model, lda also uses the argument prior:
the prior probabilities of class membership. If unspecified, the class
proportions for the training set are used. If present, the
probabilities should be specified in the order of the factor levels
so lda with CV = TRUE uses the same prior as for the full train set. while caret::train uses the prior determined by the re-sample. For LOOCV this should not matter much, since the prior changes just a little bit, however your data has very low separation of classes, so the prior influences the posterior probability a bit more then usual. To prove this point use the same prior for both approaches:
fit.lda_cv_MASS <- lda(outcome~.,
training_data,
CV=T,
prior = c(0.5, 0.5))
pred = fit.lda_cv_MASS$class
lg.fit_cv_CARET <- train(outcome ~ .,
data=training_data,
method="lda",
trControl = trainControl(method = "LOOCV"),
prior = c(0.5, 0.5)
)
all.equal(lg.fit_cv_CARET$pred$pred, fit.lda_cv_MASS$class)
#output
TRUE
caret::confusionMatrix(pred, training_data$outcome)
#output
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 27 25
1 24 24
Accuracy : 0.51
95% CI : (0.408, 0.6114)
No Information Rate : 0.51
P-Value [Acc > NIR] : 0.5401
Kappa : 0.0192
Mcnemar's Test P-Value : 1.0000
Sensitivity : 0.5294
Specificity : 0.4898
Pos Pred Value : 0.5192
Neg Pred Value : 0.5000
Prevalence : 0.5100
Detection Rate : 0.2700
Detection Prevalence : 0.5200
Balanced Accuracy : 0.5096
'Positive' Class : 0
lg.fit_cv_CARET$results
#output
parameter Accuracy Kappa
1 none 0.51 0.01921537
I have data with binary YES/NO Class response. Using following code for running RF model. I have problem in getting confusion matrix result.
dataR <- read_excel("*:/*.xlsx")
Train <- createDataPartition(dataR$Class, p=0.7, list=FALSE)
training <- dataR[ Train, ]
testing <- dataR[ -Train, ]
model_rf <- train( Class~., tuneLength=3, data = training, method =
"rf", importance=TRUE, trControl = trainControl (method = "cv", number =
5))
Results:
Random Forest
3006 samples
82 predictor
2 classes: 'NO', 'YES'
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 2405, 2406, 2405, 2404, 2404
Addtional sampling using SMOTE
Resampling results across tuning parameters:
mtry Accuracy Kappa
2 0.7870921 0.2750655
44 0.7787721 0.2419762
87 0.7767760 0.2524898
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 2.
So far fine, but when I run this code:
# Apply threshold of 0.50: p_class
class_log <- ifelse(model_rf[,1] > 0.50, "YES", "NO")
# Create confusion matrix
p <-confusionMatrix(class_log, testing[["Class"]])
##gives the accuracy
p$overall[1]
I get this error:
Error in model_rf[, 1] : incorrect number of dimensions
I appreciate if you guys can help me to get confusion matrix result.
As I understand you would like to obtain the confusion matrix for cross validation in caret.
For this you need to specify savePredictions in trainControl. If it is set to "final" predictions for the best model are saved. By specifying classProbs = T probabilities for each class will be also saved.
data(iris)
iris_2 <- iris[iris$Species != "setosa",] #make a two class problem
iris_2$Species <- factor(iris_2$Species) #drop levels
library(caret)
model_rf <- train(Species~., tuneLength = 3, data = iris_2, method =
"rf", importance = TRUE,
trControl = trainControl(method = "cv",
number = 5,
savePredictions = "final",
classProbs = T))
Predictions are in:
model_rf$pred
sorted as per CV fols, to sort as in original data frame:
model_rf$pred[order(model_rf$pred$rowIndex),2]
to obtain a confusion matrix:
confusionMatrix(model_rf$pred[order(model_rf$pred$rowIndex),2], iris_2$Species)
#output
Confusion Matrix and Statistics
Reference
Prediction versicolor virginica
versicolor 46 6
virginica 4 44
Accuracy : 0.9
95% CI : (0.8238, 0.951)
No Information Rate : 0.5
P-Value [Acc > NIR] : <2e-16
Kappa : 0.8
Mcnemar's Test P-Value : 0.7518
Sensitivity : 0.9200
Specificity : 0.8800
Pos Pred Value : 0.8846
Neg Pred Value : 0.9167
Prevalence : 0.5000
Detection Rate : 0.4600
Detection Prevalence : 0.5200
Balanced Accuracy : 0.9000
'Positive' Class : versicolor
In a two class setting often specifying 0.5 as the threshold probability is sub-optimal. The optimal threshold can be found after training by optimizing Kappa or Youden's J statistic (or any other preferred) as a function of the probability. Here is an example:
sapply(1:40/40, function(x){
versicolor <- model_rf$pred[order(model_rf$pred$rowIndex),4]
class <- ifelse(versicolor >=x, "versicolor", "virginica")
mat <- confusionMatrix(class, iris_2$Species)
kappa <- mat$overall[2]
res <- data.frame(prob = x, kappa = kappa)
return(res)
})
Here the highest kappa is not obtained at threshold == 0.5 but at 0.1. This should be used carefully because it can lead to over-fitting.
You can try this to create confusion matrix and check accuracy
m <- table(class_log, testing[["Class"]])
m #confusion table
#Accuracy
(sum(diag(m)))/nrow(testing)
The code piece class_log <- ifelse(model_rf[,1] > 0.50, "YES", "NO") is an if-else statement that performs the following test:
In the first column of model_rf, if the number is greater than 0.50, return "YES", else return "NO", and save the results in object class_log.
So the code essentially creates a character vector of class labels, "YES" and "NO", based on a numeric vector.
You need to apply your model to the test set.
prediction.rf <- predict(model_rf, testing, type = "prob")
Then do class_log <- ifelse(prediction.rf > 0.50, "YES", "NO")