R Caret Random Forest AUC too good to be true? - r

Relative newbie to predictive modeling--most of my training/experience is in inferential stats. I'm trying to predict student college graduation in 4 years.
Basic issue is that I've done data cleaning (imputing, centering, scaling); split that processed/transformed data into training (70%) and testing (30%) sets; balanced the data using two approaches (because data was 65%=0, 35%=1--and I've found inconsistent advice on what classifies as unbalanced, but one source suggested anything not within 40/60 range)--ROSE "BOTH" and SMOTE; and ran random forests.
For the ROSE "BOTH" models I got 0.9242 accuracy on the training set and AUC of 0.9268 for the test set.
For the SMOTE model I got 0.9943 accuracy on the training set and AUC of 0.9971 on the test set.
More details on model performance are embedded in the code copied below.
This just seems too good to be true. But, from what I've been able to find slightly improved performance on the test set would not indicate overfitting (it'd be the other way around). So, is this models performance likely really good or is it too good to be true? I have not been able to find a direct answer to this question via SO searches.
Also, in a few weeks I'll have another cohort of data I can run this on. I suppose that could be another "test" set, correct? Then I can apply this to the newest cohort for which we are interested in knowing likelihood to graduate in 4 years.
Many thanks,
Brian
#Used for predictive modeling of 4-year graduation
#IMPORT DATA
library(haven)
grad4yr <- [file path]
#DETERMINE DATA BALANCE/UNBALANCE
prop.table(table(grad4yr$graduate_4_yrs))
# 0=0.6492, 1=0.3517
#convert to factor so next step doesn't impute outcome variable
grad4yr$graduate_4_yrs <- as.factor(grad4yr$graduate_4_yrs)
#Preprocess data, RANN package used
library('RANN')
#Create proprocessed values object which includes centering, scaling, and imputing missing values using KNN
Processed_Values <- preProcess(grad4yr, method = c("knnImpute","center","scale"))
#Create new dataset with imputed values and centering/scaling
#Confirmed this results in 0 cases with missing values
grad4yr_data_processed <- predict(Processed_Values, grad4yr)
#Confirm last step results in 0 cases with missing values
sum(is.na(grad4yr_data_processed))
#[1] 0
#Convert outcome variable to numeric to ensure dummify step (next) doesn't dummify outcome variable.
grad4yr_data_processed$graduate_4_yrs <- as.factor(grad4yr_data_processed$graduate_4_yrs)
#Convert all factor variables to dummy variables; fullrank used to omit one of new dummy vars in each
#set.
dmy <- dummyVars("~ .", data = grad4yr_data_processed, fullRank = TRUE)
#Create new dataset that has the data imputed AND transformed to have dummy variables for all variables that
#will go in models.
grad4yr_processed_transformed <- data.frame(predict(dmy,newdata = grad4yr_data_processed))
#Convert outcome variable back to binary/factor for predictive models and create back variable with same name
#not entirely sure who last step created new version of outcome var with ".1" at the end
grad4yr_processed_transformed$graduate_4_yrs.1 <- as.factor(grad4yr_processed_transformed$graduate_4_yrs.1)
grad4yr_processed_transformed$graduate_4_yrs <- as.factor(grad4yr_processed_transformed$graduate_4_yrs)
grad4yr_processed_transformed$graduate_4_yrs.1 <- NULL
#Split data into training and testing/validation datasets based on outcome at 70%/30%
index <- createDataPartition(grad4yr_processed_transformed$graduate_4_yrs, p=0.70, list=FALSE)
trainSet <- grad4yr_processed_transformed[index,]
testSet <- grad4yr_processed_transformed[-index,]
#load caret
library(caret)
#Feature selection using rfe in R Caret, used with profile/comparison
control <- rfeControl(functions = rfFuncs,
method = "repeatedcv",
repeats = 10,#using k=10 per Kuhn & Johnson pp70; and per James et al pp
#https://www-bcf.usc.edu/~gareth/ISL/ISLR%20First%20Printing.pdf
verbose = FALSE)
#create traincontrol using repeated cross-validation with 10 fold 5 times
fitControl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 5,
search = "random")
#Set the outcome variable object
grad4yrs <- 'graduate_4_yrs'
#set predictor variables object
predictors <- names(trainSet[!names(trainSet) %in% grad4yrs])
#create predictor profile to see what where prediction is best (by num vars)
grad4yr_pred_profile <- rfe(trainSet[,predictors],trainSet[,grad4yrs],rfeControl = control)
# Recursive feature selection
#
# Outer resampling method: Cross-Validated (10 fold, repeated 5 times)
#
# Resampling performance over subset size:
#
# Variables Accuracy Kappa AccuracySD KappaSD Selected
# 4 0.6877 0.2875 0.03605 0.08618
# 8 0.7057 0.3078 0.03461 0.08465 *
# 16 0.7006 0.2993 0.03286 0.08036
# 40 0.6949 0.2710 0.03330 0.08157
#
# The top 5 variables (out of 8):
# Transfer_Credits, HS_RANK, Admit_Term_Credits_Taken, first_enroll, Admit_ReasonUT10
#see data structure
str(trainSet)
#not copying output here, but confirms outcome var is factor and everything else is numeric
#given 65/35 split on outcome var and what can find about unbalanced data, considering unbalanced and doing steps to balance.
#using ROSE "BOTH and SMOTE to see how differently they perform. Also ran under/over with ROSE but they didn't perform nearly as
#well so removed from this script.
#SMOTE to balance data on the processed/dummified dataset
library(DMwR)#https://www3.nd.edu/~dial/publications/chawla2005data.pdf for justification
train.SMOTE <- SMOTE(graduate_4_yrs ~ ., data=grad4yr_processed_transformed, perc.over=600, perc.under=100)
#see how balanced SMOTE resulting dataset is
prop.table(table(train.SMOTE$graduate_4_yrs))
#0 1
#0.4615385 0.5384615
#open ROSE package/library
library("ROSE")
#ROSE to balance data (using BOTH) on the processed/dummified dataset
train.both <- ovun.sample(graduate_4_yrs ~ ., data=grad4yr_processed_transformed, method = "both", p=.5,
N = 2346)$data
#see how balanced BOTH resulting dataset is
prop.table(table(train.both$graduate_4_yrs))
#0 1
#0.4987212 0.5012788
#ROSE to balance data (using BOTH) on the processed/dummified dataset
table(grad4yr_processed_transformed$graduate_4_yrs)
#0 1
#1144 618
library("caret")
#create random forests using balanced data from above
RF_model_both <- train(train.both[,predictors],train.both[, grad4yrs],method = 'rf', trControl = fitControl, ntree=1000, tuneLength = 10)
#print info on accuracy & kappa for "BOTH" training model
# print(RF_model_both)
# Random Forest
#
# 2346 samples
# 40 predictor
# 2 classes: '0', '1'
#
# No pre-processing
# Resampling: Cross-Validated (10 fold, repeated 5 times)
# Summary of sample sizes: 2112, 2111, 2111, 2112, 2111, 2112, ...
# Resampling results across tuning parameters:
#
# mtry Accuracy Kappa
# 8 0.9055406 0.8110631
# 11 0.9053719 0.8107246
# 12 0.9057981 0.8115770
# 13 0.9054584 0.8108965
# 14 0.9048602 0.8097018
# 20 0.9034992 0.8069796
# 26 0.9027307 0.8054427
# 30 0.9034152 0.8068113
# 38 0.9023899 0.8047622
# 40 0.9032428 0.8064672
# Accuracy was used to select the optimal model using the largest value.
# The final value used for the model was mtry = 12.
RF_model_SMOTE <- train(train.SMOTE[,predictors],train.SMOTE[, grad4yrs],method = 'rf', trControl = fitControl, ntree=1000, tuneLength = 10)
#print info on accuracy & kappa for "SMOTE" training model
# print(RF_model_SMOTE)
# Random Forest
#
# 8034 samples
# 40 predictor
# 2 classes: '0', '1'
#
# No pre-processing
# Resampling: Cross-Validated (10 fold, repeated 5 times)
# Summary of sample sizes: 7231, 7231, 7230, 7230, 7231, 7231, ...
# Resampling results across tuning parameters:
#
# mtry Accuracy Kappa
# 17 0.9449082 0.8899939
# 19 0.9458047 0.8917740
# 21 0.9458543 0.8918695
# 29 0.9470243 0.8941794
# 31 0.9468750 0.8938864
# 35 0.9468003 0.8937290
# 36 0.9463772 0.8928876
# 40 0.9463275 0.8927828
#
# Accuracy was used to select the optimal model using the largest value.
# The final value used for the model was mtry = 29.
#Given that both accuracy and kappa appear better in the "SMOTE" random forest it's looking like it's the better model.
#But, running ROC/AUC on both to see how they both perform on validation data.
#Create predictions based on random forests above
rf_both_predictions <- predict.train(object=RF_model_both,testSet[, predictors], type ="raw")
rf_SMOTE_predictions <- predict.train(object=RF_model_SMOTE,testSet[, predictors], type ="raw")
#Create predictions based on random forests above
rf_both_pred_prob <- predict.train(object=RF_model_both,testSet[, predictors], type ="prob")
rf_SMOTE_pred_prob <- predict.train(object=RF_model_SMOTE,testSet[, predictors], type ="prob")
#create Random Forest confusion matrix to evaluate random forests
confusionMatrix(rf_both_predictions,testSet[,grad4yrs], positive = "1")
#output copied here:
# Confusion Matrix and Statistics
#
# Reference
# Prediction 0 1
# 0 315 12
# 1 28 173
#
# Accuracy : 0.9242
# 95% CI : (0.8983, 0.9453)
# No Information Rate : 0.6496
# P-Value [Acc > NIR] : < 2e-16
#
# Kappa : 0.8368
# Mcnemar's Test P-Value : 0.01771
#
# Sensitivity : 0.9351
# Specificity : 0.9184
# Pos Pred Value : 0.8607
# Neg Pred Value : 0.9633
# Prevalence : 0.3504
# Detection Rate : 0.3277
# Detection Prevalence : 0.3807
# Balanced Accuracy : 0.9268
#
# 'Positive' Class : 1
# confusionMatrix(rf_under_predictions,testSet[,grad4yrs], positive = "1")
#output copied here:
#Accuracy : 0.8258
#only copied accuracy as it was fair below two other versions
confusionMatrix(rf_SMOTE_predictions,testSet[,grad4yrs], positive = "1")
#output copied here:
# Confusion Matrix and Statistics
#
# Reference
# Prediction 0 1
# 0 340 0
# 1 3 185
#
# Accuracy : 0.9943
# 95% CI : (0.9835, 0.9988)
# No Information Rate : 0.6496
# P-Value [Acc > NIR] : <2e-16
#
# Kappa : 0.9876
# Mcnemar's Test P-Value : 0.2482
#
# Sensitivity : 1.0000
# Specificity : 0.9913
# Pos Pred Value : 0.9840
# Neg Pred Value : 1.0000
# Prevalence : 0.3504
# Detection Rate : 0.3504
# Detection Prevalence : 0.3561
# Balanced Accuracy : 0.9956
#
# 'Positive' Class : 1
#put predictions in dataset
testSet$rf_both_pred <- rf_both_predictions#predictions (BOTH)
testSet$rf_SMOTE_pred <- rf_SMOTE_predictions#probabilities (BOTH)
testSet$rf_both_prob <- rf_both_pred_prob#predictions (SMOTE)
testSet$rf_SMOTE_prob <- rf_SMOTE_pred_prob#probabilities (SMOTE)
library(pROC)
#get AUC of the BOTH predictions
testSet$rf_both_pred <- as.numeric(testSet$rf_both_pred)
Both_ROC_Curve <- roc(response = testSet$graduate_4_yrs,
predictor = testSet$rf_both_pred,
levels = rev(levels(testSet$graduate_4_yrs)))
auc(Both_ROC_Curve)
# Area under the curve: 0.9268
#get AUC of the SMOTE predictions
testSet$rf_SMOTE_pred <- as.numeric(testSet$rf_SMOTE_pred)
SMOTE_ROC_Curve <- roc(response = testSet$graduate_4_yrs,
predictor = testSet$rf_SMOTE_pred,
levels = rev(levels(testSet$graduate_4_yrs)))
auc(SMOTE_ROC_Curve)
#Area under the curve: 0.9971
#So, the SMOTE balanced data performed very well on training data and near perfect on the validation/test data.
#But, it seems almost too good to be true.
#Is there anything I might have missed or performed incorrectly?

I'll post as an answer my comment, even if this might be migrated.
I really think that you're overfitting, because you have balanced on the whole dataset.
Instead you should balance only the train set.
Here is your code:
library(DMwR)
train.SMOTE <- SMOTE(graduate_4_yrs ~ ., data=grad4yr_processed_transformed,
perc.over=600, perc.under=100)
By doing so your train.SMOTE now contains information from the test set too, so when you'll test on your testSet the model will have already seen part of the data, and this will likely be the cause of your "too good" results.
It should be:
library(DMwR)
train.SMOTE <- SMOTE(graduate_4_yrs ~ ., data=trainSet, # use only the train set
perc.over=600, perc.under=100)

Related

kNN algorithm not working while using caret

I am trying to run LOOCV kNN on this dataset (104x182 where the first 62 samples are B and the following 42 are C). I first conducted a PCA on the standardized version of this dataset (giving me 104 PCs). I then try to perform LOOCV kNN for i = 3:98 where i refers to the number of PCs I will use for my kNN model. For each i I pull out the highest accuracy, which k it occurs at and store it within a data frame.
# required packages
library(MASS)
library(class)
library(tidyverse)
library(caret)
# reading in and cleaning data
data <- read.csv("chowdary.csv")
og_data <- data[, -1]
st_data <- as.data.frame(cbind(og_data[, 1], scale(og_data[, -1])))
colnames(st_data)[1] <- "tumour"
# PCA for dimension reduction
# on standardized data
pca_all <- prcomp(og_data[, -1], center=TRUE, scale=TRUE)
# creating data frame to store best k value for each number of PCs
kdf_pca_all_cc <- tibble(i=as.numeric(), # this is for storing number of PCs used,
pca_all_k=as.numeric(), # k value,
pca_all_acc=as.numeric(), # accuracy value,
pca_all_kapp=as.numeric()) # and kappa value
# kNN
k_kNN <- 3:97 # number of PCs to use in each iteration of the model
train_control <- trainControl(method="LOOCV")
kNN_data <- as.data.frame(cbind(as.factor(st_data[, 1]), pca_all$x)) # data used in kNN model below
for (i in k_kNN){
a111 <- train(V1~ .,
method="knn",
tuneGrid=expand.grid(k=1:25),
trControl=train_control,
metric="Accuracy",
data=kNN_data[, 1:i])
b111 <- a111$results[as.integer(a111$bestTune), ] # this is to store the best accuracy rate, along with its k and kappa value
kdf_pca_all_cc <- kdf_pca_all_cc %>%
add_row(i=i-1,
pca_all_k=b111[, 1],
pca_all_acc=b111[, 2],
pca_all_kapp=b111[, 3])
}
For example, for i = 5, the kNN model would be using the following data:
head(kNN_data[, 1:5])
V1 PC1 PC2 PC3 PC4
1 1 3.299844 0.2587487 -1.00501632 2.0273727
2 1 1.427856 -1.0455044 -1.79970790 2.5244021
3 1 3.087657 1.2563404 1.67591441 -1.4270431
4 1 3.107778 1.5893396 2.65871270 -2.8217264
5 1 3.244306 0.5982652 0.37011029 0.3642425
6 1 3.000098 0.5471276 -0.01178315 1.0857886
However, whenever I try to run the for-loop, I am given the following warning message:
Error: Metric Accuracy not applicable for regression models
In addition: Warning message:
In train.default(x, y, weights = w, ...) :
You are trying to do regression and your outcome only has two possible values Are you trying to do classification? If so, use a 2 level factor as your outcome column.
I have no idea how to fix this. Any help would be much appreciated.
Also, as a side note, is there a faster way to run this for-loop? It takes quite a while but I have no idea how to make it more efficient. Thank you.

SVM performance not consistent with AUC score

I have a dataset that contains information about patients. It includes several variables and their clinical status (0 if they are healthy, 1 if they are sick).
I have tried to implement an SVM model to predict patient status based on these variables.
library(e1071)
Index <-
order(Ytrain, decreasing = FALSE)
SVMfit_Var <-
svm(Xtrain[Index, ], Ytrain[Index],
type = "C-classification", gamma = 0.005, probability = TRUE, cost = 0.001, epsilon = 0.1)
preds1 <-
predict(SVMfit_Var, Xtest, probability = TRUE)
preds1 <-
attr(preds1, "probabilities")[,1]
samples <- !is.na(Ytest)
pred <- prediction(preds1[samples],Ytest[samples])
AUC<-performance(pred,"auc")#y.values[[1]]
prediction <- predict(SVMfit_Var, Xtest)
xtab <- table(Ytest, prediction)
To test the performance of the model, I have calculated the ROC AUC, and with the validation set I obtain an AUC = 0.997.
But when I view the predictions, all the patients have been assigned as healthy.
AUC = 0.997
> xtab
prediction
Ytest 0 1
0 72 0
1 52 0
Can anyone help me with this problem?
Did you look at the probabilities versus the fitted values? You can read about how probability works with SVM here.
If you want to look at the performance you can use the library DescTools and the function Conf or with the library caret and the function confusionMatrix. (They provide the same output.)
library(DescTools)
library(caret)
# for the training performance with DescTools
Conf(table(SVMfit_Var$fitted, Ytrain[Index]))
# svm.model$fitted, y-values for training
# training performance with caret
confusionMatrix(SVMfit_Var$fitted, as.factor(Ytrain[Index]))
# svm.model$fitted, y-values
# if y.values aren't factors, use as.factor()
# for testing performance with DescTools
# with `table()` in your question, you must flip the order:
# predicted first, then actual values
Conf(table(prediction, Ytest))
# and for caret
confusionMatrix(prediction, as.factor(Ytest))
Your question isn't reproducible, so I went through this with iris data. The probability was the same for every observation. I included this, so you can see this with another data set.
library(e1071)
library(ROCR)
library(caret)
data("iris")
# make it binary
df1 <- iris %>% filter(Species != "setosa") %>% droplevels()
# check the subset
summary(df1)
set.seed(395) # keep the sample repeatable
tr <- sample(1:nrow(df1), size = 70, # 70%
replace = F)
# create the model
svm.fit <- svm(df1[tr, -5], df1[tr, ]$Species,
type = "C-classification",
gamma = .005, probability = T,
cost = .001, epsilon = .1)
# look at probabilities
pb.fit <- predict(svm.fit, df1[-tr, -5], probability = T)
# this shows EVERY row has the same outcome probability distro
pb.fit <- attr(pb.fit, "probabilities")[,1]
# look at performance
performance(prediction(pb.fit, df1[-tr, ]$Species), "auc")#y.values[[1]]
# [1] 0.03555556 that's abysmal!!
# test the model
p.fit = predict(svm.fit, df1[-tr, -5])
confusionMatrix(p.fit, df1[-tr, ]$Species)
# 93% accuracy with NIR at 50%... the AUC score was not useful
# check the trained model performance
confusionMatrix(svm.fit$fitted, df1[tr, ]$Species)
# 87%, with NIR at 50%... that's really good

Use logistic regression on data set with repeated K fold using R

I am trying to predict if water are safe to drink or not. The data set is composed of the one here:
https://www.kaggle.com/adityakadiwal/water-potability?select=water_potability.csv.
Assume I take the dataframe to be composed of Ph, Hardness, Solids, Chloramines and Potability.
I'd like to run logistic regression on 10 k fold (for example, I wish to try more choices).
Disregarding the computational power needed, I'd also then like to conduct this with different randomized 10 k fold, 5 more times and then choose the best model.
I have come across the k fold function, and glm function , but I don't know how to combine it to repeat this process 5 randomized times.
Later on, I'd also like to create something similar with KNN.
I'd appreciate any help on this matter.
some code:
df <- read_csv("water_potability.csv")
train_model <- trainControl(method = "repeatedcv",
number = 10, repeats = 5)
model <- train(Potability~., data = df, method = "regLogistic",
trControl = train_model )
However, I'd prefer to use non regularized logistic.
You can do the following (based on some sample data from here)
library(caret)
# Sample data since your post doesn't include sample data
df <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
# Make sure the response `admit` is a `factor`
df$admit <- factor(df$admit)
# Set up 10-fold CV
train_model <- trainControl(method = "repeatedcv", number = 10, repeats = 5)
# Train the model
model <- train(
admit ~ .,
data = df,
method = "glm",
family = "binomial",
trControl = train_model)
model
#Generalized Linear Model
#
#400 samples
# 3 predictor
# 2 classes: '0', '1'
#
#No pre-processing
#Resampling: Cross-Validated (10 fold, repeated 5 times)
#Summary of sample sizes: 359, 361, 360, 360, 359, 361, ...
#Resampling results:
#
# Accuracy Kappa
# 0.7020447 0.1772786
We can look at the confusion matrix for good measure
confusionMatrix(predict(model), df$admit)
#Confusion Matrix and Statistics
#
# Reference
#Prediction 0 1
# 0 253 98
# 1 20 29
#
# Accuracy : 0.705
# 95% CI : (0.6577, 0.7493)
# No Information Rate : 0.6825
# P-Value [Acc > NIR] : 0.1809
#
# Kappa : 0.1856
#
#Mcnemar's Test P-Value : 1.356e-12
#
# Sensitivity : 0.9267
# Specificity : 0.2283
# Pos Pred Value : 0.7208
# Neg Pred Value : 0.5918
# Prevalence : 0.6825
# Detection Rate : 0.6325
# Detection Prevalence : 0.8775
# Balanced Accuracy : 0.5775
#
# 'Positive' Class : 0

How to test accuracy of a trained knn model in R Studio?

The objective is to train a model to predict the default variable. Train a KNN model with k = 13 using the knn3() function and calculate the test accuracy.
My code to solve this problem so far is:
# load packages
library("mlbench")
library("tibble")
library("caret")
library("rpart")
# set seed
set.seed(49607)
# load data and coerce to tibble
default = as_tibble(ISLR::Default)
# split data
dft_trn_idx = sample(nrow(default), size = 0.8 * nrow(default))
dft_trn = default[dft_trn_idx, ]
dft_tst = default[-dft_trn_idx, ]
# check data
dft_trn
# fit knn model
mod_knn = knn3(default ~ ., data = dft_trn, k = 13)
# make "predictions" with knn model
new_obs = data.frame(balance = 421, income = 28046)
predtrn = predict(mod_knn, new_obs, type = "prob")
confusionMatrix(predtrn,dft_trn)
at the last line of the code chunk, I get error "Error: data and reference should be factors with the same levels." I am unsure as to how I can fix this, or if this is even the correct method to measure the test accuracy.
Any help would be great, thanks!
First of all, as machine learner you are doing well because a necessary step is to split data into train and test set. The issue I found is that you are trying to compare a new prediction from data outside from test and train test. The principle in ML is to train the model on train dataset and then make predictions on test dataset in order to finally evaluate performance. You have the datasets for that (dft_tst). Here the code to obtain confusion matrix. As a reminder, if you have one predicted label without having the real label to compare, the confusion matrix will not be computed. Here the code to obtain the desired matrix:
# load packages
library("mlbench")
library("tibble")
library("caret")
library("rpart")
# set seed
set.seed(49607)
# load data and coerce to tibble
default = as_tibble(ISLR::Default)
Now, we split into train and test sets:
# split data
dft_trn_idx = sample(nrow(default), size = 0.8 * nrow(default))
dft_trn = default[dft_trn_idx, ]
dft_tst = default[-dft_trn_idx, ]
We train the model:
# fit knn model
mod_knn = knn3(default ~ ., data = dft_trn, k = 13)
Now, the key part is making predictions on test set (or any labelled set) and obtain the confusion matrix:
# make "predictions" with knn model
predtrn = predict(mod_knn, dft_tst, type = "class")
In order to compute the confusion matrix, the predictions and original labels must have the same lenght:
#Confusion matrix
confusionMatrix(predtrn,dft_tst$default)
Output:
Confusion Matrix and Statistics
Reference
Prediction No Yes
No 1929 67
Yes 1 3
Accuracy : 0.966
95% CI : (0.9571, 0.9735)
No Information Rate : 0.965
P-Value [Acc > NIR] : 0.4348
Kappa : 0.0776
Mcnemar's Test P-Value : 3.211e-15
Sensitivity : 0.99948
Specificity : 0.04286
Pos Pred Value : 0.96643
Neg Pred Value : 0.75000
Prevalence : 0.96500
Detection Rate : 0.96450
Detection Prevalence : 0.99800
Balanced Accuracy : 0.52117
'Positive' Class : No

Create a binary outcome with random forest

I have a dataset that looks like this:
TEAM1 TEAM2 EXPG1 EXPG2 Gewonnen
ADO Den Haag Groningen 1.5950 1.2672 1
I now try to predict the column Gewonnen based on EXPG1 and EXPG2. Therefore I created a training and test set and am creating the following model (all using rcaret):
modFit <- train(Gewonnen~ EXPG1 + EXPG2, data=training, method="rf", prox=TRUE)
I can't make a confusion matrix now because my data has more references. That's true because when I do:
pred <- predict(modFit, testing)
head(print)
It says: 0.5324000 0.7237333 0.2811333 0.8231000 0.8299333 0.9792000
Because I want to make a confusion matrix I can't turn them into on 0/1 but I have the feeling that there should be an option to do this in the model as well.
Any thoughts on what I should change in this model to create 0/1 values. I couldn't find it in the documentation:
modFit <- train(Gewonnen~ EXPG1 + EXPG2, data=training, method="rf", prox=TRUE)
First of all, as Tim Biegeleisen says, you should convert your Gewonnen variable to a factor (in both training & test sets), if it is not already:
training$Gewonnen <- as.factor(training$Gewonnen)
testing$Gewonnen <- as.factor(testing$Gewonnen)
After that, the type option in the caret function predict determines what type of response you get for a binary classification problem, i.e. class labels or probabilities. Here is a reproducible example from the caret documentation using the Sonar dataset from the package mlbench:
library(caret)
library(mlbench)
data(Sonar)
str(Sonar$Class)
# Factor w/ 2 levels "M","R": 2 2 2 2 2 2 2 2 2 2 ...
set.seed(998)
inTraining <- createDataPartition(Sonar$Class, p = .75, list = FALSE)
training <- Sonar[ inTraining,]
testing <- Sonar[-inTraining,]
modFit <- train(Class ~ ., data=training, method="rf", prox=TRUE)
pred <- predict(modFit, testing, type="prob") # for class probabilities
head(pred)
# M R
# 5 0.442 0.558
# 10 0.276 0.724
# 11 0.096 0.904
# 12 0.360 0.640
# 20 0.654 0.346
# 21 0.522 0.478
pred2 <- predict(modFit, testing, type="raw") # for class labels
head(pred2)
# [1] R R R R M M
# Levels: M R
For the confusion matrix, you will need class labels (i.e. pred2 above):
confusionMatrix(pred2, testing$Class)
# Confusion Matrix and Statistics
# Reference
# Prediction M R
# M 25 6
# R 2 18
This answer is a bit speculative as you omitted some critical details about your data set and I have not worked extensively with the caret package. That being said, it appears that you are running random forests in regression mode, which means that you will end up with a continuous function. This means that predictions can have a response value of 0, 1, or anything in between 0 and 1. If your Gewonnen column only has values of 0 or 1, and you want predicted values to also behave this way, then you can try turning Gewonnen into a categorical variable. As this article discusses, this might tell random forests to run in classification mode instead of regression mode.
Gewonnen <- as.factor(Gewonnen)
This builds the random forest as you did before, and you should have the responses you want.

Resources