Why aren't these models produced in Caret identical? - r

I'm really trying to understand why two pieces of code don't produce identical models. To create the first neural network (NN1), I used (code below) cross validation in the train function of the Caret package to find the best parameters. Page 2 of the package's vignette suggests that it will "Fit the final model to all the training data using the optimal parameter set".
So in the code below I expect NN1 to reflect the full training set with the best parameters which happen to be size=5 and decay=0.1.
My plan was to use the parameters from this step to create a model to put into production using the combined training and test data. Before I created this production model, I wanted to make sure I was using the output from the train function properly.
So I created a second model (NN2) with with the train function but without tuning. Instead I specified the parameters size=5 and decay=0.1. With the same data, same parameters (and same seed), I expected identical models, but they are not. Why aren't these models identical?
# Create some data
library(caret)
set.seed(2)
xy<-data.frame(Response=factor(sample(c("Y","N"),534,replace = TRUE,prob=c(0.5,0.5))),
GradeGroup=factor(sample(c("G1","G2","G3"),534,replace=TRUE,prob=c(0.4,0.3,0.3))),
Sibling=sample(c(TRUE,FALSE),534,replace=TRUE,prob=c(0.3,0.7)),
Dist=rnorm(534))
xyTrain <- xy[1:360,]
xyTest <- xy[361:534,]
# Create NN1 using cross-validation
tc <- trainControl(method="cv", number = 10, savePredictions = TRUE, classProbs = TRUE)
set.seed(2)
NN1 <- train(Response~.,data=xyTrain,
method="nnet",
trControl=tc,
verbose=FALSE,
metric="Accuracy")
# Create NN2 using parameters from NN1
fitControl <- trainControl(method="none", classProbs = TRUE)
set.seed(2)
NN2 <- train(Response~.,data=xyTrain,
method="nnet",
trControl=fitControl,
verbose=FALSE,
tuneGrid=data.frame(size=NN1$bestTune[[1]],decay=NN1$bestTune[[2]]),
metric="Accuracy")
Here are the results
> # Parameters of NN1
> NN1$bestTune
size decay
1 1 0
>
> # Code to show results of NN1 and NN2 differ
> testFitted <- data.frame(fitNN1=NN1$finalModel$fitted.values,
+ fitNN2=NN2$finalModel$fitted.values)
>
> testPred<-data.frame(predNN1=predict(NN1,xyTest,type="prob")$Y,
+ predNN2=predict(NN2,xyTest,type="prob")$Y)
> # Fitted values are different
> head(testFitted)
fitNN1 fitNN2
X1 0.4824096 0.4834579
X2 0.4673498 0.4705441
X3 0.4509407 0.4498603
X4 0.4510129 0.4498710
X5 0.4690963 0.4753655
X6 0.4509160 0.4498539
> # Predictions on test set are different
> head(testPred)
predNN1 predNN2
1 0.4763952 0.4784981
2 0.4509160 0.4498539
3 0.5281298 0.5276355
4 0.4512930 0.4498993
5 0.4741959 0.4804776
6 0.4509335 0.4498589
>
> # Accuracy of predictions are different
> sum(predict(NN1,xyTest,type="raw")==xyTest$Response)/nrow(xyTest)
[1] 0.4655172
> sum(predict(NN2,xyTest,type="raw")==xyTest$Response)/nrow(xyTest)
[1] 0.4597701
>
> # Summary of models
> summary(NN1)
a 4-1-1 network with 7 weights
options were - entropy fitting
b->h1 i1->h1 i2->h1 i3->h1 i4->h1
-8.38 6.58 5.51 -9.50 1.06
b->o h1->o
-0.20 1.39
> summary(NN2)
a 4-1-1 network with 7 weights
options were - entropy fitting
b->h1 i1->h1 i2->h1 i3->h1 i4->h1
10.94 -8.27 -7.36 8.50 -0.76
b->o h1->o
3.15 -3.35

I believe this has to do with the random seed. When you do cross-validation you are fitting many models from that starting seed (set.seed(2)). The final model is fit with the same parameters but the seed that the final model was fit inside cross-validation is not the same as when you try to just fit that final model with those parameters yourself. You see this here because the weights in each neural network call (nnet) are randomly generated each time.

Related

Setting C for Linear SVM

Here's my question:
I have a medium size data set about the condition of a hydraulic system.
The data set is represented by 68 variables plus condition of the system(green, yellow, red)
I have to use several classifiers to predict the behaviour of the system so I have divided my data set into training and test set as follows:
(Talking about the conditions, the colour means: red-Warning, yellow-Pay attention, green-Good)
That's what I wrote
Tab$Condition=factor(Tab$Condition, labels=c("Yellow","Green","Red"))
set.seed(32343)
reg_Control = trainControl("repeatedcv", number = 5, repeats=5, verboseIter = T, classProbs =T)
inTrain = createDataPartition(y=Tab$Condition,p=0.75, list=FALSE)
training = Tab[inTrain,]
testing = Tab[-inTrain,]
I'm using a SVM linear classifier to predict the behaviour of the system.
I started by using a random value for C to see what kind of results I should get.
svmLinear = train(Condition ~.,data=training, method="svmLinear", trControl=reg_Control,tuneGrid=data.frame(C=seq(0.1,1,0.1)))
svmLPredictions = predict(svmLinear,newdata=training)
confusionMatrix(svmLPredictions,training$Condition)
#misclassification of 129/1655 accuracy of 92.21%
svmLPred = predict(svmLinear,newdata=testing)
confusionMatrix(svmLPred,testing$Condition)
#misclassification of 41/550 accuracy of 92.55%
I've used a SVM linear classifier to predict the behaviour of the system.
As Isaid before I started with RANDOM VALUE FOR C.
How do I decide then about the best value to use for the analysis??
Sorry if the question is banal but I'm a beginner!
Answers will be helpful!
Thanks
Caret calls other packages to run the actual modelling process. Caret itself is only a (very powerful) convenience package in this regard. However ,it does that automatically so a user might not realize this easily unless an error is thrown
Anyway , I have cobbled together an example to explain the process.
library(caret)
data("iris")
set.seed(1024)
tr <- createDataPartition(iris$Species, list = FALSE)
training <- iris[ tr,]
testing <- iris[-tr,]
#head(training)
fitControl <- trainControl(##smaller values for quick run
method = "repeatedcv",
number = 5,
repeats = 4)
set.seed(1024)
tunegrid=data.frame(C=c(0.25, 0.5, 1,5,8,12,100))
tunegrid
svmfit <- train(Species ~ ., data = training,
method = "svmLinear",
trControl = fitControl,
tuneGrid= tunegrid)
#print this, it will give model's accuracy (on train data) given various
# parameter values
svmfit
#C Accuracy Kappa
#0.25 0.9533333 0.930
#0.50 0.9666667 0.950
#1.00 0.9766667 0.965
#5.00 0.9800000 0.970
#8.00 0.9833333 0.975
#12.00 0.9833333 0.975
#100.00 0.9400000 0.910
#The final value used for the model was C = 8.
# it has already chosen the best model (as per train Accuracy )
# how well does it work on test data?
preds <-predict(svmfit, testing)
cmSVM <-confusionMatrix(preds, testing$Species)
print(cmSVM)

R Caret Random Forest AUC too good to be true?

Relative newbie to predictive modeling--most of my training/experience is in inferential stats. I'm trying to predict student college graduation in 4 years.
Basic issue is that I've done data cleaning (imputing, centering, scaling); split that processed/transformed data into training (70%) and testing (30%) sets; balanced the data using two approaches (because data was 65%=0, 35%=1--and I've found inconsistent advice on what classifies as unbalanced, but one source suggested anything not within 40/60 range)--ROSE "BOTH" and SMOTE; and ran random forests.
For the ROSE "BOTH" models I got 0.9242 accuracy on the training set and AUC of 0.9268 for the test set.
For the SMOTE model I got 0.9943 accuracy on the training set and AUC of 0.9971 on the test set.
More details on model performance are embedded in the code copied below.
This just seems too good to be true. But, from what I've been able to find slightly improved performance on the test set would not indicate overfitting (it'd be the other way around). So, is this models performance likely really good or is it too good to be true? I have not been able to find a direct answer to this question via SO searches.
Also, in a few weeks I'll have another cohort of data I can run this on. I suppose that could be another "test" set, correct? Then I can apply this to the newest cohort for which we are interested in knowing likelihood to graduate in 4 years.
Many thanks,
Brian
#Used for predictive modeling of 4-year graduation
#IMPORT DATA
library(haven)
grad4yr <- [file path]
#DETERMINE DATA BALANCE/UNBALANCE
prop.table(table(grad4yr$graduate_4_yrs))
# 0=0.6492, 1=0.3517
#convert to factor so next step doesn't impute outcome variable
grad4yr$graduate_4_yrs <- as.factor(grad4yr$graduate_4_yrs)
#Preprocess data, RANN package used
library('RANN')
#Create proprocessed values object which includes centering, scaling, and imputing missing values using KNN
Processed_Values <- preProcess(grad4yr, method = c("knnImpute","center","scale"))
#Create new dataset with imputed values and centering/scaling
#Confirmed this results in 0 cases with missing values
grad4yr_data_processed <- predict(Processed_Values, grad4yr)
#Confirm last step results in 0 cases with missing values
sum(is.na(grad4yr_data_processed))
#[1] 0
#Convert outcome variable to numeric to ensure dummify step (next) doesn't dummify outcome variable.
grad4yr_data_processed$graduate_4_yrs <- as.factor(grad4yr_data_processed$graduate_4_yrs)
#Convert all factor variables to dummy variables; fullrank used to omit one of new dummy vars in each
#set.
dmy <- dummyVars("~ .", data = grad4yr_data_processed, fullRank = TRUE)
#Create new dataset that has the data imputed AND transformed to have dummy variables for all variables that
#will go in models.
grad4yr_processed_transformed <- data.frame(predict(dmy,newdata = grad4yr_data_processed))
#Convert outcome variable back to binary/factor for predictive models and create back variable with same name
#not entirely sure who last step created new version of outcome var with ".1" at the end
grad4yr_processed_transformed$graduate_4_yrs.1 <- as.factor(grad4yr_processed_transformed$graduate_4_yrs.1)
grad4yr_processed_transformed$graduate_4_yrs <- as.factor(grad4yr_processed_transformed$graduate_4_yrs)
grad4yr_processed_transformed$graduate_4_yrs.1 <- NULL
#Split data into training and testing/validation datasets based on outcome at 70%/30%
index <- createDataPartition(grad4yr_processed_transformed$graduate_4_yrs, p=0.70, list=FALSE)
trainSet <- grad4yr_processed_transformed[index,]
testSet <- grad4yr_processed_transformed[-index,]
#load caret
library(caret)
#Feature selection using rfe in R Caret, used with profile/comparison
control <- rfeControl(functions = rfFuncs,
method = "repeatedcv",
repeats = 10,#using k=10 per Kuhn & Johnson pp70; and per James et al pp
#https://www-bcf.usc.edu/~gareth/ISL/ISLR%20First%20Printing.pdf
verbose = FALSE)
#create traincontrol using repeated cross-validation with 10 fold 5 times
fitControl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 5,
search = "random")
#Set the outcome variable object
grad4yrs <- 'graduate_4_yrs'
#set predictor variables object
predictors <- names(trainSet[!names(trainSet) %in% grad4yrs])
#create predictor profile to see what where prediction is best (by num vars)
grad4yr_pred_profile <- rfe(trainSet[,predictors],trainSet[,grad4yrs],rfeControl = control)
# Recursive feature selection
#
# Outer resampling method: Cross-Validated (10 fold, repeated 5 times)
#
# Resampling performance over subset size:
#
# Variables Accuracy Kappa AccuracySD KappaSD Selected
# 4 0.6877 0.2875 0.03605 0.08618
# 8 0.7057 0.3078 0.03461 0.08465 *
# 16 0.7006 0.2993 0.03286 0.08036
# 40 0.6949 0.2710 0.03330 0.08157
#
# The top 5 variables (out of 8):
# Transfer_Credits, HS_RANK, Admit_Term_Credits_Taken, first_enroll, Admit_ReasonUT10
#see data structure
str(trainSet)
#not copying output here, but confirms outcome var is factor and everything else is numeric
#given 65/35 split on outcome var and what can find about unbalanced data, considering unbalanced and doing steps to balance.
#using ROSE "BOTH and SMOTE to see how differently they perform. Also ran under/over with ROSE but they didn't perform nearly as
#well so removed from this script.
#SMOTE to balance data on the processed/dummified dataset
library(DMwR)#https://www3.nd.edu/~dial/publications/chawla2005data.pdf for justification
train.SMOTE <- SMOTE(graduate_4_yrs ~ ., data=grad4yr_processed_transformed, perc.over=600, perc.under=100)
#see how balanced SMOTE resulting dataset is
prop.table(table(train.SMOTE$graduate_4_yrs))
#0 1
#0.4615385 0.5384615
#open ROSE package/library
library("ROSE")
#ROSE to balance data (using BOTH) on the processed/dummified dataset
train.both <- ovun.sample(graduate_4_yrs ~ ., data=grad4yr_processed_transformed, method = "both", p=.5,
N = 2346)$data
#see how balanced BOTH resulting dataset is
prop.table(table(train.both$graduate_4_yrs))
#0 1
#0.4987212 0.5012788
#ROSE to balance data (using BOTH) on the processed/dummified dataset
table(grad4yr_processed_transformed$graduate_4_yrs)
#0 1
#1144 618
library("caret")
#create random forests using balanced data from above
RF_model_both <- train(train.both[,predictors],train.both[, grad4yrs],method = 'rf', trControl = fitControl, ntree=1000, tuneLength = 10)
#print info on accuracy & kappa for "BOTH" training model
# print(RF_model_both)
# Random Forest
#
# 2346 samples
# 40 predictor
# 2 classes: '0', '1'
#
# No pre-processing
# Resampling: Cross-Validated (10 fold, repeated 5 times)
# Summary of sample sizes: 2112, 2111, 2111, 2112, 2111, 2112, ...
# Resampling results across tuning parameters:
#
# mtry Accuracy Kappa
# 8 0.9055406 0.8110631
# 11 0.9053719 0.8107246
# 12 0.9057981 0.8115770
# 13 0.9054584 0.8108965
# 14 0.9048602 0.8097018
# 20 0.9034992 0.8069796
# 26 0.9027307 0.8054427
# 30 0.9034152 0.8068113
# 38 0.9023899 0.8047622
# 40 0.9032428 0.8064672
# Accuracy was used to select the optimal model using the largest value.
# The final value used for the model was mtry = 12.
RF_model_SMOTE <- train(train.SMOTE[,predictors],train.SMOTE[, grad4yrs],method = 'rf', trControl = fitControl, ntree=1000, tuneLength = 10)
#print info on accuracy & kappa for "SMOTE" training model
# print(RF_model_SMOTE)
# Random Forest
#
# 8034 samples
# 40 predictor
# 2 classes: '0', '1'
#
# No pre-processing
# Resampling: Cross-Validated (10 fold, repeated 5 times)
# Summary of sample sizes: 7231, 7231, 7230, 7230, 7231, 7231, ...
# Resampling results across tuning parameters:
#
# mtry Accuracy Kappa
# 17 0.9449082 0.8899939
# 19 0.9458047 0.8917740
# 21 0.9458543 0.8918695
# 29 0.9470243 0.8941794
# 31 0.9468750 0.8938864
# 35 0.9468003 0.8937290
# 36 0.9463772 0.8928876
# 40 0.9463275 0.8927828
#
# Accuracy was used to select the optimal model using the largest value.
# The final value used for the model was mtry = 29.
#Given that both accuracy and kappa appear better in the "SMOTE" random forest it's looking like it's the better model.
#But, running ROC/AUC on both to see how they both perform on validation data.
#Create predictions based on random forests above
rf_both_predictions <- predict.train(object=RF_model_both,testSet[, predictors], type ="raw")
rf_SMOTE_predictions <- predict.train(object=RF_model_SMOTE,testSet[, predictors], type ="raw")
#Create predictions based on random forests above
rf_both_pred_prob <- predict.train(object=RF_model_both,testSet[, predictors], type ="prob")
rf_SMOTE_pred_prob <- predict.train(object=RF_model_SMOTE,testSet[, predictors], type ="prob")
#create Random Forest confusion matrix to evaluate random forests
confusionMatrix(rf_both_predictions,testSet[,grad4yrs], positive = "1")
#output copied here:
# Confusion Matrix and Statistics
#
# Reference
# Prediction 0 1
# 0 315 12
# 1 28 173
#
# Accuracy : 0.9242
# 95% CI : (0.8983, 0.9453)
# No Information Rate : 0.6496
# P-Value [Acc > NIR] : < 2e-16
#
# Kappa : 0.8368
# Mcnemar's Test P-Value : 0.01771
#
# Sensitivity : 0.9351
# Specificity : 0.9184
# Pos Pred Value : 0.8607
# Neg Pred Value : 0.9633
# Prevalence : 0.3504
# Detection Rate : 0.3277
# Detection Prevalence : 0.3807
# Balanced Accuracy : 0.9268
#
# 'Positive' Class : 1
# confusionMatrix(rf_under_predictions,testSet[,grad4yrs], positive = "1")
#output copied here:
#Accuracy : 0.8258
#only copied accuracy as it was fair below two other versions
confusionMatrix(rf_SMOTE_predictions,testSet[,grad4yrs], positive = "1")
#output copied here:
# Confusion Matrix and Statistics
#
# Reference
# Prediction 0 1
# 0 340 0
# 1 3 185
#
# Accuracy : 0.9943
# 95% CI : (0.9835, 0.9988)
# No Information Rate : 0.6496
# P-Value [Acc > NIR] : <2e-16
#
# Kappa : 0.9876
# Mcnemar's Test P-Value : 0.2482
#
# Sensitivity : 1.0000
# Specificity : 0.9913
# Pos Pred Value : 0.9840
# Neg Pred Value : 1.0000
# Prevalence : 0.3504
# Detection Rate : 0.3504
# Detection Prevalence : 0.3561
# Balanced Accuracy : 0.9956
#
# 'Positive' Class : 1
#put predictions in dataset
testSet$rf_both_pred <- rf_both_predictions#predictions (BOTH)
testSet$rf_SMOTE_pred <- rf_SMOTE_predictions#probabilities (BOTH)
testSet$rf_both_prob <- rf_both_pred_prob#predictions (SMOTE)
testSet$rf_SMOTE_prob <- rf_SMOTE_pred_prob#probabilities (SMOTE)
library(pROC)
#get AUC of the BOTH predictions
testSet$rf_both_pred <- as.numeric(testSet$rf_both_pred)
Both_ROC_Curve <- roc(response = testSet$graduate_4_yrs,
predictor = testSet$rf_both_pred,
levels = rev(levels(testSet$graduate_4_yrs)))
auc(Both_ROC_Curve)
# Area under the curve: 0.9268
#get AUC of the SMOTE predictions
testSet$rf_SMOTE_pred <- as.numeric(testSet$rf_SMOTE_pred)
SMOTE_ROC_Curve <- roc(response = testSet$graduate_4_yrs,
predictor = testSet$rf_SMOTE_pred,
levels = rev(levels(testSet$graduate_4_yrs)))
auc(SMOTE_ROC_Curve)
#Area under the curve: 0.9971
#So, the SMOTE balanced data performed very well on training data and near perfect on the validation/test data.
#But, it seems almost too good to be true.
#Is there anything I might have missed or performed incorrectly?
I'll post as an answer my comment, even if this might be migrated.
I really think that you're overfitting, because you have balanced on the whole dataset.
Instead you should balance only the train set.
Here is your code:
library(DMwR)
train.SMOTE <- SMOTE(graduate_4_yrs ~ ., data=grad4yr_processed_transformed,
perc.over=600, perc.under=100)
By doing so your train.SMOTE now contains information from the test set too, so when you'll test on your testSet the model will have already seen part of the data, and this will likely be the cause of your "too good" results.
It should be:
library(DMwR)
train.SMOTE <- SMOTE(graduate_4_yrs ~ ., data=trainSet, # use only the train set
perc.over=600, perc.under=100)

How to turn off k fold cross validation in rpart() in r

I have the Bitcoin time series, I use 11 technical indicators as features and I want to fit a regression tree to the data. As far as I know, there are 2 functions in r which can create regression trees, i.e. rpart() and tree(), but both functions do not seem appropriate. rpart() uses k-fold cross validation to validate the optimal cost complexity parameter cp and in tree(), it is not possible to specify the value of cp.
I am aware that cv.tree() looks for the optimal value of cp via cross validation, but again, cv.tee() uses k-fold cross validation. Since I have a time series, and therefore temporal dependencies, I do not want to use k-fold cross validation, because k-fold cross validation will randomly divide the data into k-fold, fit the model on k-1 folds and calculate the MSE on the left out k-th fold, and then the sequence of my time series is obviously ruined.
I found an argument of the rpart() function, i.e. xval, which is supposed to let me specify the number of cross validations, but when I look at the output of the rpart() function call when xval=0, it doesn't seem like cross validation is turned off. Below you can see my function call and the output:
tree.model= rpart(Close_5~ M+ DSMA+ DWMA+ DEMA+ CCI+ RSI+ DKD+ R+ FI+ DVI+
OBV, data= train.subset, method= "anova", control=
rpart.control(cp=0.01,xval= 0, minbucket = 5))
> summary(tree.model)
Call:
rpart(formula = Close_5 ~ M + DSMA + DWMA + DEMA + CCI + RSI +
DKD + R + FI + DVI + OBV, data = train.subset, method = "anova",
control = rpart.control(cp = 0.01, xval = 0, minbucket = 5))
n= 590
CP nsplit rel error
1 0.35433076 0 1.0000000
2 0.10981049 1 0.6456692
3 0.06070669 2 0.5358587
4 0.04154720 3 0.4751521
5 0.02415633 5 0.3920576
6 0.02265346 6 0.3679013
7 0.02139752 8 0.3225944
8 0.02096500 9 0.3011969
9 0.02086543 10 0.2802319
10 0.01675277 11 0.2593665
11 0.01551861 13 0.2258609
12 0.01388126 14 0.2103423
13 0.01161287 15 0.1964610
14 0.01127722 16 0.1848482
15 0.01000000 18 0.1622937
It seems like rpart() cross validated 15 different values of cp. If these values were tested with k-fold cross validation, then again, the sequence of my time series will be ruined and I can basically not use these results. Does anyone know how I can turn off cross validation in rpart() effectively, or how to vary the value of cp in tree()?
UPDATE: I followed the suggestion of one of our colleagues and set xval=1, but that didn't seem to solve the problem. You can see the full function output when xval=1 here. Btw, parameters[j] is the j-th element of a parameter vector. When I called this function, parameters[j]= 0.0009765625
Many thanks in advance
To demonstrate that rpart() is creating tree nodes by iterating over declining values of cp versus resampling, we'll use the Ozone data from the mlbench package to compare the results of rpart() and caret::train() as discussed in the comments to the OP. We'll setup the Ozone data as illustrated in the CRAN documentation for Support Vector Machines, which support nonlinear regression and are comparable to rpart().
library(rpart)
library(caret)
data(Ozone, package = "mlbench")
# split into test and training
index <- 1:nrow(Ozone)
set.seed(01381708)
testIndex <- sample(index, trunc(length(index) / 3))
testset <- na.omit(Ozone[testIndex,-3])
trainset <- na.omit(Ozone[-testIndex,-3])
# rpart version
set.seed(95014) #reset seed to ensure sample is same as caret version
rpart.model <- rpart(V4 ~ .,data = trainset,xval=0)
# summary(rpart.model)
# calculate RMSE
rpart.pred <- predict(rpart.model, testset[,-3])
crossprod(rpart.pred - testset[,3]) / length(testIndex)
...and the output for the RMSE calculation:
> crossprod(rpart.pred - testset[,3]) / length(testIndex)
[,1]
[1,] 18.25507
Next, we'll run the same analysis with caret::train() as proposed in the comments to the OP.
# caret version
set.seed(95014)
rpart.model <- caret::train(x = trainset[,-3],
y = trainset[,3],method = "rpart", trControl = trainControl(method = "none"),
metric = "RMSE", tuneGrid = data.frame(cp=0.01),
preProcess = c("center", "scale"), xval = 0, minbucket = 5)
# summary(rpart.model)
# demonstrate caret version did not do resampling
rpart.model
# calculate RMSE, which matches RMSE from rpart()
rpart.pred <- predict(rpart.model, testset[,-3])
crossprod(rpart.pred - testset[,3]) / length(testIndex)
When we print the model output from caret::train() it clearly notes that there was no resampling.
> rpart.model
CART
135 samples
11 predictor
Pre-processing: centered (9), scaled (9), ignore (2)
Resampling: None
The RMSE for the caret::train() version matches the RMSE from rpart().
> # calculate RMSE, which matches RMSE from rpart()
> rpart.pred <- predict(rpart.model, testset[,-3])
> crossprod(rpart.pred - testset[,3]) / length(testIndex)
[,1]
[1,] 18.25507
>
Conclusions
First, as configured above, neither caret::train() nor rpart() are resampling. If one prints the model output, however, one will see multiple values of cp are used to generate the final tree of 47 nodes via both techniques.
Output from caret summary(rpart.model)
CP nsplit rel error
1 0.58951537 0 1.0000000
2 0.08544094 1 0.4104846
3 0.05237152 2 0.3250437
4 0.04686890 3 0.2726722
5 0.03603843 4 0.2258033
6 0.02651451 5 0.1897648
7 0.02194866 6 0.1632503
8 0.01000000 7 0.1413017
Output from rpart summary(rpart.model)
CP nsplit rel error
1 0.58951537 0 1.0000000
2 0.08544094 1 0.4104846
3 0.05237152 2 0.3250437
4 0.04686890 3 0.2726722
5 0.03603843 4 0.2258033
6 0.02651451 5 0.1897648
7 0.02194866 6 0.1632503
8 0.01000000 7 0.1413017
Second, both models account for time values via the inclusion of month and day variables as independent variables. In the Ozone data set, V1 is the month variable, and V2 is the day variable. All data was collected during 1976, so there is no year variable included in the data set, and in the original analysis in the svm vignette, day of week was dropped prior to analysis.
Third, to account for other time-based effects using algorithms like rpart() or svm() when date attributes are not used as features in the model, one must include lag effects as features in the model because these algorithms do not directly account for a time component. One example of how to do this with an ensemble of regression trees using a range of lagged values is Ensemble Regression Trees for Time Series Predictions.
In your model, simply xval=0 turn off cross validation.
In your output, you have only CP NSPLIT REL ERROR, with cross valisation you should have CP NSPLIT REL ERROR XERROR XSTD.
cp is just your " complexity parameter" (cp=0.01 by default) from 1 to 0.01.
rel error is your predicted error on your dataset train / expected loss from root node.
nsplit number of node relativ at size of your tree according to cp.
Look : https://cran.r-project.org/web/packages/rpart/vignettes/longintro.pdf

How do I evaluate a multinomial classification model in R?

i am currently trying to build a muti-class prediction model to predict the letter out of 26 English alphabets. I have currently built a few models using ANN, SVM, Ensemble and nB. But i am stuck at the evaluating the accuracy of these models. Although the confusion matrix shows me the Alphabet-wise True and False predictions, I am only able to get an overall accuracy of each model. Is there a way to evaluate the model's accuracy similar to the ROC and AUC values for a Binomial Classification.
Note: I am currently running the model using the H2o package as it saves me more time.
Once you train a model in H2O, if you simply do: print(fit) it will show you all the available metrics for that model type. For multiclass, I'd recommend h2o.mean_per_class_error().
R code example on the iris dataset:
library(h2o)
h2o.init(nthreads = -1)
data(iris)
fit <- h2o.naiveBayes(x = 1:4,
y = 5,
training_frame = as.h2o(iris),
nfolds = 5)
Once you have the model, we can evaluate model performance using the h2o.performance() function to view all the metrics:
> h2o.performance(fit, xval = TRUE)
H2OMultinomialMetrics: naivebayes
** Reported on cross-validation data. **
** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
Cross-Validation Set Metrics:
=====================
Extract cross-validation frame with `h2o.getFrame("iris")`
MSE: (Extract with `h2o.mse`) 0.03582724
RMSE: (Extract with `h2o.rmse`) 0.1892808
Logloss: (Extract with `h2o.logloss`) 0.1321609
Mean Per-Class Error: 0.04666667
Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>,xval = TRUE)`
=======================================================================
Top-3 Hit Ratios:
k hit_ratio
1 1 0.953333
2 2 1.000000
3 3 1.000000
Or you can look at a particular metric, like mean_per_class_error:
> h2o.mean_per_class_error(fit, xval = TRUE)
[1] 0.04666667
If you want to view performance on a test set, then you can do the following:
perf <- h2o.performance(fit, test)
h2o.mean_per_class_error(perf)

Create a binary outcome with random forest

I have a dataset that looks like this:
TEAM1 TEAM2 EXPG1 EXPG2 Gewonnen
ADO Den Haag Groningen 1.5950 1.2672 1
I now try to predict the column Gewonnen based on EXPG1 and EXPG2. Therefore I created a training and test set and am creating the following model (all using rcaret):
modFit <- train(Gewonnen~ EXPG1 + EXPG2, data=training, method="rf", prox=TRUE)
I can't make a confusion matrix now because my data has more references. That's true because when I do:
pred <- predict(modFit, testing)
head(print)
It says: 0.5324000 0.7237333 0.2811333 0.8231000 0.8299333 0.9792000
Because I want to make a confusion matrix I can't turn them into on 0/1 but I have the feeling that there should be an option to do this in the model as well.
Any thoughts on what I should change in this model to create 0/1 values. I couldn't find it in the documentation:
modFit <- train(Gewonnen~ EXPG1 + EXPG2, data=training, method="rf", prox=TRUE)
First of all, as Tim Biegeleisen says, you should convert your Gewonnen variable to a factor (in both training & test sets), if it is not already:
training$Gewonnen <- as.factor(training$Gewonnen)
testing$Gewonnen <- as.factor(testing$Gewonnen)
After that, the type option in the caret function predict determines what type of response you get for a binary classification problem, i.e. class labels or probabilities. Here is a reproducible example from the caret documentation using the Sonar dataset from the package mlbench:
library(caret)
library(mlbench)
data(Sonar)
str(Sonar$Class)
# Factor w/ 2 levels "M","R": 2 2 2 2 2 2 2 2 2 2 ...
set.seed(998)
inTraining <- createDataPartition(Sonar$Class, p = .75, list = FALSE)
training <- Sonar[ inTraining,]
testing <- Sonar[-inTraining,]
modFit <- train(Class ~ ., data=training, method="rf", prox=TRUE)
pred <- predict(modFit, testing, type="prob") # for class probabilities
head(pred)
# M R
# 5 0.442 0.558
# 10 0.276 0.724
# 11 0.096 0.904
# 12 0.360 0.640
# 20 0.654 0.346
# 21 0.522 0.478
pred2 <- predict(modFit, testing, type="raw") # for class labels
head(pred2)
# [1] R R R R M M
# Levels: M R
For the confusion matrix, you will need class labels (i.e. pred2 above):
confusionMatrix(pred2, testing$Class)
# Confusion Matrix and Statistics
# Reference
# Prediction M R
# M 25 6
# R 2 18
This answer is a bit speculative as you omitted some critical details about your data set and I have not worked extensively with the caret package. That being said, it appears that you are running random forests in regression mode, which means that you will end up with a continuous function. This means that predictions can have a response value of 0, 1, or anything in between 0 and 1. If your Gewonnen column only has values of 0 or 1, and you want predicted values to also behave this way, then you can try turning Gewonnen into a categorical variable. As this article discusses, this might tell random forests to run in classification mode instead of regression mode.
Gewonnen <- as.factor(Gewonnen)
This builds the random forest as you did before, and you should have the responses you want.

Resources