Obtaining training Error using Caret package in R - r

I am using caret package in order to train a K-Nearest Neigbors algorithm. For this, I am running this code:
Control <- trainControl(method="cv", summaryFunction=twoClassSummary, classProb=T)
tGrid=data.frame(k=1:100)
trainingInfo <- train(Formula, data=trainData, method = "knn",tuneGrid=tGrid,
trControl=Control, metric = "ROC")
As you can see, I am interested in obtain the AUC parameter of the ROC. This code works good but returns the testing error (which the algorithm uses for tuning the k parameter of the model) as the mean of the error of the CrossValidation folds. I am interested in return, in addition of the testing error, the training error (the mean across each fold of the error obtained with the training data). ¿How can I do it?
Thank you

What you are asking is a bad idea on multiple levels. You will grossly over-estimate the area under the ROC curve. Consider the 1-NN model: you will have perfect predictions every time.
To do this, you will need to run train again and modify the index and indexOut objects:
library(caret)
set.seed(1)
dat <- twoClassSim(200)
set.seed(2)
folds <- createFolds(dat$Class, returnTrain = TRUE)
Control <- trainControl(method="cv",
summaryFunction=twoClassSummary,
classProb=T,
index = folds,
indexOut = folds)
tGrid=data.frame(k=1:100)
set.seed(3)
a_bad_idea <- train(Class ~ ., data=dat,
method = "knn",
tuneGrid=tGrid,
trControl=Control, metric = "ROC")
Max

Related

R feature selection with LASSO

I have a small data set (37 observations x 23 features) and want to perform feature selection with LASSO regression in order to its reduce dimensionality. To achieve this, I designed the below code based on online tutorials
#Load the libraries
library(mlbench)
library(elasticnet)
library(caret)
#Initialize cross validation and train LASSO
cv_5 <- trainControl(method="cv", number=5)
lasso <- train( ColumnY ~., data=My_Data_Frame, method='lasso', trControl=cv_5)
#Filter out the variables whose coefficients have squeezed to 0
drop <-predict.enet(lasso$finalModel, type='coefficients', s=lasso$bestTune$fraction, mode='fraction')$coefficients
drop<-drop[drop==0]%>%names()
My_Data_Frame<- My_Data_Frame%>%select(-drop)
In most cases the code runs without errors but it occasionally throws the following:
Warning messages:
1: model fit failed for Fold2: fraction=0.9 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
2: In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
There were missing values in resampled performance measures.
I sense this happens because my data has few rows and some variables have low variance.
Is there a way I can bypass or fix this issue (e.g. setting a parameter in the flow)?
You have a low number of observations, so there's a good chance in some training set, that some of your columns will be all zero, or very low variance. For example:
library(caret)
set.seed(222)
df = data.frame(ColumnY = rnorm(37),matrix(rbinom(37*23,1,p=0.15),ncol=23))
cv_5 <- trainControl(method="cv", number=5)
lasso <- train( ColumnY ~., data=df, method='lasso', trControl=cv_5)
Warning messages:
1: model fit failed for Fold4: fraction=0.9 Error in elasticnet::enet(as.matrix(x), y, lambda = 0, ...) :
Some of the columns of x have zero variance
Before running below, check that for categorical columns, all of them don't have only 1 positive label..
One way is to increase the cv fold, if you set 5, you are using 80% of the data. Try 10 to use 90% of the data:
cv_10 <- trainControl(method="cv", number=10)
lasso <- train( ColumnY ~., data=df, method='lasso', trControl=cv_10)
And as you might have seen.. since the dataset is so small, cross-validation might not offer you that much advantage, you can also do leave one out cross-validation:
tr <- trainControl(method="LOOCV")
lasso <- train( ColumnY ~., data=df, method='lasso', trControl=tr)
You can use the FSinR package to perform feature selection. It is in R and accessible from CRAN. It has a wide variety of filter and wrapper methods that you can combine with search methods. The interface to generate the wrapper evaluator follows the caret interface. For example:
# Load the library
library(FSinR)
# Choose one of the search methods
searcher <- searchAlgorithm('sequentialForwardSelection')
# Choose one of the filter/wrapper evaluators (You can remove the fitting and resampling params if you want to make it simpler)(These are the parameters of the train and trainControl of caret)
resamplingParams <- list(method = "cv", number = 5)
fittingParams <- list(preProc = c("center", "scale"), metric="Accuracy", tuneGrid = expand.grid(k = c(1:20)))
evaluator <- wrapperEvaluator('knn', resamplingParams, fittingParams)
# You make the feature selection (returns the best features)
results <- featureSelection(My_Data_Frame, 'ColumnY', searcher, evaluator)

tuneRF vs caret tunning for random forest

I've trying to tune a random forest model using the tuneRF tool included in the randomForest Package and I'm also using the caret package to tune my model. The issue is that I'm tunning to get mtry and I'm getting different results for each approach. The question is how do I know which is the best approach and base on what? I'm not clear if I should expect similar or different results.
tuneRF: with this approach I'm getting the best mtry is 3
t <- tuneRF(train[,-12], train[,12],
stepFactor = 0.5,
plot = TRUE,
ntreeTry = 100,
trace = TRUE,
improve = 0.05)
caret: With this approach I'm always getting that the best mtry is all variables in this case 6
control <- trainControl(method="cv", number=5)
tunegrid <- expand.grid(.mtry=c(2:6))
set.seed(2)
custom <- train(CRTOT_03~., data=train, method="rf", metric="rmse",
tuneGrid=tunegrid, ntree = 100, trControl=control)
There are a few differences, for each mtry parameters, tuneRF fits one model on the whole dataset, and you get the OOB error from each of these fit. tuneRF then takes the lowest OOB error. For each value of mtry, you have one score (or RMSE value) and this will change with different runs.
In caret, you actually do cross-validation, so the test data from the fold was not used at all in the model. Though in principle it should be similar to OOB, you should be aware of the differences.
A evaluation with a better picture on the error might be to run tuneRF a few rounds, and we can use cv in caret:
library(randomForest)
library(mlbench)
data(BostonHousing)
train <- BostonHousing
tuneRF_res = lapply(1:10,function(i){
tr = tuneRF(train[,-14], train[,14],mtryStart=2,step=0.9,ntreeTry = 100,trace = TRUE,improve=1e-5)
tr = data.frame(tr)
tr$RMSE = sqrt(tr[,2])
tr
})
tuneRF_res = do.call(rbind,tuneRF_res)
control <- trainControl(method="cv", number=10,returnResamp="all")
tunegrid <- expand.grid(.mtry=c(2:7))
caret_res <- train(medv ~., data=train, method="rf", metric="RMSE",
tuneGrid=tunegrid, ntree = 100, trControl=control)
library(ggplot2)
df = rbind(
data.frame(tuneRF_res[,c("mtry","RMSE")],test="tuneRF"),
data.frame(caret_res$resample[,c("mtry","RMSE")],test="caret")
)
df = df[df$mtry!=1,]
ggplot(df,aes(x=mtry,y=RMSE,col=test))+
stat_summary(fun.data=mean_se,geom="errorbar",width=0.2) +
stat_summary(fun=mean,geom="line") + facet_wrap(~test)
You can see the trend is more or less similar. My suggestion would be to use tuneRF to quickly check the range of mtrys to train over, then use caret, cross-validation to properly evaluate this.

Difference between Model and $FinalModel for classification in R?

Currently got this Random Forest model, just seeing how well it predicts those with diabetes positive or diabetes negative
Model is calculated using the caret workflow
when looking at variable importance i was told to use the code
randomForest::importance(model$finalModel)
what is the purpose of $finalModel? what is $finalModel as compared to just the original model? should it not be just be the original model passed in as the argument instead to view variable importance?
example below:
library(tidyverse)
library(mlbench)
library(caret)
library(car)
library(glmnet)
library(rpart.plot)
library(rpart)
data("PimaIndiansDiabetes2")
PimaIndiansDiabetes2 <- na.omit(PimaIndiansDiabetes2)
set.seed(123)
training.samples <- PimaIndiansDiabetes2$diabetes %>% createDataPartition(p = 0.8, list = FALSE)
train.data <- PimaIndiansDiabetes2[training.samples,]
test.data <- PimaIndiansDiabetes2[-training.samples,]
model_rf <- caret::train(
diabetes ~.,
data = train.data,
method = "rf",
trControl = trainControl("cv", number = 10),
importance = TRUE)
model_rf
model_rf$bestTune
model_rf$finalModel
# variable importance here
importance(model_rf$finalModel)
From the documentation:
finalModel A fit object using the best parameters
Most of the times with train you pass some different values for hyper-parameter estimation, to find the values that achieve the best performance (using trainControl).
Inside model_rf you'll find under finalModel the model built with the best parameters.
FYI caret also has a function for variable importance plotting: varImp.

trainControl in caret package

In caret package, there is a thing called trainControl that allow us to perform variety of cross validation. To perform 10-fold cross-validation, one would use
fitControl <- trainControl(method= "repeatedcv", number = 10, repeats = 10)
fitJ48_10_fold <- train(x = x, y =y, method = "J48", trControl= fitControl)
while for training set, it is
fitControl <- trainControl(method= "none")
fitJ48train <- train(x = x, y =y, method = "J48", trControl= fitControl)
However, confusion matrix of these model show the same for both 10-fold and training.
Activity <- predict(fitJ48_10_fold, newdata = Train)
confusionMatrix(Activity, Train$Activity)
Activity <- predict(fitJ48train, newdata = Train)
confusionMatrix(Activity, Train$Activity)
I used the weka classifier GUI and indeed the performance of J48 from 10-fold cross validation is lower than that of training set. Am I wrong to suspect that the trainControl from caret isn't working or I pass this in a wrong way?
Am I wrong to suspect that the trainControl from caret isn't working or I pass this in a wrong way?
A little. For J48, there is a tuning parameter but the default grid only fits a single value C = 0.25. The final model will be the same no matter what value of method that you use in trainControl so the confusion matrices will always be the same.
Max

caret: different RMSE on the same data

I think, that my problem is quite weird. When I use RMSE metric for the best model selection by train function, I obtain different RMSE value from computed by my own function on the same data. Where ist the problem? Does my function work wrong?
library(caret)
library(car)
library(nnet)
data(oil)
ztest=fattyAcids[c(81:96),]
fit<-list(r1=c(1:80))
pred<-list(r1=c(81:96))
ctrl <- trainControl(method = "LGOCV",index=fit,indexOut=pred)
model <- train(Palmitic~Stearic+Oleic+Linoleic+Linolenic+Eicosanoic,
fattyAcids,
method='nnet',
linout=TRUE,
trace=F,
maxit=10000,
skip=F,
metric="RMSE",
tuneGrid=expand.grid(.size=c(10,11,12,9),.decay=c(0.005,0.001,0.01)),
trControl = ctrl,
preProcess = c("range"))
model
forecast <- predict(model, ztest)
Blad<-function(zmienna,prognoza){
RMSE<-((sum((zmienna-prognoza)^2))/length(zmienna))^(1/2)
estymatory <- c(RMSE)
names(estymatory) <-c('RMSE')
estymatory
}
Blad(ztest$Palmitic,forecast)
The resampled estimates shown in the output of train are calculated using rows 81:96. Once train figures out the right tuning parameter settings, it refits using all the data (1:96). The model from that data is used to make the new predictions.
For this reason, the model performance
> getTrainPerf(model)
TrainRMSE TrainRsquared method
1 0.9230175 0.8364212 nnet
is worse than the other predictions:
> Blad(ztest$Palmitic,forecast)
RMSE
0.3355387
The predictions in forecast are created from a model that included those same data points, which is why it looks better.
Max

Resources