Issues with Caret predict function when using caretStack object - r

I have been using the caretEnsemble and caret packages for stacking. My data is a document term matrix with some additional features like POS tags and the goal is performing sentiment analysis with two classes. "Sentitr" denotes the vector of sentiment corresponding to the training observations. "Sentitest" the vector for the test set.
I use a 60:40 split
control <- trainControl(method="cv", number=10, savePredictions = "final", classProbs = TRUE,
summaryFunction = twoClassSummary,
index=createResample(sentitr, 10))
algorithmList <- c('pda', 'nnet', 'gbm', 'svmLinear', 'rf', 'C5.0', 'glmnet')
models <- caretList(trainset, sentitr, trControl=control, methodList=algorithmList)
# some model info
summary(models)
res = resamples(models)
summary(res)
modelCor(res)
# lda and nnet extremely closely correlated
stackcontrol <- trainControl(method="cv", number=5, savePredictions = "final", classProbs = TRUE,
summaryFunction = twoClassSummary)
# stacks
stack.c5.0 <- caretStack(models, method="C5.0", metric="ROC", trControl=stackcontrol)
summary(stack.c5.0)
stack.c50.pred = predict(stack.c5.0, newdata = testset, type = "raw")
stackc50.conf = confusionMatrix(stack.c50.pred, sentitest)
I tried to run the model 10 times every time randomly partitioning my data into a 60/40 training/test set. I received the following classification accuracies on the test set (which I extracted from the confusion matrix)
X0.3225
1 0.3225
2 0.2550
3 0.7500
4 0.2675
5 0.2950
6 0.7825
7 0.2575
8 0.2875
9 0.2900
10 0.3275
These are the outputs. As you can see an accuracy of around 75-80% is achieved on two model iterations. This is expected and mirrors the results that I get from fitting single models. But the remaining iterations it yields extremely bad accuracy. It almost seems to me that the model randomly confuses testing error with accuracy.
Any ideas what causes this behaviour
Every iteration when the prediction comes up with such terrible accuracy, I get the following error when training the caretStack:
2: In predict.C5.0(modelFit, newdata, trial = submodels$trials[j]) :
'trials' should be <= 9 for this object. Predictions generated using 9 trials
3: In predict.C5.0(modelFit, newdata, type = "prob", trials = submodels$trials[j]) :
'trials' should be <= 9 for this object. Predictions generated using 9 trials
4: In predict.C5.0(modelFit, newdata, trial = submodels$trials[j]) :
'trials' should be <= 9 for this object. Predictions generated using 9 trials
5: In predict.C5.0(modelFit, newdata, type = "prob", trials = submodels$trials[j]) :
'trials' should be <= 9 for this object. Predictions generated using 9 trials

Related

Setting C for Linear SVM

Here's my question:
I have a medium size data set about the condition of a hydraulic system.
The data set is represented by 68 variables plus condition of the system(green, yellow, red)
I have to use several classifiers to predict the behaviour of the system so I have divided my data set into training and test set as follows:
(Talking about the conditions, the colour means: red-Warning, yellow-Pay attention, green-Good)
That's what I wrote
Tab$Condition=factor(Tab$Condition, labels=c("Yellow","Green","Red"))
set.seed(32343)
reg_Control = trainControl("repeatedcv", number = 5, repeats=5, verboseIter = T, classProbs =T)
inTrain = createDataPartition(y=Tab$Condition,p=0.75, list=FALSE)
training = Tab[inTrain,]
testing = Tab[-inTrain,]
I'm using a SVM linear classifier to predict the behaviour of the system.
I started by using a random value for C to see what kind of results I should get.
svmLinear = train(Condition ~.,data=training, method="svmLinear", trControl=reg_Control,tuneGrid=data.frame(C=seq(0.1,1,0.1)))
svmLPredictions = predict(svmLinear,newdata=training)
confusionMatrix(svmLPredictions,training$Condition)
#misclassification of 129/1655 accuracy of 92.21%
svmLPred = predict(svmLinear,newdata=testing)
confusionMatrix(svmLPred,testing$Condition)
#misclassification of 41/550 accuracy of 92.55%
I've used a SVM linear classifier to predict the behaviour of the system.
As Isaid before I started with RANDOM VALUE FOR C.
How do I decide then about the best value to use for the analysis??
Sorry if the question is banal but I'm a beginner!
Answers will be helpful!
Thanks
Caret calls other packages to run the actual modelling process. Caret itself is only a (very powerful) convenience package in this regard. However ,it does that automatically so a user might not realize this easily unless an error is thrown
Anyway , I have cobbled together an example to explain the process.
library(caret)
data("iris")
set.seed(1024)
tr <- createDataPartition(iris$Species, list = FALSE)
training <- iris[ tr,]
testing <- iris[-tr,]
#head(training)
fitControl <- trainControl(##smaller values for quick run
method = "repeatedcv",
number = 5,
repeats = 4)
set.seed(1024)
tunegrid=data.frame(C=c(0.25, 0.5, 1,5,8,12,100))
tunegrid
svmfit <- train(Species ~ ., data = training,
method = "svmLinear",
trControl = fitControl,
tuneGrid= tunegrid)
#print this, it will give model's accuracy (on train data) given various
# parameter values
svmfit
#C Accuracy Kappa
#0.25 0.9533333 0.930
#0.50 0.9666667 0.950
#1.00 0.9766667 0.965
#5.00 0.9800000 0.970
#8.00 0.9833333 0.975
#12.00 0.9833333 0.975
#100.00 0.9400000 0.910
#The final value used for the model was C = 8.
# it has already chosen the best model (as per train Accuracy )
# how well does it work on test data?
preds <-predict(svmfit, testing)
cmSVM <-confusionMatrix(preds, testing$Species)
print(cmSVM)

R RF unbalanced classes low negative predicted value on unseen data compared to train

I have built a Random Forest model for predicting if a customer is doing operations regarding to fraud or not. It is a large an a quite unbalanced sample, with 3% cases of fraud, and I want to predict the minority class (fraud).
I balance the data (50% each) and build the RF. So far, I have a good model with an overall accuracy of ~80% and a +70% fraud predicted correctly. But when I try the model on unseen data (test), although the overall accuracy is good, the negative predicted value (fraud) is really low compared to the training data (13% only vs +70%).
I have tried increasing the sample size, increasing the balanced categories, tuning RF parameters, ..., but none of them have worked well, with similar results. Am I overfitting somehow? What can I do to improve fraud detection (negative predicted value)
on unseen data?
Here is the code and results:
set.seed(1234)
#train and test sets
model <- sample(nrow(dataset), 0.7 * nrow(dataset))
train <- dataset[model, ]
test <- dataset[-model, ]
#Balance the data
balanced <- ovun.sample(custom21_type ~ ., data = train, method = "over",p = 0.5, seed = 1)$data
table(balanced$custom21_type)
0 1
5813 5861
#build the RF
rf5 = randomForest(custom21_type~.,ntree = 100,data = balanced,importance = TRUE,mtry=3,keep.inbag=TRUE)
rf5
Call:
randomForest(formula = custom21_type ~ ., data = balanced, ntree = 100, importance = TRUE, mtry = 3, keep.inbag = TRUE)
Type of random forest: classification
Number of trees: 100
No. of variables tried at each split: 3
OOB estimate of error rate: 21.47%
Confusion matrix:
0 1 class.error
0 4713 1100 0.1892310
1 1406 4455 0.2398908
#test on unseen data
predicted <- predict(rf5, newdata=test)
confusionMatrix(predicted,test$custom21_type)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 59722 559
1 13188 1938
Accuracy : 0.8177
95% CI : (0.8149, 0.8204)
No Information Rate : 0.9669
P-Value [Acc > NIR] : 1
Kappa : 0.1729
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.8191
Specificity : 0.7761
Pos Pred Value : 0.9907
Neg Pred Value : 0.1281
Prevalence : 0.9669
Detection Rate : 0.7920
Detection Prevalence : 0.7994
Balanced Accuracy : 0.7976
'Positive' Class : 0
First I notice that you are not using any cross validation. Including this will help add variation in the data used to train and will help reduce overfitting. Additionally we are going to user C.50 in place of randomForest because it is more robust and gives more penalties to type 1 errors.
One thing you may consider is actually not having a 50-50 balance split in the train data, but making it more 80-20. This is so that the underbalanced class is not over sampled. I am sure this is leading to overfitting and the failure for your model to classify novel examples as negative.
RUN THIS AFTER YOU CREATE THE RE-BALANCED DATA (p=.2)
library(caret)
#set up you cross validation
Control <- trainControl(
summaryFunction = twoClassSummary, #displays model score not confusion matrix
classProbs = TRUE, #important for the summaryFunction
verboseIter = TRUE, #tones down output
savePredictions = TRUE,
method = "repeatedcv", #repeated cross validation, 10 folds, 3 times
repeats = 3,
number = 10,
allowParallel = TRUE
)
Now I read in the comments that all your variables are categorical. This is optimal for NaiveBayes algorithms. However if you have any numerical data you will need to preprocess (scale, normalize, and NA input) as is standard procedure. We are also going to implement a grid-searching process.
IF YOUR DATA IS ALL CATEGORICAL
model_nb <- train(
x = balanced[,-(which(colnames(balanced))%in% "custom21_type")],
y= balanced$custom21_type,
metric = "ROC",
method = "nb",
trControl = Control,
tuneGrid = data.frame(fL=c(0,0.5,1.0), usekernel = TRUE,
adjust=c(0,0.5,1.0)))
IF YOU WOULD LIKE A RF APPROACH (make sure to preprocess if data is numeric)
model_C5 <- train(
x = balanced[,-(which(colnames(balanced))%in% "custom21_type")],
y= balanced$custom21_type,
metric = "ROC",
method = "C5.0",
trControl = Control,
tuneGrid = tuneGrid=expand.grid(.model = "tree",.trials = c(1,5,10), .winnow = F)))
Now we predict
C5_predict<-predict(model_C5, test, type = "raw")
NB_predict<-predict(model_nb, test, type = "raw")
confusionMatrix(C5_predict,test$custom21_type)
confusionMatrix(nb_predict,test$custom21_type)
EDIT:
try adjusting the cost matrix below. What this one does is penalize type two errors twice as bad as type one errors.
cost_mat <- matrix(c(0, 2, 1, 0), nrow = 2)
rownames(cost_mat) <- colnames(cost_mat) <- c("bad", "good")
cost_mod <- C5.0( x = balanced[,-(which(colnames(balanced))%in%
"custom21_type")],
y= balanced$custom21_type,
costs = cost_mat)
summary(cost_mod)
EDIT 2:
predicted <- predict(rf5, newdata=test, type="prob")
will give you the actual probabilities for each prediction. The default cut-off is .5. I.e. everything above .5 will get classified as 0 and everything below as 1. So you can adjust this cutoff to help with unbalanced classes.
ifelse(predicted[,1] < .4, 1, predicted[,1])

Feature selection with caret rfe and training with another method

Right now, I'm trying to use Caret rfe function to perform the feature selection, because I'm in a situation with p>>n and most regression techniques that don't involve some sort of regularisation can't be used well. I already used a few techniques with regularisation (Lasso), but what I want to try now is reduce my number of feature so that I'm able to run, at least decently, any kind of regression algorithm on it.
control <- rfeControl(functions=rfFuncs, method="cv", number=5)
model <- rfe(trainX, trainY, rfeControl=control)
predict(model, testX)
Right now, if I do it like this, a feature selection algorithm using random forest will be run, and then the model with the best set of features, according to the 5-fold cross-validation, will be used for the prediction, right?
I'm curious about two things here:
1) Is there an easy way to take the set of feature, and train another function on it that the one used for the feature selection? For example, reducing the number of features from 500 to 20 or so that seem more important and then applying k-nearest neighborhood.
I'm imagining an easy way to do it that would look like that:
control <- rfeControl(functions=rfFuncs, method="cv", number=5)
model <- rfe(trainX, trainY, method = "knn", rfeControl=control)
predict(model, testX)
2) Is there a way to tune the parameters of the feature selection algorithm? I would like to have some control on the values of mtry. The same way you can pass a grid of value when you are using the train function from Caret. Is there a way to do such a thing with rfe?
Here is a short example on how to perform rfe with an inbuilt model:
library(caret)
library(mlbench) #for the data
data(Sonar)
rctrl1 <- rfeControl(method = "cv",
number = 3,
returnResamp = "all",
functions = caretFuncs,
saveDetails = TRUE)
model <- rfe(Class ~ ., data = Sonar,
sizes = c(1, 5, 10, 15),
method = "knn",
trControl = trainControl(method = "cv",
classProbs = TRUE),
tuneGrid = data.frame(k = 1:10),
rfeControl = rctrl1)
model
#output
Recursive feature selection
Outer resampling method: Cross-Validated (3 fold)
Resampling performance over subset size:
Variables Accuracy Kappa AccuracySD KappaSD Selected
1 0.6006 0.1984 0.06783 0.14047
5 0.7113 0.4160 0.04034 0.08261
10 0.7357 0.4638 0.01989 0.03967
15 0.7741 0.5417 0.05981 0.12001 *
60 0.7696 0.5318 0.06405 0.13031
The top 5 variables (out of 15):
V11, V12, V10, V49, V9
model$fit$results
#output
k Accuracy Kappa AccuracySD KappaSD
1 1 0.8082684 0.6121666 0.07402575 0.1483508
2 2 0.8089610 0.6141450 0.10222599 0.2051025
3 3 0.8173377 0.6315411 0.07004865 0.1401424
4 4 0.7842208 0.5651094 0.08956707 0.1761045
5 5 0.7941775 0.5845479 0.07367886 0.1482536
6 6 0.7841775 0.5640338 0.06729946 0.1361090
7 7 0.7932468 0.5821317 0.07545889 0.1536220
8 8 0.7687229 0.5333385 0.05164023 0.1051902
9 9 0.7982468 0.5918922 0.07461116 0.1526814
10 10 0.8030087 0.6024680 0.06117471 0.1229467
for more customization see:
https://topepo.github.io/caret/recursive-feature-elimination.html

Resampling based performance measure in caret

I perform a penalized logistic regression and I train a model with caret (glmnet).
model_fit <- train(Data[,-1], Data[,1],
method = "glmnet",
family="binomial",
metric = "ROC",
maximize="TRUE",
trControl = ctrl,
preProc = c("center", "scale"),
tuneGrid=expand.grid(.alpha=0.5,.lambda=lambdaSeq)
)
According to the caret documentation, the function train "[...] calculates a resampling based performance measure" and "Across each data set, the performance of held-out samples is calculated and the mean and standard deviation is summarized for each combination."
results is "A data frame" (containing) "the training error rate and values of the tuning parameters."
Is model_fit$results$ROC a vector (with size equal to the size of my tuning parameter lambda) of the mean of the performance measure across resampling? (And NOT the performance measure computed over the whole sample after re-estimating the model over the whole sample for each value of lambda?)
Is model_fit$results$ROC a vector (with size equal to the size of my tuning parameter lambda) of the mean of the performance measure across resampling?
It is; to be precise, the length will be equal to the number of rows of your tuneGrid, which here it happens to coincide with the length of your lambdaSeq (since the only other parameter, alpha, is being held constant).
Here is a quick example, adapted from the caret docs (it is with gbm and Accuracy metric, but the idea is the same):
library(caret)
library(mlbench)
data(Sonar)
set.seed(998)
inTraining <- createDataPartition(Sonar$Class, p = .75, list = FALSE)
training <- Sonar[ inTraining,]
testing <- Sonar[-inTraining,]
fitControl <- trainControl(method = "cv",
number = 5)
set.seed(825)
gbmGrid <- expand.grid(interaction.depth = 3,
n.trees = (1:3)*50,
shrinkage = 0.1,
n.minobsinnode = 20)
gbmFit1 <- train(Class ~ ., data = training,
method = "gbm",
trControl = fitControl,
tuneGrid = gbmGrid,
## This last option is actually one
## for gbm() that passes through
verbose = FALSE)
Here, gbmGrid has 3 rows, i.e. it is consisted only of three (3) different values of n.trees with the other parameters held constant; hence, the corresponding gbmFit1$results$Accuracy will be a vector of length 3:
gbmGrid
# interaction.depth n.trees shrinkage n.minobsinnode
# 1 3 50 0.1 20
# 2 3 100 0.1 20
# 3 3 150 0.1 20
gbmFit1$results
# shrinkage interaction.depth n.minobsinnode n.trees Accuracy Kappa AccuracySD KappaSD
# 1 0.1 3 20 50 0.7450672 0.4862194 0.05960941 0.1160537
# 2 0.1 3 20 100 0.7829704 0.5623801 0.05364031 0.1085451
# 3 0.1 3 20 150 0.7765188 0.5498957 0.05263735 0.1061387
gbmFit1$results$Accuracy
# [1] 0.7450672 0.7829704 0.7765188
Each of the 3 Accuracy values returned is the result of the metric in the validation folds of the 5-fold cross validation we have used as a resampling technique; more precisely, it is the mean of the validation accuracies computed in these 5 folds (and you can see that there is an AccuracySD column, containing also its standard deviation).
And NOT the performance measure computed over the whole sample after re-estimating the model over the whole sample for each value of lambda?
Correct, it is not that.

How to do recursive feature elimination with logistic regression?

Can someone provide me a detailed example of using caret's rfe function with the glm or glmnet model? I tried something like this:
rfe_records <- Example_data_frame
rfe_ctrl <- rfeControl(functions = caretFuncs, method = "repeatedcv", repeats = 5, verbose = TRUE, classProbs = TRUE, summaryFunction = twoClassSummary)
number_predictors <- dim(rfe_records)[2]-1
x <- dplyr::select(rfe_records, -outcomeVariable)
y <- as.numeric(rfe_records$outcomeVariable)
glmProfile <- rfe(x, y, rfeControl = rfe_ctrl, sizes = c(1:number_predictors), method="glmnet", preProc = c("center", "scale"), metric = "Accuracy")
print(glmProfile)
But the results I'm getting are not what I needed. I specified Accuracy as the metric but I got:
Recursive feature selection
Outer resampling method: Cross-Validated (10 fold, repeated 5 times)
Resampling performance over subset size:
Variables RMSE Rsquared RMSESD RsquaredSD Selected
1 0.5047 0.10830 0.04056 0.11869 *
2 0.5058 0.09386 0.04728 0.11332
3 0.5117 0.08565 0.04999 0.10211
4 0.5139 0.07490 0.05042 0.10048
5 0.5166 0.07678 0.05456 0.09966
6 0.5202 0.08203 0.06174 0.10822
7 0.5187 0.08471 0.06207 0.10893
8 0.5168 0.07850 0.05939 0.09697
9 0.5175 0.08228 0.05966 0.10068
10 0.5176 0.08180 0.05980 0.10042
11 0.5179 0.08015 0.05950 0.09905
The top 1 variables (out of 1):
varName
According to this page caret uses the class of the outcome variable when it determines whether to use regression or classification with a function like glmnet that can do either. According to your code, you specified the outcome variable to be numeric with as.numeric() so glmnet chose to do regression, not classification as you intended. Specify your outcome variable as a two-level factor to get classification instead.

Resources