Alternative performance measures for multiclass classification in caret - r

I do want to tune a classification algorithm predicting probabilities using caret.
Since my data-set is highly unbalanced, the default Accuracy option of caret seems not to be so helpful according to this post: https://stats.stackexchange.com/questions/68702/r-caret-difference-between-roc-curve-and-accuracy-for-classification.
In my specific case, I want to determine the optimal mtry parameter of a random forest, which predicts probabilities. I do have 3 classes and a palance ratio of 98.7% - 0.45% - 0.85%. An reproducible example - which has sadely no unbalanced data-set - is given by:
library(caret)
data(iris)
control = trainControl(method="CV", number=5,verboseIter = TRUE,classProbs=TRUE)
grid = expand.grid(mtry = 1:3)
rf_gridsearch = train(y=iris[,5],x=iris[-5],method="ranger", num.trees=2000, tuneGrid=grid, trControl=control)
rf_gridsearch
So my two questions basically are:
What alternative summary metrics besides the Accuracy do I have?
(Using multiROC is not my favourite, due to: https://stats.stackexchange.com/questions/68702/r-caret-difference-between-roc-curve-and-accuracy-for-classification. I think of sth. like a Brier Score)
How do I implement them?
Many thanks!

Related

Multiclass classification measures in mlr

I have a multiclass classification problem with 4 classes and am training various learners on the data in mlr. I am using the multiclass wrapper with the default "onevsrest". As I understand it, this methods creates 4 binary classifiers, one for each class, with that class as the positive case and all other classes combined as the negative case. I am using resample to train the multiclass classifier using 5-fold cross validation.
If that is correct, is there some way I can access the individual binary classifiers and calculate the various ROC measures (sensivity, specificity, F1, auc) for each binary classifier? The function calculateROCMeasures insists on binary results, but the methods it uses (e.g. getPredictionTruth, measureTPR etc) could be called directly.
I am aware of the measures multiclass.au1p etc, but these give results across the 4 classes and don't provide values for TP, TN, F1 etc.
This is the how I have set up the problem (I am still using mlr 2 - sorry - but I imagine the same general question also applies to mlr 3):
library(mlr)
cv.resamp = makeResampleDesc("CV", iters=5)
lrn = makeLearner(cl = "classif.cvglmnet", predict.type = "response")
mclrn = makeMulticlassWrapper(lrn, mcw.method = "onevsrest")
res = resample(learner = mclrn, task = iris.task, resampling = cv.resamp, models = TRUE, show.info = FALSE, extract = getFilteredFeatures)
but then I am not sure how to proceed (possibly there is no way to proceed?). I have examined the structure of res, returned from resample, and of res$pred. In res$pred I can see no evidence of the 4 binary models.
Any advice on how to obtain these scores for a multiclass problem in mlr would be appreciated.

Random forest: OOB for k-fold cross-validation?

I am rather new to machine learning and I am currently trying to implement a random forest classification using the caret and randomForest packages in R. I am using the trainControl function with repeated cross-validation. Maybe it is a stupid question but as far as I understand random forest usually uses bagging to split the training data into different subsets with replacement using 1/3 as a validation set based on which the OOB is calculated on. But what happens if you specify that you want to use k-fold cross-validation? From the caret documentation, I assumed that it uses only cross-validation for the resampling, But if it only used cross-validation, why do you still get an OOB error? Or is bagging still used for the creation of the model and cross-validation for the performance evaluation?
TrainingControl <- trainControl(method = "repeatedcv", number = 10, repeats = 3, savePredictions = TRUE, classProbs = TRUE, search = "grid")
train(x ~ ., data = training_set,
method = "rf",
metric = "Accuracy",
trControl = TrainingControl,
ntree = 1000,
importance = TRUE
)
Trying to address your questions:
random forest usually uses bagging to split the training data into
different subsets with replacement using 1/3 as a validation set based
on which the OOB is calculated on
Yes, caret is using randomForest() from the package randomForest, and more specifically, it bootstraps on the training data, generate multiple decision tress which are bagged, to reduce overfitting, from wiki:
This bootstrapping procedure leads to better model performance because
it decreases the variance of the model, without increasing the bias.
This means that while the predictions of a single tree are highly
sensitive to noise in its training set, the average of many trees is
not, as long as the trees are not correlated.
So if you call k-fold cross-validation from caret, it simply runs randomForest() on different training sets, therefore the answer to this:
But what happens if you specify that you want to use k-fold
cross-validation? From the caret documentation, I assumed that it uses
only cross-validation for the resampling, But if it only used
cross-validation, why do you still get an OOB error?
Would be the sampling and bagging is performed because it is part of randomforest. caret simply repeats this on different training set and estimates the error on their respective test set. The OOB error generated from randomForest() stays regardless. The difference is that you have a truly "unseen" data that can be used to evaluate your model.

How to build regression models and then compare their fits with data held out from the model training-testing?

I have been building a couple different regression models using the caret package in R in order to make predictions about how fluorescent certain genetic sequences will become under certain experimental conditions.
I have followed the basic protocol of splitting my data into two sets: one "training-testing set" (80%) and one "hold-out set" (20%), the former of which would be utilized to build the models, and the latter would be used to test them in order to compare and pick the final model, based on metrics such as their R-squared and RMSE values. One such guide of the many I followed can be found here (http://www.kimberlycoffey.com/blog/2016/7/16/compare-multiple-caret-run-machine-learning-models).
However, I run into a block in that I do not know how to test and compare the different models based on how well they can predict the scores in the hold-out set. In the guide I linked to above, the author uses a ConfusionMatrix in order to calculate the specificity and accuracy for each model after building a predict.train object that applied the recently built models on the hold-out set of data (which is referred to as test in the link). However, ConfusionMatrix can only be applied to classification models, wherein the outcome (or response) is a categorical value (as far as my research has indicated. Please correct me if this is incorrect, as I have not been able to conclude without any doubt that this is the case).
I have found that the resamples method is capable of comparing multiple models against each other (source: https://www.rdocumentation.org/packages/caret/versions/6.0-77/topics/resamples), but it cannot take into account how the new models fit with the data that I excluded from the training-testing sessions.
I tried to create predict objects using the recently built models and hold-out data, then calculate Rsquared and RMSE values using caret's R2 and RMSE methods. But I'm not sure if such an approach is best possible way for comparing and picking the best model.
At this point, I should note that all the model building methods I am using are based on linear regression, since I need to be able to extract the coefficients and apply them in a separate Python script.
Another option I considered was setting a threshold in my outcome, wherein any genetic sequence that had a fluorescence value over 100 was considered useful, while sequences scoring values under 100 were not. This would allow me utilize the ConfusionMatrix. But I'm not sure how I should implement this within my R code to make these two classes in my outcome variable. I'm further concerned that this approach might make it difficult to apply my regression models to other data and make predictions.
For what it's worth, each of the predictors is either an integer or a float, and have ranges that are not normally distributed.
Here is the code I thus far been using:
library(caret)
data <- read.table("mydata.csv")
sorted_Data<- data[order(data$fluorescence, decreasing= TRUE),]
splitprob <- 0.8
traintestindex <- createDataPartition(sorted_Data$fluorescence, p=splitprob, list=F)
holdoutset <- sorted_Data[-traintestindex,]
trainingset <- sorted_Data[traintestindex,]
traindata<- trainingset[c('x1', 'x2', 'x3', 'x4', 'x5', 'fluorescence')]
cvCtrl <- trainControl(method = "repeatedcv", number= 20, repeats = 20, verboseIter = FALSE)
modelglmStepAIC <- train(fluorescence~., traindata, method = "glmStepAIC", preProc = c("center","scale"), trControl = cvCtrl)
model_rlm <- train(fluorescence~., traindata, method = "rlm", preProc = c("center","scale"), trControl = cvCtrl)
pred_glmStepAIC<- predict.lm(modelglmStepAIC$finalModel, holdoutset)
pred_rlm<- predict.lm(model_rlm$finalModel, holdoutset)
glmStepAIC_r2<- R2(pred_glmStepAIC, holdoutset$fluorescence)
glmStepAIC_rmse<- RMSE(pred_glmStepAIC, holdoutset$fluorescence)
rlm_r2<- R2(pred_rlm, holdoutset$fluorescence)
rlm_rmse<- RMSE(pred_rlm, holdoutset$fluorescence)
The out-of-sample performance measures offered by Caret are RMSE, MAE and squared correlation between fitted and observed values (called R2). See more info here https://topepo.github.io/caret/measuring-performance.html
At least in time series regression context, RMSE is the standard measure for out-of-sample performance of regression models.
I would advise against discretising continuous outcome variable, because you are essentially throwing away information by discretising.

Feature Selection for Regression Models in R

I’m trying to find a Feature Selection Package in R that can be used for Regression most of the packages implement their methods for classification using a factor or class for the response variable. In particular I’m interested if there’s a method using Random Forest for that purpose. Also a good paper in this field would be helpfull.
IIRC the randomForest package also does regression trees. You could start with the Breiman paper and go from there.
There are many ways you can use randomforest for calculating variable importance.
I. Mean Decrease Impurity (MDI) / Gini Importance :
This makes use of a random forest model or a decision tree. When training a tree, it is measured by how much each feature decreases the weighted impurity in a tree. For a forest, the impurity decrease from each feature can be averaged and the features are ranked according to this measure. Here is an example of the same using R.
fit <- randomForest(Target ~.,importance = T,ntree = 500, data=training_data)
var.imp1 <- data.frame(importance(fit, type=2))
var.imp1$Variables <- row.names(var.imp1)
varimp1 <- var.imp1[order(var.imp1$MeanDecreaseGini,decreasing = T),]
par(mar=c(10,5,1,1))
giniplot <- barplot(t(varimp1[-2]/sum(varimp1[-2])),las=2,
cex.names=1,
main="Gini Impurity Index Plot")
And the output will look like this: Gini Importance Plot
II. Permutation Importance or Mean Decrease in Accuracy (MDA) : Permutation Importance or Mean Decrease in Accuracy (MDA) is assessed for each feature by removing the association between that feature and the target. This is achieved by randomly permuting the values of the feature and measuring the resulting increase in error. The influence of the correlated features is also removed. Example in R:
fit <- randomForest(Target ~.,importance = T,ntree = 500, data=training_data)
var.imp1 <- data.frame(importance(fit, type=1))
var.imp1$Variables <- row.names(var.imp1)
varimp1 <- var.imp1[order(var.imp1$MeanDecreaseGini,decreasing = T),]
par(mar=c(10,5,1,1))
giniplot <- barplot(t(varimp1[-2]/sum(varimp1[-2])),las=2,
cex.names=1,
main="Permutation Importance Plot")
This two are are the ones which use Random Forest directly. There are some more easy to use metrics for variable importance calculation purpose. 'Boruta' method and Weight of evidence (WOE) and Information Value (IV) might also be helpful.

Issues when using randomForest in caret with ROC as optimization metric

I'm having an issue when constructing random forest models using caret. I have a dataset of about 46k rows and 10 columns (one of which is the optimization target). From this dataset, I'm trying to compare different classifiers. I did the following:
ctrl = trainControl(method="boot"
,classProbs=TRUE
,summaryFunction=twoClassSummary )
#GLM Model:
model.glm = train(x=d[,2:10]
,y=d$CONV_BT, method='glm'
,trControl=ctrl, metric="ROC"
,family="binomial")
#Random Forest Model:
model.rf = train(x=d[,2:10]
,y=d$CONV_BT, method='rf'
,trControl=ctrl, metric="ROC")
#Naive Bayes Model:
model.nb = train(x=d[,2:10]
,y=d$CONV_BT, method='nb'
,trControl=ctrl, metric="ROC" )
Then, model.glm and model.nb both look pretty decent. I can look at the 25 bootstrap replications, and each case has an ROC of around .7. However, something appears to be wrong with model.rf, because the reported ROC scores are all around .3. That suggests to me that something is being specified incorrectly, because I could just switch my predictions from the rf model from p to 1-p and my ROC would then be .7, right?
I'm sorry that I can't provide the data (because it's pretty big to upload and it's proprietary). The other bizarre thing is that when I simulate data, I no longer have this issue. Any idea what this could be??? Thanks for your help!

Resources