I have an imbalanced data, and I want to do stratified cross validation and use precision recall auc as my evaluation metric.
I use prSummary in r package caret with stratified index, and I encounter an error when computing performance.
The following is a sample which can be reproduced. I found that there are only ten sample to compute p-r auc, and because of the imbalanced, there is only one class so that it cannot compute p-r auc. (The reason that I found that only ten sample to compute p-r auc is because I modified the prSummary to force this function to print out the data)
library(randomForest)
library(mlbench)
library(caret)
# Load Dataset
data(Sonar)
dataset <- Sonar
x <- dataset[,1:60]
y <- dataset[,61]
# make this data very imbalance
y[4:length(y)] <- "M"
y <- as.factor(y)
dataset$Class <- y
# create index and indexOut
seed <- 1
set.seed(seed)
folds <- 2
idxAll <- 1:nrow(x)
cvIndex <- createFolds(factor(y), folds, returnTrain = T)
cvIndexOut <- lapply(1:length(cvIndex), function(i){
idxAll[-cvIndex[[i]]]
})
names(cvIndexOut) <- names(cvIndex)
# set the index, indexOut and prSummaryCorrect
control <- trainControl(index = cvIndex, indexOut = cvIndexOut,
method="cv", summaryFunction = prSummary, classProbs = T)
metric <- "AUC"
set.seed(seed)
mtry <- sqrt(ncol(x))
tunegrid <- expand.grid(.mtry=mtry)
rf_default <- train(Class~., data=dataset, method="rf", metric=metric, tuneGrid=tunegrid, trControl=control)
Here is the error message:
Error in ROCR::prediction(y_pred, y_true) :
Number of classes is not equal to 2.
ROCR currently supports only evaluation of binary classification tasks.
I think I found the weird thing...
Even I specified the cross validation index, the summary function(no matter prSummary or other summary function) will still randomly(I am not sure) select ten sample to computing performance.
The way I did is defined a summary function with tryCatch to avoid this error occur.
prSummaryCorrect <- function (data, lev = NULL, model = NULL) {
print(data)
print(dim(data))
library(MLmetrics)
library(PRROC)
if (length(levels(data$obs)) != 2)
stop(levels(data$obs))
if (length(levels(data$obs)) > 2)
stop(paste("Your outcome has", length(levels(data$obs)),
"levels. The prSummary() function isn't appropriate."))
if (!all(levels(data[, "pred"]) == levels(data[, "obs"])))
stop("levels of observed and predicted data do not match")
res <- tryCatch({
auc <- MLmetrics::PRAUC(y_pred = data[, lev[2]], y_true = ifelse(data$obs == lev[2], 1, 0))
}, warning = function(war) {
print(war)
auc <- NA
}, error = function(e){
print(dim(data))
auc <- NA
}, finally = {
print("finally")
auc <- NA
})
c(AUC = res,
Precision = precision.default(data = data$pred, reference = data$obs, relevant = lev[2]),
Recall = recall.default(data = data$pred, reference = data$obs, relevant = lev[2]),
F = F_meas.default(data = data$pred, reference = data$obs, relevant = lev[2]))
}
Related
I am using the R programming language. I am trying to use the "caret" library to train a multiclass classification algorithm.
I found the following website which supposedly contains a function that can be used for multiclass classification problem:https://www.r-bloggers.com/2012/07/error-metrics-for-multi-class-problems-in-r-beyond-accuracy-and-kappa/
require(compiler)
multiClassSummary <- cmpfun(function (data, lev = NULL, model = NULL){
#Load Libraries
require(Metrics)
require(caret)
#Check data
if (!all(levels(data[, "pred"]) == levels(data[, "obs"])))
stop("levels of observed and predicted data do not match")
#Calculate custom one-vs-all stats for each class
prob_stats <- lapply(levels(data[, "pred"]), function(class){
#Grab one-vs-all data for the class
pred <- ifelse(data[, "pred"] == class, 1, 0)
obs <- ifelse(data[, "obs"] == class, 1, 0)
prob <- data[,class]
#Calculate one-vs-all AUC and logLoss and return
cap_prob <- pmin(pmax(prob, .000001), .999999)
prob_stats <- c(auc(obs, prob), logLoss(obs, cap_prob))
names(prob_stats) <- c('ROC', 'logLoss')
return(prob_stats)
})
prob_stats <- do.call(rbind, prob_stats)
rownames(prob_stats) <- paste('Class:', levels(data[, "pred"]))
#Calculate confusion matrix-based statistics
CM <- confusionMatrix(data[, "pred"], data[, "obs"])
#Aggregate and average class-wise stats
#Todo: add weights
class_stats <- cbind(CM$byClass, prob_stats)
class_stats <- colMeans(class_stats)
#Aggregate overall stats
overall_stats <- c(CM$overall)
#Combine overall with class-wise stats and remove some stats we don't want
stats <- c(overall_stats, class_stats)
stats <- stats[! names(stats) %in% c('AccuracyNull',
'Prevalence', 'Detection Prevalence')]
#Clean names and return
names(stats) <- gsub('[[:blank:]]+', '_', names(stats))
return(stats)
})
I would like to use this function to train a "Decision Tree" model using the "F1 Score" (or any metric suitable for a multiclass problem). For example:
library(caret)
library(plyr)
library(C50)
library(dplyr)
library(compiler)
train.control <- trainControl(method = "repeatedcv", number = 10, repeats = 3,
summaryFunction = multiClassSummary, classProbs = TRUE)
train_model <- train(my_data$response ~., data = my_data, method = "C5.0",
trControl=train.control ,
preProcess = c("center", "scale"),
tuneLength = 15,
metric = "F1")
But this produces the following error:
Error in ctrl$summaryFunction(testOutput, lev, method):
Your outcome has 4 levels. The prSummary() function isn't appropriate.
I want to extract the predictions for new unseen data using the function caret::extractPrediction with a random forest model but I cannot figure out, why my code throws the error Error: $ operator is invalid for atomic vectors. How should the input parameters be structured, to use this function?
Here is my reproducible code:
library(caret)
dat <- as.data.frame(ChickWeight)
# create column set
dat$set <- rep("train", nrow(dat))
# split into train and validation set
set.seed(1)
dat[sample(nrow(dat), 50), which(colnames(dat) == "set")] <- "validation"
# predictors and response
all_preds <- dat[which(dat$set == "train"), which(names(dat) %in% c("Time", "Diet"))]
response <- dat[which(dat$set == "train"), which(names(dat) == "weight")]
# set train control parameters
contr <- caret::trainControl(method="repeatedcv", number=3, repeats=5)
# recursive feature elimination caret
set.seed(1)
model <- caret::train(x = all_preds,
y = response,
method ="rf",
ntree = 250,
metric = "RMSE",
trControl = contr)
# validation set
vali <- dat[which(dat$set == "validation"), ]
# not working
caret::extractPrediction(models = model, testX = vali[,-c(3,5,1)], testY = vali[,1])
caret::extractPrediction(models = model, testX = vali, testY = vali)
# works without problems
caret::predict.train(model, newdata = vali)
I found a solution by looking at the documentation of extractPrediction. Basically, the argument models doesn't want a single model instance, but a list of models. So I just inserted list(my_rf = model) and not just model.
caret::extractPrediction(models = list(my_rf = model), testX = vali[,-c(3,5,1)], testY = vali[,1])
I want to perform a logistic regression with the train() function from the caret package. My model looks something like that:
model <- train(Y ~.,
data = train_data,
family = "binomial",
method = "glmnet")
With the resulting model, I want to make predictions:
pred <- predict(model, newdata = test_data, s = "lambda.min", type = "prob")
Now, I want to evaluate how good the model predictions are in comparison with the actual test data. For this I know how to receive the ROC and AUC. However I am also interested in receiveing the BRIER SCORE. The formula for the Brier Score is almost identical to the MSE.
The problem I am facing, is that the type argument in predict only allows "prob" (or "class" which I am not interested in) which gives the probability of one prediction beeing a ONE (e.g. 0.64) , and the complementing probability of beeing a ZERO (e.g. 0.37). For the Brier Score however, I need One probability estimate for each prediction that contains the information of both (e.g. a value above 0.5 would indicate a 1 and a value below 0.5 would indicate a 0).
I have not found any solution for receiving the Brier Score in the caret package. I am aware that with the package cv.glmnet the predict function allows the argument "response" which would solve my problem. However, for personal preferences I would like to stay with the caretpackage.
Thanks for the help!
If we go by the wiki definition of brier score:
The most common formulation of the Brier score is
where f_t is the probability that was forecast, o_t the actual outcome of the (0 or 1) and N is the number of forecasting instances.
In R, if your label is a factor, then the logistic regression will always predict with respect to the 2nd level, meaning you just calculate the probability and 0/1 with respect to that. For example:
library(caret)
idx = sample(nrow(iris),100)
data = iris
data$Species = factor(ifelse(data$Species=="versicolor","v","o"))
levels(data$Species)
[1] "o" "v"
In this case, o is 0 and v is 1.
train_data = data[idx,]
test_data = data[-idx,]
model <- train(Species ~.,data = train_data,family = "binomial",method = "glmnet")
pred <- predict(model, newdata = test_data)
So we can see the probability of the class:
head(pred)
o v
1 0.8367885 0.16321154
2 0.7970508 0.20294924
3 0.6383656 0.36163437
4 0.9510763 0.04892370
5 0.9370721 0.06292789
To calculate the score:
f_t = pred[,2]
o_t = as.numeric(test_data$Species)-1
mean((f_t - o_t)^2)
[1] 0.32
I use the Brier score to tune my models in caret for binary classification. I ensure that the "positive" class is the second class, which is the default when you label your response "0:1". Then I created this master summary function, based on caret's own suite of summary functions, to return all the metrics I want to see:
BigSummary <- function (data, lev = NULL, model = NULL) {
pr_auc <- try(MLmetrics::PRAUC(data[, lev[2]],
ifelse(data$obs == lev[2], 1, 0)),
silent = TRUE)
brscore <- try(mean((data[, lev[2]] - ifelse(data$obs == lev[2], 1, 0)) ^ 2),
silent = TRUE)
rocObject <- try(pROC::roc(ifelse(data$obs == lev[2], 1, 0), data[, lev[2]],
direction = "<", quiet = TRUE), silent = TRUE)
if (inherits(pr_auc, "try-error")) pr_auc <- NA
if (inherits(brscore, "try-error")) brscore <- NA
rocAUC <- if (inherits(rocObject, "try-error")) {
NA
} else {
rocObject$auc
}
tmp <- unlist(e1071::classAgreement(table(data$obs,
data$pred)))[c("diag", "kappa")]
out <- c(Acc = tmp[[1]],
Kappa = tmp[[2]],
AUCROC = rocAUC,
AUCPR = pr_auc,
Brier = brscore,
Precision = caret:::precision.default(data = data$pred,
reference = data$obs,
relevant = lev[2]),
Recall = caret:::recall.default(data = data$pred,
reference = data$obs,
relevant = lev[2]),
F = caret:::F_meas.default(data = data$pred, reference = data$obs,
relevant = lev[2]))
out
}
Now I can simply pass summaryFunction = BigSummary in trainControl and then metric = "Brier", maximize = FALSE in the train call.
Implementation of ROCR curve, kNN ,K 10 fold cross validation.
I am using Ionosphere dataset.
Here is the attribute information for your reference:
-- All 34 are continuous, as described above
-- The 35th attribute is either "good" or "bad" according to the definition
summarized above. This is a binary classification task.
data1<-read.csv('https://archive.ics.uci.edu/ml/machine-learning-databases/ionosphere/ionosphere.data',header = FALSE)
knn on its own works, kNN with kfold also works. But when I put in the ROCR code it doesnt like it.
I get the error: "The format of predictions is incorrect".
I checked the dataframes pred and Class 1. The dimensions are same. I tried with data.test$V35 instead of Class1 I get the same error with this option.
install.packages("class")
library(class)
nrFolds <- 10
data1[,35]<-as.numeric(data1[,35])
# generate array containing fold-number for each sample (row)
folds <- rep_len(1:nrFolds, nrow(data1))
# actual cross validation
for(k in 1:nrFolds) {
# actual split of the data
fold <- which(folds == k)
data.train <- data1[-fold,]
data.test <- data1[fold,]
Class<-data.train[,35]
Class1<-data.test[,35]
# train and test your model with data.train and data.test
pred<-knn(data.train, data.test, Class, k = 5, l = 0, prob = FALSE, use.all = TRUE)
data<-data.frame('predict'=pred, 'actual'=Class1)
count<-nrow(data[data$predict==data$actual,])
total<-nrow(data.test)
avg = (count*100)/total
avg =format(round(avg, 2), nsmall = 2)
method<-"KNN"
accuracy<-avg
cat("Method = ", method,", accuracy= ", accuracy,"\n")
}
install.packages("ROCR")
library(ROCR)
rocrPred=prediction(pred, Class1, NULL)
rocrPerf=performance(rocrPred, 'tpr', 'fpr')
plot(rocrPerf, colorize=TRUE, text.adj=c(-.2,1.7))
Any help is appreciated.
This worked for me..
install.packages("class")
library(class)
library(ROCR)
nrFolds <- 10
data1[,35]<-as.numeric(data1[,35])
# generate array containing fold-number for each sample (row)
folds <- rep_len(1:nrFolds, nrow(data1))
# actual cross validation
for(k in 1:nrFolds) {
# actual split of the data
fold <- which(folds == k)
data.train <- data1[-fold,]
data.test <- data1[fold,]
Class<-data.train[,35]
Class1<-data.test[,35]
# train and test your model with data.train and data.test
pred<-knn(data.train, data.test, Class, k = 5, l = 0, prob = FALSE, use.all = TRUE)
data<-data.frame('predict'=pred, 'actual'=Class1)
count<-nrow(data[data$predict==data$actual,])
total<-nrow(data.test)
avg = (count*100)/total
avg =format(round(avg, 2), nsmall = 2)
method<-"KNN"
accuracy<-avg
cat("Method = ", method,", accuracy= ", accuracy,"\n")
pred <- prediction(Class1,pred)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=T, add=TRUE)
abline(0,1)
}
I am trying to apply filter based feature selection in caret package for logistic regression. I was successful at using sbf() function for random forest and LDA models (using rfSBF and ldaSBF respectively).
The way I modified lmSBF is as follows:
# custom lmSBF
logisticRegressionWithPvalues <- lmSBF
logisticRegressionWithPvalues$score <- pScore
logisticRegressionWithPvalues$summary <- fiveStats
logisticRegressionWithPvalues$filter <- pCorrection
logisticRegressionWithPvalues$fit <- glmFit
# my training control parameters for sbf (selection by filter)
myTrainControlSBF = sbfControl(method = "cv",
number = 10,
saveDetails = TRUE,
verbose = FALSE,
functions = logisticRegressionWithPvalues)
# fit the logistic regression model
logisticRegressionModelWithSBF <- sbf(x = input_predictors,
y = input_labels,
sbfControl = myTrainControlSBF)
Here, glmFit function (mentioned above) is as follows:
# fit function for logistic regression
glmFit <- function(x, y, ...) {
if (ncol(x) > 0) {
tmp <- as.data.frame(x)
tmp$y <- y
glm(y ~ ., data = tmp, family = binomial)
}
else nullModel(y = y)
}
But while calling logisticRegressionModelWithSBF I am getting an error as:
Error in { : task 1 failed - "inputs must be factors"
What am I doing wrong?