I'm using R mlr package because it allows me to use multiple classification methods and tune parameters, with the same methods in this package.
But it changed my Positive Class.
In my dataset, I need to predict "HasWriteOff", it has value "1" or "2". "1" is the majority class, much more than the number of "2", which means the class is imbalanced.
I set the Positive class as "2" in makeClassifTask function, but after prediction, when I was checking confusion matrix, it shows Positive Class as "1".
Here is my code:
I set the positive class here
train_task <- makeClassifTask(data=data.frame(train_data), target = "HasWriteOff", positive = "2")
test_task <- makeClassifTask(data=data.frame(test_data), target = "HasWriteOff", positive = "2")
train and predict with XGBoost
set.seed(410)
getParamSet("classif.xgboost")
xg_learner <- makeLearner("classif.xgboost", predict.type = "response")
xg_learner$par.vals <- list(
objective = "binary:logistic",
eval_metric = "error",
nrounds = 250
)
xg_param <- makeParamSet(
makeIntegerParam("nrounds",lower=200,upper=600),
makeIntegerParam("max_depth",lower=3,upper=20),
makeNumericParam("lambda",lower=0.55,upper=0.60),
makeNumericParam("eta", lower = 0.001, upper = 0.5),
makeNumericParam("subsample", lower = 0.10, upper = 0.80),
makeNumericParam("min_child_weight",lower=1,upper=5),
makeNumericParam("colsample_bytree",lower = 0.2,upper = 0.8)
)
rancontrol <- makeTuneControlRandom(maxit = 100L)
cv_xg <- makeResampleDesc("CV",iters = 3L)
xg_tune <- tuneParams(learner = xg_learner, task = train_task, resampling = cv_xg,measures = acc,par.set = xg_param, control = rancontrol)
xg_final <- setHyperPars(learner = xg_learner, par.vals = xg_tune$x)
xgmodel <- mlr::train(xg_final, train_task)
xgpredict <- predict(xgmodel, test_task)
Check Confusion Matrix here
nb_prediction <- xgpredict$data$response
dCM <- confusionMatrix(test_data$HasWriteOff, nb_prediction)
dCM
Output
Accuracy : 0.9954
95% CI : (0.9916, 0.9978)
No Information Rate : 0.9784
P-Value [Acc > NIR] : 5.136e-11
Kappa : 0.8913
Mcnemar's Test P-Value : 1
Sensitivity : 0.9977
Specificity : 0.8936
Pos Pred Value : 0.9977
Neg Pred Value : 0.8936
Prevalence : 0.9784
Detection Rate : 0.9761
Detection Prevalence : 0.9784
Balanced Accuracy : 0.9456
'Positive' Class : 1
As you can see here 'Positive' Class is 1.
I have checked other methods I'm using here, they don't have 'positive' parameter to set.
Do you know how can I really set positive class as the minority class "2"?
I'm trying to see whether by setting the minority class as Positive Class, the Specificity can be higher?
Oh, I just found that, this method should also change the positive class dCM <- confusionMatrix(test_data$HasWriteOff, nb_prediction, positive = "2")
Yesterday I didn't check confusionMatrix function because I thought the positive class should be defined by those methods used before predict.
However, just checked the R document, for confusionMatrix, parameter positive, it is saying:
If there are only two factor levels, the first level will be used as the "positive" result
So yesterday it simply chose the majority class no matter whether I defined positive class before.
Related
I tried to use mle to estimate the parameters for the negative binomial distribution. Here is my code.
library(stats4)
library(bbmle)
library(MASS)
b=rnbinom(n=1000, size=3, prob=0.1)
LL2 <- function(size, prob) {
R = dnbinom(b, size, prob, log = TRUE)
-sum(R)
}
When I set the mle function with lower and upper bound, I got
stats4::mle(LL2, start = list(size = 3, prob = 0.1),lower = c(-Inf,-Inf),upper = c(Inf,Inf))
Error in optim(start, f, method = method, hessian = TRUE, lower = lower, :
L-BFGS-B needs finite values of 'fn'
When I removed the bounds
stats4::mle(LL2, start = list(size = 3, prob = 0.1))
Call:
stats4::mle(minuslogl = LL2, start = list(size = 3, prob = 0.1))
Coefficients:
size prob
3.0467857 0.1037522
However, if I change the bounds to a finite value, the error is still there.
I was wondering why this happened? Is that because the L-BFGS-B method can not handle with bounds settings?
Any comments will be appreciated.
I ran your setup code with set.seed(101).
Create an instrumented version of the score function so we can see where the optimizer is going:
LL2 <- function(size, prob) {
R = dnbinom(b, size, prob, log = TRUE)
res <- -sum(R)
cat(size,prob,res,"\n")
res
}
stats4::mle(LL2, start = list(size = 3, prob = 0.1),lower = c(-Inf,-Inf),upper = c(Inf,Inf))
## 3 0.1 4085.146
## 3.001 0.1 4085.166
## 2.999 0.1 4085.127
## 3 0.101 4084.767
## 3 0.099 4085.858
## 2.964666 1.099376 NaN
Error in optim(start, f, method = method, hessian = TRUE, lower = lower, :
L-BFGS-B needs finite values of 'fn'
In addition: Warning message:
In dnbinom(b, size, prob, log = TRUE) : NaNs produced
The first 5 steps are the evaluation of initial value and of the finite difference approximation of the derivatives. The very next optimization step takes us to prob = 1.099, which gives us an NaN result (we need 0 < prob < 1). L-BFGS-B is much more finicky than the other optimizers about non-finite values; most of the others treat non-finite results as "bad" and try something sensible.
You could set the lower bound to 0 for size and bounds (0,1) for prob ... (I tried it and it seems to work). You do have to be a little bit careful with L-BFGS-B - it doesn't always respect the bounds when it is calculating the finite-difference approximation, so e.g. if values <= 0 will give non-finite results you may need to set the lower bound slightly above 0 (e.g. 0.002, since the default finite-difference epsilon is 0.001).
I want to perform a logistic regression with the train() function from the caret package. My model looks something like that:
model <- train(Y ~.,
data = train_data,
family = "binomial",
method = "glmnet")
With the resulting model, I want to make predictions:
pred <- predict(model, newdata = test_data, s = "lambda.min", type = "prob")
Now, I want to evaluate how good the model predictions are in comparison with the actual test data. For this I know how to receive the ROC and AUC. However I am also interested in receiveing the BRIER SCORE. The formula for the Brier Score is almost identical to the MSE.
The problem I am facing, is that the type argument in predict only allows "prob" (or "class" which I am not interested in) which gives the probability of one prediction beeing a ONE (e.g. 0.64) , and the complementing probability of beeing a ZERO (e.g. 0.37). For the Brier Score however, I need One probability estimate for each prediction that contains the information of both (e.g. a value above 0.5 would indicate a 1 and a value below 0.5 would indicate a 0).
I have not found any solution for receiving the Brier Score in the caret package. I am aware that with the package cv.glmnet the predict function allows the argument "response" which would solve my problem. However, for personal preferences I would like to stay with the caretpackage.
Thanks for the help!
If we go by the wiki definition of brier score:
The most common formulation of the Brier score is
where f_t is the probability that was forecast, o_t the actual outcome of the (0 or 1) and N is the number of forecasting instances.
In R, if your label is a factor, then the logistic regression will always predict with respect to the 2nd level, meaning you just calculate the probability and 0/1 with respect to that. For example:
library(caret)
idx = sample(nrow(iris),100)
data = iris
data$Species = factor(ifelse(data$Species=="versicolor","v","o"))
levels(data$Species)
[1] "o" "v"
In this case, o is 0 and v is 1.
train_data = data[idx,]
test_data = data[-idx,]
model <- train(Species ~.,data = train_data,family = "binomial",method = "glmnet")
pred <- predict(model, newdata = test_data)
So we can see the probability of the class:
head(pred)
o v
1 0.8367885 0.16321154
2 0.7970508 0.20294924
3 0.6383656 0.36163437
4 0.9510763 0.04892370
5 0.9370721 0.06292789
To calculate the score:
f_t = pred[,2]
o_t = as.numeric(test_data$Species)-1
mean((f_t - o_t)^2)
[1] 0.32
I use the Brier score to tune my models in caret for binary classification. I ensure that the "positive" class is the second class, which is the default when you label your response "0:1". Then I created this master summary function, based on caret's own suite of summary functions, to return all the metrics I want to see:
BigSummary <- function (data, lev = NULL, model = NULL) {
pr_auc <- try(MLmetrics::PRAUC(data[, lev[2]],
ifelse(data$obs == lev[2], 1, 0)),
silent = TRUE)
brscore <- try(mean((data[, lev[2]] - ifelse(data$obs == lev[2], 1, 0)) ^ 2),
silent = TRUE)
rocObject <- try(pROC::roc(ifelse(data$obs == lev[2], 1, 0), data[, lev[2]],
direction = "<", quiet = TRUE), silent = TRUE)
if (inherits(pr_auc, "try-error")) pr_auc <- NA
if (inherits(brscore, "try-error")) brscore <- NA
rocAUC <- if (inherits(rocObject, "try-error")) {
NA
} else {
rocObject$auc
}
tmp <- unlist(e1071::classAgreement(table(data$obs,
data$pred)))[c("diag", "kappa")]
out <- c(Acc = tmp[[1]],
Kappa = tmp[[2]],
AUCROC = rocAUC,
AUCPR = pr_auc,
Brier = brscore,
Precision = caret:::precision.default(data = data$pred,
reference = data$obs,
relevant = lev[2]),
Recall = caret:::recall.default(data = data$pred,
reference = data$obs,
relevant = lev[2]),
F = caret:::F_meas.default(data = data$pred, reference = data$obs,
relevant = lev[2]))
out
}
Now I can simply pass summaryFunction = BigSummary in trainControl and then metric = "Brier", maximize = FALSE in the train call.
i'm trying to run a query creating a for loop for creating bootstraps with data from package rattle.data (weather data with RainTomorrow as the target column). I'm trying to choose a class with maximum probability for every single bootstrap sample, then predicting the class with maximum votes.
With this code I keep getting back an error
if(!require(rpart)) install.packages("rpart")
if(!require(rpart.plot)) install.packages("rpart.plot")
if(!require(caret)) install.packages("caret")
if(!require(rattle.data)) install.packages("rattle.data")
if(!require(tidyverse)) install.packages("tidyverse")
if(!require(ipred)) install.packages("ipred")
if(!require(Metrics)) install.packages("Metrics")
library(rpart)
library(rpart.plot)
library(rattle.data)
library(tidyverse)
library(caret)
library(ipred)
library(Metrics)
set.seed(500)
data <- weather
# creating train and test data
index <- createDataPartition(data$RainTomorrow, p = .6, list = FALSE)
train_data <- data[ index, ]
test_data <- data[-index, ]
## b ukol -> error in for each loop
nBoot = 10 #nr bootstrap samples
#create empty matrix [nr test data x nr bootstrap samples]to store bootstrap predictions
pred = matrix(data = NA, nrow = nrow(test_data), ncol = nBoot)
train_controls = rpart.control(minsplit = 6, maxdepth = 3)
for(b in 1:nBoot){
#create bootstrap sample
index.boot = sample(x=nrow(train_data), replace = T, size = nrow(train_data))
data_boot = train_data[index.boot,]
#fit data for the bootstrap sample
boot.model = rpart(RainTomorrow ~ .,
data =data_boot,
method = "anova",
control = train_controls)
#rpart.plot(boot.model)
#save prediction for bootstrap
pred[,b] = predict(boot.model, newdata= test_data )
}
#calculate prediction as mean of bootstrap predictions
pred.bagged = rowMeans(pred)
print(rmse(actual = test_data$RainTomorrow, predicted = pred.bagged))
but running this query gives me back a warning message:
In Ops.factor(actual, predicted) : ‘-’ not meaningful for factors
and I cannot for the life of me figure the reason (newbie in machine learning).
EDIT: still looking for a valid answer
The error occurs because you are trying to calculate RMSE from a factor:
pred.bagged = rowMeans(pred)
class(pred.bagged)
[1] "numeric"
class(test_data$RainTomorrow)
[1] "factor"
you can convert the factor to numeric, which is what rpart did when you specified method = "anova", and calculate RMSE:
rmse(actual = as.numeric(test_data$RainTomorrow), predicted = pred.bagged)
RMSE is normally used for regression and it doesn't make much sense for a classification model. For classification, you would use method="class" and for evaluating use accuracy, f1 or cohen's kappa, you can see example below with confusionMatrix from caret:
for(b in 1:nBoot){
#create bootstrap sample
index.boot = sample(x=nrow(train_data), replace = T)
data_boot = train_data[index.boot,]
#fit data for the bootstrap sample
boot.model = rpart(RainTomorrow ~ .,
data =data_boot,
method = "class",
control = train_controls)
#rpart.plot(boot.model)
#save prediction for bootstrap
pred[,b] = as.character(predict(boot.model, newdata= test_data ,type="class"))
}
# very crude way to get majority vote
pred.bagged = apply(pred,1,function(i){
names(sort(table(factor(i,levels=c("No","Yes")))))[2]
})
# convert to a factor, same levels as test_data$RainTomorrow
pred.bagged = factor(pred.bagged,levels=c("No","Yes"))
confusionMatrix(,test_data$RainTomorrow)
Confusion Matrix and Statistics
Reference
Prediction No Yes
No 120 0
Yes 0 26
Accuracy : 1
95% CI : (0.9751, 1)
No Information Rate : 0.8219
P-Value [Acc > NIR] : 3.672e-13
Kappa : 1
Mcnemar's Test P-Value : NA
Sensitivity : 1.0000
Specificity : 1.0000
Pos Pred Value : 1.0000
Neg Pred Value : 1.0000
Prevalence : 0.8219
Detection Rate : 0.8219
Detection Prevalence : 0.8219
Balanced Accuracy : 1.0000
'Positive' Class : No
I am using the class package in order to use the KNN algorithm. I am also using the ROCR package to calculate the AUC value.
knn_one<-knn(train, test, train$Digit, k=1)
To calculate the AUC value for another method, e.g. classification trees, I used these series of commands:
treeTrain_Pred<-predict(Tree_Train, test , type = "prob")[,2]
Pred<-prediction(treeTrain_Pred, test$Digit)
Perf<-performance(Pred, "auc")
Perf#y.values[[1]]
However, when I try
knn_one = predict(knn_one, test, type="prob")[,2]
I get the following error:
Error in UseMethod("predict") :
no applicable method for 'predict' applied to an object of class "factor"
How can I fix this and obtain an AUC value for my KNN function?
There is no predict method for knn models, instead you train and receive predictions as part of a single call. Example on sonar data:
library(mlbench)
data(Sonar)
create data partition:
set.seed(1)
tr_ind <- sample(1:nrow(Sonar), 150)
train <- Sonar[tr_ind,]
test <- Sonar[-tr_ind,]
mod <- class::knn(cl = train$Class,
test = test[,1:60],
train = train[,1:60],
k = 5,
prob = TRUE)
Now the probability of the predictions are in:
attributes(mod)$prob
library(pROC)
roc(test$Class, attributes(mod)$prob)
#output
Call:
roc.default(response = test$Class, predictor = attributes(mod)$prob)
Data: attributes(mod)$prob in 30 controls (test$Class M) < 28 cases (test$Class R).
Area under the curve: 0.4667
plot(roc(test$Class, attributes(mod)$prob),
print.thres = T,
print.auc=T)
lets try with k = 4
mod <- class::knn(cl = train$Class,
test = test[,1:60],
train = train[,1:60],
k = 4,
prob = TRUE)
plot(roc(test$Class, attributes(mod)$prob),
print.thres = T,
print.auc = T,
print.auc.y = 0.2)
I've trained a gbm model in R. Since I'm trying to predict a very rare case, I get a lot of false positives. I wanted to change the threshold for a positive ("Good") case from the default to 0.7. Here is my code so far.
modFit.glm.ml <- train(as.factor(ml.training$one_lease)~., data=ml.training, method = "glm")
confusionMatrix(ml.testing$one_lease, predict(modFit.glm.ml, ml.testing), positive = "Good")
This code works but it uses the default cutoff.
Someone mentioned this is possible with the predict function, but I don't know how to do it.
You haven't provided a reproducible example, so here's one using the iris dataset to predict if an iris is of type setosa:
dat <- iris
dat$positive <- as.factor(ifelse(dat$Species == "setosa", "s", "ns"))
library(caret)
mod <- train(positive~Sepal.Length, data=dat, method="glm")
To generate a confusion matrix using a cutoff for predicted probabilities other than 0.5, you can threshold the probabilities returned by the predict function using whatever cutoff you want:
confusionMatrix(table(predict(mod, type="prob")[,"s"] >= 0.25,
dat$positive == "s"))
# Confusion Matrix and Statistics
#
#
# FALSE TRUE
# FALSE 88 3
# TRUE 12 47
#
# Accuracy : 0.9
# 95% CI : (0.8404, 0.9429)
# No Information Rate : 0.6667
# P-Value [Acc > NIR] : 2.439e-11
#
# Kappa : 0.7847
# Mcnemar's Test P-Value : 0.03887
#
# Sensitivity : 0.8800
# Specificity : 0.9400
# Pos Pred Value : 0.9670
# Neg Pred Value : 0.7966
# Prevalence : 0.6667
# Detection Rate : 0.5867
# Detection Prevalence : 0.6067
# Balanced Accuracy : 0.9100
#
# 'Positive' Class : FALSE
You haven't specified which package you want to use, so here's another solution using mlr:
library(mlr)
dat = iris
training.set = seq(1, nrow(iris), by = 2)
test.set = seq(2, nrow(iris), by = 2)
dat$positive = as.factor(ifelse(dat$Species == "setosa", "s", "ns"))
task = makeClassifTask(data = dat, target = "positive")
lrn = makeLearner("classif.glmnet", predict.type = "prob")
mod = train(lrn, task, subset = training.set)
pred = predict(mod, task, subset = test.set)
print(getConfMatrix(pred))
pred = setThreshold(pred, c(s = 1))
print(getConfMatrix(pred))
mlr allows you to set the threshold explicitly with setThreshold -- the advantage is that you can use the resulting predictions with any function that measures performance without having to make sure that the threshold is set correctly.
The mlr tutorial has a whole section on classifier calibration, which can help you figure out the best value for this threshold.