Custom Precision-Recall AUC measure in mlr3 - r

I would like to create a custom Precision-Recall AUC measure in mlr3.
I am following the mlr3 book chapter on creating custom measures.
I feel I'm almost there, but R throws an annoying error that I don't know how to interpret.
Let's define the measure:
PRAUC = R6::R6Class("PRAUC",
inherit = mlr3::MeasureClassif,
public = list(
initialize = function() {
super$initialize(
# custom id for the measure
id = "classif.prauc",
# additional packages required to calculate this measure
packages = c('PRROC'),
# properties, see below
properties = character(),
# required predict type of the learner
predict_type = "prob",
# feasible range of values
range = c(0, 1),
# minimize during tuning?
minimize = FALSE
)
}
),
private = list(
# custom scoring function operating on the prediction object
.score = function(prediction, ...) {
truth1 <- ifelse(prediction$truth == levels(prediction$truth)[1], 1, 0) # Function PRROC::pr.curve assumes binary response is numeric, positive class is 1, negative class is 0
PRROC::pr.curve(scores.class0 = prediction$prob, weights.class0 = truth1)
}
)
)
mlr3::mlr_measures$add("classif.prauc", PRAUC)
Let's see if it works:
task_sonar <- tsk('sonar')
learner <- lrn('classif.rpart', predict_type = 'prob')
learner$train(task_sonar)
pred <- learner$predict(task_sonar)
pred$score(msr('classif.prauc'))
# Error in if (sum(weights < 0) != 0) { :
# missing value where TRUE/FALSE needed
Here's the traceback:
11.
check(length(sorted.scores.class0), weights.class0)
10.
compute.pr(scores.class0, scores.class1, weights.class0, weights.class1,
curve, minStepSize, max.compute, min.compute, rand.compute,
dg.compute)
9.
PRROC::pr.curve(scores.class0 = prediction$prob, weights.class0 = truth1)
8.
measure$.__enclos_env__$private$.score(prediction = prediction,
task = task, learner = learner, train_set = train_set)
7.
measure_score(self, prediction, task, learner, train_set)
6.
m$score(prediction = self, task = task, learner = learner, train_set = train_set)
5.
FUN(X[[i]], ...)
4.
vapply(.x, .f, FUN.VALUE = .value, USE.NAMES = FALSE, ...)
3.
map_mold(.x, .f, NA_real_, ...)
2.
map_dbl(measures, function(m) m$score(prediction = self, task = task,
learner = learner, train_set = train_set))
1.
pred$score(msr("classif.prauc"))
It seems like the glitch is coming from PRROC::pr.curve. However, when trying this function on the actual prediction object pred, it works just fine:
PRROC::pr.curve(
scores.class0 = pred$prob[, 1],
weights.class0 = ifelse(pred$truth == levels(pred$truth)[1], 1, 0)
)
# Precision-recall curve
#
# Area under curve (Integral):
# 0.9081261
#
# Area under curve (Davis & Goadrich):
# 0.9081837
#
# Curve not computed ( can be done by using curve=TRUE )
One likely scenario why the error occurs is because, inside PRAUC, PRROC::pr.curve's argument weights.class0 is NA. I haven't been able to confirm this, but I'm suspecting that weights.class0 is receiving NA instead of numeric, causing PRROC::pr.curve to malfunction inside PRAUC. If that's the case, I don't know why it's happening.
There may be other scenarios that I haven't thought of. Any help will be much appreciated.
EDIT
missuse's, answer helped me realize why my measure isn't working. First,
PRROC::pr.curve(scores.class0 = prediction$prob, weights.class0 = truth1)
should be
PRROC::pr.curve(scores.class0 = prediction$prob[, 1], weights.class0 = truth1).
Second, function pr.curve returns an object of class PRROC, while the mlr3 measure I've defined is actually expecting numeric. So it should be
PRROC::pr.curve(scores.class0 = prediction$prob[, 1], weights.class0 = truth1)[[2]]
or
PRROC::pr.curve(scores.class0 = prediction$prob[, 1], weights.class0 = truth1)[[3]],
depending on the method used to compute the AUC (see ?PRROC::pr.curve).
Note that although MLmetrics::PRAUC is far less confusing than PRROC::pr.curve, it seems like the former is poorly implemented.
Here's an implementation of the measure with PRROC::pr.curve that actually works:
PRAUC = R6::R6Class("PRAUC",
inherit = mlr3::MeasureClassif,
public = list(
initialize = function() {
super$initialize(
# custom id for the measure
id = "classif.prauc",
# additional packages required to calculate this measure
packages = c('PRROC'),
# properties, see below
properties = character(),
# required predict type of the learner
predict_type = "prob",
# feasible range of values
range = c(0, 1),
# minimize during tuning?
minimize = FALSE
)
}
),
private = list(
# custom scoring function operating on the prediction object
.score = function(prediction, ...) {
truth1 <- ifelse(prediction$truth == levels(prediction$truth)[1], 1, 0) # Looks like in mlr3 the positive class in binary classification is always the first factor level
PRROC::pr.curve(
scores.class0 = prediction$prob[, 1], # Looks like in mlr3 the positive class in binary classification is always the first of two columns
weights.class0 = truth1
)[[2]]
}
)
)
mlr3::mlr_measures$add("classif.prauc", PRAUC)
Example:
task_sonar <- tsk('sonar')
learner <- lrn('classif.rpart', predict_type = 'prob')
learner$train(task_sonar)
pred <- learner$predict(task_sonar)
pred$score(msr('classif.prauc'))
#classif.prauc
# 0.923816
However, the issue now is that changing the positive class results in a different score:
task_sonar <- tsk('sonar')
task_sonar$positive <- 'R' # Now R is the positive class
learner <- lrn('classif.rpart', predict_type = 'prob')
learner$train(task_sonar)
pred <- learner$predict(task_sonar)
pred$score(msr('classif.prauc'))
#classif.prauc
# 0.9081261

?PRROC::pr.curve is rather confusing, so I will use MLmetrics::PRAUC to calculate PRAUC:
library(mlr3measures)
library(mlr3)
PRAUC = R6::R6Class("PRAUC",
inherit = mlr3::MeasureClassif,
public = list(
initialize = function() {
super$initialize(
# custom id for the measure
id = "classif.prauc",
# additional packages required to calculate this measure
packages = c('MLmetrics'),
# properties, see below
properties = character(),
# required predict type of the learner
predict_type = "prob",
# feasible range of values
range = c(0, 1),
# minimize during tuning?
minimize = FALSE
)
}
),
private = list(
# custom scoring function operating on the prediction object
.score = function(prediction, ...) {
MLmetrics::PRAUC(prediction$prob[,1], #probs for 1st (positive class is in first column) class
as.integer(prediction$truth == levels(prediction$truth)[1])) #truth for 1st class
}
)
)
To verify it works:
mlr3::mlr_measures$add("classif.prauc", PRAUC)
task_sonar <- tsk('sonar')
learner <- lrn('classif.rpart', predict_type = 'prob')
learner$train(task_sonar)
pred <- learner$predict(task_sonar)
pred$score(msr('classif.prauc'))
classif.prauc
0.8489383
MLmetrics::PRAUC(pred$data$prob[,1],
as.integer(pred$truth == "M"))
0.8489383
EDIT: the measure implementation using PRROC::pr.curve is given as edit to the question above. It is advisable to use that implementation since PRROC::pr.curve is more precise compared to MLmetrics::PRAUC.

Related

R LightGBM ignores init_score when continuing training with init_model

General description of my problem
I am performing a Poisson regression using LightGBM in R.
I am using an "offset" for the training, similar to using log(time) in a GLM as the offset when modelling insurance claims because we want to ensure that expected value of the response is proportional to time. I do this using the init_score parameter within lab.train().
I am using the "continue training" option in lgb.train (where you specify a value for init_model). This is because I want to build a "stumps" model first, and then continue training with a more complex model. This is to help me identify potential interaction terms in the data. This is just for background why I am doing this - not relevant to the specific issue described below.
However, when I continue training, the offset originally specified in the first model I build is no longer used by the fitting process. I think init_model overrides any value of init_score, but init_model does NOT itself contain or allow for init_score. So, as far as I can see, the init_score is totally lost from the fitting process once you continue training using init_model.
This means that the "starting point" when continuing to train a model is not the "finishing point" from the original model build. e.g. in my example below, I want the poisson log-likelihood error metric for models 2 and 3 to "start" from where model 1 finished. This isn't the case - but surely that is what "continue training" should deliver?
I have entered comments into the code below to explain the issue more clearly.
Reproducible example
library(lightgbm)
library(data.table)
# simulate some data
# z follows a Poisson distribution
# the mean of z is given by t * exp(x+y), where t is the "time exposed to risk"
# t is uniform(0,10)
# x and y are uniform(0,1)
# I want to specify log(t) using init_score in the lightGBM
# i.e. just like Poisson regression in insurance where log(t) is the offset in a GLM or GBM
n <- 10000 # number of rows
set.seed(42)
d <- data.table(t = runif(n,0,10), x = runif(n,0,1), y = runif(n,0,1))
d[, z := rpois(n, t * exp(x+y))]
# check weighted mean looks about right
# should get actual = 2.957188 and
# underlying = 2.939975
d[, list(actual = sum(z)/sum(t),
underlying = sum(t * exp(x+y))/sum(t)),]
# build a lightGBM using 100 rounds and specify log(t) as init_score
feature_cols <- c('x','y')
dm <- as.matrix(d[, ..feature_cols])
l_train <- lgb.Dataset(dm, label=d[,z], free_raw_data = FALSE)
setinfo(l_train, "init_score", log(d$t))
params <- list(objective='poisson', metric = 'poisson')
lgbm_1 <- lgb.train(params = params,
valids = list(train = l_train),
data = l_train,
nrounds = 100,
num_leaves = 2,
bagging_fraction = 1,
bagging_freq = 1,
feature_fraction = 1,
learning_rate=0.2)
train_log_1 <- lgb.get.eval.result(lgbm_1, "train", 'poisson')
# get the model predictions and check that they are close to expected
# remember that we need to manually apply the init_score to get the prediction
# i.e. we need to add log(t) onto the raw score, or multiply the scaled prediction by t
# the predictions are all very close
d[, lgbm_predicted_1 := t*predict(lgbm_1, dm, raw_score = FALSE)]
d[, list(actual = sum(z)/sum(t),
predicted_1 = sum(lgbm_predicted_1)/sum(t),
underlying = sum(t * exp(x+y))/sum(t)),]
# save the model
lgb.save(lgbm_1, 'lgbm_1.txt')
# ATTEMPT A - CONTINUE TRAINING FROM MODEL 1
# don't change the init_score
# note iterations in console start at 101 because we are continuing training
# however, the error metric (poisson log likelihood)
# start from a totally different value to where the first model ended
lgbm_2 <- lgb.train(params = params,
init_model = 'lgbm_1.txt',
valids = list(train = l_train),
data = l_train,
nrounds = 100,
num_leaves = 2,
bagging_fraction = 1,
bagging_freq = 1,
feature_fraction = 1,
learning_rate=0.2)
train_log_2 <- lgb.get.eval.result(lgbm_2, "train", 'poisson')
# check predictions - predicted_2 are WAY TOO HIGH now!
# I think this is because lightGBM uses the predictions from the first model
# as the starting point for training
# but the predictions from model 1 DO NOT ALLOW FOR THE log(t) being the offset to the original model!
d[, lgbm_predicted_2 := t*predict(lgbm_2, dm, raw_score = FALSE)]
d[, list(actual = sum(z)/sum(t),
predicted_1 = sum(lgbm_predicted_1)/sum(t),
predicted_2 = sum(lgbm_predicted_2)/sum(t),
underlying = sum(t * exp(x+y))/sum(t)),]
# ATTEMPT B - try init_score = 0?
# doesn't seem to make any difference
# so my hypothesis is that init_score is being ignored
# and over-written by the init_model
# but... how does the original init_score ever get back into the fitting process?
# init_score + init_model is a good stating point
# init_model on it's own is not
setinfo(l_train, "init_score", rep(0, nrow(d)))
lgbm_3 <- lgb.train(params = params,
valids = list(train = l_train),
init_model = 'lgbm_1.txt',
data = l_train,
nrounds = 100,
num_leaves = 2,
bagging_fraction = 1,
bagging_freq = 1,
feature_fraction = 1,
learning_rate=0.2)
train_log_3 <- lgb.get.eval.result(lgbm_3, "train", 'poisson')
# check predictions - models 2 and 3 are identical, the init_score made no difference
d[, lgbm_predicted_3 := t*predict(lgbm_3, dm, raw_score = FALSE)]
d[, list(actual = sum(z)/sum(t),
predicted_1 = sum(lgbm_predicted_1)/sum(t),
predicted_2 = sum(lgbm_predicted_2)/sum(t),
predicted_3 = sum(lgbm_predicted_3)/sum(t),
underlying = sum(t * exp(x+y))/sum(t)),]
# compare training logs
# question - why do V2 and V3 not start from the "finishing" point of V1?
# it's because the init_model is wrong, because it doesn't allow for the init_score
logs <- data.table(v1 = train_log_1, v2 = train_log_2, v3 = train_log_3)

How to use `boot` while ignoring the internally generated indices

The problem
In a standard use of the boot::boot function, one would define the function statistic to use the indices created internally by boot:
boot( data = d,
R = boot.reps,
parallel = "multicore",
statistic = function(original, indices){
# ***HERE IS THE KEY PART:
b = original[indices,]
mean(b$y)
} )
With the default sim = “ordinary” option, the indices should just be resampled with replacement from the original dataset’s rows. Easy enough.
However, in place of b = original[indices,], I want to use a custom resampling method that doesn't use boot's generated indices. Specifically, I want to do cluster resampling using cfdecomp::cluster.resample, and I want to do it using the boot package because it seems to have a monopoly on constructing bias-corrected and accelerated (BCa) confidence intervals.
For MWE below, we don't need to get into the weeds with clustering because I have reproduced the problem more simply by just resampling with replacement using my own indices:
myInd = sample(1:nrow(original),
size = nrow(original),
replace = TRUE)
# ***HERE IS THE KEY PART:
b = original[myInd,]
The simulation below indicates that these two methods perform very differently. The standard use, “method a”, has 100% coverage, while the second method, “method b”, has 66% coverage. So obviously they are not resampling in an equivalent way.
My attempt at self-rescue
I thought the boot function might for some reason need to use the indices again after calling statistic, and that's why I can't make my own indices, so I tried reassigning the indices as indices <<- myInd (commented out in the simulation code below). But this makes everything go completely crazy, with the bootstrap “statistics” boot.res$t now equal to indices themselves! I tried looking through the innards of the boot function, but it's a sophisticated package and I couldn't see the problem easily.
Simulation
library(boot)
library(foreach)
library(doParallel)
library(dplyr)
sim.reps = 500
boot.reps = 1000
# helper fn to check for CI coverage of truth
covers = function( truth, lo, hi ) {
return( (lo <= truth) & (hi >= truth) )
}
# generate standard normal data with n=200
# we are going to estimate the mean
d = data.frame( ID = 1:200,
y = rnorm(200) )
# run the simulation
# takes a couple minutes on my 8-core machine
rs = foreach( i = 1:sim.reps, .combine=rbind ) %dopar% {
# Method A: standard use of internally generated indices
tryCatch({
boot.res1 = boot( data = d,
R = boot.reps,
parallel = "multicore",
statistic = function(original, indices){
# ***HERE IS THE KEY PART:
b = original[indices,]
mean(b$y)
} )
CI1 = boot.ci(boot.res1, type = "bca")
lo1 = CI1[[4]][4]
hi1 = CI1[[4]][5]
btSD1 = as.numeric( sd(boot.res1$t) )
}, error = function(err){
lo1 <<- NA
hi1 <<- NA
btSD1 <<- NA
} )
# Method B: using my own indices
tryCatch({
boot.res2 = boot( data = d,
R = boot.reps,
parallel = "multicore",
statistic = function(original, indices){
myInd = sample(1:nrow(original),
size = nrow(original),
replace = TRUE)
# ***HERE IS THE KEY PART:
b = original[myInd,]
mean(b$y)
# this makes things even worse:
# causes boot.res$t to be integers?? wtf?
# indices <<- myInd
} )
CI2 = boot.ci(boot.res2, type = "bca")
lo2 = CI2[[4]][4]
hi2 = CI2[[4]][5]
btSD2 = as.numeric( sd(boot.res2$t) )
}, error = function(err){
lo2 <<- NA
hi2 <<- NA
btSD2 <<- NA
} )
return( data.frame( method = c("a", "b"),
covers = c( covers( truth = 0, lo = lo1, hi = hi1 ),
covers( truth = 0, lo = lo2, hi = hi2 ) ),
btSD = c( btSD1, btSD2 ) ) )
}
rs %>% group_by(method) %>%
summarise( btFail = mean(is.na(covers)),
covers = mean(covers, na.rm = TRUE),
btSD = mean(btSD, na.rm = TRUE) )
# btSD should be close to the true SE, i.e., 1/sqrt(200) = 0.07071068
# results:
# method btFail covers btSD
# <chr> <dbl> <dbl> <dbl>
# 1 a 0 1 0.0718
# 2 b 0.002 0.655 0.0718
You have an XY problem. Instead of trying to override the indices, just use tidyr's new nest/unnest functionality to cluster-bootstrap as follows:
dNest = d %>% group_nest(cluster)
boot( data = dNest,
statistic = function(original, indices) {
bNest = original[indices,]
# "data" is an automatically generated column name in dNest
b = bNest %>% unnest(data)
return( my_stat(b) ) } )
Alternatively, for more flexibility beyond cluster bootstrapping, you could create a fake boot object from your own resamples.

R: Caret package: Brier Score

I want to perform a logistic regression with the train() function from the caret package. My model looks something like that:
model <- train(Y ~.,
data = train_data,
family = "binomial",
method = "glmnet")
With the resulting model, I want to make predictions:
pred <- predict(model, newdata = test_data, s = "lambda.min", type = "prob")
Now, I want to evaluate how good the model predictions are in comparison with the actual test data. For this I know how to receive the ROC and AUC. However I am also interested in receiveing the BRIER SCORE. The formula for the Brier Score is almost identical to the MSE.
The problem I am facing, is that the type argument in predict only allows "prob" (or "class" which I am not interested in) which gives the probability of one prediction beeing a ONE (e.g. 0.64) , and the complementing probability of beeing a ZERO (e.g. 0.37). For the Brier Score however, I need One probability estimate for each prediction that contains the information of both (e.g. a value above 0.5 would indicate a 1 and a value below 0.5 would indicate a 0).
I have not found any solution for receiving the Brier Score in the caret package. I am aware that with the package cv.glmnet the predict function allows the argument "response" which would solve my problem. However, for personal preferences I would like to stay with the caretpackage.
Thanks for the help!
If we go by the wiki definition of brier score:
The most common formulation of the Brier score is
where f_t is the probability that was forecast, o_t the actual outcome of the (0 or 1) and N is the number of forecasting instances.
In R, if your label is a factor, then the logistic regression will always predict with respect to the 2nd level, meaning you just calculate the probability and 0/1 with respect to that. For example:
library(caret)
idx = sample(nrow(iris),100)
data = iris
data$Species = factor(ifelse(data$Species=="versicolor","v","o"))
levels(data$Species)
[1] "o" "v"
In this case, o is 0 and v is 1.
train_data = data[idx,]
test_data = data[-idx,]
model <- train(Species ~.,data = train_data,family = "binomial",method = "glmnet")
pred <- predict(model, newdata = test_data)
So we can see the probability of the class:
head(pred)
o v
1 0.8367885 0.16321154
2 0.7970508 0.20294924
3 0.6383656 0.36163437
4 0.9510763 0.04892370
5 0.9370721 0.06292789
To calculate the score:
f_t = pred[,2]
o_t = as.numeric(test_data$Species)-1
mean((f_t - o_t)^2)
[1] 0.32
I use the Brier score to tune my models in caret for binary classification. I ensure that the "positive" class is the second class, which is the default when you label your response "0:1". Then I created this master summary function, based on caret's own suite of summary functions, to return all the metrics I want to see:
BigSummary <- function (data, lev = NULL, model = NULL) {
pr_auc <- try(MLmetrics::PRAUC(data[, lev[2]],
ifelse(data$obs == lev[2], 1, 0)),
silent = TRUE)
brscore <- try(mean((data[, lev[2]] - ifelse(data$obs == lev[2], 1, 0)) ^ 2),
silent = TRUE)
rocObject <- try(pROC::roc(ifelse(data$obs == lev[2], 1, 0), data[, lev[2]],
direction = "<", quiet = TRUE), silent = TRUE)
if (inherits(pr_auc, "try-error")) pr_auc <- NA
if (inherits(brscore, "try-error")) brscore <- NA
rocAUC <- if (inherits(rocObject, "try-error")) {
NA
} else {
rocObject$auc
}
tmp <- unlist(e1071::classAgreement(table(data$obs,
data$pred)))[c("diag", "kappa")]
out <- c(Acc = tmp[[1]],
Kappa = tmp[[2]],
AUCROC = rocAUC,
AUCPR = pr_auc,
Brier = brscore,
Precision = caret:::precision.default(data = data$pred,
reference = data$obs,
relevant = lev[2]),
Recall = caret:::recall.default(data = data$pred,
reference = data$obs,
relevant = lev[2]),
F = caret:::F_meas.default(data = data$pred, reference = data$obs,
relevant = lev[2]))
out
}
Now I can simply pass summaryFunction = BigSummary in trainControl and then metric = "Brier", maximize = FALSE in the train call.

mlr3 PipeOps: Create branches with different data transformations and benchmark different learners within and between branches

I'd like use PipeOps to train a learner on three alternative transformations of a dataset:
No transformation.
Class balancing- down.
Class balancing- up.
Then, I'd like to benchmark the three learned models.
My idea was to set up the pipeline as follows:
Make pipeline: Input -> Impute dataset (optional) -> Branch -> Split into the three branches described above -> Add the learner within each branch -> Unbranch.
Train pipeline and hope (that's where I'm getting it wrong) that the will be a result saved for each learner within each branch.
Unfortunately, following these steps results in a single learner that seems to have 'merged' everything from the different branches. I was hoping to get a list of length 3, but I get a list of length one instead.
R code:
library(data.table)
library(paradox)
library(mlr3)
library(mlr3filters)
library(mlr3learners)
library(mlr3misc)
library(mlr3pipelines)
library(mlr3tuning)
library(mlr3viz)
learner <- lrn("classif.rpart", predict_type = "prob")
learner$param_set$values <- list(
cp = 0,
maxdepth = 21,
minbucket = 12,
minsplit = 24
)
graph =
po("imputehist") %>>%
po("branch", c("nop", "classbalancing_up", "classbalancing_down")) %>>%
gunion(list(
po("nop", id = "null"),
po("classbalancing", id = "classbalancing_down", ratio = 2, reference = 'minor'),
po("classbalancing", id = "classbalancing_up", ratio = 2, reference = 'major')
)) %>>%
gunion(list(
po("learner", learner, id = "learner_null"),
po("learner", learner, id = "learner_classbalancing_down"),
po("learner", learner, id = "learner_classbalancing_up")
)) %>>%
po("unbranch")
plot(graph)
tr <- mlr3::resample(tsk("iris"), graph, rsmp("holdout"))
tr$learners
Question 1
How can I get three different results instead?
Question 2
How can I benchmark these three results within the pipeline following unbranching?
Question 3
What if I want to add multiple learners within each branch? I'd like some of the learners to be inserted with fixed hyperparameters, while for others I'd like to have their hyperparameters tuned with AutoTuner within each branch. Then, I'd like to benchmark them within each branch and select the 'best' from each branch. Finally, I'd like to benchmark the three best learners to end up with the single best.
Many thanks.
I think that I've found the answer to what I'm looking for. In brief, what I'd like to do is:
Create a graph pipeline with multiple learners. I'd like some of the learners to be inserted with fixed hyperparameters, while for others I'd like to have their hyperparameters tuned. Then, I'd like to benchmark them and select the 'best' one. I'd also like the benchmarking of learners to happen under different class balancing strategies, namely, do nothing, up-sample and down-sample. The optimal parameter settings for the up/down-sampling (e.g. ratio) would also be determined during tuning.
Two examples below, one that almost does what I want, the other doing exactly what I want.
Example 1: Build a pipe that includes all learners, that is, learners with fixed hyperparameters, as well as learners whose hyperparameters require tuning
As will be shown, it seems like a bad idea to have both kinds of learners (i.e. with fixed and tunable hyperparameters), because tuning the pipe disregards the learners with tunable hyperparameters.
####################################################################################
# Build Machine Learning pipeline that:
# 1. Imputes missing values (optional).
# 2. Tunes and benchmarks a range of learners.
# 3. Handles imbalanced data in different ways.
# 4. Identifies optimal learner for the task at hand.
# Abbreviations
# 1. td: Tuned. Learner already tuned with optimal hyperparameters, as found empirically by Probst et al. (2009). See http://jmlr.csail.mit.edu/papers/volume20/18-444/18-444.pdf
# 2. tn: Tuner. Optimal hyperparameters for the learner to be determined within the Tuner.
# 3. raw: Raw dataset in that class imbalances were not treated in any way.
# 4. up: Data upsampling to balance class imbalances.
# 5. down: Data downsampling to balance class imbalances.
# References
# Probst et al. (2009). http://jmlr.csail.mit.edu/papers/volume20/18-444/18-444.pdf
####################################################################################
task <- tsk('sonar')
# Indices for splitting data into training and test sets
train.idx <- task$data() %>%
select(Class) %>%
rownames_to_column %>%
group_by(Class) %>%
sample_frac(2 / 3) %>% # Stratified sample to maintain proportions between classes.
ungroup %>%
select(rowname) %>%
deframe %>%
as.numeric
test.idx <- setdiff(seq_len(task$nrow), train.idx)
# Define training and test sets in task format
task_train <- task$clone()$filter(train.idx)
task_test <- task$clone()$filter(test.idx)
# Define class balancing strategies
class_counts <- table(task_train$truth())
upsample_ratio <- class_counts[class_counts == max(class_counts)] /
class_counts[class_counts == min(class_counts)]
downsample_ratio <- 1 / upsample_ratio
# 1. Enrich minority class by factor 'ratio'
po_over <- po("classbalancing", id = "up", adjust = "minor",
reference = "minor", shuffle = FALSE, ratio = upsample_ratio)
# 2. Reduce majority class by factor '1/ratio'
po_under <- po("classbalancing", id = "down", adjust = "major",
reference = "major", shuffle = FALSE, ratio = downsample_ratio)
# 3. No class balancing
po_raw <- po("nop", id = "raw") # Pipe operator for 'do nothing' ('nop'), i.e. don't up/down-balance the classes.
# We will be using an XGBoost learner throughout with different hyperparameter settings.
# Define XGBoost learner with the optimal hyperparameters of Probst et al.
# Learner will be added to the pipeline later on, in conjuction with and without class balancing.
xgb_td <- lrn("classif.xgboost", predict_type = 'prob')
xgb_td$param_set$values <- list(
booster = "gbtree",
nrounds = 2563,
max_depth = 11,
min_child_weight = 1.75,
subsample = 0.873,
eta = 0.052,
colsample_bytree = 0.713,
colsample_bylevel = 0.638,
lambda = 0.101,
alpha = 0.894
)
xgb_td_raw <- GraphLearner$new(
po_raw %>>%
po('learner', xgb_td, id = 'xgb_td'),
predict_type = 'prob'
)
xgb_tn_raw <- GraphLearner$new(
po_raw %>>%
po('learner', lrn("classif.xgboost",
predict_type = 'prob'), id = 'xgb_tn'),
predict_type = 'prob'
)
xgb_td_up <- GraphLearner$new(
po_over %>>%
po('learner', xgb_td, id = 'xgb_td'),
predict_type = 'prob'
)
xgb_tn_up <- GraphLearner$new(
po_over %>>%
po('learner', lrn("classif.xgboost",
predict_type = 'prob'), id = 'xgb_tn'),
predict_type = 'prob'
)
xgb_td_down <- GraphLearner$new(
po_under %>>%
po('learner', xgb_td, id = 'xgb_td'),
predict_type = 'prob'
)
xgb_tn_down <- GraphLearner$new(
po_under %>>%
po('learner', lrn("classif.xgboost",
predict_type = 'prob'), id = 'xgb_tn'),
predict_type = 'prob'
)
learners_all <- list(
xgb_td_raw,
xgb_tn_raw,
xgb_td_up,
xgb_tn_up,
xgb_td_down,
xgb_tn_down
)
names(learners_all) <- sapply(learners_all, function(x) x$id)
# Create pipeline as a graph. This way, pipeline can be plotted. Pipeline can then be converted into a learner with GraphLearner$new(pipeline).
# Pipeline is a collection of Graph Learners (type ?GraphLearner in the command line for info).
# Each GraphLearner is a td or tn model (see abbreviations above) with or without class balancing.
# Up/down or no sampling happens within each GraphLearner, otherwise an error during tuning indicates that there are >= 2 data sources.
# Up/down or no sampling within each GraphLearner can be specified by chaining the relevant pipe operators (function po(); type ?PipeOp in command line) with the PipeOp of each learner.
graph <-
#po("imputehist") %>>% # Optional. Impute missing values only when using classifiers that can't handle them (e.g. Random Forest).
po("branch", names(learners_all)) %>>%
gunion(unname(learners_all)) %>>%
po("unbranch")
graph$plot() # Plot pipeline
pipe <- GraphLearner$new(graph) # Convert pipeline to learner
pipe$predict_type <- 'prob' # Don't forget to specify we want to predict probabilities and not classes.
ps_table <- as.data.table(pipe$param_set)
View(ps_table[, 1:4])
# Set hyperparameter ranges for the tunable learners
ps_xgboost <- ps_table$id %>%
lapply(
function(x) {
if (grepl('_tn', x)) {
if (grepl('.booster', x)) {
ParamFct$new(x, levels = "gbtree")
} else if (grepl('.nrounds', x)) {
ParamInt$new(x, lower = 100, upper = 110)
} else if (grepl('.max_depth', x)) {
ParamInt$new(x, lower = 3, upper = 10)
} else if (grepl('.min_child_weight', x)) {
ParamDbl$new(x, lower = 0, upper = 10)
} else if (grepl('.subsample', x)) {
ParamDbl$new(x, lower = 0, upper = 1)
} else if (grepl('.eta', x)) {
ParamDbl$new(x, lower = 0.1, upper = 0.6)
} else if (grepl('.colsample_bytree', x)) {
ParamDbl$new(x, lower = 0.5, upper = 1)
} else if (grepl('.gamma', x)) {
ParamDbl$new(x, lower = 0, upper = 5)
}
}
}
)
ps_xgboost <- Filter(Negate(is.null), ps_xgboost)
ps_xgboost <- ParamSet$new(ps_xgboost)
# Se parameter ranges for the class balancing strategies
ps_class_balancing <- ps_table$id %>%
lapply(
function(x) {
if (all(grepl('up.', x), grepl('.ratio', x))) {
ParamDbl$new(x, lower = 1, upper = upsample_ratio)
} else if (all(grepl('down.', x), grepl('.ratio', x))) {
ParamDbl$new(x, lower = downsample_ratio, upper = 1)
}
}
)
ps_class_balancing <- Filter(Negate(is.null), ps_class_balancing)
ps_class_balancing <- ParamSet$new(ps_class_balancing)
# Define parameter set
param_set <- ParamSetCollection$new(list(
ParamSet$new(list(pipe$param_set$params$branch.selection$clone())), # ParamFct can be copied.
ps_xgboost,
ps_class_balancing
))
# Add dependencies. For instance, we can only set the mtry value if the pipe is configured to use the Random Forest (ranger).
# In a similar manner, we want do add a dependency between, e.g. hyperparameter "raw.xgb_td.xgb_tn.booster" and branch "raw.xgb_td"
# See https://mlr3gallery.mlr-org.com/tuning-over-multiple-learners/
param_set$ids()[-1] %>%
lapply(
function(x) {
aux <- names(learners_all) %>%
sapply(
function(y) {
grepl(y, x)
}
)
aux <- names(aux[aux])
param_set$add_dep(x, "branch.selection",
CondEqual$new(aux))
}
)
# Set up tuning instance
instance <- TuningInstance$new(
task = task_train,
learner = pipe,
resampling = rsmp('cv', folds = 2),
measures = msr("classif.bbrier"),
#measures = prc_micro,
param_set,
terminator = term("evals", n_evals = 3))
tuner <- TunerRandomSearch$new()
# Tune pipe learner to find best-performing branch
tuner$tune(instance)
instance$result
instance$archive()
instance$archive(unnest = "tune_x") # Unnest the tuner search space values
pipe$param_set$values <- instance$result$params
pipe$train(task_train)
pred <- pipe$predict(task_test)
pred$confusion
Note that the tuner chooses to disregard the tuning of the tunable learners and focuses on the tuned learners only. This can be confirmed by inspecting instance$result: the only things that have been tuned for the tunable learners are the class-balancing parameters, which are actually not learner hyperparameters.
Example 2: Build a pipe that includes tunable learners only, find the 'best' one, and then benchmark it against the learners with fixed hyperparameters at a second stage.
Step 1: Build pipe for tunable learners
learners_all <- list(
#xgb_td_raw,
xgb_tn_raw,
#xgb_td_up,
xgb_tn_up,
#xgb_td_down,
xgb_tn_down
)
names(learners_all) <- sapply(learners_all, function(x) x$id)
# Create pipeline as a graph. This way, pipeline can be plotted. Pipeline can then be converted into a learner with GraphLearner$new(pipeline).
# Pipeline is a collection of Graph Learners (type ?GraphLearner in the command line for info).
# Each GraphLearner is a td or tn model (see abbreviations above) with or without class balancing.
# Up/down or no sampling happens within each GraphLearner, otherwise an error during tuning indicates that there are >= 2 data sources.
# Up/down or no sampling within each GraphLearner can be specified by chaining the relevant pipe operators (function po(); type ?PipeOp in command line) with the PipeOp of each learner.
graph <-
#po("imputehist") %>>% # Optional. Impute missing values only when using classifiers that can't handle them (e.g. Random Forest).
po("branch", names(learners_all)) %>>%
gunion(unname(learners_all)) %>>%
po("unbranch")
graph$plot() # Plot pipeline
pipe <- GraphLearner$new(graph) # Convert pipeline to learner
pipe$predict_type <- 'prob' # Don't forget to specify we want to predict probabilities and not classes.
ps_table <- as.data.table(pipe$param_set)
View(ps_table[, 1:4])
ps_xgboost <- ps_table$id %>%
lapply(
function(x) {
if (grepl('_tn', x)) {
if (grepl('.booster', x)) {
ParamFct$new(x, levels = "gbtree")
} else if (grepl('.nrounds', x)) {
ParamInt$new(x, lower = 100, upper = 110)
} else if (grepl('.max_depth', x)) {
ParamInt$new(x, lower = 3, upper = 10)
} else if (grepl('.min_child_weight', x)) {
ParamDbl$new(x, lower = 0, upper = 10)
} else if (grepl('.subsample', x)) {
ParamDbl$new(x, lower = 0, upper = 1)
} else if (grepl('.eta', x)) {
ParamDbl$new(x, lower = 0.1, upper = 0.6)
} else if (grepl('.colsample_bytree', x)) {
ParamDbl$new(x, lower = 0.5, upper = 1)
} else if (grepl('.gamma', x)) {
ParamDbl$new(x, lower = 0, upper = 5)
}
}
}
)
ps_xgboost <- Filter(Negate(is.null), ps_xgboost)
ps_xgboost <- ParamSet$new(ps_xgboost)
ps_class_balancing <- ps_table$id %>%
lapply(
function(x) {
if (all(grepl('up.', x), grepl('.ratio', x))) {
ParamDbl$new(x, lower = 1, upper = upsample_ratio)
} else if (all(grepl('down.', x), grepl('.ratio', x))) {
ParamDbl$new(x, lower = downsample_ratio, upper = 1)
}
}
)
ps_class_balancing <- Filter(Negate(is.null), ps_class_balancing)
ps_class_balancing <- ParamSet$new(ps_class_balancing)
param_set <- ParamSetCollection$new(list(
ParamSet$new(list(pipe$param_set$params$branch.selection$clone())), # ParamFct can be copied.
ps_xgboost,
ps_class_balancing
))
# Add dependencies. For instance, we can only set the mtry value if the pipe is configured to use the Random Forest (ranger).
# In a similar manner, we want do add a dependency between, e.g. hyperparameter "raw.xgb_td.xgb_tn.booster" and branch "raw.xgb_td"
# See https://mlr3gallery.mlr-org.com/tuning-over-multiple-learners/
param_set$ids()[-1] %>%
lapply(
function(x) {
aux <- names(learners_all) %>%
sapply(
function(y) {
grepl(y, x)
}
)
aux <- names(aux[aux])
param_set$add_dep(x, "branch.selection",
CondEqual$new(aux))
}
)
# Set up tuning instance
instance <- TuningInstance$new(
task = task_train,
learner = pipe,
resampling = rsmp('cv', folds = 2),
measures = msr("classif.bbrier"),
#measures = prc_micro,
param_set,
terminator = term("evals", n_evals = 3))
tuner <- TunerRandomSearch$new()
# Tune pipe learner to find best-performing branch
tuner$tune(instance)
instance$result
instance$archive()
instance$archive(unnest = "tune_x") # Unnest the tuner search space values
pipe$param_set$values <- instance$result$params
pipe$train(task_train)
pred <- pipe$predict(task_test)
pred$confusion
Note that now instance$result returns optimal results for the learners' hyperparameters too, and not just for the class-balancing parameters.
Step 2: Benchmark 'best' tunable learner (now tuned) and the learners that have fixed hyperparameters
# Define re-sampling and instantiate it so always the same split will be used
resampling <- rsmp("cv", folds = 2)
set.seed(123)
resampling$instantiate(task_train)
bmr <- benchmark(
design = benchmark_grid(
task_train,
learner = list(pipe, xgb_td_raw, xgb_td_up, xgb_tn_down),
resampling
),
store_models = TRUE # Only needed if you want to inspect the models
)
bmr$aggregate(msr("classif.bbrier"))
A few issues to consider
I should have probably created a second, separate pipe for the
learners that have fixed hyperparameters, in order to at least have
the class-balancing parameters tuned. Then, the two pipes (tunable
and fixed hyperparameters) would be benchmarked with benchmark().
I should have probably used the same resampling strategy from beginning to end? I.e., instantiate the reampling strategy right
before tuning the first pipe, so that this strategy is also used in
the second pipe and in the final benchmark.
Comments/validation more than welcome.
(special thanks to missuse for the constructive comments)
The simplest way to benchmark several pipelines is to define the appropriate graphs and use the benchmark function:
library(paradox)
library(mlr3)
library(mlr3pipelines)
library(mlr3tuning)
learner <- lrn("classif.rpart", predict_type = "prob")
learner$param_set$values <- list(
cp = 0,
maxdepth = 21,
minbucket = 12,
minsplit = 24
)
Create the tree graphs:
graph 1, just imputehist
graph_nop <- po("imputehist") %>>%
learner
graph 2 : imputehist and undersample majority class (ratio relative to majority class)
graph_down <- po("imputehist") %>>%
po("classbalancing", id = "undersample", adjust = "major",
reference = "major", shuffle = FALSE, ratio = 1/2) %>>%
learner
graph 3: impute hist and oversample minority class (ratio relative to minority class)
graph_up <- po("imputehist") %>>%
po("classbalancing", id = "oversample", adjust = "minor",
reference = "minor", shuffle = FALSE, ratio = 2) %>>%
learner
Convert graphs to learners and set predict_type
graph_nop <- GraphLearner$new(graph_nop)
graph_nop$predict_type <- "prob"
graph_down <- GraphLearner$new(graph_down)
graph_down$predict_type <- "prob"
graph_up <- GraphLearner$new(graph_up)
graph_up$predict_type <- "prob"
define re-sampling and instantiate it so always the same split will be used:
hld <- rsmp("holdout")
set.seed(123)
hld$instantiate(tsk("sonar"))
Benchmark
bmr <- benchmark(design = benchmark_grid(task = tsk("sonar"),
learner = list(graph_nop,
graph_up,
graph_down),
hld),
store_models = TRUE) #only needed if you want to inspect the models
check result using different measures:
bmr$aggregate(msr("classif.auc"))
nr resample_result task_id learner_id resampling_id iters classif.auc
1: 1 <ResampleResult> sonar imputehist.classif.rpart holdout 1 0.7694257
2: 2 <ResampleResult> sonar imputehist.oversample.classif.rpart holdout 1 0.7360642
3: 3 <ResampleResult> sonar imputehist.undersample.classif.rpart holdout 1 0.7668919
bmr$aggregate(msr("classif.ce"))
nr resample_result task_id learner_id resampling_id iters classif.ce
1: 1 <ResampleResult> sonar imputehist.classif.rpart holdout 1 0.3043478
2: 2 <ResampleResult> sonar imputehist.oversample.classif.rpart holdout 1 0.3188406
3: 3 <ResampleResult> sonar imputehist.undersample.classif.rpart holdout 1 0.2898551
This can be also performed within one pipeline with branching but one would need to define the paramset and use a tuner:
graph2 <-
po("imputehist") %>>%
po("branch", c("nop", "classbalancing_up", "classbalancing_down")) %>>%
gunion(list(
po("nop", id = "nop"),
po("classbalancing", id = "classbalancing_up", ratio = 2, reference = 'major'),
po("classbalancing", id = "classbalancing_down", ratio = 2, reference = 'minor')
)) %>>%
po("unbranch") %>>%
learner
graph2$plot()
Note that the unbranch happens before the learner since one (always the same) learner is being used.
Convert graph to learner and set predict_type
graph2 <- GraphLearner$new(graph2)
graph2$predict_type <- "prob"
Define the param set. In this case just the different branch options.
ps <- ParamSet$new(
list(
ParamFct$new("branch.selection", levels = c("nop", "classbalancing_up", "classbalancing_down"))
))
In general you would want to add also learner hyper parameters like cp and minsplit for rpart as well as the ratio of over/undersampling.
Create a tuning instance and grid search with resolution 1 since no other parameters are tuned. The tuner will iterate through different pipeline branches as defined in the paramset.
instance <- TuningInstance$new(
task = tsk("sonar"),
learner = graph2,
resampling = hld,
measures = msr("classif.auc"),
param_set = ps,
terminator = term("none")
)
tuner <- tnr("grid_search", resolution = 1)
set.seed(321)
tuner$tune(instance)
Check the result:
instance$archive(unnest = "tune_x")
nr batch_nr resample_result task_id
1: 1 1 <ResampleResult> sonar
2: 2 2 <ResampleResult> sonar
3: 3 3 <ResampleResult> sonar
learner_id resampling_id iters params
1: imputehist.branch.null.classbalancing_up.classbalancing_down.unbranch.classif.rpart holdout 1 <list>
2: imputehist.branch.null.classbalancing_up.classbalancing_down.unbranch.classif.rpart holdout 1 <list>
3: imputehist.branch.null.classbalancing_up.classbalancing_down.unbranch.classif.rpart holdout 1 <list>
warnings errors classif.auc branch.selection
1: 0 0 0.7842061 classbalancing_down
2: 0 0 0.7673142 classbalancing_up
3: 0 0 0.7694257 nop
Even though the above example is possible, I think mlr3pipelines is designed so you tune learner hyper parameters jointly with preprocessing steps while also selecting best preprocessing steps (via branching).
Question 3 has multiple sub questions some of which would take quite a lot of code and explaining to answer. I suggest checking the mlr3book as well as the mlr3gallery.
EDIT: a mlr3 gallery post: https://mlr3gallery.mlr-org.com/posts/2020-03-30-imbalanced-data/ is relevant for the question.

MXNET softmax output: label shape confusion

I have not got a clear idea about how labels for the softmax classifier should be shaped.
What I could understand from my experiments is that a scalar laber indicating the index of class probability output is one option, while another is a 2D label where the rows are class probabilities, or one-hot encoded variable, like c(1, 0, 0).
What puzzles me though is that:
I can use sclalar label values that go beyong indexing, like 4 in my
example below -- without warning or error. Why is that?
When my label is a negative scalar or an array with a negative value,
the model converges to uniform probablity distribution over classes.
For example, is this expected that actor_train.y = matrix(c(0, -1,v0), ncol = 1) results in equal probabilities in the softmax output?
I try to use softmax MXNET classifier to produce the policy gradient
reifnrocement learning, and my negative rewards lead to the issue
above: uniform probability. Is that expected?
require(mxnet)
actor_initializer <- mx.init.Xavier(rnd_type = "gaussian",
factor_type = "avg",
magnitude = 0.0001)
actor_nn_data <- mx.symbol.Variable('data') actor_nn_label <- mx.symbol.Variable('label')
device.cpu <- mx.cpu()
NN architecture
actor_fc3 <- mx.symbol.FullyConnected(
data = actor_nn_data
, num_hidden = 3 )
actor_output <- mx.symbol.SoftmaxOutput(
data = actor_fc3
, label = actor_nn_label
, name = 'actor' )
crossentfunc <- function(label, pred)
{
- sum(label * log(pred)) }
actor_loss <- mx.metric.custom(
feval = crossentfunc
, name = "log-loss"
)
initialize NN
actor_train.x <- matrix(rnorm(11), nrow = 1)
actor_train.y = 0 #1 #2 #3 #-3 # matrix(c(0, 0, -1), ncol = 1)
rm(actor_model)
actor_model <- mx.model.FeedForward.create(
symbol = actor_output,
X = actor_train.x,
y = actor_train.y,
ctx = device.cpu,
num.round = 100,
array.batch.size = 1,
optimizer = 'adam',
eval.metric = actor_loss,
clip_gradient = 1,
wd = 0.01,
initializer = actor_initializer,
array.layout = "rowmajor" )
predict(actor_model, actor_train.x, array.layout = "rowmajor")
It is quite strange to me, but I found a solution.
I changed optimizer from optimizer = 'adam' to optimizer = 'rmsprop', and the NN started to converge as expected in case of negative targets. I made simulations in R using a simple NN and optim function to get the same result.
Looks like adam or SGD may be buggy or whatever in case of multinomial classification... I also used to get stuck at the fact those optimizers did not converge to a perfect solution on just 1 example, while rmsprop does! Be aware!

Resources