MLR3 using data transforms in bootstrapping hit an error - r

I'm trying to use bootstrapping resampling as my cross-validation in mlr3, and have been tracking down the cause of an error:
Error in as_data_backend.data.frame(backend, primary_key = row_ids) :
Assertion on 'primary_key' failed: Contains duplicated values, position 2.
The position changes (likely the first repeated row). Based on the error message I first thought it was an issue having rownames included, so I set those as the col_type$name, and also tried removing rownames from the data before creating the task (no luck!).
In trying to create a reprex, I narrowed it down to transform pipe operators like 'scale' and 'pca' as the cause:
library("mlr3verse")
task <- tsk('sonar')
pipe = po('scale') %>>%
po(lrn('classif.rpart'))
ps <- ParamSet$new(list(
ParamDbl$new("classif.rpart.cp", lower = 0, upper = 0.05)
))
glrn <- GraphLearner$new(pipe)
glrn$predict_type <- "prob"
bootstrap <- rsmp("bootstrap", ratio = 1, repeats = 5)
instance <- TuningInstanceSingleCrit$new(
task = task,
learner = glrn,
resampling = bootstrap,
measure = msr("classif.auc"),
search_space = ps,
terminator = trm("evals", n_evals = 100)
)
tuner <- tnr("random_search")
tuner$optimize(instance)
I've also tried grid search instead of random, different learners, including the flag "duplicated_ids = TRUE" in rsmp, with no luck. Changing to CV cross validation, however, does fix the problem.
For reference, in the full pipe/graph I am trying different feature filters and learners to identify candidate pipelines.

Related

extract_inner_fselect_results is NULL with mlr3 Nested Resampling

This question is an extension of the following question: No Model Stored with Mlr3.
I have been performing nested resampling to get an unbiased metric of model performance. If I don't specify store_models=TRUE then I get Error: No model stored at the end of the run. However, if I specify store_models=TRUE in both the at and resample calls then RStudio crashes due to RAM consumption.
I have now tried the following code in which I specified store_models=TRUE for just the at call:
MSvCon<-read.csv("MS v Control Proteomics Final.csv", row.names=1)
MSvCon$Status<-as.factor(MSvCon$Status)
MSvCon[,2:4399]<-scale(MSvCon[,2:4399], center=TRUE, scale=TRUE)
set.seed(123, "L'Ecuyer")
task = as_task_classif(MSvCon, target = "Status")
learner = lrn("classif.ranger", importance = "impurity", num.trees=10000)
set_threads(learner, n = 8)
measure = msr("classif.fbeta", beta=1, average="micro")
terminator = trm("none")
resampling_inner = rsmp("repeated_cv", folds = 10, repeats = 10)
at = AutoFSelector$new(
learner = learner,
resampling = resampling_inner,
measure = measure,
terminator = terminator,
fselect = fs("rfe", n_features = 1, feature_fraction = 0.5, recursive = FALSE),
store_models=TRUE)
resampling_outer = rsmp("repeated_cv", folds = 10, repeats = 10)
rr = resample(task, at, resampling_outer)
After finishing, I am able to extract performance measures successfully. However, I tried to use extract_inner_fselect_results and extract_inner_fselect_archives to check what features were selected and importance measures but received a NULL result.
Do you have any suggestions on what I would need to adjust in my code to see this information? I anticipate that adding store_models=TRUE to the resample call would but the RAM consumption issue (even using 128GB on Rstudio Workbench) prevents that. Is there a way around this?
The archives of the inner resampling are stored in the model slot of the AutoFSelectors i.e. without store_models = TRUE in resample() you cannot access the inner results and archives. I will write a workaround for you and answer in the other question.

How can I account for "failed" learners in benchmark()

I have a large list of task/learner/resampling combinations. I execute the resampling via
design = data.table(
task = list_of_tasks,
learner = list_of_learners,
resampling = list_of_resamplings
)
bmr = benchmark(design)
tab = bmr$aggregate(c(msr("classif.acc")))
The last command fails and I get the following error message:
Error in assert_classif(truth, response = response) : Assertion on 'response' failed: Contains missing values (element 1).
How can I check what went wrong? The learners have worked for slightly different tasks and they are all combination of "standard" learners (svm, naive bayes) with a preceding po("scale"). There are not missing data in the predictors of the targets of the tasks.
At least one learner predicted NAs. Search for NAs in the predictions to identify the failing learner.
library(mlr3)
library(mlr3misc)
# experiment
learner_rpart = lrn("classif.rpart")
learner_debug = lrn("classif.debug", predict_missing = 0.5)
task = tsk("pima")
resampling = rsmp("cv", folds = 3)
design = benchmark_grid(task, list(learner_rpart, learner_debug), resampling)
bmr = benchmark(design)
# search for predictions with NAs
tab = as.data.table(bmr)
tab[map_lgl(tab$prediction, function(pred) any(is.na(pred$response)))]
You should post a new question with a reprex including the failing learner, task and resampling.

R mlr3 TaskClassif 'termlabels' must be a character vector of length at least one

I am using mlr3 for a simple classification model. But I encounter errors with several different models which mlr3 gives access to. Here I provide one reprex to illustrate the problem:
library(data.table)
library(mlr3extralearners)
library(mlr3)
library(mlr3learners)
library(mlr3tuning)
library(mlr3pipelines)
library(mlr3filters)
#Make example data
DT = data.table(target = c(0,0,0,0,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0),pred = c(0.05767878,0.05761652,0.06508700,0.06531820,0.07050699,0.07098812,0.07150984,0.07845767,0.07891081,0.07873572,0.08035471,0.08039300,0.08040480,0.08040480,0.08472619,0.08489135,0.08517742,0.08612768,0.08728675,0.08790671,0.08913434,0.08911522,0.09036788,0.09147726,0.09154964,0.09236259,0.09299088,0.09499589,0.09748171,0.09756818,0.09756818,0.09861013,0.10193147,0.10211796,0.10277547,0.10379659,0.10393602,0.10397469,0.10364373,0.10368016,0.10362235,0.10387504,0.10385431,0.10387288,0.10423139,0.10483475,0.10570517,0.10573617,0.10569312,0.10572714,0.10597040,0.10573924,0.10551367,0.10573499,0.10602269,0.10765947,0.10721005,0.10703524,0.10824609,0.10933141,0.10936178,0.10957693,0.10874663,0.10875077))
DT[, target := as.factor(target)] #Target Variable as factor is required
task <- TaskClassif$new(id='pizza', backend = DT, target = "target", positive = '1')
#Select an algo and a filter
randF = lrn("classif.randomForest", predict_type = "prob")#
filter1 = mlr_pipeops$get("filter", filter = mlr3filters::FilterVariance$new(),param_vals = list(filter.cutoff = 0.05))
#Construct a simple graph
graph = filter1 %>>%
PipeOpLearner$new(lrn("classif.randomForest"), id = "randF")
#graph$plot()
#Construct a learner and train it
learner = GraphLearner$new(graph)
learner$train(task)
This give the error:
'Error in reformulate(attributes(Terms)$term.labels) :
'termlabels' must be a character vector of length at least one'
I have the impression, that the task- object of mlr3 somehow doesnt interact well with the graph. The error then comes from the randomForest classifier, but to me it seems like the data was not properly handed over to it. But thats just a theory of mine. I may alter the question if its not clear enough.
Your filter is removing the only feature, and feature filtering is not necessary if there is only a single feature.

Importance based variable reduction

I am facing a difficulty with filtering out the least important variables in my model. I received a set of data with more than 4,000 variables, and I have been asked to reduce the number of variables getting into the model.
I did try already two approaches, but I have failed twice.
The first thing I tried was to manually check variable importance after the modelling and based on that removing non significant variables.
# reproducible example
data <- iris
# artificial class imbalancing
data <- iris %>%
mutate(Species = as.factor(ifelse(Species == "virginica", "1", "0")))
Everything works fine while using simple Learner:
# creating Task
task <- TaskClassif$new(id = "score", backend = data, target = "Species", positive = "1")
# creating Learner
lrn <- lrn("classif.xgboost")
# setting scoring as prediction type
lrn$predict_type = "prob"
lrn$train(task)
lrn$importance()
Petal.Width Petal.Length
0.90606304 0.09393696
The issue is that the data is highly imbalanced, so I decided to use GraphLearner with PipeOp operator to undersample majority group which is then passed to AutoTuner:
I did skip some part of the code which I believe is not important for this case, things like search space, terminator, tuner etc.
# undersampling
po_under <- po("classbalancing",
id = "undersample", adjust = "major",
reference = "major", shuffle = FALSE, ratio = 1 / 2)
# combine learner with pipeline graph
lrn_under <- GraphLearner$new(po_under %>>% lrn)
# setting the autoTuner
at <- AutoTuner$new(
learner = lrn_under,
resampling = resample,
measure = measure,
search_space = ps_under,
terminator = terminator,
tuner = tuner
)
at$train(task)
The problem right know is that despite the importance property being still visable within at the $importance() in unavailable.
> at
<AutoTuner:undersample.classif.xgboost.tuned>
* Model: list
* Parameters: list()
* Packages: -
* Predict Type: prob
* Feature types: logical, integer, numeric, character, factor, ordered, POSIXct
* Properties: featureless, importance, missings, multiclass, oob_error, selected_features, twoclass, weights
So I decided to change my approach and try to add filtering into a Learner. And that's where I've failed even more. I have started by looking into this mlr3book blog - https://mlr3book.mlr-org.com/fs.html. I tried to add importance = "impurity" into Learner just like in the blog but id did yield an error.
> lrn <- lrn("classif.xgboost", importance = "impurity")
Błąd w poleceniu 'instance[[nn]] <- dots[[i]]':
nie można zmienić wartości zablokowanego połączenia dla 'importance'
Which basically means something like this:
Error in 'instance[[nn]] <- dots[[i]]': can't change value of blocked connection for 'importance'
I did also try to workaround with PipeOp filtering but it also failed miserably. I believe I won't be able to do it without importance = "impurity".
So my question is, is there a way to achieve what I am aiming for?
In addition I would be greatly thankful for explaining why is filtering by importance possible before modeling? Shouldn't it be based on the model result?
The reason why you can't access $importance of the at variable is that it is an AutoTuner, which does not directly offer variable importance and only "wraps" around the actual Learner being tuned.
The trained GraphLearner is saved inside your AutoTuner under $learner:
# get the trained GraphLearner, with tuned hyperparameters
graphlearner <- at$learner
This object also does not have $importance(). (Theoretically, a GraphLearner could contain more than one Learner and then it wouldn't even know which importance to give!).
Getting the actual LearnerClassifXgboost object is a bit tedious, unfortunately, because of shortcomings in the "R6" object system used by mlr3:
Get the untrained Learner object
get the trained state of the Learner and put it into that object
# get the untrained Learner
xgboostlearner <- graphlearner$graph$pipeops$classif.xgboost$learner
# put the trained model into the Learner
xgboostlearner$state <- graphlearner$model$classif.xgboost
Now the importance can be queried
xgboostlearner$importance()
The example from the book that you link to does not work in your case because the book uses the ranger Learner, while are using xgboost. importance = "impurity" is specific to ranger.

Custom performance measure when building models with mlr-package

I have just made the switch from caret to mlr for a specific problem I am working on at the moment.
I am wondering if anyone here is familiar with specifying custom performance measures within the resample() function.
Here's a reproducible code example:
library(mlr)
library(mlbench)
data(BostonHousing, package = "mlbench")
task_reg1 <- makeRegrTask(id = "bh", data = BostonHousing, target = "medv")
lrn_reg1 <- makeLearner(cl = "regr.randomForest",
predict.type = "response",
mtry=3)
cv_reg1 <- makeResampleDesc("RepCV", folds = 5, reps = 5)
regr_1 <- resample(learner = lrn_reg1,
task = task_reg1,
resampling = cv_reg1,
measures = mlr::rmse)
Instead of computing RMSE, I want to compute the Mean Absolute Scaled Error, MASE. A function for this may, for instance, be found in the Metrics package: Metrics::mase().
I tried to include measures = Metrics::mase directly in the resample() call, but that was, as expected, a bit optimistic and I received the following error:
Error in checkMeasures(measures, task) :
Assertion on 'measures' failed: Must be of type 'list', not 'closure'.
I found out there's a function in the mlr package for creating custom performance metrics, called makeMeasure() (https://rdrr.io/cran/mlr/man/makeMeasure.html). I tried experimenting a bit with it, but did not manage to make anything work. I do not have much experience in tinkering with custom made functions, so I was hoping that someone here could help me out, or provide some resources for stuff like this.
Cheers!
You need to construct a function that can be applied within makeMeasure() that is of the form function(task, model, pred, extra.args). We can just write a wrapper around Metrics::mase() so you can use this function in resample(), and you can do the same for any other metric you find.
mase_fun <- function(task, model, pred, feats, extra.args) {
Metrics::mase(pred$data$truth, pred$data$response, step_size = extra.args$step_size)
}
mase_measure <- makeMeasure(id = "mase",
minimize = T,
properties = c("regr", "req.pred", "req.truth"),
fun = mase_fun,
extra.args = list(step_size = 1))
resample(learner = lrn_reg1,
task = task_reg1,
resampling = cv_reg1,
measures = mase_measure)

Resources