mlr: Tune model parameters with validation set - r

Just switched to mlr for my machine learning workflow. I am wondering if it is possible to tune hyperparameters using a separate validation set. From my minimum understanding, makeResampleDesc and makeResampleInstance accepts only resampling from training data.
My goal is to tune parameters with a validation set and test the final model with the test set. This is to prevent overfitting and knowledge leak.
Here is what I did code-wise:
## Create training, validation and test tasks
train_task <- makeClassifTask(data = train_data, target = "y", positive = 1)
validation_task <- makeClassifTask(data = validation_data, target = "y")
test_task <- makeClassifTask(data = test_data, target = "y")
## Attempt to tune parameters with separate validation data
tuned_params <- tuneParams(
task = train_task,
resampling = makeResampleInstance("Holdout", task = validation_task),
...
)
From the error message, it looks like evaluation is still trying to resample from the training set:
00001: Error in resample.fun(learner2, task, resampling, measures =
measures, : Size of data set: 19454 and resampling instance:
1666333 differ!
Does anyone know what I should do? Am I setting up everything the right way?

[Update as of 2019/03/27]
Following #jakob-r's comment, and finally understanding #LarsKotthoff's suggestion, here is what I did:
## Create combined training data
train_task_data <- rbind(train_data, validation_data)
## Create learner, training task, etc.
xgb_learner <- makeLearner("classif.xgboost", predict.type = "prob")
train_task <- makeClassifTask(data = train_task_data, target = "y", positive = 1)
## Tune hyperparameters
tune_wrapper <- makeTuneWrapper(
learner = xgb_learner,
resampling = makeResampleDesc("Holdout"),
measures = ...,
par.set = ...,
control = ...
)
model_xgb <- train(tune_wrapper, train_task)
Here is what I did following #LarsKotthoff 's comment. Assume you have two separate datasets for training (train_data) and validation (validation_data):
## Create combined training data
train_task_data <- rbind(train_data, validation_data)
size <- nrow(train_task_data)
train_ind <- seq_len(nrow(train_data))
validation_ind <- seq.int(max(train_ind) + 1, size)
## Create training task
train_task <- makeClassifTask(data = train_task_data, target = "y", positive = 1)
## Tune hyperparameters
tuned_params <- tuneParams(
task = train_task,
resampling = makeFixedHoldoutInstance(train_ind, validation_ind, size),
...
)
After optimizing the hyperparameter set, you can build a final model and test against your test dataset.
Note: I have to install the latest development version (as of 2018/08/06) from GitHub. Current CRAN version (2.12.1) throws an error when I call makeFixedHoldoutInstance(), i.e.,
Assertion on 'discrete.names' failed: Must be of type 'logical flag',
not 'NULL'.

Related

extract_inner_fselect_results is NULL with mlr3 Nested Resampling

This question is an extension of the following question: No Model Stored with Mlr3.
I have been performing nested resampling to get an unbiased metric of model performance. If I don't specify store_models=TRUE then I get Error: No model stored at the end of the run. However, if I specify store_models=TRUE in both the at and resample calls then RStudio crashes due to RAM consumption.
I have now tried the following code in which I specified store_models=TRUE for just the at call:
MSvCon<-read.csv("MS v Control Proteomics Final.csv", row.names=1)
MSvCon$Status<-as.factor(MSvCon$Status)
MSvCon[,2:4399]<-scale(MSvCon[,2:4399], center=TRUE, scale=TRUE)
set.seed(123, "L'Ecuyer")
task = as_task_classif(MSvCon, target = "Status")
learner = lrn("classif.ranger", importance = "impurity", num.trees=10000)
set_threads(learner, n = 8)
measure = msr("classif.fbeta", beta=1, average="micro")
terminator = trm("none")
resampling_inner = rsmp("repeated_cv", folds = 10, repeats = 10)
at = AutoFSelector$new(
learner = learner,
resampling = resampling_inner,
measure = measure,
terminator = terminator,
fselect = fs("rfe", n_features = 1, feature_fraction = 0.5, recursive = FALSE),
store_models=TRUE)
resampling_outer = rsmp("repeated_cv", folds = 10, repeats = 10)
rr = resample(task, at, resampling_outer)
After finishing, I am able to extract performance measures successfully. However, I tried to use extract_inner_fselect_results and extract_inner_fselect_archives to check what features were selected and importance measures but received a NULL result.
Do you have any suggestions on what I would need to adjust in my code to see this information? I anticipate that adding store_models=TRUE to the resample call would but the RAM consumption issue (even using 128GB on Rstudio Workbench) prevents that. Is there a way around this?
The archives of the inner resampling are stored in the model slot of the AutoFSelectors i.e. without store_models = TRUE in resample() you cannot access the inner results and archives. I will write a workaround for you and answer in the other question.

How can I account for "failed" learners in benchmark()

I have a large list of task/learner/resampling combinations. I execute the resampling via
design = data.table(
task = list_of_tasks,
learner = list_of_learners,
resampling = list_of_resamplings
)
bmr = benchmark(design)
tab = bmr$aggregate(c(msr("classif.acc")))
The last command fails and I get the following error message:
Error in assert_classif(truth, response = response) : Assertion on 'response' failed: Contains missing values (element 1).
How can I check what went wrong? The learners have worked for slightly different tasks and they are all combination of "standard" learners (svm, naive bayes) with a preceding po("scale"). There are not missing data in the predictors of the targets of the tasks.
At least one learner predicted NAs. Search for NAs in the predictions to identify the failing learner.
library(mlr3)
library(mlr3misc)
# experiment
learner_rpart = lrn("classif.rpart")
learner_debug = lrn("classif.debug", predict_missing = 0.5)
task = tsk("pima")
resampling = rsmp("cv", folds = 3)
design = benchmark_grid(task, list(learner_rpart, learner_debug), resampling)
bmr = benchmark(design)
# search for predictions with NAs
tab = as.data.table(bmr)
tab[map_lgl(tab$prediction, function(pred) any(is.na(pred$response)))]
You should post a new question with a reprex including the failing learner, task and resampling.

How does the mlrMBO package optimize hyperparameters when no objective function is specified?

I am still very new to the mlrMBO package and hyperparameter tuning in general, so I apologize for the ignorance here. Previously I was using the makeTuneControlGrid() function for grid search hyperparameter tuning for random forest and decision tree classification models, and then I was introduced to the mlrMBO package, and have used nearly the same code that I used for grid search, only used the makeTuneControlMBO() function instead. The performance metrics greatly improved from the grid search, but I am not understanding how the functions in this package are searching differently from grid search for optimal hyperparameter combinations, when I did not yet specify an objective function. My understanding from what I have read is that I need to make an objective function to optimize, so if this objective function is never created, how is the package searching hyperparameters if it is not using grid search? Is it optimizing a default objective function that is built into the package?
Here is my code:
task <- makeClassifTask(data = training_data, target = 'DEATH_EVENT', id = 'Death', positive = 1)
View(task)
# Configure learners with probability type
learner2 <- makeLearner('classif.randomForest', predict.type = 'prob') # Random Forest learner
View(learner2)
learner3 <- makeLearner('classif.kknn', predict.type = 'prob') # kNN learner
library(mlrMBO)
getParamSet("classif.randomForest")
ps2 <- makeParamSet(
makeDiscreteParam('mtry', values = seq(1,5, by = 1))
,
makeDiscreteParam('ntree', values = seq(450,600, by = 50)),
makeDiscreteParam('nodesize', values = seq(7,14, by = 1))
)
#creates a control object for MBO optimization
ctrl = makeTuneControlMBO(mbo.control=mlrMBO::makeMBOControl())
rdesc = makeResampleDesc("CV", iters = 5L)
tuned_params <- tuneParams(learner = learner2,
task = task,
control = ctrl,
par.set = ps2,
resampling = rdesc,
measures=list(tpr,auc, fnr, mmce, tnr, setAggregation(tpr, test.sd)), show.info=T)
tuned_params$x
tuned_params$y
tuned_params$mbo.result

rpart giving same results for cross-validation and no CV

Like the title says, I'm trying to run a decision tree both with and without cross-validation using the rpart package in R. I'm doing this using the xval parameter, as described in the vignette (https://cran.r-project.org/web/packages/rpart/vignettes/longintro.pdf)
Unfortunately, I'm getting the same tree with and without CV. I've compared the calculation time for each and the CV model looks like it takes about 10 times as long, so its apparently doing something, I just can't figure out what.
I've also redone the model a number of times with different complexity parameters, but it hasn't made any difference.
Here's sample code that shows my problem, the printcp's show the same results and the predictions from both on the training and a hold-out set are the same.
library(rpart)
library(caret)
abalone <- read.csv(file = 'https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data',header = FALSE)
names(abalone) <- c("sex", "length", "diameter", "height", "whole_weight", "shucked_weight", "viscera_weight", "shell_weight", "rings")
train_set <- createDataPartition(abalone$sex, times = 1, p = 0.8, list = FALSE)
abalone_train <- slice(abalone, train_set)
abalone_test <- slice(abalone, -train_set)
abalone_fit_noCV <- rpart(sex ~ .,
data = abalone_train,
method = "class",
parms = list(split = 'information'),
control = rpart.control(xval = 0,
cp = 0.005))
abalone_fit_CV <- rpart(sex ~ .,
data = abalone_train,
method = "class",
parms = list(split = 'information'),
control = rpart.control(xval = 10,
cp = 0.005))
printcp(abalone_fit_noCV)
printcp(abalone_fit_CV)
CV_pred <- predict(abalone_fit_CV, type = "class")
noCV_pred <- predict(abalone_fit_noCV, type = "class")
confusionMatrix(CV_pred, noCV_pred)
CV_pred <- predict(abalone_fit_CV, abalone_test, type = "class")
noCV_pred <- predict(abalone_fit_noCV, abalone_test, type = "class")
confusionMatrix(CV_pred, noCV_pred)
In true beginner fashion, I figured this out shortly after posting.
For anybody else coming upon this issue, it is basically answered on Cross Validated :
The final tree that is returned is still the initial tree. You must use the prune function using the cross-validation plot to choose the best subtree.
This is clear if you read the full Pruning the tree section of the vignette, rather than just the cross-validation section.

R - mlr: Is there a easy way to get the variable importance of tuned support vector machine models in nested resampling (spatial)?

I am trying to get the variable importance for all predictors (or variables, or features) of a tuned support vector machine (svm) model using e1071::svm through the mlr-package in R. But I am not sure, if I am doing the assessment right. Well, at first the idea:
To get an honest tuned svm-model, I am following the nested-resampling tutorial using spatial n-fold cross-validation (SpRepCV) in the outer loop and spatial cross-validation (SpCV) in the inner loop. As tuning parameter gamma and cost are tuned in a random grid search.
As variable importance assessment for all the predictors, I would like to use the permutation.importance, which is, relating to the description, basically the aggregated difference between feature permuted and unpermuted predictions.
In mlr, there are some filter-functions to get variable importance, but on the same time a subset is created before model-fitting based on a user specific selection-input (threshold or number of variables). - However, I would like to retrieve variable importance of all variable of every fitted model. (I know that learner as random forest have an important assessment "inclusive")
Right now, I am using mlr::generateFeatureImportanceData in the extract-argument in the resampling, which looks really awkward. So I am asking me, if there is no easier way?
Here an example using the mlr-development version:
## initialize libraries
# devtools::install_github("mlr-org/mlr) # using developper version of mlr
if(!require("pacman")) install.packages("pacman")
pacman::p_load("mlr", "ParamHelpers", "e1071", "parallelMap")
## create tuning setting
svm.ps <- ParamHelpers::makeParamSet(
ParamHelpers::makeNumericParam("cost", lower = -12,
upper = 15, trafo = function(x) 2^x),
ParamHelpers::makeNumericParam("gamma", lower = -15,
upper = 6, trafo = function(x) 2^x)
)
## create random search grid, small iteration number for example
ctrl.tune <- mlr::makeTuneControlRandom(maxit = 8)
# inner resampling loop, "
inner <- mlr::makeResampleDesc("SpCV", iters = 3, predict = "both")
# outer loop, "
outer <- mlr::makeResampleDesc("SpRepCV", folds = 5, reps = 2, predict = "both")
## create learner - Support Vector Machine of the e1071-package
lrn.svm <- mlr::makeLearner("classif.svm", predict.type = "prob")
# ... tuning in inner resampling
lrn.svm.tune <- mlr::makeTuneWrapper(learner = lrn.svm, resampling = inner,
measures = list(auc),
par.set = svm.ps, control = ctrl.tune,
show.info = FALSE)
## create function that calculate variable importance based on permutation
extractVarImpFunction <- function(x)
{
list(mlr::generateFeatureImportanceData(task = mlr::makeClassifTask(
id = x$task.desc$id,
data = mlr::getTaskData(mlr::spatial.task, subset = x$subset),
target = x$task.desc$target,
positive = x$task.desc$positive,
coordinates = mlr::spatial.task$coordinates[x$subset,]),
method = "permutation.importance",
learner = mlr::makeLearner(cl = "classif.svm",
predict.type = "prob",
cost = x$learner.model$opt.result$x$cost,
gamma = x$learner.model$opt.result$x$gamma),
measure = list(mlr::auc), nmc = 10
)$res
)
}
## start resampling for getting variable importance of tuned models (outer)
# parallelize tuning
parallelMap::parallelStart(mode = "multicore", level = "mlr.tuneParams", cpus = 8)
res.VarImpTuned <- mlr::resample(learner = lrn.svm.tune, task = mlr::spatial.task,
extract = extractVarImpFunction,
resampling = outer, measures = list(auc),
models = TRUE, show.info = TRUE)
parallelMap::parallelStop() # stop parallelization
## get mean auroc decrease
var.imp <- do.call(rbind, lapply(res.VarImpTuned$extract, FUN = function(x){x[[1]]}))
var.imp <- data.frame(AUC_DECR = colMeans(var.imp), Variable = names(colMeans(var.imp)))

Resources