extract_inner_fselect_results is NULL with mlr3 Nested Resampling - rstudio-server

This question is an extension of the following question: No Model Stored with Mlr3.
I have been performing nested resampling to get an unbiased metric of model performance. If I don't specify store_models=TRUE then I get Error: No model stored at the end of the run. However, if I specify store_models=TRUE in both the at and resample calls then RStudio crashes due to RAM consumption.
I have now tried the following code in which I specified store_models=TRUE for just the at call:
MSvCon<-read.csv("MS v Control Proteomics Final.csv", row.names=1)
MSvCon$Status<-as.factor(MSvCon$Status)
MSvCon[,2:4399]<-scale(MSvCon[,2:4399], center=TRUE, scale=TRUE)
set.seed(123, "L'Ecuyer")
task = as_task_classif(MSvCon, target = "Status")
learner = lrn("classif.ranger", importance = "impurity", num.trees=10000)
set_threads(learner, n = 8)
measure = msr("classif.fbeta", beta=1, average="micro")
terminator = trm("none")
resampling_inner = rsmp("repeated_cv", folds = 10, repeats = 10)
at = AutoFSelector$new(
learner = learner,
resampling = resampling_inner,
measure = measure,
terminator = terminator,
fselect = fs("rfe", n_features = 1, feature_fraction = 0.5, recursive = FALSE),
store_models=TRUE)
resampling_outer = rsmp("repeated_cv", folds = 10, repeats = 10)
rr = resample(task, at, resampling_outer)
After finishing, I am able to extract performance measures successfully. However, I tried to use extract_inner_fselect_results and extract_inner_fselect_archives to check what features were selected and importance measures but received a NULL result.
Do you have any suggestions on what I would need to adjust in my code to see this information? I anticipate that adding store_models=TRUE to the resample call would but the RAM consumption issue (even using 128GB on Rstudio Workbench) prevents that. Is there a way around this?

The archives of the inner resampling are stored in the model slot of the AutoFSelectors i.e. without store_models = TRUE in resample() you cannot access the inner results and archives. I will write a workaround for you and answer in the other question.

Related

MLR3 using data transforms in bootstrapping hit an error

I'm trying to use bootstrapping resampling as my cross-validation in mlr3, and have been tracking down the cause of an error:
Error in as_data_backend.data.frame(backend, primary_key = row_ids) :
Assertion on 'primary_key' failed: Contains duplicated values, position 2.
The position changes (likely the first repeated row). Based on the error message I first thought it was an issue having rownames included, so I set those as the col_type$name, and also tried removing rownames from the data before creating the task (no luck!).
In trying to create a reprex, I narrowed it down to transform pipe operators like 'scale' and 'pca' as the cause:
library("mlr3verse")
task <- tsk('sonar')
pipe = po('scale') %>>%
po(lrn('classif.rpart'))
ps <- ParamSet$new(list(
ParamDbl$new("classif.rpart.cp", lower = 0, upper = 0.05)
))
glrn <- GraphLearner$new(pipe)
glrn$predict_type <- "prob"
bootstrap <- rsmp("bootstrap", ratio = 1, repeats = 5)
instance <- TuningInstanceSingleCrit$new(
task = task,
learner = glrn,
resampling = bootstrap,
measure = msr("classif.auc"),
search_space = ps,
terminator = trm("evals", n_evals = 100)
)
tuner <- tnr("random_search")
tuner$optimize(instance)
I've also tried grid search instead of random, different learners, including the flag "duplicated_ids = TRUE" in rsmp, with no luck. Changing to CV cross validation, however, does fix the problem.
For reference, in the full pipe/graph I am trying different feature filters and learners to identify candidate pipelines.

A question about the parallelism in h2o.grid() function

I try to use the h2o.grid() function from the h2o package to do some tuning using R, when I set the parameter parallelism larger then 1, it always shows the warning
Some models were not built due to a failure, for more details run `summary(grid_object, show_stack_traces = TRUE)
And the model_ids in the final grid object includes many models end with _cv_1, _cv_2 etc, and the number of the models is not equal to the setting of my max_models in search_criteria list, I think they are just the models in the cv process, not the final model.
When I set parallelism larger than 1:
When I leave the parallelism default or set it to 1, the result is normal, all models end with _model_1, _model_2 etc.
When I leave the "parallelism" default or set it to 1:
Here is my code:
# set the grid
rf_h2o_grid <- list(mtries = seq(3, ncol(train_h2o), 4),
max_depth = c(5, 10, 15, 20))
# set the search_criteria
sc <- list(strategy = "RandomDiscrete",
seed = 100,
max_models = 5
)
# random grid tuning
rf_h2o_grid_tune_random <- h2o.grid(
algorithm = "randomForest",
x = x,
y = y,
training_frame = train_h2o,
nfolds = 5, # use cv to validate the parameters
fold_assignment = "Stratified",
ntrees = 100,
seed = 100,
hyper_params = rf_h2o_grid,
search_criteria = sc
# parallelism = 6 # when I set it larger than 1, the result always includes some "cv_" models
)
So how can I use the parallelism correctly in h2o.grid()? Thanks for helping!
This is an actual issue with parallelism in grid search, previously noticed but not reported correctly.
Thanks for raising this, we'll try to fix it soon: see https://h2oai.atlassian.net/browse/PUBDEV-7886 if you want to track progress.
Until proper fix, you must avoid using both CV and parallelism in your grids.
Regarding the following error:
Some models were not built due to a failure, for more details run `summary(grid_object, show_stack_traces = TRUE)
if the error is reproducible, you should be getting more details by running the grid with verbose=True.
Adding the entire error message to the ticket above might also help.
This is because you set max_models = 5, your grid will only make 5 models then stop.
There are three ways to set up early stopping criteria:
"max_models": max number of models created
"max_runtime_secs": max running time in seconds
metric-based early stopping by setting up for "stopping_rounds", "stopping_metric", and "stopping_tolerance"

Custom performance measure when building models with mlr-package

I have just made the switch from caret to mlr for a specific problem I am working on at the moment.
I am wondering if anyone here is familiar with specifying custom performance measures within the resample() function.
Here's a reproducible code example:
library(mlr)
library(mlbench)
data(BostonHousing, package = "mlbench")
task_reg1 <- makeRegrTask(id = "bh", data = BostonHousing, target = "medv")
lrn_reg1 <- makeLearner(cl = "regr.randomForest",
predict.type = "response",
mtry=3)
cv_reg1 <- makeResampleDesc("RepCV", folds = 5, reps = 5)
regr_1 <- resample(learner = lrn_reg1,
task = task_reg1,
resampling = cv_reg1,
measures = mlr::rmse)
Instead of computing RMSE, I want to compute the Mean Absolute Scaled Error, MASE. A function for this may, for instance, be found in the Metrics package: Metrics::mase().
I tried to include measures = Metrics::mase directly in the resample() call, but that was, as expected, a bit optimistic and I received the following error:
Error in checkMeasures(measures, task) :
Assertion on 'measures' failed: Must be of type 'list', not 'closure'.
I found out there's a function in the mlr package for creating custom performance metrics, called makeMeasure() (https://rdrr.io/cran/mlr/man/makeMeasure.html). I tried experimenting a bit with it, but did not manage to make anything work. I do not have much experience in tinkering with custom made functions, so I was hoping that someone here could help me out, or provide some resources for stuff like this.
Cheers!
You need to construct a function that can be applied within makeMeasure() that is of the form function(task, model, pred, extra.args). We can just write a wrapper around Metrics::mase() so you can use this function in resample(), and you can do the same for any other metric you find.
mase_fun <- function(task, model, pred, feats, extra.args) {
Metrics::mase(pred$data$truth, pred$data$response, step_size = extra.args$step_size)
}
mase_measure <- makeMeasure(id = "mase",
minimize = T,
properties = c("regr", "req.pred", "req.truth"),
fun = mase_fun,
extra.args = list(step_size = 1))
resample(learner = lrn_reg1,
task = task_reg1,
resampling = cv_reg1,
measures = mase_measure)

Is there a way to know how far R has gotten on a random forest model?

I am currently learning about random forests and how to create them in R. However as I have discovered, it can be quite the time consuming activity creating these trees, and sometimes I do not know how far R has gotten or if it has crashed, and so I close R in panic. I use the randomForest package, and my code is as follows:
model <- randomForest(def ~ .,
data = mydataset,
mtry = 4,
ntree = 200,
importance = TRUE)
Is there a way to make R show me how far it has gotten at any time, or when it is finished with one tree and is continuing to the next?
In situation such as these, you are typically looking for an argument that makes the function more verbose. This is typically something like verbose = TRUE but it varies and some functions do not offer any kind of verbosity settings.
In your case, you just have to look up the help of randomForest (with ?randomForest::randomForest) to find the argument do.trace.
do.trace
If set to TRUE, give a more verbose output as randomForest is run. If set to some integer, then running output is printed for every do.trace trees.
In other words, you can enable verbosity with:
model <- randomForest(def ~ ., data = mydataset, mtry = 4,
ntree = 200, importance = TRUE, do.trace = TRUE)
or, to print some feedback every 100 trees:
model <- randomForest(def ~ ., data = mydataset, mtry = 4,
ntree = 200, importance = TRUE, do.trace = 100)
It is always a good reflex to check the manual of the function as a first step. If you use rstudio, you can use the help pane instead of using ? or ??.

using caret package to find optimal parameters of GBM

I'm using the R GBM package for boosting to do regression on some biological data of dimensions 10,000 X 932 and I want to know what are the best parameters settings for GBM package especially (n.trees, shrinkage, interaction.depth and n.minobsinnode) when I searched online I found that CARET package on R can find such parameter settings. However, I have difficulty on using the Caret package with GBM package, so I just want to know how to use caret to find the optimal combinations of the previously mentioned parameters ? I know this might seem very typical question, but I read the caret manual and still have difficulty in integrating caret with gbm, especially cause I'm very new to both of these packages
Not sure if you found what you were looking for, but I find some of these sheets less than helpful.
If you are using the caret package, the following describes the required parameters: > getModelInfo()$gbm$parameters
He are some rules of thumb for running GBM:
The interaction.depth is 1, and on most data sets that seems
adequate, but on a few I have found that testing the results against
odd multiples up the max has given better results. The max value I
have seen for this parameter is floor(sqrt(NCOL(training))).
Shrinkage: the smaller the number, the better the predictive value,
the more trees required, and the more computational cost. Testing
the values on a small subset of data with something like shrinkage =
shrinkage = seq(.0005, .05,.0005) can be helpful in defining the
ideal value.
n.minobsinnode: default is 10, and generally I don't mess with that.
I have tried c(5,10,15,20) on small sets of data, and didn't really
see an adequate return for computational cost.
n.trees: the smaller the shrinkage, the more trees you should have.
Start with n.trees = (0:50)*50 and adjust accordingly.
Example setup using the caret package:
getModelInfo()$gbm$parameters
library(parallel)
library(doMC)
registerDoMC(cores = 20)
# Max shrinkage for gbm
nl = nrow(training)
max(0.01, 0.1*min(1, nl/10000))
# Max Value for interaction.depth
floor(sqrt(NCOL(training)))
gbmGrid <- expand.grid(interaction.depth = c(1, 3, 6, 9, 10),
n.trees = (0:50)*50,
shrinkage = seq(.0005, .05,.0005),
n.minobsinnode = 10) # you can also put something like c(5, 10, 15, 20)
fitControl <- trainControl(method = "repeatedcv",
repeats = 5,
preProcOptions = list(thresh = 0.95),
## Estimate class probabilities
classProbs = TRUE,
## Evaluate performance using
## the following function
summaryFunction = twoClassSummary)
# Method + Date + distribution
set.seed(1)
system.time(GBM0604ada <- train(Outcome ~ ., data = training,
distribution = "adaboost",
method = "gbm", bag.fraction = 0.5,
nTrain = round(nrow(training) *.75),
trControl = fitControl,
verbose = TRUE,
tuneGrid = gbmGrid,
## Specify which metric to optimize
metric = "ROC"))
Things can change depending on your data (like distribution), but I have found the key being to play with gbmgrid until you get the outcome you are looking for. The settings as they are now would take a long time to run, so modify as your machine, and time will allow.
To give you a ballpark of computation, I run on a Mac PRO 12 core with 64GB of ram.
This link has a concrete example (page 10) -
http://www.jstatsoft.org/v28/i05/paper
Basically, one should first create a grid of candidate values for hyper parameters (like n.trees, interaction.depth and shrinkage). Then call the generic train function as usual.

Resources