How I can make a list of learners including autofselector and autotuner in benchmark and compare their performance?
I wonder how to rank learners stratified by task when we have multiple tasks
library(mlr3verse)
mod1 = AutoTuner$new(
learner = lrn("surv.svm", type = "hybrid", diff.meth = "makediff3",
gamma.mu = c(0.1, 0.1)),
resampling = rsmp("holdout"),
measure = msr("surv.cindex"),
terminator = trm("evals", n_evals = 10),
tuner = tnr("random_search"))
mod2 = AutoFSelector$new(
learner = as_learner(
po("imputemedian", affect_columns = selector_type("numeric")) %>>%
po("imputemode", affect_columns = selector_type("factor")) %>>%
po("scale") %>>%
po("encode", method = "one-hot") %>>%
lrn("surv.coxph")) ,
resampling = rsmp("holdout"),
measure = msr("surv.cindex"),
terminator = trm("evals", n_evals = 100),
fselector = fs("sequential", strategy ="sbs"))
lrns = c(mod1, mod1)
design = benchmark_grid(tasks = tsks(c("actg", "rats")),
learners = lrns,
resamplings = rsmp("holdout"))
bmr = benchmark(design, store_models = TRUE, store_backends = TRUE)
Related
I using the mlr3 package for autotuning ML models (mlr3pipelines graph, to be more correct).
It is very hard to reproduce the problem because the error occurs occasionally. The same code sometimes returns an error and sometimes doesn't.
Here is the code snippet
learners_l = list(
ranger = lrn("classif.ranger", predict_type = "prob", id = "ranger"),
log_reg = lrn("classif.log_reg", predict_type = "prob", id = "log_reg")
)
# create complete grapg
graph = po("removeconstants", ratio = 0.05) %>>%
po("branch", options = c("nop_prep", "yeojohnson", "pca", "ica"), id = "prep_branch") %>>%
gunion(list(po("nop", id = "nop_prep"), po("yeojohnson"), po("pca", scale. = TRUE), po("ica"))) %>>%
po("unbranch", id = "prep_unbranch") %>>%
learners_l %>>%
po("classifavg", innum = length(learners))
graph_learner = as_learner(graph)
search_space = ps(
prep_branch.selection = p_fct(levels = c("nop_prep", "yeojohnson", "pca", "ica")),
pca.rank. = p_int(2, 6, depends = prep_branch.selection == "pca"),
ica.n.comp = p_int(2, 6, depends = prep_branch.selection == "ica"),
yeojohnson.standardize = p_lgl(depends = prep_branch.selection == "yeojohnson"),
ranger.ranger.mtry.ratio = p_dbl(0.2, 1),
ranger.ranger.max.depth = p_int(2, 6)
)
at_classif = auto_tuner(
method = "random_search",
learner = graph_learner,
resampling = rsmp("cv", folds = 3),
measure = msr("classif.acc"),
search_space = search_space,
term_evals = 20
)
at_classif$train(task_classif)
You can use any task you want.
The error I get is:
INFO [15:05:33.610] [bbotk] Starting to optimize 6 parameter(s) with '<OptimizerRandomSearch>' and '<TerminatorEvals> [n_evals=20, k=0]'
INFO [15:05:33.653] [bbotk] Evaluating 1 configuration(s)
Error in UUIDgenerate() : Too many DLL modules.
There is a fixed buffer for loading RNG functions in uuid which will fail if there are too many DLLs loaded already. A simple work-around is to run
library(uuid)
UUIDgenerate()
before other packages which will force the loading of the RNG functions early.
(#12 now tracks the underlying issue and should be fixed in uuid 1.0-3).
This section of the ml tutorial: https://mlr.mlr-org.com/articles/tutorial/nested_resampling.html#filter-methods-with-tuning explains how to use a TuneWrapper with a FilterWrapper to tune the threshold for the filter. But what if my filter has hyperparameters that need tuning as well, such as a random forest variable importance filter? I don't seem to be able to tune any parameters except the threshold.
For example:
library(survival)
library(mlr)
data(veteran)
set.seed(24601)
task_id = "MAS"
mas.task <- makeSurvTask(id = task_id, data = veteran, target = c("time", "status"))
mas.task <- createDummyFeatures(mas.task)
tuning = makeResampleDesc("CV", iters=5, stratify=TRUE) # Tuning: 5-fold CV, no repeats
cox.filt.rsfrc.lrn = makeTuneWrapper(
makeFilterWrapper(
makeLearner(cl="surv.coxph", id = "cox.filt.rfsrc", predict.type="response"),
fw.method="randomForestSRC_importance",
cache=TRUE,
ntree=2000
),
resampling = tuning,
par.set = makeParamSet(
makeIntegerParam("fw.abs", lower=2, upper=10),
makeIntegerParam("mtry", lower = 5, upper = 15),
makeIntegerParam("nodesize", lower=3, upper=25)
),
control = makeTuneControlRandom(maxit=20),
show.info = TRUE)
produces the error message:
Error in checkTunerParset(learner, par.set, measures, control) :
Can only tune parameters for which learner parameters exist: mtry,nodesize
Is there any way to tune the hyperparameters of the random forest?
EDIT: Other attempts following suggestion in comments:
Wrap tuner around base learner before feeding to filter (filter not shown) - fails
cox.lrn = makeLearner(cl="surv.coxph", id = "cox.filt.rfsrc", predict.type="response")
cox.tune = makeTuneWrapper(cox.lrn,
resampling = tuning,
measures=list(cindex),
par.set = makeParamSet(
makeIntegerParam("mtry", lower = 5, upper = 15),
makeIntegerParam("nodesize", lower=3, upper=25),
makeIntegerParam("fw.abs", lower=2, upper=10)
),
control = makeTuneControlRandom(maxit=20),
show.info = TRUE)
Error in checkTunerParset(learner, par.set, measures, control) :
Can only tune parameters for which learner parameters exist: mtry,nodesize,fw.abs
Two levels of tuning - fails
cox.lrn = makeLearner(cl="surv.coxph", id = "cox.filt.rfsrc", predict.type="response")
cox.filt = makeFilterWrapper(cox.lrn,
fw.method="randomForestSRC_importance",
cache=TRUE,
ntree=2000)
cox.tune = makeTuneWrapper(cox.filt,
resampling = tuning,
measures=list(cindex),
par.set = makeParamSet(
makeIntegerParam("fw.abs", lower=2, upper=10)
),
control = makeTuneControlRandom(maxit=20),
show.info = TRUE)
cox.tune2 = makeTuneWrapper(cox.tune,
resampling = tuning,
measures=list(cindex),
par.set = makeParamSet(
makeIntegerParam("mtry", lower = 5, upper = 15),
makeIntegerParam("nodesize", lower=3, upper=25)
),
control = makeTuneControlRandom(maxit=20),
show.info = TRUE)
Error in makeBaseWrapper(id, learner$type, learner, learner.subclass = c(learner.subclass, :
Cannot wrap a tuning wrapper around another optimization wrapper!
It looks like that you currently can not tune hyperparameters of filters. You can manually change certain parameters by passing them in makeFilterWrapper() but not tune them.
You can only tune one of fw.abs, fw.perc or fw.tresh when it comes to filtering.
I do not know how big the effect will be on the ranking when using different hyperpars for the RandomForest filter. One way to check the robustness would be to compare the rankings of single RF model fits using different settings for mtry and friends with the help of getFeatureImportance(). If there is a very high rank correlation between these, you can safely ignore the tuning of the RF filter. (Maybe you want to use a different filter which does not come with this issue at all?)
If you insist on having this feature, you might need to raise PR for the package :)
lrn = makeLearner(cl = "surv.coxph", id = "cox.filt.rfsrc", predict.type = "response")
filter_wrapper = makeFilterWrapper(
lrn,
fw.method = "randomForestSRC_importance",
cache = TRUE,
ntrees = 2000
)
cox.filt.rsfrc.lrn = makeTuneWrapper(
filter_wrapper,
resampling = tuning,
par.set = makeParamSet(
makeIntegerParam("fw.abs", lower = 2, upper = 10)
),
control = makeTuneControlRandom(maxit = 20),
show.info = TRUE)
I am tuning more than 2 hyperparameters, while Generate hyperparameter effect data using the function generateHyperParsEffectData I set partial.dep = TRUE, while plotting plotHyperParsEffect i am getting error for classification learner, its requiring regressor learner
This is my task and learner for classification
classif.task <- makeClassifTask(id = "rfh2o.task", data = Train_clean, target = "Action")
rfh20.lrn.base = makeLearner("classif.h2o.randomForest", predict.type = "prob",fix.factors.prediction=TRUE)
rfh20.lrn <- makeFilterWrapper(rfh20.lrn.base, fw.method = "chi.squared", fw.perc = 0.5)
This is my tuning
rdesc <- makeResampleDesc("CV", iters = 3L, stratify = TRUE)
ps<- makeParamSet(makeDiscreteParam("fw.perc", values = seq(0.2, 0.8, 0.1)),
makeIntegerParam("mtries", lower = 2, upper = 10),
makeIntegerParam("ntrees", lower = 20, upper = 50)
)
Tuned_rf <- tuneParams(rfh20.lrn, task = QBE_classif.task, resampling = rdesc.h2orf, par.set = ps.h2orf, control = makeTuneControlGrid())
While plotting the tune
h2orf_data = generateHyperParsEffectData(Tuned_rf, partial.dep = TRUE)
plotHyperParsEffect(h2orf_data, x = "iteration", y = "mmce.test.mean", plot.type = "line", partial.dep.learn =rfh20.lrn)
I am getting the Error
Error in checkLearner(partial.dep.learn, "regr") :
Learner 'classif.h2o.randomForest.filtered' must be of type 'regr', not: 'classif'
I would expect to see the plot for any more tuning requirement so I can add more hyper tuning, am I missing some thing.
The partial.dep.learn parameter needs a regression learner; see the documentation.
I'm trying to resolve an error I'm facing with mlr::mergeBenchmarkResults, which is:
Error in mergeBenchmarkResults(bmrs = list(bmr, bmr_no_mos)): The following task - learner combination(s) occur either multiple times or are missing:
* wnv_no_mos - rpart
* wnv_no_mos - rf
* wnv - rf_no_mos
* wnv - xgb_no_mos
* wnv - extraTrees_no_mos
Traceback:
1. mergeBenchmarkResults(bmrs = list(bmr, bmr_no_mos))
2. stopf("The following task - learner combination(s) occur either multiple times or are missing: \n* %s\n",
. msg)
In a nutshell, my desire is...
I have two tasks, two learner sets, and two benchmarking objects. I wish to combine the two benchmarking objects using mlr::mergeBenchmarkResults.
tsk = makeClassifTask(data = wnv, target = "y", positive = "Infected")
lrns = list(makeLearner(id = "rpart", cl = "classif.rpart", predict.type = "prob"),
makeLearner(id = "rf", cl = "classif.randomForest", predict.type = "prob"))
bmr = benchmark(learners = lrns, tasks = tsk, resamplings = rdesc, measures = meas, show.info = TRUE)
tsk_no_mos = makeClassifTask(data = wnv_no_mos, target = "y", positive = "Infected")
lrns_2 = list(makeLearner(id = "rf_no_mos", cl = "classif.randomForest", predict.type = "prob"),
makeLearner(id = "xgb_no_mos", cl = "classif.xgboost", predict.type = "prob", nthread=25),
makeLearner(id = "extraTrees_no_mos", cl = "classif.extraTrees", predict.type = "prob", numThreads = 25))
bmr_no_mos = benchmark(learners = lrns_2, tasks = tsk_no_mos, resamplings = rdesc, measures = meas, show.info = TRUE)
mergeBenchmarkResults(bmrs = list(bmr, bmr_no_mos))
What am I doing wrong?
I would like to compare simple logistic regressions models where each model considers a specified set of features only. I would like to perform comparisons of these regression models on resamples of the data.
The R package mlr allows me to select columns at the task level using dropFeatures. The code would be something like:
full_task = makeClassifTask(id = "full task", data = my_data, target = "target")
reduced_task = dropFeatures(full_task, setdiff( getTaskFeatureNames(full_task), list_feat_keep))
Then I can do benchmark experiments where I have a list of tasks.
lrn = makeLearner("classif.logreg", predict.type = "prob")
rdesc = makeResampleDesc(method = "Bootstrap", iters = 50, stratify = TRUE)
bmr = benchmark(lrn, list(full_task, reduced_task), rdesc, measures = auc, show.info = FALSE)
How can I generate a learner that only considers a specified set of features.
As far as I know the filter or selection methods always apply some statistical
procedure but do not allow to select the features directly. Thank you!
The first solution is lazy and also not optimal because the filter calculation is still carried out:
library(mlr)
task = sonar.task
sel.feats = c("V1", "V10")
lrn = makeLearner("classif.logreg", predict.type = "prob")
lrn.reduced = makeFilterWrapper(learner = lrn, fw.method = "variance", fw.abs = 2, fw.mandatory.feat = sel.feats)
bmr = benchmark(list(lrn, lrn.reduced), task, cv3, measures = auc, show.info = FALSE)
The second one uses the preprocessing wrapper to filter the data and should be the fastest solution and is also more flexible:
lrn.reduced.2 = makePreprocWrapper(
learner = lrn,
train = function(data, target, args) list(data = data[, c(sel.feats, target)], control = list()),
predict = function(data, target, args, control) data[, sel.feats]
)
bmr = benchmark(list(lrn, lrn.reduced.2), task, cv3, measures = auc, show.info = FALSE)