Benchmarking multiple AutoTuning instances - r

I have been trying to use mlr3 to do some hyperparameter tuning for xgboost. I want to compare three different models:
xgboost tuned over just the alpha hyperparameter
xgboost tuned over alpha and lambda hyperparameters
xgboost tuned over alpha, lambda, and maxdepth hyperparameters.
After reading the mlr3 book, I thought that using AutoTuner for the nested resampling and benchmarking would be the best way to go about doing this. Here is what I have tried:
task_mpcr <- TaskRegr$new(id = "mpcr", backend = data.numeric, target = "n_reads")
measure <- msr("poisson_loss")
xgb_learn <- lrn("regr.xgboost")
set.seed(103)
fivefold.cv = rsmp("cv", folds = 5)
param.list <- list( alpha = p_dbl(lower = 0.001, upper = 100, logscale = TRUE),
lambda = p_dbl(lower = 0.001, upper = 100, logscale = TRUE),
max_depth = p_int(lower = 2, upper = 10)
)
model.list <- list()
for(model.i in 1:length(param.list)){
param.list.subset <- param.list[1:model.i]
search_space <- do.call(ps, param.list.subset)
model.list[[model.i]] <- AutoTuner$new(
learner = xgb_learn,
resampling = fivefold.cv,
measure = measure,
search_space = search_space,
terminator = trm("none"),
tuner = tnr("grid_search", resolution = 10),
store_tuning_instance = TRUE
)
}
grid <- benchmark_grid(
task = task_mpcr,
learner = model.list,
resampling = rsmp("cv", folds =3)
)
bmr <- benchmark(grid, store_models = TRUE)
Note that I added Poisson loss as a measure for the count data I am working with.
For some reason after running the benchmark function, the Poisson loss of all my models is nearly identical per fold, making me think that no tuning was done.
I also cannot find a way to access the hyperparameters used to get the lowest loss per train/test iteration.
Am I misusing the benchmark function entirely?
Also, this is my first question on SO, so any formatting advice would be appreciated!

To see whether tuning has an effect, you can just add an untuned learner to the benchmark. Otherwise, the conclusion could be that tuning alpha is sufficient for your example.
I adapted the code so that it runs with an example task.
library(mlr3verse)
task <- tsk("mtcars")
measure <- msr("regr.rmse")
xgb_learn <- lrn("regr.xgboost")
param.list <- list(
alpha = p_dbl(lower = 0.001, upper = 100, logscale = TRUE),
lambda = p_dbl(lower = 0.001, upper = 100, logscale = TRUE)
)
model.list <- list()
for(model.i in 1:length(param.list)){
param.list.subset <- param.list[1:model.i]
search_space <- do.call(ps, param.list.subset)
at <- AutoTuner$new(
learner = xgb_learn,
resampling = rsmp("cv", folds = 5),
measure = measure,
search_space = search_space,
terminator = trm("none"),
tuner = tnr("grid_search", resolution = 5),
store_tuning_instance = TRUE
)
at$id = paste0(at$id, model.i)
model.list[[model.i]] <- at
}
model.list <- c(model.list, list(xgb_learn)) # add baseline learner
grid <- benchmark_grid(
task = task,
learner = model.list,
resampling = rsmp("cv", folds =3)
)
bmr <- benchmark(grid, store_models = TRUE)
autoplot(bmr)
bmr_data = bmr$data$as_data_table() # convert benchmark result to a handy data.table
bmr_data$learner[[1]]$learner$param_set$values # the final learner used by AutoTune is nested in $learner
# best found value during grid search
bmr_data$learner[[1]]$archive$best()
# transformed value (the one that is used for the learner)
bmr_data$learner[[1]]$archive$best()$x_domain
In the last lines you see how one can access the individual runs of the benchmark. Im my example we have 9 runs resulting for 3 learners and 3 outer resampling folds.

Related

mlr3, benchmarking and nested resampling: how to extract a tuned model from a benchmark object to calculate feature importance

I am using the benchmark() function in mlr3 to compare several ML algorithms. One of them is XGB with hyperparameter tuning. Thus, I have an outer resampling to evaluate the overall performance (hold-out sample) and an inner resampling for the hyper parameter tuning (5-fold Cross-validation). Besides having an estimate of the accuracy for all ML algorithms, I would like to see the feature importance of the tuned XGB. For that, I would have to access the tuned model (within the benchmark object). I do not know how to do that. The object returned by benchmark() is a deeply nested list and I do not understand its structure.
This answer on stackoverflow did not help me, because it uses a different setup (a learner in a pipeline rather than a benchmark object).
This answer on github did not help me, because it shows how to extract all the information about the benchmarking at once but not how to extract one (tuned) model of one of the learners in the benchmark.
Below is the code I am using to carry out the nested resampling. Following the benchmarking, I would like to estimate the feature importance as described here, which requires accessing the tuned XGB model.
require(mlr3verse)
### Parameters
## Tuning
n_folds = 5
grid_search_resolution = 2
measure = msr("classif.acc")
task = tsk("iris")
# Messages mlr3
# https://stackoverflow.com/a/69336802/7219311
options("mlr3.debug" = TRUE)
### Set up hyperparameter tuning
# AutoTuner for the inner resampling
## inner-resampling design
inner_resampling = rsmp("cv", folds = n_folds)
terminator = trm("none")
## XGB: no Hyperparameter Tuning
xgb_no_tuning = lrn("classif.xgboost", eval_metric = "mlogloss")
set_threads(xgb_no_tuning, n = 6)
## XGB: AutoTuner
# Setting up Hyperparameter Tuning
xgb_learner_tuning = lrn("classif.xgboost", eval_metric = "mlogloss")
xgb_search_space = ps(nrounds = p_int(lower = 100, upper= 500),
max_depth = p_int(lower = 3, upper= 10),
colsample_bytree = p_dbl(lower = 0.6, upper = 1)
)
xgb_tuner = tnr("grid_search", resolution = grid_search_resolution)
# implicit parallelisation
set_threads(xgb_learner_tuning, n = 6)
xgb_tuned = AutoTuner$new(xgb_learner_tuning, inner_resampling, measure, terminator, xgb_tuner, xgb_search_space, store_tuning_instance = TRUE)
## Outer re-sampling: hold-out
outer_resampling = rsmp("holdout")
outer_resampling$instantiate(task)
bm_design = benchmark_grid(
tasks = task,
learners = c(lrn("classif.featureless"),
xgb_no_tuning,
xgb_tuned
),
resamplings = outer_resampling
)
begin_time = Sys.time()
bmr = benchmark(bm_design, store_models = TRUE)
duration = Sys.time() - begin_time
print(duration)
## Results of benchmarking
benchmark_results = bmr$aggregate(measure)
print(benchmark_results)
## Overview
mlr3misc::map(as.data.table(bmr)$learner, "model")
## Detailed results
# Specification of learners
print(bmr$learners$learner)
Solution
Based on the comments by be-marc
require(mlr3verse)
require(mlr3tuning)
require(mlr3misc)
### Parameters
## Tuning
n_folds = 5
grid_search_resolution = 2
measure = msr("classif.acc")
task = tsk("iris")
# Messages mlr3
# https://stackoverflow.com/a/69336802/7219311
options("mlr3.debug" = TRUE)
### Set up hyperparameter tuning
# AutoTuner for the inner resampling
## inner-resampling design
inner_resampling = rsmp("cv", folds = n_folds)
terminator = trm("none")
## XGB: no Hyperparameter Tuning
xgb_no_tuning = lrn("classif.xgboost", eval_metric = "mlogloss")
set_threads(xgb_no_tuning, n = 6)
## XGB: AutoTuner
# Setting up Hyperparameter Tuning
xgb_learner_tuning = lrn("classif.xgboost", eval_metric = "mlogloss")
xgb_search_space = ps(nrounds = p_int(lower = 100, upper= 500),
max_depth = p_int(lower = 3, upper= 10),
colsample_bytree = p_dbl(lower = 0.6, upper = 1)
)
xgb_tuner = tnr("grid_search", resolution = grid_search_resolution)
# implicit parallelisation
set_threads(xgb_learner_tuning, n = 6)
xgb_tuned = AutoTuner$new(xgb_learner_tuning, inner_resampling, measure, terminator, xgb_tuner, xgb_search_space, store_tuning_instance = TRUE)
## Outer re-sampling: hold-out
outer_resampling = rsmp("holdout")
outer_resampling$instantiate(task)
bm_design = benchmark_grid(
tasks = task,
learners = c(lrn("classif.featureless"),
xgb_no_tuning,
xgb_tuned
),
resamplings = outer_resampling
)
begin_time = Sys.time()
bmr = benchmark(bm_design, store_models = TRUE)
duration = Sys.time() - begin_time
print(duration)
## Results of benchmarking
benchmark_results = bmr$aggregate(measure)
print(benchmark_results)
## Overview
mlr3misc::map(as.data.table(bmr)$learner, "model")
## Detailed results
# Specification of learners
print(bmr$learners$learner)
## Feature Importance
# extract models from outer sampling
# https://stackoverflow.com/a/69828801
data = as.data.table(bmr)
outer_learners = map(data$learner, "learner")
xgb_tuned_model = outer_learners[[3]]
print(xgb_tuned_model)
# print feature importance
# (presumably gain - mlr3 documentation not clear)
print(xgb_tuned_model$importance())
library(mlr3tuning)
library(mlr3learners)
library(mlr3misc)
learner = lrn("classif.xgboost", nrounds = to_tune(100, 500), eval_metric = "logloss")
at = AutoTuner$new(
learner = learner,
resampling = rsmp("cv", folds = 3),
measure = msr("classif.ce"),
terminator = trm("evals", n_evals = 5),
tuner = tnr("random_search"),
store_models = TRUE
)
design = benchmark_grid(task = tsk("pima"), learner = at, resampling = rsmp("cv", folds = 5))
bmr = benchmark(design, store_models = TRUE)
To extract learners fitted in the outer loop
data = as.data.table(bmr)
outer_learners = map(data$learner, "learner")
To extract learners fitted in the inner loop
archives = extract_inner_tuning_archives(bmr)
inner_learners = map(archives$resample_result, "learners")

How to use mlrMBO with mlr for hyperparameter optimisation and tuning

Im trying to train ML algorithms (rf, adaboost, xgboost) in R on a dataset where the target is multiclass classification. For hyperparameter tuning I use the MLR package.
My goal of the code below is to tune the parameters mtry and nodesize, but keep ntrees constant at 128 (with mlrMBO). However, I get the error message below. How can I define this in the correct way?
rdesc <- makeResampleDesc("CV",stratify = T,iters=10L)
traintask <- makeClassifTask(data = df_train,
target = "more_than_X_perc_damage")
testtask <- makeClassifTask(data = df_test,
target = "more_than_X_perc_damage")
lrn <- makeLearner("classif.randomForest",
predict.type = "prob")
# parameter space
params_to_tune <- makeParamSet(makeIntegerParam("ntree", lower = 128, upper = 128),
makeNumericParam("mtry", lower = 0, upper = 1, trafo = function(x) ceiling(x*ncol(train_x))),
makeNumericParam("nodesize",lower = 0,upper = 1, trafo = function(x) ceiling(nrow(train_x)^x)))
ctrl = makeTuneControlMBO(mbo.control=mlrMBO::makeMBOControl())
tuned_params <- tuneParams(learner = lrn,
task = traintask,
control = ctrl,
par.set = params_to_tune,
resampling = rdesc,
measure=acc)
rf_tuned_learner <- setHyperPars(learner = lrn,
par.vals = tuned_params$x)
rf_tuned_model <- mlr::train(rf_tuned_learner, traintask)
# prediction performance
pred <- predict(rf_tuned_model, testtask)
performance(pred)
calculateConfusionMatrix(pred)
stats <- confusionMatrix(pred$data$response,pred$data$truth)
acc_rf_tune <- stats$overall[1] # accuracy
print(acc_rf_tune)
Error in (function (fn, nvars, max = FALSE, pop.size = 1000, max.generations = 100, :
Domains[,1] must be less than or equal to Domains[,2]
Thanks in advance!
You can do this by not including the hyperparameter you want to keep constant in the ParamSet and instead setting it to the value you want when creating the learner.

Combining getOOBPreds with nested resampling and parameter tuning

In the package R::mlr I read from the tutorial that the getOOBPreds function that I can access the out-of-bag predictions from say a random forest model, but I cannot figure out how to use this in a nested resampling procedure designed to tune hyperparameters.
I understand that the inner loop shoud be somehow
Thanks for sharing insights / hints !
I tried as inner loop:
makeTuneWrapper(lrnr,
resampling = "oob",
par.set = params,
control = ctrl,
show.info = TRUE,
measures = list(logloss, multiclass.brier,
timetrain))
... but the value "oob" for parameter resampling is not valid.
tentative MRE:
library(mlr)
# Task
tsk = iris.task
# Learner
lrnr <- makeLearner("classif.randomForestSRC", predict.type = "prob")
# Hyperparameters
params <- makeParamSet(makeIntegerParam("mtry",lower = 2,upper = 10),
makeIntegerParam("nodesize",lower = 1,upper = 100),
makeIntegerParam("nsplit",lower = 1,upper = 20))
# Validation strategy
rdesc_inner_oob <- makeResampleDesc("oob") # FAILS
ctrl <- makeTuneControlRandom(maxit = 10L)
tuning_lrnr = makeTuneWrapper(lrnr,
# resampling = oob, # ALSO WRONG
resampling = rdesc_inner_oob,
par.set = params,
control = ctrl,
measures = list(logloss, multiclass.brier, timetrain))
outer = makeResampleDesc("CV", iters = 3)
r = resample(learner = tuning_lrnr,
task = tsk,
resampling = outer,
extract = getOOBPreds,
show.info = TRUE,
measures = list(multiclass.brier))

Combining train + test data and running cross validation in R

I have the following R code that runs a simple xgboost model on a set of training and test data with the intention of predicting a binary outcome.
We start by
1) Reading in the relevant libraries.
library(xgboost)
library(readr)
library(caret)
2) Cleaning up the training and test data
train.raw = read.csv("train_data", header = TRUE, sep = ",")
drop = c('column')
train.df = train.raw[, !(names(train.raw) %in% drop)]
train.df[,'outcome'] = as.factor(train.df[,'outcome'])
test.raw = read.csv("test_data", header = TRUE, sep = ",")
drop = c('column')
test.df = test.raw[, !(names(test.raw) %in% drop)]
test.df[,'outcome'] = as.factor(test.df[,'outcome'])
train.c1 = subset(train.df , outcome == 1)
train.c0 = subset(train.df , outcome == 0)
3) Running XGBoost on the properly formatted data.
train_xgb = xgb.DMatrix(data.matrix(train.df [,1:124]), label = train.raw[, "outcome"])
test_xgb = xgb.DMatrix(data.matrix(test.df[,1:124]))
4) Running the model
model_xgb = xgboost(data = train_xgb, nrounds = 8, max_depth = 5, eta = .1, eval_metric = "logloss", objective = "binary:logistic", verbose = 5)
5) Making predicitions
pred_xgb <- predict(model_xgb, newdata = test_xgb)
My question is: How can I adjust this process so that I'm just pulling in / adjusting a single 'training' data set, and getting predictions on the hold-out sets of the cross-validated file?
To specify k-fold CV in the xgboost call one needs to call xgb.cv with nfold = some integer argument, to save the predictions for each resample use prediction = TRUE argument. For instance:
xgboostModelCV <- xgb.cv(data = dtrain,
nrounds = 1688,
nfold = 5,
objective = "binary:logistic",
eval_metric= "auc",
metrics = "auc",
verbose = 1,
print_every_n = 50,
stratified = T,
scale_pos_weight = 2
max_depth = 6,
eta = 0.01,
gamma=0,
colsample_bytree = 1 ,
min_child_weight = 1,
subsample= 0.5 ,
prediction = T)
xgboostModelCV$pred #contains predictions in the same order as in dtrain.
xgboostModelCV$folds #contains k-fold samples
Here's a decent function to pick hyperparams
function(train, seed){
require(xgboost)
ntrees=2000
searchGridSubCol <- expand.grid(subsample = c(0.5, 0.75, 1),
colsample_bytree = c(0.6, 0.8, 1),
gamma=c(0, 1, 2),
eta=c(0.01, 0.03),
max_depth=c(4,6,8,10))
aucErrorsHyperparameters <- apply(searchGridSubCol, 1, function(parameterList){
#Extract Parameters to test
currentSubsampleRate <- parameterList[["subsample"]]
currentColsampleRate <- parameterList[["colsample_bytree"]]
currentGamma <- parameterList[["gamma"]]
currentEta =parameterList[["eta"]]
currentMaxDepth =parameterList[["max_depth"]]
set.seed(seed)
xgboostModelCV <- xgb.cv(data = train,
nrounds = ntrees,
nfold = 5,
objective = "binary:logistic",
eval_metric= "auc",
metrics = "auc",
verbose = 1,
print_every_n = 50,
early_stopping_rounds = 200,
stratified = T,
scale_pos_weight=sum(all_data_nobad[index_no_bad,1]==0)/sum(all_data_nobad[index_no_bad,1]==1),
max_depth = currentMaxDepth,
eta = currentEta,
gamma=currentGamma,
colsample_bytree = currentColsampleRate,
min_child_weight = 1,
subsample= currentSubsampleRate)
xvalidationScores <- as.data.frame(xgboostModelCV$evaluation_log)
#Save rmse of the last iteration
auc=xvalidationScores[xvalidationScores$iter==xgboostModelCV$best_iteration,c(1,4,5)]
auc=cbind(auc, currentSubsampleRate, currentColsampleRate, currentGamma, currentEta, currentMaxDepth)
names(auc)=c("iter", "test.auc.mean", "test.auc.std", "subsample", "colsample", "gamma", "eta", "max.depth")
print(auc)
return(auc)
})
return(aucErrorsHyperparameters)
}
You can change the grid values and the params in the grid, as well as loss/evaluation metric. It is similar as provided by caret grid search, but caret does not provide the possibility to define alpha, lambda, colsample_bylevel, num_parallel_tree... hyper parameters in the grid search apart defining a custom function which I found cumbersome. Caret has the advantage of automatic preprocessing, automatic up/down sampling within CV etc.
setting the seed outside the xgb.cv call will pick the same folds for CV but not the same trees at each round so you will end up with a different model. Even if you set the seed inside the xgb.cv function call there is no guarantee you will end up with the same model but there's a much higher chance (depends on threads, type of model.. - I for one like the uncertainty and found it to have little impact on the result).
You can use xgb.cv and set prediction = TRUE.

xgboost in R: how does xgb.cv pass the optimal parameters into xgb.train

I've been exploring the xgboost package in R and went through several demos as well as tutorials but this still confuses me: after using xgb.cv to do cross validation, how does the optimal parameters get passed to xgb.train? Or should I calculate the ideal parameters (such as nround, max.depth) based on the output of xgb.cv?
param <- list("objective" = "multi:softprob",
"eval_metric" = "mlogloss",
"num_class" = 12)
cv.nround <- 11
cv.nfold <- 5
mdcv <-xgb.cv(data=dtrain,params = param,nthread=6,nfold = cv.nfold,nrounds = cv.nround,verbose = T)
md <-xgb.train(data=dtrain,params = param,nround = 80,watchlist = list(train=dtrain,test=dtest),nthread=6)
Looks like you misunderstood xgb.cv, it is not a parameter searching function. It does k-folds cross validation, nothing more.
In your code, it does not change the value of param.
To find best parameters in R's XGBoost, there are some methods. These are 2 methods,
(1) Use mlr package, http://mlr-org.github.io/mlr-tutorial/release/html/
There is a XGBoost + mlr example code in the Kaggle's Prudential challenge,
But that code is for regression, not classification. As far as I know, there is no mlogloss metric yet in mlr package, so you must code the mlogloss measurement from scratch by yourself. CMIIW.
(2) Second method, by manually setting the parameters then repeat, example,
param <- list(objective = "multi:softprob",
eval_metric = "mlogloss",
num_class = 12,
max_depth = 8,
eta = 0.05,
gamma = 0.01,
subsample = 0.9,
colsample_bytree = 0.8,
min_child_weight = 4,
max_delta_step = 1
)
cv.nround = 1000
cv.nfold = 5
mdcv <- xgb.cv(data=dtrain, params = param, nthread=6,
nfold=cv.nfold, nrounds=cv.nround,
verbose = T)
Then, you find the best (minimum) mlogloss,
min_logloss = min(mdcv[, test.mlogloss.mean])
min_logloss_index = which.min(mdcv[, test.mlogloss.mean])
min_logloss is the minimum value of mlogloss, while min_logloss_index is the index (round).
You must repeat the process above several times, each time change the parameters manually (mlr does the repeat for you). Until finally you get best global minimum min_logloss.
Note: You can do it in a loop of 100 or 200 iterations, in which for each iteration you set the parameters value randomly. This way, you must save the best [parameters_list, min_logloss, min_logloss_index] in variables or in a file.
Note: better to set random seed by set.seed() for reproducible result. Different random seed yields different result. So, you must save [parameters_list, min_logloss, min_logloss_index, seednumber] in the variables or file.
Say that finally you get 3 results in 3 iterations/repeats:
min_logloss = 2.1457, min_logloss_index = 840
min_logloss = 2.2293, min_logloss_index = 920
min_logloss = 1.9745, min_logloss_index = 780
Then you must use the third parameters (it has global minimum min_logloss of 1.9745). Your best index (nrounds) is 780.
Once you get best parameters, use it in the training,
# best_param is global best param with minimum min_logloss
# best_min_logloss_index is the global minimum logloss index
nround = 780
md <- xgb.train(data=dtrain, params=best_param, nrounds=nround, nthread=6)
I don't think you need watchlist in the training, because you have done the cross validation. But if you still want to use watchlist, it is just okay.
Even better you can use early stopping in xgb.cv.
mdcv <- xgb.cv(data=dtrain, params=param, nthread=6,
nfold=cv.nfold, nrounds=cv.nround,
verbose = T, early.stop.round=8, maximize=FALSE)
With this code, when mlogloss value is not decreasing in 8 steps, the xgb.cv will stop. You can save time. You must set maximize to FALSE, because you expect minimum mlogloss.
Here is an example code, with 100 iterations loop, and random chosen parameters.
best_param = list()
best_seednumber = 1234
best_logloss = Inf
best_logloss_index = 0
for (iter in 1:100) {
param <- list(objective = "multi:softprob",
eval_metric = "mlogloss",
num_class = 12,
max_depth = sample(6:10, 1),
eta = runif(1, .01, .3),
gamma = runif(1, 0.0, 0.2),
subsample = runif(1, .6, .9),
colsample_bytree = runif(1, .5, .8),
min_child_weight = sample(1:40, 1),
max_delta_step = sample(1:10, 1)
)
cv.nround = 1000
cv.nfold = 5
seed.number = sample.int(10000, 1)[[1]]
set.seed(seed.number)
mdcv <- xgb.cv(data=dtrain, params = param, nthread=6,
nfold=cv.nfold, nrounds=cv.nround,
verbose = T, early.stop.round=8, maximize=FALSE)
min_logloss = min(mdcv[, test.mlogloss.mean])
min_logloss_index = which.min(mdcv[, test.mlogloss.mean])
if (min_logloss < best_logloss) {
best_logloss = min_logloss
best_logloss_index = min_logloss_index
best_seednumber = seed.number
best_param = param
}
}
nround = best_logloss_index
set.seed(best_seednumber)
md <- xgb.train(data=dtrain, params=best_param, nrounds=nround, nthread=6)
With this code, you run cross validation 100 times, each time with random parameters. Then you get best parameter set, that is in the iteration with minimum min_logloss.
Increase the value of early.stop.round in case you find out that it's too small (too early stopping). You need also to change the random parameter values' limit based on your data characteristics.
And, for 100 or 200 iterations, I think you want to change verbose to FALSE.
Side note: That is example of random method, you can adjust it e.g. by Bayesian optimization for better method. If you have Python version of XGBoost, there is a good hyperparameter script for XGBoost, https://github.com/mpearmain/BayesBoost to search for best parameters set using Bayesian optimization.
Edit: I want to add 3rd manual method, posted by "Davut Polat" a Kaggle master, in the Kaggle forum.
Edit: If you know Python and sklearn, you can also use GridSearchCV along with xgboost.XGBClassifier or xgboost.XGBRegressor
This is a good question and great reply from silo with lots of details! I found it very helpful for someone new to xgboost like me. Thank you. The method to randomize and compared to boundary is very inspiring. Good to use and good to know. Now in 2018 some slight revise are needed, for example, early.stop.round should be early_stopping_rounds. The output mdcv is organized slightly differently:
min_rmse_index <- mdcv$best_iteration
min_rmse <- mdcv$evaluation_log[min_rmse_index]$test_rmse_mean
And depends on the application (linear, logistic,etc...), the objective, eval_metric and parameters shall be adjusted accordingly.
For the convenience of anyone who is running a regression, here is the slightly adjusted version of code (most are the same as above).
library(xgboost)
# Matrix for xgb: dtrain and dtest, "label" is the dependent variable
dtrain <- xgb.DMatrix(X_train, label = Y_train)
dtest <- xgb.DMatrix(X_test, label = Y_test)
best_param <- list()
best_seednumber <- 1234
best_rmse <- Inf
best_rmse_index <- 0
set.seed(123)
for (iter in 1:100) {
param <- list(objective = "reg:linear",
eval_metric = "rmse",
max_depth = sample(6:10, 1),
eta = runif(1, .01, .3), # Learning rate, default: 0.3
subsample = runif(1, .6, .9),
colsample_bytree = runif(1, .5, .8),
min_child_weight = sample(1:40, 1),
max_delta_step = sample(1:10, 1)
)
cv.nround <- 1000
cv.nfold <- 5 # 5-fold cross-validation
seed.number <- sample.int(10000, 1) # set seed for the cv
set.seed(seed.number)
mdcv <- xgb.cv(data = dtrain, params = param,
nfold = cv.nfold, nrounds = cv.nround,
verbose = F, early_stopping_rounds = 8, maximize = FALSE)
min_rmse_index <- mdcv$best_iteration
min_rmse <- mdcv$evaluation_log[min_rmse_index]$test_rmse_mean
if (min_rmse < best_rmse) {
best_rmse <- min_rmse
best_rmse_index <- min_rmse_index
best_seednumber <- seed.number
best_param <- param
}
}
# The best index (min_rmse_index) is the best "nround" in the model
nround = best_rmse_index
set.seed(best_seednumber)
xg_mod <- xgboost(data = dtest, params = best_param, nround = nround, verbose = F)
# Check error in testing data
yhat_xg <- predict(xg_mod, dtest)
(MSE_xgb <- mean((yhat_xg - Y_test)^2))
I found silo's answer is very helpful.
In addition to his approach of random research, you may want to use Bayesian optimization to facilitate the process of hyperparameter search, e.g. rBayesianOptimization library.
The following is my code with rbayesianoptimization library.
cv_folds <- KFold(dataFTR$isPreIctalTrain, nfolds = 5, stratified = FALSE, seed = seedNum)
xgb_cv_bayes <- function(nround,max.depth, min_child_weight, subsample,eta,gamma,colsample_bytree,max_delta_step) {
param<-list(booster = "gbtree",
max_depth = max.depth,
min_child_weight = min_child_weight,
eta=eta,gamma=gamma,
subsample = subsample, colsample_bytree = colsample_bytree,
max_delta_step=max_delta_step,
lambda = 1, alpha = 0,
objective = "binary:logistic",
eval_metric = "auc")
cv <- xgb.cv(params = param, data = dtrain, folds = cv_folds,nrounds = 1000,early_stopping_rounds = 10, maximize = TRUE, verbose = verbose)
list(Score = cv$evaluation_log$test_auc_mean[cv$best_iteration],
Pred=cv$best_iteration)
# we don't need cross-validation prediction and we need the number of rounds.
# a workaround is to pass the number of rounds(best_iteration) to the Pred, which is a default parameter in the rbayesianoptimization library.
}
OPT_Res <- BayesianOptimization(xgb_cv_bayes,
bounds = list(max.depth =c(3L, 10L),min_child_weight = c(1L, 40L),
subsample = c(0.6, 0.9),
eta=c(0.01,0.3),gamma = c(0.0, 0.2),
colsample_bytree=c(0.5,0.8),max_delta_step=c(1L,10L)),
init_grid_dt = NULL, init_points = 10, n_iter = 10,
acq = "ucb", kappa = 2.576, eps = 0.0,
verbose = verbose)
best_param <- list(
booster = "gbtree",
eval.metric = "auc",
objective = "binary:logistic",
max_depth = OPT_Res$Best_Par["max.depth"],
eta = OPT_Res$Best_Par["eta"],
gamma = OPT_Res$Best_Par["gamma"],
subsample = OPT_Res$Best_Par["subsample"],
colsample_bytree = OPT_Res$Best_Par["colsample_bytree"],
min_child_weight = OPT_Res$Best_Par["min_child_weight"],
max_delta_step = OPT_Res$Best_Par["max_delta_step"])
# number of rounds should be tuned using CV
#https://www.hackerearth.com/practice/machine-learning/machine-learning-algorithms/beginners-tutorial-on-xgboost-parameter-tuning-r/tutorial/
# However, nrounds can not be directly derivied from the bayesianoptimization function
# Here, OPT_Res$Pred, which was supposed to be used for cross-validation, is used to record the number of rounds
nrounds=OPT_Res$Pred[[which.max(OPT_Res$History$Value)]]
xgb_model <- xgb.train (params = best_param, data = dtrain, nrounds = nrounds)

Resources