R XGBoost early stopping by - r

Below I have code, in which I am trying to train an XGBoost model in R that early stops after a given number of rounds early_stopping_rounds without improvement.
watchlist <- list(train=dtrain, test=dtest)
param <- list(
objective = "binary:logistic",
eta = 0.3,
max_depth = 8,
eval_metric="logloss"
)
xgb.train(params = param, data = dtrain, nrounds = 1000, watchlist = watchlist, early_stopping_rounds = 3)
However, instead of fixing the number of rounds, I would like to pass a min_delta value, so the early stopping kicks in when the difference between rounds is below a given tolerance.
Others (here and here) have asked this for Python.
However, advances not too long ago have implemented this option for Python.
But how do I work this out in R? Is there something like it?

Related

extract_inner_fselect_results is NULL with mlr3 Nested Resampling

This question is an extension of the following question: No Model Stored with Mlr3.
I have been performing nested resampling to get an unbiased metric of model performance. If I don't specify store_models=TRUE then I get Error: No model stored at the end of the run. However, if I specify store_models=TRUE in both the at and resample calls then RStudio crashes due to RAM consumption.
I have now tried the following code in which I specified store_models=TRUE for just the at call:
MSvCon<-read.csv("MS v Control Proteomics Final.csv", row.names=1)
MSvCon$Status<-as.factor(MSvCon$Status)
MSvCon[,2:4399]<-scale(MSvCon[,2:4399], center=TRUE, scale=TRUE)
set.seed(123, "L'Ecuyer")
task = as_task_classif(MSvCon, target = "Status")
learner = lrn("classif.ranger", importance = "impurity", num.trees=10000)
set_threads(learner, n = 8)
measure = msr("classif.fbeta", beta=1, average="micro")
terminator = trm("none")
resampling_inner = rsmp("repeated_cv", folds = 10, repeats = 10)
at = AutoFSelector$new(
learner = learner,
resampling = resampling_inner,
measure = measure,
terminator = terminator,
fselect = fs("rfe", n_features = 1, feature_fraction = 0.5, recursive = FALSE),
store_models=TRUE)
resampling_outer = rsmp("repeated_cv", folds = 10, repeats = 10)
rr = resample(task, at, resampling_outer)
After finishing, I am able to extract performance measures successfully. However, I tried to use extract_inner_fselect_results and extract_inner_fselect_archives to check what features were selected and importance measures but received a NULL result.
Do you have any suggestions on what I would need to adjust in my code to see this information? I anticipate that adding store_models=TRUE to the resample call would but the RAM consumption issue (even using 128GB on Rstudio Workbench) prevents that. Is there a way around this?
The archives of the inner resampling are stored in the model slot of the AutoFSelectors i.e. without store_models = TRUE in resample() you cannot access the inner results and archives. I will write a workaround for you and answer in the other question.

A question about the parallelism in h2o.grid() function

I try to use the h2o.grid() function from the h2o package to do some tuning using R, when I set the parameter parallelism larger then 1, it always shows the warning
Some models were not built due to a failure, for more details run `summary(grid_object, show_stack_traces = TRUE)
And the model_ids in the final grid object includes many models end with _cv_1, _cv_2 etc, and the number of the models is not equal to the setting of my max_models in search_criteria list, I think they are just the models in the cv process, not the final model.
When I set parallelism larger than 1:
When I leave the parallelism default or set it to 1, the result is normal, all models end with _model_1, _model_2 etc.
When I leave the "parallelism" default or set it to 1:
Here is my code:
# set the grid
rf_h2o_grid <- list(mtries = seq(3, ncol(train_h2o), 4),
max_depth = c(5, 10, 15, 20))
# set the search_criteria
sc <- list(strategy = "RandomDiscrete",
seed = 100,
max_models = 5
)
# random grid tuning
rf_h2o_grid_tune_random <- h2o.grid(
algorithm = "randomForest",
x = x,
y = y,
training_frame = train_h2o,
nfolds = 5, # use cv to validate the parameters
fold_assignment = "Stratified",
ntrees = 100,
seed = 100,
hyper_params = rf_h2o_grid,
search_criteria = sc
# parallelism = 6 # when I set it larger than 1, the result always includes some "cv_" models
)
So how can I use the parallelism correctly in h2o.grid()? Thanks for helping!
This is an actual issue with parallelism in grid search, previously noticed but not reported correctly.
Thanks for raising this, we'll try to fix it soon: see https://h2oai.atlassian.net/browse/PUBDEV-7886 if you want to track progress.
Until proper fix, you must avoid using both CV and parallelism in your grids.
Regarding the following error:
Some models were not built due to a failure, for more details run `summary(grid_object, show_stack_traces = TRUE)
if the error is reproducible, you should be getting more details by running the grid with verbose=True.
Adding the entire error message to the ticket above might also help.
This is because you set max_models = 5, your grid will only make 5 models then stop.
There are three ways to set up early stopping criteria:
"max_models": max number of models created
"max_runtime_secs": max running time in seconds
metric-based early stopping by setting up for "stopping_rounds", "stopping_metric", and "stopping_tolerance"

Multiclassification with LightGBM

I am using latest release of LightGBM to solve a multi classification problem. When I switch the objective to "multiclass", this error occurs;
Error in data$update_params(params) :
[LightGBM] [Fatal] Number of classes should be specified and greater than 1 for multiclass training
I leave a reproducible example that indicates my way
catnames <- names(purrr::keep(train_x,is.factor))
dtrain <- lgb.Dataset(as.matrix(train_x), label = train_y,categorical_feature = catnames)
data_file <- tempfile(fileext = ".data")
lgb.Dataset.save(dtrain, data_file)
dtrain <- lgb.Dataset(data_file)
lgb.Dataset.construct(dtrain)
model <- lgb.train(data=dtrain,
objective = "multiclass",
alpha = 0.1,
nrounds = 1000,
learning_rate = .1
)
Tried to save my target (train_y) as factor, nothing changed.
When using the multi-class objective in LightGBM, you need to pass another parameter that tells the learner the number of classes to predict.
So, it should probably look more like this:
model <- lgb.train(data=dtrain,
objective = "multiclass",
num_classes = INSERT NUMBER OF TARGET CLASSES HERE,
alpha = 0.1,
nrounds = 1000,
learning_rate = .1,
)
My experience is more with the python API so it might be that (if this does not work) you need to pass the num_class parameter in the form of a list for a params keyword argument in lgb.train.

How to find optimal number of epoch in general while using h2o.deeplearning()?

As i have just started studying about Neural networks modelling , I wanted to know that, Suppose i have a dataset for classification problem and i am using
churn_model <- h2o.deeplearning(x = setdiff(names(churn), names(churn)[10]),
y = names(churn)[10],
training_frame = churnTrain,
validation_frame = churnValidation,
distribution = "multinomial",
activation = "RectifierWithDropout",
hidden = c(200,200,200),
hidden_dropout_ratio = c(0.1, 0.1, 0.1),
epochs = 50 , stopping_rounds = 0,
l1 = 1e-5)
So how can i determine through any function or something what will be the number of epoch that i can use?
Usually, you estimate it empirically. Start training with an initially large number of iterations/epochs and stop it whenever you observe overfitting, i.e. you validation error goes up while train error diminishes. This is called early stopping.
If you do not observe overfitting, increase the number of iterations, until validation error becomes stable.

Why doesnt the early.stop.round argument in xgboost work?

I try to use the early.stop.round argument in the xgb.cv function of the xgboost library, however, I got an error. After I leave the early.stop.round unspecified, the function runs without any problem. What did I do wrong?
Here is my example code:
library(xgboost)
train = matrix(as.numeric(1:100),20,5)
Y = rep(c(0,1),10)
dtrain = xgb.DMatrix(train, label=Y)
#cross validation when early.stop.round =5, gives an error
CV = xgb.cv(data = dtrain, nround=200, nfold =2, metrics=list("auc"),
objective = "binary:logistic",early.stop.round = 5)
#cross validation when early.stop.round is not specified, works
CV = xgb.cv(data = dtrain, nround=200, nfold =2, metrics=list("auc"),
objective = "binary:logistic")
I am using xgboost_0.4-2
Looks like something goes wrong when using the metrics parameter and early.stop simultaneously. Remove metrics and use early.stop with eval_metric="auc" instead.

Resources