R Xgboost validation error as stopping metric - r

I am using a train and validation dataset on an xgboost binary classification model.
params5 <- list(booster = "gbtree", objective = "binary:logistic",
eta=0.0001, gamma=0.5, max_depth=15, min_child_weight=1, subsample=0.6,
colsample_bytree=0.4,seed =2222)
xgb_MOD5 <- xgb.train (params = params5, data = dtrain, nrounds = 4000,
watchlist = list(validation = dvalid,train = dtrain),
print_every_n =30,early_stopping_rounds = 100
maximize = F ,serialize = TRUE)
It automatically picks the train error as stopping metric. This results in the model continuing to train while overfitting.
Multiple eval metrics are present. Will use train_error for early stopping.
Will train until train_error hasn't improved in 100 rounds.
How do I assign the validation error as stopping metric?

I do not use the R binding of xgboost and the R-package documentation is not specific about it. However, the python-API documentation (see the early_stopping_rounds argument documentation) has a relevant clarification on this issue:
Requires at least one item in evals. If there’s more than one, will use the last.
Here, evals is the list of samples on which metrics will be evaluated, i.e. it is analogous to your watchlist argument. So I would guess, it can be that you just need to swap the order of items in the list provided as that argument

Thanks #abhiieor for the solution. Adding to that from what I observed,when we use only the validation in watchlist:
xgb_MOD5 <- xgb.train (params = params5, data = dtrain, nrounds = 400,watchlist = list(validation = dvalid),
print_every_n =30,early_stopping_rounds = 100, maximize = F ,serialize = TRUE)
log results while it runs:
[1] validation-error:0.222037
Will train until validation_error hasn't improved in 100 rounds.
[31] validation-error:0.201712
[61] validation-error:0.201635
And if we want to see both the train error and validation error while it runs,
adding the validation as 2nd argument in the watch list did it while using validation error as the stopping metric .
xgb_MOD5 <- xgb.train (params = params5, data = dtrain, nrounds = 400,watchlist = list(train =dtrain,validation = dvalid),
print_every_n =30,early_stopping_rounds = 100, maximize = F ,serialize = TRUE)
[1] train-error:0.202131 validation-error:0.232341
Multiple eval metrics are present. Will use validation_error for early stopping.
Will train until validation_error hasn't improved in 100 rounds.
[31] train-error:0.174278 validation-error:0.202871
[61] train-error:0.173909 validation-error:0.202288

Related

Why does training Xgboost model with pseudo-Huber loss return a constant test metric?

I am trying to fit an xgboost model using the native pseudo-Huber loss reg:pseudohubererror. However, it doesn't seem to be working since nor the training nor the test error is improving. It works just fine with reg:squarederror. What am I missing?
Code:
library(xgboost)
n = 1000
X = cbind(runif(n,10,20), runif(n,0,10))
y = X %*% c(2,3) + rnorm(n,0,1)
train = xgb.DMatrix(data = X[-n,],
label = y[-n])
test = xgb.DMatrix(data = t(as.matrix(X[n,])),
label = y[n])
watchlist = list(train = train, test = test)
xbg_test = xgb.train(data = train, objective = "reg:pseudohubererror", eval_metric = "mae", watchlist = watchlist, gamma = 1, eta = 0.01, nrounds = 10000, early_stopping_rounds = 100)
Result:
[1] train-mae:44.372692 test-mae:33.085709
Multiple eval metrics are present. Will use test_mae for early stopping.
Will train until test_mae hasn't improved in 100 rounds.
[2] train-mae:44.372692 test-mae:33.085709
[3] train-mae:44.372688 test-mae:33.085709
[4] train-mae:44.372688 test-mae:33.085709
[5] train-mae:44.372688 test-mae:33.085709
[6] train-mae:44.372688 test-mae:33.085709
[7] train-mae:44.372688 test-mae:33.085709
[8] train-mae:44.372688 test-mae:33.085709
[9] train-mae:44.372688 test-mae:33.085709
[10] train-mae:44.372692 test-mae:33.085709
It seems like that is the expected behavior of the pseudohuber loss. Here I hard coded the first and second derivatives of the objective loss function found here and fed it via the obj=obje parameter. If you run it and compare with the objective="reg:pseudohubererror" version, you'll see they are the same. As for why it is so much worse than squared loss, not sure.
set.seed(20)
obje=function(pred, dData) {
labels=getinfo(dData, "label")
a=pred
d=labels
fir=a^2/sqrt(a^2/d^2+1)/d-2*d*(sqrt(a^2/d^2+1)-1)
sec=((2*(a^2/d^2+1)^(3/2)-2)*d^2-3*a^2)/((a^2/d^2+1)^(3/2)*d^2)
return (list(grad=fir, hess=sec))
}
xbg_test = xgb.train(data = train, obj=obje, eval_metric = "mae", watchlist = watchlist, gamma = 1, eta = 0.01, nrounds = 10000, early_stopping_rounds = 100)
I don`t have an answer for the "why" part of the question but had the same problem and found a solution that worked for me.
In my problem, the algorithm started converging only after I applied standardization on the label:
Label_standard = (Label - mean(Label)) / sd(Label)
Pay attention to calculate mean and sd only from training and not including the testing dataset!
After training the model and generating predictions, you need to transform the standardized predictions back to the original range with the mean and sd that you calculated from the training dataset.
I got this idea because I found that the algorithm didn`t convert when the label values were "big". I would be interested to understand the "why" as well.

Why does XGBoost respond to parameters for the linear booster when I'm telling it to only use the tree booster?

Run the following modification to the example code from ?xgb.train:
library(xgboost)
data(agaricus.train, package='xgboost')
data(agaricus.test, package='xgboost')
dtrain <- xgb.DMatrix(agaricus.train$data, label = agaricus.train$label)
dtest <- xgb.DMatrix(agaricus.test$data, label = agaricus.test$label)
watchlist <- list(train = dtrain, eval = dtest)
param <- list(eta = 0.1, objective = "reg:squarederror", eval_metric = "rmse",booster="gbtree")
bst <- xgb.train(param, dtrain, nrounds = 200, watchlist)
and pay attention to the outputs. Now run the same code, but replace param with param <- list(eta = 0.1, objective = "reg:squarederror", eval_metric = "rmse",booster="gbtree",lambda=5000). After doing this you will get wildly different results.
Why is this? As far as I can tell from ?xgb.train, lambda is a parameter that is only for the linear booster (gblinear) and booster="gbtree" is telling xgb.train to use only the tree booster (gbtree). In other words, it appears that xgb.train is responding to the lambda parameter despite being explicitly told to only use a model that doesn't use lambda.
This answer appears to blame randomness, but the results in the lambda=5000 case are very consistent, suggesting otherwise.

How to limit the execution time but save the output in R?

I'm trying to limit the execution time of an analysis, however I want to keep what the analysis already did.
In my case I'm running xgb.cv (from xgboost R package) and I want to keep all iterations until the analysis reach 10 seconds (or "n" seconds/minutes/hours).
I've tried the approach mentioned in this thread but it stops after it reaches 10 secs without keeping the iterations previously done.
Here is my code:
require(xgboost)
require(R.utils)
data(iris)
train.model <- model.matrix(Sepal.Length~., iris)
dtrain <- xgb.DMatrix(data=train.model, label=iris$Sepal.Length)
evalerror <- function(preds, dtrain) {
labels <- getinfo(dtrain, "label")
err <- sqrt(sum((log(preds) - log(labels))^2)/length(labels))
return(list(metric = "error", value = err))}
xgb_grid = list(eta = 0.05, max_depth = 5, subsample = 0.7, gamma = 0.3,
min_child_weight = 1)
fit_boost <- tryCatch(
expr = {evalWithTimeout({xgb.cv(data = dtrain,
nrounds = 10000,
objective = "reg:linear",
eval_metric = evalerror,
early_stopping_rounds = 300,
print_every_n = 100,
params = xgb_grid,
colsample_bytree = 0.7,
nfold = 5,
prediction = TRUE,
maximize = FALSE
)},
timeout = 10)
},
TimeoutException = function(ex) cat("Timeout. Skipping.\n"))
and the output is
#Error in dim.xgb.DMatrix(x) : reached CPU time limit
Thank you!
Edit - slightly closer to what you want:
Wrap the whole thing with R's capture.output() function. This will store all the evaluation output as an R object. Again, I think you're looking for something more, but this is at least local and malleable. Syntax:
fit_boost <- capture.output(tryCatch(expr = {evalWithTimeout({...}) ) )
> fit_boost
[1] "[1]\ttrain-error:2.033160+0.006109\ttest-error:2.034180+0.017467 " ...
Original answer:
You could also use a sink. Simply add this line before you start doing the cross validation:
sink("evaluationLog.txt")
fit_boost <- tryCatch(
expr = {evalWithTimeout({xgb.cv(data = dtrain,
nrounds = 10000,
objective = "reg:linear",
eval_metric = evalerror,
early_stopping_rounds = 300,
print_every_n = 100,
params = xgb_grid,
colsample_bytree = 0.7,
nfold = 5,
prediction = TRUE,
maximize = FALSE
)},
timeout = 10)
},
TimeoutException = function(ex) cat("Timeout. Skipping.\n"))
sink()
Where the sink() at the end would normally return output to the console, but in this case it won't because an error is thrown. But once you run this, you can open up evaluationLog.txt and viola:
[1] train-error:2.033217+0.003705 test-error:2.032427+0.012808
Multiple eval metrics are present. Will use test_error for early stopping.
Will train until test_error hasn't improved in 300 rounds.
[101] train-error:0.045297+0.000396 test-error:0.060047+0.001849
[201] train-error:0.042085+0.000852 test-error:0.059798+0.002382
[301] train-error:0.041117+0.001032 test-error:0.059733+0.002701
[401] train-error:0.040340+0.001170 test-error:0.059481+0.002973
[501] train-error:0.039988+0.001145 test-error:0.059469+0.002929
[601] train-error:0.039698+0.001028 test-error:0.059416+0.003018
This isn't perfect, of course. I imagine you want to perform some operations on these and this isn't exactly the best format. However, it's not a tall order to convert this into something more manageable. I haven't yet found a way to save the actual xgb.cv$evaluation_log object before the timeout. That is a very good question.

XGBoost - predict not exported in namespace

I am trying to tune an xgboost model with a multiclass dependent variable in R. I am using MLR to do this, however I run into an error where xgboost doesn't have predict within its namespace - which I assume MLR wants to use. I have had a look online and see that other people have encountered similar issues. However, I can't entirely understand the answers that have been provided (e.g. https://github.com/mlr-org/mlr/issues/935), when I try to implement them the issue persists. My code is as follows:
# Tune parameters
#create tasks
train$result <- as.factor(train$result) # Needs to be a factor variable for makeClass to work
test$result <- as.factor(test$result)
traintask <- makeClassifTask(data = train,target = "result")
testtask <- makeClassifTask(data = test,target = "result")
lrn <- makeLearner("classif.xgboost",predict.type = "response")
# Set learner value and number of rounds etc.
lrn$par.vals <- list(
objective = "multi:softprob", # return class with maximum probability,
num_class = 3, # There are three outcome categories
eval_metric="merror",
nrounds=100L,
eta=0.1
)
# Set parameters to be tuned
params <- makeParamSet(
makeDiscreteParam("booster",values = c("gbtree","gblinear")),
makeIntegerParam("max_depth",lower = 3L,upper = 10L),
makeNumericParam("min_child_weight",lower = 1L,upper = 10L),
makeNumericParam("subsample",lower = 0.5,upper = 1),
makeNumericParam("colsample_bytree",lower = 0.5,upper = 1)
)
# Set resampling strategy
rdesc <- makeResampleDesc("CV",stratify = T,iters=5L)
# search strategy
ctrl <- makeTuneControlRandom(maxit = 10L)
#parallelStartSocket(cpus = detectCores()) # Enable parallel processing
mytune <- tuneParams(learner = lrn
,task = traintask
,resampling = rdesc
,measures = acc
,par.set = params
,control = ctrl
,show.info = T)
The specific error I get is:
Error: 'predict' is not an exported object from 'namespace:xgboost'
My package versions are:
packageVersion("xgboost")
[1] ‘0.6.4’
packageVersion("mlr")
[1] ‘2.8’
Would anyone know what I should do here?
Thanks in advance.

Using parallelMap Package with Custom Filter in mlr

I working with mlr to do a text classification task. I have written a custom filter as described here
Create Custom Filters
The filter works as intended, however when I try and and ustilise parallelization I receive the following error:
Exporting objects to slaves for mode socket: .mlr.slave.options
Mapping in parallel: mode = socket; cpus = 4; elements = 2.
Error in stopWithJobErrorMessages(inds, vcapply(result.list[inds], as.character)) :
Errors occurred in 2 slave jobs, displaying at most 10 of them:
00001: Error in parallel:::.slaveRSOCK() :
Assertion on 'method' failed: Must be element of set {'anova.test','carscore','cforest.importance','chi.squared','gain.ratio','information.gain','kruskal.test','linear.correlation','mrmr','oneR','permutation.importance','randomForest.importance','randomForestSRC.rfsrc','randomForestSRC.var.select','rank.correlation','relief','rf.importance','rf.min.depth','symmetrical.uncertainty','univariate','univariate.model.score','variance'}.
I'm assuming from the error that my custom filter needs to be an element in the set to stand a chance of working in parallel, but haven't managed to work out if (a) this is possible, and (b) if it is, how do I go about it.
Thanks in advance for any help,
Azam
Added: Test Script
I can't let you see the actual script/data I'm working with due to sensitivity, but this example reproduces the error I see. Apart from the custom feature selection and data-set, the steps to set up the learner and evaluate it are as I have in my 'real' script. As in my real case, if you remove the parallelStartSocket() command then the script runs as expected.
I should also add that I have sucessfully used (or at least I received no errors) parallel processing when tuning the hyper-parameters of an SVM with RBF kernel: the script being identical apart from the makeParamSet() definition.
library(parallelMap)
library(mlr)
library(kernlab)
makeFilter(
name = "nonsense.filter",
desc = "Calculates scores according to alphabetical order of features",
pkg = "mlr",
supported.tasks = c("classif", "regr", "surv"),
supported.features = c("numerics", "factors", "ordered"),
fun = function(task, nselect, decreasing = TRUE, ...) {
feats = getTaskFeatureNames(task)
imp = order(feats, decreasing = decreasing)
names(imp) = feats
imp
}
)
# set up svm with rbf kernal
svm.lrn <- makeLearner("classif.ksvm",predict.type = "response")
# wrap learner with filter
svm.lrn <- makeFilterWrapper(svm.lrn, fw.method = "nonsense.filter")
# define feature selection parameters
ps.svm = makeParamSet(
makeDiscreteParam("fw.abs", values = seq(2, 3, 1))
)
# define inner search and evaluation strategy
ctrl.svm = makeTuneControlGrid()
inner.svm = makeResampleDesc("CV", iters = 5, stratify = TRUE)
svm.lrn <- makeTuneWrapper(svm.lrn, resampling = inner.svm, par.set = ps.svm,
control = ctrl.svm)
# set up outer resampling
outer.svm <- makeResampleDesc("CV", iters = 10, stratify = TRUE)
# run it...
parallelStartSocket(2)
run.svm <- resample(svm.lrn, iris.task,
resampling = outer.svm, extract = getTuneResult)
parallelStop()
The problem is that makeFilter registers S3 methods, which are not available in separate R processes. You have two options to make this work: either simply use parallelStartMulticore(2) so that everything runs in the same R process, or tell parallelMap about the pieces that need to be present in the other R processes.
There are two parts to the latter. First, use parallelLibrary("mlr") to load mlr everywhere and pull out the definition of the filter into a separate file that can be loaded using parallelSource(). For example:
filter.R:
makeFilter(
name = "nonsense.filter",
desc = "Calculates scores according to alphabetical order of features",
pkg = "mlr",
supported.tasks = c("classif", "regr", "surv"),
supported.features = c("numerics", "factors", "ordered"),
fun = function(task, nselect, decreasing = TRUE, ...) {
feats = getTaskFeatureNames(task)
imp = order(feats, decreasing = decreasing)
names(imp) = feats
imp
}
)
main.R:
library(parallelMap)
library(mlr)
library(kernlab)
parallelStartSocket(2)
parallelLibrary("mlr")
parallelSource("filter.R")
# set up svm with rbf kernal
svm.lrn = makeLearner("classif.ksvm",predict.type = "response")
# wrap learner with filter
svm.lrn = makeFilterWrapper(svm.lrn, fw.method = "nonsense.filter")
# define feature selection parameters
ps.svm = makeParamSet(
makeDiscreteParam("fw.abs", values = seq(2, 3, 1))
)
# define inner search and evaluation strategy
ctrl.svm = makeTuneControlGrid()
inner.svm = makeResampleDesc("CV", iters = 5, stratify = TRUE)
svm.lrn = makeTuneWrapper(svm.lrn, resampling = inner.svm, par.set = ps.svm,
control = ctrl.svm)
# set up outer resampling
outer.svm = makeResampleDesc("CV", iters = 10, stratify = TRUE)
# run it...
run.svm = resample(svm.lrn, iris.task, resampling = outer.svm, extract = getTuneResult)
parallelStop()

Resources