I'm trying to limit the execution time of an analysis, however I want to keep what the analysis already did.
In my case I'm running xgb.cv (from xgboost R package) and I want to keep all iterations until the analysis reach 10 seconds (or "n" seconds/minutes/hours).
I've tried the approach mentioned in this thread but it stops after it reaches 10 secs without keeping the iterations previously done.
Here is my code:
require(xgboost)
require(R.utils)
data(iris)
train.model <- model.matrix(Sepal.Length~., iris)
dtrain <- xgb.DMatrix(data=train.model, label=iris$Sepal.Length)
evalerror <- function(preds, dtrain) {
labels <- getinfo(dtrain, "label")
err <- sqrt(sum((log(preds) - log(labels))^2)/length(labels))
return(list(metric = "error", value = err))}
xgb_grid = list(eta = 0.05, max_depth = 5, subsample = 0.7, gamma = 0.3,
min_child_weight = 1)
fit_boost <- tryCatch(
expr = {evalWithTimeout({xgb.cv(data = dtrain,
nrounds = 10000,
objective = "reg:linear",
eval_metric = evalerror,
early_stopping_rounds = 300,
print_every_n = 100,
params = xgb_grid,
colsample_bytree = 0.7,
nfold = 5,
prediction = TRUE,
maximize = FALSE
)},
timeout = 10)
},
TimeoutException = function(ex) cat("Timeout. Skipping.\n"))
and the output is
#Error in dim.xgb.DMatrix(x) : reached CPU time limit
Thank you!
Edit - slightly closer to what you want:
Wrap the whole thing with R's capture.output() function. This will store all the evaluation output as an R object. Again, I think you're looking for something more, but this is at least local and malleable. Syntax:
fit_boost <- capture.output(tryCatch(expr = {evalWithTimeout({...}) ) )
> fit_boost
[1] "[1]\ttrain-error:2.033160+0.006109\ttest-error:2.034180+0.017467 " ...
Original answer:
You could also use a sink. Simply add this line before you start doing the cross validation:
sink("evaluationLog.txt")
fit_boost <- tryCatch(
expr = {evalWithTimeout({xgb.cv(data = dtrain,
nrounds = 10000,
objective = "reg:linear",
eval_metric = evalerror,
early_stopping_rounds = 300,
print_every_n = 100,
params = xgb_grid,
colsample_bytree = 0.7,
nfold = 5,
prediction = TRUE,
maximize = FALSE
)},
timeout = 10)
},
TimeoutException = function(ex) cat("Timeout. Skipping.\n"))
sink()
Where the sink() at the end would normally return output to the console, but in this case it won't because an error is thrown. But once you run this, you can open up evaluationLog.txt and viola:
[1] train-error:2.033217+0.003705 test-error:2.032427+0.012808
Multiple eval metrics are present. Will use test_error for early stopping.
Will train until test_error hasn't improved in 300 rounds.
[101] train-error:0.045297+0.000396 test-error:0.060047+0.001849
[201] train-error:0.042085+0.000852 test-error:0.059798+0.002382
[301] train-error:0.041117+0.001032 test-error:0.059733+0.002701
[401] train-error:0.040340+0.001170 test-error:0.059481+0.002973
[501] train-error:0.039988+0.001145 test-error:0.059469+0.002929
[601] train-error:0.039698+0.001028 test-error:0.059416+0.003018
This isn't perfect, of course. I imagine you want to perform some operations on these and this isn't exactly the best format. However, it's not a tall order to convert this into something more manageable. I haven't yet found a way to save the actual xgb.cv$evaluation_log object before the timeout. That is a very good question.
Related
I am having an issue where xgboost is not producing reproducible results after saving it to binary file.
R version: 4.1.2
XGBoost version 1.5.2.1
The methodology is as follows (logistic-regression, gbtree):
bst <- xgboost(
params = best.params
, data = dtrain
, nrounds = nrounds
, early_stopping_rounds = early_stopping_rounds
, nthread = nthread
, num_parallel_tree = num_parallel_tree
, eval_metric = eval_metric
, verbose = 2
, print_every_n = 1
)
min(predict(bst, dtest))
max(predict(bst, dtest))
xgb.save(bst, savefilemodelloc)
this produces:
min = 0.17932555079
max = 0.78802382946
now I read the bin back in
remove(bst)
bst <- xgb.load(savefilemodelloc)
min(predict(bst, dtest))
max(predict(bst, dtest))
this produces:
min = 0.49377295375
max = 0.50564271212
this is being run on the exact same data set, and is producing no where near the same results. I have tried rebuilding the model several times with nearly identical results.
The model size is about 17GB.
My OS is RHEL 7
Does anyone know what is going on here?
Update 3.8.2022
I have discovered that if I load my parameters back into the model manually it works.
for example
remove(bst)
bst <- xgb.load(savefilemodelloc)
xgb.parameters(bst) <- best.params
min(predict(bst, dtest))
max(predict(bst, dtest))
this now produces:
min = 0.17932555079
max = 0.78802382946
I am not sure if this is expected behavior
I've faced this issue before and I 'solved it' by saving the model in .rds format. My predictions weren't as extreme (<1% different before/after save/import) but I think it was to do with 'losing' specific parameters from the xgboost model object (something to do with 'early stopping'?) when the model was saved using xgb.save().
I saved the model as:
Text
xgboost::xgb.dump(model = xgb,
fname = "xgb_model_text.txt",
with_stats = TRUE, dump_format = c("text"))
Binary
xgboost::xgb.save(model = xgb, fname = "xgb.model")
R object (.rds)
saveRDS(object = xgb, file = "xgb.model.rds")
And I also saved the feature names to a file (super useful down the track, highly recommend):
# Write the model feature names to file
dt <- xgb.model.dt.tree(feature_names = NULL, model = xgb)
write.table(x = dt, file = "model_dt_tree.txt",
quote = FALSE, sep = "\t", row.names = FALSE, col.names = TRUE)
To load in the model to 'repeat' the predictions I used:
xgb2 <- readRDS("xgb.model.rds")
xgb2 <- xgb.Booster.complete(xgb2)
There was more to it than that, but hopefully this will solve your immediate problem.
Based on a hint that Jared said above, I was able to resolve my issue. The problem seems to be that if you save an xgboost bin file, it does not keep the parameters used. The solution is to load the parameters back in. I tried saving the model to a json file but it crashed my rsession in each attempt. So it would appear that bin is my only option.
The methodology is as follows (logistic-regression, gbtree):
bst <- xgboost(
params = best.params
, data = dtrain
, nrounds = nrounds
, early_stopping_rounds = early_stopping_rounds
, nthread = nthread
, num_parallel_tree = num_parallel_tree
, eval_metric = eval_metric
, verbose = 2
, print_every_n = 1
)
min(predict(bst, dtest))
max(predict(bst, dtest))
xgb.save(bst, savefilemodelloc)
this produces:
min = 0.17932555079
max = 0.78802382946
now read the bin back in
remove(bst)
bst <- xgb.load(savefilemodelloc)
xgb.parameters(bst) <- best.params # this loads the special parameters back in
min(predict(bst, dtest))
max(predict(bst, dtest))
this now produces:
min = 0.17932555079
max = 0.78802382946
I wrote the R code below to mine with the FP-Growth algorithm:
fpgabdata <- read.csv('../Agen Biasa.csv', header = FALSE)
train <- sapply(fpgabdata, as.factor)
train <- data.frame(train, check.names = TRUE)
txns <- as(train,"transactions")
abrulesfpg = rCBA::fpgrowth(txns, support = 0.25, confidence = 0.5, maxLength = 10, consequent = NULL, verbose = TRUE, parallel = TRUE)
But I get the following error:
Error in .jcall(jPruning, "[[Ljava/lang/String;", "fpgrowth", support, :
method fpgrowth with signature (DDI)[[Ljava/lang/String; not found
These are my data:
The reason you are seeing this error is that the current implementation of the FP-growth algorithm in rCBA requires that you specify a value for the consequent (right hand side).
For example, the following should work, assuming you have sensible thresholds for support and confidence:
abrulesfpg = rCBA::fpgrowth(
txns,
support = 0.25,
confidence = 0.5,
maxLength = 10,
consequent = "SPIRULINA",
verbose = TRUE,
parallel = TRUE
)
I know the OP is likely to have discovered this by now, but I've answered this just in case anyone else encounters the same error.
I am using a train and validation dataset on an xgboost binary classification model.
params5 <- list(booster = "gbtree", objective = "binary:logistic",
eta=0.0001, gamma=0.5, max_depth=15, min_child_weight=1, subsample=0.6,
colsample_bytree=0.4,seed =2222)
xgb_MOD5 <- xgb.train (params = params5, data = dtrain, nrounds = 4000,
watchlist = list(validation = dvalid,train = dtrain),
print_every_n =30,early_stopping_rounds = 100
maximize = F ,serialize = TRUE)
It automatically picks the train error as stopping metric. This results in the model continuing to train while overfitting.
Multiple eval metrics are present. Will use train_error for early stopping.
Will train until train_error hasn't improved in 100 rounds.
How do I assign the validation error as stopping metric?
I do not use the R binding of xgboost and the R-package documentation is not specific about it. However, the python-API documentation (see the early_stopping_rounds argument documentation) has a relevant clarification on this issue:
Requires at least one item in evals. If there’s more than one, will use the last.
Here, evals is the list of samples on which metrics will be evaluated, i.e. it is analogous to your watchlist argument. So I would guess, it can be that you just need to swap the order of items in the list provided as that argument
Thanks #abhiieor for the solution. Adding to that from what I observed,when we use only the validation in watchlist:
xgb_MOD5 <- xgb.train (params = params5, data = dtrain, nrounds = 400,watchlist = list(validation = dvalid),
print_every_n =30,early_stopping_rounds = 100, maximize = F ,serialize = TRUE)
log results while it runs:
[1] validation-error:0.222037
Will train until validation_error hasn't improved in 100 rounds.
[31] validation-error:0.201712
[61] validation-error:0.201635
And if we want to see both the train error and validation error while it runs,
adding the validation as 2nd argument in the watch list did it while using validation error as the stopping metric .
xgb_MOD5 <- xgb.train (params = params5, data = dtrain, nrounds = 400,watchlist = list(train =dtrain,validation = dvalid),
print_every_n =30,early_stopping_rounds = 100, maximize = F ,serialize = TRUE)
[1] train-error:0.202131 validation-error:0.232341
Multiple eval metrics are present. Will use validation_error for early stopping.
Will train until validation_error hasn't improved in 100 rounds.
[31] train-error:0.174278 validation-error:0.202871
[61] train-error:0.173909 validation-error:0.202288
Below is the code which i am executing on XGBOOST,
data(Glass, package = "mlbench")
levels(Glass$Type) <- c(0:5) #Proper Sequence. Should start with 0
Glass$Type <- as.integer(as.character(Glass$Type))
set.seed(100)
options(scipen = 999)
library(caret)
R_index <- createDataPartition(Glass$Type, p=.7, list = FALSE)
gl_train <- Glass[R_index,]
gl_test <- Glass[-R_index,]
'%ni%' <- Negate('%in%')
library(xgboost)
library(Matrix)
#Creating the matrix for training the model
train_gl <- xgb.DMatrix(data.matrix(gl_train[ ,colnames(gl_train) %ni% 'Type']),
label = as.numeric(gl_train$Type))
test_gl <- xgb.DMatrix(data.matrix(gl_test[ ,colnames(gl_test) %ni% 'Type']))
watchlist <- list(train = gl_train, test = gl_test)
#Define the parameters and cross validate
param <- list("objective" = "multi:softmax",
"eval_metric" = "mlogloss",
"num_class" = length(unique(gl_train$Type)))
cv.nround <- 5
cv.nfold <- 3
cvMod <- xgb.cv(param = param, data = train_gl,
nfold = cv.nfold,
nrounds = cv.nround,
watchlist=watchlist)
#Build the Model
nrounds = 50
xgMod = xgboost(param = param, data = train_gl, nrounds = nrounds, watchlist = watchlist)
After executing xgMod i am getting the below mentioned error,
Error in check.custom.obj() :
Setting objectives in 'params' and 'obj' at the same time is not allowed
Let me know what's wrong in my code.
Any help is appreciated.
Regards,
Mohan
The problem is due to the watchlist parameter passed to xgboost.
watchlist is a parameter of xgb.train but not of xgboost, hence it is considered by xgboost like "other parameters" (...) .
The following code
xgMod <- xgboost(param = param, data = train_gl, nrounds = nrounds)
works correctly
[1] train-mlogloss:1.259886
[2] train-mlogloss:0.963367
[3] train-mlogloss:0.755535
[4] train-mlogloss:0.601647
[5] train-mlogloss:0.478923
...
I am trying to tune an xgboost model with a multiclass dependent variable in R. I am using MLR to do this, however I run into an error where xgboost doesn't have predict within its namespace - which I assume MLR wants to use. I have had a look online and see that other people have encountered similar issues. However, I can't entirely understand the answers that have been provided (e.g. https://github.com/mlr-org/mlr/issues/935), when I try to implement them the issue persists. My code is as follows:
# Tune parameters
#create tasks
train$result <- as.factor(train$result) # Needs to be a factor variable for makeClass to work
test$result <- as.factor(test$result)
traintask <- makeClassifTask(data = train,target = "result")
testtask <- makeClassifTask(data = test,target = "result")
lrn <- makeLearner("classif.xgboost",predict.type = "response")
# Set learner value and number of rounds etc.
lrn$par.vals <- list(
objective = "multi:softprob", # return class with maximum probability,
num_class = 3, # There are three outcome categories
eval_metric="merror",
nrounds=100L,
eta=0.1
)
# Set parameters to be tuned
params <- makeParamSet(
makeDiscreteParam("booster",values = c("gbtree","gblinear")),
makeIntegerParam("max_depth",lower = 3L,upper = 10L),
makeNumericParam("min_child_weight",lower = 1L,upper = 10L),
makeNumericParam("subsample",lower = 0.5,upper = 1),
makeNumericParam("colsample_bytree",lower = 0.5,upper = 1)
)
# Set resampling strategy
rdesc <- makeResampleDesc("CV",stratify = T,iters=5L)
# search strategy
ctrl <- makeTuneControlRandom(maxit = 10L)
#parallelStartSocket(cpus = detectCores()) # Enable parallel processing
mytune <- tuneParams(learner = lrn
,task = traintask
,resampling = rdesc
,measures = acc
,par.set = params
,control = ctrl
,show.info = T)
The specific error I get is:
Error: 'predict' is not an exported object from 'namespace:xgboost'
My package versions are:
packageVersion("xgboost")
[1] ‘0.6.4’
packageVersion("mlr")
[1] ‘2.8’
Would anyone know what I should do here?
Thanks in advance.