Why doesnt the early.stop.round argument in xgboost work? - r

I try to use the early.stop.round argument in the xgb.cv function of the xgboost library, however, I got an error. After I leave the early.stop.round unspecified, the function runs without any problem. What did I do wrong?
Here is my example code:
library(xgboost)
train = matrix(as.numeric(1:100),20,5)
Y = rep(c(0,1),10)
dtrain = xgb.DMatrix(train, label=Y)
#cross validation when early.stop.round =5, gives an error
CV = xgb.cv(data = dtrain, nround=200, nfold =2, metrics=list("auc"),
objective = "binary:logistic",early.stop.round = 5)
#cross validation when early.stop.round is not specified, works
CV = xgb.cv(data = dtrain, nround=200, nfold =2, metrics=list("auc"),
objective = "binary:logistic")
I am using xgboost_0.4-2

Looks like something goes wrong when using the metrics parameter and early.stop simultaneously. Remove metrics and use early.stop with eval_metric="auc" instead.

Related

R XGBoost early stopping by

Below I have code, in which I am trying to train an XGBoost model in R that early stops after a given number of rounds early_stopping_rounds without improvement.
watchlist <- list(train=dtrain, test=dtest)
param <- list(
objective = "binary:logistic",
eta = 0.3,
max_depth = 8,
eval_metric="logloss"
)
xgb.train(params = param, data = dtrain, nrounds = 1000, watchlist = watchlist, early_stopping_rounds = 3)
However, instead of fixing the number of rounds, I would like to pass a min_delta value, so the early stopping kicks in when the difference between rounds is below a given tolerance.
Others (here and here) have asked this for Python.
However, advances not too long ago have implemented this option for Python.
But how do I work this out in R? Is there something like it?

Multiclassification with LightGBM

I am using latest release of LightGBM to solve a multi classification problem. When I switch the objective to "multiclass", this error occurs;
Error in data$update_params(params) :
[LightGBM] [Fatal] Number of classes should be specified and greater than 1 for multiclass training
I leave a reproducible example that indicates my way
catnames <- names(purrr::keep(train_x,is.factor))
dtrain <- lgb.Dataset(as.matrix(train_x), label = train_y,categorical_feature = catnames)
data_file <- tempfile(fileext = ".data")
lgb.Dataset.save(dtrain, data_file)
dtrain <- lgb.Dataset(data_file)
lgb.Dataset.construct(dtrain)
model <- lgb.train(data=dtrain,
objective = "multiclass",
alpha = 0.1,
nrounds = 1000,
learning_rate = .1
)
Tried to save my target (train_y) as factor, nothing changed.
When using the multi-class objective in LightGBM, you need to pass another parameter that tells the learner the number of classes to predict.
So, it should probably look more like this:
model <- lgb.train(data=dtrain,
objective = "multiclass",
num_classes = INSERT NUMBER OF TARGET CLASSES HERE,
alpha = 0.1,
nrounds = 1000,
learning_rate = .1,
)
My experience is more with the python API so it might be that (if this does not work) you need to pass the num_class parameter in the form of a list for a params keyword argument in lgb.train.

How can choose number of nods in rpart?

In tree package we can use following code for choosing number of terminal nods:
tree.model = tree(...)
tree.prune = prune.tree(tree.model, best = 20)
This code returns a new tree with 20 terminal nods.
In rpart package following code can use for this:
rpart.model = rpart(...)
rpart.prune = prune.rpart(rpart.model, cp =?)
That cp is cost complexity parameter. but I want similar best argument in prune.tree.
rpart package doesn't have a similar argument to best of tree package. The tree package was developed to cover the functionalities rpart was missing on.
To choose appropriate number of nodes, you can tune other parameters in rpart. For eg.
prune.control <- rpart.control(minsplit = 20, minbucket = round(minsplit/3), xval = 10)
rpart(formula, data, method, control = prune.control)
Then, evaluate the cross validated error vs cp, to choose a cp value. Also, you can automatically tune cp value using caret package. For eg.
ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 5)
model <- train(x = train_data,
y = labels,
method = "rpart",
trControl = ctrl)

xgboost using R xgb.importance throws error

I am using xgboost package from CRAN for the first time.
Creating a model as:
bst <- xgb.train(data = dtrain, booster = "gblinear",
objective = "reg:linear", max.depth = 5, nround = 2,watchlist=watchlist)
importance_matrix <- xgb.importance(model = bst)
When I call xgb.importance I get an error:
Error in readLines(filename_dump) : 'con' is not a connection
Any ideas why?
xgb.importance works fine for booster=gbtree
I did not find any documentation but looks like xgb.importance is valid for tree method only

Fatal error with train() in caret on Windows 7, R 3.0.2, caret 6.0-21

I am trying to use train() in caret to fit a classification model, but I'm hitting some kind of unhandled exception and my R session crashes before outputting any error information in the R console.
Windows error:
R for Windows terminal front-end has stopped working
I am running Windows 7, R 3.0.2, caret 6.0-21, and have tried running this on both 32/64 versions of R, in R Studio and also directly in the R console, and am getting the same results each time.
Here is my call to train:
library("AppliedPredictiveModeling")
library("caret")
data("AlzheimerDisease")
data <- data.frame(predictors, diagnosis)
tuneGrid <- expand.grid(interaction.depth = 1:2, n.trees = 100, shrinkage = 0.1)
trainControl <- trainControl(method = "cv", number = 5, verboseIter = TRUE)
gbmFit <- train(diagnosis ~ ., data = data, method = "gbm", trControl = trainControl, tuneGrid = tuneGrid)
There are no more errors using this parameter grid instead:
tuneGrid <- expand.grid(interaction.depth = 1, n.trees = 100:101, shrinkage = 0.1)
However, I am still getting all nans in the ValidDeviance column. Is this normal?
Note: My original problem is resolved, and this is a continuation from the comments section. Formatting blocks of code in the comments section is unreadable so I'm posting it up here. This is no longer a question regarding caret, but gbm instead.
I am still having issues, however, with direct calls to gbm using a single predictor with cv.folds specified. Here is the code:
library("AppliedPredictiveModeling")
library("caret")
data("AlzheimerDisease")
diagnosis <- as.numeric(diagnosis)
diagnosis[diagnosis == 1] <- 0
diagnosis[diagnosis == 2] <- 1
data <- data.frame(diagnosis, predictors[, 1])
gbmFit <- gbm(diagnosis ~ ., data = data, cv.folds = 5)
Again, this works without specifying cv.folds but with it, returns an error:
Error in checkForRemoteErrors(val) : 5 nodes produced errors; first error: incorrect number of dimensions
It is a bug that occurs when method = 'gbm' is used with a single model (i.e. nrow(tuneGrid) == 1). I'm about to release a new version, so I will fix this in that version.
One side note... it looks like you want to do classification. In that case, y should be a factor (and you shouldn't use only integers as the classes) otherwise it will be doing regression. These changes will work for now:
y <- factor(paste("Class", y, sep = ""))
and
tuneGrid <- expand.grid(interaction.depth = 1,
n.trees = 100:101,
shrinkage = 0.1)
Thanks,
Max

Resources