For a few days now, the tuning parameter has stopped working when I try to train a model with caret and xgbTree. Before that, everything always has worked fine. Here is the error I am receiving:
[16:45:38] WARNING: amalgamation/../src/learner.cc:516:
Parameters: { tune } might not be used.
This may not be accurate due to some parameters are only used in language bindings but
passed down to XGBoost core. Or some parameters are not used but slip through this
verification. Please open an issue if you find above cases.
My model looks like this:
set.seed(79647)
xgb <- train(dv ~ .,
data = model_train,
method = "xgbTree",
trControl = ctrl,
tune = expand.grid(max_depth = 3,
nrounds = 50,
eta = 0.4,
min_child_weight = 1,
subsample = 0.8,
gamma = 0,
colsample_bytree = 0.8,
subsample = 1),
metric = "ROC")
I can't work out what is causing this error and a google search with the error message did not guide me to anything. Does anybody has some possible insights on this?
You specified subsample twice in your call, and also the tuning parameters should be feed into the function using tuneGrid = , not tune=
If you try something below, it should work, I don't have your trainControl so I use the basic below:
Grid = expand.grid(nrounds = 50,
max_depth = 2:3,
eta = 0.4,
min_child_weight = 1,
subsample = 0.8,
gamma = 0,
colsample_bytree = 0.8)
xgb <- train(Species ~ .,
data = iris,
method = "xgbTree",
trControl = trainControl(method="cv"),
tuneGrid =Grid)
Related
So far I built many classification models using the "caret" package. This library allows me to find the best parameters for XGBoost by using expand.grid and trying all the possible combinations of some parameters as shown in the example below.
trControl = trainControl(
method = 'cv',
number = 3,
returnData=F,
classProbs = TRUE,
verboseIter = TRUE,
allowParallel = TRUE)
tuneGridXGB <- expand.grid(
nrounds=c(10, 50, 100, 200, 350, 500),
max_depth = c(2,4),
eta = c(0.005, 0.01, 0.05, 0.1),
gamma = c(0,2,4),
colsample_bytree = c(0.75),
subsample = c(0.50),
min_child_weight = c(0,2,4))
xgbmod_classif_bin <- train(
x=eg_Train_mat,
y= y_train_target,
method = "xgbTree",
metric = "auc",
reg_lambda=0.7,
scale_pos_weight=1.6,
nthread = 4,
trControl = trControl,
tuneGrid = tuneGridXGB,
verbose=T)
For the first time I have a multiclass classification problem (with 9 classes) to deal with, but I don't seem to be able to use anything like "multi:softprob" (as I would do with the xgboost package - see below).
param=list(objective="multi:softprob",
num_class=9,
eta=0.005,
max.depth=4,
min_child_weight=2,
gamma=6,
eval_metric ="merror",
nthread=4,
booster = "gbtree",
lambda=1.8,
subssample=0.8,
alpha=6,
colsample_bytree=0.5,
scale_pos_weight=1.6,
verbosity=3
)
bst=xgboost(params = param,
data = eg_Train_mat,
nrounds = 15)
Any idea of how to try many parameters using a grid, possibly using the caret package, for a multiclass classification problem?
Thanks
I'm working on tuning parameters for a neural network exercise on the Boston dataset. I have been getting a persistent error:
Error: The tuning parameter grid should have columns size, decay
The following is the set up of my Caret tuning:
caret_control <- trainControl(method = "repeatedcv",
number = 10,
repeats = 3)
caret_grid <- expand.grid(batch_size=seq(60,120,20),
dropout=0.5,
size=100,
decay = 0,
lr=2e-6,
activation = "relu")
caret_t <- train(medv ~ ., data = chasRad,
method = "nnet",
metric="RMSE",
trControl = caret_control,
tuneGrid = caret_grid,
verbose = FALSE)
Here chasRad is a 12x506 matrix. Could anyone help on fixing the error that seems triggered by the expanded grid?
The error you're getting should be interpreted as:
"The tuning parameter grid should ONLY have columns size, decay".
You're passing in four additional parameters that nnet can't tune in caret. For a full list of parameters that are tunable, run modelLookup(model = 'nnet').
To tune only size and decay, replace your caret_grid with:
caret_grid <- expand.grid(size=seq(from = 1, to = 10, by = 1),
decay = seq(from = 0.1, to = 0.5, by = 0.1))
and your code will run.
I have the following R code that runs a simple xgboost model on a set of training and test data with the intention of predicting a binary outcome.
We start by
1) Reading in the relevant libraries.
library(xgboost)
library(readr)
library(caret)
2) Cleaning up the training and test data
train.raw = read.csv("train_data", header = TRUE, sep = ",")
drop = c('column')
train.df = train.raw[, !(names(train.raw) %in% drop)]
train.df[,'outcome'] = as.factor(train.df[,'outcome'])
test.raw = read.csv("test_data", header = TRUE, sep = ",")
drop = c('column')
test.df = test.raw[, !(names(test.raw) %in% drop)]
test.df[,'outcome'] = as.factor(test.df[,'outcome'])
train.c1 = subset(train.df , outcome == 1)
train.c0 = subset(train.df , outcome == 0)
3) Running XGBoost on the properly formatted data.
train_xgb = xgb.DMatrix(data.matrix(train.df [,1:124]), label = train.raw[, "outcome"])
test_xgb = xgb.DMatrix(data.matrix(test.df[,1:124]))
4) Running the model
model_xgb = xgboost(data = train_xgb, nrounds = 8, max_depth = 5, eta = .1, eval_metric = "logloss", objective = "binary:logistic", verbose = 5)
5) Making predicitions
pred_xgb <- predict(model_xgb, newdata = test_xgb)
My question is: How can I adjust this process so that I'm just pulling in / adjusting a single 'training' data set, and getting predictions on the hold-out sets of the cross-validated file?
To specify k-fold CV in the xgboost call one needs to call xgb.cv with nfold = some integer argument, to save the predictions for each resample use prediction = TRUE argument. For instance:
xgboostModelCV <- xgb.cv(data = dtrain,
nrounds = 1688,
nfold = 5,
objective = "binary:logistic",
eval_metric= "auc",
metrics = "auc",
verbose = 1,
print_every_n = 50,
stratified = T,
scale_pos_weight = 2
max_depth = 6,
eta = 0.01,
gamma=0,
colsample_bytree = 1 ,
min_child_weight = 1,
subsample= 0.5 ,
prediction = T)
xgboostModelCV$pred #contains predictions in the same order as in dtrain.
xgboostModelCV$folds #contains k-fold samples
Here's a decent function to pick hyperparams
function(train, seed){
require(xgboost)
ntrees=2000
searchGridSubCol <- expand.grid(subsample = c(0.5, 0.75, 1),
colsample_bytree = c(0.6, 0.8, 1),
gamma=c(0, 1, 2),
eta=c(0.01, 0.03),
max_depth=c(4,6,8,10))
aucErrorsHyperparameters <- apply(searchGridSubCol, 1, function(parameterList){
#Extract Parameters to test
currentSubsampleRate <- parameterList[["subsample"]]
currentColsampleRate <- parameterList[["colsample_bytree"]]
currentGamma <- parameterList[["gamma"]]
currentEta =parameterList[["eta"]]
currentMaxDepth =parameterList[["max_depth"]]
set.seed(seed)
xgboostModelCV <- xgb.cv(data = train,
nrounds = ntrees,
nfold = 5,
objective = "binary:logistic",
eval_metric= "auc",
metrics = "auc",
verbose = 1,
print_every_n = 50,
early_stopping_rounds = 200,
stratified = T,
scale_pos_weight=sum(all_data_nobad[index_no_bad,1]==0)/sum(all_data_nobad[index_no_bad,1]==1),
max_depth = currentMaxDepth,
eta = currentEta,
gamma=currentGamma,
colsample_bytree = currentColsampleRate,
min_child_weight = 1,
subsample= currentSubsampleRate)
xvalidationScores <- as.data.frame(xgboostModelCV$evaluation_log)
#Save rmse of the last iteration
auc=xvalidationScores[xvalidationScores$iter==xgboostModelCV$best_iteration,c(1,4,5)]
auc=cbind(auc, currentSubsampleRate, currentColsampleRate, currentGamma, currentEta, currentMaxDepth)
names(auc)=c("iter", "test.auc.mean", "test.auc.std", "subsample", "colsample", "gamma", "eta", "max.depth")
print(auc)
return(auc)
})
return(aucErrorsHyperparameters)
}
You can change the grid values and the params in the grid, as well as loss/evaluation metric. It is similar as provided by caret grid search, but caret does not provide the possibility to define alpha, lambda, colsample_bylevel, num_parallel_tree... hyper parameters in the grid search apart defining a custom function which I found cumbersome. Caret has the advantage of automatic preprocessing, automatic up/down sampling within CV etc.
setting the seed outside the xgb.cv call will pick the same folds for CV but not the same trees at each round so you will end up with a different model. Even if you set the seed inside the xgb.cv function call there is no guarantee you will end up with the same model but there's a much higher chance (depends on threads, type of model.. - I for one like the uncertainty and found it to have little impact on the result).
You can use xgb.cv and set prediction = TRUE.
I have below code. Let's assume that optimization stopped after 600 rounds and best round was 450. Which model will be used for prediction - one after 450th round or after 600th?
watchlist <- list(val=dval,train=dtrain)
param <- list( objective = "binary:logistic",
booster = "gbtree",
eval_metric = "auc",
eta = 0.02,
max_depth = 7,
subsample = 0.6,
colsample_bytree = 0.7
)
clf <- xgb.train( params = param,
data = dtrain,
nrounds = 2000,
verbose = 0,
early.stop.round = 150,
watchlist = watchlist,
maximize = TRUE
)
preds <- predict(clf, test)
After some research I found answer myself. Predict will use model after 600th rounds. If one wants to use model with best result, should use preds <- predict(clf, test, ntreelimit=clf$bestInd)
I was trying the XGBoost technique for the prediction. As my dependent variable is continuous, I was doing the regression using XGBoost, but most of the references available in various portal are for classification. Though i know by using
objective = "reg:linear"
we can do the regression but still I need some clarity for other parameters as well. It would be a great help if somebody can provide me an R snippet of it.
xgboost(data = X,
booster = "gbtree",
objective = "binary:logistic",
max.depth = 5,
eta = 0.5,
nthread = 2,
nround = 2,
min_child_weight = 1,
subsample = 0.5,
colsample_bytree = 1,
num_parallel_tree = 1)
These are all the parameters you can play around with while using tree boosters. For linear booster you can use the following parameters to play with...
xgboost(data = X,
booster = "gblinear",
objective = "binary:logistic",
max.depth = 5,
nround = 2,
lambda = 0,
lambda_bias = 0,
alpha = 0)
You can refer to the description of xg.train() in the xgboost CRAN document for detailed meaning of these parameters.
The best description of the parameters that I have found is at
https://github.com/dmlc/xgboost/blob/master/doc/parameter.md
There are many examples of using XGBoost in R available in the Kaggle scripts repository. For example:
https://www.kaggle.com/michaelpawlus/springleaf-marketing-response/xgboost-example-0-76178/code