I'm creating a simple ensemble of two xgboost and mxnet models. The data frame is A3n.df with the classification variable at A3n.df[,1]. Both the models run fine on their own and get believable accuracy. All data is normalized 0-1, shuffled and the class variable converted to a factor (for caret). I have already run grid search for the best hyperparameters, but need to include a grid for caretEnsemble.
#training grid for xgboost
xgb_grid_A3 = expand.grid(
nrounds = 1200,
eta = 0.01,
max_depth = 20,
gamma = 1,
colsample_bytree = 0.6,
min_child_weight = 2,
subsample = 0.8)
#training grid for mxnet
mxnet_grid_A3 = expand.grid(layer1 = 12,
layer2 = 2,
layer3 = 0,
learningrate = 0.001,
dropout = 0
beta1 = .9,
beta2 = 0.999,
activation = 'relu')
Ensemble_control_A4 <- trainControl(
method = "cv",
number = 5,
verboseIter = TRUE,
returnData = TRUE,
returnResamp = "all",
classProbs = TRUE,
summaryFunction = twoClassSummary,
allowParallel = TRUE,
sampling = "up",
index=createResample(yEf, 20))
yE = A4n.df[,1]
xE = data.matrix(A4n.df[,-1])
yf <- yE
yEf <- ifelse(yE == 0, "no", "yes")
yEf <- factor(yEf)
Ensemble_list_A4 <- caretList(
x=xE,
y=yEf,
trControl=Ensemble_control_A4,
metric="ROC",
methodList=c("glm", "rpart"),
tuneList=list(
xgbA4=caretModelSpec(method="xgbTree", tuneGrid=xgb_grid_A4),
mxA4=caretModelSpec(method="mxnetAdam", tuneGrid=mxnet_grid_A4)))
XGboost seems to train fine:
+ Resample01: eta=0.01, max_depth=20, gamma=1, colsample_bytree=0.6, min_child_weight=2, subsample=0.8, nrounds=1200
....
+ Resample20: eta=0.01, max_depth=20, gamma=1, colsample_bytree=0.6, min_child_weight=2, subsample=0.8, nrounds=1200
- Resample20: eta=0.01, max_depth=20, gamma=1, colsample_bytree=0.6, min_child_weight=2, subsample=0.8, nrounds=1200
Aggregating results
Selecting tuning parameters
Fitting nrounds = 1200, max_depth = 20, eta = 0.01, gamma = 1, colsample_bytree = 0.6, min_child_weight = 2, subsample = 0.8 on full training set
However, mxnet seems to only run for 10 rounds, when 1 or 2 thousand makes more sense, and there seems to be missing parameters:
+ Resample01: layer1=12, layer2=2, layer3=0, learningrate=0.001, dropout=0, beta1=0.9, beta2=0.999, activation=relu
Start training with 1 devices
[1] Train-accuracy=0.487651209677419
[2] Train-accuracy=0.624751984126984
[3] Train-accuracy=0.599082341269841
[4] Train-accuracy=0.651909722222222
[5] Train-accuracy=0.662202380952381
[6] Train-accuracy=0.671006944444444
[7] Train-accuracy=0.676463293650794
[8] Train-accuracy=0.683407738095238
[9] Train-accuracy=0.691964285714286
[10] Train-accuracy=0.698660714285714
- Resample01: layer1=12, layer2=2, layer3=0, learningrate=0.001, dropout=0, beta1=0.9, beta2=0.999, activation=relu
+ Resample01: parameter=none
- Resample01: parameter=none
+ Resample02: parameter=none
Aggregating results
Selecting tuning parameters
Fitting cp = 0.0243 on full training set
There were 40 warnings (use warnings() to see them)
Warnings (1-40):
1: In predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == ... :
prediction from a rank-deficient fit may be misleading
I expect mxnet to train for thousands of rounds, and the training accuracy to end up like the pre-ensemble model, 60-70%
*On second thought, some of the 20 mxnet runs reach 60-70%, but it seems inconsistent. Perhaps it is functioning normally?
There's a note in the caret documentation that num.round needs to be set by the user outside the tune_grid: http://topepo.github.io/caret/train-models-by-tag.html
Ensemble_list_A2 <- caretList(
x=xE,
y=yEf,
trControl=Ensemble_control_A2,
metric="ROC",
methodList=c("glm", "rpart", "bayesglm"),
tuneList=list(
xgbA2=caretModelSpec(method="xgbTree", tuneGrid=xgb_grid_A2),
mxA2=caretModelSpec(method="mxnetAdam", tuneGrid=mxnet_grid_A2, num.round=1500, ctx=mx.gpu()),
svmA2=caretModelSpec(method="svmLinear2", tuneGrid=svm_grid_A2),
rfA2=caretModelSpec(method="rf", tuneGrid=rf_grid_A2)))
Related
So far I built many classification models using the "caret" package. This library allows me to find the best parameters for XGBoost by using expand.grid and trying all the possible combinations of some parameters as shown in the example below.
trControl = trainControl(
method = 'cv',
number = 3,
returnData=F,
classProbs = TRUE,
verboseIter = TRUE,
allowParallel = TRUE)
tuneGridXGB <- expand.grid(
nrounds=c(10, 50, 100, 200, 350, 500),
max_depth = c(2,4),
eta = c(0.005, 0.01, 0.05, 0.1),
gamma = c(0,2,4),
colsample_bytree = c(0.75),
subsample = c(0.50),
min_child_weight = c(0,2,4))
xgbmod_classif_bin <- train(
x=eg_Train_mat,
y= y_train_target,
method = "xgbTree",
metric = "auc",
reg_lambda=0.7,
scale_pos_weight=1.6,
nthread = 4,
trControl = trControl,
tuneGrid = tuneGridXGB,
verbose=T)
For the first time I have a multiclass classification problem (with 9 classes) to deal with, but I don't seem to be able to use anything like "multi:softprob" (as I would do with the xgboost package - see below).
param=list(objective="multi:softprob",
num_class=9,
eta=0.005,
max.depth=4,
min_child_weight=2,
gamma=6,
eval_metric ="merror",
nthread=4,
booster = "gbtree",
lambda=1.8,
subssample=0.8,
alpha=6,
colsample_bytree=0.5,
scale_pos_weight=1.6,
verbosity=3
)
bst=xgboost(params = param,
data = eg_Train_mat,
nrounds = 15)
Any idea of how to try many parameters using a grid, possibly using the caret package, for a multiclass classification problem?
Thanks
I'm working on tuning parameters for a neural network exercise on the Boston dataset. I have been getting a persistent error:
Error: The tuning parameter grid should have columns size, decay
The following is the set up of my Caret tuning:
caret_control <- trainControl(method = "repeatedcv",
number = 10,
repeats = 3)
caret_grid <- expand.grid(batch_size=seq(60,120,20),
dropout=0.5,
size=100,
decay = 0,
lr=2e-6,
activation = "relu")
caret_t <- train(medv ~ ., data = chasRad,
method = "nnet",
metric="RMSE",
trControl = caret_control,
tuneGrid = caret_grid,
verbose = FALSE)
Here chasRad is a 12x506 matrix. Could anyone help on fixing the error that seems triggered by the expanded grid?
The error you're getting should be interpreted as:
"The tuning parameter grid should ONLY have columns size, decay".
You're passing in four additional parameters that nnet can't tune in caret. For a full list of parameters that are tunable, run modelLookup(model = 'nnet').
To tune only size and decay, replace your caret_grid with:
caret_grid <- expand.grid(size=seq(from = 1, to = 10, by = 1),
decay = seq(from = 0.1, to = 0.5, by = 0.1))
and your code will run.
I am using caret for modeling using "xgboost"
1- However, I get following error :
"Error: The tuning parameter grid should have columns nrounds,
max_depth, eta, gamma, colsample_bytree, min_child_weight, subsample"
The code:
library(caret)
library(doParallel)
library(dplyr)
library(pROC)
library(xgboost)
## Create train/test indexes
## preserve class indices
set.seed(42)
my_folds <- createFolds(train_churn$churn, k = 10)
# Compare class distribution
i <- my_folds$Fold1
table(train_churn$churn[i]) / length(i)
my_control <- trainControl(
summaryFunction = twoClassSummary,
classProbs = TRUE,
verboseIter = TRUE,
savePredictions = TRUE,
index = my_folds
)
my_grid <- expand.grid(nrounds = 500,
max_depth = 7,
eta = 0.1,
gammma = 1,
colsample_bytree = 1,
min_child_weight = 100,
subsample = 1)
set.seed(42)
model_xgb <- train(
class ~ ., data = train_churn,
metric = "ROC",
method = "xgbTree",
trControl = my_control,
tuneGrid = my_grid)
2- I also want to get a prediction made by averaging the predictions made by using the model fitted for each fold.
I know it's 'tad' bit late but, check your spelling of gamma in the grid of tuning parameters. You misspelled it as gammma (with triple m's).
I have the following R code that runs a simple xgboost model on a set of training and test data with the intention of predicting a binary outcome.
We start by
1) Reading in the relevant libraries.
library(xgboost)
library(readr)
library(caret)
2) Cleaning up the training and test data
train.raw = read.csv("train_data", header = TRUE, sep = ",")
drop = c('column')
train.df = train.raw[, !(names(train.raw) %in% drop)]
train.df[,'outcome'] = as.factor(train.df[,'outcome'])
test.raw = read.csv("test_data", header = TRUE, sep = ",")
drop = c('column')
test.df = test.raw[, !(names(test.raw) %in% drop)]
test.df[,'outcome'] = as.factor(test.df[,'outcome'])
train.c1 = subset(train.df , outcome == 1)
train.c0 = subset(train.df , outcome == 0)
3) Running XGBoost on the properly formatted data.
train_xgb = xgb.DMatrix(data.matrix(train.df [,1:124]), label = train.raw[, "outcome"])
test_xgb = xgb.DMatrix(data.matrix(test.df[,1:124]))
4) Running the model
model_xgb = xgboost(data = train_xgb, nrounds = 8, max_depth = 5, eta = .1, eval_metric = "logloss", objective = "binary:logistic", verbose = 5)
5) Making predicitions
pred_xgb <- predict(model_xgb, newdata = test_xgb)
My question is: How can I adjust this process so that I'm just pulling in / adjusting a single 'training' data set, and getting predictions on the hold-out sets of the cross-validated file?
To specify k-fold CV in the xgboost call one needs to call xgb.cv with nfold = some integer argument, to save the predictions for each resample use prediction = TRUE argument. For instance:
xgboostModelCV <- xgb.cv(data = dtrain,
nrounds = 1688,
nfold = 5,
objective = "binary:logistic",
eval_metric= "auc",
metrics = "auc",
verbose = 1,
print_every_n = 50,
stratified = T,
scale_pos_weight = 2
max_depth = 6,
eta = 0.01,
gamma=0,
colsample_bytree = 1 ,
min_child_weight = 1,
subsample= 0.5 ,
prediction = T)
xgboostModelCV$pred #contains predictions in the same order as in dtrain.
xgboostModelCV$folds #contains k-fold samples
Here's a decent function to pick hyperparams
function(train, seed){
require(xgboost)
ntrees=2000
searchGridSubCol <- expand.grid(subsample = c(0.5, 0.75, 1),
colsample_bytree = c(0.6, 0.8, 1),
gamma=c(0, 1, 2),
eta=c(0.01, 0.03),
max_depth=c(4,6,8,10))
aucErrorsHyperparameters <- apply(searchGridSubCol, 1, function(parameterList){
#Extract Parameters to test
currentSubsampleRate <- parameterList[["subsample"]]
currentColsampleRate <- parameterList[["colsample_bytree"]]
currentGamma <- parameterList[["gamma"]]
currentEta =parameterList[["eta"]]
currentMaxDepth =parameterList[["max_depth"]]
set.seed(seed)
xgboostModelCV <- xgb.cv(data = train,
nrounds = ntrees,
nfold = 5,
objective = "binary:logistic",
eval_metric= "auc",
metrics = "auc",
verbose = 1,
print_every_n = 50,
early_stopping_rounds = 200,
stratified = T,
scale_pos_weight=sum(all_data_nobad[index_no_bad,1]==0)/sum(all_data_nobad[index_no_bad,1]==1),
max_depth = currentMaxDepth,
eta = currentEta,
gamma=currentGamma,
colsample_bytree = currentColsampleRate,
min_child_weight = 1,
subsample= currentSubsampleRate)
xvalidationScores <- as.data.frame(xgboostModelCV$evaluation_log)
#Save rmse of the last iteration
auc=xvalidationScores[xvalidationScores$iter==xgboostModelCV$best_iteration,c(1,4,5)]
auc=cbind(auc, currentSubsampleRate, currentColsampleRate, currentGamma, currentEta, currentMaxDepth)
names(auc)=c("iter", "test.auc.mean", "test.auc.std", "subsample", "colsample", "gamma", "eta", "max.depth")
print(auc)
return(auc)
})
return(aucErrorsHyperparameters)
}
You can change the grid values and the params in the grid, as well as loss/evaluation metric. It is similar as provided by caret grid search, but caret does not provide the possibility to define alpha, lambda, colsample_bylevel, num_parallel_tree... hyper parameters in the grid search apart defining a custom function which I found cumbersome. Caret has the advantage of automatic preprocessing, automatic up/down sampling within CV etc.
setting the seed outside the xgb.cv call will pick the same folds for CV but not the same trees at each round so you will end up with a different model. Even if you set the seed inside the xgb.cv function call there is no guarantee you will end up with the same model but there's a much higher chance (depends on threads, type of model.. - I for one like the uncertainty and found it to have little impact on the result).
You can use xgb.cv and set prediction = TRUE.
I'm running gbm model for a classification problem.Below is my code & output
library(gbm)
library(caret)
set.seed(123)
train=read.csv("train.csv")
gbm_model= gbm(DV~.,
data=train,
distribution = "bernoulli",
n.trees = 9,
interaction.depth = 9,
n.minobsinnode = 1,
shrinkage = 0.2,
bag.fraction = 0.9)
output of print(gbm1)
gbm(formula = DP ~ ., distribution = "bernoulli",
data = train, n.trees = 9, interaction.depth = 9, n.minobsinnode = 1,
shrinkage = 0.2, bag.fraction = 0.9)
A gradient boosted model with bernoulli loss function.
9 iterations were performed.
There were 100 predictors of which 67 had non-zero influence.
When I try to print top variables, it throws error.
varImp(gbm_model)
Error in 1:n.trees : argument of length 0
Any suggestion how to rectify this error.
I got the error rectified after researching a bit more on caret package. First I needed to train the model and then use the varImp().
gbm1= train(as.factor(DV)~., data=train,method="gbm",
distribution ="bernoulli",trControl=trainControl(number=200),
tuneGrid=expand.grid(.interaction.depth = 9,.n.trees = 9, .shrinkage = .1), n.minobsinnode = 1,
bag.fraction = 0.9)
then run
plot(varImp(gbm1),top=20)
to get top 20 variables