Xgboost Hyperparameter Tuning In R for binary classification - r

I am new to R and trying to do hyper parameter tuning for xgboost- binary classification, However I am getting error, I would appreciate if someone could help me
Error in as.matrix(cv.res)[, 3] : subscript out of bounds In addition: Warning message: 'early.stop.round' is deprecated. Use 'early_stopping_rounds' instead. See help("Deprecated") and help("xgboost-deprecated").
Please find below the code snippet`
I would appreciate if some one could provide another alternative too apart from this approach in R
X_Train <- as(X_train, "dgCMatrix")
GS_LogLoss = data.frame("Rounds" = numeric(),
"Depth" = numeric(),
"r_sample" = numeric(),
"c_sample" = numeric(),
"minLogLoss" = numeric(),
"best_round" = numeric())
for (rounds in seq(50,100, 25)) {
for (depth in c(4, 6, 8, 10)) {
for (r_sample in c(0.5, 0.75, 1)) {
for (c_sample in c(0.4, 0.6, 0.8, 1)) {
for (imb_scale_pos_weight in c(5, 10, 15, 20, 25)) {
for (wt_gamma in c(5, 7, 10)) {
for (wt_max_delta_step in c(5,7,10)) {
for (wt_min_child_weight in c(5,7,10,15)) {
set.seed(1024)
eta_val = 2 / rounds
cv.res = xgb.cv(data = X_Train, nfold = 2, label = y_train,
nrounds = rounds,
eta = eta_val,
max_depth = depth,
subsample = r_sample,
colsample_bytree = c_sample,
early.stop.round = 0.5*rounds,
scale_pos_weight= imb_scale_pos_weight,
max_delta_step = wt_max_delta_step,
gamma = wt_gamma,
objective='binary:logistic',
eval_metric = 'auc',
verbose = FALSE)
print(paste(rounds, depth, r_sample, c_sample, min(as.matrix(cv.res)[,3]) ))
GS_LogLoss[nrow(GS_LogLoss)+1, ] = c(rounds,
depth,
r_sample,
c_sample,
min(as.matrix(cv.res)[,3]),
which.min(as.matrix(cv.res)[,3]))
}
}
}
}
}
}
}
}
`

To do you hyperparameters selection, you could use the metapackage tidymodels, especially the packages parsnip, rsample, yardstick and tune.
A workflow like this would work:
library(tidyverse)
library(tidymodels)
# Specify the model and the parameters to tune (parnsip)
model <-
boost_tree(tree_depth = tune(), mtry = tune()) %>%
set_mode("classification") %>%
set_engine("xgboost")
# Specify the resampling method (rsample)
splits <- vfold_cv(X_train, v = 2)
# Specify the metrics to optimize (yardstick)
metrics <- metric_set(roc_auc)
# Specify the parameters grid (or you can use dials to automate your grid search)
grid <- expand_grid(tree_depth = c(4, 6, 8, 10),
mtry = c(2, 10, 50)) # You can add others
# Run each model (tune)
tuned <- tune_grid(formula = Y ~ .,
model = model,
resamples = splits,
grid = grid,
metrics = metrics,
control = control_grid(verbose = TRUE))
# Check results
show_best(tuned)
autoplot(tuned)
select_best(tuned)
# Update model
tuned_model <-
model %>%
finalize_model(select_best(tuned)) %>%
fit(Y ~ ., data = X_train)
# Make prediction
predict(tuned_model, X_train)
predict(tuned_model, X_test)
Please note that the names during the model specification are subject to change compare to the original names in xgboost because parsnip is a unified interface with consistant names across several models. See here.

Related

I fail to run caret's nnet regression

I tried to run regression using caret's nnet, but I got a error.
library(tidyverse)
library(caret)
feature = rnorm(100, 0, 1) %>% as.matrix()
colnames(feature) = "x1"
outcome = rnorm(100, 0, 1) %>% as.matrix()
colnames(feature) = "y1"
model = caret::train(
x = feature, y = outcome, method = "nnet",
tuneGrid = expand.grid(size=c(1:3), decay=seq(0.1, 1, 0.1)),
weights = NULL, linout = TRUE
)
Error: Metric RMSE not applicable for classification models
Of course, I want to regress, not classify. In order to show this, I set the option linout = TRUE. What went wrong?
Also, I followed this question, and tried to remove as.matrix, but it also shows other error.
library(tidyverse)
library(caret)
feature = rnorm(100, 0, 1) %>% as.double()
outcome = rnorm(100, 0, 1) %>% as.double()
CATE_model = caret::train(
x = feature, y = outcome, method = "nnet",
tuneGrid = expand.grid(size=c(1:3), decay=seq(0.1, 1, 0.1)),
weights = NULL, linout = TRUE
)
Error: Please use column names for 'x'
Thanks, a lot
The input of the Neural Network seems to require colnames.
This makes sense as the trained neural net might later need to identify the correct columns in its input for prediction.
if you force your features into a Dataframe with colnames (in my case x), it works.
library(tidyverse)
library(caret)
feature = data.frame(x = rnorm(100, 0, 1) %>% as.double()) # changed line
outcome = rnorm(100, 0, 1) %>% as.double()
CATE_model = caret::train(
x = feature, y = outcome, method = "nnet",
tuneGrid = expand.grid(size=c(1:3), decay=seq(0.1, 1, 0.1)),
weights = NULL, linout = TRUE
)

train,validation, test split model in CARET in R

I would like to ask for help please. I use this code to run the XGboost model in the Caret package. However, I want to use the validation split based on time. I want 60% training, 20% validation ,20% testing. I already split the data, but I do know how to deal with the validation data if it is not cross-validation.
Thank you,
xgb_trainControl = trainControl(
method = "cv",
number = 5,
returnData = FALSE
)
xgb_grid <- expand.grid(nrounds = 1000,
eta = 0.01,
max_depth = 8,
gamma = 1,
colsample_bytree = 1,
min_child_weight = 1,
subsample = 1
)
set.seed(123)
xgb1 = train(sale~., data = trans_train,
trControl = xgb_trainControl,
tuneGrid = xgb_grid,
method = "xgbTree",
)
xgb1
pred = predict(lm1, trans_test)
The validation partition should not be used when you are creating the model - it should be 'set aside' until the model is trained and tuned using the 'training' and 'tuning' partitions, then you can apply the model to predict the outcome of the validation dataset and summarise how accurate the predictions were.
For example, in my own work I create three partitions: training (75%), tuning (10%) and testing/validation (15%) using
# Define the partition (e.g. 75% of the data for training)
trainIndex <- createDataPartition(data$response, p = .75,
list = FALSE,
times = 1)
# Split the dataset using the defined partition
train_data <- data[trainIndex, ,drop=FALSE]
tune_plus_val_data <- data[-trainIndex, ,drop=FALSE]
# Define a new partition to split the remaining 25%
tune_plus_val_index <- createDataPartition(tune_plus_val_data$response,
p = .6,
list = FALSE,
times = 1)
# Split the remaining ~25% of the data: 40% (tune) and 60% (val)
tune_data <- tune_plus_val_data[-tune_plus_val_index, ,drop=FALSE]
val_data <- tune_plus_val_data[tune_plus_val_index, ,drop=FALSE]
# Outcome of this section is that the data (100%) is split into:
# training (~75%)
# tuning (~10%)
# validation (~15%)
These data partitions are converted to xgb.DMatrix matrices ("dtrain", "dtune", "dval"). I then use the 'training' partition to train models and the 'tuning' partition to tune hyperparameters (e.g. random grid search) and evaluate model training (e.g. cross validation). This is ~equivalent to the code in your question.
lrn_tune <- setHyperPars(lrn, par.vals = mytune$x)
params2 <- list(booster = "gbtree",
objective = lrn_tune$par.vals$objective,
eta=lrn_tune$par.vals$eta, gamma=0,
max_depth=lrn_tune$par.vals$max_depth,
min_child_weight=lrn_tune$par.vals$min_child_weight,
subsample = 0.8,
colsample_bytree=lrn_tune$par.vals$colsample_bytree)
xgb2 <- xgb.train(params = params2,
data = dtrain, nrounds = 50,
watchlist = list(val=dtune, train=dtrain),
print_every_n = 10, early_stopping_rounds = 50,
maximize = FALSE, eval_metric = "error")
Once the model is trained I apply the model to the validation data with predict():
xgbpred2_keep <- predict(xgb2, dval)
xg2_val <- data.frame("Prediction" = xgbpred2_keep,
"Patient" = rownames(val),
"Response" = val_data$response)
# Reorder Patients according to Response
xg2_val$Patient <- factor(xg2_val$Patient,
levels = xg2_val$Patient[order(xg2_val$Response)])
ggplot(xg2_val, aes(x = Patient, y = Prediction,
fill = Response)) +
geom_bar(stat = "identity") +
theme_bw(base_size = 16) +
labs(title=paste("Patient predictions (xgb2) for the validation dataset (n = ",
length(rownames(val)), ")", sep = ""),
subtitle="Above 0.5 = Non-Responder, Below 0.5 = Responder",
caption=paste("JM", Sys.Date(), sep = " "),
x = "") +
theme(axis.text.x = element_text(angle=90, vjust=0.5,
hjust = 1, size = 8)) +
# Distance from red line = confidence of prediction
geom_hline(yintercept = 0.5, colour = "red")
# Convert predictions to binary outcome (responder / non-responder)
xgbpred2_binary <- ifelse(predict(xgb2, dval) > 0.5,1,0)
# Results matrix (i.e. true positives/negatives & false positives/negatives)
confusionMatrix(as.factor(xgbpred2_binary), as.factor(labels_tv))
# Summary of results
Summary_of_results <- data.frame(Patient_ID = rownames(val),
label = labels_tv,
pred = xgbpred2_binary)
Summary_of_results$eval <- ifelse(
Summary_of_results$label != Summary_of_results$pred,
"wrong",
"correct")
Summary_of_results$conf <- round(predict(xgb2, dval), 2)
Summary_of_results$CDS <- val_data$`variants`
Summary_of_results
This provides you with a summary of how well the model 'works' on your validation data.

Save Gradient Boosting Machine values obtained with Bootstrap

I am calculating the boosting gradient to identify the importance of variables in the model, however I am performing resampling to identify how the importance of each variable behaves.
But I can't correctly save the variable name with it's importance calculated in each bootstrap.
I'm doing this using a function, which is called within the bootstrap package
boost command.
Below is a minimally reproducible example adapted for AmesHousing data:
library(gbm)
library(boot)
library(AmesHousing)
df <- make_ames()
imp_gbm <- function(data, indices) {
d <- data[indices,]
gbm.fit <- gbm(
formula = Sale_Price ~ .,
distribution = "gaussian",
data = d,
n.trees = 100,
interaction.depth = 5,
shrinkage = 0.1,
cv.folds = 5,
n.cores = NULL,
verbose = FALSE
)
return(summary(gbm.fit)[,2])
}
results_GBM <- boot(data = df,statistic = imp_gbm, R=100)
results_GBM$t0
I expect to save the bootstrap results with their variable names but I can only save the importance of variables without their names.
with summary.gbm, the default is to order the variables according to importance. you need to set it to FALSE, and also not plot. Then the returned variable importance is the same as the order of variables in the fit.
imp_gbm <- function(data, indices) {
d <- data[indices,]
# use gbmfit because gbm.fit is a function
gbmfit <- gbm(
formula = Sale_Price ~ .,
distribution = "gaussian",
data = d,
n.trees = 100,
interaction.depth = 5,
shrinkage = 0.1,
cv.folds = 5,
n.cores = NULL,
verbose = FALSE
)
o= summary(gbmfit,plotit=FALSE,order=FALSE)[,2]
names(o) = gbmfit$var.names
return(o)
}

How to fine tune this xgboost model

How could I fine tune this so I can get better prediction? I don't know how
to make it a better model. Any insight will be greatly appreciated. Thanks a
ton.
Basically I meant to predict best corrected visual acuity (BCVA 0,1 with
0=20/20 vision, 1=worse than 20/20).
Liyan
#preparing data
library(xgboost)
train <- read_sas("Rtrain2.sas7bdat",NULL)
test <- read_sas("Rtest2.sas7bdat",NULL)
labels <- train$bcva01
test_label <- test$bcva01
#outcome variable
drops <- c("bcva01")
x<-train[ , !(names(train) %in% drops)]
x_test<-test[ , !(names(test) %in% drops)]
new_tr <- model.matrix(~.+0,data = x)
new_ts <- model.matrix(~.+0,data = x_test)
#preparing matrix
dtrain <- xgb.DMatrix(data = new_tr,label = labels)
dtest <- xgb.DMatrix(data = new_ts,label=test_label)
#parameters
?list
params <- list(booster = "gbtree", objective = "binary:logistic", eta=0.03,
gamma=0, max_depth=6,
min_child_weight=1, subsample=1, colsample_bytree=1)
#Using the inbuilt xgb.cv function
xgbcv <- xgb.cv( params = params, data = dtrain, nrounds = 21, nfold = 5,
showsd = T, stratified = T, print.every.n = 10, early.stop.round = 21,
maximize = F)
min(xgbcv$test.error.mean) #inf
#first default - model training
xgb1 <- xgb.train (params = params, data = dtrain, nrounds = 21, watchlist =
list(val=dtest,train=dtrain),
print.every.n = 10, early.stop.round = 21, maximize = F ,
eval_metric = "error")
#model prediction
xgbpred <- predict (xgb1,dtest)
cvAUC::AUC(predictions = xgbpred, labels = test[,"bcva01"]) #0.69 2018-10-25
There are a few ways to auto calibrate your hyper parameters:
scikit-learn GridSearch here and here
Hyperopt which I use, here with a nice example here and a short example on how to do it with xgboost
Bayesian Optimization with xgboost example here
All are technique of finding some kind of "minimum" in a defined "space" where that defined "space" is the "search space" you will define for your hypter parameters and the "minimum" is the models error you'd like to reduce.
Subject is quite wide and you have a lot to read, or you can just follow some examples and implement it in your code.

Combining train + test data and running cross validation in R

I have the following R code that runs a simple xgboost model on a set of training and test data with the intention of predicting a binary outcome.
We start by
1) Reading in the relevant libraries.
library(xgboost)
library(readr)
library(caret)
2) Cleaning up the training and test data
train.raw = read.csv("train_data", header = TRUE, sep = ",")
drop = c('column')
train.df = train.raw[, !(names(train.raw) %in% drop)]
train.df[,'outcome'] = as.factor(train.df[,'outcome'])
test.raw = read.csv("test_data", header = TRUE, sep = ",")
drop = c('column')
test.df = test.raw[, !(names(test.raw) %in% drop)]
test.df[,'outcome'] = as.factor(test.df[,'outcome'])
train.c1 = subset(train.df , outcome == 1)
train.c0 = subset(train.df , outcome == 0)
3) Running XGBoost on the properly formatted data.
train_xgb = xgb.DMatrix(data.matrix(train.df [,1:124]), label = train.raw[, "outcome"])
test_xgb = xgb.DMatrix(data.matrix(test.df[,1:124]))
4) Running the model
model_xgb = xgboost(data = train_xgb, nrounds = 8, max_depth = 5, eta = .1, eval_metric = "logloss", objective = "binary:logistic", verbose = 5)
5) Making predicitions
pred_xgb <- predict(model_xgb, newdata = test_xgb)
My question is: How can I adjust this process so that I'm just pulling in / adjusting a single 'training' data set, and getting predictions on the hold-out sets of the cross-validated file?
To specify k-fold CV in the xgboost call one needs to call xgb.cv with nfold = some integer argument, to save the predictions for each resample use prediction = TRUE argument. For instance:
xgboostModelCV <- xgb.cv(data = dtrain,
nrounds = 1688,
nfold = 5,
objective = "binary:logistic",
eval_metric= "auc",
metrics = "auc",
verbose = 1,
print_every_n = 50,
stratified = T,
scale_pos_weight = 2
max_depth = 6,
eta = 0.01,
gamma=0,
colsample_bytree = 1 ,
min_child_weight = 1,
subsample= 0.5 ,
prediction = T)
xgboostModelCV$pred #contains predictions in the same order as in dtrain.
xgboostModelCV$folds #contains k-fold samples
Here's a decent function to pick hyperparams
function(train, seed){
require(xgboost)
ntrees=2000
searchGridSubCol <- expand.grid(subsample = c(0.5, 0.75, 1),
colsample_bytree = c(0.6, 0.8, 1),
gamma=c(0, 1, 2),
eta=c(0.01, 0.03),
max_depth=c(4,6,8,10))
aucErrorsHyperparameters <- apply(searchGridSubCol, 1, function(parameterList){
#Extract Parameters to test
currentSubsampleRate <- parameterList[["subsample"]]
currentColsampleRate <- parameterList[["colsample_bytree"]]
currentGamma <- parameterList[["gamma"]]
currentEta =parameterList[["eta"]]
currentMaxDepth =parameterList[["max_depth"]]
set.seed(seed)
xgboostModelCV <- xgb.cv(data = train,
nrounds = ntrees,
nfold = 5,
objective = "binary:logistic",
eval_metric= "auc",
metrics = "auc",
verbose = 1,
print_every_n = 50,
early_stopping_rounds = 200,
stratified = T,
scale_pos_weight=sum(all_data_nobad[index_no_bad,1]==0)/sum(all_data_nobad[index_no_bad,1]==1),
max_depth = currentMaxDepth,
eta = currentEta,
gamma=currentGamma,
colsample_bytree = currentColsampleRate,
min_child_weight = 1,
subsample= currentSubsampleRate)
xvalidationScores <- as.data.frame(xgboostModelCV$evaluation_log)
#Save rmse of the last iteration
auc=xvalidationScores[xvalidationScores$iter==xgboostModelCV$best_iteration,c(1,4,5)]
auc=cbind(auc, currentSubsampleRate, currentColsampleRate, currentGamma, currentEta, currentMaxDepth)
names(auc)=c("iter", "test.auc.mean", "test.auc.std", "subsample", "colsample", "gamma", "eta", "max.depth")
print(auc)
return(auc)
})
return(aucErrorsHyperparameters)
}
You can change the grid values and the params in the grid, as well as loss/evaluation metric. It is similar as provided by caret grid search, but caret does not provide the possibility to define alpha, lambda, colsample_bylevel, num_parallel_tree... hyper parameters in the grid search apart defining a custom function which I found cumbersome. Caret has the advantage of automatic preprocessing, automatic up/down sampling within CV etc.
setting the seed outside the xgb.cv call will pick the same folds for CV but not the same trees at each round so you will end up with a different model. Even if you set the seed inside the xgb.cv function call there is no guarantee you will end up with the same model but there's a much higher chance (depends on threads, type of model.. - I for one like the uncertainty and found it to have little impact on the result).
You can use xgb.cv and set prediction = TRUE.

Resources