How to fine tune this xgboost model - r

How could I fine tune this so I can get better prediction? I don't know how
to make it a better model. Any insight will be greatly appreciated. Thanks a
ton.
Basically I meant to predict best corrected visual acuity (BCVA 0,1 with
0=20/20 vision, 1=worse than 20/20).
Liyan
#preparing data
library(xgboost)
train <- read_sas("Rtrain2.sas7bdat",NULL)
test <- read_sas("Rtest2.sas7bdat",NULL)
labels <- train$bcva01
test_label <- test$bcva01
#outcome variable
drops <- c("bcva01")
x<-train[ , !(names(train) %in% drops)]
x_test<-test[ , !(names(test) %in% drops)]
new_tr <- model.matrix(~.+0,data = x)
new_ts <- model.matrix(~.+0,data = x_test)
#preparing matrix
dtrain <- xgb.DMatrix(data = new_tr,label = labels)
dtest <- xgb.DMatrix(data = new_ts,label=test_label)
#parameters
?list
params <- list(booster = "gbtree", objective = "binary:logistic", eta=0.03,
gamma=0, max_depth=6,
min_child_weight=1, subsample=1, colsample_bytree=1)
#Using the inbuilt xgb.cv function
xgbcv <- xgb.cv( params = params, data = dtrain, nrounds = 21, nfold = 5,
showsd = T, stratified = T, print.every.n = 10, early.stop.round = 21,
maximize = F)
min(xgbcv$test.error.mean) #inf
#first default - model training
xgb1 <- xgb.train (params = params, data = dtrain, nrounds = 21, watchlist =
list(val=dtest,train=dtrain),
print.every.n = 10, early.stop.round = 21, maximize = F ,
eval_metric = "error")
#model prediction
xgbpred <- predict (xgb1,dtest)
cvAUC::AUC(predictions = xgbpred, labels = test[,"bcva01"]) #0.69 2018-10-25

There are a few ways to auto calibrate your hyper parameters:
scikit-learn GridSearch here and here
Hyperopt which I use, here with a nice example here and a short example on how to do it with xgboost
Bayesian Optimization with xgboost example here
All are technique of finding some kind of "minimum" in a defined "space" where that defined "space" is the "search space" you will define for your hypter parameters and the "minimum" is the models error you'd like to reduce.
Subject is quite wide and you have a lot to read, or you can just follow some examples and implement it in your code.

Related

Training, validation and testing without using caret

I'm having doubts during the hyperparameters tune step. I think I might be making some confusion.
I split my dataset into training (70%), validation (15%) and testing (15%). Below is the code used for regression with Random Forest.
1. Training
I perform the initial training with the dataset, as follows:
rf_model <- ranger(y ~.,
date = train ,
num.trees = 500,
mtry = 5,
min.node.size = 100,
importance = "impurity")
I get the R squared and the RMSE using the actual and predicted data from the training set.
pred_rf <- predict(rf_model,train)
pred_rf <- data.frame(pred = pred_rf, obs = train$y)
RMSE_rf <- RMSE(pred_rf$pred, pred_rf$obs)
R2_rf <- (color(pred_rf$pred, pred_rf$obs)) ^2
2. Parameter optimization
Using a parameter grid, the best model is chosen based on performance.
hyper_grid <- expand.grid(mtry = seq(3, 12, by = 4),
sample_size = c(0.5,1),
min.node.size = seq(20, 500, by = 100),
MSE = as.numeric(NA),
R2 = as.numeric(NA),
OOB_RMSE = as.numeric(NA)
)
And I perform the search for the best model according to the smallest OOB error, for example.
for (i in 1:nrow(hyper_grid)) {
model <- ranger(formula = y ~ .,
date = train,
num.trees = 500,
mtry = hyper_grid$mtry[i],
sample.fraction = hyper_grid$sample_size[i],
min.node.size = hyper_grid$min.node.size[i],
importance = "impurity",
replace = TRUE,
oob.error = TRUE,
verbose = TRUE
)
hyper_grid$OOB_RMSE[i] <- sqrt(model$prediction.error)
hyper_grid[i, "MSE"] <- model$prediction.error
hyper_grid[i, "R2"] <- model$r.squared
hyper_grid[i, "OOB_RMSE"] <- sqrt(model$prediction.error)
}
Choose the best performing model
x <- hyper_grid[which.min(hyper_grid$OOB_RMSE), ]
The final model:
rf_fit_model <- ranger(formula = y ~ .,
date = train,
num.trees = 100,
mtry = x$mtry,
sample.fraction = x$sample_size,
min.node.size = x$min.node.size,
oob.error = TRUE,
verbose = TRUE,
importance = "impurity"
)
Perform model prediction with validation data
rf_predict_val <- predict(rf_fit_model, validation)
rf_predict_val <- as.data.frame(rf_predict_val[1])
names(rf_predict_val) <- "pred"
rf_predict_val <- data.frame(pred = rf_predict_val, obs = validation$y)
RMSE_rf_fit <- RMSE rf_predict_val$pred, rf_predict_val$obs)
R2_rf_fit <- (cor(rf_predict_val$pred, rf_predict_val$obs)) ^ 2
Well, now I wonder if I should replicate the model evaluation with the test data.
The fact is that the validation data is being used only as a "test" and is not effectively helping to validate the model.
I've used cross validation in other methods, but I'd like to do it more manually. One of the reasons is that the CV via caret is very slow.
I'm in the right way?
Code using Caret, but very slow:
ctrl <- trainControl(method = "repeatedcv",
repeats = 10)
grid <- expand.grid(interaction.depth = seq(1, 7, by = 2),
n.trees = 1000,
shrinkage = c(0.01,0.1),
n.minobsinnode = 50)
gbmTune <- train(y ~ ., data = train,
method = "gbm",
tuneGrid = grid,
verbose = TRUE,
trControl = ctrl)

Error in data$update_params(params = params) : [LightGBM] [Fatal] Cannot change max_bin after constructed Dataset handle

I downloaded lightgbm package on RStudio and trying to run a model with it.
The script based on the Retip.
The function is this :
> fit.lightgbm
function (training, testing)
{
train <- as.matrix(training)
test <- as.matrix(testing)
coltrain <- ncol(train)
coltest <- ncol(test)
dtrain <- lightgbm::lgb.Dataset(train[, 2:coltrain], label = train[,
1])
lightgbm::lgb.Dataset.construct(dtrain)
dtest <- lightgbm::lgb.Dataset.create.valid(dtrain, test[,2:coltest], label = test[, 1])
valids <- list(test = dtest)
params <- list(objective = "regression", metric = "rmse")
modelcv <- lightgbm::lgb.cv(params, dtrain, nrounds = 5000,
nfold = 10, valids, verbose = 1, early_stopping_rounds = 1000,
record = TRUE, eval_freq = 1L, stratified = TRUE, max_depth = 4,
max_leaf = 20, max_bin = 50)
best.iter <- modelcv$best_iter
params <- list(objective = "regression_l2", metric = "rmse")
model <- lightgbm::lgb.train(params, dtrain, nrounds = best.iter,
valids, verbose = 0, early_stopping_rounds = 1000, record = TRUE,
eval_freq = 1L, max_depth = 4, max_leaf = 20, max_bin = 50)
print(paste0("End training"))
return(model)
}
However when I'm trying to run the function as in the Retip
lightgbm <- fit.lightgbm(training,testing)
There is this Fatal Error:
Error in data$update_params(params = params) :
[LightGBM] [Fatal] Cannot change max_bin after constructed Dataset handle.
Only when changing max_bin to max_bin=255 there is no error.
Went through documentation:
What is the right way for hyper parameter tuning for LightGBM classification? #1339
[Python] max_bin weird behaviour #1053
Any ideas\suggestions to what should be done?
This was cross-posted to https://github.com/microsoft/LightGBM/issues/4019 and has been answered there.
Construction of the Dataset object in LightGBM handles some important pre-processing steps (see this prior answer) that happen before training, and none of the Dataset parameters can be changed after construction.
Passing max_bin=50 into lgb.Dataset() instead of lgb.cv() / lgb.train() in the original post's code will result in successful training without this error.

Save Gradient Boosting Machine values obtained with Bootstrap

I am calculating the boosting gradient to identify the importance of variables in the model, however I am performing resampling to identify how the importance of each variable behaves.
But I can't correctly save the variable name with it's importance calculated in each bootstrap.
I'm doing this using a function, which is called within the bootstrap package
boost command.
Below is a minimally reproducible example adapted for AmesHousing data:
library(gbm)
library(boot)
library(AmesHousing)
df <- make_ames()
imp_gbm <- function(data, indices) {
d <- data[indices,]
gbm.fit <- gbm(
formula = Sale_Price ~ .,
distribution = "gaussian",
data = d,
n.trees = 100,
interaction.depth = 5,
shrinkage = 0.1,
cv.folds = 5,
n.cores = NULL,
verbose = FALSE
)
return(summary(gbm.fit)[,2])
}
results_GBM <- boot(data = df,statistic = imp_gbm, R=100)
results_GBM$t0
I expect to save the bootstrap results with their variable names but I can only save the importance of variables without their names.
with summary.gbm, the default is to order the variables according to importance. you need to set it to FALSE, and also not plot. Then the returned variable importance is the same as the order of variables in the fit.
imp_gbm <- function(data, indices) {
d <- data[indices,]
# use gbmfit because gbm.fit is a function
gbmfit <- gbm(
formula = Sale_Price ~ .,
distribution = "gaussian",
data = d,
n.trees = 100,
interaction.depth = 5,
shrinkage = 0.1,
cv.folds = 5,
n.cores = NULL,
verbose = FALSE
)
o= summary(gbmfit,plotit=FALSE,order=FALSE)[,2]
names(o) = gbmfit$var.names
return(o)
}

Combining train + test data and running cross validation in R

I have the following R code that runs a simple xgboost model on a set of training and test data with the intention of predicting a binary outcome.
We start by
1) Reading in the relevant libraries.
library(xgboost)
library(readr)
library(caret)
2) Cleaning up the training and test data
train.raw = read.csv("train_data", header = TRUE, sep = ",")
drop = c('column')
train.df = train.raw[, !(names(train.raw) %in% drop)]
train.df[,'outcome'] = as.factor(train.df[,'outcome'])
test.raw = read.csv("test_data", header = TRUE, sep = ",")
drop = c('column')
test.df = test.raw[, !(names(test.raw) %in% drop)]
test.df[,'outcome'] = as.factor(test.df[,'outcome'])
train.c1 = subset(train.df , outcome == 1)
train.c0 = subset(train.df , outcome == 0)
3) Running XGBoost on the properly formatted data.
train_xgb = xgb.DMatrix(data.matrix(train.df [,1:124]), label = train.raw[, "outcome"])
test_xgb = xgb.DMatrix(data.matrix(test.df[,1:124]))
4) Running the model
model_xgb = xgboost(data = train_xgb, nrounds = 8, max_depth = 5, eta = .1, eval_metric = "logloss", objective = "binary:logistic", verbose = 5)
5) Making predicitions
pred_xgb <- predict(model_xgb, newdata = test_xgb)
My question is: How can I adjust this process so that I'm just pulling in / adjusting a single 'training' data set, and getting predictions on the hold-out sets of the cross-validated file?
To specify k-fold CV in the xgboost call one needs to call xgb.cv with nfold = some integer argument, to save the predictions for each resample use prediction = TRUE argument. For instance:
xgboostModelCV <- xgb.cv(data = dtrain,
nrounds = 1688,
nfold = 5,
objective = "binary:logistic",
eval_metric= "auc",
metrics = "auc",
verbose = 1,
print_every_n = 50,
stratified = T,
scale_pos_weight = 2
max_depth = 6,
eta = 0.01,
gamma=0,
colsample_bytree = 1 ,
min_child_weight = 1,
subsample= 0.5 ,
prediction = T)
xgboostModelCV$pred #contains predictions in the same order as in dtrain.
xgboostModelCV$folds #contains k-fold samples
Here's a decent function to pick hyperparams
function(train, seed){
require(xgboost)
ntrees=2000
searchGridSubCol <- expand.grid(subsample = c(0.5, 0.75, 1),
colsample_bytree = c(0.6, 0.8, 1),
gamma=c(0, 1, 2),
eta=c(0.01, 0.03),
max_depth=c(4,6,8,10))
aucErrorsHyperparameters <- apply(searchGridSubCol, 1, function(parameterList){
#Extract Parameters to test
currentSubsampleRate <- parameterList[["subsample"]]
currentColsampleRate <- parameterList[["colsample_bytree"]]
currentGamma <- parameterList[["gamma"]]
currentEta =parameterList[["eta"]]
currentMaxDepth =parameterList[["max_depth"]]
set.seed(seed)
xgboostModelCV <- xgb.cv(data = train,
nrounds = ntrees,
nfold = 5,
objective = "binary:logistic",
eval_metric= "auc",
metrics = "auc",
verbose = 1,
print_every_n = 50,
early_stopping_rounds = 200,
stratified = T,
scale_pos_weight=sum(all_data_nobad[index_no_bad,1]==0)/sum(all_data_nobad[index_no_bad,1]==1),
max_depth = currentMaxDepth,
eta = currentEta,
gamma=currentGamma,
colsample_bytree = currentColsampleRate,
min_child_weight = 1,
subsample= currentSubsampleRate)
xvalidationScores <- as.data.frame(xgboostModelCV$evaluation_log)
#Save rmse of the last iteration
auc=xvalidationScores[xvalidationScores$iter==xgboostModelCV$best_iteration,c(1,4,5)]
auc=cbind(auc, currentSubsampleRate, currentColsampleRate, currentGamma, currentEta, currentMaxDepth)
names(auc)=c("iter", "test.auc.mean", "test.auc.std", "subsample", "colsample", "gamma", "eta", "max.depth")
print(auc)
return(auc)
})
return(aucErrorsHyperparameters)
}
You can change the grid values and the params in the grid, as well as loss/evaluation metric. It is similar as provided by caret grid search, but caret does not provide the possibility to define alpha, lambda, colsample_bylevel, num_parallel_tree... hyper parameters in the grid search apart defining a custom function which I found cumbersome. Caret has the advantage of automatic preprocessing, automatic up/down sampling within CV etc.
setting the seed outside the xgb.cv call will pick the same folds for CV but not the same trees at each round so you will end up with a different model. Even if you set the seed inside the xgb.cv function call there is no guarantee you will end up with the same model but there's a much higher chance (depends on threads, type of model.. - I for one like the uncertainty and found it to have little impact on the result).
You can use xgb.cv and set prediction = TRUE.

L2 regularized MLR using caret, and how to make sure I am using best model while predicting

I am trying to do L2-regularized MLR on a data set using caret. Following is what I have done so far to achieve this:
r_squared <- function ( pred, actual){
mean_actual = mean (actual)
ss_e = sum ((pred - actual )^2)
ss_total = sum ((actual-mean_actual)^2 )
r_squared = 1 - (ss_e/ss_total)
}
df = as.data.frame(matrix(rnorm(10000, 10, 3), 1000))
colnames(df)[1] = "response"
set.seed(753)
inTraining <- createDataPartition(df[["response"]], p = .75, list = FALSE)
training <- df[inTraining,]
testing <- df[-inTraining,]
testing_response <- base::subset(testing,
select = c(paste ("response")))
gridsearch_for_lambda = data.frame (alpha = 0,
lambda = c (2^c(-15:15), 3^c(-15:15)))
regression_formula = as.formula (paste ("response", "~ ", " .", sep = " "))
train_control = trainControl (method="cv", number =10,
savePredictions =TRUE , allowParallel = FALSE )
model = train (regression_formula,
data = training,
trControl = train_control,
method = "glmnet",
tuneGrid =gridsearch_for_lambda,
preProcess = NULL
)
prediction = predict (model, newdata = testing)
testing_response[["predicted"]] = prediction
r_sq = round (r_squared(testing_response[["predicted"]],
testing_response[["response"]] ),3)
Here I am concerned about assurance that the model I am using for prediction is the best one (the optimal tuned lambda value).
P.S.: The data is sampled from random normal distribution, which is not giving a good R^2 value, but I want to get the idea correctly

Resources