Using ParBayesianOptimization for regression problem in R (minimizing rmse) - r

I am trying to use the ParBayesianOptimization package for tunning parameters in my model. The original GitHub repository demonstrates how to implement the package for parameter tuning in the classification problem (maximizing AUC). However, in my case, I want to implement the function in the regression problem and minimize rmse.
The main problem I am having is to understand why the final parameters getBestPars(optObj) are chosen according to the highest value in a column Score here: optObj$scoreSummary. As I understand the score column represents the value of rmse for a given iteration thus the function should return parameters for the lowest score.
My results:
Example of code to reproduce:
# install.packages("mlbench")
library('mlbench')
library('ParBayesianOptimization')
library("xgboost")
library("data.table")
library('doParallel')
#------------------------------------------------------------------------------#
#### Get data
#------------------------------------------------------------------------------#
set.seed(123)
data(BostonHousing)
BostonHousing <- data.frame(apply(BostonHousing, 2, as.numeric))
setDT(BostonHousing)
train_x <- BostonHousing[ , .SD,.SDcols = setdiff(names(BostonHousing), "medv")]
train_y <- BostonHousing[ ,.SD,.SDcols = "medv"]
#------------------------------------------------------------------------------#
#### Create Folds
#------------------------------------------------------------------------------#
Folds <- list(
Fold1 = as.integer(seq(1,nrow(BostonHousing),by = 3))
, Fold2 = as.integer(seq(2,nrow(BostonHousing),by = 3))
, Fold3 = as.integer(seq(3,nrow(BostonHousing),by = 3))
)
#------------------------------------------------------------------------------#
#### define the scoring function
#------------------------------------------------------------------------------#
scoringFunction <- function(max_depth, min_child_weight, subsample, eta, gamma,
colsample_bytree) {
dtrain <- xgboost::xgb.DMatrix(as.matrix(train_x), label = as.matrix(train_y))
Pars <- list(
booster = "gbtree"
, gamma = gamma
, colsample_bytree = colsample_bytree
, eta = eta
, max_depth = max_depth
, min_child_weight = min_child_weight
, subsample = subsample
, objective = 'reg:linear'
, eval_metric = "rmse"
)
xgbcv <- xgb.cv(
params = Pars
, data = dtrain
, nround = 100
, folds = Folds
, early_stopping_rounds = 100
, maximize = TRUE
, verbose = 1
)
return(
list(Score = min(xgbcv$evaluation_log$test_rmse_mean)
, nrounds = xgbcv$best_iteration
)
)
}
#------------------------------------------------------------------------------#
#### Bounds
#------------------------------------------------------------------------------#
bounds <- list(
gamma = c(0.1,50L)
, colsample_bytree = c(0.5,1L)
, eta = c(0.01,0.1)
, max_depth = c(1L, 5L)
, min_child_weight = c(0, 25)
, subsample = c(0.1, 1)
)
#------------------------------------------------------------------------------#
#### To run in parallel
#------------------------------------------------------------------------------#
cl <- makeCluster(parallel::detectCores() - 1)
registerDoParallel(cl)
clusterExport(cl,c('Folds','train_x', "train_y"))
clusterEvalQ(cl,expr= {
library(xgboost)
})
tWithPar <- system.time(
optObj <- bayesOpt(
FUN = scoringFunction
, bounds = bounds
, initPoints = 7
, iters.n = (parallel::detectCores() - 1)*2
, iters.k = (parallel::detectCores() - 1)*2
, parallel = TRUE
, verbose = 1
)
)
stopCluster(cl)
registerDoSEQ()
#------------------------------------------------------------------------------#
#### Printing results
#------------------------------------------------------------------------------#
optObj$scoreSummary
getBestPars(optObj)
I would appreciate any help in better understanding the function and how to correctly implement it in a regression problem.

Minimizing the RMSE is equivalent to maximizing -1*RMSE so try redefining your Score.
Score = -1*min(xgbcv$evaluation_log$test_rmse_mean)

Related

how to get accuration xgboost in r

how i get accuracy xgboost in r?
i have same problem, i will get a accuracy with method xgboost
library(xgboost)
library(RStoolbox)
library("caret", lib.loc="~/R/win-library/3.5")
setwd("D:/NEW DATA/kurt/tugas")
shp <- shapefile("jajal/samplepoint(2).shp")
ras <- stack("cigudegc21.tif")
vals <- extract(ras,shp)
train<-data.matrix(vals)
classes <- as.numeric(as.factor(shp#data$id)) - 1
xgb <- xgboost(data = train,
label = classes,
eta = 0.1,
max_depth = 4,
nround=100,
objective = "multi:softmax",
num_class = length(unique(classes)),
nthread = 3)
result <- predict(xgb, ras[1:(nrow(ras)*ncol(ras))],reshape=TRUE)
res <- raster(ras)
res <- setValues(res,result+1)```

what are the parameters of bayes optimization for tuning parameter?

I am using Bayesian optimization to tune the parameters of SVM for regression problem. In the following code, what should be the value of init_grid_dt = initial_grid ? I got the upper and lower bounds of the sigma and C parameters of SVM, but dont know what should be the initial-grid?
In one of the example on the web, they took a random search results as input to the initial grid. The code is as follow:
ctrl <- trainControl(method = "repeatedcv", repeats = 5)
svm_fit_bayes <- function(logC, logSigma) {
## Use the same model code but for a single (C, sigma) pair.
txt <- capture.output(
mod <- train(y ~ ., data = train_dat,
method = "svmRadial",
preProc = c("center", "scale"),
metric = "RMSE",
trControl = ctrl,
tuneGrid = data.frame(C = exp(logC), sigma = exp(logSigma)))
)
list(Score = -getTrainPerf(mod)[, "TrainRMSE"], Pred = 0)
}
lower_bounds <- c(logC = -5, logSigma = -9)
upper_bounds <- c(logC = 20, logSigma = -0.75)
bounds <- list(logC = c(lower_bounds[1], upper_bounds[1]),
logSigma = c(lower_bounds[2], upper_bounds[2]))
## Create a grid of values as the input into the BO code
initial_grid <- rand_search$results[, c("C", "sigma", "RMSE")]
initial_grid$C <- log(initial_grid$C)
initial_grid$sigma <- log(initial_grid$sigma)
initial_grid$RMSE <- -initial_grid$RMSE
names(initial_grid) <- c("logC", "logSigma", "Value")
library(rBayesianOptimization)
ba_search <- BayesianOptimization(svm_fit_bayes,
bounds = bounds,
init_grid_dt = initial_grid,
init_points = 0,
n_iter = 30,
acq = "ucb",
kappa = 1,
eps = 0.0,
verbose = TRUE)

Combining train + test data and running cross validation in R

I have the following R code that runs a simple xgboost model on a set of training and test data with the intention of predicting a binary outcome.
We start by
1) Reading in the relevant libraries.
library(xgboost)
library(readr)
library(caret)
2) Cleaning up the training and test data
train.raw = read.csv("train_data", header = TRUE, sep = ",")
drop = c('column')
train.df = train.raw[, !(names(train.raw) %in% drop)]
train.df[,'outcome'] = as.factor(train.df[,'outcome'])
test.raw = read.csv("test_data", header = TRUE, sep = ",")
drop = c('column')
test.df = test.raw[, !(names(test.raw) %in% drop)]
test.df[,'outcome'] = as.factor(test.df[,'outcome'])
train.c1 = subset(train.df , outcome == 1)
train.c0 = subset(train.df , outcome == 0)
3) Running XGBoost on the properly formatted data.
train_xgb = xgb.DMatrix(data.matrix(train.df [,1:124]), label = train.raw[, "outcome"])
test_xgb = xgb.DMatrix(data.matrix(test.df[,1:124]))
4) Running the model
model_xgb = xgboost(data = train_xgb, nrounds = 8, max_depth = 5, eta = .1, eval_metric = "logloss", objective = "binary:logistic", verbose = 5)
5) Making predicitions
pred_xgb <- predict(model_xgb, newdata = test_xgb)
My question is: How can I adjust this process so that I'm just pulling in / adjusting a single 'training' data set, and getting predictions on the hold-out sets of the cross-validated file?
To specify k-fold CV in the xgboost call one needs to call xgb.cv with nfold = some integer argument, to save the predictions for each resample use prediction = TRUE argument. For instance:
xgboostModelCV <- xgb.cv(data = dtrain,
nrounds = 1688,
nfold = 5,
objective = "binary:logistic",
eval_metric= "auc",
metrics = "auc",
verbose = 1,
print_every_n = 50,
stratified = T,
scale_pos_weight = 2
max_depth = 6,
eta = 0.01,
gamma=0,
colsample_bytree = 1 ,
min_child_weight = 1,
subsample= 0.5 ,
prediction = T)
xgboostModelCV$pred #contains predictions in the same order as in dtrain.
xgboostModelCV$folds #contains k-fold samples
Here's a decent function to pick hyperparams
function(train, seed){
require(xgboost)
ntrees=2000
searchGridSubCol <- expand.grid(subsample = c(0.5, 0.75, 1),
colsample_bytree = c(0.6, 0.8, 1),
gamma=c(0, 1, 2),
eta=c(0.01, 0.03),
max_depth=c(4,6,8,10))
aucErrorsHyperparameters <- apply(searchGridSubCol, 1, function(parameterList){
#Extract Parameters to test
currentSubsampleRate <- parameterList[["subsample"]]
currentColsampleRate <- parameterList[["colsample_bytree"]]
currentGamma <- parameterList[["gamma"]]
currentEta =parameterList[["eta"]]
currentMaxDepth =parameterList[["max_depth"]]
set.seed(seed)
xgboostModelCV <- xgb.cv(data = train,
nrounds = ntrees,
nfold = 5,
objective = "binary:logistic",
eval_metric= "auc",
metrics = "auc",
verbose = 1,
print_every_n = 50,
early_stopping_rounds = 200,
stratified = T,
scale_pos_weight=sum(all_data_nobad[index_no_bad,1]==0)/sum(all_data_nobad[index_no_bad,1]==1),
max_depth = currentMaxDepth,
eta = currentEta,
gamma=currentGamma,
colsample_bytree = currentColsampleRate,
min_child_weight = 1,
subsample= currentSubsampleRate)
xvalidationScores <- as.data.frame(xgboostModelCV$evaluation_log)
#Save rmse of the last iteration
auc=xvalidationScores[xvalidationScores$iter==xgboostModelCV$best_iteration,c(1,4,5)]
auc=cbind(auc, currentSubsampleRate, currentColsampleRate, currentGamma, currentEta, currentMaxDepth)
names(auc)=c("iter", "test.auc.mean", "test.auc.std", "subsample", "colsample", "gamma", "eta", "max.depth")
print(auc)
return(auc)
})
return(aucErrorsHyperparameters)
}
You can change the grid values and the params in the grid, as well as loss/evaluation metric. It is similar as provided by caret grid search, but caret does not provide the possibility to define alpha, lambda, colsample_bylevel, num_parallel_tree... hyper parameters in the grid search apart defining a custom function which I found cumbersome. Caret has the advantage of automatic preprocessing, automatic up/down sampling within CV etc.
setting the seed outside the xgb.cv call will pick the same folds for CV but not the same trees at each round so you will end up with a different model. Even if you set the seed inside the xgb.cv function call there is no guarantee you will end up with the same model but there's a much higher chance (depends on threads, type of model.. - I for one like the uncertainty and found it to have little impact on the result).
You can use xgb.cv and set prediction = TRUE.

Bayesian Optimization in r - results

I have an Bayesian Optimization code and it print results with Value and selected parameters. My question is - how is the best combinations chosen? The value in my case min RMSE was lower in different round?
Code:
library(xgboost)
library(rBayesianOptimization)
data(agaricus.train, package='xgboost')
dtrain <- xgb.DMatrix(agaricus.train$data, label = agaricus.train$label)
cv_folds <- KFold(
y
, nfolds = 5
, stratified = TRUE
, seed = 5000)
xgb_cv_bayes <- function(eta, max.depth, min_child_weight, subsample,colsample_bytree ) {
cv <- xgb.cv(params = list(booster = "gbtree"
# , eta = 0.01
, eta = eta
, max_depth = max.depth
, min_child_weight = min_child_weight
, colsample_bytree = colsample_bytree
, subsample = subsample
#, colsample_bytree = 0.3
, lambda = 1
, alpha = 0
, objective = "reg:linear"
, eval_metric = "rmse")
, data = dtrain
, nround = 1000
, folds = cv_folds
, prediction = TRUE
, showsd = TRUE
, early_stopping_rounds = 10
, maximize = TRUE
, verbose = 0
, finalize = TRUE)
list(Score = cv$evaluation_log[,min(test_rmse_mean)]
,Pred = cv$pred
, cb.print.evaluation(period = 1))
}
cat("Calculating Bayesian Optimum Parameters\n")
OPT_Res <- BayesianOptimization(xgb_cv_bayes
, bounds = list(
eta = c(0.001, 0.03)
, max.depth = c(3L, 10L)
, min_child_weight = c(3L, 10L)
, subsample = c(0.8, 1)
, colsample_bytree = c(0.5, 1))
, init_grid_dt = NULL
, init_points = 10
, n_iter = 200
, acq = "ucb"
, kappa = 3
, eps = 1.5
, verbose = TRUE)
From help(BayesianOptimization), the parameter FUN:
The function to be maximized. This Function should return a named list
with 2 components. The first component "Score" should be the metrics
to be maximized, and the second component "Pred" should be the
validation/cross-validation prediction for ensembling/stacking.
Your function returns Score = cv$evaluation_log[,min(test_rmse_mean)]. You want to minimize this value, not maximize it. Try returning the negative, so that when the returned value is maximized, you're minimizing the RMSE. Score = -cv$evaluation_log[,min(test_rmse_mean)]

R caretEnsemble building models on different feature subsets

I wonder if there is some way to combine prediction of two different models are built on two different input feature set . For example , first on features 1:10 and second on 11:20 and combine with caretEnssemble of caretStack function.
I am trying :
data("mtcars")
head(mtcars)
library(caret)
library(caretEnsemble)
library(glmnet)
library(gbm)
ma_control <- trainControl(method = "cv",
number = 2,
summaryFunction = RMSE,
verboseIter = TRUE,
savePredictions = TRUE)
subset1 <- mtcars[,c(2:3,1)]
subset2 <- mtcars[,c(4:5,1)]
classification_formula1 <- as.formula(paste("mpg" ,"~",
paste(names(subset1)[!names(subset1)=='mpg'],collapse="+")))
classification_formula2 <- as.formula(paste("mpg" ,"~",
paste(names(subset2)[!names(subset2)=='mpg'],collapse="+")))
emf_tuneGrid_list <- NULL;
emf_tuneGrid_list$glmnet1_tuneGrid <- expand.grid(alpha = 1.0 ,lambda = 1)
emf_tuneGrid_list$gbm2_tuneGrid <- expand.grid(interaction.depth = 1, n.trees = 101 ,
shrinkage = 0.5 , n.minobsinnode = 5)
emf_model_list <- caretList (
trControl=ma_control, metric = "RMSE",
tuneList=list(
glmnet1= caretModelSpec(method='glmnet', classification_formula = classification_formula1 , data = subset1 , tuneGrid=emf_tuneGrid_list$glmnet1_tuneGrid),
gbm2 = caretModelSpec(method='gbm', classification_formula = classification_formula2, data = subset2 , tuneGrid=emf_tuneGrid_list$gbm2_tuneGrid , verbose = FALSE)
)
)
But get Error in extractCaretTarget.default(...) :
argument "y" is missing, with no default

Resources