I need to perform parameter optimization on a gbm model on RH2o. I am relatively new to H2o and I think I need to convert ntrees and learn_rate(below) into a H2o vector before performing the below.
How do I perform this operation?
Thanks!
ntrees <- c(100,200,300,400)
learn_rate <- c(1,0.5,0.1)
for (i in ntrees){
for j in learn_rate{
n = ntrees[i]
l= learn_rate[j]
gbm_model <- h2o.gbm(features, label, training_frame = train, validation_frame = valid, ntrees=ntrees[[i]],max_depth = 5,learn_rate=learn_rate[j])
print(c(ntrees[i],learn_rate[j],h2o.mse(h2o.performance(gbm_model, valid = TRUE))))
}
}
you can use h2o.grid() to do your grid search
# specify your hyper parameters
hyper_params = list( ntrees = c(100,200,300,400), learn_rate = c(1,0.5,0.1) )
# then build your grid
grid <- h2o.grid(
## hyper parameters
hyper_params = hyper_params,
## which algorithm to run
algorithm = "gbm",
## identifier for the grid, to later retrieve it
grid_id = "my_grid",
## standard model parameters
x = features,
y = label,
training_frame = train,
validation_frame = valid,
## set a seed for reproducibility
seed = 1234)
you can read more about how h2o.grid() works in the R documentation http://docs.h2o.ai/h2o/latest-stable/h2o-r/h2o_package.pdf
Lauren's answer, to use grids, is the best one here. I'll just quickly point out that what you have written is a usable approach, and one you can fall back on when grids don't do something you need.
Your example didn't include any data (see https://stackoverflow.com/help/mcve) so I couldn't run it, but I corrected the couple of syntax issues I noticed (R's for-in loop directly gives you the value, not the index, and parentheses around the 2nd for loop):
ntrees <- c(100,200,300,400)
learn_rate <- c(1,0.5,0.1)
for (n in ntrees){
for (l in learn_rate){
gbm_model <- h2o.gbm(
features, label, training_frame = train, validation_frame = valid,
ntrees = n,max_depth = 5,learn_rate = l
)
print(c(n,l,h2o.mse(h2o.performance(gbm_model, valid = TRUE))))
}
}
An example of when you'd use nested loops, like this, is when you want to skip certain combinations. E.g. You might decide to only test ntrees of 100 with learn rate of 0.1, which would then look like this:
ntrees <- c(100,200,300,400)
learn_rate <- c(1,0.5,0.1)
for (n in ntrees){
for (l in learn_rate){
if(l == 0.1 && n > 100)next #Skip when n is 200,300,400
gbm_model <- h2o.gbm(
features, label, training_frame = train, validation_frame = valid,
ntrees = n,max_depth = 5,learn_rate = l
)
print(c(n,l,h2o.mse(h2o.performance(gbm_model, valid = TRUE))))
}
}
Related
I have not got a clear idea about how labels for the softmax classifier should be shaped.
What I could understand from my experiments is that a scalar laber indicating the index of class probability output is one option, while another is a 2D label where the rows are class probabilities, or one-hot encoded variable, like c(1, 0, 0).
What puzzles me though is that:
I can use sclalar label values that go beyong indexing, like 4 in my
example below -- without warning or error. Why is that?
When my label is a negative scalar or an array with a negative value,
the model converges to uniform probablity distribution over classes.
For example, is this expected that actor_train.y = matrix(c(0, -1,v0), ncol = 1) results in equal probabilities in the softmax output?
I try to use softmax MXNET classifier to produce the policy gradient
reifnrocement learning, and my negative rewards lead to the issue
above: uniform probability. Is that expected?
require(mxnet)
actor_initializer <- mx.init.Xavier(rnd_type = "gaussian",
factor_type = "avg",
magnitude = 0.0001)
actor_nn_data <- mx.symbol.Variable('data') actor_nn_label <- mx.symbol.Variable('label')
device.cpu <- mx.cpu()
NN architecture
actor_fc3 <- mx.symbol.FullyConnected(
data = actor_nn_data
, num_hidden = 3 )
actor_output <- mx.symbol.SoftmaxOutput(
data = actor_fc3
, label = actor_nn_label
, name = 'actor' )
crossentfunc <- function(label, pred)
{
- sum(label * log(pred)) }
actor_loss <- mx.metric.custom(
feval = crossentfunc
, name = "log-loss"
)
initialize NN
actor_train.x <- matrix(rnorm(11), nrow = 1)
actor_train.y = 0 #1 #2 #3 #-3 # matrix(c(0, 0, -1), ncol = 1)
rm(actor_model)
actor_model <- mx.model.FeedForward.create(
symbol = actor_output,
X = actor_train.x,
y = actor_train.y,
ctx = device.cpu,
num.round = 100,
array.batch.size = 1,
optimizer = 'adam',
eval.metric = actor_loss,
clip_gradient = 1,
wd = 0.01,
initializer = actor_initializer,
array.layout = "rowmajor" )
predict(actor_model, actor_train.x, array.layout = "rowmajor")
It is quite strange to me, but I found a solution.
I changed optimizer from optimizer = 'adam' to optimizer = 'rmsprop', and the NN started to converge as expected in case of negative targets. I made simulations in R using a simple NN and optim function to get the same result.
Looks like adam or SGD may be buggy or whatever in case of multinomial classification... I also used to get stuck at the fact those optimizers did not converge to a perfect solution on just 1 example, while rmsprop does! Be aware!
I am trying to run repeated 10-fold CV (alpha and lambda) using glmnet / glmnetUtils. My proposed workflow is to:
a) fit a proposed model at 11 values of alpha,
b) run the process X (in this case, 10) times,
c) average the results, and
d) fit a final model with the best combination of alpha and lambda (s = "lambda.1se").
To address a-c, I used the code below; however, the results from the 10 iterations are exactly the same.
library(glmnet)
library(glmnetUtils)
library(doParallel)
data(BinomialExample)
# Create alpha sequence; fix folds
alpha <- seq(.5, 1, .05)
set.seed(1)
folds <- sample(1:10, size = length(y), replace = TRUE)
# Determine optimal combination of alpha and lambda; extract lowest CV error and associated lambda at each alpha
extractGlmnetInfo <- function(object)
{
# Find lambdas
lambda1se <- object$lambda.1se
# Determine where lambdas fall in path
which1se <- which(object$lambda == lambda1se)
# Create data frame with selected lambdas and corresponding error
data.frame(lambda.1se = lambda1se, cv.1se = object$cvm[which1se])
}
#Run glmnet
cl <- makeCluster(detectCores())
registerDoParallel(cl)
enet <- foreach(i = 1:10,
.inorder = FALSE,
.multicombine = TRUE,
.packages = "glmnetUtils") %dopar%
{
cv <- cva.glmnet(x, y,
foldid = folds,
alpha = alpha,
family = "binomial",
parallel = TRUE)
}
stopCluster(cl)
# Extract smallest CV error and lambda at each alpha for each iteration of 10-fold CV
# Calculate means (across iterations) of lowest CV error and associated lambdas for each alpha
cv.rep1 <- ldply(enet[[1]]$modlist, extractGlmnetInfo)
cv.rep2 <- ldply(enet[[2]]$modlist, extractGlmnetInfo)
cv.rep3 <- ldply(enet[[3]]$modlist, extractGlmnetInfo)
cv.rep4 <- ldply(enet[[4]]$modlist, extractGlmnetInfo)
cv.rep5 <- ldply(enet[[5]]$modlist, extractGlmnetInfo)
cv.rep6 <- ldply(enet[[6]]$modlist, extractGlmnetInfo)
cv.rep7 <- ldply(enet[[7]]$modlist, extractGlmnetInfo)
cv.rep8 <- ldply(enet[[8]]$modlist, extractGlmnetInfo)
cv.rep9 <- ldply(enet[[9]]$modlist, extractGlmnetInfo)
cv.rep10 <- ldply(enet[[10]]$modlist, extractGlmnetInfo)
cv.rep <- bind_rows(cv.rep1, cv.rep2, cv.rep3, cv.rep4, cv.rep5, cv.rep6, cv.rep7, cv.rep8, cv.rep9, cv.rep10)
cv.rep <- data.frame(cbind(alpha, cv.rep))
Questions
My understanding is that the folds should be fixed when cross-validating over alpha. Therefore, should I set.seed() multiple times to generate different folds for each iteration and run each iteration separately, rather than looping over them? For example:
# Set folds for first iteration
set.seed(1)
folds1 <- sample(1:10, size = length(y), replace = TRUE)
# Run first iteration
enet1 <- cva.glmnet(x, y,
foldid = folds1,
alpha = alpha,
family = "binomial")
# Set folds for second iteration
set.seed(2)
folds2 <- sample(1:10, size = length(y), replace = TRUE)
# Run second iteration
enet2 <- cva.glmnet(x, y,
foldid = folds2,
alpha = alpha,
family = "binomial")
Or is there a way to fix the folds and loop over the iterations, thereby making use of parallel processing?
Re: the option presented in 1., how do I determine which configuration of folds I should use to fit the final model using the optimal combination of alpha and lambda? Is the decision arbitrary?
NB. I am not using caret for this specific task.
This is with reference to this answer on implementation of Bayesian Optimization. I am unable to understand the following R-code that defines a function xgb.cv.bayes(). The code is as follows:
xgb.cv.bayes <- function(max.depth, min_child_weight, subsample, colsample_bytree, gamma){
cv <- xgv.cv(params = list(booster = 'gbtree', eta = 0.05,
max_depth = max.depth,
min_child_weight = min_child_weight,
subsample = subsample,
colsample_bytree = colsample_bytree,
gamma = gamma,
lambda = 1, alpha = 0,
objective = 'binary:logistic',
eval_metric = 'auc'),
data = data.matrix(df.train[,-target.var]),
label = as.matrix(df.train[, target.var]),
nround = 500, folds = cv_folds, prediction = TRUE,
showsd = TRUE, early.stop.round = 5, maximize = TRUE,
verbose = 0
)
list(Score = cv$dt[, max(test.auc.mean)],
Pred = cv$pred)
}
I am unable to understand the following part of code that comes after closing parenthesis of xgb.cv():
list(Score = cv$dt[, max(test.auc.mean)],
Pred = cv$pred)
Or very briefly, I do not understand the following syntax:
xgb.cv.bayes <- function(max.depth, min_child_weight, subsample, colsample_bytree, gamma){
cv <- xgv.cv(...)list(...)
}
I will be grateful in understanding this R-syntax and where can I find more examples of this.
In R the value of the last expression in a function is automatically the return value of this function. So the function you presented has exactly two steps:
compute the result of xgv.cv(...) and store the result in a
variable cv
create a list with two entries (Score and Pred)
whose values are extracted from cv.
Since the expression that creates the list is the last expression in the function, the list is automatically the return value. So, if you would execute test <- xgb.cv.bayes(...) you could then access test$Score and test$Pred.
Does this answer your question?
I wrote a function within lapply to fit a GAM (with splines) for each element in a vector of response variables within a data frame. I opted to use caret to fit the models instead of directly using mgcv or the gam package because I would like to eventually split my data into a train/test set for validation and use various resampling techniques. For now, I simply have the trainControl method set to 'none' like so:
# Set resampling method
# tc <- trainControl(method = "boot", number = 100)
# tc <- trainControl(method = "repeatedcv", number = 10, repeats = 1)
tc <- trainControl(method = "none")
fm <- lapply(group, function(x) {
printFormula <- paste(x, "~", inf.factors)
inputFormula <- as.formula(printFormula)
# Partition input data for model training and testing
# dpart <- createDataPartition(mdata[,x], times = 1, p = 0.7, list = FALSE)
# train <- mdata[ data.partition, ]
# test <- mdata[ -data.partition, ]
cat("Fitting:", printFormula, "\n")
# gam(inputFormula, family = binomial(link = "logit"), data = mdata)
train(inputFormula, family = binomial(link = "logit"), data = mdata, method = "gam",
trControl = tc)
})
When I execute this code, I receive the following error:
Error in train.default(x, y, weights = w, ...) :
Only one model should be specified in tuneGrid with no resampling
If I re-run the code in debugging mode, I can find where caret stops the training process:
if (trControl$method == "none" && nrow(tuneGrid) != 1)
stop("Only one model should be specified in tuneGrid with no resampling")
Clearly the train function fails because of the second condition, but when I look up the tuning parameters for a GAM (with splines) there is only an option for feature selection (not interested, I want to keep all the predictors in the model) and the method. Consequently, I do not include a tuneGrid data frame when I call train. Is this the reason why the model is failing in this way? What parameter would I provide and what would the tuneGrid look like?
I should add that the model is trained successfully when I use bootstrapping or k-fold CV, however these resampling methods take much longer to calculate and I do not need to use them yet.
Any help on this issue would be appreciated!
For that model, the tuning grid looks over two values of the select parameters:
> getModelInfo("gam", regex = FALSE)[[1]]$grid
function(x, y, len = NULL, search = "grid") {
if(search == "grid") {
out <- expand.grid(select = c(TRUE, FALSE), method = "GCV.Cp")
} else {
out <- data.frame(select = sample(c(TRUE, FALSE), size = len, replace = TRUE),
method = sample(c("GCV.Cp", "ML"), size = len, replace = TRUE))
}
out[!duplicated(out),]
}
You should use something like tuneGrid = data.frame(select = FALSE, method = "GCV.Cp") to only evaluate a single model (as error message says).
I want to perform leave subject out cross validation with R caret (cf. this example) but only use a subset of the data in training for creating CV models. Still, the left out CV partition should be used as a whole, as I need to test on all data of a left out subject (no matter if it's millions of samples that cannot be used in training due to computational restrictions).
I've created a minimal 2 class classification example using the subset and index parameters of caret::train and caret::trainControl to achieve this. From my observation this should solve the problem, but I have a hard time actually ensuring that the evaluation is still done in a leave-subject-out way. Maybe someone with experience in this task could shed some light on this:
library(plyr)
library(caret)
library(pROC)
library(ggplot2)
# with diamonds we want to predict cut and look at results for different colors = subjects
d <- diamonds
d <- d[d$cut %in% c('Premium', 'Ideal'),] # make a 2 class problem
d$cut <- factor(d$cut)
indexes_data <- c(1,5,6,8:10)
indexes_labels <- 2
# population independent CV indexes for trainControl
index <- llply(unique(d[,3]), function(cls) c(which(d[,3]!=cls)))
names(index) <- paste0('sub_', unique(d[,3]))
str(index) # indexes used for training models with CV = OK
m3 <- train(x = d[,indexes_data],
y = d[,indexes_labels],
method = 'glm',
metric = 'ROC',
subset = sample(nrow(d), 5000), # does this subset the data used for training and obtaining models, but not the left out partition used for estimating CV performance?
trControl = trainControl(returnResamp = 'final',
savePredictions = T,
classProbs = T,
summaryFunction = twoClassSummary,
index = index))
str(m3$resample) # all samples used once = OK
# performance over all subjects
myRoc <- roc(predictor = m3$pred[,3], response = m3$pred$obs)
plot(myRoc, main = 'all')
performance for individual subjects
l_ply(unique(m3$pred$Resample), .fun = function(cls) {
pred_sub <- m3$pred[m3$pred$Resample==cls,]
myRoc <- roc(predictor = pred_sub[,3], response = pred_sub$obs)
plot(myRoc, main = cls)
} )
Thanks for your time!
Using both the index and indexOut parameter in caret::trainControl at the same time seems to do the trick (thanks to Max for the hint in this question). Here is the updated code:
library(plyr)
library(caret)
library(pROC)
library(ggplot2)
str(diamonds)
# with diamonds we want to predict cut and look at results for different colors = subjects
d <- diamonds
d <- d[d$cut %in% c('Premium', 'Ideal'),] # make a 2 class problem
d$cut <- factor(d$cut)
indexes_data <- c(1,5,6,8:10)
indexes_labels <- 2
# population independent CV partitions for training and left out partitions for evaluation
indexes_populationIndependence_subjects <- 3
index <- llply(unique(d[,indexes_populationIndependence_subjects]), function(cls) c(which(d[,indexes_populationIndependence_subjects]!=cls)))
names(index) <- paste0('sub_', unique(d[,indexes_populationIndependence_subjects]))
indexOut <- llply(index, function(part) (1:nrow(d))[-part])
names(indexOut) <- paste0('sub_', unique(d[,indexes_populationIndependence_subjects]))
# subsample partitions for training
index <- llply(index, function(i) sample(i, 1000))
m3 <- train(x = d[,indexes_data],
y = d[,indexes_labels],
method = 'glm',
metric = 'ROC',
trControl = trainControl(returnResamp = 'final',
savePredictions = T,
classProbs = T,
summaryFunction = twoClassSummary,
index = index,
indexOut = indexOut))
m3$resample # seems OK
str(m3$pred) # seems OK
myRoc <- roc(predictor = m3$pred[,3], response = m3$pred$obs)
plot(myRoc, main = 'all')
# analyze results per subject
l_ply(unique(m3$pred$Resample), .fun = function(cls) {
pred_sub <- m3$pred[m3$pred$Resample==cls,]
myRoc <- roc(predictor = pred_sub[,3], response = pred_sub$obs)
plot(myRoc, main = cls)
} )
Still, I'm not absolutely sure if this is actually does the estimation in a population independent way, so if anybody has knowledge about the details please share your thoughts!