run h2o algorithms inside a foreach loop? - r

I naively thought it's straight forward to make multiple calls to h2o.gbm in parallel inside a foreach loop. But got a strange error.
Error in { :
task 3 failed - "java.lang.AssertionError: Can't unlock: Not locked!"
Codes below
library(foreach)
library(doParallel)
library(doSNOW)
Xtr.hf = as.h2o(Xtr)
Xval.hf = as.h2o(Xval)
cl = makeCluster(6, type="SOCK")
registerDoSNOW(cl)
junk <- foreach(i=1:6,
.packages=c("h2o"),
.errorhandling = "stop",
.verbose=TRUE) %dopar%
{
h2o.init(ip="localhost", nthreads=2, max_mem_size = "5G")
for ( j in 1:3 ) {
bm2 <- h2o.gbm(
training_frame = Xtr.hf,
validation_frame = Xval.hf,
x=2:ncol(Xtr.hf),
y=1,
distribution="gaussian",
ntrees = 100,
max_depth = 3,
learn_rate = 0.1,
nfolds = 1)
}
h2o.shutdown(prompt=FALSE)
return(iname)
}
stopCluster(cl)

NOTE: This unlikely good use of R's parallel foreach, but I'll answer your question first, then explain why. (BTW when I use "cluster" in this answer I'm referring to an H2O cluster (even if is just on your local machine), and not an R "cluster".)
I've re-written your code, assuming the intention was to have a single H2O cluster, where all the models are to be made:
library(foreach)
library(doParallel)
library(doSNOW)
library(h2o)
h2o.init(ip="localhost", nthreads=-1, max_mem_size = "5G")
Xtr.hf = as.h2o(Xtr)
Xval.hf = as.h2o(Xval)
cl = makeCluster(6, type="SOCK")
registerDoSNOW(cl)
junk <- foreach(i=1:6,
.packages=c("h2o"),
.errorhandling = "stop",
.verbose=TRUE) %dopar%
{
for ( j in 1:3 ) {
bm2 <- h2o.gbm(
training_frame = Xtr.hf,
validation_frame = Xval.hf,
x=2:ncol(Xtr.hf),
y=1,
distribution="gaussian",
ntrees = 100,
max_depth = 3,
learn_rate = 0.1,
nfolds = 1)
#TODO: do something with bm2 here?
}
return(iname) #???
}
stopCluster(cl)
I.e. in outline form:
Start H2O, and load Xtr and Xval into it
Start 6 threads in your R client
In each thread make 3 GBM models (one after each other)
I dropped the h2o.shutdown() command, guessing that you didn't intend that (when you shutdown the H2O cluster the models you just made get deleted). And I've highlighted where you might want to be doing something with your model. And I've given H2O all the threads on your machine (that is the nthreads=-1 in h2o.init()), not just 2.
You can make H2O models in parallel, but it is generally a bad idea, as they end up fighting for resources. Better to do them one at a time, and rely on H2O's own parallel code to spread the computation over the cluster. (When the cluster is a single machine this tends to be very efficient.)
By the fact you've gone to the trouble of making a parallel loop in R, makes me think you've missed the way H2O works: it is a server written in Java, and R is just a light client that sends it API calls. The GBM calculations are not done in R; they are all done in Java code.
The other way to interpret your code is to run multiple instances of H2O, i.e. multiple H2O clusters. This might be a good idea if you have a set of machines, and you know the H2O algorithm is not scaling very well across a multi-node cluster. Doing it on a single machine is almost certainly a bad idea. But, for the sake of argument, this is how you do it (untested):
library(foreach)
library(doParallel)
library(doSNOW)
cl = makeCluster(6, type="SOCK")
registerDoSNOW(cl)
junk <- foreach(i=1:6,
.packages=c("h2o"),
.errorhandling = "stop",
.verbose=TRUE) %dopar%
{
library(h2o)
h2o.init(ip="localhost", port = 54321 + (i*2), nthreads=2, max_mem_size = "5G")
Xtr.hf = as.h2o(Xtr)
Xval.hf = as.h2o(Xval)
for ( j in 1:3 ) {
bm2 <- h2o.gbm(
training_frame = Xtr.hf,
validation_frame = Xval.hf,
x=2:ncol(Xtr.hf),
y=1,
distribution="gaussian",
ntrees = 100,
max_depth = 3,
learn_rate = 0.1,
nfolds = 1)
#TODO: save bm2 here
}
h2o.shutdown(prompt=FALSE)
return(iname) #???
}
stopCluster(cl)
Now the outline is:
Create 6 R threads
In each thread, start an H2O cluster that is running on localhost but on a port unique to that cluster. (The i*2 is because each H2O cluster is actually using two ports.)
Upload your data to the H2O cluster (i.e. this will be repeated 6 times, once for each cluster).
Make 3 GBM models, one after each other.
Do something with those models
Kill the cluster for the current thread.
If you have 12+ threads on your machine, and 30+ GB memory, and the data is relatively small, this will be roughly as efficient as using one H2O cluster and making 12 GBM models in serial. If not, I believe it will be worse. (But, if you have pre-started 6 H2O clusters on 6 remote machines, this might be a useful approach - I must admit I'd been wondering how to do this, and using the parallel library for it had never occurred to me until I saw your question!)
NOTE: as of the current version (3.10.0.6), I know the above code won't work, as there is a bug in h2o.init() that effectively means it is ignoring the port. (Workarounds: either pre-start all 6 H2O clusters on the commandline, or set the port in an environment variable.)

Related

Using "snow" parallel operations in bootstrap_parameters/model on merMod object (R)

I've been using bootstrap_parameters (parameters package in R) on generalised linear mixed models produced using glmmTMB. These work fine without parallel processing (parallel = "no") and also works fine on my old and slow mac using parallel = "multicore". I'm working on a new PC (Windows OS) so need to use parallel = "snow" however I get the following error:
system.time(b <- bootstrap_parameters(m1, iterations = 10, parallel = "snow", n_cpus = 6))
Error in data.frame(..., check.names = FALSE) :
arguments imply differing number of rows: 0, 1
In addition: Warning message:
In lme4::bootMer(model, boot_function, nsim = iterations, verbose = FALSE, :
some bootstrap runs failed (10/10)
Timing stopped at: 0.89 0.3 7.11
If I select n_cpus = 1, the function works or if I feed bootstrap_parameters or bootstrap_model an lm object (where the underlying code uses boot::boot) it also works fine. I have narrowed the problem down to bootMer (lme4). I suspect the dataset exported using clusterExport is landing in an environment that is different from where clustered bootMer function is looking. The following is a reproduceable example
library(glmmTMB)
library(parameters)
library(parallel)
library(lme4)
m1 <- glmmTMB(count ~ mined + (1|site), zi=~mined,
family=poisson, data=Salamanders)
summary(m1)
cl <- makeCluster(6)
clusterEvalQ(cl, library("lme4"))
clusterExport(cl, varlist = c("Salamanders"))
system.time(b <- bootstrap_parameters(m1, iterations = 10, parallel = "snow", n_cpus = 6))
stopCluster(cl)
Any ideas on solving this problem?
You need to clusterEvalQ(cl, library("glmmTMB")). From https://github.com/glmmTMB/glmmTMB/issues/843:
This issue is more or less resolved by a documentation patch (we need to explicitly clusterEvalQ(cl, library("glmmTMB"))). The only question is whether we can make this any easier for users. There are two problems here: (1) when the user sets up their own cluster rather than leaving it to be done in bootMer, more explicit clusterEvalQ/clusterExport stuff is necessary in any case; (2) bootMer internally does parallel::clusterExport(cl, varlist=getNamespaceExports("lme4")) if it is setting up the cluster (not if the cluster is set up and passed to bootMer by the user), but we wouldn't expect it to extend the same courtesy to glmmTMB ...

Combining mclapply and register DoMC in a function

I am running a function that utilizes the functions biganalytics::bigkmeans and xgboost (through Caret). Both of these support parallel processing if it is registered first by doing registerDoMC(cores = 4). However, to utilize the power of the 64 core machine I have access to without adding too much parallel overhead, I want to a run the following function in 16 instances (total of 64 processes.
example = function (x) {
biganalytics:: bigkmeans (matrix(rnorm(10*5,1000,1),ncol=500))
mod <- train(Class ~ ., data = df ,
method = "xgbTree", tuneLength = 50,
trControl = trainControl(search = "random"))
}
set.seed(1)
dat1 <- twoClassSim(1000)
dat2 <- twoClassSim(1001)
dat3 <- twoClassSim(1002)
dat4 <- twoClassSim(1003)
list <- list(dat1, dat2, dat3, dat4)
mclapply(list, example, mc.cores = 16).
It is important that I stick to mclapply because I need a shared memory parallel backend so that I don't run out of ram in my actual use of data sets over 50gb.
My question is, where would I do registerDoMC in this case?
Thanks!
Using nested parallelism isn't often a good idea, but if the outer loop has many fewer iterations than cores, it might be.
You can load doMC and call registerDoMC inside the foreach loop to prepare the workers to call train. But note that it doesn't make sense to call mclapply with more workers than tasks, otherwise some of the workers won't have any work to do.
You could do something like this:
example <- function (dat, nw) {
library(doMC)
registerDoMC(nw)
# call train function on dat...
}
# This assumes that length(datlist) is much less than ncores
ncores <- 64
m <- length(datlist)
nw <- ncores %/% m
mclapply(datlist, example, nw, mc.cores=m)
If length(datlist) is 4, then each "train" task will use 16 workers. You can certainly use fewer workers per "train" task, but you probably shouldn't use more.

R, h2o and foreach: java.lang.IllegalStateException

In a different post here I asked for help on parallel processing a call to h2o.gbm inside a foreach loop.
Following the answers provided, I run a script similar to this example:
library(h2o)
data(iris)
data <- as.h2o(iris)
ss <- h2o.splitFrame(data)
gbm <- h2o.gbm(x = 1:4, y = "Species", training_frame = ss[[1]])
h2o.saveModel(path="some path")
h2o.shutdown(prompt = FALSE)
library(foreach)
library(doParallel)
#setup parallel backend to use 12 processors
cl <- makeCluster(12)
registerDoParallel(cl)
#loop
df4 <- foreach(i = seq(20), .combine=rbind) %dopar% {
library(h2o)
port <- 54321 + 3*i
print(paste0("http://localhost:", port))
h2o.init(nthreads = 1, max_mem_size = "10G", port = port) #my local machine runs 128GB
df4 <- data.frame()
gbm <- h2o.loadModel(path="some path")
df4 <- as.data.frame(h2o.predict(gbm, ss[[2]]))[,1]
}
It runs really well on a small sample of my real data (at least 50% faster than sequential)
But when I run this on all of my data, I get the following error code after 45 minutes:
Error in { : task 2 failed - "
ERROR MESSAGE:
DistributedException from localhost/127.0.0.1:60984, caused by
java.lang.IllegalStateException: Unable to clean up RollupStats after an
exception (see cause). This could cause a key leakage, key=$05ff14000000feffffff$_b66dbd609dc068f0137cc88cb42a
"
I am not sure what causes this error code. I guess it has to do with a memory issue because this code will take up 85-95% of my RAM (128GB) and 100% of my CPU (12 threads).
Anyone any ideas to work around this?
For those who are interested, I found the reason of this error. It is actually really simple.
Using makeCluster(12) I request 12 threads on my CPU.
Later on in the foreach call, I make a h2o.init call in which I request yet another thread.
Since my machine only has 12 threads, this last call for an additional (12+1) thread can't be correctly processed.
I fixed this by assigning 6 threads to the cluster. This leaves me 6 threads to make six individual calls to h2o.init (one in each foreach call).
This works great.

Building parallel GBM models using cross-validation in R

The gbm package in R has a handy feature of parallelizing cross-validation by sending each fold to its own node. I would like to build multiple cross-validated GBM models running over a range of hyperparameters. Ideally, because I have multiple cores, I could also parallelize the building of these multiple models. With 12 cores, I could- in theory- have 4 models building simultaneously with each using 3-fold validation.
Something like this:
tuneGrid <- expand.grid(
n_trees = 5000,
shrink = c(.0001),
i.depth = seq(10,25,5),
minobs = 100,
distro = c(0,1) #0 = bernoulli, 1 = adaboost
)
cl <- makeCluster(4, outfile="GBMlistening.txt")
registerDoParallel(cl) #4 parent cores to run in parallel
err.vect <- NA #initialize
system.time(
err.vect <- foreach (j=1:nrow(tuneGrid), .packages=c('gbm'),.combine=rbind) %dopar% {
fit <- gbm(Label~., data=training,
n.trees = tuneGrid[j, 'n_trees'],
shrinkage = tuneGrid[j, 'shrink'],
interaction.depth=tuneGrid[j, 'i.depth'],
n.minobsinnode = tuneGrid[j, 'minobs'],
distribution=ifelse(tuneGrid[j, 'distro']==0, "bernoulli", "adaboost"),
w=weights$Weight,
bag.fraction=0.5,
cv.folds=3,
n.cores = 3) #will this make 4X3=12 workers?
cv.test <- data.frame(scores=1/(1 + exp(-fit$cv.fitted)), Weight=training$Weight, Label=training$Label)
print(j) #write out to the listener
cbind(gbm.roc.area(cv.test$Label, cv.test$scores), getAMS(cv.test), tuneGrid[j, 'n_trees'], tuneGrid[j, 'shrink'], tuneGrid[j, 'i.depth'],tuneGrid[j, 'minobs'], tuneGrid[j, 'distro'], j )
}
)
stopCluster(cl) #clean up after ourselves
I would use the caret package, however I have some hyperparameters beyond those defaulted in caret, and I would prefer not to build my own custom model in caret at this time. I am on a Windows machine, as I know that affects which parallel back-end gets used.
If I do this, will each of the 4 clusters I start up spawn off 3 workers each, for a total of 12 workers chugging away? Or will I only have 4 cores working at once?
I believe this will do what you want. The foreach loop will run four instances of gbm, and each of them will create a three node cluster using makeCluster. So you'll actually have 16 workers, but only 12 will perform serious computation at any one time. You have to be careful with nested parallelism, but I think this will work.

Nesting parallel functions in R (

I'm familiar with foreach, %dopar% and the like. I am also familiar with the parallel option for cv.glmnet. But how do you set up the nested parallelistion as below?
library(glmnet)
library(foreach)
library(parallel)
library(doSNOW)
Npar <- 1000
Nobs <- 200
Xdat <- matrix(rnorm(Nobs * Npar), ncol = Npar)
Xclass <- rep(1:2, each = Nobs/2)
Ydat <- rnorm(Nobs)
Parallel cross-validation:
cl <- makeCluster(8, type = "SOCK")
registerDoSNOW(cl)
system.time(mods <- foreach(x = 1:2, .packages = "glmnet") %dopar% {
idx <- Xclass == x
cv.glmnet(Xdat[idx,], Ydat[idx], nfolds = 4, parallel = TRUE)
})
stopCluster(cl)
Not parallel cross-validation:
cl <- makeCluster(8, type = "SOCK")
registerDoSNOW(cl)
system.time(mods <- foreach(x = 1:2, .packages = "glmnet") %dopar% {
idx <- Xclass == x
cv.glmnet(Xdat[idx,], Ydat[idx], nfolds = 4, parallel = FALSE)
})
stopCluster(cl)
For the two system times I am only getting a very marginal difference.
Is parallelistion taken are of? Or do I need to use the nested operator explicitly?
Side-question: If 8 cores are available in a cluster object and the foreach loop contains two tasks, will each task be given 1 core (and the other 6 cores left idle) or will each task be given four cores (using up all 8 cores in total)? What's the way to query how many cores are being used at a given time?
In your parallel cross-validation example, cv.glmnet itself will not run in parallel because there is no foreach parallel backend registered in the cluster workers. The outer foreach loop will run in parallel, but not the foreach loop in the cv.glmnet function.
To use doSNOW for the outer and inner foreach loops, you could initialize the snow cluster workers using clusterCall:
cl <- makeCluster(2, type = "SOCK")
clusterCall(cl, function() {
library(doSNOW)
registerDoSNOW(makeCluster(2, type = "SOCK"))
NULL
})
registerDoSNOW(cl)
This registers doSNOW for both the master and the workers so that each call to cv.glmnet will execute on a two-worker cluster when parallel=TRUE is specified.
The trick with nested parallelism is to avoid creating too many processes and oversubscribing the CPU (or CPUs), so you need to be careful when registering the parallel backends. My example makes sense for a CPU with four cores even though a total of six workers are created, since the "outer" workers don't do much while the inner foreach loops execute. It is common when running on a cluster to use doSNOW to start one worker per node, and then use doMC to start one worker per core on each of those nodes.
Note that your example doesn't use much compute time, so it's not really worthwhile to use two levels of parallelism. I would use a much bigger problem in order to determine the benefits of the different approaches.

Resources