R, h2o and foreach: java.lang.IllegalStateException - r

In a different post here I asked for help on parallel processing a call to h2o.gbm inside a foreach loop.
Following the answers provided, I run a script similar to this example:
library(h2o)
data(iris)
data <- as.h2o(iris)
ss <- h2o.splitFrame(data)
gbm <- h2o.gbm(x = 1:4, y = "Species", training_frame = ss[[1]])
h2o.saveModel(path="some path")
h2o.shutdown(prompt = FALSE)
library(foreach)
library(doParallel)
#setup parallel backend to use 12 processors
cl <- makeCluster(12)
registerDoParallel(cl)
#loop
df4 <- foreach(i = seq(20), .combine=rbind) %dopar% {
library(h2o)
port <- 54321 + 3*i
print(paste0("http://localhost:", port))
h2o.init(nthreads = 1, max_mem_size = "10G", port = port) #my local machine runs 128GB
df4 <- data.frame()
gbm <- h2o.loadModel(path="some path")
df4 <- as.data.frame(h2o.predict(gbm, ss[[2]]))[,1]
}
It runs really well on a small sample of my real data (at least 50% faster than sequential)
But when I run this on all of my data, I get the following error code after 45 minutes:
Error in { : task 2 failed - "
ERROR MESSAGE:
DistributedException from localhost/127.0.0.1:60984, caused by
java.lang.IllegalStateException: Unable to clean up RollupStats after an
exception (see cause). This could cause a key leakage, key=$05ff14000000feffffff$_b66dbd609dc068f0137cc88cb42a
"
I am not sure what causes this error code. I guess it has to do with a memory issue because this code will take up 85-95% of my RAM (128GB) and 100% of my CPU (12 threads).
Anyone any ideas to work around this?

For those who are interested, I found the reason of this error. It is actually really simple.
Using makeCluster(12) I request 12 threads on my CPU.
Later on in the foreach call, I make a h2o.init call in which I request yet another thread.
Since my machine only has 12 threads, this last call for an additional (12+1) thread can't be correctly processed.
I fixed this by assigning 6 threads to the cluster. This leaves me 6 threads to make six individual calls to h2o.init (one in each foreach call).
This works great.

Related

Can I make this R foreach loop faster?

Thanks in advance for your help.
The short of this is that I have huge foreach loops that are running much slower than I'm used to, and I'm curious as to whether I can speed them up -- it's taking hours (maybe even days).
So, I've been given two large pieces of data ( by friend's who needs help). The first is a very large matrix (728396 rows by 276 columns) of genetic data for 276 participants (I'll call this M1). The second is a dataset (276 rows and 34 columns) of other miscellaneous data about the participants (I'll call this DF1). We're running a multilevel logistic regression model utilizing both sets of data.
I'm using a Windows PC with 8 virtual cores running at 4.7ghz and 36gb of ram.
Here's a portion of the code I've written/modified:
library(pacman)
p_load(car, svMisc, doParallel, foreach, tcltk, lme4, lmerTest, nlme)
load("M1.RDATA")
load("DF1.RDATA")
clust = makeCluster(detectCores() - 3, outfile="")
#I have 4 physical cores, 8 virtual. I've been using 5 because my cpu sits at about 89% like this.
registerDoParallel(clust)
getDoParWorkers() #5 cores
n = 728396
res_function = function (i){
x = as.vector(M1[i,])
#Taking one row of genetic data to be used in the regression
fit1 = glmer(r ~ x + m + a + e + n + (1 | famid), data = DF1, family = binomial(link = "logit"))
#Running the model
c(coef(summary(fit1))[2,1:4], coef(summary(fit1))[3:6,1], coef(summary(fit1))[3:6,4], length(fit1#optinfo[["conv"]][["lme4"]][["messages"]]))
#Collecting data, including whether there are any convergence error messages
}
start_time = Sys.time()
model1 = foreach(i = 1:n, .packages = c("tcltk", "lme4"), .combine = rbind) %dopar% {
if(!exists("pb")) pb <- tkProgressBar("Parallel task", min=1, max=n)
setTkprogressBar(pb, i)
#This is some code I found here to keep track of my progress
res_function(i)
}
end_time = Sys.time()
end_time - start_time
stopCluster(clust)
showConnections()
I've run nearly identical code in the past and it took me only about 13 minutes. However, I suspect that this model is taking up more memory than usual on each core (likely due to the second level) and slowing things down. I've read that BiocParallel, Future, or even Microsoft R Open might work better, but I haven't had much success using any of them (likely due to my own lack of know how). I've also read a bit about the package "bigmemory" to more efficiently use the large matrix across cores, but I ran into several errors when I tried to use it (failed workers and such). I'm also curious about the potential of using my GPU (a Titan X Pascal) for some additional umph if anyone knows more about this.
Any advice would be very appreciated!

Why is this parallel computing code only using 1 CPU?

I am using foreach and parallel libraries to perform parallel computation, but for some reason, while running, it only uses 1 CPU at a time (I look it up using 'top' (Bash on Linux Terminal).
The server has 48 cores, and I've tried:
Using 24, 12 or 5 cores
Example codes (as the one below)
In Windows, where the tasks as such appear, but they do not use any CPU
list.of.packages <- c("foreach", "doParallel")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if (length(new.packages)) install.packages(new.packages)
library(foreach)
library(doParallel)
no_cores <- detectCores() / 2 # 24 cores
cl<-makeCluster(no_cores)
registerDoParallel(cl)
df.a = data.frame(str = cbind(paste('name',seq(1:60000))), int = rnorm(60000))
df.b = data.frame(str = sample(df.a[, 1]))
df.b$int = NA
foreach(row.a = 1:length(df.a$str),
.combine = rbind,
.verbose = T) %dopar% {
row.b = grep(pattern = df.a$str[row.a], x = df.b$str)
df.b$int[row.b] = df.a$int[row.a]
df.b
}
stopCluster(cl)
I expect this code to use several CPUs (as many as defined), but it actually uses 1.
Run foreach(..., verbose = TRUE) to understand what is going on. Have slightly changed the code that is running to better identify when the parallel code is running.
library(foreach)
library(doParallel)
no_cores <- detectCores() / 2 # 24 cores
cl<-makeCluster(no_cores)
registerDoParallel(cl)
base = 2
out <- foreach(exponent = 2:400,
.combine = sum, .verbose = TRUE) %dopar%
runif(1000000)
First segment:
# discovered package(s):
# no variables are automatically exported
# explicitly exporting package(s):
# numValues: 3999, numResults: 0, stopped: TRUE
This setup is not parallel - this is your master setting up your children. This takes a very long time with 2:40000000, which may be where you are stopping, and you would only see one core in use.
# numValues: 79, numResults: 0, stopped: TRUE
# got results for task n
This computation while you are waiting for this to be printed should be parallel. On Windows I see 4 cores working to calculate each runif.
# calling combine function
# evaluating call object to combine results:
# fun(accum, result.n)
This runs for each worker with a different value for n. This is your combine function and is not parallel either.
I think your code is getting hung up on the setup piece, and you are only observing the serial part of the operation. If not, I would watch what is happening with verbose = TRUE and watch for more clues.
I don't know how your main problem is setup, but your example is not a good example of how to set up parallelization - you are using millions of workers to do very small tasks, so your serial overhead costs per worker are very high. You will see improved performance if you can send larger pieces to each worker.

Combining mclapply and register DoMC in a function

I am running a function that utilizes the functions biganalytics::bigkmeans and xgboost (through Caret). Both of these support parallel processing if it is registered first by doing registerDoMC(cores = 4). However, to utilize the power of the 64 core machine I have access to without adding too much parallel overhead, I want to a run the following function in 16 instances (total of 64 processes.
example = function (x) {
biganalytics:: bigkmeans (matrix(rnorm(10*5,1000,1),ncol=500))
mod <- train(Class ~ ., data = df ,
method = "xgbTree", tuneLength = 50,
trControl = trainControl(search = "random"))
}
set.seed(1)
dat1 <- twoClassSim(1000)
dat2 <- twoClassSim(1001)
dat3 <- twoClassSim(1002)
dat4 <- twoClassSim(1003)
list <- list(dat1, dat2, dat3, dat4)
mclapply(list, example, mc.cores = 16).
It is important that I stick to mclapply because I need a shared memory parallel backend so that I don't run out of ram in my actual use of data sets over 50gb.
My question is, where would I do registerDoMC in this case?
Thanks!
Using nested parallelism isn't often a good idea, but if the outer loop has many fewer iterations than cores, it might be.
You can load doMC and call registerDoMC inside the foreach loop to prepare the workers to call train. But note that it doesn't make sense to call mclapply with more workers than tasks, otherwise some of the workers won't have any work to do.
You could do something like this:
example <- function (dat, nw) {
library(doMC)
registerDoMC(nw)
# call train function on dat...
}
# This assumes that length(datlist) is much less than ncores
ncores <- 64
m <- length(datlist)
nw <- ncores %/% m
mclapply(datlist, example, nw, mc.cores=m)
If length(datlist) is 4, then each "train" task will use 16 workers. You can certainly use fewer workers per "train" task, but you probably shouldn't use more.

run h2o algorithms inside a foreach loop?

I naively thought it's straight forward to make multiple calls to h2o.gbm in parallel inside a foreach loop. But got a strange error.
Error in { :
task 3 failed - "java.lang.AssertionError: Can't unlock: Not locked!"
Codes below
library(foreach)
library(doParallel)
library(doSNOW)
Xtr.hf = as.h2o(Xtr)
Xval.hf = as.h2o(Xval)
cl = makeCluster(6, type="SOCK")
registerDoSNOW(cl)
junk <- foreach(i=1:6,
.packages=c("h2o"),
.errorhandling = "stop",
.verbose=TRUE) %dopar%
{
h2o.init(ip="localhost", nthreads=2, max_mem_size = "5G")
for ( j in 1:3 ) {
bm2 <- h2o.gbm(
training_frame = Xtr.hf,
validation_frame = Xval.hf,
x=2:ncol(Xtr.hf),
y=1,
distribution="gaussian",
ntrees = 100,
max_depth = 3,
learn_rate = 0.1,
nfolds = 1)
}
h2o.shutdown(prompt=FALSE)
return(iname)
}
stopCluster(cl)
NOTE: This unlikely good use of R's parallel foreach, but I'll answer your question first, then explain why. (BTW when I use "cluster" in this answer I'm referring to an H2O cluster (even if is just on your local machine), and not an R "cluster".)
I've re-written your code, assuming the intention was to have a single H2O cluster, where all the models are to be made:
library(foreach)
library(doParallel)
library(doSNOW)
library(h2o)
h2o.init(ip="localhost", nthreads=-1, max_mem_size = "5G")
Xtr.hf = as.h2o(Xtr)
Xval.hf = as.h2o(Xval)
cl = makeCluster(6, type="SOCK")
registerDoSNOW(cl)
junk <- foreach(i=1:6,
.packages=c("h2o"),
.errorhandling = "stop",
.verbose=TRUE) %dopar%
{
for ( j in 1:3 ) {
bm2 <- h2o.gbm(
training_frame = Xtr.hf,
validation_frame = Xval.hf,
x=2:ncol(Xtr.hf),
y=1,
distribution="gaussian",
ntrees = 100,
max_depth = 3,
learn_rate = 0.1,
nfolds = 1)
#TODO: do something with bm2 here?
}
return(iname) #???
}
stopCluster(cl)
I.e. in outline form:
Start H2O, and load Xtr and Xval into it
Start 6 threads in your R client
In each thread make 3 GBM models (one after each other)
I dropped the h2o.shutdown() command, guessing that you didn't intend that (when you shutdown the H2O cluster the models you just made get deleted). And I've highlighted where you might want to be doing something with your model. And I've given H2O all the threads on your machine (that is the nthreads=-1 in h2o.init()), not just 2.
You can make H2O models in parallel, but it is generally a bad idea, as they end up fighting for resources. Better to do them one at a time, and rely on H2O's own parallel code to spread the computation over the cluster. (When the cluster is a single machine this tends to be very efficient.)
By the fact you've gone to the trouble of making a parallel loop in R, makes me think you've missed the way H2O works: it is a server written in Java, and R is just a light client that sends it API calls. The GBM calculations are not done in R; they are all done in Java code.
The other way to interpret your code is to run multiple instances of H2O, i.e. multiple H2O clusters. This might be a good idea if you have a set of machines, and you know the H2O algorithm is not scaling very well across a multi-node cluster. Doing it on a single machine is almost certainly a bad idea. But, for the sake of argument, this is how you do it (untested):
library(foreach)
library(doParallel)
library(doSNOW)
cl = makeCluster(6, type="SOCK")
registerDoSNOW(cl)
junk <- foreach(i=1:6,
.packages=c("h2o"),
.errorhandling = "stop",
.verbose=TRUE) %dopar%
{
library(h2o)
h2o.init(ip="localhost", port = 54321 + (i*2), nthreads=2, max_mem_size = "5G")
Xtr.hf = as.h2o(Xtr)
Xval.hf = as.h2o(Xval)
for ( j in 1:3 ) {
bm2 <- h2o.gbm(
training_frame = Xtr.hf,
validation_frame = Xval.hf,
x=2:ncol(Xtr.hf),
y=1,
distribution="gaussian",
ntrees = 100,
max_depth = 3,
learn_rate = 0.1,
nfolds = 1)
#TODO: save bm2 here
}
h2o.shutdown(prompt=FALSE)
return(iname) #???
}
stopCluster(cl)
Now the outline is:
Create 6 R threads
In each thread, start an H2O cluster that is running on localhost but on a port unique to that cluster. (The i*2 is because each H2O cluster is actually using two ports.)
Upload your data to the H2O cluster (i.e. this will be repeated 6 times, once for each cluster).
Make 3 GBM models, one after each other.
Do something with those models
Kill the cluster for the current thread.
If you have 12+ threads on your machine, and 30+ GB memory, and the data is relatively small, this will be roughly as efficient as using one H2O cluster and making 12 GBM models in serial. If not, I believe it will be worse. (But, if you have pre-started 6 H2O clusters on 6 remote machines, this might be a useful approach - I must admit I'd been wondering how to do this, and using the parallel library for it had never occurred to me until I saw your question!)
NOTE: as of the current version (3.10.0.6), I know the above code won't work, as there is a bug in h2o.init() that effectively means it is ignoring the port. (Workarounds: either pre-start all 6 H2O clusters on the commandline, or set the port in an environment variable.)

makeCluster with parallelSVM in R takes up all Memory and swap

I'm trying to train a SVM model on a large dataset(~110k training points). This is a sample of the code where I am using the parallelSVM package to parallelize the training step on a subset of the training data on my 4 core Linux machine.
numcore = 4
train.time = c()
for(i in 1:5)
{
cl = makeCluster(4)
registerDoParallel(cores=numCore)
getDoParWorkers()
dummy = train_train[1:10000*i,]
begin = Sys.time()
model.svm = parallelSVM(as.factor(target) ~ .,data =dummy,
numberCores=detectCores(),probability = T)
end = Sys.time() - begin
train.time = c(train.time,end)
stopCluster(cl)
registerDoSEQ()
}
The idea of this snippet of code is to estimate the time it'll take to train the model on the entire dataset by gradually increasing the size of the dummy training set. After running the code above for 10,000 and 20,000 training samples, this is the memory and swap history usage statistic from the System Monitor.After 4 runs of the for loop,both the memory and swap usage is about 95%,and I get the following error :
Error in summary.connection(connection) : invalid connection
Any ideas on how to manage this problem? Is there a way to deallocate the memory used by a cluster after using the stopCluster() function ?
Please take into consideration the fact that I am an absolute beginner in this field. A short explanation of the proposed solutions will be greatly appreciated. Thank you.
Your line
registerDoParallel(cores=numCore)
creates a new cluster with number of nodes equal to numCore (which you haven't stated). This cluster is never destroyed, so with each iteration of the loop you're starting more new R processes. Since you're already creating a cluster with cl = makeCluster(4), you should use
registerDoParallel(cl)
instead.
(And move the makeCluster, registerDoParallel, stopCluster and registerDoSEQ calls outside the loop.)

Resources