Read a large (1.5 GB) file in h2o R - r

I am using h2o package for modelling in R. For this I want to read a dataset which has a size of about 1.5 GB using h2o.importfile(). I start the h2o server using the lines
library(h2oEnsemble)
h2o.init(max_mem_size = '1499m',nthreads=-1)
This produces a log
H2O is not running yet, starting it now...
java version "1.8.0_121"
Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
Java HotSpot(TM) Client VM (build 25.121-b13, mixed mode)
Starting H2O JVM and connecting: . Connection successful!
R is connected to the H2O cluster:
H2O cluster uptime: 3 seconds 665 milliseconds
H2O cluster version: 3.10.4.8
H2O cluster version age: 28 days, 14 hours and 36 minutes
H2O cluster name: H2O_started_from_R_Lucifer_jvn970
H2O cluster total nodes: 1
H2O cluster total memory: 1.41 GB
H2O cluster total cores: 4
H2O cluster allowed cores: 4
H2O cluster healthy: TRUE
H2O Connection ip: localhost
H2O Connection port: 54321
H2O Connection proxy: NA
H2O Internal Security: FALSE
R Version: R version 3.3.2 (2016-10-31)`
The following line gives me an error
train=h2o.importFile(path=normalizePath("C:\\Users\\All data\\traindt.rds"))
DistributedException from localhost/127.0.0.1:54321, caused by java.lang.AssertionError
DistributedException from localhost/127.0.0.1:54321, caused by java.lang.AssertionError
at water.MRTask.getResult(MRTask.java:478)
at water.MRTask.getResult(MRTask.java:486)
at water.MRTask.doAll(MRTask.java:402)
at water.parser.ParseDataset.parseAllKeys(ParseDataset.java:246)
at water.parser.ParseDataset.access$000(ParseDataset.java:27)
at water.parser.ParseDataset$ParserFJTask.compute2(ParseDataset.java:195)
at water.H2O$H2OCountedCompleter.compute(H2O.java:1315)
at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974)
at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)
at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)
Caused by: java.lang.AssertionError
at water.parser.Categorical.addKey(Categorical.java:41)
at water.parser.FVecParseWriter.addStrCol(FVecParseWriter.java:127)
at water.parser.CsvParser.parseChunk(CsvParser.java:133)
at water.parser.Parser.readOneFile(Parser.java:187)
at water.parser.Parser.streamParseZip(Parser.java:217)
at water.parser.ParseDataset$MultiFileParseTask.streamParse(ParseDataset.java:907)
at water.parser.ParseDataset$MultiFileParseTask.map(ParseDataset.java:856)
at water.MRTask.compute2(MRTask.java:601)
at water.H2O$H2OCountedCompleter.compute1(H2O.java:1318)
at water.parser.ParseDataset$MultiFileParseTask$Icer.compute1(ParseDataset$MultiFileParseTask$Icer.java)
at water.H2O$H2OCountedCompleter.compute(H2O.java:1314)
... 5 more
Error: DistributedException from localhost/127.0.0.1:54321, caused by java.lang.AssertionError
Any help on how to fix this problem?
Note: Assigning memory larger than 1499mb also gives me an error (cannot allocate memory). I am using a 16GB ram environment
Edit: I download the 64-bit version of Java and changed my file to a csv file. I was then able to assign max_mem_size to 5G and the problem was solved.
For others who face the problem:
1. Download the latest version of 64 bit jdk
2. Execute the following line line
h2o.init(max_mem_size = '5g',nthreads=-1)

You are running with 32-bit java, which is limiting the memory that you are able to start H2O with. One clue is that it won't start with a higher max_mem_size. Another clue is that it says "Client VM".
You want 64-bit java instead. The 64-bit version will say "Server VM". You can download the Java 8 SE JDK from here:
http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html
Based on what you've described, I recommend setting max_mem_size = '6g' or more, which will work fine on your system once you have the right version of Java installed.

train=h2o.importFile(path=normalizePath("C:\\Users\\All data\\traindt.rds")
Are you trying to load an .rds file? That's an R binary format which is not readable by h2o.importFile(), so that won't work. You will need to store your training data in a cross-platform storage format (e.g. CSV, SMVLight, etc) if you want to read it into H2O directly. If you don't have a copy in another format, then just save one from R:
# loads a `train` data.frame for example
load("C:\\Users\\All data\\traindt.rds")
# save as CSV
write.csv(train, "C:\\Users\\All data\\traindt.csv")
# import from CSV into H2O cluster directly
train = h2o.importFile(path = normalizePath("C:\\Users\\All data\\traindt.csv"))
Another option is to load it into R from the .rds file and use the as.h2o() function:
# loads a `train` data.frame for example
load("C:\\Users\\All data\\traindt.rds")
# send to H2O cluster
hf <- as.h2o(train)

Related

'all connections are in use' with parallel processing on AWS

I have been able to run 20 models simultaneously using a r6a.48xlarge Amazon Web Services instance (192 vCPUs, 1536.00 GiB memory) and this R code:
setwd('/home/ubuntu/')
library(doParallel)
detectCores()
my.AWS.n.cores <- detectCores()
my.AWS.n.cores <- my.AWS.n.cores - 92
my.AWS.n.cores
registerDoParallel(my.cluster <- makeCluster(my.AWS.n.cores))
folderName <- 'model000222'
files <- list.files(folderName, full.names=TRUE)
start.time <- Sys.time()
foreach(file = files, .errorhandling = "remove") %dopar% {
source(file)
}
stopCluster(my.cluster)
end.time <- Sys.time()
total.time.c <- end.time-start.time
total.time.c
However, the above R code did not run until I reduced the number of cores to 100 from 192 with this line:
my.AWS.n.cores <- my.AWS.n.cores - 92
If I tried running the code with all 192 vCPUs or 187 vCPUs I got this error message:
> my.AWS.n.cores <- detectCores()
> my.AWS.n.cores <- my.AWS.n.cores - 5
> my.AWS.n.cores
[1] 187
>
> registerDoParallel(my.cluster <- makeCluster(my.AWS.n.cores))
Error in socketConnection("localhost", port = port, server = TRUE, blocking = TRUE, :
all connections are in use
Calls: registerDoParallel ... makePSOCKcluster -> newPSOCKnode -> socketConnection
I had never seen that error message and could not locate it with an internet search. Could someone explain this error message? I do not know why my solution worked or whether a better solution exists. Can I easily determine the maximum number of connections I can use without getting this error? I suppose I could run the code incrementing the number of cores from 100 to 187.
I installed R on this instance with the lines below in PuTTY. R could not be located on the instance until I used the last line below: apt install r-base-core.
sudo su
echo "deb http://cran.rstudio.com/bin/linux/ubuntu trusty/"
sudo apt-get update
sudo apt-get install r-base
sudo apt install dos2unix
apt install r-base-core
I used this AMI:
Ubuntu Server 18.04 LTS (HVM), SSD Volume Type
EDIT
Apparently, R has a hardwired limit of 128 connections. Apparently, you can increase the number of PSOCK workers manually if you are willing to rebuild R from source but I have not found an answer showing how to do that. Ideally I can find an answer showing how to do that with Ubuntu and AWS. See also these previous related questions.
Errors in makeCluster(multicore): cannot open the connection
Is there a limit on the number of slaves that R snow can create?
Explanation
Each parallel PSOCK worker consumes one R connection. As of R 4.2.1, R is hard-coded to support only 128 open connections at any time. Three of those connections are always in use (stdin, stdout, and stderr), leaving you with 125 to play with.
To increase this limit, you have to update constant:
#define NCONNECTIONS 128
in src/main/connections.c, and then re-build R from source. FWIW, I've verified that it works with at least 16,384 on Ubuntu 16.04 (https://github.com/HenrikBengtsson/Wishlist-for-R/issues/28#issuecomment-231603035).
People have reported on this before, and the problem has been raised on R-devel several times over the years. Last time the limit was increased was in R 2.4.0 (October 2008) when it was increased from 50 to 128.
See https://github.com/HenrikBengtsson/Wishlist-for-R/issues/28 for more details and discussions. I think it's worth bumping this topic again on R-devel. As people get access to more cores, more people will run into this problem.
The parallelly package provides two functions, availableConnections() and freeConnections(), for querying the current R installation for number of connections available and free. See https://parallelly.futureverse.org/reference/availableConnections.html details and examples.
FYI, if you use parallelly::makeClusterPSOCK(n) instead of parallel::makeCluster(n), you'll get a more informative error message, and much sooner, e.g.
> cl <- parallelly::makeClusterPSOCK(192)
Error: Cannot create 192 parallel PSOCK nodes. Each node
needs one connection but there are only 124 connections left
out of the maximum 128 available on this R installation
Workaround
You can avoid relying on R connections for local parallel processing, by using the callr package under the hood. The easiest way to achieve this is to use doFuture in combination with future.callr. In your example, that would be:
library(doFuture)
library(future.callr)
registerDoFuture()
plan(callr, workers = parallelly::availableCores(omit = 5))
...
With this setup, the parallel workers are launched via callr (which operates without R connections). Each parallel task is launched in a separate callr process and when the task completes, the parallel worker is terminated. Because the parallel workers are not reused, there is an extra overhead paid for using the callr backend, but if your parallel tasks are long enough, that should still be a minor part of the processing time.

H2O using too much RAM with Sparse Matrices

I'm using H2O with a SVMLight sparse matrix of dimensions ~700,000 x ~800,000. The file size is approximately ~800MB on disk. But importing it into H2O takes up over 300GB of RAM? The process also takes too long (~15 minutes) to finish.
I can create and store the sparse matrix in RAM using the Matrix package rather quickly in comparison. The Sparse Matrix in that case takes ~1.2GB of RAM.
Below is my code:
library(h2o)
h2o.init(nthreads=-1,max_mem_size = "512g")
x <- h2o.importFile('test2.svmlight', parse = TRUE)
Here is my system:
openjdk version "1.8.0_121"
OpenJDK Runtime Environment (build 1.8.0_121-b13)
OpenJDK 64-Bit Server VM (build 25.121-b13, mixed mode)
Starting H2O JVM and connecting: .. Connection successful!
R is connected to the H2O cluster:
H2O cluster uptime: 2 seconds 76 milliseconds
H2O cluster version: 3.14.0.3
H2O cluster version age: 1 month and 8 days
H2O cluster name: H2O_started_from_R_ra2816_fhv677
H2O cluster total nodes: 1
H2O cluster total memory: 455.11 GB
H2O cluster total cores: 24
H2O cluster allowed cores: 24
H2O cluster healthy: TRUE
H2O Connection ip: localhost
H2O Connection port: 54321
H2O Connection proxy: NA
H2O Internal Security: FALSE
H2O API Extensions: XGBoost, Algos, AutoML, Core V3, Core V4
R Version: R version 3.4.1 (2017-06-30)
I would appreciate any advice because I really enjoy H2O and would like to use it for this project.
H2O stores data in a columnar compressed store, and is optimized to work well with datasets that have a huge number (billions+) of rows and a large number (thousands+) of columns.
Each column is stored in a bunch of what H2O calls chunks. A chunk is a group of contiguous rows. A chunk may be sparse, so if a chunk contains 10,000 rows and they are all missing, the amount of memory needed by that chunk can be really small. But the chunk still needs to be there.
In practice, what that means is that H2O stores rows sparsely but does not store columns sparsely. So it won't store things as efficiently as a pure sparse matrix package for wide data.
In your specific case, 800,000 columns is pushing H2O's limits.
One thing some people don't know about H2O is that it handles categorical columns efficiently. So if you are getting column explosion by manually 1-hot-encoding your data, you don't need to do that with H2O. Another data representation would be more efficient.

How to stop h2o from saving massive .ERR, .OUT and other log files to the local drive

I am currently running a script in which several h2o glm and deeplearning models are being generated for several iterations of a Monte-Carlo Cross-Validation. When finished running (which takes about half a day), h2o is saving immense files to the local drive (with sizes up to 8.5 GB). These files are not erased when RStudio or my computer is restarted (as I originally thought). Is there a way to stop h2o from saving these files?
When you start H2O with h2o.init() from R, the stdout and stderr files should be saved to a temporary directory (see R's tempdir() to see the path). This temporary directory should be removed when the R session exits. It seems as though this is not working with RStudio, however it works if you are using R from the command line. I'm not sure if this is a setting that can be changed in RStudio or if this is an RStudio bug.
But you can take more control yourself. You can start H2O by hand using java on the command line and then connect from R using h2o.init().
java -Xmx5g -jar h2o.jar
In this example, I started H2O with 5 GB of Java heap memory, but you should increase that if your data is larger. Then connecting in R will look like this:
> h2o.init()
Connection successful!
R is connected to the H2O cluster:
H2O cluster uptime: 16 hours 34 minutes
H2O cluster version: 3.15.0.99999
H2O cluster version age: 17 hours and 25 minutes
H2O cluster name: H2O_started_from_R_me_exn817
H2O cluster total nodes: 1
H2O cluster total memory: 4.43 GB
H2O cluster total cores: 8
H2O cluster allowed cores: 8
H2O cluster healthy: TRUE
H2O Connection ip: localhost
H2O Connection port: 54321
H2O Connection proxy: NA
H2O Internal Security: FALSE
H2O API Extensions: XGBoost, Algos, AutoML, Core V3, Core V4
R Version: R version 3.3.2 (2016-10-31)
So if you want to redirect both stdout and stderr to devnull you simply add the redirect command to the end of the java command to start the H2O cluster and connect to H2O from R again. To redirect both stderr and stdout, you append > /dev/null 2>&1 like this:
java -Xmx5g -jar h2o.jar > /dev/null 2>&1 &
Encountered this in a spark shell running H2O. The shell had 50 executors connected and this caused the /tmp directories on those nodes to eventually cause storage issues.
When h2o.init() is called it creates jvm's. The logging from h2o is handled by these jvm's. But when the shell is shutdown those jvm's persist and just log heartbeat errors in /tmp in perpetuity. You will need to find the jvm's associated with h2o and shut them down. I believe in my case the specific process names where water.H2OApp
I found it easier to take care of the problem by removing those files after running every model.
unlink(list.files(tempdir(), full.names = TRUE), recursive = TRUE)
This helps remove the temporary files when I run the multiple models in a loop.

How to allow h2o to access all available memory?

I am running h2o through Rstudio Server on a linux server with 64 GB of RAM. When I initialize the cluster it says that the total cluster memory is only 9.78 GB. I have tried using the max_mem_size parameter but still only using 9.78 GB.
localH2O <<- h2o.init(ip = "localhost", port = 54321, nthreads = -1, max_mem_size = "25g")
H2O is not running yet, starting it now...
java version "1.8.0_131"
Java(TM) SE Runtime Environment (build 1.8.0_131-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode)
Connection successful!
R is connected to the H2O cluster:
H2O cluster uptime: 5 hours 10 minutes
H2O cluster version: 3.10.4.6
H2O cluster version age: 19 days
H2O cluster name: H2O_started_from_R_miweis_mxv543
H2O cluster total nodes: 1
H2O cluster total memory: 9.78 GB
H2O cluster total cores: 16
H2O cluster allowed cores: 16
H2O cluster healthy: TRUE
H2O Connection ip: localhost
H2O Connection port: 54321
H2O Connection proxy: NA
H2O Internal Security: FALSE
R Version: R version 3.3.3 (2017-03-06)
I ran the following on the server to insure the amount of memory available:
cat /proc/meminfo
MemTotal: 65806476 kB
EDIT:
I was looking more into this issue and it seems like it is a default within the JVM. When I started h2o directly in Java I was able to pass in the command -Xmx32g and it did increase the memory. I could then connect to that h2o instance in Rstudio and have access to the increases memory. I was wondering if there was a way to change this default value in the JVM and allow more memory so I don't have to first start the h2o instance from the command line then connect to it from Rstudio server.
The max_mem_size argument in the h2o R package is functional, so you can use it to start an H2O cluster of whatever size you want -- you don't need to start it from the command line using -Xmx.
What's seems to be happening in your case is that you are connecting to an existing H2O cluster located at localhost:54321 that was limited to "10G" (in reality, 9.78 GB). So when you run h2o.init() from R, it will just connect to the existing cluster (with a fixed memory), rather than starting a new H2O cluster with the memory that you specified in max_mem_size, and so the memory request gets ignored.
To fix, you should do one of the following:
Kill the existing H2O cluster at localhost:54321 and restart from R with the desired memory requirement, or
start a cluster from R at different IP/port than the one that's
already running.
When starting up h2o.init() want to specify the argument min_mem_size=
This forces H2O to use at least that amount of memory. max_mem_size= prevents H2O from using more than that amount of memory.
if you have 6GB (for example) of available memory you can do this:
library(h2o)
h2o.init(max_mem_size = "6g")
example: more memory

h2o initialization parameters not optimal

My laptop has 8GB RAM with 4 cores.
My h2o version is as follows,`
Package: h2o
Type: Package
Version: 3.10.0.8
Branch: rel-turing
Date: Mon Oct 10 13:47:51 PDT 2016
License: Apache License (== 2.0)
Depends: R (>= 2.13.0), RCurl, jsonlite, statmod, tools, methods, utils`
I initialized it as follows,
h2o.init(nthreads = -1,max_mem_size = "8g")
But the output i get is as follows,
R is connected to the H2O cluster:
H2O cluster uptime: 13 hours 21 minutes
H2O cluster version: 3.10.0.8
H2O cluster version age: 21 days, 13 hours and 33 minutes
H2O cluster name: H2O_started_from_R_hp_ubq027
H2O cluster total nodes: 1
H2O cluster total memory: 1.33 GB
H2O cluster total cores: 4
H2O cluster allowed cores: 2
H2O cluster healthy: TRUE
H2O Connection ip: localhost
H2O Connection port: 54321
H2O Connection proxy: NA
R Version: R version 3.3.1 (2016-06-21)
Why is the allowed cores only 2 and allowed memory only 1.33gb while almost 8GB is available?
It says it has been running 13hrs. So what you are seeing is a cluster that is already running, and was (probably) started with default settings.
So, before doing your h2o.init() command you need to do h2o.shutdown():
h2o.shutdown()
h2o.init(nthreads = -1,max_mem_size = "8g")
(Remember when you shut down H2O that all models and data are lost, so use h2o.exportFile() and/or h2o.saveModel() if any of it cannot easily be re-created.)
UPDATE: I just noticed you said you had an 8GB laptop? I'd recommend not allocating more than 90% to H2O if the machine is dedicated, to be sure there is some left for the OS, Flow web server, etc.. (The EC2 scripts use 90%.) And if you intend to do other stuff on your notebook (run RStudio, check email, use StackOverflow in a browser window, etc.) subtract the memory for all that first. (My notebook is 8GB, and my general-purpose machine, so I usually give H2O "4g" if I think I'll be making a lot of models, "2g" or "3g" otherwise.)
Regarding nthreads defaulting to 2 -- to the best of my knowledge, that is a CRAN policy restriction, which is why it's set to 2 instead of -1 (recommended).

Resources