Start multiple h2o cluster from within R - r

My intention is to start two or more h2o clusters / instances (not two or more nodes!) from within R on the same computer/server to enable multiple user to connect with h2o at the same time. In addition, I want to be able to shutdown and restart clusters separately, also from within R.
I already know that I cannot controll multiple h2o clusters simply from within R, thus I tried to start two clusters from the command line in Windows 10:
java -Xmx1g -jar h2o.jar -name testCluster1 -nthreads 1 -port 54321
java -Xmx1g -jar h2o.jar -name testCluster2 -nthreads 1 -port 54323
This works fine for me:
library(h2o)
h2o.init(startH2O = FALSE, ip = "localhost", port = 54321)
Connection successful!
R is connected to the H2O cluster:
H2O cluster uptime: 4 minutes 8 seconds
H2O cluster version: 3.8.3.2
H2O cluster name: testCluster
H2O cluster total nodes: 1
H2O cluster total memory: 0.87 GB
H2O cluster total cores: 4
H2O cluster allowed cores: 1
H2O cluster healthy: TRUE
H2O Connection ip: localhost
H2O Connection port: 54321
H2O Connection proxy: NA
R Version: R version 3.2.5 (2016-04-14)
h2o.init(startH2O = FALSE, ip = "localhost", port = 54323)
Connection successful!
R is connected to the H2O cluster:
H2O cluster uptime: 3 minutes 32 seconds
H2O cluster version: 3.8.3.2
H2O cluster name: testCluster2
H2O cluster total nodes: 1
H2O cluster total memory: 0.87 GB
H2O cluster total cores: 4
H2O cluster allowed cores: 1
H2O cluster healthy: TRUE
H2O Connection ip: localhost
H2O Connection port: 54323
H2O Connection proxy: NA
R Version: R version 3.2.5 (2016-04-14)
Now, I want to do the same from within R via the system() command.
launchH2O <- as.character("java -Xmx1g -jar h2o.jar -name testCluster -nthreads 1 -port 54321")
system(command = launchH2O, intern =TRUE)
But I get an error message:
[1] "Error: Unable to access jarfile h2o.jar"
attr(,"status")
[1] 1
Warning message:
running command 'java -Xmx1g -jar h2o.jar -name testCluster -nthreads 1 -port 54321' had status 1
Trying
system2(command = launchH2O)
I get a warning message and I am not able to connect with the cluster:
system2(command = launchH2O)
Warning message:
running command '"java -Xmx1g -jar h2o.jar -name testCluster -nthreads 1 -port 54321"' had status 127
h2o.init(startH2O = FALSE, ip = "localhost", port = 54321)
Error in h2o.init(startH2O = FALSE, ip = "localhost", port = 54321) :
Cannot connect to H2O server. Please check that H2O is running at http://localhost:54321/
Any ideas how to start / shutdown two or more h2o clusters from within R?
Thank you in advance!
Note 1: I am only using my local Windows device for testing, I actually want to create multiple h2o clusters on a Linux server.
Note 2: I tried it with both R GUI (3.2.5) and R Studio (Version 0.99.892) and I ran them as admin. The h2o.jar file is in my working directory and my Java version is (Build 1.8.0_91-b14).
Note 3: System information:
- h2o & h2o R package version: 3.8.3.2
- Windows 10 Home, Version 1511
- 16 RAM, Intel Core i5-6200U CPU with 2,30 GHz

EDIT: I've changed to intern=FALSE, in below examples, based on comments
You should just need to change directory; it is either that or not setting wait=FALSE (to run the command in the background).
launchH2O <- "java -Xmx1g -jar h2o.jar -name testCluster -nthreads 1 -port 54321"
savewd <- setwd("/path/to/h2ojar/")
system(command = launchH2O, intern =FALSE wait=FALSE)
setwd(savewd)
The last line, and the assignment to savewd is just to preserve working directory. Alternatively this should also work:
launchH2O <- "java -Xmx1g -jar /path/to/h2ojar/h2o.jar -name testCluster -nthreads 1 -port 54321"
system(command = launchH2O, intern =FALSE, wait=FALSE)
When on Linux, there is another way:
launchH2O <- "bash -c 'nohup java -Xmx1g -jar /path/to/h2ojar/h2o.jar -name testCluster -nthreads 1 -port 54321 &'"
system(command = launchH2O, intern =FALSE)
(Because the last command explicitly puts it in the background, I don't think you need to set wait=FALSE.)

Related

Number of slaves 0 when I mpirun my R code that test rmpi

After some trials, I was able to install Rmpi package on my computer using the following code:
R CMD INSTALL -l /storage/home/***/.R Rmpi_0.6-7.tar.gz --configure-args="--with-Rmpi-type=OPENMPI --disable-dlopen --with-Rmpi-include=/gpfs/group/RISE/sw7/openmpi_4.1.4_gcc-9.3.1/include --with-Rmpi-libpath=/gpfs/group/RISE/sw7/openmpi_4.1.4_gcc-9.3.1/lib"
I tried to run the following test code:
# Load the R MPI package if it is not already loaded.
if (!is.loaded("mpi_initialize")) {
library("Rmpi")
}
ns <- mpi.universe.size() - 1
mpi.spawn.Rslaves(nslaves=ns)
#
# In case R exits unexpectedly, have it automatically clean up
# resources taken up by Rmpi (slaves, memory, etc...)
.Last <- function(){
if (is.loaded("mpi_initialize")){
if (mpi.comm.size(1) > 0){
print("Please use mpi.close.Rslaves() to close slaves.")
mpi.close.Rslaves()
}
print("Please use mpi.quit() to quit R")
.Call("mpi_finalize")
}
}
# Tell all slaves to return a message identifying themselves
mpi.bcast.cmd( id <- mpi.comm.rank() )
mpi.bcast.cmd( ns <- mpi.comm.size() )
mpi.bcast.cmd( host <- mpi.get.processor.name() )
mpi.remote.exec(paste("I am",mpi.comm.rank(),"of",mpi.comm.size()))
# Test computations
x <- 5
x <- mpi.remote.exec(rnorm, x)
length(x)
x
# Tell all slaves to close down, and exit the program
mpi.close.Rslaves(dellog = FALSE)
mpi.quit()
On my HPC I run the following:
qsub -A open -l walltime=6:00:00 -l nodes=4:ppn=4:stmem -I
module use /gpfs/group/RISE/sw7/modules
module load openmpi/4.1.4-gcc.9.3.1 r/4.0.3
mpirun -np 4 Rscript "codes/test/test4.R"
But then I get the following error indicating that I only have 1 number of slaves:
--------------------------------------------------------------------------
By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default. The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA parameter
to true.
Local host: comp-sc-0222
Local adapter: mlx4_0
Local port: 1
--------------------------------------------------------------------------
--------------------------------------------------------------------------
By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default. The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA parameter
to true.
Local host: comp-sc-0222
Local adapter: mlx4_0
Local port: 1
--------------------------------------------------------------------------
--------------------------------------------------------------------------
By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default. The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA parameter
to true.
Local host: comp-sc-0222
Local adapter: mlx4_0
Local port: 1
--------------------------------------------------------------------------
--------------------------------------------------------------------------
By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default. The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA parameter
to true.
Local host: comp-sc-0222
Local adapter: mlx4_0
Local port: 1
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
Local host: comp-sc-0222
Local device: mlx4_0
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
Local host: comp-sc-0222
Local device: mlx4_0
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
Local host: comp-sc-0222
Local device: mlx4_0
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
Local host: comp-sc-0222
Local device: mlx4_0
--------------------------------------------------------------------------
Error in mpi.comm.spawn(slave = system.file("Rslaves.sh", package = "Rmpi"), :
Choose a positive number of slaves.
Calls: mpi.spawn.Rslaves -> mpi.comm.spawn
Execution halted
Error in mpi.comm.spawn(slave = system.file("Rslaves.sh", package = "Rmpi"), :
Choose a positive number of slaves.
Calls: mpi.spawn.Rslaves -> mpi.comm.spawn
Execution halted
Error in mpi.comm.spawn(slave = system.file("Rslaves.sh", package = "Rmpi"), :
Choose a positive number of slaves.
Calls: mpi.spawn.Rslaves -> mpi.comm.spawn
Execution halted
Error in mpi.comm.spawn(slave = system.file("Rslaves.sh", package = "Rmpi"), :
Choose a positive number of slaves.
Calls: mpi.spawn.Rslaves -> mpi.comm.spawn
Execution halted
I have tried specifying different number of np's but still get the same error. What could be the cause here?
============================================================
(EDIT)
It seems that my original command to load the modules also load intel/19.1.2 and mkl/2020.3. If I unload them, I do see that OMPI_UNIVERSE_SIZE=4.
[****#comp-sc-0220 work]$ module purge
[****#comp-sc-0220 work]$ module load openmpi/4.1.4-gcc.9.3.1 r/4.0.3
[****#comp-sc-0220 work]$ module list
Currently Loaded Modules:
1) openmpi/4.1.4-gcc.9.3.1 2) intel/19.1.2 3) mkl/2020.3 4) r/4.0.3
[****#comp-sc-0220 work]$ mpirun -np 4 env | grep OMPI_UNIVERSE_SIZE
[****#comp-sc-0220 work]$ type mpirun; mpirun --version; mpirun -np 1 env | grep OMPI
mpirun is /opt/aci/intel/compilers_and_libraries_2020.2.254/linux/mpi/intel64/bin/mpirun
Intel(R) MPI Library for Linux* OS, Version 2019 Update 8 Build 20200624 (id: 4f16ad915)
Copyright 2003-2020, Intel Corporation.
LMOD_FAMILY_COMPILER_VERSION=19.1.2
LMOD_FAMILY_COMPILER=intel
[****#comp-sc-0220 work]$ module purge
[****#comp-sc-0220 work]$ module load openmpi/4.1.4-gcc.9.3.1 r/4.0.3
[****#comp-sc-0220 work]$ module unload intel mkl
[****#comp-sc-0220 work]$ module list
Currently Loaded Modules:
1) openmpi/4.1.4-gcc.9.3.1 2) r/4.0.3
[****#comp-sc-0220 work]$ mpirun -np 4 env | grep OMPI_UNIVERSE_SIZE
OMPI_UNIVERSE_SIZE=4
OMPI_UNIVERSE_SIZE=4
OMPI_UNIVERSE_SIZE=4
OMPI_UNIVERSE_SIZE=4
[****#comp-sc-0220 work]$ type mpirun; mpirun --version; mpirun -np 1 env | grep OMPI
mpirun is /gpfs/group/RISE/sw7/openmpi_4.1.4_gcc-9.3.1/bin/mpirun
mpirun (Open MPI) 4.1.4
Report bugs to http://www.open-mpi.org/community/help/
OMPI_MCA_pmix=^s1,s2,cray,isolated
OMPI_COMMAND=env
OMPI_MCA_orte_precondition_transports=954e2ae0a9569e46-2223294369d728a3
OMPI_MCA_orte_local_daemon_uri=4134338560.0;tcp://10.102.201.220:58039
OMPI_MCA_orte_hnp_uri=4134338560.0;tcp://10.102.201.220:58039
OMPI_MCA_mpi_oversubscribe=0
OMPI_MCA_orte_app_num=0
OMPI_UNIVERSE_SIZE=4
OMPI_MCA_orte_num_nodes=1
OMPI_MCA_shmem_RUNTIME_QUERY_hint=mmap
OMPI_MCA_orte_bound_at_launch=1
OMPI_MCA_ess=^singleton
OMPI_MCA_orte_ess_num_procs=1
OMPI_COMM_WORLD_SIZE=1
OMPI_COMM_WORLD_LOCAL_SIZE=1
OMPI_MCA_orte_tmpdir_base=/tmp
OMPI_MCA_orte_top_session_dir=/tmp/ompi.comp-sc-0220.26954
OMPI_MCA_orte_jobfam_session_dir=/tmp/ompi.comp-sc-0220.26954/pid.8212
OMPI_NUM_APP_CTX=1
OMPI_FIRST_RANKS=0
OMPI_APP_CTX_NUM_PROCS=1
OMPI_MCA_initial_wdir=/storage/work/k/****
OMPI_MCA_orte_launch=1
OMPI_MCA_ess_base_jobid=4134338561
OMPI_MCA_ess_base_vpid=0
OMPI_COMM_WORLD_RANK=0
OMPI_COMM_WORLD_LOCAL_RANK=0
OMPI_COMM_WORLD_NODE_RANK=0
OMPI_MCA_orte_ess_node_rank=0
OMPI_FILE_LOCATION=/tmp/ompi.comp-sc-0220.26954/pid.8212/0/0
But if I run the same test4.R again, I get the following error:
/gpfs/group/RISE/sw7/R-4.0.3-intel-19.1.2-mkl-2020.3/R-4.0.3/../install/lib64/R/bin/exec/R: error while loading shared libraries: libiomp5.so: cannot open shared object file: No such file or directory
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
/gpfs/group/RISE/sw7/R-4.0.3-intel-19.1.2-mkl-2020.3/R-4.0.3/../install/lib64/R/bin/exec/R: error while loading shared libraries: libiomp5.so: cannot open shared object file: No such file or directory
/gpfs/group/RISE/sw7/R-4.0.3-intel-19.1.2-mkl-2020.3/R-4.0.3/../install/lib64/R/bin/exec/R: error while loading shared libraries: libiomp5.so: cannot open shared object file: No such file or directory
/gpfs/group/RISE/sw7/R-4.0.3-intel-19.1.2-mkl-2020.3/R-4.0.3/../install/lib64/R/bin/exec/R: error while loading shared libraries: libiomp5.so: cannot open shared object file: No such file or directory
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[63743,1],0]
Exit code: 127
--------------------------------------------------------------------------
============================================================
(EDIT 2)
I changed my module load command again to module load openmpi/4.1.4-gcc.9.3.1 r/4.0.5-gcc-9.3.1. With this newer version of R I ran my test4.R script again with mpirun -np 4 Rscript "codes/test/test4.R". It is now returning a new error message as follows:
[1] "/storage/home/k/kxk5678/.R"
[2] "/gpfs/group/RISE/sw7/R-4.0.5-gcc-9.3.1/install/lib64/R/library"
[1] "/storage/home/k/kxk5678/.R"
[2] "/gpfs/group/RISE/sw7/R-4.0.5-gcc-9.3.1/install/lib64/R/library"
[1] "/storage/home/k/kxk5678/.R"
[2] "/gpfs/group/RISE/sw7/R-4.0.5-gcc-9.3.1/install/lib64/R/library"
[1] "/storage/home/k/kxk5678/.R"
[2] "/gpfs/group/RISE/sw7/R-4.0.5-gcc-9.3.1/install/lib64/R/library"
[1] 4
[1] 4
[1] 4
[1] 4
--------------------------------------------------------------------------
All nodes which are allocated for this job are already filled.
--------------------------------------------------------------------------
Error in mpi.comm.spawn(slave = system.file("Rslaves.sh", package = "Rmpi"), :
MPI_ERR_SPAWN: could not spawn processes
Calls: mpi.spawn.Rslaves -> mpi.comm.spawn
Execution halted
Error in mpi.comm.spawn(slave = system.file("Rslaves.sh", package = "Rmpi"), :
MPI_ERR_SPAWN: could not spawn processes
Calls: mpi.spawn.Rslaves -> mpi.comm.spawn
Execution halted
Error in mpi.comm.spawn(slave = system.file("Rslaves.sh", package = "Rmpi"), :
MPI_ERR_SPAWN: could not spawn processes
Calls: mpi.spawn.Rslaves -> mpi.comm.spawn
Execution halted
Error in mpi.comm.spawn(slave = system.file("Rslaves.sh", package = "Rmpi"), :
MPI_ERR_SPAWN: could not spawn processes
Calls: mpi.spawn.Rslaves -> mpi.comm.spawn
Execution halted
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[62996,1],1]
Exit code: 1
--------------------------------------------------------------------------
Install the package pbdMPI in an R session on the login node and run the following translation of the Rmpi test code into the use of pbdMPI:
library(pbdMPI)
ns <- comm.size()
# Tell all R sessions to return a message identifying themselves
id <- comm.rank()
ns <- comm.size()
host <- system("hostname", intern = TRUE)
comm.cat("I am", id, "on", host, "of", ns, "\n", all.rank = TRUE)
# Test computations
x <- 5
x <- rnorm(x)
comm.print(length(x))
comm.print(x, all.rank = TRUE)
finalize()
You run it the same way you used for the Rmpi version: mpirun -np 4 Rscript your_new_script_file.
Spawning MPI (as in the Rmpi example) was appropriate when running on clusters of workstations but on an HPC cluster the prevalent way to program with MPI is SPMD - single program multiple data. SPMD means that your code is a generalization of a serial code that is able to have several copies of itself cooperate with each other.
In the above example, cooperation happens only with printing (the comm... functions). There is no manager/master, just several R sessions running the same code (usually computing something different based on comm.rank()) and cooperating/communicating via MPI. This is the prevalent way of large scale parallel computing on HPC clusters.

How to stop h2o from saving massive .ERR, .OUT and other log files to the local drive

I am currently running a script in which several h2o glm and deeplearning models are being generated for several iterations of a Monte-Carlo Cross-Validation. When finished running (which takes about half a day), h2o is saving immense files to the local drive (with sizes up to 8.5 GB). These files are not erased when RStudio or my computer is restarted (as I originally thought). Is there a way to stop h2o from saving these files?
When you start H2O with h2o.init() from R, the stdout and stderr files should be saved to a temporary directory (see R's tempdir() to see the path). This temporary directory should be removed when the R session exits. It seems as though this is not working with RStudio, however it works if you are using R from the command line. I'm not sure if this is a setting that can be changed in RStudio or if this is an RStudio bug.
But you can take more control yourself. You can start H2O by hand using java on the command line and then connect from R using h2o.init().
java -Xmx5g -jar h2o.jar
In this example, I started H2O with 5 GB of Java heap memory, but you should increase that if your data is larger. Then connecting in R will look like this:
> h2o.init()
Connection successful!
R is connected to the H2O cluster:
H2O cluster uptime: 16 hours 34 minutes
H2O cluster version: 3.15.0.99999
H2O cluster version age: 17 hours and 25 minutes
H2O cluster name: H2O_started_from_R_me_exn817
H2O cluster total nodes: 1
H2O cluster total memory: 4.43 GB
H2O cluster total cores: 8
H2O cluster allowed cores: 8
H2O cluster healthy: TRUE
H2O Connection ip: localhost
H2O Connection port: 54321
H2O Connection proxy: NA
H2O Internal Security: FALSE
H2O API Extensions: XGBoost, Algos, AutoML, Core V3, Core V4
R Version: R version 3.3.2 (2016-10-31)
So if you want to redirect both stdout and stderr to devnull you simply add the redirect command to the end of the java command to start the H2O cluster and connect to H2O from R again. To redirect both stderr and stdout, you append > /dev/null 2>&1 like this:
java -Xmx5g -jar h2o.jar > /dev/null 2>&1 &
Encountered this in a spark shell running H2O. The shell had 50 executors connected and this caused the /tmp directories on those nodes to eventually cause storage issues.
When h2o.init() is called it creates jvm's. The logging from h2o is handled by these jvm's. But when the shell is shutdown those jvm's persist and just log heartbeat errors in /tmp in perpetuity. You will need to find the jvm's associated with h2o and shut them down. I believe in my case the specific process names where water.H2OApp
I found it easier to take care of the problem by removing those files after running every model.
unlink(list.files(tempdir(), full.names = TRUE), recursive = TRUE)
This helps remove the temporary files when I run the multiple models in a loop.

Read a large (1.5 GB) file in h2o R

I am using h2o package for modelling in R. For this I want to read a dataset which has a size of about 1.5 GB using h2o.importfile(). I start the h2o server using the lines
library(h2oEnsemble)
h2o.init(max_mem_size = '1499m',nthreads=-1)
This produces a log
H2O is not running yet, starting it now...
java version "1.8.0_121"
Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
Java HotSpot(TM) Client VM (build 25.121-b13, mixed mode)
Starting H2O JVM and connecting: . Connection successful!
R is connected to the H2O cluster:
H2O cluster uptime: 3 seconds 665 milliseconds
H2O cluster version: 3.10.4.8
H2O cluster version age: 28 days, 14 hours and 36 minutes
H2O cluster name: H2O_started_from_R_Lucifer_jvn970
H2O cluster total nodes: 1
H2O cluster total memory: 1.41 GB
H2O cluster total cores: 4
H2O cluster allowed cores: 4
H2O cluster healthy: TRUE
H2O Connection ip: localhost
H2O Connection port: 54321
H2O Connection proxy: NA
H2O Internal Security: FALSE
R Version: R version 3.3.2 (2016-10-31)`
The following line gives me an error
train=h2o.importFile(path=normalizePath("C:\\Users\\All data\\traindt.rds"))
DistributedException from localhost/127.0.0.1:54321, caused by java.lang.AssertionError
DistributedException from localhost/127.0.0.1:54321, caused by java.lang.AssertionError
at water.MRTask.getResult(MRTask.java:478)
at water.MRTask.getResult(MRTask.java:486)
at water.MRTask.doAll(MRTask.java:402)
at water.parser.ParseDataset.parseAllKeys(ParseDataset.java:246)
at water.parser.ParseDataset.access$000(ParseDataset.java:27)
at water.parser.ParseDataset$ParserFJTask.compute2(ParseDataset.java:195)
at water.H2O$H2OCountedCompleter.compute(H2O.java:1315)
at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974)
at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)
at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)
Caused by: java.lang.AssertionError
at water.parser.Categorical.addKey(Categorical.java:41)
at water.parser.FVecParseWriter.addStrCol(FVecParseWriter.java:127)
at water.parser.CsvParser.parseChunk(CsvParser.java:133)
at water.parser.Parser.readOneFile(Parser.java:187)
at water.parser.Parser.streamParseZip(Parser.java:217)
at water.parser.ParseDataset$MultiFileParseTask.streamParse(ParseDataset.java:907)
at water.parser.ParseDataset$MultiFileParseTask.map(ParseDataset.java:856)
at water.MRTask.compute2(MRTask.java:601)
at water.H2O$H2OCountedCompleter.compute1(H2O.java:1318)
at water.parser.ParseDataset$MultiFileParseTask$Icer.compute1(ParseDataset$MultiFileParseTask$Icer.java)
at water.H2O$H2OCountedCompleter.compute(H2O.java:1314)
... 5 more
Error: DistributedException from localhost/127.0.0.1:54321, caused by java.lang.AssertionError
Any help on how to fix this problem?
Note: Assigning memory larger than 1499mb also gives me an error (cannot allocate memory). I am using a 16GB ram environment
Edit: I download the 64-bit version of Java and changed my file to a csv file. I was then able to assign max_mem_size to 5G and the problem was solved.
For others who face the problem:
1. Download the latest version of 64 bit jdk
2. Execute the following line line
h2o.init(max_mem_size = '5g',nthreads=-1)
You are running with 32-bit java, which is limiting the memory that you are able to start H2O with. One clue is that it won't start with a higher max_mem_size. Another clue is that it says "Client VM".
You want 64-bit java instead. The 64-bit version will say "Server VM". You can download the Java 8 SE JDK from here:
http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html
Based on what you've described, I recommend setting max_mem_size = '6g' or more, which will work fine on your system once you have the right version of Java installed.
train=h2o.importFile(path=normalizePath("C:\\Users\\All data\\traindt.rds")
Are you trying to load an .rds file? That's an R binary format which is not readable by h2o.importFile(), so that won't work. You will need to store your training data in a cross-platform storage format (e.g. CSV, SMVLight, etc) if you want to read it into H2O directly. If you don't have a copy in another format, then just save one from R:
# loads a `train` data.frame for example
load("C:\\Users\\All data\\traindt.rds")
# save as CSV
write.csv(train, "C:\\Users\\All data\\traindt.csv")
# import from CSV into H2O cluster directly
train = h2o.importFile(path = normalizePath("C:\\Users\\All data\\traindt.csv"))
Another option is to load it into R from the .rds file and use the as.h2o() function:
# loads a `train` data.frame for example
load("C:\\Users\\All data\\traindt.rds")
# send to H2O cluster
hf <- as.h2o(train)

How to allow h2o to access all available memory?

I am running h2o through Rstudio Server on a linux server with 64 GB of RAM. When I initialize the cluster it says that the total cluster memory is only 9.78 GB. I have tried using the max_mem_size parameter but still only using 9.78 GB.
localH2O <<- h2o.init(ip = "localhost", port = 54321, nthreads = -1, max_mem_size = "25g")
H2O is not running yet, starting it now...
java version "1.8.0_131"
Java(TM) SE Runtime Environment (build 1.8.0_131-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode)
Connection successful!
R is connected to the H2O cluster:
H2O cluster uptime: 5 hours 10 minutes
H2O cluster version: 3.10.4.6
H2O cluster version age: 19 days
H2O cluster name: H2O_started_from_R_miweis_mxv543
H2O cluster total nodes: 1
H2O cluster total memory: 9.78 GB
H2O cluster total cores: 16
H2O cluster allowed cores: 16
H2O cluster healthy: TRUE
H2O Connection ip: localhost
H2O Connection port: 54321
H2O Connection proxy: NA
H2O Internal Security: FALSE
R Version: R version 3.3.3 (2017-03-06)
I ran the following on the server to insure the amount of memory available:
cat /proc/meminfo
MemTotal: 65806476 kB
EDIT:
I was looking more into this issue and it seems like it is a default within the JVM. When I started h2o directly in Java I was able to pass in the command -Xmx32g and it did increase the memory. I could then connect to that h2o instance in Rstudio and have access to the increases memory. I was wondering if there was a way to change this default value in the JVM and allow more memory so I don't have to first start the h2o instance from the command line then connect to it from Rstudio server.
The max_mem_size argument in the h2o R package is functional, so you can use it to start an H2O cluster of whatever size you want -- you don't need to start it from the command line using -Xmx.
What's seems to be happening in your case is that you are connecting to an existing H2O cluster located at localhost:54321 that was limited to "10G" (in reality, 9.78 GB). So when you run h2o.init() from R, it will just connect to the existing cluster (with a fixed memory), rather than starting a new H2O cluster with the memory that you specified in max_mem_size, and so the memory request gets ignored.
To fix, you should do one of the following:
Kill the existing H2O cluster at localhost:54321 and restart from R with the desired memory requirement, or
start a cluster from R at different IP/port than the one that's
already running.
When starting up h2o.init() want to specify the argument min_mem_size=
This forces H2O to use at least that amount of memory. max_mem_size= prevents H2O from using more than that amount of memory.
if you have 6GB (for example) of available memory you can do this:
library(h2o)
h2o.init(max_mem_size = "6g")
example: more memory

h2o initialization parameters not optimal

My laptop has 8GB RAM with 4 cores.
My h2o version is as follows,`
Package: h2o
Type: Package
Version: 3.10.0.8
Branch: rel-turing
Date: Mon Oct 10 13:47:51 PDT 2016
License: Apache License (== 2.0)
Depends: R (>= 2.13.0), RCurl, jsonlite, statmod, tools, methods, utils`
I initialized it as follows,
h2o.init(nthreads = -1,max_mem_size = "8g")
But the output i get is as follows,
R is connected to the H2O cluster:
H2O cluster uptime: 13 hours 21 minutes
H2O cluster version: 3.10.0.8
H2O cluster version age: 21 days, 13 hours and 33 minutes
H2O cluster name: H2O_started_from_R_hp_ubq027
H2O cluster total nodes: 1
H2O cluster total memory: 1.33 GB
H2O cluster total cores: 4
H2O cluster allowed cores: 2
H2O cluster healthy: TRUE
H2O Connection ip: localhost
H2O Connection port: 54321
H2O Connection proxy: NA
R Version: R version 3.3.1 (2016-06-21)
Why is the allowed cores only 2 and allowed memory only 1.33gb while almost 8GB is available?
It says it has been running 13hrs. So what you are seeing is a cluster that is already running, and was (probably) started with default settings.
So, before doing your h2o.init() command you need to do h2o.shutdown():
h2o.shutdown()
h2o.init(nthreads = -1,max_mem_size = "8g")
(Remember when you shut down H2O that all models and data are lost, so use h2o.exportFile() and/or h2o.saveModel() if any of it cannot easily be re-created.)
UPDATE: I just noticed you said you had an 8GB laptop? I'd recommend not allocating more than 90% to H2O if the machine is dedicated, to be sure there is some left for the OS, Flow web server, etc.. (The EC2 scripts use 90%.) And if you intend to do other stuff on your notebook (run RStudio, check email, use StackOverflow in a browser window, etc.) subtract the memory for all that first. (My notebook is 8GB, and my general-purpose machine, so I usually give H2O "4g" if I think I'll be making a lot of models, "2g" or "3g" otherwise.)
Regarding nthreads defaulting to 2 -- to the best of my knowledge, that is a CRAN policy restriction, which is why it's set to 2 instead of -1 (recommended).

Resources