How to prevent h2o cluster shutdown without notice using R - r

I am a h2o R version user and I have a question regarding the h2o local cluster. I setup the cluster by execute the command in r,
h2o.init()
However, the cluster will be turned off automatically when I do not use it for a few hours. For example, I run my model during the night, but when I come back to my office in the morning to check on my model. It says,
Error in h2o.getConnection() : No active connection to an H2O cluster. Did you runh2o.init()?
Is there a way to fix or work around it ?

If the H2O cluster is still running, then your models are all still there (assuming they finished training successfully). There are a number of ways that you can check if the H2O Java cluster is still running. In R, you can check the output of these functions:
h2o.clusterStatus()
h2o.clusterInfo()
At the command line (look for a Java process):
ps aux | grep java
If you started H2O from R, then you should see a line that looks something like this:
yourusername 26215 0.0 2.7 8353760 454128 ?? S 9:41PM 21:25.33 /usr/bin/java -ea -cp /Library/Frameworks/R.framework/Versions/3.3/Resources/library/h2o/java/h2o.jar water.H2OApp -name H2O_started_from_R_me_iqv833 -ip localhost -port 54321 -ice_root /var/folders/2j/jg4sl53d5q53tc2_nzm9fz5h0000gn/T//Rtmp6XG99X
H2O models do not live in the R environment, they live in the H2O cluster (a Java process). It sounds like what's happening is that the R object representing your model (which is actually just a pointer to the model in the H2O cluster) is having issues finding the model since your cluster disconnected. I don't know exactly what's going on because you haven't posted the errors you're receiving when you try to use h2o.predict() or h2o.performance().
To get the model back, you can use the h2o.getModel() function. You will need to know the ID of your model. If your model object (that's not working properly) is still accessible, then you can see the model ID easily that way: model#model_id You can also head over to H2O Flow in the browser (by typing: http://127.0.0.1:54321 if you started H2O with the defaults) and view all the models by ID that way.
Once you know the model ID, then refresh the model by doing:
model <- h2o.getModel("model_id")
This should re-establish the connection to your model and the h2o.predict() and h2o.performance() functions should work again.

Related

Security concerns with the H2O R package

I am using the H2O R package.
My understanding is, that this package requires you to have an internet connection as well as connect to the the h2o servers? If you use the h2o package run machine learning models on your data, does h2o "see" your data? I turned off my wifi and tried running some machine learning models using h2o :
data(iris)
library(h2o)
h2o.init()
iris_hf <- as.h2o(iris)
iris_dl <- h2o.deeplearning(x = 1:4, y = 5, training_frame = iris_hf, seed=123456)
predictions <- h2o.predict(iris_dl, iris_hf)
This seems to work, but could someone please confirm? If you do not want anyone to see your data, is it still a good idea to use the "h2o" library? Since the code above runs without an internet connection, I am not sure about this.
From the documentation of h2o.init() (emphasis mine):
This method first checks if H2O is connectible. If it cannot connect and startH2O = TRUE with IP of localhost, it will attempt to start an instance of H2O with IP = localhost, port = 54321. Otherwise, it stops immediately with an error. When initializing H2O locally, this method searches for h2o.jar in the R library resources [...], and if the file does not exist, it will automatically attempt to download the correct version from Amazon S3. The user must have Internet access for this process to be successful. Once connected, the method checks to see if the local H2O R package version matches the version of H2O running on the server. If there is a mismatch and the user indicates she wishes to upgrade, it will remove the local H2O R package and download/install the H2O R package from the server.
So, h2o.init() with the default setting ip = "127.0.0.1", as here, connects the R session with the H2O instance (sometimes referred to as "server") in your local machine. If all the necessary package files are in place and up to date, no internet connection is necessary; the package will attempt to connect to the internet only to download stuff in case something is not present or up to date. No data is uploaded anywhere.

Is clustering algorithm running although Jupyter Notebook Gateway timed out?

I am running the sklearn DBSCAN algorithm on a dataset with dimensionality 300000x50 in a Jupyter Notebook on AWS Sagemaker ("ml.t2.medium" compute instance). The dataset contains feature vectors with 1:s and 0:s.
Once I run the cell, an orange prompt in the upper right corner "Gateway Timeout" appears after a while. The icon disappears when you click on it providing no further information. The notebook is unresponsive until you restart the notebook instance.
I have tried different values for the parameters eps and min_samples to no avail.
db = DBSCAN(eps = 0.1, min_samples = 100).fit(transformed_vectors)
Does "Gateway Timeout" mean that the notebook kernel has crashed or can I expect any results by waiting?
So far the calculation has been running for about 2 hours.
you could always pick a larger size for your notebook instance (ml.t2.medium is pretty small), but I think the better way would be to train your code a on a managed SageMaker instance. Sklearn is built-in on SageMaker, so all you have to do is bring your script, e.g.:
from sagemaker.sklearn.estimator import SKLearn
sklearn = SKLearn(
entry_point="my_code.py",
train_instance_type="ml.c4.xlarge",
role=role,
sagemaker_session=sagemaker_session)
Here's a complete example: https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-python-sdk/scikit_learn_iris/Scikit-learn%20Estimator%20Example%20With%20Batch%20Transform.ipynb

Model training fails with h2o deepwater

While trying to train a lenet model for multiclass classification using h2o deepwater using mxnet backed I am getting the following errors:
Loading H2O mxnet bindings.
Found CUDA_HOME or CUDA_PATH environment variable, trying to connect to GPU devices.
Loading CUDA library.
Loading mxnet library.
Loading H2O mxnet bindings.
Done loading H2O mxnet bindings.
Constructing model.
Done constructing model.
Building network.
mxnet data input shape: (32,100)
[10:40:16] /home/jenkins/slave_dir_from_mr-0xb1/workspace/deepwater-master/thirdparty/mxnet/dmlc-core/include/dmlc/logging.h:235: [10:40:16] src/operator/./convolution-inl.h:349: Check failed: (dshape.ndim()) == (4) Input data should be 4D in batch-num_filter-y-x
[10:40:16] src/symbol.cxx:189: Check failed: (MXSymbolInferShape(GetHandle(), keys.size(), keys.data(), arg_ind_ptr.data(), arg_shape_data.data(), &in_shape_size, &in_shape_ndim, &in_shape_data, &out_shape_size, &out_shape_ndim, &out_shape_data, &aux_shape_size, &aux_shape_ndim, &aux_shape_data, &complete)) == (0)
The details of my setup :
* Ubuntu : 16.04
* Ram : 12gb
* Graphics card : Nvidia 920mx driver version : 384.90
* Cuda : 8.0.61
* cudnn : 6.0
* R version : 3.4.3
* H2o version : 3.15.0.393 & h2o-R package : 3.16.0.2
* mxnet : 0.11.0
* Train data size : 400mb (when converting to the h2o frame object it comes around 822mb)
Things I have done :
1.) Gave enough memory to java heap while running h2o cluster (java -Xmx9g -jar h2o.jar)
2.) Build the mxnet from source for gpu
3.) Monitored the gpu and system via nvidia-smi and system monitor. At no point do they eat up all the ram to show "out of memory" issue. I still will be having around 2-3gb free before the error shows up
4.) Have tried with tensorflow-gpu(build from source). Checking the pip list made sure that its installed but during model creation in R it gives the error :
Error: java.lang.RuntimeException: Unable to initialize the native Deep Learning backend: null
5.) The only method I got it the h2o deepwater to work with all the backend and w/wo GPU is through docker setup provided in the installation tutorials.
I wanted the same functionality on my laptop instead of using Docker. Also is there any way to run deepwater using just CPU? The link Is it possible to build Deep Water/TensorFlow model in H2O without CUDA doesn't provide any helpful answers. Any help or advice will be greatly appreciated!
As evident from the error logs and from documentation of mxnet.sym.Convolution your data needs to be in [batch, channels, height, width] format. However it looks like your data contains only two dimensions (based on this log: mxnet data input shape: (32,100)). Reformatting the data, even including two dimensions of size 1 such that your input shape is (1,1,32,100) should resolve this issue.

Error while using h2o.init in R

This is the error message:
> h2o.init()
Error in dirname(path) : path too long
In addition: There were 12 warnings (use warnings() to see them)
This is one of the warning messages (the others are similar):
> warnings()
Warning messages:
1: In normalizePath(path.expand(path), winslash, mustWork) :
path[1]="\\FILE-EM1-06/USERDATA2$/john134/My Documents/./../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../..": The filename or extension is too long
Any idea how to work around this error?
Thanks
It seems that Windows path string is limited to (maybe) 256 length. Usually, setting a the path setwd(shorterExistingWorkDir) works and should address your issue.
I struggled with this issue quite a bit, including upgrading.
Most folks are assuming that you've literally just set an incredibly long path. I don't think this is the case (it wasn't for me, at least). It's that the PATH may be set on a network drive or other device where the underlying mapped paths are more complicated.
A related thread is here on the H2O forum:
Main issue is the user had a Windows drive that did not conform to the norm, i.e., "C://", etc. Instead, the user had a network drive
(DTCHYB-AZPX015/). This caused issues in the search for a config
file as there was no "root" (In this case, "root" is reaching your Win
drive). Since there was no "root", the path to search kept expanding
until it caused R to error out with the above exception.
The fix is to NOT search for a config when h2o.init() is called. Rather, only search for a config if a user asks to do so. My proposal
is to add a new field to h2o.init()called ignore_config. This
field will be set to TRUE by default.
When calling h2o.init() the R environment signal the launching of h2o application (actually a web server) in the backend which was installed when you install H2O package into R. The local runtime environment uses the full path of the location where H2O jar file is located. Because the packages is installed deep inside the nested folders in your file system it cross the valid limit of OS path 256 character length and fails to launch the backend H2O server and you see this error. In your case you are using external path so adds up more characters in the path to make the problem worse..
For example the h2o.jar is located in my OSX machine as below:
/Library/Frameworks/R.framework/Resources/library/h2o <-- H2O package Path
/Library/Frameworks/R.framework/Resources/library/h2o/java/h2o.jar <-- Jar Path
As you are using Windows, what you need is to find ways to reduce this path to OS limit and it will work.
The other solution is to run h2o.jar separately and then just use R to connect to H2O cluster. The steps are as below:
Download H2O 3.10.4.2 and unzip to a folder close to root so you do not hit 265 char limit again. Also install 3.10.4.2 R Package. (Try to keep the same version)
Run H2O > java -jar h2o.jar
From RStudio console try > h2o.init()
So if there is already H2O cluster running the h2o.init() will connect to a running H2O cluster instead to start one and you will by pass above problem.
If you hit any problem write here and we will help you.

MPI cluster based parallel calculation in R on WestGrid (pbs file)

I am now dealing with a large dataset and I want to use parallel calculation to accelerate the process. WestGird is a Canadian computing system which has clusters with interconnect.
I use two packages doSNOW and parallel to do parallel jobs. My question is how I should write the pbs file. When I submit the job using qsub, an error occurs: mpirun noticed that the job aborted, but has no info as to the process that caused that situation.
Here is the R script code:
install.packages("fume_1.0.tar.gz")
library(fume)
library(foreach)
library(doSNOW)
load("spei03_df.rdata",.GlobalEnv)
cl <- makeCluster(mpi.universe.size(), type='MPI' )
registerDoSNOW(cl)
MK_grid <-
foreach(i=1:6000, .packages="fume",.combine='rbind') %dopar% {
abc <- mkTrend(as.matrix(spei03_data)[i,])
data.frame(P_value=abc$`Corrected p.value`, Slope=abc$`Sen's Slope`*10,Zc=abc$Zc)
}
stopCluster(cl)
save(MK_grid,file="MK_grid.rdata")
mpi.exit()
The "fume" package is download from https://cran.r-project.org/src/contrib/Archive/fume/ .
Here is the pbs file:
#!/bin/bash
#PBS -l nodes=2:ppn=12
#PBS -l walltime=2:00:00
module load application/R/3.3.1
cd $PBS_O_WORKDIR
export OMP_NUM_THREADS=1
mpirun -np 1 -hostfile $PBS_NODEFILE R CMD BATCH Trend.R
Can anyone help? Thanks a lot.
It's difficult to give advice on how to use a compute cluster that I've never used since each cluster is setup somewhat differently, but I can give you some general advice that may help.
Your job script looks reasonable to me. It's very similar to what I use on one of our Torque/Moab clusters. It's a good idea to verify that you're able to load all of the necessary R packages interactively because sometimes additional module files may need to be loaded. If you need to install packages yourself, make sure you install them in the standard "personal library" which is called something like "~/R/x86_64-pc-linux-gnu-library/3.3". That often avoids errors loading packages in the R script when executing in parallel.
I have more to say about your R script:
You need to load the Rmpi package in your R script using library(Rmpi). It isn't automatically loaded when loading doSNOW, so you will get an error when calling mpi.universe.size().
I don't recommend installing R packages in the R script itself. That will fail if install.script needs to prompt you for the CRAN repository, for example, since you can't execute interactive functions from an R script executed via mpirun.
I suggest starting mpi.universe.size() - 1 cluster workers when calling makeCluster. Since mpirun starts one worker, it may not be safe for makeCluster to spawn mpi.universe.size() additional workers since that would result in a total of mpi.universize.size() + 1 MPI processes. That works on some clusters, but it fails on at least one of our clusters.
While debugging, try using the makeCluster outfile='' option. Depending on your MPI installation, that may let you see error messages that would otherwise be hidden.

Resources