I'm trying to connect to h20 for the first time in R Studio with the following command:
library(h2o)
h2o.init()
R Versions 2.3.2 and h20 version 3.10.2.2. Any idea on how to fix this so I can connect?
H2O is not running yet, starting it now...
running command ''/usr/bin/java' -version 2>&1' had status 1
Note: In case of errors look at the following log files:
/var/folders/w7/vpssy9010lg4xxlwkg5f2zwm0000gn/T//RtmpZhyGXB/h2o_chrisstroud_started_from_r.out
/var/folders/w7/vpssy9010lg4xxlwkg5f2zwm0000gn/T//RtmpZhyGXB/h2o_chrisstroud_started_from_r.err
No Java runtime present, requesting install.
Starting H2O JVM and connecting: ............................................................
[1] "localhost"
[1] 54321
[1] TRUE
[1] -1
[1] "Failed to connect to localhost port 54321: Connection refused"
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0curl: (7) Failed to connect to localhost port 54321: Connection refused
[1] 7
Show Traceback
Error in h2o.init() : H2O failed to start, stopping execution.
If you have proxy enabled, it won't work.
Sys.setenv(https_proxy="")
Sys.setenv(http_proxy="")
Sys.setenv(http_proxy_user="")
Sys.setenv(https_proxy_user="")
fixed it for me.
Related
I submitted a job via slurm. The job ran for 12 hours and was working as expected. Then I got Data unpack would read past end of buffer in file util/show_help.c at line 501. It is usual for me to get errors like ORTE has lost communication with a remote daemon but I usually get this in the beginning of the job. It is annoying but still does not cause as much time loss as getting error after 12 hours. Is there a quick fix for this? Open MPI version is 4.0.1.
--------------------------------------------------------------------------
By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default. The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA parameter
to true.
Local host: barbun40
Local adapter: mlx5_0
Local port: 1
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
Local host: barbun40
Local device: mlx5_0
--------------------------------------------------------------------------
[barbun21.yonetim:48390] [[15284,0],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in
file util/show_help.c at line 501
[barbun21.yonetim:48390] 127 more processes have sent help message help-mpi-btl-openib.txt / ib port
not selected
[barbun21.yonetim:48390] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error
messages
[barbun21.yonetim:48390] 126 more processes have sent help message help-mpi-btl-openib.txt / error in
device init
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
An MPI communication peer process has unexpectedly disconnected. This
usually indicates a failure in the peer process (e.g., a crash or
otherwise exiting without calling MPI_FINALIZE first).
Although this local MPI process will likely now behave unpredictably
(it may even hang or crash), the root cause of this problem is the
failure of the peer -- that is what you need to investigate. For
example, there may be a core file that you can examine. More
generally: such peer hangups are frequently caused by application bugs
or other external events.
Local host: barbun64
Local PID: 252415
Peer host: barbun39
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[15284,1],35]
Exit code: 9
--------------------------------------------------------------------------
I have the common issue with the R-package 'parallel' as seen here. The command
cl <- future::makeClusterPSOCK(1, outfile = NULL, verbose = TRUE)
hangs on the machine whenever I am logged onto a wifi connection. However, it works fine when I am logged onto a mobile hotspot from my phone.
I have read all posts asking for solutions, but so far the best idea was to reinstall my operating system, which I would really prefer to avoid...
Any ideas?
I use R version 3.5.1, Platform: x86_64-apple-darwin15.6.0 (64-bit).
Update 1:
When connections are turned off or I am on a wifi network, the output from the command above is something like:
Workers: [n = 1] ‘localhost’
Base port: 11349
Creating node 1 of 1 ...
- setting up node
Starting worker #1 on ‘localhost’: '/Library/Frameworks/R.framework/Resources/bin/Rscript' --default-packages=datasets,utils,grDevices,graphics,stats,methods -e 'parallel:::.slaveRSOCK()' MASTER=localhost PORT=11349 OUT= TIMEOUT=2592000 XDR=TRUE
Waiting for worker #1 on ‘localhost’ to connect back
starting worker pid=4841 on localhost:11349 at 08:37:36.219
On a mobile hotspot it looks very similar but with success:
Workers: [n = 1] ‘localhost’
Base port: 11501
Creating node 1 of 1 ...
- setting up node
Starting worker #1 on ‘localhost’: '/Library/Frameworks/R.framework/Resources/bin/Rscript' --default-packages=datasets,utils,grDevices,graphics,stats,methods -e 'parallel:::.slaveRSOCK()' MASTER=localhost PORT=11501 OUT= TIMEOUT=2592000 XDR=TRUE
Waiting for worker #1 on ‘localhost’ to connect back
starting worker pid=4892 on localhost:11501 at 08:39:47.070
Connection with worker #1 on ‘localhost’ established
- assigning connection UUID
- collecting session information
Creating node 1 of 1 ... done
I am installing ICp 2.1.0.1 and I received an error at the TASK
[master: Waiting for MariaDB service to start] msg: The MariaDB
component failed to start.
After this msg the installation completed with failed status.
We are installing ICp with 3 Masters, 3 Proxies and 2 Workers. We have 1 IP for VIP master and 1 for VIP proxy.
I tried to install multiple times and all installations got the same error.
For prior issues with that error, the correct db admin password was not used. So check the db user and password to resolve issue.
Would you validate whether each master host was able to access port 3306 on the other hosts?
If you run with .. install -vv | tee -a install-log.txt, do you get additional details as well?
The error was solved by following the steps below.
Check whether kubelet is running:
Log in to your master node.
Run the following command to check kubelet status:
systemctl status kubelet
If kubelet is not running, run the following command to get the logs:
journalctl -u kubelet &> kubelet.log
We found the error in the kubelet.log log:
Error: failed to run Kubelet: Running with swap on is not supported, please disable swap! or set --fail-swap-on flag to false.
We found this troubleshoot in this link, and the solution at the ICP issue 4651.
https://www.ibm.com/support/knowledgecenter/en/SSBS6K_2.1.0/troubleshoot/etcd_fails.html
https://github.ibm.com/IBMPrivateCloud/roadmap/issues/4651
I am currently running a script in which several h2o glm and deeplearning models are being generated for several iterations of a Monte-Carlo Cross-Validation. When finished running (which takes about half a day), h2o is saving immense files to the local drive (with sizes up to 8.5 GB). These files are not erased when RStudio or my computer is restarted (as I originally thought). Is there a way to stop h2o from saving these files?
When you start H2O with h2o.init() from R, the stdout and stderr files should be saved to a temporary directory (see R's tempdir() to see the path). This temporary directory should be removed when the R session exits. It seems as though this is not working with RStudio, however it works if you are using R from the command line. I'm not sure if this is a setting that can be changed in RStudio or if this is an RStudio bug.
But you can take more control yourself. You can start H2O by hand using java on the command line and then connect from R using h2o.init().
java -Xmx5g -jar h2o.jar
In this example, I started H2O with 5 GB of Java heap memory, but you should increase that if your data is larger. Then connecting in R will look like this:
> h2o.init()
Connection successful!
R is connected to the H2O cluster:
H2O cluster uptime: 16 hours 34 minutes
H2O cluster version: 3.15.0.99999
H2O cluster version age: 17 hours and 25 minutes
H2O cluster name: H2O_started_from_R_me_exn817
H2O cluster total nodes: 1
H2O cluster total memory: 4.43 GB
H2O cluster total cores: 8
H2O cluster allowed cores: 8
H2O cluster healthy: TRUE
H2O Connection ip: localhost
H2O Connection port: 54321
H2O Connection proxy: NA
H2O Internal Security: FALSE
H2O API Extensions: XGBoost, Algos, AutoML, Core V3, Core V4
R Version: R version 3.3.2 (2016-10-31)
So if you want to redirect both stdout and stderr to devnull you simply add the redirect command to the end of the java command to start the H2O cluster and connect to H2O from R again. To redirect both stderr and stdout, you append > /dev/null 2>&1 like this:
java -Xmx5g -jar h2o.jar > /dev/null 2>&1 &
Encountered this in a spark shell running H2O. The shell had 50 executors connected and this caused the /tmp directories on those nodes to eventually cause storage issues.
When h2o.init() is called it creates jvm's. The logging from h2o is handled by these jvm's. But when the shell is shutdown those jvm's persist and just log heartbeat errors in /tmp in perpetuity. You will need to find the jvm's associated with h2o and shut them down. I believe in my case the specific process names where water.H2OApp
I found it easier to take care of the problem by removing those files after running every model.
unlink(list.files(tempdir(), full.names = TRUE), recursive = TRUE)
This helps remove the temporary files when I run the multiple models in a loop.
Nexus version 3.1.0-04
During a build, I receive the following error downloading an artifact from Nexus.
Download >http://10.148.254.17:8081/nexus/content/repositories/central/org/assertj/assertj-core/2.4.1/assertj-core-2.4.1.jar
:collection:extractIncludeTestProto FAILED
FAILURE: Build failed with an exception.
What went wrong:
Could not resolve all dependencies for configuration ':collection:testCompile'.
Could not download assertj-core.jar (org.assertj:assertj-core:2.4.1)
Could not get resource 'http://xxx.xxx.xxx.xxx:8081/nexus/content/repositories/central/org/assertj/assertj-core/2.4.1/assertj-core-2.4.1.jar'.
Premature end of Content-Length delimited message body (expected: 900718; received: 6862
This appear to be a problem with large files stored in Nexus.
If I try and download the file via wget or curl, it also fails.
c:>wget http://xxx.xxx.xxx.xxx:8081/nexus/content/repositories/central/org/assertj/assertj-core/2.5.0/assertj-
core-2.5.0.jar
--13:57:06-- >http://xxx.xxx.xxx.xxx:8081/nexus/content/repositories/central/org/assertj/assertj-core/2.5.0/assertj-core-2.5.0.jar
=> `assertj-core-2.5.0.jar'
Resolving proxy.xxxx.com... done.
Connecting to proxy.xxxx.com[xxx.xxx.xxx.xxx]:xxx... connected.
Proxy request sent, awaiting response... 200 OK
Length: 934,446 [application/java-archive]
0% [ ] 6,856 1.44K/s ETA 10:27
13:57:21 (1.44 KB/s) - Connection closed at byte 6856. Retrying.
c:>curl -O http://xxx.xxx.xxx.xxx:8081/nexus/content/repositories/central/org/assertj/assertj-core/2.5.0/assertj-core-2.5.0.jar
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 912k 0 6862 0 0 613 0 0:25:24 0:00:11 0:25:13 613
curl: (18) transfer closed with 927584 bytes remaining to read
Any ideas why?
In my case by docker layer was blocked. I solved this problem by changing the timeout in System>Http>Connection/Socket timeout.