Multi Node H2O cluster in R not detecting other EC2 instances - r

I have been struggling to get a Multi Node H2O cluster up and running using AWS EC2 instances.
I have followed the advice from this thread, but still struggle with the nodes not seeing each other. The EC2 instances all use the same AMI that I have pre-built, so the same h2o.jar file is on all of them,
I have also tried the following troubleshooting advice:
Name cluster -name
Rather use -network flag
Open port 54321 on security group as 0.0.0.0
Here are my steps:
1) Start AWS EC2 in same availability zone and get private IPs and network cidr (172.31.0.0/20). Put ip addresses into flatfile.txt
172.31.8.210:54321
172.31.9.207:54321
172.31.13.136:54321
2) Copy the flatfile.txt to all servers to which I want to connect as nodes and start H2O
# cluster_run
library(h2oEnsemble)
library(ssh)
ips <- gsub("(.*):.*", "\\1", readLines("flatfile.txt"))
start_cluster <- function(ip){
# Copy flatfile across
session <- ssh_connect(paste0("ubuntu#", ip), keyfile = "mykey.pem")
scp_upload(session, "flatfile.txt")
# Ensure no h2o instance is already running
out <- ssh_exec_wait(session, "sudo pkill java")
# Start H2O cluster
cmd <- gsub("\\s+", " ", paste0("ssh -i mykey.pem -o 'StrictHostKeyChecking no' ubuntu#", ip,
" 'java -Xmx20g
-jar /home/rstudio/R/x86_64-pc-linux-gnu-library/3.5/h2o/java/h2o.jar
-name mycluster
-network 172.31.0.0/20
-flatfile flatfile.txt
-port 54321 &'"))
system(cmd, wait = FALSE)
}
start_cluster(ips[3])
start_cluster(ips[2])
start_cluster(ips[1])
3) Once this has been done, I now want to connect R to my new Multi Node cluster
h2o.init(startH2O = F)
h2o.shutdown(prompt = FALSE)
This is where I see that the nodes aren't being picked up:
I have also seen that when I start the H2O cluster on the different nodes, it isnt picking up the other machines within the network:

You need to add port 54321+1 (so 54322) to the security group, as well.
The internal communication goes through 54322.
(I would also specify /16 for -network because it’s easier for other people to understand. For example, even if you are sure /20 is technically correct for your network setup, I can’t easily be sure. :-)
Depending on the actual network setup, you probably don’t need -network flag at all. Your instances probably only have one interface.

Related

Ubuntu (Oracle VM) - Mounted Samba shares hang indefinitely

I have a VM instance on Oracle Cloud (Ubuntu 22.04) set up with ZeroTier to act as a web server for some services that should work with my local Synology NAS.
For some of those services I also need to mount three SMB shares from my NAS with the ZeroTier tunnel, but I can't make it work.
I used mount and mount.cifs plenty of times with automounting too, this time it acts very strange:
running the mount command seems to succeed from the console, but /var/log/syslog reads
CIFS: VFS: \\XXX.XXX.XXX.XXX has not responded in 180 seconds.
Reconnecting...
if trying to access one of the shares (ls or lsof or cd or any other command), it succeeds for only one of the shares (always the same one), but only for the first time any command is given:
$ ls /temp
folder1 folder2 folder3
any other following command just "hangs" as if they system is working on something, but it stays like that indefinitely most of the times:
$ ls /temp
█
Just a few times it spits out this error
lsof: WARNING: can't stat() cifs file system /temp
Output information may be incomplete.
ls 1475 ubuntu 3r DIR 0,44 0 123207681 /temp
findmnt reads:
└─/temp //XXX.XXX.XXX.XXX/Downloads cifs rw,relatime,vers=2.0,cache=strict, username=[redacted],uid=1005,noforceuid,gid=0,noforcegid,addr=XXX.XXX.XXX.XXX,file_mode=0755,dir_mode=0755,soft,nounix,serverino,mapposix,rsize=65536,wsize=65536,bsize=1048576,echo_interval=60,actimeo=1
for the remaining two "mounted" shares, none of them seems to respond to any command, not even the very first command, and they just hang like the one share that, at least, lets me browse for one time;
umount and umount -l take at least 2-3 minutes to successfully unmount the shares.
Same behavior when using smbclient and also with NFS shares from the same NAS.
What I have already tried:
update kernel and all packages;
remove, purge and reinstall cifs-utils, smbclient and so on...
tried mounting the same shares in another client / node within the ZeroTier network and it works just fine; also browsing from Windows and Android file manager apps with and without ZeroTier works flawlessly;
tried all SMB versions including SMBv3 and SMBv1 (CIFS);
tried different browsing or mounting methods / commands including mount, mount.cifs, autofs, smbclient;
tried to debug what happens behind the console, but didn't found anything that seems related to this in logs, htop or anything else. During the "hanging" sessions there is no spike in CPU, RAM or Network usage in either the Oracle VM or Synology NAS;
checked, reset and reconfigured all permissions on my NAS for shares, folders and files recursively and reconfigured users groups permissions.
What I haven't tried yet (I'll try as soon as possible):
reproduce this on another Oracle VM configured the same as the faulty one and another with a different base image (maybe Oracle Linux?);
It seems to me that the mount.cifs process doesn't really succeeds in mounting the share correctly, as it doesn't show as such anywhere. It also seems an issue not related to folder/file permissions, but rather something related to networking?
A note on something that may or may not be related to this: ZeroTier on my Synology NAS does not seems to work with IPv4 only - it remains OFFLINE. The node goes ONLINE only when IPv6 is enabled, but I must say that this is the only node in my ZT network that shows a IPv6 as public IP in the ZT web GUI - the other nodes show IPv4 public addresses.
If anyone has any clue on this, I'll be happy to support and reproduce any advice. Thank you!
I'm using YailScale, but I presume it will work the same.
You need to add the port 445 to /etc/iptables/rules.v4 just under the SSH setup like below:
-A INPUT -p tcp -m state --state NEW -m tcp --dport 22 -j ACCEPT
-A INPUT -p tcp -m state --state NEW -m tcp --dport 445 -j ACCEPT (like this)
Then you need to edit the interfaces in /etc/samba/smb.conf to:
interfaces = lo tailscale0 100.0.0.0/24
Obviously, my interface is tailscale0, but yours will be different. Use ip link show to find yours. You may also need to change your IP range to suit ZeroTeirs, such as 100.0.0.0/24, which is what tailscale uses.
Then reboot!
I couldn't get it working without doing this.

How to run R code with the Parallel package on a cluster consisting of a local machine as master and several remote VPS machines as workers?

I am trying to run parallel R code, with the parallel package, in a cluster formed by my local machine as master and a remote VPS as worker. In the VPS I have RStudio installed. I have the Rstudio link, the username and the password of the VPS. Likewise, from the VPS I have the IP, user and password to enter Ubuntu.
I have done tests with parallel code on my local machine and it works without problems. However, I don't know how to form the cluster where my local machine is the master and the VPS is the worker. Therefore I don't know how to communicate my local machine with the VPS.
I found these instructions:
workers <- c ("n1.remote.org", "n2.remote.org", "n3.remote.org")
cl <- makeClusterPSOCK (workers, dryrun = TRUE)
But I don't know what should I put in workers <- c ("n1.remote.org", "n2.remote.org", "n1.remote.org"). Specifically, what do I have to put in n1, in n2 and n3? The thing is, I don't know how to establish communication between my local machine and the VPS. Where do I have to indicate that my machine is local and where that the VPS is the remote one ?.
Likewise, I do the following test on my local machine:
library ("parallel")
find_workers <- function (nodes) {
nodes <- unique (nodes)
cl <- makeCluster (nodes)
on.exit (stopCluster (cl))
ns <- clusterCall (cl, fun = detectCores)
rep (nodes, times = ns)
}
workers <- find_workers (c ("n1", "n2", "n3"))
cl <- makeCluster (workers)
But my machine I think it goes in infinite loop and does not produce any output.
I would be very grateful if someone can help me, and put an example of how to do that.
Greetings.
Luciano Maldonado

How to setup workers for parallel processing in R using snowfall and multiple Windows nodes?

I’ve successfully used snowfall to setup a cluster on a single server with 16 processors.
require(snowfall)
if (sfIsRunning() == TRUE) sfStop()
number.of.cpus <- 15
sfInit(parallel = TRUE, cpus = number.of.cpus)
stopifnot( sfCpus() == number.of.cpus )
stopifnot( sfParallel() == TRUE )
# Print the hostname for each cluster member
sayhello <- function()
{
info <- Sys.info()[c("nodename", "machine")]
paste("Hello from", info[1], "with CPU type", info[2])
}
names <- sfClusterCall(sayhello)
print(unlist(names))
Now, I am looking for complete instructions on how to move to a distributed model. I have 4 different Windows machines with a total of 16 cores that I would like to use for a 16 node cluster. So far, I understand that I could manually setup a SOCK connection or leverage MPI. While it appears possible, I haven’t found clear and complete directions as to how.
The SOCK route appears to depend on code in a snowlib script. I can generate a stub from the master side with the following code:
winOptions <-
list(host="172.01.01.03",
rscript="C:/Program Files/R/R-2.7.1/bin/Rscript.exe",
snowlib="C:/Rlibs")
cl <- makeCluster(c(rep(list(winOptions), 2)), type = "SOCK", manual = T)
It yields the following:
Manually start worker on 172.01.01.03 with
"C:/Program Files/R/R-2.7.1/bin/Rscript.exe"
C:/Rlibs/snow/RSOCKnode.R
MASTER=Worker02 PORT=11204 OUT=/dev/null SNOWLIB=C:/Rlibs
It feels like a reasonable start. I found code for RSOCKnode.R on GitHub under the snow package:
local({
master <- "localhost"
port <- ""
snowlib <- Sys.getenv("R_SNOW_LIB")
outfile <- Sys.getenv("R_SNOW_OUTFILE") ##**** defaults to ""; document
args <- commandArgs()
pos <- match("--args", args)
args <- args[-(1 : pos)]
for (a in args) {
pos <- regexpr("=", a)
name <- substr(a, 1, pos - 1)
value <- substr(a,pos + 1, nchar(a))
switch(name,
MASTER = master <- value,
PORT = port <- value,
SNOWLIB = snowlib <- value,
OUT = outfile <- value)
}
if (! (snowlib %in% .libPaths()))
.libPaths(c(snowlib, .libPaths()))
library(methods) ## because Rscript as of R 2.7.0 doesn't load methods
library(snow)
if (port == "") port <- getClusterOption("port")
sinkWorkerOutput(outfile)
cat("starting worker for", paste(master, port, sep = ":"), "\n")
slaveLoop(makeSOCKmaster(master, port))
})
It’s not clear how to actually start a SOCK listener on the workers, unless it is buried in snow::recvData.
Looking into the MPI route, as far as I can tell, Microsoft MPI version 7 is a starting point. However, I could not find a Windows alternative for sfCluster. I was able to start the MPI service, but it does not appear to listen on port 22 and no amount of bashing against it with snowfall::makeCluster has yielded a result. I’ve disabled the firewall and tried testing with makeCluster and directly connecting to the worker from the master with PuTTY.
Is there a comprehensive, step-by-step guide to setting up a snowfall cluster on Windows workers that I’ve missed? I am fond of snowfall::sfClusterApplyLB and would like to continue using that, but if there is an easier solution, I’d be willing to change course. Looking into Rmpi and parallel, I found alternative solutions for the master side of the work, but still little to no specific detail on how to setup workers running Windows.
Due to the nature of the work environment, neither moving to AWS, nor Linux is an option.
Related questions without definitive answers for Windows worker nodes:
How to set up cluster slave nodes (on Windows)
Parallel R on a Windows cluster
Create a cluster of co-workers' Windows 7 PCs for parallel processing in R?
There were several options for HPC infrastructure considered: MPICH, Open MPI, and MS MPI. Initially tried to use MPICH2 but gave up as the latest stable release 1.4.1 for Windows dated back by 2013 and no support since those times. Open MPI is not supported by Windows. Then only the MS MPI option is left.
Unfortunately snowfall does not support MS MPI so I decided to go with pbdMPI package, which supports MS MPI by default. pbdMPI implements the SPMD paradigm in contrast withRmpi, which uses manager/worker parallelism.
MS MPI installation, configuration, and execution
Install MS MPI v.10.1.2 on all machines in the to-be Windows HPC cluster.
Create a directory accessible to all nodes, where R-scripts / resources will reside, for example, \HeadMachine\SharedDir.
Check if MS MPI Launch Service (MsMpiLaunchSvc) running on all nodes.
Check, that MS MPI has the rights to run R application on all the nodes on behalf of the same user, i.e. SharedUser. The user name and the password must be the same for all machines.
Check, that R should be launched on behalf of the SharedUser user.
Finally, execute mpiexec with the following options mentioned in Steps 7-10:
mpiexec.exe -n %1 -machinefile "C:\MachineFileDir\hosts.txt" -pwd
SharedUserPassword –wdir "\HeadMachine\SharedDir" Rscript hello.R
where
-wdir is a network path to the directory with shared resources.
–pwd is a password by SharedUser user, for example, SharedUserPassword.
–machinefile is a path to hosts.txt text file, for example С:\MachineFileDir\hosts.txt. hosts.txt file must be readable from the head node at the specified path and it contains a list of IP addresses of the nodes on which the R script is to be run.
As a result of Step 7 MPI will log in as SharedUser with the password SharedUserPassword and execute copies of the R processes on each computer listed in the hosts.txt file.
Details
hello.R:
library(pbdMPI, quiet = TRUE)
init()
cat("Hello World from
process",comm.rank(),"of",comm.size(),"!\n")
finalize()
hosts.txt
The hosts.txt - MPI Machines File - is a text file, the lines of which contain the network names of the computers on which R scripts will be launched. In each line, after the computer name is separated by a space (for MS MPI), the number of MPI processes to be launched. Usually, it equals the number of processors in each node.
Sample of hosts.txt with three nodes having 2 processors each:
192.168.0.1 2
192.168.0.2 2
192.168.0.3 2

openMPI/mpich2 doesn't run on multiple nodes

I am trying to use install openMPI and mpich2 on a multi-node cluster and I am having trouble running on multiple machines in both cases. Using mpich2 I am able to run on an specific host from the head node, but if I try to run something from the compute nodes to a different node I get:
HYDU_sock_connect (utils/sock/sock.c:172): unable to connect from "destination_node" to "parent_node" (No route to host)
[proxy:0:0#destination_node] main (pm/pmiserv/pmip.c:189): unable to connect to server parent_node at port 56411 (check for firewalls!)
If I try to use sge to set up a job I get similar errors.
On the other hand, if I try to use openMPI to run jobs, I am not able to run in any remote machine, even from the head node. I get:
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default
* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.
* compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.
* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
The machines are connected to each other, I can ping, ssh passwordlessly etc from any of them to any other, MPI_LIB and the PATH are well set in all machines.
Usually this is caused because you didn't set up a hostfile or pass the list of hosts on the command line.
For MPICH, you do this by passing the flag -host on the command line, followed by a list of hosts (host1,host2,host3,etc.).
mpiexec -host host1,host2,host3 -n 3 <executable>
You can also put these in a file:
host1
host2
host3
Then you pass that file on the command line like so:
mpiexec -f <hostfile> -n 3 <executable>
Similarly, with Open MPI, you would use:
mpiexec --host host1,host2,host3 -n 3 <executable>
and
mpiexec --hostfile hostfile -n 3 <executable>
You can get more information at these links:
MPICH - https://wiki.mpich.org/mpich/index.php/Using_the_Hydra_Process_Manager#Hydra_with_Non-Ethernet_Networks
Open MPI - http://www.open-mpi.org/faq/?category=running#mpirun-hostfile

R and snow on amazon EC2 using starcluster

I'm trying to run analysis in parrallel in R on an AWS EC2 cluster. I am using
starcluster to setup and manage the EC2 cluster, and am trying to use snow and
foreach in R. To start off, I have 2 nodes in the cluster, 1 master and 1
worker.
starcluster start mycluster
starcluster listinstances
-----------------------------------------
mycluster (security group: #sc-mycluster)
-----------------------------------------
....
Cluster nodes:
master running i-xxxxxxxxx masterIP.compute-1.amazonaws.com
node001 running i-xxxxxxxxx node001IP.compute-1.amazonaws.com
Total nodes: 2
starcluster sshmaster mycluster
I then start R and load the snow package and try to create a cluster
object.
R
library("snow")
cl = makeCluster(c("masterIP.compute-1.amazonaws.com", "node001IP.compute-1.amazonaws.com"), type = "SOCK")
This, however, gives me the following error message:
The authenticity of host 'masterIP.compute-1.amazonaws.com (xx.xxx.xx.xx)' can't be established.
ECDSA key fingerprint is xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'masterIP.compute-1.amazonaws.com,xx.xxx.xx.xx' (ECDSA) to the list of known hosts.
Permission denied (publickey).
So I tried copying my ssh key (keyname.rsa to be specific) to the .ssh file
on EC2 and trying again. That still didn't work; I received the same
Permission denied (publickey). error. It was my thought that starcluster
handled the setup of ssh and communication between nodes, so I'm a little
confused as to why I'm not able to set this up. I also tried to just add node001, so cl = makeCluster(c("node001IP.compute-1.amazonaws.com"), type = "SOCK"), but the same error occurs.
It turns out, after much tinkering, that all that was needed was an update to R version 2.15. The command cl = makeCluster(c("masterIP.compute-1.amazonaws.com", "node001IP.compute-1.amazonaws.com"), type = "SOCK") worked perfectly after that.

Resources