R and snow on amazon EC2 using starcluster - r

I'm trying to run analysis in parrallel in R on an AWS EC2 cluster. I am using
starcluster to setup and manage the EC2 cluster, and am trying to use snow and
foreach in R. To start off, I have 2 nodes in the cluster, 1 master and 1
worker.
starcluster start mycluster
starcluster listinstances
-----------------------------------------
mycluster (security group: #sc-mycluster)
-----------------------------------------
....
Cluster nodes:
master running i-xxxxxxxxx masterIP.compute-1.amazonaws.com
node001 running i-xxxxxxxxx node001IP.compute-1.amazonaws.com
Total nodes: 2
starcluster sshmaster mycluster
I then start R and load the snow package and try to create a cluster
object.
R
library("snow")
cl = makeCluster(c("masterIP.compute-1.amazonaws.com", "node001IP.compute-1.amazonaws.com"), type = "SOCK")
This, however, gives me the following error message:
The authenticity of host 'masterIP.compute-1.amazonaws.com (xx.xxx.xx.xx)' can't be established.
ECDSA key fingerprint is xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'masterIP.compute-1.amazonaws.com,xx.xxx.xx.xx' (ECDSA) to the list of known hosts.
Permission denied (publickey).
So I tried copying my ssh key (keyname.rsa to be specific) to the .ssh file
on EC2 and trying again. That still didn't work; I received the same
Permission denied (publickey). error. It was my thought that starcluster
handled the setup of ssh and communication between nodes, so I'm a little
confused as to why I'm not able to set this up. I also tried to just add node001, so cl = makeCluster(c("node001IP.compute-1.amazonaws.com"), type = "SOCK"), but the same error occurs.

It turns out, after much tinkering, that all that was needed was an update to R version 2.15. The command cl = makeCluster(c("masterIP.compute-1.amazonaws.com", "node001IP.compute-1.amazonaws.com"), type = "SOCK") worked perfectly after that.

Related

How to run R code with the Parallel package on a cluster consisting of a local machine as master and several remote VPS machines as workers?

I am trying to run parallel R code, with the parallel package, in a cluster formed by my local machine as master and a remote VPS as worker. In the VPS I have RStudio installed. I have the Rstudio link, the username and the password of the VPS. Likewise, from the VPS I have the IP, user and password to enter Ubuntu.
I have done tests with parallel code on my local machine and it works without problems. However, I don't know how to form the cluster where my local machine is the master and the VPS is the worker. Therefore I don't know how to communicate my local machine with the VPS.
I found these instructions:
workers <- c ("n1.remote.org", "n2.remote.org", "n3.remote.org")
cl <- makeClusterPSOCK (workers, dryrun = TRUE)
But I don't know what should I put in workers <- c ("n1.remote.org", "n2.remote.org", "n1.remote.org"). Specifically, what do I have to put in n1, in n2 and n3? The thing is, I don't know how to establish communication between my local machine and the VPS. Where do I have to indicate that my machine is local and where that the VPS is the remote one ?.
Likewise, I do the following test on my local machine:
library ("parallel")
find_workers <- function (nodes) {
nodes <- unique (nodes)
cl <- makeCluster (nodes)
on.exit (stopCluster (cl))
ns <- clusterCall (cl, fun = detectCores)
rep (nodes, times = ns)
}
workers <- find_workers (c ("n1", "n2", "n3"))
cl <- makeCluster (workers)
But my machine I think it goes in infinite loop and does not produce any output.
I would be very grateful if someone can help me, and put an example of how to do that.
Greetings.
Luciano Maldonado

Multi Node H2O cluster in R not detecting other EC2 instances

I have been struggling to get a Multi Node H2O cluster up and running using AWS EC2 instances.
I have followed the advice from this thread, but still struggle with the nodes not seeing each other. The EC2 instances all use the same AMI that I have pre-built, so the same h2o.jar file is on all of them,
I have also tried the following troubleshooting advice:
Name cluster -name
Rather use -network flag
Open port 54321 on security group as 0.0.0.0
Here are my steps:
1) Start AWS EC2 in same availability zone and get private IPs and network cidr (172.31.0.0/20). Put ip addresses into flatfile.txt
172.31.8.210:54321
172.31.9.207:54321
172.31.13.136:54321
2) Copy the flatfile.txt to all servers to which I want to connect as nodes and start H2O
# cluster_run
library(h2oEnsemble)
library(ssh)
ips <- gsub("(.*):.*", "\\1", readLines("flatfile.txt"))
start_cluster <- function(ip){
# Copy flatfile across
session <- ssh_connect(paste0("ubuntu#", ip), keyfile = "mykey.pem")
scp_upload(session, "flatfile.txt")
# Ensure no h2o instance is already running
out <- ssh_exec_wait(session, "sudo pkill java")
# Start H2O cluster
cmd <- gsub("\\s+", " ", paste0("ssh -i mykey.pem -o 'StrictHostKeyChecking no' ubuntu#", ip,
" 'java -Xmx20g
-jar /home/rstudio/R/x86_64-pc-linux-gnu-library/3.5/h2o/java/h2o.jar
-name mycluster
-network 172.31.0.0/20
-flatfile flatfile.txt
-port 54321 &'"))
system(cmd, wait = FALSE)
}
start_cluster(ips[3])
start_cluster(ips[2])
start_cluster(ips[1])
3) Once this has been done, I now want to connect R to my new Multi Node cluster
h2o.init(startH2O = F)
h2o.shutdown(prompt = FALSE)
This is where I see that the nodes aren't being picked up:
I have also seen that when I start the H2O cluster on the different nodes, it isnt picking up the other machines within the network:
You need to add port 54321+1 (so 54322) to the security group, as well.
The internal communication goes through 54322.
(I would also specify /16 for -network because it’s easier for other people to understand. For example, even if you are sure /20 is technically correct for your network setup, I can’t easily be sure. :-)
Depending on the actual network setup, you probably don’t need -network flag at all. Your instances probably only have one interface.

How to make a client / server connection using Rserver and Windows Server 2008

I am searching for a robust solution to perform extensive computations on a remote server, dedicated to computational tasks. The server is on Windows 2008 R2 and has R x64 3.4.1 installed on it. I've searched for free solutions and am now focusing on the Rserver/RSclient packages solutions.
However, I can't connect any client (using RSclient) to the instanced server.
This is how I'm proceeding at the moment from the server side:
library(Rserve)
run.Rserve(config.file = "Rserv.conf")
using the following Rserv.conf file:
port 6311
remote enable
plaintext enable
control enable
r-control enable
The server is now intanciated using the Rsession (It's a bit ugly, but will change that latter on):
running Rserve in this R session (pid=...), 1 server(s)
Now, i'm trying to connect using a remote computer (Client-side) using:
library(RSclient)
c = RS.connect(host = "...")
The connection then seems to succeed, checking for c:
> c
Rserve QAP1 connection 0x000000000fbe9f50 (socket 764, queue length 0)
The error occurs when i try to eval anything, for example:
> RS.server.eval(c,"0<1")
Error in RS.server.eval(c, "0<1") : command failed with status code 0x4e: no control line present (control commands disabled or server shutdown)
I've read the available guides but still failed in connecting. What is wrong? It seems to be related to control lines but I authorized them in the config file.
for me the problem was solved by initiating the Rserve instance with the command:
R CMD Rserve --RS-port 9000 --RS-enable-remote --RS-enable-control
instead of starting it in the R environment (library(Rserve), run.Rserve(config.file = "Rserv.conf")). You may try this on Windows as well.
Refer https://github.com/s-u/Rserve/wiki/rserve.conf.
port 6311
remote enable -> it should be remote true
plaintext enable
control enable
r-control enable
Likewise refer the link and try with actual values

How to setup workers for parallel processing in R using snowfall and multiple Windows nodes?

I’ve successfully used snowfall to setup a cluster on a single server with 16 processors.
require(snowfall)
if (sfIsRunning() == TRUE) sfStop()
number.of.cpus <- 15
sfInit(parallel = TRUE, cpus = number.of.cpus)
stopifnot( sfCpus() == number.of.cpus )
stopifnot( sfParallel() == TRUE )
# Print the hostname for each cluster member
sayhello <- function()
{
info <- Sys.info()[c("nodename", "machine")]
paste("Hello from", info[1], "with CPU type", info[2])
}
names <- sfClusterCall(sayhello)
print(unlist(names))
Now, I am looking for complete instructions on how to move to a distributed model. I have 4 different Windows machines with a total of 16 cores that I would like to use for a 16 node cluster. So far, I understand that I could manually setup a SOCK connection or leverage MPI. While it appears possible, I haven’t found clear and complete directions as to how.
The SOCK route appears to depend on code in a snowlib script. I can generate a stub from the master side with the following code:
winOptions <-
list(host="172.01.01.03",
rscript="C:/Program Files/R/R-2.7.1/bin/Rscript.exe",
snowlib="C:/Rlibs")
cl <- makeCluster(c(rep(list(winOptions), 2)), type = "SOCK", manual = T)
It yields the following:
Manually start worker on 172.01.01.03 with
"C:/Program Files/R/R-2.7.1/bin/Rscript.exe"
C:/Rlibs/snow/RSOCKnode.R
MASTER=Worker02 PORT=11204 OUT=/dev/null SNOWLIB=C:/Rlibs
It feels like a reasonable start. I found code for RSOCKnode.R on GitHub under the snow package:
local({
master <- "localhost"
port <- ""
snowlib <- Sys.getenv("R_SNOW_LIB")
outfile <- Sys.getenv("R_SNOW_OUTFILE") ##**** defaults to ""; document
args <- commandArgs()
pos <- match("--args", args)
args <- args[-(1 : pos)]
for (a in args) {
pos <- regexpr("=", a)
name <- substr(a, 1, pos - 1)
value <- substr(a,pos + 1, nchar(a))
switch(name,
MASTER = master <- value,
PORT = port <- value,
SNOWLIB = snowlib <- value,
OUT = outfile <- value)
}
if (! (snowlib %in% .libPaths()))
.libPaths(c(snowlib, .libPaths()))
library(methods) ## because Rscript as of R 2.7.0 doesn't load methods
library(snow)
if (port == "") port <- getClusterOption("port")
sinkWorkerOutput(outfile)
cat("starting worker for", paste(master, port, sep = ":"), "\n")
slaveLoop(makeSOCKmaster(master, port))
})
It’s not clear how to actually start a SOCK listener on the workers, unless it is buried in snow::recvData.
Looking into the MPI route, as far as I can tell, Microsoft MPI version 7 is a starting point. However, I could not find a Windows alternative for sfCluster. I was able to start the MPI service, but it does not appear to listen on port 22 and no amount of bashing against it with snowfall::makeCluster has yielded a result. I’ve disabled the firewall and tried testing with makeCluster and directly connecting to the worker from the master with PuTTY.
Is there a comprehensive, step-by-step guide to setting up a snowfall cluster on Windows workers that I’ve missed? I am fond of snowfall::sfClusterApplyLB and would like to continue using that, but if there is an easier solution, I’d be willing to change course. Looking into Rmpi and parallel, I found alternative solutions for the master side of the work, but still little to no specific detail on how to setup workers running Windows.
Due to the nature of the work environment, neither moving to AWS, nor Linux is an option.
Related questions without definitive answers for Windows worker nodes:
How to set up cluster slave nodes (on Windows)
Parallel R on a Windows cluster
Create a cluster of co-workers' Windows 7 PCs for parallel processing in R?
There were several options for HPC infrastructure considered: MPICH, Open MPI, and MS MPI. Initially tried to use MPICH2 but gave up as the latest stable release 1.4.1 for Windows dated back by 2013 and no support since those times. Open MPI is not supported by Windows. Then only the MS MPI option is left.
Unfortunately snowfall does not support MS MPI so I decided to go with pbdMPI package, which supports MS MPI by default. pbdMPI implements the SPMD paradigm in contrast withRmpi, which uses manager/worker parallelism.
MS MPI installation, configuration, and execution
Install MS MPI v.10.1.2 on all machines in the to-be Windows HPC cluster.
Create a directory accessible to all nodes, where R-scripts / resources will reside, for example, \HeadMachine\SharedDir.
Check if MS MPI Launch Service (MsMpiLaunchSvc) running on all nodes.
Check, that MS MPI has the rights to run R application on all the nodes on behalf of the same user, i.e. SharedUser. The user name and the password must be the same for all machines.
Check, that R should be launched on behalf of the SharedUser user.
Finally, execute mpiexec with the following options mentioned in Steps 7-10:
mpiexec.exe -n %1 -machinefile "C:\MachineFileDir\hosts.txt" -pwd
SharedUserPassword –wdir "\HeadMachine\SharedDir" Rscript hello.R
where
-wdir is a network path to the directory with shared resources.
–pwd is a password by SharedUser user, for example, SharedUserPassword.
–machinefile is a path to hosts.txt text file, for example С:\MachineFileDir\hosts.txt. hosts.txt file must be readable from the head node at the specified path and it contains a list of IP addresses of the nodes on which the R script is to be run.
As a result of Step 7 MPI will log in as SharedUser with the password SharedUserPassword and execute copies of the R processes on each computer listed in the hosts.txt file.
Details
hello.R:
library(pbdMPI, quiet = TRUE)
init()
cat("Hello World from
process",comm.rank(),"of",comm.size(),"!\n")
finalize()
hosts.txt
The hosts.txt - MPI Machines File - is a text file, the lines of which contain the network names of the computers on which R scripts will be launched. In each line, after the computer name is separated by a space (for MS MPI), the number of MPI processes to be launched. Usually, it equals the number of processors in each node.
Sample of hosts.txt with three nodes having 2 processors each:
192.168.0.1 2
192.168.0.2 2
192.168.0.3 2

Impala 1.2.1 ERROR: Couldn't open transport for localhost:26000(connect() failed: Connection refused)

Using impala-shell, I can see the hive metastore, use any data base created by Hive and query any table created by Hive. When I try to create a table in impala-shell or do a "invalidate metadata", I get
"ERROR: Couldn't open transport for localhost:26000(connect() failed: Connection refused)"
Have following configuration. This is a multi-node cluster configuration * built by hand i.e. without using Cloudera Manager *
CentOS 6
CDH4.5
Impala 1.2.1
Hive MySQL Metastore
impalad are running on multiple nodes with data nodes
statestored and catalogd is running on a single node that is NOT impalad node
In /etc/default/impala I have changed IMPALA_STATE_STORE_HOST to point to IP of the statestored machine
From the /var/log/impala/catalogd.INFO, it seems 26000 is used by catalog service as there is a line in this file "--catalog_service_port=26000"
Just as /etc/default/impala has to tell Impalad where is the statestore (using IMPALA_STATE_STORE_HOST), I am wondering if for 1.2.1 (where catalogd is introduced) there has to be an additional entry for catalogd location as well - just a guess ....
Any help is appreciated.
Thanks,
you have to start the impalad with the option -catalog_service_host=fqdn_to_your_catalog_host.
unfortunately this is not yet in the default configuration so you have to add it yourself
change /etc/default/impala
CATALOG_SERVICE_HOST=fqdn_to_your_catalog_host
IMPALA_SERVER_ARGS=add: -catalog_service_host=${CATALOG_SERVICE_HOST}
restart impalad and it should work now :-)

Resources