'all connections are in use' with parallel processing on AWS - r

I have been able to run 20 models simultaneously using a r6a.48xlarge Amazon Web Services instance (192 vCPUs, 1536.00 GiB memory) and this R code:
setwd('/home/ubuntu/')
library(doParallel)
detectCores()
my.AWS.n.cores <- detectCores()
my.AWS.n.cores <- my.AWS.n.cores - 92
my.AWS.n.cores
registerDoParallel(my.cluster <- makeCluster(my.AWS.n.cores))
folderName <- 'model000222'
files <- list.files(folderName, full.names=TRUE)
start.time <- Sys.time()
foreach(file = files, .errorhandling = "remove") %dopar% {
source(file)
}
stopCluster(my.cluster)
end.time <- Sys.time()
total.time.c <- end.time-start.time
total.time.c
However, the above R code did not run until I reduced the number of cores to 100 from 192 with this line:
my.AWS.n.cores <- my.AWS.n.cores - 92
If I tried running the code with all 192 vCPUs or 187 vCPUs I got this error message:
> my.AWS.n.cores <- detectCores()
> my.AWS.n.cores <- my.AWS.n.cores - 5
> my.AWS.n.cores
[1] 187
>
> registerDoParallel(my.cluster <- makeCluster(my.AWS.n.cores))
Error in socketConnection("localhost", port = port, server = TRUE, blocking = TRUE, :
all connections are in use
Calls: registerDoParallel ... makePSOCKcluster -> newPSOCKnode -> socketConnection
I had never seen that error message and could not locate it with an internet search. Could someone explain this error message? I do not know why my solution worked or whether a better solution exists. Can I easily determine the maximum number of connections I can use without getting this error? I suppose I could run the code incrementing the number of cores from 100 to 187.
I installed R on this instance with the lines below in PuTTY. R could not be located on the instance until I used the last line below: apt install r-base-core.
sudo su
echo "deb http://cran.rstudio.com/bin/linux/ubuntu trusty/"
sudo apt-get update
sudo apt-get install r-base
sudo apt install dos2unix
apt install r-base-core
I used this AMI:
Ubuntu Server 18.04 LTS (HVM), SSD Volume Type
EDIT
Apparently, R has a hardwired limit of 128 connections. Apparently, you can increase the number of PSOCK workers manually if you are willing to rebuild R from source but I have not found an answer showing how to do that. Ideally I can find an answer showing how to do that with Ubuntu and AWS. See also these previous related questions.
Errors in makeCluster(multicore): cannot open the connection
Is there a limit on the number of slaves that R snow can create?

Explanation
Each parallel PSOCK worker consumes one R connection. As of R 4.2.1, R is hard-coded to support only 128 open connections at any time. Three of those connections are always in use (stdin, stdout, and stderr), leaving you with 125 to play with.
To increase this limit, you have to update constant:
#define NCONNECTIONS 128
in src/main/connections.c, and then re-build R from source. FWIW, I've verified that it works with at least 16,384 on Ubuntu 16.04 (https://github.com/HenrikBengtsson/Wishlist-for-R/issues/28#issuecomment-231603035).
People have reported on this before, and the problem has been raised on R-devel several times over the years. Last time the limit was increased was in R 2.4.0 (October 2008) when it was increased from 50 to 128.
See https://github.com/HenrikBengtsson/Wishlist-for-R/issues/28 for more details and discussions. I think it's worth bumping this topic again on R-devel. As people get access to more cores, more people will run into this problem.
The parallelly package provides two functions, availableConnections() and freeConnections(), for querying the current R installation for number of connections available and free. See https://parallelly.futureverse.org/reference/availableConnections.html details and examples.
FYI, if you use parallelly::makeClusterPSOCK(n) instead of parallel::makeCluster(n), you'll get a more informative error message, and much sooner, e.g.
> cl <- parallelly::makeClusterPSOCK(192)
Error: Cannot create 192 parallel PSOCK nodes. Each node
needs one connection but there are only 124 connections left
out of the maximum 128 available on this R installation
Workaround
You can avoid relying on R connections for local parallel processing, by using the callr package under the hood. The easiest way to achieve this is to use doFuture in combination with future.callr. In your example, that would be:
library(doFuture)
library(future.callr)
registerDoFuture()
plan(callr, workers = parallelly::availableCores(omit = 5))
...
With this setup, the parallel workers are launched via callr (which operates without R connections). Each parallel task is launched in a separate callr process and when the task completes, the parallel worker is terminated. Because the parallel workers are not reused, there is an extra overhead paid for using the callr backend, but if your parallel tasks are long enough, that should still be a minor part of the processing time.

Related

Checking available cores in R on SLURM

I ran below script for SLURM RStudio setup (currently running):
#!/bin/bash
#SBATCH --job-name=nodes
#SBATCH --output=a.log
#SBATCH --ntasks=18
#SBATCH --cpus-per-task=8
#SBATCH --mem-per-cpu=7gb
date;hostname;pwd
module load R/4.2
rserver <- runs RStudio server
Which runs 8 cores with 18 nodes (144 cores).
However, when I check the number of cores available for parallel processing in the R console, it says 32 instead.
Here's the code for checking.
library(doParallel)
detectCores() # 32
Even worse, with another package, parallelly (or future) that considers the scheduler setting, it reports differently.
From parallely package:
For instance, if compute cluster schedulers are used (e.g. TORQUE/PBS and Slurm), they set specific environment variable specifying the number of cores that was allotted to any given job; availableCores() acknowledges these as well.)
library(parallelly)
availableCores() # 8
I am wondering if the current R is running with the above scheduler specification (144 cores) and if I am missing something important.
Also, could you recommend how to check available resources (core / memory) allocated and able to use in R with slurm setting?
Thank you very much in advance.

How to stop h2o from saving massive .ERR, .OUT and other log files to the local drive

I am currently running a script in which several h2o glm and deeplearning models are being generated for several iterations of a Monte-Carlo Cross-Validation. When finished running (which takes about half a day), h2o is saving immense files to the local drive (with sizes up to 8.5 GB). These files are not erased when RStudio or my computer is restarted (as I originally thought). Is there a way to stop h2o from saving these files?
When you start H2O with h2o.init() from R, the stdout and stderr files should be saved to a temporary directory (see R's tempdir() to see the path). This temporary directory should be removed when the R session exits. It seems as though this is not working with RStudio, however it works if you are using R from the command line. I'm not sure if this is a setting that can be changed in RStudio or if this is an RStudio bug.
But you can take more control yourself. You can start H2O by hand using java on the command line and then connect from R using h2o.init().
java -Xmx5g -jar h2o.jar
In this example, I started H2O with 5 GB of Java heap memory, but you should increase that if your data is larger. Then connecting in R will look like this:
> h2o.init()
Connection successful!
R is connected to the H2O cluster:
H2O cluster uptime: 16 hours 34 minutes
H2O cluster version: 3.15.0.99999
H2O cluster version age: 17 hours and 25 minutes
H2O cluster name: H2O_started_from_R_me_exn817
H2O cluster total nodes: 1
H2O cluster total memory: 4.43 GB
H2O cluster total cores: 8
H2O cluster allowed cores: 8
H2O cluster healthy: TRUE
H2O Connection ip: localhost
H2O Connection port: 54321
H2O Connection proxy: NA
H2O Internal Security: FALSE
H2O API Extensions: XGBoost, Algos, AutoML, Core V3, Core V4
R Version: R version 3.3.2 (2016-10-31)
So if you want to redirect both stdout and stderr to devnull you simply add the redirect command to the end of the java command to start the H2O cluster and connect to H2O from R again. To redirect both stderr and stdout, you append > /dev/null 2>&1 like this:
java -Xmx5g -jar h2o.jar > /dev/null 2>&1 &
Encountered this in a spark shell running H2O. The shell had 50 executors connected and this caused the /tmp directories on those nodes to eventually cause storage issues.
When h2o.init() is called it creates jvm's. The logging from h2o is handled by these jvm's. But when the shell is shutdown those jvm's persist and just log heartbeat errors in /tmp in perpetuity. You will need to find the jvm's associated with h2o and shut them down. I believe in my case the specific process names where water.H2OApp
I found it easier to take care of the problem by removing those files after running every model.
unlink(list.files(tempdir(), full.names = TRUE), recursive = TRUE)
This helps remove the temporary files when I run the multiple models in a loop.

How to setup workers for parallel processing in R using snowfall and multiple Windows nodes?

I’ve successfully used snowfall to setup a cluster on a single server with 16 processors.
require(snowfall)
if (sfIsRunning() == TRUE) sfStop()
number.of.cpus <- 15
sfInit(parallel = TRUE, cpus = number.of.cpus)
stopifnot( sfCpus() == number.of.cpus )
stopifnot( sfParallel() == TRUE )
# Print the hostname for each cluster member
sayhello <- function()
{
info <- Sys.info()[c("nodename", "machine")]
paste("Hello from", info[1], "with CPU type", info[2])
}
names <- sfClusterCall(sayhello)
print(unlist(names))
Now, I am looking for complete instructions on how to move to a distributed model. I have 4 different Windows machines with a total of 16 cores that I would like to use for a 16 node cluster. So far, I understand that I could manually setup a SOCK connection or leverage MPI. While it appears possible, I haven’t found clear and complete directions as to how.
The SOCK route appears to depend on code in a snowlib script. I can generate a stub from the master side with the following code:
winOptions <-
list(host="172.01.01.03",
rscript="C:/Program Files/R/R-2.7.1/bin/Rscript.exe",
snowlib="C:/Rlibs")
cl <- makeCluster(c(rep(list(winOptions), 2)), type = "SOCK", manual = T)
It yields the following:
Manually start worker on 172.01.01.03 with
"C:/Program Files/R/R-2.7.1/bin/Rscript.exe"
C:/Rlibs/snow/RSOCKnode.R
MASTER=Worker02 PORT=11204 OUT=/dev/null SNOWLIB=C:/Rlibs
It feels like a reasonable start. I found code for RSOCKnode.R on GitHub under the snow package:
local({
master <- "localhost"
port <- ""
snowlib <- Sys.getenv("R_SNOW_LIB")
outfile <- Sys.getenv("R_SNOW_OUTFILE") ##**** defaults to ""; document
args <- commandArgs()
pos <- match("--args", args)
args <- args[-(1 : pos)]
for (a in args) {
pos <- regexpr("=", a)
name <- substr(a, 1, pos - 1)
value <- substr(a,pos + 1, nchar(a))
switch(name,
MASTER = master <- value,
PORT = port <- value,
SNOWLIB = snowlib <- value,
OUT = outfile <- value)
}
if (! (snowlib %in% .libPaths()))
.libPaths(c(snowlib, .libPaths()))
library(methods) ## because Rscript as of R 2.7.0 doesn't load methods
library(snow)
if (port == "") port <- getClusterOption("port")
sinkWorkerOutput(outfile)
cat("starting worker for", paste(master, port, sep = ":"), "\n")
slaveLoop(makeSOCKmaster(master, port))
})
It’s not clear how to actually start a SOCK listener on the workers, unless it is buried in snow::recvData.
Looking into the MPI route, as far as I can tell, Microsoft MPI version 7 is a starting point. However, I could not find a Windows alternative for sfCluster. I was able to start the MPI service, but it does not appear to listen on port 22 and no amount of bashing against it with snowfall::makeCluster has yielded a result. I’ve disabled the firewall and tried testing with makeCluster and directly connecting to the worker from the master with PuTTY.
Is there a comprehensive, step-by-step guide to setting up a snowfall cluster on Windows workers that I’ve missed? I am fond of snowfall::sfClusterApplyLB and would like to continue using that, but if there is an easier solution, I’d be willing to change course. Looking into Rmpi and parallel, I found alternative solutions for the master side of the work, but still little to no specific detail on how to setup workers running Windows.
Due to the nature of the work environment, neither moving to AWS, nor Linux is an option.
Related questions without definitive answers for Windows worker nodes:
How to set up cluster slave nodes (on Windows)
Parallel R on a Windows cluster
Create a cluster of co-workers' Windows 7 PCs for parallel processing in R?
There were several options for HPC infrastructure considered: MPICH, Open MPI, and MS MPI. Initially tried to use MPICH2 but gave up as the latest stable release 1.4.1 for Windows dated back by 2013 and no support since those times. Open MPI is not supported by Windows. Then only the MS MPI option is left.
Unfortunately snowfall does not support MS MPI so I decided to go with pbdMPI package, which supports MS MPI by default. pbdMPI implements the SPMD paradigm in contrast withRmpi, which uses manager/worker parallelism.
MS MPI installation, configuration, and execution
Install MS MPI v.10.1.2 on all machines in the to-be Windows HPC cluster.
Create a directory accessible to all nodes, where R-scripts / resources will reside, for example, \HeadMachine\SharedDir.
Check if MS MPI Launch Service (MsMpiLaunchSvc) running on all nodes.
Check, that MS MPI has the rights to run R application on all the nodes on behalf of the same user, i.e. SharedUser. The user name and the password must be the same for all machines.
Check, that R should be launched on behalf of the SharedUser user.
Finally, execute mpiexec with the following options mentioned in Steps 7-10:
mpiexec.exe -n %1 -machinefile "C:\MachineFileDir\hosts.txt" -pwd
SharedUserPassword –wdir "\HeadMachine\SharedDir" Rscript hello.R
where
-wdir is a network path to the directory with shared resources.
–pwd is a password by SharedUser user, for example, SharedUserPassword.
–machinefile is a path to hosts.txt text file, for example С:\MachineFileDir\hosts.txt. hosts.txt file must be readable from the head node at the specified path and it contains a list of IP addresses of the nodes on which the R script is to be run.
As a result of Step 7 MPI will log in as SharedUser with the password SharedUserPassword and execute copies of the R processes on each computer listed in the hosts.txt file.
Details
hello.R:
library(pbdMPI, quiet = TRUE)
init()
cat("Hello World from
process",comm.rank(),"of",comm.size(),"!\n")
finalize()
hosts.txt
The hosts.txt - MPI Machines File - is a text file, the lines of which contain the network names of the computers on which R scripts will be launched. In each line, after the computer name is separated by a space (for MS MPI), the number of MPI processes to be launched. Usually, it equals the number of processors in each node.
Sample of hosts.txt with three nodes having 2 processors each:
192.168.0.1 2
192.168.0.2 2
192.168.0.3 2

R Running foreach dopar loop on HPC MPIcluster

I got access to an HPC cluster with a MPI partition.
My problem is that -no matter what I try- my code (which works fine on my PC) doesn't run on the HPC cluster. The code looks like this:
library(tm)
library(qdap)
library(snow)
library(doSNOW)
library(foreach)
> cl<- makeCluster(30, type="MPI")
> registerDoSNOW(cl)
> np<-getDoParWorkers()
> np
> Base = "./Files1a/"
> files = list.files(path=Base,pattern="\\.txt");
>
> for(i in 1:length(files)){
...some definitions and variable generation...
+ text<-foreach(k = 1:10, .combine='c') %do%{
+ text= if (file.exists(paste("./Files", k, "a/", files[i], sep=""))) paste(tolower(readLines(paste("./Files", k, "a/", files[i], sep=""))) , collapse=" ") else ""
+ }
+
+ docs <- Corpus(VectorSource(text))
+
+ for (k in 1:10){
+ ID[k] <- paste(files[i], k, sep="_")
+ }
+ data <- as.data.frame(docs)
+ data[["docs"]]=ID
+ rm(docs)
+ data <- sentSplit(data, "text")
+
+ frequency=NULL
+ cs <- ceiling(length(POLKEY$x) / getDoParWorkers())
+ opt <- list(chunkSize=cs)
+ frequency<-foreach(j = 2: length(POLKEY$x), .options.mpi=opt, .combine='cbind') %dopar% ...
+ write.csv(frequency, file =paste("./Result/output", i, ".csv", sep=""))
+ rm(data, frequency)
+ }
When I run the batch job the session gets killed at the time limit. Whereas I receive the following message after the MPI cluster initialization:
Loading required namespace: Rmpi
--------------------------------------------------------------------------
PMI2 initialized but returned bad values for size and rank.
This is symptomatic of either a failure to use the
"--mpi=pmi2" flag in SLURM, or a borked PMI2 installation.
If running under SLURM, try adding "-mpi=pmi2" to your
srun command line. If that doesn't work, or if you are
not running under SLURM, try removing or renaming the
pmi2.h header file so PMI2 support will not automatically
be built, reconfigure and build OMPI, and then try again
with only PMI1 support enabled.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
An MPI process has executed an operation involving a call to the
"fork()" system call to create a child process. Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your MPI job may hang, crash, or produce silent
data corruption. The use of fork() (or system() or other calls that
create child processes) is strongly discouraged.
The process that invoked fork was:
Local host: ...
MPI_COMM_WORLD rank: 0
If you are *absolutely sure* that your application will successfully
and correctly survive a call to fork(), you may disable this warning
by setting the mpi_warn_on_fork MCA parameter to 0.
--------------------------------------------------------------------------
30 slaves are spawned successfully. 0 failed.
Unfortunately, it seems that the loop doesn't go through once as no output is returned.
For the sake of completeness, my batch file:
#!/bin/bash -l
#SBATCH --job-name MyR
#SBATCH --output MyR-%j.out
#SBATCH --nodes=5
#SBATCH --ntasks-per-node=6
#SBATCH --mem=24gb
#SBATCH --time=00:30:00
MyRProgram="$HOME/R/hpc_test2.R"
cd $HOME/R
export R_LIBS_USER=$HOME/R/Libs2
# start R with my R program
module load R
time R --vanilla -f $MyRProgram
Does anybody have a suggestion how to solve the problem? What am I doing wrong?
Thanks in advance for your help!
Your script is an MPI application, so you need to execute it appropriately via Slurm. The Open MPI FAQ has a special section on how to do that:
https://www.open-mpi.org/faq/?category=slurm
The most important point is that your script shouldn't execute R directly, but should execute it via the mpirun command, using something like:
mpirun -np 1 R --vanilla -f $MyRProgram
My guess is that the "PMI2" error is caused by not executing R via mpirun. I don't think the "fork" message indicates a real problem and it happens to me at times. I think it happens because R calls "fork" when initializing, but this has never caused a problem for me. I'm not sure why I only get this message occasionally.
Note that it is very important to tell mpirun to only launch one process since the other processes will be spawned, so you should use the mpirun -np 1 option. If Open MPI was properly built with Slurm support, then Open MPI should know where to launch those processes when they are spawned, but if you don't use -np 1, then all 30 processes launched via mpirun will spawn 30 processes each, causing a huge mess.
Finally, I think you should tell makeCluster to spawn only 29 processes to avoid running a total of 31 MPI processes. Depending on your network configuration, even that much oversubscription can cause problems.
I would create the cluster object as follows:
library(snow)
library(Rmpi)
cl<- makeCluster(mpi.universe.size() - 1, type="MPI")
That's safer and makes it easier to keep your R script and job script in sync with each other.

R and snow on amazon EC2 using starcluster

I'm trying to run analysis in parrallel in R on an AWS EC2 cluster. I am using
starcluster to setup and manage the EC2 cluster, and am trying to use snow and
foreach in R. To start off, I have 2 nodes in the cluster, 1 master and 1
worker.
starcluster start mycluster
starcluster listinstances
-----------------------------------------
mycluster (security group: #sc-mycluster)
-----------------------------------------
....
Cluster nodes:
master running i-xxxxxxxxx masterIP.compute-1.amazonaws.com
node001 running i-xxxxxxxxx node001IP.compute-1.amazonaws.com
Total nodes: 2
starcluster sshmaster mycluster
I then start R and load the snow package and try to create a cluster
object.
R
library("snow")
cl = makeCluster(c("masterIP.compute-1.amazonaws.com", "node001IP.compute-1.amazonaws.com"), type = "SOCK")
This, however, gives me the following error message:
The authenticity of host 'masterIP.compute-1.amazonaws.com (xx.xxx.xx.xx)' can't be established.
ECDSA key fingerprint is xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'masterIP.compute-1.amazonaws.com,xx.xxx.xx.xx' (ECDSA) to the list of known hosts.
Permission denied (publickey).
So I tried copying my ssh key (keyname.rsa to be specific) to the .ssh file
on EC2 and trying again. That still didn't work; I received the same
Permission denied (publickey). error. It was my thought that starcluster
handled the setup of ssh and communication between nodes, so I'm a little
confused as to why I'm not able to set this up. I also tried to just add node001, so cl = makeCluster(c("node001IP.compute-1.amazonaws.com"), type = "SOCK"), but the same error occurs.
It turns out, after much tinkering, that all that was needed was an update to R version 2.15. The command cl = makeCluster(c("masterIP.compute-1.amazonaws.com", "node001IP.compute-1.amazonaws.com"), type = "SOCK") worked perfectly after that.

Resources