parallel processing in R hangs when not on mobile hotspot - r

I have the common issue with the R-package 'parallel' as seen here. The command
cl <- future::makeClusterPSOCK(1, outfile = NULL, verbose = TRUE)
hangs on the machine whenever I am logged onto a wifi connection. However, it works fine when I am logged onto a mobile hotspot from my phone.
I have read all posts asking for solutions, but so far the best idea was to reinstall my operating system, which I would really prefer to avoid...
Any ideas?
I use R version 3.5.1, Platform: x86_64-apple-darwin15.6.0 (64-bit).
Update 1:
When connections are turned off or I am on a wifi network, the output from the command above is something like:
Workers: [n = 1] ‘localhost’
Base port: 11349
Creating node 1 of 1 ...
- setting up node
Starting worker #1 on ‘localhost’: '/Library/Frameworks/R.framework/Resources/bin/Rscript' --default-packages=datasets,utils,grDevices,graphics,stats,methods -e 'parallel:::.slaveRSOCK()' MASTER=localhost PORT=11349 OUT= TIMEOUT=2592000 XDR=TRUE
Waiting for worker #1 on ‘localhost’ to connect back
starting worker pid=4841 on localhost:11349 at 08:37:36.219
On a mobile hotspot it looks very similar but with success:
Workers: [n = 1] ‘localhost’
Base port: 11501
Creating node 1 of 1 ...
- setting up node
Starting worker #1 on ‘localhost’: '/Library/Frameworks/R.framework/Resources/bin/Rscript' --default-packages=datasets,utils,grDevices,graphics,stats,methods -e 'parallel:::.slaveRSOCK()' MASTER=localhost PORT=11501 OUT= TIMEOUT=2592000 XDR=TRUE
Waiting for worker #1 on ‘localhost’ to connect back
starting worker pid=4892 on localhost:11501 at 08:39:47.070
Connection with worker #1 on ‘localhost’ established
- assigning connection UUID
- collecting session information
Creating node 1 of 1 ... done

Related

Data unpack would read past end of buffer in file util/show_help.c at line 501

I submitted a job via slurm. The job ran for 12 hours and was working as expected. Then I got Data unpack would read past end of buffer in file util/show_help.c at line 501. It is usual for me to get errors like ORTE has lost communication with a remote daemon but I usually get this in the beginning of the job. It is annoying but still does not cause as much time loss as getting error after 12 hours. Is there a quick fix for this? Open MPI version is 4.0.1.
--------------------------------------------------------------------------
By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default. The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA parameter
to true.
Local host: barbun40
Local adapter: mlx5_0
Local port: 1
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
Local host: barbun40
Local device: mlx5_0
--------------------------------------------------------------------------
[barbun21.yonetim:48390] [[15284,0],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in
file util/show_help.c at line 501
[barbun21.yonetim:48390] 127 more processes have sent help message help-mpi-btl-openib.txt / ib port
not selected
[barbun21.yonetim:48390] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error
messages
[barbun21.yonetim:48390] 126 more processes have sent help message help-mpi-btl-openib.txt / error in
device init
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
An MPI communication peer process has unexpectedly disconnected. This
usually indicates a failure in the peer process (e.g., a crash or
otherwise exiting without calling MPI_FINALIZE first).
Although this local MPI process will likely now behave unpredictably
(it may even hang or crash), the root cause of this problem is the
failure of the peer -- that is what you need to investigate. For
example, there may be a core file that you can examine. More
generally: such peer hangups are frequently caused by application bugs
or other external events.
Local host: barbun64
Local PID: 252415
Peer host: barbun39
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[15284,1],35]
Exit code: 9
--------------------------------------------------------------------------

Error in MPI program execution - no active ports found

I am trying to run a simple MPI job across multiple hosts of a cluster.
[capc#gpu6 mpi_tests]$ /opt/openmpi4.0.3/build/bin/mpirun --host gpu7,gpu6 ./a.out
WARNING: There is at least non-excluded one OpenFabrics device found,
but there are no active ports detected (or Open MPI was unable to use
them). This is most certainly not what you wanted. Check your
cables, subnet manager configuration, etc. The openib BTL will be
ignored for this job.
Local host: gpu7
We have 2 processes.
WARNING: Open MPI accepted a TCP connection from what appears to be a
another Open MPI process but cannot find a corresponding process
entry for that peer.
This attempted connection will be ignored; your MPI job may or may not
continue properly.
Local host: gpu6
PID: 29209
[gpu6:29203] 1 more process has sent help message help-mpi-btl-openib.txt / no active ports found
[gpu6:29203] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
I have compiled the MPI program with mpicc and on running with mpirun it hangs.
Can anyone guide me regarding this?

How can I shut down Rserve gracefully?

I have tried many options both in Mac and in Ubuntu.
I read the Rserve documentation
http://rforge.net/Rserve/doc.html
and that for the Rserve and RSclient packages:
http://cran.r-project.org/web/packages/RSclient/RSclient.pdf
http://cran.r-project.org/web/packages/Rserve/Rserve.pdf
I cannot figure out what is the correct workflow for opening/closing a connection within Rserve and for shutting down Rserve 'gracefully'.
For example, in Ubuntu, I installed R from source with the ./config --enable-R-shlib (following the Rserve documentation) and also added the 'control enable' line in /etc/Rserve.conf.
In an Ubuntu terminal:
library(Rserve)
library(RSclient)
Rserve()
c<-RS.connect()
c ## this is an Rserve QAP1 connection
## Trying to shutdown the server
RSshutdown(c)
Error in writeBin(as.integer....): invalid connection
RS.server.shutdown(c)
Error in RS.server.shutdown(c): command failed with satus code 0x4e: no control line present (control commands disabled or server shutdown)
I can, however, CLOSE the connection:
RS.close(c)
>NULL
c ## Closed Rserve connection
After closing the connection, I also tried the options (also tried with argument 'c', even though the connection is closed):
RS.server.shutdown()
RSshutdown()
So, my questions are:
1- How can I close Rserve gracefully?
2- Can Rserve be used without RSclient?
I also looked at
How to Shutdown Rserve(), running in DEBUG
but the question refers to the debug mode and is also unresolved. (I don't have enough reputation to comment/ask whether the shutdown works in the non-debug mode).
Also looked at:
how to connect to Rserve with an R client
Thanks so much!
Load Rserve and RSclient packages, then connect to the instances.
> library(Rserve)
> library(RSclient)
> Rserve(port = 6311, debug = FALSE)
> Rserve(port = 6312, debug = TRUE)
Starting Rserve...
"C:\..\Rserve.exe" --RS-port 6311
Starting Rserve...
"C:\..\Rserve_d.exe" --RS-port 6312
> rsc <- RSconnect(port = 6311)
> rscd <- RSconnect(port = 6312)
Looks like they're running...
> system('tasklist /FI "IMAGENAME eq Rserve.exe"')
> system('tasklist /FI "IMAGENAME eq Rserve_d.exe"')
Image Name PID Session Name Session# Mem Usage
========================= ======== ================ =========== ============
Rserve.exe 8600 Console 1 39,312 K
Rserve_d.exe 12652 Console 1 39,324 K
Let's shut 'em down.
> RSshutdown(rsc)
> RSshutdown(rscd)
And they're gone...
> system('tasklist /FI "IMAGENAME eq Rserve.exe"')
> system('tasklist /FI "IMAGENAME eq Rserve_d.exe"')
INFO: No tasks are running which match the specified criteria.
Rserve can be used w/o RSclient by starting it with args and/or a config script. Then you can connect to it from some other program (like Tableau) or with your own code. RSclient provides a way to pass commands/data to Rserve from an instance of R.
Hope this helps :)
On a Windows system, if you want to close an RServe instance, you can use the system function in R to close it down.
For example in R:
library(Rserve)
Rserve() # run without any arguments or ports specified
system('tasklist /FI "IMAGENAME eq Rserve.exe"') # run this to see RServe instances and their PIDs
system('TASKKILL /PID {yourPID} /F') # run this to kill off the RServe instance with your selected PID
If you have closed your RServe instance with that PID correctly, the following message will appear:
SUCCESS: The process with PID xxxx has been terminated.
You can check the RServe instance has been closed down by entering
system('tasklist /FI "IMAGENAME eq Rserve.exe"')
again. If there are no RServe instances running any more, you will get the message
INFO: No tasks are running which match the specified criteria.
More help and info on this topic can be seen in this related question.
Note that the 'RSClient' approach mentioned in an earlier answer is tidier and easier than this one, but I put it forward anyway for those who start RServe without knowing how to stop it.
If you are not able to shut it down within R, run the codes below to kill it in terminal. These codes work on Mac.
$ ps ax | grep Rserve # get active Rserve sessions
You will see outputs like below. 29155 is job id of the active Rserve session.
29155 /Users/userid/Library/R/3.5/library/Rserve/libs/Rserve
38562 0:00.00 grep Rserve
Then run
$ kill 29155

R and snow on amazon EC2 using starcluster

I'm trying to run analysis in parrallel in R on an AWS EC2 cluster. I am using
starcluster to setup and manage the EC2 cluster, and am trying to use snow and
foreach in R. To start off, I have 2 nodes in the cluster, 1 master and 1
worker.
starcluster start mycluster
starcluster listinstances
-----------------------------------------
mycluster (security group: #sc-mycluster)
-----------------------------------------
....
Cluster nodes:
master running i-xxxxxxxxx masterIP.compute-1.amazonaws.com
node001 running i-xxxxxxxxx node001IP.compute-1.amazonaws.com
Total nodes: 2
starcluster sshmaster mycluster
I then start R and load the snow package and try to create a cluster
object.
R
library("snow")
cl = makeCluster(c("masterIP.compute-1.amazonaws.com", "node001IP.compute-1.amazonaws.com"), type = "SOCK")
This, however, gives me the following error message:
The authenticity of host 'masterIP.compute-1.amazonaws.com (xx.xxx.xx.xx)' can't be established.
ECDSA key fingerprint is xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'masterIP.compute-1.amazonaws.com,xx.xxx.xx.xx' (ECDSA) to the list of known hosts.
Permission denied (publickey).
So I tried copying my ssh key (keyname.rsa to be specific) to the .ssh file
on EC2 and trying again. That still didn't work; I received the same
Permission denied (publickey). error. It was my thought that starcluster
handled the setup of ssh and communication between nodes, so I'm a little
confused as to why I'm not able to set this up. I also tried to just add node001, so cl = makeCluster(c("node001IP.compute-1.amazonaws.com"), type = "SOCK"), but the same error occurs.
It turns out, after much tinkering, that all that was needed was an update to R version 2.15. The command cl = makeCluster(c("masterIP.compute-1.amazonaws.com", "node001IP.compute-1.amazonaws.com"), type = "SOCK") worked perfectly after that.

bind failure: Address already in use even though recycle and reuse flags are set to 1

Environment:
Unix client and unix server.
Tool used : curl.
Client/Server should ignore the time wait time (2 *MSL ) when establishing connection.
This is done by executing the following commands :
sysctl net.ipv4.tcp_tw_reuse=1
sysctl net.ipv4.tcp_tw_recycle=1
Local port must be specified so that it can re-used.
Start the connection.
Example : while [ 1 ]; do curl --local-port 9056 192.168.40.2; sleep 30; done
I am still seeing the error even though it should have ignored time wait period.
Any idea why this is happening?

Resources