After some trials, I was able to install Rmpi package on my computer using the following code:
R CMD INSTALL -l /storage/home/***/.R Rmpi_0.6-7.tar.gz --configure-args="--with-Rmpi-type=OPENMPI --disable-dlopen --with-Rmpi-include=/gpfs/group/RISE/sw7/openmpi_4.1.4_gcc-9.3.1/include --with-Rmpi-libpath=/gpfs/group/RISE/sw7/openmpi_4.1.4_gcc-9.3.1/lib"
I tried to run the following test code:
# Load the R MPI package if it is not already loaded.
if (!is.loaded("mpi_initialize")) {
library("Rmpi")
}
ns <- mpi.universe.size() - 1
mpi.spawn.Rslaves(nslaves=ns)
#
# In case R exits unexpectedly, have it automatically clean up
# resources taken up by Rmpi (slaves, memory, etc...)
.Last <- function(){
if (is.loaded("mpi_initialize")){
if (mpi.comm.size(1) > 0){
print("Please use mpi.close.Rslaves() to close slaves.")
mpi.close.Rslaves()
}
print("Please use mpi.quit() to quit R")
.Call("mpi_finalize")
}
}
# Tell all slaves to return a message identifying themselves
mpi.bcast.cmd( id <- mpi.comm.rank() )
mpi.bcast.cmd( ns <- mpi.comm.size() )
mpi.bcast.cmd( host <- mpi.get.processor.name() )
mpi.remote.exec(paste("I am",mpi.comm.rank(),"of",mpi.comm.size()))
# Test computations
x <- 5
x <- mpi.remote.exec(rnorm, x)
length(x)
x
# Tell all slaves to close down, and exit the program
mpi.close.Rslaves(dellog = FALSE)
mpi.quit()
On my HPC I run the following:
qsub -A open -l walltime=6:00:00 -l nodes=4:ppn=4:stmem -I
module use /gpfs/group/RISE/sw7/modules
module load openmpi/4.1.4-gcc.9.3.1 r/4.0.3
mpirun -np 4 Rscript "codes/test/test4.R"
But then I get the following error indicating that I only have 1 number of slaves:
--------------------------------------------------------------------------
By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default. The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA parameter
to true.
Local host: comp-sc-0222
Local adapter: mlx4_0
Local port: 1
--------------------------------------------------------------------------
--------------------------------------------------------------------------
By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default. The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA parameter
to true.
Local host: comp-sc-0222
Local adapter: mlx4_0
Local port: 1
--------------------------------------------------------------------------
--------------------------------------------------------------------------
By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default. The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA parameter
to true.
Local host: comp-sc-0222
Local adapter: mlx4_0
Local port: 1
--------------------------------------------------------------------------
--------------------------------------------------------------------------
By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default. The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA parameter
to true.
Local host: comp-sc-0222
Local adapter: mlx4_0
Local port: 1
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
Local host: comp-sc-0222
Local device: mlx4_0
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
Local host: comp-sc-0222
Local device: mlx4_0
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
Local host: comp-sc-0222
Local device: mlx4_0
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
Local host: comp-sc-0222
Local device: mlx4_0
--------------------------------------------------------------------------
Error in mpi.comm.spawn(slave = system.file("Rslaves.sh", package = "Rmpi"), :
Choose a positive number of slaves.
Calls: mpi.spawn.Rslaves -> mpi.comm.spawn
Execution halted
Error in mpi.comm.spawn(slave = system.file("Rslaves.sh", package = "Rmpi"), :
Choose a positive number of slaves.
Calls: mpi.spawn.Rslaves -> mpi.comm.spawn
Execution halted
Error in mpi.comm.spawn(slave = system.file("Rslaves.sh", package = "Rmpi"), :
Choose a positive number of slaves.
Calls: mpi.spawn.Rslaves -> mpi.comm.spawn
Execution halted
Error in mpi.comm.spawn(slave = system.file("Rslaves.sh", package = "Rmpi"), :
Choose a positive number of slaves.
Calls: mpi.spawn.Rslaves -> mpi.comm.spawn
Execution halted
I have tried specifying different number of np's but still get the same error. What could be the cause here?
============================================================
(EDIT)
It seems that my original command to load the modules also load intel/19.1.2 and mkl/2020.3. If I unload them, I do see that OMPI_UNIVERSE_SIZE=4.
[****#comp-sc-0220 work]$ module purge
[****#comp-sc-0220 work]$ module load openmpi/4.1.4-gcc.9.3.1 r/4.0.3
[****#comp-sc-0220 work]$ module list
Currently Loaded Modules:
1) openmpi/4.1.4-gcc.9.3.1 2) intel/19.1.2 3) mkl/2020.3 4) r/4.0.3
[****#comp-sc-0220 work]$ mpirun -np 4 env | grep OMPI_UNIVERSE_SIZE
[****#comp-sc-0220 work]$ type mpirun; mpirun --version; mpirun -np 1 env | grep OMPI
mpirun is /opt/aci/intel/compilers_and_libraries_2020.2.254/linux/mpi/intel64/bin/mpirun
Intel(R) MPI Library for Linux* OS, Version 2019 Update 8 Build 20200624 (id: 4f16ad915)
Copyright 2003-2020, Intel Corporation.
LMOD_FAMILY_COMPILER_VERSION=19.1.2
LMOD_FAMILY_COMPILER=intel
[****#comp-sc-0220 work]$ module purge
[****#comp-sc-0220 work]$ module load openmpi/4.1.4-gcc.9.3.1 r/4.0.3
[****#comp-sc-0220 work]$ module unload intel mkl
[****#comp-sc-0220 work]$ module list
Currently Loaded Modules:
1) openmpi/4.1.4-gcc.9.3.1 2) r/4.0.3
[****#comp-sc-0220 work]$ mpirun -np 4 env | grep OMPI_UNIVERSE_SIZE
OMPI_UNIVERSE_SIZE=4
OMPI_UNIVERSE_SIZE=4
OMPI_UNIVERSE_SIZE=4
OMPI_UNIVERSE_SIZE=4
[****#comp-sc-0220 work]$ type mpirun; mpirun --version; mpirun -np 1 env | grep OMPI
mpirun is /gpfs/group/RISE/sw7/openmpi_4.1.4_gcc-9.3.1/bin/mpirun
mpirun (Open MPI) 4.1.4
Report bugs to http://www.open-mpi.org/community/help/
OMPI_MCA_pmix=^s1,s2,cray,isolated
OMPI_COMMAND=env
OMPI_MCA_orte_precondition_transports=954e2ae0a9569e46-2223294369d728a3
OMPI_MCA_orte_local_daemon_uri=4134338560.0;tcp://10.102.201.220:58039
OMPI_MCA_orte_hnp_uri=4134338560.0;tcp://10.102.201.220:58039
OMPI_MCA_mpi_oversubscribe=0
OMPI_MCA_orte_app_num=0
OMPI_UNIVERSE_SIZE=4
OMPI_MCA_orte_num_nodes=1
OMPI_MCA_shmem_RUNTIME_QUERY_hint=mmap
OMPI_MCA_orte_bound_at_launch=1
OMPI_MCA_ess=^singleton
OMPI_MCA_orte_ess_num_procs=1
OMPI_COMM_WORLD_SIZE=1
OMPI_COMM_WORLD_LOCAL_SIZE=1
OMPI_MCA_orte_tmpdir_base=/tmp
OMPI_MCA_orte_top_session_dir=/tmp/ompi.comp-sc-0220.26954
OMPI_MCA_orte_jobfam_session_dir=/tmp/ompi.comp-sc-0220.26954/pid.8212
OMPI_NUM_APP_CTX=1
OMPI_FIRST_RANKS=0
OMPI_APP_CTX_NUM_PROCS=1
OMPI_MCA_initial_wdir=/storage/work/k/****
OMPI_MCA_orte_launch=1
OMPI_MCA_ess_base_jobid=4134338561
OMPI_MCA_ess_base_vpid=0
OMPI_COMM_WORLD_RANK=0
OMPI_COMM_WORLD_LOCAL_RANK=0
OMPI_COMM_WORLD_NODE_RANK=0
OMPI_MCA_orte_ess_node_rank=0
OMPI_FILE_LOCATION=/tmp/ompi.comp-sc-0220.26954/pid.8212/0/0
But if I run the same test4.R again, I get the following error:
/gpfs/group/RISE/sw7/R-4.0.3-intel-19.1.2-mkl-2020.3/R-4.0.3/../install/lib64/R/bin/exec/R: error while loading shared libraries: libiomp5.so: cannot open shared object file: No such file or directory
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
/gpfs/group/RISE/sw7/R-4.0.3-intel-19.1.2-mkl-2020.3/R-4.0.3/../install/lib64/R/bin/exec/R: error while loading shared libraries: libiomp5.so: cannot open shared object file: No such file or directory
/gpfs/group/RISE/sw7/R-4.0.3-intel-19.1.2-mkl-2020.3/R-4.0.3/../install/lib64/R/bin/exec/R: error while loading shared libraries: libiomp5.so: cannot open shared object file: No such file or directory
/gpfs/group/RISE/sw7/R-4.0.3-intel-19.1.2-mkl-2020.3/R-4.0.3/../install/lib64/R/bin/exec/R: error while loading shared libraries: libiomp5.so: cannot open shared object file: No such file or directory
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[63743,1],0]
Exit code: 127
--------------------------------------------------------------------------
============================================================
(EDIT 2)
I changed my module load command again to module load openmpi/4.1.4-gcc.9.3.1 r/4.0.5-gcc-9.3.1. With this newer version of R I ran my test4.R script again with mpirun -np 4 Rscript "codes/test/test4.R". It is now returning a new error message as follows:
[1] "/storage/home/k/kxk5678/.R"
[2] "/gpfs/group/RISE/sw7/R-4.0.5-gcc-9.3.1/install/lib64/R/library"
[1] "/storage/home/k/kxk5678/.R"
[2] "/gpfs/group/RISE/sw7/R-4.0.5-gcc-9.3.1/install/lib64/R/library"
[1] "/storage/home/k/kxk5678/.R"
[2] "/gpfs/group/RISE/sw7/R-4.0.5-gcc-9.3.1/install/lib64/R/library"
[1] "/storage/home/k/kxk5678/.R"
[2] "/gpfs/group/RISE/sw7/R-4.0.5-gcc-9.3.1/install/lib64/R/library"
[1] 4
[1] 4
[1] 4
[1] 4
--------------------------------------------------------------------------
All nodes which are allocated for this job are already filled.
--------------------------------------------------------------------------
Error in mpi.comm.spawn(slave = system.file("Rslaves.sh", package = "Rmpi"), :
MPI_ERR_SPAWN: could not spawn processes
Calls: mpi.spawn.Rslaves -> mpi.comm.spawn
Execution halted
Error in mpi.comm.spawn(slave = system.file("Rslaves.sh", package = "Rmpi"), :
MPI_ERR_SPAWN: could not spawn processes
Calls: mpi.spawn.Rslaves -> mpi.comm.spawn
Execution halted
Error in mpi.comm.spawn(slave = system.file("Rslaves.sh", package = "Rmpi"), :
MPI_ERR_SPAWN: could not spawn processes
Calls: mpi.spawn.Rslaves -> mpi.comm.spawn
Execution halted
Error in mpi.comm.spawn(slave = system.file("Rslaves.sh", package = "Rmpi"), :
MPI_ERR_SPAWN: could not spawn processes
Calls: mpi.spawn.Rslaves -> mpi.comm.spawn
Execution halted
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[62996,1],1]
Exit code: 1
--------------------------------------------------------------------------
Install the package pbdMPI in an R session on the login node and run the following translation of the Rmpi test code into the use of pbdMPI:
library(pbdMPI)
ns <- comm.size()
# Tell all R sessions to return a message identifying themselves
id <- comm.rank()
ns <- comm.size()
host <- system("hostname", intern = TRUE)
comm.cat("I am", id, "on", host, "of", ns, "\n", all.rank = TRUE)
# Test computations
x <- 5
x <- rnorm(x)
comm.print(length(x))
comm.print(x, all.rank = TRUE)
finalize()
You run it the same way you used for the Rmpi version: mpirun -np 4 Rscript your_new_script_file.
Spawning MPI (as in the Rmpi example) was appropriate when running on clusters of workstations but on an HPC cluster the prevalent way to program with MPI is SPMD - single program multiple data. SPMD means that your code is a generalization of a serial code that is able to have several copies of itself cooperate with each other.
In the above example, cooperation happens only with printing (the comm... functions). There is no manager/master, just several R sessions running the same code (usually computing something different based on comm.rank()) and cooperating/communicating via MPI. This is the prevalent way of large scale parallel computing on HPC clusters.
I submitted a job via slurm. The job ran for 12 hours and was working as expected. Then I got Data unpack would read past end of buffer in file util/show_help.c at line 501. It is usual for me to get errors like ORTE has lost communication with a remote daemon but I usually get this in the beginning of the job. It is annoying but still does not cause as much time loss as getting error after 12 hours. Is there a quick fix for this? Open MPI version is 4.0.1.
--------------------------------------------------------------------------
By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default. The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA parameter
to true.
Local host: barbun40
Local adapter: mlx5_0
Local port: 1
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
Local host: barbun40
Local device: mlx5_0
--------------------------------------------------------------------------
[barbun21.yonetim:48390] [[15284,0],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in
file util/show_help.c at line 501
[barbun21.yonetim:48390] 127 more processes have sent help message help-mpi-btl-openib.txt / ib port
not selected
[barbun21.yonetim:48390] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error
messages
[barbun21.yonetim:48390] 126 more processes have sent help message help-mpi-btl-openib.txt / error in
device init
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
An MPI communication peer process has unexpectedly disconnected. This
usually indicates a failure in the peer process (e.g., a crash or
otherwise exiting without calling MPI_FINALIZE first).
Although this local MPI process will likely now behave unpredictably
(it may even hang or crash), the root cause of this problem is the
failure of the peer -- that is what you need to investigate. For
example, there may be a core file that you can examine. More
generally: such peer hangups are frequently caused by application bugs
or other external events.
Local host: barbun64
Local PID: 252415
Peer host: barbun39
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[15284,1],35]
Exit code: 9
--------------------------------------------------------------------------
I'm using rsync in solaris and couldn't find an exit code if there is no file or folder modification/addition or deletion done on the destination folder. How can I get the status if rsync doesn't have one ?
0 Success
1 Syntax or usage error
2 Protocol incompatibility
3 Errors selecting input/output files, dirs
4 Requested action not supported: an attempt was made to manipulate 64-bit
files on a platform that cannot support them; or an option was specified
that is supported by the client and not by the server.
5 Error starting client-server protocol
6 Daemon unable to append to log-file
10 Error in socket I/O
11 Error in file I/O
12 Error in rsync protocol data stream
13 Errors with program diagnostics
14 Error in IPC code
20 Received SIGUSR1 or SIGINT
21 Some error returned by waitpid()
22 Error allocating core memory buffers
23 Partial transfer due to error
24 Partial transfer due to vanished source files
25 The --max-delete limit stopped deletions
30 Timeout in data send/receive
35 Timeout waiting for daemon connection
Thank you
There is a work around
rsync --log-format=%f ...
Note that rsync outputs files anytime any attribute changes, not only if the content of the file is updated.
There is also a -i option (or --log-format=%i) that itemizes all of the changes. See the rsync man page for details of the output format.
My requirement is to create a dump file of heap memory of a remote server using Jmap.
I did this way.
jmap -dump:file=remoteDump.txt,format=b 3104
This worked fine as 3104 is the pid of a process from my local machine.
How do I do the same with remote server?
I tried
jmap -dump:file=remoteDump.txt,format=b 3104 54.197.228.33:8080
But it's failed.
I tried creating a debug server using jsadebugd, as below.
1.Started rmiregistry
rmiregistry -J-Xbootclasspath/p:$JAVA_HOME/lib/sa-jdi.jar
2.Ran jsadebugd
>jsadebugd 11594 54.197.228.33:9009
But the step 2 is throwing the following error:
Error attaching to process or starting server: sun.jvm.hotspot.debugger.D
Exception: Windbg Error: WaitForEvent failed!
at sun.jvm.hotspot.debugger.windbg.WindbgDebuggerLocal.attach0(Na
thod)
at sun.jvm.hotspot.debugger.windbg.WindbgDebuggerLocal.attach(Win
ggerLocal.java:152)
at sun.jvm.hotspot.HotSpotAgent.attachDebugger(HotSpotAgent.java:
at sun.jvm.hotspot.HotSpotAgent.setupDebuggerWin32(HotSpotAgent.j
)
at sun.jvm.hotspot.HotSpotAgent.setupDebugger(HotSpotAgent.java:3
at sun.jvm.hotspot.HotSpotAgent.go(HotSpotAgent.java:313)
at sun.jvm.hotspot.HotSpotAgent.startServer(HotSpotAgent.java:220
at sun.jvm.hotspot.DebugServer.run(DebugServer.java:106)
at sun.jvm.hotspot.DebugServer.main(DebugServer.java:45)
at sun.jvm.hotspot.jdi.SADebugServer.main(SADebugServer.java:55)
Help me get out of it.
The reason why you can not attach to process could be that it is already attached to some other debuger or executed on other visual machine than your jmap is running.
Try to assure that process is not attached to any debuger and you attach to the same VM.
I tried the following command to rsync from a server and got the following error message:
rsync -e ssh -avz name#home.com:/home/name/. .
receiving file list ... done
Desktop/Python_Nick/Python-2.4.1/
Desktop/Python_Nick/Python-2.4.1/Python/
Write failed: Broken pipe
rsync: writefd_unbuffered failed to write 4092 bytes [generator]: Broken pipe (32)
rsync error: unexplained error (code 255) at /SourceCache/rsync/rsync-42/rsync/io.c(1121) [generator=2.6.9]
rsync error: received SIGUSR1 (code 19) at /SourceCache/rsync/rsync-42/rsync/main.c(1197) [receiver=2.6.9]
Sometimes it gives the following error:
Read from socket failed: Operation timed out
rsync: connection unexpectedly closed (0 bytes received so far) [receiver]
rsync error: unexplained error (code 255) at /SourceCache/rsync/rsync-42/rsync/io.c(452) [receiver=2.6.9]
Then sometimes the following error:
Write failed: Broken pipe
rsync: connection unexpectedly closed (314764 bytes received so far) [receiver]rsync: writefd_unbuffered failed to write 4092 bytes [generator]: Broken pipe (32)
rsync error: unexplained error (code 255) at /SourceCache/rsync/rsync-42/rsync/io.c(1121) [generator=2.6.9]
rsync error: error in rsync protocol data stream (code 12) at /SourceCache/rsync/rsync-42/rsync/io.c(452) [receiver=2.6.9]
I even tried using FETCH (The MAC application).
Any suggestions? The server is running LINUX and my local machine is a MAC
I had a similar problem, and my solution was to update rsync on my Mac.
As is clear from point 3 in the rsync "current issues and debugging" page, the "connection unexpectedly closed"
error tells you that the local rsync was trying to talk to the
remote rsync, but the connection to that rsync is now gone. The thing
you must figure out is why, and that can involve some investigative
work.
One of the first things suggested on this page is to update rsync. It turns out that Apple has not given us the latest rsync in their machines. Thus, I updated the rsync version on my Mac via Macports to 3.1.1. After that, I have had no problems syncing.
This is no general solution, of course, but then again there seem to be no general solution for this vague error message: if this had not worked, I had had to try and debug it. For details on how, see the above-mentioned rsync issues page.