Submitting MPI jobs on SGE using Rmpi - mpi

*UPLOADED
I'm trying to submit a job on SGE using MPI. The R script uses Rmpi and snow.
In the SGE script I have:
#!/bin/sh
#Run with current set of modules and in the current directory
#$ -V -cwd
#Request some time- min 15 mins - max 48 hours
#$ -l h_rt=1:00:00
# Request 4 CPU cores(processes)
#$ -pe ib 4
#Get email at start and end of the job
#$ -m abe
#$ -M bnldd#leeds.ac.uk
#Now run the job
mpirun -n 4 /apps/applications/R/3.5.0/1/default/bin/R --no-save -q <snowe.R> snowe.Rout
And the following R script snowe.R:
library(snow)
library(Rmpi)
p <- rnorm(123, m=33)
cl <- makeCluster(3, type="MPI")
### sends function to each system
clusterCall( cl, function() Sys.info()[c("nodename","machine")])
clusterCall( cl, function() rnorm(1, 33,1 ) )
myNorms <- matrix( rnorm(1000), ncol=10 )
## goes column by column
mypapply <- parApply(cl, myNorms, 2, print )
attributes(mypapply)
mypapply <- parApply(cl, myNorms, 2, mean )
mypapply
stopCluster(cl)
mpi.quit()
The system produces the following error, inside snowe.Rout file:
library(snow)
library(Rmpi)
>
mpirun has exited due to process rank 2 with PID 0 on
node dc1s0b1a exiting improperly. There are three reasons this could occur:
this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.
this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"
this process called "MPI_Abort" or "orte_abort" and the mca parameter
orte_create_session_dirs is set to false. In this case, the run-time cannot
detect that the abort call was an abnormal termination. Hence, the only
error message you will receive is this one.
This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
You can avoid this message by specifying -quiet on the mpirun command line.
Which could be the problem?

Related

Can Snakemake work if a rule's shell command is a cluster job?

In below example, if shell script shell_script.sh sends a job to cluster, is it possible to have snakemake aware of that cluster job's completion? That is, first, file a should be created by shell_script.sh which sends its own job to the cluster, and then once this cluster job is completed, file b should be created.
For simplicity, let's assume that snakemake is run locally meaning that the only cluster job originating is from shell_script.sh and not by snakemake .
localrules: that_job
rule all:
input:
"output_from_shell_script.txt",
"file_after_cluster_job.txt"
rule that_job:
output:
a = "output_from_shell_script.txt",
b = "file_after_cluster_job.txt"
shell:
"""
shell_script.sh {output.a}
touch {output.b}
"""
PS - At the moment, I am using sleep command to give it a waiting time before the job is "completed". But this is an awful workaround as this could give rise to several problems.
Snakemake can manage this for you with the --cluster argument on the command line.
You can supply a template for the jobs to be executed on the cluster.
As an example, here is how I use snakemake on a SGE managed cluster:
template which will encapsulate the jobs which I called sge.sh:
#$ -S /bin/bash
#$ -cwd
#$ -V
{exec_job}
then I use directly on the login node:
snakemake -rp --cluster "qsub -e ./logs/ -o ./logs/" -j 20 --jobscript sge.sh --latency-wait 30
--cluster will tell which queuing system to use
--jobscript is the template in which jobs will be encapsulated
--latency-wait is important if the file system takes a bit of time to write the files. You job might end and return before the output of the rules are actually visible to the filesystem which will cause an error
Note that you can specify rules not to be executed on the nodes in the Snakefile with the keyword localrules:
Otherwise, depending on your queuing system, some options exist to wait for job sent to cluster to finish:
SGE:
Wait for set of qsub jobs to complete
SLURM:
How to hold up a script until a slurm job (start with srun) is completely finished?
LSF:
https://superuser.com/questions/46312/wait-for-one-or-all-lsf-jobs-to-complete

How to specify the number of MPI ranks by means of environment variables?

Let's assume, I run my Open MPI application with the following command:
mpirun a.out
and I specify the number of MPI ranks by means of an LSF job scheduler script:
#BSUB -n 20
How to specify the number of MPI ranks for mpirun through some Open MPI environment variable?
The reason for my need is the following. First, I need to allocate 20 cores on a node and run 5 independent parallel jobs (1, 2, 3, 4, 10 MPI ranks). Second, I do not have an opportunity to submit these jobs as non-exclusive jobs to the same host. Third, I do not directly invoke mpirun a.out command, as it is hidden deeply inside some complex third-parth script run.sh, and that is only the run.sh script that I can explicitly execute in the job scheduler command file. That is why I would like to do something like this:
OMPI_NUM_RANKS=1 run.sh &
OMPI_NUM_RANKS=2 run.sh &
...
OMPI_NUM_RANKS=10 run.sh &

doMPI not recognizing other nodes in cluster for R script

Using RHEL7.3
Using R 3.3.2
Installed Rmpi_0.6-6.tar.gz and doMPI_0.2.1.tar.gz
Installed mpich-3.0-3.0.4-10.el7 RPM for x86_64
I created a cluster of three machines (aml1,2,3). I can run the /examples/cpi example from the mpich installation and the processes run without issue on all three machines.
I can also run an R script that needs to be run multiple times, which is discussed on the doMPI documentation -- so the script runs on all clusters.
My problem is when my R script has code prior to the %dopar% that needs to be run once on the master(aml1), and have the %dopar% run on the cluster (aml2,aml3). It only runs on the master. And doMPI says Size of MPI universe: 0 and doesn't recognize aml2 or aml3.
For example:
Run: mpirun -np 1 --hostfile ~/projects/hosts R --no-save -q < example6.R
(and my ~/projects/hosts file is defined to use 8 cores)
example6.R:
library(doMPI) #load doMPI library
cl <- startMPIcluster(verbose=TRUE)
#load data
#clean data
#perform some functions
#let's say I want to have this done in the script and only parallelize this
x <- foreach(seed=c(7, 11, 13), .combine="cbind") %dopar% {
set.seed(seed)
rnorm(3)
}
x
closeCluster(cl)
Output of example6.R:
Master processor name: aml1; nodename: aml1
Size of MPI universe: 0
Spawning 2 workers using the command:
/usr/lib64/R/bin/Rscript /usr/lib64/R/library/doMPI/RMPIworker.R WORKDIR=/home/spark LOGDIR=/home/spark MAXCORES=1 COMM=3 INTERCOMM=4 MTAG=10 WTAG=11 INCLUDEMASTER=TRUE BCAST=TRUE VERBOSE=TRUE
2 slaves are spawned successfully. 0 failed.
If I define cl <- startMPIcluster(count=34, verbose=TRUE) I still get the following but at least I can run 34 slaves:
Master processor name: aml1; nodename: aml1
Size of MPI universe: 0
34 slaves are spawned successfully. 0 failed.
How can I troubleshoot this? I would like to run the R script so it runs the first portion once on the master, and then do %dopar% on the cluster.
Thanks!!
Update 1
Since the last update, I tried running an older version of OpenMPI:
[spark#aml1 ~]$ which mpirun
/opt/openmpi-1.8.8/bin/mpirun
Per #SteveWeston, I created the following script and ran it:
[spark#aml1 ~]$ cat sanity_check.R
library(Rmpi)
print(mpi.comm.rank(0))
mpi.quit()
With the following output:
[spark#aml1 ~]$ mpirun -np 3 --hostfile ~/projects/hosts R --slave -f sanity_check.R
FIPS mode initialized
master (rank 0, comm 1) of size 3 is running on: aml1
slave1 (rank 1, comm 1) of size 3 is running on: aml1
slave2 (rank 2, comm 1) of size 3 is running on: aml1
[1] 0
Here it just hangs -- and nothing happens.
I've already accepted #SteveWeston's answer as it helped me in better understanding my original question.
I commented to his answer that I was still having issues with my R script hanging; the scripts would run, but it would never finish on its own or close its own clusters and I would have to kill it with ctrl-C.
I ultimately set up an nfs environment, build and installed openmpi-1.10.5 there, and installed my R libraries there as well. R is installed separately on both machines, but they share the same library in my nfs directory. Previously I had installed and managed everything under root, including the R libraries (I know). I'm not sure if this what caused complications, but my issues seem to be resolved.
[master#aml1 nfsshare]$ cat sanity_check.R
library(Rmpi)
print(mpi.comm.rank(0))
mpi.quit(save= "no")
[master#aml1 nfsshare]$ mpirun -np 3 --hostfile hosts R --slave -f sanity_check.R
FIPS mode initialized
[1] 1
[1] 0
[1] 2
# no need to ctrl-C here. It no longer hangs

Kill all R processes that hang for longer than a minute

I use crontask to regularly run Rscript. Unfortunately, I need to do this on a small instance of aws and the process may hang, building more and more processes on top of each other until the whole system is lagging.
I would like to write a crontask to kill all R processes lasting longer than one minute. I found another answer on Stack Overflow that I've adapted that I think would solve the problem. I came up with;
if [[ "$(uname)" = "Linux" ]];then killall --older-than 1m "/usr/lib/R/bin/exec/R --slave --no-restore --file=/home/ubuntu/script.R";fi
I copied the task directly from htop, but it does not work as I expect. I get the No such file or directory error but I've checked it a few times.
I need to kill all R processes that have lasted longer than a minute. How can I do this?
You may want to avoid killing processes from another user and try SIGKILL (kill -9) after SIGTERM (kill -15). Here is a script you could execute every minute with a CRON job:
#!/bin/bash
PROCESS="R"
MAXTIME=`date -d '00:01:00' +'%s'`
function killpids()
{
PIDS=`pgrep -u "${USER}" -x "${PROCESS}"`
# Loop over all matching PIDs
for pid in ${PIDS}; do
# Retrieve duration of the process
TIME=`ps -o time:1= -p "${pid}" |
egrep -o "[0-9]{0,2}:?[0-9]{0,2}:[0-9]{2}$"`
# Convert TIME to timestamp
TTIME=`date -d "${TIME}" +'%s'`
# Check if the process should be killed
if [ "${TTIME}" -gt "${MAXTIME}" ]; then
kill ${1} "${pid}"
fi
done
}
# Leave a chance to kill processes properly (SIGTERM)
killpids "-15"
sleep 5
# Now kill remaining processes (SIGKILL)
killpids "-9"
Why imply an additional process every minute with cron?
Would it not be easier to start R with timeout from coreutils, the processes will then be killed automatically after the time you chose.
timeout [option] duration command [arg]…
I think the best option is to do this with R itself. I am no expert, but it seems the future package will allow executing a function in a separate thread. You could run the actual task in a separate thread, and in the main thread sleep for 60 seconds and then stop().
Previous Update
user1747036's answer which recommends timeout is a better alternative.
My original answer
This question is more appropriate for superuser, but here are a few things wrong with
if [[ "$(uname)" = "Linux" ]];then
killall --older-than 1m \
"/usr/lib/R/bin/exec/R --slave --no-restore --file=/home/ubuntu/script.R";
fi
The name argument is either the name of image or path to it. You have included parameters to it as well
If -s signal is not specified killall sends SIGTERM which your process may ignore. Are you able to kill a long running script with this on the command line? You may need SIGKILL / -9
More at http://linux.die.net/man/1/killall

Submitting Open MPI jobs to SGE

I've installed openmpi , not in /usr/... but in a /commun/data/packages/openmpi/ , it was compiled with --with-sge.
I've added a new PE in SGE as descibed in http://docs.oracle.com/cd/E19080-01/n1.grid.eng6/817-5677/6ml49n2c0/index.html
# /commun/data/packages/openmpi/bin/ompi_info | grep gridengine
MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.6.3)
# qconf -sq all.q | grep pe_
pe_list make orte
Without SGE, the program runs without any problem, using several processors.
/commun/data/packages/openmpi/bin/orterun -np 20 ./a.out args
Now I want to submit my program to SGE
In the Open MPI FAQ, I read:
# Allocate a SGE interactive job with 4 slots
# from a parallel environment (PE) named 'orte'
shell$ qsh -pe orte 4
but my output is:
qsh -pe orte 4
Your job 84550 ("INTERACTIVE") has been submitted
waiting for interactive job to be scheduled ...
Could not start interactive job.
I've also tried the mpirun command embedded in a script:
$ cat ompi.sh
#!/bin/sh
/commun/data/packages/openmpi/bin/mpirun \
/path/to/a.out args
but it fails
$ cat ompi.sh.e84552
error: executing task of job 84552 failed: execution daemon on host "node02" didn't accept task
--------------------------------------------------------------------------
A daemon (pid 18327) died unexpectedly with status 1 while attempting
to launch so we are aborting.
There may be more information reported by the environment (see above).
This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
error: executing task of job 84552 failed: execution daemon on host "node01" didn't accept task
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
How can I fix this?
answer in the openmpi mailing list: http://www.open-mpi.org/community/lists/users/2013/02/21360.php
In my case setting "job_is_first_task FALSE" and "control_slaves TRUE" solved the problem.
# qconf -mp mpi1
pe_name mpi1
slots 9
user_lists NONE
xuser_lists NONE
start_proc_args /bin/true
stop_proc_args /bin/true
allocation_rule $fill_up
control_slaves TRUE
job_is_first_task FALSE
urgency_slots min
accounting_summary FALSE

Resources