How to list hanging mpi jobs

How to list hanging mpi jobs - mpi

I'm running some jobs using mpiexec (mpich2).
mpiexec process exits with nonzero status leaving some worker processes
I can print a list of running child jobs:
$ps aux | grep mpi
Is there another way to list running/hanging jobs?

If MPI leaves around a zombie process (which is odd, this really shouldn't be happening), it will be named whatever the executable that you originally executed was called. So if you started your program with:
mpiexec -n 4 ./a.out
then you'll need to search for
ps aux | grep a.out
which will give you the list of all of those processes that are still hanging around. The reason that what you suggested won't usually work is that if the mpirun or mpiexec process has gone away (due to a crash or completion), you obviously can't keep searching for it. However, it's possible that the children will still be around for one reason or another.

this may help you : ps aux | grep MPICH

Related

why does mpirun behave as it does when used with slurm?

I am using Intel MPI and have encountered some confusing behavior when using mpirun in conjunction with slurm.
If I run (in a login node)
mpirun -n 2 python -c "from mpi4py import MPI; print(MPI.COMM_WORLD.Get_rank())"
then I get as output the expected 0 and 1 printed out.
If however I salloc --time=30 --nodes=1 and run the same mpirun from the interactive compute node, I get two 0s printed out instead of the expected 0 and 1.
Then, if I change -n 2 to -n 3 (still in compute node), I get a large error from slurm saying srun: error: PMK_KVS_Barrier task count inconsistent (2 != 1) (plus a load of other stuff), but I am not sure how to explain this either...
Now, based on this OpenMPI page, it seems these kind of operations should be supported at least for OpenMPI:
Specifically, you can launch Open MPI's mpirun in an interactive SLURM allocation (via the salloc command) or you can submit a script to SLURM (via the sbatch command), or you can "directly" launch MPI executables via srun.
Maybe the Intel MPI implementation I was using just doesn't have the same support and is not designed to be used directly in a slurm environment (?), but I am still wondering: what is the underlying nature of mpirun and slurm (salloc) that this is the behavior produced? Why would it print two 0s in the first "case," and what are the inconsistent task counts it talks about in the second "case"?

Kill all R processes that hang for longer than a minute

I use crontask to regularly run Rscript. Unfortunately, I need to do this on a small instance of aws and the process may hang, building more and more processes on top of each other until the whole system is lagging.
I would like to write a crontask to kill all R processes lasting longer than one minute. I found another answer on Stack Overflow that I've adapted that I think would solve the problem. I came up with;
if [[ "$(uname)" = "Linux" ]];then killall --older-than 1m "/usr/lib/R/bin/exec/R --slave --no-restore --file=/home/ubuntu/script.R";fi
I copied the task directly from htop, but it does not work as I expect. I get the No such file or directory error but I've checked it a few times.
I need to kill all R processes that have lasted longer than a minute. How can I do this?

You may want to avoid killing processes from another user and try SIGKILL (kill -9) after SIGTERM (kill -15). Here is a script you could execute every minute with a CRON job:
#!/bin/bash
PROCESS="R"
MAXTIME=`date -d '00:01:00' +'%s'`
function killpids()
{
PIDS=`pgrep -u "${USER}" -x "${PROCESS}"`
# Loop over all matching PIDs
for pid in ${PIDS}; do
# Retrieve duration of the process
TIME=`ps -o time:1= -p "${pid}" |
egrep -o "[0-9]{0,2}:?[0-9]{0,2}:[0-9]{2}$"`
# Convert TIME to timestamp
TTIME=`date -d "${TIME}" +'%s'`
# Check if the process should be killed
if [ "${TTIME}" -gt "${MAXTIME}" ]; then
kill ${1} "${pid}"
fi
done
}
# Leave a chance to kill processes properly (SIGTERM)
killpids "-15"
sleep 5
# Now kill remaining processes (SIGKILL)
killpids "-9"

Why imply an additional process every minute with cron?
Would it not be easier to start R with timeout from coreutils, the processes will then be killed automatically after the time you chose.
timeout [option] duration command [arg]…

I think the best option is to do this with R itself. I am no expert, but it seems the future package will allow executing a function in a separate thread. You could run the actual task in a separate thread, and in the main thread sleep for 60 seconds and then stop().
Previous Update
user1747036's answer which recommends timeout is a better alternative.
My original answer
This question is more appropriate for superuser, but here are a few things wrong with
if [[ "$(uname)" = "Linux" ]];then
killall --older-than 1m \
"/usr/lib/R/bin/exec/R --slave --no-restore --file=/home/ubuntu/script.R";
fi
The name argument is either the name of image or path to it. You have included parameters to it as well
If -s signal is not specified killall sends SIGTERM which your process may ignore. Are you able to kill a long running script with this on the command line? You may need SIGKILL / -9
More at http://linux.die.net/man/1/killall

What is difference between a job and a process in Unix?

What is the difference between a job and a process in Unix ? Can you please give an example ?

Jobs are processes which are started by a shell. The shell keeps track of these in a job table. The jobs command shows a list of active background processes. They get a jobspec number which is not the pid of the process. Commands like fg use the jobspec id.
In the spirit of Jürgen Hötzel's example:
find $HOME | sort &
[1] 15317
$ jobs
[1]+ Running find $HOME | sort &
$ fg
find $HOME | sort
C-c C-z
[1]+ Stopped find $HOME | sort
$ bg 1
[1]+ find $HOME | sort &
Try the examples yourself and look at the man pages.

A Process Group can be considered as a Job. For example you create a background process group in shell:
$ find $HOME|sort &
[1] 2668
And you can see two processes as members of the new process group:
$ ps -p 2668 -o cmd,pgrp
CMD PGRP
sort 2667
$ ps -p "$(pgrep -d , -g 2667)" -o cmd,pgrp
CMD PGRP
find /home/juergen 2667
sort 2667
You can can also kill the whole process group/job:
$ pkill -g 2667

http://en.wikipedia.org/wiki/Job_control_%28Unix%29:
Processes under the influence of a job control facility are referred to as jobs.

http://en.wikipedia.org/wiki/Job_control_%28Unix%29
Jobs are one or more processes that are grouped together as a 'job', where job is a UNIX shell concept.

Jobs are one or more processes that are grouped together as a 'job', where job is a UNIX shell concept. A job consists of multiple processes running in series or parallel. while
A process is a program under execution. job is when you want to know about processes started from the current shell.

A job consists of multiple processes running in series or parallel. A process is a program under execution.

job is when you want to know about processes started from the current shell.
process is when you want to know about a process running from any shell or computer.

I think a job is a scheduled process or set of processes, a job always has the notion of schedule, otherwise we could call it a process.

List and kill at jobs on UNIX

I have created a job with the at command on Solaris 10.
It's working now but I want to kill it but I don't know how I can find the job number and how to kill that job or process.

You should be able to find your command with a ps variant like:
ps -ef
ps -fubob # if your job's user ID is bob.
Then, once located, it should be a simple matter to use kill to kill the process (permissions permitting).
If you're talking about getting rid of jobs in the at queue (that aren't running yet), you can use atq to list them and atrm to get rid of them.

To delete a job which has not yet run, you need the atrm command. You can use atq command to get its number in the at list.
To kill a job which has already started to run, you'll need to grep for it using:
ps -eaf | grep <command name>
and then use kill to stop it.
A quicker way to do this on most systems is:
pkill <command name>

at -l to list jobs, which gives return like this:
age2%> at -l
11 2014-10-21 10:11 a hoppent
10 2014-10-19 13:28 a hoppent
atrm 10 kills job 10
Or so my sysadmin told me, and it

First
ps -ef
to list all processes. Note the the process number of the one you want to kill. Then
kill 1234
were you replace 1234 with the process number that you want.
Alternatively, if you are absolutely certain that there is only one process with a particular name, or you want to kill multiple processes which share the same name
killall processname

killall on process of same name

I would like to use killall on a process of the same name from which killall will be executed without killing the process spawning the killall.
So in more detail, say I have process foo, and process foo is running. I want to be able to run "foo -k", and have the new foo kill the old foo, without killing itself.

pgrep foo | grep -v $$ | xargs kill
If you don't have pgrep, you'll have to come up with some other way of generating the list of PIDs of interest. Some options are:
Use ps with appropriate options, followed by some combination of grep, sed and/or awk to match the processes and extract the PIDs.
killall can send a signal 0 instead of SIGTERM; the standard semantics of this is that it doesn't send a signal, but just determines if the process is alive or not. Perhaps you can use killall to select the process list and get it to print the PIDs of the matching ones that are alive. This would also probably require a bit of post-processing with sed.
There may be something along the lines of Linux's /proc filesystem with pseudo-files holding system data that you could grovel through. Again, grep/awk/sed are your friends here.
If you truly need particular details on how to do this, comment or send me mail, and I'll try expanding some of these options in more detail.
[Edits: added further options for those without pgrep.]

This seems to work on OS X:
killall -s foo | perl -ne 'system $_ unless /\b'$PPID'\b/'
killall -s lists what it would do, one PID at a time. Do what it would do except for killing yourself.

The usual way to solve this is to have foo write its process ID to a file, say something like /var/run/foo.pid when it is run in daemon mode. Then you can have the non-daemon version read the PID from the PID file and call kill(2) on it directly. This is usually how apache and the like handle it. Of course the newer OSX daemons go through launchd(8) instead, but there are still a few that use good old fashioned signals.