Poll system call - is this infiniband communication? - mpi

I have a place in my openmpi (compiled for infiniband usage) code where there is no progress output and strace on one of the processes shows this:
poll([{fd=5, events=POLLIN}, {fd=14, events=POLLIN}, {fd=23, events=POLLIN}], 3, 0) = 0 (Timeout)
over and over again. As per this question I ran ls -l /proc/<pid>/fd and I see a couple of sockets and /dev/infiniband/ links. Is this system call indicative of interprocess communication over infiniband? How can I verify this or further debug what is happening in the code at this time?

The fact that poll returns with a timeout only means that the process is waiting for some communication event to occur. This openmpi faq page lists some ways to debug.

Related

SIGKILL init process (PID 1)

I'm facing a weird issue regarding sending signal 9 (SIGKILL) to the init process (PID 1).
As you may know, SIGKILL can't be ignored via signal handlers. As I tried sending SIGKILL to init, I noticed that nothing was happening; init would not get terminated. Trying to figure out this behaviour, I decided to attach myself to the init process with strace too see more clearly what was happening. Now comes the weird part. If I'm "looking" at the init process with strace and send it SIGKILL, the system crashes.
My question is why is this happening? Why does the system crash when I look at the process and why does it not crash when I'm not? As I said, in both cases I send SIGKILL to init. Tested on CentOS 6.5, Debian 7 and Arch.
Thanks!
The Linux kernel deliberately forces a system crash if init terminates (see http://lxr.free-electrons.com/source/kernel/exit.c?v=3.12#L501 and particularly the call to panic therein). Therefore, as a safeguard, the kernel will not deliver any fatal signal to init, and SIGKILL is not excepted (see http://lxr.free-electrons.com/ident?v=3.12&i=SIGNAL_UNKILLABLE) (however, the code flow is convoluted enough that I'm not sure, but I suspect a kernel-generated SIGSEGV or similar would go through).
Applying ptrace(2) (the system call that strace uses) to process 1 apparently disables this protection. This could be said to be a bug in the kernel. I am insufficiently skilled at digging around in the code to find this bug.
I do not know if other Unix variants apply the same crash-on-exit semantics or signal protection to init. It would be reasonable to have the OS perform a clean shutdown or reboot, rather than a panic, if init terminates (at least, if it does so by calling _exit) but as far as I know, all modern Unix variants have a dedicated system call to request this, instead (reboot(2)).

Process stop getting network data

We have a process (written in c++ /managed), which receives network data via tcpip.
After running the process for a while while tracking network load, it seems that network get into freeze state and the process does not getting data, there are other processes in the system that using networking (same nic) which operates normally.
the process gets out of this frozen situation by itself after several minutes.
Any idea what is happening?
Any counter i can track to see if my process reach some limitations ?
It is going to be very difficult to answer specifically,
-- without knowing what exactly is your process/application about,
-- whether it is a network chat application, or a file server/client, or ......
-- without other details about your process how it is implemented, what libraries it uses, if relevant to problem.
Also you haven't mentioned what OS and environment you are running this process under,
there is very little anyone can help . It could be anything, a busy wait loopl in your code, locking problems if its a multi-threaded code,....
Nonetheless , here are some options to check:
If its linux try below commands to debug and monitor the behaviour of the process and see what could be problem-
top
Check top to see ow much resources(CPU, memory) your process is using and if there is anything abnormally high values in CPU usage for it.
pstack
This should stack frames of the process executing at time of the problem.
netstat
Run this with necessary options (tcp/udp) to check what is the stae of the network sockets opened by your process
gcore -s -c
This forces your process to core when the mentioned problem happens, and then analyze that core file using gdb
gdb
and then use command where at gdb prompt to get full back trace of the process (which functions it was executing last and previous function calls.

Communication between two programs signals or shared mem?

I need to implement (in Qt) some solution to communicate between two programs running on Linux machine. One program is Worker, and the second is Watchdog. Basically I need Watchdog to periodically check on Worker and in case something wrong (no process,hangup - no answer from Worker) kill Worker (if present) and start it again.
Worker runs as a daemon, so I think starting it from unix /etc/init.d/worker would be appropriate.
I can see two solutions
Unix signals - both of them can send and receive Unix SIGUSR1
Shared memory
Which one to choose?
With signals both of programs will have to know others pid, probably reading from filesystem /var/run so it looks like a drawback.
With shared memory, all I need is "key" that programs will have hardcoded, so no need to read pids from filesystem. Since Watchdog should start first it can create shared mem segment, and Worker will only attach to it and maybe update its timestamp value??? However, to stop Worker by Watchdog (in case of hungup) Watchdog will still need Worker pid to send him SIGKILL, maybe it can read it from shared mem? Both concepts are new to me.
So what is the proper way to build reliable Watchdog, or am I missing something?
best regards
Marek
I think this is the best solution available through Qt:
http://qt-project.org/doc/qt-4.8/qlocalsocket.html
http://qt-project.org/doc/qt-4.8/qlocalserver.html
The QLocalSocket class provides a local socket. On Windows this is a
named pipe and on Unix this is a local domain socket.
http://qt-project.org/doc/qt-4.8/ipc-localfortuneserver.html
http://qt-project.org/doc/qt-4.8/ipc-localfortuneclient.html
Hope that helps.

Kill an mpi process

I would like to know if there is a way that an MPI process send a kill signal to another MPI process?
Or differently, is there a way to exit from an MPI environment graciously, when one of the process is still active? (i.e. mpi_abort() prints an error message).
Thanks
No, this is not possible within an MPI application using the MPI library.
Individual processes would not be aware of the location of the other processes, nor of the process IDs of the other processes - and there is nothing in the MPI spec to make the kill you are wanting.
If you were to do this manually, then you'd need to MPI_Alltoall to exchange process IDs and hostnames across the system, and then you would need to spawn ssh/rsh to visit the required node when you wanted to kill something. All in all, it's not portable, not clean.
MPI_Abort is the right way to do what you are trying to achieve. From the Open MPI manual:
"This routine makes a "best attempt" to abort all tasks in the group of comm." (ie. MPI_Abort(MPI_COMM_WORLD, -1) is what you need.
Any output during MPI_Abort would be machine specific - so you may, or may not, receive the error message you mention.

What is a signal in Unix?

This comment confuses me: "kill -l generally lists all signals". I thought that a signal means a quantized amount of energy.
[Added] Please, clarify the (computational) signal in Unix and the physical signal. Are they totally different concepts?
[Added] Are there major differences between paradigms? Is the meaning the same in languages such as C, Python and Haskell? The signal seems to be a general term.
I cannot believe that people are not comparing things such as hardware and software or stressing OS at some points.
Comparison between a signal and an interrupt:
The difference is that while
interrupts are sent to the operating
system by the hardware, signals are
sent to the process by the operating
system, or by other processes. Note
that signals have nothing to do with
software interrupts, which are still
sent by the hardware (the CPU itself,
in this case). (source)
Definitions
process = a program in execution, according to the book below
Further reading
compare the signal to Interrupts and Exceptions
Tanenbaum's book Modern Operating Systems
The manual refers to a very basic mechanism that allow processes or the operation system to notify other processes by sending a signal. The operation system can use it to notify programs about abortions of them (signal SIGABRT) or about a segmentation fault (often caused by accessing a null-pointer, SIGSEGV), to name two of them.
Some unix servers use signals so the administrator can use kill to send them a signal, causing them to re-read their configuration file, without requiring them to restart.
There are default actions taken for some signals and other signals are just ignored. For example on receive of a SIGSEGV, the program terminates, while receiving a SIGCHLD, meaning a child-process died, will by default result in nothing special.
There is a ANSI C standard function that installs a signal handler, which is a function that can execute some code when receiving a signal, called signal (read in man signal). In different unix's, that function behave different, so its usage is discouraged. Its manpage refers to the sigaction function (read man sigaction), which behaves consistent, and is also more powerful.
A physical signal and a Unix signal are indeed different concepts. When a Unix signal is sent from one process to another, there is no specific corresponding physical signal. Unix signals are merely an abstraction so programmers can talk about processes communicating with one another.
Unix signals could have been called messages, events, notifications, or even a made-up term like "frobs". The designers just chose the name "signal", and it stuck.
A signal is a message, either to the target process, or to the OS about the target process. It is part of the unix API (and is defined in various POSIX standards).
Read man kill, man signal, and man sigaction.
Other SO questions that might be helpful:
What is the difference between sigaction and signal?
Some from my notes :
Allows asynchronous communication
Between processes belonging to the
same user
From the system to any process
From the system manager to any process
All associated information is in the signal itself
Many different signals
SIGINT
From the system to all processes
associated to a terminal
Trigger: ^C pressed
Usual way to stop a running process
SIGFPE
From the kernel to a single process
Trigger: error in floating point operation
SIGKILL
To a single process
Stops the execution of the destination process
SIGALRM
From the kernel to a single process
Trigger: timer expiration
SIGTERM
To a single process
Recommends the process to terminate gracefully
SIGUSR1, SIGUSR2
From any process to any other
Without a predefined semantic
Freely usable by programmers
Sending a signal to another process
int kill(pid, signal_ID)
The programmer can decide what to do when a signal
is received
Use the default behavior
Ignore it
Execute a user function
Detecting an interrupted write
if (write(fd, buff, SIZE)<0) {
switch (errno) {
case EINTR:
warning(“Interrupted write\n”);
break;
}
}…
A signal is a message which can be sent to a running process.
For example, to tell the Internet Daemon (inetd) to re-read its configuration file, it should be sent a SIGHUP signal.
For example, if the current process ID (PID) of inetd is 1234, you would type:
kill -SIGHUP 1234
A signal is "an event, message, or data structure transmitted between computational processes" (from Wikipedia).
In this case signal means 'message'. So it's sending a message to a process which can tell the process to do various things.
A unix signal is a kind of message that can be sent to and from unix processes. They can do things like tell a process to quit (SIGKILL) or that a process had an invalid memory reference (SIGSEGV) or that the process was killed by the user hitting control-c (SIGINT).
from a *nix command line type in:
man signal
that will should you all the signals available.
Signal is basically an interrupt that tells the process that a particular event has happened.
Signal generally send by the kernel, meanwhile a process can also send the signal to other process (depends on permission ans all ) by using kill and killall command and a process can send signal to itself by using raise.
Major use of signal:
To handle the interrupt.
Process synchronization.
Signal is an interrupt that used to intimate a process that a particular event has happened.
Signal can be send by kernel to running process or one process to another process.
In bash kill and killall command used to send the signal.

Resources