I use MPICH2. When I launch processes with mpiexec, the failure of one process will crash all other processes. How to avoid this?
In MPICH, there is a flag called -disable-auto-cleanup which will prevent the process manager from automatically cleaning up all processes when a single process fails.
However, MPI itself does not have much support for fault tolerance and this is something that the Fault Tolerance Working Group is working on adding in a future version of the MPI Standard.
For now, the best you can do is change the default MPI Error Handler away from MPI_ERRORS_ARE_FATAL, which causes all processes to abort, to something else like MPI_ERRORS_RETURN which would return the error code to the application and allow it to do something else. However, you're not likely to be able to communicate anymore after a failure has occurred, especially if you are trying to use collective communication.
I want my program to monitor some processes that it has started. These are the most important requirements I know of:
Record the exit status of the child processes (unless they exit after my program has exited).
Record the stderr and stdout output. Ideally within a few seconds of its being written, but that is not a hard requirement: it would probably be enough to read it only when a user requests it.
Sometimes, the child processes will outlive my program. Other times, they will not. It's important that my program doesn't make the child processes more likely to exit in a way that might inconvenience my users -- for example, Unix signals sent to my program shouldn't kill the child processes as a side-effect. If the parent exits, the children should continue running unaffected.
Ideally, the parent would track forks of the children so they can be monitored and perhaps signalled. This is not a hard requirement, though.
The scheme needs to work on both Linux and OS X.
My resolution is to do all of the standard operations required to daemonize a process, except for the second fork. Before doing that, I redirect output to a temporary log file, then monitor the log file using inotify (on Linux) or kqueues (on OS X).
As far as I know, the only loss to my stability requirement (3.) of omitting the second fork is if the child process obtains a controlling tty.
Is this solution a good one for these requirements? What bad things could happen to the child processes that I haven't considered?
I need to implement (in Qt) some solution to communicate between two programs running on Linux machine. One program is Worker, and the second is Watchdog. Basically I need Watchdog to periodically check on Worker and in case something wrong (no process,hangup - no answer from Worker) kill Worker (if present) and start it again.
Worker runs as a daemon, so I think starting it from unix /etc/init.d/worker would be appropriate.
I can see two solutions
Unix signals - both of them can send and receive Unix SIGUSR1
Shared memory
Which one to choose?
With signals both of programs will have to know others pid, probably reading from filesystem /var/run so it looks like a drawback.
With shared memory, all I need is "key" that programs will have hardcoded, so no need to read pids from filesystem. Since Watchdog should start first it can create shared mem segment, and Worker will only attach to it and maybe update its timestamp value??? However, to stop Worker by Watchdog (in case of hungup) Watchdog will still need Worker pid to send him SIGKILL, maybe it can read it from shared mem? Both concepts are new to me.
So what is the proper way to build reliable Watchdog, or am I missing something?
best regards
Marek
I think this is the best solution available through Qt:
http://qt-project.org/doc/qt-4.8/qlocalsocket.html
http://qt-project.org/doc/qt-4.8/qlocalserver.html
The QLocalSocket class provides a local socket. On Windows this is a
named pipe and on Unix this is a local domain socket.
http://qt-project.org/doc/qt-4.8/ipc-localfortuneserver.html
http://qt-project.org/doc/qt-4.8/ipc-localfortuneclient.html
Hope that helps.
We're preparing an application using Qt that has a main process that controls the GUI and spawns processes that do the actual data processing. Messages are exchanged between the main process and the data-processing processes using the Qt mechanisms and the stdin/stdout pipes.
Now, in the event that the GUI crashes, the other processes keep running. What we'd like to be able to do is to, when a new GUI starts, reconnect to these processes as before. Anyone know if this is possible, and if so, how to achieve it?
This is possible if you are using a named pipe for communicating with the process. stdin/out are closed if the process they belong to is terminated.
You might want to investigate shared memory for the communication between processes. I seem to recall that it was able to recover in a very similar situation at a previous job.
Another possibility, if your platform supports it, is to use dbus for the communication between processes. If that is the case, neither process would have to be there, but will act get the appropriate messages if it is running.
I found a bunch of scripts in the project I have been newly assigned to that are the "shutdown" scripts. They just do some basic searches and run the Unix kill command. Is there any reason they shouldn't shutdown the process this way? Does this ensure that dynamically allocated memory will return properly? Are there any other negative effects? I've operated under an intuition that this is a last resort way of terminating a process.
The kill command sends a signal to a Unix process. That signal defaults to SIGTERM, which is a polite request for the program to exit.
When a process exits for any reason, the Unix OS does clean up its memory allocations, file handles and other resources. The only resources that do not get cleaned up are those that are supposed to be shared, like the contents of files and of shared memory (like System V IPC).
Many programs do not need to do any special cleanup on exit and use the default SIGTERM behavior, which is to let the OS stop the process.
If a program does need special behavior, it can install a signal handler, and it can then run a function to handle the signal.
Now the SIGKILL signal, which is number 9, is evil, but also necessary. This signal never gets to the process itself, the OS simple stops the process. This should only be used when really, really necessary. It often becomes necessary in multithreaded programs that get into deadlocks or programs that have installed a TERM signal handler, but screwed up during their exit process.
kill is a polite request for the program to end. It cleans up its memory, closes its handles and other such niceities. It sends a SIGTERM
kill -9 tells the operating system to grab the process by the balls and throw it the hell out of the bar. Obivously it is not concerned with niceities - although it does reclaim all the memory, as it's the Operating System's responsability to keep track of that. But because it's a forceful shutdown you may have problems when trying to run the program again (not cleaning up .pid files for example)
See also [wikipedia](http://en.wikipedia.org/wiki/Kill_(Unix)
Each process runs in its own protected address space, and when the process ends (whether it exits voluntarily or is killed by an external signal) that address space is fully reclaimed. So yes, all if its memory is released properly.
Depending on the process, it may or may not cause other problems next time you try to run it. For example, it may have some files open and leave them in an inconsistent state if it's killed unexpectedly. (The files will be closed automatically, but it could be in the middle of writing some application data, for example, and the files may contain incomplete/inconsistent data if interrupted.)
Typically when the system is shutting down, all processes will be sent signal 15 (SIGTERM), at which they can perform whatever cleanup/shutdown actions they need to do. Then a short time later, they'll get signal 9 (SIGKILL), which immediately kills them, without giving them any chance to react in any way. This gives all processes a chance to clean up for themselves, and then forcefully kills any processes that aren't responding promptly.
kill -9
is the last resort, not kill.
Yes memory is reclaimed (this is the OS's responsibility)
The programs can respond to the signal however they want, it's up to the particular program to do "the right thing"
kill by default will send a terminate signal which will allow the process to exit gracefully. If the process does not seem to exit in a timely fashion, some scripts will then fall back on kill -9 which forces an exit, 'ready or not'.
In all cases OS managed things such as dynamic memory will be returned, files closed etc. But application level things may not be tidied up on a -9 kill.
kill merely sends a signal to the process. The process can trap signals (except for signal 9) and run code to perform shutdown. An app's shutdown is supposed to be brief, but it may not be instantaneous.
In any case, once the process exits, the operating system will reclaim dynamically allocated memory, close open file descriptors, and other resources.
There could be some resources that survive, for example if the app held shared memory or sockets that are also held by other (still living) processes.