I tried quantum computational program gamess with mpirun and it runned well yesterday. However, when I tried another calculation by the same procedure, it failed with next messeages.How can I fix it? I confirmed that there was no mpi process running and I cleaned cache memories also..
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD with
errorcode 911.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
There are two reasons this could occur:
this process did not call "init" before exiting, but others in the job did. This can cause a job to hang indefinitely while it waits for
all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.
this process called "init", but exited without calling "finalize". By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"
This may have caused other processes in the application to be terminated by signals sent by mpirun (as reported here).
Something in the application called MPI_ABORT on rank 0. You'll have to look at the code to figure out why it was called, but my guess is that there was some bad input. I don't know much about GAMESS though. You might try asking the GAMESS people directly. They have a website (http://www.msg.ameslab.gov/gamess/) which includes a way to contact them.
Related
I use MPICH2. When I launch processes with mpiexec, the failure of one process will crash all other processes. How to avoid this?
In MPICH, there is a flag called -disable-auto-cleanup which will prevent the process manager from automatically cleaning up all processes when a single process fails.
However, MPI itself does not have much support for fault tolerance and this is something that the Fault Tolerance Working Group is working on adding in a future version of the MPI Standard.
For now, the best you can do is change the default MPI Error Handler away from MPI_ERRORS_ARE_FATAL, which causes all processes to abort, to something else like MPI_ERRORS_RETURN which would return the error code to the application and allow it to do something else. However, you're not likely to be able to communicate anymore after a failure has occurred, especially if you are trying to use collective communication.
I can't seem to find this specific implementation detail, or even a pointer to where in an OS book to find this.
Basically, main thread calls an async task (to be run later) on itself. So... when does it run?
Does it wait for the run loop to finish? Or does it just randomly interrupt the run-loop in the middle of any function?
I understand the registers will be the same (unless separate thread), but not really the instruction pointer and what happens to the stack, if anything does happen.
Thank you
In C# the task is scheduled to be run on the current SynchronizationContext. The context basically has a queue of tasks which it schedules to run on the threads it is associated with, in a GUI app there is only one thread so the task is scheduled to run there.
The GUI thread is not interrupted but it executes the task when it finishes all other tasks preceding it in the queue.
The threads of a process all share the same address space, not the same CPU registers. The thread scheduling is done depends on the programming language and the O/S. Usually there are explicit scheduling points, such as returning from a system call, blocking awaiting I/O completion, or between p-code instructions for interpreted languages. Some O/S implemtations reschedule depending on how long a thread has run for time-based scheduling. Often languages include a function that explicitly offers the CPU to any other thread or process by transferring control to the process or thread scheduler component of the O/S.
The act of switching from one thread or process to another is known as a context switch and is carefully tuned code because this is often done thousands of times per second. This can make the code difficult to follow.
The best explanation of this I've ever seen is http://www.amazon.com/The-Design-UNIX-Operating-System/dp/0132017997 classic.
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 2 in communicator MPI_COMM_WORLD
with errorcode 1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun has exited due to process rank 2 with PID 19175 on
node mosura15 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
I am running a simulation. In MPI command, I found the above error. What is reason behind this. How can I resolve this ?
It looks like the 3rd instance of your program (id 2) crashed and didn't call MPI_Finalize() to close down, and so mpirun closed all the other copies of the program as well. Is there something causing that particular node to crash, or is it a different node each time?
The message is pretty clear; rank 2 called MPI_Abort(), which stops the whole program. You should be able to look in your code and find out under what error conditions the program calls MPI_Abort().
I would like to know if there is a way that an MPI process send a kill signal to another MPI process?
Or differently, is there a way to exit from an MPI environment graciously, when one of the process is still active? (i.e. mpi_abort() prints an error message).
Thanks
No, this is not possible within an MPI application using the MPI library.
Individual processes would not be aware of the location of the other processes, nor of the process IDs of the other processes - and there is nothing in the MPI spec to make the kill you are wanting.
If you were to do this manually, then you'd need to MPI_Alltoall to exchange process IDs and hostnames across the system, and then you would need to spawn ssh/rsh to visit the required node when you wanted to kill something. All in all, it's not portable, not clean.
MPI_Abort is the right way to do what you are trying to achieve. From the Open MPI manual:
"This routine makes a "best attempt" to abort all tasks in the group of comm." (ie. MPI_Abort(MPI_COMM_WORLD, -1) is what you need.
Any output during MPI_Abort would be machine specific - so you may, or may not, receive the error message you mention.
I found a bunch of scripts in the project I have been newly assigned to that are the "shutdown" scripts. They just do some basic searches and run the Unix kill command. Is there any reason they shouldn't shutdown the process this way? Does this ensure that dynamically allocated memory will return properly? Are there any other negative effects? I've operated under an intuition that this is a last resort way of terminating a process.
The kill command sends a signal to a Unix process. That signal defaults to SIGTERM, which is a polite request for the program to exit.
When a process exits for any reason, the Unix OS does clean up its memory allocations, file handles and other resources. The only resources that do not get cleaned up are those that are supposed to be shared, like the contents of files and of shared memory (like System V IPC).
Many programs do not need to do any special cleanup on exit and use the default SIGTERM behavior, which is to let the OS stop the process.
If a program does need special behavior, it can install a signal handler, and it can then run a function to handle the signal.
Now the SIGKILL signal, which is number 9, is evil, but also necessary. This signal never gets to the process itself, the OS simple stops the process. This should only be used when really, really necessary. It often becomes necessary in multithreaded programs that get into deadlocks or programs that have installed a TERM signal handler, but screwed up during their exit process.
kill is a polite request for the program to end. It cleans up its memory, closes its handles and other such niceities. It sends a SIGTERM
kill -9 tells the operating system to grab the process by the balls and throw it the hell out of the bar. Obivously it is not concerned with niceities - although it does reclaim all the memory, as it's the Operating System's responsability to keep track of that. But because it's a forceful shutdown you may have problems when trying to run the program again (not cleaning up .pid files for example)
See also [wikipedia](http://en.wikipedia.org/wiki/Kill_(Unix)
Each process runs in its own protected address space, and when the process ends (whether it exits voluntarily or is killed by an external signal) that address space is fully reclaimed. So yes, all if its memory is released properly.
Depending on the process, it may or may not cause other problems next time you try to run it. For example, it may have some files open and leave them in an inconsistent state if it's killed unexpectedly. (The files will be closed automatically, but it could be in the middle of writing some application data, for example, and the files may contain incomplete/inconsistent data if interrupted.)
Typically when the system is shutting down, all processes will be sent signal 15 (SIGTERM), at which they can perform whatever cleanup/shutdown actions they need to do. Then a short time later, they'll get signal 9 (SIGKILL), which immediately kills them, without giving them any chance to react in any way. This gives all processes a chance to clean up for themselves, and then forcefully kills any processes that aren't responding promptly.
kill -9
is the last resort, not kill.
Yes memory is reclaimed (this is the OS's responsibility)
The programs can respond to the signal however they want, it's up to the particular program to do "the right thing"
kill by default will send a terminate signal which will allow the process to exit gracefully. If the process does not seem to exit in a timely fashion, some scripts will then fall back on kill -9 which forces an exit, 'ready or not'.
In all cases OS managed things such as dynamic memory will be returned, files closed etc. But application level things may not be tidied up on a -9 kill.
kill merely sends a signal to the process. The process can trap signals (except for signal 9) and run code to perform shutdown. An app's shutdown is supposed to be brief, but it may not be instantaneous.
In any case, once the process exits, the operating system will reclaim dynamically allocated memory, close open file descriptors, and other resources.
There could be some resources that survive, for example if the app held shared memory or sockets that are also held by other (still living) processes.