I have a multithreaded process running on RHEL6.x (64bit).
I find that the process hangs and some threads (of the same process) crash most of the time when I try to bring up the process. Some threads wait for shared memory between the threads to get created (I can see that all of it does not get created).
But when I use strace , the process does not hang and it works just fine (all of the memory that is supposed to be created, gets created). Even interrupting strace after the memory gets created, keeps the process running fine for good.
I have read this:
strace fixes hung process
which did give me an idea. But I am still unclear on this as the version of RHEL that they have used is not mentioned.
Also, another point is that, changing the kernel to a fedora (compatible) kernel did not produce the issue.
So, I would just like to know how exactly does strace affect a process ? (or is it just the stack that moves back to the kernel as pointed out in the link) ?
Related
I'm facing a weird issue regarding sending signal 9 (SIGKILL) to the init process (PID 1).
As you may know, SIGKILL can't be ignored via signal handlers. As I tried sending SIGKILL to init, I noticed that nothing was happening; init would not get terminated. Trying to figure out this behaviour, I decided to attach myself to the init process with strace too see more clearly what was happening. Now comes the weird part. If I'm "looking" at the init process with strace and send it SIGKILL, the system crashes.
My question is why is this happening? Why does the system crash when I look at the process and why does it not crash when I'm not? As I said, in both cases I send SIGKILL to init. Tested on CentOS 6.5, Debian 7 and Arch.
Thanks!
The Linux kernel deliberately forces a system crash if init terminates (see http://lxr.free-electrons.com/source/kernel/exit.c?v=3.12#L501 and particularly the call to panic therein). Therefore, as a safeguard, the kernel will not deliver any fatal signal to init, and SIGKILL is not excepted (see http://lxr.free-electrons.com/ident?v=3.12&i=SIGNAL_UNKILLABLE) (however, the code flow is convoluted enough that I'm not sure, but I suspect a kernel-generated SIGSEGV or similar would go through).
Applying ptrace(2) (the system call that strace uses) to process 1 apparently disables this protection. This could be said to be a bug in the kernel. I am insufficiently skilled at digging around in the code to find this bug.
I do not know if other Unix variants apply the same crash-on-exit semantics or signal protection to init. It would be reasonable to have the OS perform a clean shutdown or reboot, rather than a panic, if init terminates (at least, if it does so by calling _exit) but as far as I know, all modern Unix variants have a dedicated system call to request this, instead (reboot(2)).
We have a process (written in c++ /managed), which receives network data via tcpip.
After running the process for a while while tracking network load, it seems that network get into freeze state and the process does not getting data, there are other processes in the system that using networking (same nic) which operates normally.
the process gets out of this frozen situation by itself after several minutes.
Any idea what is happening?
Any counter i can track to see if my process reach some limitations ?
It is going to be very difficult to answer specifically,
-- without knowing what exactly is your process/application about,
-- whether it is a network chat application, or a file server/client, or ......
-- without other details about your process how it is implemented, what libraries it uses, if relevant to problem.
Also you haven't mentioned what OS and environment you are running this process under,
there is very little anyone can help . It could be anything, a busy wait loopl in your code, locking problems if its a multi-threaded code,....
Nonetheless , here are some options to check:
If its linux try below commands to debug and monitor the behaviour of the process and see what could be problem-
top
Check top to see ow much resources(CPU, memory) your process is using and if there is anything abnormally high values in CPU usage for it.
pstack
This should stack frames of the process executing at time of the problem.
netstat
Run this with necessary options (tcp/udp) to check what is the stae of the network sockets opened by your process
gcore -s -c
This forces your process to core when the mentioned problem happens, and then analyze that core file using gdb
gdb
and then use command where at gdb prompt to get full back trace of the process (which functions it was executing last and previous function calls.
I executed a perl script in background using the following command
nohup perl myPerlSCript.pl >debug_log &
After few minutes I got the status as
[1]+ Stopped
I wasn't expecting it to stop, nor do I know what stopped it. How can I debug this and find out why it stopped? I am actually interested in knowing the unix commands to debug.
There are several ways a process running in the background can be stopped. All of them involve one of these signals:
SIGSTOP
SIGTSTP
SIGTTOU
SIGTTIN
SIGSTOP is severe. It's unblockable, unignorable, unhandlable. It stops the process as surely as SIGKILL would kill it. The others can be handled by the background process to prevent stopping.
A signal was sent by another process using kill(2), or by the process to itself using raise(3) or kill(2)
The process attempted to write to the terminal, and the terminal option tostop is enabled (see output of stty -a). This generates SIGTTOU.
The process attempted to change the terminal modes with tcsetattr(3) or an equivalent ioctl. (These are the same modes shown by stty.) This generates SIGTTOU regardless of the current state of the tostop flag.
The process attempted to read from the terminal. This generates SIGTTIN.
This list is probably very incomplete.
Are you using tcsh by any chance? Tcsh actually comes with a built-in nohup command that I've had lots of problems with before, seeing the exact behavior you're seeing.
Try using /usr/bin/nohup directly if that is the case.
We have an application which is deployed to a WebSphere server running on UNIX, and we are experiencing two issues:
a system hang which recovers after a few minutes - to investigate, we will need the thread dump (javacore).
a system hang which does not recover and requires WebSphere to be restarted - to investigate, we will need the thread dump and heap dump.
The problem is: when a system hang occurs, we do not know whether it is issue 1 or 2.
Ideally we would like to manually generate the thread dump first, and wait to see if the system recovers. If it does not, then we generate the thread dump and the heap dump, before restarting WebSphere.
I know about the kill -3 (or kill -QUIT) command. The command would generate thread dump only (if the parameter IBM_HEAPDUMP=false), or thread dump and heap dump (if IBM_HEAPDUMP=true). However, IBM_HEAPDUMP has to be set before WebSphere is started and cannot be changed while WebSphere is running.
Is my understanding correct, regarding the IBM_HEAPDUMP parameter and the kill -3 command?
Also, is it possible get the logs in the way I described? (i.e. when generating JVM diagnostics, choose whether to generate heap dump or not on the fly)
Your understanding is consistent with everything I've read.
However, I believe you can accomplish what you want by using wsadmin scripting. This article describes how to force javacores and heapdumps on a Windows platform where kill -3 is not available, but the same commands can be run on any WebSphere system.
From within wsadmin or a wsadmin script, execute:
set jvm [$AdminControl completeObjectName type=JVM,process=server1,*]
$AdminControl invoke $jvm generateHeapDump
$AdminControl invoke $jvm dumpThreads
I found a bunch of scripts in the project I have been newly assigned to that are the "shutdown" scripts. They just do some basic searches and run the Unix kill command. Is there any reason they shouldn't shutdown the process this way? Does this ensure that dynamically allocated memory will return properly? Are there any other negative effects? I've operated under an intuition that this is a last resort way of terminating a process.
The kill command sends a signal to a Unix process. That signal defaults to SIGTERM, which is a polite request for the program to exit.
When a process exits for any reason, the Unix OS does clean up its memory allocations, file handles and other resources. The only resources that do not get cleaned up are those that are supposed to be shared, like the contents of files and of shared memory (like System V IPC).
Many programs do not need to do any special cleanup on exit and use the default SIGTERM behavior, which is to let the OS stop the process.
If a program does need special behavior, it can install a signal handler, and it can then run a function to handle the signal.
Now the SIGKILL signal, which is number 9, is evil, but also necessary. This signal never gets to the process itself, the OS simple stops the process. This should only be used when really, really necessary. It often becomes necessary in multithreaded programs that get into deadlocks or programs that have installed a TERM signal handler, but screwed up during their exit process.
kill is a polite request for the program to end. It cleans up its memory, closes its handles and other such niceities. It sends a SIGTERM
kill -9 tells the operating system to grab the process by the balls and throw it the hell out of the bar. Obivously it is not concerned with niceities - although it does reclaim all the memory, as it's the Operating System's responsability to keep track of that. But because it's a forceful shutdown you may have problems when trying to run the program again (not cleaning up .pid files for example)
See also [wikipedia](http://en.wikipedia.org/wiki/Kill_(Unix)
Each process runs in its own protected address space, and when the process ends (whether it exits voluntarily or is killed by an external signal) that address space is fully reclaimed. So yes, all if its memory is released properly.
Depending on the process, it may or may not cause other problems next time you try to run it. For example, it may have some files open and leave them in an inconsistent state if it's killed unexpectedly. (The files will be closed automatically, but it could be in the middle of writing some application data, for example, and the files may contain incomplete/inconsistent data if interrupted.)
Typically when the system is shutting down, all processes will be sent signal 15 (SIGTERM), at which they can perform whatever cleanup/shutdown actions they need to do. Then a short time later, they'll get signal 9 (SIGKILL), which immediately kills them, without giving them any chance to react in any way. This gives all processes a chance to clean up for themselves, and then forcefully kills any processes that aren't responding promptly.
kill -9
is the last resort, not kill.
Yes memory is reclaimed (this is the OS's responsibility)
The programs can respond to the signal however they want, it's up to the particular program to do "the right thing"
kill by default will send a terminate signal which will allow the process to exit gracefully. If the process does not seem to exit in a timely fashion, some scripts will then fall back on kill -9 which forces an exit, 'ready or not'.
In all cases OS managed things such as dynamic memory will be returned, files closed etc. But application level things may not be tidied up on a -9 kill.
kill merely sends a signal to the process. The process can trap signals (except for signal 9) and run code to perform shutdown. An app's shutdown is supposed to be brief, but it may not be instantaneous.
In any case, once the process exits, the operating system will reclaim dynamically allocated memory, close open file descriptors, and other resources.
There could be some resources that survive, for example if the app held shared memory or sockets that are also held by other (still living) processes.