I have the following question: can I use a signal handler for SIGCHLD and at specific places use waitpid(3) instead?
Here is my scenario: I start a daemon process that listens on a socket (at this point it's irrelevant if it's a TCP or a UNIX socket). Each time a client connects, the daemon forks a child to handle the request and the parent process keeps on accepting incoming connections. The child handling the request needs at some point to execute a command on the server; let's assume in our example that it needs to perform a copy like this:
cp -a /src/folder /dst/folder
In order to do so, the clild forks a new process that uses execl(3) (or execve(3), etc.) to execute the copy command.
In order to control my code better, I would ideally wish to catch the exit status of the child executing the copy with waitpid(3). Moreover, since my daemon process is forking children to handle requests, I need to have a signal handler for SIGCHLD so as to prevent zombie processes from being created.
In my code, I setup a signal handler for SIGCHLD using signal(3), I daemonize my program by forking twice, then I listen on my socket for incoming connections, I fork a process to handle each coming request and my child-process forks a grand-child-process to perform the copy, trying to catch its exit status via waitpid(3).
What happens is that SIGCHLD is caught by my handler when a grand-child-process dies, before waitpid(3) takes action and waitpid(3) returns -1 even though the grand-child-process exits with success.
My first thought was to add:
signal(SIGCHLD, SIG_DFL);
just before forking the child process to handle my connecting clients, without any success. Using SIG_IGN didn't work either.
Is there a suggestion on how to make my scenario work?
Thank you all for your help in advance!
PS. If you need code, I'll post it, but due to its size I decided to do so only if necessary.
PS2. My intention is to use my code in FreeBSD, but my checks are performed in Linux.
EDIT [SOLVED]:
The problem I was facing is solved. The "unexpected" behaviour was caused by my waitpid(3) handling code which was buggy at some point.
Hence, the above method can indeed be used to allow for signal(3) and waitpid(3) coexistence in daemon-like programs.
Thanx for your help and I hope that this method helps someone wishing to accomplish such a thing!
Related
Scenario : The server is in middle of processing a http request and the server shuts down. There are multiple points till where the code has executed. How are such cases typically handled ?. A typical example could be that some downstream http calls had to be made as a part of the incoming http request. How to find whether such calls were made or not made when the shutdown occurred. I assume that its not possible to persist every action in the code flow. Suggestions and views are welcome.
There are two kinds of shutdowns to consider here.
There are graceful shutdowns: when the execution environment politely asks your process to stop (e.g. systemd sends a SIGTERM) and expects it to exit on its own. If your process doesn’t exit within a few seconds, the environment proceeds to kill the process in a more forceful way.
A typical way to handle a graceful shutdown is:
listen for the signal from the environment
when you receive the signal, stop accepting new requests...
...and then wait for all current requests to finish
Exactly how you do this depends on your platform/framework. For instance, Go’s standard net/http library provides a Server.Shutdown method.
In a typical system, most shutdowns will be graceful. For example, when you need to restart your process to deploy a new version of code, you do a graceful shutdown.
There can also be unexpected shutdowns: e.g. when you suddenly lose power or network connectivity (a disconnected server is usually as good as a dead one). Such faults are harder to deal with. There’s an entire body of research dedicated to making distributed systems robust to arbitrary faults. In the simple case, when your server only writes to a single database, you can open a transaction at the beginning of a request and commit it before returning the response. This will guarantee that either all the changes are saved to the database or none of them are. But if you call multiple downstream services as part of one upstream HTTP request, you need to coordinate them, for example, with a saga.
For some applications, it may be OK to ignore unexpected shutdowns and simply deal with any inconsistencies manually if/when they arise. This depends on your application.
All port operations in Rebol 3 are asynchronous. The only way I can find to do synchronous communication is calling wait.
But the problem with calling wait in this case is that it will check events for all open ports (even if they are not in the port block passed to wait). Then they call their responding event handlers, but a read/write could be done in one of those event handlers. That could result in recursive calls to "wait".
How do I get around this?
Why don´t you create a kind of "Buffer" function to receive all messages from assyncronous entries and process them as FIFO (first-in, first-out)?
This way you may keep the Assync characteristics of your ports and process them in sync mode.
in cases where there are only asynchronous events and we are in need on synchronous reply, start a timer or sleep for timeout, if the handler or required objective is met then say true, else false and make sure the event gets cancelled /reset for the same if critical.
I think that there are 2 design problems (maybe intrinsic to the tools / solutions at hand).
Wait is doing too much - it will check events for all open ports. In a sound environment, waiting should be implemented only where it is needed: per device, per port, per socket... Creating unnecessary inter-dependencies between shared resources cannot end well - especially knowing that shared resources (even without inter-dependencies) can create a lot of problems.
The event handlers may do too much. An event handler should be as short as possible, and it should only handle the event. If is does more, then the handler is doing too much - especially if involves other shared resources. In many situations, the handler just saves the data which will be lost otherwise; and an asynchronous job will do the more complex things.
You can just use a lock. Cummunication1 can set some global lock state i.e. with a variable (be sure that it's thread safe). locked = true. Then Communication2 can wait until it's unlocked.
loop do
sleep 10ms
break if not locked
end
locked = true
handle_communication()
I'm using an OpenSSL library in multi-threading application.
For various reasons I'm using blocking SSL connection. And there is a situation when client hangs on
SSL_connect
function.
I moved connection procedure to another thread and created timer. On timeout connection thread is terminated using:
QThread::terminate()
The thread is terminable, but on the next attempt to start thread I get:
QThread::start: Thread termination error:
I checked the "max thread issue" and that's not the case.
I'm working on CentOS 6.0 with QT 4.5, OpenSSL 1.0
The question is how to completely terminate a thread.
The Qt Documentation about terminate() tells:
The thread may or may not be terminated immediately, depending on the operating systems scheduling policies. Use QThread::wait() after terminate() for synchronous termination.
but also:
Warning: This function is dangerous and its use is discouraged. The thread can be terminated at any point in its code path. Threads can be terminated while modifying data. There is no chance for the thread to clean up after itself, unlock any held mutexes, etc. In short, use this function only if absolutely necessary.
Assuming you didn't reimplement QThread::run() (which is usually not necessary) - or if you actually reimplemented run and called exec() yourself, the usual way to stop a thread would be:
_thread->quit();
_thread->wait();
The first line tells the thread asynchronously to stop execution which usually means the thread will finish whatever it is currently doing and then return from it's event loop. However, quit() always instantly returns which is why you need to call wait() so the main thread is blocked until _thread was actually ended. After that, you can safely start() the thread again.
If you really want to get rid of the thread as quickly as possible, you can also call wait() after terminate() or at least before you call start() again
According to the Nginx documentation:
If you need to replace nginx binary
with a new one (when upgrading to a
new version or adding/removing server
modules), you can do it without any
service downtime - no incoming
requests will be lost.
My coworker and I were trying to figure out: how does that work?. We know (we think) that:
Only one process can be listening on port 80 at a time
Nginx creates a socket and connects it to port 80
A parent process and any of its children can all bind to the same socket, which is how Nginx can have multiple worker children responding to requests
We also did some experiments with Nginx, like this:
Send a kill -USR2 to the current master process
Repeatedly run ps -ef | grep unicorn to see any unicorn processes, with their own pids and their parent pids
Observe that the new master process is, at first, a child of the old master process, but when the old master process is gone, the new master process has a ppid of 1.
So apparently the new master process can listen to the same socket as the old one while they're both running, because at that time, the new master is a child of the old master. But somehow the new master process can then become... um... nobody's child?
I assume this is standard Unix stuff, but my understanding of processes and ports and sockets is pretty darn fuzzy. Can anybody explain this in better detail? Are any of our assumptions wrong? And is there a book I can read to really grok this stuff?
For specifics: http://www.csc.villanova.edu/~mdamian/Sockets/TcpSockets.htm describes the C library for TCP sockets.
I think the key is that after a process forks while holding a socket file descriptor, the parent and child are both able to call accept() on it.
So here's the flow. Nginx, started normally:
Calls socket() and bind() and listen() to set up a socket, referenced by a file descriptor (integer).
Starts a thread that calls accept() on the file descriptor in a loop to handle incoming connections.
Then Nginx forks. The parent keeps running as usual, but the child immediately execs the new binary. exec() wipes out the old program, memory, and running threads, but inherits open file descriptors: see http://linux.die.net/man/2/execve. I suspect the exec() call passes the number of the open file descriptor as a command line parameter.
The child, started as part of an upgrade:
Reads the open file descriptor's number from the command line.
Starts a thread that calls accept() on the file descriptor in a loop to handle incoming connections.
Tells the parent to drain (stop accept()ing, and finish existing connections), and to die.
I have no idea how nginx does it, but basically, it could just exec the new binary, carrying the listening socket with it the new process (actually, it remains the same process, it just replaces the program executing in it). The listening socket has a backlog of incoming connections, and as long as it's fast enough to boot up, it should be able to start processing them before it overflows. If not, it could probably fork first, exec, and wait for it to boot up to the point where it's ready to process incoming requests, then hand over the command of the listening socket (file descriptors are inherited when forking, both have access to it) via some internal mechanism, before exiting. Noting your observations, this looks like what it's doing (if your parent process dies, your ppid is reassigned to init, i.e. pid 1)
If it has multiple processes competing to accept on the same listening socket (again, I have no idea how nginx does it, perhaps it has a dispatching process?), then you could replace them one by one, by ordering them to exec the new program, as above, but one at a time, as to never drop the ball. Note that during such a process there would never be any new pids or parent/child relationship changes.
At least, I think that's probably how I would do it, off the top of my head.
What I am trying to solve: have an Erlang TCP server that listens on a specific port (the code should reside in some kind of external facing interface/API) and each incoming connection should be handled by a gen_server (that is even the gen_tcp:accept should be coded inside the gen_server), but I don't actually want to initially spawn a predefined number of processes that accepts an incoming connection). Is that somehow possible ?
Basic Procedure
You should have one static process (implemented as a gen_server or a custom process) that performs the following procedure:
Listens for incoming connections using gen_tcp:accept/1
Every time it returns a connection, tell a supervisor to spawn of a worker process (e.g. another gen_server process)
Get the pid for this process
Call gen_tcp:controlling_process/2 with the newly returned socket and that pid
Send the socket to that process
Note: You must do it in that order, otherwise the new process might use the socket before ownership has been handed over. If this is not done, the old process might get messages related to the socket when the new process has already taken over, resulting in dropped or mishandled packets.
The listening process should only have one responsibility, and that is spawning of workers for new connections. This process will block when calling gen_tcp:accept/1, which is fine because the started workers will handle ongoing connections concurrently. Blocking on accept ensure the quickest response time when new connections are initiated. If the process needs to do other things in-between, gen_tcp:accept/2 could be used with other actions interleaved between timeouts.
Scaling
You can have multiple processes waiting with gen_tcp:accept/1 on a single listening socket, further increasing concurrency and minimizing accept latency.
Another optimization would be to pre-start some socket workers to further minimize latency after accepting the new socket.
Third and final, would be to make your processes more lightweight by implementing the OTP design principles in your own custom processes using proc_lib (more info). However, this you should only do if you benchmark and come to the conclusion that it is the gen_server behavior that slows you down.
The issue with gen_tcp:accept is that it blocks, so if you call it within a gen_server, you block the server from receiving other messages. You can try to avoid this by passing a timeout but that ultimately amounts to a form of polling which is best avoided. Instead, you might try Kevin Smith's gen_nb_server instead; it uses an internal undocumented function prim_inet:async_accept and other prim_inet functions to avoid blocking.
You might want to check out http://github.com/oscarh/gen_tcpd and use the handle_connection function to convert the process you get to a gen_server.
You should use "prim_inet:async_accept(Listen_socket, -1)" as said by Steve.
Now the incoming connection would be accepted by your handle_info callback
(assuming you interface is also a gen_server) as you have used an asynchronous
accept call.
On accepting the connection you can spawn another ger_server(I would recommend
gen_fsm) and make that as the "controlling process" by calling
"gen_tcp:controlling_process(CliSocket, Pid of spwned process)".
After this all the data from socket would be received by that process
rather than by your interface code. Like that a new controlling process
would be spawned for another connection.