Erlang socket doesn't receive until the second setopts {active,once} - tcp

First I would like to apologize, I'm giving so much information to make it as clear as possible what the problem is. Please let me know if there's still anything which needs clarifying.
(Running erlang R13B04, kernel 2.6.18-194, centos 5.5)
I have a very strange problem. I have the following code to listen and process sockets:
%Opts used to make listen socket
-define(TCP_OPTS, [binary, {packet, raw}, {nodelay, true}, {reuseaddr, true}, {active, false},{keepalive,true}]).
%Acceptor loop which spawns off sock processors when connections
%come in
accept_loop(Listen) ->
case gen_tcp:accept(Listen) of
{ok, Socket} ->
Pid = spawn(fun()->?MODULE:process_sock(Socket) end),
gen_tcp:controlling_process(Socket,Pid);
{error,_} -> do_nothing
end,
?MODULE:accept_loop(Listen).
%Probably not relevant
process_sock(Sock) ->
case inet:peername(Sock) of
{ok,{Ip,_Port}} ->
case Ip of
{172,16,_,_} -> Auth = true;
_ -> Auth = lists:member(Ip,?PUB_IPS)
end,
?MODULE:process_sock_loop(Sock,Auth);
_ -> gen_tcp:close(Sock)
end.
process_sock_loop(Sock,Auth) ->
try inet:setopts(Sock,[{active,once}]) of
ok ->
receive
{tcp_closed,_} ->
?MODULE:prepare_for_death(Sock,[]);
{tcp_error,_,etimedout} ->
?MODULE:prepare_for_death(Sock,[]);
%Not getting here
{tcp,Sock,Data} ->
?MODULE:do_stuff(Sock,Data);
_ ->
?MODULE:process_sock_loop(Sock,Auth)
after 60000 ->
?MODULE:process_sock_loop(Sock,Auth)
end;
{error,_} ->
?MODULE:prepare_for_death(Sock,[])
catch _:_ ->
?MODULE:prepare_for_death(Sock,[])
end.
This whole setup works wonderfully normally, and has been working for the past few months. The server operates as a message passing server with long-held tcp connections, and it holds on average about 100k connections. However now we're trying to use the server more heavily. We're making two long-held connections (in the future probably more) to the erlang server and making a few hundred commands every second per each of those connections. Each of those commands, in the common case, spawn off a new thread which will probably make some kind of read from mnesia, and send some messages based on that.
The strangeness comes when we try to test those two command connections. When we turn on the stream of commands, any new connection has about 50% chance of hanging. For instance, using netcat if I were to connect and send along the string "blahblahblah" the server should immediately return back an error. In doing this it won't make any calls outside the thread (since all it's doing is trying to parse the command, which will fail because blahblahblah isn't a command). But about 50% of the time (when the two command connections are running) typing in blahblahblah results in the server just sitting there for 60 seconds before returning that error.
In trying to debug this I pulled up wireshark. The tcp handshake always happens immediately, and when the first packet from the client (netcat) is sent it acks immediately, telling me that the tcp stack of the kernel isn't the bottleneck. My only guess is that the problem lies in the process_sock_loop function. It has a receive which will go back to the top of the function after 60 seconds and try again to get more from the socket. My best guess is that the following is happening:
Connection is made, thread moves on to process_sock_loop
{active,once} is set
Thread receives, but doesn't get data even though it's there
After 60 seconds thread goes back to the top of process_sock_loop
{active, once} is set again
This time the data comes through, things proceed as normal
Why this would be I have no idea, and when we turn those two command connections off everything goes back to normal and the problem goes away.
Any ideas?

it's likely that your first call to set {active,once} is failing due to a race condition between your call to spawn and your call to controlling_process
it will be intermittent, likely based on host load.
When doing this, I'd normally spawn a function that blocks on something like:
{take,Sock}
and then call your loop on the sock, setting {active,once}.
so you'd change the acceptor to spawn, set controlling_process then Pid ! {take,Sock}
something to that effect.
note: I don't know if the {active,once} call actually throws when you aren't the controlling processes, if it doesn't, then what I just said makes sense.

Related

Erlang: exit/shutdown synchronous or asynchronous?

You have a process tree you want to kill, so you send an exit(PID, shutdown) to the supervisor. There's other stuff you need to do, but it can't be done until this process tree is shutdown. For instance, let's say this process tree writes to a database. You want to shut everything down cleanly. You want to shut down the database, but obviously you need to shut down the process tree first, else the tree could be in the middle of a write to the database.
My question is, when I send the exit signal, is it synchronous or asynchronous? If it is synchronous, it seems I have no worries, but if it is asynchronous, I will need to do something like establish a process monitor and check whether the tree shut down before I proceed with database shutdown, correct?
Thanks.
Short answer: OTP shutdown is synchronous. exit/2 is a single asynchronous message.
Long answer: All messages in Erlang are asynchronous. The shutdown message is no different. However, there is more to shutdown than just sending a message. The supervisor listens for {'DOWN', ...} messages after sending the exit signal. Only after it receives a 'DOWN' message or times out does it proceed, so in effect it is synchronous. Checkout the supervisor source code. On line 894 is where the functions that actually makes the exit call is defined:
shutdown(Pid, Time) ->
case monitor_child(Pid) of
ok ->
exit(Pid, shutdown), %% Try to shutdown gracefully
receive
{'DOWN', _MRef, process, Pid, shutdown} ->
ok;
{'DOWN', _MRef, process, Pid, OtherReason} ->
{error, OtherReason}
after Time ->
exit(Pid, kill), %% Force termination.
receive
{'DOWN', _MRef, process, Pid, OtherReason} ->
{error, OtherReason}
end
end;
{error, Reason} ->
{error, Reason}
end.
The source code can be viewed on GitHub here: https://github.com/erlang/otp/blob/maint/lib/stdlib/src/supervisor.erl#L894
erlang:exit/2 calls on the other hand is simply an asynchronous exit signal
If you need to manage this yourself, do your own monitoring:
sneak_attack(BankGuard) ->
monitor(process, BankGuard),
exit(BankGuard, kill),
Cash = receive {'DOWN', _, process, BankGuard, _} -> rob_bank() end,
send_to_bahamas(Cash).
In this example rob_bank() and anything after is blocked waiting on the 'DOWN' message from BankGuard.
Also, note that this is a much more general concept than just shutting something down. All messages in Erlang are asynchronous but unlike UDP, ordering (between two processes) and delivery (so long as the destination is alive) is guaranteed. So synchronous messaging is simply monitoring the target, sending a tagged message, and blocking on receipt of the return message.

How do I detect tcp-client disconnect with gen_tcp?

I'am trying to use gen_tcp module.
There is example of server-side code, which I have troubles.
%% First, I bind server port and wait for peer connection
{ok, Sock} = gen_tcp:listen(7890, [{active, false}]),
{ok, Peer} = gen_tcp:accept(Sock),
%% Here client calls `gen_tcp:close/1` on socket and goes away.
%% After that I am tryin' send some message to client
SendResult = gen_server:send(Peer, <<"HELLO">>),
%% Now I guess that send is failed with {error, closed}, but...
ok = SendResult.
When I call gen_tcp:send/2 again, second call wil return {error, closed} as expected. But I want to understand, why first call succeeded? Am I missing some tcp-specific details?
This strange (for me) behavior is only for {active, false} connection.
In short, the reason for this is that there's no activity on the socket that can determine that the other end has closed. The first send appears to work because it operates on a socket that, for all intents and purposes, appears to be connected and operational. But that write activity determines that the other end is closed, which is why the second send fails as expected.
If you were to first read or recv from the socket, you'd quickly learn the other end was closed. Alternatively, if the socket were in an Erlang active mode, then you'd also learn of the other end closing because active modes poll the socket.
Aside from whether the socket is in an active mode or not, this has nothing to do with Erlang. If you were to write C code directly to the sockets API, for example, you'd see the same behavior.

Sending message between hosts, without probe or ask every host for new messages

My problem is to send messages about the status of calculation and program status. Every host get one chunk of work. If the host finish the work it should send the result to the reciever. The reciever could change while calculation run. For debugging purpose the status on every host should also be transferred to host with rank 0.
From that point I got a lot of messages. But it is not clear for me how I send the messages between the hosts.
One possibility is a message transport like a circle, where every neighbor send the message to the next neighbor.
The non blocking communication method's like MPI_Isend and MPI_Irecv could be the solution. But every host should be sender and reciever.
The easy way is where every host broadcast the messages, but that is a lot of traffic.
I need a function like broadcast, where every host could be reciever and sender. And only then, when a message is there!
regards
Based on "I need a function like broadcast, where every host could be reciever and sender.", MPI_Alltoall fits the bill. Please refer to this link for an actual example.
If you don't know that there is going to be a message, one way to handle this would probably be to have the master process act as a message queue, basically in an endless receive loop until all tasks have sent an exit signal.
This way, you don't have to worry about mixed send/receive counts between neighbor tasks, and each task can independently poll the master task occasionally to see if there is work, and or a message, for it.
This can get a bit complicated to handle, especially of the message queue and to make sure that when a message is waiting that the two tasks then initiate a short send/receive session, but it would prevent a lot of broadcast / all-to-all traffic, which is notoriously expensive time wise.
If the master task is largely only handling message status, it should be very low friction interconnect wise.
Basically for Task 0:
while (1){
MPI_recv(args);
if (any task has a message to send){ add it to the master's queue; }
if (any task is waiting to contact the task that is now polling for messages){ tell current task to initiate a Recv wait and signal the master that it is waiting; }
if (any other task is waiting for the current task to send to it){ initiate a send to that task; }
}
Each task:
if (work needs to be sent to a neighbor){ contact master; master adds to queue; }
if (a neighbor wants to send work){ enter receive loop ; }
if (a neighbor is ready to receive work according to the master){ send work to it; }
Or:
Task 3 has work for Task 8.
Task 3 contacts Master, and says so.
Task 3 continues its business.
Task 8 contacts Master, and sees it has work from Task 3 pending.
Task 8 enters a receive.
Task 3 (at some polling interval) again contacts master to check on work, and sees there is a task awaiting work.
Task 3 initiates a send.
Task 3 initiates a send for any remaining tasks waiting on work.
Program continues.
Now, this has plenty of its own caveats. Whether or not it is efficient depends on how frequently messages are passed between tasks, and how long a given task sits in wait for its neighbor to send.
And in the end, this may not be better than a simply Alltoall(). And it's entirely possible that this solution is downright terrible.

unix network process

I was wondering how tcp/ip communication is implemented in unix. When you do a send over the socket, does the tcp/level work (assembling packets, crc, etc) get executed in the same execution context as the calling code?
Or, what seems more likely, a message is sent to some other daemon process responsible for tcp communication? This process then takes the message and performs the requested work of copying memory buffers and assembling packets etc.? So, the calling code resumes execution right away and tcp work is done in parallel? Is this correct?
Details would be appreciated. Thanks!
The TCP/IP stack is part of your kernel. What happens is that you call a helper method which prepares a "kernel trap". This is a special kind of exception which puts the CPU into a mode with more privileges ("kernel mode"). Inside of the trap, the kernel examines the parameters of the exception. One of them is the number of the function to call.
When the function is called, it copies the data into a kernel buffer and prepares everything for the data to be processed. Then it returns from the trap, the CPU restores registers and its original mode and execution of your code resumes.
Some kernel thread will pick up the copy of the data and use the network driver to send it out, do all the error handling, etc.
So, yes, after copying the necessary data, your code resumes and the actual data transfer happens in parallel.
Note that this is for TCP packets. The TCP protocol does all the error handling and handshaking for you, so you can give it all the data and it will know what to do. If there is a problem with the connection, you'll notice only after a while since the TCP protocol can handle short network outages by itself. That means you'll have "sent" some data already before you'll get an error. That means you will get the error code for the first packet only after the Nth call to send() or when you try to close the connection (the close() will hang until the receiver has acknowledged all packets).
The UDP protocol doesn't buffer. When the call returns, the packet is on it's way. But it's "fire and forget", so you only know that the driver has put it on the wire. If you want to know whether it has arrived somewhere, you must figure out a way to achieve that yourself. The usual approach is have the receiver send an ack UDP packet back (which also might get lost).
No - there is no parallel execution. It is true that the execution context when you're making a system call is not the same as your usual execution context. When you make a system call, such as for sending a packet over the network, you must switch into the kernel's context - the kernel's own memory map and stack, instead of the virtual memory you get inside your process.
But there are no daemon processes magically dispatching your call. The rest of the execution of your program has to wait for the system call to finish and return whatever values it will return. This is why you can count on return values being available right away when you return from the system call - values like the number of bytes actually read from the socket or written to a file.
I tried to find a nice explanation for how the context switch to kernel space works. Here's a nice in-depth one that even focuses on architecture-specific implementation:
http://www.ibm.com/developerworks/linux/library/l-system-calls/

Behavior of shutdown(sock, SHUT_RD) with TCP

When using a TCP socket, what does
shutdown(sock, SHUT_RD);
actually do? Does it just make all recv() calls return an error code? If so, which error code?
Does it cause any packets to be sent by the underlying TCP connection? What happens to any data that the other side sends at this point - is it kept, and the window size of the connection keeps shrinking until it gets to 0, or is it just discarded, and the window size doesn't shrink?
Shutting down the read side of a socket will cause any blocked recv (or similar) calls to return 0 (indicating graceful shutdown). I don't know what will happen to data currently traveling up the IP stack. It will most certainly ignore data that is in-flight from the other side. It will not affect writes to that socket at all.
In fact, judicious use of shutdown is a good way to ensure that you clean up as soon as you're done. An HTTP client that doesn't use keepalive can shutdown the write-side as soon as it is done sending the request, and a server that sees Connection: closed can likewise shutdown the read-side as soon as it is done receiving the request. This will cause any further erroneous activity to be immediately obvious, which is very useful when writing protocol-level code.
Looking at the Linux source code, shutdown(sock, SHUT_RD) doesn't seem to cause any state changes to the socket. (Obviously, shutdown(sock, SHUT_WR) causes FIN to be set.)
I can't comment on the window size changes (or lack thereof). But you can write a test program to see. Just make your inetd run a chargen service, and connect to it. :-)
shutdown(,SHUT_RD) does not have any counterpart in TCP protocol, so it is pretty much up to implementation how to behave when someone writes to a connection where other side indicated that it will not read or when you try to read after you declared that you wont.
On slightly lower level it is beneficial to remember that TCP connection is a pair of flows using which peers send data until they declare that they are done (by SHUT_WR which sends FIN). And these two flows are quite independent.
I test shudown(sock,SHUT_RD) on Ubuntu 12.04. I find that when you call shutdown(sock,SHUT_RD) if there are no any type of data(include FIN....) in the TCP buffer, the successive read call will return 0(indicates end of stream). But if there are some data which arrived before or after shutdown function, read call will process normally as if shutdown function was not called. It seems that shutdown(sock,SHUT_RD) doesn't cause any TCP states changed to the socket
It has two effects, one of them platform-dependent.
recv() will return zero, indicating end of stream.
Any further writes to the connection by the peer will either be (a) silently thrown away by the receiver (BSD), (b) be buffered by the receiver and eventually cause send() to block or return -1/EAGAIN/EWOULDBLOCK (Linux), or (c) cause the receiver to send an RST (Windows).
shutdown(sock, SHUT_RD) causes any writer to the socket to receive a sigpipe signal.
Any further reads using the read system call will return a -1 and set errno to EINVAL.
The use of recv will return a -1 and set errno to indicate the error (probably ENOTCONN or ENOTSOCK).

Resources