Understanding TCP memory management in Linux kernel

Understanding TCP memory management in Linux kernel - tcp

I'm Implementing a custom transport layer protocol on top of UDP to provide robust delivery services and need to ensure proper memory management. I'm trying to use TCP as a reference and see how the function tcp_sendmsg() handles memory constraints.
In the kernel code for tcp_sendmsg(),
if (!sk_stream_memory_free(sk))
goto wait_for_sndbuf;
wait_for_sndbuf:
set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
So the SOCK_NOSPACE flag is set for the socket. But how and where is the bit cleared later? And how does the tcp_sendmsg() function know that the bit has been cleared and it can resume sending the data?
Edit 1: As suggested by Maxim in his answer, the function sk_stream_wait_memory() handles the waiting for TCP. Can my protocol, which is built on top of UDP, use this "stream" function as well?

So the SOCK_NOSPACE flag is set for the socket. But how and where is the bit cleared later?
This bit is cleared when the data has been copied (or dropped) from the outgoing socket buffer into the outgoing queue of the device by the queueing discipline.
And how does the tcp_sendmsg() function know that the bit has been cleared and it can resume sending the data?
There is while (msg_data_left(msg)) loop with sk_stream_wait_memory call in it. sk_stream_wait_memory does the waiting.

Related

TCP write error but not really

I have been testing a program which has simple communication between two machines over a 1Gbps line. While running TCP communications over the line I occasionally receive write errors on the client side (due to a timeout) when the network is totally flooded (running at or close to 100% usage). This generally happens when I am running multiple instances of the same program going to different ports.
My question is, is it possible to get a write error but still receive the message on the server side. It appears that is what is happening, and I am not quite sure why. Could it be that the ACK coming back to the client is what is timing out?

Yes, that is possible. TCP does not guarantee you that data you sent successfully is received and that data that is sent unsuccessfully is not received. This problem is unsolvable. It is called the Generals Problem. There is always a way to loose messages/packets such that the sender comes to the wrong conclusion. TCP guarantees that the receiver receives the same stream of bytes that the sender sent, but possibly cut off at an arbitrary point.
This unreliability has performance reasons, too. TCP data is buffered on both hosts as well as on the network. Acknowledgement is delayed.
You have to live with this. If you make your scenario more concrete I can suggest some strategies of dealing with this.

send puts data into the TCP send buffer.
If the send buffer has no enough space, send will block util the data is completely or partly copied into the send buffer, or the designed timeout arrives.
Read timeout and write timeout is OK. You should check and process them. The way is restarting read/write operation after timeout. You also pay attention to other read/write error except timeout.

TCP as connection protocol questions

I'm not sure if this is the correct place to ask, so forgive me if it isn't.
I'm writing computer monitoring software that needs to connect to a server. The server may send out relatively urgent messages, such as sound or cancel an alarm, and the client may send out data about the computer, such as screenshots. The data that the client sends isn't too critical on timing, but shouldn't be more than a two minutes late.
It is essential to the software that portforwarding need not be set up, and it is assumed that the internet connection will be done through a wireless router that has NAT almost all the time.
My idea is to have a TCP connection initiated from the client, and use that to transfer data. Ideally, I would have no data being sent when it is not needed, but I believe this to be impossible. Would sending the equivalent of a ping every now and again keep the connection alive, and what sort of bandwidth would it use if this program was running all the time on the computer? In addition, would it be possible to reduce the header size for these keep-alives?
Before I start designing the communication and programming, is this plan for connection flawed? Are there better alternatives?
Thanks!

1) You do not need to send 'ping' data to keep the connection alive, the TCP stack does this automatically; one reason for sending 'ping' data would be to detect a connection close on the client side - typically you only find out something has gone wrong when you try and read/write from the socket. There may be a way to change various time-outs so you can detect this condition faster.
2) In general while TCP provides a stream-oriented error free channel, it makes no guarantees about timeliness, if you are using it on the internet it is even more unpredictable.
3) For applications such as this (I hope you are making it for ethical purposes) - I would tend to use TCP, since you don't want a situation where the client receives a packet to raise an alarm but misses that one that turns it off again.

Unix TCP servers and UDP Servers

Why is the design of TCP servers mostly such that whenever it accepts a connection, a new process is invoked to handle it . But, why in the case of UDP servers, mostly there is only a single process that handles all client requests ?

The main difference between TCP and UDP is, as stated before, that UDP is connectionless.
A program using UDP has only one socket where it receives messages. So there's no problem if you just block and wait for a message.
If using TCP you get one socket for every client which connects. Then you can't just block and wait for ONE socket to receive something, because there are other sockets which must be processed at the same time.
So you got two options, either use nonblocking methods or use threads. Code is usually much simpler when you don't have one while loop which has to handle every client, so threading is often prefered. You can also save some CPU time if using blocking methods.

When you talk with client via TCP connection you maintain TCP session. So when new connection established you need separate process(or thread, no matter how it implemented and what OS used) and maintain conversation. But when you use UDP connection you may recieve datagram(and you will be informed about senders ip and port) but in common case you cannot respond on it.

First of all, the classic Unix server paradigm is filter based. For example, various network services can be configured in /etc/services and a program like inetd listens on all of the TCP and UDP sockets for incoming connections and datagrams. When a connection / DG arrives it forks, redirects stdin, stdout and stderr to the socket using the dup2 system call, and then execs the server process. You can take any program which reads from stdin and writes to stdout and turn it into a network service, such as grep.
According to Steven's in "Unix Network Programming", there are five kinds of server I/O models (pg. 154):
blocking
non-blocking
multiplexing (select and poll)
Signal Driven
asynchronous ( POSIX aio_ functions )
In addition the servers can be either Iterative or Concurrent.
You ask why are TCP servers are typically concurrent, while UDP servers are typically iterative.
The UDP side is easier to answer. Typically UDP apps follow a simple request response model where a client sends a short request followed by a reply with each pair constituting a stand alone transaction. UDP servers are the only ones which use Signal Drive I/O, and at the very rarely.
TCP is a bit more complicated. Iterative servers can use any of the I/O models above, except #4. The fastest servers on a single processor are actually Iterative servers using non-blocking I/O. However, these are considered relatively complex to implement and that plus the Unix filter idiom where traditionally the primary reasons for use of the concurrent model with blocking I/O, whether multiprocess or multithreaded. Now, with the advent of common multicore systems, the concurrent model also has the performance advantage.

Your generalization is too general. This is a pattern you might see with a Unix-based server, where process creation is inexpensive. A .NET-based service will use a new thread from the thread pool instead of creating a new process.

Programs that can continue to do useful work while they are waiting for I/O
will often be multithreaded. Programs that do lots of computation which
can be neatly divided into separate sections can benefit from
multithreading, if there are multiple processors. Programs that service
lots of network requests can sometimes benefit by having a pool of
available threads to service requests. GUI programs that also need to
perform computation can benefit from multithreading, because it allows the
main thread to continue to service GUI events.
Thats why we use TCP as an internet protocol.

unix network process

I was wondering how tcp/ip communication is implemented in unix. When you do a send over the socket, does the tcp/level work (assembling packets, crc, etc) get executed in the same execution context as the calling code?
Or, what seems more likely, a message is sent to some other daemon process responsible for tcp communication? This process then takes the message and performs the requested work of copying memory buffers and assembling packets etc.? So, the calling code resumes execution right away and tcp work is done in parallel? Is this correct?
Details would be appreciated. Thanks!

The TCP/IP stack is part of your kernel. What happens is that you call a helper method which prepares a "kernel trap". This is a special kind of exception which puts the CPU into a mode with more privileges ("kernel mode"). Inside of the trap, the kernel examines the parameters of the exception. One of them is the number of the function to call.
When the function is called, it copies the data into a kernel buffer and prepares everything for the data to be processed. Then it returns from the trap, the CPU restores registers and its original mode and execution of your code resumes.
Some kernel thread will pick up the copy of the data and use the network driver to send it out, do all the error handling, etc.
So, yes, after copying the necessary data, your code resumes and the actual data transfer happens in parallel.
Note that this is for TCP packets. The TCP protocol does all the error handling and handshaking for you, so you can give it all the data and it will know what to do. If there is a problem with the connection, you'll notice only after a while since the TCP protocol can handle short network outages by itself. That means you'll have "sent" some data already before you'll get an error. That means you will get the error code for the first packet only after the Nth call to send() or when you try to close the connection (the close() will hang until the receiver has acknowledged all packets).
The UDP protocol doesn't buffer. When the call returns, the packet is on it's way. But it's "fire and forget", so you only know that the driver has put it on the wire. If you want to know whether it has arrived somewhere, you must figure out a way to achieve that yourself. The usual approach is have the receiver send an ack UDP packet back (which also might get lost).

No - there is no parallel execution. It is true that the execution context when you're making a system call is not the same as your usual execution context. When you make a system call, such as for sending a packet over the network, you must switch into the kernel's context - the kernel's own memory map and stack, instead of the virtual memory you get inside your process.
But there are no daemon processes magically dispatching your call. The rest of the execution of your program has to wait for the system call to finish and return whatever values it will return. This is why you can count on return values being available right away when you return from the system call - values like the number of bytes actually read from the socket or written to a file.
I tried to find a nice explanation for how the context switch to kernel space works. Here's a nice in-depth one that even focuses on architecture-specific implementation:
http://www.ibm.com/developerworks/linux/library/l-system-calls/

what happens when tcp/udp server is publishing faster than client is consuming?

I am trying to get a handle on what happens when a server publishes (over tcp, udp, etc.) faster than a client can consume the data.
Within a program I understand that if a queue sits between the producer and the consumer, it will start to get larger. If there is no queue, then the producer simply won't be able to produce anything new, until the consumer can consume (I know there may be many more variations).
I am not clear on what happens when data leaves the server (which may be a different process, machine or data center) and is sent to the client. If the client simply can't respond to the incoming data fast enough, assuming the server and the consumer are very loosely coupled, what happens to the in-flight data?
Where can I read to get details on this topic? Do I just have to read the low level details of TCP/UDP?
Thanks

With TCP there's a TCP Window which is used for flow control. TCP only allows a certain amount of data to remain unacknowledged at a time. If a server is producing data faster than a client is consuming data then the amount of data that is unacknowledged will increase until the TCP window is 'full' at this point the sending TCP stack will wait and will not send any more data until the client acknowledges some of the data that is pending.
With UDP there's no such flow control system; it's unreliable after all. The UDP stacks on both client and server are allowed to drop datagrams if they feel like it, as are all routers between them. If you send more datagrams than the link can deliver to the client or if the link delivers more datagrams than your client code can receive then some of them will get thrown away. The server and client code will likely never know unless you have built some form of reliable protocol over basic UDP. Though actually you may find that datagrams are NOT thrown away by the network stack and that the NIC drivers simply chew up all available non-paged pool and eventually crash the system (see this blog posting for more details).
Back with TCP, how your server code deals with the TCP Window becoming full depends on whether you are using blocking I/O, non-blocking I/O or async I/O.
If you are using blocking I/O then your send calls will block and your server will slow down; effectively your server is now in lock step with your client. It can't send more data until the client has received the pending data.
If the server is using non blocking I/O then you'll likely get an error return that tells you that the call would have blocked; you can do other things but your server will need to resend the data at a later date...
If you're using async I/O then things may be more complex. With async I/O using I/O Completion Ports on Windows, for example, you wont notice anything different at all. Your overlapped sends will still be accepted just fine but you might notice that they are taking longer to complete. The overlapped sends are being queued on your server machine and are using memory for your overlapped buffers and probably using up 'non-paged pool' as well. If you keep issuing overlapped sends then you run the risk of exhausting non-paged pool memory or using a potentially unbounded amount of memory as I/O buffers. Therefore with async I/O and servers that COULD generate data faster than their clients can consume it you should write your own flow control code that you drive using the completions from your writes. I have written about this problem on my blog here and here and my server framework provides code which deals with it automatically for you.
As far as the data 'in flight' is concerned the TCP stacks in both peers will ensure that the data arrives as expected (i.e. in order and with nothing missing), they'll do this by resending data as and when required.

TCP has a feature called flow control.
As part of the TCP protocol, the client tells the server how much more data can be sent without filling up the buffer. If the buffer fills up, the client tells the server that it can't send more data yet. Once the buffer is emptied out a bit, the client tells the server it can start sending data again. (This also applies to when the client is sending data to the server).
UDP on the other hand is completely different. UDP itself does not do anything like this and will start dropping data if it is coming in faster then the process can handle. It would be up to the application to add logic to the application protocol if it can't lose data (i.e. if it requires a 'reliable' data stream).

If you really want to understand TCP, you pretty much need to read an implementation in conjunction with the RFC; real TCP implementations are not exactly as specified. For example, Linux has a 'memory pressure' concept which protects against running out of the kernel's (rather small) pool of DMA memory, and also prevents one socket running any others out of buffer space.

The server can't be faster than the client for a long time. After it has been faster than the client for a while, the system where it is hosted will block it when it writes on the socket (writes can block on a full buffer just as reads can block on an empty buffer).

With TCP, this cannot happen.
In case of UDP, packets will be lost.

The TCP Wikipedia article shows the TCP header format which is where the window size and acknowledgment sequence number are kept. The rest of the fields and the description there should give a good overview of how transmission throttling works. RFC 793 specifies the basic operations; pages 41 and 42 details the flow control.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex