I've written a small program with the boost asio library to transfer files via TCP from a server to one or more clients.
During testing I found out that the transfer is extremely slow, about 10KiB/s. Nagle's algorithm is already disabled. If I transfer the same file via FileZilla from the same server to the same client, I get about 280KiB/s, so obviously something was very wrong.
My approach so far was to fragment each file into smaller packets of 1024 bytes, send one fragment(each fragment=1 async_write-call) to the client and wait for the client's response. I need to fragment the data to allow the client to keep track of the download progress and speed. In retrospect I suppose this was rather naïve, because the server has to wait for the client's response after each fragment. To check if this was the bottleneck, I've increased the fragment size twice, giving me the following results:
a) Fragment Size: 1024bytes
Transfer Speed: ~10KiB/s
b) Fragment Size: 8192bytes
Transfer Speed: ~80KiB/s
c) Fragment Size: 20000bytes
Transfer Speed: ~195KiB/s
The results speak for themselves, but I'm unsure what to do now.
I'm not too familiar with how the data transfer is actually handled internally, but if I'm not mistaken all of my data is basically added onto a stream? If that's the case, do I need to worry about how much data I write to that stream at once? Does it make a difference at all whether I use multiple write-calls with small fragments as opposed to one write-call with a large fragment? Are there any guidelines for this?
Simply stream the data to the client without artificial packetization. Reenable nagling, this is not a scenario that calls for disabling it. It will cause small inefficiencies to have it disabled.
Typical write buffer sizes would be 4KB and above.
The client can issue read calls to the network one after the other. After each successful read the client will have a new estimation for the current progress that is quite accurate. Typically, there will be one succeeding read call for each network packet received. If the incoming rate is very high then multiple packets tend to be coalesced into one read. That's not ap roblem.
If that's the case, do I need to worry about how much data I write to that stream at once?
No. Just keep a write call outstanding at all times.
Related
Section 6.9 of RFC 7540 describes the mechanism for HTTP/2 flow control. There is a flow control window for each connection, and another flow control window for all streams on that connection. It provides a way for the receiver to set the initial flow control window for a stream:
Both endpoints can adjust the initial window size for new streams by including a value for SETTINGS_INITIAL_WINDOW_SIZE in the SETTINGS frame that forms part of the connection preface.
And a way for the receiver to increase the connection and stream flow control windows:
The payload of a WINDOW_UPDATE frame is one reserved bit plus an unsigned 31-bit integer indicating the number of octets that the sender can transmit in addition to the existing flow-control window. The legal range for the increment to the flow-control window is 1 to 2^31-1 (2,147,483,647) octets.
[...]
A sender that receives a WINDOW_UPDATE frame updates the corresponding window by the amount specified in the frame.
And a way for the receiver to increment or decrement the flow control windows for all streams (but not the connection) at once:
When the value of SETTINGS_INITIAL_WINDOW_SIZE changes, a receiver MUST adjust the size of all stream flow-control windows that it maintains by the difference between the new value and the old value.
But as far as I can tell, there is no way for the receiver to decrement a single stream's flow control window without changing the initial window size.
Is that correct? If so, why not? This seems like a reasonable thing to want to do if you are multiplexing many long-lived streams over a single connection. You may have some BDP-controlled memory budget for the overall connection, carved up across the streams, and are tuning the proportion that each stream gets according to its recent bandwidth demand. If one of them temporarily goes idle you'd like to be able to reset its window to be small so that it doesn't strand the memory budget, without affecting the other streams, and without making it impossible to receive new streams.
(Of course I understand that there is a race, and the sender may have sent data before receiving the decrement. But the window is already allowed to go negative due to the SETTINGS_INITIAL_WINDOW_SIZE mechanism above, so it seems like it would be reasonable to allow for a negative window here too.)
Is it really not possible to do this without depending on forward progress from the sender to eat up the stranded bytes in the flow control window?
Here's more detail on why I'm interested in the question, because I'm conscious of the XY problem.
I'm thinking about how to solve an RPC flow control issue. I have a server with a limited memory budget, and incoming streams with different priorities for how much of that memory they should be allowed to consume. I want to implement something like weighted max-min fairness across them, adjusting their flow control windows so that they sum to no more than my memory budget, but when we're not memory constrained we get maximum throughput.
For efficiency reasons, it would be desirable to multiplex streams of different priorities on a single connection. But then as demands change, or as other connections show up, we need to be able to adjust stream flow control windows downward so they still sum to no more than the budget. When stream B shows up or receives a higher priority but stream A is sitting on a bunch of flow control budget, we need to reduce A's window and increase B's.
Even without the multiplexing, the same problem applies at the connection level: as far as I can tell, there is no way to adjust the connection flow control window downward without changing the initial window size. Of course it will be adjusted downward as the client sends data, but I don't want to need to depend on forward progress from the client for this, since that may take arbitrarily long.
It's possible there is a better way to achieve this!
A server that has N streams, of which some idle and some actively downloading data to the client, will typically re-allocate the connection window to active streams.
For example, say you are watching a movie and downloading a big file from the same server at the same time.
Connection window is 100, and each stream has a window of 100 too (obviously in case of many streams the sum of all stream windows will be capped by the connection window, but if there is only one stream it can be at max).
Now, when you watch and download each stream gets 50.
If you pause the movie, and the server knows about that (i.e. it does not exhaust the movie stream window), then the server now has to serve only one stream, with a connection window of 100 and a single stream (the download one) that also has window of 100, therefore reallocating the whole window to the active stream.
You only get into problems if the client doesn't tell the server that the movie has been paused.
In this case, the server will continue to send movie data until the movie stream window is exhausted (or quasi exhausted), and the client does not acknowledges that data because it's paused.
At that point, the server notices that data is not acknowledged by one stream and stops sending data to it, but of course part of the connection window is taken, reducing the window of the active download stream.
From the server point of view, it has a perfectly good connection where one stream (the download one) works wonderfully at max speed, but another stream hiccups and exhausts its window and causes the other stream to slow down (possibly to a halt), even if it's the same connection!
Obviously it cannot be a connection/communication issue, because one stream (the download one) works perfectly fine at max speed.
Therefore it is an application issue.
The HTTP/2 implementation on the server does not know that one of the streams is a movie that can be paused -- it's the application that must communicate this to the server and keep the connection window as large as possible.
Introducing a new HTTP/2 frame to "pause" downloads (or changing the semantic of the existing frames to accommodate a "pause" command) would have complicated the protocol quite substantially, for a feature that is 100% application driven -- it is the application that must trigger the send of the "pause" command but at that point it can send its own "pause" message to the server without complicating the HTTP/2 specification.
It is an interesting case where HTTP/1.1 and HTTP/2 behave very differently and require different code to work in a similar way.
With HTTP/1.1 you would have one connection for the movie and one for the download, they would be independent, and the client application would not need to communicate to the server that the movie was paused -- it could just stop reading from the movie connection until it became TCP congested without affecting the download connection -- assuming that the server is non-blocking to avoid scalability issues.
I'm reading about Websocket and I see that protocol have a data fragmentation (frames), a WebSocket message is composed of one or more frames, but it's not what TCP (fragmentation of data) do? I'm confused.
Fragmentation in the context of data transfer just means splitting the original data into smaller parts for transfer and combining these fragments later (for example at the recipients side) again to recreate the original data.
Fragmentation is often done if the underlying layer cannot handle larger messages or if larger messages will result in performance problems. Such problems might be because it is more expensive if one large message is lost and need to be repeated instead of only a small fragment. Or it can be a performance problem if the transfer of one large message would block the delivery of smaller messages. In this case it is useful to split the large message into fragments and deliver these message fragments together with the other messages so that these don't have to wait for delivery until the large message is done.
Fragmentation of messages in WebSockets is just one of the many types of fragmentation which exist at various layers at the data transport, like:
IP messages can be fragmented at the sender or some middlebox and get reassembled at the end.
TCP is a data stream. The various parts of the stream are transferred in different IP packets and get reassembled in the correct order at the recipient.
Application layer protocols like HTTP can have fragments too, for example the chunked Transfer-Encoding mode within HTTP or the fragments in WebSockets.
And at even higher layers there can be more fragments, like the spreading of a single large ZIP file into multiple parts onto floppy disks in former times or the accelerating of downloads by requesting different parts of the same file in parallel connections and combining these at the recipient.
I love the detailed answer by Steffen Ullrich, but I wish to add a few specific details regarding the differences between raw TCP/IP and the added Websockets layer.
TCP/IP is a stream protocol, meaning the application receives the data as fragmented pieces as data become available, with no clear indication of the fragmented "packet boundaries" or the original (non-fragmented) data structure.
The Websocket protocol is a message based protocol, meaning that the application will only receive the full Websocket message once all the fragmented pieces have arrived and put back together.
As a very simplified example:
TCP/IP: if a 50 Mb file is sent using TCP, the application will probably receive a piece of the file at a time and it will need to piece the file back together (possibly saving each piece to a temporary disk storage).
Websocket: if a 50 Mb file is sent using the Websocket protocol, the application will receive the whole of the 50Mb in one message (and the storage of all of the data, memory or disk, will be dictated by the Websocket layer, not the application layer).
Note that the Websocket Protocol is an additional layer over the TCP/IP protocol, so data is streamed over TCP/IP and the Websocket layer puts the pieces back together before forwarding the original (whole) message).
5.4. Fragmentation
A secondary use-case for fragmentation is for multiplexing, where it
is not desirable for a large message on one logical channel to
monopolize the output channel, so the multiplexing needs to be free
to split the message into smaller fragments to better share the
output channel. (Note that the multiplexing extension is not
described in this document.)
Even though it's listed as secondary reason, I'd say that't the primary reason for that fragmentation feature. Imagine, if you try to send first message with 1GB size and right away when you start sending it you also send second message with 1KB size. Framing allows applications to inject second message in between individual frames of the first message, this way receiver will not need to wait for 1GB to be transferred and will receive/handle 1KB second message right away.
How does MPI_Send() communicate the data to the receiving process if the size of the sending data is greater than the MPI buffer size? For example, let's say that I want to send 10 bytes of data (i.e., the size of my application buffer is 10B) in a single send message, but the MPI buffer has a fixed size of 6B. In this case, how does MPI_Send() send the data? Does it transfer first 6B and then transfer the remaining 4B? Or does it transfer only 6B?
There's a few different kinds of buffers involved in MPI messages so I want to make it clear what each of them do.
Application Buffers - These buffers are allocated and managed by your application. Your data is stored in these, you do calculations with these, you pass these into MPI to tell it where to send or receive data. These are sized as large or larger than your data.
Internal Buffers - These buffers are internal to MPI and may or may not even exist. There's nothing in the MPI Standard about these buffers or how they are supposed to act, how big they are supposed to be, etc. However, there are some reasonable assumptions that you can make.
Usually there will be some internal buffers that are used to speed up data transfers, especially for small messages. If your message is small enough, it could be copied into this buffer to be queued for transfer at a later time. This is what usually happens if you do a very small MPI_SEND. The call will return immediately, but the data may or may not have actually been sent to the receiving process. There are similar buffers on the receiving side so if a small message arrives before the application provides an application buffer where the data can be stored, it can be dropped into one of these smaller internal buffers until it's eventual destination is specified. This is usually called the eager protocol.
Sometimes, the internal buffers are either all used up or are too small for your message to be copied into them. In this case, MPI falls back to the rendezvous protocol. In this instance, MPI usually doesn't use an internal buffer at all, but retains control of the application buffer and sends data directly from there. If this is happening, your call to MPI_SEND won't return until the MPI library is done using the buffer and it's safe for your application to modify the data again.
Special Buffers - There are other kinds of buffers that might provide special services, such as buffers on the network card that can speed up data transfers. The way these behave is usually specific to the type of network you're using.
I'm dumping data from my server app to my client app in chunks (TCP/IP). At some point the client may wish to abort the transfer and make a new request to the server. The rapid approach to get this done is to kill the TCP connection so that any data already sent by the server and live on the network is dumped. The new connection will handle the new request/transfer so there is no delay in receiving old redundant data.
Is this an acceptable solution?
NB: I did consider breaking the chunks into smaller sizes separated by client ack messages but then you have the problem of fixing a chunk size....too small and too many acks (slowed transfer)....too big and there is still a residual delay in dumping redundant data.
Any ideas or standard design approaches that I should be aware of?
TIA
You can use two TCP connection similar to FTP - one to send control request to the server & other transfers the actual data. If you wish to abort a transfer then just a request to abort it over control channel.
Send the data in chunks but don't acknowledge. When the client wants to abort the transfer make it send a cancellation request to the server. The client now just throws away chunks (which are still arriving). Eventually, the server gets the cancellation request and stops sending data. If you want to you can make the server send an acknowledgement of cancellation.
This way you can have small chunks with minimal overhead. You could have 1KB chunks with a 4 or 8 byte chunk header containing the size of the chunk. That is an extremely small bandwidth and latency overhead.
Note, that a small chunk does not generally result in a small IP packet. TCP streams data. It does not care about your chunk size.
I am trying to get a handle on what happens when a server publishes (over tcp, udp, etc.) faster than a client can consume the data.
Within a program I understand that if a queue sits between the producer and the consumer, it will start to get larger. If there is no queue, then the producer simply won't be able to produce anything new, until the consumer can consume (I know there may be many more variations).
I am not clear on what happens when data leaves the server (which may be a different process, machine or data center) and is sent to the client. If the client simply can't respond to the incoming data fast enough, assuming the server and the consumer are very loosely coupled, what happens to the in-flight data?
Where can I read to get details on this topic? Do I just have to read the low level details of TCP/UDP?
Thanks
With TCP there's a TCP Window which is used for flow control. TCP only allows a certain amount of data to remain unacknowledged at a time. If a server is producing data faster than a client is consuming data then the amount of data that is unacknowledged will increase until the TCP window is 'full' at this point the sending TCP stack will wait and will not send any more data until the client acknowledges some of the data that is pending.
With UDP there's no such flow control system; it's unreliable after all. The UDP stacks on both client and server are allowed to drop datagrams if they feel like it, as are all routers between them. If you send more datagrams than the link can deliver to the client or if the link delivers more datagrams than your client code can receive then some of them will get thrown away. The server and client code will likely never know unless you have built some form of reliable protocol over basic UDP. Though actually you may find that datagrams are NOT thrown away by the network stack and that the NIC drivers simply chew up all available non-paged pool and eventually crash the system (see this blog posting for more details).
Back with TCP, how your server code deals with the TCP Window becoming full depends on whether you are using blocking I/O, non-blocking I/O or async I/O.
If you are using blocking I/O then your send calls will block and your server will slow down; effectively your server is now in lock step with your client. It can't send more data until the client has received the pending data.
If the server is using non blocking I/O then you'll likely get an error return that tells you that the call would have blocked; you can do other things but your server will need to resend the data at a later date...
If you're using async I/O then things may be more complex. With async I/O using I/O Completion Ports on Windows, for example, you wont notice anything different at all. Your overlapped sends will still be accepted just fine but you might notice that they are taking longer to complete. The overlapped sends are being queued on your server machine and are using memory for your overlapped buffers and probably using up 'non-paged pool' as well. If you keep issuing overlapped sends then you run the risk of exhausting non-paged pool memory or using a potentially unbounded amount of memory as I/O buffers. Therefore with async I/O and servers that COULD generate data faster than their clients can consume it you should write your own flow control code that you drive using the completions from your writes. I have written about this problem on my blog here and here and my server framework provides code which deals with it automatically for you.
As far as the data 'in flight' is concerned the TCP stacks in both peers will ensure that the data arrives as expected (i.e. in order and with nothing missing), they'll do this by resending data as and when required.
TCP has a feature called flow control.
As part of the TCP protocol, the client tells the server how much more data can be sent without filling up the buffer. If the buffer fills up, the client tells the server that it can't send more data yet. Once the buffer is emptied out a bit, the client tells the server it can start sending data again. (This also applies to when the client is sending data to the server).
UDP on the other hand is completely different. UDP itself does not do anything like this and will start dropping data if it is coming in faster then the process can handle. It would be up to the application to add logic to the application protocol if it can't lose data (i.e. if it requires a 'reliable' data stream).
If you really want to understand TCP, you pretty much need to read an implementation in conjunction with the RFC; real TCP implementations are not exactly as specified. For example, Linux has a 'memory pressure' concept which protects against running out of the kernel's (rather small) pool of DMA memory, and also prevents one socket running any others out of buffer space.
The server can't be faster than the client for a long time. After it has been faster than the client for a while, the system where it is hosted will block it when it writes on the socket (writes can block on a full buffer just as reads can block on an empty buffer).
With TCP, this cannot happen.
In case of UDP, packets will be lost.
The TCP Wikipedia article shows the TCP header format which is where the window size and acknowledgment sequence number are kept. The rest of the fields and the description there should give a good overview of how transmission throttling works. RFC 793 specifies the basic operations; pages 41 and 42 details the flow control.