I read somewhere (but cannot find the source anymore) that there is a certain maximum number of bytes that can be sent in the first TCP window. Sending more data requires ACK from the receiver, hence another round-trip. To reduce website latency, all above-the-fold content, including HTTP reply headers, should be less than this number of bytes.
Can anybody remember what the maximum number of bytes in the first TCP window is and how it is calculated?
This is regulated by initial tcp congestion window (initcwnd). This parameter determines how many segments (MSS) could be sent without waiting for ACK at first phase of slow start. Currently recommended value for most workloads is 10, but some old systems still using 4. Also note, used window size depends on clients receive window too, so if some client will advertise receive window lower than your initial congestion window, it will be used this receive window as limit.
For more info, refer to this page.
Related
Section 6.9 of RFC 7540 describes the mechanism for HTTP/2 flow control. There is a flow control window for each connection, and another flow control window for all streams on that connection. It provides a way for the receiver to set the initial flow control window for a stream:
Both endpoints can adjust the initial window size for new streams by including a value for SETTINGS_INITIAL_WINDOW_SIZE in the SETTINGS frame that forms part of the connection preface.
And a way for the receiver to increase the connection and stream flow control windows:
The payload of a WINDOW_UPDATE frame is one reserved bit plus an unsigned 31-bit integer indicating the number of octets that the sender can transmit in addition to the existing flow-control window. The legal range for the increment to the flow-control window is 1 to 2^31-1 (2,147,483,647) octets.
[...]
A sender that receives a WINDOW_UPDATE frame updates the corresponding window by the amount specified in the frame.
And a way for the receiver to increment or decrement the flow control windows for all streams (but not the connection) at once:
When the value of SETTINGS_INITIAL_WINDOW_SIZE changes, a receiver MUST adjust the size of all stream flow-control windows that it maintains by the difference between the new value and the old value.
But as far as I can tell, there is no way for the receiver to decrement a single stream's flow control window without changing the initial window size.
Is that correct? If so, why not? This seems like a reasonable thing to want to do if you are multiplexing many long-lived streams over a single connection. You may have some BDP-controlled memory budget for the overall connection, carved up across the streams, and are tuning the proportion that each stream gets according to its recent bandwidth demand. If one of them temporarily goes idle you'd like to be able to reset its window to be small so that it doesn't strand the memory budget, without affecting the other streams, and without making it impossible to receive new streams.
(Of course I understand that there is a race, and the sender may have sent data before receiving the decrement. But the window is already allowed to go negative due to the SETTINGS_INITIAL_WINDOW_SIZE mechanism above, so it seems like it would be reasonable to allow for a negative window here too.)
Is it really not possible to do this without depending on forward progress from the sender to eat up the stranded bytes in the flow control window?
Here's more detail on why I'm interested in the question, because I'm conscious of the XY problem.
I'm thinking about how to solve an RPC flow control issue. I have a server with a limited memory budget, and incoming streams with different priorities for how much of that memory they should be allowed to consume. I want to implement something like weighted max-min fairness across them, adjusting their flow control windows so that they sum to no more than my memory budget, but when we're not memory constrained we get maximum throughput.
For efficiency reasons, it would be desirable to multiplex streams of different priorities on a single connection. But then as demands change, or as other connections show up, we need to be able to adjust stream flow control windows downward so they still sum to no more than the budget. When stream B shows up or receives a higher priority but stream A is sitting on a bunch of flow control budget, we need to reduce A's window and increase B's.
Even without the multiplexing, the same problem applies at the connection level: as far as I can tell, there is no way to adjust the connection flow control window downward without changing the initial window size. Of course it will be adjusted downward as the client sends data, but I don't want to need to depend on forward progress from the client for this, since that may take arbitrarily long.
It's possible there is a better way to achieve this!
A server that has N streams, of which some idle and some actively downloading data to the client, will typically re-allocate the connection window to active streams.
For example, say you are watching a movie and downloading a big file from the same server at the same time.
Connection window is 100, and each stream has a window of 100 too (obviously in case of many streams the sum of all stream windows will be capped by the connection window, but if there is only one stream it can be at max).
Now, when you watch and download each stream gets 50.
If you pause the movie, and the server knows about that (i.e. it does not exhaust the movie stream window), then the server now has to serve only one stream, with a connection window of 100 and a single stream (the download one) that also has window of 100, therefore reallocating the whole window to the active stream.
You only get into problems if the client doesn't tell the server that the movie has been paused.
In this case, the server will continue to send movie data until the movie stream window is exhausted (or quasi exhausted), and the client does not acknowledges that data because it's paused.
At that point, the server notices that data is not acknowledged by one stream and stops sending data to it, but of course part of the connection window is taken, reducing the window of the active download stream.
From the server point of view, it has a perfectly good connection where one stream (the download one) works wonderfully at max speed, but another stream hiccups and exhausts its window and causes the other stream to slow down (possibly to a halt), even if it's the same connection!
Obviously it cannot be a connection/communication issue, because one stream (the download one) works perfectly fine at max speed.
Therefore it is an application issue.
The HTTP/2 implementation on the server does not know that one of the streams is a movie that can be paused -- it's the application that must communicate this to the server and keep the connection window as large as possible.
Introducing a new HTTP/2 frame to "pause" downloads (or changing the semantic of the existing frames to accommodate a "pause" command) would have complicated the protocol quite substantially, for a feature that is 100% application driven -- it is the application that must trigger the send of the "pause" command but at that point it can send its own "pause" message to the server without complicating the HTTP/2 specification.
It is an interesting case where HTTP/1.1 and HTTP/2 behave very differently and require different code to work in a similar way.
With HTTP/1.1 you would have one connection for the movie and one for the download, they would be independent, and the client application would not need to communicate to the server that the movie was paused -- it could just stop reading from the movie connection until it became TCP congested without affecting the download connection -- assuming that the server is non-blocking to avoid scalability issues.
I came across the concept of window size when browsing gRPC's dial options. Because gRPC uses HTTP/2 underneath, I dug this article up, which describes:
Flow control window is just nothing more than an integer value indicating the buffering capacity of the receiver. Each sender maintains a separate flow control window for each stream and for the overall connection.
If this is the window size gRPC is talking about and I understand this correctly. This is for HTTP/2 to maintain multiple concurrent stream within the same connection. Basically a number that's advertised to the sender about how much data the receiver wants the sender to send next. For control flow reasons, the connection puts different stream's data among different windows in serial.
My question is/are: is the window all or nothing? Meaning if my window size is n bytes, the stream won't send any data until it's accumulated at least n bytes? More generally, how do I maximize the performance of my stream if I maintain only one stream? I assume a bigger window size would help avoid overheads but increase risk for data loss?
Meaning if my window size is n bytes, the stream won't send any data until it's accumulated at least n bytes?
No.
The sender can send any number of bytes less than or equal to n.
More generally, how do I maximize the performance of my stream if I maintain only one stream?
For just one stream, just use the max possible value, 2^31-1.
Furthermore, you want to configure the receiver to send WINDOW_UPDATE frames soon enough, so that the sender always has a large enough flow control window that allows it to never stop sending.
One important thing to note is that the configuration of the max flow control window is related to the memory capacity of the receiver.
Since HTTP/2 is multiplexed, the implementation must continue to read data until the flow control window is exhausted.
Using the max flow control window, 2 GiB, means that the receiver needs to be prepared to buffer at least up to 2 GiB of data, until the application decides to consume that data.
In other words: reading the data from the network by the implementation, and consuming that data by the application may happen at different speeds; if reading is faster than consuming, the implementation must read the data and accumulate it aside until the application can consume it.
When the application consumes the data, it tells the implementation how many bytes were consumed, and the implementation may send a WINDOW_UPDATE frame to the sender, to enlarge the flow control window again, so the sender can continue to send.
Note that implementations really want to apply backpressure, i.e. wait for applications to consume the data before sending WINDOW_UPDATEs back to the sender.
If the implementation (wrongly) acknowledges consumption of data before passing it to the application, then it is open to memory blow-up, as the sender will continue to send, but the receiver is forced to accumulate it aside until the host memory of the receiver is exhausted (assuming the application is slower to consume data than the implementation to read data from the network).
Given the above, a single connection, for the max flow control window, may require up to 2 GiB of memory.
Imagine 1024 connections (not that many for a server), and you need 2 TiB of memory.
Also consider that for such large flow control windows, you may hit TCP congestion (head of line blocking) before the flow control window is exhausted.
If this happens, you are basically back to the TCP connection capacity, meaning that HTTP/2 flow control limits never trigger because the TCP limits trigger before (or you are otherwise limited by bandwidth, etc.).
Another consideration to make is that you want to avoid that the sender exhausts the flow control window and therefore is forced to stall and stop sending.
For a flow control window of 1 MiB, you don't want to receive 1 MiB of data, consume it and then send back a WINDOW_UPDATE of 1 MiB, because otherwise the client will send 1 MiB, stall, receive the WINDOW_UPDATE, send another 1 MiB, stall again, etc. (see also how to use Multiplexing http2 feature when uploading).
Historically, small flow control windows (as the one suggested in the specification of 64 KiB) were causing super-slow downloads in browsers, that quickly realized that they needed to tell servers that their flow control window was large enough so that the server would not stall the downloads.
Currently, Firefox and Chrome set it at 16 MiB.
You want to feed the sender with WINDOW_UPDATEs so it never stalls.
This is a combination of how fast the application consumes the received data, how much you want to "accumulate" the number of consumed bytes before sending the WINDOW_UPDATE (to avoid sending WINDOW_UPDATE too frequently), and how long it takes for the WINDOW_UPDATE to go from receiver to sender.
The GATT architecture of BLE lends itself to small fixed pieces of data (20 bytes max per characteristic). But in some cases, you end up wanting to “stream” some arbitrary length of data, that is greater than 20 bytes. For example, a firmware upgrade, even if you know its slow.
I’m curious what scheme others have used if any, to “stream” data (even if small and slow) over BLE characteristics.
I’ve used two different schemes to date:
One was to use a control characteristic, where the receiving device notify the sending device how much data it had received, and the sending device then used that to trigger the next write (I did both with_response, and without_response) on a different characteristic.
Another scheme I did recently, was to basically chunk the data into 19 byte segments, where the first byte indicates the number of packets to follow, when it hits 0, that clues the receiver that all of the recent updates can be concatenated and processed as a single packet.
The kind of answer I'm looking for, is an overview of how someone with experience has implemented a decent schema for doing this. And can justify why what they did is the best (or at least better) solution.
After some review of existing protocols, I ended up designing a protocol for over-the-air update of my BLE peripherals.
Design assumptions
we cannot predict stack behavior (protocol will be used with all our products, whatever the chip used and the vendor stack, either on peripheral side or on central side, potentially unknown yet),
use standard GATT service,
avoid L2CAP fragmentation,
assume packets get queued before TX,
assume there may be some dropped packets (even if stacks should not),
avoid unnecessary packet round-trips,
put code complexity on central side,
assume 4.2 enhancements are unavailable.
1 implies 2-5, 6 is a performance requirement, 7 is optimization, 8 is portability.
Overall design
After discovery of service and reading a few read-only characteristics to check compatibility of device with image to be uploaded, all upload takes place between two characteristics:
payload (write only, without response),
status (notifiable).
The whole firmware image is sent in chunks through the payload characteristic.
Payload is a 20-byte characteristic: 4-byte chunk offset, plus 16-byte data chunk.
Status notifications tell whether there is an error condition or not, and next expected payload chunk offset. This way, uploader can tell whether it may go on speculatively, sending its chunks from its own offset, or if it should resume from offset found in status notification.
Status updates are sent for two main reasons:
when all goes well (payloads flying in, in order), at a given rate (like 4Hz, not on every packet),
on error (out of order, after some time without payload received, etc.), with the same given rate (not on every erroneous packet either).
Receiver expects all chunks in order, it does no reordering. If a chunk is out of order, it gets dropped, and an error status notification is pushed.
When a status comes in, it acknowledges all chunks with smaller offsets implicitly.
Lastly, there is a transmit window on the sender side, where many successful acknowledges flying allow sender to enlarge its window (send more chunks ahead of matching acknowledge). Window is reduced if errors happen, dropped chunks probably are because of a queue overflow somewhere.
Discussion
Using "one way" PDUs (write without response and notification) is to avoid 6. above, as ATT protocol explicitly tells acknowledged PDUs (write, indications) must not be pipelined (i.e. you may not send next PDU until you received response).
Status, containing the last received chunk, palliates 5.
To abide 2. and 3., payload is a 20-byte characteristic write. 4+16 has numerous advantages, one being the offset validation with a 16-byte chunk only involves shifts, another is that chunks are always page-aligned in target flash (better for 7.).
To cope with 4., more than one chunk is sent before receiving status update, speculating it will be correctly received.
This protocol has the following features:
it adapts to radio conditions,
it adapts to queues on sender side,
there is no status flooding from target,
queues are kept filled, this allows the whole central stack to use every possible TX opportunity.
Some parameters are out of this protocol:
central should enforce short connection interval (try to enforce it in the updater app);
slave PHY should be well-behaved with slave latency (YMMV, test your vendor's stack);
you should probably compress your payload to reduce transfer time.
Numbers
With:
15% compression,
a device connected with connectionInterval = 10ms,
a master PHY limiting every connection event to 4-5 TX packets,
average radio conditions.
I get 3.8 packets per connection event on average, i.e. ~6 kB/s of useful payload after packet loss, protocol overhead, etc.
This way, upload of a 60 kB image is done in less than 10 seconds, the whole process (connection, discovery, transfer, image verification, decompression, flashing, reboot) under 20 seconds.
It depends a bit on what kind of central device you have.
Generally, Write Without Response is the way to stream data over BLE.
Packets being received out-of-order should not happen since BLE's link layer never sends the next packet before it the previous one has been acknowledged.
For Android it's very easy: just use Write Without Response to send all packets, one after another. Once you get the onCharacteristicWrite you send the next packet. That way Android automatically queues up the packets and it also has its own mechanism for flow control. When all its buffers are filled up, the onCharacteristicWrite will be called when there is space again.
iOS is not that smart however. If you send a lot of Write Without Response packets and the internal buffers are full, iOS will silently drop new packets. There are two ways around this, either implement some (maybe complex) protocol for the peripheral notifying the status of the transmission, like Nipos answer. An easier way however is to send each 10th packet or so as a Write With Response, the rest as Write Without Response. That way iOS will queue up all packets for you and not drop the Write Without Response packets. The only downside is that the Write With Response packets require one round-trip. This scheme should nevertheless give you high throughput.
My intent is to write a app. layer process on top of libnids. The reason for using libnids API is because it can emulate Linux kernel TCP functionality. Libnids would return hlf->count_new which the number of bytes from the last invocation of TCP callback function. However the tcp_callback is called every time a new packet comes in, therefore hlf->count_new contains a single TCP segment.
However, the app. layer is supposed to receive the TCP window buffer, not separate TCP segments.
Is there any way to get the data of the TCP window (and not the TCP segment)? In other words, to make libnids deliver the TCP window buffer data.
thanks in advance!
You have a misunderstanding. The TCP window is designed to control the amount of data in flight. Application reads do not always trigger TCP window changes. So the information you seek is not available in the place you are looking.
Consider, for example, if the window is 128KB and eight bytes have been sent. The receiving TCP stack must acknowledge those eight bytes regardless of whether the application reads them or not, otherwise the TCP connection will time out. Now imagine the application reads a single byte. It would be pointless for the TCP stack to enlarge the window by one byte -- and if window scaling is in use, it can't do that even if it wants to.
And then what? If four seconds later the application reads another single byte, adjust the window again? What would be the point?
The purpose of the window is to control data flow between the two TCP stacks, prevent the buffers from growing infinitely, and control the amount of data 'in flight'. It only indirectly reflects what the application has read from the TCP stack.
It is also strange that you would even want this. Even if you could tell what had been read by the application, of what possible use would that be to you?
I'm trying to lower the send buffer size on my non-blocking TCP socket so that I can properly display an upload progress bar but I'm seeing some strange behavior.
I am creating a non-blocking TCP socketed, setting SO_SNDBUF to 1024, verifying that is is set properly, then connecting (tried this before and after call to connect with no difference).
The problem is, when my app actually comes around and calls send (sending about 2MB) rather than returning that around 1024 bytes were sent, the send call apparently accepts all the data and returns a sent value of 2 MB (exactly what I passed in). Everything operates properly (this is an HTTP PUT and i get a response, etc) but what I end up displaying in my progress bar is the upload sitting at 100% for about 30 seconds then the response coming in.
I have verified that if I stop before getting the response the upload does not complete so it's not like it just uploaded really fast and then the server stalled... Any ideas? Does windows even look at this setting?
Windows does look at this setting, but the setting is not working as you expect it to be.
When you're setting the size of those buffers, you're actually setting the size of the buffers on the actuall NIC you're communicating with, thus determining the size of the packets that are going out.
What you need to know about Windows, is that there is a buffer between your calling code and the actuall NIC, and I'm not sure that you can control the size of that. What happens if when you call the Send operation on your socket, you're dumping the data in that socket, and the Kernel of Windows will perform small step by step sends on the NIC using the data in the buffer.
This means that the code will actually report 2MB beeing 'sent', but this just means that your 2MB of data has been successfully written in the internal buffer, and it does not mean/guarantee that the data has already been sent.
I've been working on similar projects with video streaming and tcp communications, and this information is somewhere available on the MSDN Forums and technet, but it requires some really detailed searching on how it all actually works.
I observed the same thing on Windows, using Java non-blocking channel.
According to http://support.microsoft.com/kb/214397
If necessary, Winsock can buffer significantly more than the SO_SNDBUF buffer size.
This makes sense; the send is initiated by a program on local machine, which is presumed to be cooperative and not hostile. If kernel has enough memory, there's no point to reject the send data; some one must buffer it anyway. (Receive buffer is for the remote program, which may be hostile)
Kernel does have limits on this buffering of send data. I'm making a server socket, and kernel accepts at most 128K per send; not like 2MB in your example which is for a client socket.
Also according to the same article, kernel only buffer 2 sends; next non-blocking send should return immediately reporting 0 bytes written. So if we only send small amount of data each time, the program will be throttled by the receiving end, and your progress indicator would work nicely.
The setting does not affect anything on the NIC; it is the Kernel buffer that is affected. It defaults to 8k for both Send and Receive.
The reason for the behavior you are seeing is this: the send buffer size is NOT the limit of the amount you can sent at one time, it is the "nominal" buffer size. It really only affects subsequent sends when there is still data in the buffer waiting to be sent.
For example:
Set the send buffer to 101 bytes
Send 10 bytes, it will be buffered
Send 10 more bytes, it will be buffered
...continue until the buffer has 100 bytes in it
Send 10 more bytes
At this point WinSock uses some logic to determine whether to accept the new 10 bytes (and make the buffer 110 bytes) or block. I don't recall the behavior exactly but it is on MSDN.
Send 10 more bytes
This last one will definately block until some buffer space is available.
So, in essence, the send buffer is sizeable and:
WinSock will always accept a send of almost any size of the buffer is empty
If the buffer has data and a write will overflow, there is some logic to determine whether to accept/reject
If the buffer is full or overflowed, it will not accept the new send
Sorry for the vagueness and lack of links; I'm in a bit of a hurry but happened to remember these details from a network product I wrote a while back.