My intent is to write a app. layer process on top of libnids. The reason for using libnids API is because it can emulate Linux kernel TCP functionality. Libnids would return hlf->count_new which the number of bytes from the last invocation of TCP callback function. However the tcp_callback is called every time a new packet comes in, therefore hlf->count_new contains a single TCP segment.
However, the app. layer is supposed to receive the TCP window buffer, not separate TCP segments.
Is there any way to get the data of the TCP window (and not the TCP segment)? In other words, to make libnids deliver the TCP window buffer data.
thanks in advance!
You have a misunderstanding. The TCP window is designed to control the amount of data in flight. Application reads do not always trigger TCP window changes. So the information you seek is not available in the place you are looking.
Consider, for example, if the window is 128KB and eight bytes have been sent. The receiving TCP stack must acknowledge those eight bytes regardless of whether the application reads them or not, otherwise the TCP connection will time out. Now imagine the application reads a single byte. It would be pointless for the TCP stack to enlarge the window by one byte -- and if window scaling is in use, it can't do that even if it wants to.
And then what? If four seconds later the application reads another single byte, adjust the window again? What would be the point?
The purpose of the window is to control data flow between the two TCP stacks, prevent the buffers from growing infinitely, and control the amount of data 'in flight'. It only indirectly reflects what the application has read from the TCP stack.
It is also strange that you would even want this. Even if you could tell what had been read by the application, of what possible use would that be to you?
Related
Section 6.9 of RFC 7540 describes the mechanism for HTTP/2 flow control. There is a flow control window for each connection, and another flow control window for all streams on that connection. It provides a way for the receiver to set the initial flow control window for a stream:
Both endpoints can adjust the initial window size for new streams by including a value for SETTINGS_INITIAL_WINDOW_SIZE in the SETTINGS frame that forms part of the connection preface.
And a way for the receiver to increase the connection and stream flow control windows:
The payload of a WINDOW_UPDATE frame is one reserved bit plus an unsigned 31-bit integer indicating the number of octets that the sender can transmit in addition to the existing flow-control window. The legal range for the increment to the flow-control window is 1 to 2^31-1 (2,147,483,647) octets.
[...]
A sender that receives a WINDOW_UPDATE frame updates the corresponding window by the amount specified in the frame.
And a way for the receiver to increment or decrement the flow control windows for all streams (but not the connection) at once:
When the value of SETTINGS_INITIAL_WINDOW_SIZE changes, a receiver MUST adjust the size of all stream flow-control windows that it maintains by the difference between the new value and the old value.
But as far as I can tell, there is no way for the receiver to decrement a single stream's flow control window without changing the initial window size.
Is that correct? If so, why not? This seems like a reasonable thing to want to do if you are multiplexing many long-lived streams over a single connection. You may have some BDP-controlled memory budget for the overall connection, carved up across the streams, and are tuning the proportion that each stream gets according to its recent bandwidth demand. If one of them temporarily goes idle you'd like to be able to reset its window to be small so that it doesn't strand the memory budget, without affecting the other streams, and without making it impossible to receive new streams.
(Of course I understand that there is a race, and the sender may have sent data before receiving the decrement. But the window is already allowed to go negative due to the SETTINGS_INITIAL_WINDOW_SIZE mechanism above, so it seems like it would be reasonable to allow for a negative window here too.)
Is it really not possible to do this without depending on forward progress from the sender to eat up the stranded bytes in the flow control window?
Here's more detail on why I'm interested in the question, because I'm conscious of the XY problem.
I'm thinking about how to solve an RPC flow control issue. I have a server with a limited memory budget, and incoming streams with different priorities for how much of that memory they should be allowed to consume. I want to implement something like weighted max-min fairness across them, adjusting their flow control windows so that they sum to no more than my memory budget, but when we're not memory constrained we get maximum throughput.
For efficiency reasons, it would be desirable to multiplex streams of different priorities on a single connection. But then as demands change, or as other connections show up, we need to be able to adjust stream flow control windows downward so they still sum to no more than the budget. When stream B shows up or receives a higher priority but stream A is sitting on a bunch of flow control budget, we need to reduce A's window and increase B's.
Even without the multiplexing, the same problem applies at the connection level: as far as I can tell, there is no way to adjust the connection flow control window downward without changing the initial window size. Of course it will be adjusted downward as the client sends data, but I don't want to need to depend on forward progress from the client for this, since that may take arbitrarily long.
It's possible there is a better way to achieve this!
A server that has N streams, of which some idle and some actively downloading data to the client, will typically re-allocate the connection window to active streams.
For example, say you are watching a movie and downloading a big file from the same server at the same time.
Connection window is 100, and each stream has a window of 100 too (obviously in case of many streams the sum of all stream windows will be capped by the connection window, but if there is only one stream it can be at max).
Now, when you watch and download each stream gets 50.
If you pause the movie, and the server knows about that (i.e. it does not exhaust the movie stream window), then the server now has to serve only one stream, with a connection window of 100 and a single stream (the download one) that also has window of 100, therefore reallocating the whole window to the active stream.
You only get into problems if the client doesn't tell the server that the movie has been paused.
In this case, the server will continue to send movie data until the movie stream window is exhausted (or quasi exhausted), and the client does not acknowledges that data because it's paused.
At that point, the server notices that data is not acknowledged by one stream and stops sending data to it, but of course part of the connection window is taken, reducing the window of the active download stream.
From the server point of view, it has a perfectly good connection where one stream (the download one) works wonderfully at max speed, but another stream hiccups and exhausts its window and causes the other stream to slow down (possibly to a halt), even if it's the same connection!
Obviously it cannot be a connection/communication issue, because one stream (the download one) works perfectly fine at max speed.
Therefore it is an application issue.
The HTTP/2 implementation on the server does not know that one of the streams is a movie that can be paused -- it's the application that must communicate this to the server and keep the connection window as large as possible.
Introducing a new HTTP/2 frame to "pause" downloads (or changing the semantic of the existing frames to accommodate a "pause" command) would have complicated the protocol quite substantially, for a feature that is 100% application driven -- it is the application that must trigger the send of the "pause" command but at that point it can send its own "pause" message to the server without complicating the HTTP/2 specification.
It is an interesting case where HTTP/1.1 and HTTP/2 behave very differently and require different code to work in a similar way.
With HTTP/1.1 you would have one connection for the movie and one for the download, they would be independent, and the client application would not need to communicate to the server that the movie was paused -- it could just stop reading from the movie connection until it became TCP congested without affecting the download connection -- assuming that the server is non-blocking to avoid scalability issues.
I'm testing large buffers right now and I am setting it in WSABUF and calling WSASend() next.
Thing though, is that WSARecv() just handed it back that large buffer in one go.
Does that make sense?
Where is the limit of WSASend() and WSARecv() with with respect to large buffers?
It seems that batching is happening in the background and all of that is hidden behind the abstraction.
If that is the case, I would like that always happenning to my application.
The limit is the socket send and receive buffer size respectively.
WSASend() blocks while the socket send buffer is full and returns when everything has been transferred to the socket send buffer. Meanwhile, asynchronously, TCP is removing data from the socket send buffer, turning it into TCP segments in a way which you cannot control, and passing the segments to the IP layer, which in turn turns them into IP packets, again in a way which you cannot control.
WSARecv() blocks while there is no data in the socket receive buffer, and returns when all the data in the socket receive buffer has been transferred to the application, up to the limit supplied by the application. That could be as little as one bye, or the entire application buffer, or anything in between, depending entirely on the granularity of what is received.
All this refers to blocking mode. Non-blocking mode is similar except that there are error returns instead of blocks above.
Can someone quickly explain how Netty/NIO consumes TCP buffers from OS?
I reckon the TCP sliding window ACKs are managed by OS TCP stack (recvspace) and are sent back after each packet (MTU size) till the recvspace is full.
Then after NIO selector triggers a receive event, NIO (in direct buf mode) creates a direct buffer pointing to the same memory area and marks it as read? Or does it copy from recvspace into another buffer?
If this is the case, then what's each application's SO_RCVBUF? Is it relevant at all?
My goal is to read from then next buffer (and hence send new ACKs to read more) only after fully consuming the buffer.
I reckon the TCP sliding window ACKs are managed by OS TCP stack (recvspace) and are sent back after each packet (MTU size) till the recvspace is full.
Correct. This happens from the socket receive buffer, which is in the kernel.
Then after NIO selector triggers a receive event, NIO (in direct buf mode) creates a direct buffer
Not necessarily. I don't see a reason for it to be a direct buffer.
pointing to the same memory area
No. It is in the application space.
and marks it as read?
No.
Or does it copy from recvspace into another buffer?
Correct. It reads, by calling ReadableByteChannel.read(), which ultimately calls recv(), which copies data out of the socket receive buffer into application memory.
If this is the case, then what's each application's SO_RCVBUF? Is it relevant at all?
It's the first thing mentioned above.
In most descriptions of the TCP PUSH function, it is mentioned that the PUSH feature not only requires the sender to send the data immediately (without waiting for its buffer to fill), but also requires that the data be pushed to receiving application on the receiver side, without being buffered.
What I dont understand is why would TCP buffer data on receiving side at all? After all, TCP segments travel in IP datagrams, which are processed in their entirety (ie IP layer delivers only an entire segment to TCP layer after doing any necessary reassembly of fragments of the IP datagram which carried any given segment). Then, why would the receiving TCP layer wait to deliver this data to its application? One case could be if the application were not reading the data at that point in time. But then, if that is the case, then forcibly pushing the data to the application is anyway not possible. Thus, my question is, why does PUSH feature need to dictate anything about receiver side behavior? Given that an application is reading data at the time a segment arrives, that segment should anyway be delivered to the application straightaway.
Can anyone please help resolve my doubt?
TCP must buffer received data because it doesn't know when the application is going to actually read the data and it has told the sender that it is willing to receive (the available "window"). All this data gets stored in the "receive window" until such time as it gets read out by the application.
Once the application reads the data, it drops the data from the receive window and increases the size it reports back to the sender with the next ACK. If this window did not exist, then the sender would have to hold off sending until the receiver told it to go ahead which it could not do until the application issued a read. That would add a full round-trip-delay worth of latency to every read call, if not more.
Most modern implementations also make use of this buffer to keep out-of-order packets received so that the sender can retransmit only the lost ones rather than everything after it as well.
The PSH bit is not generally used acted upon. Yes, implementations send it but it typically doesn't change the behavior of the receiving end.
Note that, although the other comments are correct (the PSH bit doesn't impact application behaviour much at all in most implementations), it's still used by TCP to determine ACK behaviour. Specifically, when the PSH bit is set, the receiving TCP will ACK immediately instead of using delayed ACKs. Minor detail ;)
This is more of a theoretical question. Let us say that there is an infinite data source, which keeps pushing data every second. Some device which monitors "Solar events", and sends events to a back-end system continuously, every nanosecond ( to mean its a continuous stream ). And the back-end system wants to transmit the live data to another remote system over TCP. Can TCP handle the infinite data stream in a single TCP connection ?
I'm aware of the sequence number limitation, but with TCP timestamps, the sequence numbers will properly wrap around, and it should not pose a problem. Also, assume that the system has several terabytes of memory ( which can be considered close to an infinite memory model ). If I just give the base address of where the stream starts, will TCP able to proceed ( segmenting, transmitting, re-transmitting .. etc ) continuously in a single TCP connection, without bothering on whether the data ever ends ?
My guess is that since TCP never expects any stream length parameter, it should be possible. Am I right ?
Basically, yes. As long as the data is byte, ('octet'), aligned, data on TCP streams can be piped anywhere, (see any router). TCP comms is a byte stream - it doesn't care about message boundaries. The windowed protocol has built-in flow-control, so it should all work.