Does WSASend() hide batching of large buffers? - tcp

I'm testing large buffers right now and I am setting it in WSABUF and calling WSASend() next.
Thing though, is that WSARecv() just handed it back that large buffer in one go.
Does that make sense?
Where is the limit of WSASend() and WSARecv() with with respect to large buffers?
It seems that batching is happening in the background and all of that is hidden behind the abstraction.
If that is the case, I would like that always happenning to my application.

The limit is the socket send and receive buffer size respectively.
WSASend() blocks while the socket send buffer is full and returns when everything has been transferred to the socket send buffer. Meanwhile, asynchronously, TCP is removing data from the socket send buffer, turning it into TCP segments in a way which you cannot control, and passing the segments to the IP layer, which in turn turns them into IP packets, again in a way which you cannot control.
WSARecv() blocks while there is no data in the socket receive buffer, and returns when all the data in the socket receive buffer has been transferred to the application, up to the limit supplied by the application. That could be as little as one bye, or the entire application buffer, or anything in between, depending entirely on the granularity of what is received.
All this refers to blocking mode. Non-blocking mode is similar except that there are error returns instead of blocks above.

Related

How does MPI_Send work when the application buffer size is greater than MPI buffer size?

How does MPI_Send() communicate the data to the receiving process if the size of the sending data is greater than the MPI buffer size? For example, let's say that I want to send 10 bytes of data (i.e., the size of my application buffer is 10B) in a single send message, but the MPI buffer has a fixed size of 6B. In this case, how does MPI_Send() send the data? Does it transfer first 6B and then transfer the remaining 4B? Or does it transfer only 6B?
There's a few different kinds of buffers involved in MPI messages so I want to make it clear what each of them do.
Application Buffers - These buffers are allocated and managed by your application. Your data is stored in these, you do calculations with these, you pass these into MPI to tell it where to send or receive data. These are sized as large or larger than your data.
Internal Buffers - These buffers are internal to MPI and may or may not even exist. There's nothing in the MPI Standard about these buffers or how they are supposed to act, how big they are supposed to be, etc. However, there are some reasonable assumptions that you can make.
Usually there will be some internal buffers that are used to speed up data transfers, especially for small messages. If your message is small enough, it could be copied into this buffer to be queued for transfer at a later time. This is what usually happens if you do a very small MPI_SEND. The call will return immediately, but the data may or may not have actually been sent to the receiving process. There are similar buffers on the receiving side so if a small message arrives before the application provides an application buffer where the data can be stored, it can be dropped into one of these smaller internal buffers until it's eventual destination is specified. This is usually called the eager protocol.
Sometimes, the internal buffers are either all used up or are too small for your message to be copied into them. In this case, MPI falls back to the rendezvous protocol. In this instance, MPI usually doesn't use an internal buffer at all, but retains control of the application buffer and sends data directly from there. If this is happening, your call to MPI_SEND won't return until the MPI library is done using the buffer and it's safe for your application to modify the data again.
Special Buffers - There are other kinds of buffers that might provide special services, such as buffers on the network card that can speed up data transfers. The way these behave is usually specific to the type of network you're using.

Relation between net.inet.tcp.recvspace, SO_RCVBUF, Direct ByteBuf and ByteBufAllocator in Netty

Can someone quickly explain how Netty/NIO consumes TCP buffers from OS?
I reckon the TCP sliding window ACKs are managed by OS TCP stack (recvspace) and are sent back after each packet (MTU size) till the recvspace is full.
Then after NIO selector triggers a receive event, NIO (in direct buf mode) creates a direct buffer pointing to the same memory area and marks it as read? Or does it copy from recvspace into another buffer?
If this is the case, then what's each application's SO_RCVBUF? Is it relevant at all?
My goal is to read from then next buffer (and hence send new ACKs to read more) only after fully consuming the buffer.
I reckon the TCP sliding window ACKs are managed by OS TCP stack (recvspace) and are sent back after each packet (MTU size) till the recvspace is full.
Correct. This happens from the socket receive buffer, which is in the kernel.
Then after NIO selector triggers a receive event, NIO (in direct buf mode) creates a direct buffer
Not necessarily. I don't see a reason for it to be a direct buffer.
pointing to the same memory area
No. It is in the application space.
and marks it as read?
No.
Or does it copy from recvspace into another buffer?
Correct. It reads, by calling ReadableByteChannel.read(), which ultimately calls recv(), which copies data out of the socket receive buffer into application memory.
If this is the case, then what's each application's SO_RCVBUF? Is it relevant at all?
It's the first thing mentioned above.

TCP Window size libnids

My intent is to write a app. layer process on top of libnids. The reason for using libnids API is because it can emulate Linux kernel TCP functionality. Libnids would return hlf->count_new which the number of bytes from the last invocation of TCP callback function. However the tcp_callback is called every time a new packet comes in, therefore hlf->count_new contains a single TCP segment.
However, the app. layer is supposed to receive the TCP window buffer, not separate TCP segments.
Is there any way to get the data of the TCP window (and not the TCP segment)? In other words, to make libnids deliver the TCP window buffer data.
thanks in advance!
You have a misunderstanding. The TCP window is designed to control the amount of data in flight. Application reads do not always trigger TCP window changes. So the information you seek is not available in the place you are looking.
Consider, for example, if the window is 128KB and eight bytes have been sent. The receiving TCP stack must acknowledge those eight bytes regardless of whether the application reads them or not, otherwise the TCP connection will time out. Now imagine the application reads a single byte. It would be pointless for the TCP stack to enlarge the window by one byte -- and if window scaling is in use, it can't do that even if it wants to.
And then what? If four seconds later the application reads another single byte, adjust the window again? What would be the point?
The purpose of the window is to control data flow between the two TCP stacks, prevent the buffers from growing infinitely, and control the amount of data 'in flight'. It only indirectly reflects what the application has read from the TCP stack.
It is also strange that you would even want this. Even if you could tell what had been read by the application, of what possible use would that be to you?

Strange behavior using SO_SNDBUF on non-blocking TCP socket under windows

I'm trying to lower the send buffer size on my non-blocking TCP socket so that I can properly display an upload progress bar but I'm seeing some strange behavior.
I am creating a non-blocking TCP socketed, setting SO_SNDBUF to 1024, verifying that is is set properly, then connecting (tried this before and after call to connect with no difference).
The problem is, when my app actually comes around and calls send (sending about 2MB) rather than returning that around 1024 bytes were sent, the send call apparently accepts all the data and returns a sent value of 2 MB (exactly what I passed in). Everything operates properly (this is an HTTP PUT and i get a response, etc) but what I end up displaying in my progress bar is the upload sitting at 100% for about 30 seconds then the response coming in.
I have verified that if I stop before getting the response the upload does not complete so it's not like it just uploaded really fast and then the server stalled... Any ideas? Does windows even look at this setting?
Windows does look at this setting, but the setting is not working as you expect it to be.
When you're setting the size of those buffers, you're actually setting the size of the buffers on the actuall NIC you're communicating with, thus determining the size of the packets that are going out.
What you need to know about Windows, is that there is a buffer between your calling code and the actuall NIC, and I'm not sure that you can control the size of that. What happens if when you call the Send operation on your socket, you're dumping the data in that socket, and the Kernel of Windows will perform small step by step sends on the NIC using the data in the buffer.
This means that the code will actually report 2MB beeing 'sent', but this just means that your 2MB of data has been successfully written in the internal buffer, and it does not mean/guarantee that the data has already been sent.
I've been working on similar projects with video streaming and tcp communications, and this information is somewhere available on the MSDN Forums and technet, but it requires some really detailed searching on how it all actually works.
I observed the same thing on Windows, using Java non-blocking channel.
According to http://support.microsoft.com/kb/214397
If necessary, Winsock can buffer significantly more than the SO_SNDBUF buffer size.
This makes sense; the send is initiated by a program on local machine, which is presumed to be cooperative and not hostile. If kernel has enough memory, there's no point to reject the send data; some one must buffer it anyway. (Receive buffer is for the remote program, which may be hostile)
Kernel does have limits on this buffering of send data. I'm making a server socket, and kernel accepts at most 128K per send; not like 2MB in your example which is for a client socket.
Also according to the same article, kernel only buffer 2 sends; next non-blocking send should return immediately reporting 0 bytes written. So if we only send small amount of data each time, the program will be throttled by the receiving end, and your progress indicator would work nicely.
The setting does not affect anything on the NIC; it is the Kernel buffer that is affected. It defaults to 8k for both Send and Receive.
The reason for the behavior you are seeing is this: the send buffer size is NOT the limit of the amount you can sent at one time, it is the "nominal" buffer size. It really only affects subsequent sends when there is still data in the buffer waiting to be sent.
For example:
Set the send buffer to 101 bytes
Send 10 bytes, it will be buffered
Send 10 more bytes, it will be buffered
...continue until the buffer has 100 bytes in it
Send 10 more bytes
At this point WinSock uses some logic to determine whether to accept the new 10 bytes (and make the buffer 110 bytes) or block. I don't recall the behavior exactly but it is on MSDN.
Send 10 more bytes
This last one will definately block until some buffer space is available.
So, in essence, the send buffer is sizeable and:
WinSock will always accept a send of almost any size of the buffer is empty
If the buffer has data and a write will overflow, there is some logic to determine whether to accept/reject
If the buffer is full or overflowed, it will not accept the new send
Sorry for the vagueness and lack of links; I'm in a bit of a hurry but happened to remember these details from a network product I wrote a while back.

Non-blocking socket with TCP

I'm writing a program using Java non-blocking socket and TCP. I understand that TCP is a stream protocol but the underlayer IP protocol uses packets. When I call SocketChannel.read(ByteBuffer dst), will I always get the whole content of IP packets? or it may end at any position in the middle of a packet?
This matters because I'm trying to send individual messages through the channel, each messages are small enough to be sent within a single IP packet without being fragmented. It would be cool if I can always get a whole message by calling read() on the receiver side, otherwise I have to implement some method to re-assembly the messages.
Edit: assume that, on the sender side, messages are sent with a long interval(like 1 second), so they aren't going to group together in one IP packet. On the receiver side, the buffer used to call read(ByteBuffer dst) is big enough to hold any message.
TCP is a stream of bytes. Each read will receive between 1 and the maximum of the buffer size that you supplied and the number of bytes that are available to read at that time.
TCP knows nothing of your concept of messages. Each send by client can result in 0 or more reads being required at the other end. Zero or more because you might get a single read that returns more than one of your 'messages'.
You should ALWAYS write your read code such that it can deal with your message framing and either reassemble partial messages or split multiple ones.
You may find that if you don't bother with this complexity then your code will seem to 'work' most of the time, don't rely on that. As soon as you are running on a busy network or across the internet, or as soon as you increase the size of your messages you WILL be bitten by your broken code.
I talk about TCP message framing some more here: http://www.serverframework.com/asynchronousevents/2010/10/message-framing-a-length-prefixed-packet-echo-server.html and here: http://www.serverframework.com/asynchronousevents/2010/10/more-complex-message-framing.html though it's in terms of a C++ implementation so it may or may not be of interest to you.
The socket API makes no guarantee that send() and recv() calls correlate to datagrams for TCP sockets. On the sending side, things may get regrouped already, e.g. the system may defer sending one datagram to see whether the application has more data; on the receiving side, a read call may retrieve data from multiple datagrams, or a partial datagram if the size specified by the caller is requires breaking packet.
IOW, the TCP socket API assumes you have a stream of bytes, not a sequence of packets. You need make sure you keep calling read() until you have enough bytes for a request.
From the SocketChannel documentation:
A socket channel in non-blocking mode, for example, cannot read
any more bytes than are immediately available from the socket's input buffer;
So if your destination buffer is large enough, you are supposed to be able to consume the whole data in the socket's input buffer.

Resources