I'm sending multiple byte arrays with size <500 over TCP. However, on the client side, I sometimes receive arrays with size >2000. Why is this happening?
Below is my code for TCPClient:
byte[] receiveData = new byte[64000];
DataInputStream input = new DataInputStream(socket.getInputStream());
int size = input.read(receiveData);
byte[] receiveDataNew = new byte[size];
System.arraycopy(receiveData,0,receiveDataNew,0,size);
System.out.println("length of receiveData is " + size);
GenericRecord result = AvroByteReader.readAvroBytes(receiveDataNew);
return result;
Any help is appreciated! Thank you!
TCP is not a record oriented protocol. It is stream oriented. That means that packets can be recombined before receiving on the other side. Likely, this means you have received 3 or 4 sets of data arrays, and are processing them all at once.
If you wish to use it as a record-oriented or framed protocol, you'll have to add that framing yourself. You could prepend a size to the data you send, and only read that much data at a time.
If your records are always the same length, you can do a fixed-length read. I don't know Java's libraries, but you should be able to provide a length to read().
TCP does not have any concept of message boundaries, since it is a stream protocol. It can (and will) merge multiple sends into a single receive, and even the opposite - split a send into multiple smaller receives.
Your application must be prepared for this.
Basically what you are seeing is multiple arrays combined back to back being received at the same time. Send the array length before the array and read that many bytes.
Related
I'm experimenting working with streams in UWP to send data from one machine to another over the network.
On the sending machine, I created a DatagramSocket and serialize the data I want to send into bytes and write that to the output stream.
On the receiving machine, I create another DatagramSocket and handle the MessageReceived event to collect the sent data.
This appears to be working in the sense that when I send data from one machine, I do receive it on the other.
However, the data I'm serializing on the sender is a size of say 8150 bytes, which I write to the stream.
On the receiving end, I'm only getting about 13 bytes of data instead of the full load I expected...
So it appears that I'm responsible on the receiving end for reconstructing the full data object by waiting for all the data to come in over what might be multiple streams...
However, it appears that I'm getting packets 1:1 -- that is, if I set a breakpoint right before the send and right after the receive, that when I write and flush the data to the output stream and send it, the receiving end then triggers and I get what seems to be partial data, but I never get anything else.
so while I send 8150 bytes from the sending machine, the receiving end only gets a single packet about 13 bytes in length...
am I losing packets? it seems to be a consistent 13 bytes, so perhaps it's a buffer setting, but the problem is I the 8150 bytes is arbitrary; sometimes it's larger or smaller...
I'm obviously doing this wrong, but I'm so new to network programming I'm not really sure where to start fixing this; on a high level what's the proper way to send a complete memory object from one machine to another so that I can reconstruct an exact copy of it on the receiving end?
Okay so it turns out that the problem was that when I was writing to the output stream on the sender machine, I was using a regular StreamWriter and sending it the array as an object:
using (StreamWriter writer = new StreamWriter(stream))
{
writer.Write(output);
writer.Flush();
}
I believe this ends up writing the object itself using ToString() so what I was actually writing to the stream was "byte[]" or whatever the type was...
Instead I replaced this with a BinaryWriter and write the full output array and the complete contents are now received on the other end!
using (BinaryWriter writer = new BinaryWriter(stream))
{
writer.Write(output);
writer.Flush();
}
I know this wasn't very well put together, but I barely know what I'm doing here :) still i hope this might be helpful to others.
I'm trying to send and receive messages over TCP using a size of each message appended before the it starts.
Say, First three bytes will be the length and later will the message:
As a small example:
005Hello003Hey002Hi
I'll be using this method to do large messages, but because the buffer size will be a constant integer say, 200 Bytes. So, there is a chance that a complete message may not be received e.g. instead of 005Hello I get 005He nor a complete length may be received e.g. I get 2 bytes of length in message.
So, to get over this problem, I'll need to wait for next message and append it to the incomplete message etc.
My question is: Am I the only one having these difficulties to appending messages to each other, appending lengths etc.. to make them complete Or is this really usually how we need to handle the individual messages on TCP? Or, if there is a better way?
What you're seeing is 100% normal TCP behavior. It is completely expected that you'll loop receiving bytes until you get a "message" (whatever that means in your context). It's part of the work of going from a low-level TCP byte stream to a higher-level concept like "message".
And "usr" is right above. There are higher level abstractions that you may have available. If they're appropriate, use them to avoid reinventing the wheel.
So, there is a chance that a complete message may not be received e.g.
instead of 005Hello I get 005He nor a complete length may be received
e.g. I get 2 bytes of length in message.
Yes. TCP gives you at least one byte per read, that's all.
Or is this really usually how we need to handle the individual messages on TCP? Or, if there is a better way?
Try using higher-level primitives. For example, BinaryReader allows you to read exactly N bytes (it will internally loop). StreamReader lets you forget this peculiarity of TCP as well.
Even better is using even more higher-level abstractions such as HTTP (request/response pattern - very common), protobuf as a serialization format or web services which automate pretty much all transport layer concerns.
Don't do TCP if you can avoid it.
So, to get over this problem, I'll need to wait for next message and append it to the incomplete message etc.
Yep, this is how things are done at the socket level code. For each socket you would like to allocate a buffer of at least the same size as kernel socket receive buffer, so that you can read the entire kernel buffer in one read/recv/resvmsg call. Reading from the socket in a loop may starve other sockets in your application (this is why they changed epoll to be level-triggered by default, because the default edge-triggered forced application writers to read in a loop).
The first incomplete message is always kept in the beginning of the buffer, reading the socket continues at the next free byte in the buffer, so that it automatically appends to the incomplete message.
Once reading is done, normally a higher level callback is called with the pointers to all read data in the buffer. That callback should consume all complete messages in the buffer and return how many bytes it has consumed (may be 0 if there is only an incomplete message). The buffer management code should memmove the remaining unconsumed bytes (if any) to the beginning of the buffer. Alternatively, a ring-buffer can be used to avoid moving those unconsumed bytes, but in this case the higher level code should be able to cope with ring-buffer iterators, which it may be not ready to. Hence keeping the buffer linear may be the most convenient option.
I was trying to read some messages from a tcp connection with a redis client (a terminal just running redis-cli). However, the Read command for the net package requires me to give in a slice as an argument. Whenever I give a slice with no length, the connection crashes and the go program halts. I am not sure what length my byte messages need going to be before hand. So unless I specify some slice that is ridiculously large, this connection will always close, though this seems wasteful. I was wondering, is it possible to keep a connection without having to know the length of the message before hand? I would love a solution to my specific problem, but I feel that this question is more general. Why do I need to know the length before hand? Can't the library just give me a slice of the correct size?
Or what other solution do people suggest?
Not knowing the message size is precisely the reason you must specify the Read size (this goes for any networking library, not just Go). TCP is a stream protocol. As far as the TCP protocol is concerned, the message continues until the connection is closed.
If you know you're going to read until EOF, use ioutil.ReadAll
Calling Read isn't guaranteed to get you everything you're expecting. It may return less, it may return more, depending on how much data you've received. Libraries that do IO typically read and write though a "buffer"; you would have your "read buffer", which is a pre-allocated slice of bytes (up to 32k is common), and you re-use that slice each time you want to read from the network. This is why IO functions return number of bytes, so you know how much of the buffer was filled by the last operation. If the buffer was filled, or you're still expecting more data, you just call Read again.
A bit late but...
One of the questions was how to determine the message size. The answer given by JimB was that TCP is a streaming protocol, so there is no real end.
I believe this answer is incorrect. TCP divides up a bitstream into sequential packets. Each packet has an IP header and a TCP header See Wikipedia and here. The IP header of each packet contains a field for the length of that packet. You would have to do some math to subtract out the TCP header length to arrive at the actual data length.
In addition, the maximum length of a message can be specified in the TCP header.
Thus you can provide a buffer of sufficient length for your read operation. However, you have to read the packet header information first. You probably should not accept a TCP connection if the max message size is longer than you are willing to accept.
Normally the sender would terminate the connection with a fin packet (see 1) not an EOF character.
EOF in the read operation will most likely indicate that a package was not fully transmitted within the allotted time.
I got a question regarding tcp/ip socket networking. Basically it is there: are there any parts of tcp/ip that I can leverage to help manage multi-packet sends. For example, I want to send a 100 mb binary file which would take something like 70-80 tcp packets. Meanwhile I have a relatively fast polling receive on the other side. Would my receive have to receive each packet it individually and "stitch" together the data packet by packet, looking for some size to be reached(it can look at the opcode and determine size) or is there some way to tell tcp to say "hey I'm sending 100 mb here, let them know when it is finished."
I am using glib's low level socket library (gsocket).
When using a binary encoding like, say protocol buffers, you would wrap the actual payload by inserting a header that would include the information necessary to decode the payload on the other end.
Say appending 8 bytes where the first 4 signify the type of the encoded message and the second four indicate the length of the entire message.
On the receiving side you are then reading this header, that's part of the payload, to determine the message type and length of the message. This lets you combine multiple messages in one payload or split messages across packets and reliably recombine them.
Is it within TCP standard that multiple messages, sent from server to client in a row, will be accepted by client at same order (and bytes of one message will not be scattered within other messages)?
TCP provides an in-order byte stream delivery service. The bytes won't arrive in another order but the number of writes need not be equal to the number of reads.
You will never read bytes in another order than that in which they were sent
You can make no assumptions on "messages". TCP doesn't know about messages, only bytes (see above). Both the sender and the receiver can coalesce and split such "messages"
TCP uses a sequence number to identify each byte of data. The sequence number identifies the order of the bytes sent from each computer so that the data can be reconstructed in order, regardless of any fragmentation, disordering, or packet loss that may occur during transmission.
I agree with #cnicutar.
How are you deserializing the objects? I suspect the problem lies there.
For example if your messages are like
ABCD followed 200 ms later by PQR. It may appear as:
ABC followed by PQR
or ABCDPQR
or even AB followed by CD followed by PQ followed by R.
Basically you cannot make assumptions based on time of receiving the data.
The deserialization logic should know the object boundaries within a stream of bytes. This information should be encoded into the stream by the serialization logic.
If you are using Java, you can use ObjectInputStream & ObjectOutputStream and not be bothered about serialzation issues.
J2ME Polish has a good serialization utility that can be very easily ported to other platforms. I have myself used it in live environment.