Delimiter occurring inside byte stream as actual data - tcp

Suppose you have a specification for sending information on a TCP or UDP stream and you have a sequence of bytes that you receive delimited with STX and EOT bytes. How do you handle for example the EOT byte occurring in the actual data? This is possible I think: most bytes in the message represent numbers in a defined order (i.e. it's not just ascii text in byte form) so EOT is byte 0x04 and this a number that could occur in the data. The specification is unclear on this: should I always look at the last occurrance of EOT and ignore those in between? Other similar specifications I've seen could even handle multiple messages inside the same TCP/UDP message: for example STX some_data EOT STX more_data EOT inside one TCP/UDP message. In this case you can't just look at the last EOT because it's actually 2 separate messages. Do you do some form of escaping then?
How is this sort of thing handled usually? I couldn't find anything on google, but perhaps I'm not using the best search terms.

"Usually" the protocol should be well designed, so that messages either don't contain the delimiter, use an escape mechanism to include the delimiter, or have a known length so that you know where the message ends without having to depend on the delimiter.
If the messages are fixed size integers for example you'll know that EOT encountered within an integer is not a delimiter.

Related

How is encoded data sent over a network?

I'm a bit confused how people represent binary data, and how it is sent over networks. I will explain through Wikipedia's example. Shown here <- https://imgur.com/a/POELH -> So I have my binary data encoded as base 64, and I am sending the text TWFU. So I am sending T then W then F and finally U. But to send T, a char. I will need one byte to send it, like I've always been told. One character sent over a network is one byte.
Because now I've come to think that if I encode 24 bytes, I will be sending over 4 characters, but to send over 4 characters I need the same amount of bytes as characters??
So when sending over the network "Man" (unencoded) (Requiring 3 bytes normally) vs "TWFu" (encoded) (requiring 4 bytes normally) in the example from above, are the same sequence of bits sent over the network the same. Because the last time I've used a socket to send over data, they just ask for a string input, never a text + encoding input.
Synopsis: "How" is an agreement. "Raw" is common.
Data is sent in whichever way the sender and receiver agree. There are many protocols that are standard agreements. Protocols operate at many levels. A very common pair that covers two levels is TCP/IP. Many higher-level protocols are layered on top of them. (A higher-level protocol may or may not depend on specific underlying protocols.) HTTP and SMTP are very common higher-level protocols, often with SSL sandwiched in between.
Sometimes the layers or the software that implements them is called a stack. There is also the reference (or conceptual) OSI Model. The key point about it is that it provides a language to talk about different layers. The layers it defines may or may not map to any specific stack.
Your question is too vague to answer directly. With HTTP, "raw" binary data is transferred all the time. The HTTP headers can give the length of the body in octets and the body follows the header. As part of the agreement between the sender and receiver, the header might give meta-data about the binary data using MIME headers. For example: Your gravatar
is sent with headers including:
content-length:871
content-type:image/png
That's enough for the receiver to know that the sender claims that it is a PNG graphic of 871 bytes. The receiver will read the header and then read 871 bytes for the body and then assume that what follows is another HTTP header.
Some protocols use synchronizations methods other than bodies with pre-declared sizes. They might be entirely text-based and use a syntax that allows only certain characters. They can be extended by a nesting agreement to use something like Base64 to represent binary data as text.
Some layers might provide data compression of sufficient density that expansion by higher layers, such as Base64, is not a great concern. See HTTP Compression, for example.
If you want to see HTTP in action, hit F12 and go the Network tab. If you want to see other protocols active on your computer try WireShark, Microsoft Message Analyzer, Fiddler or similar.
Base64 is a method for encoding arbitrary 8-bit data in a purely 7-bit channel. As much as the internet is based on the principle of 8-bit bytes, for text mode it's presumed to be 7-bit ASCII unless otherwise specified.
If you're sending that data Base64 encoded then you'll literally send TWFU. Many text-based protocols use Base64 out of convenience: It's an established standard and it's efficient enough for most applications.
The foundation of the internet, IP, is a protocol based on 8-bit bytes. When sending binary data you can make full use of all 8 bits, but if you're working with a text-mode protocol, of which there are many, you're generally stuck using 7-bit ASCII unless the protocol has a way of specifying which character set or encoding you're using.
If you have the option to switch to a "binary" transfer then you can side-step the need for Base64. If you're working with a 7-bit ASCII protocol then you're probably going to need Base64.
Note this isn't the only method for encoding arbitrary binary characters. There's also quoted printable as used in email, and URI encoding for URLs. These are more efficient in cases where escaping is exceptional, but far less efficient if it's required for each character.
If you know you're dealing with 7-bit text only there's no need for base-64 encoding.
However, if you'd need to send
Man
Boy
over a purely 7-bit text channel you couldn't send it as literal with the line breaks. Instead, you'd send encoded in base64
TWFuDQpCb3kNCg==
which has encoded line breaks but doesn't use incompatible characters. Of course, the receiver needs to know that you're sending encoded text - either implied by the protocol or explicitly marked in some way.

What to chose application/x-www-form-urlencoded / multipart/form-data for file size in GB?

I am sending some video files (size could be even in GB) as application/x-www-form-urlencodedover HTTP POST.
The following link link suggests that it would be better to transmit it over Multipart form data when we have non-alphanumeric content.
Which encoding would be better to transmit data of this kind?
Also how can I find the length of encoded data (data encoded with application/x-www-form-urlencoded)?
Will encoding the binary data consume much time?
In general, encoding skips the non-alphanumeric characters with some others. So, can we skip encoding for binary data (like video)? How can we skip it?
x-www-form-urlencoded treats the value of an entry in the form data set as a sequence of bytes (octets).
Of the possible 256 values, only 66 are left as it or still encoded as a single byte value, the others are replaced by the hexadecimal representation of the value of their code-point.
This usually takes three to five bytes depending on the encoding.
So in average (256-66)/256 or 74% of the file will be encoded to take three-to-five as much space as originally.
This encoding however has no header nor significant overhead.
multipart/form-data instead works by dividing the data into parts and then finding a string of any length that doesn't occur in said part.
Such string is called the boundary and it is used to delimit the end of the part, that is transmitted as a stream of octects.
So the file is mostly send as it, with negligible size overhead for big enough data.
The draw back is that the user-agent need to find a suitable boundary, however given a string of length k there is only a probability of 2-8k of finding that string in a uniformly generated binary file.
So the user-agent can simply generate a random string and do a quick search and exploit the network transmission time to hide the latency of the search.
You should use multipart/form-data.
This depends on the platform you are using, in general if you cannot access the request body you have to re-perform the encoding your self.
For multipart/form-data encoding there is a little, usually negligible (compared to the transmission time) overhead.

Sending raw bytes over network. Bad?

This post to the question "What is base 64 encoding used for?" says:
When you have some binary data that you want to ship across a network, you generally don't do it by just streaming the bits and bytes over the wire in a raw format. Why? because some media are made for streaming text. You never know -- some protocols may interpret your binary data as control characters (like a modem), or your binary data could be screwed up because the underlying protocol might think that you've entered a special character combination (like how FTP translates line endings).
I've used sockets in Java a hundert times to send binary data over networks. And as far as I know it very common to send binary data over networks especially if you have big data. I don't see why some devices could interpret binary data wrong, since it contains TCP header etc.
SOAP MTOM also sends binary data over networks.
Am I misunderstanding something? I'm irritated, because this post has many upvotes and is accepted.
The answer you link to isn't incorrect, it just fails to explicitly mention some examples. The answer is in the quote as well:
because some media are made for streaming text
Sockets deal in bytes, they don't care what they transport. It is the higher-level protocols, or the message formats they transport, that do.
It's when this binary data is wrapped in envelopes of such protocols or formats that they can wreak havoc. A less than (<) character in image bytes is perfectly valid, but when used in an XML message, it will break the XML. Other characters, like control characters, can have an influence on how further data is to be interpreted by a protocol handler.
So base64 is used to wrap binary data in a safe-for-transport way where that would otherwise not be safe.

Is it possible to send ASCII control codes via RS232?

I would like to receive and send bytes that have special meaning in ASCII code like End of Text, End of Transmission etc. but I am not sure if it is allowed. Can it break my communication? Sending and receiving looks like reading from file that is why I doubt I can directly use this specific values. I use Windows OS.
EDIT: I have tested it and there is no problem with any sign. All of control ASCII characters can be sent via RS 232. Both reading and writing cause no unexpected behaviour.
RS232 is a very binary protocol. It does not even assume 8-bit bytes, let alone ASCII. The fact that on Windows, you use file functions does not matter either. Those too do not assume text data, although those do assume 8-bit byets.
RS-232 nodes do not interpret the data except in software flow control mode (XOn/XOff). You use this mode only if both parties agree and agree on the values of XOn and XOff.
The values are historically based on the ASCII DC1 and DC3 characters but the only thing that matters is their values, 0x11, and 0x13.
If you haven't set up software flow control, all values are passed through as-is.

what is the simplest compression techniques for a single network packet?

Need a simple compression method for a single network packet.
simple in the sense a technique which uses least computation.
Thanks!
lz4 compresses and decompresses very fast. zlib can compress better, but not quite as fast. The "least computation" would be to not compress at all.
The "PPP Predictor Compression Protocol" is one of the lowest-computation algorithms available for single-packet compression.
Source code is available in RFC1978.
The decompressor guesses what the next byte is in the current context.
If it guesses correctly, the next bit from the compressed text is "1";
If it guesses incorrectly, the next bit from the compressed text is "0" and the next byte from the compressed text is passed through literally (and the guess table for this context is updated so next time it guesses this literal byte).
The compressor attempts to compress the data field of the packet.
Even if half of the guesses are wrong, the compressed data field will end up smaller than the plaintext data, and the compressed flag for that packet is set to 1, and the compressed data is sent in the packet.
If too many more guesses are wrong, however, the "compressed" data ends up the same or even longer than the plaintext -- so the compressor instead sets the compressed flag for that packet to 0, and simply sends the raw plaintext in the packet.
There are two basic types of compression: loss-less and lossy. Loss-less means that if you have two algorithms c(msg) which is the compression algorithm and d(msg) which is the decompression algorithm then
msg == d(c(msg))
Of course then, this implies that a lossy compression would be:
msg != d(c(msg))
With some information, lossy is ok. This is typically how sound is handled. You can lose some bits without any noticeable loss. MP3 compression works this way, for example. Lossy algorithms are usually specific to the type of information that you are compressing.
So, it really depends upon the data that you are transmitting. I assume that you are speaking strictly about the payload and not any of the addressing fields and that you are interested in loss-less compression. The simplest would be run length encoding (RLE). In RLE, you basically find duplicate successive values and you replace the values with a flag followed by a count followed by the value to repeat. You should only do this if the length of the run is greater (or equal) to the length of the tuple
(sizeof(flag)+sizeof(count)+sizeof(value).
This can work really well if the data is less than 8 bits in which case you can use the 8th bit as the flag. So for instance if you had three 'A's "AAA" then in hex that would be 414141. You can encode that to C103. In this case, the max run would be 255 (FF) before you would have to start the compression sequence again if there were more than 255 characters of the same value. So in this case 3 bytes becomes 2 bytes. In a best case, 255 7-bit values of the same value would then be 2 characters instead of 255.
A simple state machine could be used to handle the RLE.
See http://en.wikipedia.org/wiki/Run-length_encoding

Resources