Sending raw bytes over network. Bad? - networking

This post to the question "What is base 64 encoding used for?" says:
When you have some binary data that you want to ship across a network, you generally don't do it by just streaming the bits and bytes over the wire in a raw format. Why? because some media are made for streaming text. You never know -- some protocols may interpret your binary data as control characters (like a modem), or your binary data could be screwed up because the underlying protocol might think that you've entered a special character combination (like how FTP translates line endings).
I've used sockets in Java a hundert times to send binary data over networks. And as far as I know it very common to send binary data over networks especially if you have big data. I don't see why some devices could interpret binary data wrong, since it contains TCP header etc.
SOAP MTOM also sends binary data over networks.
Am I misunderstanding something? I'm irritated, because this post has many upvotes and is accepted.

The answer you link to isn't incorrect, it just fails to explicitly mention some examples. The answer is in the quote as well:
because some media are made for streaming text
Sockets deal in bytes, they don't care what they transport. It is the higher-level protocols, or the message formats they transport, that do.
It's when this binary data is wrapped in envelopes of such protocols or formats that they can wreak havoc. A less than (<) character in image bytes is perfectly valid, but when used in an XML message, it will break the XML. Other characters, like control characters, can have an influence on how further data is to be interpreted by a protocol handler.
So base64 is used to wrap binary data in a safe-for-transport way where that would otherwise not be safe.

Related

File Delimiters on AES 256 Encrypted fields

I have a requirement for one of my projects in which I am expecting a few of the incoming fields encrypted as AES-256 when sent to us by upstream. The incoming file is comma delimited. Is there a possibility that the AES encrypted fields may contain "," throwing off the values to different fields? What about if it is pipe delimited or some other delimiter?
Also, how what should be the datatype of these encrypted fields in order to read these encrypted fields using an ETL tool?
Thanks in Advance
AES as a block cipher is a family of permutations selected by the key. The output is expected to be random ( more precisely we believe that AES is a Pseudo-Random-Permutation)
AES ( like any block cipher) outputs binary data, usually as a byte array and bytes can take any value between 0 and 256 with equal probability.
You are not alone;
Transmitting binary data can create problems, especially in protocols that are designed to deal with textual data. To avoid it altogether, we don't transmit binary data. Many of the programming errors related to encryption on Stack Overflow are due to sending binary data over text-based protocols. Most of the time this works, but occasionally it fails and the coders wonder about the problem. The binary data corrupts the network protocol.
Therefore hex, base64, or similar encodings are necessary to mitigate this. Base64 is not totally URL safe and one can make it URL safe with a little work.
And note that has nothing to do with security; it is about visibility and interoperability.

Difference between binary and text protocols in the context of http

What I read everywhere is http is a text based and http/2 is a binary based protocol. Lots of articles online suggests that http/2 binary protocol is more compact and efficient to process.
Where exactly in the http work flow , text based protocol is adding the overhead ? At the application layer, we would always need to serialize the data (text) into binary anyway to transfer on the wire. So, essentially are we not transferring the data in a binary using both http/http2 ?
Where exactly binary protocol in http2 work flow is bringing in that compactness and processing efficiency
At the application layer, we would always need to serialize the data (text) into binary anyway to transfer on the wire.
True, but till HTTP/1, data was written to the underneath layer like tcp layer in text. But in Http/2 the data is encoded into binary packed into frames and sent to the underlying layer.
Text format cannot fit exactly into the fixed sized frames. A character may get split across multiple frames. Rather binary encoding the data and splitting it into multiple frames looks more preferable.

How is encoded data sent over a network?

I'm a bit confused how people represent binary data, and how it is sent over networks. I will explain through Wikipedia's example. Shown here <- https://imgur.com/a/POELH -> So I have my binary data encoded as base 64, and I am sending the text TWFU. So I am sending T then W then F and finally U. But to send T, a char. I will need one byte to send it, like I've always been told. One character sent over a network is one byte.
Because now I've come to think that if I encode 24 bytes, I will be sending over 4 characters, but to send over 4 characters I need the same amount of bytes as characters??
So when sending over the network "Man" (unencoded) (Requiring 3 bytes normally) vs "TWFu" (encoded) (requiring 4 bytes normally) in the example from above, are the same sequence of bits sent over the network the same. Because the last time I've used a socket to send over data, they just ask for a string input, never a text + encoding input.
Synopsis: "How" is an agreement. "Raw" is common.
Data is sent in whichever way the sender and receiver agree. There are many protocols that are standard agreements. Protocols operate at many levels. A very common pair that covers two levels is TCP/IP. Many higher-level protocols are layered on top of them. (A higher-level protocol may or may not depend on specific underlying protocols.) HTTP and SMTP are very common higher-level protocols, often with SSL sandwiched in between.
Sometimes the layers or the software that implements them is called a stack. There is also the reference (or conceptual) OSI Model. The key point about it is that it provides a language to talk about different layers. The layers it defines may or may not map to any specific stack.
Your question is too vague to answer directly. With HTTP, "raw" binary data is transferred all the time. The HTTP headers can give the length of the body in octets and the body follows the header. As part of the agreement between the sender and receiver, the header might give meta-data about the binary data using MIME headers. For example: Your gravatar
is sent with headers including:
content-length:871
content-type:image/png
That's enough for the receiver to know that the sender claims that it is a PNG graphic of 871 bytes. The receiver will read the header and then read 871 bytes for the body and then assume that what follows is another HTTP header.
Some protocols use synchronizations methods other than bodies with pre-declared sizes. They might be entirely text-based and use a syntax that allows only certain characters. They can be extended by a nesting agreement to use something like Base64 to represent binary data as text.
Some layers might provide data compression of sufficient density that expansion by higher layers, such as Base64, is not a great concern. See HTTP Compression, for example.
If you want to see HTTP in action, hit F12 and go the Network tab. If you want to see other protocols active on your computer try WireShark, Microsoft Message Analyzer, Fiddler or similar.
Base64 is a method for encoding arbitrary 8-bit data in a purely 7-bit channel. As much as the internet is based on the principle of 8-bit bytes, for text mode it's presumed to be 7-bit ASCII unless otherwise specified.
If you're sending that data Base64 encoded then you'll literally send TWFU. Many text-based protocols use Base64 out of convenience: It's an established standard and it's efficient enough for most applications.
The foundation of the internet, IP, is a protocol based on 8-bit bytes. When sending binary data you can make full use of all 8 bits, but if you're working with a text-mode protocol, of which there are many, you're generally stuck using 7-bit ASCII unless the protocol has a way of specifying which character set or encoding you're using.
If you have the option to switch to a "binary" transfer then you can side-step the need for Base64. If you're working with a 7-bit ASCII protocol then you're probably going to need Base64.
Note this isn't the only method for encoding arbitrary binary characters. There's also quoted printable as used in email, and URI encoding for URLs. These are more efficient in cases where escaping is exceptional, but far less efficient if it's required for each character.
If you know you're dealing with 7-bit text only there's no need for base-64 encoding.
However, if you'd need to send
Man
Boy
over a purely 7-bit text channel you couldn't send it as literal with the line breaks. Instead, you'd send encoded in base64
TWFuDQpCb3kNCg==
which has encoded line breaks but doesn't use incompatible characters. Of course, the receiver needs to know that you're sending encoded text - either implied by the protocol or explicitly marked in some way.

Does IMAP protocol support binary inside multi-part body?

IMAP RFC:
8-bit textual and binary mail is supported through the use of a
[MIME-IMB] content transfer encoding. IMAP4rev1 implementations MAY
transmit 8-bit or multi-octet characters in literals, but SHOULD do
so only when the [CHARSET] is identified.
Although a BINARY body encoding is defined, unencoded binary
strings are not permitted. A "binary string" is any string with
NUL characters. Implementations MUST encode binary data into a
textual form, such as BASE64, before transmitting the data. A
string with an excessive amount of CTL characters MAY also be
considered to be binary.
If implementation has to convert to base64, why RFC is saying "BINARY body encoding is defined". Since every time we need to send the data as base64 (or some other format) effectively binary is not supported. Or am i reading some thing wrong?
IMAP supports MIME multi-part, can the parts inside this have binary data? that is content-transfer-encoding?
I am new to IMAP/HTTP, reason for asking this question is, i have to develop a server which supports both HTTP and IMAP, in HTTP server recive the data in binary (HUGE multipart data, with content-transfer-encoding as binary), FETCH can be done in IMAP. Problem is i need to parse the data and convert each parts inside multipart to base64 if IMAP doesnt support binary. Which i think is severe performance issue.
The answer is unfortunately "maybe".
The MIME RFC supports binary, but the IMAP RFC specifically disallows sending NULL characters. This is likely because they can be confusing for text based parsers, especially those written in C, where NULL has the meaning of End of String.
Some IMAP servers just consider the body to be a "bag of bytes" and I doubt few, if any, actually do re-encoding. So if you ask for the entire message, you will probably get the literal content of it.
If your clients can handle MIME-Binary, you will probably be fine.
There is RFC 3516 for an IMAP extension to support BINARY properly, but this is not widely deployed.
As a side note: why are you using Multipart MIME? That is an odd implementation choice for HTTP.

Is it possible to send ASCII control codes via RS232?

I would like to receive and send bytes that have special meaning in ASCII code like End of Text, End of Transmission etc. but I am not sure if it is allowed. Can it break my communication? Sending and receiving looks like reading from file that is why I doubt I can directly use this specific values. I use Windows OS.
EDIT: I have tested it and there is no problem with any sign. All of control ASCII characters can be sent via RS 232. Both reading and writing cause no unexpected behaviour.
RS232 is a very binary protocol. It does not even assume 8-bit bytes, let alone ASCII. The fact that on Windows, you use file functions does not matter either. Those too do not assume text data, although those do assume 8-bit byets.
RS-232 nodes do not interpret the data except in software flow control mode (XOn/XOff). You use this mode only if both parties agree and agree on the values of XOn and XOff.
The values are historically based on the ASCII DC1 and DC3 characters but the only thing that matters is their values, 0x11, and 0x13.
If you haven't set up software flow control, all values are passed through as-is.

Resources