I am building a web application that will send a set of flag states to its server by converting the flags into binary, then converting the binary into ascii characters. The ascii characters will be sent using a post command, then a response (encoded the same way) will be sent back. I would like to know if there are ascii character that can cause the HTTP requests and data transfer to break down or get misdirected. Are there standard ascii characters (0-127) that need to be avoided?
Despite its name, HTTP is agnostic to the format and semantics of the entity-body content. It doesn't need to be text. Describing it as text and giving a character encoding is metadata for the sending and receiving applications. Your actual entity-data isn't text but if you've added a layer of re-interpretation so you could provide that metadata.
HTTP bodies are of two types: counted or chunked. For counted, the message-body is the same as the entity-body. Counted is used unless you want to start streaming data before knowing its entire length. Just send the Content-Length header with the number of octets in the entity-body and copy it into the output stream.
Related
So according to mozilla docs, http messages are composed of textual information encoded in ASCII.
HTTP messages are composed of textual information encoded in ASCII,
and span over multiple lines.
So, when we request a binary file such as an Image, how is it represented in the response message?
(Assuming the response is a single text file containing both headers and data)
Is it also ASCII encoded?
If not, then are headers and data transferred separately? If yes, please share some resources where I can learn the working.
In HTTP/1.1 (not HTTP/2 or HTTP/3), message header and control information is transferred as plain text. The actual request payloads are binary, and thus no encoding is needed.
See IERF RFCs 9110..9114.
I'm a bit confused how people represent binary data, and how it is sent over networks. I will explain through Wikipedia's example. Shown here <- https://imgur.com/a/POELH -> So I have my binary data encoded as base 64, and I am sending the text TWFU. So I am sending T then W then F and finally U. But to send T, a char. I will need one byte to send it, like I've always been told. One character sent over a network is one byte.
Because now I've come to think that if I encode 24 bytes, I will be sending over 4 characters, but to send over 4 characters I need the same amount of bytes as characters??
So when sending over the network "Man" (unencoded) (Requiring 3 bytes normally) vs "TWFu" (encoded) (requiring 4 bytes normally) in the example from above, are the same sequence of bits sent over the network the same. Because the last time I've used a socket to send over data, they just ask for a string input, never a text + encoding input.
Synopsis: "How" is an agreement. "Raw" is common.
Data is sent in whichever way the sender and receiver agree. There are many protocols that are standard agreements. Protocols operate at many levels. A very common pair that covers two levels is TCP/IP. Many higher-level protocols are layered on top of them. (A higher-level protocol may or may not depend on specific underlying protocols.) HTTP and SMTP are very common higher-level protocols, often with SSL sandwiched in between.
Sometimes the layers or the software that implements them is called a stack. There is also the reference (or conceptual) OSI Model. The key point about it is that it provides a language to talk about different layers. The layers it defines may or may not map to any specific stack.
Your question is too vague to answer directly. With HTTP, "raw" binary data is transferred all the time. The HTTP headers can give the length of the body in octets and the body follows the header. As part of the agreement between the sender and receiver, the header might give meta-data about the binary data using MIME headers. For example: Your gravatar
is sent with headers including:
content-length:871
content-type:image/png
That's enough for the receiver to know that the sender claims that it is a PNG graphic of 871 bytes. The receiver will read the header and then read 871 bytes for the body and then assume that what follows is another HTTP header.
Some protocols use synchronizations methods other than bodies with pre-declared sizes. They might be entirely text-based and use a syntax that allows only certain characters. They can be extended by a nesting agreement to use something like Base64 to represent binary data as text.
Some layers might provide data compression of sufficient density that expansion by higher layers, such as Base64, is not a great concern. See HTTP Compression, for example.
If you want to see HTTP in action, hit F12 and go the Network tab. If you want to see other protocols active on your computer try WireShark, Microsoft Message Analyzer, Fiddler or similar.
Base64 is a method for encoding arbitrary 8-bit data in a purely 7-bit channel. As much as the internet is based on the principle of 8-bit bytes, for text mode it's presumed to be 7-bit ASCII unless otherwise specified.
If you're sending that data Base64 encoded then you'll literally send TWFU. Many text-based protocols use Base64 out of convenience: It's an established standard and it's efficient enough for most applications.
The foundation of the internet, IP, is a protocol based on 8-bit bytes. When sending binary data you can make full use of all 8 bits, but if you're working with a text-mode protocol, of which there are many, you're generally stuck using 7-bit ASCII unless the protocol has a way of specifying which character set or encoding you're using.
If you have the option to switch to a "binary" transfer then you can side-step the need for Base64. If you're working with a 7-bit ASCII protocol then you're probably going to need Base64.
Note this isn't the only method for encoding arbitrary binary characters. There's also quoted printable as used in email, and URI encoding for URLs. These are more efficient in cases where escaping is exceptional, but far less efficient if it's required for each character.
If you know you're dealing with 7-bit text only there's no need for base-64 encoding.
However, if you'd need to send
Man
Boy
over a purely 7-bit text channel you couldn't send it as literal with the line breaks. Instead, you'd send encoded in base64
TWFuDQpCb3kNCg==
which has encoded line breaks but doesn't use incompatible characters. Of course, the receiver needs to know that you're sending encoded text - either implied by the protocol or explicitly marked in some way.
IMAP RFC:
8-bit textual and binary mail is supported through the use of a
[MIME-IMB] content transfer encoding. IMAP4rev1 implementations MAY
transmit 8-bit or multi-octet characters in literals, but SHOULD do
so only when the [CHARSET] is identified.
Although a BINARY body encoding is defined, unencoded binary
strings are not permitted. A "binary string" is any string with
NUL characters. Implementations MUST encode binary data into a
textual form, such as BASE64, before transmitting the data. A
string with an excessive amount of CTL characters MAY also be
considered to be binary.
If implementation has to convert to base64, why RFC is saying "BINARY body encoding is defined". Since every time we need to send the data as base64 (or some other format) effectively binary is not supported. Or am i reading some thing wrong?
IMAP supports MIME multi-part, can the parts inside this have binary data? that is content-transfer-encoding?
I am new to IMAP/HTTP, reason for asking this question is, i have to develop a server which supports both HTTP and IMAP, in HTTP server recive the data in binary (HUGE multipart data, with content-transfer-encoding as binary), FETCH can be done in IMAP. Problem is i need to parse the data and convert each parts inside multipart to base64 if IMAP doesnt support binary. Which i think is severe performance issue.
The answer is unfortunately "maybe".
The MIME RFC supports binary, but the IMAP RFC specifically disallows sending NULL characters. This is likely because they can be confusing for text based parsers, especially those written in C, where NULL has the meaning of End of String.
Some IMAP servers just consider the body to be a "bag of bytes" and I doubt few, if any, actually do re-encoding. So if you ask for the entire message, you will probably get the literal content of it.
If your clients can handle MIME-Binary, you will probably be fine.
There is RFC 3516 for an IMAP extension to support BINARY properly, but this is not widely deployed.
As a side note: why are you using Multipart MIME? That is an odd implementation choice for HTTP.
I am sending some video files (size could be even in GB) as application/x-www-form-urlencodedover HTTP POST.
The following link link suggests that it would be better to transmit it over Multipart form data when we have non-alphanumeric content.
Which encoding would be better to transmit data of this kind?
Also how can I find the length of encoded data (data encoded with application/x-www-form-urlencoded)?
Will encoding the binary data consume much time?
In general, encoding skips the non-alphanumeric characters with some others. So, can we skip encoding for binary data (like video)? How can we skip it?
x-www-form-urlencoded treats the value of an entry in the form data set as a sequence of bytes (octets).
Of the possible 256 values, only 66 are left as it or still encoded as a single byte value, the others are replaced by the hexadecimal representation of the value of their code-point.
This usually takes three to five bytes depending on the encoding.
So in average (256-66)/256 or 74% of the file will be encoded to take three-to-five as much space as originally.
This encoding however has no header nor significant overhead.
multipart/form-data instead works by dividing the data into parts and then finding a string of any length that doesn't occur in said part.
Such string is called the boundary and it is used to delimit the end of the part, that is transmitted as a stream of octects.
So the file is mostly send as it, with negligible size overhead for big enough data.
The draw back is that the user-agent need to find a suitable boundary, however given a string of length k there is only a probability of 2-8k of finding that string in a uniformly generated binary file.
So the user-agent can simply generate a random string and do a quick search and exploit the network transmission time to hide the latency of the search.
You should use multipart/form-data.
This depends on the platform you are using, in general if you cannot access the request body you have to re-perform the encoding your self.
For multipart/form-data encoding there is a little, usually negligible (compared to the transmission time) overhead.
I thought I knew this already but now I'm not sure: Is all content sent over http always encoded to character data? ie, if my content type is a binary file type, is it always converted to binhex, or is it possible to send "actual" binary data across the wire?
In HTTP there is no content transfer encoding (e.g. base64) done, so binary data is sent just binary, byte-by-byte.
Character data is just binary data with special meaning to humans :p
The actual body of the HTTP request may be encoded and/or compressed, and this is specified in the headers.