When chunked HTTP transfer encoding is used, why does the server need to write out both the chunk size in bytes and have the subsequent chunk data end with CRLF?
Doesn't this make sending binary data "CRLF-unclean" and the method a bit redundant?
What if the data has a 0x0A followed by 0x0D in it somewhere (i.e. these are actually part of the data)? Is the client then expected to adhere to the chunk size explicitly provided at the head of the chunk or choke on the first CRLF it encounters in the data?
My understanding so far of expected client behaviour is to simply take the chunk size provided by the server, proceed to the next line, then read exactly this amount of bytes from within the following data (CRLF or no CRLF therein), then skip the CRLF following the data and repeat the procedure until no more chunks. Is this compliant behaviour? If so, what is the point of the CRLF after each datachunk then? Readability?
I have done some Web searching on this and also did some reading of the HTTP 1.1 specification, but a definitive answer seems to be eluding me.
A chunked consumer does not scan the message body for a CRLF pair. It first reads the specified number of bytes, and then reads two more bytes to confirm that they are CR and LF. If they're not, the message body is ill-formed, and either the size was specified improperly or the data was otherwise corrupted.
The trailing CRLF is a belt-and-suspenders assurance (per RFC 2616 section 3.6.1, Chunked Transfer Coding), but it also serves to maintain the consistent rule that fields start at the beginning of the line.
The CRLF after each chunk is probably just for better readability as it’s not necessary due to the chunk size at the begin of each chunk. But the CRLF after the “chunk header” is necessary as there may be additional information after the chunk size (see Chunk Transfer Encoding):
chunk = chunk-size [ chunk-extension ] CRLF
chunk-data CRLF
Related
I've read about chunked Transfer-Encoding and basically got the point. However, there's something I don't quite understand and hasn't been reffered to in all the sources I've read.
A chunked encoded data is structured as a series of chunks, each structured as follows:
<chunk size> (In ASCII bytes expressing the hexadecimal value)
\r\n
<data>
\r\n
What I don't understand is: what if the payload itself contains a \r\n ? Doesn't it interfere with the way we track when a chunk starts and ends?
You could argue that even if it does, we still have the chunk size before the chunk so that CRLF shouldn't bother us, but then I would ask - if so, why having these CRLFs in the first place?
Please clarify.
Yes, it can include \r\n.
As to why this format was chosen: I don't know. Maybe to make it more readable when uses with textual data.
May a URL contain raw binary data in a GET-request?
Is it possible to create a URL, www.example.com/**binary-data**, where www.example.com/ are ordinary ASCII characters, and **binary-data** are arbitrary raw byte-values, e.g., 0x10.
I don't won't to encode the binary data, but just create a string, e.g., char* in C, that contains both the ASCII characters and the binary data.
Or is POST-request the only way to send raw binary data as part of the body?
No, but could percent-escape the non-URI characters.
No. A URL transmitted in an HTTP GET request is percent-encoded, UTF-8 encoded Unicode text (resulting in an "ASCII" string).
Again, it's Unicode text.
Unicode encodings do not produce arbitrary binary data. There is no equivalent text for some arbitrary binary data.
Moving away from "raw", the server and request can, of course, agree on the use of a scheme such as Base64 to turn arbitrary binary data into Unicode text. By that point, though, you might as well use an HTTP request with a body, and although as far as HTTP is concerned the bodies are raw binary data, HTTP headers can indicate a standard format. Such requests include POST and PUT.
There are also practical limits to the length of the URL.
When I read in csv files to r the requesting dataframe has very different dimensions than I see when I open the file in excel or notepad and the column heading is labeled as "ÿþA". What does this mean?
thanks,
The file you are reading is using an UTF-16 or UTF-32 encoding (with a BOM), and the r read.csv function has not been informed correctly.
As Karsten suggests you should use the fileEncoding parameter to specify the correct encoding, which I suspect should be "UTF-16LE".
Here is what the R Studio documentation states about encoding:
Encoding
The encoding of the input/output stream of a connection can be specified by name in the same way as it would be given to iconv: see that help page for how to find out what encoding names are recognized on your platform. Additionally, "" and "native.enc" both mean the ‘native’ encoding, that is the internal encoding of the current locale and hence no translation is done.
Re-encoding only works for connections in text mode: reading from a connection with re-encoding specified in binary mode will read the stream of bytes, but mixing text and binary mode reads (e.g. mixing calls to readLines and readChar) is likely to lead to incorrect results.
The encodings "UCS-2LE" and "UTF-16LE" are treated specially, as they are appropriate values for Windows ‘Unicode’ text files. If the first two bytes are the Byte Order Mark 0xFFFE then these are removed as some implementations of iconv do not accept BOMs. Note that whereas most implementations will handle BOMs using encoding "UCS-2" and choose the appropriate byte order, some (including earlier versions of glibc) will not. There is a subtle distinction between "UTF-16" and "UCS-2" (see http://en.wikipedia.org/wiki/UTF-16/UCS-2: the use of surrogate pairs is very rare so "UCS-2LE" is an appropriate first choice.
As from R 3.0.0 the encoding "UTF-8-BOM" is accepted for reading and will remove a Byte Order Mark if present (which it often is for files and webpages generated by Microsoft applications). If it is required (it is not recommended) when writing it should be written explicitly, e.g. by writeChar("\ufeff", con, eos = NULL) or writeBin(as.raw(c(0xef, 0xbb, 0xff)), binary_con)
Requesting a conversion that is not supported is an error, reported when the connection is opened. Exactly what happens when the requested translation cannot be done for invalid input is in general undocumented. On output the result is likely to be that up to the error, with a warning. On input, it will most likely be all or some of the input up to the error.
It may be possible to deduce the current native encoding from Sys.getlocale("LC_CTYPE"), but not all OSes record it.
And here is what Wiki states on the BOM:
Byte order mark
The byte order mark (BOM) is a Unicode character used to signal the endianness (byte order) of a text file or stream. It is encoded at U+FEFF byte order mark (BOM). BOM use is optional, and, if used, should appear at the start of the text stream. Beyond its specific use as a byte-order indicator, the BOM character may also indicate which of the several Unicode representations the text is encoded in.1
Because Unicode can be encoded as 16-bit or 32-bit integers, a computer receiving these encodings from arbitrary sources needs to know which byte order the integers are encoded in. The BOM gives the producer of the text a way to describe the text stream's endianness to the consumer of the text without requiring some contract or metadata outside of the text stream itself. Once the receiving computer has consumed the text stream, it presumably processes the characters in its own native byte order and no longer needs the BOM. Hence the need for a BOM arises in the context of text interchange, rather than in normal text processing within a closed environment.
The w3.org (RFC2616) seems not to define a maximum size for chunks. But without a maximum chunk-size there is no space for the chunk-extension. There must be a maximum chunk-size, else I can't ignore the chunk-extension as I'm advised to do if it can't be understood (Quote:"MUST ignore chunk-extension extensions they do not understand").
Each chunk extension must begin with a semi-colon and the list of chunk extensions must end with a CRLF. When parsing the chunk-size, stop at either a semi-colon or a CRLF. If you stopped at a semi-colon, ignore everything up to the next CRLF. There is no need for a maximum chunk-size.
chunk = chunk-size [ chunk-extension ] CRLF
chunk-data CRLF
chunk-size = 1*HEX
chunk-extension= *( ";" chunk-ext-name [ "=" chunk-ext-val ] )
The HTTP specification is pretty clear about the syntax of the HTTP messages.
The chunk size is always given as a hexadecimal number. If that number is not directly followed by a CRLF, but a ; instead, you know that there is an extension. This extension is identified by its name (chunk-ext-name). If you never heard of that particular name, you MUST ignore it.
So what exactly is your problem?
Read a hexadecimal number
Ignore everything up to the next CRLF
Be happy
Say the body I'm trying to send via chunked encoding includes "\r\n", how do I avoid that being interpreted as the chunk delimeter?
e.g. "All your base are\r\n belong to us"
http://en.wikipedia.org/wiki/Chunked_transfer_encoding
"\r\n" isn't really a chunk delimiter. The chunk size specifies the number of bytes made up by that chunk's data. The client should then read the "\r\n" embedded within your message just fine.
By design, that is not a problem at all. Each chunk specifies the byte size of its data block. The contents of each data block are arbitrary, and must be received as such, so it can include line breaks in it. If the client is reading each chunk correctly (read a line and parse the byte size from it, then read the specified number of bytes, then read a line break), it won't matter if there are line breaks in the data, since the client is reading the data based on byte size, not on line breaks.