Can chunked Transfer-Encoded data include CRLF? - http

I've read about chunked Transfer-Encoding and basically got the point. However, there's something I don't quite understand and hasn't been reffered to in all the sources I've read.
A chunked encoded data is structured as a series of chunks, each structured as follows:
<chunk size> (In ASCII bytes expressing the hexadecimal value)
\r\n
<data>
\r\n
What I don't understand is: what if the payload itself contains a \r\n ? Doesn't it interfere with the way we track when a chunk starts and ends?
You could argue that even if it does, we still have the chunk size before the chunk so that CRLF shouldn't bother us, but then I would ask - if so, why having these CRLFs in the first place?
Please clarify.

Yes, it can include \r\n.
As to why this format was chosen: I don't know. Maybe to make it more readable when uses with textual data.

Related

May a URL contain arbitrary binary data in a GET-request?

May a URL contain raw binary data in a GET-request?
Is it possible to create a URL, www.example.com/**binary-data**, where www.example.com/ are ordinary ASCII characters, and **binary-data** are arbitrary raw byte-values, e.g., 0x10.
I don't won't to encode the binary data, but just create a string, e.g., char* in C, that contains both the ASCII characters and the binary data.
Or is POST-request the only way to send raw binary data as part of the body?
No, but could percent-escape the non-URI characters.
No. A URL transmitted in an HTTP GET request is percent-encoded, UTF-8 encoded Unicode text (resulting in an "ASCII" string).
Again, it's Unicode text.
Unicode encodings do not produce arbitrary binary data. There is no equivalent text for some arbitrary binary data.
Moving away from "raw", the server and request can, of course, agree on the use of a scheme such as Base64 to turn arbitrary binary data into Unicode text. By that point, though, you might as well use an HTTP request with a body, and although as far as HTTP is concerned the bodies are raw binary data, HTTP headers can indicate a standard format. Such requests include POST and PUT.
There are also practical limits to the length of the URL.

What is the encoding of this data?

Who can tell me the encoding type of these data? It doesn't seem to be base64.
/zZ/u00GIaP9HW010G000G01003/sm1302WS7YCU6IWZ8ICjAoWmF6H1F3StF7jONKbaaO2Pbe+
0Z8gWjER3eAhQhOgCoF/Bskxr////cy7////w/+Rz//Z/sm130IijBJmrF7P1GNRufOob+FZu+F
Zu+FZu+FZu+FZu+FZu+FZu+FZu+FZu+FZu+FZu+FZu+FZu+FZu+FZu+FZu+FZ/m00H200X07430
I800X410n41/yG07m000GK10G410G400000000000420mG51WS82GeB/yG0jH000W430m840mK5
10G0005z0G8300GH1H8XCK464r5X1o9n53A1aQ488qAnmHLIqV0aCs9oWWaA5XSO6Heb9YSeAIe
qDJOtE3awGqH5HaT8IKfJL5LMLrXPMcDaPMPdQ6bgStHrTdTuUNg3X8M6XuY9YfAJb9MMbvYPcg
AZfAMcfwYfghApjBMsjxYvkiB3nCN6nyZ9ojBJrDNMrzZPsk7Yu+JbvkVewUhnylFqzVRt+Fdw/
yG07m400m410G410G410G00000000420mG51WS82GeB/yG0jH400W4210G310S510G00G9t0042
0n441I4n1X91KGTXSHCYCe4854AHeR712ICpKl0LOdBH2XOaDE4byHSO6Hec9oWfAZKsDpWvEaD
4HKP7I4bAKrHLLbTOMLfZP6LcPsXfQdDqTNPtU7bwWeE4XOQ7Y8cAafEKbPQNc9cQegEafQQdgA
cgihEqjRQtkBcwmiF4nSR7oCdAqjFKrTRNsDdQukFavURdwEdgylFqzVRt+Fdw/ze030C1008H0
n40Fm2vQMjkzf0pGHVSKab1ad5I/PBPkbl5Z/S7D5dyrb1dfvQ/ZnIotCAIB6x7B38LLB5lm7Qc
G9zajcwMyMFzmSr18sd82pnn1586H5a4aP7EFIfblRUMRoVCsj/TO5IVph7kBAUH1TnkMJPl18s
bU/JFuvxydwWpLYZi9s8YItR7OAkV/m1LFIsjPLLrjujX08FbWPhdxUEIvlnvESbzsuZE1dgSd+
jT7DF77jyni1ZXG0IMFq52JUmCRzajcwMyMFy0S7D7sIsRfRnO/m1mSqXlRSo17VPaP0TIkVpxK
iy+C95jQHssgfFV6Sds0fyhMuYbBFPBUHsyTh4+M2imK31pzAjImsKKPbaWYMDUfyiS/fM8xgcg
ix7v1EIJrutLSdqoUxccdSX2I0e7Ex0ndhmE/Vy0ngQIjOQBqCTZSblAXYPKE2H6C4/N5IVPBPk
bl5Z/071pMHRwQ3V97vmEq1sKZ3O/0d+VlMpBSHHZzv8gZhZkVmzAWFGRzajcwMyMFzmSqVPBPk
bl5Z/S7DBzfYPrGahk+xkKZT+TJVV/0Dt+T0Qd6qKKKYZgxFvhA3FJor/7Yi+rbaRKBif5vpbzf
F2XL1nr/BZlYj2p+QoWpqyjVnugl5ljRYLNHtXcSoAw8JrwWu/3wqo1VEZakaIxjbZaF+hSuODW
yOFzAfHtKgj1O+KZgGkL2bICW7a+eF9EEsQkp1hwvX2HbOeM3cHq8px3FwrFBPmN4WaTCaSWXYC
dru/3ds50p1byrBpVRAoC+S8WzEm0wZZkFK7a6j4oksgkVACiYe0g361m2VcFrFrpLo6pWZVUY3
e02UJZ6C0+d+UcAYn91TFAoj97C536DUX0nqzEl+UkbDsk9wYpJ8P45vRg49mid3BdwaSVvxKv5
bCiapnAt92yu9K7W3sxzidZfKTtklWi4KR1IGnaT201xPxrU+//0Blyw9Q90SvgCPMyas/SPYm9
mESPFFy08VJ7NdRUQIApCio1olB2FE2Czi+t+SK2pWPjs7FkP6EUdlx3yXKWXZC8Y0Fb3W0a/m0
wKfNIGRb3Hcy/xHCxPTt+PSzktuSdyglI97l+q6CiLN0mCa/GVv/AajxQA5I8a037B7+yVyAYbb
dIut6DfBSZ0239pKeQrUX3VJ6TOeoZHHkGJ98E1MZz/m3tVvrHkNUyZycE1nd1tEk0Ajn9jYHCv
LL0p/UflOgMoHo5555G0KKKK0555501HHHG0KKKK0555501HHHG0KKKK0555507/za
(Line endings inserted for readability)
It appears to be base64. If you add a single equal ('=') for padding to the end your decoder should be happy (see https://en.wikipedia.org/wiki/Base64#Padding).
Decoded it's 1568 bytes which mod 16 is zero. This histogram of byte value occurance is flat. So I'd guess something encrypted with a 128 bit block cipher like AES.
It does look like base64 to me. Most variants of base64 include the following characters:
A-Z a-z 0-9 + / (and = for padding)
However, if it were proper base64 it would end with a single = as padding, as the 2091 characters don't exactly fit a number of bytes.
Your data doesn't seem to decode to anything readable, so it might be binary data, or encrypted (or both). Only with thorough knowledge of cryptography systems, and a lot of hints and luck, might some expert be able to figure out the encryption used (if any), but that's beyond the scope of this site.
Without more information as to the source of the data, we can only guess.

Please help identify multi-byte character encoding scheme on ASP Classic page

I'm working with a 3rd party (Commidea.com) payment processing system and one of the parameters being sent along with the processing result is a "signature" field. This is used to provide a SHA1 hash of the result message wrapped in an RSA encrypted envelope to provide both integrity and authenticity control. I have the API from Commidea but it doesn't give details of encoding and uses artificially created signatures derived from Base64 strings to illustrate the examples.
I'm struggling to work out what encoding is being used on this parameter and hoped someone might recognise the quite distinctive pattern. I initially thought it was UTF8 but having looked at the individual characters I am less sure.
Here is a short sample of the content which was created by the following code where I am looping through each "byte" in the string:
sig = Request.Form("signature")
For x = 1 To LenB(sig)
s = s & AscB(MidB(sig,x,1)) & ","
Next
' Print s to a debug log file
When I look in the log I get something like this:
129,0,144,0,187,0,67,0,234,0,71,0,197,0,208,0,191,0,9,0,43,0,230,0,19,32,195,0,248,0,102,0,183,0,73,0,192,0,73,0,175,0,34,0,163,0,174,0,218,0,230,0,157,0,229,0,234,0,182,0,26,32,42,0,123,0,217,0,143,0,65,0,42,0,239,0,90,0,92,0,57,0,111,0,218,0,31,0,216,0,57,32,117,0,160,0,244,0,29,0,58,32,56,0,36,0,48,0,160,0,233,0,173,0,2,0,34,32,204,0,221,0,246,0,68,0,238,0,28,0,4,0,92,0,29,32,5,0,102,0,98,0,33,0,5,0,53,0,192,0,64,0,212,0,111,0,31,0,219,0,48,32,29,32,89,0,187,0,48,0,28,0,57,32,213,0,206,0,45,0,46,0,88,0,96,0,34,0,235,0,184,0,16,0,187,0,122,0,33,32,50,0,69,0,160,0,11,0,39,0,172,0,176,0,113,0,39,0,218,0,13,0,239,0,30,32,96,0,41,0,233,0,214,0,34,0,191,0,173,0,235,0,126,0,62,0,249,0,87,0,24,0,119,0,82,0
Note that every other value is a zero except occasionally where it is 32 (0x20). I'm familiar with UTF8 where it represents characters above 127 by using two bytes but if this was UTF8 encoding then I would expect the "32" value to be more like 194 (0xC2) or (0xC3) and the other value would be greater than 0x80.
Ultimately what I'm trying to do is convert this signature parameter into a hex encoded string (eg. "12ab0528...") which is then used by the RSA/SHA1 function to verify the message is intact. This part is already working but I can't for the life of me figure out how to get the signature parameter decoded.
For historical reasons we are having to use classic ASP and the SHA1/RSA functions are javascript based.
Any help would be much appreciated.
Regards,
Craig.
Update: Tried looking into UTF-16 encoding on Wikipedia and other sites. Can't find anything to explain why I am seeing only 0x20 or 0x00 in the (assumed) high order byte positions. I don't think this is relevant any more as the example below shows other values in this high order position.
Tried adding some code to log the values using Asc instead of AscB (Len,Mid instead of LenB,MidB too). Got some surprising results. Here is a new stream of byte-wise characters followed by the equivalent stream of word-wise (if you know what I mean) characters.
21,0,83,1,214,0,201,0,88,0,172,0,98,0,182,0,43,0,103,0,88,0,103,0,34,33,88,0,254,0,173,0,188,0,44,0,66,0,120,1,246,0,64,0,47,0,110,0,160,0,84,0,4,0,201,0,176,0,251,0,166,0,211,0,67,0,115,0,209,0,53,0,12,0,243,0,6,0,78,0,106,0,250,0,19,0,204,0,235,0,28,0,243,0,165,0,94,0,60,0,82,0,82,0,172,32,248,0,220,2,176,0,141,0,239,0,34,33,47,0,61,0,72,0,248,0,230,0,191,0,219,0,61,0,105,0,246,0,3,0,57,32,54,0,34,33,127,0,224,0,17,0,224,0,76,0,51,0,91,0,210,0,35,0,89,0,178,0,235,0,161,0,114,0,195,0,119,0,69,0,32,32,188,0,82,0,237,0,183,0,220,0,83,1,10,0,94,0,239,0,187,0,178,0,19,0,168,0,211,0,110,0,101,0,233,0,83,0,75,0,218,0,4,0,241,0,58,0,170,0,168,0,82,0,61,0,35,0,184,0,240,0,117,0,76,0,32,0,247,0,74,0,64,0,163,0
And now the word-wise data stream:
21,156,214,201,88,172,98,182,43,103,88,103,153,88,254,173,188,44,66,159,246,64,47,110,160,84,4,201,176,251,166,211,67,115,209,53,12,243,6,78,106,250,19,204,235,28,243,165,94,60,82,82,128,248,152,176,141,239,153,47,61,72,248,230,191,219,61,105,246,3,139,54,153,127,224,17,224,76,51,91,210,35,89,178,235,161,114,195,119,69,134,188,82,237,183,220,156,10,94,239,187,178,19,168,211,110,101,233,83,75,218,4,241,58,170,168,82,61,35,184,240,117,76,32,247,74,64,163
Note the second pair of byte-wise characters (83,1) seem to be interpreted as 156 in the word-wise stream. We also see (34,33) as 153 and (120,1) as 159 and (220,2) as 152. Does this give any clues as the encoding? Why are these 15[2369] values apparently being treated differently from other values?
What I'm trying to figure out is whether I should use the byte-wise data and carry out some post-processing to get back to the intended values or if I should trust the word-wise data with whatever implicit decoding it is apparently performing. At the moment, neither seem to give me a match between data content and signature so I need to change something.
Thanks.
Quick observation tells me that you are likely dealing with UTF-16. Start from there.

HTTP Chunked transfer encoding: How do you send "\r\n"?

Say the body I'm trying to send via chunked encoding includes "\r\n", how do I avoid that being interpreted as the chunk delimeter?
e.g. "All your base are\r\n belong to us"
http://en.wikipedia.org/wiki/Chunked_transfer_encoding
"\r\n" isn't really a chunk delimiter. The chunk size specifies the number of bytes made up by that chunk's data. The client should then read the "\r\n" embedded within your message just fine.
By design, that is not a problem at all. Each chunk specifies the byte size of its data block. The contents of each data block are arbitrary, and must be received as such, so it can include line breaks in it. If the client is reading each chunk correctly (read a line and parse the byte size from it, then read the specified number of bytes, then read a line break), it won't matter if there are line breaks in the data, since the client is reading the data based on byte size, not on line breaks.

How should an HTTP client properly parse *chunked* HTTP response body?

When chunked HTTP transfer encoding is used, why does the server need to write out both the chunk size in bytes and have the subsequent chunk data end with CRLF?
Doesn't this make sending binary data "CRLF-unclean" and the method a bit redundant?
What if the data has a 0x0A followed by 0x0D in it somewhere (i.e. these are actually part of the data)? Is the client then expected to adhere to the chunk size explicitly provided at the head of the chunk or choke on the first CRLF it encounters in the data?
My understanding so far of expected client behaviour is to simply take the chunk size provided by the server, proceed to the next line, then read exactly this amount of bytes from within the following data (CRLF or no CRLF therein), then skip the CRLF following the data and repeat the procedure until no more chunks. Is this compliant behaviour? If so, what is the point of the CRLF after each datachunk then? Readability?
I have done some Web searching on this and also did some reading of the HTTP 1.1 specification, but a definitive answer seems to be eluding me.
A chunked consumer does not scan the message body for a CRLF pair. It first reads the specified number of bytes, and then reads two more bytes to confirm that they are CR and LF. If they're not, the message body is ill-formed, and either the size was specified improperly or the data was otherwise corrupted.
The trailing CRLF is a belt-and-suspenders assurance (per RFC 2616 section 3.6.1, Chunked Transfer Coding), but it also serves to maintain the consistent rule that fields start at the beginning of the line.
The CRLF after each chunk is probably just for better readability as it’s not necessary due to the chunk size at the begin of each chunk. But the CRLF after the “chunk header” is necessary as there may be additional information after the chunk size (see Chunk Transfer Encoding):
chunk = chunk-size [ chunk-extension ] CRLF
chunk-data CRLF

Resources