I found this string several times on the Internet, and I wonder what it means, and where it comes from:
3i2ndDfv2rTHiSisAbouNdArYfORhtTPEefj3q2f
It's often used after a boundery-definition in the HTTP-Content-Type-Header:
Content-Type: multipart/form-data; boundary=--3i2ndDfv2rTHiSisAbouNdArYfORhtTPEefj3q2f
http://www.w3.org/Protocols/rfc1341/7_2_Multipart.html
rfc1341 7.2 The Multipart Content-Type
The body must then contain one or more "body parts," each preceded by an encapsulation boundary, and the last one followed by a closing boundary.
Related
I have found some text in this form:
=B0=A1=C1=CB,=C4=E3=D2=B2=C3=BB=C1=AA=CF=B5=CE=D2,=D7=EE=BD=FC=CA=C7=B2=BB=CA=C7=
=BA=DC=C3=A6=B0=A1
containing mostly sequences consisting of an equal sign followed by two hexadecimal digits.
I am told it could be converted into this Chinese sentence:
啊了你也没联系我最近是不是很忙啊
What is the =B0=A1=C1 and how to decode/convert it?
The Chinese sentence has been encoded into an 8-bit Guobiao encoding (GB2312, GBK or GB18030; most likely the latter, though it apparently decodes correctly as the former too), and then further encoded into the 7-bit MIME quoted-printable encoding.
To decode it into a Unicode string, first undo the quoted-printable encoding, then decode the Guobiao encoding. Here’s an example using Python:
import quopri
print(quopri.decodestring("""\
=B0=A1=C1=CB,=C4=E3=D2=B2=C3=BB=C1=AA=CF=B5=CE=D2,=D7=EE=BD=FC=CA=C7=B2=BB=CA=C7=
=BA=DC=C3=A6=B0=A1\
""").decode('gb18030'))
This outputs 啊了,你也没联系我,最近是不是很忙啊 on my terminal.
The quoted-printable encoding is usually found in e-mail messages; whether it is actually in use should be determined from message headers. A message encoded in this manner should carry the header Content-Transfer-Encoding: quoted-printable. The text encoding (gb18030 in this case) should be specified in the charset parameter of the Content-Type header, but sometimes can be determined by other means.
According to the HTTP specs a header can look like this:
Header-Name=value1,value2,value3,...
I try to parse the header values and store them as an array:
array('value1', 'value2', 'value3')
so far so good. I can just tokenize the string if a comma appears.
BUT how should I handle headers like this one:
Expires=Thu, 01 Dec 1994 16:00:00 GMT
there's a comma but in the one value the header has. Oh that's easy I thought and figuered out the rule: Only separate by commas when there's no space before and after the comma. This way both examples get parsed correct.
BUT then I came across a header like this:
Accept-Encding=gzip, deflate
and now? Is this one value array('gzip, deflate') or two values array('gzip', 'deflate')? For me they are two separate values but then my rule from the above isn't true anymore.
Is there a list which headers are allowed more than once? So I can check against a blacklist to determine if the comma means a value delimiter or not?
Comma concatenation can occur for any header field, even those that aren't designed for it; it's how libraries and intermediaries happen to work.
It is designed to be used for header fields that use list syntax (RFC 7230 has all the details).
Finally, you can't use generic code to tokenize, because the way the comma can occur inside values varies from field to field.
RFC7233 is nice and clear, except for line endings.
I am specifically interested the HTTP response body of a multipart/byteranges response. I assume each line is terminated by a CRLF as HTTP headers are, but this document isn't explicit about it. What I'm totally befuddled about is the last line: --THIS_SEPARATOR_SEPARATES--. Is it followed by a CRLF?
Full block:
HTTP/1.1 206 Partial Content
Date: Wed, 15 Nov 1995 06:25:24 GMT
Last-Modified: Wed, 15 Nov 1995 04:58:08 GMT
Content-Length: 1741
Content-Type: multipart/byteranges; boundary=THIS_STRING_SEPARATES
--THIS_STRING_SEPARATES
Content-Type: application/pdf
Content-Range: bytes 500-999/8000
...the first range...
--THIS_STRING_SEPARATES
Content-Type: application/pdf
Content-Range: bytes 7000-7999/8000
...the second range
--THIS_STRING_SEPARATES--
Sorry I really can't find it, so help would be greatly appreciated.
NOTE: please no gut feelings, only RFC references.
If you read RFC 7233 more carefully, Appendix A refers to RFC 2046 Section 5.1 for the actual format of the MIME data within the HTTP body:
When a 206 (Partial Content) response message includes the content of
multiple ranges, they are transmitted as body parts in a multipart
message body ([RFC2046], Section 5.1) with the media type of
"multipart/byteranges".
RFC 2046 Section 5.1 defines the formal definition of the "multipart" media type and how its boundaries are formatted and parsed.
To answer your question, here is the formal syntax from RFC 2046:
The boundary delimiter MUST occur at the beginning of a line, i.e.,
following a CRLF, and the initial CRLF is considered to be attached
to the boundary delimiter line rather than part of the preceding
part. The boundary may be followed by zero or more characters of
linear whitespace. It is then terminated by either another CRLF and
the header fields for the next part, or by two CRLFs, in which case
there are no header fields for the next part. If no Content-Type
field is present it is assumed to be "message/rfc822" in a
"multipart/digest" and "text/plain" otherwise.
NOTE: The CRLF preceding the boundary delimiter line is conceptually
attached to the boundary so that it is possible to have a part that
does not end with a CRLF (line break). Body parts that must be
considered to end with line breaks, therefore, must have two CRLFs
preceding the boundary delimiter line, the first of which is part of
the preceding body part, and the second of which is part of the
encapsulation boundary.
...
The boundary delimiter line following the last body part is a
distinguished delimiter that indicates that no further body parts
will follow. Such a delimiter line is identical to the previous
delimiter lines, with the addition of two more hyphens after the
boundary parameter value.
--gc0pJq0M:08jU534c0p--
NOTE TO IMPLEMENTORS: Boundary string comparisons must compare the
boundary value with the beginning of each candidate line. An exact
match of the entire candidate line is not required; it is sufficient
that the boundary appear in its entirety following the CRLF.
...
The only mandatory global parameter for the "multipart" media type is
the boundary parameter, which consists of 1 to 70 characters from a
set of characters known to be very robust through mail gateways, and
NOT ending with white space. (If a boundary delimiter line appears to
end with white space, the white space must be presumed to have been
added by a gateway, and must be deleted.) It is formally specified
by the following BNF:
boundary := 0*69 bcharsnospace
bchars := bcharsnospace / " "
bcharsnospace := DIGIT / ALPHA / "'" / "(" / ")" /
"+" / "_" / "," / "-" / "." /
"/" / ":" / "=" / "?"
Overall, the body of a "multipart" entity may be specified as
follows:
dash-boundary := "--" boundary
; boundary taken from the value of
; boundary parameter of the
; Content-Type field.
multipart-body := [preamble CRLF]
dash-boundary transport-padding CRLF
body-part *encapsulation
close-delimiter transport-padding
[CRLF epilogue]
transport-padding := *LWSP-char
; Composers MUST NOT generate
; non-zero length transport
; padding, but receivers MUST
; be able to handle padding
; added by message transports.
encapsulation := delimiter transport-padding
CRLF body-part
delimiter := CRLF dash-boundary
close-delimiter := delimiter "--"
preamble := discard-text
epilogue := discard-text
discard-text := *(*text CRLF) *text
; May be ignored or discarded.
body-part := MIME-part-headers [CRLF *OCTET]
; Lines in a body-part must not start
; with the specified dash-boundary and
; the delimiter must not appear anywhere
; in the body part. Note that the
; semantics of a body-part differ from
; the semantics of a message, as
; described in the text.
OCTET := <any 0-255 octet value>
Each delimiter at the beginning of a new part is terminated by a CRLF, and any CRLF that immediately precedes a delimiter is parsed as part of the boundary and not the data of the preceding part. However, there is no CRLF on the end of the final closing boundary, unless there is an epilogue present (which is very rarely used in email, and I have never seen it used in HTTP as there is no way to determine when then epilogue ends unless there is a valid Content-Length header present, which is not supposed to be used with self-terminating content types like MIME).
That spec references:
https://www.rfc-editor.org/rfc/rfc2046#section-5.1.1
Which explicitly states:
--gc0pJq0M:08jU534c0p
The boundary delimiter MUST occur at the beginning of a line, i.e.,
following a CRLF, and the initial CRLF is considered to be attached
to the boundary delimiter line rather than part of the preceding
part. The boundary may be followed by zero or more characters of
linear whitespace. It is then terminated by either another CRLF and
the header fields for the next part, or by two CRLFs, in which case
there are no header fields for the next part. If no Content-Type
field is present it is assumed to be "message/rfc822" in a
"multipart/digest" and "text/plain" otherwise.
I'm trying to understand http://www.w3.org/Protocols/rfc2616/rfc2616-sec2.html#sec2.2
HTTP/1.1 header field values can be folded onto multiple lines if the
continuation line begins with a space or horizontal tab. All linear
white space, including folding, has the same semantics as SP. A
recipient MAY replace any linear white space with a single SP before
interpreting the field value or forwarding the message downstream.
LWS = [CRLF] 1*( SP | HT )
Can i put any number of <CR><LF><SP>, without putting any header value on the line ?
i.e. is this valid : Header:<CR><LF><SP><CR><LF><SP>Value
Yes, but see http://greenbytes.de/tech/webdav/draft-ietf-httpbis-p1-messaging-22.html#rfc.section.3.2.4.p.3 - it's deprecated in the upcoming revision of the HTTP spec.
When chunked HTTP transfer encoding is used, why does the server need to write out both the chunk size in bytes and have the subsequent chunk data end with CRLF?
Doesn't this make sending binary data "CRLF-unclean" and the method a bit redundant?
What if the data has a 0x0A followed by 0x0D in it somewhere (i.e. these are actually part of the data)? Is the client then expected to adhere to the chunk size explicitly provided at the head of the chunk or choke on the first CRLF it encounters in the data?
My understanding so far of expected client behaviour is to simply take the chunk size provided by the server, proceed to the next line, then read exactly this amount of bytes from within the following data (CRLF or no CRLF therein), then skip the CRLF following the data and repeat the procedure until no more chunks. Is this compliant behaviour? If so, what is the point of the CRLF after each datachunk then? Readability?
I have done some Web searching on this and also did some reading of the HTTP 1.1 specification, but a definitive answer seems to be eluding me.
A chunked consumer does not scan the message body for a CRLF pair. It first reads the specified number of bytes, and then reads two more bytes to confirm that they are CR and LF. If they're not, the message body is ill-formed, and either the size was specified improperly or the data was otherwise corrupted.
The trailing CRLF is a belt-and-suspenders assurance (per RFC 2616 section 3.6.1, Chunked Transfer Coding), but it also serves to maintain the consistent rule that fields start at the beginning of the line.
The CRLF after each chunk is probably just for better readability as it’s not necessary due to the chunk size at the begin of each chunk. But the CRLF after the “chunk header” is necessary as there may be additional information after the chunk size (see Chunk Transfer Encoding):
chunk = chunk-size [ chunk-extension ] CRLF
chunk-data CRLF