Correct Way to Manually Parse HTTP Response

Correct Way to Manually Parse HTTP Response - http

I am working in a language that has extremely low-level TCP support (if you must know, it's UnrealScript). The response received after making a POST request includes the entire HTTP header, status code, body, etc. as a string.
So, I need to parse the response to extract the body text manually. The HTTP 1.1 specification says:
Response = Status-Line
*(( general-header
| response-header
| entity-header ) CRLF)
CRLF
[message-body]
Am I correct in assuming that the best way to do this is to split the string along a double CRLF (carriage return/line feed) and return the second part of this split?
Or are there weird HTTP edge cases I should be aware of?

Am I correct in assuming that the best way to do this is to split the string along a double CRLF
Yes - but what appears in the body may be compressed using three different compressions methods even if you told the server you don't accept compressed responses.
Further the body may be split into chunks, in between each chunk is an indicator of the size of the next chunk.
Do you really have no scope for using an off the shelf component for parsing? (I would recommend lib curl).

Related

Attachment in html formatted mail in unix

1. (cat mytest.html;uuencode "myfile.xls" "myfile.xls")|mail -s "$("This is Subject\nContent-Type: text/html")" test#yahoo.com
2. (uuencode "myfile.xls" "myfile.xls")|mail -s "$("This is Subject\nContent-Type: text/html")" test#yahoo.com < mytest.html
When I am using above 2 methods, output is coming with html formatted. But I am not getting any attachment?(Where mytest.html contains the html part)
Note: I am getting some scattered character in place of attachment.
Please get me out of here

uuencode was an old standard for encoding binary data as ASCII text for inclusion in mail and news articles but it has been obsolete and not in common use for more than a decade. There are probably no remaining MUAs that still know how to process it, especially in HTML mail.
Also, your trick of specifying the Content-Type header to the -s argument of the mail command is a very ugly hack. I'm surprised it works at all! In any case, it fails to include at least one other required header: MIME-Version: 1.0.
You need to build a MIME multipart message with one part being your HTML document, and the other part being your attachment (probably base64 encoded if it's binary data).
Because MIME requires you to choose a multipart boundary, format the body of the mail to delimit the multiple parts using that boundary, generate headers for each of the multipart subparts (including each part's own Content-Type and possibly Content-Transfer-Encoding and Content-Disposition or others), and encode each part appropriately, you're much better off using a toolkit that constructs MIME messages for you rather than trying to do it manually through the mail command. If you are working in the shell, you might try makemime but that's almost as ugly as doing it manually so I'd suggest using something like Perl's MIME-Tools.

Special characters in HTTP request fields

This isn't really related to programming, but I'm using this in a program, so I thought it would be best to ask here. Essentially this is a question about handling anomalies in HTTP requests.
A standard request might look like:
GET / HTTP/1.1
Host: example.com
User-Agent: Firefox
My question is, how should HTTP handle "special characters" in parts of the HTTP request that aren't usually tampered with. For instance, what if the method was "POST ME" instead of "GET" (i.e. inclusion of a space); would this be encoded to %20?
Another example, suppose I want one of my headers to be "Class:Test: example", with the extra ":" in the header name (the header value being "example"). Would this be encoded to %3A?
Note: this isn't about whether any web servers out there would accept such encoding; this is about how it should be done. My program is a fuzz tester, so it is supposed to be testing this sort of thing!

The two question must be answered as "no" and "yes, BUT..."
The "percent encoding" you suggest is defined for content, values, not for the http language syntax. You mix protocol and payload.
You may want to take a look at the RFC that defines HTTP. It clearly defines a syntax. If you stick to that syntax you can create valid extensions (which is what you are trying to do). If you break that syntax you create invalid http requests. That would be a thing you can do inhouse, but most likely such requests won't work in the open internet, where for example proxies come into play. These have to understand your requests on y syntactical level.
For question 2 the answer is "yes, BUT", I wrote. So a few words to the BUT:
You can specify such headers and they are valid, if you encode the second ':' as you suggested. However you should understand what you are doing there: you are NOT introducing a hierarchy into header names. Instead you specify a headers content to contain a ':'. That is perfectly fine. It is up to your server component to understand, interpret and react as intended to that content.

The HTTP specification says that the method is a token, so it can't contain any delimiter characters. So "POST ME" would not be a valid method.
Similarly, header names are also tokens, so they can't contain ":". The colon is always taken to be the delimiter between the header name and its contents.
As arkascha says, you should read RFC 2616, which specifies the HTTP protocol.

For your method containing a space, this is not possible, since a request-line is defined as this:
Request-Line = Method SP Request-URI SP HTTP-Version CRLF
Method is defined as one of the HTTP/1.1 verbs or an extension-method, being a token (which cannot contain spaces). So the first space the server encounters marks the end of the method. Therefore, a method cannot contain spaces. You can percent-encode it, but the server won't know what to do with a verb like GET%20ME.
For your Class:Test: example, the http header is defined as:
message-header = field-name ":" [ field-value ]
field-name = token
field-value = *( field-content | LWS )
field-content = <the OCTETs making up the field-value
and consisting of either *TEXT or combinations
of token, separators, and quoted-string>
And TEXT is defined as:
TEXT = <any OCTET except CTLs,
but including LWS>
And CTL is defined as:
CTL = <any US-ASCII control character
(octets 0 - 31) and DEL (127)>
So no, you don't have to escape further colons (58), the first one in a header-line is always accounted as being a separator, since a colon is not allowed in a token.
So in your example the field-name is Class, while the field-value is Test: example.

Set more than one HTTP header with the same name?

As far as I know it is allowed by the HTTP spec to set more than one HTTP header with the same name. Is there any use case to do so (from client to server and vice versa)?
HTTP 1.1 Section 4.2:
Multiple message-header fields with
the same field-name MAY be present in
a message if and only if the entire
field-value for that header field is
defined as a comma-separated list
[i.e., #(values)]. It MUST be possible
to combine the multiple header fields
into one "field-name: field-value"
pair, without changing the semantics
of the message, by appending each
subsequent field-value to the first,
each separated by a comma. The order
in which header fields with the same
field-name are received is therefore
significant to the interpretation of
the combined field value, and thus a
proxy MUST NOT change the order of
these field values when a message is
forwarded.
If I'm not wrong there is no case where multiple headers with the same name are needed.

It's commonly used for Set-Cookie:. Many servers set more than one cookie.
Of course, you can always set them all in a single header.
Actually, I think you cannot set multiple cookies in one header. So that's a necessary use-case.
The Cookie spec (RFC 2109) does claim that you can combine multiple cookies in one header the same way other headers can be combined (comma-separated), but it also points out that non-conforming syntaxes (like the Expires parameter, which has ,s in its value) are still common and must be dealt with by implementations.
So, if you use Expires params in your Set-Cookie headers and you don't want all your cookies to expire at the same time, you probably need to use multiple headers.
Update: Evolution of the Cookie spec
RFC 2109 has been obsoleted by RFC 2965 that in turn got obsoleted by RFC 6265, which is stricter on the issue:
Origin servers SHOULD NOT fold multiple Set-Cookie header fields into a single header field. The usual mechanism for folding HTTP headers fields (i.e., as defined in [RFC2616]) might change the semantics of the Set-Cookie header field because the %x2C (",") character is used by Set-Cookie in a way that conflicts with such folding.
Side note
RFC 6265 uses the verb "folding" when it refers to combining multiple header fields into one, which is ambiguous in the context of the HTTP/1 specs (both by RFC2616, and its successor, RFC 7230) where:
"folding" consistently refers to line folding, and
the verb "combine" is used to describe merging same headers.
Combining header fields:
See RFC 2616, Section 4.2, Message Headers (quoted in the question), but searching for the for the word "combine" will bring up special cases.
The above item obsoleted by RFC 7230, Section 3.2.2, Field Order:
A recipient MAY combine multiple header fields with the same field name into one field-name: field-value pair, without changing the semantics of the message, by appending each subsequent field value to the combined field value in order, separated by a comma. The order in which header fields with the same field name are received is therefore significant to the interpretation of the combined field value; a proxy MUST NOT change the order of these field values when forwarding a message.
Note: In practice, the "Set-Cookie" header field (RFC6265) often appears multiple times in a response message and does not use the list syntax, violating the above requirements on multiple header fields with the same name. Since it cannot be combined into a single field-value, recipients ought to handle Set-Cookie as a special case while processing header fields. (See Appendix A.2.3 of [Kri2001] for details.)
Line folding:
From RFC 2616, Section 2.2, Basic Rules:
HTTP/1.1 header field values can be folded onto multiple lines if the continuation line begins with a space or horizontal tab. All linear white space, including folding, has the same semantics as SP. A recipient MAY replace any linear white space with a single SP before interpreting the field value or forwarding the message downstream.
The above section obsoleted by RFC 7230, Section 3.2.4, Field Parsing:
Historically, HTTP header field values could be extended over multiple lines by preceding each extra line with at least one space or horizontal tab (obs-fold). This specification deprecates such line folding except within the message/http media type (Section 8.3.1). A sender MUST NOT generate a message that includes line folding (i.e., that has any field-value that contains a match to the obs-fold rule) unless the message is intended for packaging within the message/http media type.
A server that receives an obs-fold in a request message that is not within a message/http container MUST either reject the message by sending a 400 (Bad Request), preferably with a representation explaining that obsolete line folding is unacceptable, or replace each received obs-fold with one or more SP octets prior to interpreting the field value or forwarding the message downstream.
A proxy or gateway that receives an obs-fold in a response message that is not within a message/http container MUST either discard the message and replace it with a 502 (Bad Gateway) response, preferably with a representation explaining that unacceptable line folding was received, or replace each received obs-fold with one or more SP octets prior to interpreting the field value or forwarding the message downstream.
A user agent that receives an obs-fold in a response message that is not within a message/http container MUST replace each received obs-fold with one or more SP octets prior to interpreting the field value.

Since duplicate headers can cause issues with various web-servers and APIs (regardless of what the spec says), I doubt there is any general purpose use case where this is best practice. That's not to say someone somewhere isn't doing it, of course.

As you're looking for use-cases, maybe Accept would be a valid one.
Accept: application/json
Accept: application/xml

It's only allowed for headers using a very specific format, see RFC 2616, Section 4.2.

Old thread, but I was looking into this same issue. Anyway, the Accept and Accept-Encoding headers are typical examples that uses multiple values, comma separated. Even if these are request specific header, the specs do not differentiate between request and response at this level. Check the one from this page.
What the spec says is that if you have commas as character in the value of the header, you cannot use multiple headers of the same name, unless you disambiguate the use of the comma.

What is the boundary parameter in an HTTP multi-part (POST) Request?

I am trying to develop a sidebar gadget that automates the process of checking a web page for the evolution of my transfer quota. I am almost at it but there is one last step I need to get it working: Sending an HttpRequest with the correct POST data to a php page. Using a firefox plugin, here is what the "Content-Type" of the header looks like:
Content-Type=multipart/form-data; boundary=---------------------------99614912995
with the parameter "boundary" seeming to be random, and the POSTDATA is this:
POSTDATA =-----------------------------99614912995
Content-Disposition: form-data; name="SOMENAME"
Formulaire de Quota
-----------------------------99614912995
Content-Disposition: form-data; name="OTHERNAME"
SOMEDATA
-----------------------------99614912995--
I do not understand how to correctly emulate the POSTDATA with the mystery "boundary" parameter coming back.
Would someone know how I can solve this?

To quote from the RFC 1341, section 7.2.1, what I consider to be the relevant bits on the boundary parameter of the Content-Type header (for MIME):
All subtypes of "multipart" share a common syntax ...
The Content-Type field for multipart entities requires one parameter, "boundary", which is used to specify the encapsulation boundary. The encapsulation boundary is defined as a line consisting entirely of two hyphen characters ("-", decimal code 45) followed by the boundary parameter value from the Content-Type header field.
and then clarifies:
Thus, a typical multipart Content-Type header field might look like this:
Content-Type: multipart/mixed; boundary=gc0p4Jq0M2Yt08jU534c0p
This indicates that the entity consists of several parts, each itself with a structure that is syntactically identical to an RFC 822 message, except that the header area might be completely empty, and that the parts are each preceded by the line
--gc0p4Jq0M2Yt08jU534c0p
Things to Note:
The encapsulation boundary must occur at the beginning of a line, i.e., following a CRLF (Carriage Return-Line Feed)
The boundary must be followed immediately either by another CRLF and the header fields for the next part, or by two CRLFs, in which case there are no header fields for the next part (and it is therefore assumed to be of Content-Type text/plain).
Encapsulation boundaries must not appear within the encapsulations, and must be no longer than 70 characters, not counting the two leading hyphens.
Last but not least:
The encapsulation boundary following the last body part is a distinguished delimiter that indicates that no further body parts will follow. Such a delimiter is identical to the previous delimiters, with the addition of two more hyphens at the end of the line:
--gc0p4Jq0M2Yt08jU534c0p--
I hope this helps someone else in the future, as I had to roam for a while before getting the full picture (please ensure to read the necessary RFCs to get the deepest understanding).

The boundary parameter is set to a number of hyphens plus a random string at the end, but you can set it to anything at all. The problem is, if the boundary string shows up in the request data, it will be treated as a boundary.
For some tips, and an example function for sending multipart/form-data see my answer to this question. It wouldn't be too difficult to modify that function to use a loop for each part you would like to send.

The actual specification for multipart/form-data is in RFC 7578. Boundary is defined in Section 4.1.

Is an HTTP PUT request required to include a body?

I'm having trouble finding a definite specification of this in the standard. I have an HTTP client that's not including a Content-Length: 0 header when doing a PUT request where I don't specify a body, and a server that gets confused by such requests, and I'm wondering which program I should be blaming.

HTTP requests have a body if they have a Content-Length or Transfer-Encoding header (RFC 2616 4.3). If the request has neither, it has no body, and your server should treat it as such.
That said it is unusual for a PUT request to have no body, and so if I were designing a client that really wanted to send an empty body, I'd pass Content-Length: 0. Indeed, depending on one's reading of the POST and PUT method definitions (RFC 2616 9.5, 9.6) one might argue that the body is implied to be required - but a reasonable way to handle no body would be to assume a zero-length body.

Not answering the question, but asserting how jaxrs allows me to frequent use of bodyless PUTs:
Example of bodyless put:
Give user an additional permission.
PUT /admin/users/{username}/permission/{permission}

A body is not required by the IETF standard, though the content-length should be 0 if there's no body. Use the method that's appropriate for what you're doing. If you were to put it into code, given
int x;
int f(){ return x; }
and a remote variable called r.
A post is equivalent to
r=f();
A put is equivalent to
r=x;
and a get is equivalent to
x=r;

What is being PUT (in the verb sense) onto the server if there's no content? The spec refers to the content as "the enclosed entity", but a request with no content would have no enclosed entity, and therefore nothing to put on the server.
Unless, of course, you wanted to PUT nothing onto the server, in which case you'd probably want a DELETE instead.

The content length field is required as per the following section in the HTTP/1.1 standard http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.13

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex