Reading vs Parsing a HTTP request

Reading vs Parsing a HTTP request - http

I am reading A philosophy of software design book by J. Ousterhout.
In chapter 5 he mentioned the following exercise:
“Implement one or more classes to make it easy for Web servers to receive incoming HTTP requests and send responses.”
He then discuss a common error to solve the exercise:
“Use two different classes for receiving HTTP requests; the first class read the request from the network connection into a string, and the second class parsed the string.”
“Information leakage occurred because a HTTP request can’t be read without parsing much of the message; for example, the Content-Length header specifies the length of the request body, so the headers must be parsed in order to compute the total request length. As a result, both classes needed to understand most of the structure of HTTP requests, and parsing code was duplicated in both classes. ”
I can't understand the example because I don't have an idea about http requests. More precisely, I don't understand the meaning of reading and parsing in the sentence:
"HTTP request can’t be read without parsing much of the message"
Any help?

Reading means taking a bunch of bytes from some external source (like a network socket) and storing them in memory.
Parsing means breaking up that string of bytes into meaningful, domain-specific chunks so you can understand the message.
I haven't read that book, but the author's point is that you can't simply read the bytes first and then parse them, in two separate non-overlapping operations. HTTP requests can be of arbitrary size, so before you know how many bytes to read (that is, how many bytes represent a single HTTP request) you have to figure out how long the request is. You do that by reading the Content-Length header, and that requires parsing and understanding the message.

Related

How to know the full request has been received with HTTP 1.0 and HTTP 1:1?

I'm implementing a ultra simple dummy HTTP server responding a message with Hello world to any requests. It is just for benchmarking the asynchronous event handling with wrk or equivalent web server benchmarking tool.
After some searching on the Web I can't find a clear EndOfMessage (EOM) marker. It seam that with HTTP 1.0 we know we have received the full request when the connection is closed. Is that right ?
For HTTP 1.1, how do we know if pipelining is used ? What is the EOM in this case ?

After some searching on the Web I can't find a clear EndOfMessage (EOM) marker.
You can't find one because such a thing doesn't exist. The only marker you may find is the CRLF pair indicating the end of the header fields. In general, the enclosed entity length (that is for requests and responses!) is either communicated beforehand via the Content-Length header or through the transport coding.
with HTTP 1.0 we know we have received the full request when the connection is closed. Is that right?
That is one of two ways mandated by RFC 1945. So generally speaking: no. From RFC 1945, section 7.2.2:
When an Entity-Body is included with a message, the length of that body may be determined in one of two ways. If a Content-Length header field is present, its value in bytes represents the length of the Entity-Body. Otherwise, the body length is determined by the closing of the connection by the server.
This may read like you were generally in the right with your assertion. BUT:
Closing the connection cannot be used to indicate the end of a request body, since it leaves no possibility for the server to send back a response.
With you being on the receiving side, your assumption is simply wrong on every conceivable level: If the request contains a body, announcing the size of said body through the Content-Length header is an absolute requirement.
HTTP/1.1 is a bit relaxed in this regard, as it allows for more options. As Julian pointed out, please consult RFC 7230, section 3.3.3. That section is straightforward to read and to answer your question, I'd have to c&p it as whole.
For HTTP 1.1, how do we know if pipelining is used ?
You do if you receive multiple requests through one connection. The strongest indicator for the client non engaging into pipelining is the presence of Connection: close in the first received request. See RFC 7230, section 6.3 and section 6.3.2. If you are worried about having to support this, you are always free to just read the first request and send back a response with Connection: close in it. The client will know it has to establish a new connection.
What is the EOM in this case ?
Again, there is no marker as there is no special treatment for requests during pipelining. All pipelining is really enabling is to have multiple requests being issued in one go. See section 3.3.3 from above on how to determine the message length.

Why is the total amount of character in a GET limited?

I want to ask some question about the following quote taken from Head First Servlets and JSP, Second Edition book:
The total amount of characters in a GET is really limited (depending
on the server). If the user types, say, a long passage into a “search”
input box, the GET might not work.
Why is the total amount of characters in a GET limited?
How can I learn about the total amount of character in a Get?
When I said a long text into any input box, and GET is not working.
How many solution do I have to fix this problem.
Why is the get method limited?

There is no specific limit to the length of a GET request. Different servers can have different limits. If you need to send more data to the server, use POST instead of GET. A recommended minimum to be supported by servers and browsers is 8,000 bytes, but this is not required.
RFC 7230's Section 3.1.1 "Request Line" says
HTTP does not place a predefined limit on the length of a request-line, as described in Section 2.5. A server that receives a method longer than any that it implements SHOULD respond with a 501 (Not Implemented) status code. A server that receives a request-target longer than any URI it wishes to parse MUST respond with a 414 (URI Too Long) status code (see Section 6.5.12 of RFC7231).
Various ad hoc limitations on request-line length are found in practice. It is RECOMMENDED that all HTTP senders and recipients support, at a minimum, request-line lengths of 8000 octets.
Section 2.5 "Conformance and Error Handling" says
HTTP does not have specific length limitations for many of its protocol elements because the lengths that might be appropriate will vary widely, depending on the deployment context and purpose of the implementation. Hence, interoperability between senders and recipients depends on shared expectations regarding what is a reasonable length for each protocol element. Furthermore, what is commonly understood to be a reasonable length for some protocol elements has changed over the course of the past two decades of HTTP use and is expected to continue changing in the future.
and RFC 7231's Section 6.5.12 "414 URI Too Long" says
The 414 (URI Too Long) status code indicates that the server is
refusing to service the request because the request-target (Section
5.3 of [RFC7230]) is longer than the server is willing to interpret.
This rare condition is only likely to occur when a client has
improperly converted a POST request to a GET request with long query
information, when the client has descended into a "black hole" of
redirection (e.g., a redirected URI prefix that points to a suffix of
itself) or when the server is under attack by a client attempting to
exploit potential security holes.

The 'get'-data is send in the query string - which also has a maximum length.
You can do all kind of things with the query string, e.g. bookmark it. Would you really like to bookmark a real huge text?
It is possible to configure moste servers to use larger length - some clients will accept them, some will throw errors.
"Note: Servers ought to be cautious about depending on URI lengths above 255 bytes, because some older client or proxy implementations might not properly support these lengths." HTTP 1.1 specification chapter 3.2.1:.
There is also a status code "414 Request-URI Too Long" - if you get this you will know that you have put to many chars in the get. (If you hit the server limit, if the client limit is lower then the server limit each browser will react in it's own way).
Generally it would be wise to set a limit for each data being send to a server - just if someones tries to make huge workload or slow down the server (e.g. send a huge file - 1 server connection is used. slow down the transmission, make additional sends - at some point the server wil have a lot of connections open. Use multiple clients and you have an attack scenario on a server).

How can I download a single file from multiple locations via HTTP?

I need to download a big file quickly, but all sources I can find have throttled bandwidth. Each of them seem to support HTTP 1.1 Byte Serving (Range Requests), since I can pause and resume the downloads. How can I download it from multiple sources in parallel?

Assuming this is a programming question (given that this is StackOverflow) I am going to explain how instead of just linking to a download accelerator that takes advantage of this.
What is needed in terms of the server to do this?
A server that supports Range HTTP header.
A server that allows for concurrent connections. It is possible to support Range while not allowing multiple simultaneous connection by using either endpoint or IP based restrictions server side. For this reason, I recommend you set up a simple test server instead of downloading from a file sharing site while testing this.
What is the Range Header?
Data transmission over HTTP is sent in order starting from the beginning of the file if the Range header is not set. The first byte of the file on the server will be the first byte of the HTTP response and the last byte of the file on the server will be the last byte of the HTTP response. The Range header allows you to specify where the bytes should start sending from allowing you to "skip" the beginning of the response.
Actual Answer Example
Our Situation
The response is plain text. The response content is just one word "StackOverflow!!" encoding ASCII, meaning each character is one byte. Therefore, the Content-Length header's value is 15 octets (another term for bytes).
We are going to download this file using 3 requests. For the sake of this example, we are going to say it will be 3 times faster but you should realize that this method will make downloads slower for very small files. This is because HTTP headers must be sent with each request as well as the 3-way handshake. We will also assume that the server supports HEAD requests and that the Content-Length header is sent with the download response. Finally, this request will be preformed using GET for reasons of HEAD requests. However, there are workarounds for POST.
Juicy Details
First, perform an HTTP HEAD request. Take the "Content-Length" header and divide that value by the amount of concurrent parallel connections you wish to make. For this example, the Content-Length is 15 and we wish to make 3 connections so the divided value will be 5.
Now preform the amount of requests you wished to preform parallel. With each request, set the Range header to "Range: bytes=" followe by how many requests have already been made times the divided value found above. Then append "-" followed by the value you just determined plus the divided value.
For this example, each request should have the header set as followed.
Range: bytes=0-5
Range: bytes=5-10
Range: bytes=10-15
The response of each of these requests should be
Stack
Overf
low!!
In essence, we are just conforming to Range specification (section 3.12 of RFC 2616) as well as Byte Range specification (section 14.35 of RFC 2616).
Finally, append the bytes of each request to form the final response data.
Disclaimer: I've never actually tried this but it should work in theory

I can't say if wget is able to put a file together again, if fetched from multiple sources.
The following example shows how to do it with aria2c.
You would build a download description file and then pass that to aria, like so:
aria2c -i uri.txt --split=5 --min-split-size=1M --max-connection-per-server=5
where uri.txt might contain
http://a.com/file1.iso http://mirror-1.com/file1.iso http://mirror-2.com/file1.iso
dir=/downloads
out=file1.iso
This would fetch the same file, from 3 different locations and place it into the downloads folder (dir) with the name file1.iso (out).

counting HTTP packets

What is relation between number of HTTP packets and number of objects in a web page?

What is relation between number of HTTP packets and number of objects in a web page?
The short answer is there is obviously some relation, but there is no way you can accurately predict one from the other.
For a longer answer, we first need to correct some misconceptions in the question:
There is no such thing as an "HTTP packet". HTTP is a message oriented application protocol with one request message and one response message per "resource" fetched). This sits on top of a reliable byte stream protocol (with flow control, etc) called TCP. This in turn sits on top of a packet switching protocol called IP. An HTTP request/response exchange takes an unpredictable number of IP packets ... depending on message sizes AND network conditions. Other HTTP features such as compression, keeping connections alive, caching and so on make things even more complicated.
The idea of an "object" is ill-defined. An "object" could have a one-to-one correspondence between HTTP request / response pairs (i.e. a "resource" in the above) then that part is simple. OTOH, a "resource" could be a rendering of multiple "objects" in the application domain of the webserver.
On top of that, you've also got to account for the fact that a typical HTML resource has references to other resources (Scripts, CSS, images, etc) and may even involve Ajax callbacks. Each of these is a "resource", that may or may not need to be fetched ... depending on caching, etc.
Finally, there is an implicit assumption that all "objects" are the same size. This might be true in some application domains, but it is not true in general.
So to summarize, there are far to many variables and unknowns for it to be feasible to predict the number of network packets required to fetch a certain number of "objects".
A more practical approach is to attach a packet-level network analyser to your network and get it to count the number of packets sent and received.
If you make the following assumptions:
"HTTP packets" are HTTP messages,
"objects" are resources,
a resource doesn't require other resources (Scripts, CSS, images, etc) to render,
there is no caching,
the server is not doing redirects.
then one "object" requires two "HTTP packets".
But frankly, you've simplified the problem to a point where the answer is next to useless for predicting actual performance of real web-servers. (For instance, any one of those "objects" could be tiny ... or huge. And if you allow for arbitrary javascript, or content such as links to video streams, then the number of "packets" of one kind or another is potentially unbounded.)

A GET request is issued for every file referred in a HTML page, all of which, usually, fit in one TCP stream segment. HTTP is a state machine, so, many requests/response can be pipelined in one request/response.
The number of packets sent in response vary in the size of the objects and in caching parameters. For example, if a file is already in the browser cache, it will make a conditional get and will receive a HTTP/1.1 304 Not Modified response code, which does not contain any data. Moreover, many HTTP/1.1 304 can be issued in one segment, as this response is very tiny compared to segments' maximum size. Another example, if a file is bigger than the maximum segment size, the file may (and it probably will) be divided in many segments.
Is this what you wish to know?

Web server - how to parse requests? Asynchronous Stream Tokenizer?

I'm attempting to create a simple webserver in C# in asynchronous socket programming style. The purpose is very narrow - a Comet server (http long-polling).
I've got the windows service running, accepting connections, dumping request info to the Console and returning simple fixed content to the client.
Now, I can't figure out a manageable strategy for parsing the request data asynchronously and safely. I've written synchronous LL1 parsers before. I'm not sure if LL1 Parser is appropriate or necessary for HTTP. I don't know how to tokenize the input stream asynchronously. All I can think of is having an input buffer per client, reading into that, then copying that to a StringBuilder and periodically checking to see if I have a complete request. But that seems inefficient and might led to difficult to debug/maintain code.
Also, there are the two phases of the connection of receiving the request in full and the sending a response - in this case, after some delay. Once the request is validated and actionable, only then am I planning to enroll the connection in the long-polling manager. However, a misbehaving client could continue to send data and fill up a buffer, so I think I need to continue to monitor and empty the input stream during the response phase, right?
Any guidance on this is appreciated.
I guess the first step is knowing whether it is possible to efficiently tokenize a network stream asynchronously and without a large intermediate buffer. Even without a proper parser, the same challenges of creating a tokenizer apply to reading "lines" of input at a time, or even reading until double blank lines (one big token). I don't want to read one byte at a time from the network, but neither do I want to read too many bytes and have to store them in some intermediate buffer, right?

For HTTP the best way is reading the headers in memory completely (until you receive \r\n\r\n) and then simply splitting by \r\n to get the headers and every header by : to separate name and value.
There's no need to use a complex parser for that.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex