I'm implementing a minimum HTTPS layer for my embedded project where I'm using mbedTLS for TLS and hard-coding HTTP headers to talk with HTTPS servers.
It works fine with normal websites. But so far my implementation detects the end of HTTPS response by checking if the last byte read is \n.
if( ret > 0 && output[len-1] == '\n' )
{
ret = 0;
output[len] = 0;
break;
}
This, however, is not always working for obvious reason. I tried openssl s_client, and it behaves the same - if an HTTP response terminates with \n, then s_client returns immediately after fetching all data. Otherwise it blocks forever, waiting for more data.
An real browser seems to be able to handle this properly. Is there anything I can do beyond setting a timeout?
How to tell if an HTTP response terminates in C...
But so far my implementation detects the end of HTTPS response by checking if the last byte read is \n...
This, however, is not always working for obvious reason...
HTTP calls out \r\n, and not \n. See RFC 2616, Hypertext Transfer Protocol - HTTP/1.1 and page 15:
HTTP/1.1 defines the sequence CR LF as the end-of-line marker for all
protocol elements except the entity-body (see appendix 19.3 for
tolerant applications). The end-of-line marker within an entity-body
is defined by its associated media type, as described in section 3.7.
CRLF = CR LF
Now, what various servers send is a whole different ballgame. There will be duplicate end-of-line markers, missing end-of-line markers, and incorrect end-of-line markers. Its the wild, wild west.
You might want to look at a reference implementation of a HTTP parser. If so, check out libevent or cURL's parsers and how they maintain their state machine.
Related
Both HTTP Request-Line and the Status-Line have 3 components :
Request-Line= Method SP Request-URI SP HTTP-Version CRLF
Status-Line = HTTP-Version SP Status-Code SP Reason-Phrase CRLF
The Status-Line (the Server response) is fine:
it begin with the HTTP-Version (like any protocol) so the decoder can adapt it parsing according to this first field
followed by some protocol-defined values (the Status Code) that a single word and don't need any SP/CR/LF character
end with any TEXT character (except CR/LF) as the Reason-Phrase.
What I'm failing to understand is why the Request-Line is so different:
The HTTP-Version is at the end
the Request-URI must be escaped to avoid having an SP/CR/LF character (here it goes the famous %20)
Why it does not follow the same (clean) pattern as the Status-line ?
Request-Line= HTTP-Version SP Method SP Request-URI CRLF
This way the Request-URI could be any TEXT character (except CR/LF)
So it would look like this:
HTTP/1.1 GET /user/with space
...
HTTP/1.1 404 NOT FOUND
...
See:
https://www.w3.org/Protocols/rfc2616/rfc2616-sec5.html
https://www.w3.org/Protocols/rfc2616/rfc2616-sec6.html
It may come from HTTP/0.9, the early protocol version.
The request part was:
GET http://www.example.com/foo.html\r\n
And the response part was the response body (without headers), so directly your html response starting with <html> for example.
The Request Line is:
METHOD OSP Absolute-Request-URL CRLF
With a lot of optionnal spaces for OSP, like tab or formfeed
with the location part having also the Host part (which is still supported on the protocol today)
The important point is there is no protocol version, and no protocol part. Both in the response and the request.
When HTTP/1.0 was created there was the implicit need of still supporting HTTP/0.9 requests and responses. Something that some servers are still doing today.
On the response side all the response headers parts were added (like stating the mime type of the response!), and the first line was built with this nice idea of starting by the protocol version of the response.
On the request side the protocol version was added as an optional addition so you could still decide to make a HTTP/0.9 request or a new version, and most importantly, an HTTP/0.9 server could maybe still understand your query (and ignore the SP PROTOCOL addition (and even optionnal headers added in the request).
Today if you forgot the protocol part of your request the HTTP/0.9 compatible servers will only parse the first line of your request and ignore extra headers.
These are equivalent queries (but the first one is in http 0.9 and would get no headers in the response):
# HTTP 0.9:
GET http://www.example.com/foo.html\r\n
# HTTP/1.0 version:
GET http://www.example.com/foo.html HTTP/1.0\r\n
\r\n
# or
GET /foo.html HTTP/1.0\r\n
Host: www.example.com\r\n
\r\n
#or
GET http://www.example.com/foo.html HTTP/1.0\r\n
Host: www.foo.com\r\n
\r\n
I think they've been thinking about code updates needed in the parsers and that adding the protocol at the end of the first line was easier to implement. Maybe an old parser could still send a 0.9 response to a HTTP/1.0 query (which is bad but easy to write).
Maybe just adding something on an existing line seems more like an improvment than prefixing the line of the existing protocol.
Maybe you should have been old enough to comment the RFC at this time and tell them that it would be more elegant your way (which is right) :-)
I'm implementing a ultra simple dummy HTTPÂ server responding a message with Hello world to any requests. It is just for benchmarking the asynchronous event handling with wrk or equivalent web server benchmarking tool.
After some searching on the Web I can't find a clear EndOfMessage (EOM) marker. It seam that with HTTPÂ 1.0 we know we have received the full request when the connection is closed. Is that right ?
For HTTP 1.1, how do we know if pipelining is used ? What is the EOM in this case ?
After some searching on the Web I can't find a clear EndOfMessage (EOM) marker.
You can't find one because such a thing doesn't exist. The only marker you may find is the CRLF pair indicating the end of the header fields. In general, the enclosed entity length (that is for requests and responses!) is either communicated beforehand via the Content-Length header or through the transport coding.
with HTTP 1.0 we know we have received the full request when the connection is closed. Is that right?
That is one of two ways mandated by RFC 1945. So generally speaking: no. From RFC 1945, section 7.2.2:
When an Entity-Body is included with a message, the length of that body may be determined in one of two ways. If a Content-Length header field is present, its value in bytes represents the length of the Entity-Body. Otherwise, the body length is determined by the closing of the connection by the server.
This may read like you were generally in the right with your assertion. BUT:
Closing the connection cannot be used to indicate the end of a request body, since it leaves no possibility for the server to send back a response.
With you being on the receiving side, your assumption is simply wrong on every conceivable level: If the request contains a body, announcing the size of said body through the Content-Length header is an absolute requirement.
HTTP/1.1 is a bit relaxed in this regard, as it allows for more options. As Julian pointed out, please consult RFC 7230, section 3.3.3. That section is straightforward to read and to answer your question, I'd have to c&p it as whole.
For HTTP 1.1, how do we know if pipelining is used ?
You do if you receive multiple requests through one connection. The strongest indicator for the client non engaging into pipelining is the presence of Connection: close in the first received request. See RFC 7230, section 6.3 and section 6.3.2. If you are worried about having to support this, you are always free to just read the first request and send back a response with Connection: close in it. The client will know it has to establish a new connection.
What is the EOM in this case ?
Again, there is no marker as there is no special treatment for requests during pipelining. All pipelining is really enabling is to have multiple requests being issued in one go. See section 3.3.3 from above on how to determine the message length.
I'm working with an HTTP request tool (similar to cURL) and having an issue with the server response. Either that or my understanding of the RFC for HTTP 1.1 and chunked data.
What I'm seeing is chunked data should be in this format:
4\r\n
Wiki\r\n
5\r\n
pedia\r\n
e\r\n
in\r\n\r\nchunks.\r\n
0\r\n
\r\n
what I'm actually seeing is the following:
4\r\n
Wiki\r\n
5\r\n
pedia\r\n
e\r\n
in\r\n\r\nchunks.\r\n
0
In other words, the few servers I've tested with send no more data after the 0.. not CRLF, much less CRLFCRLF.
How are we supposed to know it's the end of the chunked data without the proper format of the chunked tags? Timeouts happen looking for the CRLFs after the 0, and that's no sufficient.
Yes, it violates standard. But we want to be compatible with all possible http servers and clients, so we have to understand a way how it can be violated.
Chunked is used often in a way of content streaming over http 1.1 protocol. Standard ask to end content with additional CRLF. So we can see the following pseudo code:
def stream(endpoint)
Socket.open(endpoint) do |socket|
sleep 10
more_data do |data|
print data.length.to_s(16)
print data
print "CRLF"
end
end
print "CRLF"
end
But the right code is the following:
def stream(endpoint)
Socket.open(endpoint) do |socket|
sleep 10
more_data do |data|
print data.length.to_s(16)
print data
print "CRLF"
end
end
ensure
print "CRLF"
end
It means that after input socket interruption of any other exception wrong version of method won't be able to print additional "CRLF" to output socket.
How are we supposed to know it's the end of the chunked data without
the proper format of the chunked tags? Timeouts happen looking for the
CRLFs after the 0, and that's no sufficient.
Many implementations ignores this violation because they don't need to know the size of content. They just tries to receive as much data as possible before socket will be closed.
Use Content-Length, definitely whenever I know it; for file download, checking the filesize is insignificant in terms of resources. For chunked transfer we do not scan the message body for a CRLF pair. It first reads the specified number of bytes, and then reads two more bytes to confirm that they are CR and LF. If they're not, the message body is ill-formed, and either the size was specified improperly or the data was otherwise corrupted.
For more information read RCF, which says
A server using chunked transfer-coding in a response MUST NOT use the
trailer for any header fields unless at least one of the following is
true:
a)the request included a TE header field that indicates "trailers" is
acceptable in the transfer-coding of the response, as described in
section 14.39; or,
b)the server is the origin server for the response, the trailer fields
consist entirely of optional metadata, and the recipient could use the
message (in a manner acceptable to the origin server) without
receiving this metadata. In other words, the origin server is willing
to accept the possibility that the trailer fields might be silently
discarded along the path to the client.
Way to Determine Message Body Length:
If header has Transfer-Encoding and the chunked transfer is final encoding, then message body length is determined by reading and decoding the chunked data until the transfer coding indicates the data is complete.
If header has Transfer-Encoding and the chunked transfer is not final encoding, then message body length is determined by reading the connection until it is closed by the server.
If header has Transfer-Encoding in request and the chunked transfer is not final encoding, then message body length cannot be determined reliably; the server MUST respond with the 400 (Bad Request) status code and then close the connection.
If a message is received with both a Transfer-Encoding and Content-Length header field, the Transfer-Encoding overrides the Content-Length. Such a message might indicate an attempt to perform request response splitting and ought to be handled as an error. A sender MUST remove the received Content-Length field prior to forwarding such a message downstream.
I need to download a big file quickly, but all sources I can find have throttled bandwidth. Each of them seem to support HTTP 1.1 Byte Serving (Range Requests), since I can pause and resume the downloads. How can I download it from multiple sources in parallel?
Assuming this is a programming question (given that this is StackOverflow) I am going to explain how instead of just linking to a download accelerator that takes advantage of this.
What is needed in terms of the server to do this?
A server that supports Range HTTP header.
A server that allows for concurrent connections. It is possible to support Range while not allowing multiple simultaneous connection by using either endpoint or IP based restrictions server side. For this reason, I recommend you set up a simple test server instead of downloading from a file sharing site while testing this.
What is the Range Header?
Data transmission over HTTP is sent in order starting from the beginning of the file if the Range header is not set. The first byte of the file on the server will be the first byte of the HTTP response and the last byte of the file on the server will be the last byte of the HTTP response. The Range header allows you to specify where the bytes should start sending from allowing you to "skip" the beginning of the response.
Actual Answer Example
Our Situation
The response is plain text. The response content is just one word "StackOverflow!!" encoding ASCII, meaning each character is one byte. Therefore, the Content-Length header's value is 15 octets (another term for bytes).
We are going to download this file using 3 requests. For the sake of this example, we are going to say it will be 3 times faster but you should realize that this method will make downloads slower for very small files. This is because HTTP headers must be sent with each request as well as the 3-way handshake. We will also assume that the server supports HEAD requests and that the Content-Length header is sent with the download response. Finally, this request will be preformed using GET for reasons of HEAD requests. However, there are workarounds for POST.
Juicy Details
First, perform an HTTP HEAD request. Take the "Content-Length" header and divide that value by the amount of concurrent parallel connections you wish to make. For this example, the Content-Length is 15 and we wish to make 3 connections so the divided value will be 5.
Now preform the amount of requests you wished to preform parallel. With each request, set the Range header to "Range: bytes=" followe by how many requests have already been made times the divided value found above. Then append "-" followed by the value you just determined plus the divided value.
For this example, each request should have the header set as followed.
Range: bytes=0-5
Range: bytes=5-10
Range: bytes=10-15
The response of each of these requests should be
Stack
Overf
low!!
In essence, we are just conforming to Range specification (section 3.12 of RFC 2616) as well as Byte Range specification (section 14.35 of RFC 2616).
Finally, append the bytes of each request to form the final response data.
Disclaimer: I've never actually tried this but it should work in theory
I can't say if wget is able to put a file together again, if fetched from multiple sources.
The following example shows how to do it with aria2c.
You would build a download description file and then pass that to aria, like so:
aria2c -i uri.txt --split=5 --min-split-size=1M --max-connection-per-server=5
where uri.txt might contain
http://a.com/file1.iso http://mirror-1.com/file1.iso http://mirror-2.com/file1.iso
dir=/downloads
out=file1.iso
This would fetch the same file, from 3 different locations and place it into the downloads folder (dir) with the name file1.iso (out).
I have a program already written in gawk that downloads a lot of small bits of info from the internet. (A media scanner and indexer)
At present it launches wget to get the information. This is fine, but I'd like to simply reuse the connection between invocations. Its possible a run of the program might make between 200-2000 calls to the same api service.
I've just discovered that gawk can do networking and found geturl
However the advice at the bottom of that page is well heeded, I can't find an easy way to read the last line and keep the connection open.
As I'm mostly reading JSON data, I can set RS="}" and exit when body length reaches the expected content-length. This might break with any trailing white space though. I'd like a more robust approach. Does anyone have a nicer way to implement sporadic http requests in awk that keep the connection open. Currently I have the following structure...
con="/inet/tcp/0/host/80";
send_http_request(con);
RS="\r\n";
read_headers();
# now read the body - but do not close the connection...
RS="}"; # for JSON
while ( con |& getline bytes ) {
body = body bytes RS;
if (length(body) >= content_length) break;
print length(body);
}
# Do not close con here - keep open
Its a shame this one little thing seems to be spoiling all the potential here. Also in case anyone asks :) ..
awk was originally chosen for historical reasons - there were not many other language options on this embedded platform at the time.
Gathering up all of the URLs in advance and passing to wget will not be easy.
re-implementing in perl/python etc is not a quick solution.
I've looked at trying to pipe urls to a named pipe and into wget -i - , that doesn't work. Data gets buffered, and unbuffer not available - also I think wget gathers up all the URLS until EOF before processing.
The data is small so lack of compression is not an issue.
The problem with the connection reuse comes from the HTTP 1.0 standard, not gawk. To reuse the connection you must either use HTTP 1.1 or try some other non-standard solutions for HTTP 1.0. Don't forget to add the Host: header in your HTTP/1.1 request, as it is mandatory.
You're right about the lack of robustness when reading the response body. For line oriented protocols this is not an issue. Moreover, even when using HTTP 1.1, if your scripts locks waiting for more data when it shouldn't, the server will, again, close the connection due to inactivity.
As a last resort, you could write your own HTTP retriever in whatever langauage you like which reuses connections (all to the same remote host I presume) and also inserts a special record separator for you. Then, you could control it from the awk script.