Proxy / gateway behaviour if HTTP response data exceeds content length - http

How should proxies / gateways behave when http servers send HTTP response where the data size exceeds content-length?
Dropping it as a RFC non-compliance is one way to go but looks like there are quite a few implementations/deployments with this behaviour today and this change will end up breaking those URLs.
Will really appreciate any insights/pointers.
Thanks,
Dev

If the data size exceeds content-length, the remaining bytes on the wire are considered part of the response to the next (pipelined) request.
If there isn't an outstanding request to match with that response, see https://datatracker.ietf.org/doc/html/draft-ietf-httpbis-p1-messaging-26#section-3.3.3 -
If the final response to the last request on a connection has been
completely received and there remains additional data to read, a user
agent MAY discard the remaining data or attempt to determine if that
data belongs as part of the prior response body, which might be the
case if the prior message's Content-Length value is incorrect. A
client MUST NOT process, cache, or forward such extra data as a
separate response, since such behavior would be vulnerable to cache
poisoning.

Related

http get response to same uri means same format in response

I have a question regarding the response to a HTTP request.
My question is
The representations present in responses to GET requests on the same URI should always have the same format, and why.
I thought that the server might change the content associated to that URI, or that the client making the request might change the accept header in the request, but I'm not sure.
The format may changed based on Content-Encoding, Content-Disposition, and a lot of other things. Compression for example may be used (but that's not the final format, just the transport format). The page may contain dynamic content, based on your current user session (so based on your cookies, for example).
The server response would usually contain a Vary header which clearly states, for your browser, the request headers that may influence the content of the page.
For example you may have a Vary: cookie which means that if the browser requested this page without a cookie, and that later you have a cookie for this website, then the page content should not be loaded from the browser cache and a new request should be made.
So your first sentence is wrong, or too simple. Request headers and Response headers can contain informations on validity of the message, how to store it, when to ask for changes, or what headers may alter the message content.

Caching reverse proxy for dynamic content

I was thinking about asking on Software Recommendations, but then I've found out that it may be a too strange request and it needs some clarification first.
My points are:
Each response contains an etag
which is a hash of the content
and which is globally unique (with sufficient probability)
The content is (mostly) dynamic and may change anytime (expires and max-age headers are useless here).
The content is partly user-dependent, as given by the permissions (which itself change sometimes).
Basically, the proxy should contain a cache mapping the etag to the response content. The etag gets obtained from the server and in the most common case, the server does not deal with the response content at all.
It should go like follows: The proxy always sends a request to the server and then either
1 the server returns only the etag and the proxy makes a lookup based on it and
1.1 on cache hit,
it reads the response data from cache
and sends a response to the client
1.2 on cache miss,
it asks the server again and then
the server returns the response with content and etag,
the proxy stores it in its cache
and sends a response to the client
2 or the server returns the response with content and etag,
the proxy stores the data in its cache
and sends a response to the client
For simplicity, I left out the handling of the if-none-match header, which is rather obvious.
My reason for this is that the most common case 1.1 can be implemented very efficiently in the server (using its cache mapping requests to etags; the content isn't cached in the server), so that most requests can be handled without the server dealing with the response content. This should be better than first getting the content from a side cache and then serving it.
In case 1.2, there are two requests to the server, which sounds bad, but is no worse than the server asking a side cache and getting a miss.
Q1: I wonder, how to map the first request to HTTP. In case 1, it's like a HEAD request. In case 2, it's like GET. The decision between the two is up to the server: If it can serve the etag without computing the content, then it's case 1, otherwise, it's case 2.
Q2: Is there a reverse proxy doing something like this? I've read about nginx, HAProxy and Varnish and it doesn't seem to be the case. This leads me to Q3: Is this a bad idea? Why?
Q4: If not, then which existing proxy is easiest to adapt?
An Example
A GET request like /catalog/123/item/456 from user U1 was served with some content C1 and etag: 777777. The proxy stored C1 under the key 777777.
Now the same request comes from user U2. The proxy forwards it, the server returns just etag: 777777 and the proxy is lucky, finds C1 in its cache (case 1.1) and sends it to U2. In this example, neither the clients not the proxy knew the expected result.
The interesting part is how could the server know the etag without computing the answer. For example, it can have a rule stating that requests of this form return the same result for all users, assuming that the given user is allowed to see it. So when the request from U1 came, it computed C1 and stored the etag under the key /catalog/123/item/456. When the same request came from U2, it just verified that U2 is permitted to see the result.
Q1: It is a GET request. The server can answer with an "304 not modified" without body.
Q2: openresty (nginx with some additional modules) can do it, but you will need to implement some logic yourself (see more detailed description below).
Q3: This sounds like a reasonable idea given the information in your question. Just some food for thought:
You could also split the page in user-specific and generic parts which can be cached independently.
You shouldn't expect the cache to keep the calculated responses forever. So, if the server returns a 304 not modified with etag: 777777 (as per your example), but the cache doesn't know about it, you should have an option to force re-building the answer, e.g. with another request with a custom header X-Force-Recalculate: true.
Not exactly part of your question, but: Make sure to set a proper Vary header to prevent caching issues.
If this is only about permissions, you could maybe also work with permission infos in a signed cookie. The cache could derive the permission from the cookie without asking the server, and the cookie is tamper proof due to the signature.
Q4: I would use openresty for this, specifically the lua-resty-redis module. Put the cached content into a redis key-value-store with the etag as key. You'd need to code the lookup logic in Lua, but it shouldn't be more than a couple of lines.

HTTP request and response flow for get

I am having difficulties understanding the HTTP request and response flow. I am working with a system where I can "hijack" incoming HTTP request and give my own response. The problem I am having is that some type of GET request seem to assume that all data is sent back in first request.
For instance, JPEG image requests, no matter the size (my tests include 0-20 MB JPEG files) seems to assume that the entire data is sent in the first response. Even if I don't send any data and explicitly set range header to 0 I never get a response back from the client asking for the data.
Other data request types, such as mp4 video, the client seems perfectly fine with getting a response with only header information with no data and then sends a new request explicitly asking for byte range 0-.
Is there some kind of agreement between the the client and server that some types should be sent back in one request while others can be split up in a number of requests?

What does client send after receiving a "100 Continue" status code?

Client sends a POST or PUT request that includes the header:
Expect: 100-continue
The server responds with the status code:
100 Continue
What does the client send now? Does it send an entire request (the previously send request line and headers along with the previously NOT sent content)? Or does it only send the content?
I think it's the later, but I'm struggling to find concrete examples online. Thanks.
This should be all the information you need regarding the usage of a 100 Continue response.
In my experienced this is really used when you have a large request body. It could be considered to be roughly complementary to the HEAD method in relation to GET requests - fetch just the header information and not the body (usually) to reduce network load. 100 responses are used to determine whether the server will accept the request based purely on the headers - so that, for example, if you try and send a large POST/PUT request to a non-existent server resource it will result in a 404 before the entire request body has been sent.
So the short answer to your question is - yes, it's the latter. Although, you should always read the RFC for a complete picture. RFC2616 contains 99% of the information you would ever need to know about HTTP - there are some more recent RFCs that settle some ambiguities and offer a few small extensions to the protocol but off the top of my head I can't remember what they are.

Matching HTTP responses with their corresponding HTTP pipelined requests

I'm trying to write a program to match HTTP requests with their corresponding responses. Seems that everything is working well for most of the scenarios (when the transfer is perfectly ordered and even when its not, by using TCP sequence numbers).
The only problem I found is for when I have pipelined requests. After that, I get several responses but I don't know which packets are the answer to a specific request and which are not. I read in another post that the responses will come sequentially and combining this property with information on the Content-Length field seems to be a solution. The problem is that Content-length is not a mandatory field, so I'm not sure if I can always rely on that.
Does anyone know how the web-browsers that support this feature (btw, not most of them do) actually do it?
The information about the bodies length has to be present in the headers. It's just not always in 'content-length'. In order to work it all out you will have to study the relevant RFC 2616. Most notably section 4.4 deals with the different headers
Some more relevant rules from the RFC 2616:
When pipelining:
A server MUST send its responses to those requests in the same order that the requests were received.
From 9.2
If no response body is included, the response MUST include a Content-Length field with a field-value of "0".
From 10.2.7 206 Partial Content
The response MUST include .... Either a Content-Range header field ... or a multipart/byteranges
Content-Type including Content-Range fields for each part.
From 14.13 Content-Length
Applications SHOULD use this field to indicate the transfer-length of the message-body, unless this is prohibited by the rules in section 4.4.
Current responses are a bit old. Need a refresh.
The new HTTP 1.1 RFC is RFC 7230. And contains more precise information on parsing the messages size.
Message Body Length
Associating a response to a request
Security Considerations
Detecting the size of a message is quite complex. You can have a Content-length, or Transfer-Encoding: chunked, or both, or none. And some sepcial codes like 100 Continue which may alter all this.
The first link contains 7 entries that should be checked in the right order to guess the right size.
And as stated in the last link, failing to detect the right message length may lead to HTTP Smuggling (splitting, cache poisoning) issues.
Pipelining support is the source of most smuggling issues. You should really take care of the whole RFC7230 document if you want to implement it.

Resources