Why do HTTP proxies require an absolute URI in their GET requests? - http

I noticed in the HTTP spec (section 5.1.2) that a HTTP request to a HTTP proxy uses an absolute URI:
GET http://stackoverflow.com/questions/1968887/uribuilder-and-in-uri HTTP/1.1
while a non-proxied request uses a relative URI:
GET /relative_path.html HTTP/1.1
In either case, a "Host:" header is also specified. Since the "Host:" header already specifies the target of the request, why is the absolute URI required for the HTTP proxy? The spec says something about avoiding request loops, but I'm not at all certain that has anything to do with my question.
I've checked in a network monitor and verified that at least on my system, requests do behave as described above.

I suspect because the "Host" header only appeared in HTTP 1.1 (I think). Prior to that, the path was all there was. This wasn't enough for the proxy, so the host had to be added to the body for it to work.
It's sort-of redundant with HTTP 1.1, I suppose, but it's there for backward compatibility now.

Related

URLs and HTTP protocol

I am currently learning about how to transfer messages via URL to a host server. What I have learned so far is how a URL is composed: http://example.com:80/latest/example.jpg?d=400x400 gives me the image example.jpg in the dimension requested from the host via port 80 (which can be left out as HTTP always uses port 80). The request message for this would look like this:
GET latest/example.jpg?d=400x400 HTTP/1.1. The response message would look like this: HTTP/1.1 200 OK.
So it is clear to me how to GET some resource from a Host. But what's with the other HTTP methods like PUT, POST or DELETE? I don't understand where in the URL the HTTP method is carried for the host to read. How do I tell the host to PUT instead of GET?
There seems to be a small misconception about urls and the corresponding requests.
The url http://example.com:80/latest/example.jpg?d=400x400 is composed of 5 pieces:
The used protocol (in your case http)
The use fqdn - fully qualified domain name - (in your case example.com)
The port on the fqdn - in your case 80 - which is in your case unnecessary because your browser will default to 80 for http
your requested resource, in your case /latest/example.jpg
your requested query string parameters, indicated by ?, in your case the parameter d with the value 400x400
Note that the request message only looks like you outlined, because your browser defaults to the GET method of HTTP. As you correctly stated, there are various HTTP methods, such as PUT, POST, PATCH, DELETE, etc.
The HTTP-Method is stated in the HTTP Header, so it's up to the request which HTTP-Method is invoked.
For the "well-known" internet surfing, your typed url will always result in a GET request. For the other HTTP methods, it's up to the application (e.g. your Website or your normal software that uses HTTP requests) to enable the use. As an example, html enables the use of <form> tags where you can specify the http method, e.g. you can say to use POST.
To sum it up: Your url does not specify the HTTP-Methods.
Browsers default to GET, but in the end it's up to your application (and thus the logic behind it) which HTTP-method is used.

What HTTP client headers should I use to instruct proxies to refetch from origin, and cache the response?

I'm currently working on a system where a client makes HTTP 1.1 requests of an origin server. I control both the client and the server software, so have free reign over HTTP headers set. Between the client are multiple, hierarchical layers of web proxy / cache devices (think, Squid or similar).
The data served up by the origin is usually highly cacheable, and I intend to set HTTP response headers to indicate this. Specifically, I plan to use Cache-Control: public, max-age=<value>. I understand that this will mean that intermediate proxies will cache the response up to the specified max-age, at which point they will revalidate against the origin (presumably with a Last-Modified header, looking for a 304 response).
The problem I have is that the client might become aware that the data held by caches might now be invalid. In this case, I need the client to make a request which instructs the caches to either fetch or revalidate their response with the origin. If the origin response is now different, the cache should store this new response. In my mind, this would involve the client making the request, and each cache in the chain should revalidate its response with the next upstream device, all the way back to the origin. The new response can then be served from the closest cache which actually has it.
What's the correct HTTP headers that need to be set on the client request to achieve this? At first I thought that setting Cache-control: no-cache in the HTTP request would make this happen, but reading the RFC, it seems that this will instruct the intermediate caches to both go back to the origin (desired) but also not cache the new response (not desired). I then saw an article in which an HTTP request header of Cache-control: max-age=0 would perhaps do this, but I'm not sure.
Will max-age=0 do what I need here, or do I need some other combination of HTTP headers?
I asked a similar question here: How to make proxy revalidate resource from origin. I since learned that proxy revalidate wasn't supported by nginx at the time of writing. It is scheduled for the 1.5 release.
Sending max-age=0 from the client should trigger this revalidate mechanism in the proxy, if the original response from the origin contained the right cache control headers.
But whether your upstream server(s) will respect these headers and revalidate with their origin is clearly not something you can just assume. If you have control over your upstream servers I think it could work.
Also etag is preferred over modified since headers afaik.
I found these to be helpful articles on the subject:
caching tutorial
cache control directives
http specs on validation
section 14.9.4 on this spec
[UPDATE]
Nginx version 1.5.8 has been released since, and I can confirm that this mechanism is now working!

Returning 400 in virtual host environments where Host header has no match

Consider a web server from which three virtual hosts are served:
mysite.com
myothersite.com
imnotcreative.com
Now assume that the server receives the following raw request message (code formatting removes the terminating \r\n sequences):
GET / HTTP/1.1
Host: nothostedhere.com
I haven't see any guidance in RFC 2616 (perhaps I missed it?) on how to respond to a request for a host name that does not exist at the current server. Apache, for example, will simply use the first virtual host defined in its configuration as the "primary host" and pretend the client requested that host. Obviously this is more robust than returning a 400 Bad Request response and guarantees the client always sees some representation.
So my question is ...
Can anyone provide reasons aside from the "robustness vs. correctness" argument to dissuade me from responding with a 400 (or other error code) should the client request a non-existent host when employing the HTTP/1.1 protocol?
Note that all HTTP/1.1 requests MUST specify a Host: header as per RFC 2616. For HTTP/1.0 requests the only real option is to serve the "primary" host result. This question specifically addresses HTTP/1.1 protocol requests.
400 is not really the semantically correct response code in this scenario.
10.4.1 400 Bad Request
The request could not be understood by the server due to malformed syntax.
This is not what has happened. The request is syntactically valid, and by the time you server has reached the routing phase (when you are inspecting the value of the header) this will already have been determined.
I would say the correct response code here is 403:
10.4.4 403 Forbidden
The server understood the request, but is refusing to fulfill it.
This describes what has happened more accurately. The server is refusing to fulfill the request because it is unable to, and a more verbose error message can be provided in the message entity.
There is also an argument that 404 would be acceptable/correct, since a suitable document with which to satisfy the request could not be found, but personally I think that this is not the correct option, because 404 states:
10.4.5 404 Not Found
The server has not found anything matching the Request-URI
This explicitly mentions a problem with the Request-URI, and at this early stage of the routing phase you are probably not interested in the URI, since you first need to allocate the request to a host before it can determine whether it has a suitable document to handle the URI path.
In HTTP/1.1 Host: headers are mandatory. If a client states that it is using version 1.1 and does not supply a Host: header then 400 is definitely the correct response code. If the client states that it is using version 1.0 then it is not required to supply a host header and this should be handled gracefully - and this scenario amounts to the same situation as an unrecognised domain.
Really you have two options in this event: route the request to a default virtual host container, or respond with an error. As outlined above, if you are going to respond with an error, I believe the error should be 403.
I'd say this largely depends on what type(s) of clients you expect to consume your service and the type of service you offer.
For a general website:
Pretty safe to assume that requests are triggered from a user's browser, in which case I'd be more forgiving regarding the lack or incorrectness of a Host: header. I'd even go so far and say that the way Apache handles the case (i.e. fallback to the first appropriate VHost) is perfectly fine. After all, you don't want to scare your customers away.
For an API/RPC type of service:
That's a totally different case. You SHOULD expect whatever/whoever consumes your service to adhere to your specifications. So, if these require a consumer to pass a valid Host: header and the consumer fails to do so, you SHOULD return with a reasonable response (400 Bad Request seems fine to me).
It seems the previously accepted answer is no longer correct.
Per RFC 7230:
"A server MUST respond with a 400 (Bad Request) status code to any HTTP/1.1 request message that lacks a Host header field and to any request message that contains more than one Host header field or a Host header field with an invalid field-value."

What are the consequences of not setting "cache-control" in http response header?

Say, my web application responds to a http request with a response that has no "cache-control" in its header. If the client-end submits the same request within a relatively short time, what would happen? Does a cached copy of the response get used and thus the request does not need to reach the server? Or does the request get sent to the server just like the first time?
If the answer is "it depends", please indicate what the dependencies are. Thanks.
There is no caching behavior defined in HTTP/1.1 protocol for a resource served with no cache-related headers, so it's really up to the HTTP client's implementation.
Here is the link to RFC.

About the http request standard

GET http://stackoverflow.com/questions HTTP/1.1
Host: stackoverflow.com
Does the HTTP standard require that GET requests are fed with an absolute or relative address? What about when the request is in a proxy?
I ask this because I feel it's duplicate with the Host info.
GET / HTTP/1.1
Is a valid request line. The full path is not necessary.
5.1.2 Request-URI
The Request-URI is a Uniform Resource
Identifier (section 3.2) and
identifies the resource upon which to
apply the request.
Request-URI = "*" | absoluteURI | abs_path | authority
The four options for Request-URI are
dependent on the nature of the
request. The asterisk "*" means that
the request does not apply to a
particular resource, but to the server
itself, and is only allowed when the
method used does not necessarily apply
to a resource. One example would be
OPTIONS * HTTP/1.1
The absoluteURI form is REQUIRED when
the request is being made to a proxy.
The proxy is requested to forward the
request or service it from a valid
cache, and return the response. Note
that the proxy MAY forward the request
on to another proxy or directly to the
server specified by the absoluteURI. In order
to avoid request loops, a proxy MUST
be able to recognize all of its server
names, including any aliases, local
variations, and the numeric IP
address. An example Request-Line would
be:
GET http://www.w3.org/pub/WWW/TheProject.html HTTP/1.1
To allow for transition to
absoluteURIs in all requests in future
versions of HTTP, all HTTP/1.1 servers
MUST accept the absoluteURI form in
requests, even though HTTP/1.1 clients
will only generate them in requests to
proxies.
You can consult the HTTP RFC for this.
3.2.1 General Syntax
URIs in HTTP can be represented in absolute form or relative to some
known base URI [11], depending upon the context of their use.
Host details are not required. Relative path is sufficient

Resources