About the http request standard - http

GET http://stackoverflow.com/questions HTTP/1.1
Host: stackoverflow.com
Does the HTTP standard require that GET requests are fed with an absolute or relative address? What about when the request is in a proxy?
I ask this because I feel it's duplicate with the Host info.

GET / HTTP/1.1
Is a valid request line. The full path is not necessary.
5.1.2 Request-URI
The Request-URI is a Uniform Resource
Identifier (section 3.2) and
identifies the resource upon which to
apply the request.
Request-URI = "*" | absoluteURI | abs_path | authority
The four options for Request-URI are
dependent on the nature of the
request. The asterisk "*" means that
the request does not apply to a
particular resource, but to the server
itself, and is only allowed when the
method used does not necessarily apply
to a resource. One example would be
OPTIONS * HTTP/1.1
The absoluteURI form is REQUIRED when
the request is being made to a proxy.
The proxy is requested to forward the
request or service it from a valid
cache, and return the response. Note
that the proxy MAY forward the request
on to another proxy or directly to the
server specified by the absoluteURI. In order
to avoid request loops, a proxy MUST
be able to recognize all of its server
names, including any aliases, local
variations, and the numeric IP
address. An example Request-Line would
be:
GET http://www.w3.org/pub/WWW/TheProject.html HTTP/1.1
To allow for transition to
absoluteURIs in all requests in future
versions of HTTP, all HTTP/1.1 servers
MUST accept the absoluteURI form in
requests, even though HTTP/1.1 clients
will only generate them in requests to
proxies.

You can consult the HTTP RFC for this.
3.2.1 General Syntax
URIs in HTTP can be represented in absolute form or relative to some
known base URI [11], depending upon the context of their use.

Host details are not required. Relative path is sufficient

Related

URLs and HTTP protocol

I am currently learning about how to transfer messages via URL to a host server. What I have learned so far is how a URL is composed: http://example.com:80/latest/example.jpg?d=400x400 gives me the image example.jpg in the dimension requested from the host via port 80 (which can be left out as HTTP always uses port 80). The request message for this would look like this:
GET latest/example.jpg?d=400x400 HTTP/1.1. The response message would look like this: HTTP/1.1 200 OK.
So it is clear to me how to GET some resource from a Host. But what's with the other HTTP methods like PUT, POST or DELETE? I don't understand where in the URL the HTTP method is carried for the host to read. How do I tell the host to PUT instead of GET?
There seems to be a small misconception about urls and the corresponding requests.
The url http://example.com:80/latest/example.jpg?d=400x400 is composed of 5 pieces:
The used protocol (in your case http)
The use fqdn - fully qualified domain name - (in your case example.com)
The port on the fqdn - in your case 80 - which is in your case unnecessary because your browser will default to 80 for http
your requested resource, in your case /latest/example.jpg
your requested query string parameters, indicated by ?, in your case the parameter d with the value 400x400
Note that the request message only looks like you outlined, because your browser defaults to the GET method of HTTP. As you correctly stated, there are various HTTP methods, such as PUT, POST, PATCH, DELETE, etc.
The HTTP-Method is stated in the HTTP Header, so it's up to the request which HTTP-Method is invoked.
For the "well-known" internet surfing, your typed url will always result in a GET request. For the other HTTP methods, it's up to the application (e.g. your Website or your normal software that uses HTTP requests) to enable the use. As an example, html enables the use of <form> tags where you can specify the http method, e.g. you can say to use POST.
To sum it up: Your url does not specify the HTTP-Methods.
Browsers default to GET, but in the end it's up to your application (and thus the logic behind it) which HTTP-method is used.

Which changes do a browser make when using an HTTP Proxy?

Imagine a webbrowser that makes an HTTP request to a remote server, such as site.example.com
If the browser is then configured to use a proxy server, let's call it proxy.example.com using port 8080, in which ways are the request now different?
Obviously the request is now sent to proxy.example.com:8080, but there must surely be other changes to enable the proxy to make a request to the original url?
RFC 7230 - Hypertext Transfer Protocol (HTTP/1.1): Message Syntax and Routing, Section 5.3.2. absolute-form:
When making a request to a proxy, other than a CONNECT or server-wide
OPTIONS request (as detailed below), a client MUST send the target
URI in absolute-form as the request-target.
absolute-form = absolute-URI
The proxy is requested to either service that request from a valid
cache, if possible, or make the same request on the client's behalf
to either the next inbound proxy server or directly to the origin
server indicated by the request-target. Requirements on such
"forwarding" of messages are defined in Section 5.7.
An example absolute-form of request-line would be:
GET http://www.example.org/pub/WWW/TheProject.html HTTP/1.1
So, without proxy, the connection is made to www.example.org:80:
GET /pub/WWW/TheProject.html HTTP/1.1
Host: www.example.org
With proxy it is made to proxy.example.com:8080:
GET http://www.example.org/pub/WWW/TheProject.html HTTP/1.1
Host: www.example.org
Where in the latter case the Host header is optional (for HTTP/1.0 clients), and must be recalculated by the proxy anyway.
The proxy simply makes the request on behalf of the original client. Hence the name "proxy", the same meaning as in legalese. The browser sends their request to the proxy, the proxy makes a request to the requested server (or not, depending on whether the proxy wants to forward this request or deny it), the server returns a response to the proxy, the proxy returns the response to the original client. There's no fundamental difference in what the server will see, except for the fact that the originating client will appear to be the proxy server. The proxy may or may not alter the request, and it may or may not cache it; meaning the server may not receive a request at all if the proxy decides to deliver a cached version instead.

Returning 400 in virtual host environments where Host header has no match

Consider a web server from which three virtual hosts are served:
mysite.com
myothersite.com
imnotcreative.com
Now assume that the server receives the following raw request message (code formatting removes the terminating \r\n sequences):
GET / HTTP/1.1
Host: nothostedhere.com
I haven't see any guidance in RFC 2616 (perhaps I missed it?) on how to respond to a request for a host name that does not exist at the current server. Apache, for example, will simply use the first virtual host defined in its configuration as the "primary host" and pretend the client requested that host. Obviously this is more robust than returning a 400 Bad Request response and guarantees the client always sees some representation.
So my question is ...
Can anyone provide reasons aside from the "robustness vs. correctness" argument to dissuade me from responding with a 400 (or other error code) should the client request a non-existent host when employing the HTTP/1.1 protocol?
Note that all HTTP/1.1 requests MUST specify a Host: header as per RFC 2616. For HTTP/1.0 requests the only real option is to serve the "primary" host result. This question specifically addresses HTTP/1.1 protocol requests.
400 is not really the semantically correct response code in this scenario.
10.4.1 400 Bad Request
The request could not be understood by the server due to malformed syntax.
This is not what has happened. The request is syntactically valid, and by the time you server has reached the routing phase (when you are inspecting the value of the header) this will already have been determined.
I would say the correct response code here is 403:
10.4.4 403 Forbidden
The server understood the request, but is refusing to fulfill it.
This describes what has happened more accurately. The server is refusing to fulfill the request because it is unable to, and a more verbose error message can be provided in the message entity.
There is also an argument that 404 would be acceptable/correct, since a suitable document with which to satisfy the request could not be found, but personally I think that this is not the correct option, because 404 states:
10.4.5 404 Not Found
The server has not found anything matching the Request-URI
This explicitly mentions a problem with the Request-URI, and at this early stage of the routing phase you are probably not interested in the URI, since you first need to allocate the request to a host before it can determine whether it has a suitable document to handle the URI path.
In HTTP/1.1 Host: headers are mandatory. If a client states that it is using version 1.1 and does not supply a Host: header then 400 is definitely the correct response code. If the client states that it is using version 1.0 then it is not required to supply a host header and this should be handled gracefully - and this scenario amounts to the same situation as an unrecognised domain.
Really you have two options in this event: route the request to a default virtual host container, or respond with an error. As outlined above, if you are going to respond with an error, I believe the error should be 403.
I'd say this largely depends on what type(s) of clients you expect to consume your service and the type of service you offer.
For a general website:
Pretty safe to assume that requests are triggered from a user's browser, in which case I'd be more forgiving regarding the lack or incorrectness of a Host: header. I'd even go so far and say that the way Apache handles the case (i.e. fallback to the first appropriate VHost) is perfectly fine. After all, you don't want to scare your customers away.
For an API/RPC type of service:
That's a totally different case. You SHOULD expect whatever/whoever consumes your service to adhere to your specifications. So, if these require a consumer to pass a valid Host: header and the consumer fails to do so, you SHOULD return with a reasonable response (400 Bad Request seems fine to me).
It seems the previously accepted answer is no longer correct.
Per RFC 7230:
"A server MUST respond with a 400 (Bad Request) status code to any HTTP/1.1 request message that lacks a Host header field and to any request message that contains more than one Host header field or a Host header field with an invalid field-value."

Why do HTTP proxies require an absolute URI in their GET requests?

I noticed in the HTTP spec (section 5.1.2) that a HTTP request to a HTTP proxy uses an absolute URI:
GET http://stackoverflow.com/questions/1968887/uribuilder-and-in-uri HTTP/1.1
while a non-proxied request uses a relative URI:
GET /relative_path.html HTTP/1.1
In either case, a "Host:" header is also specified. Since the "Host:" header already specifies the target of the request, why is the absolute URI required for the HTTP proxy? The spec says something about avoiding request loops, but I'm not at all certain that has anything to do with my question.
I've checked in a network monitor and verified that at least on my system, requests do behave as described above.
I suspect because the "Host" header only appeared in HTTP 1.1 (I think). Prior to that, the path was all there was. This wasn't enough for the proxy, so the host had to be added to the body for it to work.
It's sort-of redundant with HTTP 1.1, I suppose, but it's there for backward compatibility now.

Do web browsers always send a trailing slash after a domain name?

Is there consistency and/or a standard on how browsers send a url to a host related to trailing slashes?
Meaning, if I type in http://example.com in the address bar of a web browser, is the browser suppose to add a trailing slash (http://example.com/) or not?
The HTTP request sent from the browser to the server does not include the domain name, only the "path" portion (starting from the first slash after the domain name). Since the path cannot be empty, a / is sent in that case.
A sample GET request for the root of a web site might be:
GET / HTTP/1.0
The / above cannot be omitted.
As RFC 2616 tells:
3.2.2 http URL
The "http" scheme is used to locate
network resources via the HTTP
protocol. This section defines the
scheme-specific syntax and
semantics for http URLs.
http_URL = "http:" "//" host [ ":"
port ] [ abs_path [ "?" query ]]
If the port is empty or not given,
port 80 is assumed. The semantics
are that the identified resource is
located at the server listening for
TCP connections on that port of that
host, and the Request-URI for the
resource is abs_path (section 5.1.2).
The use of IP addresses in URLs
SHOULD be avoided whenever possible
(see RFC 1900 [24]). If the
abs_path is not present in the URL, it
MUST be given as "/" when used as a
Request-URI for a resource (section
5.1.2). If a proxy receives a host name which is not a fully qualified
domain name, it MAY add its domain
to the host name it received. If a
proxy receives a fully qualified
domain name, the proxy MUST NOT change
the host name.
Read more: http://www.faqs.org/rfcs/rfc2616.html#ixzz0kGbpjYWa
5.1.2 Request-URI
...
For example, a client wishing to retrieve the
resource above directly from the
origin server would create a TCP
connection to port 80 of the host
"www.w3.org" and send the lines:
GET /pub/WWW/TheProject.html HTTP/1.1
Host: www.w3.org
followed by the remainder of the
Request. Note that the absolute path
cannot be empty; if none is present in
the original URI, it MUST be given
as "/" (the server root).
Read more: http://www.faqs.org/rfcs/rfc2616.html#ixzz0kGcaRbqU
Note that it's a very different matter when the URL has a path element:
http://example.com/dir
is a different URL than
http://example.com/dir/
and could in fact contain different content, and have a different search engine ranking.
As far as the protocol is concerned, http://example.com/something and http://example.com/something/ are quite different. Some servers might redirect you from one to the other if it is implemented in such a way.
As for the pure domain names, it always sends a request ending with a slash.
(The domain name itself is not included in the path section of an HTTP request, just as Greg Hewgill and the others wrote. It is, however, included in the headers.)
You can check it with a tool like Fiddler or WireShark.

Resources