How does HTTP caching works in a proxy server? - http

It's my understanding that caching is one of the main utilities of a proxy server. I'm currently trying to develop a simple one and I would like to know exactly how caching works.
Intuitively I think that it's basically an association between a request and a response. For example: for the following request: "GET google.com" you have the following response: "HTTP/1.0 200 OK..."
That way, whenever the proxy server receives a request for that URL he can reply with the cached response (I'm not really worried right now about when to serve the cached response and when to actually send the request to the real destination).
What I don't understand is how to establish the association between a request and a response since the HTTP response doesn't have any field saying "hey this is the response you get when you request the X URL" (or does it?).
Should I get this information by analyzing the underlying protocols? If so, how?

Your cache proxy server is already putted into play when a request arrives. Therefore you have the requested resource URL. Then you look in your cache and try to find the cached resource for the requested resource URL, if you cannot find the resource in your cache (or the cache is outdated), you fetch the data from the source. Keep in Mind, that you have to invalidate the cached resource if you receive a PUT, POST or DELETE request.

Related

What is the actual difference between the different HTTP request methods besides semantics?

I have read many discussions on this, such as the fact the PUT is idempotent and POST is not, etc. However, doesn't this ultimately depend on how the server is implemented? A developer can always build the backend server such that the PUT request is not idempotent and creates multiple records for multiple requests. A developer can also build an endpoint for a PUT request such that it acts like a DELETE request and deletes a record in the database.
So my question is, considering that we don't take into account any server side code, is there any real difference between the HTTP methods? For example, GET and POST have real differences in that you can't send a body using a GET request, but you can send a body using a POST request. Also, from my understanding, GET requests are usually cached by default in most browsers.
Are HTTP request methods anything more than just a logical structure (semantics) so that as developers we can "expect" a certain behavior based on the type of HTTP request we send?
You are right that most of the differences are on the semantic level, and if your components decide to assign other semantics, this will work as well. Unless there are components involved that you do not control (libraries, proxies, load balancers, etc).
For instance, some component might take advantage of the fact that PUT it idempotent and thus can re retried, while POST is not.
The Hypertext Transfer Protocol (HTTP) is designed to enable communications between clients and servers.
HTTP works as a request-response protocol between a client and server.
A web browser may be the client, and an application on a computer that hosts a web site may be the server.
Example: A client (browser) submits an HTTP request to the server; then the server returns a response to the client. The response contains status information about the request and may also contain the requested content.
HTTP Methods
GET
POST
PUT
HEAD
DELETE
PATCH
OPTIONS
The GET Method
GET is used to request data from a specified resource.
GET is one of the most common HTTP methods.
Note that the query string (name/value pairs) is sent in the URL of a GET request.
The POST Method
POST is used to send data to a server to create/update a resource.
The data sent to the server with POST is stored in the request body of the HTTP request.
POST is one of the most common HTTP methods.
The PUT Method
PUT is used to send data to a server to create/update a resource.
The difference between POST and PUT is that PUT requests are idempotent. That is, calling the same PUT request multiple times will always produce the same result. In contrast, calling a POST request repeatedly have side effects of creating the same resource multiple times.
The HEAD Method
HEAD is almost identical to GET, but without the response body.
In other words, if GET /users returns a list of users, then HEAD /users will make the same request but will not return the list of users.
HEAD requests are useful for checking what a GET request will return before actually making a GET request - like before downloading a large file or response body.
The DELETE Method
The DELETE method deletes the specified resource.
The OPTIONS Method
The OPTIONS method describes the communication options for the target resource.
src. w3schools

What to do with headers on following HTTP 303

I'm trying to determine what a client should do with headers on receiving a 303 (See Other) from the server. Specifically, what should be done with the Authorization header that was sent on the initial request?
Here's the problem: the client makes a request to myserver.com (HTTP request method is not relevant here) and the server at myserver.com responds with a 303 and the Location header contains otherserver.com/some_resource/. Tools like Postman and curl will follow the redirect by passing all the same headers in the subsequent request to otherserver.com. I haven't found a way to make these tools drop the headers.
In the case I've described, sending the Authorization header to otherserver.com seems like a security risk: otherserver.com now knows my token and possibly what host it can be used on so now the token is compromised. This can also cause errors, depending on how the destination host is configured. In the case where the redirect is to another resources on the same host (ie, myserver.com) then the Authorization header will (probably) need to be sent, and because it's the same host nothing is compromised.
Effectively, in different situations it seems that the correct behaviour is different. The relevant section in the RFC does not address this issue. In developing my own API, I've written documentation telling API clients to drop the Authorization header on redirect to otherserver.com. However, based on mucking around with curl and Postman, it's not clear to me either (a) what the default behaviour is for a typical HTTP client library or (b) whether HTTP client libraries permit easy modification of the HTTP headers before following a 303 redirect. As a result, it's possible my suggestion isn't practical. I also know of no way for the server to instruct the client as to what it should do with headers on following the 303 redirect.
What should a HTTP client do with the headers when it follows a 303 redirect? Who is responsible for deciding whether to use the same headers on the redirect, the HTTP client or server?
You can argue that when sending the 303 with otherserver.com's Location, myserver.com trusted otherserver.com to handle your token. It could have sent the token in the background as well. From the client's perspective, the client trusts myserver.com to handle the token, store and verify it securely, etc. If myserver.com decides to send it on to otherserver.com, should the client override? In this case it can of course, but in general I don't think it should.
As an attacker does not control the response headers from myserver.com which is a legit resource, I think in general it is secure to send the token by default to the other server it specifies, maybe unless you have some good reason not to (say an explicit policy on the client).

Custom response headers not sent by server (Rails Devise)

I'm trying to retrieve 3 response headers (Rails Devise Auth Headers: uid, client, access-token) in every request to a Rails Server.
Using Postman (http client) it's working.
With OkHttp (java http client) the headers just don't show up in the client (i've checked using Wireshark).
When i'm in debug mode it just work...
The additional headers with postman are due to postman sending an Origin header and the server is replying with CORS headers, i.e. Access-Control-.... These headers are send within the normal HTTP header, i.e. not after the response.
But these access control headers are only relevant when the access is done from a browser because they control the cross origin behavior of XHR. Since you are not inside a browser they should be irrelevant for what you are doing. What is relevant are the body of the response and some of the other headers and here you'll find no differences. Also irrelevant should be if multiple requests are send within the same TCP connection (HTTP keep-alive done by postman) or with multiple connections (OkHttp) because each request is independent from the other and using the same TCP connection is only a performance optimization.
If you really want to get these special headers you should add an Origin header within you OkHttp request. See the OkHttp examples on how to add your own headers. But like I said: these access control headers should be irrelevant for the real task and there should be no need to get to these headers.
There is a property "config.batch_request_buffer_throttle" in the file "config/initializers/devise_token_auth.rb" of the Rails Project. We changed it from 5 seconds to 0 seconds.
It is a property to keep the current token available for that amount of time to the following requests.
As the original documentation: "Sometimes it's necessary to make several requests to the API at the same time. In this case, each request in the batch will need to share the same auth token. This setting determines how far apart the requests can be while still using the same auth token."
So when we did the request using Postman or in Java Debug the 5 seconds was running allowing Devise to generate new tokens then retrieve them to the client.

304 Not Modified with 200 (from cache)

I'm trying to understand what exactly is the difference between "status 304 not modified" and "200 (from cache)"
I'm getting 304 on javascript files that I changed last. Why does this happen?
Thanks for the assistance.
https://sookocheff.com/post/api/effective-caching/ is an excellent source to form the required understanding around these 2 HTTP status codes.
After reading this thoroughly, I had this understanding
In typical usage, when a URL is retrieved, the web server will return the resource's current representation along with its corresponding ETag value, which is placed in an HTTP response header "ETag" field. The client may then decide to cache the representation, along with its ETag. Later, if the client wants to retrieve the same URL resource again, it will first determine whether the local cached version of the URL has expired (through the Cache-Control and the Expire headers). If the URL has not expired, it will retrieve the local cached resource. If it determined that the URL has expired (is stale), then the client will contact the server and send its previously saved copy of the ETag along with the request in a "If-None-Match" field. (Source: https://en.wikipedia.org/wiki/HTTP_ETag)
But even when expires time for an asset is set in future, browser can still reach the server for a conditional GET using ETag as per the 'Vary' header. Details on 'vary' header: https://www.fastly.com/blog/best-practices-using-vary-header/
"200 (from cache)" means the browser found a cached response for the request made. So instead of making a network call to fetch that resource from the remote server, it simply made use of the cached response.
Now these cached responses have a life time associated with them. With time, the responses can become stale. And if these are not allowed to be served stale (see section 4.2.4 - RFC7234), the browser needs to contact the remote server and verify whether these cached responses are still valid or not. The 304 response status code is the servers' way of letting the browser know that the response hasn't changed and is still valid.
If the browser receives a 304 response, it can "freshen" the relevant stale response(s). And subsequent requests to the resource can again be served from the browser cache without checking with the remote server (until the response becomes stale again).
304 Modified
A 304 Not Modified means that the files have not been modified since the specified version by the "If-Modified-Since", or "If-None-Match".
200 OK
This is the response you'll get if an HTTP request works. GET requests will have something relating to the files. A POST request will have something holding the result of the action.
Happy coding!Lyfe

Does sending POST data to a server that doesn't accept post data recieve the data?

I am setting up a back end API in a script of mine that contacts one of my sites by sending XML to my web server in the form of POST data. This script will be used by many and I want to limit the bandwidth waste for people that accidentally turn the feature on without a proper access key.
I will be denying requests that do not have the correct access key by maybe generating a 403 access code.
Lets say the POST data is ~500kb of data. Does the server receive all 500kb of data when this attempt is made regardless of the status code?
How about if I made the url contain the key mydomain/api/123456789 and generate 403 status on all bad access keys.
Does the POST data still get sent/received regardless or is it negotiated before the data is finally sent.
Thanks in advance!
Generally speaking, the entire request will be sent, including post data. There is often no way for the application layer to return a response like a 403 until it has received the entire request.
In reality, it will depend on the language/framework used and how closely it is linked to the HTTP server. Section 8.2.2 of RFC2616 HTTP/1.1 specification has this to say
An HTTP/1.1 (or later) client sending
a message-body SHOULD monitor the
network connection for an error status
while it is transmitting the request.
If the client sees an error status, it
SHOULD immediately cease transmitting
the body. If the body is being sent
using a "chunked" encoding (section
3.6), a zero length chunk and empty trailer MAY be used to prematurely
mark the end of the message. If the
body was preceded by a Content-Length
header, the client MUST close the
connection.
So, if you can find a language environemnt closely linked with the HTTP server (for example, mod_perl), you could do this in a way which does comply with standards.
An alternative approach you could take is to make an initial, smaller request to obtain a URL to use for the larger POST. The application can then deny providing the URL to clients without an appropriate key.
Here is great book about RESTful Web Services, where it's explained how HTTP works: http://oreilly.com/catalog/9780596529260
You can consider any request as envelope, where on top of it it's written address (URL), some properties (HTTP Headers) and inside it there's some data (if request is initiated by post method). So as you might guess you can't receive envelope partially.
Oh I forgot, it's when you are using HTTP Post with standard HTTP header "application/x-www-form-urlencoded" but if you are uploading files (correspondingly using ""multipart/form-data") Django gives you control over streamed chunks of files using Middleware classes: http://docs.djangoproject.com/en/dev/topics/http/middleware/

Resources