Using If-Modified-Since header for dynamically generated remote files - http

Our web server regularly downloads images from other web servers. To prevent our server having to download the same image every day even if it has not changed, I plan to store the Last-Modified header when the image downloads and then put that date in the If-Modified-Since header of subsequent requests for the same file.
I have this working fine except when the remote file is generated on-the-fly when requested (e.g. if it generates a certain sized version for the web when requested from separate original file). In this case, the Last-Modified header is the date that the remote server responds to the request so the stored Last-Modified header from the previous download will always be earlier that ones for subsequent requests so the image will always get downloaded and I'll never get the 304 Not Modified status code.
So, is there a way to reduce the download frequency when the remote server is serving up images that are generated on the fly?
It sounds to me like this is not possible, but I thought I'd ask anyway.

If you can create some form of hash for the the images use ETags. Your server will have to check the If-None-Match request header against the hash and if they match you can return a 304 response.
Clients will still send Last-Modified but if your hashing method does not generate many collision you should be able to ignore it and just match the ETags.

Related

Caching strategy using ETag and Expires/Cache-control with no assets version/ID

After reading a lot about caching validators (more intensively after reading this answer on SO), I had a doubt that didn't find the answer anywhere.
My use-case is to serve a static asset (a javascript file, ie: https://example.com/myasset.js) to be used in other websites, so messing with their Gpagespeed/gmetrix score matters the most.
I also need their users to receive updated versions of my static asset every time I deploy new changes.
For this, I have the following response headers:
Cache-Control: max-age=10800
etag: W/"4efa5de1947fe4ce90cf10992fa"
In short, we can see the following flow in terms of how browser behaves using etag
For the first request, the browser has no value for the If-None-Match Request Header, so the Server will send back the status code 200 (Ok), the content itself, and a Response header with ETag value.
For the subsequent requests, the browser will add the previously received ETag value in a form of the If-None-Match Request Header. This way, the server can compare this value with the current value from ETag and, if both match, the server can return 304 (Not Modified) telling the browser to use the latest version of the file, or just 200 followed by the new content and the related ETag value instead.
However, I couldn't find any information in regards to using the Cache-Control: max-age header and how will this affect the above behavior, like:
Will the browser request for new updates before max-age has met? Meaning that I can define a higher max-age value (pagespeed/gmetrix will be happy about it) and force this refresh using only etag fingerprint.
If not, then what are the advantages of using etag and adding extra bits to the network?
No, the browser will not send any requests until max-age has passed.
The advantage of using ETag is that, if the file hasn't changed, you don't need to resend the entire file to the client. The response will be a small 304.
Note that you can achieve the best of both worlds by using the stale-while-revalidate directive, which allows stale responses to be served while the cache silently revalidates the resource in the background.

Checking if HTTP resource has changed after maximum cache time has expired

I'm trying to work out a new caching policy for the static resources on a website. A common problem is whenever javascript, CSS etc. is updated, many users hold onto stale versions because currently there are no caching specific HTTP headers included in the file responses.
This becomes a serious problem when, for example, the javascript updates are linked to server-side updates, and the stale javascript chokes on the new server responses.
Eliminating browser caching completely with a cache-control: max-age=0, no-cache seems like overkill, since I'd still like to take some pressure off the server by letting browsers cache temporarily. So, setting the cache policy to a maximum of one hour seems alright, like cache-control: max-age=3600, no-cache.
My understanding is that this will always fetch a new copy of the resource if the cached copy is older than one hour. I'd specifically like to know if it's possible to set a HTTP header or combination of headers that will instruct browsers to only fetch a new copy if the resource was last checked more than one hour ago AND if the resource has changed.
I'm just trying to avoid browsers blindly fetching new copies just because the cached resource is older than one hour, so I'd also like to add the condition that the resource has been changed.
Just to illustrate further what I'm asking:
New user arrives at site and gets fresh copy of script.js
User stays on site for 45 mins, browser uses cached copy of script.js all the time
User comes back to site 2 hours later, and browser asks the server if script.js has changed
If it has, then it gets a fresh copy and the process repeats
If it has not changed, then it uses the cached copy for the next hour, after which it will check again
Have I misunderstood things? Is what I'm asking how it actually works, or do I have to do something different?
Have I misunderstood things? Is what I'm asking how it actually works,
or do I have to do something different?
You have some serious misconceptions about what the various cache control directives do and why cache behaves as it does.
Eliminating browser caching completely with a cache-control:
max-age=0, no-cache seems like overkill, since I'd still like to take
some pressure off the server by letting browsers cache temporarily ...
The no-cache option is wrong too. Including it means the browser will always
check with the server for modifications to the file every time.
That isn't what the no-cache means or what it is intended for - it means that a client MUST NOT used a cached copy to satisfy a subsequent request without successful revalidation - it does not and has never meant "do not cache" - that is what the no-store directive is for
Also the max-age directive is just the primary means for caches to calculate the freshness lifetime and expiration time of cached entries. The Expires header (minus the value of the Date header can also be used) - as can a heuristic based on the current UTC time and any Last-Modified header value.
Really if your goal is to retain the cached copy of a resource for as long as it is meaningful - whilst minimising requests and responses you have a number of options.
The Etag (Entity Tag) header - this is supplied by the server in response to a request in either a "strong" or "weak" form. It is usually a hash based on the resource in question. When a client re-requests a resource it can pass the stored value of the Etag with the If-None-Match request header. If the resource has not changed then the server will respond with 304 Not Modified.
You can think Etags as fingerprints for resources. They can be used to massively reduce the amount of information sent over the wire - as only fresh data is served - but they do not have any bearing on the number of times or frequency of requests.
The last-modified header - this is supplied by the server in response to a request in HTTPdate format - it tells the client the last time the resource was modified.
When a client re-requests a resource it can pass the stored value of the last-modified header with the If-Modified-Since request header. If the resource has not changed since the time it was last modified then the server will respond with 304 Not Modified.
You can think of last modified as a weaker form of entity checking than Etags. It addresses the same problem (bandwidth/redundancy) it in a less robust way and again it has no bearing at all on the actual number of requests made.
Revving - a technique that use a combination of the Expires header and the name (URN) of a resource. (see stevesouders blog post)
Here one basically sets a far forward Expires header - say 5 years from now - to ensure the static resource is cached for a long time.
You then have have two options for updating - either by appending a versioning query string to the requests URL - e.g. "/mystyles.css?v=1.1" - and updating the version number as and when the resource changes. Or better - versioning the file name itself e.g. "/mystyles.v1.1.css" so that each version is cached for as long as possible.
This way not only do you reduce the amount of bandwidth - you will as eliminate all checks to see if the resource has changed until you rename it.
I suppose the main point here is none of the catch control directives you mention max-age, public, etc have any bearing at all on if a 304 response is generated or not. For that use either Etag / If-None-Match or last-modified / If-Modified-Since or a combination of them (with If-Modified-Since as a fallback mechanism to If-None-Match).
It seems that I have misunderstood how it works, because some testing in Chrome has revealed exactly the behavior that I was looking for in the 5 steps I mentioned.
It doesn't blindly grab a fresh copy from the server when the max-age has expired. It does a GET, and if the response is 304 (Not Modified), it continues using the cached copy until the next hour has expired, at which point it checks for changes again etc.
The no-cache option is wrong too. Including it means the browser will always check with the server for modifications to the file every time. So what I was really looking for is:
Cache-Control: public, max-age=3600

304 Not Modified with 200 (from cache)

I'm trying to understand what exactly is the difference between "status 304 not modified" and "200 (from cache)"
I'm getting 304 on javascript files that I changed last. Why does this happen?
Thanks for the assistance.
https://sookocheff.com/post/api/effective-caching/ is an excellent source to form the required understanding around these 2 HTTP status codes.
After reading this thoroughly, I had this understanding
In typical usage, when a URL is retrieved, the web server will return the resource's current representation along with its corresponding ETag value, which is placed in an HTTP response header "ETag" field. The client may then decide to cache the representation, along with its ETag. Later, if the client wants to retrieve the same URL resource again, it will first determine whether the local cached version of the URL has expired (through the Cache-Control and the Expire headers). If the URL has not expired, it will retrieve the local cached resource. If it determined that the URL has expired (is stale), then the client will contact the server and send its previously saved copy of the ETag along with the request in a "If-None-Match" field. (Source: https://en.wikipedia.org/wiki/HTTP_ETag)
But even when expires time for an asset is set in future, browser can still reach the server for a conditional GET using ETag as per the 'Vary' header. Details on 'vary' header: https://www.fastly.com/blog/best-practices-using-vary-header/
"200 (from cache)" means the browser found a cached response for the request made. So instead of making a network call to fetch that resource from the remote server, it simply made use of the cached response.
Now these cached responses have a life time associated with them. With time, the responses can become stale. And if these are not allowed to be served stale (see section 4.2.4 - RFC7234), the browser needs to contact the remote server and verify whether these cached responses are still valid or not. The 304 response status code is the servers' way of letting the browser know that the response hasn't changed and is still valid.
If the browser receives a 304 response, it can "freshen" the relevant stale response(s). And subsequent requests to the resource can again be served from the browser cache without checking with the remote server (until the response becomes stale again).
304 Modified
A 304 Not Modified means that the files have not been modified since the specified version by the "If-Modified-Since", or "If-None-Match".
200 OK
This is the response you'll get if an HTTP request works. GET requests will have something relating to the files. A POST request will have something holding the result of the action.
Happy coding!Lyfe

Renewing a HTTP ETag

I'm using WebDav to put metadata on the files and folders of a server, along with a cache to avoid unnecessary requests to the server, based on the ETag property of the files.
Basically, I send a HEAD request and check whether the ETag matches the one I have in local. If it doesn't, then I send a larger, slower PROPFIND method to retrieve the other properties.
I built this cache on the idea that the ETag was changed every time the file was modified, including when metadata were modified, added or deleted.
However, I've recently discovered that it is not the case :
Because clients may be forced to prompt users or throw away changed
content if the ETag changes, a WebDAV server should not change the
ETag (or the Last-Modified time) for a resource that has an unchanged
body and location. The ETag represents the state of the body or
contents of the resource. There is no similar way to tell if
properties have changed.
(RFC 4918, http://www.webdav.org/specs/rfc4918.html#etag, emphasis mine)
Since invalidating the cache whenever properties are changed is important to me, I was wondering: is there a way to manually instruct the web server to renew the ETag ?
There are a couple different options. If the etag is generated based on the content (a bad idea), then it's more difficult. In our solution, we generated a different tag (a ptag) that we updated when properties changed, and you could query it with a PROPFIND, and we returned it as an X-PTag header on the response. If the etag is generated randomly on a PUT, then you could PUT the same data again, and it would give you a new etag.

ETag vs Header Expires

I've looked around but haven't been able to figure out if I should use both an ETag and an Expires Header or one or the other.
What I'm trying to do is make sure that my flash files (and other images and what not only get updated when there is a change to those files.
I don't want to do anything special like changing the filename or putting some weird chars on the end of the url to make it not get cached.
Also, is there anything I need to do programatically on my end in my PHP scripts to support this or is it all Apache?
They are slightly different - the ETag does not have any information that the client can use to determine whether or not to make a request for that file again in the future. If ETag is all it has, it will always have to make a request. However, when the server reads the ETag from the client request, the server can then determine whether to send the file (HTTP 200) or tell the client to just use their local copy (HTTP 304). An ETag is basically just a checksum for a file that semantically changes when the content of the file changes.
The Expires header is used by the client (and proxies/caches) to determine whether or not it even needs to make a request to the server at all. The closer you are to the Expires date, the more likely it is the client (or proxy) will make an HTTP request for that file from the server.
So really what you want to do is use BOTH headers - set the Expires header to a reasonable value based on how often the content changes. Then configure ETags to be sent so that when clients DO send a request to the server, it can more easily determine whether or not to send the file back.
One last note about ETag - if you are using a load-balanced server setup with multiple machines running Apache you will probably want to turn off ETag generation. This is because inodes are used as part of the ETag hash algorithm which will be different between the servers. You can configure Apache to not use inodes as part of the calculation but then you'd want to make sure the timestamps on the files are exactly the same, to ensure the same ETag gets generated for all servers.
Etag and Last-modified headers are validators.
They help the browser and/or the cache (reverse proxy) to understand if a file/page, has changed, even if it preserves the same name.
Expires and Cache-control are giving refresh information.
This means that they inform, the browser and the reverse in-between proxies, up to what time or for how long, they may keep the page/file at their cache.
So the question usually is which one validator to use, etag or last-modified, and which refresh infomation header to use, expires or cache-control.
Expires and Cache-Control are "strong caching headers"
Last-Modified and ETag are "weak caching headers"
First the browser check Expires/Cache-Control to determine whether or not to make a request to the server
If have to make a request, it will send Last-Modified/ETag in the HTTP request. If the Etag value of the document matches that, the server will send a 304 code instead of 200, and no content. The browser will load the contents from its cache.
Another summary:
You need to use both. ETags are a "server side" information. Expires are a "Client side" caching.
Use ETags except if you have a load-balanced server. They are safe and will let clients know they should get new versions of your server files every time you change something on your side.
Expires must be used with caution, as if you set a expiration date far in the future but want to change one of the files immediatelly (a JS file for instance), some users may not get the modified version until a long time!
One additional thing I would like to mention that some of the answers may have missed is the downside to having both ETags and Expires/Cache-control in your headers.
Depending on your needs it may just add extra bytes in your headers which may increase packets which means more TCP overhead. Again, you should see if the overhead of having both things in your headers is necessary or will it just add extra weight in your requests which reduces performance.
You can read more about it on this excellent blog post by Kyle Simpson: http://calendar.perfplanet.com/2010/bloated-request-response-headers/
In my view, With Expire Header, server can tell the client when my data would be stale, while with Etag, server would check the etag value for client' each request.
ETag is used to determine whether a resource should use the copy one. and Expires Header like Cache-Control is told the client that before the cache decades, client should fetch the local resource.
In modern sites, There are often offer a file named hash, like app.98a3cf23.js, so that it's a good practice to use Expires Header. Besides this, it also reduce the cost of network.
Hope it helps ;)
Etag is a hash for indicating the version of the resource. When the server returns data, it hashes the data and set this hash value under ETAG. When you send a "PUT" request to the server to update a record, maybe simultaneously another user made the same "PUT" request and its request has been processed. The server will check your "PUT" data and will see that it is the same update so it wont make another update, it will send you the updated data (by another user) and you will update your cache.
when the time for caching expires, the browser automatically makes a new request to get the fresh data. That is why "Expires" header is used
If a response includes both an Expires header and a max-age directive,
the max-age directive overrides the Expires header, even if the
Expires header is more restrictive. This rule allows an origin server
to provide, for a given response, a longer expiration time to an
HTTP/1.1 (or later) cache than to an HTTP/1.0 cache. This might be
useful if certain HTTP/1.0 caches improperly calculate ages or
expiration times, perhaps due to desynchronized clocks.

Resources