How can I generate Modified http headers with Compojure? - http

I'm trying to improve performance for clients fetching pages from my Compojure webserver. We serve up a bunch of static files (JS, CSS) using (compojure.route/resources "/"), which looks for files on the filesystem, converts them to URLs, and then serves them to Ring as streams. By converting to streams, it seems to lose all file metadata, such as the mod time.
I can wrap the static-resource handler and add an Expires or Cache-Control: max-age header, but that prevents the client from sending any request at all. Useful, but these files do change on occasion (when we put out a release).
Ideally I'd like the client to trust its own cached version for, say, an hour, and make a request with an If-Modified-Since header after that hour has passed. Then we can just return 304 Not Modified and the client avoids downloading a couple hundred kilos of javascript.
It looks like I can set a Last-Modified header when serving a response, and that causes the client to qualify subsequent requests with If-Modified-Since headers. Great, except I'd have to rewrite most of the code in compojure.route/resources in order to add Last-Modified - not difficult, but tedious - and invent some more code to recognize and respond to the If-Modified-Since header. Not a monumental task, but not a simple one either.
Does this already exist somewhere? I couldn't find it, but it seems like a common enough, and large enough, task that someone would have written a library for it by now.

FWIW, I got this to work by using Ring's wrap-file-info middleware; I'm sorta embarrassed that I looked for this in Compojure instead of Ring. However, compojure.route's files and resources handlers both serve up streams instead of Files or URLs, and of course Ring can't figure out metadata from that.
I had to write basically a copy of resources that returns a File instead; when wrapped in wrap-file-info that met my needs. Still wouldn't mind a slightly better solution that doesn't involve copying a chunk of code from Compojure.

Have you considered using the ring-etag-middleware? It uses the last modified date of a file to generate the entity tag. It then keys a 304 on a match to the if-none-match header in the request.

Related

Golang HTTP and file caching

I have an application, written in Go, which runs an HTTP server and uses http.ServeFile() to serve a file which is being updated 3 times per second; this is an audio streaming index file for a HTTP Live Stream which I need to operate in near zero latency, hence the frequent updates. I can see from the logging in my Go server application that this file really is being updated 3 times per second and I call Sync() on the file each time it is updated to make sure that it is written to disk.
My problem is that, on the browser side (Chrome), while this file is being requested several times per second, it is only actually being served once a second; on all the other occasions the server is returning 304, indicating that the file is unchanged:
What might be causing this behaviour and how could I make the file be served on each request?
As state in the comments, it turns out that the modification time checking in HTTP has a minimum resolution of 1 second and so, where a file needs to be change and server more frequently than that, it's best to serve it oneself from RAM. For instance, store it in a slice called content and serve that slice with something like:
http.ServeContent(w, r, filepath.Base(r.URL.Path), time.Time{}, bytes.NewReader(content))
Modification time checking in HTTP only has resolution to the second. However, the alternative is to use entity-tags ('etags'), which can be updated as often as the server needs to change the content.
Therefore your use-case would work better via etags instead of modification times. An etag contains an opaque string that either does or doesn't match.
From https://www.rfc-editor.org/rfc/rfc7232#section-2.3,
An entity-tag is an opaque validator for
differentiating between multiple representations of the same
resource, regardless of whether those multiple representations are
due to resource state changes over time, content negotiation
resulting in multiple representations being valid at the same time,
or both.
Detection of change by the client is usually done using the if-none-match header (https://www.rfc-editor.org/rfc/rfc7232#section-3.2)
If-None-Match is primarily used in conditional GET requests to enable
efficient updates of cached information with a minimum amount of
transaction overhead.

Why do the http precondition header fields support multiple etags?

https://www.rfc-editor.org/rfc/rfc7232#section-3
When would one need to check against multiple etags instead of just one?
E.g, what would be the purpose of sending a HTTP request with If-None-Match: "etag1","etag2"? Why would a client ever need both "etag1" and "etag2" in there? Shouldn't the client be using just the last etag received from the server for this particular resource?
Why would a client ever need both etag1 and etag2 in there?
If the client has both etag1 and etag2 in its cache then it is to its advantage to send both, since that expands the range of situations in which it can skip downloading new data.
Shouldn't the client be using just the last etag received from the server for this particular resource?
You're thinking of a resource that, when it changes, always changes to a new value, in which case there wouldn't be any point in caching the old ones. However, that's not necessarily the case. Imagine a resource that oscillates between two states (representations), in which case caching both (and sending both ETags) means that you never have to download the data again.
Another thing to consider is that Content-Type can change the representation of the same resource at the same point in time. Imagine a client (an intermediate cache, say) that can handle both text and gzipped versions of a resource, and already has one of each. Sending both ETags (in conjunction with Accept) means that it is less likely to have to download new data.

Proper http method(s) for serving static content

I apologize if there's an answer elsewhere.
I'm building a simple server and am now working on static file responses. Should I refuse all http methods except GET when serving static content? By static content I am referring to files stored on the file system on the server.
My immediate hunch is to only allow GET, but I want to make sure before locking it down.
What http method(s) should resolve static files of the form:
http://somedomain.com/foo/bar/baz.css?
Not necessarily requested through the browser, obviously.
All HTTP requests have a specific purpose. If you don't plan to implement that purpose or feature, block it with 405 Method Not Allowed.
For example, do you want to allow others update the files? You'll need PUT then. I'd recommend simply reading what the methods mean so you know what makes sense and what not.
Intuitively I think you probably only need GET and HEAD. I think it's good to respond to OPTIONS with a correct response as well.

How to display the cached version first and check the etag/modified-since later?

With caching headers I can either make the client not check online for updates for a certain period of time, and/or check for etags every time. What I do not know is whether I can do both: use the offline version first, but meanwhile in the background, check for an update. If there is a new version, it would be used next time the page is opened.
For a page that is completely static except for when the user changes it by themselves, this would be much more efficient than having to block checking the etag every time.
One workaround I thought of is using Javascript: set headers to cache the page indefinitely and have some Javascript make a request with an If-Modified-Since or something, which could then dynamically change the page. The big issue with this is that it cannot invalidate the existing cache, so it would have to keep dynamically updating the page theoretically forever. I'd also prefer to keep it pure HTTP (or HTML, if there is some tag that can do this), but I cannot find any relevant hits online.
A related question mentions "the two rules of caching": never cache HTML and cache everything else forever. Just to be clear, I mean to cache the HTML. The whole purpose of the thing I am building is for it to be very fast on very slow connections (high latency, low throughput, like EDGE). Every roundtrip saved is a second or two shaved off of loading time.
Update: reading more caching resources, it seems the Vary: Cookie header might do the trick in my case. I would like to know if there is a more general solution though, and I didn't really dig into the vary-header yet so I don't know yet if that works.
Solution 1 (HTTP)
There is a cache control extension stale-while-revalidate which describes exactly what you want.
When present in an HTTP response, the stale-while-revalidate Cache-
Control extension indicates that caches MAY serve the response in
which it appears after it becomes stale, up to the indicated number
of seconds.
If a cached response is served stale due to the presence of this
extension, the cache SHOULD attempt to revalidate it while still
serving stale responses (i.e., without blocking).
cache-control: max-age=60,stale-while-revalidate=86400
When browser firstly request the page it will cache result for 60s. During that 60s period requests are answered from the cache without contacting of the origin server. During next 86400s content will be served from the cache and fetched from origin server simultaneously. Only if both periods 60s+86400s are expired cache will not serve cached content but wait for origin server to fresh data.
This solution has only one drawback. I was not able to find any browser or intermediate cache which currently supports this cache control extension.
Solution 2 (Javascript)
Another solution is usage of Service workers with its feature to make custom responses to requests. With combination with Cache API it is enough to provide the requested feature.
The problem is that this solution will work only for browsers (not intermediate caches nor another http services) and even not all browsers supports Services workers and Cache API.

What is the point of If-Unmodified-Since/If-Modified-Since? Aren't they superseded by ETags?

There seem to be two distinct ways to implement conditional requests using HTTP headers, both of which can be used for caching, range requests, concurrency control etc...:
If-Unmodified-Since and If-Modified-Since, where the client sends a timestamp of the resource.
If-Modified and If-None-Modified, where the client sends an ETag representation of the resource.
In both cases, the client sends a piece of information it has about the resource, which allows the server to determine whether the resource has changed since the client last saw it. The server then decides whether to execute the request depending on the conditional header supplied by the client.
I don't understand why two separate approaches are available. Surely, ETags supersede timestamps, since the server could quite easily choose to generate ETags from timestamps.
So, my questions are:
In which scenarios might you favour If-Unmodified-Since/If-Modified-Since over ETags?
In which scenarios might you need both?
I once pondered the same thing, and realized that there is one difference that is quite important: Dates can be ordered, ETags can not.
This means that if some resource was modified a year ago, but never since, and we know it. Then we can correctly answer an If-Unmodified-Since request for arbitrary dates the last year and agree that sure... it has been unmodified since that date.
An Etag is only comparable for identity. Either it is the same or it is not. If you have the same resource as above, and during the year the docroot has been moved to a new disk and filesystem, giving all files new inodes but preserving modification dates. And someone had based the ETags on file's inode number. Then we can't say that the old ETag is still okay, without having a log of past-still-okay-ETags.
So I don't see them as one obsoleting the other. They are for different situations. Either you can easily get a Last-Modified date of all the data in the page you're about to serve, or you can easily get an ETag for what you will serve.
If you have a dynamic webpage with data from lots of db lookups it might be difficult to tell what the Last-Modified date is without making your database contain lots of modification dates. But you can always make an md5 checksum of the result rendered page.
When supporting these cache protocols I definitely go for only one of them, never both.
There is one rather big difference: I can only use ETags if I have already asked the server for one in the past. Timestamps, OTOH, I can make up as I go along.
Simple reason: backward-compatibility.

Resources