Golang HTTP and file caching

Golang HTTP and file caching - http

I have an application, written in Go, which runs an HTTP server and uses http.ServeFile() to serve a file which is being updated 3 times per second; this is an audio streaming index file for a HTTP Live Stream which I need to operate in near zero latency, hence the frequent updates. I can see from the logging in my Go server application that this file really is being updated 3 times per second and I call Sync() on the file each time it is updated to make sure that it is written to disk.
My problem is that, on the browser side (Chrome), while this file is being requested several times per second, it is only actually being served once a second; on all the other occasions the server is returning 304, indicating that the file is unchanged:
What might be causing this behaviour and how could I make the file be served on each request?

As state in the comments, it turns out that the modification time checking in HTTP has a minimum resolution of 1 second and so, where a file needs to be change and server more frequently than that, it's best to serve it oneself from RAM. For instance, store it in a slice called content and serve that slice with something like:
http.ServeContent(w, r, filepath.Base(r.URL.Path), time.Time{}, bytes.NewReader(content))

Modification time checking in HTTP only has resolution to the second. However, the alternative is to use entity-tags ('etags'), which can be updated as often as the server needs to change the content.
Therefore your use-case would work better via etags instead of modification times. An etag contains an opaque string that either does or doesn't match.
From https://www.rfc-editor.org/rfc/rfc7232#section-2.3,
An entity-tag is an opaque validator for
differentiating between multiple representations of the same
resource, regardless of whether those multiple representations are
due to resource state changes over time, content negotiation
resulting in multiple representations being valid at the same time,
or both.
Detection of change by the client is usually done using the if-none-match header (https://www.rfc-editor.org/rfc/rfc7232#section-3.2)
If-None-Match is primarily used in conditional GET requests to enable
efficient updates of cached information with a minimum amount of
transaction overhead.

Related

Efficiently handling HTTP uploads of many large files in Go

There is probably an answer within reach, but most of the search results are "handling large file uploads" where the user does not know what they're doing or "handing many uploads" where the answer consistently is just an explanation of how to work with multipart requests and/or Flash uploader widgets.
I haven't had time to sift through Go's HTTP implementation, yet, but when does the application have the first chance to see the incoming body? Not until it has been completely received?
If I were to [poorly] decide to use HTTP to transfer a large amount of data and posted a single request with several 10-gigabyte parts, would I have to wait for the whole thing to be received before processing it or does the io.Reader with the body iteratively process it?
This is only tangentially related, but I also haven't been able to get a clear answer about whether I can choose to forcibly close the connection in the middle; whether or not, even if I close it, it will just keep receiving it on the port.
Thanks so much.

An application's handler is called after the headers are parsed and before the request body is read. The handler can read the request body as soon as the handler is called. The server does not buffer the entire request body.
An application can read file uploads without buffering the entire request by getting a multipart reader and iterating through the parts.
An application can replace the request body with a MaxBytesReader to force close the connection after a specified limit is breached.
The above comments are about the net/http server included in the standard library. The comments may not apply to other servers.

While I haven't done this with GB size files, my strategy with file processing (mostly stuff I read from and write to S3) is to use https://golang.org/pkg/os/exec/ with a cmd line utility that handles chunking a way you like. Then read and process by tailing the file as explained here: Reading log files as they're updated in Go
In my situations, network utilities can download the data far faster than my code can process it, so it makes sense to send it to disk and pick it up as fast as I can, that way I'm not holding some connection open while I process.

Why do the http precondition header fields support multiple etags?

https://www.rfc-editor.org/rfc/rfc7232#section-3
When would one need to check against multiple etags instead of just one?
E.g, what would be the purpose of sending a HTTP request with If-None-Match: "etag1","etag2"? Why would a client ever need both "etag1" and "etag2" in there? Shouldn't the client be using just the last etag received from the server for this particular resource?

Why would a client ever need both etag1 and etag2 in there?
If the client has both etag1 and etag2 in its cache then it is to its advantage to send both, since that expands the range of situations in which it can skip downloading new data.
Shouldn't the client be using just the last etag received from the server for this particular resource?
You're thinking of a resource that, when it changes, always changes to a new value, in which case there wouldn't be any point in caching the old ones. However, that's not necessarily the case. Imagine a resource that oscillates between two states (representations), in which case caching both (and sending both ETags) means that you never have to download the data again.
Another thing to consider is that Content-Type can change the representation of the same resource at the same point in time. Imagine a client (an intermediate cache, say) that can handle both text and gzipped versions of a resource, and already has one of each. Sending both ETags (in conjunction with Accept) means that it is less likely to have to download new data.

How can I generate Modified http headers with Compojure?

I'm trying to improve performance for clients fetching pages from my Compojure webserver. We serve up a bunch of static files (JS, CSS) using (compojure.route/resources "/"), which looks for files on the filesystem, converts them to URLs, and then serves them to Ring as streams. By converting to streams, it seems to lose all file metadata, such as the mod time.
I can wrap the static-resource handler and add an Expires or Cache-Control: max-age header, but that prevents the client from sending any request at all. Useful, but these files do change on occasion (when we put out a release).
Ideally I'd like the client to trust its own cached version for, say, an hour, and make a request with an If-Modified-Since header after that hour has passed. Then we can just return 304 Not Modified and the client avoids downloading a couple hundred kilos of javascript.
It looks like I can set a Last-Modified header when serving a response, and that causes the client to qualify subsequent requests with If-Modified-Since headers. Great, except I'd have to rewrite most of the code in compojure.route/resources in order to add Last-Modified - not difficult, but tedious - and invent some more code to recognize and respond to the If-Modified-Since header. Not a monumental task, but not a simple one either.
Does this already exist somewhere? I couldn't find it, but it seems like a common enough, and large enough, task that someone would have written a library for it by now.

FWIW, I got this to work by using Ring's wrap-file-info middleware; I'm sorta embarrassed that I looked for this in Compojure instead of Ring. However, compojure.route's files and resources handlers both serve up streams instead of Files or URLs, and of course Ring can't figure out metadata from that.
I had to write basically a copy of resources that returns a File instead; when wrapped in wrap-file-info that met my needs. Still wouldn't mind a slightly better solution that doesn't involve copying a chunk of code from Compojure.

Have you considered using the ring-etag-middleware? It uses the last modified date of a file to generate the entity tag. It then keys a 304 on a match to the if-none-match header in the request.

What is the point of If-Unmodified-Since/If-Modified-Since? Aren't they superseded by ETags?

There seem to be two distinct ways to implement conditional requests using HTTP headers, both of which can be used for caching, range requests, concurrency control etc...:
If-Unmodified-Since and If-Modified-Since, where the client sends a timestamp of the resource.
If-Modified and If-None-Modified, where the client sends an ETag representation of the resource.
In both cases, the client sends a piece of information it has about the resource, which allows the server to determine whether the resource has changed since the client last saw it. The server then decides whether to execute the request depending on the conditional header supplied by the client.
I don't understand why two separate approaches are available. Surely, ETags supersede timestamps, since the server could quite easily choose to generate ETags from timestamps.
So, my questions are:
In which scenarios might you favour If-Unmodified-Since/If-Modified-Since over ETags?
In which scenarios might you need both?

I once pondered the same thing, and realized that there is one difference that is quite important: Dates can be ordered, ETags can not.
This means that if some resource was modified a year ago, but never since, and we know it. Then we can correctly answer an If-Unmodified-Since request for arbitrary dates the last year and agree that sure... it has been unmodified since that date.
An Etag is only comparable for identity. Either it is the same or it is not. If you have the same resource as above, and during the year the docroot has been moved to a new disk and filesystem, giving all files new inodes but preserving modification dates. And someone had based the ETags on file's inode number. Then we can't say that the old ETag is still okay, without having a log of past-still-okay-ETags.
So I don't see them as one obsoleting the other. They are for different situations. Either you can easily get a Last-Modified date of all the data in the page you're about to serve, or you can easily get an ETag for what you will serve.
If you have a dynamic webpage with data from lots of db lookups it might be difficult to tell what the Last-Modified date is without making your database contain lots of modification dates. But you can always make an md5 checksum of the result rendered page.
When supporting these cache protocols I definitely go for only one of them, never both.

There is one rather big difference: I can only use ETags if I have already asked the server for one in the past. Timestamps, OTOH, I can make up as I go along.

Simple reason: backward-compatibility.

I'm confused about HTTP caching

I've been thinking about batch reads and writes in a RESTful environment, and I think I've come to the realization that I have broader questions about HTTP caching. (Below I use commas (",") to delimit multiple record IDs, but that detail is not particular to the discussion.)
I started with this problem:
1. Single GET invalidated by batch update
GET /farms/123 # get info about Old MacDonald's Farm
PUT /farms/123,234,345 # update info on Old MacDonald's Farm and some others
GET /farms/123
How does a caching server in between the client and the Farms server know to invalidate its cache of /farms/123 when it sees the PUT?
Then I realized this was also a problem:
2. Batch GET invalidated by single (or batch) update
GET /farms/123,234,345 # get info about a few farms
PUT /farms/123 # update Old MacDonald's Farm
GET /farms/123,234,345
How does the cache know to invalidate the multiple-farm GET when it sees the PUT go by?
So I figured that the problem was really just with batch operations. Then I realized that any relationship could cause a similar problem. Let's say a farm has zero or one owners, and an owner can have zero or one farms.
3. Single GET invalidated by update to a related record
GET /farms/123 # get info about Old MacDonald's Farm
PUT /farmers/987 # Old MacDonald sells his farm and buys another one
GET /farms/123
How does the cache know to invalidate the single GET when it sees the PUT go by?
Even if you change the models to be more RESTful, using relationship models, you get the same problem:
GET /farms/123 # get info about Old MacDonald's Farm
DELETE /farm_ownerships/456 # Old MacDonald sells his farm...
POST /farm_ownerships # and buys another one
GET /farms/123
In both versions of #3, the first GET should return something like (in JSON):
farm: {
id: 123,
name: "Shady Acres",
size: "60 acres",
farmer_id: 987
}
And the second GET should return something like:
farm: {
id: 123,
name: "Shady Acres",
size: "60 acres",
farmer_id: null
}
But it can't! Not even if you use ETags appropriately. You can't expect the caching server to inspect the contents for ETags -- the contents could be encrypted. And you can't expect the server to notify the caches that records should be invalidated -- caches don't register themselves with servers.
So are there headers I'm missing? Things that indicate a cache should do a HEAD before any GETs for certain resources? I suppose I could live with double-requests for every resource if I can tell the caches which resources are likely to be updated frequently.
And what about the problem of one cache receiving the PUT and knowing to invalidate its cache and another not seeing it?

Cache servers are supposed to invalidate the entity referred to by the URI on receipt of a PUT (but as you've noticed, this doesn't cover all cases).
Aside from this you could use cache control headers on your responses to limit or prevent caching, and try to process request headers that ask if the URI has been modified since last fetched.
This is still a really complicated issue and in fact is still being worked on (e.g. see http://www.ietf.org/internet-drafts/draft-ietf-httpbis-p6-cache-05.txt)
Caching within proxies doesn't really apply if the content is encrypted (at least with SSL), so that shouldn't be an issue (still may be an issue on the client though).

HTTP protocol supports a request type called "If-Modified-Since" which basically allows the caching server to ask the web-server if the item has changed. HTTP protocol also supports "Cache-Control" headers inside of HTTP server responses which tell cache servers what to do with the content (such as never cache this, or assume it expires in 1 day, etc).
Also you mentioned encrypted responses. HTTP cache servers cannot cache SSL because to do so would require them to decrypt the pages as a "man in the middle." Doing so would be technically challenging (decrypt the page, store it, and re-encrypt it for the client) and would also violate the page security causing "invalid certificate" warnings on the client side. It is technically possible to have a cache server do it, but it causes more problems than it solves, and is a bad idea. I doubt any cache servers actually do this type of thing.

Unfortunately HTTP caching is based on exact URIs, and you can't achieve sensible behaviour in your case without forcing clients to do cache revalidation.
If you've had:
GET /farm/123
POST /farm_update/123
You could use Content-Location header to specify that second request modified the first one. AFAIK you can't do that with multiple URIs and I haven't checked if this works at all in popular clients.
The solution is to make pages expire quickly and handle If-Modified-Since or E-Tag with 304 Not Modified status.

You can't cache dynamic content (withouth drawbacks), because... it's dynamic.

In re: SoapBox's answer:
I think If-Modified-Since is the two-stage GET I suggested at the end of my question. It seems like an OK solution where the content is large (i.e. where the cost of doubling the number of requests, and thus the overhead is overcome by the gains of not re-sending content. That isn't true in my example of Farms, since each Farm's information is short.)
It is perfectly reasonable to build a system that sends encrypted content over an unencrypted (HTTP) channel. Imagine the scenario of a Service Oriented Architecture where updates are infrequent and GETs are (a) frequent, (b) need to be extremely fast, and (c) must be encrypted. You would build a server that requires a FROM header (or, equivalently, an API key in the request parameters), and sends back an asymmetrically-encrypted version of the content for the requester. Asymmetric encryption is slow, but if properly cached, beats the combined SSL handshake (asymmetric encryption) and symmetric content encryption. Adding a cache in front of this server would dramatically speed up GETs.
A caching server could reasonably cache HTTPS GETs for a short period of time. My bank might put a cache-control of about 5 minutes on my account home page and recent transactions. I'm not terribly likely to spend a long time on the site, so sessions won't be very long, and I'll probably end up hitting my account's main page several times while I'm looking for that check I recently sent of to SnorgTees.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex