Cache Busting - Query String vs URL Path

Cache Busting - Query String vs URL Path - http

So to be sure a stale asset isn't served up, people often use something like:
example.com/css/styles.css?v=1
or
example.com/css/styles-v1.css
Similar tactics are used by libraries like jQuery to request JSONP resources (using the query-string approach). Likewise, analytics services use tracking pixels with cache-busting in the URL.
My question is, does anyone have any real data on what percentage of caching proxies (or other mechanisms) might ignore the query-string, making the URL-path option preferable?
I've heard of mobile internet providers and corporate environments having severe caching rules but I haven't seen any real data.

No data, but any proxy stripping the query string would per definition be non-compliant. Having done a ton of work with this stuff, I would definitely say:
A bug might exist in some implementation.
It's probably uncommon enough for you not to care.

Related

Are "forever" and "never" the only useful caching durations?

In the presentation "Cache is King" by Steve Souders (at around 14:30), it is implied that there are in practice only two caching durations that you should use for your resources: "forever" and "never" (my own terminology).
"Forever" means that you effectively make the resource permanently immutable by setting a very high max age, such as one year. If you want to modify the resource at some point, the presentation suggests, you simply publish the modified resource at a different URL. (It is suggested that this renaming is necessary, in part or entirely, because of the large number of misconfigured proxies on the Internet.)
"Never" means that you effectively disable all forms of caching and require browsers to download the resource every time it is requested.
On the one hand, any performance advice given by the head performance engineer at Google carries weight on its own. On the other hand, HTTP caching was presumably designed with variable cache durations for a reason (not just "forever" and "never"), and changing the URL to a resource only because the resource has been modified seems to go against the spirit of HTTP.
Are "forever" and "never" the only cache durations that you should use in practice? Is this in conflict with other best practices on the web?
In addition to the typical "user with a browser" use case, I would also like to know how these principles apply to REST/hypermedia APIs.

Many people would disagree with limiting yourself to "forever" or "never" as you describe it.
For one thing, it ignores the option of allowing caching with always revalidating. In this case, if the client (or proxy) has cached the resource, it sends a conditional HTTP request. If the client/proxy has cached the latest version of the resource, then the server sends a short 304 response rather than the entire resource. If the client's (proxy) copy is out of date, then the server sends the entire resource.
With this scheme, the client will always get an up-to-date version of the resource, and if the resource doesn't change much bandwidth will be saved.
To save even more bandwidth, the client can be instructed to revalidate only when the resource is older than a certain period of time.
And if bad proxies are a problem, the server can specify that only clients and not proxies may cache the resource.
I found this document pretty concisely describes your options for caching. This page is longer but also gives some excellent information.

"It depends" really, on your use case, what you are trying to achieve, and your branding proposition.
If all you want to achieve is some bandwidth saving, you could do a total cost breakdown. Serving cost might not amount to much. Browsers are anyway pretty smart at optimizing image hits, for example, so understand your HTTP protocol. Forever, combined with versioned resource url, and url rewrite rules might be a good fit, like your Google engineer suggested.
Resource volatility is another. If you are only serving daily stock charts for example, it could safely be cached for some time but not forever.
Are your computation costs heavy? Are your users sensitive to timeliness? Is data live or fixed? For example, you might be serving airline routes, path of a hurricane, option greeks or a BI report to COO. You might want to have it cached, but the TTL will likely vary by user class, all the way down to never. Forever cannot work for live data but never might be a wrong answer too.
Degree of cooperation between the server and the client may be another factor. For example in a business operations environment where procedures can be distributed and expected to be followed, it might be worthwhile to again look at TTLs.
HTH. I doubt if there is a magical answer.

Idealy, you muste cache until the content changes, if you cannot clear/refresh the cache when content changes for any reason, you need a duration. But indeed, if you can, cache forever or do not cache. No need to refresh if you already know nothing changed.

If you know that the underlying data will be static for any length of time, caching makes sense. We have a web service that exposes data from a database that is populated by a nightly ETL job from an external source. Our RESTful web service only goes to the database when it changes. In our case, we know exactly when the data changes and we invalidate the cache right after the ETL process finishes.

Doesn't including tokens like #me in a REST URI break HTTP caching?

OpenSocial and some of the newer Google APIs include these tokens, such as "#me" or "#self", whose values are replaced by the API server with values based on the currently authenticated user. For example "/api/people/#me/#all" is an OpenSocial REST URL.
Doesn't this break with the goal of REST APIs to support native HTTP cache servers (like Squid)?
Even if you could get around the issue using the "Vary" header, it seems like a major drawback. And only real benefit is to allow developers to hard code some URIs into their apps. Anyone know why it was designed this way?

Yes it will make the use of public caches difficult. Personally I think it is a really bad idea and does seem to be driven by the desire to make it easier for clients to construct URIs. I sometimes wonder if the extensive use of caching servers like memcached are causing developers to forget about the benefits of http caching.

Why aren't the rest of the HTTP verbs used?

Most of the times, websites mainly only use GET and POST for all operations, yet there are seven more verbs out there. Where they used in older times but not so much now?
Or maybe is it because some browsers don't recognize the other verbs? And if that's the case, why do browser vendors choose to implement half of the protocol?
[Update]
I found this article which gives a good summary about the situation: Why REST failed.

The HTML spec is a big culprit, only really allowing GET, POST and HEAD. They are getting used quite a bit though, but not as much directly in browsers.
The most common uses of the other crud-verbs such as PUT and DELETE are in REST services and WebDAV.
You'll see OPTIONS more in the future, as it's used by the CORS specification (cross domain xmlhttprequest).
TRACE is pretty much disabled everywhere, as it poses a pretty big security risk. CONNECT is definitely used quite a bit by proxies.
PATCH is brand new. While it's odd to me that they decided to add it to the list (but not PROPFIND, MKCOL, ACL, LOCK, and so on), I do think we'll see it appear more in the future in RESTful services.
Addendum: The original browser used both GET and PUT (the latter for updating web pages). Later browsers pretty much became read-only until forms and the POST request made their way into the specifications.

Most of them are still used, though not as widely as GET or POST. For example RESTful web services use PUT & DELETE as well as GET & POST:
RESTful Web Service - Wiki Article
HEAD is very useful for server debugging of the HTTP headers, but as it doesn't return the response body, it's not much use to the browser / average web visitor...
Other verbs like TRACE aren't as widespread, because of potential security concerns etc. Mentioned briefly in the Wiki article:
HTTP Protocol Methods - Wiki Article

A decade later, these other verbs are used very commonly in RESTful APIs, which back nearly all of today's ubiquitous SPA applications and many mobile applications.
Though, interest in REST as an API structure is beginning to wane with the advent of GraphQL and growing interest in functional programming styles which benefit from RPC-style API structures.

Is there any reason not to use HTTP PUT and DELETE in a web application?

Looking around, I can't name a single web application (not web service) that uses anything besides GET and POST requests. Is there a specific reason for this? Do some browsers (or servers) not support any other types of requests? Or is this only for historical reasons? I'd like to make use of PUT and DELETE requests to make my life a little easier on the server-side, but I'm reluctant to because no one else does.

Actually a fair amount of people use PUT and DELETE, mostly for non-browser APIs. Some examples are the Atom Publishing Protocol and the Google Data APIs:
http://www.ietf.org/rfc/rfc5023.txt
http://code.google.com/apis/gdata/docs/2.0/basics.html
Beyond that, you don't see PUT/DELETE in common usage because most browsers don't support PUT and DELETE through Forms. HTML5 seems to be fixing this:
http://www.w3.org/TR/html5/forms.html#form-submission-0
The way it works for browser applications is: people design RESTful applications with PUT and DELETE in mind, then "tunnel" those requests through POSTs from the browser. For example, see this SO question on how Ruby on Rails accomplishes this using hidden fields:
How can I emulate PUT/DELETE for Rails and GWT?
So, you wouldn't be on your own designing your application with the larger set of HTTP verbs in mind.
EDIT: By the way, if you're curious about why PUT/DELETE are missing from browser based form posts, it turns out there's no real good technical reason. Reading around this thread on the rest-discuss mailing list, especially Roy Fielding's comments, is interesting for some context:
http://tech.groups.yahoo.com/group/rest-discuss/message/9620?threaded=1&var=1&l=1&p=13
EDIT: There are some comments on whether AJAX libraries support all the methods. It does come down to the actual browser implementation of XMLHttpRequest. I thought someone might find this link handy, which tests your browser to see how compliant the HttpRequest object is with various HTTP options.
http://www.mnot.net/javascript/xmlhttprequest/
Unfortunately, I don't know of a reference which collects these results.

Quite simply, the HTML 4.01 form element only allows the values "POST" and "GET" in its method attribute

Some proxy servers with tough security policies might drop them. I'm using PUT and DELETE anyways.

I've read that some browsers do not support other HTTP methods properly, though I can't name any specifics.
Rails, in particular, will pack your forms with a method parameter to explicitly set this even if the browser doesn't support those methods. That seems like a reasonable precaution if you're going to do this.

I say use all the features of HTTP, browsers be damned, lol. Maybe it'll inspire more complete and proper use of the HTTP protocol moving forward. There's more happening on the net than just POSTs and GETs. About time browser implementations reflected this.

This depends on your browser and Ajax library. For example jQuery supports all HTTP methods even though the browser may not. See for example the jQuery "ajax" documentation on the "type" attribute.
The Restlet Java framework lets you tunnel PUT and DELETE requests through HTML POST operations. To do this, you just add method=put or method=delete to your URI's query string, eg:
http://www.example.com/user=xyz?method=delete ...
This is the same as Ruby on Rails' approach (as described by #ars above).

Personally, I really don't see any purpose for using PUT or DELETE in a web application. All operations that an application performs are read or write, aka input output. Why do you need to distinguish the nature of the operation in the header of the HTTP request?
I could make ajax calls with the same url of form /object/object_id
and do multiple operations like delete, update, get the value, or create.
Just by looking at the URL, I have no clue which one it is.
By using GET and POST only, my urls will be:
/object/id/delete
/object/id/create
/object/id/update
/object/id --> implied GET
etc.
Based on my limited experience, this is a lot cleaner than hidden header request types in many cases.
I am not saying one should never use PUT or DELETE, just saying, use them only if absolutely needed.
Refer to "RESTful Web API" by Leonard Richardson to read more about different use cases and conventions regarding HTTP request methods in a RESTful web api.

HTTP Verbs and Content Negotiation or GET Strings for REST service?

I am designing a REST service and am trying to weight the pros and cons of using the full array of http verbs and content negotiation vs GET string variables. Does my choice affect cacheability? Neither solution may be right for every area.
Which is best for the crud and queries (e.g. ?action=PUT)?
Which is best for api version picking (e.g. ?version=1.0)?
Which is best for return data type(e.g. ?type=json)?

CRUD/Queries are best represented with HTTP verbs. A create and update is usually a PUT or POST. A retrieve would be a GET. Deletes would be a DELETE. Thats the generally mapping. The main point is that a GET doesn't cause side effects, and that the verbs do what you'd expect them to do.
Putting the action in the URI is OK if thats the -only- way to pass it (e.g, the http client library doesn't allow you to send non-GET/POST requests). Most libraries do, though, so its strongly advised not to pass the verb via the URL.
The "best" way to version the API would be using HTTP headers on a per-request basis; this lets clients upgrade/downgrade specific requests instead of every single one. Of course, that granularity of versioning needs to be baked in at the start and could severely complicate the server-side code. Most people just use the URL used the access the servers. A longer explanation is in a blog post by Peter Williams, "Versioning Rest Web Services"
There is no best return data type; it depends on your app. JSON might be easier for Ajax websites, whereas XML might be easier for complicated structures you want to query with Xpath. Protocol Buffers are a third option. Its also debated whether its better to put the return protocol is best specified in the URL or in the HTTP headers.
For the most part, headers will have the largest affect on caching, since proxies are suppose to respect them when told, as are user agents (obviously UA's behave differently, though). Caching based on URL alone is very dependent on the layers. Some user agents don't cache anything with a query string (Safari, iirc), and proxies are free to cache or not cache as they feel appropriate.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex