Are there any CDNs that give you full control over HTTP headers? - http

Are any CDNs (Content Delivery Networks) that provide control and/or customization of all or most HTTP headers?
Specifically, I'm interested in controlling the Expires, ETag, and Cache-Control headers, although other headers interest me as well.
I understand that part of the value proposition of CDNs is that they "just work" and set these headers to somewhat optimal values (for most use cases), but I am definitely interested in controlling these headers myself.

Akamai has a full interface for allowing this type of control on a per-property, per-header basis. It is a standard XML based config file. You can set each header to be a specific value, respect the headers passed through, add if not present, have exceptions based on User Agent etc.
Essentially, within reason, it is completely configurable. I have found setting defaults when absent but allowing applications/admins to set their own values is generally the best approach but it really does depend on the quality and understanding of the developer/admin.
Like most CDN providers Akamai have some default behaviors baked in, but the values are completely configurable. It has been a couple of years since I actively managed a CDN, but at the time Limelight was working on being feature compatible with Akamai and was most of the way there, so I would expect that they have similar functionality now.
In general, most CDN vendors will strive for feature compatibility with the big player in the market and Akamai is definitely it for CDN.

Related

Are "forever" and "never" the only useful caching durations?

In the presentation "Cache is King" by Steve Souders (at around 14:30), it is implied that there are in practice only two caching durations that you should use for your resources: "forever" and "never" (my own terminology).
"Forever" means that you effectively make the resource permanently immutable by setting a very high max age, such as one year. If you want to modify the resource at some point, the presentation suggests, you simply publish the modified resource at a different URL. (It is suggested that this renaming is necessary, in part or entirely, because of the large number of misconfigured proxies on the Internet.)
"Never" means that you effectively disable all forms of caching and require browsers to download the resource every time it is requested.
On the one hand, any performance advice given by the head performance engineer at Google carries weight on its own. On the other hand, HTTP caching was presumably designed with variable cache durations for a reason (not just "forever" and "never"), and changing the URL to a resource only because the resource has been modified seems to go against the spirit of HTTP.
Are "forever" and "never" the only cache durations that you should use in practice? Is this in conflict with other best practices on the web?
In addition to the typical "user with a browser" use case, I would also like to know how these principles apply to REST/hypermedia APIs.
Many people would disagree with limiting yourself to "forever" or "never" as you describe it.
For one thing, it ignores the option of allowing caching with always revalidating. In this case, if the client (or proxy) has cached the resource, it sends a conditional HTTP request. If the client/proxy has cached the latest version of the resource, then the server sends a short 304 response rather than the entire resource. If the client's (proxy) copy is out of date, then the server sends the entire resource.
With this scheme, the client will always get an up-to-date version of the resource, and if the resource doesn't change much bandwidth will be saved.
To save even more bandwidth, the client can be instructed to revalidate only when the resource is older than a certain period of time.
And if bad proxies are a problem, the server can specify that only clients and not proxies may cache the resource.
I found this document pretty concisely describes your options for caching. This page is longer but also gives some excellent information.
"It depends" really, on your use case, what you are trying to achieve, and your branding proposition.
If all you want to achieve is some bandwidth saving, you could do a total cost breakdown. Serving cost might not amount to much. Browsers are anyway pretty smart at optimizing image hits, for example, so understand your HTTP protocol. Forever, combined with versioned resource url, and url rewrite rules might be a good fit, like your Google engineer suggested.
Resource volatility is another. If you are only serving daily stock charts for example, it could safely be cached for some time but not forever.
Are your computation costs heavy? Are your users sensitive to timeliness? Is data live or fixed? For example, you might be serving airline routes, path of a hurricane, option greeks or a BI report to COO. You might want to have it cached, but the TTL will likely vary by user class, all the way down to never. Forever cannot work for live data but never might be a wrong answer too.
Degree of cooperation between the server and the client may be another factor. For example in a business operations environment where procedures can be distributed and expected to be followed, it might be worthwhile to again look at TTLs.
HTH. I doubt if there is a magical answer.
Idealy, you muste cache until the content changes, if you cannot clear/refresh the cache when content changes for any reason, you need a duration. But indeed, if you can, cache forever or do not cache. No need to refresh if you already know nothing changed.
If you know that the underlying data will be static for any length of time, caching makes sense. We have a web service that exposes data from a database that is populated by a nightly ETL job from an external source. Our RESTful web service only goes to the database when it changes. In our case, we know exactly when the data changes and we invalidate the cache right after the ETL process finishes.

Cache Busting - Query String vs URL Path

So to be sure a stale asset isn't served up, people often use something like:
example.com/css/styles.css?v=1
or
example.com/css/styles-v1.css
Similar tactics are used by libraries like jQuery to request JSONP resources (using the query-string approach). Likewise, analytics services use tracking pixels with cache-busting in the URL.
My question is, does anyone have any real data on what percentage of caching proxies (or other mechanisms) might ignore the query-string, making the URL-path option preferable?
I've heard of mobile internet providers and corporate environments having severe caching rules but I haven't seen any real data.
No data, but any proxy stripping the query string would per definition be non-compliant. Having done a ton of work with this stuff, I would definitely say:
A bug might exist in some implementation.
It's probably uncommon enough for you not to care.

Is it possible using ASP.NET to globally block all cookies (including 3rdparty ones) that are dropped when someone is on my site?

The context of this is around the much hyped EU Privacy law which makes it illegal for a site to drop any "non-essential" cookies unless the user has "opted in" to this.
My specific challenge is due to the complexity of the site and the variety of different ways cookies are being dropped - particularly where governed by a CMS that has allowed marketeers to run riot and embed all sorts of content in different places - mostly around 3rd party cookies where there is embedded javascript, img pixels, Iframes e.t.c. (I'm speculating these all allow the dropping of 3rd party cookies having briefly browsed key areas of the site using a FF plugin - I haven't checked the mechanisms of each yet).
So, I've been trying to think whether in ASP.NET there would be a way to globally intercept and block all cookies that get dropped by my site should I need to, and also extend this to check whether they are essential or not, and if not, whether the user has already agreed to having cookies dropped (which would probably consist of a master YES cookie).
There are several things I am unclear about. First - would it be possible to use Response.Filter or Response.Cookies as a pipeline step to strip out any cookies that have already been dropped? Secondly - would it be possible for this to intercept any kind of cookie whatsoever or is it going to be impossible to catch some of the 3rdparty ones if they are executing browser requests from the client to the 3rdparty server directly?
The closest thing I could find that resembles my question is this but that looks like a sitewide solution - not user specific.
A reverse proxy with URL rewriting could probably do this for you. If you spend the time tracking down the resources and implement the heavy hammer of allow/disallow cookies and rewrite 3rd party URLs to go through your reverse proxy. The you can hijack and modify their set-cookie responses. In addition if they set cookies on the client through JavaScript they would be through your server/domain so you would have control over if they are forwarded or not.
This is not a simple solution but it should be possible and could be implemented without changing the application or the user experience.

HTTP Caching Testing

I have a proxying system that needs to understand the HTTP Cache-Control headers. The intent is not to perform any actual caching, but to derive tuning information based on the caching characteristics of sets of HTTP requests. I'm looking for a way to test the system.
I can do spot checking by pushing content from well-known websites or authored sites to make sure that the system is acting correctly. However, I'd like to expand the pool of test data.
Is there a test suite that enumerates either a set of common or complete caching headers that I can integrate with my software to make sure I'm covering all the bases I need to cover?
I know the Apache HTTPClient (HTTPCore) has a pretty extensive test suite, though I don't know how deeply it goes into the cache stuff. This is Java, don't know if you care about the language or not.

HTTP Verbs and Content Negotiation or GET Strings for REST service?

I am designing a REST service and am trying to weight the pros and cons of using the full array of http verbs and content negotiation vs GET string variables. Does my choice affect cacheability? Neither solution may be right for every area.
Which is best for the crud and queries (e.g. ?action=PUT)?
Which is best for api version picking (e.g. ?version=1.0)?
Which is best for return data type(e.g. ?type=json)?
CRUD/Queries are best represented with HTTP verbs. A create and update is usually a PUT or POST. A retrieve would be a GET. Deletes would be a DELETE. Thats the generally mapping. The main point is that a GET doesn't cause side effects, and that the verbs do what you'd expect them to do.
Putting the action in the URI is OK if thats the -only- way to pass it (e.g, the http client library doesn't allow you to send non-GET/POST requests). Most libraries do, though, so its strongly advised not to pass the verb via the URL.
The "best" way to version the API would be using HTTP headers on a per-request basis; this lets clients upgrade/downgrade specific requests instead of every single one. Of course, that granularity of versioning needs to be baked in at the start and could severely complicate the server-side code. Most people just use the URL used the access the servers. A longer explanation is in a blog post by Peter Williams, "Versioning Rest Web Services"
There is no best return data type; it depends on your app. JSON might be easier for Ajax websites, whereas XML might be easier for complicated structures you want to query with Xpath. Protocol Buffers are a third option. Its also debated whether its better to put the return protocol is best specified in the URL or in the HTTP headers.
For the most part, headers will have the largest affect on caching, since proxies are suppose to respect them when told, as are user agents (obviously UA's behave differently, though). Caching based on URL alone is very dependent on the layers. Some user agents don't cache anything with a query string (Safari, iirc), and proxies are free to cache or not cache as they feel appropriate.

Resources