Update file in Windows Azure CDN - cdn

I have blob storage and CDN endpoint, that store my static content.
Now I want to update app.js file, because it was modified, but when I write this file to blob, CDN still gives me old app.js file. How can I update my app.js file? Or I have to wait until my cache is not going to end?

Simply you can't update the cache object before its expiration.
From https://msdn.microsoft.com/en-us/library/azure/gg680303.aspx:
If you no longer wish to cache an object in the Azure Content Delivery Network (CDN), you can take one of the following steps:
For a Azure blob, you can delete the blob from the public container.
You can make the container private instead of public. See Restrict Access to Containers and Blobs for more information.
You can disable or delete the CDN endpoint using the Azure Management Portal.
You can modify your hosted service to no longer respond to requests for the object.
An object already cached in the CDN will remain cached until the time-to-live period for the object expires. When the time-to-live period expires, the CDN will check to see whether the CDN endpoint is still valid and the object still anonymously accessible. If it is not, then the object will no longer be cached.
No explicit "purge" tool is currently available for the Azure CDN.
Other workarounds include using either fake query strings or new file names, if possible. See here: https://stackoverflow.com/a/8773202/908336

Question was asked quite long ago. I just wanted to update on method that proved useful for me. Its recommended by Microsoft. Essentially you need to set up cache-control headers in your Blob Storage. You can set cache control header with value "public, max-age=3600". This will cache your file for about 1 hour.
https://azure.microsoft.com/en-us/documentation/articles/cdn-manage-expiration-of-blob-content/

The CDN is simple. When a request comes in, it fetches the content from the origin (in this case, blob storage), and then caches it for some time based on the Cache-Control header. It will keep delivering the same content until the cache expires.
There's no way to tell the CDN to expire something early.
Others may jump in with more helpful advice about how to deal with this (like query string parameters), but I just wanted to give a straightforward explanation of how the CDN's caching works.

The only way to do this right this now is to contact Azure Support and they will in turn open a support ticket with Verizon EdgeCast to remove the file from the CDN and it will update at that point. This whole process takes about 8 hours on the basic Azure support plan. This isn't a good solution and I really hope they update this to where we can programmatically purge something from the CDN. This seems like a basic feature they are lacking. Your best bet I think right now is to enable querystring status and then update that querystring when you update. We do this for js files like so /js/custommix.js?version=1. Then we append a new version from our config when we need to update those.
http://azure.microsoft.com/en-us/blog/best-practices-for-the-windows-azure-content-delivery-network/
How can I purge or invalidate content in the Windows Azure CDN?
As of 1.4, no purge function is available. This feature is under development. The best freshness control is to set good cache expiration headers as described in this document and the Windows Azure CDN documentation on MSDN.

You can purge the content from Azure new Management portal.

It appears the default expiration time is 7 days.
From: http://msdn.microsoft.com/en-us/library/azure/gg680306.aspx
Blobs that benefit the most from Azure CDN caching are those that are
accessed frequently during their time-to-live (TTL) period. A blob
stays in the cache for the TTL period and then is refreshed by the
blob service after that time is elapsed. Then the process repeats.
You have two options for controlling the TTL:
Do not set cache values thus using the default TTL of 7 days.
Explicitly set the x-ms-blob-cache-control property on a Put Blob, Put
Block List, or Set Blob Properties request, or use the Azure Managed
Library to set the BlobProperties.CacheControl property. Setting this
property sets the value of the Cache-Control header for the blob. The
value of the header or property should specify the appropriate value
in seconds. For example, to set the maximum caching period to one
year, you can specify the request header as x-ms-blob-cache-control:
public, max-age=31556926. For details on setting caching headers, see
the HTTP/1.1 specification.

Related

Difference between public, max-age and s-maxage

TLDR:
What is the difference between the Public, max-age=<VALUE> and
maxage=<VALUE>, s-maxage=<VALUE> cache-control syntax?
Question:
For one of my projects, I am looking to reduce the server load via the Cache-control http-header. This project is hosted as a serverless function on Vercel and calls GitHub's GraphQL API on the backend to retrieve GitHub user information. Both the Vercel API and GitHub API are rate limited, so I am looking for the best header to prevent these limits from being hit. To achieve this, I am currently using the following Cache-control header:
public, max-age=14400, stale-while-revalidate=86400
According to the Mozilla documentation, this header should keep both the private browser cache and server cache fresh for 4 hours, while the stale cache can be reused for 1 day while it revalidates on the server. Furthermore, since I am not using an authentication header, I should be able even to remove the Public keyword.
The Vercel cache documentation, however, recommends the following header:
maxage=0, s-maxage=14400, stale-while-revalidate=86400
Based on the Vercel documentation and this Stack Overflow question, I, therefore, think the following header is best suited for reducing both the Vercel and GitHub load:
maxage=14400, s-maxage=7200, stale-while-revalidate=86400
To my understanding, with this header, the cache will be fresh for 4 hours for individual users while the Vercel server refreshes the cache every 2 hours, and a stale cache can be reused for 1 day while it revalidates on the server.
As I am uncertain about the difference between the Public, max-age=<VALUE> and maxage=<VALUE>, s-maxage=<VALUE> syntax, I quickly wanted to double-check my understanding.
According to the Mozilla documentation, these two syntaxes should result in the same behaviour if the <VALUE> is equal between maxage and s-maxage properties. However, the Symfony documentation states that the s-maxage flag prohibits a cache from using a stale response in stale-if-error scenarios. My question is, therefore: What is the exact difference between these two syntaxes, and which one would you recommend for reducing the load on both the Vercel and GitHub api?
The Vercel-recommended cache headers are well-suited to minimizing the number of API calls:
Cache-Control: maxage=0, s-maxage=N
The s-maxage is used to control caching by the Vercel Edge Network, allowing it to serve cached responses rather than call the serverless function. As far as I know it does not have rate limits of its own.
Of course, you probably have other caching goals besides just reducing API calls. So you might want to use maxage as well to allow browser caching to reduce latency for your users.
It's also a good idea to use stale-while-revalidate, but note that this is just a mechanism to reduce latency. The revalidation still happens, so it won't have an effect on the number of API calls.
As for public, it means "that any cache MAY store the response, even if the response would normally be non-cacheable or cacheable only within a private cache". If your response is already cacheable by a public cache then this directive won't have any effect.

How to display the cached version first and check the etag/modified-since later?

With caching headers I can either make the client not check online for updates for a certain period of time, and/or check for etags every time. What I do not know is whether I can do both: use the offline version first, but meanwhile in the background, check for an update. If there is a new version, it would be used next time the page is opened.
For a page that is completely static except for when the user changes it by themselves, this would be much more efficient than having to block checking the etag every time.
One workaround I thought of is using Javascript: set headers to cache the page indefinitely and have some Javascript make a request with an If-Modified-Since or something, which could then dynamically change the page. The big issue with this is that it cannot invalidate the existing cache, so it would have to keep dynamically updating the page theoretically forever. I'd also prefer to keep it pure HTTP (or HTML, if there is some tag that can do this), but I cannot find any relevant hits online.
A related question mentions "the two rules of caching": never cache HTML and cache everything else forever. Just to be clear, I mean to cache the HTML. The whole purpose of the thing I am building is for it to be very fast on very slow connections (high latency, low throughput, like EDGE). Every roundtrip saved is a second or two shaved off of loading time.
Update: reading more caching resources, it seems the Vary: Cookie header might do the trick in my case. I would like to know if there is a more general solution though, and I didn't really dig into the vary-header yet so I don't know yet if that works.
Solution 1 (HTTP)
There is a cache control extension stale-while-revalidate which describes exactly what you want.
When present in an HTTP response, the stale-while-revalidate Cache-
Control extension indicates that caches MAY serve the response in
which it appears after it becomes stale, up to the indicated number
of seconds.
If a cached response is served stale due to the presence of this
extension, the cache SHOULD attempt to revalidate it while still
serving stale responses (i.e., without blocking).
cache-control: max-age=60,stale-while-revalidate=86400
When browser firstly request the page it will cache result for 60s. During that 60s period requests are answered from the cache without contacting of the origin server. During next 86400s content will be served from the cache and fetched from origin server simultaneously. Only if both periods 60s+86400s are expired cache will not serve cached content but wait for origin server to fresh data.
This solution has only one drawback. I was not able to find any browser or intermediate cache which currently supports this cache control extension.
Solution 2 (Javascript)
Another solution is usage of Service workers with its feature to make custom responses to requests. With combination with Cache API it is enough to provide the requested feature.
The problem is that this solution will work only for browsers (not intermediate caches nor another http services) and even not all browsers supports Services workers and Cache API.

Synchronizing local cache with external application

I have two separate web applications:
The "admin" application where data is created and updated
The "public" application where data is displayed.
The information displayed on the "public" changes infrequently, so I want to cache it.
What I'm looking for is the "simplest possible thing" to update the cache on the public site when a change is made in the admin site.
To throw in some complexity, the application is running on Windows Azure. This rules out file and sql cache dependencies (at least the built in ones).
I am running both applications on a single web role instance.
I've considered using Memcached for this purpose. but since I'm not really after a distributed cache and that the performance is not as good as using a memory cache (System.Runtime.Caching) I want to try and avoid this.
I've also considered using NServiceBus (or the Azure equivalent) but again, this seems overkill just to send a notification to clear the cache.
What I'm thinking (maybe a little hacky, but simple):
Have a controller action on the public site that clears the in memory cache. I'm not bothered about clearing specific cached items, the data doesn't change enough for me to worry about that. When the "admin" application makes a cache, we make a httpwebrequest to the clear cache action on the public site.
Since the database is the only shared resource between the two applications, just adding a table with the datetime of the last update. The public site will make a query on every request and compare the database last update datetime to one that we will hold in memory. If it doesn't match then we clear the cache.
Any other recommendations or problems with the above options? The key thing here is simple and high performance.
1., where you have a controller action to clear the cache, won't work if you have more than one instance; otherwise, if you know you have one and only one instance, it should work just fine.
2., where you have a table that stores the last update time, would work fine for multiple instances but incurs the cost of a SQL database query per request -- and for a heavily loaded site this can be an issue.
Probably fastest and simplest is to use option 2 but store the last update time in table storage rather than a SQL database. Reads to table storage are very fast -- under the covers it's a simple HTTP GET.
Having a public controller that you can call to tell the site to clear its cache will work as long as you only have one instance of the main site. As soon as you add a second instance, as calls go through the load balancer, your one call will only go to one instance.
If you're not concerned about how soon the update makes it from the admin site to the main site, the best performing and easiest (but not the cheapest) solution is to use the Azure AppFabric Cache and then configure it to use a a local (in memory) cache with a short-ish time out (say 10 minutes).
The first time your client tries to access an item this would be what happens
Look for the item in local cache
It's not there, so look for the item in the distributed cache
It's not there either so load the item from persistent storage
Add the item to the cache with a long-ish time to live (48 hours is the default I think)
Return the item
Steps 1 and 2 are taken care of for you by the library, the other bits you need to write. Any subsequent calls in the next X minutes will return the item from the in memory cache. After X minutes it falls out of the local cache. The next call loads it from the distributed cache back into the local cache and you can carry on.
All your admin app needs to do is update the database and then remove the item from the distributed cache. The next time the item falls out of the local cache on the client, it will simply reload the data from the database.
If you like this idea but don't want the expense of using the caching service, you could do something very similar with your database idea. Keep the cached data in a static variable and just check for updates every x minutes rather than with every request.
In the end I used Azure Blobs as cache dependencies. I created a file change monitor to poll for changes to the files (full details at http://ben.onfabrik.com/posts/monitoring-files-in-azure-blob-storage).
When a change is made in the admin application I update the blob. When the file change monitor detects the change we clear the local cache.

Advantages of ETags verses updating URL

ETags allow browsers to perform conditional GETs. Only if the resource in question has been altered will the resource have to be re-downloaded. However, the browser still has to wait for the server to respond to its request.
An alternative to ETags is to introduce a token into the URL pointing to the resource:
http://example.com/css/styles.css?token=134124134
or
http://example.com/css/134124134/styles.css
Both approaches avoid having to re-download an unchanged resource.
However, using URLs with tokens allows the server to set a far-future expiry header on the resource. This saves the round trip taken up by a conditional GET - if the resource is unchanged then the URL pointing to it will be unchanged.
Are there any advantages to using ETags over URLs with tokens?
The major downside for read-only resources that I see is that if we all took this approach for all static resources then client caches would start to fill with all sorts of out-dated resources.
Also, think of all the intermediary caches that would start holding loads of useless files.
You are fighting against the web with this approach and if it became popular then something would have to change because it is not a scalable solution.
Could there be some kind of hybrid approach where you use a limited set of tokens and set the expiry small enough that an old cached resource would expire before the token was reused?
Etags are also used for read-write resources and in this case the I suspect the token solution just does not work.
I think the biggest difference/potential advantage would be configuration; The URL setting must be configured/setup inside the application (eg, the HTML actually must include the value). ETags are configured for the entire web server, and the HTML doesn't have to be modified to take advantage of them.
Also, the ETags will (assuming they are configured correctly) change when the file pointed at changes; Adding a token to the URL will require some additional "thing" that tells it to change (either a person editing the HTML or some configuration setting, etc).
Have a constant URI?

Single web server and ETags

Does anyone know if it is worth disabling ETags on an web application that is hosted on a single web server? Currently we don't make use of ETags in our application.
If it is worth disabling them - why?
Many thanks.
I don't know if this helps, but you can read about etags here:
http://developer.yahoo.net/blog/archives/2007/07/high_performanc_11.html
and here is what Jeff Atwood thinks about ETags:
ETags are a checksum field served up
with each server file so the client
can tell if the server resource is
different from the cached version the
client holds locally. Yahoo recommends
turning ETags off because they cause
problems on server farms due to the
way they are generated with
machine-specific markers. So unless
you run a server farm, you should
ignore this guidance. It'll only make
your site perform worse because the
client will have a more difficult time
determining if its cache is stale or
fresh. It is possible for the client
to use the existing last-modified date
fields to determine whether the cache
is stale, but last-modified is a weak
validator, whereas Entity Tag (ETag)
is a strong validator. Why trade
strength for weakness?
also interview with Steve Souders at .NET Rocks may help:
Steve Souders: ... the default implementation of
IIS and Apache, they put both of
those servers, put something in the
e-tag that will make it very likely
that if the user ever has to check
the validity of that resource, the
browsers are going to be incorrectly
told that the resource is no longer
valid. So in Apache’s case, what they
put in the e-tag is the INO number of
the file on that web server so that
if you have more than one web servers
hosting your site which most large
websites do, that INO number is never
going to match across two servers so
if yesterday the user went to server
one and today they tried to validate
that resource and they go to server
2, the e-tag is not going to match,
e-tag overrides last modified date so
instead of just returning a 200-byte
304 response, the server has to
return a 50k response of the entire
image.
"If you’re hosting your website on one server, it isn’t necessary to remove ETags. The same ETag will be used every time and the validation check will take place efficiently and correctly."
Source: Dean Hume

Resources