I am trying to replicate Akamai CDN rules on a different CDN. Not able to find what are the default components of cache key on Akamai, are all the HTTP headers and Cookies included by default? or are cookies and other HTTP headers included in the cache key only if they are explicitly specified?
No, cache-keys do not contain header or cookie information by default. They will have to be explicitly added using Cache-ID Modification behavior. The default cache-key has the following details on them
Akamai proprietary information (such as CP Codes)
Cache TTL
Hostname (Origin or Incoming Host header)
URL Path
The cache key of your content can be learned by using the Akamai Pragma Headers.
You want to send (at a minimum) the Pragma: akamai-x-get-true-cache-key header to see the response from Akamai showing what is used to create your cache key.
Related
Definition of ETag header (https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/ETag):
The ETag HTTP response header is an identifier for a specific version
of a resource. It allows caches to be more efficient, and saves
bandwidth, as a web server does not need to send a full response if
the content has not changed. On the other side, if the content has
changed, etags are useful to help prevent simultaneous updates of a
resource from overwriting each other ("mid-air collisions").
Definition of Cache-Control header (https://developer.mozilla.org/de/docs/Web/HTTP/Headers/Cache-Control):
The Cache-Control general-header field is used to specify directives
for caching mechanisms in both requests and responses.
So the ETag header tells the browser for a resource to send a single HTTP request to the server and ask if the file hash has changed. If yes, download a new one. Great. So if the ETag header is set why should I need Cache-Control any more (beside of the Expires header which may help to avoid this single request)?
So if I have to set the Cache-Control header anyway it can only be harmful right? I think the most appropriate value would be:
Cache-Control: must-revalidate
But I am not sure if this triggers unecessary additional actions.
After some research, I found a great tutorial on Medium by Alex Barashkov: "Best practices for cache control settings for your website".
Alex writes:
I recommend you apply Cache-Control: no-cache to html files. Applying
“no-cache” does not mean that there is no cache at all, it simply
tells the browser to validate resources on the server before use it
from the cache. That’s why we need to use it with Etag, so browsers
will send a simple request and load the extra 80 bytes to verify the
state of the file.
Presence of ETag header does not tell the browser to do anything. Browser decides what to do based on the Cache-Control header it receives in the request and cached response. If it decides that resource is stale or needs to be re-validated, then it can use the ETag value to create a conditional request to the server and either get a new resource (status code 200), or a notification that things have not changed (status code 304)
Both headers are necessary for your cache to work optimally.
I'm using playframework and nginx. playframework may add following cookies to http response: PLAY_SESSION, PLAY_FLASH, PLAY_LANG.
I want to make sure that only above cookies (PLAY_*) are allowed in nginx level. If there are other cookies (let's say they're added accidentally) they should be removed by nginx.
How can I allow only predefined cookies in http response in nginx?
PS: If it's not possible to solve this issue in nginx, I need to fix by using playframework.
How cookies work?
First, let's establish what's cookies — they're little pieces of "sticky" hidden information that lets you keep state on your web-site for a given User-Agent. These cookies are often used for tracking users, keeping session and storing minor preference information for the site.
Set-Cookie HTTP response header (from server to client)
Cookies can be set by the server through the Set-Cookie response header (with a separate header for each cookie), or, after the page has already been transferred from the server to the client, through JavaScript.
Note that setting cookies is a pretty complex job — they have expiration dates, http/https settings, path etc — hence the apparent necessity to use a separate Set-Cookie header for each cookie.
This requirement to have a separate header is usually not an issue, since cookies aren't supposed to be modified all that often, as they usually store very minimal information, like a session identifier, with the heavy-duty information being stored in an associated database on the server.
Cookie HTTP request header (from client to server)
Regardless how they were first set, cookies would then included in eligible subsequent requests to the server by the client, using the Cookie request header, with a whole list of eligible cookies in one header.
Note that, as such, these cookies that are sent by the client back to the server is a simple list of name and attribute pairs, without any extra information about the underlying cookies that store these attributes on the client side (e.g., the expiration dates, http/https setting and paths are saved by the client internally, but without being revealed in subsequent requests to the server).
This conciseness of the Cookie request header field is important, because, once set, eligible cookies will be subsequently included in all forthcoming requests for all resources with the eligible scheme / domain / path combination.
Caching issues with cookies.
The normal issue of using cookies, especially in the context of acceleration and nginx, is that:
cookies invalidate the cache by default (e.g., unless you use proxy_ignore_headers Set-Cookie;),
or, if you do sloppy configuration, cookies could possibly spoil your cache
e.g., through the client being able to pass cookies to the upstream in the absence of proxy_set_header Cookie "";,
or, through the server insisting on setting a cookie through the absence of proxy_hide_header Set-Cookie;.
How nginx handles cookies?
Cookie from the client
Note that nginx does support looking through the cookies that the client sends to it (in the Cookie request header) through the $cookie_name scheme.
If you want to limit the client to only be sending certain cookies, you could easily re-construct the Cookie header based on these variables, and send only whichever ones you want to the upstream (using proxy_set_header as above).
Or, you could even make decisions based on the cookie to decide which upstream to send the request to, or to have a per-user/per-session proxy_cache_key, or make access control decisions based on the cookies.
Set-Cookie from the backend
As for the upstream sending back the cookies, you can, of course, decide to block it all to considerably improve the caching characteristics (if applicable to your application, or parts thereof), or fix up the domain and/or path with proxy_cookie_domain and/or proxy_cookie_path, respectively.
Otherwise, it's generally too late to make any other routing decision — the request has already been processed by the selected upstream server, and the response is ready to be served — so, naturally, there doesn't seem to be a way to look into these individual Set-Cookie cookies through normal means in nginx (unless you want to go third-party modules, or lua, or perl), since it'd already be too late to make any important routing decisions for a completed request.
Basically, these Set-Cookie cookies have more to do with the content than with the way it is served or routed, so, it doesn't seem appropriate to have integrated functionality to look into them through nginx.
(If you do need to make routing decisions after the completion of the request, then nginx does support X-Accel-Redirect, as well as some other special headers.)
If your issue is security, then, as I've pointed out above, the upstream developer can already use JavaScript to set ANY extra cookies however they want, so, effectively, trying to use nginx to limit some, but not all, Set-Cookie responses from the server is kind of a pointless endeavour in the real world (as there is hardly any difference between the cookies set through JavaScript compared to Set-Cookie).
In summary:
you can easily examine and reconstruct the Cookie header sent by the client to the server before passing it over to the backend, and only include the sanctioned cookies in the request to upstream backend,
but, unless you want to use lua/perl, or have your own nginx module (as well as possibly quarantine the JavaScript from the pages you serve), then you cannot pass only certain Set-Cookie headers back from the upstream backend to the client with a stock nginx.conf — with the Set-Cookie headers, it's an all-or-nothing situation, and there doesn't seem to be a good-enough use-case for a distinct approach.
For an Nginx solution it might be worth asking over at serverfault. Here is a potential solution via Play Framework.
package filters
import javax.inject._
import play.api.mvc._
import scala.concurrent.ExecutionContext
#Singleton
class ExampleFilter #Inject()(implicit ec: ExecutionContext) extends EssentialFilter {
override def apply(next: EssentialAction) = EssentialAction { request =>
next(request).map { result =>
val cookieWhitelist = List("PLAY_SESSION", "PLAY_FLASH", "PLAY_LANG")
val allCookies = result.newCookies.map(c => DiscardingCookie(c.name))
val onlyWhitelistedCookies = result.newCookies.filter(c => cookieWhitelist.contains(c.name))
result.discardingCookies(allCookies: _*).withCookies(onlyWhitelistedCookies: _*)
}
}
}
This solution utilizes Filters and Result manipulation. Do test for adverse effects on performance.
MDN says, when the credentials like cookies, authorisation header or TLS client certificates has to be exchanged between sites Access-Control-Allow-Crendentials has to be set to true.
Consider two sites A - https://example1.xyz.com and another one is B- https://example2.xyz.com. Now I have to make a http Get request from A to B. When I request B from A I am getting,
"No 'Access-Control-Allow-Origin' header is present on the requested
resource. Origin 'http://example1.xyz.com' is therefore not allowed
access."
So, I'm adding the following response headers in B
response.setHeader("Access-Control-Allow-Origin", request.getHeader("origin"));
This resolves the same origin error and I'm able to request to B. When and why should I set
response.setHeader("Access-Control-Allow-Credentials", "true");
When I googled to resolve this same-origin error, most of them recommended using both headers. I'm not clear about using the second one Access-Control-Allow-Credentials.
When should I use both?
Why should I set Access-Control-Allow-Origin to origin obtained from request header rather than wildcard *?
Please quote me an example to understand it better.
Allow-Credentials would be needed if you want the request to also be able to send cookies. If you needed to authorize the incoming request, based off a session ID cookie would be a common reason.
Setting a wildcard allows any site to make requests to your endpoint. Setting allow to origin is common if the request matches a whitelist you've defined. Some browsers will cache the allow response, and if you requested the same content from another domain as well, this could cause the request to be denied.
Setting Access-Control-Allow-Credentials: true actually has two effects:
Causes the browser to actually allow your frontend JavaScript code to access the response if credentials are included
Causes any Set-Cookie response header to actually have the effect of setting a cookie (the Set-Cookie response header is otherwise ignored)
Those effects combine with the effect that setting XMLHttpRequest.withCredentials or credentials: 'include' (Fetch API) have of causing credentials (HTTP cookies, TLS client certificates, and authentication entries) to actually be included as part of the request.
https://fetch.spec.whatwg.org/#example-cors-with-credentials has a good example.
Why should I set Access-Control-Allow-Origin to origin obtained from request header rather than wildcard *?
You shouldn’t unless you’re very certain what you’re doing.
It’s actually safe to do if:
The resource for which you’re setting the response headers that way is a public site or API endpoint intended to be accessible by everyone, and
You’re just not setting cookies that could enable an attacker to get access to sensitive information or confidential data.
For example, if your server code is just setting cookies just for the purpose of saving application state or session state as a convenience to your users, then there’s no risk in taking the value of the Origin request header and reflecting/echoing it back in the Access-Control-Allow-Origin value while also sending the Access-Control-Allow-Credentials: true response header.
On the other hand, if the cookies you’re setting expose sensitive information or confidential data, then unless you’re really certain you have things otherwise locked down (somehow…) you really want to avoid reflecting the Origin back in the Access-Control-Allow-Origin value (without checking it on the server side) while also sending Access-Control-Allow-Credentials: true.
If you do that, you’re potentially exposing sensitive information or confidential data in way that could allow malicious attackers to get to it. For an explanation of the risks, read the following:
https://web-in-security.blogspot.jp/2017/07/cors-misconfigurations-on-large-scale.html
http://blog.portswigger.net/2016/10/exploiting-cors-misconfigurations-for.html
And if the resource you’re sending the CORS headers for is not a public site or API endpoint intended to be accessible by everyone but is instead inside an intranet or otherwise behind some IP-address-restricted firewall, then you definitely really want to avoid combining Access-Control-Allow-Origin-reflects-Origin and Access-Control-Allow-Credentials: true. (In the intranet case you almost always want to only be allowing specific hardcoded/whitelisted origins.)
I'm serving a set of resources through content negotiation.
Concretely, any URL can be represented in different formats,
depending on the client's Accept header.
An example of this can be seen at Facebook:
curl -H "Accept: application/json" http://graph.facebook.com/daft-punk
results in JSON
curl -H "Accept: text/turtle" http://graph.facebook.com/daft-punk
results in Turtle
I'm looking for a CDN that caches content based on URL and the client's Accept header.
Example of what goes wrong
CloudFlare doesn't support this: if one client asks for HTML, then all subsequent requests to that URL receive the HTML representation, regardless of their preferences. Others have similar issues.
For example, if I would place CloudFlare over graph.facebook.com(and configure it to cache “extensionless” resources, which it does not by default), then it would behave incorrectly:
I ask for http://graph.facebook.com/daft-punk in JSON through curl;
in response, CloudFlare asks the JSON original from the server, caches it, and serves it.
I ask for http://graph.facebook.com/daft-punk through my browser (thus in HTML);
in response CloudFlare sends the cached JSON (!) representation, even though the original server would have sent the HTML version.
What would be needed instead
The correct behavior would be that CloudFlare asks the server again, since the second client had a different Accept header.
After this, requests with similar Accept headers can be served from cache.
Which CDN solutions support content-negotiation, and also cache negotiated content?
So note that only respecting Accept is not enough; negotiated responses should be cached too.
PS1: It's easy to make your own caching servers support it. For instance, for nginx:
proxy_cache_key "$scheme$host$request_uri$http_accept";
Note how the client's Accept header is part of the key that indexes the cache. I want that on CDN.
PS2: It is not an option to use different URLs for different representations. My application is in the Linked Data domain, where URLs play an important role for identification.
Seems maxcdn still can set up custom nginx rules for content negotiation (despite what their faq says) - http://blog.maxcdn.com/how-to-reduce-image-size-with-webp-automagically/#comment-1048561182
I can't think of any way we would impact this at all at this time. We don't, for example, cache HTML by default. Have you actually seen an issue with this? Have you opened a support ticket?
I've looked around but haven't been able to figure out if I should use both an ETag and an Expires Header or one or the other.
What I'm trying to do is make sure that my flash files (and other images and what not only get updated when there is a change to those files.
I don't want to do anything special like changing the filename or putting some weird chars on the end of the url to make it not get cached.
Also, is there anything I need to do programatically on my end in my PHP scripts to support this or is it all Apache?
They are slightly different - the ETag does not have any information that the client can use to determine whether or not to make a request for that file again in the future. If ETag is all it has, it will always have to make a request. However, when the server reads the ETag from the client request, the server can then determine whether to send the file (HTTP 200) or tell the client to just use their local copy (HTTP 304). An ETag is basically just a checksum for a file that semantically changes when the content of the file changes.
The Expires header is used by the client (and proxies/caches) to determine whether or not it even needs to make a request to the server at all. The closer you are to the Expires date, the more likely it is the client (or proxy) will make an HTTP request for that file from the server.
So really what you want to do is use BOTH headers - set the Expires header to a reasonable value based on how often the content changes. Then configure ETags to be sent so that when clients DO send a request to the server, it can more easily determine whether or not to send the file back.
One last note about ETag - if you are using a load-balanced server setup with multiple machines running Apache you will probably want to turn off ETag generation. This is because inodes are used as part of the ETag hash algorithm which will be different between the servers. You can configure Apache to not use inodes as part of the calculation but then you'd want to make sure the timestamps on the files are exactly the same, to ensure the same ETag gets generated for all servers.
Etag and Last-modified headers are validators.
They help the browser and/or the cache (reverse proxy) to understand if a file/page, has changed, even if it preserves the same name.
Expires and Cache-control are giving refresh information.
This means that they inform, the browser and the reverse in-between proxies, up to what time or for how long, they may keep the page/file at their cache.
So the question usually is which one validator to use, etag or last-modified, and which refresh infomation header to use, expires or cache-control.
Expires and Cache-Control are "strong caching headers"
Last-Modified and ETag are "weak caching headers"
First the browser check Expires/Cache-Control to determine whether or not to make a request to the server
If have to make a request, it will send Last-Modified/ETag in the HTTP request. If the Etag value of the document matches that, the server will send a 304 code instead of 200, and no content. The browser will load the contents from its cache.
Another summary:
You need to use both. ETags are a "server side" information. Expires are a "Client side" caching.
Use ETags except if you have a load-balanced server. They are safe and will let clients know they should get new versions of your server files every time you change something on your side.
Expires must be used with caution, as if you set a expiration date far in the future but want to change one of the files immediatelly (a JS file for instance), some users may not get the modified version until a long time!
One additional thing I would like to mention that some of the answers may have missed is the downside to having both ETags and Expires/Cache-control in your headers.
Depending on your needs it may just add extra bytes in your headers which may increase packets which means more TCP overhead. Again, you should see if the overhead of having both things in your headers is necessary or will it just add extra weight in your requests which reduces performance.
You can read more about it on this excellent blog post by Kyle Simpson: http://calendar.perfplanet.com/2010/bloated-request-response-headers/
In my view, With Expire Header, server can tell the client when my data would be stale, while with Etag, server would check the etag value for client' each request.
ETag is used to determine whether a resource should use the copy one. and Expires Header like Cache-Control is told the client that before the cache decades, client should fetch the local resource.
In modern sites, There are often offer a file named hash, like app.98a3cf23.js, so that it's a good practice to use Expires Header. Besides this, it also reduce the cost of network.
Hope it helps ;)
Etag is a hash for indicating the version of the resource. When the server returns data, it hashes the data and set this hash value under ETAG. When you send a "PUT" request to the server to update a record, maybe simultaneously another user made the same "PUT" request and its request has been processed. The server will check your "PUT" data and will see that it is the same update so it wont make another update, it will send you the updated data (by another user) and you will update your cache.
when the time for caching expires, the browser automatically makes a new request to get the fresh data. That is why "Expires" header is used
If a response includes both an Expires header and a max-age directive,
the max-age directive overrides the Expires header, even if the
Expires header is more restrictive. This rule allows an origin server
to provide, for a given response, a longer expiration time to an
HTTP/1.1 (or later) cache than to an HTTP/1.0 cache. This might be
useful if certain HTTP/1.0 caches improperly calculate ages or
expiration times, perhaps due to desynchronized clocks.