affordable CDN alternative with mobile detection - cdn

I have been looking for an affordable CDN alternative for quite some time (researching / contacting all big CDN companies). But I have not found a good affordable alternative yet.
CDN criteria
(Near) instant purge
HTTPS for up to 30 domains
Support for Vary: Switch cache based on User-Agent mobile/desktop without redirect
Affordable (near) instant invalidation (± 1.000.000 URLs monthly)
Ideally: cache tagging for specific invalidation
Under 500 USD monthly for 500.000 requests and 2 TB traffic.
CDNs not meeting criteria
I've contacted these CDNs already; they do not meet the criteria:
Cloudfront
Invalidating URL's is very expensive
No instant purge
Stackpath
Claims to support dynamic serving; does not actually support it
KeyCDN
Claims to support dynamic serving; does not actually support it
CDNs meeting technical criteria
These CDNs do meet the criteria, but are too expensive with the criteria combined:
Fastly
Over our budget of 500 USD monthly
So in short: What is a good affordable CDN alternative?

PageCDN might work for you.
Less than 15 seconds purge time. Mostly less than 5 seconds.
No CNAME hosts. Primary host is pagecdn.io that serves all of your content.
PageCDN does not require you to create expensive zones. You can create zones easily with API and mix it with device detection performed at the origin server. This will give you workable alternative of Vary. Allowing full support for Vary header is very expensive at CDN level.
If you need instant invalidation, you should try pagecdn's builtin path based cache busters. Contrary to the cache busting usually deployed at the origin servers, this is resolved at file level at pagecdn.
Same as point 4. You can use cache busters for tags.
500USD plan will give you 15 TB bandwidth and you can work with 500 origins (still more than 30 origins * 15 useragent variations).
In addition, you also get:
Github pull.
Free Public CDN.
Configurable compression. Even Brotli-11 compression is support.
Configurable Edge and HTTP Cache Age. Immutable Caching is also supported.
Configurable HTTP/2 Server Push.

Related

SEO crawler DDOSing sites

I have a customer that runs 36 websites (many thousands of pages) on a round robin with sticky affinity load-balanced set of IIS servers - the infra is entirely AWS based (r3.2xl - 8 VCPU, 60.5 GiB RAM)
To get straight to the point, the site is configured to 'cache on access' using standard in-memory caching with ASP.NET 4.6 and static assets through Cloudfront. The site on a 'cold start' makes both SQL Server queries for content, and separate elasticsearch queries at runtime to determine hreflang alternate language tags - this basically queries which versions of the URL are available in different languages for SEO reasons. This query has been optimised to be a lookup on a single index from a cross-index wildcard query. As mentioned, the entire result is cached for 24h once all this has executed.
Under normal use conditions the site works perfectly. As there are 36 sites running on a single box, the private set space gets allocated up to the max (99%) of physical RAM over time, as more and more content gets cached in memory. I can end up with App Pools in excess of 1.5GiB which isn't ideal. After this point, presumably the .NET LRU cache eviction algorithm is working overtime.
The problem I have, after some post-mortem review of the IIS logd, the customer is using an SEO bot tool, SEMrush, which essentially triggers a denial of service attack against the sites (thundering herd?) because of simultaneous requests for the 'long tail' of pages which are never viewed by a user and hence aren't stored in the cache.
The net result is a server brought to its knees, App Pool CPU usage all over the place, and an Elasticsearch queue length > 1000, huge ES heap growth, rejection rate - and eventually a crash.
The solutions I've thought about but haven't implemented:
Cloudfront all the sites - use a warm up script (although I don't think this will actually help as it's a cold start problem when all the pages expire, unless I could have a MOST recently used cache invalidation mechanism which invalidated pages on number of requests - say > 100, and left everything else persistent)
AWS Shield/WAF to provide some sort of rate limiting
Remove the runtime ES lookup all together and move to an eventually-consistent model which computes the hreflang lookup table elsewhere on a separate process. Hpwever, the ES instances, whilst on a v1.3.1 version which is old, is a 3-node cluster which has a lot of CPU power and each node set to a 16GiB min/max heap so should be able to take that level of throughput?
Or all 3!
Has anyone come across this problem before and what was your solution? it must be fairly common especially for large sites which are hammered by SEO / DQM web crawlers?

Are "forever" and "never" the only useful caching durations?

In the presentation "Cache is King" by Steve Souders (at around 14:30), it is implied that there are in practice only two caching durations that you should use for your resources: "forever" and "never" (my own terminology).
"Forever" means that you effectively make the resource permanently immutable by setting a very high max age, such as one year. If you want to modify the resource at some point, the presentation suggests, you simply publish the modified resource at a different URL. (It is suggested that this renaming is necessary, in part or entirely, because of the large number of misconfigured proxies on the Internet.)
"Never" means that you effectively disable all forms of caching and require browsers to download the resource every time it is requested.
On the one hand, any performance advice given by the head performance engineer at Google carries weight on its own. On the other hand, HTTP caching was presumably designed with variable cache durations for a reason (not just "forever" and "never"), and changing the URL to a resource only because the resource has been modified seems to go against the spirit of HTTP.
Are "forever" and "never" the only cache durations that you should use in practice? Is this in conflict with other best practices on the web?
In addition to the typical "user with a browser" use case, I would also like to know how these principles apply to REST/hypermedia APIs.
Many people would disagree with limiting yourself to "forever" or "never" as you describe it.
For one thing, it ignores the option of allowing caching with always revalidating. In this case, if the client (or proxy) has cached the resource, it sends a conditional HTTP request. If the client/proxy has cached the latest version of the resource, then the server sends a short 304 response rather than the entire resource. If the client's (proxy) copy is out of date, then the server sends the entire resource.
With this scheme, the client will always get an up-to-date version of the resource, and if the resource doesn't change much bandwidth will be saved.
To save even more bandwidth, the client can be instructed to revalidate only when the resource is older than a certain period of time.
And if bad proxies are a problem, the server can specify that only clients and not proxies may cache the resource.
I found this document pretty concisely describes your options for caching. This page is longer but also gives some excellent information.
"It depends" really, on your use case, what you are trying to achieve, and your branding proposition.
If all you want to achieve is some bandwidth saving, you could do a total cost breakdown. Serving cost might not amount to much. Browsers are anyway pretty smart at optimizing image hits, for example, so understand your HTTP protocol. Forever, combined with versioned resource url, and url rewrite rules might be a good fit, like your Google engineer suggested.
Resource volatility is another. If you are only serving daily stock charts for example, it could safely be cached for some time but not forever.
Are your computation costs heavy? Are your users sensitive to timeliness? Is data live or fixed? For example, you might be serving airline routes, path of a hurricane, option greeks or a BI report to COO. You might want to have it cached, but the TTL will likely vary by user class, all the way down to never. Forever cannot work for live data but never might be a wrong answer too.
Degree of cooperation between the server and the client may be another factor. For example in a business operations environment where procedures can be distributed and expected to be followed, it might be worthwhile to again look at TTLs.
HTH. I doubt if there is a magical answer.
Idealy, you muste cache until the content changes, if you cannot clear/refresh the cache when content changes for any reason, you need a duration. But indeed, if you can, cache forever or do not cache. No need to refresh if you already know nothing changed.
If you know that the underlying data will be static for any length of time, caching makes sense. We have a web service that exposes data from a database that is populated by a nightly ETL job from an external source. Our RESTful web service only goes to the database when it changes. In our case, we know exactly when the data changes and we invalidate the cache right after the ETL process finishes.

Optimizing Google Search Appliance on a remote server

I'm planning to deploy a Google Search Appliance to remotely index an intranet site (transcontinentally). So I will be using the company's network and potentially consuming too much bandwidth.
Regarding the configurations that I can use to mitigate the effect of the initial crawl (which is the only one that is perceived as dangerous for the network) we have:
Crawl and Index > Host Load Schedule
Web Server Host Load: basically number of concurrent connections to the crawled servers within 1 minute, so minimizing this setting should
Exceptions to Web Server Host Load: this is a schedule used for either increasing or decreasing the number of concurrent connections to the crawled server.
Crawl and Index > Crawl Schedule
Instead of a continous crawl I should choose a Scheduled crawl.
Am I on the right track and can other settings be configured in order not to generate excessive network traffic between the GSA and the Web servers?
The best way to minimize the crawling of a remote site is to not crawl it. Failing that, there are a couple of settings will help it it as noted out above:
1) Host Load Schedule
This sets the number of current threads set to the crawler for the host. Note that this can be a number below 1. (i.e. 2.5) (also noted by BigMikeW)
2) Freshness Tuning
Crawl infrequently actually means "Crawl never again". This works well in conjunction with a meta-url feed which will tell the GSA to recrawl the page or a recrawl request from the administrative console. Crawl frequently actually means: "Crawl Once Per Day". This setting doesn't really mean much now that the crawler has been retuned and the hardware is faster. The GSA will submit requests intra daily to the pages it finds.
3) Crawl schedule
I find that it's not better to turn off the crawler but rather keep it on continuous mode and set the threshold at zero. This allows the natural GSA algorithms to play out. Anything you wish to achieve by scheduling can be achieved by tuning it to zero for the periods you want the crawler quiet.
My recommendation for minimizing WAN traffic:
1) Review DNS and add an override if necessary to ensure you are routing to nearest content source
2) Set the content sources pattern to crawl infrequently
3) Create a meta url feed to push content updates.
The last one would take a bit of coding. There is an example sitemap feeder here:
https://code.google.com/p/gsafeedmanager/
With this configuration, the GSA will never recrawl the content and will rely on the feed to inform it of updates.
Alternate:
1) Ensure the content source responds to HEAD requests with LAST Modified Dates. Do not configure crawl infrequently. The GSA will detect deltas and slow the crawl down over time.
Yes, I would also look at the Freshness Tuning and Duplicate Hosts.
Host Load Schedule
Web Server Host Load
Exceptions to Web Server Host Load
Crawl Schedule
Crawl Mode
Freshness Tuning
Crawl Frequently
Crawl Infrequently
As Tan Hong Tat says, look at Freshness Tuning and Duplicate Hosts.
I would set it to crawl infrequently at least until the initial crawl has completed.
Also do some content analysis. Using the Crawl patterns you can direct the GSA to ignore certain content types (based on file extension) or areas of the intranet that don't contain content of value to the search experience.
When you're setting the host load remember that you can use decimal values between 0-1, e.g.: 0.1.
If they have a decent WAN optimizer in place you may find this is less of an issue than you think.

Are there any CDNs that give you full control over HTTP headers?

Are any CDNs (Content Delivery Networks) that provide control and/or customization of all or most HTTP headers?
Specifically, I'm interested in controlling the Expires, ETag, and Cache-Control headers, although other headers interest me as well.
I understand that part of the value proposition of CDNs is that they "just work" and set these headers to somewhat optimal values (for most use cases), but I am definitely interested in controlling these headers myself.
Akamai has a full interface for allowing this type of control on a per-property, per-header basis. It is a standard XML based config file. You can set each header to be a specific value, respect the headers passed through, add if not present, have exceptions based on User Agent etc.
Essentially, within reason, it is completely configurable. I have found setting defaults when absent but allowing applications/admins to set their own values is generally the best approach but it really does depend on the quality and understanding of the developer/admin.
Like most CDN providers Akamai have some default behaviors baked in, but the values are completely configurable. It has been a couple of years since I actively managed a CDN, but at the time Limelight was working on being feature compatible with Akamai and was most of the way there, so I would expect that they have similar functionality now.
In general, most CDN vendors will strive for feature compatibility with the big player in the market and Akamai is definitely it for CDN.

akamai refresh cache before deployment and do cutover at specified time

My objective is to achieve zero downtime during deployment. My site uses akamai as CDN. Lets say I do have primary and secondary cluster of IIS servers. During deployment, the updates are made to secondary cluster. Before switchover from primary to secondary, can I request akamai to cache the content and do a cutover at a specified time?
The problem you are going to have is to guarantee that your content is cached on ALL akamai servers. Is the issue that you want to force content to be refreshed as soon as you cutover?
There are a few options here.
1 - Use a version in the requests "?v=1". This version would ALWAYS be requested from origin and would be appended to every request. As soon as you update your site, update the version on origin, so that the next request will append "?v=2" thus "busting" the cache and forcing an origin hit for all requests
2 - Change your akamai config to "honor webserver TTLs". You can then set very low or almost 0 TTLs right before you cut over and then increase gradually after you cutover
3 - Configure akamai to use If-MOdified-Since. This will force akamai to "validate" if any requests have changed.
4 - Use ECCU which can purge a whole directory, but this can take up to 40 minutes, but should be manageable during a maint window.
I don't think this would be possible based on my experience with Akamai (but things change faster than I can keep up with) - you can flush the content manually (at a cost) so you could flush /* we used to do this for particular files during deployments (never /* because we had over 1.2M URLs) but I can't see how Akamai could cache a non-visible version of your site for instant cut-over without having some secondary domain and origin.
However I have also found that Akamai are pretty good to deal with and it would definitely be worth contacting them in relation to a solution.

Resources