How do CDNs work? - web-deployment

I am new to webdev and would like to understand how CDNs work?
Specifically how do CDNs achieve performance in retrieving the content? Is the content stored on disk, in a database in binary format, or on disk but the location stored in the database?
How is the data kept in sync? Does the end user only push new/updated content to one location and the CDN takes care of synchronizing the content?
When is it wise to use a CDN and are there any other alternatives aside from storing the data on disk?

A content delivery network or content distribution network (CDN) is a globally distributed network of proxy servers deployed in multiple data centers.
CDNs are very useful for a multitude of reasons. For website owners who have visitors in multiple geographic locations, content will be delivered faster to these users as there is less distance to travel. CDN users also benefit from the ability to easily scale up and down much more easily due to traffic spikes. On average, 80% of a website consist of static resources therefore when using a CDN, there is much less load on the origin server.
Source

Related

Do I need a CDN or can I just go with ngnix loadbalancer (cache)

I have a system that will generate image optimization and resizing for a client who has a news portal with lots of pageviews. We will provide only the images to this portal, but users are all on the same country as the our server. The question is, whats the best strategy thinking about cost-benefit:
Route all(most) image traffic via some paid CDN
Setup an internal image server using nginx and a loadbalancer
Monthly we estimate a bandwidth of 11TB, with millions of requests. (images only)
It is not a questions if it is possible or what is more cost efficient.
You need to calculate the costs based on many factors: Actual sizing of your servers. Amount of servers. Bandwith. Where are the servers located and much more.
It will be a lot of work to setup and maintain / monitor your own CDN probaly but sure you can do it.
I dont think that anybody can create this calculation for you. See the comment fro Rob. It is not realy a question for SO.

What kind of caching model does Content distribution network use?

What kind of caching model does Content distribution networks use ?
Specifically do they use ( akamai, edgecast, bitgravity, cotendo etc ..) i.e. when they have a cache missing, do they come to source and make sure they distribute the cotent internally ?
I would assume that each CDN supported a slightly different architecture. Akamai supports 2 levels of their own servers. The edge nodes which is what they create most of their servers as and then a second internal ring of replicated web servers (a smaller number).
If an item cannot be found in on the edge node it requests the information from an inner web server, if that fails then it evetually falls back to the origin, your server.
So yes requests do fall back to the source if they cannot be found in the CDN.
They do some replication amongst each other but you can't guarantee how many servers the information is replicated to and you have no idea how long each one will cache it for.
At an Akamai server the more an item is requested the longer it will stay in the cache. But this is not per company, it is for all requests to the machine. So if your information is on a server that is also being used by a site more popular than yours then it may not be cached very long. When I spoke to them, they couldn't give you that level of detail.
discovery.com Akamai CDN Article

Traffic Performance Testing Webpages Under Specified Conditions

As the title implies, I would like to be able to simulate traffic to a collection of webpages that I have created for loadbalancing and bottleneck issues. I would like to mimic typical HTTP requests relative to the upload/download speed of the user. Furthermore, I would like to be able to perform extreme tests assuming a certain amount of storage and bandwidth on a server(s).
How I should go about doing this?
Look at Apache Flood: hhttp://httpd.apache.org/test/flood/
Good description: http://www.clove.org/flood-presentation/flood.pdf

Which is the fastest way to load images on a webpage?

I'm building a new site, and during the foundation stage I'm trying to assess the best way to load images. Browsers have a limit of 2-6 items it can load concurrently (images/css/js). Through the grapevine I've heard various different methods, but no definitive answer on which is actually faster.
Relative URLs:
background-image: url(images/image.jpg);
Absolute URLs:
background-image: url(http://site.com/images/image.jpg);
Absolute URLs (with sub-domains):
background-image: url(http://fakecdn.site.com/images/image.jpg);
Will a browser recognize my "fakecdn" subdomain as a different domain and load images from it concurrently in a separate thread?
Do images referenced in a #import CSS file load in a separate thread?
The HTTP 1.1 spec suggests that browsers do not open more than two connections to a given domain.
Clients that use persistent connections SHOULD limit the number of
simultaneous connections that they maintain to a given server. A
single-user client SHOULD NOT maintain more than 2 connections with
any server or proxy.
So, if you are loading many medium sized images, then it may make sense to put them on separate FQDNs so that the 2 connection limit is not the bottleneck. For small images, the need of a new socket connection to each FQDN may outweigh the benefits. Similarly, for large images, the client network bandwith may be the limiting factor.
If the images are always displayed, then using a data uri may be fastester, since no separate connection is required, and the images can be included in the stream in the order they are needed.
However, as always with optimizing for performance, profile first!
See
Wikipedia - data uri
For lots of small images, social media icons being a good example, you'll also want to look into combining them into a single sprite map. That way they'll all load in the same request, and you just have to do some background-positioning when using them.

Harvesting Dynamic HTTP Content to produce Replicating HTTP Static Content

I have a slowly evolving dynamic website served from J2EE. The response time and load capacity of the server are inadequate for client needs. Moreover, ad hoc requests can unexpectedly affect other services running on the same application server/database. I know the reasons and can't address them in the short term. I understand HTTP caching hints (expiry, etags....) and for the purpose of this question, please assume that I have maxed out the opportunities to reduce load.
I am thinking of doing a brute force traversal of all URLs in the system to prime a cache and then copying the cache contents to geodispersed cache servers near the clients. I'm thinking of Squid or Apache HTTPD mod_disk_cache. I want to prime one copy and (manually) replicate the cache contents. I don't need a federation or intelligence amongst the slaves. When the data changes, invalidating the cache, I will refresh my master cache and update the slave versions, probably once a night.
Has anyone done this? Is it a good idea? Are there other technologies that I should investigate? I can program this, but I would prefer a configuration of open source technologies solution
Thanks
I've used Squid before to reduce load on dynamically-created RSS feeds, and it worked quite well. It just takes some careful configuration and tuning to get it working the way you want.
Using a primed cache server is an excellent idea (I've done the same thing using wget and Squid). However, it is probably unnecessary in this scenario.
It sounds like your data is fairly static and the problem is server load, not network bandwidth. Generally, the problem exists in one of two areas:
Database query load on your DB server.
Business logic load on your web/application server.
Here is a JSP-specific overview of caching options.
I have seen huge performance increases by simply caching query results. Even adding a cache with a duration of 60 seconds can dramatically reduce load on a database server. JSP has several options for in-memory cache.
Another area available to you is output caching. This means that the content of a page is created once, but the output is used multiple times. This reduces the CPU load of a web server dramatically.
My experience is with ASP, but the exact same mechanisms are available on JSP pages. In my experience, with even a small amount of caching you can expect a 5-10x increase in max requests per sec.
I would use tiered caching here; deploy Squid as a reverse proxy server in front of your app server as you suggest, but then deploy a Squid at each client site that points to your origin cache.
If geographic latency isn't a big deal, then you can probably get away with just priming the origin cache like you were planning to do and then letting the remote caches prime themselves off that one based on client requests. In other words, just deploying caches out at the clients might be all you need to do beyond priming the origin cache.

Resources