Outputcache - how to determine optimal value for duration? - asp.net

I read somewhere that for a high traffic site (I guess that is a murky term as well), 30 - 60 seconds is a good value. Obviously I could do a load test and vary the values, but I couldn't find any kind of documentation on this. Most samples have a minute, a couple of minutes. There's no recommended range. Is there something on msdn or anywhere that talks about this?

This all depends on whether or not the content changes frequently. For slowly or non-mutating content, a longer value works perfectly. However, you may need to shorten the value for always-changing data or risk bad output.

It all depends on how often a user requests your resource, and how big the resource is.
First, it is important to understand that when you cache something, that resource will remain the same until the cache duration runs out. A short duration cache will tax the webserver more than longer one, but the short will provide more up-to-date data should the requested resource change.
Obviously you want to cache database queries as much as possible, prioritizing those who are called often. But all cache takes memory on the server, and as resources runs low the cache will be evicted. Take this into consideration when caching large things for longer durations.
If you want data on how often users requests a resource you can use Google Analytics, which is extremely easy to set up.
For very exhausitive analytics you can use Kiwik. It requires a local server though.
On very changing resources, don't cache at all, unless it's really really resource heavy and isn't vital to be realtime updated.
To give you an exact number or recommendation would be to make you a disservice, there are too many variables around.

Related

Is it bad practice to store information in the cache for long periods of time?

I have a webpage, which takes a while to load because it has to pull information from lots of local databases. For example, if a user searches for person 1 then it will query 20 databases. It can sometimes take 5 minutes to pull all the information needed and apply the business logic. The best solution is to design a data warehouse, which is a long term aim.
If I use data caching it reduces the page load time (of the big records) from five minutes to four seconds. Is it bad practice to store information in the cache for a long period of time i.e. 24 hours? The cache will be refreshed every 24 hours. Alternatively I could store the cached information in a database table.
Every example I find online caches information for seconds e.g. 20 seconds.
Pros:
Faster load times
Less bandwidth usage
Less stress on the server
Cons:
May require high technical expertise to configure it just right
Will not work for content that is constantly being updated
For many system administrators, especially those with the skills to implement a caching system, the pros greatly outweigh the cons. Caching can make your websites run more smoothly for visitors and lessen the burden on your dedicated server.
For more Check this link
Cache is used for global resources, if in your application the data is per user then use Session which is like cache per user.
You can also cache results and connect it to database tables so if a table is being update so does the cache it is called Cache Dependency.
The main concern you need to have is what if the cached information in not up to date, in that case use Cache Dependency.
Don't worry about memory issues in your server, the server is already optimized and knows to clean cache in case of lack in memory.
I hope this helps you.

Are "forever" and "never" the only useful caching durations?

In the presentation "Cache is King" by Steve Souders (at around 14:30), it is implied that there are in practice only two caching durations that you should use for your resources: "forever" and "never" (my own terminology).
"Forever" means that you effectively make the resource permanently immutable by setting a very high max age, such as one year. If you want to modify the resource at some point, the presentation suggests, you simply publish the modified resource at a different URL. (It is suggested that this renaming is necessary, in part or entirely, because of the large number of misconfigured proxies on the Internet.)
"Never" means that you effectively disable all forms of caching and require browsers to download the resource every time it is requested.
On the one hand, any performance advice given by the head performance engineer at Google carries weight on its own. On the other hand, HTTP caching was presumably designed with variable cache durations for a reason (not just "forever" and "never"), and changing the URL to a resource only because the resource has been modified seems to go against the spirit of HTTP.
Are "forever" and "never" the only cache durations that you should use in practice? Is this in conflict with other best practices on the web?
In addition to the typical "user with a browser" use case, I would also like to know how these principles apply to REST/hypermedia APIs.
Many people would disagree with limiting yourself to "forever" or "never" as you describe it.
For one thing, it ignores the option of allowing caching with always revalidating. In this case, if the client (or proxy) has cached the resource, it sends a conditional HTTP request. If the client/proxy has cached the latest version of the resource, then the server sends a short 304 response rather than the entire resource. If the client's (proxy) copy is out of date, then the server sends the entire resource.
With this scheme, the client will always get an up-to-date version of the resource, and if the resource doesn't change much bandwidth will be saved.
To save even more bandwidth, the client can be instructed to revalidate only when the resource is older than a certain period of time.
And if bad proxies are a problem, the server can specify that only clients and not proxies may cache the resource.
I found this document pretty concisely describes your options for caching. This page is longer but also gives some excellent information.
"It depends" really, on your use case, what you are trying to achieve, and your branding proposition.
If all you want to achieve is some bandwidth saving, you could do a total cost breakdown. Serving cost might not amount to much. Browsers are anyway pretty smart at optimizing image hits, for example, so understand your HTTP protocol. Forever, combined with versioned resource url, and url rewrite rules might be a good fit, like your Google engineer suggested.
Resource volatility is another. If you are only serving daily stock charts for example, it could safely be cached for some time but not forever.
Are your computation costs heavy? Are your users sensitive to timeliness? Is data live or fixed? For example, you might be serving airline routes, path of a hurricane, option greeks or a BI report to COO. You might want to have it cached, but the TTL will likely vary by user class, all the way down to never. Forever cannot work for live data but never might be a wrong answer too.
Degree of cooperation between the server and the client may be another factor. For example in a business operations environment where procedures can be distributed and expected to be followed, it might be worthwhile to again look at TTLs.
HTH. I doubt if there is a magical answer.
Idealy, you muste cache until the content changes, if you cannot clear/refresh the cache when content changes for any reason, you need a duration. But indeed, if you can, cache forever or do not cache. No need to refresh if you already know nothing changed.
If you know that the underlying data will be static for any length of time, caching makes sense. We have a web service that exposes data from a database that is populated by a nightly ETL job from an external source. Our RESTful web service only goes to the database when it changes. In our case, we know exactly when the data changes and we invalidate the cache right after the ETL process finishes.

How to detect a reasonable number of concurrent requests I can safely perform on someone's server?

I crawl some data from the web, because there is no API. Unfortunately, it's quite a lot of data from several different sites and I quickly learned I can't just make thousands of requests to the same site in a short while... I want to approach the data as fast as possible, but I don't want to cause a DOS attack :)
The problem is, every server has different capabilities and I don't know them in advance. The sites belong to my clients, so my intention is to prevent any possible downtime caused by my script. So no policy like "I'll try million requests first and if it fails, I'll try half million, and if it fails..." :)
Is there any best practice for this? How Google's crawler knows how many requests it can do in the same while to the same site? Maybe they "shuffle their playlist", so there are not as many concurrent requests to a single site. Could I detect this stuff somehow via HTTP? Wait for a single request, count response time, approximately guess how well balanced the server is and then somehow make up a maximum number of concurrent requests?
I use a Python script, but this doesn't matter much for the answer - just to let you know in which language I'd prefer your potential code snippets.
The google spider is pretty damn smart. On my small site it hits me 1 page per minute to the second. They obviously have a page queue that is filled keeping time and sites in mind. I also wonder if they are smart enough about not hitting multiple domains on the same server -- so some recognition of IP ranges as well as URLs.
Separating the job of queueing up the URLs to be spidered at a specific time from the actually spider job would be a good architecture for any spider. All of your spiders could use the urlToSpiderService.getNextUrl() method which would block (if necessary) unless the next URL is to be spidered.
I believe that Google looks at the number of pages on a site to determine the spider speed. The more pages that you have the refresh in a given time then the faster they need to hit that particular server. You certainly should be able to use that as a metric although before you've done an initial crawl it would be hard to determine.
You could start out at one page every minute and then as the pages-to-be-spidered for a particular site increases, you would decrease the delay. Some sort of function like the following would be needed:
public Period delayBetweenPages(String domain) {
take the number of pages in the to-do queue for the domain
divide by the overall refresh period that you want to complete in
if more than a minute then just return a minute
if less than some minimum then just return the minimum
}
Could I detect this stuff somehow via HTTP?
With the modern internet, I don't see how you can. Certainly if the server is returning after a couple of seconds or returning 500 errors, then you should be throttling way back but a typical connection and download is sub-second these days for a large percentage of servers and I'm not sure there is much to be learned from any stats in that area.

Is caching a good idea? If so, where?

I have an asp.net web site with 10-25k visitors a day (peaks of over 60k before holidays). Pages/visit is also high, since it's a content site.
I have a few specific pages which generate about 60% of the traffic. These pages are a bit complex and are DB heavy (sql server 2008 r2 backend).
I was wondering if it's worth "caching" a static version of these pages (I hear this is possible) and only re-render them when something changes (about once in 48hs).
Does this sound like a good idea? Where would be the best place to implement this?
(asp.net, iis, db)
Update: Looks like a good option for me is outputcache with SqlDependency. I see a reference to some kind of SQL server notification for invalidating the cache, but I only see talk of SQL server 2005. Has this option been deprecated by Microsoft? Any new way to handle this?
Caching is a broad term that can happen at a number of different points. The optimum solution may be a combination of some or all.
For example, you can add page, or output caching as described here, which caches output on the web server, which I think is what you were referring to.
In addition, you can cache the data in memory using something like memcached, so that your data is more available to the web server as it builds the page, but you need to look at cache hit rate to know for sure that you are caching the right data.
Also, although slightly off the topic of improving db heavy pages, you can cache static resources that change infrequently like images, css and include files using a content delivery network. Any CDN will almost certainly have a higher bandwidth and a cheaper data plan than your own connection because of the economies of scale, so the more of your content you can serve from there the better, in general.
Your first question was "I was wondering if it's worth "caching" a static version of these pages". I guess the answer to that depends on whether there is a performance problem at the moment, and where the cause of that problem is. If the pages are being served quickly and reliably, then quite possibly it's not worth implementing caching. If there is a performance problem, then where is it? Is it in db read time, or is it in the time spent building the page once the data has been returned?
I don't have much experience in caching, but this is what I would try to do:
I would look at your stats and run some profiles, see which are the most heavily visited pages that run the most expensive SQL queries. Pick one or two of the most expensive pages.
If the page is pseudo static, that is, no data on it such as your logged in username, no comments, etc etc, you can cache the entire page. You can set a relatively long cache as well, anything from 1 min to a few hours.
If the page has some dynamic real time content on it, such as comments, you can identify the static controls and cache those individually. Don't put a page wide cache on.
Good luck, sounds like a cache could improve performance.
Caching may or may not help. For example, if a site has low traffic and if the caching is enabled, the server processes to create the cache before serving the request. And because the traffic is low, there can be enough delay between successive requests. So the cached version may even expire and the server again creates a new cached version. This process makes the response even slower than normal.
Read more: Caching - the good, the bad.
I have myself experienced this issue.
If the traffic is good, caching may help you have better load times.
Cheers
Aditya

What is normal memory usage of an asp.net mvc website?

I have really tried to Google it but only articles about how to troubleshoot memory issues come up. Before I start to troubleshoot, I would like to know if my web site's memory usage is really abnormal or not.
So it is an asp.net mvc 2 website that runs on IIS 7.5 in production. I guess normal memory usage depends upon traffic, so here are the numbers of an average day:
300 unique visitor
400 visits
3000 page views
I would be really happy to get some idea how much is the normal memory usage for this traffic. Also I would be curious to know how memory usage normally increases with traffic growth.
Thanks a lot
It's pretty imposible to define "normal memory usage" for anything without a more complete specification.
For example, if you cache large quantities of data in memory, that will affect the "normal memory usage" of your application. One thing that can particularly skew this is if the data is cached in response to a user action. There could be a scenario that users trigger on one in a thousand visits to your site that causes 75Mb of additional data to be cached, which might (depending on the usual dataset) cause what appears to be a significant difference.

Resources