SEO crawler DDOSing sites

SEO crawler DDOSing sites - asp.net

I have a customer that runs 36 websites (many thousands of pages) on a round robin with sticky affinity load-balanced set of IIS servers - the infra is entirely AWS based (r3.2xl - 8 VCPU, 60.5 GiB RAM)
To get straight to the point, the site is configured to 'cache on access' using standard in-memory caching with ASP.NET 4.6 and static assets through Cloudfront. The site on a 'cold start' makes both SQL Server queries for content, and separate elasticsearch queries at runtime to determine hreflang alternate language tags - this basically queries which versions of the URL are available in different languages for SEO reasons. This query has been optimised to be a lookup on a single index from a cross-index wildcard query. As mentioned, the entire result is cached for 24h once all this has executed.
Under normal use conditions the site works perfectly. As there are 36 sites running on a single box, the private set space gets allocated up to the max (99%) of physical RAM over time, as more and more content gets cached in memory. I can end up with App Pools in excess of 1.5GiB which isn't ideal. After this point, presumably the .NET LRU cache eviction algorithm is working overtime.
The problem I have, after some post-mortem review of the IIS logd, the customer is using an SEO bot tool, SEMrush, which essentially triggers a denial of service attack against the sites (thundering herd?) because of simultaneous requests for the 'long tail' of pages which are never viewed by a user and hence aren't stored in the cache.
The net result is a server brought to its knees, App Pool CPU usage all over the place, and an Elasticsearch queue length > 1000, huge ES heap growth, rejection rate - and eventually a crash.
The solutions I've thought about but haven't implemented:
Cloudfront all the sites - use a warm up script (although I don't think this will actually help as it's a cold start problem when all the pages expire, unless I could have a MOST recently used cache invalidation mechanism which invalidated pages on number of requests - say > 100, and left everything else persistent)
AWS Shield/WAF to provide some sort of rate limiting
Remove the runtime ES lookup all together and move to an eventually-consistent model which computes the hreflang lookup table elsewhere on a separate process. Hpwever, the ES instances, whilst on a v1.3.1 version which is old, is a 3-node cluster which has a lot of CPU power and each node set to a 16GiB min/max heap so should be able to take that level of throughput?
Or all 3!
Has anyone come across this problem before and what was your solution? it must be fairly common especially for large sites which are hammered by SEO / DQM web crawlers?

Related

Multiple wordpress site installation

guys i have question,
lets say i want to upload hundreds thousands of post / product to wordpress, which will slow down the website performance, and the database size will also getting bigger.
what if i split the wordpress site into several different installation to different sub directory based on the product or post category, so lets say one website only contain 25-30k post / producst, but there will be like 10 of those in different installation, in this way the database will be a lot smaller.
do you think it will make the performance better than put everything in single website?
my server is around 16gb ram and 8 cpu cores.

I don't think it will make any difference given you will run it on the same hardware. In case of multiple machines and one ingress node/load balancer you could route the request to the different backend server basing on the product requested, but if you have only one server for hosting everything: web server, database, etc. you will hit the limits of CPU/RAM/etc. much faster than the size of the database table (given it's properly designed, has indices and so on)
However you can measure the performance in both cases using a load testing tool and see how does response time, resources usage and database slow query log looks like in both deployment scenarios.

Data size doesn't have to slow the site. It becomes a matter of how fast can you get the data from the DB. A few things to consider:
Place the Database on a dedicated host. If locally hosted dedicate a crossover cable from the web tier to the DB tier, with a second IP for admin on the database host. You might consider a managed instance of your database with a cloud provider.
Indexes are your friend. Larger datasets result in longer indexes, but you can make deliberately shortened indexes. Choose a database that supports partitioned indexes. Combine these partitioned indexes along with higher I/O operations per second of SSDs for your index partitions and ensuring that all lookup access via index will result in your performance for large data sets doesn't suffer. How does a partitioned index increase access speed? Instead of having to traverse an index from A to S for an index supported query with an S based where clause, in a partioned index you might have 26 indexes, one for A, then B, then C, then ... You jump straight to the S partition for the lookup.
Shape your pool size on the PHP/Web tier. You have already increased the pool size by pulling the database onto its own host. The next thing to do is to effectively manage your cache of fixed assets, the items that do not change across user sessions. Commonly these items are style sheets, images, fonts, javascript files, ... Minimally look at a cache node in front of your wordpress site. Take a look at Varnish or Nginx for this. I am partial to Varnish, but either should do the trick. If you pair this with a CDN for a multigenerational cache then all the better. If you are in the cloud then you have built in CDN options with each cloud provider. You can also widen your bandwidth by placing these fixed assets ona dedicated host and then caching that one host, but this would require a lot of base modification of your wordpress image.
There is no reason why you cannot have multiple web fronts with a common database back end. You would need a load balancer to distribute the load and your first generation cache would sit in front of the load balancer. Realistically, if all of your queries are index supported and your cache is effectively managed, then you can easily scale to hundreds of concurrent users on moderate hardware. Your most taxing item is going to be your PHP execution to pull dynamic data for user sessions. Make the queries respond as fast as possible then you have a small lock window on PHP for each session.
Watch your locks per session! You may be at the mercy of a template and how it is managing your finite resource pool, but in general (a) unless 50%+1 use something, do not allocate it early, (b) be merciless in cutting sessions to release the session based locks on memory, (c) pinch your assets until they bleed - No 45 MB images on the front page when a color optimized 120K compressed image will do the job, (d) Watch the repeat access problem - This applies to subqueries in the database as well as building a web page with hundreds of assets to resolve a page.
Have you considered other options, such as Drupal? The setup is a bit more complex, but I can validate running a dozen distinct websites out of a single Drupal instance with no degradation in performance with the above dedicated database and cache nodes with hundreds of concurrent users on fairly moderate hardware (mini-itx atom based PCs)

Optimizing Google Search Appliance on a remote server

I'm planning to deploy a Google Search Appliance to remotely index an intranet site (transcontinentally). So I will be using the company's network and potentially consuming too much bandwidth.
Regarding the configurations that I can use to mitigate the effect of the initial crawl (which is the only one that is perceived as dangerous for the network) we have:
Crawl and Index > Host Load Schedule
Web Server Host Load: basically number of concurrent connections to the crawled servers within 1 minute, so minimizing this setting should
Exceptions to Web Server Host Load: this is a schedule used for either increasing or decreasing the number of concurrent connections to the crawled server.
Crawl and Index > Crawl Schedule
Instead of a continous crawl I should choose a Scheduled crawl.
Am I on the right track and can other settings be configured in order not to generate excessive network traffic between the GSA and the Web servers?

The best way to minimize the crawling of a remote site is to not crawl it. Failing that, there are a couple of settings will help it it as noted out above:
1) Host Load Schedule
This sets the number of current threads set to the crawler for the host. Note that this can be a number below 1. (i.e. 2.5) (also noted by BigMikeW)
2) Freshness Tuning
Crawl infrequently actually means "Crawl never again". This works well in conjunction with a meta-url feed which will tell the GSA to recrawl the page or a recrawl request from the administrative console. Crawl frequently actually means: "Crawl Once Per Day". This setting doesn't really mean much now that the crawler has been retuned and the hardware is faster. The GSA will submit requests intra daily to the pages it finds.
3) Crawl schedule
I find that it's not better to turn off the crawler but rather keep it on continuous mode and set the threshold at zero. This allows the natural GSA algorithms to play out. Anything you wish to achieve by scheduling can be achieved by tuning it to zero for the periods you want the crawler quiet.
My recommendation for minimizing WAN traffic:
1) Review DNS and add an override if necessary to ensure you are routing to nearest content source
2) Set the content sources pattern to crawl infrequently
3) Create a meta url feed to push content updates.
The last one would take a bit of coding. There is an example sitemap feeder here:
https://code.google.com/p/gsafeedmanager/
With this configuration, the GSA will never recrawl the content and will rely on the feed to inform it of updates.
Alternate:
1) Ensure the content source responds to HEAD requests with LAST Modified Dates. Do not configure crawl infrequently. The GSA will detect deltas and slow the crawl down over time.

Yes, I would also look at the Freshness Tuning and Duplicate Hosts.
Host Load Schedule
Web Server Host Load
Exceptions to Web Server Host Load
Crawl Schedule
Crawl Mode
Freshness Tuning
Crawl Frequently
Crawl Infrequently

As Tan Hong Tat says, look at Freshness Tuning and Duplicate Hosts.
I would set it to crawl infrequently at least until the initial crawl has completed.
Also do some content analysis. Using the Crawl patterns you can direct the GSA to ignore certain content types (based on file extension) or areas of the intranet that don't contain content of value to the search experience.
When you're setting the host load remember that you can use decimal values between 0-1, e.g.: 0.1.
If they have a decent WAN optimizer in place you may find this is less of an issue than you think.

Cpu cores per IIS website (process)

I've heard that only 1 cpu core can be used per asp.net 4.0 website in IIS 7. Using more cpu cores means entering webfarm territory (as a result session management should be done respectively). But I could not find any references confirming this.
So are there any limitations on cpu core count that can be used per website, where session still can be in-proc? Any references?

Absolut wrong.
The IIS7 and asp.net use all cpu and the power of the server. Also include in that and the SQL server that run in parallel with asp.net/iis and also use all the cpus and (all) the memory. Also if you make any system Thread, you also potential use a different cpu.
What you have "hear" probably is the session is blocking the asynchronous processing of the pages that is not totally bad you know, is help in many case. Again that is not mean that is use one CPU, only that the calls on the same site are synchronous (the one wait the other)
Few words about the session. Personally I have totally replace it with my custom made session handler, but for any beginner site and small sites the current session is perfect because is help you to synchronize the calls.
Without the current session module, you need to handle the synchronization by case manually - that is not so easy. If you not do that, the results is usually double and triple submissions of the same data (all that from experience).
Now if you design your site for web garden, and design it good, you take care of the synchronization of the calls and make it work fast, and correct.
Read about the session blocking on:
Web app blocked while processing another web app on sharing same session
Replacing ASP.Net's session entirely
What perfmon counters are useful for identifying ASP.NET bottlenecks?
Trying to make Web Method Asynchronous

Limit database usage of a website

To start off - I have 2 separate websites and a database (IIS 7.5, ASP.NET and SQL Server 2008, using Linq-To-SQL for database access).
I have a separate administrative website that sometimes, during usage needs to trigger long running operations (more than 10 seconds) on database. The problem is that those operations cause sqlserver process to hit 100% CPU and then other, main customer website, can't access database promptly - there are some delays in accessing database.
I am OK with those administrative operations lasting 2x or 4x or nx times longer since they are lower priority.
I've tried using CPU Limit setting on AppPool in IIS, but that doesn't help, as w3wp.exe process never uses much of CPU... rather it's sqlservr.exe. Thanks in advance for your suggestions!

If your admin queries are consuming all the CPU on the box, there is almost certainly some tuning opportunities there - likely some indexing optimizations.
In lieu of the time to invest in those, and until you get your Resource Governor configuration settled, you can simply reduce their impact to a single CPU, which may provide short-term symptom relief, by adding the MAXDOP hint to your admin query:
OPTION (MAXDOP 1);
Yes, it might make you feel a little dirty, but the rest of your CPUs will be freed up to work on your more important queries.
The real answer is to tune your admin queries. Just because it's ok that they run long does not mean it's good for your server or the experience of your users. You'll never be able to completely isolate them from the effects of other queries going on on the box, especially if you are experiencing high CPU that is compensating for slow I/O. I/O does not have any knobs in resource governor - you can only control CPU and memory, and not even 100%.

Sounds like you want to look into Resource Governor which is built into SQL server as of SQL 2008. BOL link should get you started.
http://msdn.microsoft.com/en-us/library/bb933866(v=SQL.100).aspx
Essentially, you can throttle CPU and memory usage for the resource pools and workloads you define. This throttling will only kick in when the server is under load. Be aware that you cannot control disk IO utilization. If the process in your admin database is IO bound and your other DBs share drives you will inevitably still see performance issues and moving databases to separate spindles or query tuning will be necessary.
Example of the classifier function that will ensure the user you define is throttled by the desired resource pool based on workload group:
/* Classifier function */
CREATE FUNCTION dbo.rgov_classifier_db ()
RETURNS sysname
WITH SCHEMABINDING
AS
BEGIN
DECLARE #rgWorkloadGrp sysname
IF SUSER_SNAME() = 'adminWebsiteDB'
SET #rgWorkloadGrp = 'workloadGroupName'
ELSE
SET #rgWorkloadGrp = 'defaultWorkloadGroupName'
RETURN #rgWorkloadGrp
END;
GO
/* Register the function with Resource Governor and then start Resource Governor. */
ALTER RESOURCE GOVERNOR
WITH (CLASSIFIER_FUNCTION = dbo.rgov_classifier_db);
GO
ALTER RESOURCE GOVERNOR RECONFIGURE;
GO

Running a Asp.net website with MS SQL server - When should i worry about scalability?

I run a medium sized website on an ASP.net platform and using MS SQL server to store the data.
My current site stats are:
~ 6000 Page Views a day
~ 10 tables in the SQL server with around 1000 rows per table
~ 4 queries per page served
The hosting machine has 1GB RAM
I expect by the end of 2009 to hit around:
~ 20,000 page views
~ 10 tables and around 4000 rows per table
~ 5 queries per page served
My question is should I plan for scalability right now itself? Will the machine hold up till the end of the year with the expects stats.
I know my description is very top level and does not provide insight into the kind of queries etc. But just wanted to know what your gut instinct tells you?
Thanks!

You should always plan for scalability. When to put resources into doing the actual scaling is usually the tough guess.
Will the machine hold up until the end
of the year
Way too little information to answer this. If a page request takes 30 CPU seconds to process due to massive interaction with a legacy enterprise application through the four queries per page - then there's no way. If it's taking miniscule fractions of a second to serve some static content stored in the cache and your queries are only executed every half hour to refresh the content - then you're good until 2020 at the rate of traffic growth you describe.
My guess is that you're somewhere closer to the latter scenario. 20,000 page hits a day is not really a ton of traffic, but you'll need to benchmark your page and server performance at some point so that you can make the calculations you need.
Things to look at for scaling your site when it is time:
Output Caching
Optimizing Viewstate
Using Ajax where appropriate
Session optimization
Request, script, css and html minification
Two years ago I saw a relatively new (for two years ago) laptop running IIS and serving up 1100 to 1200 simple dynamic page requests per second. It had been set up by a consulting firm whose business was optimizing ASP.Net websites, but it goes to show you how much you can do.

Essentially, by the end of 2009, you expect to do 100,000 SQL queries per day. This is about 1.157 queries per second.
I am making the assumption that your configuration is "normal" (i.e. you're not doing something funky and these are pretty straightforward SELECT, UPDATE, INSERT, etc), and that your server is running RAID disks.
At 4,000 rows per table this is nothing to SQL server. You should be just fine. If you wanted to be proactive about it, put another stick of RAM in the server and bring it up to at least 2GB, that way IIS and SQL have plenty of memory (SQL will certainly take advantage of it).

The hosting machine? Does this mean that you have IIS and SQL installed on the same box or IIS on your host machine with a dedicated SQL Server provided by your hosting company? Either way I would suggest starting to take a look at how you might implement a caching layer to minimize the hits (where possible) to the database. Once this is PLANNED (not necessarily implemented) I would then start to look at how you might build a caching layer around your output (things built in ASP.NET). If you see a clear an easy path to building caching layers...then this is a quick and easy way to start to minimize request to the database and work on your web server. I suggest that this cache layer be flexible...read not use anything provided by .NET! Currently I still suggest using MemCached Win32. You can install it on your one hosted local box easily and configure your cache layer to use local resources (add memory...1gb is not enough). Then if you find that you really need to squeeze every little bit of performance out of your system...splurge for a second box. Split your cache between your current box...and the new box (allowing you to keep more in cache). This will give you some room (and time) to grow. Offloading to more cache should help address any future spikes...and with the second box you can now also focus on making your site work in farmed environment. If you are using local session..push that into your cache layer so that a request from one box or another won't matter (standard session is local to the box that it is managed on).
This is a huge subject...so without real details this is all speculation of course! You might be just right for adding better and more hardware to the existing installation.

Have you tried setting up a quick performance test using sample data? 20,000 page views is less than one/sec (assuming even distribution over 8 hours), which is pretty minimal given your small tables. Assuming you're not sending a ton of data with each page view (i.e. a data table with all 1000 rows from one of your tables), you are likely OK.
You may need to increase RAM, but other than running a performance test I wouldn't worry too much about performance right now.

I don't think the load you are describing would be too much of a problem for most machines. Of course it doesn't just depend on the few metrics you outlined but also on query complexity, page size, and a heap of other things.
If you worry about scalability do some load testing and see how your site handles, say 10000 page views per hour (about 3 views per second). It's mostly always good to plan ahead as long as you plan for probable scenarios.

Guts say: Given 10 tables with 4,000 rows each and assuming about 2KB of data per row is only 80MB for the entire database. Easily cached within memory available. Assuming everything else about the application is equally simple, you should be able to easily serve hundreds of pages per second.
Engineers say: If you want to know, stress test your application.