Optimizing Google Search Appliance on a remote server - networking

I'm planning to deploy a Google Search Appliance to remotely index an intranet site (transcontinentally). So I will be using the company's network and potentially consuming too much bandwidth.
Regarding the configurations that I can use to mitigate the effect of the initial crawl (which is the only one that is perceived as dangerous for the network) we have:
Crawl and Index > Host Load Schedule
Web Server Host Load: basically number of concurrent connections to the crawled servers within 1 minute, so minimizing this setting should
Exceptions to Web Server Host Load: this is a schedule used for either increasing or decreasing the number of concurrent connections to the crawled server.
Crawl and Index > Crawl Schedule
Instead of a continous crawl I should choose a Scheduled crawl.
Am I on the right track and can other settings be configured in order not to generate excessive network traffic between the GSA and the Web servers?

The best way to minimize the crawling of a remote site is to not crawl it. Failing that, there are a couple of settings will help it it as noted out above:
1) Host Load Schedule
This sets the number of current threads set to the crawler for the host. Note that this can be a number below 1. (i.e. 2.5) (also noted by BigMikeW)
2) Freshness Tuning
Crawl infrequently actually means "Crawl never again". This works well in conjunction with a meta-url feed which will tell the GSA to recrawl the page or a recrawl request from the administrative console. Crawl frequently actually means: "Crawl Once Per Day". This setting doesn't really mean much now that the crawler has been retuned and the hardware is faster. The GSA will submit requests intra daily to the pages it finds.
3) Crawl schedule
I find that it's not better to turn off the crawler but rather keep it on continuous mode and set the threshold at zero. This allows the natural GSA algorithms to play out. Anything you wish to achieve by scheduling can be achieved by tuning it to zero for the periods you want the crawler quiet.
My recommendation for minimizing WAN traffic:
1) Review DNS and add an override if necessary to ensure you are routing to nearest content source
2) Set the content sources pattern to crawl infrequently
3) Create a meta url feed to push content updates.
The last one would take a bit of coding. There is an example sitemap feeder here:
https://code.google.com/p/gsafeedmanager/
With this configuration, the GSA will never recrawl the content and will rely on the feed to inform it of updates.
Alternate:
1) Ensure the content source responds to HEAD requests with LAST Modified Dates. Do not configure crawl infrequently. The GSA will detect deltas and slow the crawl down over time.

Yes, I would also look at the Freshness Tuning and Duplicate Hosts.
Host Load Schedule
Web Server Host Load
Exceptions to Web Server Host Load
Crawl Schedule
Crawl Mode
Freshness Tuning
Crawl Frequently
Crawl Infrequently

As Tan Hong Tat says, look at Freshness Tuning and Duplicate Hosts.
I would set it to crawl infrequently at least until the initial crawl has completed.
Also do some content analysis. Using the Crawl patterns you can direct the GSA to ignore certain content types (based on file extension) or areas of the intranet that don't contain content of value to the search experience.
When you're setting the host load remember that you can use decimal values between 0-1, e.g.: 0.1.
If they have a decent WAN optimizer in place you may find this is less of an issue than you think.

Related

SEO crawler DDOSing sites

I have a customer that runs 36 websites (many thousands of pages) on a round robin with sticky affinity load-balanced set of IIS servers - the infra is entirely AWS based (r3.2xl - 8 VCPU, 60.5 GiB RAM)
To get straight to the point, the site is configured to 'cache on access' using standard in-memory caching with ASP.NET 4.6 and static assets through Cloudfront. The site on a 'cold start' makes both SQL Server queries for content, and separate elasticsearch queries at runtime to determine hreflang alternate language tags - this basically queries which versions of the URL are available in different languages for SEO reasons. This query has been optimised to be a lookup on a single index from a cross-index wildcard query. As mentioned, the entire result is cached for 24h once all this has executed.
Under normal use conditions the site works perfectly. As there are 36 sites running on a single box, the private set space gets allocated up to the max (99%) of physical RAM over time, as more and more content gets cached in memory. I can end up with App Pools in excess of 1.5GiB which isn't ideal. After this point, presumably the .NET LRU cache eviction algorithm is working overtime.
The problem I have, after some post-mortem review of the IIS logd, the customer is using an SEO bot tool, SEMrush, which essentially triggers a denial of service attack against the sites (thundering herd?) because of simultaneous requests for the 'long tail' of pages which are never viewed by a user and hence aren't stored in the cache.
The net result is a server brought to its knees, App Pool CPU usage all over the place, and an Elasticsearch queue length > 1000, huge ES heap growth, rejection rate - and eventually a crash.
The solutions I've thought about but haven't implemented:
Cloudfront all the sites - use a warm up script (although I don't think this will actually help as it's a cold start problem when all the pages expire, unless I could have a MOST recently used cache invalidation mechanism which invalidated pages on number of requests - say > 100, and left everything else persistent)
AWS Shield/WAF to provide some sort of rate limiting
Remove the runtime ES lookup all together and move to an eventually-consistent model which computes the hreflang lookup table elsewhere on a separate process. Hpwever, the ES instances, whilst on a v1.3.1 version which is old, is a 3-node cluster which has a lot of CPU power and each node set to a 16GiB min/max heap so should be able to take that level of throughput?
Or all 3!
Has anyone come across this problem before and what was your solution? it must be fairly common especially for large sites which are hammered by SEO / DQM web crawlers?

WP-Engine 502 timeout- what options do I have to get around this limitation?

We have a plugin for Wordpress that we've been using successfully on many customers- the plugin syncs stock numbers with our warehouse and exports orders to our warehouse.
We have recently had a client move to WP-Engine who seem to impose a hard 30 second limit on the length of a running request. Because sometimes we have many orders to export, the script simply hits a 502 bad gateway error.
According to WP-Engine documentation, this cannot be turned off on a client by client basis.
https://wpengine.com/support/troubleshooting-502-error/
My question is, what options do I have to get around a host's 30 second timeout limit? Setting set_time_limit has no effect (as expected as it is the web server killing the request, not PHP). The only thing I can think of is make heavy modifications to the plugin whereby it acts as an API and we simply pull the data from the clients system, however this is a last resort.
The long-process timeout is 60 seconds.
This cannot be turned off on shared plans, only plans with dedicated servers. You will not be able to get around this by attempting to modify it as it runs directly on Apache outside of your particular install
Your optons are:
1. 'Chunk' the upload to be smaller
2. Upload the sql file to your sFTP _wpeprivate folder and have their support import it for you.
3. Optimize the import so the content is imported more efficiently.
I can see three options here.
Change the web host (easy option).
Modify a plugin to process the sync in batches. However, this also won't give you a 100% guarantee with a hard script execution time limit - something may get lost in one or more batches and you won't even know.
Contact WP Engine and ask to raise the limit for this particular client.

I'm being scraped, how can I prevent this?

Running IIS 7, a couple of times a week I see a huge number of hits on Google Analytics from one geographical location. The sequence of urls they are viewing are clearly being generated by some algorithm so I know I'm being scraped for content. Is there any way to prevent this? So frustrated that Google doesn't just give me an IP.
There are plenty of techniques in the anti-scraping world. I'd just categorize them. If you find something missing in my answer please comment.
A. Server side filtering based on web requests
1. Blocking suspicious IP or IPs.
The blocking suspicious IPs works well but today most of scraping is done using IP proxying so for a long run it wouldn't be effective. In your case you get requests from the same IP geo location, so if you ban this IP, the scrapers will surely leverage IP proxying thus staying IP independent and undetected.
2. Using DNS level filtering
Using DNS firewall pertains to the anti-scrape measure. Shortly saying this is to set up you web service to a private domain name servers (DNS) network that will filter and prevent bad requests before they reach your server. This sophisticated measure is provided by some companies for complex website protection and you might get deeper in viewing an example of such a service.
3. Have custom script to track users' statistic and drop troublesome requests
As you've mentioned you've detected an algorithm a scraper crawls urls. Have a custom script that tracks the request urls and based on this turns on protection measures. For this you have to activate a [shell] script in IIS. Side effect might be that the system response timing will increase, slowing down your services. By the way the algorithm that you've detected might be changed thus leaving this measure off.
4. Limit requests frequency
You might set a limitation of the frequency of requests or downloadable data amount. The restrictions must be applied considering the usability for a normal user. When compared to the scraper insistent requests you might set your web service rules to drop or delay unwanted activity. Yet if scraper gets reconfigured to imitate common user behaviour (thru some nowdays well-known tools: Selenuim, Mechanize, iMacros) this measure will fail off.
5. Setting maximum session length
This measure is a good one but usually modern scrapers do perform session authentication thus cutting off session time is not that effective.
B. Browser based identification and preventing
1. Set CAPTCHAs for target pages
This is the old times technique that for most part does solve scraping issue. Yet, if your scraping opponent leverages any of anti-captcha services this protection will most likely be off.
2. Injecting JavaScript logic into web service response
JavaScript code should arrive to client (user's browser or scraping server) prior to or along with requested html content. This code functions to count and return a certain value to the target server. Based on this test the html code might be malformed or might even be not sent to the requester, thus leaving malicious scrapers off. The logic might be placed in one or more JavaScript-loadable files. This JavaScript logic might be applied not just to the whole content but also to only certain parts of site's content (ex. prices). To bypass this measure scrapers might need to turn to even more complex scraping logic (usually of JavaScript) that is highly customizable and thus costly.
C. Content based protection
1. Disguising important data as images
This method of content protection is widely used today. It does prevent scrapers to collect data. Its side effect is that the data obfuscated as images are hidden for search engine indexing, thus downgrading site's SEO. If scrapers leverage a OCR system this kind of protection is again might be bypassed.
2. Frequent page structure change
This is far effective way for scrape protection. It works not just to change elements ids and classes but the entire hierarchy. The latter involving styling restructuring thus imposing additional costs. Sure, the scraper side must adapt to a new structure if it wants to keep content scraping. Not much side effects if your service might afford it.

Is this Anti-Scraping technique viable with Robots.txt Crawl-Delay?

I want to prevent web scrapers from agressively scraping 1,000,000 pages on my website. I'd like to do this by returning a "503 Service Unavailable" HTTP error code to bots that access an abnormal number of pages per minute. I'm not having trouble with form-spammers, just with scrapers.
I don't want search engine spiders to ever receive the error. My inclination is to set a robots.txt crawl-delay which will ensure spiders access a number of pages per minute under my 503 threshold.
Is this an acceptable solution? Do all major search engines support the crawl-delay directive? Could it negatively affect SEO? Are there any other solutions or recommendations?
I have built a few scrapers, and the part that takes the longest time is allways trying to figure out the site layout what to scrape and not. What I can tell you is that changing divs and internal layout will be devastating for all scrapers. Like ConfusedMind already pointed out.
So here's a little text for you:
Rate limiting
To rate limit an IP means that you only allow the IP a certain amount of searches in a fixed timeframe before blocking it. This may seem sure way prevent the worst offenders but in reality it's not. The problem is that a large proportion of your users are likely to come through proxy servers or large corporate gateways which they often share with thousands of other users. If you rate limit a proxy's IP that limit will easily trigger when different users from the proxy uses your site. Benevolent bots may also run at higher rates than normal, triggering your limits.
One solution is of course to use white list but the problem with that is that you continually need to manually compile and maintain these lists since IP-addresses change over time. Needless to say the data scrapers will only lower their rates or distribute the searches over more IP:s once they realise that you are rate limiting certain addresses.
In order for rate limiting to be effective and not prohibitive for big users of the site we usually recommend to investigate everyone exceeding the rate limit before blocking them.
Captcha tests
Captcha tests are a common way of trying to block scraping at web sites. The idea is to have a picture displaying some text and numbers on that a machine can't read but humans can (see picture). This method has two obvious drawbacks. Firstly the captcha tests may be annoying for the users if they have to fill out more than one. Secondly, web scrapers can easily manually do the test and then let their script run. Apart from this a couple of big users of captcha tests have had their implementations compromised.
Obfuscating source code
Some solutions try to obfuscate the http source code to make it harder for machines to read it. The problem here with this method is that if a web browser can understand the obfuscated code, so can any other program. Obfuscating source code may also interfere with how search engines see and treat your website. If you decide to implement this you should do it with great care.
Blacklists
Blacklists consisting of IP:s known to scrape the site is not really a method in itself since you still need to detect a scraper first in order to blacklist him. Even so it is still a blunt weapon since IP:s tend to change over time. In the end you will end up blocking legitimate users with this method. If you still decide to implement black lists you should have a procedure to review them on at least a monthly basis.

Multiple requests to server question

I have a DB with user accounts information.
I've scheduled a CRON job which updates the DB with every new user data it fetches from their accounts.
I was thinking that this may cause a problem since all requests are coming from the same IP address and the server may block requests from that IP address.
Is this the case?
If so, how do I avoid being banned? should I be using a proxy?
Thanks
You get banned for suspicious (or malicious) activity.
If you are running a normal business application inside a normal company intranet you are unlikely to get banned.
Since you have access to user accounts information, you already have a lot of access to the system. The best thing to do is to ask your systems administrator, since he/she defines what constitutes suspicious/malicious activity. The systems administrator might also want to help you ensure that your database is at least as secure as the original information.
should I be using a proxy?
A proxy might disguise what you are doing - but you are still doing it. So this isn't the most ethical way of solving the problem.
Is the cron job that fetches data from this "database" on the same server? Are you fetching data for a user from a remote server using screen scraping or something?
If this is the case, you may want to set up a few different cron jobs and do it in batches. That way you reduce the amount of load on the remote server and lower the chance of wherever you are getting this data from, blocking your access.
Edit
Okay, so if you have not got permission to do scraping, obviously you are going to want to do it responsibly (no matter the site). Try gather as much data as you can from as little requests as possible, and spread them out over the course of the whole day, or even during times that a likely to be low load. I wouldn't try and use a proxy, that wouldn't really help the remote server, but it would be a pain in the ass to you.
I'm no iPhone programmer, and this might not be possible, but you could try have the individual iPhones grab the data so all the source traffic isn't from the same IP. Just an idea, otherwise just try to be a bit discrete.
Here are some tips from Jeff regarding the scraping of Stack Overflow, but I'd imagine that the rules are similar for any site.
Use GZIP requests. This is important! For example, one scraper used 120 megabytes of bandwidth in only 3,310 hits which is substantial. With basic gzip support (baked into HTTP since the 90s, and universally supported) it would have been 20 megabytes or less.
Identify yourself. Add something useful to the user-agent (ideally, a link to an URL, or something informational) so we can see your bot as something other than "generic unknown anonymous scraper."
Use the right formats. Don't scrape HTML when there is a JSON or RSS feed you could use instead. Heck, why scrape at all when you can download our cc-wiki data dump??
Be considerate. Pulling data more than every 15 minutes is questionable. If you need something more timely than that ... why not ask permission first, and make your case as to why this is a benefit to the SO community and should be allowed? Our email is linked at the bottom of every single page on every SO family site. We don't bite... hard.
Yes, you want an API. We get it. Don't rage against the machine by doing naughty things until we build it. It's in the queue.

Resources