Running IIS 7, a couple of times a week I see a huge number of hits on Google Analytics from one geographical location. The sequence of urls they are viewing are clearly being generated by some algorithm so I know I'm being scraped for content. Is there any way to prevent this? So frustrated that Google doesn't just give me an IP.
There are plenty of techniques in the anti-scraping world. I'd just categorize them. If you find something missing in my answer please comment.
A. Server side filtering based on web requests
1. Blocking suspicious IP or IPs.
The blocking suspicious IPs works well but today most of scraping is done using IP proxying so for a long run it wouldn't be effective. In your case you get requests from the same IP geo location, so if you ban this IP, the scrapers will surely leverage IP proxying thus staying IP independent and undetected.
2. Using DNS level filtering
Using DNS firewall pertains to the anti-scrape measure. Shortly saying this is to set up you web service to a private domain name servers (DNS) network that will filter and prevent bad requests before they reach your server. This sophisticated measure is provided by some companies for complex website protection and you might get deeper in viewing an example of such a service.
3. Have custom script to track users' statistic and drop troublesome requests
As you've mentioned you've detected an algorithm a scraper crawls urls. Have a custom script that tracks the request urls and based on this turns on protection measures. For this you have to activate a [shell] script in IIS. Side effect might be that the system response timing will increase, slowing down your services. By the way the algorithm that you've detected might be changed thus leaving this measure off.
4. Limit requests frequency
You might set a limitation of the frequency of requests or downloadable data amount. The restrictions must be applied considering the usability for a normal user. When compared to the scraper insistent requests you might set your web service rules to drop or delay unwanted activity. Yet if scraper gets reconfigured to imitate common user behaviour (thru some nowdays well-known tools: Selenuim, Mechanize, iMacros) this measure will fail off.
5. Setting maximum session length
This measure is a good one but usually modern scrapers do perform session authentication thus cutting off session time is not that effective.
B. Browser based identification and preventing
1. Set CAPTCHAs for target pages
This is the old times technique that for most part does solve scraping issue. Yet, if your scraping opponent leverages any of anti-captcha services this protection will most likely be off.
2. Injecting JavaScript logic into web service response
JavaScript code should arrive to client (user's browser or scraping server) prior to or along with requested html content. This code functions to count and return a certain value to the target server. Based on this test the html code might be malformed or might even be not sent to the requester, thus leaving malicious scrapers off. The logic might be placed in one or more JavaScript-loadable files. This JavaScript logic might be applied not just to the whole content but also to only certain parts of site's content (ex. prices). To bypass this measure scrapers might need to turn to even more complex scraping logic (usually of JavaScript) that is highly customizable and thus costly.
C. Content based protection
1. Disguising important data as images
This method of content protection is widely used today. It does prevent scrapers to collect data. Its side effect is that the data obfuscated as images are hidden for search engine indexing, thus downgrading site's SEO. If scrapers leverage a OCR system this kind of protection is again might be bypassed.
2. Frequent page structure change
This is far effective way for scrape protection. It works not just to change elements ids and classes but the entire hierarchy. The latter involving styling restructuring thus imposing additional costs. Sure, the scraper side must adapt to a new structure if it wants to keep content scraping. Not much side effects if your service might afford it.
Related
I have a website hosted in firebase that totally went viral for a day. Since I wasn't expecting that, I didn't install any analytics tool. However, I would like to know the number of visits or downloads. The only metric I have available is the GB Downloaded: 686,8GB. But I am confused because if I open the website with the console of Chrome, I get two different metrics about the size of the page: 319KB transferred and 1.2MB resources. Furthermore, not all of those things are transferred from firebase but from other CDN as you can see in the screenshots. What is the proper way of calculating the visits I had?
Transferred metric is how much bandwidth was used after compression was applied.
Resources metric is how much disk space those resources use before they are compressed (for transfer).
True analytics requires an understanding how what is on the web. There are three classifications:
Humans, composed of flesh and blood and overwhelmingly (though not absolutely) use web browsers.
Spiders (or search engines) that request pages with the notion that they obey robots.txt and will list your website in their websites for relevant search queries.
Rejects (basically spammers and the unknowns) which include (though are far from limited to) content/email scrapers, brute-force password guessers, vulnerability scanners and POST spammers.
With this clarification in place what you're asking in effect is, "How many human visitors am I receiving?" The easiest way to obtain that information is to:
Determine what user agent requests are human (not easy, behavior based).
Determine the length of time a single visit from a human should count as.
Assign human visitors a session.
I presume you understand what a cookie is and how it differs from a session cookie. Obviously when you sign in to a website you are assigned a session. If that session cookie is not sent to the server on a page request you will in effect be signed out. You can make session cookies last for a long time and it will come down to factors such as convenience for the visitor and if you directly count those sessions or use it in conjunction with something else.
Now your next thought likely is, "But how do I count downloads?" Thankfully you mention PHP in your website so I can thankfully give you some code that should make sense to you. If you just link directly to the file you'd be stuck with (at best) counting clicks via a click event on the anchor element though if the download gets canceled because it was a mistake or something else makes it more subjective than my suggestion. Granted my suggestion can still be subjective (e.g. they decide they actually don't want to download and cancel before the completion) and of course if they use the download is another aspect to consider. That being said if you want the server to give you a download count you'd want to do the following:
You'll may want to use Apache rewrite (or whatever the other HTTP server equivalents are) so that PHP handles the download.
You'll may need to ensure Apache has the proper handling for PHP (e.g. AddType application/x-httpd-php5 .exe .msi .dmg) so your server knows to let PHP run on the request file.
You'll want to use PHP's file_exists() with an absolute file path on the server for the sake of security.
You'll want to ensure that you set the correct mime for the file via PHP's header() as you should expect browsers to be horrible at guessing.
You absolutely need to use die() or exit() to avoid Gecko (Firefox) bugs if your software leaks even whitespace as the browser would interpret it as part of the file likely causing corruption.
Here is the code for PHP itself:
$p = explode('/',strrev($_SERVER['REQUEST_URI']));
$file = strrev($p[0]);
header('HTTP/1.1 200');
header('Content-Type: '.$mime);
echo file_get_contents($path_absolute.$file);
die();
For counting downloads if you want to get a little fancy you could create a couple of database tables. One for the files (download_files) and the second table for requests (download_requests). Throw in basic SQL queries and you're collecting data. Record IPv6 (Storing IPv6 Addresses in MySQL) and you'll be able to discern from a query how many unique downloads you have.
Back to human visitors: it takes a very thorough study to understand the differences between humans and bots. Things like Captcha are garbage and are utterly annoying. You can get a rough start by requiring a cookie to be sent back on requests though not all bots are ludicrously stupid. I hope this at least gets you on the right path.
I have multiple AJAX requests going out of my browser.
My UI is comprised of multiple views and the AJAX requests are trying to populate those views simultaneously. In some cases I require more than 10 simultaneous requests to be sent from client and processed concurrently at the server.
But due to browser limitations on max concurrent requests to a single domain and because of HTTP's "A server MUST send its responses to requests in the same order that the requests were received" constraint, I am not deriving as much concurrency in request processing as I would want.
From my application's standpoint, I dont need responses to come in the order in which I sent the request. I am ok if view8 gets populated before view1, for example.
Async processing using Servlet 3.0 constructs seems to address only one-side of the problem (the Server-side) and hence cannot be fully exploited for maximizing application concurrency.
My question is:
Am I missing out on some proper constructs ? ('proper' in contrast to workarounds like "host your images from a different sub domain") that can yield me more concurrency ?
This seems like something many web UIs would need ! If not, then I am designing my UI the wrong way. In either case, I would appreciate your inputs.
Edit1: To my advantage, I dont have to support a huge number of concurrent clients. The maximum number of concurrent clients accessing the app would be < 100. Given that fact, basically am trying to enhance the experience of these clients when I have the processing power available aplenty on my server-side.
Edit2: Our application/API is not for 'public' consumption. For ex: It is like my company's webmail app. It is hosted on the internet but it is not meant for everyone's consumption. Only meant for consumption by the relevant few.
The reason why am giving that info, is to differentiate my app from SO/Twitter, which seem to differentiate their (REST) API users from their normal website users. In our case, we think we should not differentiate that way and want to provide single-set of REST endpoints for both.
The reason behind the limitation in the spec (RFC2616) seems to be : "These guidelines are intended to improve HTTP response and avoid congestion.". However, intranet web apps have more luxuries and should not have to be so constrained !?
The server is exposing REST API and hence the UI makes specific GETs
for various resource catogories (ex: blogs, videos, news, articles).
Since each resource catogory has its exclusive view it all fits in
nicely. It feels wrong to collate requests to get blogs and videos
together in one request. Isnt it ?
Well, IMHO being pragmatic is more important. Sure, it makes sense for a service to expose RESTful API but it's not always necessary to expose the entire API to the browser. Your API can be separate from your server side web app. You can always make those multiple API requests on the server side, collate the results and send them back to the client. For e.g. look at the SO home page. The StackOverflow API does expose a RESTful API but when loading the home page the browser doesn't send across multiple requests just to populate the tags, thread listing etc.
Thanks Sanjay for the suggestion. But we wanted to have a single-API
for both REST clients and Browser clients. Interestingly, the root URI
"stackoverflow.com" is not mentioned in SO's REST API, but the browser
client uses it. I suppose if they had exposed the root URI, their
response would be difficult to process (as it would be a mixture of
data). Their REST API is granular (as is in my application), but their
javascript code uses some other doors(APIs) to decrease no. of
round-trips to the server! Somehow that doesnt feel right (Am a novice
in this field though). Feel free to correct me
SO doesn't use any "other doors". It's just that they simply don't send across 10 concurrent requests for populating something on the page. They make XHR request when you vote, mark thread as favorite, comment etc. For loading the page itself, there are no multiple requests. If you want to directly hit your RESTful API from the browser, you'll have to honor the limitations. Either that or go the desktop way which allows you virtually unlimited connections to your server but I guess you don't want to go that route...
I have a web site that reports about each non-expected server side error on my email.
Quite often (once each 1-2 weeks) somebody launches automated tools that bombard the web site with a ton of different URLs:
sometimes they (hackers?) think my site has inside phpmyadmin hosted and they try to access vulnerable (i believe) php-pages...
sometimes they are trying to access pages that are really absent but belongs to popular CMSs
last time they tried to inject wrong ViewState...
It is clearly not search engine spiders as 100% of requests that generated errors are requests to invalid pages.
Right now they didn't do too much harm, the only one is that I need to delete a ton of server error emails (200-300)... But at some point they could probably find something.
I'm really tired of that and looking for the solution that will block such 'spiders'.
Is there anything ready to use? Any tool, dlls, etc... Or I should implement something myself?
In the 2nd case: could you please recommend the approach to implement? Should I limit amount of requests from IP per second (let's say not more than 5 requests per second and not more then 20 per minute)?
P.S. Right now my web site is written using ASP.NET 4.0.
Such bots are not likely to find any vulnerabilities in your system, if you just keep the server and software updated. They are generally just looking for low hanging fruit, i.e. systems that are not updated to fix known vulnerabilities.
You could make a bot trap to minimise such traffic. As soon as someone tries to access one of those non-existant pages that you know of, you could stop all requests from that IP address with the same browser string, for a while.
There are a couple of things what you can consider...
You can use one of the available Web Application Firewalls. It usually has set of rules and analytic engine that determine suspicious activities and react accordingly. For example in you case it can automatically block attempts to scan you site as it recognize it as a attack pattern.
More simple (but not 100% solution) approach is check referer url (referer url description in wiki) and if request was originating not from one of you page you rejected it (you probably should create httpmodule for that purpose).
And of cause you want to be sure that you site address all known security issues from OWASP TOP 10 list (OWASP TOP 10). You can find very comprehensive description how to do it for asp.net here (owasp top 10 for .net book in pdf), i also recommend to read the blog of the author of the aforementioned book: http://www.troyhunt.com/
Theres nothing you can do (reliabily) to prevent vulernability scanning, the only thing to do really is to make sure you are on top of any vulnerabilities and prevent vulernability exploitation.
If youre site is only used by a select few and in constant locations you could maybe use an IP restriction
I want to prevent web scrapers from agressively scraping 1,000,000 pages on my website. I'd like to do this by returning a "503 Service Unavailable" HTTP error code to bots that access an abnormal number of pages per minute. I'm not having trouble with form-spammers, just with scrapers.
I don't want search engine spiders to ever receive the error. My inclination is to set a robots.txt crawl-delay which will ensure spiders access a number of pages per minute under my 503 threshold.
Is this an acceptable solution? Do all major search engines support the crawl-delay directive? Could it negatively affect SEO? Are there any other solutions or recommendations?
I have built a few scrapers, and the part that takes the longest time is allways trying to figure out the site layout what to scrape and not. What I can tell you is that changing divs and internal layout will be devastating for all scrapers. Like ConfusedMind already pointed out.
So here's a little text for you:
Rate limiting
To rate limit an IP means that you only allow the IP a certain amount of searches in a fixed timeframe before blocking it. This may seem sure way prevent the worst offenders but in reality it's not. The problem is that a large proportion of your users are likely to come through proxy servers or large corporate gateways which they often share with thousands of other users. If you rate limit a proxy's IP that limit will easily trigger when different users from the proxy uses your site. Benevolent bots may also run at higher rates than normal, triggering your limits.
One solution is of course to use white list but the problem with that is that you continually need to manually compile and maintain these lists since IP-addresses change over time. Needless to say the data scrapers will only lower their rates or distribute the searches over more IP:s once they realise that you are rate limiting certain addresses.
In order for rate limiting to be effective and not prohibitive for big users of the site we usually recommend to investigate everyone exceeding the rate limit before blocking them.
Captcha tests
Captcha tests are a common way of trying to block scraping at web sites. The idea is to have a picture displaying some text and numbers on that a machine can't read but humans can (see picture). This method has two obvious drawbacks. Firstly the captcha tests may be annoying for the users if they have to fill out more than one. Secondly, web scrapers can easily manually do the test and then let their script run. Apart from this a couple of big users of captcha tests have had their implementations compromised.
Obfuscating source code
Some solutions try to obfuscate the http source code to make it harder for machines to read it. The problem here with this method is that if a web browser can understand the obfuscated code, so can any other program. Obfuscating source code may also interfere with how search engines see and treat your website. If you decide to implement this you should do it with great care.
Blacklists
Blacklists consisting of IP:s known to scrape the site is not really a method in itself since you still need to detect a scraper first in order to blacklist him. Even so it is still a blunt weapon since IP:s tend to change over time. In the end you will end up blocking legitimate users with this method. If you still decide to implement black lists you should have a procedure to review them on at least a monthly basis.
I have a DB with user accounts information.
I've scheduled a CRON job which updates the DB with every new user data it fetches from their accounts.
I was thinking that this may cause a problem since all requests are coming from the same IP address and the server may block requests from that IP address.
Is this the case?
If so, how do I avoid being banned? should I be using a proxy?
Thanks
You get banned for suspicious (or malicious) activity.
If you are running a normal business application inside a normal company intranet you are unlikely to get banned.
Since you have access to user accounts information, you already have a lot of access to the system. The best thing to do is to ask your systems administrator, since he/she defines what constitutes suspicious/malicious activity. The systems administrator might also want to help you ensure that your database is at least as secure as the original information.
should I be using a proxy?
A proxy might disguise what you are doing - but you are still doing it. So this isn't the most ethical way of solving the problem.
Is the cron job that fetches data from this "database" on the same server? Are you fetching data for a user from a remote server using screen scraping or something?
If this is the case, you may want to set up a few different cron jobs and do it in batches. That way you reduce the amount of load on the remote server and lower the chance of wherever you are getting this data from, blocking your access.
Edit
Okay, so if you have not got permission to do scraping, obviously you are going to want to do it responsibly (no matter the site). Try gather as much data as you can from as little requests as possible, and spread them out over the course of the whole day, or even during times that a likely to be low load. I wouldn't try and use a proxy, that wouldn't really help the remote server, but it would be a pain in the ass to you.
I'm no iPhone programmer, and this might not be possible, but you could try have the individual iPhones grab the data so all the source traffic isn't from the same IP. Just an idea, otherwise just try to be a bit discrete.
Here are some tips from Jeff regarding the scraping of Stack Overflow, but I'd imagine that the rules are similar for any site.
Use GZIP requests. This is important! For example, one scraper used 120 megabytes of bandwidth in only 3,310 hits which is substantial. With basic gzip support (baked into HTTP since the 90s, and universally supported) it would have been 20 megabytes or less.
Identify yourself. Add something useful to the user-agent (ideally, a link to an URL, or something informational) so we can see your bot as something other than "generic unknown anonymous scraper."
Use the right formats. Don't scrape HTML when there is a JSON or RSS feed you could use instead. Heck, why scrape at all when you can download our cc-wiki data dump??
Be considerate. Pulling data more than every 15 minutes is questionable. If you need something more timely than that ... why not ask permission first, and make your case as to why this is a benefit to the SO community and should be allowed? Our email is linked at the bottom of every single page on every SO family site. We don't bite... hard.
Yes, you want an API. We get it. Don't rage against the machine by doing naughty things until we build it. It's in the queue.