With all HTTP data available,What 'signs' can you look for to recognize Google's search engine robots?
How to verify googlebot - the official method.
As far as I know, Google's crawlers have the user-agent set to "Googlebot".
Other search engine providers typically stick to a recognisable name in the user-agent; there are various lists of well-known agents, such as that on http://www.jafsoft.com/searchengines/webbots.html.
The User-Agent header should be enough to detect the Google bot
Check out user-agents.org website to get a list of known se bot
By the you would like to want to be sure that's a true googlebot from google, then you can check out the ip/host which is always
c[nn].googlebot.com
Where [nn] is a number.
Well, I'm not so sure how maintainable it is to be doing DNS reverse lookups for ip addresses. I would only do this if you were concerned about someone spoofing google's user agent strings, which is highly unlikely. It can also be spoofed itself, as the article points out.
You're best off just matching their known user agents:
Regex.IsMatch(ua, #"googlebot|mediapartners-google|adsbot-google", RegexOptions.IgnoreCase);
Related
Is there a way to determine if a http request to an ASP.Net application is made from a browser or from a robot/crawler? I need to differentiate this two kind of requests.
Thanks!
No, there isn't. There is no fool proof to determine what originated a request - all HTTP headers can be spoofed.
Some crawlers (GoogleBot and such) do advertise themselves, but that doesn't mean a person browsing can't pretend to be GoogleBot.
The best strategy it to look for the well known bots (by User-Agent header and possibly by the known IP address) and assume those are crawlers.
Well... If the robot want to be recognized as a robot, yes. Cause he can easilly simulates that he is a web browser.
Personnaly, I will use this list to start: http://www.robotstxt.org/db.html
Have a look at Request.Browser.Crawler, but that only works for some crawlers.
On an ASP website, is there a way to tell whether a visitor is a robot?
I'm thinking there might be a parameter in the ServerVariables collection that could be used, in a similar way to HTTP_X_FORWARDED_FOR and REMOTE_ADDR can be used to get the visitor's IP addresses.
Searches on Google have so far yieled few leads.
Thanks for your help.
There is no bullet-proof method because headers and origins can be spoofed.
My suggestion would be to try
HTTP_USER_AGENT
if a visitor access robots.txt it's most likely a spider.
IF there is nothing in the host or user-agent information or there is no referring URL or IP address changes within a visit or
if the log lines appear together in an uninterrupted block in the log file then it's most likely robot traffic.
thanks
I've come across a rather unique issue. If you deal with scaling large sites and work with a company like Akamai, you have origin servers that Akamai talks to. Whatever you serve to Akamai, they will propagate on their cdn.
But how do you handle robots.txt? You don't want Google to crawl your origin. That can be a HUGE security issue. Think denial of service attacks.
But if you serve a robots.txt on your origin with "disallow", then your entire site will be uncrawlable!
The only solution I can think of is to serve a different robots.txt to Akamai and to the world. Disallow to the world, but allow to Akamai. But this is very hacky and prone to so many issues that I cringe thinking about it.
(Of course, origin servers shouldn't be viewable to the public, but I'd venture to say most are for practical reasons...)
It seems an issue the protocol should be handling better. Or perhaps allow a site-specific, hidden robots.txt in the Search Engine's webmaster tools...
Thoughts?
If you really want your origins not to be public, use a firewall / access control to restrict access for any host other than Akamai - it's the best way to avoid mistakes and it's the only way to stop the bots & attackers who simply scan public IP ranges looking for webservers.
That said, if all you want is to avoid non-malicious spiders, consider using a redirect on your origin server which redirects any requests which don't have a Host header specifying your public hostname to the official name. You generally want something like that anyway to avoid issues with confusion or search rank dilution if you have variations of the canonical hostname. With Apache this could use mod_rewrite or even a simple virtualhost setup where the default server has RedirectPermanent / http://canonicalname.example.com/.
If you do use this approach, you could either simply add the production name to your test systems' hosts file when necessary or also create and whitelist an internal-only hostname (e.g. cdn-bypass.mycorp.com) so you can access the origin directly when you need to.
What is the point of doing this?
I want a reason why it's a good idea to send a person back to where they came from if the referrer is outside of the domain. I want to know why a handful of websites out there insist that this is good practice. It's easily exploitable, easily bypassed by anyone who's logging in with malicious intent, and just glares in my face as a useless "security" measure. I don't like to have my biased opinions on things without other input, so explain this one to me.
The request headers are only as trustworthy as your client, why would you use them as a means of validation?
There are three reasons why someone might want to do this. Checking the referer is a method of CSRF Prevention. A site may not want people to link to sensitive content and thus use this to bounce the browser back. It may also be to prevent spiders from accessing content that the publisher wishes to restrict.
I agree it is easy to bypass this referer restriction on your own browser using something like TamperData. It should also be noted that the browser's http request will not contain a referer if your coming from an https:// page going to an http:// page.
What would be the easiest way to ban a specific IP (or a range of addresses) from being
able to access my publicly available web site?
Is it possible to do so using the ASP.NET only, without resorting to modifying any IIS settings?
It is easy and fast in asp.net using httpmodule, just take a look at Hanselman's post:
http://www.hanselman.com/blog/AnIPAddressBlockingHttpModuleForASPNETIn9Minutes.aspx
You can check the Request.ServerVariables["REMOTE_ADDR"] value and if they're banned redirect them to yahoo or something.
Indeed, Spencer Ruport's suggestion is the right way to go about it. (Not sure I would redirect to Yahoo however - an page informing the user they have been banned would be better, with some option for contacting the web admin if the client feels they were inadvertently banned).
I would add that it would be wise to check the HTTP_X_FORWARDED_FOR server variable (representing the IP forwarded by a proxy, or null if none) firstly in order to avoid the issue of the IP address for the proxy (and thus potentially many other users) also being banned.