How to tell if web visitor is a robot - asp-classic

On an ASP website, is there a way to tell whether a visitor is a robot?
I'm thinking there might be a parameter in the ServerVariables collection that could be used, in a similar way to HTTP_X_FORWARDED_FOR and REMOTE_ADDR can be used to get the visitor's IP addresses.
Searches on Google have so far yieled few leads.
Thanks for your help.

There is no bullet-proof method because headers and origins can be spoofed.

My suggestion would be to try
HTTP_USER_AGENT

if a visitor access robots.txt it's most likely a spider.
IF there is nothing in the host or user-agent information or there is no referring URL or IP address changes within a visit or
if the log lines appear together in an uninterrupted block in the log file then it's most likely robot traffic.
thanks

Related

Scraping Websites via Google Cached Pages pages has been blocked

I'm trying to create a Service that Scraping websites by using Google Cached Pages.
Example
https://webcache.googleusercontent.com/search?q=cache:nike.com
The Response that I get is the HTML from Google cache, which is an older version of the Nike site.
And it works fine as long as I run it locally on my computer,
but when I deploy to google cloud platform, there I use porxy server
I get a 403 error that I can not access the information through a porxy server
Example of response from proxy server
433. That’s an error.Your client does not have permission to get URL /s
earch?q=cache:http://nike.com from this server. (Client IP address: XX.XXX.XX.XXX)<br
Please see Google's Terms of Service posted at
https://policies.google.com/terms If you believe that you
have received this response in error, please report your
problem. However, please make sure to take a look at our Terms of
Service (http://www.google.com/terms_of_service.html). In your email,
please send us the entire code displayed below. Please also
send us any information you may know about how you are performing your
Google searches-- for example, "I' m using the Opera browser on Linux
to do searches from home. My Internet access is through a dial-up
account I have with the FooCorp ISP." or "I'm using the Konqueror
browser on Linux t o search from my job at myFoo.com. My machine's IP
address is 10.20.30.40, but all of myFoo' s web traffic goes through
some kind of proxy server whose IP address is 10.11.12.13." (If y ou
don't know any information like this, that's OK. But this kind of
information can help us track down problems, so please tell us what
you can.)We will use all this information to diagnose the
problem, and we'll hopefully have you back up and searching with
Google agai n quickly! Please note that although we read all
the email we receive, we are not always able to send a personal
response to each and every email. So don't despair if you don't hear
back from u s! Also note that if you do not send us the
entire code below, we will not be able to help
you.Best wishes,The Google
Article that talks about the problem https://proxyserver.com/web-scraping-crawling/scraping-websites-via-google-cached-pages/
How can I solve this problem, and run requests from the cloud as well without being blocked? Add parameters?
Thanks :)
I guess that you should add a property in the header of your http request
for example :
URL u = new URL("https://www.google.com//search?q=c");
URLConnection c = u.openConnection();
c.setRequestProperty("User-Agent", "MSIE 7.0");
or
HttpRequest request =HttpRequest.newBuilder(new URI("https://www.google.com//search?q=c")).header("User-Agent", "MSIE 7.0").GET().build();
// note to change the URI
this two examples are in Java but the same concept is applied in all environments I guess
hope that was helpfull

Robot request to an ASP.Net app

Is there a way to determine if a http request to an ASP.Net application is made from a browser or from a robot/crawler? I need to differentiate this two kind of requests.
Thanks!
No, there isn't. There is no fool proof to determine what originated a request - all HTTP headers can be spoofed.
Some crawlers (GoogleBot and such) do advertise themselves, but that doesn't mean a person browsing can't pretend to be GoogleBot.
The best strategy it to look for the well known bots (by User-Agent header and possibly by the known IP address) and assume those are crawlers.
Well... If the robot want to be recognized as a robot, yes. Cause he can easilly simulates that he is a web browser.
Personnaly, I will use this list to start: http://www.robotstxt.org/db.html
Have a look at Request.Browser.Crawler, but that only works for some crawlers.

Creating subdomain in URL alaising

I am creating a social networking site and one of the requirements is to have the subdomain like URL for each user. For example, for the user1 his profile page will be user1.mysitename.com and for the user2 profile page will be user2.mysitename.com.
Can it be done using url aliasing? basically user1.mysitename.com should be www.mysitename.com/profile.aspx?username=user1
I will be hosting this in windows 2003 (IIS6), any help is highly appreciated.
You can either respond to each GET request of user1.mysitename.com with the same contents as www.mysitename.com/profile.aspx?username=user1 or you can answer using a redirection (HTTP 302 response) from the first url to the second url.
However, you have to first make sure the DNS server who is authoritative on mysitename.com is aware to all these domains and respond with the answer you need (either the IP of the server, or a CNAME to a domain that is linked to an IP).
EDIT:
When someone will try to surf to user1.mysitename.com he will first try to resolve user1.mysitename.com to get it's IP - here you need someone to let him know what is the IP of the domain user1.mysitename.com.
After the user has the IP of the domain, he will request the page using HTTP GET request. You need to respond to it somehow. One way is to redirect him to a different URL (www.mysitename.com/profile.aspx?username=user1). Another way is to simply respond to the GET request and give him the page he's looking for.

What HTTP signatures are encountered from Google web crawling robots?

With all HTTP data available,What 'signs' can you look for to recognize Google's search engine robots?
How to verify googlebot - the official method.
As far as I know, Google's crawlers have the user-agent set to "Googlebot".
Other search engine providers typically stick to a recognisable name in the user-agent; there are various lists of well-known agents, such as that on http://www.jafsoft.com/searchengines/webbots.html.
The User-Agent header should be enough to detect the Google bot
Check out user-agents.org website to get a list of known se bot
By the you would like to want to be sure that's a true googlebot from google, then you can check out the ip/host which is always
c[nn].googlebot.com
Where [nn] is a number.
Well, I'm not so sure how maintainable it is to be doing DNS reverse lookups for ip addresses. I would only do this if you were concerned about someone spoofing google's user agent strings, which is highly unlikely. It can also be spoofed itself, as the article points out.
You're best off just matching their known user agents:
Regex.IsMatch(ua, #"googlebot|mediapartners-google|adsbot-google", RegexOptions.IgnoreCase);

Can I ban an IP address (or a range of addresses) in the ASP.NET applicaton?

What would be the easiest way to ban a specific IP (or a range of addresses) from being
able to access my publicly available web site?
Is it possible to do so using the ASP.NET only, without resorting to modifying any IIS settings?
It is easy and fast in asp.net using httpmodule, just take a look at Hanselman's post:
http://www.hanselman.com/blog/AnIPAddressBlockingHttpModuleForASPNETIn9Minutes.aspx
You can check the Request.ServerVariables["REMOTE_ADDR"] value and if they're banned redirect them to yahoo or something.
Indeed, Spencer Ruport's suggestion is the right way to go about it. (Not sure I would redirect to Yahoo however - an page informing the user they have been banned would be better, with some option for contacting the web admin if the client feels they were inadvertently banned).
I would add that it would be wise to check the HTTP_X_FORWARDED_FOR server variable (representing the IP forwarded by a proxy, or null if none) firstly in order to avoid the issue of the IP address for the proxy (and thus potentially many other users) also being banned.

Resources