How do I detect bots programmatically

How do I detect bots programmatically - asp.net

We have a situation where we log visits and visitors on page hits and bots are clogging up our database. We can't use captcha or other techniques like that because this is before we even ask for human input, basically we are logging page hits and we would like to only log page hits by humans.
Is there a list of known bot IP out there? Does checking known bot user-agents work?

There is no sure-fire way to catch all bots. A bot could act just like a real browser if someone wanted that.
Most serious bots identify themselves clearly in the agent string, so with a list of known bots you can fitler out most of them. To the list you can also add some agent strings that some HTTP libraries use by default, to catch bots from people who don't even know how to change the agent string. If you just log the agent strings of visitors, you should be able to pick out the ones to store in the list.
You can also make a "bad bot trap" by putting a hidden link on your page that leads to a page that's filtered out in your robots.txt file. Serious bots would not follow the link, and humans can't click on it, so only bot that doesn't follow the rules request the file.

Depending on the type of bot you want to detect:
Detecting Honest Web Crawlers
Detecting Stealth Web Crawlers

you can use Request.Browser.Crawler to detect crawlers programmatically;
preferably keep your list of recognized crawlers up to date as described here
http://www.primaryobjects.com/cms/article102.aspx

I think many bots would be identifiable by user-agent, but surely not all of them. A list of known IPs - I wouldn't count on it either.
A heuristic approach might work. Bots are usually much quicker at following links than people. Maybe you can track each client's IP and detect the average speed with which it following links. If it's a crawler it probably follows every link immediately (or at least much faster than humans).

Have you already added a robots.txt? While this won't solve for malicious bot use you might be surprised at the legitimate crawling activity already occurring on your site.

i don't think there will be a list of Botnet IP addresses, Botnet IP addresses is not static, and nobody knows who are the bots including the users that are behaving like Bots.
Your question is arguably hot research area right now, i'm curious if someone could give a solution for that problem.
You can use any kind of technique and understand if this is a human or not, then you can get the logs.

I think a best way to do this is to use a link for non human users ( bots, crowlers and etc... )
then gather their user-agent and then filter them by user-agent.
you have to make the link not observable to human to do this.
you can add robots.txt to the root of your site and then do this

Related

Avoid website garb programs

I found several program over the internet which can grab your website and download the whole website on your pc. How one can secure your website from these programs?
Link: http://www.makeuseof.com/tag/save-and-backup-websites-with-httrack/

You have to tell whether the visitor is human or bot in the first place. This no easy task, see e. g. : Tell bots apart from human visitors for stats?
Then, if you detected what bot it is, you can decide wether you want to give it your website content or not. Legitimate bots (like Googlebot) will conveniently provide their own userAgent id; malicious bots / web crawlers may disguise themselves as common browser programs.
There is no 100% solution, anyway.
If you content is really sensitive, you may want to add captcha, or user authentication.

ASP.NET Less known ways for unregistered user tracking

I am building application that needs to interact with users without accounts and keep track of them. I know OpenID is great and easy and I've used it in almost all my apps, but accounts are not option even those that user is likely to have like Facebook, Google, Yahoo account, etc.
Any coding language is acceptable (but asp.net, JavaScript or Flash would be best, or a combination).
So my plan is to use cookies...but cookies are so easily removed (I really don't count it as reliable identifier)
IP address...well this is efficient even trough proxies, but if someone uses dynamic IP like my whole country this also becomes unreliable
Flash cookies are fine, but I recently read an article describing Mozilla Firefox History-cleaning system gets rid of them too, I need confirmation for this.
Browser Fingerprinting - I don't know how reliable it is since anyone that knows little of any language that can send HTTP requests can spoof it (client string at least).
If anyone knows of any other methods from the ones I listed, or want to correct me in my list feel free to reply.

How can I prevent users from voting more than one time?

I want to prevent users from voting more than one time in my website, I used two methods to do that, but no one do that well !!
using cookie.
the problem : users can delete the cookie and return to vote again and again ..
using database table.
The problem : users shouldn't forced to register in my website !
So, How i can solve this problem ?

You have your two answers, you need to decide which is best. No option is going to be bulletproof. It's all about slowing them down, and what level of effectiveness is acceptable for you.
A cookie is generally the acceptable way to do this. Yes, cookies can be cleared, but if the desire to prevent duplicate voting is that important, than registration is the only effective way to prevent it. Any other mechanism could probably be beaten by those that want to. You could use something like Evercookie, but I don't generally think it's a good practice to do so. If you make your registration process simple, but effective, some users will do that.
An IP address is just as flawed as most redisential IPs are not statically assigned. Someone could reset their modem, and get a new IP address. Or worse, someone could reset their modem, get an IP address that has already visited the site, and be unable to vote. Another scenario is users behind NAT. If 200 people are sharing an IP with NAT, then only one of them will be able to vote.
You could get creative with the IP address though. Keep using the cookie, because that will be effective. If you start detecting multiple votes from the same IP address (because they cleared their cookies), display a CAPTCHA. If it isn't someone trying to abuse the system, then they still get the opportunity to vote. This will help defeat automated voting, and slow down users enough that abusing your voting system isn't worth their time. This as well, can be defeated, it's what level of effectiveness is acceptable to you. Even registration isn't purely 100% effective, but probably the most effective. What would stop someone from registering many times with different email addresses?

I dont think you have many options as you are not forcing users to register. You need to use session or cookies. As pointed out in comments you can also check the IP Address. But if intended audience uses dynamic IP address assigned by their ISP, then this solution also failed.
If possible you can ask user to registered with their facebook/google id, like stackoverflow is doing

There isn't an infallible way to accomplish what you want from a web application and specially without requiring users to register.

This site (Stack Overflow) does it right, by registration. IP is a really bad idea because all the folks behind a proxy/server can't vote. Most folks have mulitiple browsers, you don't even need to delete the voting cookie to vote again, just use another browser. As mentioned, OpenID is the lowest impact + highest security route. They can get around that via multiple accounts.

Considerations for anonymous users

So, the Web application I'm working on allows input from anonymous users (and their participation in the flagging system).
As for the spamming issue, would it be enough to use the honeypot method or is an image CAPTCHA (e.g. reCAPTCHA) necessary in this case?
For the flagging system, if I want to let anonymous users to "flag" posts, it's not enough to allow a flag (per post) per cookie because they have control over the cookies (and could bypass this prevention). I should allow ONLY a flag per IP then, right? I know that this method would prevent users that share the same IP (yeah, corporate networks, etc.) to flag to the same post, but there is no other way around it, is there?
How can I ensure anonymous users' anonymity? By this I mean, how to prevent their posts to be "tracked" (if this is even possible). I know that every server has a log of every connection, so, is it possible to hide theirs?
Any help would be greatly appreciated!

Honeypots are useless if your site is popular, because then people will write custom bots for it. For the flagging, you can limit it to one per cookie, and rate-limit it by IP. That way, people on corporate networks, etc. will be a little inconvenienced but not completely out of luck.
It's completely up to you what you log and how long you keep them. By default, the request IP may be logged, but you don't have to log it. Most sites do, but the real difference is how long they keep it.

ASP.NET counting visitors, not bots

I have got an ASP.Net 4 web site. I'm counting visitors at background but my code counts search engine's bots too. How can I understand my client is a bot or human? I don't want to count bots.
Regards

You can use the Crawler property of Request.Browser to filter search engine bots.

You could check the User Agent and then look for the Type R which is a robot or crawler.
See http://www.user-agents.org for more info.
I am sure there are cases where the bots are not following standards and you might have to one off those.

Your best bet is probably checking the client's user agent:
http://support.microsoft.com/kb/306576
There may even be a quick little library out there for .NET with a lot of well known user agents or good regexps to use. Note that some bots will send fake user agents to make it look like they're people, some people's browsers may send empty or unknown user agents, etc. But those cases should be few and far between. For the most part this should get you pretty good statistics.

You can try and inspect the User Agent in the message header, for starters. A malicious bot will fake that, though. A more labor intensive approach is to log/inspect your IP visits programmatically (look in the web log files, or collect them yourself) and try to deduce which of them are bots based on frequency of visits, etc. Quite a cat and mouse game.

if you want to block crawlers from accessing certain links, create a Robots.txt file in your root directory, with something like:
User-agent: *
Disallow: / // blocks the default route / page
Disallow: /MyPage.aspx
check
http://en.wikipedia.org/wiki/Robots_exclusion_standard
&
http://www.google.com/#hl=en&q=robots.txt

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex