I have got an ASP.Net 4 web site. I'm counting visitors at background but my code counts search engine's bots too. How can I understand my client is a bot or human? I don't want to count bots.
Regards
You can use the Crawler property of Request.Browser to filter search engine bots.
You could check the User Agent and then look for the Type R which is a robot or crawler.
See http://www.user-agents.org for more info.
I am sure there are cases where the bots are not following standards and you might have to one off those.
Your best bet is probably checking the client's user agent:
http://support.microsoft.com/kb/306576
There may even be a quick little library out there for .NET with a lot of well known user agents or good regexps to use. Note that some bots will send fake user agents to make it look like they're people, some people's browsers may send empty or unknown user agents, etc. But those cases should be few and far between. For the most part this should get you pretty good statistics.
You can try and inspect the User Agent in the message header, for starters. A malicious bot will fake that, though. A more labor intensive approach is to log/inspect your IP visits programmatically (look in the web log files, or collect them yourself) and try to deduce which of them are bots based on frequency of visits, etc. Quite a cat and mouse game.
if you want to block crawlers from accessing certain links, create a Robots.txt file in your root directory, with something like:
User-agent: *
Disallow: / // blocks the default route / page
Disallow: /MyPage.aspx
check
http://en.wikipedia.org/wiki/Robots_exclusion_standard
&
http://www.google.com/#hl=en&q=robots.txt
Related
I am creating my own short url website 9o9.in
While a visitor hits a short url generated by my site, he will essentially hit my server first. But I know there might be several links of potentially harmful or inappropriate sites which will be shortened using my site's service.
In order to make sure that I am not not setting a negative reputation of my site in terms of SEO, by linking or http referring unaccepted sites in the eyes of a Search Engine, should I go for a server side redirect like using php header() function, or shall I do a javascript based client side redirect?
Well, I know the wiser solution is to prevent users from generating short links of unacceptable sites. But right now I can't afford to implement it, as it would require extensive amount of data analysis or using expensive word filtering APIs...
Any help is highly appretiated.
Thanks.
A server-side redirect will be lower latency, as the browser can immediately begin fetching the new page whereas, with a client-side redirect in JavaScript, the browser must continue downloading your JavaScript code and then must execute this JavaScript code. Therefore, it is in your users' best interest to use a server-side redirect wherever possible over doing client-side redirecting. And, because it is in the users' best interest, it is also in a search engine's best interest to reward such behavior (indeed, Google has publicly stated that end user latency is one of many ranking signals that is used).
On the subject, though, you may want to take advantage of the safe browsing API to help you validate the URLs to which you redirect for malware, so that you don't serve malware from these links.
I found several program over the internet which can grab your website and download the whole website on your pc. How one can secure your website from these programs?
Link: http://www.makeuseof.com/tag/save-and-backup-websites-with-httrack/
You have to tell whether the visitor is human or bot in the first place. This no easy task, see e. g. : Tell bots apart from human visitors for stats?
Then, if you detected what bot it is, you can decide wether you want to give it your website content or not. Legitimate bots (like Googlebot) will conveniently provide their own userAgent id; malicious bots / web crawlers may disguise themselves as common browser programs.
There is no 100% solution, anyway.
If you content is really sensitive, you may want to add captcha, or user authentication.
I build ASP.NET websites (hosted under IIS 6 usually, often with SQL Server backends and forms authentication).
Clients sometimes ask if I can check whether there are people currently browsing (and/or whether there are users currently logged in to) their website at a given moment, usually so the can safely do a deployment (they want a hotfix, for example).
I know the web is basically stateless so I can't be sure whether someone has closed the browser window, but I imagine there'd be some count of not-yet-timed-out sessions or something, and surely logged-in-users...
Is there a standard and/or easy way to check this?
Jakob's answer is correct but does rely on installing and configuring the Membership features.
A crude but simple way of tracking users online would be to store a counter in the Application object. This counter could be incremented/decremented upon their sessions starting and ending. There's an example of this on the MSDN website:
Session-State Events (MSDN Library)
Because the default Session Timeout is 20 minutes the accuracy of this method isn't guaranteed (but then that applies to any web application due to the stateless and disconnected nature of HTTP).
I know this is a pretty old question, but I figured I'd chime in. Why not use Google Analytics and view their real time dashboard? It will require minor code modifications (i.e. a single script import) and will do everything you're looking for...
You may be looking for the Membership.GetNumberOfUsersOnline method, although I'm not sure how reliable it is.
Sessions, suggested by other users, are a basic way of doing things, but are not too reliable. They can also work well in some circumstances, but not in others.
For example, if users are downloading large files or watching videos or listening to the podcasts, they may stay on the same page for hours (unless the requests to the binary data are tracked by ASP.NET too), but are still using your website.
Thus, my suggestion is to use the server logs to detect if the website is currently used by many people. It gives you the ability to:
See what sort of requests are done. It's quite easy to detect humans and crawlers, and with some experience, it's also possible to see if the human is currently doing something critical (such as writing a comment on a website, editing a document, or typing her credit card number and ordering something) or not (such as browsing).
See who is doing those requests. For example, if Google is crawling your website, it is a very bad idea to go offline, unless the search rating doesn't matter for you. On the other hand, if a bot is trying for two hours to crack your website by doing requests to different pages, you can go offline for sure.
Note: if a website has some critical areas (for example, writing this long answer, I would be angry if Stack Overflow goes offline in a few seconds just before I submit my answer), you can also send regular AJAX requests to the server while the user stays on the page. Of course, you must be careful when implementing such feature, and take in account that it will increase the bandwidth used, and will not work if the user has JavaScript disabled).
You can run command netstat and see how many active connection exist to your website ports.
Default port for http is *:80.
Default port for https is *:443.
We have a situation where we log visits and visitors on page hits and bots are clogging up our database. We can't use captcha or other techniques like that because this is before we even ask for human input, basically we are logging page hits and we would like to only log page hits by humans.
Is there a list of known bot IP out there? Does checking known bot user-agents work?
There is no sure-fire way to catch all bots. A bot could act just like a real browser if someone wanted that.
Most serious bots identify themselves clearly in the agent string, so with a list of known bots you can fitler out most of them. To the list you can also add some agent strings that some HTTP libraries use by default, to catch bots from people who don't even know how to change the agent string. If you just log the agent strings of visitors, you should be able to pick out the ones to store in the list.
You can also make a "bad bot trap" by putting a hidden link on your page that leads to a page that's filtered out in your robots.txt file. Serious bots would not follow the link, and humans can't click on it, so only bot that doesn't follow the rules request the file.
Depending on the type of bot you want to detect:
Detecting Honest Web Crawlers
Detecting Stealth Web Crawlers
you can use Request.Browser.Crawler to detect crawlers programmatically;
preferably keep your list of recognized crawlers up to date as described here
http://www.primaryobjects.com/cms/article102.aspx
I think many bots would be identifiable by user-agent, but surely not all of them. A list of known IPs - I wouldn't count on it either.
A heuristic approach might work. Bots are usually much quicker at following links than people. Maybe you can track each client's IP and detect the average speed with which it following links. If it's a crawler it probably follows every link immediately (or at least much faster than humans).
Have you already added a robots.txt? While this won't solve for malicious bot use you might be surprised at the legitimate crawling activity already occurring on your site.
i don't think there will be a list of Botnet IP addresses, Botnet IP addresses is not static, and nobody knows who are the bots including the users that are behaving like Bots.
Your question is arguably hot research area right now, i'm curious if someone could give a solution for that problem.
You can use any kind of technique and understand if this is a human or not, then you can get the logs.
I think a best way to do this is to use a link for non human users ( bots, crowlers and etc... )
then gather their user-agent and then filter them by user-agent.
you have to make the link not observable to human to do this.
you can add robots.txt to the root of your site and then do this
I want to create an in-house RSS feed (I work for 3 Mobile, Australia) for consumption on an INQ1 mobile phone, or any other RSS reader for that matter. However, testing it out on the phone's built-in RSS reader, I realize that without the ability to password protect the feed, or otherwise restrict access to it, I stand little chance of being able to develop this idea further.
One thing I thought of was to periodically change the Uri for the feed, so managers who had left the company couldn't continue to subscribe and see sensitive information, but the idea of making users do that would make it a harder sell, and furthermore is terribly inelegant.
Does anybody know how to make it so that prior to downloading a feed, a reader would have to authenticate the user? Is it part of the metadata within the feed, or something you would set in the reader software?
Update: I should have explained that I already have placed folder-level permissions on the parent folder, which brings up the normal authentication dialog when the feed is viewed in a browser, but which just results in a failed update with no explanation or warning in the phone's RSS reader, and is indistiguishable from the file being missing, when I next try and refresh the feed.
If the reader in the phone doesn't support HTTP Basic or Digest, your best bet is to create a unique url to the feed for each consumer. Have the customer login and generate a link with some token in it that is unique for that user. If the user ever leaves, you can then deny that token, shutting down access.
If you go this route, you probably want to investigate including the Feed Access Control bits in your feed. It's not perfect, but it is respected by the bigger aggregators, so if one of your clients decides to subscribe to the feed with Reader or Bloglines, things shouldn't show up in search results.
I believe you would set the permissions on the feed itself, forcing authentication, much like the Twitter feeds. The problem with this is that many readers (including Google Reader) don't yet support authenticated feeds.
The idea is to have authentication over a secure channel. These posts explain it pretty well:
RSS Security
Private RSS Feeds
Authentication by the webserver is probably the best solution, however to get round the issues of reader not supporting it (Google has been mentioned and I have issues with Safari) you could implement a simple value-key to append to the URL.
I.E.
http://www.mydomain/rss.php?key=value
Your system could then "authenticate" the key-value and output the RSS, invalid k-v could get a standard "invalid authenticate" message as single item RSS or return a 40x error.
It not very secure as you could see the k-v in the URL but it's a a trade off. An un-authenticated HTTPS would be slightly more secure.
Assuming your RSS feed is over HTTP then basic HTTP authentication would probably do the trick. This would either be done at the web server level (in IIS for example) or via whatever framework you're using to produce the feed (in ASP.NET for example).
The authentication scheme (basic username/password, NTLM, Kerberos etc) is up to you. If you're using WCF to produce the feed then these decisions are things you can make later and apply via config if needed.
Are you simply looking to authenticate consumers of the feed, or also encrypt it to prevent the information from being read by a "man in the middle". If you require encryption then SSL is probably the easiest to implement.
You should avoid simply "hiding" the RSS feed by changing it's name.
update:
Your question (with it's update) sounds like you're actually having issues with the RSS client on the device. You need to determine whether the phones RSS client understands how to deal with basic/digest authentication etc.
Assuming it doesn't, is there anything in the HTTP request that could allow you to associate a device with a user? Is there an HTTP Header that gives you a unique device ID? If so, you might be able to then perform a lookup against this data to perform your own weak-authentication, but you should remember that this sort of authentication could be easily spoofed.
Does the device have a client certificate that could be used for mutual SSL? If so, then that would be ideal.