Spam filter for ASP.NET? - asp.net

I'm looking for a spam filter that I can integrate into my ASP.NET application. I don't want any delegate services (e.g. Akismet) as I'm doing this for a high traffic website, any suggestions?
Edit:
I mean a post spam filter, it's a forum based website.
Edit:
Thanks for your answer but I'm not looking for a Captcha, I'm looking for a spam filter, Captcha is not a spam filter it's used for preventing automated spam but it's not a spam filter itself, a spam filter is a piece of software that scans the posts and mark them as spam or not. Actually I already have a Captcha in my application for preventing automated spam.

You could check out nBayes, a C# implementation of Paul Graham's plan for spam

If you don't want to use CAPTCHA's, because they annoy people, and you already the site up and running, you could write your own parser to filter out spam. Most spam you see is extremely blatant. Posting hundreds of links in a post. Subject, Body, and Poster name are all exactly the same. Other stuff along those lines. You could write some simple filters, like I did for my blog, to basically cut out 99% of the spam, while your users don't even realize that you are even using filtering.

I use Akismet for spam filtering. There is a .NET interface for available here on CodePlex.
It works very well and the Api is pretty simple. Akismet is free for personal use (making less than $500/month), so I'm not sure about the pricing if you are making some serious money on the website.

Although not widely used, since comments/forum posts are typically stored in a database, an insert trigger works remarkably well which looks for certain words in comments and auto triggers a delete. Again, this isn't an ideal solution, but it works for me. There is the possibility of deleting a legit post, but then again, its sometimes near impossible to correctly decipher a captcha...

I can't recommend this because I've never used it, but I know a small company that had decent luck with A Naive Bayesian Spam Filter for C# by Jason Kester.
I would personally recommend using a 3rd party like Akismet though. Spam filtering is tough business and it is always better to delegate that to someone who can and will keep up with the spammers' techniques over time.

Related

Wordpress - contact form 7 - spam messages

I installed Contact form 7 in my wordpress blog and there are several spam messages coming through it. Then I added really simple captcha/ recaptcha for the form. But still the spam messages are getting submitted.
How can I block this? Please help me.
Thanks in advance.
There are a lot of papers and works related to spam blocking. For example, you can ask easy questions, like 2+7 instead of captcha but i dont know how powerful is that now, because spammers are also improving themselves.
You can also block by looking at their behaviour, for example, spambot enters to your website, and after one-two second it sends its spam to your website, this is not human behaviour, so dont allow that post.
You can search on google about that and I'm sure you can find a lot of stuff related to it
As I said, there is a lot of research going on in this issue, you can use google scholars too
Also this question looks similar to your question.
You can try to use the honeypot plugin: http://wordpress.org/extend/plugins/contact-form-7-honeypot/ (Note: I haven't tested it, but I use a similar functionality in Gravity Forms, and it works great!).
Make sure to activate the Akismet plugin. It will help capture a good deal of the spam that gets through your forms.
I would also advice to use at least a combination of Captcha and Akismet like Bill already mentioned. You can find a very good tutorial on this topic at http://cool-tricks.net/contact-form-7-configuration/

URL Rewrite: Adding keywords

I was looking at amazon.com and noticed for a product like: "Really Really Really Long Book Title," they will have a URL like: "amazon.com/Really-Long-Book-Title/ref?id=1&anotherId=2,"
and for a short title like: "Success," they will add other words, like the author name: "amazon.com/Success-John-Smith/ref?id=1&anotherId=2." If I remove these words, like so: "amazon.com/ref?id=1&anotherId=2," the URL still resolves.
Does it hurt SEO to have multiple URLs that resolve to the same page?
How are these words even added to the URL? Is it done programmatically, or do they have someone hand-pick words and store them in a database for each product?
I've been trying to expand my knowledge about SEO so I'd really like to learn how this is being done as thoroughly as possible. I'd greatly appreciate the recommendation of any resources, and also advice based on person experience so that if I implement URLs like this, I can do it correctly. I know I can Google this stuff, but there always seems to be 1,000 ways to do something and I'd just to hear some personal recommendations.
For what it's worth, I use asp.net 4.0 (c#) and the IIS7 URL rewrite toolkit.
Thanks a lot!
IIS7 URL rewrite toolkit is a best tool to use in your case. Here are my answers to your questions.
Does it hurt SEO to have multiple URLs that resolve to the same page?
It does not, as long as you show search engines which URL is a primary one. You can do this by adding rel="canonical"in the primary link page. The best example of this is StackOverflow, which is doing very well in terms of SEO. Now, if you use http://stackoverflow.com/questions/5392137/ you will be pointed to this page, if you use http://stackoverflow.com/questions/5392137/url-rewrite-adding-keywords you will be on this page as well. Obviously the second ULR has more keywords, which is great for SEO, and it is more user friendly as well, since users know what the URL is all about.
How are these words even added to the
URL? Is it done programmatically, or
do they have someone hand-pick words
and store them in a database for each
product?
If you are a developer, then it is not your responsibility any more. SEO is 20% technical, 80% marketing(it is my rough calculation, you know the point.:-)). Those marketing folks should handle that after you give them an access to write or rewrite ULRs. They may find some keywords and add some of them in URL based on their tactics. Elad Lachmi gave a good answer on this question. StackOverflow is using question titles as a URL,which is reasonable. Hiring lots of SEOers to find keywords based on different questions and then manually add to URLs is not a good option for SO. But for commercial web sites, it is worthwhile to hire someone to manually do it. The answer is based on what kind of web site yours is.
Hope this helps
I love the rewrite toolkit because you can do ANYTHING!
From my experience, letting the content editor set whatever URL they like is the best option. Computer are not big on semantics. You can create set rules, and they might be ok (It`s not that hard to tell a computer "if the title is not long enough, add the author name"), but since a human adds the products anyway, a little SEO tutorial for the content editors can go a long way. You would be surprised what people who know thier products can come up with. I have seen great titles and URLs done by our content editors, that I would never think of in a million years from my position as a developer.

Prevent automated tools from accessing the website

The data on our website can easily be scraped. How can we detect whether a human is viewing the site or a tool?
One way is by calculating time which a user stays on a page. I do not know how to implement that. Can anyone help to detect and prevent automated tools from scraping data from my website?
I used a security image in login section, but even then a human may log in and then use an automated tool. When the recaptcha image appears after a period of time the user may type the security image and again, use an automated tool to continue scraping data.
I developed a tool to scrape another site. So I only want to prevent this from happening to my site!
DON'T do it.
It's the web, you will not be able to stop someone from scraping data if they really want it. I've done it many, many times before and got around every restriction they put in place. In fact having a restriction in place motivates me further to try and get the data.
The more you restrict your system, the worse you'll make user experience for legitimate users. Just a bad idea.
It's the web. You need to assume that anything you put out there can be read by human or machine. Even if you can prevent it today, someone will figure out how to bypass it tomorrow. Captchas have been broken for some time now, and sooner or later, so will the alternatives.
However, here are some ideas for the time being.
And here are a few more.
and for my favorite. One clever site I've run across has a good one. It has a question like "On our "about us" page, what is the street name of our support office?" or something like that. It takes a human to find the "About Us" page (the link doesn't say "about us" it says something similar that a person would figure out, though) And then to find the support office address,(different than main corporate office and several others listed on the page) you have to look through several matches. Current computer technology wouldn't be able to figure it out any more than it can figure out true speech recognition or cognition.
a Google search for "Captcha alternatives" turns up quite a bit.
This cant be done without risking false positives (and annoying users).
How can we detect whether a human is viewing the site or a tool?
You cant. How would you handle tools parsing the page for a human, like screen readers and accessibility tools?
For example one way is by calculating the time up to which a user stays in page from which we can detect whether human intervention is involved. I do not know how to implement that but just thinking about this method. Can anyone help how to detect and prevent automated tools from scraping data from my website?
You wont detect automatic tools, only unusual behavior. And before you can define unusual behavior, you need to find what's usual. People view pages in different order, browser tabs allow them to do parallel tasks, etc.
I should make a note that if there's a will, then there is a way.
That being said, I thought about what you've asked previously and here are some simple things I came up with:
simple naive checks might be user-agent filtering and checking. You can find a list of common crawler user agents here: http://www.useragentstring.com/pages/Crawlerlist/
you can always display your data in flash, though I do not recommend it.
use a captcha
Other than that, I'm not really sure if there's anything else you can do but I would be interested in seeing the answers as well.
EDIT:
Google does something interesting where if you're looking for SSNs, after the 50th page or so, they will captcha. It begs the question to see whether or not you can intelligently time the amount a user spends on your page or if you want to introduce pagination into the equation, the time a user spends on one page.
Using the information that we previously assumed, it is possible to put a time limit before another HTTP request is sent. At that point, it might be beneficial to "randomly" generate a captcha. What I mean by this, is that maybe one HTTP request will go through fine, but the next one will require a captcha. You can switch those up as you please.
The scrappers steal the data from your website by parsing URLs and reading the source code of your page. Following steps can be taken to atleast making scraping a bit difficult if not impossible.
Ajax requests make it difficult to parse the data and require extra efforts in getting the URLs to be parsed.
Use cookie even for the normal pages which do not require any authentication, create cookies once the user visits the home page and then its required for all the inner pages.This makes scraping a bit difficult.
Display the encrypted code on the website and then decrypt it on the loadtime using javascript code. I have seen it on a couple of websites.
I guess the only good solution is to limit the rate that data can be accessed. It may not completely prevent scraping but at least you can limit the speed at which automated scraping tools will work, hopefully below a level that will discourage scraping the data.

How to prevent someone from hacking API feed?

I have started developing a webpage and recently hired someone to write code to display a customized feed (powered by API) in the middle panel on http://farmball.com/. Note that this is not the RSS feed tied to the site blog. The feed ties to my account on another site. There is no RSS link for an average user to subscribe to the feed. I've taken the site out of maintenance mode to ask anyone here with scraping/hacking experience how someone would most easily go about 'taking' the feed and displaying it on their own site. More importantly, what can I do to prevent it?
^Updated for re-wording
You can't.
If you are going to expose an RSS feed which you don't want others to be able to display on their site then you are completely missing the point of RSS. The entire reason for Really Simple Syndication (RSS) is to make your content externally consumable- whether that's in an RSS Reader or through someone simply printing its content on their own website.
Why are you including an RSS feed if you do not want someone to be able to consume it?
what can I do to prevent...'taking' the feed and displaying it on their own site?
Nothing. Preventing reuse goes against the basic concept of RSS, which is to make it as easy as possible for anyone to do anything they want with it. It was designed from the ground up to be Really Simple to Syndicate, not Really Hard to Retransmit Without Permission.
You could restrict access to the feed itself to trusted users only by making them provide some credentials or pass in a key to the feed (e.g. yoursite.rss?mykey=abc123). But you cannot control use. Only access.
Be explicit about your license. It isn't a technology solution, as others have mentioned, the technology is an open technology-- this isn't DRM! But if you ask in each post that people who use this feed to not repost/fail to give credit/etc then some people will respond to the request.
Otherwise, you're better off putting your content behind a password and using a paid subscription model for distributing your content.
This is a DRM problem essentially. If you had some technique that you could put content on the web without having it redistributable, the music industry would love you.
It is possible to try to prevent redistribution. One technique you could try is embedding a signature of some sort into the feed for each user who you require to sign up. If the content is found on the web, you can identify and ban the user who redistributed your content.
This is avoidable too, by getting multiple accounts and normalizing the content to remove fingerprints. For the would-be pirate, this requires more effort than they may be willing to put in. Your signature could be a unique whitespace pattern, tiny variances in the timestamps on posts, misplaced pixels in videos, or any other thing you can vary slightly without end users noticing.
use .htpassword
better yet, don't put something private in a public place where it's likely to get picked up by software automatically. Like others have said, it's a pretty odd question, if you're trying to figure something else out, you're better off being explicit with what you want to know.

What is the best method to keep bots from spamming your blog?

I got a problem at my blog. I got visits from kind bots who leave "nice" comments to my blog posts :(
I'm wondering if there is a smarter way to keep them out, besides using the captcha modules.
My problem with the captcha modules is that I thinks they are anoying to the user :(
I don't know if it's any help to anyone but my site is in asp.net mvc beta.
Have you thought about using this?
http://akismet.com/
From their FAQ
When a new comment, trackback, or pingback comes to your blog it is submitted to the Akismet web service which runs hundreds of tests on the comment and returns a thumbs up or thumbs down.
It's a really easy to use system, which I highly recommend.
I've had good luck with Honeypots and Hashes.
By making it difficult for robots to post successfully, you can let users post without registration, captchas, or false positives from akismet.
Have a CAPTCHA that is really simple. Perhaps make it always "orange"? I don't think anyone's done that before.
Akismet is definitely the #1 method I know of for limiting spam comments. Also nice to offload that to a 3rd party (at a reasonable price).. that way if client complains, just 'shift the blame'
Another option is to incorporate something like mod_security's spammer signature file. They have a list of keywords you can scan a comment for and place the message to be moderated if you got a match. Though if you had a message board that actually discussed topics that contain these keywords, you'll need a lot of moderators. :-)
Also may want to consider scanning IP's and matching them against SpamHaus or DCShield's block lists. We recently started this approach and it has done wonders.
Things that don't work: requiring registration, simple captcha's, user agent... these can be automated or defeated with cheap labor.
I think you have several options...
Require registration to post comments - but thats more annoying than captcha, so probably not the best idea
Examine the user-agent of the poster (see here) for something that looks genuine or exclude those which look suspect
Use a nice Captcha. As annoying as they are, used properly they aren't that bad. It took me 7 attempts to sign up for a gmail the other day because i just couldnt read what it said. A nice captcha though isnt that bad really, kept it short and READABLE
If the spam you are receiving is link-heavy you could assume any comment that contains >= 2 links is a spam comment and not post it to the blog unless the blog author approves them. This is what most comment-spam plugins do. I'm currently working on a blog software and I adopted this solution in the interim until I can integrate akismet fully.
I made spam into someone else's problem by using Disqus to run my blog's comments. There has been no spam since switching, Disqus keeps on top of it.
A few answers advised Akismet but I disagree and consider dynamic captcha approach the best one

Resources