I have started developing a webpage and recently hired someone to write code to display a customized feed (powered by API) in the middle panel on http://farmball.com/. Note that this is not the RSS feed tied to the site blog. The feed ties to my account on another site. There is no RSS link for an average user to subscribe to the feed. I've taken the site out of maintenance mode to ask anyone here with scraping/hacking experience how someone would most easily go about 'taking' the feed and displaying it on their own site. More importantly, what can I do to prevent it?
^Updated for re-wording
You can't.
If you are going to expose an RSS feed which you don't want others to be able to display on their site then you are completely missing the point of RSS. The entire reason for Really Simple Syndication (RSS) is to make your content externally consumable- whether that's in an RSS Reader or through someone simply printing its content on their own website.
Why are you including an RSS feed if you do not want someone to be able to consume it?
what can I do to prevent...'taking' the feed and displaying it on their own site?
Nothing. Preventing reuse goes against the basic concept of RSS, which is to make it as easy as possible for anyone to do anything they want with it. It was designed from the ground up to be Really Simple to Syndicate, not Really Hard to Retransmit Without Permission.
You could restrict access to the feed itself to trusted users only by making them provide some credentials or pass in a key to the feed (e.g. yoursite.rss?mykey=abc123). But you cannot control use. Only access.
Be explicit about your license. It isn't a technology solution, as others have mentioned, the technology is an open technology-- this isn't DRM! But if you ask in each post that people who use this feed to not repost/fail to give credit/etc then some people will respond to the request.
Otherwise, you're better off putting your content behind a password and using a paid subscription model for distributing your content.
This is a DRM problem essentially. If you had some technique that you could put content on the web without having it redistributable, the music industry would love you.
It is possible to try to prevent redistribution. One technique you could try is embedding a signature of some sort into the feed for each user who you require to sign up. If the content is found on the web, you can identify and ban the user who redistributed your content.
This is avoidable too, by getting multiple accounts and normalizing the content to remove fingerprints. For the would-be pirate, this requires more effort than they may be willing to put in. Your signature could be a unique whitespace pattern, tiny variances in the timestamps on posts, misplaced pixels in videos, or any other thing you can vary slightly without end users noticing.
use .htpassword
better yet, don't put something private in a public place where it's likely to get picked up by software automatically. Like others have said, it's a pretty odd question, if you're trying to figure something else out, you're better off being explicit with what you want to know.
Related
I'm wondering if I should add a recaptch to my website which gets 30k hits a day. Simply, it rewrites text with a custom made api I created.
I'm not sure how many robots are hitting my site, inputting text into the text box, and using the api (like a web scraper) and am considering using a recaptcha. But personally, I hate running into recaptchas, so I don't want it pop up if a real person is rewriting quite a few articles for a real reason.
Are there recaptcha settings that are ideal for this situation? Should I use a recaptcha and show it after 3 consecutive use cases for example?
Is the google recaptcha smart enough to know it's not a robot?
Are there better alternatives, maybe popular solutions which blocks scrapers?
Any advice is appreciated.
I am writing a website that will be publishing content that has high IP and we would like people to pay for it. To prevent screen capture consistently, I know that there are limitations in using Javascript + flash + html.
I have discovered artistscope which seems to make it impossible to do anything of that nature. I am happy to inconvenience the user as they view my webpage but lock it down.
Does anyone have any experience with this framework?? I understand all users will have to install a plugin that some antivirus software has flagged and i'll just need to add some mark-up to the article page.
Does anyone know anything about artistscope solution and what is involved in implementing it or how well it works??
If its only a few users, I'm guessing you'll require registration? if so you could use legal copyright to protect intellectual property. Use Creative common's, TradeMark your sitename, use registered post to send content to yourself before you post it online that way you can prove in a court that its plagiarized and you were the first to copyright it. This sort of reminds me of this article: http://thedailywtf.com/Articles/Lock-and-Key-.aspx and maybe your site should be in stone, a safe or as a MagicEye. As Greg mentioned there is no bullet proof solution, guys like us will come along and write auto-OCR readers to scan your site and get foreigners to run the app. If you had a legal notice, I'd at least think twice.
Edit: maybe you could even get creative with Captcha's too to deter people (when you detect copyright infringement), here's an idea to two: Is there an efficient algorithm for segmentation of handwritten text?
I also have used Artist Scope solutions, but when it comes to screenshot, it is not enough.
I've just written this post about protection against screenshot and other content grabbing methods like snipping tool. I'll update it soon for other protection methods that I followed in my blog.
Here's a general description of my approach:
It only works for restricted content; content that needs registration to view it.
It requires continuous monitoring by the administrator because...
It detects screen print key and sends an email to you with the username and other details of the person that has already captured your content (If you are aware of any method that bans the user automatically, I'd be glad to hear it).
It covers your content with an overlay if the user tries to capture it while outside the browser window.
I never needed to play a lot with RSS, but now I have a project to do, and I wonder if it is possible to pull an RSS feed of all the posts for a certain blog...
I'm not talking about creating a feed generator. I'm just curious why most of the blogspot.com etc websites have available only the last 5 or last 20 posts, but never the complete list... is it a performance/security reason? I guess it is the user's right to decide how many posts go in the feed, right?
How many entries you want to have in your RSS-Feed depends on the goals you want to achieve with the RSS-Feed. Usually you want to provide information about the current articles on your site. That's why you usually have only the most recent articles in the feed.
Performance is of course an issue. A popular RSS-Feed with many followers does not want to send a huge XML file all the time. That can be addressed with enough ressources, but as long as it does not really help your goal, why do it?
I do not see a real security issue. If someone wants to steal your content, he can simple iteratate over the articles on your website directly. RSS would make it a little easier, but if someone wants to steal the content, he will steal it anyway, with or without a full RSS feed. If you take Denial of Service into consideration - we are back at performance issues - there might be a threat to availability. But that's already quite speculative.
The data on our website can easily be scraped. How can we detect whether a human is viewing the site or a tool?
One way is by calculating time which a user stays on a page. I do not know how to implement that. Can anyone help to detect and prevent automated tools from scraping data from my website?
I used a security image in login section, but even then a human may log in and then use an automated tool. When the recaptcha image appears after a period of time the user may type the security image and again, use an automated tool to continue scraping data.
I developed a tool to scrape another site. So I only want to prevent this from happening to my site!
DON'T do it.
It's the web, you will not be able to stop someone from scraping data if they really want it. I've done it many, many times before and got around every restriction they put in place. In fact having a restriction in place motivates me further to try and get the data.
The more you restrict your system, the worse you'll make user experience for legitimate users. Just a bad idea.
It's the web. You need to assume that anything you put out there can be read by human or machine. Even if you can prevent it today, someone will figure out how to bypass it tomorrow. Captchas have been broken for some time now, and sooner or later, so will the alternatives.
However, here are some ideas for the time being.
And here are a few more.
and for my favorite. One clever site I've run across has a good one. It has a question like "On our "about us" page, what is the street name of our support office?" or something like that. It takes a human to find the "About Us" page (the link doesn't say "about us" it says something similar that a person would figure out, though) And then to find the support office address,(different than main corporate office and several others listed on the page) you have to look through several matches. Current computer technology wouldn't be able to figure it out any more than it can figure out true speech recognition or cognition.
a Google search for "Captcha alternatives" turns up quite a bit.
This cant be done without risking false positives (and annoying users).
How can we detect whether a human is viewing the site or a tool?
You cant. How would you handle tools parsing the page for a human, like screen readers and accessibility tools?
For example one way is by calculating the time up to which a user stays in page from which we can detect whether human intervention is involved. I do not know how to implement that but just thinking about this method. Can anyone help how to detect and prevent automated tools from scraping data from my website?
You wont detect automatic tools, only unusual behavior. And before you can define unusual behavior, you need to find what's usual. People view pages in different order, browser tabs allow them to do parallel tasks, etc.
I should make a note that if there's a will, then there is a way.
That being said, I thought about what you've asked previously and here are some simple things I came up with:
simple naive checks might be user-agent filtering and checking. You can find a list of common crawler user agents here: http://www.useragentstring.com/pages/Crawlerlist/
you can always display your data in flash, though I do not recommend it.
use a captcha
Other than that, I'm not really sure if there's anything else you can do but I would be interested in seeing the answers as well.
EDIT:
Google does something interesting where if you're looking for SSNs, after the 50th page or so, they will captcha. It begs the question to see whether or not you can intelligently time the amount a user spends on your page or if you want to introduce pagination into the equation, the time a user spends on one page.
Using the information that we previously assumed, it is possible to put a time limit before another HTTP request is sent. At that point, it might be beneficial to "randomly" generate a captcha. What I mean by this, is that maybe one HTTP request will go through fine, but the next one will require a captcha. You can switch those up as you please.
The scrappers steal the data from your website by parsing URLs and reading the source code of your page. Following steps can be taken to atleast making scraping a bit difficult if not impossible.
Ajax requests make it difficult to parse the data and require extra efforts in getting the URLs to be parsed.
Use cookie even for the normal pages which do not require any authentication, create cookies once the user visits the home page and then its required for all the inner pages.This makes scraping a bit difficult.
Display the encrypted code on the website and then decrypt it on the loadtime using javascript code. I have seen it on a couple of websites.
I guess the only good solution is to limit the rate that data can be accessed. It may not completely prevent scraping but at least you can limit the speed at which automated scraping tools will work, hopefully below a level that will discourage scraping the data.
Has anyone ever tried to integrate AspDotNetStorefront and Sitecore? I've been trying for the past couple of days to come up with a way to get the two systems to play nicely together, but it doesn't seem feasible from what I can tell. A couple issues I've run across so far:
Authentication between the two (AspDotNetStorefront has its own implementation, Sitecore just uses/extends .NET Membership)
The main DLL for AspDotNetStorefront is what pops up in the stack trace when I get yellow-screened, but that DLL is obfuscated so I can't figure out what the problem is.
The biggest issue is that we need to keep our existing AspDotNetStorefront application as an e-commerce backend and use Sitecore to do everything else. AspDotNetStorefront has a CMS as part of it, but it's really not an acceptable solution for anything but really basic content pages.
Any thoughts on how I might go about this?
EDIT:
I've decided to break this whole thing down into the different problems that I am facing at the moment and solve each one as efficiently as I know how. I'll detail the ones I have here and then update when I run into new ones.
Problem 1: Authentication between the two systems.
This one isn't too bad actually if you're knowledgeable about forms authentication tickets, which I wasn't at the time but am learning quickly enough. As long as the two systems share the same encryption info, it's easy enough to pass information back and forth between them using cookies as stated below in the accepted answer. The other kicker is that I needed to set the CustomerGUID in the AspDotNetStorefront Customer table to be the user ID from the Sitecore user tables (standard ASP.NET membership). So far this approach seems to work pretty well (I'm still in the proof of concept stage at the moment.
Another thing to keep in mind if you ever need to attempt this is that AspDotNetStorefront comes with a web service that you can use to basically do anything you need. Since they use the same encryption keys, I am able to log in on the storefront side using this service more securely than just passing over clear text passwords (I had to write the method myself, I don't believe it comes standard, if I am mistaken please let me know). Although I doubt it's a huge deal since it all happens server side anyways.
Problem 2: Getting at the product data
This one was a little more troublesome. The aforementioned web service has a few issues I've had difficulty working around. However, since the databases are going to be on the same server, I simply decided that since all I really need is the price and ID I would go ahead and set the ProductGUID column of each product in the Storefront database to match the Sitecore item ID of the corresponding item in the Sitecore database. This way I just need a quick query to grab the ProductID and price information which is only used in a few places. Everything else is going to be housed in Sitecore.
If anyone has anything to add feel free, as far as I can tell from Google, no one has actually done this before, so I'm having a lot of trouble finding resources on this particular topic.
UPDATE:
The integration is in fact possible and our site has been up for a week and a half now with very few integration related problems. This isn't something I recommend doing really on a personal level, but it is in fact possible to pull off.
I know ASPDotNetStorefront and other CMS systems (but not Sitecore). If I was approaching this, I would probably start simple and create a custom URL structure for sitecore 'content' pages that ASPDNSF would direct to Sitecore to handle. [possibly replacing the existing topics system in ASPDNSF]. So, for example: a URL such as www.domain.com/p-1234-aproductpage.aspx would be handled by ASPDNSF whereas www.domain.com/content/123/a-content-page would get sent to Sitecore to render. This is a straightforward web.config edit.
Security sharing across the systems should be possible across the same domain as the cookie information will be available (you should be able to create some code in Sitecore using the ASPDNSFCommon.dll and a cast of HttpContext.Current.User into a AspDotNetStorefrontPrincipal class to detect if a customer is logged in)
Another way to approach the problem would be to write a function that retrieved Sitecore content from the database based on a URL id and then write an ASPDNSF XML template to use the function to retrieve this content based on the URL. For example, you could create a custom URL structure in ASPDNSF such as www.domain.com/sc-1234-sitecore-content-item.aspx which is sent to your custom code; 1234 is used as the sitecore content id and the XML template retrieves the content and renders it on screen.
This second approach has the advantage of using Sitecore for all non-product content management while keeping the live application in ASPDNSF. Also one set of design templates and all your security issues go.