Preventing RSS feed scraping? - rss

On a Wordpress site, I have both a normal blog that I want Google to detect and an RSS feed for outgoing links to other sites. I don't need/want bots to get at this other RSS feed nor do I want people to be able to get the link for their own use.
I've disabled RSS for the main blog successfully but am not sure how to encrypt/protect/hide the RSS link for this additional feed.
I'm not sure how Facebook runs a newsfeed without RSS but however they do it is probably beyond my means/experience to replicate.
Where these are just outgoing links, I don't think copyright notices in the feed will do much. Maybe there is a way to output the links automatically through a means other than RSS?

Use Robots.Text www.robotstxt.org to prevent google from following the link. All self respecting robots should follow the directives in the robots.txt file. This file needs to go in the root of your sit.

The basic answer to this is to use a method of getting the feed entries in a manner other than using the actual RSS like outputting JSON, going through the API, etc.
It will help prevent scraping though not completely.

Related

Can I build an OPML file from a list of website URLs?

This happens a lot so I wonder if there's a tool for working around it. Often, I find a website with a blogroll or a links page with a long list of 20 or more websites. I sure would like to keep up with those sites via the feed reader of my choice, but it sure it tedious to click on each and every link, look for an RSS link, subscribe to that, wash, rinse, repeat.
My favorite feed reader will accept an OPML to batch import a list of feeds, so that's a start, but here's my question:
If all I have is a list of the website URLs, is there a way to generate an OPML of the RSS feeds?
I was able to create an OPML file. All I had to do was create a text file, with a URL on each line. Then, I was able to use a PHP script to look at each URL, hunt each for the RSS feed's address and add each RSS feed address to the OPML file.
Incidentally, I've shared the project that this is part of on Github. I wanted to be able to subscribe to lots of litblogs at once.

Can I track who is linking or manipulating my site's data?

Is it possible to track if someone links to data on my site? Specifically if my data is used in a site dynamically generated by a developer program? I would like to know if someone is blatantly passing off my site's data as their own. There are obviously ways around directly linking to content, such as content manipulation or even manual manipulation. But if someone where to link(or directly add word for word or manipulate) my content into their website, is there a way to track it?
Can I avoid someone being able to scrape my website at all, or is everything just up for grabs?
the best answer and the easy one is called GOOGLE - WEBMASTER TOOLS!
HERE
actually doing that is very hard and you would need to crawl the web to discover those links that address to your pages... dynamic content as well is linked so it would be find by google as well.
this tool will allow you to see outer links that address to your site.. and you can check them.
for extra - you can monitor requests and traffic to your site and find ip's that are using the same page over and over again. that can tell u that an outer page is dynamically loading content from your web page.
EDIT:
here is a good article in this subject: link - scroll down and you can see the use of google
webmaster tool with some other progrmas and method.
here is a good start guide to the google webmaster: link
ENJOY!

How to retrieve google pages

Dear all,I am now using a webtool
http://fiddesktop.cs.northwestern.edu/mmp/scrape?url=
to parse a webpage.
For example,we can parse newyorktimes homepage,we do:
http://fiddesktop.cs.northwestern.edu/mmp/scrape?url=http://www.nytimes.com/pages/world/index.html
in the address bar of our browser,it will parse things nicely for us.
However,it just fails for google pages.
For example,if I want to parse Google news headpage,like:
http://fiddesktop.cs.northwestern.edu/mmp/scrape?url=http://news.google.com/nwshp?hl=en&tab=wn
I will always get 500 Internal Server Error.
I am sure that is somthing to do with google website,I think probably we need some API for google,does anyone have any idea how to to sort this out for google pages?
Many thanks.
Per the google.com robots.txt file, you are explictly requested not to scrape their content. Google does not provide an API for machine-readable search results; they want to control the presentation of their content via widgets and embedding strategies.

Collecting RSS Feeds Online?

I'd like to be able to collect RSS feeds online as an alternative to collecting them on a desktop machine using a regularly running process.
Ideally, it would either collect all feeds and simply email them to a single address as soon as it finds a new one (or even without checking for new feeds) or aggregates all the smaller feeds and sends them out as a bulk larger feed less periodically.
It would have to run on a web server continually, but would be a nice to be able to collect all feeds, not just the ones I happen to pick up when a feed reader is running on my machine. Is something like this available?
Just use Google Reader. :)
Google Reader.
Maybe Yahoo's Pipes could help you. It is an interesting way of combining and manipulating feeds.
I'm not sure if you have ever used it but iGoogle allows you to customise the google homepage to display information from around the web. You can add tabs to the page to allow you to split the information up. It's extremely useful and as you can log into it from any computer / browser you can access your feeds anywhere.
If you have a lot of feeds of one type or feeds that update infrequently then iGoogle can also be combined with google reader.
It's also great for adding other plugins like gmail, games, Dilbert :) and more.
To create an iGoogle page go to the google home page and click the iGoogle link in the top right corner. iGoogle will then provide you with a starter page and some suggested content which you can add or ignore. If you click the "Add Stuff" link then "Add feed or gadget" you can manually add all your RSS feeds. However, you can also configure Firefox to automatically select google as your RSS reader when ever you click on an RSS feed icon in the navigation bar. You can select / change this under Tools -> Options -> Applications -> Web Feed.
In order to use your iGoogle on multiple browsers / computers you will need a gmail / google account however it's free and easy to create.
T
simplepie is great if you have PHP installed.
Universal Feed Parser if you're programming in python might be of help

Add RSS to any website?

Is there any website/service which will enable me to add RSS subscription to any website?
This is for my company I work. We have a website which displays company related news. These news are supplied by an external agency and they gets updated to our database automatically. Our website picks up random/new news and displays them. We are looking at adding a "Subscribe via RSS" button to our website.
If you have the data in your database, creating one yourself is fairly straight forward - there's a simple tutorial here.
Once you've set up a feed, in the <head> of your page, you put text like:
<link rel="alternate" title="RSS Feed"
href="http://www.example.com/rss-feed/latest/" type="application/rss+xml" />
This allows the feed to be "auto-discovered" by your user's browser (e.g. the RSS icon appears in the address bar in FF).
Here's an article that discusses various webscrapers that will generate feeds: http://www.masternewmedia.org/news/2006/03/09/how_to_create_a_rss.htm
If you don't care to click through, here are the services the author discusses:
http://www.feedyes.com/
http://www.feed43.com/
http://www.feedfire.com/site/index.html
Other webscrapers suggested in the other answers:
http://page2rss.com/
http://www.dapper.net/
However, you're probably better off generating the feeds yourself from the info in the DB.
Your question is a little difficult to understand. Are you trying to generate the RSS for others to consume, or are you trying to consume someone else's RSS?
If you are trying to generate your RSS feed for others to consume you will need to read the spec:
http://cyber.law.harvard.edu/rss/rss.html
If you are trying to consume it, that link will also help. Then you'll need to look into an XML / RSS parser.
If you can provide more details I can update my answer.
If you are not in a position to add an RSS feed to the existing site, see Page2Rss as an intermediate solution.
Might Dapper be of some use? You just need to set up which bits of your news feed to scour and voila, instant rss without having to touch any code...
Actually this is very doable with Yahoo! Pipes. Assuming that 1) your page is under 200k, 2) your robots.txt file does not disallow Pipes, and 3) your news feed has a unique ID, like so:
<ul id="newsfeed">
... you could use the Fetch Page module, trim it to just the items inside the news feed, loop though each list item, and use an Item Builder module to mangle the relevant bits as a proper RSS feed. Then, in the head of your document, you'd put in an RSS link, like so:
<link rel="alternate" type="application/atom+xml" title="News Feed" href="http://pipes.yahoo.com/your_pipe_id" />
This is of course completely ass-backwards, but would work for a quick fix, or in situations where you had no control over the body of the page.
Write a webhandler that exposes the content of the database as an RSS feed.
You either need to roll your own, or get a service that is a screen scraper.
After you have created your feed, you can use something like Feedburner to disseminate it.
If you happen to be using ASP.NET, you might want to check out the ASP.NET RSS Toolkit. It's useful for both generating and consuming feeds.

Resources