What's the universal standard to get data from any blog? - rss

I want to extract data from various kinds of blogs and was going through various ways to do it:
API which needs user authentication
XML RPC(Don't know which all support it)
RSS(Again, not sure which blogs support it and even if they do, how much can one get from RSS feeds.)
Atom
I know that this isn't a strictly programming related question but I went forward in asking this as there is heck lot of confusion as to what to use and which is better served?
It would be nice to not use API with Authentication as you not only will have to tackle with varied implementations of Authentication, you also have to deal with varied API limits.

RSS is the oldest that came into use. There are limitations to it. Atom was designed to be the replacement for it, overcoming the limitations of RSS. Atom is just a specialised form of XML RPC. In other words, there are other uses for XML RPC, and Atom is the variation of it you want. All of the above are a type of API. So ideally what you want to do is support RSS and Atom. Sadly Atom and RSS are not backwards compatible. To quote the Wikipedia on "Atom":
In particular, many blog and wiki sites offer their web feeds in the
Atom format.
#porneL's solution is not recommended (at the moment). However in the future, HTML markup is set to change to improve the semantic meaning given to blocks, such as the new <article> tag. This will be yet another way to parse documents. It will be the most versatile, but in my opinion it will be a very long time before it becomes reliable, since many if not most sites suffer from 'tag soup' syndrome.

The most universal "standard" is crawling and parsing HTML.
wget -m http://example.com/
How exactly you do it depends on what are you trying to accomplish and how universal you want to be.
You could use heuristics, similar to what Readability uses, to find articles on a site. You could detect and special-case popular blogging platforms.

Related

Is there a scraper application like KimonoLabs?

I have used scrapy and beautiful soup many times, however find kimonolabs solution much easier and faster. The only problem is that sometimes jobs do need a bit of tweaking, which is not possible (e.g., crawling using a unique pattern).
Is there any other solution which combines the ease with optional complexity? Mainly I want to define a page scraping template using a WYSIWYG interface, and then programatically write the crawler.
Use an Import.io extractor.
Download the Import.io browser
Create an extractor (what you call a "scraping template")
From your code use the extractor's REST API
Full disclosure: I'm one of the founders of ParseHub.
ParseHub tries to solve exactly this problem. It gives you a gui and powerful tools for defining templates visually, and falls back to a subset of javascript if you need more fine-grained control. All of the programming primitives that you're familiar with (if, for, break, recursion, etc.) are available.
You can find it at www.parsehub.com
Try Agenty
Agenty has exact same feature to scrape websites, and the Chrome extension to setup the scraping agents. You can just install the extension and create agents to scrape any site.
FYI : We also have plan to launch hosted solution and REST API by April, 2016 (Update - API is available now)
You may see more details on website (www.datascraping.co) now Agenty.com
Disclosure : I'm one of the founding member

Captchas on RSS Reader?

This question is coming from a non-technical person. I have asked a team to build a sort of RSS reader. In essence, its a news aggregator. What we had in mind at first was to source news directly from specific sources: ft.com, reuters.com, and bloomberg.com.
Now, the development team has proposed a certain way of doing it (because it'll be easier)... which is to use news.google.com and return whatever is the result. Now I know this has questionable legality and we are not really that comfortable with that fact, but while the legal department is checking that.. we have proceeded working with a prototype.
Now comes the technical problem... because the method was actually simulating search via news.google.com, after a period of time it returns a captcha. I'm suspicious that its because the method was SEARCHING WITH RESULTS SHOWN AS RSS as opposed to an outright RSS... however the dev team says RSS is exactly the same thing... and that it will give captcha as well.
I have my doubts. If thats the case, how have the other news aggregator sites done their compilation of feeds from different sources?
For your reference, here is the same of the URL that eventually gives the CAPTCHA
https://news.google.com/news/feeds?hl=en&gl=sg&as_qdr=a&authuser=0&q=dbs+bank+singapore&bav=on.2,or.r_gc.r_pw.r_cp.,cf.osb&biw=1280&bih=963&um=1&ie=UTF-8&output=rss
"Searching" is usually behind a captcha because it is very resource intensive, thus they do everything they can to prevent bots from searching. A normal RSS feed is the opposite of resource intensive. To summarize: normal RSS feeds will probably not trigger CAPTCHA's.
Since Google declared their News API deprecated as of May 26, 2011, maybe using NewsCred as suggested in this group post http://productforums.google.com/forum/#!topic/news/RBRH8pihQJI could be an option for your commercial use.

Which is most used? RSS or Atom? and which one is better to build from?

I am thinking of using either RSS or Atom in my project, but also "enhancing" the feed with some of my own special attributes specifically used by my project.
So I have two questions:
1) Which is most used of RSS and Atom on the web and by the big sites?
2) Which is most suitable to be build from by adding my own tags?
Update:
So RSS is most used, but I should pick Atom since I need to make my own tweaks on a feed? If RSS is more popular, why not pick that? Why didn't Google pick that?
There was a day when I was really interested in syndication and publishing formats. I knew all the quirks of RSS 0.91/1.0/2.0 and Atom 1.0 (and the 0.3 version). Atom was basically born to create something more complete out of the RSS experience which consisted roughly only on the very specifications of Dave Winer's and Netscape's (now only the RSS 2.0 makes practical sense and its specification is here: http://cyber.law.harvard.edu/rss/rss.html). Atom was started by Sam Ruby, then was adopted and developed by a committee of savvy people and it resulted in two things: an XML based syndication format and a publishing protocol. Since 2005 Atom is an IETF standard and in my opinion more complete and better specified than RSS.
As of adoption I think that in raw numbers RSS is still in advantage. A lot of sites decided to stick with the version they already had in place (RSS) and podcasting is usually done on RSS too. There a ton of websites offering both by the way.
As of expanding the format, your second question, Atom has been created with this in mind so you should go down that route. Google GData format is basically an extension of the Atom format: https://developers.google.com/gdata/docs/1.0/elements
Atom is absolutely the standard to go for.
I presume you're using the standard to share (or move) information - so it's like a pipe that your information is padding along. By adopting Atom you can be confident that both ends of the pipe are in agreement about what's in there. It's more hit & miss with RSS.

How to scrape websites such as Hype Machine?

I'm curious about website scraping (i.e. how it's done etc..), specifically that I'd like to write a script to perform the task for the site Hype Machine.
I'm actually a Software Engineering Undergraduate (4th year) however we don't really cover any web programming so my understanding of Javascript/RESTFul API/All things Web are pretty limited as we're mainly focused around theory and client side applications.
Any help or directions greatly appreciated.
The first thing to look for is whether the site already offers some sort of structured data, or if you need to parse through the HTML yourself. Looks like there is an RSS feed of latest songs. If that's what you're looking for, it would be good to start there.
You can use a scripting language to download the feed and parse it. I use python, but you could pick a different scripting language if you like. Here's some docs on how you might download a url in python and parse XML in python.
Another thing to be conscious of when you write a program that downloads a site or RSS feed is how often your scraping script runs. If you have it run constantly so that you'll get the new data the second it becomes available, you'll put a lot of load on the site, and there's a good chance they'll block you. Try not to run your script more often than you need to.
You may want to check the following books:
"Webbots, Spiders, and Screen Scrapers: A Guide to Developing Internet Agents with PHP/CURL"
http://www.amazon.com/Webbots-Spiders-Screen-Scrapers-Developing/dp/1593271204
"HTTP Programming Recipes for C# Bots"
http://www.amazon.com/HTTP-Programming-Recipes-C-Bots/dp/0977320677
"HTTP Programming Recipes for Java Bots"
http://www.amazon.com/HTTP-Programming-Recipes-Java-Bots/dp/0977320669
I believe that the most important thing you must analyze is which kind of information do you want to extract. If you want to extract entire websites like google does probably your best option is to analyze tools like nutch from Apache.org or flaptor solution http://ww.hounder.org If you need to extract particular areas on unstructured data documents - websites, docs, pdf - probably you can extend nutch plugins to fit particular needs. nutch.apache.org
On the other hand if you need to extract particular text or clipping areas of a website where you set rules using DOM of the page probably what you need to check is more related to tools like mozenda.com. with those tools you will be able to set up extraction rules in order to scrap particular information on a website. You must take into consideration that any change on a webpage will give you an error on your robot.
Finally, If you are planning to develop a website using information sources you could purchase information from companies such as spinn3r.com were they sell particular niches of information ready to be consume. You will be able to save lots of money on infrastructure.
hope it helps!.
sebastian.
Python has the feedparser module, located at feedparser.org that actually handles RSS in its various flavours and ATOM in its various flavours. No reason to reinvent the wheel.

Access to old, no longer available, feed entries

I am working on a project that requires reliable access to historic feed entries which are not necessarily available in the current feed of the website. I have found several ways to access such data, but none of them give me all the characteristics I need.
Look at this as a brainstorm. I will tell you how much I have found and you can contribute if you have any other ideas.
Google AJAX Feed API - will limit you to 250 items
Unofficial Google Reader API - Perfect but unofficial and therefore unreliable (and perhaps quasi-illegal?). Also, the authentication seems to be tricky.
Spinn3r - Costs a lot of money
Spidering the internet archive at the site of the feed - Lots of complexity, spotty coverage, only useful as a last resort
Yahoo! Feed API or Yahoo! Search BOSS - The first looks more like an aggregator, meaning I'd need a different registration for each feed and the second should give more access to Yahoo's data but I can find no mention of feeds.
(thanks to Lou Franco) Bloglines Sync API - Besides the problem of needing an account and being designed more as an aggregator, it does not have a way to add feeds to the account. So no retrieval of arbitrary feeds. You need to manually add them through the reader first.
Other search engines/blog search/whatever?
This is a really irritating problem as we are talking about semantic information that was once out there, is still (usually) valid, yet is difficult to access reliably, freely and without limits. Anybody know any alternative sources for feed entry goodness?
Bloglines has an API to sync accounts
http://www.bloglines.com/services/api/sync
You have to make an account, subscribe to the feed you want to download, but then then you can download based on Date, which can be way in the past. Not sure of the terms.
The best answer I've found so far, is this: Google reader's unofficial API turns out to have a public access point for their feeds, which means there is no authentication needed. Use is as follows:
http://www.google.com/reader/public/atom/feed/{your feed uri here}?n=1000
replace the text in the squigglies (including the squigglies themselves) with the feed URI you're interested in. More information about the precise arguments can be found here:
http://blog.martindoms.com/2009/10/16/using-the-google-reader-api-part-2/
but remember to use the /public/ url if you don't want to mess with authentication

Resources