If you are building an RSS parser, how important is it to build support for RDF?
Are any new feeds being published in only RDF?
My thinking was that RSS 2.0 (and Atom) have replaced RDF.
I actually had not heard of RDF until a client pointed out some feeds that are RDF-only.
RSS 1.0, which is built around RDF, does still crop up from time to time, so you should built support for it (along with the other 8 versions of RSS, and the ability to recover from errors since many RSS feeds are invalid and not well formed). Better yet, use an existing RSS parser instead of reinventing the wheel.
Related
I want to generate a RSS feed that can be displayed threaded in RSS-Clients.
So if the feed describes something like comments or changes to one and the same entity, i want these to be visually grouped.
Is this possible in Java?
There are ATOM Threading Extensions that will do the job. Atom Feeds will be displayed threaded in Thunderbird / Outlook when those are used.
They should be pretty simple to implement for any RSS-Library. For rome i published a rome-module that can be used.
I want to extract data from various kinds of blogs and was going through various ways to do it:
API which needs user authentication
XML RPC(Don't know which all support it)
RSS(Again, not sure which blogs support it and even if they do, how much can one get from RSS feeds.)
Atom
I know that this isn't a strictly programming related question but I went forward in asking this as there is heck lot of confusion as to what to use and which is better served?
It would be nice to not use API with Authentication as you not only will have to tackle with varied implementations of Authentication, you also have to deal with varied API limits.
RSS is the oldest that came into use. There are limitations to it. Atom was designed to be the replacement for it, overcoming the limitations of RSS. Atom is just a specialised form of XML RPC. In other words, there are other uses for XML RPC, and Atom is the variation of it you want. All of the above are a type of API. So ideally what you want to do is support RSS and Atom. Sadly Atom and RSS are not backwards compatible. To quote the Wikipedia on "Atom":
In particular, many blog and wiki sites offer their web feeds in the
Atom format.
#porneL's solution is not recommended (at the moment). However in the future, HTML markup is set to change to improve the semantic meaning given to blocks, such as the new <article> tag. This will be yet another way to parse documents. It will be the most versatile, but in my opinion it will be a very long time before it becomes reliable, since many if not most sites suffer from 'tag soup' syndrome.
The most universal "standard" is crawling and parsing HTML.
wget -m http://example.com/
How exactly you do it depends on what are you trying to accomplish and how universal you want to be.
You could use heuristics, similar to what Readability uses, to find articles on a site. You could detect and special-case popular blogging platforms.
I am thinking of using either RSS or Atom in my project, but also "enhancing" the feed with some of my own special attributes specifically used by my project.
So I have two questions:
1) Which is most used of RSS and Atom on the web and by the big sites?
2) Which is most suitable to be build from by adding my own tags?
Update:
So RSS is most used, but I should pick Atom since I need to make my own tweaks on a feed? If RSS is more popular, why not pick that? Why didn't Google pick that?
There was a day when I was really interested in syndication and publishing formats. I knew all the quirks of RSS 0.91/1.0/2.0 and Atom 1.0 (and the 0.3 version). Atom was basically born to create something more complete out of the RSS experience which consisted roughly only on the very specifications of Dave Winer's and Netscape's (now only the RSS 2.0 makes practical sense and its specification is here: http://cyber.law.harvard.edu/rss/rss.html). Atom was started by Sam Ruby, then was adopted and developed by a committee of savvy people and it resulted in two things: an XML based syndication format and a publishing protocol. Since 2005 Atom is an IETF standard and in my opinion more complete and better specified than RSS.
As of adoption I think that in raw numbers RSS is still in advantage. A lot of sites decided to stick with the version they already had in place (RSS) and podcasting is usually done on RSS too. There a ton of websites offering both by the way.
As of expanding the format, your second question, Atom has been created with this in mind so you should go down that route. Google GData format is basically an extension of the Atom format: https://developers.google.com/gdata/docs/1.0/elements
Atom is absolutely the standard to go for.
I presume you're using the standard to share (or move) information - so it's like a pipe that your information is padding along. By adopting Atom you can be confident that both ends of the pipe are in agreement about what's in there. It's more hit & miss with RSS.
Is RDF still used widely for content syndication? Specifically, I know only of Slashdot as a large scale website syndicating content in that format (say versus RSS).
Understandably this might seem vague to answer so more specifically:
Can anyone list any larger sites similar in scale to Amazon or CNN using it?
Any web based publishing platforms (Wordpress, Joomla, etc...) that generate syndication feeds with this xml vocabulary.
Any other more quantifiable evidence that it is used for syndication online.
I understand that RDF may be a parent specification but in this case I'm talking about sites that syndicate content using <rdf> as a root element and heavily leveraging elements from the RDF namespace:
http://www.w3.org/1999/02/22-rdf-syntax-ns#
Initial versions of RSS were RDF based, but newer ones are XML languages without RDF syntax elements.
Here is a link one the different RSS versions : http://diveintomark.org/archives/2004/02/04/incompatible-rss
I believe RSS 2.0 and Atom are currently more common for syndication than RDF based RSS formats.
i found some modules for feed parsing(aggregator,feeds,feedapi). i am confusing to choose right one. i need to filter and classify the feeds. can any one guide me
Feeds is an attempt to replace FeedAPI, done by the same developers. It should be better, but as FeedAPI has gathered some extensions by other modules, Feeds might not offer some features yet that where available via extension modules before (note that this is just speculation).
Both offer more functionality than Drupals build in Aggregator module, which is geared towards a 'lightweight' aggregation approach.
So I would start with checking the built in Aggregator module. It offers 'categorization' of feeds and items, which might be enough for your need to 'filter' and 'classify'. If it is not enough, I would check the new Feeds module next, and only 'fall back' to FeedsAPI, if you need some extension/functioanlity not available for Feeds yet.
Feeds is the way to go. FeedAPI is not going to be further developed.
Also, the Managing News install profile might be a good starting point depending on your needs. Both are built by Development Seed, who are forging ahead in doing interesting stuff with feeds.
Feeds and/or FeedAPI work well. FeedAPI has been discontinued in favor of Feeds though.