RDF Usage Rates for Syndication - rss

Is RDF still used widely for content syndication? Specifically, I know only of Slashdot as a large scale website syndicating content in that format (say versus RSS).
Understandably this might seem vague to answer so more specifically:
Can anyone list any larger sites similar in scale to Amazon or CNN using it?
Any web based publishing platforms (Wordpress, Joomla, etc...) that generate syndication feeds with this xml vocabulary.
Any other more quantifiable evidence that it is used for syndication online.
I understand that RDF may be a parent specification but in this case I'm talking about sites that syndicate content using <rdf> as a root element and heavily leveraging elements from the RDF namespace:
http://www.w3.org/1999/02/22-rdf-syntax-ns#

Initial versions of RSS were RDF based, but newer ones are XML languages without RDF syntax elements.
Here is a link one the different RSS versions : http://diveintomark.org/archives/2004/02/04/incompatible-rss
I believe RSS 2.0 and Atom are currently more common for syndication than RDF based RSS formats.

Related

How to create a data:application link for a web font?

Since web fonts have some ins-and-outs pertaining to cross-domain hosting, being a developer who provides code for a multitude of clients that want to use such web fonts to leverage their aesthetic quality, can be challenging especially when trying to detail the technical steps of hosting a file and making sure the URL path is pointing to it properly.
Recently, I have come across a webfont that uses a
data:application/font-woff;charset=utf-8;base64, "longHash"
nomenclature and I am not familiar with this.
One great benefit of this is that it seems that this doesn't have the cross-domain pitfalls of using a URL for a font, example here:
http://jsfiddle.net/9336yqkL/1/
If you look at the link you can see that it's a series of alphanumerical characters quite long in length where the URL path typically is.
I wonder, how does one create a path like this?
Help is always appreciated!
Base64 is an encoding scheme for binary. You can use a decoder to get the binary contents of the alphanumerical text that comes after base64,. The entire process is called Data URI Scheme and has a list of pros and cons for using it. http://en.wikipedia.org/wiki/Data_URI_scheme#Advantages_and_disadvantages

What's the universal standard to get data from any blog?

I want to extract data from various kinds of blogs and was going through various ways to do it:
API which needs user authentication
XML RPC(Don't know which all support it)
RSS(Again, not sure which blogs support it and even if they do, how much can one get from RSS feeds.)
Atom
I know that this isn't a strictly programming related question but I went forward in asking this as there is heck lot of confusion as to what to use and which is better served?
It would be nice to not use API with Authentication as you not only will have to tackle with varied implementations of Authentication, you also have to deal with varied API limits.
RSS is the oldest that came into use. There are limitations to it. Atom was designed to be the replacement for it, overcoming the limitations of RSS. Atom is just a specialised form of XML RPC. In other words, there are other uses for XML RPC, and Atom is the variation of it you want. All of the above are a type of API. So ideally what you want to do is support RSS and Atom. Sadly Atom and RSS are not backwards compatible. To quote the Wikipedia on "Atom":
In particular, many blog and wiki sites offer their web feeds in the
Atom format.
#porneL's solution is not recommended (at the moment). However in the future, HTML markup is set to change to improve the semantic meaning given to blocks, such as the new <article> tag. This will be yet another way to parse documents. It will be the most versatile, but in my opinion it will be a very long time before it becomes reliable, since many if not most sites suffer from 'tag soup' syndrome.
The most universal "standard" is crawling and parsing HTML.
wget -m http://example.com/
How exactly you do it depends on what are you trying to accomplish and how universal you want to be.
You could use heuristics, similar to what Readability uses, to find articles on a site. You could detect and special-case popular blogging platforms.

Which is most used? RSS or Atom? and which one is better to build from?

I am thinking of using either RSS or Atom in my project, but also "enhancing" the feed with some of my own special attributes specifically used by my project.
So I have two questions:
1) Which is most used of RSS and Atom on the web and by the big sites?
2) Which is most suitable to be build from by adding my own tags?
Update:
So RSS is most used, but I should pick Atom since I need to make my own tweaks on a feed? If RSS is more popular, why not pick that? Why didn't Google pick that?
There was a day when I was really interested in syndication and publishing formats. I knew all the quirks of RSS 0.91/1.0/2.0 and Atom 1.0 (and the 0.3 version). Atom was basically born to create something more complete out of the RSS experience which consisted roughly only on the very specifications of Dave Winer's and Netscape's (now only the RSS 2.0 makes practical sense and its specification is here: http://cyber.law.harvard.edu/rss/rss.html). Atom was started by Sam Ruby, then was adopted and developed by a committee of savvy people and it resulted in two things: an XML based syndication format and a publishing protocol. Since 2005 Atom is an IETF standard and in my opinion more complete and better specified than RSS.
As of adoption I think that in raw numbers RSS is still in advantage. A lot of sites decided to stick with the version they already had in place (RSS) and podcasting is usually done on RSS too. There a ton of websites offering both by the way.
As of expanding the format, your second question, Atom has been created with this in mind so you should go down that route. Google GData format is basically an extension of the Atom format: https://developers.google.com/gdata/docs/1.0/elements
Atom is absolutely the standard to go for.
I presume you're using the standard to share (or move) information - so it's like a pipe that your information is padding along. By adopting Atom you can be confident that both ends of the pipe are in agreement about what's in there. It's more hit & miss with RSS.

Popularity of RDF format vs RSS

If you are building an RSS parser, how important is it to build support for RDF?
Are any new feeds being published in only RDF?
My thinking was that RSS 2.0 (and Atom) have replaced RDF.
I actually had not heard of RDF until a client pointed out some feeds that are RDF-only.
RSS 1.0, which is built around RDF, does still crop up from time to time, so you should built support for it (along with the other 8 versions of RSS, and the ability to recover from errors since many RSS feeds are invalid and not well formed). Better yet, use an existing RSS parser instead of reinventing the wheel.

How Do I Fetch All Old Items on an RSS Feed?

I've been experimenting with writing my own RSS reader. I can handle the "parse XML" bit. The thing I'm getting stuck on is "How do I fetch older posts?"
Most RSS feeds only list the 10-25 most recent items in their XML file. How do I get ALL the items in a feed, and not just the most recent ones?
The only solution I could find was using the "unofficial" Google Reader API, which would be something like
http://www.google.com/reader/atom/feed/http://fskrealityguide.blogspot.com/feeds/posts/default?n=1000
I don't want to make my application dependent on Google Reader.
Is there any better way? I noticed that on Blogger, I can do "?start-index=1&max-results=1000", and on WordPress I can do "?paged=5". Is there any general way to fetch an RSS feed so that it gives me everything, and not just the most recent items?
RSS/Atom feeds does not allow for historic information to be retrieved. It is up to the publisher of the feed to provide it if they want such as in the blogger or wordpress examples you gave above.
The only reason that Google Reader has more information is that it remembered it from when it came up the first time.
There is some information on something like this talked about as an extension to the ATOM protocol, but I don't know if it is actually implemented anywhere.
As the other replies here mentioned, a feed may not provide archival data but historical items may be available from another source.
Archive.org’s Wayback Machine has an API to access historical content, including RSS feeds (if their bots have downloaded it). I’ve created the web tool Backfeed that uses this API to regenerate a feed containing concatenated historical items. If you'd like to discuss the implementation in detail please get in touch.
In my experience with RSS, the feed is compiled by the last X items where X is a variable. Certain Feeds may have the full list, but for bandwidth sake most places are likely limiting to just the last few items.
The likely answer for google reader having the old info, is that it is storing it on its side for users later.
Further to what David Dean said the RSS/Atom feeds will only contain what the publisher of the feed has up at that moment and someone would need to be actively collecting this informaton in order to have any historical information. Basically Google Reader was doing this for free and when you interacted with it you could retrieve this stored informaton from the google database servers.
Now that they have retired the service, to my knowledge you have two choices. You either have to start collection of this information from your feeds of interest and store the data using XML or some such, or you could pay for this data from one of the companies who sell this type of archived feed information.
I hope this information helps somebody.
Seán
Another potential solution that might not have been available when the question was originally asked and shouldn't require any specific service.
Find the URL of the RSS feed you want and use waybackpack to get the archived urls for that feed.
Use FeedReader or a similar library to pull down the archived RSS feed.
Take the URLs from each feed and scrape them as you wish. If you're going way back in time it's possible there might be some dead links.
All previous answers more or less relied on existing services to still have a copy of that feed or the feed engine to be able to provide older items dynamically.
There's though another, admittedly pro-active and rather theoretical way to do so: Let your feedreader use a caching proxy which semantically understands RSS and/or Atom feeds and caches them on a per-item base up to as many items as you configure.
If the feedreader doesn't poll feeds regularily, the proxy could fetch known feeds time-based on its own to not miss an item in highly volatile feeds like the one from User Friendly which has only one item and changes every day (or at least used to do so). Hence if the feedreadere.g. crashed or lost network connection while you are away for a few days, you might loose items in your feedreader's cache. Having the proxy to fetch those feeds regularily (e.g. from a data center instead from at home or on a server instead of a laptop) allows you to easily run the feedreader only then and when without loosing items which were posted after your feedreader fetched feeds the last time but rotated out again before you fetch them the next time.
I call that concept a Semantic Feed Proxy and I've implemented a proof of concept implementation called sfp. It's though not much more than a proof of concept and I haven't developed it further. (So I'd be happy about hints to projects with similar ideas or purposes. :-)
Why does this problem exist?
Most RSS readers need to import feeds through a live URL, which makes things harder for sites that are unindexed on Wayback Machine.
The reason why Wayback Machine feeds can be imported is that the reader can regularly poll the server for updates according to its defined TTL configuration. The reader compares the current datetime with the RSS feed posts pubDate or lastBuildDate keys in the XML response. We can't hack the machine datetime to work around the datetime resolution because the current datetime is fetched live.
I've outlined an alternative solution without Wayback below. Unfortunately, I have not been able to find a universal solution for all feed sources.
Alternative Solution(s)
In my experience, NOT ALL feeds are partial though. The XML doesn't have to specify the datetime of each post. This means the RSS Reader doesn't have a datetime to filter the feed with. An example of this feed type can be found here.
This kind of reading experience is useful when chronological order is irrelevant, and the content doesn't need to be sorted. This approach is useful for sites where ALL the content is valuable, and the linked Essays of Paul Graham is a good example.
If the site has a generic, non-chronological feed option, subscribe to that RSS instead (the preferred option).
Download the linked timestamped .rss file, strip datetimes and host the file on your own server. Note, we can implement this via an AWS Lambda.
Set up a server that fetches the RSS from live.
Strip the pubDate tags from the XML file on fetch.
Host the modified RSS on your own server.
Note
These are suboptimal solutions due to loss of orders, however, I wanted to provide a potential alternative to WaybackMachine.
In addition, some existing answers require advanced SysDesign workarounds, more prework and in some cases are outdated (Google Reader is shut down). I hope it's helpful for those who really need a solution for a complete feed list. Constructing new RSS feeds is not too hard from the original RSS file.

Resources