One of the feeds I follow has started using enclosures of the following form:
<enclosure>http://www.historyofphilosophy.net/sites/default/files/MP3/HoP%20140%20-%20By%20All%20Means%20Necessary%20-%20Avicenna%20on%20God.mp3</enclosure>
This, if I understand correctly, doesn't conform to the RSS 2.0 specs, which, rather, proposes enclosures of the form
<enclosure url="http://www.scripting.com/mp3s/weatherReportSuite.mp3" length="12216320" type="audio/mpeg" />
My question is whether the former, abnormal enclosures are frequent, and something a feed aggregator should be expected to parse.
Related
What RFC defines the passing arrays over HTTP? Most web application platforms allow you to supply an array of arguments over GET or POST. The following URL is an example:
http://localhost/?var[1]=one&var[2]=two&var[3]=three
RFC1738 defines URLs, however the bracket is missing from the Backus–Naur Form(BNF) definition of the URL. Also this RFC doesn't cover POST. Ideally I would like to get the BNF for this feature as defined in the RFC.
According to Wikipedia, there is no single spec:
While there is no definitive standard, most web frameworks allow multiple values to be associated with a single field (eg. field1=value1&field1=value2&field2=value3)
That Wikipedia article links to the following Stack Overflow post, which covers a similar question: Authoritative position of duplicate HTTP GET query keys
The issue here is that form parameters can be whatever you want them to be. Some web frameworks have settled on key[number]=value for arrays, others haven't. Interestingly, RFC1866 section 8.2.4, page 48 (note: this RFC is historical and not current) shows an example with the same key used twice in a form POST:
name=John+Doe
&gender=male
&family=5
&city=kent
&city=miami
&other=abc%0D%0Adef
&nickname=J%26D
On the W3C side of things, HTML 4.01 has some information about how to encode form parameters. Sadly this doesn't cover arrays.
At the time of writing, I don't think there is a correct answer to your question - no IETF RFC or W3C spec defines the behavior that you're interested in.
(As a side note, the W3C HTML JSON form submission draft spec covers posting arrays, thank goodness.)
URIs are defined by RFC 3986.
However, what you're asking about is encoding of form parameters. You need to look up the HTML spec for that.
I want to extract data from various kinds of blogs and was going through various ways to do it:
API which needs user authentication
XML RPC(Don't know which all support it)
RSS(Again, not sure which blogs support it and even if they do, how much can one get from RSS feeds.)
Atom
I know that this isn't a strictly programming related question but I went forward in asking this as there is heck lot of confusion as to what to use and which is better served?
It would be nice to not use API with Authentication as you not only will have to tackle with varied implementations of Authentication, you also have to deal with varied API limits.
RSS is the oldest that came into use. There are limitations to it. Atom was designed to be the replacement for it, overcoming the limitations of RSS. Atom is just a specialised form of XML RPC. In other words, there are other uses for XML RPC, and Atom is the variation of it you want. All of the above are a type of API. So ideally what you want to do is support RSS and Atom. Sadly Atom and RSS are not backwards compatible. To quote the Wikipedia on "Atom":
In particular, many blog and wiki sites offer their web feeds in the
Atom format.
#porneL's solution is not recommended (at the moment). However in the future, HTML markup is set to change to improve the semantic meaning given to blocks, such as the new <article> tag. This will be yet another way to parse documents. It will be the most versatile, but in my opinion it will be a very long time before it becomes reliable, since many if not most sites suffer from 'tag soup' syndrome.
The most universal "standard" is crawling and parsing HTML.
wget -m http://example.com/
How exactly you do it depends on what are you trying to accomplish and how universal you want to be.
You could use heuristics, similar to what Readability uses, to find articles on a site. You could detect and special-case popular blogging platforms.
This question is coming from a non-technical person. I have asked a team to build a sort of RSS reader. In essence, its a news aggregator. What we had in mind at first was to source news directly from specific sources: ft.com, reuters.com, and bloomberg.com.
Now, the development team has proposed a certain way of doing it (because it'll be easier)... which is to use news.google.com and return whatever is the result. Now I know this has questionable legality and we are not really that comfortable with that fact, but while the legal department is checking that.. we have proceeded working with a prototype.
Now comes the technical problem... because the method was actually simulating search via news.google.com, after a period of time it returns a captcha. I'm suspicious that its because the method was SEARCHING WITH RESULTS SHOWN AS RSS as opposed to an outright RSS... however the dev team says RSS is exactly the same thing... and that it will give captcha as well.
I have my doubts. If thats the case, how have the other news aggregator sites done their compilation of feeds from different sources?
For your reference, here is the same of the URL that eventually gives the CAPTCHA
https://news.google.com/news/feeds?hl=en&gl=sg&as_qdr=a&authuser=0&q=dbs+bank+singapore&bav=on.2,or.r_gc.r_pw.r_cp.,cf.osb&biw=1280&bih=963&um=1&ie=UTF-8&output=rss
"Searching" is usually behind a captcha because it is very resource intensive, thus they do everything they can to prevent bots from searching. A normal RSS feed is the opposite of resource intensive. To summarize: normal RSS feeds will probably not trigger CAPTCHA's.
Since Google declared their News API deprecated as of May 26, 2011, maybe using NewsCred as suggested in this group post http://productforums.google.com/forum/#!topic/news/RBRH8pihQJI could be an option for your commercial use.
Could someone please help me understand what the “link” tags are used for within an ATOM feed?
Do they point to a physical resource, or just like an identifier?
What is the difference between link URLs in the beginning and for each “entry” block?
Is it compulsory to have this linkURL?
Any information regarding this would be much appreciated!
I have provided an example snippet of code below.
<?xml version="1.0"?>
<atom:feed>
<link rel="self" href="http://publisher.example.com/happycats.xml" />
<updated>2008-08-11T02:15:01Z</updated>
<!-- Example of a full entry. -->
<entry>
<title>Heathcliff</title>
<link href="http://publisher.example.com/happycat25.xml" />
<id>http://publisher.example.com/happycat25.xml</id>
<updated>2008-08-11T02:15:01Z</updated>
<content>
What a happy cat. Full content goes here.
</content>
</entry>
Atom is a syndication format that can be used by applications employing ReSTful communication through hypermedia. It's very good for publication of feeds, which is not only for blogs but can also be used in distributed applications (for example, for publishing events to other parts of a system) to utilise the benefits of HTTP (caching, scalability, etc) and the decoupling involved in using REST.
elements in Atom are called link relations and can indicate to the consumer of the feed a number of things:
rel="self" normally indicates that the current element (in your case, the feed itself) represents an actual resource, and this is the URI for that resource
rel="via" can identify the original source of the information in the feed or the entry within the feed
rel="alternate" specifies a link to an alternative representation of the same resource (feed or entry)
rel="enclosure" can mean that the linked to resource is intended to be downloaded and cached, as it may be large
rel="related" indicates the link is related to the current feed or entry in some way
A provider of ATOM could also specify their own reasons for a link to appear, and provide a custom rel value
By providing links to related resources in this way you can decouple systems - the only URI the system needs to know about is 1 entry point, and from then on other actions are provided to the consumer via these link relations. The links effectively tell the consumer that they can use these links to either take actions on or retrieve data for the entry they are related to.
A great book I can recommend for REST which goes into depth about Atom is REST in Practice by Jim Webber, Savas Parastatidis and Ian Robinson.
Is RDF still used widely for content syndication? Specifically, I know only of Slashdot as a large scale website syndicating content in that format (say versus RSS).
Understandably this might seem vague to answer so more specifically:
Can anyone list any larger sites similar in scale to Amazon or CNN using it?
Any web based publishing platforms (Wordpress, Joomla, etc...) that generate syndication feeds with this xml vocabulary.
Any other more quantifiable evidence that it is used for syndication online.
I understand that RDF may be a parent specification but in this case I'm talking about sites that syndicate content using <rdf> as a root element and heavily leveraging elements from the RDF namespace:
http://www.w3.org/1999/02/22-rdf-syntax-ns#
Initial versions of RSS were RDF based, but newer ones are XML languages without RDF syntax elements.
Here is a link one the different RSS versions : http://diveintomark.org/archives/2004/02/04/incompatible-rss
I believe RSS 2.0 and Atom are currently more common for syndication than RDF based RSS formats.