How does RSS reader know that a feed is updated? - rss

Just learning about this via youtube but could not find answer to my question of how reader knows there is an update.
Is it like a Push in blackberry?

RSS is a file format source and doesn't actually know anything about where it gets the entries from. The answer really is: "how can an http request get only the newest results from a server" and the answer is Conditional GET source. Http also supports Conditional PUT.
This is an article about using this feature of http to specifically support rss hackers.

RSS is a pull technology. The reader re-fetches the RSS feed now and then (for example two times per hour, or more often if the reader learns that it's an often updated feed).
The feed is served through regular HTTP and consists of a simple XML file. It is always fetched from the same URL.

It just check the feed for update regularly.
Recently there is a new protocol called pubsubhubbub to make feed push to the listener. But it requires the publishers support it.
Here is a list of web services support real-time RSS pushing, including Google Reader, Blogger, FeedBurner, FriendFeed, MySpace, etc.

Let's summarize :
Usually, a client knows that an RSS feed has been updated through polling, that is regular pull (HTTP GET request on the feed URL)
Push doesn't exist on the web, at least, not with HTTP until HTML5 websocket is fixed.
However, some blog frameworks like Wordpress, Google and others, now support the pubsubhubbub convention. In this mode, you would "subscribe" to the updates of an RSS flow. The "hub" will call an URL on YOUR site (callback URL) to send you updates : that is a push.
Push or pull, in both cases you still need to write some piece of code to update the RSS list on your site, database or wherever you store/display it.
And, as a side question, it is not necessary to request the whole XML at every pull to see if the content has changed : using a standard that is not linked to RSS, but global to the whole HTTP protocol (etag and last-modified headers), you can know if the RSS page was modified after a given date, and grab the whole XML only if modified.

It's a pull. That's why you have to configure your reader how often it should refresh the feed.

Related

TLDR; Does RSS-Feeds transport bulk or update?

Reading about RSS leads to many false-informations. I am not quite sure how RSS works. So I have some questions and I hope you dont answer using links-only. There is always another link that claims your link is wrong.
Questions:
If I subscribe to a RSS-Feed the first time, are the feeds from the last 30 years downloaded as a bulk-response may have Gigabytes of data?
Are following requests to a already subscribed RSS-Feed updates to the previous subscription? If yes, how does the server know what messages are already transported to the "client"?
How often are RSS-Feeds downloaded?
Kind regards
You get whatever is currently in the feed. How many entries and how far back that goes is up to the publisher.
No. Each request gets whatever is in the feed at the time.
As often as the client wants to download them. (The format includes options to recommend a frequency but clients may ignore it).

HTTP PUT and POST alternatives for uploading content

Other than HTTP PUT and POST, what other methods can a web application designer use to allow users to upload content (either files or listbox text) from a page of his web app to a remote server?
On the same topic, I was wondering what technology/APIs does a service like Google Docs or Google Drive use? The reason I ask this is: Our Sys Admin has disabled file uploading (via Squid proxy), yet I was able to create and share a document using Google Docs / Google Drive.
Many thanks in advance,
/HS
EDIT Please see the strikeout above.
This depends on the server in question - as the standard set of HTTP commands can be expanded, and some may not be configured/allowed. One of the common commands is "OPTIONS" that ask "what can I do".
But to answer more helpfully: you generally have two main options:
POST (the one you probably want to user as it's nearly always avaiable
GET. You could use GET (but I'm NOT advocating it - just saying you could you it - you should not use a GET to make changes to the server). There are problems with this approach (including size of files, manually handling the encoding etc) but it's possible if you have to go this route.
PUT it often not enabled on servers for security reasons.
More reading: http://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html
Edit: if "file uploading" is prevented by proxy, have you tried encoding the POST? i.e. As opposed to sending a multipart POST, try encoding the files yourself into POST string and sending that instead? Or encode the file and split into multiple small posts and piecing them together at the other end?
Google Docs uses a mixture of POST and GET. POST for the updates. Google Drive I don't know.

How to archive an RSS feed?

I need to take a few RSS feeds, and archive all the items that get added to them. I've never consumed or created RSS before, but I know xml, so the format seems pretty intuitive.
I know how to parse the feed: How can I get started making a C# RSS Reader?
I know I can't rely on the feed server to provide a complete history: Is it possible to get RSS archive
I know I'll have to have some custom logic around duplicates: how to check uniqueness (non duplication) of a post in an rss feed
My question is, how can I ensure I don't miss any items? My initial plan is to write a parser, where for each item in the feed:
1) Check to see if it's already in the archive database
2) If not, add it to the database
If I schedule this to run once a day, can I be confident I won't be missing any items?
It depends on the feed, some sites publish articles very frequently and may have their RSS feed configured to show only the 10 most recent articles. Some sites are going to do the opposite.
Ideally your app should 'learn' the frequency this from the sites and tune itself to ping those sites based on the learnt frequency. (Ex: If you see new unique articles every time you ping, you'll need to ping more often, on the other hand if you see the same set of articles on multiple attempts, you may back off the next time).
If you're open to relying on a service for this... I built my own RSS archival service (https://app.pub.center). You can access an RSS feed's data via our API.
Page 1 of The Atlantic
https://pub.center/feed/02702624d8a4c825dde21af94e9169773454e0c3/articles?limit=10&page=1
Page 2 of The Atlantic
https://pub.center/feed/02702624d8a4c825dde21af94e9169773454e0c3/articles?limit=10&page=2
The REST API is free. We have pricing plans for push notifications (email, SMS, your custom API endpoint)
Use a series of decisions based on the feed and the storage limitations. For example:
Connect to the Web site, and download the XML source of the feed. The Feed Download Engine downloads feeds and enclosures via HTTP or Secure Hypertext Transfer Protocol (HTTPS) protocols only.
Transform the feed source into the Windows RSS Platform native format, which is based on RSS 2.0 with additional namespace extensions. (The native format is essentially a superset of all supported formats.) To do this, the Windows RSS Platform requires Microsoft XML (MSXML) 3.0 SP5 or later.
Merge new feed items with existing feed items in the feed store.
Purge older items from the feed store when the predetermined maximum number of items have been received.
Optionally, schedule downloads of enclosures with Background Intelligent Transfer Service (BITS).
Use HTTP to its fullest to minimize wasted bandwidth:
To limit its impact on servers, the Feed Download Engine implements HTTP conditional GET combined with Delta encoding in HTTP (RFC3229) World Wide Web link. This implementation allows the server to transfer a minimal description of changes instead of transferring an entirely new instance of a resource cached on the client. The engine also supports compression using the HTTP gzip support of Microsoft Win32 Internet (WinInet).
A successful synchronization means that the feed was successfully downloaded, verified, transformed into the native format, and merged into the store. A server response of HTTP 304 Not Modified in response to a HTTP conditional GET (If-Modified-Since, If-None-Match, ETag, and so on) also constitutes success.
And define criteria for removal:
The following properties directly affect the number of items that remain after a synchronization operation.
PubDate—used to determine the "age" of items. If PubDate is not set, LastDownloadTime is used. If the feed is a list, the order of items is predetermined and PubDate (if present) is ignored.
MaxItemCount—a per-feed setting that limits the number of archived items. The feed's ItemCount will never exceed the maximum, even if there are more items that could be downloaded from the feed.
ItemCountLimit—the upper limit of items for any one feed, normally defined as 2500. The value of MaxItemCount may not exceed this limit. Set MaxItemCount to ItemCountLimit to retain the highest possible number of items.
References
Understanding the Feed Download Engine

Creating an aggregate RSS feed from RSS-less search results

So, say I'm a journalist, who wants some way of easily posting links to stories I've written that are published to my newspaper's website. Alas, my newspaper's website doesn't offer user-level RSS feeds (user-level anything for journalists, really).
Running a search (I.e., http://www.calgaryherald.com/search/search.html?q=Rininsland) brings up everything I've done in reverse chronological order (albeit with some duplicates; ignore for now, will deal with later). Is there any way I can parse this into an RSS feed?
It seems like Yahoo! Pipes might be an easy way to do this, but I'm open to whatever.
Thanks!
Normally this would be a great use of Yahoo Pipes, but it appears that the search page you cited has a robots.txt file, which Pipes respects. This means that Pipes will not pull data from the page.
For more info: "How do I keep Pipes from accessing my web pages?"
http://pipes.yahoo.com/pipes/docs?doc=troubleshooting#q14
You would have to write a scraper yourself that makes an HTTP request to that URL, parses the response, and writes RSS as output. This could be done in many server-side environments such as PHP, Python, etc.
EDIT: Feedity provides a service to scrape web pages into feeds. Here is a Feedity feed of your search url:
http://feedity.com/rss.aspx/calgaryherald-com/UFJWUVZQ
However, unless you sign up for a subscription ($3.25/mo), this feed will be subject to the following constraints:
Free feeds created
without an account are limited to 5
items and 10 hours update interval.
Free feeds created without an account
are automatically purged from our
system after 30 days of inactivity.
Provided it's just links and a timestamp you want for each article then the Yahoo Pipes Search module will return the latest 10 in it's search index of the Herlad site.

RSS feed basics - just repeatedly overwriting the same file?

Really simple question here:
For a PHP-driven RSS feed, am I just overwriting the same XML file every time I "publish" a new feed thing? and the syndicates it's registered with will pop in from time to time to check that it's new?
Yes. An RSS reader has the URL of the feed and regularly requests the same URL to check for new content.
that's how it works, a simple single xml rss file that gets polled for changes by rss readers
for scalability there there is FeedTree: collaborative RSS and Atom delivery but unlike another well known network program (bittorrent) it hasn't had as much support in readers by default
Essentially, yes. It isn't necessarily a "file" actually stored on disk, but your RSS (or Atom) is just changed to contain the latest items/entries and resides at a particular fixed URL. Clients will fetch it periodically. There are also technologies like PubSubHubbub and pinging for causing updates to get syndicated closer to real-time.
Yes... BUT! There are ways to make the susbcribers life better and also improve your bandwidth :) Implement the PubSubHubbub protocol. It will help any application that wants the content of the feed to be notified as soon as it's available. It'es relatively simple to implement on the publisher side as it only involves a ping.

Resources