How to archive an RSS feed?

How to archive an RSS feed? - rss

I need to take a few RSS feeds, and archive all the items that get added to them. I've never consumed or created RSS before, but I know xml, so the format seems pretty intuitive.
I know how to parse the feed: How can I get started making a C# RSS Reader?
I know I can't rely on the feed server to provide a complete history: Is it possible to get RSS archive
I know I'll have to have some custom logic around duplicates: how to check uniqueness (non duplication) of a post in an rss feed
My question is, how can I ensure I don't miss any items? My initial plan is to write a parser, where for each item in the feed:
1) Check to see if it's already in the archive database
2) If not, add it to the database
If I schedule this to run once a day, can I be confident I won't be missing any items?

It depends on the feed, some sites publish articles very frequently and may have their RSS feed configured to show only the 10 most recent articles. Some sites are going to do the opposite.
Ideally your app should 'learn' the frequency this from the sites and tune itself to ping those sites based on the learnt frequency. (Ex: If you see new unique articles every time you ping, you'll need to ping more often, on the other hand if you see the same set of articles on multiple attempts, you may back off the next time).

If you're open to relying on a service for this... I built my own RSS archival service (https://app.pub.center). You can access an RSS feed's data via our API.
Page 1 of The Atlantic
https://pub.center/feed/02702624d8a4c825dde21af94e9169773454e0c3/articles?limit=10&page=1
Page 2 of The Atlantic
https://pub.center/feed/02702624d8a4c825dde21af94e9169773454e0c3/articles?limit=10&page=2
The REST API is free. We have pricing plans for push notifications (email, SMS, your custom API endpoint)

Use a series of decisions based on the feed and the storage limitations. For example:
Connect to the Web site, and download the XML source of the feed. The Feed Download Engine downloads feeds and enclosures via HTTP or Secure Hypertext Transfer Protocol (HTTPS) protocols only.
Transform the feed source into the Windows RSS Platform native format, which is based on RSS 2.0 with additional namespace extensions. (The native format is essentially a superset of all supported formats.) To do this, the Windows RSS Platform requires Microsoft XML (MSXML) 3.0 SP5 or later.
Merge new feed items with existing feed items in the feed store.
Purge older items from the feed store when the predetermined maximum number of items have been received.
Optionally, schedule downloads of enclosures with Background Intelligent Transfer Service (BITS).
Use HTTP to its fullest to minimize wasted bandwidth:
To limit its impact on servers, the Feed Download Engine implements HTTP conditional GET combined with Delta encoding in HTTP (RFC3229) World Wide Web link. This implementation allows the server to transfer a minimal description of changes instead of transferring an entirely new instance of a resource cached on the client. The engine also supports compression using the HTTP gzip support of Microsoft Win32 Internet (WinInet).
A successful synchronization means that the feed was successfully downloaded, verified, transformed into the native format, and merged into the store. A server response of HTTP 304 Not Modified in response to a HTTP conditional GET (If-Modified-Since, If-None-Match, ETag, and so on) also constitutes success.
And define criteria for removal:
The following properties directly affect the number of items that remain after a synchronization operation.
PubDate—used to determine the "age" of items. If PubDate is not set, LastDownloadTime is used. If the feed is a list, the order of items is predetermined and PubDate (if present) is ignored.
MaxItemCount—a per-feed setting that limits the number of archived items. The feed's ItemCount will never exceed the maximum, even if there are more items that could be downloaded from the feed.
ItemCountLimit—the upper limit of items for any one feed, normally defined as 2500. The value of MaxItemCount may not exceed this limit. Set MaxItemCount to ItemCountLimit to retain the highest possible number of items.
References
Understanding the Feed Download Engine

Related

How can I RSS subscribe to Ethereum address?

Since everything is on-chain in Ethereum blockchain. Theoretically, each event is public visible and can be summed up as a block event. Is it possible to subscribe to events of non-contract addresses and create a feed page like RSS subscription?

It is possible, but honestly, I have never seen integration with the RSS protocol.
However, writing your script in the programming language is very easy. The procedure is well documented in the official geth documentation: https://goethereumbook.org/event-subscribe/
I would use this as a reference for other languages.
There is one big problem for you. To subscribe to Ethereum, you need access to the Ethereum node to get events from it.
There are three options from the best to the worst (in my opinion).
You can use the API from parties that provide access to ethereum networks. Example of it is INFURA, Alchemy and QuickNode. The huge disadventage is that requests are limited and you will use it very fast. Probably in minutes or hours.
You can create your own node connected to the Ethereum, but you need pretty fast computer/stable internet and 1TB SSD hard drive to keep it stable.
Find publicly available node. Usualy those nodes are not very stable and you will get ban soon. To discover ethereum nodes you can use Shodan. I have tried houndred times to use the public nodes to write my apps and those nodes are not stable. Each stable node is protected and does not allow to send any request to it...
If you need to read data from specific addresses you can use the Etherscan API - I love it as it is much easier than using the ETH API :)

There is an opensource protocol named RSS3 dedicated to RSS service on blockchain.
Its third-party API accesses the Ethereum network and creates a feed of any ENS address. The protocol not only displays transactions of the ENS, but also identifies and filter different types of on-chain transactions.
(check how more on RSS3 Docs and its Github)
The feed can be generated to a standard XML format RSS file, or import the RSS URL or that address directly to any RSS reader.
Take ETH founder Vitalik's ENS address (vitalik.eth) for example.
Access RSS3.io and type in the ENS
enter vitalik.eth
Generate RSS file
Click the RSS icon on the right and get the RSS file:
https://rss3.io/rss/0xd8da6bf26964af9d7eed9e03e53415d37aa96045/
Generate RSS URL
Go to https://rss3.io/manage and generate an RSS feed for an address/ENS.
Type in Vitalik.eth and get different types of RSS feed subscription:
All feeds: https://rss3.io/rss/vitalik.eth
These URLs should work in any RSS reader.

Creating an aggregate RSS feed from RSS-less search results

So, say I'm a journalist, who wants some way of easily posting links to stories I've written that are published to my newspaper's website. Alas, my newspaper's website doesn't offer user-level RSS feeds (user-level anything for journalists, really).
Running a search (I.e., http://www.calgaryherald.com/search/search.html?q=Rininsland) brings up everything I've done in reverse chronological order (albeit with some duplicates; ignore for now, will deal with later). Is there any way I can parse this into an RSS feed?
It seems like Yahoo! Pipes might be an easy way to do this, but I'm open to whatever.
Thanks!

Normally this would be a great use of Yahoo Pipes, but it appears that the search page you cited has a robots.txt file, which Pipes respects. This means that Pipes will not pull data from the page.
For more info: "How do I keep Pipes from accessing my web pages?"
http://pipes.yahoo.com/pipes/docs?doc=troubleshooting#q14
You would have to write a scraper yourself that makes an HTTP request to that URL, parses the response, and writes RSS as output. This could be done in many server-side environments such as PHP, Python, etc.
EDIT: Feedity provides a service to scrape web pages into feeds. Here is a Feedity feed of your search url:
http://feedity.com/rss.aspx/calgaryherald-com/UFJWUVZQ
However, unless you sign up for a subscription ($3.25/mo), this feed will be subject to the following constraints:
Free feeds created
without an account are limited to 5
items and 10 hours update interval.
Free feeds created without an account
are automatically purged from our
system after 30 days of inactivity.

Provided it's just links and a timestamp you want for each article then the Yahoo Pipes Search module will return the latest 10 in it's search index of the Herlad site.

RSS feed basics - just repeatedly overwriting the same file?

Really simple question here:
For a PHP-driven RSS feed, am I just overwriting the same XML file every time I "publish" a new feed thing? and the syndicates it's registered with will pop in from time to time to check that it's new?

Yes. An RSS reader has the URL of the feed and regularly requests the same URL to check for new content.

that's how it works, a simple single xml rss file that gets polled for changes by rss readers
for scalability there there is FeedTree: collaborative RSS and Atom delivery but unlike another well known network program (bittorrent) it hasn't had as much support in readers by default

Essentially, yes. It isn't necessarily a "file" actually stored on disk, but your RSS (or Atom) is just changed to contain the latest items/entries and resides at a particular fixed URL. Clients will fetch it periodically. There are also technologies like PubSubHubbub and pinging for causing updates to get syndicated closer to real-time.

Yes... BUT! There are ways to make the susbcribers life better and also improve your bandwidth :) Implement the PubSubHubbub protocol. It will help any application that wants the content of the feed to be notified as soon as it's available. It'es relatively simple to implement on the publisher side as it only involves a ping.

creating updatable rss feed

Hi I have a page in my java/jsp based web application which shows list of new products.
I want to provide an rss feed for this. So, what is the way to create an rss feed which other can use to subscribe?
I could find some java based feed creators. But, the question is how the feed will be self updating based on new products added to the system?

I'm not familiar with Java, so here's a general thought.
Your feed should be accessible via some URL, like http://mydomain.com/products/feeds/rss. When Feed Aggregator fetches this URL, the servlet (I believe this is how they are called in Java world) fetches a list of recent products from the DB or wherever, build RSS feed and then sends it back to the requester, which turns out to be a Feed Aggregator.
For performance reasons this particular servlet may not access the database each time it's executing. Rather, it can cache either resulting feed (recommended, HTTP allows for a very flexible caching) or a result of database query somewhere in memory/on disk.

How does RSS reader know that a feed is updated?

Just learning about this via youtube but could not find answer to my question of how reader knows there is an update.
Is it like a Push in blackberry?

RSS is a file format source and doesn't actually know anything about where it gets the entries from. The answer really is: "how can an http request get only the newest results from a server" and the answer is Conditional GET source. Http also supports Conditional PUT.
This is an article about using this feature of http to specifically support rss hackers.

RSS is a pull technology. The reader re-fetches the RSS feed now and then (for example two times per hour, or more often if the reader learns that it's an often updated feed).
The feed is served through regular HTTP and consists of a simple XML file. It is always fetched from the same URL.

It just check the feed for update regularly.
Recently there is a new protocol called pubsubhubbub to make feed push to the listener. But it requires the publishers support it.
Here is a list of web services support real-time RSS pushing, including Google Reader, Blogger, FeedBurner, FriendFeed, MySpace, etc.

Let's summarize :
Usually, a client knows that an RSS feed has been updated through polling, that is regular pull (HTTP GET request on the feed URL)
Push doesn't exist on the web, at least, not with HTTP until HTML5 websocket is fixed.
However, some blog frameworks like Wordpress, Google and others, now support the pubsubhubbub convention. In this mode, you would "subscribe" to the updates of an RSS flow. The "hub" will call an URL on YOUR site (callback URL) to send you updates : that is a push.
Push or pull, in both cases you still need to write some piece of code to update the RSS list on your site, database or wherever you store/display it.
And, as a side question, it is not necessary to request the whole XML at every pull to see if the content has changed : using a standard that is not linked to RSS, but global to the whole HTTP protocol (etag and last-modified headers), you can know if the RSS page was modified after a given date, and grab the whole XML only if modified.

It's a pull. That's why you have to configure your reader how often it should refresh the feed.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex