Access to old, no longer available, feed entries - rss

I am working on a project that requires reliable access to historic feed entries which are not necessarily available in the current feed of the website. I have found several ways to access such data, but none of them give me all the characteristics I need.
Look at this as a brainstorm. I will tell you how much I have found and you can contribute if you have any other ideas.
Google AJAX Feed API - will limit you to 250 items
Unofficial Google Reader API - Perfect but unofficial and therefore unreliable (and perhaps quasi-illegal?). Also, the authentication seems to be tricky.
Spinn3r - Costs a lot of money
Spidering the internet archive at the site of the feed - Lots of complexity, spotty coverage, only useful as a last resort
Yahoo! Feed API or Yahoo! Search BOSS - The first looks more like an aggregator, meaning I'd need a different registration for each feed and the second should give more access to Yahoo's data but I can find no mention of feeds.
(thanks to Lou Franco) Bloglines Sync API - Besides the problem of needing an account and being designed more as an aggregator, it does not have a way to add feeds to the account. So no retrieval of arbitrary feeds. You need to manually add them through the reader first.
Other search engines/blog search/whatever?
This is a really irritating problem as we are talking about semantic information that was once out there, is still (usually) valid, yet is difficult to access reliably, freely and without limits. Anybody know any alternative sources for feed entry goodness?

Bloglines has an API to sync accounts
http://www.bloglines.com/services/api/sync
You have to make an account, subscribe to the feed you want to download, but then then you can download based on Date, which can be way in the past. Not sure of the terms.

The best answer I've found so far, is this: Google reader's unofficial API turns out to have a public access point for their feeds, which means there is no authentication needed. Use is as follows:
http://www.google.com/reader/public/atom/feed/{your feed uri here}?n=1000
replace the text in the squigglies (including the squigglies themselves) with the feed URI you're interested in. More information about the precise arguments can be found here:
http://blog.martindoms.com/2009/10/16/using-the-google-reader-api-part-2/
but remember to use the /public/ url if you don't want to mess with authentication

Related

Valuable? Fields support in REST API

I've read some topics about GraphQL, and one of the great features I like is you can specify fields you want (Client End).
I'm thinking maybe I can also add it into REST API. I look around and find there has already such specification: fetching-sparse-fieldsets
So I'm trying to add such feature in Symfony. (Especially, in FOSRestBundle+JSMSerializer).
But I'm not quite sure whether it is valuable or not. Can someone give you advice?
That is a question you should ask to your API users and your customer. It might be useful for the API user, but it has a lot of downsides:
By building it yourself, you spend a lot of time by developing, testing and maintaining this 'optional' feature. Is the customer willing to pay? (YAGNI)
As mentioned above, you must maintain the code for this feature, you cannot remove it unless you decide to release a new API version. As external packages change, your code might require an update aswell.
It can become difficult when troubleshooting. API users might be trying to retrieve a field that isn't specified in the API URL (ofcourse these issues arise after project transfers to other developers). Questions can come up why some data isn't available, but is present in API documentation.
Especially the first one in the list is incredibly important. Don't build features that the customer does not need. Personally I always return all accessible data available, even null values. The API users decide what to do with that data. Bandwidth isn't such a problem these days I guess.

Which API allows access to Google's Dictionary information?

I know that Google Dictionary was discontinued in 2011, but the dictionary information and definitions are still available through google search results:
Does anyone know whether this information can be accessed through the Custom Search API or the Translate API?
I found this related question (but sadly without a satisfying answer).
I also needed Google Dictionary API for my project, it was not present so I decided to create one.
I scrapped the WebPage for the url https://www.google.com/#q=define+term where term is any word you want to get meaning of, and created the API, you can find it here Google Dictionary API.
How to use
The basic syntax of a URL request to the API is shown below:
https://api.dictionaryapi.dev/api/v2/entries/<--language_code-->/<--word-->
As an example, to get definition of English word hello, you can send request to:
https://api.dictionaryapi.dev/api/v2/entries/en/hello
The API also provides other meanings of the word, example sentences, and synonyms, if any.
If you want me to include any other details, please comment and I will happily extend the API to cover your needs.
In case you wish to see the code, it is on github.
Google Dictionary's content is licenced from Oxford Dictionaries' Lexico. Their API can be accessed from here.
Note their free access platform ("prototype") has a number of limitations:
1000 requests per month
Limited data access
Limited request rate
It doesn't look promising from the API Explorer
https://developers.google.com/apis-explorer/#search/dictionary/

Captchas on RSS Reader?

This question is coming from a non-technical person. I have asked a team to build a sort of RSS reader. In essence, its a news aggregator. What we had in mind at first was to source news directly from specific sources: ft.com, reuters.com, and bloomberg.com.
Now, the development team has proposed a certain way of doing it (because it'll be easier)... which is to use news.google.com and return whatever is the result. Now I know this has questionable legality and we are not really that comfortable with that fact, but while the legal department is checking that.. we have proceeded working with a prototype.
Now comes the technical problem... because the method was actually simulating search via news.google.com, after a period of time it returns a captcha. I'm suspicious that its because the method was SEARCHING WITH RESULTS SHOWN AS RSS as opposed to an outright RSS... however the dev team says RSS is exactly the same thing... and that it will give captcha as well.
I have my doubts. If thats the case, how have the other news aggregator sites done their compilation of feeds from different sources?
For your reference, here is the same of the URL that eventually gives the CAPTCHA
https://news.google.com/news/feeds?hl=en&gl=sg&as_qdr=a&authuser=0&q=dbs+bank+singapore&bav=on.2,or.r_gc.r_pw.r_cp.,cf.osb&biw=1280&bih=963&um=1&ie=UTF-8&output=rss
"Searching" is usually behind a captcha because it is very resource intensive, thus they do everything they can to prevent bots from searching. A normal RSS feed is the opposite of resource intensive. To summarize: normal RSS feeds will probably not trigger CAPTCHA's.
Since Google declared their News API deprecated as of May 26, 2011, maybe using NewsCred as suggested in this group post http://productforums.google.com/forum/#!topic/news/RBRH8pihQJI could be an option for your commercial use.

How Do I Fetch All Old Items on an RSS Feed?

I've been experimenting with writing my own RSS reader. I can handle the "parse XML" bit. The thing I'm getting stuck on is "How do I fetch older posts?"
Most RSS feeds only list the 10-25 most recent items in their XML file. How do I get ALL the items in a feed, and not just the most recent ones?
The only solution I could find was using the "unofficial" Google Reader API, which would be something like
http://www.google.com/reader/atom/feed/http://fskrealityguide.blogspot.com/feeds/posts/default?n=1000
I don't want to make my application dependent on Google Reader.
Is there any better way? I noticed that on Blogger, I can do "?start-index=1&max-results=1000", and on WordPress I can do "?paged=5". Is there any general way to fetch an RSS feed so that it gives me everything, and not just the most recent items?
RSS/Atom feeds does not allow for historic information to be retrieved. It is up to the publisher of the feed to provide it if they want such as in the blogger or wordpress examples you gave above.
The only reason that Google Reader has more information is that it remembered it from when it came up the first time.
There is some information on something like this talked about as an extension to the ATOM protocol, but I don't know if it is actually implemented anywhere.
As the other replies here mentioned, a feed may not provide archival data but historical items may be available from another source.
Archive.org’s Wayback Machine has an API to access historical content, including RSS feeds (if their bots have downloaded it). I’ve created the web tool Backfeed that uses this API to regenerate a feed containing concatenated historical items. If you'd like to discuss the implementation in detail please get in touch.
In my experience with RSS, the feed is compiled by the last X items where X is a variable. Certain Feeds may have the full list, but for bandwidth sake most places are likely limiting to just the last few items.
The likely answer for google reader having the old info, is that it is storing it on its side for users later.
Further to what David Dean said the RSS/Atom feeds will only contain what the publisher of the feed has up at that moment and someone would need to be actively collecting this informaton in order to have any historical information. Basically Google Reader was doing this for free and when you interacted with it you could retrieve this stored informaton from the google database servers.
Now that they have retired the service, to my knowledge you have two choices. You either have to start collection of this information from your feeds of interest and store the data using XML or some such, or you could pay for this data from one of the companies who sell this type of archived feed information.
I hope this information helps somebody.
Seán
Another potential solution that might not have been available when the question was originally asked and shouldn't require any specific service.
Find the URL of the RSS feed you want and use waybackpack to get the archived urls for that feed.
Use FeedReader or a similar library to pull down the archived RSS feed.
Take the URLs from each feed and scrape them as you wish. If you're going way back in time it's possible there might be some dead links.
All previous answers more or less relied on existing services to still have a copy of that feed or the feed engine to be able to provide older items dynamically.
There's though another, admittedly pro-active and rather theoretical way to do so: Let your feedreader use a caching proxy which semantically understands RSS and/or Atom feeds and caches them on a per-item base up to as many items as you configure.
If the feedreader doesn't poll feeds regularily, the proxy could fetch known feeds time-based on its own to not miss an item in highly volatile feeds like the one from User Friendly which has only one item and changes every day (or at least used to do so). Hence if the feedreadere.g. crashed or lost network connection while you are away for a few days, you might loose items in your feedreader's cache. Having the proxy to fetch those feeds regularily (e.g. from a data center instead from at home or on a server instead of a laptop) allows you to easily run the feedreader only then and when without loosing items which were posted after your feedreader fetched feeds the last time but rotated out again before you fetch them the next time.
I call that concept a Semantic Feed Proxy and I've implemented a proof of concept implementation called sfp. It's though not much more than a proof of concept and I haven't developed it further. (So I'd be happy about hints to projects with similar ideas or purposes. :-)
Why does this problem exist?
Most RSS readers need to import feeds through a live URL, which makes things harder for sites that are unindexed on Wayback Machine.
The reason why Wayback Machine feeds can be imported is that the reader can regularly poll the server for updates according to its defined TTL configuration. The reader compares the current datetime with the RSS feed posts pubDate or lastBuildDate keys in the XML response. We can't hack the machine datetime to work around the datetime resolution because the current datetime is fetched live.
I've outlined an alternative solution without Wayback below. Unfortunately, I have not been able to find a universal solution for all feed sources.
Alternative Solution(s)
In my experience, NOT ALL feeds are partial though. The XML doesn't have to specify the datetime of each post. This means the RSS Reader doesn't have a datetime to filter the feed with. An example of this feed type can be found here.
This kind of reading experience is useful when chronological order is irrelevant, and the content doesn't need to be sorted. This approach is useful for sites where ALL the content is valuable, and the linked Essays of Paul Graham is a good example.
If the site has a generic, non-chronological feed option, subscribe to that RSS instead (the preferred option).
Download the linked timestamped .rss file, strip datetimes and host the file on your own server. Note, we can implement this via an AWS Lambda.
Set up a server that fetches the RSS from live.
Strip the pubDate tags from the XML file on fetch.
Host the modified RSS on your own server.
Note
These are suboptimal solutions due to loss of orders, however, I wanted to provide a potential alternative to WaybackMachine.
In addition, some existing answers require advanced SysDesign workarounds, more prework and in some cases are outdated (Google Reader is shut down). I hope it's helpful for those who really need a solution for a complete feed list. Constructing new RSS feeds is not too hard from the original RSS file.

RSS/Atom for professional use

I wondered if anyone can give an example of a professional use of RSS/Atom feeds in a company product. Does anyone use feeds for other things than updating news?
For example, did you create a product that gives results as RSS/Atom feeds? Like price listings or current inventory, or maybe dates of training lessons?
Or am I thinking in a wrong way of use cases for RSS/Atom feeds anyway?
edit #abyx has a really good example of a somewhat unexpected use of RSS as a way to get debug information from program transactions. I like the idea of this process. This is the type of use I was thinking of - besides publishing search results or last changes (like mediawiki)
Some of my team's new systems generate RSS feeds that the developers syndicate.
These feeds push out events that interest the developers at certain times and the information is controlled using different loggers. Thus when debugging you can get the debugging feed, when you want to see completed transactions you go to the transactions feeds etc.
This allows all the developers to get the information they want in a comfortable way and without any need to mess a lot with configuration. If you don't want to get it there's no need to remove yourself from a mailing list or edit a configuration file - simply remove the feed and be done with it.
Very cool, and the idea was stolen from Pragmatic Project Automation.
Most of the digital libraries uses RSS/ATOM to display their search/results, data update, according to the OAI-PMH protocol
With our internal TRAC server, I'm subscribed to the timeline view for each project that I work on. It's great for keeping track of checkins and bug tickets. This is pretty exclusive to a developer position though.
I also am subscribed to the recent changes for our installation of MediaWiki that we use for our intranet. That way it's easy to see if documents that I need have been changed, or if there's new policies etc.
Our website has a news page that I wrote an RSS feed for as well. While you mentioned that you weren't really interested in recent news, it is nice to keep up with our press releases.
I have seen RSS used to syndicate gas prices from a service for a specific zip code.
there are many examples. Here are a couple.
SharePoint provides RSS feeds from its lists.
Many faceted navigation products allow you to get an RSS feed based on a selected filter. For example, you can navigate to view 24" LCD Monitors on newegg.com and then get an RSS feed of that view.
Mantis bug tracker includes RSS feeds although I wish they were more configurable. Also we use MediaWiki for documentation which has all sorts of RSS Feeds including a per page watch, and recent changes.
I just added RSS feeds to the ticketing system I use at work (TicketDesk) and that feature should be in the next release of the product.
It's nice because it basically provides me a custom search view of outstanding trouble tickets or work requests that comes to me rather then me having to go to the application. It also allows users to get feeds of issues they may be interested in, but not require them to get emails on each update.
I'm looking at implementing an RSS feed for calls for service that our agency takes, to provide the administrators a quick and easy way to see what has been going on.
Atom feed documents and Atom entry documents are used as the representation format for RESTful web services that follow the Atom Publication Protocol (AtomPub).
I personally have used syndication feeds to expose a sub-set of the Windows Event Log information so that I could subscribe and be notified of critical events on a server.
immobilienscout24
they use RSS feeds for updates on your search.

Resources