I want to merge multiple rss feeds into a single feed, removing any duplicates. Specifically, I'm interested in merging the feeds for the tags I'm interested in.
[A quick search turned up some promising links, which I don't have time to visit at the moment]
Broadly speaking, the ideal would be a reader that would list all the available tags on the site and toggle them on and off, allowing me to explore what's available, keep track of questions I've visited, new answers on interesting feeds, etc, etc . . . though I don't suppose such a things exists right now.
As I randomly explore the site and see questions I think are interesting, I inevitably find "oh yes, that one looked interesting a couple days ago when I read it the first time, and hasn't been updated since". It would be much nicer if my machine would keep track of such deails for me :)
Update: You can now use "and", "or", and "not" to combine multiple tags into a single feed: Tags AND Tags OR Tags
Update: You can now use Filters to watch tags across one or multiple sites: Improved Tag Stes
Have you heard of Yahoo's Pipes.
Its an interactive feed aggregator and
manipulator. List of 'hot pipes' to
subscribe to, and ability to create
your own (yahoo account required).
I played with it during beta back in the day, however I had a blast. Its really fun and easy to aggregate different feeds and you can add logic or filters to the "pipes". You can even do more then just RSS like import images from flickr.
I create a the stackoverflow tag feeds pipe. You can list your tags of choice into the text box and it will combine them into a single feed with all the unique posts. It escapes '#' and '+' characters for you.
Alternatively, you can use the pipe's rss feed by appending your html-encoded tags separated by '+'s:
http://pipes.yahoo.com/pipes/pipe.run?_id=uP22vN923RG_c71O1ZzWFw&_render=rss&tags=.net+c%23+powershell
Unfortunatley, though, this seems to strip out the content of the posts. The content is visible in the debug view, but the output only contains the post title.
[Thanks to everyone for suggesting Yahoo Pipes! Had heard of it before, but never tried it until now :-]
SimplePie is a PHP library that supports merging RSS feeds into one combined feed. I don't believe it does dupe checking out-of-the-box, but I found it trivial to write a little function to eliminate duplicate content via their GUIDs.
Here is an article on Merge Multiple RSS Feeds Into One with Yahoo! Pipes + FeedBurner.
Another option is Feed Rinse, but they have a paid version as well as the free version.
Additionally:
I have heard good things about AideRss
Yahoo Pipes?
23 minutes later:
Aww, I got answer-sniped by #Bernie Perez. Oh well :)
In the latest Podcast, Jeff and Joel talked about the RSS feeds for tags, and Joel noted that there is only the current ability to do AND on tags, not OR.
Jeff suggested that this would be included at some stage in the future.
I think that you should request this on uservoice, or vote for it if it is already there.
Related
How's it going?
I've found a lot of more detailed answers relating to specific problems relating to RSS feeds, but I can't really figure out how you USE one, basically.
Could someone explain?
I see the RSS feed icon at the top of a lot of Wordpress sites, including my own, but when I click it, it just seems to be a long XML file. I don't know what to do with it, or even why it would be there.
How do you use this? Are you meant to hit it with an API request, or is there a particular kind of software that you use?
Cheers
Before telling you what RSS, let me describe you a common problem that many people have.
Say there is a bunch of sites that you really like and it's sort of a
daily routine for you to go thru them. They may be a news site, your
friend's blog, but also craigslist bcause you're currently looking for
a new house and maybe a weather site to know how late you should stay
at work :)
The first thing you do when you get to work, is open your web browser
and these sites in new tabs. It's not particularly cumbersome because
there are just 4 sites. But think about it: maybe there is a new blog
that you start to like and ho, these cartoons are really funny. Maybe
there is also a bit of financial info that you're interested in and
the pictures that your brother is posting to Flickr every couple day:
they just had a new baby! Also, as you're trying to buy a house, you'd
love a little raise and you've figured that your boss really likes it
when you tell her that you've read about your company in the news or
when you tell her about a new competing product... There is also
StackOverflow. You're desperately trying to get this "expert" badge
and boost up your reputation: this may help with your boss too or even
when you're looking for a new job.
Opening all these tabs is starting to take a toll and you keep
forgetting an important one. You're also slowly getting tired of the
different reading experience that all these sites have: small fonts,
large fonts, ads all over...etc. Now you have a problem.
Imagine there is a tool that does the following: you can tell it what sites you care about, and then, this tool will look up the new stuff for you. It will show everything in a nice looking format. It should also help you identify what's really worth seeing ASAP or maybe have some kind of "serendipity" mode that you can go into and find interesting stuff that you would have missed otherwise. The tool will obviously send you to the original sites should you need more info about any particular story or classified...
This tool exists. It's usually called a Reader, mostly because it lets your read more things online. Often times you'll see them called "RSS reader", because RSS is what they use to get the information from all these sites. RSS is the pipe. You as a user should probably not know about it, but that's what the readers depend on. In an ideal world, when you're on site you like, you should just hit "follow" on a button like this one and then you'd be redirected to your reader of choice. Later when new content is added, you'll get it straight in your reader.
To get a bit into more technical details, RSS (like Atom) is an XML flavor. It's a collection (mostly reverse chronological) of entries. Entries have at least a title and a link to the actual story. They should also include a unique identifier and could have other elements like a description, an image, tags, author information... etc.
RSS is great because it's content agnostic. It can be used to represent a lot of different things (as described in the little story) and decouples the publishing platform from the subscribing platform: they don't even know the other one exists. RSS is their lingua-franca.
I wrote a blog post about this very question not long ago. Here's the link if you're interested in reading my personal interpretation. https://www.rss.com/whatisrss
An XML file is all the content of a page, with no markup. The XML represents the data in its rawest, most descriptive form. Many readers can interpret XML sources from a variety of places, and format all of the data in its own unique way.
Has anyone come across a clean way to post tweets in wordpress with the use of a plugin. I would like to have the flexibility to pull at least the top 5 and only display tweets with the assigned hash tag.
Many thanks
I'm not sure that this would give you the functionality to pull the most recent tweets from a hashtag, but it's totally doable to pull the most recent X updated from any user. Here's a couple of examples I found for you, written in php:
Example One
Example Two (Scroll down to #6)
Removing the "from:" in the second example's search string should allow you to search for any word or hashtag instead of by username, but I'm not 100% sure.
Fair warning though, that makes a prominent area of your site easily spammable. Anyone can post to a hashtag, right?
It could be a project well beyond my skills right now but I've got around one full month to spend on it so I think I can do it. What I want to build is this: Gather news about a specific subject from various sources. Easy, right? Just get the rss feeds and display them on a page. Well, I want something more advanced: Duplicates removed and customized presentation (that is, be able to define/change the format in which the news headlines are displayed).
I've played a bit with Yahoo Pipes and some other tools and I am facing two big problems:
Some sources don't provide rss feeds. How do I create one?
What's the best method to find and remove duplicates. I thought about comparing the headlines and checking if there is a matching bigger than, say, 50%. Is that a good practice though?
Please add any other things (problems, suggestions, whatever) I might not have considered.
Duplication is a nasty issue. What I eventually ended up doing:
1. Strip out all HTML tags except for links (Although I started using regex, I was burned. I eventually moved to custom parsing to remove tags)
2. Strip out all whitespace
3. Case-desensitize
4. Hash all that with MD5.
Here's why you leave the link in:
A comment might be as simple as "Yes, this sucks". "Yes, this sucks" could be a common comment. BUT if the text "this sucks" is linked to different things, then it is not a duplicate comment.
Additionally, you will find that HTML tag escaping is weird with RSS feeds. You would think that a stray < would be double-encoded: (I think)&<;
But it is not. It is encoded <
But so too are HTML tags! :<p>
I eventually copied all the known HTML tags as parsed by Mozilla Firefox and manually recognized those tags.
Creating an RSS feed from HTML is quite nasty and I can only point you to services such as Spinn3r, which are fantastic at de-duplication and content extraction. These services typically use probability-based algorithms that are above me. I know of one provider that got away with regexing pages (They had to know that a certain page was MySpace-based or Blogger-based) but they did not perform admirably.
You might want to try to use the YQL module to scrape a webpage that doesn't provide RSS. Here's a sample of a YQL statement to scrape HTML.
About duplicates, take a look at this pipe.
Customized presentation: if you want it truly customized you'll have to manipulate the pipe results yourself, e.g. get it as JSON an manipulate it with Javascript, or process it server-side.
I'd like an rss feed from this google scholar search: Scholar Fish Oil Search
I've looked a little bit at yahoo pipes, and I thought I had found a solution when I found this pipe: Old Pipe But it doesn't work (it's a couple years old now). If someone can either tell me what's wrong with that pipe, or tell me how to retrieve a feed from that search through another means, I'd be very appreciative.
Thanks for your time,
-Landon
You could try a 3rd party website that creates feeds from other websites. See 7 Tools To Make An RSS Feed Of Any Website. (Disclaimer: I have no idea if they work or are any good, but they may be worth investigating).
[Edit: Google disallows indexing of this content via their robots.txt file, apparently. Check out http://scholar.google.com/robots.txt. Yahoo Pipes respects the robots.txt file—perhaps one of the other tools doesn't suffer from this snag?]
It appears that markup may have been altered slightly since the publication of this Pipe.
When I use the URL builder module in Pipes and populate the sample query with "fish oil", I get the following search string:
http://scholar.google.com/scholar?scoring=r&q=%22fish+oil%22&lr=&hl=en&as_ylo=2007
(Which, when entered into a browser window, does generate results.)
I am currently parsing through their regular expressions to make sure the proper elements are captured.
Did you have any luck with the tools Dan mentioned? Would also be quite interested if any were simple, effective, and (ideally) non-proprietary or self-hostable.
I've been experimenting with writing my own RSS reader. I can handle the "parse XML" bit. The thing I'm getting stuck on is "How do I fetch older posts?"
Most RSS feeds only list the 10-25 most recent items in their XML file. How do I get ALL the items in a feed, and not just the most recent ones?
The only solution I could find was using the "unofficial" Google Reader API, which would be something like
http://www.google.com/reader/atom/feed/http://fskrealityguide.blogspot.com/feeds/posts/default?n=1000
I don't want to make my application dependent on Google Reader.
Is there any better way? I noticed that on Blogger, I can do "?start-index=1&max-results=1000", and on WordPress I can do "?paged=5". Is there any general way to fetch an RSS feed so that it gives me everything, and not just the most recent items?
RSS/Atom feeds does not allow for historic information to be retrieved. It is up to the publisher of the feed to provide it if they want such as in the blogger or wordpress examples you gave above.
The only reason that Google Reader has more information is that it remembered it from when it came up the first time.
There is some information on something like this talked about as an extension to the ATOM protocol, but I don't know if it is actually implemented anywhere.
As the other replies here mentioned, a feed may not provide archival data but historical items may be available from another source.
Archive.org’s Wayback Machine has an API to access historical content, including RSS feeds (if their bots have downloaded it). I’ve created the web tool Backfeed that uses this API to regenerate a feed containing concatenated historical items. If you'd like to discuss the implementation in detail please get in touch.
In my experience with RSS, the feed is compiled by the last X items where X is a variable. Certain Feeds may have the full list, but for bandwidth sake most places are likely limiting to just the last few items.
The likely answer for google reader having the old info, is that it is storing it on its side for users later.
Further to what David Dean said the RSS/Atom feeds will only contain what the publisher of the feed has up at that moment and someone would need to be actively collecting this informaton in order to have any historical information. Basically Google Reader was doing this for free and when you interacted with it you could retrieve this stored informaton from the google database servers.
Now that they have retired the service, to my knowledge you have two choices. You either have to start collection of this information from your feeds of interest and store the data using XML or some such, or you could pay for this data from one of the companies who sell this type of archived feed information.
I hope this information helps somebody.
Seán
Another potential solution that might not have been available when the question was originally asked and shouldn't require any specific service.
Find the URL of the RSS feed you want and use waybackpack to get the archived urls for that feed.
Use FeedReader or a similar library to pull down the archived RSS feed.
Take the URLs from each feed and scrape them as you wish. If you're going way back in time it's possible there might be some dead links.
All previous answers more or less relied on existing services to still have a copy of that feed or the feed engine to be able to provide older items dynamically.
There's though another, admittedly pro-active and rather theoretical way to do so: Let your feedreader use a caching proxy which semantically understands RSS and/or Atom feeds and caches them on a per-item base up to as many items as you configure.
If the feedreader doesn't poll feeds regularily, the proxy could fetch known feeds time-based on its own to not miss an item in highly volatile feeds like the one from User Friendly which has only one item and changes every day (or at least used to do so). Hence if the feedreadere.g. crashed or lost network connection while you are away for a few days, you might loose items in your feedreader's cache. Having the proxy to fetch those feeds regularily (e.g. from a data center instead from at home or on a server instead of a laptop) allows you to easily run the feedreader only then and when without loosing items which were posted after your feedreader fetched feeds the last time but rotated out again before you fetch them the next time.
I call that concept a Semantic Feed Proxy and I've implemented a proof of concept implementation called sfp. It's though not much more than a proof of concept and I haven't developed it further. (So I'd be happy about hints to projects with similar ideas or purposes. :-)
Why does this problem exist?
Most RSS readers need to import feeds through a live URL, which makes things harder for sites that are unindexed on Wayback Machine.
The reason why Wayback Machine feeds can be imported is that the reader can regularly poll the server for updates according to its defined TTL configuration. The reader compares the current datetime with the RSS feed posts pubDate or lastBuildDate keys in the XML response. We can't hack the machine datetime to work around the datetime resolution because the current datetime is fetched live.
I've outlined an alternative solution without Wayback below. Unfortunately, I have not been able to find a universal solution for all feed sources.
Alternative Solution(s)
In my experience, NOT ALL feeds are partial though. The XML doesn't have to specify the datetime of each post. This means the RSS Reader doesn't have a datetime to filter the feed with. An example of this feed type can be found here.
This kind of reading experience is useful when chronological order is irrelevant, and the content doesn't need to be sorted. This approach is useful for sites where ALL the content is valuable, and the linked Essays of Paul Graham is a good example.
If the site has a generic, non-chronological feed option, subscribe to that RSS instead (the preferred option).
Download the linked timestamped .rss file, strip datetimes and host the file on your own server. Note, we can implement this via an AWS Lambda.
Set up a server that fetches the RSS from live.
Strip the pubDate tags from the XML file on fetch.
Host the modified RSS on your own server.
Note
These are suboptimal solutions due to loss of orders, however, I wanted to provide a potential alternative to WaybackMachine.
In addition, some existing answers require advanced SysDesign workarounds, more prework and in some cases are outdated (Google Reader is shut down). I hope it's helpful for those who really need a solution for a complete feed list. Constructing new RSS feeds is not too hard from the original RSS file.