I am trying to subscribe to an RSS feed for a fighter's history via MixedMartialArts.com, but this particular website updates the pubDate for each of the fights, causing duplicates every time the pubDate is updated.
http://www1.mixedmartialarts.com/?go=rss.fighterRecord&pid=8878384A5C892D13
However, other attributes of each remain the same, particularly <title>.
What can I do (maybe via Yahoo Pipes, or other normalizer) to fix this issue temporarily until they correct the problem on their end?
I use Google Reader, and find that it is very good at dealing with malformed feeds.
Related
I have a large set of podcast feed URLs which I'm periodically polling to check for updates. I'm really struggling to find a robust way to detect if a feed has changed that doesn't have any false positives. I'd like to be able to detect not just if there is a new episode, but also if an existing episode was updated.
RSS and Atom feeds provide pubDate, lastBuildDate or updated elements. However, I'm finding these frequently misused so that the feed is actually inserting the current date time into these fields each request. This makes them difficult to rely on to detect changes.
My next thought was to strip all date information from the podcasts, then MD5 hash the feed contents. I can then compare the feed hashes to detect changes to the feeds.
This seems to work for about 90% of the cases. However, there are still hundreds of podcasts that insert dynamic data into their feeds.
One podcast has the following as their podcast cover art:
http://erikglassman.hipcast.com/albumart/1000.1439649026.jpg
Where 1439649026 is what I assume is a timestamp. This second number changes with each request of their feed.
This is starting to seem like a losing battle. If I can't reliably trust the date fields of a podcast feed, and if some percentage of podcasts insert dynamic data into their feed text, how can I reliably detect changes to a feed in a robust way?
Everything you say is true, so it's not a good idea to try to detect changes at the feed level, instead look for them at the item level.
That generally works, if it doesn't the feed can't be used by anyone, so the source of the feed is likely to have fixed any problem. That's why I think it works so well.
I've been writing feed readers as long as they have existed, my current product is called River4, it's available as open source, MIT License, so you can use it as example code, for this and other issues.
This is where it checks if an item is new:
https://github.com/scripting/river4/blob/master/river4.js#L1411
That might move around as the code changes, so look for a routine called getItemGuid. It shows you how to get a value that uniquely identifies the item. I use this code for my podcatcher, http://podcatch.com/, and it seems to catch the new items, and doesn't get false positives.
Hope this helps! :-)
I am building RSS feed for the first time and I have some simple, direct questions that I was unable to find on the web, well at list in a sense that would be clear to me. Can you help me understand following
Which items should I include in RSS generation? should I always put in all the articles or what is the criteria when I query my articles for the feed?
What value should I set for pubDate? The specification says "The publication date for the content in the channel. For example, the New York Times publishes on a daily basis, the publication date flips once every 24 hours. That's when the pubDate of the channel changes.". I do not quite understand how to apply this to my feed. I have new articles daily, should I set the pubDate to let say 06:00 AM today and update it every day?
lastBuildDate: if I understand this right is the date of the latest updated item?
Which items should I include in RSS generation?
You should have one generic feed with all the new articles you post (for example: news). Additionally if you got your webpage split into categories, or you have some specific feeds (eg. calendar of the events) then it's good to create additional separate RSS for each one of them
What value should I set for pubDate? I do not quite understand how to apply this to my feed. I have new articles daily, should I set the pubDate to let say 06:00 AM today and update it every day?
Always set pubDate to the time when your news/articles went online. So if you have new articles daily pubDate should be a date when they were released to the public. Not random hour in the morning. Not the moment when you started writing them.
lastBuildDate: if I understand this right is the date of the latest updated item?
lastBuildDate is the most recent date when any of the results was posted or modified. Usually you should skip it - especially if your lastBuildDate will be simply a most recent pubDate. It's an optional parameter.
I use lastBuildDate only for calendar RSS feeds to show when the calendar was updated (as in calendars you not only add new entries but also often edit existing).
You should put every article, but the best is to provide different feeds for different categories, even search keywords. You can build it like any dynamic page, with a querystring.
that's not super important, you can put whatever. I don't think may feed readers use it.
theoretically it's the date the content changed. So the date of the latest updated item should work.
Something super important, since people are going to do polling on this page (meaning a lot of requests on the page)
- Cache it on your server
- Serve and Etag header and/or a LastModifiedDate. That way your server can respond with just a "not modified" if the client has it in cache already.
This may be a simple question, but for some reason I don't know this answer. Is it possible to create an RSS feed file that contains contents for an entire year but only publishes the current date and previous date information?
I have a client that wants to do a "this day in history" post. Currently, I am using IFTTT, and created around sixty dated posts for the next two months. Of course, this works -- but it is very labor intensive.
Is it possible to create an RSS feed that you could put all 365 days of data in to, but if someone pulls up the feed it only shows today's item and prior days in the feed?
Or is RSS not the proper technology to do this? The reason I am using RSS is for ease of use, and IFTTT will take those RSS feeds and pump it in to Facebook and Twitter for automatic status updates for my client.
There are various tools that let you define Facebook and Twitter posts in advance, to be published at a specified date and time in the future. Why not use one of those instead of writing your own?
A quick search for "scheduled twitter post" uncovered Later Bro, Twuffer and twAitter but there must be dozens to choose from.
If you're looking for just posting on Facebook and Twitter, and not an RSS feed as well, I'd follow Matthew's suggestion. If you want an RSS feed, there is a feed for each Twitter feed. But if you want actual RSS, you need to add something in between. An RSS feed is just an XML file. it's not a process. I suggest having a file of some type (maybe RSS, or other XML, or a database table, or even a csv file with all the posts and relevant information, including date. Then a small script that runs as a chron job (or IFTTT if it supports date as trigger and running a script as the "then" part) that pulls the day's feed and updates the actual RSS feed. Pretty simple.
Here is what I ended up doing
Using the Drupal backend of my website, I created a content type specifically for these posts.
I created individual articles for each day, and used the schedule module to schedule the publish date to the date I wanted.
I created an RSS feed of these posts through Drupal.
I linked the newly created RSS feed to IFTTT.
Created an IFTTT recipe to post the text from the RSS feed to Facebook/Twitter/etc.
It wasn't the best solution, but it worked. I was really trying to do this without having to rely on a third-party such as IFTTT, but never really figured out a good way to do it.
What is the correct response an RSS client should have when it encounters a feed that has multiple items with the same guid/identifier?
Currently in my application, any items that use an existing guid won't be cached or displayed because it believes it already has that item.
In this example feed a lot of items share this id:
tag:blizzard.com,2010-10-22:diablo3:feed:en-us:1
According to w3 when there are duplicate entries in an RSS feed:
Atom Processors MAY choose to display all of them or some subset of them. One typical behavior would be to display only the entry with the latest atom: updated timestamp.
I would go with the spec and display only the entry with the latest updated timestamp. Don't forget to send an email to Blizzard support and have them get their RSS validated - just don't threaten to keep them out of the next raid.
Take care.
I think your app is doing it right. Don't get fancy. If you've already seen an item with that guid you don't present it a second time. You should contact whe webmaster for the feed if possible and alert them to the problem.
Does each item have a unique URL? If so, fall back to using the URL.
How would I get the next page or more results for a feed?
For example, when I go to Security Now feed page, there is no "next" link of any kind and the url parameter of "page=100" does nothing:
http://leoville.tv/podcasts/sn.xml
I get only 1 page of results of about 20 episodes. However my Google Reader can successfully retrieve episodes that are earlier than that.
Indeed it is true that Google Reader caches the items and it is NOT possible to paginate on RSS2, RSS or Atom feeds (unless they have rel=next link, which none of them seem to have).
However, we can leverage the existing Google Reader infrastructure, with some work, to retrieve a list of, say 200 items!
Given the above podcast url we retrieve the latest 200 episodes by:
Using the ...google.ca/reader/atom/feed prefix instead of the usual view/feed as can be seen in your google reader.
Appending n=200 as the query parameter.
So we have:
http://www.google.ca/reader/atom/feed/http://leoville.tv/podcasts/sn.xml?hl=en&n=200
There is a very insightful reverse-engineered google-reader API project located at http://code.google.com/p/pyrfeed/wiki/GoogleReaderAPI
Google reader caches RSS entries. You can't get any more from the actual feed if they don't allow for it.