I have a large set of podcast feed URLs which I'm periodically polling to check for updates. I'm really struggling to find a robust way to detect if a feed has changed that doesn't have any false positives. I'd like to be able to detect not just if there is a new episode, but also if an existing episode was updated.
RSS and Atom feeds provide pubDate, lastBuildDate or updated elements. However, I'm finding these frequently misused so that the feed is actually inserting the current date time into these fields each request. This makes them difficult to rely on to detect changes.
My next thought was to strip all date information from the podcasts, then MD5 hash the feed contents. I can then compare the feed hashes to detect changes to the feeds.
This seems to work for about 90% of the cases. However, there are still hundreds of podcasts that insert dynamic data into their feeds.
One podcast has the following as their podcast cover art:
http://erikglassman.hipcast.com/albumart/1000.1439649026.jpg
Where 1439649026 is what I assume is a timestamp. This second number changes with each request of their feed.
This is starting to seem like a losing battle. If I can't reliably trust the date fields of a podcast feed, and if some percentage of podcasts insert dynamic data into their feed text, how can I reliably detect changes to a feed in a robust way?
Everything you say is true, so it's not a good idea to try to detect changes at the feed level, instead look for them at the item level.
That generally works, if it doesn't the feed can't be used by anyone, so the source of the feed is likely to have fixed any problem. That's why I think it works so well.
I've been writing feed readers as long as they have existed, my current product is called River4, it's available as open source, MIT License, so you can use it as example code, for this and other issues.
This is where it checks if an item is new:
https://github.com/scripting/river4/blob/master/river4.js#L1411
That might move around as the code changes, so look for a routine called getItemGuid. It shows you how to get a value that uniquely identifies the item. I use this code for my podcatcher, http://podcatch.com/, and it seems to catch the new items, and doesn't get false positives.
Hope this helps! :-)
Related
Is there a way to add a feed or something to a website to show the upcoming football games (who's playing and at what time)?
I was thinking something like this: http://www.bbc.com/sport/football/fixtures
I think they have an RSS feed but I can't find how to utilise it. Is this even the right thing? I've never used any sort of feeds before.
I have found this: market.mashape.com/heisenbug/champions-league-live-scores I'm not sure if it displays the upcoming matches or not, but it's the closest thing I've found. Most of the sports APIs I've found seem to charge quite a lot per month to use. This one has a free version, but I don't fully understand it. It says 50 free per month, but 50 free what? Requests? If so, is it one 'request' per update (which is every 10mins with this plan)? Then it would only last just over 8hrs??? market.mashape.com/heisenbug/champions-league-live-scores/
Thanks
I did found two webpages which supplies the information you need. Since you didn't add source code, I think this answer will suit better for you.
First, enter to ScoresPro and pick any of the available sports rss feeds.
For this example I select Soccer.
Later, enter to Feedwind, paste the feed URL and press ENTER.
This is the result.
I think the question has been answered here before,but i could not find the desired topic.I am a newbie in web scraping.I have to develop a script that will take all the google search result for a specific name.Then it will grab the related data against that name and if there is found more than one,the data will be grouped according to their names.
All I know is that,google has some kind of restriction on scraping.They provide a custom search api.I still did not use that api,but hoping to get all the resulted links corresponding to a query from that api. But, could not understand what will be the ideal process to do the scraping of the information from that links.Any tutorial link or suggestion is very much appreciated.
You should have provided a bit more what you have been doing, it does not sound like you even tried to solve it yourself.
Anyway, if you are still on it:
You can scrape Google through two ways, one is allowed one is not allowed.
a) Use their API, you can get around 2k results a day.
You can up it to around 3k a day for 2000 USD/year. You can up it more by getting in contact with them directly.
You will not be able to get accurate ranking positions from this method, if you only need a lower number of requests and are mainly interested in getting some websites according to a keyword that's the choice.
Starting point would be here: https://code.google.com/apis/console/
b) You can scrape the real search results
That's the only way to get the true ranking positions, for SEO purposes or to track website positions. Also it allows to get a large amount of results, if done right.
You can Google for code, the most advanced free (PHP) code I know is at http://scraping.compunect.com
However, there are other projects and code snippets.
You can start off at 300-500 requests per day and this can be multiplied by multiple IPs. Look at the linked article if you want to go that route, it explains it in more details and is quite accurate.
That said, if you choose route b) you break Googles terms, so either do not accept them or make sure you are not detected. If Google detects you, your script will be banned by IP/captcha. Not getting detected should be a priority.
I am building RSS feed for the first time and I have some simple, direct questions that I was unable to find on the web, well at list in a sense that would be clear to me. Can you help me understand following
Which items should I include in RSS generation? should I always put in all the articles or what is the criteria when I query my articles for the feed?
What value should I set for pubDate? The specification says "The publication date for the content in the channel. For example, the New York Times publishes on a daily basis, the publication date flips once every 24 hours. That's when the pubDate of the channel changes.". I do not quite understand how to apply this to my feed. I have new articles daily, should I set the pubDate to let say 06:00 AM today and update it every day?
lastBuildDate: if I understand this right is the date of the latest updated item?
Which items should I include in RSS generation?
You should have one generic feed with all the new articles you post (for example: news). Additionally if you got your webpage split into categories, or you have some specific feeds (eg. calendar of the events) then it's good to create additional separate RSS for each one of them
What value should I set for pubDate? I do not quite understand how to apply this to my feed. I have new articles daily, should I set the pubDate to let say 06:00 AM today and update it every day?
Always set pubDate to the time when your news/articles went online. So if you have new articles daily pubDate should be a date when they were released to the public. Not random hour in the morning. Not the moment when you started writing them.
lastBuildDate: if I understand this right is the date of the latest updated item?
lastBuildDate is the most recent date when any of the results was posted or modified. Usually you should skip it - especially if your lastBuildDate will be simply a most recent pubDate. It's an optional parameter.
I use lastBuildDate only for calendar RSS feeds to show when the calendar was updated (as in calendars you not only add new entries but also often edit existing).
You should put every article, but the best is to provide different feeds for different categories, even search keywords. You can build it like any dynamic page, with a querystring.
that's not super important, you can put whatever. I don't think may feed readers use it.
theoretically it's the date the content changed. So the date of the latest updated item should work.
Something super important, since people are going to do polling on this page (meaning a lot of requests on the page)
- Cache it on your server
- Serve and Etag header and/or a LastModifiedDate. That way your server can respond with just a "not modified" if the client has it in cache already.
I'm using feedparser for working with RSS.
I'm getting regularly (e.g. every 15 minutes) RSS channel with items and store it. In the channels there aren't often any new items. So, it's unefficient.
Is there a way to detect quickly if there are some new items in the channel and if not, do nothing with this channel?
thank you
For RSS 2.0, the channel element has an optional lastBuildDate eleement. For atom, there's a similar "atom:updated" element, but the standard does state that this is when "an entry or feed was modified in a way the publisher considers significant. Therefore, not all modifications necessarily result in a changed atom:updated value".
There's also a PubDate element in RSS 2.0, also optional, but lastBuildDate should be the one to use, assuming it's there and the publisher is using it correctly.
You can store the previous one and compare the newly retrieved value with the old one.
Added material on feedparser:
For feedparser, see feed-updated_parsed and feed-updated.
I'm writing an RSS feed. Let's say it's for a list of entries as in a blog.
How do I handle updating the feed? I mean, let's assume that The feed always displays the last 10 entries.
If someone subscribes now, he'll get the last 10 entries (1..10)... what if there are for example 2 new articles, and then what will his feed reader do? Because I will return the articles (2..12).
Do I have to do any special handling to start from a certain article in the feed, or do I just always put the last 10 and this will be fine
Returning the last n articles will be fine. Because you assign a unique identifier to each article (you do, right?) the feed reader can easily keep track of what it has already seen or not.
The feed reader will probably watch to see how often new articles appear, to help determine how often it checks for new articles.