How to crawl a feed - rss

My application needs to keep track of RSS/Atom feeds and save the new entries in a database. My question is, What is the most reliable method to determine whether an entry in a feed has already been crawled or not?
I use Universal Feed Parser module to parse the feeds. My current implementation keeps record of the latest value of feed.entry[i].updated_parsed, when crawling if updated_parsed value of an entry is greater than the recorded value, then that entry is saved in the database. The problem here is that many feeds don't have a published date or an updated date.

You should be determining whether you've already crawled an entry by reference to its <guid> primarily (falling back to <link> in the absence of a <guid>), and anything to do with dates only as a secondary analysis.

Related

What the best clockify API endpoint to get time entries of (grouped by) saved reports?

Asking here, after asking to Clockify support.
Trying to extend some of clockify capabilities to create extra reporting for our clients,
I’ve been playing with your API and specifically: the enpoint /reports/{reportsId}
• My goal:
Get all the time entries of a specific "saved report” (usually saved by our Project Managers)
• What I EXPECT from "/reports/{reportsId}”:
To get all the info and entities (users, time entries, projects, etc.) only regarding that particular reportId
• What I GET from "/reports/{reportsId}”:
Lots of info regarding the whole workspace, and I only see summaryReport
as more “specific to the saved report itself”...
• Questions:
Is this the correct behavior?
How do you filter down time entries of specific reports in URLs like https://clockify.me/bookmarks/BOOKMARK_HASH_HERE ?
Do you only call "/reports/{reportsId}” and filter down on client-side? (it seems to me that way, exploring the Network tab)
If that’s the way, what’s the point of calling the report endpoint? Only for the summaryReport object?
3- Is "/reports/{reportsId}” the best endpoint I can use to reach my goal? …or which way would you recommend me?
summaryReport.timeEntries will contain all the individual time entries from that particular report. Each entry has a user, project, client, time etc. Grouping by project is done on the client.
I'm not sure I fully understand your specific problem though. Are you suggesting the entries you get from the report endpoint do not belong to the given report?

Questions on building RSS feed

I am building RSS feed for the first time and I have some simple, direct questions that I was unable to find on the web, well at list in a sense that would be clear to me. Can you help me understand following
Which items should I include in RSS generation? should I always put in all the articles or what is the criteria when I query my articles for the feed?
What value should I set for pubDate? The specification says "The publication date for the content in the channel. For example, the New York Times publishes on a daily basis, the publication date flips once every 24 hours. That's when the pubDate of the channel changes.". I do not quite understand how to apply this to my feed. I have new articles daily, should I set the pubDate to let say 06:00 AM today and update it every day?
lastBuildDate: if I understand this right is the date of the latest updated item?
Which items should I include in RSS generation?
You should have one generic feed with all the new articles you post (for example: news). Additionally if you got your webpage split into categories, or you have some specific feeds (eg. calendar of the events) then it's good to create additional separate RSS for each one of them
What value should I set for pubDate? I do not quite understand how to apply this to my feed. I have new articles daily, should I set the pubDate to let say 06:00 AM today and update it every day?
Always set pubDate to the time when your news/articles went online. So if you have new articles daily pubDate should be a date when they were released to the public. Not random hour in the morning. Not the moment when you started writing them.
lastBuildDate: if I understand this right is the date of the latest updated item?
lastBuildDate is the most recent date when any of the results was posted or modified. Usually you should skip it - especially if your lastBuildDate will be simply a most recent pubDate. It's an optional parameter.
I use lastBuildDate only for calendar RSS feeds to show when the calendar was updated (as in calendars you not only add new entries but also often edit existing).
You should put every article, but the best is to provide different feeds for different categories, even search keywords. You can build it like any dynamic page, with a querystring.
that's not super important, you can put whatever. I don't think may feed readers use it.
theoretically it's the date the content changed. So the date of the latest updated item should work.
Something super important, since people are going to do polling on this page (meaning a lot of requests on the page)
- Cache it on your server
- Serve and Etag header and/or a LastModifiedDate. That way your server can respond with just a "not modified" if the client has it in cache already.

Detecting new RSS feed entries

I'm using feedparser for working with RSS.
I'm getting regularly (e.g. every 15 minutes) RSS channel with items and store it. In the channels there aren't often any new items. So, it's unefficient.
Is there a way to detect quickly if there are some new items in the channel and if not, do nothing with this channel?
thank you
For RSS 2.0, the channel element has an optional lastBuildDate eleement. For atom, there's a similar "atom:updated" element, but the standard does state that this is when "an entry or feed was modified in a way the publisher considers significant. Therefore, not all modifications necessarily result in a changed atom:updated value".
There's also a PubDate element in RSS 2.0, also optional, but lastBuildDate should be the one to use, assuming it's there and the publisher is using it correctly.
You can store the previous one and compare the newly retrieved value with the old one.
Added material on feedparser:
For feedparser, see feed-updated_parsed and feed-updated.

How to handle non unique item GUIDs/IDs in an RSS feed?

What is the correct response an RSS client should have when it encounters a feed that has multiple items with the same guid/identifier?
Currently in my application, any items that use an existing guid won't be cached or displayed because it believes it already has that item.
In this example feed a lot of items share this id:
tag:blizzard.com,2010-10-22:diablo3:feed:en-us:1
According to w3 when there are duplicate entries in an RSS feed:
Atom Processors MAY choose to display all of them or some subset of them. One typical behavior would be to display only the entry with the latest atom: updated timestamp.
I would go with the spec and display only the entry with the latest updated timestamp. Don't forget to send an email to Blizzard support and have them get their RSS validated - just don't threaten to keep them out of the next raid.
Take care.
I think your app is doing it right. Don't get fancy. If you've already seen an item with that guid you don't present it a second time. You should contact whe webmaster for the feed if possible and alert them to the problem.
Does each item have a unique URL? If so, fall back to using the URL.

Reading RSS by publication date

I want to build an RSS reader for twitter RSS feeds (c# .NET 3.5).
Getting a response from RSS web address and parsing it is very simple. (I did that with XmlDocument.Load("<RSS Feed>")).
The problem is that I need to get RSS items by publication date range.
When loading the application, I want to get all the items since the last time the feeds have been downloaded.
How can I do this?
Does every RSS feed allow that? (Google reader is showing items even from the last year).
It comes down to two sources of data: what the feed currently provides, and what you have stored.
If the feed is only showing the 10 most recent, for example, there is nothing you can do to get the older data. The feed must provide it.
Google Reader runs a cronjob that checks feeds about every 3 hours. It then stores the items in a database for Google Reader to reference any time it needs.

Resources