Are RSS guids actually expected to be _globally_ unique? - rss

Just trying to clear up what level of uniqueness the <guid> element in an RSS feed is actually supposed to have. I understand that one of its main purposes is to be something that software can use to identify the item for such purposes as read/unread tracking. But am I right that:
once a guid has been used, it should never be used again, even if the last instance was removed from the feed ages ago?
it should be unique not only within a feed, but also across multiple feeds and even (to the extent it can be achieved) unrelated websites?
one of the reasons it's common to use URLs as guids is to help achieve the above?
Moreover, if a program does encounter the same guid twice in different feeds, what should happen?
it treats them as distinct RSS items, since they are in different feeds?
they are considered to be one and the same item, just published in multiple places (similar to Usenet crossposts)?
it depends on whether they're on the same site/domain?
the behaviour is undefined?

It's a good question and the answer is No.
It was a bad choice of terminology.
The guids only have to be unique to the feed.
The goal in adding them was to have a way for an aggregator to know for sure whether or not it's seen the item before. A locally-unique id suffices for that purpose.

Related

Are the IDs reliable?

I'm starting to work with the Clockify APIs and I'd like to know if the different IDs are reliable or not? As in, is it a really bad idea to keep ther IDs in my database to know what's what or that's something that would work long term for sure? Thank you
IDs in Clockify represent the identities of their respective entities. They don't change, and are unique across the board, so you can use them in your database if you choose so.
That being said, it's always a good practice when dealing with outside data to assign them your own IDs, that way you're not reliant on contracts that you cannot enforce. Provide every entity with id (your own) and externalId or clockifyId and you won't ever be in position when outside change affected your domain logic.

Using auto-number database fields theory

I was on "another" programming forum, and we were talking about getting the next number from an auto-increment field BEFORE an insert takes place (there is a way using ADOX). This was in an MS-Access database btw.
Anyway, the discussion veered off into the area of SHOULD you use auto-increment fields for things like invoice numbers, PO numbers, bill of lading numbers, or anything else that needs an unique, incrementing number.
My thoughts were "why not"? Other people are arguing that an Invoice number (for instance) should be managed as a separate table and incremented with code, not using an auto-number field.
Can someone give me a good reason why that would be true?
I've used auto-number fields for years for just this type of thing and have never had problem one.
Your thoughts?
I have always avoided number auto_increment. As it turns out for good reason. But originally my reasons were because that was what the professor told us.
Facebook had a major breach a few years ago - simply because they were use AUTO_INCREMENT fields for user id's. Doesn't take a calculator to figure out that if my ID is 10320 there is likely someone with ID 10319, etc.
When debugging (or proofing design) having a key that implicit of the data it represents is a heck of a lot easier.
Have keys that are implicit of the data reduces the potencial for corrupted data (type's and user guessing).
Implicit keys require the developer think about they're data. I have never come across a table using implicit keys that was not normalized.
Other than the fact deadlines often run tight - there is no great reason for auto increment.
Normally I use and autonumbering field for the ID so I don't need to think about how's generated.
The recordset operation like insert and delete alter the sequence skipping block of numbers.
When you manage CustomerID, Invoice Numbers and so on, it's better to have the full control over them instead of letting them under system's control.
You can create a function that generates for you the desired numbers using a rule (e.g. the invoice can be a function that include the invoicing date).
With autonumbering you can't manage this.
After that there is NO FIXED RULES about what to do and what not do.
It's just your practice and experience and the degree of freedom you want to have.
Bye:-)

Using Lucene.Net as a primary lookup for lists before heading to the database, is this a good idea?

First, I do not want to use Lucene as a database, per se, but rather as the primary look-up for displaying lists to the user. This would be a canned search to Lucene where we would pull, say, all user information to be displayed in a grid list. We are building an ASP.Net web application, first of all. Is it a good idea to pull, from Lucene initially, a list of items (that can be paged) to display to the user in some sort of grid format? The only time we would call the database is when a user selects a specific record to view or update.
My concern is stale data coming from Lucene. I have been looking for information about add and updates to an index, but it is unclear to me if my scenario is better suited for a database rather than Lucene. My other developers and I have been going back and forth about this, but unfortuneatley, we don't know enough about how Lucene handles writes and reads.
I'm not sure if it's a good or bad fit for your use case. Hopefully I can give you some insight on how Lucene stores its data, and you can make a decision from that.
Lucene is extremely quick if you want to search for an item in its index. The time it takes to index its items isn't so quick. It's by no means slow if you look at everything its doing, but it adds complexity to know what you need to do about it.
Lucene is essentially a document store. So each item in Lucene is a Document, which can hold a certain amount of fields. Those fields are essentially key value pairs, though right now, Lucene only supports types of string and byte[] as values, and strings only as keys. Each field can be index and/or analyzed (or neither). Indexing simply means you can search against that field's data, generally only via exact matches and wildcards. Analyzing gives you better searching capabilities, since it will take the string and tokenize it. Depending on the analyzer it will tokenize it differently. The most common is whitespace and stopwords; essentially marking each word as a term unless its something like (a, an, the, as, etc...).
The real killer when used for many pieces, you can't update a document in an index. When you pull out a document to update it and change the field, the call to UpdateDocument() actually marks the old document as deleted and inserts a new document.
Notice I said it marks it as deleted. That introduces another thing related to Lucene indexes: Optimization of the index. When you write to an index, every so often a segment of the index is written to disk. (It's temporarily stored in RAM for fast indexing) When you run a search on an index, lucene needs to open all those different segments to find the terms to search against (it has to order them in a way too). This means if you have many segments, searching can be slow. A call to Optimize() will not only merge the segments together, it will also remove any documents marked for deletion, thus lowering your index size, as well.
However, optimizing your index requires around 1.5x more space while the optimization is being done, sometimes more. Fortunately, Lucene.net is transactional during an optimization, which means not only will your index not be corrupt if an optimization fails, any existing IndexReader you have open will still be able to search and read from the index when you're optimizing it.
In short, if it were me, if you were expecting only get one result from a search each time, I may not recommend lucene. Lucene especially shines when you're searching through many documents for many documents. It's an inverted index and it's good at that. For a single lookup, you may be better off with a database. Unfortunately, the only way you'll really find out is to benchmark it. Fortunately, at least Lucene.Net is very easy to setup for something like that.
Also, if you do use Lucene.Net, consider our 2.9.4g branch. You may not be able to use it, since it is technically not release code, but it is a bit faster than normal lucene, as we've added generics and removed a bit of the costly boxing done in previous versions.
Lucene is not a good fit for the scenario you're describing. You're looking at caching data.
Why not use the Asp.net cache? If you need a more robust caching solution, there's memcached and a whole host of other ones ... even NoSql stores like mongo, redis, etc.
Obviously, you'll need to manually remove items from the cache on updates to stop serving stale data.
I think this is a viable solution, and I say this because there is a major open source content management system that is using a technique very similar to what you've described. It's called Umbraco, and it's version 5 is going to be using a customized version of Lucene.NET for a sort of cache.
you can look at the project and source here: http://umbraco.codeplex.com/SourceControl/changeset/view/5a7c9af9bbf9

RESTful collections & controlling member details

I have come across this issue a few times now, and each time I make a fruitless search to come up with a satisfying answer.
We have a collection resource which returns a representation of the member URIs, as well as a Link header field with the same URIs (and a custom relation type). Often we find that we need specific data from each member in the collection.
At one extreme, we can have the collection return nothing but the member URIs; the client must then query each URI in turn to determine the required data from each member.
At the other extreme, we return all of the details we might want on the collection. Neither of these is perfect; the first can result in a large number of API calls, and the second may return a lot of potentially unneeded information.
Of the two extremes I favour the second in our case, since we rarely use this for more than one sutiation. However, for a more general approach, I wondered if anyone had a nice way of dynamically specifying which details should be included for each member of the collection? I guess a query string parameter would be most appropriate, but I don't want to break the self-descriptiveness of the resource.
I prefer your first option..
At one extreme, we can have the
collection return nothing but the
member URIs; the client must then
query each URI in turn to determine
the required data from each member.
If you are wanting to reduce the number of HTTP calls over the wire, for example calling a service from a handset app (iOS/Android). You can include an additional header to include the child resources:
X-Aggregate-Resources-Depth: 2
Your server side code will have to aggregate the resources to the desired depth.
Sounds like you're trying to reinvent PROPFIND (RFC 4918, Section 9.1).
I regularly contain a subset of elements in each item within a collection resource. How you define the different subsets is really up to you. Whether you do,
/mycollectionwithjustlinks
/mycollectionwithsubsetA
/mycollectionwithsubsetB
or you use query strings
/mycollection?itemfields=foo,bar,baz
either way they are all different resources. I'm not sure why you believe this is affecting the self-descriptive constraint.

PubDate/Guid is essential to RSS? How I create a good RSS in Yahoo! Pipes if the source doesn't provide different dates for the items?

I am creating a Yahoo! Pipe to a news site but the feedless source doesn't have a date/time for each item. My RSS doesn't works very well: each update makes the RSS Reader, Google Reader for instance, to mark all readed items as unreaded again. Perhaps that's because of the lack of pubDate tag or incorrect guid tag.
How to create a "pubDate" on Yahoo! Pipes when your source doesn't provide you the data?
How to avoid the "guid" tag overwritting? (you can set the guid in YPipes but then YPipes ignores your guid)
Solution: pudDate isn't necessary. guid is essential. Even if Yahoo! Pipes rewrites the guid, it will work, because Yahoo! Pipes converts your guid text into a hash value, that do not is modified until the text is modified.
I think the GUID is generated from the link parameter. So it is important to have a unique url for each feed item. If all the feed urls have same link, they will have same GUID.
I hope that helps.
I am struggling myself to create unique url. Have you found anyway to achieve it?
Have you looked at Feedity - http://feedity.com - for creating custom RSS feeds. It's like Pipes, but much easier to use, and in fact works well within Pipes as well. I've been using it for a while to create RSS feeds for those "feedless" webpages.
Well, for future reference, the solution can be found in this link. It also serves well for putting a date. Basically what it does is to create a node copying as its subnodes all the needed fields, and then at the end it replaces the parent with this "cloned" child.
I don't have a definitive answer for you, but anecdotely I have been maintaining a private feed reader for the last 4 years or so. I've been exposed to a lot of vagaries of RSS/ATOM and I can tell you that a lot of feeds don't have dates associated with the items. It might be an RSS version issue.
Last time I rebuilt my site, I had a bunch of trouble with the feed. In the way you describe- read things becoming unread on next update, duplicate entries. Turns out the problem was more to do with the guid element than the pubdate. As far as I recall, it didn't matter too much what I did with the date (I had the format wrong for a while) as long as the guid was unique.
With Yahoo Pipes, using the 'Create RSS' module, it appears to use (a hashed version of) each entry's link to generate a GUID, which as you point out, is necessary for most feed readers to detect new entries.
I've attempted to set the 'Create RSS' module's GUID field to a value that's unique for each entry, however the GUID in the resultant feed remains identical for each entry. When I then set the link to this value the GUIDs generate were unique for each entry.
I have verified this by making a copy of your pipe and removing (well, renaming) the link attribute and no GUID is generated (although you have specified one). This has been confirmed by others as a bug, see tinyurl.com/mxard2.
the problem could be with the source of your feed. If you are using mutliple feeds then after the union operation in pipes, do a sort operation on pubdate and then redirect it to the output.
Just been doing this myself, and have resorted to appending a random number to the url that I'm using to get the data from (I'm scraping using YQL). I'm generating that random number by using a Date Builder and populating it with "today" to get the current date/time. I'm then using a URL Builder to build up my url that I'm requesting, passing in an extra parameter of "randomnumber" which I'm assigning to my the DateTime.utime value.
Having looked at the generated RSS feed via view source, the articleId now does appear to be unique, but I haven't left it long enough to know if google reader etc sees it as different.

Resources