Can an RSS guid be considered globally unique? - rss

I need to store new items from multiple RSS feeds in a database. I would like to use the GUID tag of each item to determine, whether it already exists in the database.
See the W3C specification:
guid stands for globally unique identifier. It's a string that uniquely identifies the item. When present, an aggregator may choose to use this string to determine if an item is new.
...
There are no rules for the syntax of a guid. Aggregators must view them as a string. It's up to the source of the feed to establish the uniqueness of the string.
So my question is, is it safe to consider a GUID unique among different feeds? Or will I need to combine the GUID with the feed it comes from, to make sure there are no duplicate GUIDs?

The GUID is not even mandatory, so in my opinion it is not safe to consider it unique. I'd suggest you read this blog post about rss feed duplicate detection.

Unfortunately, they should not be considered unique, however, if, indeed, the RSS 2.0 spec says they're optional, they should be very strongly recommended as the most efficient mechanism to identify new entries from old ones.

Related

How to choose a good value set for CosmosDB id field?

According to docs, the property id is special in Azure CosmosDB documents as it must always be set and have unique value per partition. Also it has additional restrictions on its content :
The following characters are restricted and cannot be used in the Id
property: '/', '\', '?', '#'
Obviously, this field is one of document "keys" (in addition to _rid) and used somehow in internal plumbing. Other than the restrictions above, it is unclear how exactly is this key used internally and more importantly for practitioners,which values constitute technically better ids than others?
Wild guess 1: For example, from some DB worlds, one would prefer short primary key values, since the PK would be included in index entries and shorter keys would allow more compact index for storage and lookup. Would id field length matter at all besides the one-time storage cost?
Wild guess 2: in some systems better throughput is achieved if common prefixes are avoided in names (i.e. azure storage container/blob names) and even suggest to add a small random hash as prefix. Does cosmosDB care about id prefix similarities?
Anything else one should consider?
EDIT: Clarification, I'm interested in what's good for the cosmosDB server storage/execution side, provided my data model is still in design and/or has multiple keys available the data designer can choose from.
First and foremost let's clear something out. The id property is NOT unique. Your collection can have multiple documents that have the exact same id. The id is ONLY unique within it's own logical partition.
That said, based on all the compiled info that we know from documentation and talks it doesn't really matter what value you choose to go with. It is a string and Cosmos DB will treat it as such but it is also considered as a "Primary key" internally so restrictions apply, such as ordering by it.
Where it does matter is in your consuming application's business logic. The id plays a double role of being both a CosmosDB property but also your property. You get to set it. This is the value you are going to use to make direct reads to the database. If you use any other value, then it's no longer a read. It's a query. That makes it more expensive and slower.
A good value to set is the id of the entity that is hosted in this collection. That way you can use the entity's id to read quickly and efficiently.

How to retrieve resources based on different conditions using GET in RESTful api?

As per REST framework, we can access resources using GET method, which is fine, if i know key my resource. For example, for getting transaction, if i pass transaction_id then i can get my resource for that transaction. But when i want to access all transactions between two dates, then how should i write my REST method using GET.
For getting transaciton of transaction_id : GET/transaction/id
For getting transaction between two dates ???
Also if there are other conditions, i need to put like latest 10 transactions, oldest 10 transaction, then how should i write my URL, which is main key in REST.
I tried to look on google but not able to find a way which is completely RESTful and solve my queries, so posting my question here. I have clear understanding of POST and DELETE, but if i want to do same update using PUT for some resource based on condition, then how to do it?
There are collection and item resources in REST.
If you want to get a representation of an item, you usually use an unique identifier:
/books/123
/books/isbn:32t4gf3e45e67 (not a valid isbn)
or with template
`/books/{id}
/books/isbn:{isbn}
If you want to get a representation of a collection, or a reduced collection you use the unique identifier of the collection and add some filters to it:
/books/since:{fromDate}/to:{toDate}/
/books/?since="{fromDate}"&to="{toDate}"
the filters can go into the path or into the queryString part of the url.
In the response you should add links with these URLs (aka HATEOAS), which the REST clients can follow. You should use link relations, for example IANA link relations to describe those links, and linked data, for example schema.org or to describe the data in your representation. There are other vocabs as well, for example GoodRelations, and ofc. you can write your own vocab as well for your application.

What is the best way to implement multilingual domain objects using NHibernate?

What is the best way to design the Domain objects which can have multi-lingual fields. An example can be a Product class with Description being multi-lingual.
I have found few links but could not decide which one is the best way.
http://fabiomaulo.blogspot.com/2009/06/localized-property-with-nhibernate.html
(This stores all localised language data in one field. Can be a problem if we query from Sql)
http://ayende.com/Blog/archive/2006/12/26/LocalizingNHibernateContextualParameters.aspx
(This one has a warning at the beginning that it is a hack and no longer supported)
http://www.webdevbros.net/2009/06/24/create-a-multi-languaged-domain-model-with-nhibernate-and-c/
(This does not describe how multilingual data will be structured in the database.)
Anyone having experience with using NHibernate with multi-lingual data. Is there a better way?
The third option looks great. The hibernate mapping is given, but not the database schema - if that's what you are missing, then I'll sketch it out here:
dictionary
----------
ID: int - identity
name: nvarchar(255)
phrase
------
dictionary_id:int (fkey dictionary.ID)
culture_id:int (LCID)
phrase:nvarchar(255) - this is the default size - seems too small
According to this blog entry, 255 is the default string length for String values. To overcome the short string length on the phrase text, you can change the <element> tag to
<element column="phrase" type="String" length="4001"></element>
To use this in your domain model, you add a PhraseDictionary property to your entity where you want translatable text. E.g. the title property or decription property.
I think the article describes a great approach, and is the one that I would go
for.
EDIT: In response to the comments, make the length less than 4001 if you know the absolute maximum size is less than that, as this will typically be faster. Also, NHibernate will lazily fetch the collection, but it may fetch all the items at once. You can profile to determine if this has any performance implications. (If you have only a handful of languages then I doubt you will see a difference.) If you have many languages (Say 50+) then it may be worthwhile creating custom properties to fetch the localized text. These will issue queries to fetch specifically the text required. More importantly, you may be able to fetch all the text for a given entity in one query, rather than each localized text property as a separate query.
Note that this extra effort is only needed if profiling gives you reason to be concerned about the performance. Chances are that the implementation in the article as is will function more than adequately.
I only have experience for Hibernate, but since nHibernate is so similar:
One option is to define a component type MultilingualString with members for each language (this assumes the set of languages is known at coding time). This type is also a convenient location to place an getter for the string by language id.
class MultiLingualString {
String english;
String chinese;
String klingon;
String forLanguage(Language lang) {
switch (lang) {
// you can guess what goes here
}
}
}
This results in the strings for all languages being stored in separate columns in the database while the representation in the object world retains fine granularity.
The advantage is that no join is required to fetch the strings. On the other hand, the only way not to fetch a string with this approach is to use a projection, which is a severe limitation if the strings are large, numerous and rarely needed.
If you do this a lot, writing a UserType might be worth it.
From a strictly database oriented standpoint with SQL Server, you should have one table with all of the base data (record key, dates, numbers, etc) and one table with all of the translatable string data. Let call the two tables Base and Base_Description.
Base ensures that there is a single key for each record, the key might be a string or auto-generated id depending on your particular use case.
The Base_Description table is related to the Base table, but also contains a value to select the language that the data is in. In my projects we use the langid column from sys.languages because we can set the language of the connection with and then grab it with ##LANGID for most operations.
In our testing we found this to be significantly faster than having multiple fields for each language, it also allows you to add other languages more easily. We are also using SQL Server Full-Text indexing and it fully works with this method. You should index in the neutral language and then you can pick the language to search against at run time (also filtering against the LangID column in Base_Description).
Do your requirements include the domain objects actually having multiple-language properties in the same object? And, if so, is it unlimited translations stored in the object (in a collection, say - in which case I would say that it would need to be just like any master/detail or parent/child collection) or fixed translations, in which case the languages (and thus the mapping to results of a stored proc or whatever) have to be determined statically anyway?
In many internationalized applications I worked on, the data was in only one language - customer names, the product names (there was no point in mapping even identical products used in one country to products in another, they all had different distributors and different SKUs, and of course localized pricing). The interface was also only in one language (at a time). So all the domain objects only required one language at a time. Thus the language of the translation would be determined when the object was instantiated.
We had translation user interfaces which allowed users to update the translated texts, but these only required two languages at a time (local and the default). I can see this being closest to what you are talking about. I guess that you would have child collections for each translatable property with all the possible translations in the collection. This would probably be closest to the second solution in the third article you linked. Of course, at this point you would also need to see if you want eager/lazy loading etc.

Shorter GUID using CRC

I am making a website in ASP.NET and want to be able to have a user profile which can be accessed via a URL with the users id at the end. Unique identifier is obviously a bad choice as it is long and (correct me if i am wrong) not really URL friendly.
I was wondering if i produced a unique idnetifier on the ASP page then hashed it using CRC (or something similar) if it would still be as unique (or even unique at all) as just a GUID.
For example:
The GUID 6f1a7841-190b-4c7a-9f23-98709b6f8848 equals CRC E6DC2D44.
Thanks
A CRC of a GUID would not be unique, no. That would be some awesome compression algorithm otherwise, to be able to put everything into just 4 bytes.
Also, if your users are stored in the database with a GUID key, you'd have trouble finding the user that matches up to this particular CRC.
You'd be better off using a plain old integer to uniquely identify a user. If you want to have the URL unguessable, you can combine it with a second ticket (or token) parameter that's randomly generated. It doesn't have to be unique, because you use the integer ID for identifying the user. You can think of it more or less as a password.
Any calculated hash contains less information (bits) than the original data and can never be as unique. There are always collisions.
If the users have a username then why not use that? It should be unique (I would hope!) and would probably be short and URL friendly. It would also be easy for users to remember, too, and fits in the with the ASP.NET membership scheme (since usernames are the "primary key" in membership providers). I don't see any security issue as (presumably) only authenticated users would be able to access it, anyway?
No, it won't be as unique, because you're losing information from it. If you take a 32 character hex string and convert it to an 8 character hex string then, by definition, you're losing 75% of the data.
What you can do is use more characters to represent the data. A guid uses ony 16 characters (base 16) so you could use a higher base (e.g. base 64) which lets you encode the same amount of information in fewer characters.
I don't see any problem with the normal GUID in HTTP URL. If you want the shorted form of Guid use the below.
var gid = Guid.NewGuid().ToString("N");
This will give a GUID without any hyphen or special characters.
A GUID is globally unique, meaning that you won't run into clashes, hopefully ever. These are usually based on some sort of time based calculation with randomness interjected. If you want to shorten something using a hash, such as CRC, then then uniqueness it not automatic, but as long as you manage your uniqueness yourself (checking to see if the hash is not currently assigned to another user and if so, regenerating until you get a unique one) then you could use almost anything.
This is the way a lot of url-shorteners work.
If you use a CRC of a UUID/GUID as ID you could also use a shorter ID in the first place.
The idea of an UUID/GUID as ID is IMO that you can create IDs on disconnected systems and should have no problem with duplicate IDs.
Anyway who is going to enter the URL for the profile page by hand anyway?
Also I see no problems with URL friendliness of an UUID/GUID - there are no chars which are not allowed by http.
How are users identified in the database (or any other place you use to store your data)?
If they are identified using this GUID I'd say, you have a really good reason for this, because this makes searching for a special ID really complicated (even when using a binary tree); there is also more space needed to store these values.
If they are identified by an unique integer value, why not using this to call the user profile?
You can shorten a GUID to 20 printable ASCII characters, with it still being unique and without losing any information.
Take a look at this blog post by Jeff Atwood:
Equipping our ASCII Armor

PubDate/Guid is essential to RSS? How I create a good RSS in Yahoo! Pipes if the source doesn't provide different dates for the items?

I am creating a Yahoo! Pipe to a news site but the feedless source doesn't have a date/time for each item. My RSS doesn't works very well: each update makes the RSS Reader, Google Reader for instance, to mark all readed items as unreaded again. Perhaps that's because of the lack of pubDate tag or incorrect guid tag.
How to create a "pubDate" on Yahoo! Pipes when your source doesn't provide you the data?
How to avoid the "guid" tag overwritting? (you can set the guid in YPipes but then YPipes ignores your guid)
Solution: pudDate isn't necessary. guid is essential. Even if Yahoo! Pipes rewrites the guid, it will work, because Yahoo! Pipes converts your guid text into a hash value, that do not is modified until the text is modified.
I think the GUID is generated from the link parameter. So it is important to have a unique url for each feed item. If all the feed urls have same link, they will have same GUID.
I hope that helps.
I am struggling myself to create unique url. Have you found anyway to achieve it?
Have you looked at Feedity - http://feedity.com - for creating custom RSS feeds. It's like Pipes, but much easier to use, and in fact works well within Pipes as well. I've been using it for a while to create RSS feeds for those "feedless" webpages.
Well, for future reference, the solution can be found in this link. It also serves well for putting a date. Basically what it does is to create a node copying as its subnodes all the needed fields, and then at the end it replaces the parent with this "cloned" child.
I don't have a definitive answer for you, but anecdotely I have been maintaining a private feed reader for the last 4 years or so. I've been exposed to a lot of vagaries of RSS/ATOM and I can tell you that a lot of feeds don't have dates associated with the items. It might be an RSS version issue.
Last time I rebuilt my site, I had a bunch of trouble with the feed. In the way you describe- read things becoming unread on next update, duplicate entries. Turns out the problem was more to do with the guid element than the pubdate. As far as I recall, it didn't matter too much what I did with the date (I had the format wrong for a while) as long as the guid was unique.
With Yahoo Pipes, using the 'Create RSS' module, it appears to use (a hashed version of) each entry's link to generate a GUID, which as you point out, is necessary for most feed readers to detect new entries.
I've attempted to set the 'Create RSS' module's GUID field to a value that's unique for each entry, however the GUID in the resultant feed remains identical for each entry. When I then set the link to this value the GUIDs generate were unique for each entry.
I have verified this by making a copy of your pipe and removing (well, renaming) the link attribute and no GUID is generated (although you have specified one). This has been confirmed by others as a bug, see tinyurl.com/mxard2.
the problem could be with the source of your feed. If you are using mutliple feeds then after the union operation in pipes, do a sort operation on pubdate and then redirect it to the output.
Just been doing this myself, and have resorted to appending a random number to the url that I'm using to get the data from (I'm scraping using YQL). I'm generating that random number by using a Date Builder and populating it with "today" to get the current date/time. I'm then using a URL Builder to build up my url that I'm requesting, passing in an extra parameter of "randomnumber" which I'm assigning to my the DateTime.utime value.
Having looked at the generated RSS feed via view source, the articleId now does appear to be unique, but I haven't left it long enough to know if google reader etc sees it as different.

Resources