I am making a website in ASP.NET and want to be able to have a user profile which can be accessed via a URL with the users id at the end. Unique identifier is obviously a bad choice as it is long and (correct me if i am wrong) not really URL friendly.
I was wondering if i produced a unique idnetifier on the ASP page then hashed it using CRC (or something similar) if it would still be as unique (or even unique at all) as just a GUID.
For example:
The GUID 6f1a7841-190b-4c7a-9f23-98709b6f8848 equals CRC E6DC2D44.
Thanks
A CRC of a GUID would not be unique, no. That would be some awesome compression algorithm otherwise, to be able to put everything into just 4 bytes.
Also, if your users are stored in the database with a GUID key, you'd have trouble finding the user that matches up to this particular CRC.
You'd be better off using a plain old integer to uniquely identify a user. If you want to have the URL unguessable, you can combine it with a second ticket (or token) parameter that's randomly generated. It doesn't have to be unique, because you use the integer ID for identifying the user. You can think of it more or less as a password.
Any calculated hash contains less information (bits) than the original data and can never be as unique. There are always collisions.
If the users have a username then why not use that? It should be unique (I would hope!) and would probably be short and URL friendly. It would also be easy for users to remember, too, and fits in the with the ASP.NET membership scheme (since usernames are the "primary key" in membership providers). I don't see any security issue as (presumably) only authenticated users would be able to access it, anyway?
No, it won't be as unique, because you're losing information from it. If you take a 32 character hex string and convert it to an 8 character hex string then, by definition, you're losing 75% of the data.
What you can do is use more characters to represent the data. A guid uses ony 16 characters (base 16) so you could use a higher base (e.g. base 64) which lets you encode the same amount of information in fewer characters.
I don't see any problem with the normal GUID in HTTP URL. If you want the shorted form of Guid use the below.
var gid = Guid.NewGuid().ToString("N");
This will give a GUID without any hyphen or special characters.
A GUID is globally unique, meaning that you won't run into clashes, hopefully ever. These are usually based on some sort of time based calculation with randomness interjected. If you want to shorten something using a hash, such as CRC, then then uniqueness it not automatic, but as long as you manage your uniqueness yourself (checking to see if the hash is not currently assigned to another user and if so, regenerating until you get a unique one) then you could use almost anything.
This is the way a lot of url-shorteners work.
If you use a CRC of a UUID/GUID as ID you could also use a shorter ID in the first place.
The idea of an UUID/GUID as ID is IMO that you can create IDs on disconnected systems and should have no problem with duplicate IDs.
Anyway who is going to enter the URL for the profile page by hand anyway?
Also I see no problems with URL friendliness of an UUID/GUID - there are no chars which are not allowed by http.
How are users identified in the database (or any other place you use to store your data)?
If they are identified using this GUID I'd say, you have a really good reason for this, because this makes searching for a special ID really complicated (even when using a binary tree); there is also more space needed to store these values.
If they are identified by an unique integer value, why not using this to call the user profile?
You can shorten a GUID to 20 printable ASCII characters, with it still being unique and without losing any information.
Take a look at this blog post by Jeff Atwood:
Equipping our ASCII Armor
Related
I've decided to use GUID as primary key for many of my project DB tables. I think it is a good practice, especially for scalability, backup and restore in mind. The problem is that I don't want to use the regular GUID and search for an alternative approach. I was actually interested to know what Pinterest i using as primary key. When you look at the URL you see something like this:
http://pinterest.com/pin/275001120966638272/
I prefer the numerical representation, even it it is stores as string. Is there any way to achieve this?
Furthermore, youtube also use a different kind of hashing technique which I can't figure it out:
http://www.youtube.com/watch?v=kOXFLI6fd5A
This reminds me shorten url like scheme.
I prefer the shortest one, but I know that it won't guarantee to be unique. I first thought about doing something like this:
DateTime dt1970 = new DateTime(1970, 1, 1);
DateTime current = DateTime.Now;
TimeSpan span = current - dt1970;
Result Example:
1350433430523.66
Prints the total milliseconds since 1970, But what happens if I have hundreds thousands of writes per second.
I mainly prefer the non BIGINT Auto-Increment solution because it makes a lot less headache to scale the DB using 3rd party tools as well as less problematic backup/restore functionality because I can transfer data between servers and such if I want.
Another sophisticated approach is to tailor the solution towards my application. In the database, the primary key will also contain the username (unique and can't be changed by the user), so I can combine the numerical value of the name with the millisecond number which will give me a unique numerical string. Because the user doesn't insert data as such a high rate, the numerical ID is guarantee to be unique. I can also remove the last 5 figures and still get a unique ID, because I assume that the user won't insert data at more than 1 per second the most, but I would probably won't do that (what do you think about this idea?)
So I ask for your help. My data is assumes to grow very big, 2TB a year with ten of thousands new rows each second. I want URLs to look as "friendly" as possible, and prefer not to use the 'regular' GUID.
I am developing my app using ASP.NET 4.5 and MySQL
Thanks.
Collision Table
For YouTube like GUID's you can see this answer. They are basically keeping a database table of all random video ID's they are generating. When they request a new one, they check the table for any collisions. If they find a collision, they try to generate a new one.
Long Primary Keys
You could use a long (e.g. 275001120966638272) as a primary key, however if you have multiple servers generating unique identifiers you'll have to partition them somehow or introduce a global lock, so each server doesn't generate the same unique identifier.
Twitter Snowflake ID's
One solution to the partitioning problem with long ID's is to use snowflake ID's. This is what Twitter uses to generate it's ID's. All generated ID's are made up of the following parts:
Epoch timestamp in millisecond precision - 41 bits (gives us 69 years with a custom epoch)
Configured machine id - 10 bits (gives us up to 1024 machines)
Sequence number - 12 bits (A local counter per machine that rolls over every 4096)
One extra bit is reserved for future purposes. Since the ID's use timestamp as the first component, they are time sortable (which is very important for query performance).
Base64 Encoded GUID's
You can use ShortGuid which encodes a GUID as a base64 string. The downside is that the output is a little ugly (e.g. 00amyWGct0y_ze4lIsj2Mw) and it's case sensitive which may not be good for URL's if you are lower-casing them.
Base32 Encoded GUID's
There is also base32 encoding of GUID's, which you can see this answer for. These are slightly longer than ShortGuid above (e.g. lt7fz44kdqlu5pt7wnyzmu4ov4) but the advantage is that they can be all lower case.
Multiple Factors
One alternative I have been thinking about is to introduce multiple factors e.g. If Pintrest used a username and an ID for extra uniqueness:
https://pinterest.com/some-user/1
Here the ID 1 is unique to the user some-user and could be the number of posts they've made i.e. their next post would be 2. You could also use YouTube's approach with their video ID but specific to a user, this could lead to some ridiculously short URL's.
The first, simplest and practical scenario for unique keys
is the increasing numbering sequence of the write order,
This represent the record number inside one database providing unique numbering on a local scale : this is the -- often met -- application level requirement.
Next, the numerical approach based on a concatenation of time and counters is commonly used to ensure that concurrent transactions in same wagons will have unique ids before writing.
When the system gets highly threaded and distributed, like in highly concurrent situations, do some constraints need to be relaxed, before they become a penalty for scaling.
Universally unique identifier as primary key
Yes, it's a good practice.
A key reference system can provide independence from the underlying database system.
This provides one more level of integrity for the database when the evoked scenario occurs : backup, restore, scale, migrate and perhaps prove some authenticity.
This article Generating Globally Unique Identifiers for Use with MongoDB
by Alexander Marquardt (a Senior Consulting Engineer at MongoDB) covers the question in detail and gives some insight about database and informatics.
UUID are 128 bits length. They introduce an amount of entropy
high enough to ensure a practical uniqueness of labels.
They can be represented by a 32 hex character strings.
Enough to write several thousands of billions of billions
of decimal number.
Here are a few more questions that can occur when considering the overall principle and the analysis:
should primary keys of database
and Unique Resource Location be kept as two different entities ?
does this numbering destruct the sequentiality in the system ?
Does providing a machine host number (h),
followed by a user number (u) and time (t) along a write index (i)
guarantee the PK huti to stay unique ?
Now considering the DB system:
primary keys should be preserved as numerical (be it hexa)
the database system relies on it and this implies performance considerations.
their size should be fixed,
the system must answer rapidly to tell if it's potentially dealing with a PK or not.
Hashids
The hashing technique of Youtube is hashids.
It's a good choice :
the hash are shorts and the length can be controlled,
the alphabet can be customized,
it is reversible (and as such interesting as short reference to the primary keys),
it can use salt.
it's design to hash positive numbers.
However it is a hash and as such the probability exists that a collision happen. They can be detected : unique constraint is violated before they are stored and in such case, should be run again.
Consider the comment to this answer to figure out how much entropy it's possible to get from a shorten sha1+b64 recipe.
To anticipate on the colliding scenario,
calls for the estimation of the future dimension of the database, that is, the potential number of records. Recommended reading : Z.Bloom, How Long Does An ID Need To Be ?
Milliseconds since epoch
Cited from the previous article, which provides most of the answer to the problem at hand with a nice synthetic style
It may not be necessary for you to encode every time since 1970
however. If you are only interested in keeping recent records close to
each other, you only need enough values to ensure that you don’t have
more values with the same prefix than your database can cache at once
What you could do is convert a GUID into only numeric by converting all the letters into numbers in the guid. Here is a example of what that would look like. It's abit long but if that is not a problem this could be one way of going about generating the keys.
1004234499987310234371029731000544986101469898102
Here is the code i used to generate the string above. But i would probably recommend you using a long primary key insteed although it can be abit of a pain it's probably a safer way to do it then the function below.
string generateKey()
{
Guid guid = Guid.NewGuid();
string newKey = "";
foreach(char c in guid.ToString().Replace("-", "").ToCharArray())
{
if(char.IsLetter(c))
{
newKey += (int)c;
}
else
{
newKey += c;
}
}
return newKey;
}
Edit:
I did some testing with only taking the 20 first numbers and out of 5000000 generated keys 4999978 was uniqe. But when using 25 first numbers it is 5000000 out of 5000000. I would recommend you to do some more testing if going with this method.
I am debating using user-names as a means to salt passwords, instead of storing a random string along with the names. My justification is that the purpose of the salt is to prevent rainbow tables, so what makes this realistically less secure than another set of data in there?
For example,
hash( md5(johnny_381#example.com), p4ss\/\/0rD)
vs
hash( md5(some_UUID_value), p4ss\/\/0rD)
Is there a real reason I couldn't just stick with the user name and simplify things? The only thing my web searching resulted was debates as to how a salt should be like a password, but ended without any reasoning behind it, where I'm under the impression this is just to prevent something like a cain-and-able cracker to run against it without being in the range of a million years. Thinking about processing limitations of reality, I don't believe this is a big deal if people know the hash, they still don't know the password, and they've moved into the super-computer range to brute force each individual hash.
Could someone please enlighten me here?
You'll run into problems, when the username changes (if it can be changed). There's no way you can update the hashed password, because you don't store the unsalted, unhashed password.
I don't see a problem with utilizing the username as the salt value.
A more secure way of storing passwords involves using a different salt value for each record anyway.
If you look at the aspnet_Membership table of the asp.net membership provider you'll see that they have stored the password, passwordsalt, and username fields in pretty much the same record. So, from that perspective, there's no security difference in just using the username for the salt value.
Note that some systems use a single salt value for all of the passwords, and store that in a config file. The only difference in security here is that if they gained access to a single salt value, then they can more easily build a rainbow table to crack all of the passwords at once...
But then again, if they have access to the encrypted form of the passwords, then they probably would have access to the salt value stored in the user table right along with it... Which might mean that they would have a slightly harder time of figuring out the password values.
However, at the end of the day I believe nearly all applications fail on the encryption front because they only encrypt what is ostensibly one of the least important pieces of data: the password. What should really be encrypted is nearly everything else.
After all, if I have access to your database, why would I care if the password is encrypted? I already have access to the important things...
There are obviously other considerations at play, but at the end of the day I wouldn't sweat this one too much as it's a minor issue compared others.
If you use the username as password and there are many instances of your application, people may create rainbow tables for specific users like "admin" or "system" like it is the case with Oracle databases or with a whole list of common names like they did for WPA (CowPatty)
You better take a really random salt, it is not that difficult and it will not come back haunting you.
This method was deemed secure enough for the working group that created HTTP digest authentication which operates with a hash of the string "username:realm:password".
I think you would be fine seeing as this decision is secret. If someone steals your database and source code to see how you actually implemented your hashing, well what are they logging in to access at that point? The website that displays the data in the database that they've already stolen?
In this case a salt buys your user a couple of security benefits. First, if the thief has precomputed values (rainbow tables) they would have to recompute them for every single user in order to do their attack; if the thief is after a single user's password this isn't a big win.
Second, the hashes for all users will always be different even if they share the same password, so the thief wouldn't get any hash collisions for free (crack one user get 300 passwords).
These two benefits help protect your users that may use the same password at multiple sites even if the thief happens to acquire the databases of other sites.
So while a salt for password hashing is best kept secret (which in your case the exact data used for the salt would be) it does still provide benefits even if it is compromised.
Random salting prevents comparison of two independently-computed password hashes for the same username. Without it, it would be possible to test whether a person's password on one machine matched the one on another, or whether a password matched one that was used in the past, etc., without having to have the actual password. It would also greatly facilitate searching for criteria like the above even when the password is available (since one could search for the computed hash, rather than computing the hash separately for each old password hash value).
As to whether such prevention is a good thing or a bad thing, who knows.
I know this is an old question but for anyone searching for a solution based on this question.
If you use a derived salt (as opposed to random salt), the salt source should be strengthened by using a key derivation function like PBKDF2.
Thus if your username is "theunhandledexception" pass that through PBKDF2 for x iterations to generate a 32 bit (or whatever length salt you need) value.
Make x pseudo random (as opposed to even numbers like 1,000) and pass in a static site specific salt to the PBKDF2 and you make it highly improbable that your username salt will match any other site's username salt.
One requirement is that when persisting my C# objects to the database I must decide the database ID (surrogate primary key) in code.
Second requirement is that the database type for the key must be int or char(x)... so no uniqueidentifier or binary(16) or the like.
These are unchangeable requirements.
What would be the best way to go about handling this?
One idea is the base64 encoded GUIDs looking like "XSiZtdXcKU68QWe7N96Dig". These are easily created in code and are to me acceptable in URLs if necessary. But will it be too expensive regarding performance (indexing, size) having all primary and foreign keys be char(22)? Off hand I really like this idea.
Another idea would be to create a code version of a database sequence creating incremented integers for me. But I don't know if this is plausible and would need some guidance to secure the reliability. The sequencer must know har far it has come and what about threads that I don't control etc.
I imagine that no table involved will ever exceed 1.000.000 rows... will probably be far less.
You could have a table called "sequences". For each table there would be a row with a counter. Then, when you need another number, fetch it from the counter table and increment it. Put it in a transaction and you will have uniqueness.
However this will suffer in terms of performance, of course.
A simple incrementing int would be the easiest way to ensure uniqueness. This is what the database will do if you let it. If you set the table row to auto_increment, the database will do this for you automatically.
There are no security issues with this, but since you will be handling it yourself instead of letting the database engine take care of it, you will need to ensure that you don't generate the same id twice. This should be simple if you are on a single threaded system, but if your program is distributed you will need to put some effort into ensuring the uniqueness.
Seeing that you have an ASP.NET app, you could do the following (hoping and assuming all users must authenticate themselves before using your app!):
Assign each user a unique "UserID" in your database (can be INT, or CHAR)
Assign each user a "HighestSequentialID" (INT) in your database
When the user logs on, read those values from the database and store them in e.g. a custom principal, or in a cookie, or something else
whenever the user is about to insert a row, create a segmented ID: (UserID).(User's sequential number) and store it as "VARCHAR(20)" - e.g. your UserID is 15 and thus this user's entries would have unique IDs of "15.00001", "15.00002" and so on.
when the user logs off (or at any other time), update its new, highest used sequential ID in the database so that next time around, you'll know what this user has used last
Again - you'll have to do a lot more housekeeping work yourself, and it's always prone to a mishap (assigning a duplicate user ID, or misinterpreting the highest sequential number for that user).
I would strongly recommend trying to get these requirements changed - with these in place, all solutions will be sub-optimal at best, while using the database to handle this would be totally painless.
Marc
For a table below 1.000.000 rows, I would not be too terribly concerned about a char(22) Primary key. Of course the ideal solution for a situation like this would be for each object to have something unique about it that you could leverage for the key, even if it is a multi-part key. The next ideal solution would be to have the requirements changed :)
I would like to get a few ideas on generating unique id's without using the GUID. Preferably i would like the unique value to be of type int32.
I'm looking for something that can be used for database primary key as well as being url friendly.
Can these considered Unique?
(int)DateTime.Now.Ticks
(int)DateTime.Now * RandomNumber
Any other ideas?
Thanks
EDIT: Well i am trying to practise Domain Driven Design and all my entities need to have a ID upon creation to be valid. I could in theory call into the DB to get an auto incremented number but would rather steer clear of this as DB related stuff is getting into the Domain.
It depends on how unique you needed it to be and how many items you need to give IDs to. Your best bet may be assigning them sequentially; if you try to get fancy you'll likely run into the Birthday Paradox (collisions are more likely than you might expect) or (as in your case 1) above) be foreced to limit the rate at which you can issue them.
Your 1) above is a little better than the 2) for most cases; it's rate limited--you can't issue more than 1 ID per tick--but not susceptible to the Birthday Paradox. Your 2) is just throwing bits away. Might be slightly better to XOR with the random number, but in any case I don't think the rand is buying you anything, just hiding the problem & making it harder to fix.
Are these considered Globally Unique?
1) (int)DateTime.Now.Ticks 2)
(int)DateTime.Now * RandomNumber
Neither option is globally unique.
Option 1 - This is only unique if you can guarantee no more than one ID is generated per tick. From your description, it does not sound like this would work.
Option 2 - Random numbers are pseudo random, but not guaranteed to be unique. With that already in mind, we can reduce the DateTime portion of this option to a similar problem to option 1.
If you want a globally unique ID that is an int32, one good way would be a synchronous service of some sort that returns sequential IDs. I guess it depends on what your definition of global means. If you had larger than an int32 to work with, and you mean global on a given network, then maybe you could use IP address with a sequence number appended, where the sequence number is generated synchronously across processes.
If you have other unique identifiers besides IP address, then that would obviously be a better choice for displaying as part of a URL.
You can use the RNGCryptoServiceProvider class, if you are using .NET
RNGCryptoServiceProvider Class
So in my simple learning website, I use the built in ASP.NET authentication system.
I am adding now a user table to save stuff like his zip, DOB etc. My question is:
In the new table, should the key be the user name (the string) or the user ID which is that GUID looking number they use in the asp_ tables.
If the best practice is to use that ugly guid, does anyone know how to get it? it seems to not be accessible as easily as the name (System.Web.HttpContext.Current.User.Identity.Name)
If you suggest I use neither (not the guid nor the userName fields provided by ASP.NET authentication) then how do I do it with ASP.NET authentication? One option I like is to use the email address of the user as login, but how to I make ASP.NET authentication system use an email address instead of a user name? (or there is nothing to do there, it is just me deciding I "know" userName is actually an email address?
Please note:
I am not asking on how get a GUID in .NET, I am just referring to the userID column in the asp_ tables as guid.
The user name is unique in ASP.NET authentication.
You should use some unique ID, either the GUID you mention or some other auto generated key. However, this number should never be visible to the user.
A huge benefit of this is that all your code can work on the user ID, but the user's name is not really tied to it. Then, the user can change their name (which I've found useful on sites). This is especially useful if you use email address as the user's login... which is very convenient for users (then they don't have to remember 20 IDs in case their common user ID is a popular one).
You should use the UserID.
It's the ProviderUserKey property of MembershipUser.
Guid UserID = new Guid(Membership.GetUser(User.Identity.Name).ProviderUserKey.ToString());
I would suggest using the username as the primary key in the table if the username is going to be unique, there are a few good reasons to do this:
The primary key will be a clustered index and thus search for a users details via their username will be very quick.
It will stop duplicate usernames from appearing
You don't have to worry about using two different peices of information (username or guid)
It will make writing code much easier because of not having to lookup two bits of information.
I would use a userid. If you want to use an user name, you are going to make the "change the username" feature very expensive.
I would say use the UserID so Usernames can still be changed without affecting the primary key. I would also set the username column to be unique to stop duplicate usernames.
If you'll mainly be searching on username rather than UserID then make Username a clustered index and set the Primary key to be non clustered. This will give you the fastest access when searching for usernames, if however you will be mainly searching for UserIds then leave this as the clustered index.
Edit : This will also fit better with the current ASP.Net membership tables as they also use the UserID as the primary key.
I agree with Palmsey,
Though there seems to be a little error in his code:
Guid UserID = new Guid(Membership.GetUser(User.Identity.Name)).ProviderUserKey.ToString());
should be
Guid UserID = new Guid(Membership.GetUser(User.Identity.Name).ProviderUserKey.ToString());
This is old but I just want people who find this to note a few things:
The aspnet membership database IS optimized when it comes to accessing user records. The clustered index seek (optimal) in sql server is used when a record is searched for using loweredusername and applicationid. This makes a lot of sense as we only have the supplied username to go on when the user first sends their credentials.
The guid userid will give a larger index size than an int but this is not really significant because we often only retrieve 1 record (user) at a time and in terms of fragmentation, the number of reads usually greately outweighs the number of writes and edits to a users table - people simply don't update that info all that often.
the regsql script that creates the aspnet membership tables can be edited so that instead of using NEWID as the default for userid, it can use NEWSEQUENTIALID() which delivers better performance (I have profiled this).
Profile. Someone creating a "new learning website" should not try to reinvent the wheel. One of the websites I have worked for used an out of the box version of the aspnet membership tables (excluding the horrible profile system) and the users table contained nearly 2 million user records. Even with such a high number of records, selects were still fast because, as I said to begin with, the database indexes focus on loweredusername+applicationid to peform clustered index seek for these records and generally speaking, if sql is doing a clustered index seek to find 1 record, you don't have any problems, even with huge numbers of records provided that you dont add columns to the tables and start pulling back too much data.
Worrying about a guid in this system, to me, based on actual performance and experience of the system, is premature optimization. If you have an int for your userid but the system performs sub-optimal queries because of your custom index design etc. the system won't scale well. The Microsoft guys did a generally good job with the aspnet membership db and there are many more productive things to focus on than changing userId to int.
I would use an auto incrementing number usually an int.
You want to keep the size of the key as small as possible. This keeps your index small and benefits any foreign keys as well. Additonally you are not tightly coupling the data design to external user data (this holds true for the aspnet GUID as well).
Generally GUIDs don't make good primary keys as they are large and inserts can happen at potentially any data page within the table rather than at the last data page. The main exception to this is if you are running mutilple replicated databases. GUIDs are very useful for keys in this scenario, but I am guessing you only have one database so this is not a problem.
If you're going to be using LinqToSql for development, I would recommend using an Int as a primary key. I've had many issues when I had relationships built off of non-Int fields, even when the nvarchar(x) field had constraints to make it a unique field.
I'm not sure if this is a known bug in LinqToSql or what, but I've had issues with it on a current project and I had to swap out PKs and FKs on several tables.
I agree with Mike Stone. I would also suggest only using a GUID in the event you are going to be tracking an enormous amount of data. Otherwise, a simple auto incrementing integer (Id) column will suffice.
If you do need the GUID, .NET is lovely enough that you can get one by a simple...
Dim guidProduct As Guid = Guid.NewGuid()
or
Guid guidProduct = Guid.NewGuid();
I'm agreeing with Mike Stone also. My company recently implemented a new user table for outside clients (as opposed to internal users who authenticate through LDAP). For the external users, we chose to store the GUID as the primary key, and store the username as varchar with unique constraints on the username field.
Also, if you are going to store the password field, I highly recommend storing the password as a salted, hashed binary in the database. This way, if someone were to hack your database, they would not have access to your customer's passwords.
I would use the guid in my code and as already mentioned an email address as username. It is, after all, already unique and memorable for the user. Maybe even ditch the guid (v. debateable).
Someone mentioned using a clustered index on the GUID if this was being used in your code. I would avoid this, especially if INSERTs are high; the index will be rebuilt every time you INSERT a record. Clustered indexes work well on auto increment IDs though because new records are appended only.