How can I create my own GUID algorithm with smaller "global"? - guid

I have my own application with far more smaller "global" than our real global and I wanted shorter version of GUID. Now supposed I have my concrete number of IDs that I estimated to not ever exceed (for example 100 million IDs). How can I determine the number of random bits required to have the same property as GUID? (Globally unique, require no central authority to generate one) Using the normal GUID would be an overkill.
My "overkill" refers to this : I need the ID to be as easily typed/say/write down as possible and have somewhat astronomically low collision chance as GUID at the same time. I heard GUID can be assigned to every grain of sand on earth. My application is a game, each player get one ID generated, obviously my players is not as much as the amount of sand on earth.
It would be the best if player can say like "My ID is XXXX-XXXX". In that case, I would be not so sure if 8 characters of randomized hex is not enough or too much for 100 million players. (In reality I encode it to A-Z 0-9 instead of hex though) My game is not online restricted, so I would like each player to be able to obtain unique ID even when not online. (no server to check ID collisions)
GUID has been designed to be globally unique. But I don't know why that results in 128-bit sequence. Maybe they just choose the "very large" one that is a power of 2? I don't know what are they thinking when designing GUID to ensure that it will not clash. (They use world population times something? If that is the case I can too use 10 million times something.)

A 128-bit guid will generally perform well, because most compilers are smart enough to reduce operations on it to a pair of 64-bit operations (and on some CPUs, a single 128-bit extended operation). Java and C#/VB.NET would likely have quite a bit more overhead than C++, but if you are using Java or C#/VB.NET, you've already accepted quite a bit more overhead, and a GUID won't add much to it.
However, if you really need smaller values, you could manually reduce GUIDs, by XOR-ing the upper 64 bits with the lower 64 bits (thereby preserving some of the uniqueness of the original) to create a compact 64-bit mostly-unique number.
You could reduce to 32-bit or 48-bit in a similar way, always a multiple of the size of the original GUID. This has the advantage that you are starting out with a number that is intended to be unique across a very large set. However, keep in mind that 100 million items require a fairly high number of bits to preserve a non-overlapping guarantee, so you may just be setting yourself up for a very difficult-to-find problem later on if you aren't careful.
A crude but probably equally effective approach is to use a cryptographically-secure random number generator and construct a number as large as you need (probably minimum 48-bit). It is important not to do modulo operations on the results, or you could significantly reduce the uniqueness (due to the period of the random number generator).
I am assuming you cannot use a sequential id, although you may want to revisit that idea and see if there is a way to make a sequential id work. For example, you could use a sequential id paired with a random seed number, guaranteeing uniqueness without requiring a large number, and allowing internal indexing operations and similar optimizations that are common with large data sets.

Ok, I have discussed with friend and came up with solution. This is how to decide the number of "characters" of my game ID.
A character would consist of 0-9 and A-Z instead of HEX, thats 36 kinds of characters. We took out 0 O 1 I so it would be printable to variety of fonts without confusion, that leaves 32 kind of characters.
Then if every characters will be pseudo-randomized, how many players can we safely have?
We used Birthday paradox's square approximation. The formula in that page indicate how many number of people necessary to have 50% chance of 2 people colliding. It is 22.99 people for birthday problem. (365 possible choices)
Now we substitute 32^No.of characters into the equation instead of 365. This is how many players that will cause 50% chance of 2 players having the same ID :
Finally, we agreed to choose 9-character ID so the game can be registered up to 6.9 million players before just 2 from all 6.9 million players will have the same ID (50% chance).
The game isn't even online-only! It only collide if that 2 players is still actively playing at the same time and decide to send score to the scoreboard in the same week because of weekly score reset. So the actual number that the game can hold would be somewhat higher than that. (The game will probably not having that many players.. it is just a small happy dream of every game startups. Well at least the computation was fun.)
It will probably looks like this for easier reading : 5XT-339-A67

Related

Encrypting small messages

i need to implement a coupon-code feature. because of the number of codes required and some other constraints, i can't store them in a database. in addition the displayed codes need to be short (around 10 characters).
my original idea was to use a cryptographic function to create codes by encrypting an ongoing counter. but i'm at a loss what method to use.
Because of the counter i would be encoding only a couple of bytes and I am aware that many algorithms are not secure when used with very short messages.
Is my Approach a good idea?
What algorithm could i use?
I'm not sure if this is what you're after, and as per my comment, you have no real guarantee of security, but one possible answer could be to seed a prng with some number and give out the first x numbers as codes. As long as x is much smaller than the total possible number of outcomes, the chance for repetition is small, and codes could be validated by re-generating the sequence (you may want to hash parts of it for speed purposes)
if you use base 62: [a-z A-Z 0-9] with 10 numbers, there are over 839 quadrillion possible outcomes. If you were to give everyone on the planet a unique code, you would have used roughly 0.0000009% of your addressable space

How to generate a unique GUID from two unique GUIDs, which are order-insignificant

I have an application whereby users have their own IDs.
The IDs are unique.
The IDs are GUIDs, so they include letters and numbers.
I want a formulae whereby if I have both IDs I can find their combined GUID, regardless of which order I use them in.
These GUIDs are 16 digits long, for the example below I will pretend they are 4.
user A: x43y
user B: f29a
If I use formula X which takes two arguments: X(a,b) I want the produced code to give the same result regardless whether a = UserA or UserB's GUID.
I do not require a method to find either users IDs, given one, from this formulae - ie it is a one way method.
Thank you for any answers or direction
So I'll turn my comment into an answer. Then this question can get answered, the answer accepted (if it is good enough) and we can all move on.
Sort the GUIDs lexicographically and append the second to the first. The result is unique, and has all the other characteristics you've asked for.
Can you compress it (I know you wrote shorten but bear with me) down to 16 characters ? No you can't; not, that is, if you want to be able to decompress it again and recover the original bits. (You've written that you don't need to be able to recover the original GUIDs, skip the next paragraph if you want to.)
A GUID is, essentially, a random sequence of 128 bits. Random sequences can't, by definition, be compressed. If a sequence of 128 bits is compressible it can't be random, there would have to be some algorithm for inflating the compressed version back to 128 bits. I know that since GUIDs are generated algorithmically they're not truly random. However, in practice there is almost no point in regarding them as anything other than truly random; I certainly don't think you should waste your time trying to compress them.
Given that the total population of possible GUIDs is large, you might be satisfied by a method which takes the first half of each individual GUID and assembles a pseudo-GUID from them. Depending on how many GUIDs your system is likely to be working with, and your appetite for risk, this might satisfy your practical needs.

Allocating datastore long ID's - but segmented so different Kinds have different ranges

My program has 3 kinds that are closely related and I want to be able to store and manipulate their long id's interchangeably, e.g. I might have an array of long id's that can be for any of the 3 Kind's.
Using the allocateIds API I can allocate the ID's for the 3 kinds in the same namespace, but I also sometimes need to be able to tell which Kind one of these id's referred to (e.g. in order to do a datastore operation on the right Kind).
I understand that the 'normal' way to this is to store the whole Key type, rather then just the long id, but there will be a huge number of these - it will be more efficient if I can just use 'long' values rather then Key values.
So, I'd like to be able to segment the ID ranges, so I can call a simple function with an ID and it will tell me which of the 3 Kind's the ID is for.
(I'm using Java, but I don't think that matters.)
Allocate my own ID's
I guess the most straight-forward way to do this is to simply allocate my own ID's. I believe that, in order to allocate sequential ID's, I would need to do an extra datastore write for every allocation (to track the allocations), or get into some complicated system of pre-allocating ranges of ID's to each live instance. This sounds like a bad idea.
So I could generate random 54 bit ID's - reserving 2 bits to use as flags to indicate the type. But it is my understand that random or hash allocation dramatically reduces the number of allocations that can be made safely. The Internet tells me that the chance of a collision is approximately k^2 / 2N, where k is number of allocations and N is the size of the allocation space. So, if I'm willing to accept 0.1% chance of collision then k=sqrt(2*2^54/1000) = ~1.9 million. Since I really have no idea how many entities I will need to store, this is unacceptable.
Reserve some bits in the Long ID to indicate the Kind
Another solution would be to use 2 bits of the long value as flags to indicate the type. The easiest way to do this would be to take advantage of the fact that the allocator now only uses the low 56 bits of a long. So I could use the high bits as flags to indicate the Kind. The problem with that solution is that I lose the ability to manipulate these numbers in javascript - the reason for the 56 bit limit in the first place.
An alternative to this - to maintain the option of manipulating these numbers in js - is to use allocateIdRange and pre-allocate (and throw away) the ID ranges corresponding to bits 54 and 55. Actually, I could use any bits, but specifying the ID ranges is much easier if I use the high bits.
But I know little of how the datastore and how the allocator actually work, so I don't know if this 'pre-allocate and discard' technique is a good idea.

Entire range - Reverse MD5 lookup

I am learning about encryption methods and I have a question about MD5.
I have seen there are several websites that have 'rainbow tables' that will give you reverse MD5 lookup, but, they can't lookup all the combinations possible.
For knowledge's sake, my question is this :
Hypothetically, if a group of people were to consider an upper limit (eg. 5 or 6 characters) and decide to map out the entire MD5 hash for all the values inside that range, storing the results in a database to use for reverse lookup.
1. Do you think such a thing is probable.
2. If you can speculate, what kind of scale of resources would this mean?
3. To your knowledge have there been any public or private attempts to do this?
I am not referring to tables that have select entries based on a dictionary, but mapping the entire range upto a certain number of characters.
(I have refered to This question already.)
It is possible. For a small number of characters, it has already been done. In the near future, it will be easy for larger numbers of characters. MD5 isn't getting any stronger.
That's a function of time. To reverse the entire 6-or-fewer-character alphanumeric space would require computing 62^6 entries. That's 56 trillion MD5s. That's doable by a determined small group or easy for a government, right now. In the future, it will be doable on a home computer. Remember, though, that as the number of allowable characters or the maximum length increases, the difficulty increase is exponential.
People already have done it. But, honestly, it doesn't matter - because anyone with half an ounce of sense uses a random salt. If you precompute the entire MD5 space and reverse it, that doesn't mean jack dandy if someone is using key strengthening or a good salt! Read up on salting.
5 or 6 characters is easy. 6 bytes is doable (that's 248 combinations), even with limited hardware.
Namely, a simple Core2 CPU from Intel will be able to hash one password in about 150 clock cycles (assuming you use a SSE2 implementation, which will hash four passwords in parallel in 600 clock cycles). With a 2.4 GHz quad core CPU (that's my PC, not exactly the newest machine available), I can then try about 226 passwords per second. For that kind of job, a massively parallel architecture is fine, hence it makes sense to use a GPU. For maybe 200$, you can buy a NVidia video card which will be about four times faster (i.e. 228 passwords per second). 6 alphanumeric characters (uppercase, lowercase and digits) are close to 236 combinations; trying them all is then a matter of 2(36-28) seconds, which is less than five minutes. With 6 random bytes, it will need 220 seconds, i.e. a bit less than a fortnight.
That's for the CPU cost. If you want to speed up the actual attack, you store the hash results: thus you will not need to recompute all those hashed passwords every time you attack a password (but you still have to do it once). 236 hash results (16 bytes each) mean 1 terabyte. You can buy a harddisk that big for 100$. 248 hash results imply 4096 times that storage space; in plain harddisks this will cost as much as a house: a bit expensive for the average bored student, but affordable for most kinds of governmental or criminal organizations.
Rainbow tables are an optimization trick for the storage. In rough terms, you store only one every t hash results, in exchange of having to do t lookups and t2 hash computations for every attack. E.g., you choose t=1000, you only have to buy four harddisks instead of four thousands, but you will need to make 1000 lookups and a million hashes every time you want to crack a password (this will need a dozen seconds at most, if you do it right).
Hence you have two costs:
The CPU cost is about computing hashes for the complete password space; with a table (rainbow or not) you have to do it once, and then can reuse that computational effort for every attacked password.
The storage cost is about storing the hash results in order to easily attack several passwords. Harddisks are not very expensive, as shown above. Rainbow tables help you lower storage costs.
Salting defeats cost sharing through precomputed tables (whether they are rainbow tables or just plain tables has no effect here: tables are about reusing precomputed values for several attacked passwords, and salts prevent such recycling).
The CPU cost can be increased by defining that the hash procedure is not just a single hash computation; for instance, you can define the "password hash" as applying MD5 over the concatenation of 10000 copies of the password. This will make each attacker guess one
thousand times more expensive. It also makes legitimate password validation one thousands times more expensive, but most users will not mind (the user has just typed his password; he cannot really see whether the password verification took 10ms or 10µs).
Modern Unix-like systems (e.g. Linux) use "MD5" passwords which actually combine salting and iterated hashing, as described above. (Actually, a modern Linux system may use another hash function, such as SHA-256, but that does not change things much here.) So precomputed tables will not help, and the on-the-fly password cracking is expensive. A password with 6 alphanumeric characters can still be cracked within a few days, because 6 characters are kind of weak anyway. Also, many longer passwords are crackable because it turns out that human begins are bad are remembering passwords; hence they will not choose just any random sequence of characters, they will select passwords which have some "meaning". This reduces the space of possible passwords.
It's called a rainbow table, and it's easily defeated with salting.
Yes, it is not only probable, but it's probably been done before.
It depends on whether they are mapping the entire possible range or just a range of ASCII characters. Let's say you need 128 bits + 6 bytes to store each match. That's 22 bytes. You'd need:
6.32 GB to store all lowercase alphabetic combinations [a-z]
405 GB to for all alphabetic combinations [a-zA-Z]
1.13 TB for all alphanumeric combinations [a-zA-Z0-9]
5.24 TB for all combinations that consists of letters, numbers and 18 symbols.
As you see, it increases exponentially, but even at 5.24 TB that's nothing to agencies like, say, the NSA or the CIA. They probably have done it.
As everyone else said, salting can easily defeat rainbow tables and that's almost as important as hashing. Read this: Just hashing is far from enough - How to position against dictionary and rainbow attacks

Is it safe to assume a GUID will always be unique?

I know there is a minute possibility of a clash but if I generated a batch of 1000 GUIDs (for example), would it be safe to assume they're all unique to save testing each one?
Bonus question
An optimal way to test a GUID for uniqueness? Bloom filter maybe?
Yes, you can. Since GUIDs are 128 bits long, there is admittedly a minute possibility of a clash—but the word "minute" is nowhere near strong enough. There are so many GUIDs that if you generate several trillion of them randomly, you're still more likely to get hit by a meteorite than to have even one collision (from Wikipedia). And if you aren't generating them randomly, but are e.g. using the MAC-address-and-time-stamp algorithm, then they're also going to be unique, as MAC addresses are unique among computers and time stamps are unique on your computer.
Edit 1: To answer your bonus question, the optimal way to test a set of GUIDs for uniqueness is to just assume that they are all are unique. Why? Because, given the number of GUIDs you're generating, the odds of a GUID collision are smaller than the odds of a cosmic ray flipping a bit in your computer's memory and screwing up the answer given by any "accurate" algorithm you'd care to run. (See this StackOverflow answer for the math.)
There are an enormous number of GUIDs out there. To quote Douglas Adams's Hitchhiker's Guide to the Galaxy:
"Space," it says, "is big. Really big. You just won't believe how vastly hugely mindbogglingly big it is. I mean you may think it's a long way down the road to the chemist, but that's just peanuts to space, listen…"
And since there are about 7×1022 stars in the universe, and just under 2128 GUIDs, then there are approximately 4.86×1015—almost five quadrillion—GUIDs for every single star. If every one of those stars had a world with a thriving population like ours, then around each and every star, every human or alien who had ever lived would be entitled to over forty-five thousand GUIDs. For every person in history at every star in the universe. The GUID space is at the same level of hugeness as the size of the entire universe. You do not need to worry.
(Edit 2: Reflecting on this: wow. I hadn't realized myself what this meant. The GUID space is incomprehensibly massive. I'm sort of in awe of it.)
Short answer: for practical purposes, yes.
However, you have to consider the birthday paradox!
I have calculated a few representative collision probabilities. With 122-bit UUIDs as specified in the Wikipedia article, the probability of collision is 1/2 if you generate at least 2.71492e18 UUIDs. With 10^19 UUIDs, the probability is 0.999918. With 10^17 UUIDs, 0.000939953.
Some numbers for comparison can be found on Wikipedia. So you can safely assign a UUID for each human that has lived, each galaxy in the observable universe, each fish in the ocean, and each individual ant on Earth. However, collisions are almost certain if you generate a UUID for each transistor humanity produces in a year, each insect on Earth, each grain of sand on Earth, each star in the observable universe, or anything larger.
If you generate 1 billion UUIDs per second, it would take about 36 years to get a collision probability of 10%.
Eventually, there will probably be a collision among the set of UUIDs generated over the course of human history. Still, the probability that collided UUIDs will be used for the same purpose is vanishingly small, so there's no problem in practice.
An analysis of the possibility of collision is available on Wikipedia: http://en.wikipedia.org/wiki/Uuid#Random_UUID_probability_of_duplicates
As mentioned in the link, this will be affected by the properties of the random number generator.
There is also the possibility of a bug in GUID generator code; while the chances are low, they are probably higher than the chances of a collision based on the mathematics.
A Bloom filter might be appropriate; it can quickly tell you if a GUID is unique, but there's a chance for a false indication of a collision. An alternate method if you're testing a batch at a time is to sort the batch and compare each successive element.
In general, yes it is safe to assume.
If your GUID generator is truly random, the possibilities of a clash within a 1000 GUIDs is extraordinarily small.
Of course, that assumes a good GUID generator. So the question is really about how much you trust the tool you're using to generate GUID and does it have its own tests?
This topic reminds me of the Deck of cards scenario. That is to say that there are so many ways a deck of 52 cards can be arranged, that its pretty much certain that no 2 properly shuffled decks of cards that have ever existed, have been in the same order.
If you take a deck now and shuffle it, that sequence will be unique, and will probably never be seen again in all of humanity. Indeed the potential number of ways to arrange 52 of anything is so unimaginably vast that the chances of any 2 decks happening to be the same order are close to zero.
In this example of having 40 shuffled decks and wanting to know for sure they are all unique, it's not impossible 2 of them are the same but its something that most likely would not occur if you were able to shuffle all the decks once every 10th of a second and you started at the birth of the universe.
While a collision is possible, it is HIGHLY unlikely. (Math here.) It is safe to assume they are in fact distinct.
Usually it is a pretty safe assumption.
http://en.wikipedia.org/wiki/Globally_Unique_Identifier
Is a GUID unique 100% of the time?

Resources