Is it safe to assume a GUID will always be unique? - math

I know there is a minute possibility of a clash but if I generated a batch of 1000 GUIDs (for example), would it be safe to assume they're all unique to save testing each one?
Bonus question
An optimal way to test a GUID for uniqueness? Bloom filter maybe?

Yes, you can. Since GUIDs are 128 bits long, there is admittedly a minute possibility of a clash—but the word "minute" is nowhere near strong enough. There are so many GUIDs that if you generate several trillion of them randomly, you're still more likely to get hit by a meteorite than to have even one collision (from Wikipedia). And if you aren't generating them randomly, but are e.g. using the MAC-address-and-time-stamp algorithm, then they're also going to be unique, as MAC addresses are unique among computers and time stamps are unique on your computer.
Edit 1: To answer your bonus question, the optimal way to test a set of GUIDs for uniqueness is to just assume that they are all are unique. Why? Because, given the number of GUIDs you're generating, the odds of a GUID collision are smaller than the odds of a cosmic ray flipping a bit in your computer's memory and screwing up the answer given by any "accurate" algorithm you'd care to run. (See this StackOverflow answer for the math.)
There are an enormous number of GUIDs out there. To quote Douglas Adams's Hitchhiker's Guide to the Galaxy:
"Space," it says, "is big. Really big. You just won't believe how vastly hugely mindbogglingly big it is. I mean you may think it's a long way down the road to the chemist, but that's just peanuts to space, listen…"
And since there are about 7×1022 stars in the universe, and just under 2128 GUIDs, then there are approximately 4.86×1015—almost five quadrillion—GUIDs for every single star. If every one of those stars had a world with a thriving population like ours, then around each and every star, every human or alien who had ever lived would be entitled to over forty-five thousand GUIDs. For every person in history at every star in the universe. The GUID space is at the same level of hugeness as the size of the entire universe. You do not need to worry.
(Edit 2: Reflecting on this: wow. I hadn't realized myself what this meant. The GUID space is incomprehensibly massive. I'm sort of in awe of it.)

Short answer: for practical purposes, yes.
However, you have to consider the birthday paradox!
I have calculated a few representative collision probabilities. With 122-bit UUIDs as specified in the Wikipedia article, the probability of collision is 1/2 if you generate at least 2.71492e18 UUIDs. With 10^19 UUIDs, the probability is 0.999918. With 10^17 UUIDs, 0.000939953.
Some numbers for comparison can be found on Wikipedia. So you can safely assign a UUID for each human that has lived, each galaxy in the observable universe, each fish in the ocean, and each individual ant on Earth. However, collisions are almost certain if you generate a UUID for each transistor humanity produces in a year, each insect on Earth, each grain of sand on Earth, each star in the observable universe, or anything larger.
If you generate 1 billion UUIDs per second, it would take about 36 years to get a collision probability of 10%.
Eventually, there will probably be a collision among the set of UUIDs generated over the course of human history. Still, the probability that collided UUIDs will be used for the same purpose is vanishingly small, so there's no problem in practice.

An analysis of the possibility of collision is available on Wikipedia: http://en.wikipedia.org/wiki/Uuid#Random_UUID_probability_of_duplicates
As mentioned in the link, this will be affected by the properties of the random number generator.
There is also the possibility of a bug in GUID generator code; while the chances are low, they are probably higher than the chances of a collision based on the mathematics.
A Bloom filter might be appropriate; it can quickly tell you if a GUID is unique, but there's a chance for a false indication of a collision. An alternate method if you're testing a batch at a time is to sort the batch and compare each successive element.

In general, yes it is safe to assume.
If your GUID generator is truly random, the possibilities of a clash within a 1000 GUIDs is extraordinarily small.
Of course, that assumes a good GUID generator. So the question is really about how much you trust the tool you're using to generate GUID and does it have its own tests?

This topic reminds me of the Deck of cards scenario. That is to say that there are so many ways a deck of 52 cards can be arranged, that its pretty much certain that no 2 properly shuffled decks of cards that have ever existed, have been in the same order.
If you take a deck now and shuffle it, that sequence will be unique, and will probably never be seen again in all of humanity. Indeed the potential number of ways to arrange 52 of anything is so unimaginably vast that the chances of any 2 decks happening to be the same order are close to zero.
In this example of having 40 shuffled decks and wanting to know for sure they are all unique, it's not impossible 2 of them are the same but its something that most likely would not occur if you were able to shuffle all the decks once every 10th of a second and you started at the birth of the universe.

While a collision is possible, it is HIGHLY unlikely. (Math here.) It is safe to assume they are in fact distinct.

Usually it is a pretty safe assumption.
http://en.wikipedia.org/wiki/Globally_Unique_Identifier
Is a GUID unique 100% of the time?

Related

Encrypting small messages

i need to implement a coupon-code feature. because of the number of codes required and some other constraints, i can't store them in a database. in addition the displayed codes need to be short (around 10 characters).
my original idea was to use a cryptographic function to create codes by encrypting an ongoing counter. but i'm at a loss what method to use.
Because of the counter i would be encoding only a couple of bytes and I am aware that many algorithms are not secure when used with very short messages.
Is my Approach a good idea?
What algorithm could i use?
I'm not sure if this is what you're after, and as per my comment, you have no real guarantee of security, but one possible answer could be to seed a prng with some number and give out the first x numbers as codes. As long as x is much smaller than the total possible number of outcomes, the chance for repetition is small, and codes could be validated by re-generating the sequence (you may want to hash parts of it for speed purposes)
if you use base 62: [a-z A-Z 0-9] with 10 numbers, there are over 839 quadrillion possible outcomes. If you were to give everyone on the planet a unique code, you would have used roughly 0.0000009% of your addressable space

How can I create my own GUID algorithm with smaller "global"?

I have my own application with far more smaller "global" than our real global and I wanted shorter version of GUID. Now supposed I have my concrete number of IDs that I estimated to not ever exceed (for example 100 million IDs). How can I determine the number of random bits required to have the same property as GUID? (Globally unique, require no central authority to generate one) Using the normal GUID would be an overkill.
My "overkill" refers to this : I need the ID to be as easily typed/say/write down as possible and have somewhat astronomically low collision chance as GUID at the same time. I heard GUID can be assigned to every grain of sand on earth. My application is a game, each player get one ID generated, obviously my players is not as much as the amount of sand on earth.
It would be the best if player can say like "My ID is XXXX-XXXX". In that case, I would be not so sure if 8 characters of randomized hex is not enough or too much for 100 million players. (In reality I encode it to A-Z 0-9 instead of hex though) My game is not online restricted, so I would like each player to be able to obtain unique ID even when not online. (no server to check ID collisions)
GUID has been designed to be globally unique. But I don't know why that results in 128-bit sequence. Maybe they just choose the "very large" one that is a power of 2? I don't know what are they thinking when designing GUID to ensure that it will not clash. (They use world population times something? If that is the case I can too use 10 million times something.)
A 128-bit guid will generally perform well, because most compilers are smart enough to reduce operations on it to a pair of 64-bit operations (and on some CPUs, a single 128-bit extended operation). Java and C#/VB.NET would likely have quite a bit more overhead than C++, but if you are using Java or C#/VB.NET, you've already accepted quite a bit more overhead, and a GUID won't add much to it.
However, if you really need smaller values, you could manually reduce GUIDs, by XOR-ing the upper 64 bits with the lower 64 bits (thereby preserving some of the uniqueness of the original) to create a compact 64-bit mostly-unique number.
You could reduce to 32-bit or 48-bit in a similar way, always a multiple of the size of the original GUID. This has the advantage that you are starting out with a number that is intended to be unique across a very large set. However, keep in mind that 100 million items require a fairly high number of bits to preserve a non-overlapping guarantee, so you may just be setting yourself up for a very difficult-to-find problem later on if you aren't careful.
A crude but probably equally effective approach is to use a cryptographically-secure random number generator and construct a number as large as you need (probably minimum 48-bit). It is important not to do modulo operations on the results, or you could significantly reduce the uniqueness (due to the period of the random number generator).
I am assuming you cannot use a sequential id, although you may want to revisit that idea and see if there is a way to make a sequential id work. For example, you could use a sequential id paired with a random seed number, guaranteeing uniqueness without requiring a large number, and allowing internal indexing operations and similar optimizations that are common with large data sets.
Ok, I have discussed with friend and came up with solution. This is how to decide the number of "characters" of my game ID.
A character would consist of 0-9 and A-Z instead of HEX, thats 36 kinds of characters. We took out 0 O 1 I so it would be printable to variety of fonts without confusion, that leaves 32 kind of characters.
Then if every characters will be pseudo-randomized, how many players can we safely have?
We used Birthday paradox's square approximation. The formula in that page indicate how many number of people necessary to have 50% chance of 2 people colliding. It is 22.99 people for birthday problem. (365 possible choices)
Now we substitute 32^No.of characters into the equation instead of 365. This is how many players that will cause 50% chance of 2 players having the same ID :
Finally, we agreed to choose 9-character ID so the game can be registered up to 6.9 million players before just 2 from all 6.9 million players will have the same ID (50% chance).
The game isn't even online-only! It only collide if that 2 players is still actively playing at the same time and decide to send score to the scoreboard in the same week because of weekly score reset. So the actual number that the game can hold would be somewhat higher than that. (The game will probably not having that many players.. it is just a small happy dream of every game startups. Well at least the computation was fun.)
It will probably looks like this for easier reading : 5XT-339-A67

How to generate a unique GUID from two unique GUIDs, which are order-insignificant

I have an application whereby users have their own IDs.
The IDs are unique.
The IDs are GUIDs, so they include letters and numbers.
I want a formulae whereby if I have both IDs I can find their combined GUID, regardless of which order I use them in.
These GUIDs are 16 digits long, for the example below I will pretend they are 4.
user A: x43y
user B: f29a
If I use formula X which takes two arguments: X(a,b) I want the produced code to give the same result regardless whether a = UserA or UserB's GUID.
I do not require a method to find either users IDs, given one, from this formulae - ie it is a one way method.
Thank you for any answers or direction
So I'll turn my comment into an answer. Then this question can get answered, the answer accepted (if it is good enough) and we can all move on.
Sort the GUIDs lexicographically and append the second to the first. The result is unique, and has all the other characteristics you've asked for.
Can you compress it (I know you wrote shorten but bear with me) down to 16 characters ? No you can't; not, that is, if you want to be able to decompress it again and recover the original bits. (You've written that you don't need to be able to recover the original GUIDs, skip the next paragraph if you want to.)
A GUID is, essentially, a random sequence of 128 bits. Random sequences can't, by definition, be compressed. If a sequence of 128 bits is compressible it can't be random, there would have to be some algorithm for inflating the compressed version back to 128 bits. I know that since GUIDs are generated algorithmically they're not truly random. However, in practice there is almost no point in regarding them as anything other than truly random; I certainly don't think you should waste your time trying to compress them.
Given that the total population of possible GUIDs is large, you might be satisfied by a method which takes the first half of each individual GUID and assembles a pseudo-GUID from them. Depending on how many GUIDs your system is likely to be working with, and your appetite for risk, this might satisfy your practical needs.

Sudden drop in performance of hash table

I recently implemented an algorithm in Java that used a hash table. I compared it to a few other algorithms with rather large data input sizes such as 100000.
The thing that has struck me is that once my data input size exceeds 10000 the performance of the hash table drops dramatically. To emphasise this drop, what took 4000 ms with input size 1000 suddenly goes up to 172000 ms for input size 5000.
Can anyone please explain to me what the reason for this is? I'd really like to know.
Thanks!
This question is way too ambiguous for anyone to give a definitive answer, but if I had to guess I would say that you are encountering collisions. The stock implementation of java's HashMap uses linked lists to hold the entries whose keys' hashes collide, which will certainly happen if the hashCode method has been incorrectly defined; perhaps returning a constant value.
Having said that, if you're just measuring elapsed time, that doesn't tell you too much. Perhaps you crossed a threshold that caused a major garbage collection to occur. You should try to measure performance after your JVM and hash table are sufficiently warmed up, and take lots of measurements and consider their average, before coming to any conclusions.

Partially re-create Risk-like game based on incomplete log files

I'm trying to re-create this conquerclub (Risk-like) game:
http://conquerclub.barrycarter.info/ONEOFF/7460216.html
In other words, I want to know who owned each territory at each point
in time, and how many troops they had on that territory. My primary
source of information is the Game Log. Notes:
% It's not in the Game Log, but all territories start w/ 3 troops.
% Since we know the territory owners at the end of the game, and the
Game Log mentions all owner changes, determining territory owners at
any point in time is easy.
% The challenge is to find the number of troops on a territory at a
given time.
% The Game Log gives information on troop deployment, reinforcement,
and conquest.
% However, the Game Log is incomplete. Suppose territory X attacks
territory Y unsuccessfully, but both territories lose troops in the
process. The Game Log will not mention this.
% It's probably not possible (in general) to find the exact number
of troops on a territory at a given time, so I'm looking for a range.
% I tried feeding the data to Mathematica as a series of
inequalities, but as the manual warns, the computation time increases
exponentially with the number of inequalities. Even with a fairly
small number of inequalities, it hangs. Plus, I'm not convinced
Mathematica is the right tool here.
% Any thoughts? Another example is:
http://conquerclub.barrycarter.info/ONEOFF/7562013.html
% I know about http://userscripts.org/scripts/show/83035 but that only tracks \
owners, not number of troops.
You could make use of Prolog's constraint programming (specifically, CLP/FD). It would require you to encode all rules in Prolog, which might be a non-trivial task. However Prolog would be able then to show you all possible valid (legal in terms of encoded rules) ways of playing such game, or just show ranges of possible values.
Also, while CLP/FD in Prolog sometimes is quite fast, it might be difficult to use it to make solving your problem quickly. Most free solvers have many quirks.
Again, I think this is a nontrivial task, and even greater if you haven't programmed in Prolog earlier. But I am pretty sure this would give you answers you seek.

Resources