Probability of hash collision

Probability of hash collision - math

I am looking for some precise math on the likelihood of collisions for MD5, SHA1, and SHA256 based on the birthday paradox.
I am looking for something like a graph that says "If you have 10^8 keys, this is the probability. If you have 10^13 keys, this is the probability and so on"
I have looked at tons of articles but I am having a tough time finding something that gives me this data. (Ideal option for me would be a formula or code that calculates this for any provided hash size)

Let's imagine we have a truly random hash function that hashes from strings to n-bit numbers. This means that there are 2n possible hash codes, and each string's hash code is chosen uniformly at random from all of those possibilities.
The birthday paradox specifically says that once you've seen roughly √(2k) items, there's a 50% chance of a collision, where k is the number of distinct possible outputs. In the case where the hash function hashes to an n-bit output, this means that you'll need roughly 2n/2 hashes before you get a collision. This is why we typically pick hashes that output 256 bits; it means that we'd need a staggering 2128 ≈1038 items hashed before there's a "reasonable" chance of a collision. With a 512-bit hash, you'd need about 2256 to get a 50% chance of a collision, and 2256 is approximately the number of protons in the known universe.
The exact formula for the probability of getting a collision with an n-bit hash function and k strings hashed is
1 - 2n! / (2kn (2n - k)!)
This is a fairly tricky quantity to work with directly, but we can get a decent approximation of this quantity using the expression
1 - e-k2/2n+1
So, to get (roughly) a probability p chance of a collision, we can solve to get
p ≈ 1 - e-k2/2n+1
1 - p ≈ e-k2/2n+1
ln(1 - p) ≈ -k2/2n+1
-ln(1 - p) ≈ k2/2n+1
-2n+1 ln(1 - p) ≈ k2
2(n+1)/2 √(-ln(1 - p)) ≈ k
As one last approximation, assume we're dealing with very small choices of p. Then ln(1 - p) ≈ -p, so we can rewrite this as
k ≈ 2(n+1)/2 √p
Notice that there's still a monster 2(n+1)/2 term here, so for a 256-bit hash that leading term is 2128.5, which is just enormous. For example, how many items must we see to get a 2-50 chance of a collision with a 256-bit hash? That would be approximately
2(256+1)/2 √2-50
= 2257/2 2-50/2
= 2207/2
= 2103.5.
So you'd need a staggeringly huge number of hashes to have a vanishingly small chance of getting a collision. Figure that 2103.5 is about 1031, which at one nanosecond per hash computed would take you longer than the length of the universe to compute. And after all that, you'd get a success probability of 2-50, which is about 10-15.
In fact, this precisely why we pick such large numbers of bits for our hashes! It makes it extremely unlikely for a collision to occur by chance.
(Note that the hash functions we have today aren't actually truly random functions, which is why people advise against using MD5, SHA1, and others that have had security weaknesses exposed.)
Hope this helps!

Related

Is it always necessary to make hash table number of buckets a prime number for performance reason?

https://www.quora.com/Why-should-the-size-of-a-hash-table-be-a-prime-number?share=1
I see that people mention that the number of buckets of a hash table is better to be prime numbers.
Is it always the case? When the hash values are already evenly distributed, there is no need to use prime numbers then?
https://github.com/rui314/chibicc/blob/main/hashmap.c
For example, the above hash table code does not use prime numbers as the number of buckets.
https://github.com/rui314/chibicc/blob/main/hashmap.c#L37
But the hash values are generated from strings using fnv_hash.
https://github.com/rui314/chibicc/blob/main/hashmap.c#L17
So there is a reason why it makes sense to use bucket sizes that are not necessarily prime numbers?

The answer is "usually you don't need a table whose size is a prime number, but there are some implementation reasons why you might want to do this."
Fundamentally, hash tables work best when hash codes are spread out as close to uniformly at random as possible. That prevents items from clustering in any one location within the table. At some level, provided that you have a good enough hash function to make this happen, the size of the table doesn't matter.
So why do folks say to pick tables whose size is a prime? There are two main reasons for this, and they're due to specific cases that don't arise in all hash tables.
One reason why you sometimes see prime-sized tables is due to a specific way of building hash functions. You can build reasonable hash functions by picking functions of the form h(x) = (ax + b) mod p, where a is a number in {1, 2, ..., p-1} and b is a number in the {0, 1, 2, ..., p-1}, assuming that p is a prime. If p isn't prime, hash functions of this form don't spread items out uniformly. As a result, if you're using a hash function like this one, then it makes sense to pick a table whose size is a prime number.
The second reason you see advice about prime-sized tables is if you're using an open-addressing strategy like quadratic probing or double hashing. These hashing strategies work by hashing items to some initial location k. If that slot is full, we look at slot (k + r) mod T, where T is the table size and r is some offset. If that slot is full, we then check (k + 2r) mod T, then (k + 3r) mod T, etc. If the table size is a prime number and r isn't zero, this has the nice, desirable property that these indices will cycle through all the different positions in the table without ever repeating, ensuring that items are nicely distributed over the table. With non-prime table sizes, it's possible that this strategy gets stuck cycling through a small number of slots, which gives less flexibility in positions and can cause insertions to fail well before the table fills up.
So assuming you aren't using double hashing or quadratic probing, and assuming you have a strong enough hash function, feel free to size your table however you'd like.

templatetypedef has some excellent points as always - just adding a couple more and some examples...
Is it always necessary to make hash table number of buckets a prime number for performance reason?
No. Firstly, using prime numbers for bucket count tends to mean you need to spend more CPU cycles to fold/mod a hash value returned by the hash function into the current bucket count. A popular alternative is to use powers of two for the bucket count (e.g. 8, 16, 32, 64... as you resize), because then you can do a bitwise AND operation to map from a hash value to a bucket in 1 CPU cycle. That answers your "So there is a reason why it makes sense to use bucket sizes that are not necessarily prime numbers?"
Tuning a hash table for performance often means weighing the cost of a stronger hash function and modding by prime numbers against the cost of higher collisions.
Prime bucket counts often help reduce collisions when the hash function is unable to produce a very good distribution for the keys its fed.
For example, if you hashed a bunch of pointers to 64-bit doubles using an identity hash (basically, casting the pointer address to a size_t), then the hash values would all be multiples of 8 (due to alignment), and if you had a hash table size like say 1024 or 2048 (powers of 2), then all your pointers would hash onto 1/8th of the bucket indices (specifically, buckets 0, 8, 16, 25, 32 etc.). With a prime number of buckets, at least the pointer values - which if the load factor is high are inevitably spread out over a much larger range than the range of bucket indices - tend to wrap around the hash table hitting different indices.
When you use a very strong hash function - where the low order bits are effectively random but repeatable, you'll already get a good distribution across buckets regardless of the bucket count. There are also times when even with a terribly weak hash function - like an identity hash - h(x) == x - all the bits in the keys are so random that they produce as good a distribution as a cryptographic hash could produce, so there's no point spending extra time on a stronger hash - that may even increase collisions.
There a also times when the distribution isn't inherently great, but you can afford to use extra memory to keep the load factor low, so it's not worth using primes or a better hash function. Still, extra buckets puts more strain on the CPU caches too - so things can end up slower than hoped for.
Other times, keys with an identity hash have an inherent tendency to fall into distinct buckets (e.g. because they might have been generated by an incrementing counter, even if some of the values are no longer in use). In that case, a strong hash function increases collisions and worsens CPU cache access patterns. Whether you use powers of two or prime bucket counts makes little difference here.
When the hash values are already evenly distributed, there is no need to use prime numbers then?
That statement is trivially true but kind of pointless if you're talking about hash values after the mod-to-current-hash-table-size operation: even distribution there directly relates to few collisions.
If you're talking about the more interesting case of hash values evenly distributed in the hash function return type value space (e.g. a 64-bit integer), before those values are modded into whatever the current hash table bucket count is, then there's till room for prime numbers to help, but only when the hashed key space a larger range than the hash bucket indices. The pointer example above illustrated that: if you had say 800 distinct 8-byte-aligned pointers going into ~1000 bucket, then the difference between the numerically lowest pointer and the higher address would be at least 799*8 = 6392... you're wrapping around the table more than 6 times at a minimum (for close-as-possible pointers), and a prime number of buckets would increase the odds of each of "wrap" modding onto previously unused buckets.
Note that some of the above benefits to prime bucket counts apply to any kind of collision handling - separate chaining, linear probing, quadratic probing, double hashing, cuckoo hashing, robin hood hashing etc.

RSA - bitlength of p and q

I'm just trying to understand the key generation part of RSA, and more specifically, selecting the p and q primes. Given a target bit length for the modulus, n, what range I should be generating p and q in?
The modulus, n, is the product of p and q, where p and q are both prime numbers. I've read that p and q should be relatively close to each other, and somewhere around sqrt(n). If the target bit length is, for example, 32 bits (very small I realise), then does that follow that p and q should be a random prime of a maximum 16 bits?
Thanks for any clarification
Rob

For a 32-bit modulus the question is a bit academic: your primary aim in choosing p and q is to make the product hard to factorize, but finding the prime factorisation of a number smaller than 2^32 is so easy that there's little point worrying about the sizes of p and q in this case. Note that the mathematics will work just fine so long as p and q are distinct primes.
For something more realistic, like a 1024-bit modulus, then yes, you're pretty safe choosing two 512-bit primes p and q at random: that is, choose p and q uniformly from the set of all primes in the range [2^511, 2^512]. There's a notion of 'strong primes', which are primes designed to avoid particular possible known attacks---for example, you'll see recommendations that p and q should be chosen so that p-1 and q-1 have large factors, to guard against easy factorizations using Pollard's 'p-1' algorithm. However, these recommendations don't really apply to large moduli and state-of-the-art factorization algorithms (GNFS, ECM). There are other possible cases that in theory could give an easy factorization, but they're so unlikely to turn up in practice from random choices of p and q that they're not worth worrying about.
Summary: just choose two random primes with equal bitlength, and you're done.
A couple of additional comments and things to think about:
Of course, if you do choose two 512-bit primes, you'll end up with either a 1023-bit or a 1024-bit modulus; that's probably not worth worrying about, but if you really cared about getting exactly a 1024-bit modulus you could either restrict the range of p and q further, say to [1.5 * 2^511, 2^512], or just throw out any 1023-bit modulus and try again.
Don't deliberately choose p and q so that they're near each other: if p and q are truly close to each other (e.g., less than 10^10 apart, say), then their product pq is easily factorized by Fermat's method. But if you're choosing random primes p and q in the range [2^511, 2^512], this isn't going to happen with any sort of realistic probability.
When choosing a prime at random, a tempting strategy is to pick a random (odd) integer in the range [2^511, 2^512] and then increment it until you find the first prime. But note that that does not give a uniform choice amongst all primes: primes occurring after a large gap would be more likely to come up than other primes. A better strategy is just to keep picking random odd numbers and keep the first one that's a prime (or more likely, a strong probable prime to so many randomly-chosen bases that you can be sure in practice that it's prime).
Make sure you've got a really good cryptographic source of random numbers on hand for your prime number generation.

Why are "large prime numbers" used in RSA/encryption?

I've learned the theory of public key encryption but I'm missing the connection to the physical world. e.g.
I've been told that good RSA encryption should rely on prime numbers with 300 decimal digits but why? who came up with this number? How long it will take to break such encryption (statistics about different machines).
I've tried Google, but couldn't find what I wanted. anyone?
thanks

The key of asymmetric cryptography is to have an asymmetric function which allow decrypting message encrypted by the asymmetric key, without allowing to find the other key. In RSA, the function used is based on factorization of prime numbers however it is not the only option (Elliptic curve is another one for example).
So, basically you need two prime numbers for generating a RSA key pair. If you are able to factorize the public key and find these prime numbers, you will then be able to find the private key. The whole security of RSA is based on the fact that it is not easy to factorize large composite numbers, that's why the length of the key highly change the robustness of the RSA algorithm.
There are competitions to factorize large prime numbers with calculators each years with nice price. The last step of factorizing RSA key was done in 2009 by factorizing 768 bits keys. That's why at least 2048 bit keys should be used now.
As usual, Wikipedia is a good reference on RSA.

All public key algorithms are based on trapdoor functions, that is, mathematical constructs that are "easy" to compute in one way, but "hard" to reverse unless you have also some additional information (used as private key) at which point also the reverse becomes "easy".
"Easy" and "hard" are just qualitative adjectives that are always more formally defined in terms of computational complexity. "Hard" very often refers to computations that cannot be solved in polynomial time O(nx) for some fixed x and where n is the input data.
In the case of RSA, the "easy" function is the modular exponentiation C = Me mod N where the factors of N are kept secret. The "hard" problem is to find the e-th root of C (that is, M). Of course, "hard" does not mean that it is always hard, but (intuitively) that increasing the size of N by a certain factor increases the complexity by a much larger factor.
The sizes of the modulus which are recommended (2048 bits, or 617 decimal digits) relate to the availability of computation power at present time, so that if you stick to them you are assured that it will be extremely expensive for the attacker to break it. For more details, I should refer you to a brilliant answer on cryptography.SE (go and upvote :-)).
Finally, in order to have a trapdoor, N is built so as to be a composite number. It theory, for improved performance, N may have more than 2 factors, but the general security rule is that all factors must be balanced and have roughly the same size. That means that if you have K factors, and N is B bits long, each factor is roughly B/K bits longs.
This problem to solve is not the same as the integer factorization problem though. The two are related in that if you manage to factor N you can compute the private key by re-doing what the party that generated the key did. Typically, the exponent e being used is very small (3); it cannot be excluded that someday somebody devises an algorithm to compute the e-th without factoring N.
EDIT: Corrected the number of decimal digits for the modulus of a 2048 bits RSA key.

RSA uses the idea of one-way math functions, so that it's easy to encrypt and decrypt if you have the key, but hard (as in it takes lots and lots of CPU cycles) to decrypt if you don't have the key. Even before they thought of using prime numbers, mathematicians identified the need for a one-way function.
The first method they hit upon was the idea that if your "key" is a prime number, and your message is another number, then you can encrypt by multiplying the two together. Someone with the key can easily divide out the prime number and get the message, but for someone without the prime number, figuring out the prime number key is hard.

how to find the number of possibilities of a hash

if i have a hash say like this: 0d47aeda9d97686ab3da96bae2c93d078a5ab253
how do i do the math to find out the number of possibilities to try if i start with 0000000000000000000000000000000000000000 to 9999999999999999999999999999999999999999 which is the general length of a sha1.

The number of possibilities would be 2^(X) where X is the number of bits in the hash.
In the normal hexadecimal string representation of the hash value like the one you gave, each character is 4 bits, so it would be 2^(4*len) where len is the string length of the hash value. In your example, you have a 40 character SHA1 digest, which corresponds to 160 bits, or 2^160 == 1.4615016373309029182036848327163e+48 values.

An SHA-1 hash is 160 bits, so there are 2^160 possible hashes.

Your hexadecimal digit range is 0 through f.
Then it's simply 16^40 or however many characters it contains

Recall that a hash function accepts inputs of arbitrary length. A good cryptographic hash function will seem to assign a "random" hash result to any input. So if the digest is N bits long (for SHA-1, N=160), then every input will be hashed to one of 2^N possible results, in a manner we'll treat as random.
That means that the expectation for finding a preimage for your hash result is running though 2^N inputs. They don't have to be specifically the range that you suggested - any 2^N distinct inputs are fine.
This also means that 2^N inputs don't guarantee that you'll find a preimage - each try is random, so you might miss your 1-in-2^N chance in every single one of those 2^N inputs (just like flipping a coin twice doesn't guarantee you'll get heads at least once). But you can figure out how many inputs are required to find a preimage for the hash with probability p or greater - with p being as close to one as you desire (just not actually 1).

maximum variations, with repeating and with attention to the order are defined as n^k. in your case this would mean 10^40, which can't be correct for SHA1. Reading Wikipedia it sais SHA1 has a max. complexity for a collision based attack of 2^80, using different technices researches were allready successfull with 2^51 collisions, so 10^40 seems a bit much.

How do I compute the approximate entropy of a bit string?

Is there a standard way to do this?
Googling -- "approximate entropy" bits -- uncovers multiple academic papers but I'd like to just find a chunk of pseudocode defining the approximate entropy for a given bit string of arbitrary length.
(In case this is easier said than done and it depends on the application, my application involves 16,320 bits of encrypted data (cyphertext). But encrypted as a puzzle and not meant to be impossible to crack. I thought I'd first check the entropy but couldn't easily find a good definition of such. So it seemed like a question that ought to be on StackOverflow! Ideas for where to begin with de-cyphering 16k random-seeming bits are also welcome...)
See also this related question:
What is the computer science definition of entropy?

Entropy is not a property of the string you got, but of the strings you could have obtained instead. In other words, it qualifies the process by which the string was generated.
In the simple case, you get one string among a set of N possible strings, where each string has the same probability of being chosen than every other, i.e. 1/N. In the situation, the string is said to have an entropy of N. The entropy is often expressed in bits, which is a logarithmic scale: an entropy of "n bits" is an entropy equal to 2n.
For instance: I like to generate my passwords as two lowercase letters, then two digits, then two lowercase letters, and finally two digits (e.g. va85mw24). Letters and digits are chosen randomly, uniformly, and independently of each other. This process may produce 26*26*10*10*26*26*10*10 = 4569760000 distinct passwords, and all these passwords have equal chances to be selected. The entropy of such a password is then 4569760000, which means about 32.1 bits.

Shannon's entropy equation is the standard method of calculation. Here is a simple implementation in Python, shamelessly copied from the Revelation codebase, and thus GPL licensed:
import math
def entropy(string):
"Calculates the Shannon entropy of a string"
# get probability of chars in string
prob = [ float(string.count(c)) / len(string) for c in dict.fromkeys(list(string)) ]
# calculate the entropy
entropy = - sum([ p * math.log(p) / math.log(2.0) for p in prob ])
return entropy
def entropy_ideal(length):
"Calculates the ideal Shannon entropy of a string with given length"
prob = 1.0 / length
return -1.0 * length * prob * math.log(prob) / math.log(2.0)
Note that this implementation assumes that your input bit-stream is best represented as bytes. This may or may not be the case for your problem domain. What you really want is your bitstream converted into a string of numbers. Just how you decide on what those numbers are is domain specific. If your numbers really are just one and zeros, then convert your bitstream into an array of ones and zeros. The conversion method you choose will affect the results you get, however.

I believe the answer is the Kolmogorov Complexity of the string.
Not only is this not answerable with a chunk of pseudocode, Kolmogorov complexity is not a computable function!
One thing you can do in practice is compress the bit string with the best available data compression algorithm.
The more it compresses the lower the entropy.

The NIST Random Number Generator evaluation toolkit has a way of calculating "Approximate Entropy." Here's the short description:
Approximate Entropy Test Description: The focus of this test is the
frequency of each and every overlapping m-bit pattern. The purpose of
the test is to compare the frequency of overlapping blocks of two
consecutive/adjacent lengths (m and m+1) against the expected result
for a random sequence.
And a more thorough explanation is available from the PDF on this page:
http://csrc.nist.gov/groups/ST/toolkit/rng/documentation_software.html

There is no single answer. Entropy is always relative to some model. When someone talks about a password having limited entropy, they mean "relative to the ability of an intelligent attacker to predict", and it's always an upper bound.
Your problem is, you're trying to measure entropy in order to help you find a model, and that's impossible; what an entropy measurement can tell you is how good a model is.
Having said that, there are some fairly generic models that you can try; they're called compression algorithms. If gzip can compress your data well, you have found at least one model that can predict it well. And gzip is, for example, mostly insensitive to simple substitution. It can handle "wkh" frequently in the text as easily as it can handle "the".

Using Shannon entropy of a word with this formula : http://imgur.com/a/DpcIH
Here's a O(n) algorithm that calculates it :
import math
from collections import Counter
def entropy(s):
l = float(len(s))
return -sum(map(lambda a: (a/l)*math.log2(a/l), Counter(s).values()))

Here's an implementation in Python (I also added it to the Wiki page):
import numpy as np
def ApEn(U, m, r):
def _maxdist(x_i, x_j):
return max([abs(ua - va) for ua, va in zip(x_i, x_j)])
def _phi(m):
x = [[U[j] for j in range(i, i + m - 1 + 1)] for i in range(N - m + 1)]
C = [len([1 for x_j in x if _maxdist(x_i, x_j) <= r]) / (N - m + 1.0) for x_i in x]
return -(N - m + 1.0)**(-1) * sum(np.log(C))
N = len(U)
return _phi(m) - _phi(m + 1)
Example:
>>> U = np.array([85, 80, 89] * 17)
>>> ApEn(U, 2, 3)
-1.0996541105257052e-05
The above example is consistent with the example given on Wikipedia.