Probability of Collisions in Hash Table - math

When inserting n items into a hash table of size m, assuming that the destination of each item is independently uniformly random, what is the probability that no collision occurs?
My working thus far:
We have n items and m locations.
Each item has a 1/m chance of being in any particular location.
There are nC2 possible pairs of items.
The probability of there being no collisions is the probability that for every location, every pair of items does not hash to that location.
For any given location, for any given pair, the probability that the two items do not hash to that location is (m-1)/m.
Then, for any given location, the probability that the above is true for ALL pairs is ((m-1)/m)^(nC2).
Then, the probability that this is true for every location is
[((m-1)/m)^(nC2)]^(m).

You made a few mistakes in that reasoning. The main one is that you assume that the probabilities for pairs not hashing together are independent, so you can multiply them together. You have not shown that is the case, and in fact it is not the case. Consider three elements a, b, and c. If you know that both a and b do not collide with c, then they are limited to m-1 places rather than the initial m places, and they are more likely to collide with each other than if you just ignore c.
Here is a straightforward way to find your desired probability. Looking at the total possibilities ignoring collisions, each of the n items has m places to go. Those placements are independent, so the total possibilities are m^n (or m**n in Python) if we take order into account.
If we know there are no collisions, those n items are a way of choosing out of the m locations without replacement. So if we take order into account, that makes mPn possibilities -- the ways to choose n items out of m choices without replacement and with order (permutations). Therefore your desired probability is
mPn / m^n = (m!) / ((m-n)! * m^n) = m/m * (m-1)/m * (m-2)/m * ... * (m-n+1)/m
There are n factors in that last expression. (This would be so much better in MathJax!) You can choose which of those three equivalent expressions is best for your purpose.
There are other ways to come up with those expressions, of course. That last one can be thought of as the probability of no collision placing 1 item in m slots times the conditional probability of placing a second item given no prior collision times the conditional probability of placing a third item given no prior collision times ....
Those expressions are fairly easy to test. Just choose specific, small values of m and n, generate all possible choices of n items out of those m, and find the empirical probability of no collisions. This should agree with the formula(s) above. I'll leave the choice of programming language and the coding to you. After all, this is a programming site. I just did this in Python, for multiple choices of n and m, and it works out.

Related

Is it possible to represent 'average value' in programming?

Had a tough time thinking of an appropriate title, but I'm just trying to code something that can auto compute the following simple math problem:
The average value of a,b,c is 25. The average value of b,c is 23. What is the value of 'a'?
For us humans we can easily compute that the value of 'a' is 29, without the need to know b and c. But I'm not sure if this is possible in programming, where we code a function that takes in the average values of 'a,b,c' and 'b,c' and outputs 'a' automatically.
Yes, it is possible to do this. The reason for this is that you can model the sort of problem being described here as a system of linear equations. For example, when you say that the average of a, b, and c is 25, then you're saying that
a / 3 + b / 3 + c / 3 = 25.
Adding in the constraint that the average of b and c is 23 gives the equation
b / 2 + c / 2 = 23.
More generally, any constraint of the form "the average of the variables x1, x2, ..., xn is M" can be written as
x1 / n + x2 / n + ... + xn / n = M.
Once you have all of these constraints written out, solving for the value of a particular variable - or determining that many solutions exists - reduces to solving a system of linear equations. There are a number of techniques to do this, with Gaussian elimination with backpropagation being a particularly common way to do this (though often you'd just hand this to MATLAB or a linear algebra package and have it do the work for you.)
There's no guarantee in general that given a collection of equations the computer can determine whether or not they have a solution or to deduce a value of a variable, but this happens to be one of the nice cases where the shape of the contraints make the problem amenable to exact solutions.
Alright I have figured some things out. To answer the question as per title directly, it's possible to represent average value in programming. 1 possible way is to create a list of map data structures which store the set collection as key (eg. "a,b,c"), while the average value of the set will be the value (eg. 25).
Extract the key and split its string by comma, store into list, then multiply the average value by the size of list to get the total (eg. 25x3 and 23x2). With this, no semantic information will be lost.
As for the context to which I asked this question, the more proper description to the problem is "Given a set of average values of different combinations of variables, is it possible to find the value of each variable?" The answer to this is open. I can't figure it out, but below is an attempt in describing the logic flow if one were to code it out:
Match the lists (from Paragraph 2) against one another in all possible combinations to check if a list contains all elements in another list. If so, substract the lists (eg. abc-bc) as well as the value (eg. 75-46). If upon substracting we only have 1 variable in the collection, then we have found the value for this variable.
If there's still more than 1 variables left such as abcd - bc = ad, then store the values as a map data structure and repeat the process, till the point where the substraction count in the full iteration is 0 for all possible combinations (eg. ac can't substract bc). This is unfortunately not where it ends.
Further solutions may be found by combining the lists (eg. ac + bd = abcd) to get more possible ways to subtract and derive at the answer. When this is the case, you just don't know when to stop trying, and the list of combinations will get exponential. Maybe someone with strong related mathematical theories may be able to prove that upon a certain number of iteration, further additions are useless and hence should stop. Heck, it may even be possible that negative values are also helpful, and hence contradict what I said earlier about 'ac' can't subtract 'bd' (to get a,c,-b,-d). This will give even more combinations to compute.
People with stronger computing science foundations may try what templatetypedef has suggested.

Turning a random number generating loop into a one equation?

Here's some pseudocode:
count = 0
for every item in a list
1/20 chance to add one to count
This is more or less my current code, but there could be hundreds of thousands of items in that list; therefore, it gets inefficient fast. (isn't this called like, 0(n) or something?)
Is there a way to compress this into one equation?
Let's look at the properties of the random variable you've described. Quoting Wikipedia:
The binomial distribution with parameters n and p is the discrete probability distribution of the number of successes in a sequence of n independent yes/no experiments, each of which yields success with probability p.
Let N be the number of items in the list, and C be a random variable that represents the count you're obtaining from your pseudocode. C will follow a binomial probability distribution (as shown in the image below), with p = 1/20:
The remaining problem is how to efficently poll a random variable with said probability distribution. There are a number of libraries that allow you to draw samples from random variables with a specified PDF. I've never had to implement it myself, so I don't exactly know the details, but many are open source and you can refer to the implementation for yourself.
Here's how you would calculate count with the numpy library in Python:
n, p = 10, 0.05 # 10 trials, probability of success is 0.05
count = np.random.binomial(n, p) # draw a single sample
Apparently the OP was asking for a more efficient way to generate random numbers with the same distribution this will give. I though the question was how to do the exact same operation as the loop, but as a one liner (and preferably with no temporary list that exists just to be iterated over).
If you sample a random number generator n times, it's going to have at best O(n) run time, regardless of how the code looks.
In some interpreted languages, using more compact syntax might make a noticeable difference in the constant factors of run time. Other things can affect the run time, like whether you store all the random values and then process them, or process them on the fly with no temporary storage.
None of this will allow you to avoid having your run time scale up linearly with n.

What is the difference between permutations and derangements?

I have been given a program to write difference combinations of set of number entered by user and when I researched for the same I get examples with terms permutations and derangements.
I am unable to find the clarity between the them. Also adding to that one more term is combinations. Any one please provide a simple one liner for clarity on the question.
Thanks in advance.
http://en.wikipedia.org/wiki/Permutation
The notion of permutation relates to the act of rearranging, or permuting, all the members of a set into some sequence or order (unlike combinations, which are selections of some members of the set where order is disregarded). For example, written as tuples, there are six permutations of the set {1,2,3}, namely: (1,2,3), (1,3,2), (2,1,3), (2,3,1), (3,1,2), and (3,2,1). As another example, an anagram of a word, all of whose letters are different, is a permutation of its letters.
http://en.wikipedia.org/wiki/Derangement
In combinatorial mathematics, a derangement is a permutation of the elements of a set such that none of the elements appear in their original position.
The number of derangements of a set of size n, usually written Dn, dn, or !n, is called the "derangement number" or "de Montmort number". (These numbers are generalized to rencontres numbers.) The subfactorial function (not to be confused with the factorial n!) maps n to !n.1 No standard notation for subfactorials is agreed upon; n¡ is sometimes used instead of !n.2

Generate random small numbers with a target average

I need to write a function that returns on of the numbers (-2,-1,0,1,2) randomly, but I need the average of the output to be a specific number (say, 1.2).
I saw similar questions, but all the answers seem to rely on the target range being wide enough.
Is there a way to do this (without saving state) with this small selection of possible outputs?
UPDATE: I want to use this function for (randomized) testing, as a stub for an expensive function which I don't want to run. The consumer of this function runs it a couple of hundred times and takes an average. I've been using a simple randint function, but the average is always very close to 0, which is not realistic.
Point is, I just need something simple that won't always average to 0. I don't really care what the actual average is. I may have asked the question wrong.
Do you really mean to require that specific value to be the average, or rather the expected value? In other words, if the generated sequence were to contain an extraordinary number of small values in its initial part, should the rest of the sequence atempt to compensate for that in an attempt to get the overall average right? I assume not, I assume you want all your samples to be computed independently (after all, you said you don't want any state), in which case you can only control the expected value.
If you assign a probability pi for each of your possible choices, then the expected value will be the sum of these values, weighted by their probabilities:
EV = − 2p−2 − p−1 + p1 + 2p2 = 1.2
As additional constraints you have to require that each of these probabilities is non-negative, and that the above four add up to a value less than 1, with the remainder taken by the fifth probability p0.
there are many possible assignments which satisfy these requirements, and any one will do what you asked for. Which of them are reasonable for your application depends on what that application does.
You can use a PRNG which generates variables uniformly distributed in the range [0,1), and then map these to the cases you described by taking the cumulative sums of the probabilities as cut points.

How many different partitions with exactly n parts can be made of a set with k-elements?

How many different partitions with exactly two parts can be made of the set {1,2,3,4}?
There are 4 elements in this list that need to be partitioned into 2 parts. I wrote these out and got a total of 7 different possibilities:
{{1},{2,3,4}}
{{2},{1,3,4}}
{{3},{1,2,4}}
{{4},{1,2,3}}
{{1,2},{3,4}}
{{1,3},{2,4}}
{{1,4},{2,3}}
Now I must answer the same question for the set {1,2,3,...,100}.
There are 100 elements in this list that need to be partitioned into 2 parts. I know the largest size a part of the partition can be is 50 (that's 100/2) and the smallest is 1 (so one part has 1 number and the other part has 99). How can I determine how many different possibilities there are for partitions of two parts without writing out extraneous lists of every possible combination?
Can the answer be simplified into a factorial (such as 12!)?
Is there a general formula one can use to find how many different partitions with exactly n parts can be made of a set with k-elements?
1) stackoverflow is about programming. Your question belongs to https://math.stackexchange.com/ realm.
2) There are 2n subsets of a set of n elements (because each of n elements may either be or be not contained in the specific subset). This gives us 2n-1 different partitions of a n-element set into the two subsets. One of these partitions is the trivial one (with the one part being an empty subset and other part being the entire original set), and from your example it seems you don't want to count the trivial partition. So the answer is 2n-1-1 (which gives 23-1=7 for n=4).
The general answer for n parts and k elements would be the Stirling number of the second kind S(k,n).
Please beware that the usual convention is with n the total number of elements, thus S(n,k)
Computing the general formula is quite ugly, but doable for k=2 (with the common notation) :
Thus S(n,2) = 1/2 ( (+1) * 1 * 0n +(-1) * 2 * 1n + (+1) * 1 * 2n ) = (0-2+2n)/2 = 2n-1-1

Resources