what is the running time (big Oh) for linear probing on insertion, deletion and searching.
Thanks
Theoretical worst case is O(n) since if you happen to insert all the elements such that they consecutively collide then the last element inserted will have to be put into the table n steps from its original hash position.
You could however calculate the average expected number of steps if you know the distribution of your input elements (possibly assuming random, but of course assumption is the mother...), know the distribution of your hashing function (hopefully uniform, but it depends on the quality of your algorithm), the length of your hash table and the number of elements inserted.
If your hashing function is sufficiently uniform you can calculate the probability of collisions using the birthday problem and from those probabilities calculate expected probing lengths.
Related
https://www.quora.com/Why-should-the-size-of-a-hash-table-be-a-prime-number?share=1
I see that people mention that the number of buckets of a hash table is better to be prime numbers.
Is it always the case? When the hash values are already evenly distributed, there is no need to use prime numbers then?
https://github.com/rui314/chibicc/blob/main/hashmap.c
For example, the above hash table code does not use prime numbers as the number of buckets.
https://github.com/rui314/chibicc/blob/main/hashmap.c#L37
But the hash values are generated from strings using fnv_hash.
https://github.com/rui314/chibicc/blob/main/hashmap.c#L17
So there is a reason why it makes sense to use bucket sizes that are not necessarily prime numbers?
The answer is "usually you don't need a table whose size is a prime number, but there are some implementation reasons why you might want to do this."
Fundamentally, hash tables work best when hash codes are spread out as close to uniformly at random as possible. That prevents items from clustering in any one location within the table. At some level, provided that you have a good enough hash function to make this happen, the size of the table doesn't matter.
So why do folks say to pick tables whose size is a prime? There are two main reasons for this, and they're due to specific cases that don't arise in all hash tables.
One reason why you sometimes see prime-sized tables is due to a specific way of building hash functions. You can build reasonable hash functions by picking functions of the form h(x) = (ax + b) mod p, where a is a number in {1, 2, ..., p-1} and b is a number in the {0, 1, 2, ..., p-1}, assuming that p is a prime. If p isn't prime, hash functions of this form don't spread items out uniformly. As a result, if you're using a hash function like this one, then it makes sense to pick a table whose size is a prime number.
The second reason you see advice about prime-sized tables is if you're using an open-addressing strategy like quadratic probing or double hashing. These hashing strategies work by hashing items to some initial location k. If that slot is full, we look at slot (k + r) mod T, where T is the table size and r is some offset. If that slot is full, we then check (k + 2r) mod T, then (k + 3r) mod T, etc. If the table size is a prime number and r isn't zero, this has the nice, desirable property that these indices will cycle through all the different positions in the table without ever repeating, ensuring that items are nicely distributed over the table. With non-prime table sizes, it's possible that this strategy gets stuck cycling through a small number of slots, which gives less flexibility in positions and can cause insertions to fail well before the table fills up.
So assuming you aren't using double hashing or quadratic probing, and assuming you have a strong enough hash function, feel free to size your table however you'd like.
templatetypedef has some excellent points as always - just adding a couple more and some examples...
Is it always necessary to make hash table number of buckets a prime number for performance reason?
No. Firstly, using prime numbers for bucket count tends to mean you need to spend more CPU cycles to fold/mod a hash value returned by the hash function into the current bucket count. A popular alternative is to use powers of two for the bucket count (e.g. 8, 16, 32, 64... as you resize), because then you can do a bitwise AND operation to map from a hash value to a bucket in 1 CPU cycle. That answers your "So there is a reason why it makes sense to use bucket sizes that are not necessarily prime numbers?"
Tuning a hash table for performance often means weighing the cost of a stronger hash function and modding by prime numbers against the cost of higher collisions.
Prime bucket counts often help reduce collisions when the hash function is unable to produce a very good distribution for the keys its fed.
For example, if you hashed a bunch of pointers to 64-bit doubles using an identity hash (basically, casting the pointer address to a size_t), then the hash values would all be multiples of 8 (due to alignment), and if you had a hash table size like say 1024 or 2048 (powers of 2), then all your pointers would hash onto 1/8th of the bucket indices (specifically, buckets 0, 8, 16, 25, 32 etc.). With a prime number of buckets, at least the pointer values - which if the load factor is high are inevitably spread out over a much larger range than the range of bucket indices - tend to wrap around the hash table hitting different indices.
When you use a very strong hash function - where the low order bits are effectively random but repeatable, you'll already get a good distribution across buckets regardless of the bucket count. There are also times when even with a terribly weak hash function - like an identity hash - h(x) == x - all the bits in the keys are so random that they produce as good a distribution as a cryptographic hash could produce, so there's no point spending extra time on a stronger hash - that may even increase collisions.
There a also times when the distribution isn't inherently great, but you can afford to use extra memory to keep the load factor low, so it's not worth using primes or a better hash function. Still, extra buckets puts more strain on the CPU caches too - so things can end up slower than hoped for.
Other times, keys with an identity hash have an inherent tendency to fall into distinct buckets (e.g. because they might have been generated by an incrementing counter, even if some of the values are no longer in use). In that case, a strong hash function increases collisions and worsens CPU cache access patterns. Whether you use powers of two or prime bucket counts makes little difference here.
When the hash values are already evenly distributed, there is no need to use prime numbers then?
That statement is trivially true but kind of pointless if you're talking about hash values after the mod-to-current-hash-table-size operation: even distribution there directly relates to few collisions.
If you're talking about the more interesting case of hash values evenly distributed in the hash function return type value space (e.g. a 64-bit integer), before those values are modded into whatever the current hash table bucket count is, then there's till room for prime numbers to help, but only when the hashed key space a larger range than the hash bucket indices. The pointer example above illustrated that: if you had say 800 distinct 8-byte-aligned pointers going into ~1000 bucket, then the difference between the numerically lowest pointer and the higher address would be at least 799*8 = 6392... you're wrapping around the table more than 6 times at a minimum (for close-as-possible pointers), and a prime number of buckets would increase the odds of each of "wrap" modding onto previously unused buckets.
Note that some of the above benefits to prime bucket counts apply to any kind of collision handling - separate chaining, linear probing, quadratic probing, double hashing, cuckoo hashing, robin hood hashing etc.
I want to generate some numbers, which should attempt to share as few common bit patterns as possible, such that collisions happen at minimal amount. Until now its "simple" hashing with a given amount of output bits. However, there is another 'constraint'. I want to minimize the risk that, if you take one number and change it by toggling a small amount of bits, you end up with another number you've just generated. Note: I don't want it to be impossible or something, I want to minimize the risk!
How to calculate the probability for a list with n numbers, where each number has m bits? And, of course, what would be a suitable method to generate those numbers? Any good articles about this?
To answer this question precisely, you need to say what exactly you mean by "collision", and what you mean by "generate". If you just want the strings to be far apart from each other in hamming distance, you could hope to make an optimal, deterministic set of such strings. It is true that random strings will have this property with high probability, so you could use random strings instead.
When you say
Note: I don't want it to be impossible or something, I want to minimize the risk!
this sounds like an XY problem. If some outcome is the "bad thing" then why do you want it to be possible, but just low probability? Shouldn't you want it not to happen at all?
In short I think you should look up the term "error correcting code". The codewords of any good error correcting code, with any parameters that you feel like, will have the minimal risk of collision in the presence of random noise, for that number of code words of that length, and they can typically be generated very easily using matrix multiplication.
Here's some pseudocode:
count = 0
for every item in a list
1/20 chance to add one to count
This is more or less my current code, but there could be hundreds of thousands of items in that list; therefore, it gets inefficient fast. (isn't this called like, 0(n) or something?)
Is there a way to compress this into one equation?
Let's look at the properties of the random variable you've described. Quoting Wikipedia:
The binomial distribution with parameters n and p is the discrete probability distribution of the number of successes in a sequence of n independent yes/no experiments, each of which yields success with probability p.
Let N be the number of items in the list, and C be a random variable that represents the count you're obtaining from your pseudocode. C will follow a binomial probability distribution (as shown in the image below), with p = 1/20:
The remaining problem is how to efficently poll a random variable with said probability distribution. There are a number of libraries that allow you to draw samples from random variables with a specified PDF. I've never had to implement it myself, so I don't exactly know the details, but many are open source and you can refer to the implementation for yourself.
Here's how you would calculate count with the numpy library in Python:
n, p = 10, 0.05 # 10 trials, probability of success is 0.05
count = np.random.binomial(n, p) # draw a single sample
Apparently the OP was asking for a more efficient way to generate random numbers with the same distribution this will give. I though the question was how to do the exact same operation as the loop, but as a one liner (and preferably with no temporary list that exists just to be iterated over).
If you sample a random number generator n times, it's going to have at best O(n) run time, regardless of how the code looks.
In some interpreted languages, using more compact syntax might make a noticeable difference in the constant factors of run time. Other things can affect the run time, like whether you store all the random values and then process them, or process them on the fly with no temporary storage.
None of this will allow you to avoid having your run time scale up linearly with n.
I need to write a function that returns on of the numbers (-2,-1,0,1,2) randomly, but I need the average of the output to be a specific number (say, 1.2).
I saw similar questions, but all the answers seem to rely on the target range being wide enough.
Is there a way to do this (without saving state) with this small selection of possible outputs?
UPDATE: I want to use this function for (randomized) testing, as a stub for an expensive function which I don't want to run. The consumer of this function runs it a couple of hundred times and takes an average. I've been using a simple randint function, but the average is always very close to 0, which is not realistic.
Point is, I just need something simple that won't always average to 0. I don't really care what the actual average is. I may have asked the question wrong.
Do you really mean to require that specific value to be the average, or rather the expected value? In other words, if the generated sequence were to contain an extraordinary number of small values in its initial part, should the rest of the sequence atempt to compensate for that in an attempt to get the overall average right? I assume not, I assume you want all your samples to be computed independently (after all, you said you don't want any state), in which case you can only control the expected value.
If you assign a probability pi for each of your possible choices, then the expected value will be the sum of these values, weighted by their probabilities:
EV = ā 2pā2 ā pā1 + p1 + 2p2 = 1.2
As additional constraints you have to require that each of these probabilities is non-negative, and that the above four add up to a value less than 1, with the remainder taken by the fifth probability p0.
there are many possible assignments which satisfy these requirements, and any one will do what you asked for. Which of them are reasonable for your application depends on what that application does.
You can use a PRNG which generates variables uniformly distributed in the range [0,1), and then map these to the cases you described by taking the cumulative sums of the probabilities as cut points.
Isn't it easily possible to construct a PRNG in such a fashion? Why is it not done?
That is, as far as I know we could simply have a PRNG that takes a seed n. When you ask for a random bit, it takes the nth digit of the binary expansion of the computable normal number, and increments n.
My first thought was that perhaps we hadn't found a computable normal number, but we have. The remaining thought is that there is a good reason not to-- either there's some property of PRNGs that I'm not familiar with that such a method would not have, or it would be impractical somehow, or is otherwise outstripped by other methods.
That would make predicting the output really simple.
Say, for example, you generate the integer 0x54a30b7f. If you have 4GiB of pi (or random noise or an actual normal number), chances are there's only going to be one (or maybe a handful) occurrence of that particular integer and I can predict with reasonably high probability all future numbers. This is a serious problem in the case of cryptographically strong PRNGs. If instead of simple sequential scan you use some function, I just have to follow the function which if it is difficult enough to follow it turns into a PRNG in it's own right.
If you are not concerned about the cryptographic strength of your generator, then there are much more compact ways of generating random numbers. Mersenne Twister, for example, has a much larger period without requiring a 4GiB lookup table.