What is the cost of deleting a value from a hashtable? - hashtable

Now I have this question where I was asked the cost of deleting a value from a hash table when we used linear probing while the insertion process.
What I could figure out from reading various stuff on the internet is that it has to do something with the load factor. Though I am not sure, but I read a relation between the load factor and no of probes required and it is No of probes = 1 / (1-LF).
So I believe the cost has to be dependent on the probe sequence. But then another thought ruins everything.
What if the element was inserted in p probes and now I am trying to delete this element. But before this I had already deleted few elements having the same hash code and were a part of insertion in probes less than p.
In this case I reach to a stage where I see a slot empty in the hash table but I am not sure if the element I am trying to delete is already deleted or is at some other location as a result of probing.
I also found that once I delete an element I must mark this slot with some special indicator to inform that it is available, but this doesn't solve my problem of being uncertain about the element which I am willing to delete.
Could anyone please suggest how to find the cost in such cases?
Is the approach going to vary if it is non-linear probing?

The standard approach is "lookup the element, mark as deleted". Marking obviously has O(1) cost, so the total operation cost is the same as just lookup: O(1) expected. It can be as high as O(n) in degenerate cases (e.g. all elements have the same hash). O(1) expected is all we can say theoretically.
About the load factor. The higher the load factor (ratio of number of occupied buckets to the total number), the larger is the expected factor (but this doesn't change the theoretical O cost). Note that in this case load factor includes number of both present in the table elements plus the number of buckets that got marked as deleted previously.
Other probing kinds (e.g. quadratic) don't change the theoretical cost, but may alter the expected constant factor or its variance. If you look at "fallback" sequences, in linear ordering the sequences of different buckets overlap. This means that if for some bucket the sequence is long, for adjacent buckets it will also be long. E.g.: if buckets 4 to 10 are occupied, sequence for bucket #4 is 7 bucket long (4, 5, 6, ..., 10), for #5 it's 6 and so on. For quadratic probing this is not the case.
However, linear probing has the benefit of better memory-cache behavior, since you check memory cells close to each other. In practice, though, for quadratic probing fallback sequences are rarely long enough for this to matter.
Finally, in linear probing case, it is possible to work without deleted mark, but for this you'd have to complicate deleting procedure considerably (still O(1) expected, though, but with much higher constant factor). Whether it is worth it has to be decided with actual profiling; for example, this simplifies inserting somewhat and lookup a bit. For a C++ implementation this would have the downside that erase() would invalidate iterators, though.

Related

Is it always necessary to make hash table number of buckets a prime number for performance reason?

https://www.quora.com/Why-should-the-size-of-a-hash-table-be-a-prime-number?share=1
I see that people mention that the number of buckets of a hash table is better to be prime numbers.
Is it always the case? When the hash values are already evenly distributed, there is no need to use prime numbers then?
https://github.com/rui314/chibicc/blob/main/hashmap.c
For example, the above hash table code does not use prime numbers as the number of buckets.
https://github.com/rui314/chibicc/blob/main/hashmap.c#L37
But the hash values are generated from strings using fnv_hash.
https://github.com/rui314/chibicc/blob/main/hashmap.c#L17
So there is a reason why it makes sense to use bucket sizes that are not necessarily prime numbers?
The answer is "usually you don't need a table whose size is a prime number, but there are some implementation reasons why you might want to do this."
Fundamentally, hash tables work best when hash codes are spread out as close to uniformly at random as possible. That prevents items from clustering in any one location within the table. At some level, provided that you have a good enough hash function to make this happen, the size of the table doesn't matter.
So why do folks say to pick tables whose size is a prime? There are two main reasons for this, and they're due to specific cases that don't arise in all hash tables.
One reason why you sometimes see prime-sized tables is due to a specific way of building hash functions. You can build reasonable hash functions by picking functions of the form h(x) = (ax + b) mod p, where a is a number in {1, 2, ..., p-1} and b is a number in the {0, 1, 2, ..., p-1}, assuming that p is a prime. If p isn't prime, hash functions of this form don't spread items out uniformly. As a result, if you're using a hash function like this one, then it makes sense to pick a table whose size is a prime number.
The second reason you see advice about prime-sized tables is if you're using an open-addressing strategy like quadratic probing or double hashing. These hashing strategies work by hashing items to some initial location k. If that slot is full, we look at slot (k + r) mod T, where T is the table size and r is some offset. If that slot is full, we then check (k + 2r) mod T, then (k + 3r) mod T, etc. If the table size is a prime number and r isn't zero, this has the nice, desirable property that these indices will cycle through all the different positions in the table without ever repeating, ensuring that items are nicely distributed over the table. With non-prime table sizes, it's possible that this strategy gets stuck cycling through a small number of slots, which gives less flexibility in positions and can cause insertions to fail well before the table fills up.
So assuming you aren't using double hashing or quadratic probing, and assuming you have a strong enough hash function, feel free to size your table however you'd like.
templatetypedef has some excellent points as always - just adding a couple more and some examples...
Is it always necessary to make hash table number of buckets a prime number for performance reason?
No. Firstly, using prime numbers for bucket count tends to mean you need to spend more CPU cycles to fold/mod a hash value returned by the hash function into the current bucket count. A popular alternative is to use powers of two for the bucket count (e.g. 8, 16, 32, 64... as you resize), because then you can do a bitwise AND operation to map from a hash value to a bucket in 1 CPU cycle. That answers your "So there is a reason why it makes sense to use bucket sizes that are not necessarily prime numbers?"
Tuning a hash table for performance often means weighing the cost of a stronger hash function and modding by prime numbers against the cost of higher collisions.
Prime bucket counts often help reduce collisions when the hash function is unable to produce a very good distribution for the keys its fed.
For example, if you hashed a bunch of pointers to 64-bit doubles using an identity hash (basically, casting the pointer address to a size_t), then the hash values would all be multiples of 8 (due to alignment), and if you had a hash table size like say 1024 or 2048 (powers of 2), then all your pointers would hash onto 1/8th of the bucket indices (specifically, buckets 0, 8, 16, 25, 32 etc.). With a prime number of buckets, at least the pointer values - which if the load factor is high are inevitably spread out over a much larger range than the range of bucket indices - tend to wrap around the hash table hitting different indices.
When you use a very strong hash function - where the low order bits are effectively random but repeatable, you'll already get a good distribution across buckets regardless of the bucket count. There are also times when even with a terribly weak hash function - like an identity hash - h(x) == x - all the bits in the keys are so random that they produce as good a distribution as a cryptographic hash could produce, so there's no point spending extra time on a stronger hash - that may even increase collisions.
There a also times when the distribution isn't inherently great, but you can afford to use extra memory to keep the load factor low, so it's not worth using primes or a better hash function. Still, extra buckets puts more strain on the CPU caches too - so things can end up slower than hoped for.
Other times, keys with an identity hash have an inherent tendency to fall into distinct buckets (e.g. because they might have been generated by an incrementing counter, even if some of the values are no longer in use). In that case, a strong hash function increases collisions and worsens CPU cache access patterns. Whether you use powers of two or prime bucket counts makes little difference here.
When the hash values are already evenly distributed, there is no need to use prime numbers then?
That statement is trivially true but kind of pointless if you're talking about hash values after the mod-to-current-hash-table-size operation: even distribution there directly relates to few collisions.
If you're talking about the more interesting case of hash values evenly distributed in the hash function return type value space (e.g. a 64-bit integer), before those values are modded into whatever the current hash table bucket count is, then there's till room for prime numbers to help, but only when the hashed key space a larger range than the hash bucket indices. The pointer example above illustrated that: if you had say 800 distinct 8-byte-aligned pointers going into ~1000 bucket, then the difference between the numerically lowest pointer and the higher address would be at least 799*8 = 6392... you're wrapping around the table more than 6 times at a minimum (for close-as-possible pointers), and a prime number of buckets would increase the odds of each of "wrap" modding onto previously unused buckets.
Note that some of the above benefits to prime bucket counts apply to any kind of collision handling - separate chaining, linear probing, quadratic probing, double hashing, cuckoo hashing, robin hood hashing etc.

Find the first root and local maximum/minimum of a function

Problem
I want to find
The first root
The first local minimum/maximum
of a black-box function in a given range.
The function has following properties:
It's continuous and differentiable.
It's combination of constant and periodic functions. All periods are known.
(It's better if it can be done with weaker assumptions)
What is the fastest way to get the root and the extremum?
Do I need more assumptions or bounds of the function?
What I've tried
I know I can use root-finding algorithm. What I don't know is how to find the first root efficiently.
It needs to be fast enough so that it can run within a few miliseconds with precision of 1.0 and range of 1.0e+8, which is the problem.
Since the range could be quite large and it should be precise enough, I can't brute-force it by checking all the possible subranges.
I considered bisection method, but it's too slow to find the first root if the function has only one big root in the range, as every subrange should be checked.
It's preferable if the solution is in java, but any similar language is fine.
Background
I want to calculate when arbitrary celestial object reaches certain height.
It's a configuration-defined virtual object, so I can't assume anything about the object.
It's not easy to get either analytical solution or simple approximation because various coordinates are involved.
I decided to find a numerical solution for this.
For a general black box function, this can't really be done. Any root finding algorithm on a black box function can't guarantee that it has found all the roots or any particular root, even if the function is continuous and differentiable.
The property of being periodic gives a bit more hope, but you can still have periodic functions with infinitely many roots in a bounded domain. Given that your function relates to celestial objects, this isn't likely to happen. Assuming your periodic functions are sinusoidal, I believe you can get away with checking subranges on the order of one-quarter of the shortest period (out of all the periodic components).
Maybe try Brent's Method on the shortest quarter period subranges?
Another approach would be to apply your root finding algorithm iteratively. If your range is (a, b), then apply your algorithm to that range to find a root at say c < b. Then apply your algorithm to the range (a, c) to find a root in that range. Continue until no more roots are found. The last root you found is a good candidate for your minimum root.
Black box function for any range? You cannot even be sure it has the continuous domain over that range. What kind of solutions are you looking for? Natural numbers, integers, real numbers, complex? These are all the question that greatly impact the answer.
So 1st thing should be determining what kind of number you accept as the result.
Second is having some kind of protection against limes of function that will try to explode your calculations as it goes for plus or minus infinity.
Since we are touching the limes topics you could have your solution edge towards zero and look like a solution but never touch 0 and become a solution. This depends on your margin of error, how close something has to be to be considered ok, it's good enough.
I think for this your SIMPLEST TO IMPLEMENT bet for real number solutions (I assume those) is to take an interval and this divide and conquer algorithm:
Take lower and upper border and middle value (or approx middle value for infinity decimals border/borders)
Try to calculate solution with all 3 and have some kind of protection against infinities
remember all 3 values in an array with results from them (3 pair of values)
remember the current best value (one its closest to solution) in seperate variable (a pair of value and result for that value)
STEP FORWARD - repeat above with 1st -2nd value range and 2nd -3rd value range
have a new pair of value and result to be closest to solution.
clear the old value-result pairs, replace them with new ones gotten from this iteration while remembering the best value solution pair (total)
Repeat above for how precise you wish to get and look at that memory explode with each iteration, keep in mind you are gonna to have exponential growth of values there. It can be further improved if you lets say take one interval and go as deep as you wanna, remember best value-result pair and then delete all other memory and go for next interval and dig deep.

Possibilities of dividing a class in groups with several criteria

I have to divide a class of 50 students writing a dissertation in 10 different discussion groups of 5 members each. In theory, there are 1.35363x10^37 possible ways of doing this, which is just the result of {50!}/{(5!^10)*10!)}, if it is already decided that the groups will consist of 5.
However, each group is to be led by a facilitator. This reduces the number of possible combinations considerably, because each facilitaror has one field of expertise among 5 possible ones, which should be matched to the topics the students are writing about as much as possible. If there are three facilitators with competence A, three with competence B, two with competence C, one with competence D and one with competence E, and 15 students are assigned to A, 15 to B, 10 to C, 5 to D and 5 to E, the number of possible combinations comes down to 252 505.
But both students and facilitators keep advocating for the use of more criteria, instead of just focusing on field of expertise. For example, wanting to be in a group of students that know each other, or being in a group with a facilitator that has particular knowledge of a specific research method.
I am trying to illustrate my intuitive reasoning, which tells me that each new criteria increases the complexity/impossibility of the task, if the objective is a completely efficient solution. But I can't get my head around expressing this analytically in a satisfactory manner.
Is my reasoning correct, that adding criteria would reduce the amount of possibilities that can be discarded following the inclusion-exclusion principle, thus making the task more complex, adding possible combinations? I also think that if the criteria are not compatible (for example if students that know each other are writing about different topics, and there aren't enough competent facilitators), certain constraints become inviable.
You need to distinguish between computational complexity and human complexity. Adding constraints almost automatically increases the human complexity of the problem in the sense that it means that there is more to wrap your mind around. But -- it isn't true that the computational complexity increases. At least sometimes it decreases.
For example, say you have a set of 200 items and you want to determine if there is a subset of them which satisfy some constraint. Depending on the constraint, There might be no feasible way to do it. After all, 2^200 is much too large to brute-force. Now add the constraint that the subset needs to have exactly 3 elements. Now all of a sudden it is possible to brute force (just run through all 1,313,400 3-element subsets until you either find a solution or determine that none exist). This is enough to show that it isn't true that adding a constraint always makes a problem intrinsically more difficult. In the discrete case a new constraint can cut down on the size of the search space in a way that can be exploited. In the continuous cases it can reduce degrees of freedom and thus lower the dimension of the problem. This isn't to say that it always makes it easier. Probably as a rule of thumb, additional constraints tend to make a problem more difficult.
Your actual problem isn't spelled out enough to give concrete advice. One possibility (and one way to handle a proliferation of somewhat extraneous constraints) is to divide the constraints into hard constraints which need to be satisfied and soft constraints which are merely desired but not strictly needed. Turn it into an optimization problem: find the solution which maximizes the number of soft-constraints that are satisfied, subject to the condition that it satisfies the hard constraints. Perhaps you can formulate it as an integer programming problem and hopefully find an exact solution. Or, if it is easy to generate solutions that satisfy the hard constraints and it is easy to mutate one such solution to obtain another (e.g. swap two students who are in different groups), then an evolutionary algorithm would be a reasonable heuristic.

Creating an efficient function to fit a dataset

Basically I have a large (could get as large as 100,000-150,000 values) data set of 4-byte inputs and their corresponding 4-byte outputs. The inputs aren't guaranteed to be unique (which isn't really a problem because I figure I can generate pseudo-random numbers to add or xor the inputs with so that they do become unique), but the outputs aren't guaranteed to be unique either (so two different sets of inputs might have the same output).
I'm trying to create a function that effectively models the values in my data-set. I don't need it to interpolate efficiently, or even at all (by this I mean that I'm never going to feed it an input that isn't contained in this static data-set). However it does need to be as efficient as possible. I've looked into interpolation and found that it doesn't really fit what I'm looking for. For example, the large number of values means that spline interpolation won't do since it creates a polynomial per interval.
Also, from my understanding polynomial interpolation would be way too computationally expensive (n values means that the polynomial could include terms as high as pow(x,n-1). For x= a 4-byte number and n=100,000 it's just not feasible). I've tried looking online for a while now, but I'm not very strong with math and must not know the right terms to search with because I haven't come across anything similar so far.
I can see that this is not completely (to put it mildly) a programming question and I apologize in advance. I'm not looking for the exact solution or even a complete answer. I just need pointers on the topics that I would need to read up on so I can solve this problem on my own. Thanks!
TL;DR - I need a variant of interpolation that only needs to fit the initially given data-points, but which is computationally efficient.
Edit:
Some clarification - I do need the output to be exact and not an approximation. This is sort of an optimization of some research work I'm currently doing and I need to have this look-up implemented without the actual bytes of the outputs being present in my program. I can't really say a whole lot about it at the moment, but I will say that for the purposes of my work, encryption (or compression or any other other form of obfuscation) is not an option to hide the table. I need a mathematical function that can recreate the output so long as it has access to the input. I hope that clears things up a bit.
Here is one idea. Make your function be the sum (mod 232) of a linear function over all 4-byte integers, a piecewise linear function whose pieces depend on the value of the first bit, another piecewise linear function whose pieces depend on the value of the first two bits, and so on.
The actual output values appear nowhere, you have to add together linear terms to get them. There is also no direct record of which input values you have. (Someone could conclude something about those input values, but not their actual values.)
The various coefficients you need can be stored in a hash. Any lookups you do which are not found in the hash are assumed to be 0.
If you add a certain amount of random "noise" to your dataset before starting to encode it fairly efficiently, it would be hard to tell what your input values are, and very hard to tell what the outputs are even approximately without knowing the inputs.
Since you didn't impose any restriction on the function (continuous, smooth, etc), you could simply do a piece-wise constant interpolation:
or a linear interpolation:
I assume you can figure out how to construct such a function without too much trouble.
EDIT: In light of your additional requirement that such a function should "hide" the data points...
For a piece-wise constant interpolation, the constant intervals should be randomized so as to not reveal where the data point is. So for example in the picture, the intervals are centered about the data point it's interpolating. Instead, you might want to do something like:
[0 , 0.3) -> 0
[0.3 , 1.9) -> 0.8
[1.9 , 2.1) -> 0.9
[2.1 , 3.5) -> 0.2
etc
Of course, this only hides the x-coordinate. To hide the y-coordinate as well, you can use a linear interpolation.
Simply make it so that the "pointy" part isn't where the data point is. Pick random x-values such that every adjacent data point has one of these x-values in between. Then interpolate such that the "pointy" part is at these x-values.
I suggest a huge Lookup Table full of unused entries. It's the brute-force approach, having an ordered table of outputs, ordered by every possible value of the input (not just the data set, but also all other possible 4-byte value).
Though all of your data would be there, you could fill the non-used inputs with random, arbitrary, or stochastic (random whithin potentially complex constraints) data. If you make it convincing, no one could pick your real data out of it. If a "real" function interpolated all your data, it would also "contain" all the information of your real data, and anyone with access to it could use it to generate an LUT as described above.
LUTs are lightning-fast, but very memory hungry. Your case is on the edge of feasibility, requiring (2^32)*32= 16 Gigabytes of RAM, which requires a 64-bit machine to run. That is just for the data, not the program, the Operating System, or other data. It's better to have 24, just to be sure. If you can afford it, they are the way to go.

Generate very very large random numbers

How would you generate a very very large random number? I am thinking on the order of 2^10^9 (one billion bits). Any programming language -- I assume the solution would translate to other languages.
I would like a uniform distribution on [1,N].
My initial thoughts:
--You could randomly generate each digit and concatenate. Problem: even very good pseudorandom generators are likely to develop patterns with millions of digits, right?
You could perhaps help create large random numbers by raising random numbers to random exponents. Problem: you must make the math work so that the resulting number is still random, and you should be able to compute it in a reasonable amount of time (say, an hour).
If it helps, you could try to generate a possibly non-uniform distribution on a possibly smaller range (using the real numbers, for instance) and transform. Problem: this might be equally difficult.
Any ideas?
Generate log2(N) random bits to get a number M,
where M may be up to twice as large as N.
Repeat until M is in the range [1;N].
Now to generate the random bits you could either use a source of true randomness, which is expensive.
Or you might use some cryptographically secure random number generator, for example AES with a random key encrypting a counter for subsequent blocks of bits. The cryptographically secure implies that there can be no noticeable patterns.
It depends on what you need the data for. For most purposes, a PRNG is fast and simple. But they are not perfect. For instance I remember hearing that Monte Carlos simulations of chaotic systems are really good at revealing the underlying pattern in a PRNG.
If that is the sort of thing that you are doing, though, there is a simple trick I learned in grad school for generating lots of random data. Take a large (preferably rapidly changing) file. (Some big data structures from the running kernel are good.) Compress it to increase the entropy. Throw away the headers. Then for good measure, encrypt the result. If you're planning to use this for cryptographic purposes (and you didn't have a perfect entropy data set to work with), then reverse it and encrypt again.
The underlying theory is simple. Information theory tells us that there is no difference between a signal with no redundancy and pure random data. So if we pick a big file (ie lots of signal), remove redundancy with compression, and strip the headers, we have a pretty good random signal. Encryption does a really good job at removing artifacts. However encryption algorithms tend to work forward in blocks. So if someone could, despite everything, guess what was happening at the start of the file, that data is more easily guessable. But then reversing the file and encrypting again means that they would need to know the whole file, and our encryption, to find any pattern in the data.
The reason to pick a rapidly changing piece of data is that if you run out of data and want to generate more, you can go back to the same source again. Even small changes will, after that process, turn into an essentially uncorrelated random data set.
NTL: A Library for doing Number Theory
This was recommended by my Coding Theory and Cryptography teacher... so I guess it does the work right, and it's pretty easy to use.
RandomBnd, RandomBits, RandomLen -- routines for generating pseudo-random numbers
ZZ RandomLen_ZZ(long l);
// ZZ = psuedo-random number with precisely l bits,
// or 0 of l <= 0.
If you have a random number generator that generates random numbers of X bits. And concatenated bits of [X1, X2, ... Xn ] create the number you want of N bits, as long as each X is random, I don't see why your large number wouldn't be random as well for all intents and purposes. And if standard C rand() method is not secure enough, I'm sure there's plenty of other libraries (like the ones mentioned in this thread) whose pseudo-random numbers are "more random".
even very good pseudorandom generators are likely to develop patterns with millions of digits, right?
From the wikipedia on pseudo-random number generation:
A PRNG can be started from an arbitrary starting state using a seed state. It will always produce the same sequence thereafter when initialized with that state. The maximum length of the sequence before it begins to repeat is determined by the size of the state, measured in bits. However, since the length of the maximum period potentially doubles with each bit of 'state' added, it is easy to build PRNGs with periods long enough for many practical applications.
You could perhaps help create large random numbers by raising random numbers to random exponents
I assume you're suggesting something like populating the values of a scientific notation with random values?
E.g.: 1.58901231 x 10^5819203489
The problem with this is that your distribution is going to be logarithmic (or is that exponential? :) - same difference, it isn't even). You will never get a value that has the millionth digit set, yet contains a digit in the one's column.
you could try to generate a possibly non-uniform distribution on a possibly smaller range (using the real numbers, for instance) and transform
Not sure I understand this. Sounds like the same thing as the exponential solution, with the same problems. If you're talking about multiplying by a constant, then you'll get a lumpy distribution instead of a logarithmic (exponential?) one.
Suggested Solution
If you just need really big pseudo-random values, with a good distribution, use a PRNG algorithm with a larger state. The Periodicity of a PRNG is often the square of the number of bits, so it doesn't take that many bits to fill even a really large number.
From there, you can use your first solution:
You could randomly generate each digit and concatenate
Although I'd suggest that you use the full range of values returned by your PRNG (possibly 2^31 or 2^32), and populate a byte array with those values, splitting it up as necessary. Otherwise you might be throwing away a lot of bits of randomness. Also, scaling your values to a range (or using modulo) can easily screw up your distribution, so there's another reason to try to keep the max number of bits your PRNG can return. Be careful to pack your byte array full of the bits returned, though, or you'll again introduce lumpiness to your distribution.
The problem with those solution, though, is how to fill that (larger than normal) seed state with random-enough values. You might be able to use standard-size seeds (populated via time or GUID-style population), and populate your big-PRNG state with values from the smaller-PRNG. This might work if it isn't mission critical how well distributed your numbers are.
If you need truly cryptographically secure random values, the only real way to do it is use a natural form of randomness, such as that at http://www.random.org/. The disadvantages of natural randomness are availability, and the fact that many natural-random devices take a while to generate new entropy, so generating large amounts of data might be really slow.
You can also use a hybrid and be safe - natural-random seeds only (to avoid the slowness of generation), and PRNG for the rest of it. Re-seed periodically.

Resources