Assume we have applied close hashing algorithm on (4, 2, 12, 3, 9, 11, 7, 8, 13, and 18). And assume the length of hash table is 7 initially.
How a search on such a hash table can be achieved in O(1) time in the worst case.
It really doesn't matter what you do. Because the data set is predetermined, there is a constant upper bound to the worst case lookup for any hash function (as long as the hash function is guaranteed to terminate). (If one element takes longer to find than the others, that is the upper bound.) The constant upper bound implies O(1) complexity. QED.
Related
https://www.quora.com/Why-should-the-size-of-a-hash-table-be-a-prime-number?share=1
I see that people mention that the number of buckets of a hash table is better to be prime numbers.
Is it always the case? When the hash values are already evenly distributed, there is no need to use prime numbers then?
https://github.com/rui314/chibicc/blob/main/hashmap.c
For example, the above hash table code does not use prime numbers as the number of buckets.
https://github.com/rui314/chibicc/blob/main/hashmap.c#L37
But the hash values are generated from strings using fnv_hash.
https://github.com/rui314/chibicc/blob/main/hashmap.c#L17
So there is a reason why it makes sense to use bucket sizes that are not necessarily prime numbers?
The answer is "usually you don't need a table whose size is a prime number, but there are some implementation reasons why you might want to do this."
Fundamentally, hash tables work best when hash codes are spread out as close to uniformly at random as possible. That prevents items from clustering in any one location within the table. At some level, provided that you have a good enough hash function to make this happen, the size of the table doesn't matter.
So why do folks say to pick tables whose size is a prime? There are two main reasons for this, and they're due to specific cases that don't arise in all hash tables.
One reason why you sometimes see prime-sized tables is due to a specific way of building hash functions. You can build reasonable hash functions by picking functions of the form h(x) = (ax + b) mod p, where a is a number in {1, 2, ..., p-1} and b is a number in the {0, 1, 2, ..., p-1}, assuming that p is a prime. If p isn't prime, hash functions of this form don't spread items out uniformly. As a result, if you're using a hash function like this one, then it makes sense to pick a table whose size is a prime number.
The second reason you see advice about prime-sized tables is if you're using an open-addressing strategy like quadratic probing or double hashing. These hashing strategies work by hashing items to some initial location k. If that slot is full, we look at slot (k + r) mod T, where T is the table size and r is some offset. If that slot is full, we then check (k + 2r) mod T, then (k + 3r) mod T, etc. If the table size is a prime number and r isn't zero, this has the nice, desirable property that these indices will cycle through all the different positions in the table without ever repeating, ensuring that items are nicely distributed over the table. With non-prime table sizes, it's possible that this strategy gets stuck cycling through a small number of slots, which gives less flexibility in positions and can cause insertions to fail well before the table fills up.
So assuming you aren't using double hashing or quadratic probing, and assuming you have a strong enough hash function, feel free to size your table however you'd like.
templatetypedef has some excellent points as always - just adding a couple more and some examples...
Is it always necessary to make hash table number of buckets a prime number for performance reason?
No. Firstly, using prime numbers for bucket count tends to mean you need to spend more CPU cycles to fold/mod a hash value returned by the hash function into the current bucket count. A popular alternative is to use powers of two for the bucket count (e.g. 8, 16, 32, 64... as you resize), because then you can do a bitwise AND operation to map from a hash value to a bucket in 1 CPU cycle. That answers your "So there is a reason why it makes sense to use bucket sizes that are not necessarily prime numbers?"
Tuning a hash table for performance often means weighing the cost of a stronger hash function and modding by prime numbers against the cost of higher collisions.
Prime bucket counts often help reduce collisions when the hash function is unable to produce a very good distribution for the keys its fed.
For example, if you hashed a bunch of pointers to 64-bit doubles using an identity hash (basically, casting the pointer address to a size_t), then the hash values would all be multiples of 8 (due to alignment), and if you had a hash table size like say 1024 or 2048 (powers of 2), then all your pointers would hash onto 1/8th of the bucket indices (specifically, buckets 0, 8, 16, 25, 32 etc.). With a prime number of buckets, at least the pointer values - which if the load factor is high are inevitably spread out over a much larger range than the range of bucket indices - tend to wrap around the hash table hitting different indices.
When you use a very strong hash function - where the low order bits are effectively random but repeatable, you'll already get a good distribution across buckets regardless of the bucket count. There are also times when even with a terribly weak hash function - like an identity hash - h(x) == x - all the bits in the keys are so random that they produce as good a distribution as a cryptographic hash could produce, so there's no point spending extra time on a stronger hash - that may even increase collisions.
There a also times when the distribution isn't inherently great, but you can afford to use extra memory to keep the load factor low, so it's not worth using primes or a better hash function. Still, extra buckets puts more strain on the CPU caches too - so things can end up slower than hoped for.
Other times, keys with an identity hash have an inherent tendency to fall into distinct buckets (e.g. because they might have been generated by an incrementing counter, even if some of the values are no longer in use). In that case, a strong hash function increases collisions and worsens CPU cache access patterns. Whether you use powers of two or prime bucket counts makes little difference here.
When the hash values are already evenly distributed, there is no need to use prime numbers then?
That statement is trivially true but kind of pointless if you're talking about hash values after the mod-to-current-hash-table-size operation: even distribution there directly relates to few collisions.
If you're talking about the more interesting case of hash values evenly distributed in the hash function return type value space (e.g. a 64-bit integer), before those values are modded into whatever the current hash table bucket count is, then there's till room for prime numbers to help, but only when the hashed key space a larger range than the hash bucket indices. The pointer example above illustrated that: if you had say 800 distinct 8-byte-aligned pointers going into ~1000 bucket, then the difference between the numerically lowest pointer and the higher address would be at least 799*8 = 6392... you're wrapping around the table more than 6 times at a minimum (for close-as-possible pointers), and a prime number of buckets would increase the odds of each of "wrap" modding onto previously unused buckets.
Note that some of the above benefits to prime bucket counts apply to any kind of collision handling - separate chaining, linear probing, quadratic probing, double hashing, cuckoo hashing, robin hood hashing etc.
In the context of convolutional neural network model, I once heard a statement that:
One desirable property of convolutions is that they are
translationally equivariant; and the introduction of spatial pooling
can corrupt the property of translationally equivalent.
What does this statement mean, and why?
Most probably you heard it from Bengio's book. I will try to give you my explanation.
In a rough sense, two transformations are equivariant if f(g(x)) = g(f(x)). In your case of convolutions and translations means that if you convolve(translate(x)) it would be the same as if you translate(convolve(x)). This is desired because if your convolution will find an eye of a cat in an image, it will find that eye if you will shift the image.
You can see this by yourself (I use 1d conv only because it is easy to calculate stuff). Lets convolve v = [4, 1, 3, 2, 3, 2, 9, 1] with k = [5, 1, 2]. The result will be [27, 12, 23, 17, 35, 21]
Now let's shift our v by appending it with something v' = [8] + v. Convolving with k you will get [46, 27, 12, 23, 17, 35, 21]. As you the result is just a previous result prepended with some new stuff.
Now the part about spatial pooling. Let's do a max-pooling of size 3 on the first result and on the second one. In the first case you will get [27, 35], in the second [46, 35, 21]. As you see 27 somehow disappeared (result was corrupted). It will be more corrupted if you will take an average pooling.
P.S. max/min pooling is the most translationally invariant of all poolings (if you can say so, if you compare the number of non-corrupt elements).
A note on translation equivariant and invariant terms. These terms are different.
Equivariant translation means that a translation of input features results in an equivalent translation of outputs. This is desirable when we need to find the pattern rectangle.
Invariant translation means that a translation of input does not change the outputs at all.
Translation invariance is so important to achieve. This effectively means after learning a certain pattern in the lower-left corner of a picture our convnet can recognize the pattern anywhere (also in the upper right corner).
As we know just a densely connected network without convolution layers in-between cannot achieve translation invariance.
We need to introduce convolution layers to bring generalization power to the deep networks and learn representations with fewer training samples.
What is the max # of tuples you could insert at a time in Impala
INSERT INTO sample_table values ('john', 'high',....value 6, value 7, value 8 ......value 25), ('Kim', 'low',... value 6, value 7, value 8 ......value 25),
given that a tuple is
('john', 'high',....value 6, value 7, value 8 ......value 25)
Well. The limit of n should depends on how much stack size of the impala frondend's JVM has, since this style of insert statement causes jflex (which impala uses as SQL parser) to recurse at least n times, and all the tuples are stored in one deep parse tree. Suppose you've successfully constructed this nasty tree, the next thing shall be serializing it as a thrift message and pass it around. I can only imagine how slowly it could be.
I'd suggest using LOAD for large amount of insertions, which transforms to raw file motions, or using insert into select from, which internally applies distributed reads and writes over HDFS.
I am planning out a C++ program that takes 3 strings that represent a cryptarithmetic puzzle. For example, given TWO, TWO, and FOUR, the program would find digit substitutions for each letter such that the mathematical expression
TWO
+ TWO
------
FOUR
is true, with the inputs assumed to be right justified. One way to go about this would of course be to just brute force it, assigning every possible substitution for each letter with nested loops, trying the sum repeatedly, etc., until the answer is finally found.
My thought is that though this is terribly inefficient, the underlying loop-check thing may be a feasible (or even necessary) way to go--after a series of deductions are performed to limit the domains of each variable. I'm finding it kind of hard to visualize, but would it be reasonable to first assume a general/padded structure like this (each X represents a not-necessarily distinct digit, and each C is a carry digit, which in this case, will either be 0 or 1)? :
CCC.....CCC
XXX.....XXXX
+ XXX.....XXXX
----------------
CXXX.....XXXX
With that in mind, some more planning thoughts:
-Though leading zeros will not be given in the problem, I probably ought to add enough of them where appropriate to even things out/match operands up.
-I'm thinking I should start with a set of possible values 0-9 for each letter, perhaps stored as vectors in a 'domains' table, and eliminate values from this as deductions are made. For example, if I see some letters lined up like this
A
C
--
A
, I can tell that C is zero and this eliminate all other values from its domain. I can think of quite a few deductions, but generalizing them to all kinds of little situations and putting it into code seems kind of tricky at first glance.
-Assuming I have a good series of deductions that run through things and boot out lots of values from the domains table, I suppose I'd still just loop over everything and hope that the state space is small enough to generate a solution in a reasonable amount of time. But it feels like there has to be more to it than that! -- maybe some clever equations to set up or something along those lines.
Tips are appreciated!
You could iterate over this problem from right to left, i.e. the way you'd perform the actual operation. Start with the rightmost column. For every digit you encounter, you check whether there already is an assignment for that digit. If there is, you use its value and go on. If there isn't, then you enter a loop over all possible digits (perhaps omitting already used ones if you want a bijective map) and recursively continue with each possible assignment. When you reach the sum row, you again check whether the variable for the digit given there is already assigned. If it is not, you assign the last digit of your current sum, and then continue to the next higher valued column, taking the carry with you. If there already is an assignment, and it agrees with the last digit of your result, you proceed in the same way. If there is an assignment and it disagrees, then you abort the current branch, and return to the closest loop where you had other digits to choose from.
The benefit of this approach should be that many variables are determined by a sum, instead of guessed up front. Particularly for letters which only occur in the sum row, this might be a huge win. Furthermore, you might be able to spot errors early on, thus avoiding choices for letters in some cases where the choices you made so far are already inconsistent. A drawback might be the slightly more complicated recursive structure of your program. But once you got that right, you'll also have learned a good deal about turning thoughts into code.
I solved this problem at my blog using a randomized hill-climbing algorithm. The basic idea is to choose a random assignment of digits to letters, "score" the assignment by computing the difference between the two sides of the equation, then altering the assignment (swap two digits) and recompute the score, keeping those changes that improve the score and discarding those changes that don't. That's hill-climbing, because you only accept changes in one direction. The problem with hill-climbing is that it sometimes gets stuck in a local maximum, so every so often you throw out the current attempt and start over; that's the randomization part of the algorithm. The algorithm is very fast: it solves every cryptarithm I have given it in fractions of a second.
Cryptarithmetic problems are classic constraint satisfaction problems. Basically, what you need to do is have your program generate constraints based on the inputs such that you end up with something like the following, using your given example:
O + O = 2O = R + 10Carry1
W + W + Carry1 = 2W + Carry1 = U + 10Carry2
T + T + Carry2 = 2T + Carry2 = O + 10Carry3 = O + 10F
Generalized pseudocode:
for i in range of shorter input, or either input if they're the same length:
shorterInput[i] + longerInput2[i] + Carry[i] = result[i] + 10*Carry[i+1] // Carry[0] == 0
for the rest of the longer input, if one is longer:
longerInput[i] + Carry[i] = result[i] + 10*Carry[i+1]
Additional constraints based on the definition of the problem:
Range(digits) == {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
Range(auxiliary_carries) == {0, 1}
So for your example:
Range(O, W, T) == {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
Range(Carry1, Carry2, F) == {0, 1}
Once you've generated the constraints to limit your search space, you can use CSP resolution techniques as described in the linked article to walk the search space and determine your solution (if one exists, of course). The concept of (local) consistency is very important here and taking advantage of it allows you to possibly greatly reduce the search space for CSPs.
As a simple example, note that cryptarithmetic generally does not use leading zeroes, meaning if the result is longer than both inputs the final digit, i.e. the last carry digit, must be 1 (so in your example, it means F == 1). This constraint can then be propagated backwards, as it means that 2T + Carry2 == O + 10; in other words, the minimum value for T must be 5, as Carry2 can be at most 1 and 2(4)+1==9. There are other methods of enhancing the search (min-conflicts algorithm, etc.), but I'd rather not turn this answer into a full-fledged CSP class so I'll leave further investigation up to you.
(Note that you can't make assumptions like A+C=A -> C == 0 except for in least significant column due to the possibility of C being 9 and the carry digit into the column being 1. That does mean that C in general will be limited to the domain {0, 9}, however, so you weren't completely off with that.)
Specifically around log log counting approach.
I'll try and clarify the use of probabilistic counters although note that I'm no expert on this matter.
The aim is to count to very very large numbers using only a little space to store the counter (e.g. using a 32 bits integer).
Morris came up with the idea to maintain a "log count", so instead of counting n, the counter holds log₂(n). In other words, given a value c of the counter, the real count represented by the counter is 2ᶜ.
As logs are not generally of integer value, the problem becomes when the c counter should be incremented, as we can only do so in steps of 1.
The idea here is to use a "probabilistic counter", so for each call to a method Increment on our counter, we update the actual counter value with a probability p. This is useful as it can be shown that the expected value represented by the counter value c with probabilistic updates is in fact n. In other words, on average the value represented by our counter after n calls to Increment is in fact n (but at any one point in time our counter is probably has an error)! We are trading accuracy for the ability to count up to very large numbers with little storage space (e.g. a single register).
One scheme to achieve this, as described by Morris, is to have a counter value c represent the actual count 2ᶜ (i.e. the counter holds the log₂ of the actual count). We update this counter with probability 1/2ᶜ where c is the current value of the counter.
Note that choosing this "base" of 2 means that our actual counts are always multiples of 2 (hence the term "order of magnitude estimate"). It is also possible to choose other b > 1 (typically such that b < 2) so that the error is smaller at the cost of being able to count smaller maximum numbers.
The log log comes into play because in base-2 a number x needs log₂ bits to be represented.
There are in fact many other schemes to approximate counting, and if you are in need of such a scheme you should probably research which one makes sense for your application.
References:
See Philippe Flajolet for a proof on the average value represented by the counter, or a much simpler treatment in the solutions to a problem 5-1 in the book "Introduction to Algorithms". The paper by Morris is usually behind paywalls, I could not find a free version to post here.
its not exactly for the log counting approach but i think it can help you,
using Morris' algorithm, the counter represents an "order of magnitude estimate" of the actual count.The approximation is mathematically unbiased.
To increment the counter, a pseudo-random event is used, such that the incrementing is a probabilistic event. To save space, only the exponent is kept. For example, in base 2, the counter can estimate the count to be 1, 2, 4, 8, 16, 32, and all of the powers of two. The memory requirement is simply to hold the exponent.
As an example, to increment from 4 to 8, a pseudo-random number would be generated such that a probability of .25 generates a positive change in the counter. Otherwise, the counter remains at 4. from wiki