Related
I've been looking into hash tables, where some data is hashed and that is used for a bucket index.
Some libraries use the modulo of the hash with the bucket size, and others use a bit-mask.
Where only the bits used by the bucket mask are used (ensuring the range is not exceeded).
bitmask:
index = h->hash_func(key) & h->hash_mask;
modulo:
index = h->hash_func(key) % h->bucket_tot;
While there are obvious differences between the two, such as bucket size constraints with bit-masks, ensuring hashing gives good distribution on lower bits, speed of modulo... etc.
Are there strong reasons to choose one over another?
(I'll probably try & benchmark for my own use-case, but curious whats already known on the matter).
Note, this is simply for key:value store, (dictionary/hash/associative-array) and not security related.
Example of a dynamic resizing, chaining hash table implementation using bit-mask:
https://github.com/amadvance/tommyds/blob/master/tommyds/tommyhashdyn.c
https://github.com/GNOME/glib/blob/master/glib/ghash.c
Example using modulo:
https://www.daniweb.com/software-development/c/threads/104887/sucinct-example-of-hash-table-w-chaining
You mentioned "bucket" index so I assume you mean hash tables with separate chaining as collision resolution, in this case there is no reasons for using modulo or bit mask "stronger" that you mentioned (which BTW not so obvious, as you said).
In some languages, most notably Java/JVM-based, array index is positive signed 32-bit integer, thus maximum array size for bit mask is 2^30, that could be insufficient and a strong reason to use no-power-of-two table size and modulo, with which you can approach 2^31-1(max possible signed 32-bit integer) very closely. But since you used C++ syntax this shouldn't be a concern for you.
Also, if you meant not only separate chaining, some open addressing collision resolution algorithms require table size to meet certain conditions, for example, if you implement double hashing, table size should be prime. In this case you obviously should use only modulo to obtain the initial index in the table.
It isn't always just about performance either, sometimes it's about the domain of your problem. You may, for example, have a mask that wishes to hash negative numbers as well. With modulo you have to write special cases to handle them, not so with a bitmask.
When we are insert/lookup an key in a hash table, textbook said it is O(1) time. Yet, how is possible to have an O(1) lookup time? If the hash table store the key in a vector, it will cost O(N), if in a binary tree, it will be O(logN). I just can't image some data structure with O(1) accessing time.
Thanks!
The hashtable hashes your key and put it in array.
For example, hash(x) = 3, where x is your key. The table then puts it into array[3]. Accessing from array is O(1).
At a minimum, hash tables consist of an array and hash function. When an object is added to the table, the hash function is computed on the object and it is stored in the array at the index of the corresponding value that was computed. e.g., if hash(obj) = 2 then arr[2] = obj.
The average insert/lookup on a hash table is O(1).
However it is possible to have collisions when objects compute the same hash value.
In the general case there are "buckets" at each index of the array to handle these collisions. Meaning, all three objects are stored in some other data structure (maybe a linked list or another array) at the index of the hash table.
Therefore, the worst case for lookup on a hash table is O(n) because it is possible that all objects stored in the hash table have collided and are stored in the same bucket.
Technically speaking, hash table lookup, if there is no collision, is O(logn). This is because hashing time is linear with respect to the size (in bytes) of the identifier, and the smallest that a new identifier added to the hash table can be, for that identifier to be unique, is O(logn).
However, the log of all the computer memory in the world is just such a small number, which means that we have very good upper bounds on hash table identifier size. Case in point, log10 of the number of particles in the observable universe is estimated at slightly over 80; in log2 it's about 3.3 times as much. Logarithms grow very slowly.
As a result, most log terms can be treated as constant terms. It's just that traditionally we only apply this fact to hash tables, but not to search trees, in order to teach recurrence relations to students.
This is a homework question, but I think there's something missing from it. It asks:
Provide a sequence of m keys to fill a hash table implemented with linear probing, such that the time to fill it is minimum.
And then
Provide another sequence of m keys, but such that the time fill it is maximum. Repeat these two questions if the hash table implements quadratic probing
I can only assume that the hash table has size m, both because it's the only number given and because we have been using that letter to address a hash table size before when describing the load factor. But I can't think of any sequence to do the first without knowing the hash function that hashes the sequence into the table.
If it is a bad hash function, such that, for instance, it hashes every entry to the same index, then both the minimum and maximum time to fill it will take O(n) time, regardless of what the sequence looks like. And in the average case, where I assume the hash function is OK, how am I supposed to know how long it will take for that hash function to fill the table?
Aren't these questions linked to the hash function stronger than they are to the sequence that is hashed?
As for the second question, I can assume that, regardless of the hash function, a sequence of size m with the same key repeated m-times will provide the maximum time, because it will cause linear probing from the second entry on. I think that will take O(n) time. Is that correct?
Well, the idea behind these questions is to test your understanding of probing styles. For linear probing, if a collision occurs, you simply test the next cell. And it goes on like this until you find an available cell to store your data.
Your hash table doesn't need to be size m but it needs to be at least size m.
First question is asking that if you have a perfect hash function, what is the complexity of populating the table. Perfect hashing function addresses each element without collision. So for each element in m, you need O(1) time. Total complexity is O(m).
Second question is asking for the case that hash(X)=cell(0), which all of the elements will search till the first empty cell(just rear of the currently populated table).
For the first element, you probe once -> O(1)
For the second element, you probe twice -> O(2)
for the nth element, you probe n times -> O(n)
overall you have m elements, so -> O(n*(n+1)/2)
For quadratic probing, you have the same strategy. The minimum case is the same, but the maximum case will have O(nlogn). ( I didn't solve it, just it's my educated guess.)
This questions doesn't sound terribly concerned with the hash function, but it would be nice to have. You seem to pretty much get it, though. It sounds to me like the question is more concerned with "do you know what a worst-case list of keys would be?" than "do you know how to exploit bad hash functions?"
Obviously, if you come up with a sequence where all the entries hash to different locations, then you have O(1) insertions for O(m) time in total.
For what you are saying about hashing all the keys to the same location, each insertion should take O(n) if that's what you are suggesting. However, that's not the total time for inserting all the elements. Also, you might want to consider not literally using the same key over and over but rather using keys that would produce the same location in the table. I think, by convention, inserting the same key should cause a replacement, though I'm not 100% sure.
I'll apologize in advance if I gave too much information or left anything unclear. This question seems pretty cut-and-dried save the part about not actually knowing the hash function, and it was kind of hard to really say much without answering the whole question.
Say I have a scheme that derives a key from N different inputs. Each of the inputs may not be completely secure (f.x. bad passwords) but in combination they are secure. The simple way to do this is to concatenate all of the inputs in order and use a hash as a result.
Now I want to allow key-derivation (or rather key-decryption) given only N-1 out of the N inputs. A simple way to do this is to generate a random key K, generate N temporary keys out of different N subsets of the input, each with one input missing (i.e. Hash(input_{1}, ..., input_{N-1}), Hash(input_{0}, input_{2}, ..., input_{N-1}), Hash(input_{0}, input_{1}, input_{3},..., input_{N-1}), ..., Hash(input_{0}, ..., input_{N-2})) then encrypt K with each of the N keys and store all of the results.
Now I want to a generalized solution, where I can decrypt the key using K out of N inputs. The naive way to expand the scheme above requires storing (N choose N-K) values, which quickly becomes infeasible.
Is there a good algorithm for this, that does not entail this much storage?
I have thought about a way to use something like Shamir's Secret Sharing Scheme, but cannot think of a good way, since the inputs are fixed.
Error Correcting Codes are the most direct way to deal with the problem. They are not, however, particularly easy to implement.
The best approach would be using a Reed Solomon Code. When you derive the password for the first time you also calculate the redundancy required by the code and store it. When you want to recalculate the key you use the redundancy to correct the wrong or missing inputs.
To encrypt / create:
Take the N inputs. Turn each into a block in a good /secure way. Use Reed Solomon to generate M redundancy blocks from the N block combination. You now have N+M blocks, of which you need only a total of N to generate the original N blocks.
Use the N blocks to encrypt or create a secure key.
If the first, store the encrypted key and the M redundancy blocks. If the second, store only the M redundancy blocks.
To decrypt / retrieve:
Take N - R correct input blocks, where R =< M. Combine them with the redundancy blocks you stored to create the original N blocks. Use the original N blocks to decrypt or create the secure key.
(Thanks to https://stackoverflow.com/users/492020/giacomo-verticale : This is essentially what he/she said, but I think a little more explicit / clearer.)
Shamir's share secret is a techinique that is used when you want to split a secret in multiple shares such that only a combination of minimum k parts would reveal the intial secret. If you are not sure about the correctness of the initiator and you want to verify this you use verifiable secret sharing .both are based to polynomial interpolation
One approach would be to generate a purely random key (or by hashing all of the inputs, if you want to avoid an RNG for some reason), split it using a k-of-n threshold scheme, and encrypt each share using the individual password inputs (eg send them through PBKDF2 with 100000 iterations and then encrypt/MAC with AES-CTR/HMAC). This would require less storage than storing hash subsets; roughly N * (share size + salt size + MAC size)
Rather than simply allowing a few errors out of a large number of inputs, you should divide the inputs up into groups and allow some number of errors in each group. If you were to allow 4 errors out of 64 inputs then you would have to have 15,249,024 encrypted keys, but if you break that up into two groups of 32, allowing two errors per group then you would only need to have 1984 encrypted keys.
Once you have decrypted the key information from each group then use that as input into decrypting key that you ultimately want.
Also, the keys acquired from each group must not be trivial in comparison to the key that you ultimately want. Do not simply break up a 256 bit key into 8 32bit key pieces. Doing this would allow someone that could decrypt 7 of those key pieces to attempt a bruteforce attack on the last piece. if you want access to a 256 bit key, then you must work with 256 bit keys for the whole procedure.
I don't have experience with hash tables outside of arrays/dictionaries in dynamic languages, so I recently found out that internally they're implemented by making a hash of the key and using that to store the value. What I don't understand is why aren't the values stored with the key (string, number, whatever) as the, well, key, instead of making a hash of it and storing that.
This is a near duplicate: Why do we use a hashcode in a hashtable instead of an index?
Long story short, you can check if a key is already stored VERY quickly, and equally rapidly store a new mapping. Otherwise you'd have to keep a sorted list of keys, which is much slower to store and retrieve mappings from.
what is hash table?
It is also known as hash map is a data structure used to implement an associative array.It is a structure that can map keys to values.
How it works?
A hash table uses a hash function to compute an index into an array of buckets or slots, from which the correct value can be found.
See the below diagram it clearly explains.
Advantages:
In a well-dimensioned hash table, the average cost for each lookup is independent of the number of elements stored in the table.
Many hash table designs also allow arbitrary insertions and deletions of key-value pairs.
In many situations, hash tables turn out to be more efficient than search trees or any other table lookup structure.
Disadvantages:
The hash tables are not effective when the number of entries is very small. (However, in some cases the high cost of computing the hash function can be mitigated by saving the hash value together with the key.)
Uses:
They are widely used in many kinds of computer software, particularly for associative arrays, database indexing, caches and sets.
What I don't understand is why aren't the values stored with the key (string, number, whatever) as the, well, key
And how do you implement that?
Computers know only numbers. A hash table is a table, i.e. an array and when we get right down to it, an array can only addressed via an integral nonnegative index. Everything else is trickery. Dynamic languages that let you use string keys – they use trickery.
And one such trickery, and often the most elegant, is just computing a numerical, reproducible “hash” number of the key and using that as the index.
(There are other considerations such as compaction of the key range but that’s the foremost issue.)
In a nutshell: Hashing allows O(1) queries/inserts/deletes to the table. OTOH, a sorted structure (usually implemented as a balanced BST) makes the same operations take O(logn) time.
Why take a hash, you ask? How do you propose to store the key "as the key"? Ask yourself this, if you plan to store simply (key,value) pairs, how fast will your lookups/insertions/deletions be? Will you be running a O(n) loop over the entire array/list?
The whole point of having a hash value is that it allows all keys to be transformed into a finite set of hash values. This allows us to store keys in slots of a finite array (enabling fast operations - instead of searching the whole list you only search those keys that have the same hash value) even though the set of possible keys may be extremely large or infinite (e.g. keys can be strings, very large numbers, etc.) With a good hash function, very few keys will ever have the same hash values, and all operations are effectively O(1).
This will probably not make much sense if you are not familiar with hashing and how hashtables work. The best thing to do in that case is to consult the relevant chapter of a good algorithms/data structures book (I recommend CLRS).
The idea of a hash table is to provide a direct access to its items. So that is why the it calculates the "hash code" of the key and uses it to store the item, insted of the key itself.
The idea is to have only one hash code per key. Many times the hash function that generates the hash code is to divide a prime number and uses its remainer as the hash code.
For example, suppose you have a table with 13 positions, and an integer as the key, so you can use the following hash function
f(x) = x % 13
What I don't understand is why aren't
the values stored with the key
(string, number, whatever) as the,
well, key, instead of making a hash of
it and storing that.
Well, how do you propose to do that, with O(1) lookup?
The point of hashtables is basically to provide O(1) lookup by turning the key into an array index and then returning the content of the array at that index. To make that possible for arbitrary keys you need
A way to turn the key into an array index (this is the hash's purpose)
A way to deal with collisions (keys that have the same hash code)
A way to adjust the array size when it's too small (causing too many collisions) or too big (wasting space)
Generally the point of a hash table is to store some sparse value -- i.e. there is a large space of keys and a small number of things to store. Think about strings. There are an uncountable number of possible strings. If you are storing the variable names used in a program then there is a relatively small number of those possible strings that you are actually using, even though you don't know in advance what they are.
In some cases, it's possible that the key is very long or large, making it impractical to keep copies of these keys. Hashing them first allows for less memory usage as well as quicker lookup times.
A hashtable is used to store a set of values and their keys in a (for some amount of time) constant number of spots. In a simple case, let's say you wanted to save every integer from 0 to 10000 using the hash function of i % 10.
This would make a hashtable of 1000 blocks (often an array), each having a list 10 elements deep. So if you were to search for 1234, it would immediately know to search in the table entry for 123, then start comparing to find the exact match. Granted, this isn't much better than just using an array of 10000 elements, but it's just to demonstrate.
Hashtables are very useful for when you don't know exactly how many elements you'll have, but there will be a good number fewer collisions on the hash function than your total number of elements. (Which makes the hash function "hash(x) = 0" very, very bad.) You may have empty spots in your table, but ideally a majority of them will have some data.
The main advantage of using a hash for the purpose of finding items in the table, as opposed to using the original key of the key-value pair (which BTW, it typically stored in the table as well, since the hash is not reversible), is that..
...it allows mapping the whole namespace of the [original] keys to the relatively small namespace of the hash values, allowing the hash-table to provide O(1) performance for retrieving items.
This O(1) performance gets a bit eroded when considering the extra time to dealing with collisions and such, but on the whole the hash table is very fast for storing and retrieving items, as opposed to a system based solely on the [original] key value, which would then typically be O(log N), with for example a binary tree (although such tree is more efficient, space-wise)
Also consider speed. If your key is a string and your values are stored in an array, your hash can access any element in 'near' constant time. Compare that to searching for the string and its value.