Hash Table Implementation - alternatives to collision detection - hashtable

Other than collision detection and throwing a LinkedList in a hashtable, what are some other ways that a Hash Table can be implemented? Is collision detection the only way to achieve an efficient hash table?

Ultimately a finite sized hash table is going to have collisions, at least any generally programmed one. If your key is type string then the hash table has an infinite number of possible keys, but with a hash table, you have just a finite number of buckets. So fundamentally there has to be collisions. If you were to implement a hash table where it ignores collisions, then you would have a very strange, indeterministic data structure that would appear to remove elements at random.
Now, the data structure used on the backend doesn't have to be a linked list. You could implement it as a red-black tree and get log(n) performance out of a collision. You should checkout the article 5 Myths About Hash Tables and also this Stack Overflow question about HashMaps vs Maps.
Now, if you know something about you key type, say the key is a 2 character long string, then there are only a finite number of possible keys, you can then proceed to create a "hash" function that converts the key to a relatively small integer, you could create a look-up table that is guaranteed to not have collisions.
It is important to note that a well-implemented hash table will not suffer very much from collisions. There are bigger problems in the world like world hunger (or even how to implement an efficient hash function) than the computer having to traverse three nodes in a linked list once every 5 days.

Other than collision detection and throwing a LinkedList in a hashtable, what are some other ways that a Hash Table can be implemented?
Other ways include:
having another container type linked from the nodes where elements have collided, such as a balanced binary tree or vector/array
GCC's hash table underpinning std::unordered_X uses a single singly-linked list of values, and a contiguous array of buckets container iterators into the list; that's got some great characteristics including optimal iteration speed regardless of the current load_factor()
using open addressing / closed hashing, which - when an insert/find/erase finds another key in the bucket it has hashed to, uses some algorithm to find another bucket to look in instead (and so on until it finds the key, a deleted element it can insert over, or an unused bucket); there are a number of options for this kind of "probing", the simplest being a try-the-next-bucket approach, another being quadratic 1, 4, 9, 16..., another the use of alternative hash functions.
perfect hash functions (below)
Is collision detection the only way to achieve an efficient hash table?
sometimes it's possible to find a perfect hash function that won't have collisions, but that's generally only true for very limited input sets, whether due to the nature of the inputs (e.g. month and year of birth of living people only has order-of a thousand possible values), or because a small number are known at compile time (e.g. a set of 200 keywords for a compiler).

Related

Using hash as a bucket index, modulo vs bit-mask

I've been looking into hash tables, where some data is hashed and that is used for a bucket index.
Some libraries use the modulo of the hash with the bucket size, and others use a bit-mask.
Where only the bits used by the bucket mask are used (ensuring the range is not exceeded).
bitmask:
index = h->hash_func(key) & h->hash_mask;
modulo:
index = h->hash_func(key) % h->bucket_tot;
While there are obvious differences between the two, such as bucket size constraints with bit-masks, ensuring hashing gives good distribution on lower bits, speed of modulo... etc.
Are there strong reasons to choose one over another?
(I'll probably try & benchmark for my own use-case, but curious whats already known on the matter).
Note, this is simply for key:value store, (dictionary/hash/associative-array) and not security related.
Example of a dynamic resizing, chaining hash table implementation using bit-mask:
https://github.com/amadvance/tommyds/blob/master/tommyds/tommyhashdyn.c
https://github.com/GNOME/glib/blob/master/glib/ghash.c
Example using modulo:
https://www.daniweb.com/software-development/c/threads/104887/sucinct-example-of-hash-table-w-chaining
You mentioned "bucket" index so I assume you mean hash tables with separate chaining as collision resolution, in this case there is no reasons for using modulo or bit mask "stronger" that you mentioned (which BTW not so obvious, as you said).
In some languages, most notably Java/JVM-based, array index is positive signed 32-bit integer, thus maximum array size for bit mask is 2^30, that could be insufficient and a strong reason to use no-power-of-two table size and modulo, with which you can approach 2^31-1(max possible signed 32-bit integer) very closely. But since you used C++ syntax this shouldn't be a concern for you.
Also, if you meant not only separate chaining, some open addressing collision resolution algorithms require table size to meet certain conditions, for example, if you implement double hashing, table size should be prime. In this case you obviously should use only modulo to obtain the initial index in the table.
It isn't always just about performance either, sometimes it's about the domain of your problem. You may, for example, have a mask that wishes to hash negative numbers as well. With modulo you have to write special cases to handle them, not so with a bitmask.

Analyzing goals and choosing a good hash function

This isn’t a specific question with a specific solution; but it’s rather a response to the fact that I can’t find any good Stack Overflow qestions about how to choose a good a hashing function for hash tables and similar tasks.
So! Let’s talk hash functions, and how to choose one. How should a programming noob, who needs to choose a good hash function for their specific task, go about choosing one? When is the simple and quick Fowler-Noll-Vo appropriate? When should they vendor in MurmurHash3 instead? Do you have any links to good resources on comparing the various options?
The hash function for hash tables should have these two properties
Uniformity all outputs of H() should be evenly distributed as much as possible. In other words the for 32-bit hash function the probability for every output should be equal to 1/2^32. (for n-bit it should be 1/2^n). With uniform hash function the chance of collision is minimized to lowest possible for any possible input.
Low computational cost Hash functions for tables are expected to be FAST, compared to cryptographic hash functions where speed is traded for preimage resistance (eg it is hard to find the message from given hash value) and collision resistance.
For purposes of hash tables all cryptographic functions are BAD choice, since the computational cost is enormous. Because hashing here is used not for security but for fast access. MurmurHash is considered one of the fastest and uniform functions suitable for big hash tables or hash indexes. For small tables a trivial hash function should be OK. A trivial hash is where we mix values of object (by multiplication, addition and subtraction with some prime).
If your hash keys are strings (or other variable-length data) you might look at this paper by Ramakrishna and Zobel. They benchmark a few classes of hashing functions (for speed and low collisions) and exhibit a class that is better than the usual Bernstein hashes.

Hash tables v self-balancing search trees

I am curious to know what is the reasoning that could overweighs towards using a self-balancing tree technique to store items than using a hash table.
I see that hash tables cannot maintain the insertion-order, but I could always use a linked list on top to store the insertion-order sequence.
I see that for small number of values, there is an added cost of of the hash-function, but I could always save the hash-function together with the key for faster lookups.
I understand that hash tables are difficult to implement than the straight-forward implementation of a red-black tree, but in a practical implementation wouldn't one be willing to go an extra mile for the trouble?
I see that with hash tables it is normal for collisions to occur, but with open-addressing techniques like double hashing that allow to save the keys in the hash table itself, hasn't the problem been reduced to the effect of not tipping the favor towards red black trees for such implementations?
I am curious if I am strictly missing a disadvantage of hash table that still makes red black trees quite viable data structure in practical applications (like filesystems, etc.).
Here is what I can think of:
There are kinds of data which cannot be hashed (or is too expensive to hash), therefore cannot be stored in hash tables.
Trees keep data in the order you need (sorted), not insertion order. You can't (effectively) do that with hash table, even if you run a linked list through it.
Trees have better worst-case performace
Storage allocation is another consideration. Every time you fill all of the buckets in a hash-table, you need to allocate new storage and re-hash everything. This can be avoided if you know the size of the data ahead of time. On the other hand, balanced trees don't suffer from this issue at all.
Just wanted to add :
Balanced binary trees have a predictable time of fetching a data [log n] independent of the type of data. Many times that may be important for your application to estimate the response times for your application. [hash tables may have unpredictable response times]. Remember for smaller n's as in most common use cases the difference in performance in an in-memory look up is hardly going to matter and the bottle neck of the system is going to be elsewhere and sometimes you just want to make the system much simpler to debug and analyze.
Trees are generally more memory efficient compared to hash tables and much simpler to implement without any analysis on the distribution of input keys and possible collisions etc.
In my humble opinion, self-balancing trees work pretty well as Academic topics. And I
do not know anything that can be qualified as a "straight-forward implementation of a
red-black tree".
In the real world, the memory wall makes them far less efficient than they are on paper.
With this in mind, hash tables are decent alternatives, especially if you don't practice
them the Academic style (forget about the table size constraint and you magically resolve
the table resize issue and almost all collision issues).
In a word: keep it simple. If that's simple for you then that's simple for your computer.
I think if you want to query for a range of keys instead of one key, self balanced tree structure will perform better than a hash table structure.
A few reasons I can think of:
Trees are dynamic (the space complexity is N), whereas hash tables are often implemented as arrays which are fixed size, which means they will often be initialized with K size, where K > N, so even if you only have 1 element in a hashmap, you might still have 100 empty slots that take up memory. Another effect of this is:
Increasing the size of an array-based hash table is costly (O(N) average time, O(N log N) worst case), whereas trees can grow in constant time (O(1)) + (time to locate insertion point (O(log N))
Elements in a tree can be gathered in sorted order (using ex: in-order-traversal). Thereby you often get a sorted list as a free perk with trees.
Trees can have a better worst-case performance vs a hashmap depending on how the hashmap is implemented (ex: hashmap with chaining will have O(N) worst case, whereas self-balanced trees can guarantee O(log N) worst case for all operations).
Both self-balanced trees and hashmaps have a worst-case efficiency of O(log N) in the best worst-case (assuming that the hashmap does handle colissions), but Hashmaps can have a better average-case performance (often close to O(1)), whereas Trees will have a constant O(log N). This is because even thou a hashmap can locate the insertion index in O(1), it has to account for hash colissions (more than one element hashing to the same array index), and thus in the best case degrades to a self-balanced tree (such as the Java implementation of hashmap), that is, each element in the hashmap can be implemented as a self-balanced tree, storing all elements which has hashed to the given array cell.

A couple of questions about Hash Tables

I've been reading a lot about Hash Tables and how to implement on in C and I think I have almost all the concepts in my head so I can start to code my own, I just have a couple of questions that I have yet to properly understand.
As a reference, I've been reading this:
http://eternallyconfuzzled.com/jsw_home.aspx
1) As I've read on the site above, a power of two or a prime number is recommended for the Hash Table size. This is basically an array and an array has a fixed size so I can quickly look up for the value I'm looking for. I can't declare a small array if I have a large input as it won't fit and I can't declare a very large array if my input data is not that large cause it's wasted memory.
What is the optimum size for the Hash Table? What should I base my decision on?
2) Also, on that site, there's a couple of hashing functions which I have yet to read them all. It also states that it's always best to use a good known algorithm and to roll my own. And I might do just that, I'll pick one from that site and test it out on my code and see if it minimizes collisions based on my input data.
What's bugging me is how I control the hash range? The hash can't return and integer larger than the Hash Table size or we'll have a serious problem. How do I deal with this?
1) What you are referring to is the load factor of the hash table - the percentage of buckets that are expected to be filled. Wikipedia has this to say:
With a good hash function, the average
lookup cost is nearly constant as the
load factor increases from 0 up to 0.7
or so. Beyond that point, the
probability of collisions and the cost
of handling them increases.
I believe the Java implementation (and probably others) resizes periodically to keep the load factor within an acceptable range.
2) Just use the modulo operator (%) to keep the bucket index legal. The second operator should be the size of your bucket array.
Pick a small size for your hash table. As you add stuff to your table, check to see what percentage of the table is being used; when it is greater than 70% full, make the table bigger. This also holds true as you remove elements-- make the table smaller when it is less than 60% full, for instance. Wikipedia has a good description of some strategies for dynamic resizing, but that's the general idea.
I only say this because you seem to have known input data:
If you know the rough order of magnitude of the amount of data you will be storing in the hash table, it's generally good enough to just create a table about that big. (You shouldn't worry about whether everything will fit. Instead, the right thing to think about is how many collisions you will have and how you will handle them.)
As for the right hash function, it's possible that the structure of your input will suggest which one will be correct. For instance, what aspects of your input are likely to be evenly distributed?

What's the point of a hash table?

I don't have experience with hash tables outside of arrays/dictionaries in dynamic languages, so I recently found out that internally they're implemented by making a hash of the key and using that to store the value. What I don't understand is why aren't the values stored with the key (string, number, whatever) as the, well, key, instead of making a hash of it and storing that.
This is a near duplicate: Why do we use a hashcode in a hashtable instead of an index?
Long story short, you can check if a key is already stored VERY quickly, and equally rapidly store a new mapping. Otherwise you'd have to keep a sorted list of keys, which is much slower to store and retrieve mappings from.
what is hash table?
It is also known as hash map is a data structure used to implement an associative array.It is a structure that can map keys to values.
How it works?
A hash table uses a hash function to compute an index into an array of buckets or slots, from which the correct value can be found.
See the below diagram it clearly explains.
Advantages:
In a well-dimensioned hash table, the average cost for each lookup is independent of the number of elements stored in the table.
Many hash table designs also allow arbitrary insertions and deletions of key-value pairs.
In many situations, hash tables turn out to be more efficient than search trees or any other table lookup structure.
Disadvantages:
The hash tables are not effective when the number of entries is very small. (However, in some cases the high cost of computing the hash function can be mitigated by saving the hash value together with the key.)
Uses:
They are widely used in many kinds of computer software, particularly for associative arrays, database indexing, caches and sets.
What I don't understand is why aren't the values stored with the key (string, number, whatever) as the, well, key
And how do you implement that?
Computers know only numbers. A hash table is a table, i.e. an array and when we get right down to it, an array can only addressed via an integral nonnegative index. Everything else is trickery. Dynamic languages that let you use string keys – they use trickery.
And one such trickery, and often the most elegant, is just computing a numerical, reproducible “hash” number of the key and using that as the index.
(There are other considerations such as compaction of the key range but that’s the foremost issue.)
In a nutshell: Hashing allows O(1) queries/inserts/deletes to the table. OTOH, a sorted structure (usually implemented as a balanced BST) makes the same operations take O(logn) time.
Why take a hash, you ask? How do you propose to store the key "as the key"? Ask yourself this, if you plan to store simply (key,value) pairs, how fast will your lookups/insertions/deletions be? Will you be running a O(n) loop over the entire array/list?
The whole point of having a hash value is that it allows all keys to be transformed into a finite set of hash values. This allows us to store keys in slots of a finite array (enabling fast operations - instead of searching the whole list you only search those keys that have the same hash value) even though the set of possible keys may be extremely large or infinite (e.g. keys can be strings, very large numbers, etc.) With a good hash function, very few keys will ever have the same hash values, and all operations are effectively O(1).
This will probably not make much sense if you are not familiar with hashing and how hashtables work. The best thing to do in that case is to consult the relevant chapter of a good algorithms/data structures book (I recommend CLRS).
The idea of a hash table is to provide a direct access to its items. So that is why the it calculates the "hash code" of the key and uses it to store the item, insted of the key itself.
The idea is to have only one hash code per key. Many times the hash function that generates the hash code is to divide a prime number and uses its remainer as the hash code.
For example, suppose you have a table with 13 positions, and an integer as the key, so you can use the following hash function
f(x) = x % 13
What I don't understand is why aren't
the values stored with the key
(string, number, whatever) as the,
well, key, instead of making a hash of
it and storing that.
Well, how do you propose to do that, with O(1) lookup?
The point of hashtables is basically to provide O(1) lookup by turning the key into an array index and then returning the content of the array at that index. To make that possible for arbitrary keys you need
A way to turn the key into an array index (this is the hash's purpose)
A way to deal with collisions (keys that have the same hash code)
A way to adjust the array size when it's too small (causing too many collisions) or too big (wasting space)
Generally the point of a hash table is to store some sparse value -- i.e. there is a large space of keys and a small number of things to store. Think about strings. There are an uncountable number of possible strings. If you are storing the variable names used in a program then there is a relatively small number of those possible strings that you are actually using, even though you don't know in advance what they are.
In some cases, it's possible that the key is very long or large, making it impractical to keep copies of these keys. Hashing them first allows for less memory usage as well as quicker lookup times.
A hashtable is used to store a set of values and their keys in a (for some amount of time) constant number of spots. In a simple case, let's say you wanted to save every integer from 0 to 10000 using the hash function of i % 10.
This would make a hashtable of 1000 blocks (often an array), each having a list 10 elements deep. So if you were to search for 1234, it would immediately know to search in the table entry for 123, then start comparing to find the exact match. Granted, this isn't much better than just using an array of 10000 elements, but it's just to demonstrate.
Hashtables are very useful for when you don't know exactly how many elements you'll have, but there will be a good number fewer collisions on the hash function than your total number of elements. (Which makes the hash function "hash(x) = 0" very, very bad.) You may have empty spots in your table, but ideally a majority of them will have some data.
The main advantage of using a hash for the purpose of finding items in the table, as opposed to using the original key of the key-value pair (which BTW, it typically stored in the table as well, since the hash is not reversible), is that..
...it allows mapping the whole namespace of the [original] keys to the relatively small namespace of the hash values, allowing the hash-table to provide O(1) performance for retrieving items.
This O(1) performance gets a bit eroded when considering the extra time to dealing with collisions and such, but on the whole the hash table is very fast for storing and retrieving items, as opposed to a system based solely on the [original] key value, which would then typically be O(log N), with for example a binary tree (although such tree is more efficient, space-wise)
Also consider speed. If your key is a string and your values are stored in an array, your hash can access any element in 'near' constant time. Compare that to searching for the string and its value.

Resources