What's the point of a hash table? - hashtable

I don't have experience with hash tables outside of arrays/dictionaries in dynamic languages, so I recently found out that internally they're implemented by making a hash of the key and using that to store the value. What I don't understand is why aren't the values stored with the key (string, number, whatever) as the, well, key, instead of making a hash of it and storing that.

This is a near duplicate: Why do we use a hashcode in a hashtable instead of an index?
Long story short, you can check if a key is already stored VERY quickly, and equally rapidly store a new mapping. Otherwise you'd have to keep a sorted list of keys, which is much slower to store and retrieve mappings from.

what is hash table?
It is also known as hash map is a data structure used to implement an associative array.It is a structure that can map keys to values.
How it works?
A hash table uses a hash function to compute an index into an array of buckets or slots, from which the correct value can be found.
See the below diagram it clearly explains.
Advantages:
In a well-dimensioned hash table, the average cost for each lookup is independent of the number of elements stored in the table.
Many hash table designs also allow arbitrary insertions and deletions of key-value pairs.
In many situations, hash tables turn out to be more efficient than search trees or any other table lookup structure.
Disadvantages:
The hash tables are not effective when the number of entries is very small. (However, in some cases the high cost of computing the hash function can be mitigated by saving the hash value together with the key.)
Uses:
They are widely used in many kinds of computer software, particularly for associative arrays, database indexing, caches and sets.

What I don't understand is why aren't the values stored with the key (string, number, whatever) as the, well, key
And how do you implement that?
Computers know only numbers. A hash table is a table, i.e. an array and when we get right down to it, an array can only addressed via an integral nonnegative index. Everything else is trickery. Dynamic languages that let you use string keys – they use trickery.
And one such trickery, and often the most elegant, is just computing a numerical, reproducible “hash” number of the key and using that as the index.
(There are other considerations such as compaction of the key range but that’s the foremost issue.)

In a nutshell: Hashing allows O(1) queries/inserts/deletes to the table. OTOH, a sorted structure (usually implemented as a balanced BST) makes the same operations take O(logn) time.
Why take a hash, you ask? How do you propose to store the key "as the key"? Ask yourself this, if you plan to store simply (key,value) pairs, how fast will your lookups/insertions/deletions be? Will you be running a O(n) loop over the entire array/list?
The whole point of having a hash value is that it allows all keys to be transformed into a finite set of hash values. This allows us to store keys in slots of a finite array (enabling fast operations - instead of searching the whole list you only search those keys that have the same hash value) even though the set of possible keys may be extremely large or infinite (e.g. keys can be strings, very large numbers, etc.) With a good hash function, very few keys will ever have the same hash values, and all operations are effectively O(1).
This will probably not make much sense if you are not familiar with hashing and how hashtables work. The best thing to do in that case is to consult the relevant chapter of a good algorithms/data structures book (I recommend CLRS).

The idea of a hash table is to provide a direct access to its items. So that is why the it calculates the "hash code" of the key and uses it to store the item, insted of the key itself.
The idea is to have only one hash code per key. Many times the hash function that generates the hash code is to divide a prime number and uses its remainer as the hash code.
For example, suppose you have a table with 13 positions, and an integer as the key, so you can use the following hash function
f(x) = x % 13

What I don't understand is why aren't
the values stored with the key
(string, number, whatever) as the,
well, key, instead of making a hash of
it and storing that.
Well, how do you propose to do that, with O(1) lookup?
The point of hashtables is basically to provide O(1) lookup by turning the key into an array index and then returning the content of the array at that index. To make that possible for arbitrary keys you need
A way to turn the key into an array index (this is the hash's purpose)
A way to deal with collisions (keys that have the same hash code)
A way to adjust the array size when it's too small (causing too many collisions) or too big (wasting space)

Generally the point of a hash table is to store some sparse value -- i.e. there is a large space of keys and a small number of things to store. Think about strings. There are an uncountable number of possible strings. If you are storing the variable names used in a program then there is a relatively small number of those possible strings that you are actually using, even though you don't know in advance what they are.

In some cases, it's possible that the key is very long or large, making it impractical to keep copies of these keys. Hashing them first allows for less memory usage as well as quicker lookup times.

A hashtable is used to store a set of values and their keys in a (for some amount of time) constant number of spots. In a simple case, let's say you wanted to save every integer from 0 to 10000 using the hash function of i % 10.
This would make a hashtable of 1000 blocks (often an array), each having a list 10 elements deep. So if you were to search for 1234, it would immediately know to search in the table entry for 123, then start comparing to find the exact match. Granted, this isn't much better than just using an array of 10000 elements, but it's just to demonstrate.
Hashtables are very useful for when you don't know exactly how many elements you'll have, but there will be a good number fewer collisions on the hash function than your total number of elements. (Which makes the hash function "hash(x) = 0" very, very bad.) You may have empty spots in your table, but ideally a majority of them will have some data.

The main advantage of using a hash for the purpose of finding items in the table, as opposed to using the original key of the key-value pair (which BTW, it typically stored in the table as well, since the hash is not reversible), is that..
...it allows mapping the whole namespace of the [original] keys to the relatively small namespace of the hash values, allowing the hash-table to provide O(1) performance for retrieving items.
This O(1) performance gets a bit eroded when considering the extra time to dealing with collisions and such, but on the whole the hash table is very fast for storing and retrieving items, as opposed to a system based solely on the [original] key value, which would then typically be O(log N), with for example a binary tree (although such tree is more efficient, space-wise)

Also consider speed. If your key is a string and your values are stored in an array, your hash can access any element in 'near' constant time. Compare that to searching for the string and its value.

Related

Is a fixed partition key bad practice?

I have a DynamoDB table with:
Timestamp (HASH)
Text (String)
I want to be able to get the latest item via a query, but doing so requires that I sort by Timestamp rather than partition by it. I was considering doing this instead:
Partition (HASH, hard-coded as whatever)
Timestamp (RANGE)
Text (String)
That way I can query and pass a hard-coded partition in.
But is this bad practice?
It depends.
The main thing to consider is that partitions have a finite throughput for both reads and writes. This is independent from the provisioned throughput for the table. Partition throughput is constrained by the hard disk's read and write speeds. Remember that all items with the same hash value will live on the same partition and therefore will be written to the same disk (discounting replication).
So, it depends on your scale. It will work for a small scale, low throughput use case but it won't be able to scale beyond a single disk.
It is usually bad practice to use a single, hard coded value for your hash key. Rather than a hard coded hash key value, you should consider using year_month_day (or some variation) as your hash key for this use case. It's still not great, but it's much better than a single value.
If you do want to use hard coded hash key values, consider using multiple hard coded values to shard your data across partitions.

How can the insertion complexity for a hashtable be O(1)

I am creating a hashtable and inserting n elements in it from an unsorted array. As I am calling the hashfunction n times. Wouldn't the time complexity to create/insert the hashtable be O(n) ?
I tried searching everywhere, but they mention complexity in case of collisions, but don't cover how can I create a hashtable in O(1) in a perfect case as I will have to traverse the array in order to pick element one by one and put it in the hashtable?
When inserting a new record into a hash table, using a perfect hash function, an unused index entry (used to point to the record) will be found immediately giving O(1). Later, the record will be found immediately when searched for.
Of course, hash functions are seldom perfect. As the hash index starts to become populated the function will at times require two or more attempts to find an unused index entry to insert a new record and every later attempt to search for that record will also require two or more attempts before the correct index entry is found. So the actual search complexity of a hash table may end up as O(1.5) or more but that value is made up of searches where the record is most often found in the first attempt while others may require two or more.
I guess the trick is to find a hashing algorithm that is "good enough" which means a compromise between an index that isn't too big, an average complexity that is reasonably low and a worst case that is acceptable.
I posted on another search question here and showed how hashing could be used and a good hashing function be determined. The question required looking up a 64-bit value in a table containing 16000 records and replacing it with another value from the record. Computing the second value from the first was not possible. My algorithm sported an average search comnplexity of <1.5 with worst case of 14. The index size was a bit large for my taste but I think a more exhaustive search could have found a better one. In comparison, binary searches required, at best, about twice as many clock cycles to perform a lookup as the hash function did, probably because their time complexity was greater.

Hash Table Implementation - alternatives to collision detection

Other than collision detection and throwing a LinkedList in a hashtable, what are some other ways that a Hash Table can be implemented? Is collision detection the only way to achieve an efficient hash table?
Ultimately a finite sized hash table is going to have collisions, at least any generally programmed one. If your key is type string then the hash table has an infinite number of possible keys, but with a hash table, you have just a finite number of buckets. So fundamentally there has to be collisions. If you were to implement a hash table where it ignores collisions, then you would have a very strange, indeterministic data structure that would appear to remove elements at random.
Now, the data structure used on the backend doesn't have to be a linked list. You could implement it as a red-black tree and get log(n) performance out of a collision. You should checkout the article 5 Myths About Hash Tables and also this Stack Overflow question about HashMaps vs Maps.
Now, if you know something about you key type, say the key is a 2 character long string, then there are only a finite number of possible keys, you can then proceed to create a "hash" function that converts the key to a relatively small integer, you could create a look-up table that is guaranteed to not have collisions.
It is important to note that a well-implemented hash table will not suffer very much from collisions. There are bigger problems in the world like world hunger (or even how to implement an efficient hash function) than the computer having to traverse three nodes in a linked list once every 5 days.
Other than collision detection and throwing a LinkedList in a hashtable, what are some other ways that a Hash Table can be implemented?
Other ways include:
having another container type linked from the nodes where elements have collided, such as a balanced binary tree or vector/array
GCC's hash table underpinning std::unordered_X uses a single singly-linked list of values, and a contiguous array of buckets container iterators into the list; that's got some great characteristics including optimal iteration speed regardless of the current load_factor()
using open addressing / closed hashing, which - when an insert/find/erase finds another key in the bucket it has hashed to, uses some algorithm to find another bucket to look in instead (and so on until it finds the key, a deleted element it can insert over, or an unused bucket); there are a number of options for this kind of "probing", the simplest being a try-the-next-bucket approach, another being quadratic 1, 4, 9, 16..., another the use of alternative hash functions.
perfect hash functions (below)
Is collision detection the only way to achieve an efficient hash table?
sometimes it's possible to find a perfect hash function that won't have collisions, but that's generally only true for very limited input sets, whether due to the nature of the inputs (e.g. month and year of birth of living people only has order-of a thousand possible values), or because a small number are known at compile time (e.g. a set of 200 keywords for a compiler).

hash table lookup time

When we are insert/lookup an key in a hash table, textbook said it is O(1) time. Yet, how is possible to have an O(1) lookup time? If the hash table store the key in a vector, it will cost O(N), if in a binary tree, it will be O(logN). I just can't image some data structure with O(1) accessing time.
Thanks!
The hashtable hashes your key and put it in array.
For example, hash(x) = 3, where x is your key. The table then puts it into array[3]. Accessing from array is O(1).
At a minimum, hash tables consist of an array and hash function. When an object is added to the table, the hash function is computed on the object and it is stored in the array at the index of the corresponding value that was computed. e.g., if hash(obj) = 2 then arr[2] = obj.
The average insert/lookup on a hash table is O(1).
However it is possible to have collisions when objects compute the same hash value.
In the general case there are "buckets" at each index of the array to handle these collisions. Meaning, all three objects are stored in some other data structure (maybe a linked list or another array) at the index of the hash table.
Therefore, the worst case for lookup on a hash table is O(n) because it is possible that all objects stored in the hash table have collided and are stored in the same bucket.
Technically speaking, hash table lookup, if there is no collision, is O(logn). This is because hashing time is linear with respect to the size (in bytes) of the identifier, and the smallest that a new identifier added to the hash table can be, for that identifier to be unique, is O(logn).
However, the log of all the computer memory in the world is just such a small number, which means that we have very good upper bounds on hash table identifier size. Case in point, log10 of the number of particles in the observable universe is estimated at slightly over 80; in log2 it's about 3.3 times as much. Logarithms grow very slowly.
As a result, most log terms can be treated as constant terms. It's just that traditionally we only apply this fact to hash tables, but not to search trees, in order to teach recurrence relations to students.

How to create an efficient static hash table?

I need to create small-mid sized static hash tables from it. Typically, those will have 5-100 entries. When the hash table is created, all keys hashes are known up-front (i.e. the keys are already hashes.) Currently, I create a HashMap, which is I sort the keys so I get O(log n) lookup which 3-5 lookups on average for the sizes I care. Wikipedia claims that a simple hash table with chaining will result in 3 lookups on average for a full table, so that's not yet worth the trouble for me (i.e. taking hash%n as the first entry and doing the chaining.) Given that I know all hashes up-front, it seems to be that there should be an easy way to get a fast, static perfect hash -- but I couldn't find a good pointer how. I.e. amortized O(1) access with no (little?) additional overhead. How should I implement such a static table?
Memory usage is important, so the less I need to store, the better.
Edit: Notice that it's fine if I have have to resolve one collision or so manually. I.e. if I could do some chaining which on average has direct access and worst-case 3 indirections for instance, that's fine. It's not that I need a perfect hash.
For c or c++ you can use gperf
GNU gperf is a perfect hash function generator. For a given list of strings, it produces a hash function and hash table, in form of C or C++ code, for looking up a value depending on the input string. The hash function is perfect, which means that the hash table has no collisions, and the hash table lookup needs a single string comparison only.
GNU gperf is highly customizable. There are options for generating C or C++ code, for emitting switch statements or nested ifs instead of a hash table, and for tuning the algorithm employed by gperf.
Small hashes are also possible in C without an external lib using the pre-processor, for example:
swich (hash_string(*p))
{
case HASH_S16("test"):
...
break;
case HASH_S256("An example with a long text!!!!!!!!!!!!!!!!!"):
...
break;
}
Have a look for the code # http://www.heeden.nl/statichashc.htm
You can use Sux4j to generate a minimal perfect hash in Java or C++. (I'm not sure you are using Java, but you mentioned HashMap, so I'm assuming.) For C, you can use the cmph library.

Resources