Analyzing goals and choosing a good hash function - hashtable

This isn’t a specific question with a specific solution; but it’s rather a response to the fact that I can’t find any good Stack Overflow qestions about how to choose a good a hashing function for hash tables and similar tasks.
So! Let’s talk hash functions, and how to choose one. How should a programming noob, who needs to choose a good hash function for their specific task, go about choosing one? When is the simple and quick Fowler-Noll-Vo appropriate? When should they vendor in MurmurHash3 instead? Do you have any links to good resources on comparing the various options?

The hash function for hash tables should have these two properties
Uniformity all outputs of H() should be evenly distributed as much as possible. In other words the for 32-bit hash function the probability for every output should be equal to 1/2^32. (for n-bit it should be 1/2^n). With uniform hash function the chance of collision is minimized to lowest possible for any possible input.
Low computational cost Hash functions for tables are expected to be FAST, compared to cryptographic hash functions where speed is traded for preimage resistance (eg it is hard to find the message from given hash value) and collision resistance.
For purposes of hash tables all cryptographic functions are BAD choice, since the computational cost is enormous. Because hashing here is used not for security but for fast access. MurmurHash is considered one of the fastest and uniform functions suitable for big hash tables or hash indexes. For small tables a trivial hash function should be OK. A trivial hash is where we mix values of object (by multiplication, addition and subtraction with some prime).

If your hash keys are strings (or other variable-length data) you might look at this paper by Ramakrishna and Zobel. They benchmark a few classes of hashing functions (for speed and low collisions) and exhibit a class that is better than the usual Bernstein hashes.

Related

What are some ways to prevent deliberate malicious attacks against hash function implementations?

Say you have some software server that uses hash functions and some external source wants to exploit that and it keeps attacking the server using keys that they know (or with high probability) will result in collisions. How would you prevent this in practice?
I think one way is to choose the hash function randomly at the beginning of the problem, but this method seems slow in the sense that every time you change hash functions you have to rehash everything.
As you obviously realise, the best defence is to make sure they don't know what your hash function will produce - ideally not your bucket count either (if the hash function is strong, hard to reverse and produces a large range of outputs - such as say 64-bit unsigned integers - then finding two keys that produce the same hash may be time consuming, but finding a value that will hash to a specific bucket after modding by N only needs on average N attempts with any random, distinct keys).
choose the hash function randomly at the beginning of the problem, but this method seems slow in the sense that every time you change hash functions you have to rehash everything.
There's not necessarily a need to repeatedly change the hash function... you just need to make it unguessable based on exposed data/code and observable behaviours. For example, you might generate a random seed value on your server, write that to a secure file somewhere, and use it as a seed for your hash function (or if your hash function doesn't support a seed value, just XOR the hash output with the random value). Even if someone knows your hash function, if they don't know the seed then they can't engineer collisions.
You could also count the collisions a particular client has had, and if it's obviously malicious - disconnect them and remove their keys.

What are the default hash functions used by programming languages for dictionaries/associative arrays?

So I was curious when I got to know that dictionaries or associative arrays are usually implemented by hash tables. Upon reading about hash tables, I stumbled upon hash functions, I learned there are various hash functions such as md5, md6, sha-1 etc. What I was unable to find was which hash function is used by programming languages such as python, C++, java?
Those are.. not the same kind of 'hash function' D:
For hashtable hash functions, code must compute an appropriate hash based on object-data such that it conforms to equality requirements. It should also be "well distributed" and "fast". Most hashtable hashes are thus often 32-bit values using some form of a rolling/shifting computation. At the end of the day this hash is used to select from a much smaller pool of buckets.
Hashtable hashes are usually computed directly by (or with knowledge of) the objects be added to the hashtable - that is, generally, cryptographic hash functions are not involved in hashtables. A typical Java hashCode() function, defined on the object being added to the hashtable, for example might look like:
int hash = 7;
hash = 31 * hash + (int) int_field;
hash = 31 * hash + (str_field == null ? 0 : str_field.hashCode());
// etc.
return hash;
There are discussions on the choice of seed and multiplication values elsewhere.. but the take-way should be that most hashtable hash functions 1) directly derive from object state, applying 'tweaks' as prudent, and 2) are not designed to be "secure".
(Modern hashtable implementations often apply a "mixing function" to the generated hash value to mitigate degenerate hash function results and/or data poisoning attacks.)
On the the other hand, a cryptographic hash is designed to provide much stronger cryptographic requirements and have a much larger output space. While such a strong hash can be used for hashtables (after being derived from an object and then distilled down to a hash bucket), they are also slower to generate and usually unnecessary in context of a hash/dictionary.
Cryptographic hashes generally work on an arbitrary chunk of data or byte stream.
Hashtable hash desirable characteristics:
Deterministic
Uniform distribution / avoidance of clustering
Speed, speed, speed
Cryptographic hashes have additional characteristics, beyond that of a hashtable hash:
Infeasible to generate a message from its hash value
Infeasible to find two different messages with the same hash value
(While cryptographic hashes should also be fast, speed is largely secondary to the additional requirements.)
Programming languages support a wide range of different cryptographic hash functions through their standard libraries and/or 3rd party libraries. A more well-known hash (eg. MD5/SHA-x) will generally have universal support while something more specialized (eg. MD6) may require additional effort to locate an implementation for.
On the other hand, as shown above, many hash table 'functions' are implemented directly on the object(s) involved in a hashtable, following a standard pattern, with some languages (and IDEs) providing assistance to reduce manual coding. As an example, C# provides a default reflection-based GetHashCode implementation for struct types.

Hash Table Implementation - alternatives to collision detection

Other than collision detection and throwing a LinkedList in a hashtable, what are some other ways that a Hash Table can be implemented? Is collision detection the only way to achieve an efficient hash table?
Ultimately a finite sized hash table is going to have collisions, at least any generally programmed one. If your key is type string then the hash table has an infinite number of possible keys, but with a hash table, you have just a finite number of buckets. So fundamentally there has to be collisions. If you were to implement a hash table where it ignores collisions, then you would have a very strange, indeterministic data structure that would appear to remove elements at random.
Now, the data structure used on the backend doesn't have to be a linked list. You could implement it as a red-black tree and get log(n) performance out of a collision. You should checkout the article 5 Myths About Hash Tables and also this Stack Overflow question about HashMaps vs Maps.
Now, if you know something about you key type, say the key is a 2 character long string, then there are only a finite number of possible keys, you can then proceed to create a "hash" function that converts the key to a relatively small integer, you could create a look-up table that is guaranteed to not have collisions.
It is important to note that a well-implemented hash table will not suffer very much from collisions. There are bigger problems in the world like world hunger (or even how to implement an efficient hash function) than the computer having to traverse three nodes in a linked list once every 5 days.
Other than collision detection and throwing a LinkedList in a hashtable, what are some other ways that a Hash Table can be implemented?
Other ways include:
having another container type linked from the nodes where elements have collided, such as a balanced binary tree or vector/array
GCC's hash table underpinning std::unordered_X uses a single singly-linked list of values, and a contiguous array of buckets container iterators into the list; that's got some great characteristics including optimal iteration speed regardless of the current load_factor()
using open addressing / closed hashing, which - when an insert/find/erase finds another key in the bucket it has hashed to, uses some algorithm to find another bucket to look in instead (and so on until it finds the key, a deleted element it can insert over, or an unused bucket); there are a number of options for this kind of "probing", the simplest being a try-the-next-bucket approach, another being quadratic 1, 4, 9, 16..., another the use of alternative hash functions.
perfect hash functions (below)
Is collision detection the only way to achieve an efficient hash table?
sometimes it's possible to find a perfect hash function that won't have collisions, but that's generally only true for very limited input sets, whether due to the nature of the inputs (e.g. month and year of birth of living people only has order-of a thousand possible values), or because a small number are known at compile time (e.g. a set of 200 keywords for a compiler).

"Scrambler" functions? Like random number generators

I am interested in a modification of the usual idea of random number generators. That is, typical generators generate long strings of reasonably independent, uniformly distributed numbers from that space. This is intended to be used with one seed, repeatedly.
However, for my purpose, I want a way of generating a "random number" from another number (actually from a grid of integers) in a way that is "independent," in the sense that knowing the outputs for nearby points don't help you predict the value at your point.
In practice, using traditional random number generators works reasonably well, but I'd be interested in any work that was actually done for this purpose.
It sounds like you are looking for a cryptographic hash function.
The ideal cryptographic hash function has four main properties:
it is easy to compute the hash value for any given message
it is infeasible to generate a message that has a given hash
it is infeasible to modify a message without changing the hash
it is infeasible to find two different messages with the same hash
Some commonly used hash functions are SHA-1 and SHA-512. One called MD5 is still being used even though it has been shown to be insecure.

Short (6 bit) cryptographic keyed hash

I have to implement a simple hashing algorithm.
Input data:
Value (16-bit integer).
Key (any length).
Output data:
6-bit hash (number 0-63).
Requirements:
It should be practically impossible to predict hash value if you only have the input value but not the key. More specific: if I known hash(x) for x < M, it should be hard to predict hash(M) without knowing the key.
Possible solutions:
Keep full mapping as a key. So the key has length 2^16*6 bits. It's too long for my case.
Linear code. Key is a generator matrix. It's length is 16*6. But it's easy to find generator matrix using several known hash values.
Are there any other possibilities?
A HMAC seems to be what you want. So a possibility for you could be to use a SHA-based HMAC and just use a substring of the resulting hash. This should be relatively safe, since the bits of a cryptographic hash should be as independent and unpredictable as possible.
Depending on your environment, this could however take too much processing time, so you might have to chose a simpler hashing scheme to construct your HMAC.
Original Answer the discussion in the comments is based on:
Since you can forget cryptographic properties anyway (it is trivial to find collisions via bruteforce attacks on a 5-bit hash) you might as well use something like CRC or Hamming Codes and get error-detection for free
Mensi' suggestion to use truncated HMAC is a good one, but if you do happen to be on a highly constrained system and want something faster or simpler, you could take any block cipher, encrypt your 16-bit value (padded to a full block) with it and truncate the result to 6 bits.
Unlike HMAC, which computes a pseudorandom function, a block cipher is a pseudorandom permutation — every input maps to a different output. However, when you throw away all but six bits of the block cipher's output, what remains will look very much like a pseudorandom function. There will be a very tiny bias against repeated outputs, but (assuming that the block cipher's block size is much larger than 6 bits, which it should be) it'll be so small as to be all but undetectable.
A good block cipher choice for very low-end systems might be TEA or its successors XTEA and XXTEA. While there are some known attacks on these ciphers, they all require much more extensive access to the cipher than should be possible in your application.

Resources