What are the default hash functions used by programming languages for dictionaries/associative arrays? - hashtable

So I was curious when I got to know that dictionaries or associative arrays are usually implemented by hash tables. Upon reading about hash tables, I stumbled upon hash functions, I learned there are various hash functions such as md5, md6, sha-1 etc. What I was unable to find was which hash function is used by programming languages such as python, C++, java?

Those are.. not the same kind of 'hash function' D:
For hashtable hash functions, code must compute an appropriate hash based on object-data such that it conforms to equality requirements. It should also be "well distributed" and "fast". Most hashtable hashes are thus often 32-bit values using some form of a rolling/shifting computation. At the end of the day this hash is used to select from a much smaller pool of buckets.
Hashtable hashes are usually computed directly by (or with knowledge of) the objects be added to the hashtable - that is, generally, cryptographic hash functions are not involved in hashtables. A typical Java hashCode() function, defined on the object being added to the hashtable, for example might look like:
int hash = 7;
hash = 31 * hash + (int) int_field;
hash = 31 * hash + (str_field == null ? 0 : str_field.hashCode());
// etc.
return hash;
There are discussions on the choice of seed and multiplication values elsewhere.. but the take-way should be that most hashtable hash functions 1) directly derive from object state, applying 'tweaks' as prudent, and 2) are not designed to be "secure".
(Modern hashtable implementations often apply a "mixing function" to the generated hash value to mitigate degenerate hash function results and/or data poisoning attacks.)
On the the other hand, a cryptographic hash is designed to provide much stronger cryptographic requirements and have a much larger output space. While such a strong hash can be used for hashtables (after being derived from an object and then distilled down to a hash bucket), they are also slower to generate and usually unnecessary in context of a hash/dictionary.
Cryptographic hashes generally work on an arbitrary chunk of data or byte stream.
Hashtable hash desirable characteristics:
Deterministic
Uniform distribution / avoidance of clustering
Speed, speed, speed
Cryptographic hashes have additional characteristics, beyond that of a hashtable hash:
Infeasible to generate a message from its hash value
Infeasible to find two different messages with the same hash value
(While cryptographic hashes should also be fast, speed is largely secondary to the additional requirements.)
Programming languages support a wide range of different cryptographic hash functions through their standard libraries and/or 3rd party libraries. A more well-known hash (eg. MD5/SHA-x) will generally have universal support while something more specialized (eg. MD6) may require additional effort to locate an implementation for.
On the other hand, as shown above, many hash table 'functions' are implemented directly on the object(s) involved in a hashtable, following a standard pattern, with some languages (and IDEs) providing assistance to reduce manual coding. As an example, C# provides a default reflection-based GetHashCode implementation for struct types.

Related

Arbitrary length keys to standard key length in AES

I was asked to implement the AES algorithm for a security class. While implementing i couldn't find answer on how i can accept a key like a password, with arbitrary length, from the user and convert it to 128, 192 or 256-bit key. What should i do?
As mentioned in the comments, this is typically done with a key derivation function (KDF). There are two main types of key derivation functions that are used.
The first kind is used when you have some type of cryptographic material already, oftentimes some variant of a key exchange (usually, Diffie-Hellman). In this case, the key material is assumed to be strong and you just want to distill it and generate potentially multiple keys from it. HKDF, which is used in TLS 1.3, and the TLS 1.2 PRF are good examples of this. They are generally wrappers around HMAC, and they're pretty fast.
The second kind is used when you have a password. Because, in general, people are bad at coming up with and remembering passwords with sufficient entropy, we use a KDF that is specifically iterated so as to be slow, such as the older PBKDF2 or the newer scrypt and Argon2. These options are designed to use a unique salt and be iterated many times so that users who pick poor passwords are afforded at least some level of protection against compromise, and the newer options are designed to be expensive in memory to prevent efficient attacks on GPUs.

Short (6 bit) cryptographic keyed hash

I have to implement a simple hashing algorithm.
Input data:
Value (16-bit integer).
Key (any length).
Output data:
6-bit hash (number 0-63).
Requirements:
It should be practically impossible to predict hash value if you only have the input value but not the key. More specific: if I known hash(x) for x < M, it should be hard to predict hash(M) without knowing the key.
Possible solutions:
Keep full mapping as a key. So the key has length 2^16*6 bits. It's too long for my case.
Linear code. Key is a generator matrix. It's length is 16*6. But it's easy to find generator matrix using several known hash values.
Are there any other possibilities?
A HMAC seems to be what you want. So a possibility for you could be to use a SHA-based HMAC and just use a substring of the resulting hash. This should be relatively safe, since the bits of a cryptographic hash should be as independent and unpredictable as possible.
Depending on your environment, this could however take too much processing time, so you might have to chose a simpler hashing scheme to construct your HMAC.
Original Answer the discussion in the comments is based on:
Since you can forget cryptographic properties anyway (it is trivial to find collisions via bruteforce attacks on a 5-bit hash) you might as well use something like CRC or Hamming Codes and get error-detection for free
Mensi' suggestion to use truncated HMAC is a good one, but if you do happen to be on a highly constrained system and want something faster or simpler, you could take any block cipher, encrypt your 16-bit value (padded to a full block) with it and truncate the result to 6 bits.
Unlike HMAC, which computes a pseudorandom function, a block cipher is a pseudorandom permutation — every input maps to a different output. However, when you throw away all but six bits of the block cipher's output, what remains will look very much like a pseudorandom function. There will be a very tiny bias against repeated outputs, but (assuming that the block cipher's block size is much larger than 6 bits, which it should be) it'll be so small as to be all but undetectable.
A good block cipher choice for very low-end systems might be TEA or its successors XTEA and XXTEA. While there are some known attacks on these ciphers, they all require much more extensive access to the cipher than should be possible in your application.

Analyzing goals and choosing a good hash function

This isn’t a specific question with a specific solution; but it’s rather a response to the fact that I can’t find any good Stack Overflow qestions about how to choose a good a hashing function for hash tables and similar tasks.
So! Let’s talk hash functions, and how to choose one. How should a programming noob, who needs to choose a good hash function for their specific task, go about choosing one? When is the simple and quick Fowler-Noll-Vo appropriate? When should they vendor in MurmurHash3 instead? Do you have any links to good resources on comparing the various options?
The hash function for hash tables should have these two properties
Uniformity all outputs of H() should be evenly distributed as much as possible. In other words the for 32-bit hash function the probability for every output should be equal to 1/2^32. (for n-bit it should be 1/2^n). With uniform hash function the chance of collision is minimized to lowest possible for any possible input.
Low computational cost Hash functions for tables are expected to be FAST, compared to cryptographic hash functions where speed is traded for preimage resistance (eg it is hard to find the message from given hash value) and collision resistance.
For purposes of hash tables all cryptographic functions are BAD choice, since the computational cost is enormous. Because hashing here is used not for security but for fast access. MurmurHash is considered one of the fastest and uniform functions suitable for big hash tables or hash indexes. For small tables a trivial hash function should be OK. A trivial hash is where we mix values of object (by multiplication, addition and subtraction with some prime).
If your hash keys are strings (or other variable-length data) you might look at this paper by Ramakrishna and Zobel. They benchmark a few classes of hashing functions (for speed and low collisions) and exhibit a class that is better than the usual Bernstein hashes.

How to create an efficient static hash table?

I need to create small-mid sized static hash tables from it. Typically, those will have 5-100 entries. When the hash table is created, all keys hashes are known up-front (i.e. the keys are already hashes.) Currently, I create a HashMap, which is I sort the keys so I get O(log n) lookup which 3-5 lookups on average for the sizes I care. Wikipedia claims that a simple hash table with chaining will result in 3 lookups on average for a full table, so that's not yet worth the trouble for me (i.e. taking hash%n as the first entry and doing the chaining.) Given that I know all hashes up-front, it seems to be that there should be an easy way to get a fast, static perfect hash -- but I couldn't find a good pointer how. I.e. amortized O(1) access with no (little?) additional overhead. How should I implement such a static table?
Memory usage is important, so the less I need to store, the better.
Edit: Notice that it's fine if I have have to resolve one collision or so manually. I.e. if I could do some chaining which on average has direct access and worst-case 3 indirections for instance, that's fine. It's not that I need a perfect hash.
For c or c++ you can use gperf
GNU gperf is a perfect hash function generator. For a given list of strings, it produces a hash function and hash table, in form of C or C++ code, for looking up a value depending on the input string. The hash function is perfect, which means that the hash table has no collisions, and the hash table lookup needs a single string comparison only.
GNU gperf is highly customizable. There are options for generating C or C++ code, for emitting switch statements or nested ifs instead of a hash table, and for tuning the algorithm employed by gperf.
Small hashes are also possible in C without an external lib using the pre-processor, for example:
swich (hash_string(*p))
{
case HASH_S16("test"):
...
break;
case HASH_S256("An example with a long text!!!!!!!!!!!!!!!!!"):
...
break;
}
Have a look for the code # http://www.heeden.nl/statichashc.htm
You can use Sux4j to generate a minimal perfect hash in Java or C++. (I'm not sure you are using Java, but you mentioned HashMap, so I'm assuming.) For C, you can use the cmph library.

What is the difference between Obfuscation, Hashing, and Encryption?

What is the difference between Obfuscation, Hashing, and Encryption?
Here is my understanding:
Hashing is a one-way algorithm; cannot be reversed
Obfuscation is similar to encryption but doesn't require any "secret" to understand (ROT13 is one example)
Encryption is reversible but a "secret" is required to do so
Hashing is a technique of creating semi-unique keys based on larger pieces of data. In a given hash you will eventually have "collisions" (e.g. two different pieces of data calculating to the same hash value) and when you do, you typically create a larger hash key size.
obfuscation generally involves trying to remove helpful clues (i.e. meaningful variable/function names), removing whitespace to make things hard to read, and generally doing things in convoluted ways to make following what's going on difficult. It provides no serious level of security like "true" encryption would.
Encryption can follow several models, one of which is the "secret" method, called private key encryption where both parties have a secret key. Public key encryption uses a shared one-way key to encrypt and a private recipient key to decrypt. With public key, only the recipient needs to have the secret.
That's a high level explanation. I'll try to refine them:
Hashing - in a perfect world, it's a random oracle. For the same input X, you always recieve the same output Y, that is in NO WAY related to X. This is mathematically impossible (or at least unproven to be possible). The closest we get is trapdoor functions. H(X) = Y for with H-1(Y) = X is so difficult to do you're better off trying to brute force a Z such that H(Z) = Y
Obfuscation (my opinion) - Any function f, such that f(a) = b where you rely on f being secret. F may be a hash function, but the "obfuscation" part implies security through obscurity. If you never saw ROT13 before, it'd be obfuscation
Encryption - Ek(X) = Y, Dl(Y) = X where E is known to everyone. k and l are keys, they may be the same (in symmetric, they are the same). Y is the ciphertext, X is the plaintext.
A hash is a one way algorithm used to compare an input with a reference without compromising the reference.
It is commonly used in logins to compare passwords and you can also find it on your reciepe if you shop using credit-card. There you will find your credit-card-number with some numbers hidden, this way you can prove with high propability that your card was used to buy the stuff while someone searching through your garbage won't be able to find the number of your card.
A very naive and simple hash is "The first 3 letters of a string".
That means the hash of "abcdefg" will be "abc". This function can obviously not be reversed which is the entire purpose of a hash. However, note that "abcxyz" will have exactly the same hash, this is called a collision. So again: a hash only proves with a certain propability that the two compared values are the same.
Another very naive and simple hash is the 5-modulus of a number, here you will see that 6,11,16 etc.. will all have the same hash: 1.
Modern hash-algorithms are designed to keep the number of collisions as low as possible but they can never be completly avoided. A rule of thumb is: the longer your hash is, the less collisions it has.
Obfuscation in cryptography is encoding the input data before it is hashed or encrypted.
This makes brute force attacks less feasible, as it gets harder to determine the correct cleartext.
That's not a bad high-level description. Here are some additional considerations:
Hashing typically reduces a large amount of data to a much smaller size. This is useful for verifying the contents of a file without having to have two copies to compare, for example.
Encryption involves storing some secret data, and the security of the secret data depends on keeping a separate "key" safe from the bad guys.
Obfuscation is hiding some information without a separate key (or with a fixed key). In this case, keeping the method a secret is how you keep the data safe.
From this, you can see how a hash algorithm might be useful for digital signatures and content validation, how encryption is used to secure your files and network connections, and why obfuscation is used for Digital Rights Management.
This is how I've always looked at it.
Hashing is deriving a value from
another, using a set algorithm. Depending on the algo used, this may be one way, may not be.
Obfuscating is making something
harder to read by symbol
replacement.
Encryption is like hashing, except the value is dependent on another value you provide the algorithm.
A brief answer:
Hashing - creating a check field on some data (to detect when data is modified). This is a one way function and the original data cannot be derived from the hash. Typical standards for this are SHA-1, SHA256 etc.
Obfuscation - modify your data/code to confuse anyone else (no real protection). This may or may not loose some of the original data. There are no real standards for this.
Encryption - using a key to transform data so that only those with the correct key can understand it. The encrypted data can be decrypted to obtain the original data. Typical standards are DES, TDES, AES, RSA etc.
All fine, except obfuscation is not really similar to encryption - sometimes it doesn't even involve ciphers as simple as ROT13.
Hashing is one-way task of creating one value from another. The algorithm should try to create a value that is as short and as unique as possible.
obfuscation is making something unreadable without changing semantics. It involves value transformation, removing whitespace, etc. Some forms of obfuscation can also be one-way,so it's impossible to get the starting value
encryption is two-way, and there's always some decryption working the other way around.
So, yes, you are mostly correct.
Obfuscation is hiding or making something harder to understand.
Hashing takes an input, runs it through a function, and generates an output that can be a reference to the input. It is not necessarily unique, a function can generate the same output for different inputs.
Encryption transforms the input into an output in a unique manner. There is a one-to-one correlation so there is no potential loss of data or confusion - the output can always be transformed back to the input with no ambiguity.
Obfuscation is merely making something harder to understand by intruducing techniques to confuse someone. Code obfuscators usually do this by renaming things to remove anything meaningful from variable or method names. It's not similar to encryption in that nothing has to be decrypted to be used.
Typically, the difference between hashing and encryption is that hashing generally just employs a formula to translate the data into another form where encryption uses a formula requiring key(s) to encrypt/decrypt. Examples would be base 64 encoding being a hash algorithm where md5 being an encryption algorithm. Anyone can unhash base64 encoded data, but you can't unencrypt md5 encrypted data without a key.

Resources