What are some ways to prevent deliberate malicious attacks against hash function implementations? - hashtable

Say you have some software server that uses hash functions and some external source wants to exploit that and it keeps attacking the server using keys that they know (or with high probability) will result in collisions. How would you prevent this in practice?
I think one way is to choose the hash function randomly at the beginning of the problem, but this method seems slow in the sense that every time you change hash functions you have to rehash everything.

As you obviously realise, the best defence is to make sure they don't know what your hash function will produce - ideally not your bucket count either (if the hash function is strong, hard to reverse and produces a large range of outputs - such as say 64-bit unsigned integers - then finding two keys that produce the same hash may be time consuming, but finding a value that will hash to a specific bucket after modding by N only needs on average N attempts with any random, distinct keys).
choose the hash function randomly at the beginning of the problem, but this method seems slow in the sense that every time you change hash functions you have to rehash everything.
There's not necessarily a need to repeatedly change the hash function... you just need to make it unguessable based on exposed data/code and observable behaviours. For example, you might generate a random seed value on your server, write that to a secure file somewhere, and use it as a seed for your hash function (or if your hash function doesn't support a seed value, just XOR the hash output with the random value). Even if someone knows your hash function, if they don't know the seed then they can't engineer collisions.
You could also count the collisions a particular client has had, and if it's obviously malicious - disconnect them and remove their keys.

Related

Is there a universal function F that F(sha(a),sha(b)) = sha(ab)

I am faced with a need to send my data in parts, and at the same time I am expected to provide sha256 for my WHOLE data.
Something like this cat large file | chunker | receiver
where receiver is an application that is expected to receive the data, possibly in chunks having in the header sha256 of the payload, and then following payload. After collecting all chunks, it is supposed to store the whole transmitted data, and the sha256 of all data (particular sha256 will be used only to rehash and confirm integrity of the data.)
Of course, the simplest thing would be if the receiver generated sha256 from whole the streamed data, but I was wondering if there is a simpler way by collecting all hashes of all chunks, and combine them to generate one final hash, which would be the same as the hash calculated from all the data.
In other words - and I copy this from the title - I wonder if there is a function F that would receive a list of hashes of chunks of data, and then generated final hash that would be equal to the hash generated on all the data.
And again, in other words, having this formula:
F(sha256(data[0]), sha256(data[1]), ... sha256(data[N])) = sha256(data[0..N])
What would be the function F?
Would it be a universal function or there is no such thing for the way hashing is calculated?
I suspect there is no such function or this is too complicated question to answer.
AFAIK there are still no known collisions for SHA-256 but I bet that once some is found, i.e. someone finds two messages m1 and m2 such that SHA-256(m1) = SHA-256(m2), then for almost any prefix a hashes SHA-256(a || m1) and SHA-256(a || m2) will be different i.e. the function you ask is actually not a function (has different outputs for the same inputs). Or to put it otherwise SHA-2 is susceptible to length extension attacks but AFAIK not to prefixing attacks. Still even if this actually a function, it is not enough for you for such a function to exist, you also want it to be fast. And I believe there is no such fast to compute function.
On the other hand SHA-256 works by splitting the original message into 512-bit chunks and processing them using a well defined process (which is based on the state from all the previous chunks) so theoretically you can modify some implementation of SHA-256 to compute two hashes at the same time (by applying the same logic to different initial states):
Hash of your application-defined chunk (using standard initial state)
Hash of all chunks up to this point (using the state passed from the previous output of the same step as the initial state).
This probably will be slightly faster than doing those things independently but I don't know whether it will be so much faster to justify such a custom implementation.

Analyzing goals and choosing a good hash function

This isn’t a specific question with a specific solution; but it’s rather a response to the fact that I can’t find any good Stack Overflow qestions about how to choose a good a hashing function for hash tables and similar tasks.
So! Let’s talk hash functions, and how to choose one. How should a programming noob, who needs to choose a good hash function for their specific task, go about choosing one? When is the simple and quick Fowler-Noll-Vo appropriate? When should they vendor in MurmurHash3 instead? Do you have any links to good resources on comparing the various options?
The hash function for hash tables should have these two properties
Uniformity all outputs of H() should be evenly distributed as much as possible. In other words the for 32-bit hash function the probability for every output should be equal to 1/2^32. (for n-bit it should be 1/2^n). With uniform hash function the chance of collision is minimized to lowest possible for any possible input.
Low computational cost Hash functions for tables are expected to be FAST, compared to cryptographic hash functions where speed is traded for preimage resistance (eg it is hard to find the message from given hash value) and collision resistance.
For purposes of hash tables all cryptographic functions are BAD choice, since the computational cost is enormous. Because hashing here is used not for security but for fast access. MurmurHash is considered one of the fastest and uniform functions suitable for big hash tables or hash indexes. For small tables a trivial hash function should be OK. A trivial hash is where we mix values of object (by multiplication, addition and subtraction with some prime).
If your hash keys are strings (or other variable-length data) you might look at this paper by Ramakrishna and Zobel. They benchmark a few classes of hashing functions (for speed and low collisions) and exhibit a class that is better than the usual Bernstein hashes.

How does using a salt make a password more secure if it is stored in the database?

I am learning Rails, at the moment, but the answer doesn't have to be Rails specific.
So, as I understand it, a secure password system works like this:
User creates password
System encrypts password with an encryption algorithm (say SHA2).
Store hash of encrypted password in database.
Upon login attempt:
User tries to login
System creates hash of attempt with same encryption algorithm
System compares hash of attempt with hash of password in the database.
If match, they get let in. If not, they have to try again.
As I understand it, this approach is subject to a rainbow attack — wherein the following can happen.
An attacker can write a script that essentially tries every permutation of characters, numbers and symbols, creates a hash with the same encryption algorithm and compares them against the hash in the database.
So the way around it is to combine the hash with a unique salt. In many cases, the current date and time (down to milliseconds) that the user registers.
However, this salt is stored in the database column 'salt'.
So my question is, how does this change the fact that if the attacker got access to the database in the first place and has the hash created for the 'real' password and also has the hash for the salt, how is this not just as subject to a rainbow attack? Because, the theory would be that he tries every permutation + the salt hash and compare the outcome with the password hash. Just might take a bit longer, but I don't see how it is foolproof.
Forgive my ignorance, I am just learning this stuff and this just never made much sense to me.
The primary advantage of a salt (chosen at random) is that even if two people use the same password, the hash will be different because the salts will be different. This means that the attacker can't precompute the hashes of common passwords because there are too many different salt values.
Note that the salt does not have to be kept secret; it just has to be big enough (64-bits, say) and random enough that two people using the same password have a vanishingly small chance of also using the same salt. (You could, if you wanted to, check that the salt was unique.)
First of all, what you've described isn't a rainbow attack, it's a dictionary attack.
Second, the primary point of using salt is that it just makes life more difficult for the attacker. For example, if you add a 32-bit salt to each pass-phrase, the attacker has to hash and re-hash each input in the dictionary ~4 billion times, and store the results from all of those to have a successful attack.
To have any hope of being at all effective, a dictionary needs to include something like a million inputs (and a million matching results). You mentioned SHA-1, so let's use that for our example. It produces a 20-byte (160-bit) result. Let's guess that an average input is something like 8 characters long. That means a dictionary needs to be something like 28 megabytes. With a 32-bit salt, however, both the size and time to produce the dictionary get multiplied by 232-1.
Just as an extremely rough approximation, let's say producing an (unsalted) dictionary took an hour. Doing the same with a 32-bit salt would take 232-1 hours, which works out to around 15 years. There aren't very many people willing to spend that amount of time on an attack.
Since you mention rainbow tables, I'll add that they're typically even larger and slower to start with. A typical rainbow table will easily fill a DVD, and multiplying that by 232-1 gives a large enough number that storage becomes a serious problem as well (as in, that's more than all the storage built in the entire history of computers, at least on planet earth).
The attacker cannot do a rainbow-table attack and has to brute-force which is a lot less efficient.

Error correcting key encryption

Say I have a scheme that derives a key from N different inputs. Each of the inputs may not be completely secure (f.x. bad passwords) but in combination they are secure. The simple way to do this is to concatenate all of the inputs in order and use a hash as a result.
Now I want to allow key-derivation (or rather key-decryption) given only N-1 out of the N inputs. A simple way to do this is to generate a random key K, generate N temporary keys out of different N subsets of the input, each with one input missing (i.e. Hash(input_{1}, ..., input_{N-1}), Hash(input_{0}, input_{2}, ..., input_{N-1}), Hash(input_{0}, input_{1}, input_{3},..., input_{N-1}), ..., Hash(input_{0}, ..., input_{N-2})) then encrypt K with each of the N keys and store all of the results.
Now I want to a generalized solution, where I can decrypt the key using K out of N inputs. The naive way to expand the scheme above requires storing (N choose N-K) values, which quickly becomes infeasible.
Is there a good algorithm for this, that does not entail this much storage?
I have thought about a way to use something like Shamir's Secret Sharing Scheme, but cannot think of a good way, since the inputs are fixed.
Error Correcting Codes are the most direct way to deal with the problem. They are not, however, particularly easy to implement.
The best approach would be using a Reed Solomon Code. When you derive the password for the first time you also calculate the redundancy required by the code and store it. When you want to recalculate the key you use the redundancy to correct the wrong or missing inputs.
To encrypt / create:
Take the N inputs. Turn each into a block in a good /secure way. Use Reed Solomon to generate M redundancy blocks from the N block combination. You now have N+M blocks, of which you need only a total of N to generate the original N blocks.
Use the N blocks to encrypt or create a secure key.
If the first, store the encrypted key and the M redundancy blocks. If the second, store only the M redundancy blocks.
To decrypt / retrieve:
Take N - R correct input blocks, where R =< M. Combine them with the redundancy blocks you stored to create the original N blocks. Use the original N blocks to decrypt or create the secure key.
(Thanks to https://stackoverflow.com/users/492020/giacomo-verticale : This is essentially what he/she said, but I think a little more explicit / clearer.)
Shamir's share secret is a techinique that is used when you want to split a secret in multiple shares such that only a combination of minimum k parts would reveal the intial secret. If you are not sure about the correctness of the initiator and you want to verify this you use verifiable secret sharing .both are based to polynomial interpolation
One approach would be to generate a purely random key (or by hashing all of the inputs, if you want to avoid an RNG for some reason), split it using a k-of-n threshold scheme, and encrypt each share using the individual password inputs (eg send them through PBKDF2 with 100000 iterations and then encrypt/MAC with AES-CTR/HMAC). This would require less storage than storing hash subsets; roughly N * (share size + salt size + MAC size)
Rather than simply allowing a few errors out of a large number of inputs, you should divide the inputs up into groups and allow some number of errors in each group. If you were to allow 4 errors out of 64 inputs then you would have to have 15,249,024 encrypted keys, but if you break that up into two groups of 32, allowing two errors per group then you would only need to have 1984 encrypted keys.
Once you have decrypted the key information from each group then use that as input into decrypting key that you ultimately want.
Also, the keys acquired from each group must not be trivial in comparison to the key that you ultimately want. Do not simply break up a 256 bit key into 8 32bit key pieces. Doing this would allow someone that could decrypt 7 of those key pieces to attempt a bruteforce attack on the last piece. if you want access to a 256 bit key, then you must work with 256 bit keys for the whole procedure.

What is the difference between Obfuscation, Hashing, and Encryption?

What is the difference between Obfuscation, Hashing, and Encryption?
Here is my understanding:
Hashing is a one-way algorithm; cannot be reversed
Obfuscation is similar to encryption but doesn't require any "secret" to understand (ROT13 is one example)
Encryption is reversible but a "secret" is required to do so
Hashing is a technique of creating semi-unique keys based on larger pieces of data. In a given hash you will eventually have "collisions" (e.g. two different pieces of data calculating to the same hash value) and when you do, you typically create a larger hash key size.
obfuscation generally involves trying to remove helpful clues (i.e. meaningful variable/function names), removing whitespace to make things hard to read, and generally doing things in convoluted ways to make following what's going on difficult. It provides no serious level of security like "true" encryption would.
Encryption can follow several models, one of which is the "secret" method, called private key encryption where both parties have a secret key. Public key encryption uses a shared one-way key to encrypt and a private recipient key to decrypt. With public key, only the recipient needs to have the secret.
That's a high level explanation. I'll try to refine them:
Hashing - in a perfect world, it's a random oracle. For the same input X, you always recieve the same output Y, that is in NO WAY related to X. This is mathematically impossible (or at least unproven to be possible). The closest we get is trapdoor functions. H(X) = Y for with H-1(Y) = X is so difficult to do you're better off trying to brute force a Z such that H(Z) = Y
Obfuscation (my opinion) - Any function f, such that f(a) = b where you rely on f being secret. F may be a hash function, but the "obfuscation" part implies security through obscurity. If you never saw ROT13 before, it'd be obfuscation
Encryption - Ek(X) = Y, Dl(Y) = X where E is known to everyone. k and l are keys, they may be the same (in symmetric, they are the same). Y is the ciphertext, X is the plaintext.
A hash is a one way algorithm used to compare an input with a reference without compromising the reference.
It is commonly used in logins to compare passwords and you can also find it on your reciepe if you shop using credit-card. There you will find your credit-card-number with some numbers hidden, this way you can prove with high propability that your card was used to buy the stuff while someone searching through your garbage won't be able to find the number of your card.
A very naive and simple hash is "The first 3 letters of a string".
That means the hash of "abcdefg" will be "abc". This function can obviously not be reversed which is the entire purpose of a hash. However, note that "abcxyz" will have exactly the same hash, this is called a collision. So again: a hash only proves with a certain propability that the two compared values are the same.
Another very naive and simple hash is the 5-modulus of a number, here you will see that 6,11,16 etc.. will all have the same hash: 1.
Modern hash-algorithms are designed to keep the number of collisions as low as possible but they can never be completly avoided. A rule of thumb is: the longer your hash is, the less collisions it has.
Obfuscation in cryptography is encoding the input data before it is hashed or encrypted.
This makes brute force attacks less feasible, as it gets harder to determine the correct cleartext.
That's not a bad high-level description. Here are some additional considerations:
Hashing typically reduces a large amount of data to a much smaller size. This is useful for verifying the contents of a file without having to have two copies to compare, for example.
Encryption involves storing some secret data, and the security of the secret data depends on keeping a separate "key" safe from the bad guys.
Obfuscation is hiding some information without a separate key (or with a fixed key). In this case, keeping the method a secret is how you keep the data safe.
From this, you can see how a hash algorithm might be useful for digital signatures and content validation, how encryption is used to secure your files and network connections, and why obfuscation is used for Digital Rights Management.
This is how I've always looked at it.
Hashing is deriving a value from
another, using a set algorithm. Depending on the algo used, this may be one way, may not be.
Obfuscating is making something
harder to read by symbol
replacement.
Encryption is like hashing, except the value is dependent on another value you provide the algorithm.
A brief answer:
Hashing - creating a check field on some data (to detect when data is modified). This is a one way function and the original data cannot be derived from the hash. Typical standards for this are SHA-1, SHA256 etc.
Obfuscation - modify your data/code to confuse anyone else (no real protection). This may or may not loose some of the original data. There are no real standards for this.
Encryption - using a key to transform data so that only those with the correct key can understand it. The encrypted data can be decrypted to obtain the original data. Typical standards are DES, TDES, AES, RSA etc.
All fine, except obfuscation is not really similar to encryption - sometimes it doesn't even involve ciphers as simple as ROT13.
Hashing is one-way task of creating one value from another. The algorithm should try to create a value that is as short and as unique as possible.
obfuscation is making something unreadable without changing semantics. It involves value transformation, removing whitespace, etc. Some forms of obfuscation can also be one-way,so it's impossible to get the starting value
encryption is two-way, and there's always some decryption working the other way around.
So, yes, you are mostly correct.
Obfuscation is hiding or making something harder to understand.
Hashing takes an input, runs it through a function, and generates an output that can be a reference to the input. It is not necessarily unique, a function can generate the same output for different inputs.
Encryption transforms the input into an output in a unique manner. There is a one-to-one correlation so there is no potential loss of data or confusion - the output can always be transformed back to the input with no ambiguity.
Obfuscation is merely making something harder to understand by intruducing techniques to confuse someone. Code obfuscators usually do this by renaming things to remove anything meaningful from variable or method names. It's not similar to encryption in that nothing has to be decrypted to be used.
Typically, the difference between hashing and encryption is that hashing generally just employs a formula to translate the data into another form where encryption uses a formula requiring key(s) to encrypt/decrypt. Examples would be base 64 encoding being a hash algorithm where md5 being an encryption algorithm. Anyone can unhash base64 encoded data, but you can't unencrypt md5 encrypted data without a key.

Resources