Probability of collision with truncated SHA-256 hash - math

I have a database-driven web application where the primary keys of all data rows are obfuscated as follows: SHA256(content type + primary key + secret), truncated to the first 8 characters. The content type is a simple word, e.g. "post" or "message" and the secret is a 20-30 char ASCII constant. The result is stored in a separate indexed column for fast DB lookup.
How do I calculate the probability of a hash collision in this scenario? I am not a mathematician at all, but a friend claimed that due to the Birthday Paradox the collision probability would be ~1% for 10,000 rows with an 8-char truncation. Is there any truth to this claim?

Yes, there is a collision probability & it's probably somewhat too high. The exact probability depends on what "8 characters" means.
Does "8 characters" mean:
A) You store 8 hex characters of the hash? That would store 32 bits.
B) You store 8 characters of BASE-64? That would store 48 bits.
C) You store 8 bytes, encoded in some single-byte charset/ or hacked in some broken way into a character encoding? That would store 56-64 bits, but if you don't do encoding right you'll encounter character conversion problems.
D) You store 8 bytes, as bytes? That genuinely stores 64 bits of the hash.
Storing binary data as either A) hex or D) binary bytes, would be my preferred options. But I'd definitely recommend either reconsidering your "key obfuscation" scheme or significantly expanding the stored key-size to reduce the (currently excessive) probability of key collision.
From Wikipedia:
https://en.wikipedia.org/wiki/Birthday_problem#Probability_table
The birthday problem in this more generic sense applies to hash functions: the expected number of N-bit hashes that can be generated before getting a collision is not 2^N, but rather only 2^(N/2).
Since in the most conservative above understanding of your design (reading it as A, 8 chars of hex == 32 bits) your scheme would be expected to suffer collisions if it stored on the scale of ~64,000 rows. I would consider such an outcome unacceptable for all serious, or even toy, systems.
Transaction tables may have volumes, allowing growth for the business, from 1000 - 100,000 transactions/day (or more). Systems should be designed to function 100 years (36500 days), with a 10x growth factor built in, so..
For your keying mechanism to be genuinely robust & professionally useful, you would need to be able to scale it up to potentially handle ~36 billion (2^35) rows without collision. That would imply 70+ bits of hash.
The source-control system Git, for example, stores 160 bits of SHA-1 hash (40 chars of hex == 20 bytes or 160 bits). Collisions would not be expected to be probable with < less than 2^80 different file revisions stored.
A possibility better design might be, rather than hashing & pseudo-randomizing the key entirely & hoping (against hope) to avoid collisions, to prepend/ append/ fold-in 8-10 bits of a hash into the key.
This would generates a larger key, containing all the uniqueness of the original key plus 8-10 bits of verification. Attempts to access keys would then be verified, and more than 3 invalid requests would be treated as an attempt to violate security by "probing" the keyspace & would trigger semi-permanent lockout.
The only major costs here, would be a modest reduction in the size of the available keyspace for a given int-size. 32-bit int to/from the browser would have 8-10 bits dedicated to security, thus leaving 22-24 for the actual key. So you'd use 64-bit ints where that was not sufficient.

Related

Building a unique ID without collisions

I'm playing around with system design and have been reading up on url shortener. I realize there are many questions around this topic, but have some specific questions with respect to hashing and the order in which I hash + encode.
Input: https://example.com/owjpojwepofjwpoejfpwjepfojpwejfp/wefoijhwioejfiowef/weoifhwoiehjfiowef
Output: https://example.com/abr4fna
If I run this input through md5 I get the following 9e91e9c2a7ce0f0d11b475d2abfb8593. Clearly, this exceeds the length that I want, so I could truncate the substring from (0,7]. The problem is, to some degree, I can still have a collision since the prefix of the md5 is not guaranteed to be unique as the amount of urls generated increases within the service.
I do not want to have to check the database if I've already used this ID before as that would increase the amount of reads I'm doing proportional to the number of writes I'm doing. In addition, there could be concurrency issues as I grow the number of application servers doing the hash generation and storage.
I see people mentioning the use of base64 encoding the output hash, but what value does this add after the hash? Is it because I grow the amount of unique combinations by 64^n where n is the length of my hash versus md5 being only 36^n?
Thanks. Just interested in having this discussion.
edit:
As I understand, we purely doing the encoding piece to ensure we do not have transmission failures if the receiving system has issues interpreting binary data from the output hash - so it's used for the pure sake of display.
By definition, you cannot hash a large domain and expect to get a smaller domain without collisions. A hash is useful because it is one-way and would require a computationally infeasible amount of tries to find those collisions. However, with a 7 character output and a large input domain, it will be exceptionally easy to generate collisions even by chance.
You're currently using 7 hexadecimal digits. Each hexadecimal digit represents 4 bits. So you have 28 bits or 2^28 possible values. That's around 256 million possible values. So if you guess long enough you'll get a collision soon enough. With base64 you'd have 6 bits per character instead (2^6 = 64, hence the name). That means that you increase the bit size with 7 * 2 = 14 bits, or around 16 thousand times as much, but you'd still be pretty far from collision free.
Actually, for any cryptographic reassurance when taking in the birthday bound, the 16 byte output of MD5 is about the absolute minimum size of hash you want to avoid collisions. Of course, MD5 hashn't been deprecated for nothing, you'd really want to use SHA-256.

How many bits of integer data can be stored in a DynamoDB attribute of type Number?

DynamoDB's Number type supports 38 digits of decimal precision. This is not big enough to store a 128-bit integer which would require 39 digits. The max value is 340,282,366,920,938,463,463,374,607,431,768,211,455 for unsigned 128-bit ints or 170,141,183,460,469,231,731,687,303,715,884,105,727 for signed 128-bit ints. These are both 39-digit numbers.
If I can't store 128 bits, then how many bits of integer data can I store in a Number?
DynamoDB attribute of type Number can store 126-bit integers (or 127-bit unsigned integers, with serious caveats).
According to Amazon's documentation:
Numbers can have up to 38 digits precision. Exceeding this results in an exception.
This means (verified by testing in the AWS console) that the largest positive integer and smallest negative integers, respectively, that DynamoDB can store in a Number attribute are:
99,999,999,999,999,999,999,999,999,999,999,999,999 (aka 10^38-1)
-99,999,999,999,999,999,999,999,999,999,999,999,999 (aka -10^38+1)
These numbers require 126 bits of storage, using this formula:
bits = floor (ln(number) / ln (2))
= floor (87.498 / 0.693)
= floor (126.259)
= 126
So you can safely store a 126-bit signed int in a DynamoDB.
If you want to live dangerously, you can store a 127-bit unsigned int too, but there are some caveats:
You'd need to avoid (or at least be very careful) using such a number as a sort key, because values with a most-significant-bit of 1 will sort as negative numbers.
Your app will need to convert unsigned ints to signed ints when storing them or querying for them in DynamoDB, and will also need to convert them back to unsigned after reading data from DynamoDB.
If it were me, I wouldn't take these risks for one extra bit without a very, very good reason.
One logical question is whether 126 (or 127 given the caveats above) is good enough to store a UUID. The answer is: it depends. If you are in control of the UUID generation, then you can always shave a bit or two from the UUID and store it. If you shave from the 4 "version" bits (see format here) then you may not be losing any entropy at all if you are always generating UUIDs with the same version.
However, if someone else is generating those UUIDs AND is expecting lossless storage, then you may not be able to use a Number to store the UUID. But you may be able to store it if you restrict clients to a whitelist of 4-8 UUID versions. The largest version now is 5 out of a 0-15 range, and some of the older versions are discouraged for privacy reasons, so this limitation may be reasonable depending on your clients and whether they adhere to the version bits as defined in RFC 4122.
BTW, I was surprised that this bit-limit question wasn't already online... at least not in an easily-Google-able place. So contributing this Q&A pair so future searchers can find it.

What is the key size of the DES-EDE-ECB cipher?

I know that DES has a key length of 56, but what does the EDE mean and does it effect the key length?
In OpenSSL there is the des-ede-cbc option.
Triple DES, DES-EDE or TDEA (formally speaking) can be used with no less than 3 key sizes.
The most logical form uses 3 separate keys for each of the phases (Encrypt, Decrypt and then Encrypt again, which is the meaning of EDE). It has a key size of 3 times 56 bits or 168 bits, but those are usually encoded with parity bits (the least significant bit of each byte), making 192 bits in total. Due to a meet-in-the-middle attack (already known at the design phase) the security is only around 112 bits, so don't be fooled by the key size alone. Generally we aim for 128 bit or higher security. This is sometimes DES-ABC - as in DES with distinct keys A, B and C.
The two key DES-EDE uses the same keys for the Encrypt phases. The key size is therefore 112 bits, encoded as 128 bits and a security of just around 80 bits, due to various attacks. For some attacks it might even be reduced to just over 63 bits. 80 bits is probably just a bit on the short side nowadays and it isn't recommended by NIST anymore. It is called the ABA key scheme, and technically you'd use BAB for decryption.
Finally single key DES-EDE is mainly used for backwards compatibility. The first encrypt and decrypt (or decrypt and second encrypt) cancel each other out so you're left with just one encrypt. You can guess the key size: 56 bits. Single DES can be easily brute forced, especially when hardware support is used. Single key TDES is never used in software and may not be supported (it just makes sense in hardware, where you don't want to supply a separate implementation of DES in addition to DES-EDE). I guess you'd call the key scheme AAA, but I haven't seen that name around at all.
DES-EDE is much slower than a good implementation of AES, and AES has a security of around 126,8 for a key size of 128 bits (using a very complicated attack). So if you have any chance, choose AES instead. AES has other advantages as well, such as the larger block size and lack of weak keys.

Is it possible to tell which hash algorithm generated these strings?

I have pairs of email addresses and hashes, can you tell what's being used to create them?
aaaaaaa#aaaaa.com
BeRs114JrR0sBpueyEmnOWZfnLuigYTA
and
aaaaaaaaaaaaa.bbbbbbbbbbbb#cccccccccccc.com
4KoujQHr3N2wHWBLQBy%2b26t8GgVRTqSEmKduST9BqPYV6wBZF4IfebJS%2fxYVvIvR
and
r.r#a.com
819kwGAcTsMw3DndEVzu%2fA%3d%3d
First, the obvious even if you know nothing about cryptography: the percent signs are URL encoding; decoding that gives
BeRs114JrR0sBpueyEmnOWZfnLuigYTA
4KoujQHr3N2wHWBLQBy+26t8GgVRTqSEmKduST9BqPYV6wBZF4IfebJS/xYVvIvR
819kwGAcTsMw3DndEVzu/A==
And that in turn is base64. The lengths of the encodings wrt the length of the original strings are
plaintext encoding
17 24
43 48
10 16
More samples would give more confidence, but it's fairly clear that the encoding pads the plaintext to a multiple of 8 bytes. That suggest a block cipher (it can't be a hash since a hash would be fixed-size). The de facto standard block algorithm is AES which uses 16-byte blocks; 24 is not a multiple of 16 so that's out. The most common block algorithm with a block size of 8 (which fits the data) is DES; 3DES or blowfish or something even rarer is also a possibility but DES is what I'd put my money on.
Since it's a cipher, there must be a key somewhere. It might be in a configuration file, or hard-coded in the source code. If all you have is the binary, you should be able to locate it with the help of a debugger. With DES, you could find the key by brute force (because a key is only 56 bits and that's doable by renting a bit of CPU time on Amazon) but finding it in the program would be easier.
If you want to reproduce the algorithm then you'll also need to figure out the mode of operation. Here one clue is that the encoding is never more than 7 bytes longer than the plaintext, so there's no room for an initialization vector. If the developers who made that software did a horrible job they might have used ECB. If they made a slightly less horrible job they might have used CBC or (much less likely) some other mode with a constant IV. If they did an again slightly less horrible job then the IV may be derived from some other characteristic of the account. You can refine the analysis by testing some patterns:
If the encoding of abcdefghabcdefgh#example.com (starting with two identical 8-byte blocks) starts with two identical 8-byte blocks, it's ECB.
If the encoding of abcdefgh1#example.com and abcdefgh2#example.com (differing at the 9th character) have identical first blocks, it's CBC (probably) with a constant IV.
Another thing you'll need to figure out is the padding mode. There are a few common ones. That's a bit harder to figure out as a black box except with ECB.
There are some tools online, and also some open source projects. For example:
https://code.google.com/archive/p/hash-identifier/
http://www.insidepro.com/

Entire range - Reverse MD5 lookup

I am learning about encryption methods and I have a question about MD5.
I have seen there are several websites that have 'rainbow tables' that will give you reverse MD5 lookup, but, they can't lookup all the combinations possible.
For knowledge's sake, my question is this :
Hypothetically, if a group of people were to consider an upper limit (eg. 5 or 6 characters) and decide to map out the entire MD5 hash for all the values inside that range, storing the results in a database to use for reverse lookup.
1. Do you think such a thing is probable.
2. If you can speculate, what kind of scale of resources would this mean?
3. To your knowledge have there been any public or private attempts to do this?
I am not referring to tables that have select entries based on a dictionary, but mapping the entire range upto a certain number of characters.
(I have refered to This question already.)
It is possible. For a small number of characters, it has already been done. In the near future, it will be easy for larger numbers of characters. MD5 isn't getting any stronger.
That's a function of time. To reverse the entire 6-or-fewer-character alphanumeric space would require computing 62^6 entries. That's 56 trillion MD5s. That's doable by a determined small group or easy for a government, right now. In the future, it will be doable on a home computer. Remember, though, that as the number of allowable characters or the maximum length increases, the difficulty increase is exponential.
People already have done it. But, honestly, it doesn't matter - because anyone with half an ounce of sense uses a random salt. If you precompute the entire MD5 space and reverse it, that doesn't mean jack dandy if someone is using key strengthening or a good salt! Read up on salting.
5 or 6 characters is easy. 6 bytes is doable (that's 248 combinations), even with limited hardware.
Namely, a simple Core2 CPU from Intel will be able to hash one password in about 150 clock cycles (assuming you use a SSE2 implementation, which will hash four passwords in parallel in 600 clock cycles). With a 2.4 GHz quad core CPU (that's my PC, not exactly the newest machine available), I can then try about 226 passwords per second. For that kind of job, a massively parallel architecture is fine, hence it makes sense to use a GPU. For maybe 200$, you can buy a NVidia video card which will be about four times faster (i.e. 228 passwords per second). 6 alphanumeric characters (uppercase, lowercase and digits) are close to 236 combinations; trying them all is then a matter of 2(36-28) seconds, which is less than five minutes. With 6 random bytes, it will need 220 seconds, i.e. a bit less than a fortnight.
That's for the CPU cost. If you want to speed up the actual attack, you store the hash results: thus you will not need to recompute all those hashed passwords every time you attack a password (but you still have to do it once). 236 hash results (16 bytes each) mean 1 terabyte. You can buy a harddisk that big for 100$. 248 hash results imply 4096 times that storage space; in plain harddisks this will cost as much as a house: a bit expensive for the average bored student, but affordable for most kinds of governmental or criminal organizations.
Rainbow tables are an optimization trick for the storage. In rough terms, you store only one every t hash results, in exchange of having to do t lookups and t2 hash computations for every attack. E.g., you choose t=1000, you only have to buy four harddisks instead of four thousands, but you will need to make 1000 lookups and a million hashes every time you want to crack a password (this will need a dozen seconds at most, if you do it right).
Hence you have two costs:
The CPU cost is about computing hashes for the complete password space; with a table (rainbow or not) you have to do it once, and then can reuse that computational effort for every attacked password.
The storage cost is about storing the hash results in order to easily attack several passwords. Harddisks are not very expensive, as shown above. Rainbow tables help you lower storage costs.
Salting defeats cost sharing through precomputed tables (whether they are rainbow tables or just plain tables has no effect here: tables are about reusing precomputed values for several attacked passwords, and salts prevent such recycling).
The CPU cost can be increased by defining that the hash procedure is not just a single hash computation; for instance, you can define the "password hash" as applying MD5 over the concatenation of 10000 copies of the password. This will make each attacker guess one
thousand times more expensive. It also makes legitimate password validation one thousands times more expensive, but most users will not mind (the user has just typed his password; he cannot really see whether the password verification took 10ms or 10µs).
Modern Unix-like systems (e.g. Linux) use "MD5" passwords which actually combine salting and iterated hashing, as described above. (Actually, a modern Linux system may use another hash function, such as SHA-256, but that does not change things much here.) So precomputed tables will not help, and the on-the-fly password cracking is expensive. A password with 6 alphanumeric characters can still be cracked within a few days, because 6 characters are kind of weak anyway. Also, many longer passwords are crackable because it turns out that human begins are bad are remembering passwords; hence they will not choose just any random sequence of characters, they will select passwords which have some "meaning". This reduces the space of possible passwords.
It's called a rainbow table, and it's easily defeated with salting.
Yes, it is not only probable, but it's probably been done before.
It depends on whether they are mapping the entire possible range or just a range of ASCII characters. Let's say you need 128 bits + 6 bytes to store each match. That's 22 bytes. You'd need:
6.32 GB to store all lowercase alphabetic combinations [a-z]
405 GB to for all alphabetic combinations [a-zA-Z]
1.13 TB for all alphanumeric combinations [a-zA-Z0-9]
5.24 TB for all combinations that consists of letters, numbers and 18 symbols.
As you see, it increases exponentially, but even at 5.24 TB that's nothing to agencies like, say, the NSA or the CIA. They probably have done it.
As everyone else said, salting can easily defeat rainbow tables and that's almost as important as hashing. Read this: Just hashing is far from enough - How to position against dictionary and rainbow attacks

Resources