Entire range - Reverse MD5 lookup - encryption

I am learning about encryption methods and I have a question about MD5.
I have seen there are several websites that have 'rainbow tables' that will give you reverse MD5 lookup, but, they can't lookup all the combinations possible.
For knowledge's sake, my question is this :
Hypothetically, if a group of people were to consider an upper limit (eg. 5 or 6 characters) and decide to map out the entire MD5 hash for all the values inside that range, storing the results in a database to use for reverse lookup.
1. Do you think such a thing is probable.
2. If you can speculate, what kind of scale of resources would this mean?
3. To your knowledge have there been any public or private attempts to do this?
I am not referring to tables that have select entries based on a dictionary, but mapping the entire range upto a certain number of characters.
(I have refered to This question already.)

It is possible. For a small number of characters, it has already been done. In the near future, it will be easy for larger numbers of characters. MD5 isn't getting any stronger.
That's a function of time. To reverse the entire 6-or-fewer-character alphanumeric space would require computing 62^6 entries. That's 56 trillion MD5s. That's doable by a determined small group or easy for a government, right now. In the future, it will be doable on a home computer. Remember, though, that as the number of allowable characters or the maximum length increases, the difficulty increase is exponential.
People already have done it. But, honestly, it doesn't matter - because anyone with half an ounce of sense uses a random salt. If you precompute the entire MD5 space and reverse it, that doesn't mean jack dandy if someone is using key strengthening or a good salt! Read up on salting.

5 or 6 characters is easy. 6 bytes is doable (that's 248 combinations), even with limited hardware.
Namely, a simple Core2 CPU from Intel will be able to hash one password in about 150 clock cycles (assuming you use a SSE2 implementation, which will hash four passwords in parallel in 600 clock cycles). With a 2.4 GHz quad core CPU (that's my PC, not exactly the newest machine available), I can then try about 226 passwords per second. For that kind of job, a massively parallel architecture is fine, hence it makes sense to use a GPU. For maybe 200$, you can buy a NVidia video card which will be about four times faster (i.e. 228 passwords per second). 6 alphanumeric characters (uppercase, lowercase and digits) are close to 236 combinations; trying them all is then a matter of 2(36-28) seconds, which is less than five minutes. With 6 random bytes, it will need 220 seconds, i.e. a bit less than a fortnight.
That's for the CPU cost. If you want to speed up the actual attack, you store the hash results: thus you will not need to recompute all those hashed passwords every time you attack a password (but you still have to do it once). 236 hash results (16 bytes each) mean 1 terabyte. You can buy a harddisk that big for 100$. 248 hash results imply 4096 times that storage space; in plain harddisks this will cost as much as a house: a bit expensive for the average bored student, but affordable for most kinds of governmental or criminal organizations.
Rainbow tables are an optimization trick for the storage. In rough terms, you store only one every t hash results, in exchange of having to do t lookups and t2 hash computations for every attack. E.g., you choose t=1000, you only have to buy four harddisks instead of four thousands, but you will need to make 1000 lookups and a million hashes every time you want to crack a password (this will need a dozen seconds at most, if you do it right).
Hence you have two costs:
The CPU cost is about computing hashes for the complete password space; with a table (rainbow or not) you have to do it once, and then can reuse that computational effort for every attacked password.
The storage cost is about storing the hash results in order to easily attack several passwords. Harddisks are not very expensive, as shown above. Rainbow tables help you lower storage costs.
Salting defeats cost sharing through precomputed tables (whether they are rainbow tables or just plain tables has no effect here: tables are about reusing precomputed values for several attacked passwords, and salts prevent such recycling).
The CPU cost can be increased by defining that the hash procedure is not just a single hash computation; for instance, you can define the "password hash" as applying MD5 over the concatenation of 10000 copies of the password. This will make each attacker guess one
thousand times more expensive. It also makes legitimate password validation one thousands times more expensive, but most users will not mind (the user has just typed his password; he cannot really see whether the password verification took 10ms or 10µs).
Modern Unix-like systems (e.g. Linux) use "MD5" passwords which actually combine salting and iterated hashing, as described above. (Actually, a modern Linux system may use another hash function, such as SHA-256, but that does not change things much here.) So precomputed tables will not help, and the on-the-fly password cracking is expensive. A password with 6 alphanumeric characters can still be cracked within a few days, because 6 characters are kind of weak anyway. Also, many longer passwords are crackable because it turns out that human begins are bad are remembering passwords; hence they will not choose just any random sequence of characters, they will select passwords which have some "meaning". This reduces the space of possible passwords.

It's called a rainbow table, and it's easily defeated with salting.

Yes, it is not only probable, but it's probably been done before.
It depends on whether they are mapping the entire possible range or just a range of ASCII characters. Let's say you need 128 bits + 6 bytes to store each match. That's 22 bytes. You'd need:
6.32 GB to store all lowercase alphabetic combinations [a-z]
405 GB to for all alphabetic combinations [a-zA-Z]
1.13 TB for all alphanumeric combinations [a-zA-Z0-9]
5.24 TB for all combinations that consists of letters, numbers and 18 symbols.
As you see, it increases exponentially, but even at 5.24 TB that's nothing to agencies like, say, the NSA or the CIA. They probably have done it.
As everyone else said, salting can easily defeat rainbow tables and that's almost as important as hashing. Read this: Just hashing is far from enough - How to position against dictionary and rainbow attacks

Related

Building a unique ID without collisions

I'm playing around with system design and have been reading up on url shortener. I realize there are many questions around this topic, but have some specific questions with respect to hashing and the order in which I hash + encode.
Input: https://example.com/owjpojwepofjwpoejfpwjepfojpwejfp/wefoijhwioejfiowef/weoifhwoiehjfiowef
Output: https://example.com/abr4fna
If I run this input through md5 I get the following 9e91e9c2a7ce0f0d11b475d2abfb8593. Clearly, this exceeds the length that I want, so I could truncate the substring from (0,7]. The problem is, to some degree, I can still have a collision since the prefix of the md5 is not guaranteed to be unique as the amount of urls generated increases within the service.
I do not want to have to check the database if I've already used this ID before as that would increase the amount of reads I'm doing proportional to the number of writes I'm doing. In addition, there could be concurrency issues as I grow the number of application servers doing the hash generation and storage.
I see people mentioning the use of base64 encoding the output hash, but what value does this add after the hash? Is it because I grow the amount of unique combinations by 64^n where n is the length of my hash versus md5 being only 36^n?
Thanks. Just interested in having this discussion.
edit:
As I understand, we purely doing the encoding piece to ensure we do not have transmission failures if the receiving system has issues interpreting binary data from the output hash - so it's used for the pure sake of display.
By definition, you cannot hash a large domain and expect to get a smaller domain without collisions. A hash is useful because it is one-way and would require a computationally infeasible amount of tries to find those collisions. However, with a 7 character output and a large input domain, it will be exceptionally easy to generate collisions even by chance.
You're currently using 7 hexadecimal digits. Each hexadecimal digit represents 4 bits. So you have 28 bits or 2^28 possible values. That's around 256 million possible values. So if you guess long enough you'll get a collision soon enough. With base64 you'd have 6 bits per character instead (2^6 = 64, hence the name). That means that you increase the bit size with 7 * 2 = 14 bits, or around 16 thousand times as much, but you'd still be pretty far from collision free.
Actually, for any cryptographic reassurance when taking in the birthday bound, the 16 byte output of MD5 is about the absolute minimum size of hash you want to avoid collisions. Of course, MD5 hashn't been deprecated for nothing, you'd really want to use SHA-256.

Encrypting small messages

i need to implement a coupon-code feature. because of the number of codes required and some other constraints, i can't store them in a database. in addition the displayed codes need to be short (around 10 characters).
my original idea was to use a cryptographic function to create codes by encrypting an ongoing counter. but i'm at a loss what method to use.
Because of the counter i would be encoding only a couple of bytes and I am aware that many algorithms are not secure when used with very short messages.
Is my Approach a good idea?
What algorithm could i use?
I'm not sure if this is what you're after, and as per my comment, you have no real guarantee of security, but one possible answer could be to seed a prng with some number and give out the first x numbers as codes. As long as x is much smaller than the total possible number of outcomes, the chance for repetition is small, and codes could be validated by re-generating the sequence (you may want to hash parts of it for speed purposes)
if you use base 62: [a-z A-Z 0-9] with 10 numbers, there are over 839 quadrillion possible outcomes. If you were to give everyone on the planet a unique code, you would have used roughly 0.0000009% of your addressable space

256-bit encryption with small random alphanumeric password

We are considering using email to transmit PDFs containing personal health-related information ("PHI"). There is nothing of commercial value, no social security numbers or credit card numbers, or anything like that in these documents. Only recommendations for treatment of medical conditions.
The PDFs would be password-encrypted using Adobe Acrobat Pro's 256-bit password encryption.
Using very long passwords is not logistically desirable because the recipient of the emails with PDF attachment is the patient, not a technical person. We want to make the password easy-to-type, and yet not so short that any desktop PC has the CPU capacity to crack it in a few minutes.
If a password does not use any dictionary words but is simply a four-character random ASCII alphanumeric string, like DT4K (alphas all uppercase, not mixed), how long would it take a typical desktop business or home computer with no specialized hardware to crack the encryption? Does going to 5 characters significantly increase the cracking time?
Short answer: no, and no.
Longer answer: alphanumeric means A-Za-z0-9, right? That's 62 possible characters, or 5.95 bits of entropy. Since entropy is additive, 4 characters are roughly 24 bits, and 5 are about 30. To put that into comparison, 10 bits mean the attacker has to try about a thousand possible keys, 20 bits are a million, 30 bits about a billion. That's almost nothing these days. 56 bit DES was cracked using brute force in 1998, today people worry that 128 bit AES might not be safe enough.
If i were you, I'd try to use something like diceware. That's a list of 7776 easily pronounced words. You can use a random number generator to pick a passphrase from these words, and each word will have about 12.9 bits of entropy. So 5 words are about 65 bits, which for the kind of data you have might be an acceptable level of security, while being easily remembered or communicated via phone.
Why 7776 words? Well, 7776 is 6*6*6*6*6, so you can roll a die five times and get a number, and just look up the corresponding word on the list.
My bank sends statements encrypted and uses a combination of my name and birth date. I'm not a huge fan of that idea, but provided you use information that's unlikely to be known to an attacker you'll get a greater level of security than from four or five character alphanumeric passwords.
This would take less than an 25 seconds even with the most rudimentary tools. There are precomplied rainbow tables for passwords this short that can run in seconds on decent PC's. Password length, NOT complexity, are what make a password difficult to crack. I would highly recomend giving them a longer password, but make it something eaisly recalled. Maybe you entire business name salted with your street address number at the end. Please take at least some precautions. Having a four character password is barely better than not having one at all.
How Strong is your Password?

How can I create my own GUID algorithm with smaller "global"?

I have my own application with far more smaller "global" than our real global and I wanted shorter version of GUID. Now supposed I have my concrete number of IDs that I estimated to not ever exceed (for example 100 million IDs). How can I determine the number of random bits required to have the same property as GUID? (Globally unique, require no central authority to generate one) Using the normal GUID would be an overkill.
My "overkill" refers to this : I need the ID to be as easily typed/say/write down as possible and have somewhat astronomically low collision chance as GUID at the same time. I heard GUID can be assigned to every grain of sand on earth. My application is a game, each player get one ID generated, obviously my players is not as much as the amount of sand on earth.
It would be the best if player can say like "My ID is XXXX-XXXX". In that case, I would be not so sure if 8 characters of randomized hex is not enough or too much for 100 million players. (In reality I encode it to A-Z 0-9 instead of hex though) My game is not online restricted, so I would like each player to be able to obtain unique ID even when not online. (no server to check ID collisions)
GUID has been designed to be globally unique. But I don't know why that results in 128-bit sequence. Maybe they just choose the "very large" one that is a power of 2? I don't know what are they thinking when designing GUID to ensure that it will not clash. (They use world population times something? If that is the case I can too use 10 million times something.)
A 128-bit guid will generally perform well, because most compilers are smart enough to reduce operations on it to a pair of 64-bit operations (and on some CPUs, a single 128-bit extended operation). Java and C#/VB.NET would likely have quite a bit more overhead than C++, but if you are using Java or C#/VB.NET, you've already accepted quite a bit more overhead, and a GUID won't add much to it.
However, if you really need smaller values, you could manually reduce GUIDs, by XOR-ing the upper 64 bits with the lower 64 bits (thereby preserving some of the uniqueness of the original) to create a compact 64-bit mostly-unique number.
You could reduce to 32-bit or 48-bit in a similar way, always a multiple of the size of the original GUID. This has the advantage that you are starting out with a number that is intended to be unique across a very large set. However, keep in mind that 100 million items require a fairly high number of bits to preserve a non-overlapping guarantee, so you may just be setting yourself up for a very difficult-to-find problem later on if you aren't careful.
A crude but probably equally effective approach is to use a cryptographically-secure random number generator and construct a number as large as you need (probably minimum 48-bit). It is important not to do modulo operations on the results, or you could significantly reduce the uniqueness (due to the period of the random number generator).
I am assuming you cannot use a sequential id, although you may want to revisit that idea and see if there is a way to make a sequential id work. For example, you could use a sequential id paired with a random seed number, guaranteeing uniqueness without requiring a large number, and allowing internal indexing operations and similar optimizations that are common with large data sets.
Ok, I have discussed with friend and came up with solution. This is how to decide the number of "characters" of my game ID.
A character would consist of 0-9 and A-Z instead of HEX, thats 36 kinds of characters. We took out 0 O 1 I so it would be printable to variety of fonts without confusion, that leaves 32 kind of characters.
Then if every characters will be pseudo-randomized, how many players can we safely have?
We used Birthday paradox's square approximation. The formula in that page indicate how many number of people necessary to have 50% chance of 2 people colliding. It is 22.99 people for birthday problem. (365 possible choices)
Now we substitute 32^No.of characters into the equation instead of 365. This is how many players that will cause 50% chance of 2 players having the same ID :
Finally, we agreed to choose 9-character ID so the game can be registered up to 6.9 million players before just 2 from all 6.9 million players will have the same ID (50% chance).
The game isn't even online-only! It only collide if that 2 players is still actively playing at the same time and decide to send score to the scoreboard in the same week because of weekly score reset. So the actual number that the game can hold would be somewhat higher than that. (The game will probably not having that many players.. it is just a small happy dream of every game startups. Well at least the computation was fun.)
It will probably looks like this for easier reading : 5XT-339-A67

sha1sum duplicate files

Can sha1sum return the same results for two files that are different ?
I am asking this both from a theoretical and practical point of view.
Yes. The point of a hash is to avoid meaningful collisions, so that a small change to a file results in a large change to the hash value, so it is hard for an attacker to generate a collision.
Think about it: a SHA-1 hash is 160 bits. It is therefore unable to represent all the possible states of any file larger that 160 bits!
Mathematically, all hash functions have collisions -- that is, two inputs can return the same hash. This is true for any function that has an input of N bits and an output of M bits, where M<N.
A cryptographically sound hash algorithm, however, produces a hash that is so unpredictable, it would take millions of years to guess your way to an input that produces a particular hash. Some hash algorithms have known weaknesses that make this easier; when this happens the hash is considered 'broken' and everybody is supposed to switch to a new, better algorithm.
Though I don't follow crypto news that closely, SHA-1 is considered broken, so in theory if you can find the right tools, you can generate two files with the same SHA1 output.
SHA-1 produces a hash that's only 160 bits long. If the files are longer than 160 bits, the hash can no longer represent all possible values of the files, so if you were to try all possible values in the files, some collisions would be inevitable.
There is a known attack for creating collisions with SHA-1, though it is quite expensive to carry it out. This attack, however, lets the attacker find two values that both produce the same result. It does not let the attacker create a file that produces the same hash as a file I've already created.
As far as alternatives go, SHA-256 is not just SHA-1 "widened" to produce a 256-bit result instead of 160 bits -- it's considerably more complex internally (uses six separate functions per round compared to three for SHA-1) which has made cryptanalysis considerably slower and more difficult. Some progress has been made, but most of it is purely theoretical.
For example, the best attack of which I'm aware only works against an intentionally weakened version of SHA-256 (42 rounds instead of the standard 64 rounds). Even at that, it's purely theoretical -- it produces the work to produce a preimage from the 2256 that's theoretically expected, to 2251.7 -- which is still far too much to ever carry out.
Strangely, there appears to be considerably less progress in collision attacks against SHA-256. The best attacks I know of only work against around 20 rounds or so (working attacks against 19 rounds, some progress toward attacks up to 23 rounds or so).
Wikipedia on SHA-1 collisions: link. Some real numbers.

Resources