Is it possible to tell which hash algorithm generated these strings? - encryption

I have pairs of email addresses and hashes, can you tell what's being used to create them?
aaaaaaa#aaaaa.com
BeRs114JrR0sBpueyEmnOWZfnLuigYTA
and
aaaaaaaaaaaaa.bbbbbbbbbbbb#cccccccccccc.com
4KoujQHr3N2wHWBLQBy%2b26t8GgVRTqSEmKduST9BqPYV6wBZF4IfebJS%2fxYVvIvR
and
r.r#a.com
819kwGAcTsMw3DndEVzu%2fA%3d%3d

First, the obvious even if you know nothing about cryptography: the percent signs are URL encoding; decoding that gives
BeRs114JrR0sBpueyEmnOWZfnLuigYTA
4KoujQHr3N2wHWBLQBy+26t8GgVRTqSEmKduST9BqPYV6wBZF4IfebJS/xYVvIvR
819kwGAcTsMw3DndEVzu/A==
And that in turn is base64. The lengths of the encodings wrt the length of the original strings are
plaintext encoding
17 24
43 48
10 16
More samples would give more confidence, but it's fairly clear that the encoding pads the plaintext to a multiple of 8 bytes. That suggest a block cipher (it can't be a hash since a hash would be fixed-size). The de facto standard block algorithm is AES which uses 16-byte blocks; 24 is not a multiple of 16 so that's out. The most common block algorithm with a block size of 8 (which fits the data) is DES; 3DES or blowfish or something even rarer is also a possibility but DES is what I'd put my money on.
Since it's a cipher, there must be a key somewhere. It might be in a configuration file, or hard-coded in the source code. If all you have is the binary, you should be able to locate it with the help of a debugger. With DES, you could find the key by brute force (because a key is only 56 bits and that's doable by renting a bit of CPU time on Amazon) but finding it in the program would be easier.
If you want to reproduce the algorithm then you'll also need to figure out the mode of operation. Here one clue is that the encoding is never more than 7 bytes longer than the plaintext, so there's no room for an initialization vector. If the developers who made that software did a horrible job they might have used ECB. If they made a slightly less horrible job they might have used CBC or (much less likely) some other mode with a constant IV. If they did an again slightly less horrible job then the IV may be derived from some other characteristic of the account. You can refine the analysis by testing some patterns:
If the encoding of abcdefghabcdefgh#example.com (starting with two identical 8-byte blocks) starts with two identical 8-byte blocks, it's ECB.
If the encoding of abcdefgh1#example.com and abcdefgh2#example.com (differing at the 9th character) have identical first blocks, it's CBC (probably) with a constant IV.
Another thing you'll need to figure out is the padding mode. There are a few common ones. That's a bit harder to figure out as a black box except with ECB.

There are some tools online, and also some open source projects. For example:
https://code.google.com/archive/p/hash-identifier/
http://www.insidepro.com/

Related

Why do different implementations of AES produce different output?

I feel I have a pretty good understanding of hash functions and the contracts they entail.
SHA1 on Input X will ALWAYS produce the same output. You could use a Python library, a Java library, or pen and paper. It's a function, it is deterministic. My SHA1 does the same as yours and Alice's and Bob's.
As I understand it, AES is also a function. You put in some values, it spits out the ciphertext.
Why, then, could there ever be fears that Truecrypt (for instance) is "broken"? They're not saying AES is broken, they're saying the program that implements it may be. AES is, in theory, solid. So why can't you just run a file through Truecrypt, run it through a "reference AES" function, and verify that the results are the same? I know it absolutely does not work like that, but I don't know why.
What makes AES different from SHA1 in this way? Why might Truecrypt AES spit out a different file than Schneier-Ifier* AES, when they were both given all the same inputs?
In the end, my question boils down to:
My_SHA1(X) == Bobs_SHA1(X) == ...etc
But TrueCrypt_AES(X) != HyperCrypt_AES(X) != VeraCrypt_AES(X) etc. Why is that? Do all those programs wrap AES, but have different ways of determining stuff like an initialization vector or something?
*this would be the name of my file encryption program if I ever wrote one
In the SHA-1 example you give, there is only a single input to the function, and any correct SHA-1 implementation should produce the same output as any other when provided the same input data.
For AES however things are a bit tricker, and since you don't specify what you mean exactly by "AES", this itself seems likely to be the source of the perceived differences between implementations.
Firstly, "AES" isn't a single algorithm, but a family of algorithms that take different key sizes (128, 192 or 256 bits). AES is also a block cipher, it takes a single block of 128 bits/16 bytes of plaintext input, and encrypts this using the key to produce a single 16 byte block of output.
Of course in practice we often want to encrypt more than 16 bytes of data at once, so we must find a way to repeatedly apply the AES algorithm in order to encrypt all the data. Naively we could split it into 16 byte chunks and encrypt each one in turn, but this mode (described as Electronic Codebook or ECB) turns out to be horribly insecure. Instead, various other more secure modes are usually used, and most of these require an Initialization Vector (IV) which helps to ensure that encrypting the same data with the same key doesn't result in the same ciphertext (which would otherwise leak information).
Most of these modes still operate on fixed-sized blocks of data, but again we often want to encrypt data that isn't a multiple of the block size, so we have to use some form of padding, and again there are various different possibilities for how we pad a message to a length that is a multiple of the block size.
So to put all of this together, two different implementations of "AES" should produce the same output if all of the following are identical:
Plaintext input data
Key (and hence key size)
IV
Mode (including any mode-specific inputs)
Padding
Iridium covered many of the causes for a different output between TrueCrypt and other programs using nominally the same (AES) algorithm. If you are just checking actual initialization vectors, these tend to be done using ECB. It is the only good time to use ECB -- to make sure the algorithm itself is implemented correctly. This is because ECB, while insecure, does work without an IV and therefore makes it easier to check "apples to apples" though other stumbling blocks remain as Iridium pointed out.
With a test vector, the key is specified along with the plain text. And test vectors are specified as exact multiples of the block size. Or more specifically, they tend to be exactly 1 block in size for the plain text. This is done to remove padding and mode from the list of possible differences. So if you use standard test vectors between two AES encryption programs, you eliminate the issue with the plain text data differences, key differences, IV, mode, and padding.
But note you can still have differences. AES is just as deterministic as hashing, so you can get the same result every time with AES just as you can with hashing. It's just that there are more variables to control to get the same output result. One item Iridium did not mention but which can be an issue is endianness of the input (key and plain text). I ran into exactly this when checking a reference implementation of Serpent against TrueCrypt. They gave the same output to the text vectors only if I reversed the key and plain text between them.
To elaborate on that, if you have plain text that is all 16 bytes as 0s, and your key is 31 bytes of 0s and one byte of '33' (in the 256 bit version), if the '33' byte was on the left end of the byte string for the reference implementation, you had to feed TrueCrypt 31 '00' bytes and then the '33' byte on the right-hand side to get the same output. So as I mentioned, an endianness issue.
As for TrueCrypt maybe not being secure even if AES still is, that is absolutely true. I don't know the specifics on TrueCrypt's alleged weaknesses, but let me present a couple ways a program can have AES down right and still be insecure.
One way would be if, after the user keys in their password, the program stores it for the session in an insecure manner. If it is not encrypted in memory or if it encrypts your key using its own internal key but fails to protect that key well enough, you can have Windows write it out on the hard drive plain for all to read if it swaps memory to the hard drive. Or as such swaps are less common than they used to be, unless the TrueCrypt authors protect your key during a session, it is also possible for a malicious program to come and "debug" the key right out of the TrueCrypt software. All without AES being broken at all.
Another way it could be broken (theoretically) would be in a way that makes timing attacks possible. As a simple example, imagine a very basic crypto that takes your 32 bit key and splits it into 2 each chunks of 16 bytes. It then looks at the first chunk by byte. It bit-rotates the plain text right a number of bits corresponding to the value of byte 0 of your key. Then it XORs the plain text with the right-hand 16 bytes of your key. Then it bit-rotates again per byte 1 of your key. And so on, 16 shifts and 16 XORs. Well, if a "bad guy" were able to monitor your CPU's power consumption, they could use side channel attacks to time the CPU and / or measure its power consumption on a per-bit-of-the-key basis. The fact is it would take longer (usually, depending on the code that handles the bit-rotate) to bit-rotate 120 bits than it takes to bit-rotate 121 bits. That difference is tiny, but it is there and it has been proven to leak key information. The XOR steps would probably not leak key info, but half of your key would be known to an attacker with ease based on the above attack, even on an implementation of an unbroken algorithm, if the implementation itself is not done right -- a very difficult thing to do.
So I do not know if TrueCrypt is broken in one of these ways or in some other way altogether. But crypto is a lot harder than it looks. If the people on the inside say it is broken, it is very easy for me to believe them.

TripleDES key sizes - .NET vs Wikipedia

According to Wikipedia, TripleDES supports 56, 112, and 168-bit key lengths, but the System.Cryptography.TripleDESCryptoServiceProvider.LegalKeySizes says it only accepts 128 and 192-bit key lengths.
The system I'm developing needs to be interoperable (data encrypted by my code needs to be decryptable in PHP, Java, and Objective-C) and I don't who is correct in this case.
So who should I believe? And how can I be sure my encrypted data is portable?
Wikipedia does not say TripleDES supports 56 bit keys. The "keying options" talk about "triple-length" keys and "double-length" keys, the latter "reduces the key size to 112 bits".
The effective key size for the original DES is 56 bit. Such a key is constructed from 64 bit input though, where 8 bits remain unused. The "triple-length" key option thus works with a three times 56 bit (=168) constructed from three times 64 bit (=192 bit) and the "double-length" option works with two times 56 bit keys (=112) constructed from two times 64 bit (=128).
As your TripleDESCryptoServiceProvider needs to derive the actual keys from the 64 bit-based input first, it will only take either 128 bits (double-length) or 192 bits (triple-length) as input and then internally derive the 168 or 112 bit actual keys from that input.
That's standard procedure for TripleDES, so you should have no problems with portability across platforms.
Triple DES will only use 112/168 bits of your 128/192 bit key. .NET asks for more bits for the purpose of alignment (each 56 bit subkey is aligned on a 64 bit boundary).
56 bit DES is broken and I'd expect they've made it harder to use.
I believe some (all?) implementations of DES use only 7 bits per character of the key (ASCII encoding). I'm not sure if the definition of DES allows for 8-bit characters in keys or if it actually ignores the high bit of each byte. I think it's the latter.
However, in .NET key sizes are based on the number of bytes, times 8 bits per byte, even if the underlying algorithm ignores that top bit. That is probably the main discrepancy.
TripleDES runs DES three times with potentially three different 56-bit DES keys. In some implementations the middle run is reversed (encrypting-decrypting-encrypting or "EDE") so that using the same 56-bit DES key for all three duplicates the encryption of simple DES. This was done for compatibility with older systems where both are using hardware-based encryption. I'm not sure if the TripleDESCryptoServiceProvider uses this "EDE" approach or the "EEE" approach (or gives you a choice). Further, the same 56-bit DES key can be used for the first and third run, using a 112-bit key instead of the 168-bit key it could also use.
The certified TripleDESCryptoServiceProvider wouldn't accept 56-bit (64-bit) keys because it's not really 3DES security (you could use DESCryptoServiceProvider instead?). At one time it was determined that the 168-bit EEE (or EDE?) 3DES does not provide any greater security than using a 112-bit (128-bit) key. However, there may be some extreme (generally unavailable) attacks in which the shorter key is theoretically more vulnerable. That may also apply to the EDE vs EEE question.
On your compatibility vs other languages question, .NET's *CryptoServiceProvider classes are just a wrapper API around the underlying Windows CRYPTO library. If the other languages are also using the Windows CRYPTO library it should be compatible. Otherwise, you'd have to find out whether they are using EDE or EEE and make sure all are using the same one (you may or may not have flexibility on that), and obviously use the same key length. They are probably all using the same byte order, but if you find things still don't match up that might be another thing to check. Most likely on Windows they're all using CRYPTO and will probably match up as long as you can set the options the same way for all of them.
Des uses multiples of 64 bit keys, but throws away 8 bits leaving a useful keylength of 64 bits.
Triple des can use double or triple key length.
However because repeating des with the same key decrypts the message running des an even number of times can partially decrypt stuff if the keys share patterns.
For this reason des is always ran an odd number of times.
This is also why you should never choose a key where 64 bit parts repeat.
With triple des 192 bit you thus have a effective key length of 112 bits

sha1sum duplicate files

Can sha1sum return the same results for two files that are different ?
I am asking this both from a theoretical and practical point of view.
Yes. The point of a hash is to avoid meaningful collisions, so that a small change to a file results in a large change to the hash value, so it is hard for an attacker to generate a collision.
Think about it: a SHA-1 hash is 160 bits. It is therefore unable to represent all the possible states of any file larger that 160 bits!
Mathematically, all hash functions have collisions -- that is, two inputs can return the same hash. This is true for any function that has an input of N bits and an output of M bits, where M<N.
A cryptographically sound hash algorithm, however, produces a hash that is so unpredictable, it would take millions of years to guess your way to an input that produces a particular hash. Some hash algorithms have known weaknesses that make this easier; when this happens the hash is considered 'broken' and everybody is supposed to switch to a new, better algorithm.
Though I don't follow crypto news that closely, SHA-1 is considered broken, so in theory if you can find the right tools, you can generate two files with the same SHA1 output.
SHA-1 produces a hash that's only 160 bits long. If the files are longer than 160 bits, the hash can no longer represent all possible values of the files, so if you were to try all possible values in the files, some collisions would be inevitable.
There is a known attack for creating collisions with SHA-1, though it is quite expensive to carry it out. This attack, however, lets the attacker find two values that both produce the same result. It does not let the attacker create a file that produces the same hash as a file I've already created.
As far as alternatives go, SHA-256 is not just SHA-1 "widened" to produce a 256-bit result instead of 160 bits -- it's considerably more complex internally (uses six separate functions per round compared to three for SHA-1) which has made cryptanalysis considerably slower and more difficult. Some progress has been made, but most of it is purely theoretical.
For example, the best attack of which I'm aware only works against an intentionally weakened version of SHA-256 (42 rounds instead of the standard 64 rounds). Even at that, it's purely theoretical -- it produces the work to produce a preimage from the 2256 that's theoretically expected, to 2251.7 -- which is still far too much to ever carry out.
Strangely, there appears to be considerably less progress in collision attacks against SHA-256. The best attacks I know of only work against around 20 rounds or so (working attacks against 19 rounds, some progress toward attacks up to 23 rounds or so).
Wikipedia on SHA-1 collisions: link. Some real numbers.

encryption of a single character

What is the minimum number of bits needed to represent a single character of encrypted text.
eg, if I wanted to encrypt the letter 'a', how many bits would I require. (assume there are many singly encrypted characters using the same key.)
Am I right in thinking that it would be the size of the key. eg 256 bits?
Though the question is somewhat fuzzy, first of all it would depend on whether you use a stream cipher or a block cipher.
For the stream cipher, you would get the same number of bits out that you put in - so the binary logarithm of your input alphabet size would make sense. The block cipher requires input blocks of a fixed size, so you might pad your 'a' with zeroes and encrypt that, effectively having the block size as a minimum, like you already proposed.
I'm afraid all the answers you've had so far are quite wrong! It seems I can't reply to them, but do ask if you need more information on why they are wrong. Here is the correct answer:
About 80 bits.
You need a few bits for the "nonce" (sometimes called the IV). When you encrypt, you combine key, plaintext and nonce to produce the ciphertext, and you must never use the same nonce twice. So how big the nonce needs to be depends on how often you plan on using the same key; if you won't be using the key more than 256 times, you can use an 8 bit nonce. Note that it's only the encrypting side that needs to ensure it doesn't use a nonce twice; the decrypting side only needs to care if it cares about preventing replay attacks.
You need 8 bits for the payload, since that's how many bits of plaintext you have.
Finally, you need about 64 bits for the authentication tag. At this length, an attacker has to try on average 2^63 bogus messages minimum before they get one accepted by the remote end. Do not think that you can do without the authentication tag; this is essential for the security of the whole mode.
Put these together using AES in a chaining mode such as EAX or GCM, and you get 80 bits of ciphertext.
The key size isn't a consideration.
You can have the same number of bits as the plaintext if you use a one-time pad.
This is hard to answer. You should definitely first read up on some fundamentals. You can 'encrypt' an 'a' with a single bit (Huffman encoding-style), and of course you could use more bits too. A number like 256 bits without any context is meaningless.
Here's something to get you started:
Information Theory -- esp. check out Shannon's seminal paper
One Time Pad -- infamous secure, but impractical, encryption scheme
Huffman encoding -- not encryption, but demonstrates the above point

AES, Cipher Block Chaining Mode, Static Initialization Vector, and Changing Data

When using AES (or probably most any cipher), it is bad practice to reuse an initialization vector (IV) for a given key. For example, suppose I encrypt a chunk of data with a given IV using cipher block chaining (CBC) mode. For the next chunk of data, the IV should be changed (e.g., the nonce might be incremented or something). I'm wondering, though, (and mostly out of curiosity) how much of a security risk it is if the same IV is used if it can be guaranteed that the first four bytes of the chunks are incrementing. In other words, suppose two chunks of data to be encrypted are:
0x00000000someotherdatafollowsforsomenumberofblocks
0x00000001someotherdatathatmaydifferormaynotfollows
If the same IV is used for both chunks of data, how much information would be leaked?
In this particular case, it's probably OK (but don't do it, anyway). The "effective IV" is your first encrypted block, which is guaranteed to be different for each message (as long as the nonce truly never repeats under the same key), because the block cipher operation is a bijection. It's also not predictable, as long as you change the key at the same time as you change the "IV", since even with fully predictable input the attacker should not be able to predict the output of the block cipher (block cipher behaves as a pseudo-random function).
It is, however, very fragile. Someone who is maintaining this protocol long after you've moved on to greener pastures might well not understand that the security depends heavily on that non-repeating nonce, and could "optimise" it out. Is sending that single extra block each message for a real IV really an overhead you can't afford?
Mark,
what you describe is pretty much what is proposed in Appendix C of NIST SP800-38a.
In particular, there are two ways to generate an IV:
Generate a new IV randomly for
each message.
For each message use a new unique nonce (this may be a counter), encrypt the nonce, and use the result as IV.
The second option looks very similar to what you are proposing.
Well, that depends on the block size of the encryption algorithm. For the usual block size of 64 bytes i dont think that would make any difference. The first bits would be the same for many blocks, before entering the block cipher, but the result would not have any recognisable pattern. For block sizes < 4 bytes (i dont think that happens) it would make a difference, because the first block(s) would always be the same, leaking information about patterns. Just my opinion.
edit:
Found this
"For CBC and CFB, reusing an IV leaks some information about the first block
of plaintext, and about any common prefix shared by the two messages"
Source: lectures of my university :)

Resources