Is there a simple and quick way to detect encrypted files? I heard about enthropy calculation, but if I calculate it for every file on a drive, it will take days to detect encryption.
Is it possible to, say it, calculate some value for first 100 bytes or 1024 bytes and then decide? Anyone has a sources for that?
I would use a cross-entropy calculation. Calculate the cross-entropy value for X bytes for known encrypted data (it should be near 1, regardless of type of encryption, etc) - you may want to avoid file headers and footers as this may contain non-encrypted file meta data.
Calculate the entropy for a file; if it's close to 1, then it's either encrypted or /dev/random. If it's quite far away from 1, then it's likely not encrypted. I'm sure you could apply signifance tests to this to get a baseline.
It's about 10 lines of Perl; I can't remember what library is used (although, this may be useful: http://dingo.sbs.arizona.edu/~hammond/ling696f-sp03/addonecross.txt)
You could just make a system that recognizes particular common forms of encrypted files (ex: recognize encrypted zip, rar, vim, gpg, ssl, ecryptfs, and truecrypt). Any attempt to determine encryption based on the raw data will quickly run into a steganography discussion.
One of the advantages of good encryption is that you can design it so that it can't be detected - see the Wikipedia article on deniable encryption for example.
Every statistical approach to detect encryption will give you various "false alarms", like
compressed data or random looking data in general.
Imagine I'd write a program that outputs two files: file1 contains 1024 bit of π and file2 is an encrypted version of file1. If you don't know anything about the contents of file1 or file2, there's no way to distinguish them. In fact, it's quite likely that π contains the contents of file2 somewhere!
EDIT:
By the way, it's not even working the other way round (detecting unencrypted files). You could write a program that transforms encrypted data to readable english text by assigning words or whole sentences to bits/bytes of it.
Related
I was hit by a ransomware infection that encrypts the first 512 bytes at the top of the file and puts them at the bottom. Upon looking at the encrypted text it seems to be some type of XOR cipher. I know the whole plain text of one of the files that was encrypted, so i figured in theory i should be able to xor it to get the key to decrypt the rest of my files. Well i am having a very hard time with this because i don't understand how the creator xor'ed it really. Im thinking he would use a binaryreader to read the first 512 bytes into an array, XOR it, and replace it. But does that mean he XOR'ed it in HEX? or Decimal? Im quite confused at this point, but i believe i am simply missing something.
I have tried Xor Tool with python, and everything it attempts to crack looks like non sense. I also tried a python script called Unxor that you give the known plain text to, but the dump file it outputs is always blank.
Good Header file dump:
Good-Header.bin
Encrypted Header file dump:
Enc-Header.bin
This may not be the best file example to see the XOR pattern, but its the only file i have that also has the original header 100% before encryption. In other headers where there is more changes the encrypted header changes with it.
Any advice on a method i should try, or application i should use to try and take this further? Thanks so much for your help!
P.S Stackoverflow yelled at me when i tried to post 4 links because im so new, so if you would rather see the hex dumps on pastebin than download the header files, please let me no. The files are in no way malicious, and are only the extracted 512 bytes and not a whole file.
To recover the keystream XOR the plaintext bytes with the cyphertext bytes. Do this with two different files so you can see if the ransomware is using the same keystream or a different keystream for each file.
If it is using the same keystream (unlikely) then your problem is solved. If the keystreams are different, then your easiest solution is to restore the affected files from backups. You did keep backups, didn't you? Alternatively research the particular infection you have got and see if anyone else has broken that particular variant, so you can derive the key(s) they used and hence regenerate the required keystreams.
If you have a lot of money then a data recovery firm might be able to help you, but they will certainly charge.
A rule of thumb to tell a decent cipher from a toy cipher is to encrypt a highly compressible file and try to compress it in its encrypted form: a dumb cipher will produce a file with a level of entropy similar to that of the original one, so the encrypted file will compress as well as the original one; on the other side, a good cipher (even without an initialization vector) will produce a file that will look like a random garbage and thus will not compress at all.
When I compressed your Enc-Header.bin of 512 bytes with PKZIP, the output was also 512 bytes, so the cipher is not as dumb as you expected — bad luck. (But it does not mean that the malware has no weak spots at all.)
Goal (General)
My ultimate (long term) goal is to write an importer for a binary file into another application
Question Background
I am interested in two fields within a binary file format. One is
encrypted, and the other is compressed and possibly also encrypted
(See how I arrived at this conclusion here).
I have a viewer program (I'll call it viewer.exe) which can open these files for viewing. I'm hoping this can offer up some clues.
I will (soon) have a correlated deciphered output to compare and have values to search for.
This is the most relevant stackoverflow Q/A I have found
Question Specific
What is the best strategy given the resources I have to identify the algorithm being used?
Current Ideas
I realize that without the key, identifying the algo from just data is practically impossible
Having a file and a viewer.exe, I must have the key somewhere. Whether it's public, private, symmetric etc...that would be nice to figure out.
I would like to disassemble the viewer.exe using OllyDbg with the findcrypt plugin as a first step. I'm just not proficient enough in this kind of thing to accomplish it yet.
Resources
full example file
extracted binary from the field I am interested in
decrypted data In this zip archive there is a binary list of floats representing x,y,z (model2.vertices) and a binary list of integers (model2.faces). I have also included an "stl" file which you can view with many free programs but because of the weird way the data is stored in STL's, this is not what we expect to come out of the original file.
Progress
1. I disassembled the program with Olly, then did the only thing I know how to do at this poing and "searched for all referenced text" after pausing the porgram right before it imports of of the files. Then I searched for words stings like "crypt, hash, AES, encrypt, SHA, etc etc." I came up with a bunch of things, most notably "Blowfish64" which seems to go nicely with the fact that mydata occasionally is 4 bytes too long (and since it is guranteed to be mod 12 = 0) this to me looks like padding for 64 bit block size (odd amounts of vertices result in non mod 8 amounts of bytes). I also found error messages like...
“Invalid data size, (Size-4) mod 8 must be 0"
After reading Igor's response below, here is the output from signsrch. I've updated this image with green dot's which cause no problems when replaced by int3, red if the program can't start, and orange if it fails when loading a file of interest. No dot means I haven't tested it yet.
Accessory Info
Im using windows 7 64 bit
viewer.exe is win32 x86 application
The data is base64 encoded as well as encrypted
The deciphered data is groups of 12 bytes representing 3 floats (x,y,z coordinates)
I have OllyDb v1.1 with the findcrypt plugin but my useage is limited to following along with this guys youtube videos
Many encryption algorithms use very specific constants to initialize the encryption state. You can check if the binary has them with a program like signsrch. If you get any plausible hits, open the file in IDA and search for the constants (Alt-B (binary search) would help here), then follow cross-references to try and identify the key(s) used.
You can't differentiate good encryption (AES with XTS mode for example) from random data. It's not possible. Try using ent to compare /dev/urandom data and TrueCrypt volumes. There's no way to distinguish them from each other.
Edit: Re-reading your question. The best way to determine which symmetric algorithm, hash and mode is being used (when you have a decryption key) is to try them all. Brute-force the possible combinations and have some test to determine if you do successfully decrypt. This is how TrueCrypt mounts a volume. It does not know the algo beforehand so it tries all the possibilities and tests that the first few bytes decrypt to TRUE.
I'm trying to reverse engineer a file from an application to learn more about the data it is storing on me. Based on the name, it appears to be XML data, but it is obviously either saved in binary or encrypted. I thought it may have been some form of .Net (or other) serialization, and have tried decoding it that way. But, no love. Inspection in hex has not given any clues either.
Maybe someone with more 'skilz' can give me some insight into it. Here is the file
Voted down and answering: the file is exactly N * 16 bytes in size, does not contain any repetition as far as I can see, and it seems to be filled with random bytes. The first bytes seems completely random as well, hinting that this is not a plain protocol.
This would probably hint that the file is AES CBC encrypted. DESede (or any cipher with a 8/16 blocksize) could of couse also have been deployed. Without the key (if any) this all is not going to help you much (if it was, I would not be answering you).
The entropy of first file is high above 7.7 that might indicate encryption. The first 28h bytes (320-bit) of the files match. Is that possible that's the key and the encoded data starts at 28h?
Let's say I have some numbers, like
5,10,7,8,9,6,2,4,8,5,3,9,78,5,6
I need to send this to another computer, but as the least number of possible bytes. I know what there is a way to do that, I just forgot what it's called and how it works, but generally doing some math with those numbers, getting a big number that, from this number, I'll be able to export the data and get this numbers from this number. Thanks in advance.
EDIT
OK so I need to send this text in UDP but I need it as less bits as possible. I'm sending some options, like firstcolor-secondcolor, let's say I have 15 colors. Every color is just number, from 1 to 199, but maybe there is a better way to send this data? thanks.
No one can say which compression scheme is the best for you. We don't have any information about the numbers. But as a first try, you could just write them into a file and use gzip compression on it. Or bzip2, or 7zip.
And only if all these don't help, you should think about doing the compression yourself.
You also didn't tell us your operating systems (source computer, destination computer) and from where you get the data.
[Update, based on the edit in the question:] So basically you want to send some numbers in the range of 1 to 199. This is pretty close to what a single byte can hold.
If it is ok that you use 8 bits per number (meaning you waste 0.4 bits per number), this is trivial but highly depends on the programming language. Here is how it might look like in Java syntax:
ByteBuffer buf = new ByteBuffer();
buf.add(1);
buf.add(199);
buf.add(78);
buf.add(7);
udpSocket.send(buf.toArray());
Get a compression library (like zlib, for example) and feed your numbers in (as an array of integers, for example). This is compressing your data. That same library should allow you to reverse the process and decompress the data at the other end to get your values back out.
If you want to improve your algorithmic knowledge and your requirements are simple and non-critical I'd recommend having a go at writing your own compression/decompression code. If not, grab some code off the shelf - there are loads of good libraries around.
Would it be possible to transfer large files using only a system of checksums, and then reconstruct the original file by calculations?
Say that you transfer the MD5 checksum of a file and the size of the file. By making a "virtual file" and calculating it's checksum, trying every single bit combination, you should eventually "reach" the original file. But on the way you would also get a lot of "collisions" where the checksum also match.
So we change the first byte of the original file to some specified value, calculate the checksum again, and send this too. If we make the same substitution in the virtual file we can test each "collision" to see if it still matches. This should narrow it down a bit, and we can do this several times.
Of course, the computing power to do this would be enormous. But is it theoretically possible, and how many checksums would you need to transfer something (say 1mb)? Or would perhaps the amount of data needed to transfer the checksums almost as large as the file, making it pointless?
The amount of data you need to transfer would most certainly be the same size as the file. Consider: If you could communicate a n byte file with n-1 bytes of data, that means you've got 256^(n-1) possible patterns of data you may have sent, but are selecting from a space of size 256^n. This means that one out of every 256 files won't be expressible using this method - this is often referred to as the pidegonhole principle.
Now, even if that wasn't a problem, there's no guarentee that you won't have a collision after any given amount of checksumming. Checksum algorithms are designed to avoid collisions, but for most checksum/hash algorithms there's no strong proof that after X hashes you can guarantee no collisions in a N-byte space.
Finally, hash algorithms, at least, are designed to be hard to reverse, so even if it were possible it would take an impossible huge amount of CPU power to do so.
That said, for a similar approach, you might be interested in reading about Forward Error Correction codes - they're not at all hash algorithms, but I think you may find them interesting.
What you have here is a problem of information. A checksum is not necessarily unique to a particular set of data, in fact to be so it would effectively need to have a many bits of information as the source. What it can indicate is that the data received is not the exact data that the checksum was generated from but in most cases it can't prove it.
In short "no".
To take a hypothetical example, consider a 24 bpp photo with 6 pixels -- there are 2^(24 * 6) (2^144) possible combinations of intensities for each colour channel on those six pixels, so you can gaurantee that if you were to evaluate every possibility, you are guaranteed an MD5 collision (as MD5 is a 128 bit number).
Short answer: Not in any meaningfull form.
Long answer:
Let us assume an arbitrary file file.bin with a 1000-byte size. There are 2^(8*1000) different combinations that could be its actual contents. By sending e.g. a 1000-bit checksum,
you still have about 2^(7*1000) colliding alternatives.
By sending a single additional bit, you might be able cut those down by half... and you still have 2^6999 collisions. By the time you eliminate the colisions, you will have sent at least 8000 bits i.e. an amount equal or greater to the file size.
The only way for this to be theoretically possible (Note: I did not say "feasible", let alone "practical") would be if the file did not really contain random data and you could use that knowledge to prune alternatives. In that case you'd be better off using compression, ayway. Content-aware compression algorithms (e.g. FLAC for audio) use a-priori knowledge on the properties of the input data to improve the compression ratio.
I think what you are thinking of is in fact an interesting topic, but you haven't hit upon the right method. If I can try and rephrase your question, you are asking if there is a way to apply a function to some data, transmit the result of the function, and then reconstruct the original data from the terser function result. For a single MD5 checksum the answer is no, but with other functions, provided you are willingly to send several function results, it is possible. In general this area of research is called compressed sensing. Sometimes exact reconstruction is possible, but more often it is used as a lossy compression scheme for images and other visual or sound data.