Is there any efficiency analysis of how MD5 dependent on the file size. Is it actually dependent of file size or content of the file. So for i have 500mb file with all blank spaces and a 500mb file with movie in it, would md5 take same time to generate the the hash code?
Any hashsum is, by definition, a mathematical sum of the bytes of what you're summing. You have to read the file through a stream at the very least - more bytes take longer to traverse. However, I'd say (generally speaking) the bottleneck will indeed be reading the file, no matter what you're trying to with it - not hashing it once you've read it.
Edit: I kinda misread the question. It will take exactly the same amount of time to hash two files of equal size. 500mb of spaces is 500mb of bytes which represent "space". That's still 8 bits of data per byte, same as any other file.
Because MD5 consists mostly of XOR, AND, OR and NOT operations, the speed is not dependent on a given bit containing a 1 or a 0.
From http://en.wikipedia.org/wiki/MD5:
There are four possible functions F; a different one is used in each round:
denote the XOR, AND, OR and NOT operations respectively.
All hashes in general, and including MD5, do not have performance dependent upon the content.
Here's a quick empirical test.
# dd if=/dev/urandom of=randomfile bs=1024 count=512000
# dd if=/dev/zero of=zerofile bs=1024 count=512000
# time md5 randomfile
MD5 (randomfile) = bb318fa1561b17e30d03b12e803262e4
real 0m2.753s
user 0m1.567s
sys 0m1.157s
# time md5 zerofile
MD5 (zerofile) = d8b61b2c0025919d5321461045c8226f
real 0m2.761s
user 0m1.567s
sys 0m1.168s
This is expected as per previous answers alluding to the bit manipulations used in the MD5 algorithm.
MD5, like most other hash algorithms, operates on blocks. For each 512-bit block of the input it performs the same operation and uses the output as part of the input for the next block.
The operation consists of the same basic operations (XOR, AND, NOT etc.). On all processors that I know, these operations will take the same time, no matter what the arguments are. So the time MD5 should take to process input should be linear in the number of 512-bit blocks in the input.
Related
As far as I understand one of the main functions of the LSH method is data reduction even beyond the underlying hashes (often minhashes). I have been using the textreuse package in R, and I am surprised by the size of the data it generates. textreuse is a peer-reviewed ROpenSci package, so I assume it does its job correctly, but my question persists.
Let's say I use 256 permutations and 64 bands for my minhash and LSH functions respectively -- realistic values that are often used to detect with relative certainty (~98%) similarities as low as 50%.
If I hash a random text file using TextReuseTextDocument (256 perms) and assign it to trtd, I will have:
object.size(trtd$minhashes)
> 1072 bytes
Now let's create the LSH buckets for this object (64 bands) and assign it to l, I will have:
object.size(l$buckets)
> 6704 bytes
So, the retained hashes in the LSH buckets are six times larger than the original minhashes. I understand this happens because textreuse uses a md5 digest to create the bucket hashes.
But isn't this too wasteful / overkill, and can't I improve it? Is it normal that our data reduction technique ends up bloating up to this extent? And isn't it more efficacious to match the documents based on the original hashes (similar to perms = 256 and bands = 256) and then use a threshold to weed out the false positives?
Note that I have reviewed the typical texts such as Mining of Massive Datasets, but this question remains about this particular implementation. Also note that the question is not only out of curiosity, but rather out of need. When you have millions or billions of hashes, these differences become significant.
Package author here. Yes, it would be wasteful to use more hashes/bands than you need. (Though keep in mind we are talking about kilobytes here, which could be much smaller than the original documents.)
The question is, what do you need? If you need to find only matches that are close to identical (i.e., with a Jaccard score close to 1.0), then you don't need a particularly sensitive search. If, however, you need to reliable detect potential matches that only share a partial overlap (i.e., with a Jaccard score that is closer to 0), then you need more hashes/bands.
Since you've read MMD, you can look up the equation there. But there are two functions in the package, documented here, which can help you calculate how many hashes/bands you need. lsh_threshold() will calculate the threshold Jaccard score that will be detected; while lsh_probability() will tell you how likely it is that a pair of documents with a given Jaccard score will be detected. Play around with those two functions until you get the number of hashes/bands that is optimal for your search problem.
I am working on a system that makes heavy use of pseudonyms to make privacy-critical data available to researchers. These pseudonyms should have the following properties:
They should not contain any information (e.g. time of creation, relation to other pseudonyms, encoded data, …).
It should be easy to create unique pseudonyms.
They should be human readable. That means they should be easy for humans to compare, copy, and understand when read out aloud.
My first idea was to use UUID4. They are quite good on (1) and (2), but not so much on (3).
An variant is to encode UUIDs with a wider alphabet, resulting in shorter strings (see for example shortuuid). But I am not sure whether this actually improves readability.
Another approach I am currently looking into is a paper from 2005 titled "An optimal code for patient identifiers" which aims to tackle exactly my problem. The algorithm described there creates 8-character pseudonyms with 30 bits of entropy. I would prefer to use a more widely reviewed standard though.
Then there is also the git approach: only display the first few characters of the actual pseudonym. But this would mean that a pseudonym could lose its uniqueness after some time.
So my question is: Is there any widely-used standard for human-readable unique ids?
Not aware of any widely-used standard for this. Here’s a non-widely-used one:
Proquints
https://arxiv.org/html/0901.4016
https://github.com/dsw/proquint
A UUID4 (128 bit) would be converted into 8 proquints. If that’s too much, you can take the last 64 bits of the UUID4 (= just take 64 random bits). This doesn’t make it magically lose uniqueness; only increases the likelihood of collisions, which was non-zero to begin with, and which you can estimate mathematically to decide if it’s still OK for your purposes.
Here you go UUID Readable
Generate Easy to Remember, Readable UUIDs, that are Shakespearean and Grammatically Correct Sentences
This article suggests to use the first few characters from a SHA-256 hash, similarly to what git does. UUIDs are typically based on SHA-1, so this is not all that different. The tradeoff between property (2) and (3) is in the number of characters.
With d being the number of digits, you get 2 ** (4 * d) identifiers in total, but the first collision is expected to happen after 2 ** (2 * d).
The big question is really not about the kind of identifier you use, it is how you handle collisions.
Why is it, that no matter the size of the input of the SHA-256 algorithm (in bitcoin mining), it always outputs a result of 256 bits?
furthermore, how come that no matter the size of the input, the computation time is always the same?
On your first question, the answer would be that it is by design - the SHA-256 algorithm is intended to take an arbitrary amount of input data and produce 256 bits of output, whilst also maintaining certain properties that make for an effective cryptographic hash. Other hash algorithms produce different output sizes (e.g. SHA-1 produces 160 bits of output, SHA-512 produces 512 bits of output, etc.).
Your second question is based on an incorrect assumption - the computation time is dependent on the size of the input - it will naturally take longer even just to read say, a 1MB file than it would to read a 1KB file, and since the hash is dependent on every bit of the input, a larger input will take longer to hash than a smaller one.
Output size is the same as it (algorithm) was designed this way.
Input size matters - just compute hash of 1GB file and 1KB file and compare - you'll see speed diff.
So I understand that there is proof that MD5 can not guarantee uniqueness as there are more strings in the universe than there are MD5 hash strings, but is there any inverse proof for a finite number of strings?
Basically, if I have strings of maximum length of X, is there an X for which MD5 is guaranteed to be unique? if yes, then what is that X? and if there are more than one values for X, what is the maximum value of X?
or is there such an X for any other hashing algorithm, SHA-1, etc.?
Summarizing the excellent answers here: What's the shortest pair of strings that causes an MD5 collision?
The shortest known attack on MD5 requires 2 input blocks, that is, 128 bytes or 1024 bits.
For any hash algorithm that outputs N bits, assuming it distributes inputs approximately randomly, you can assume a collision is more than 50% likely in about sqrt(2^N) inputs. For example, MD5 hashes to 128 bits, so you can expect a collision among all 64 bit inputs. This assumes a uniformly random hash. Any weaknesses would reduce the number of inputs before a collision can be expected to occur.
The answer to your question is yes. For any hash function, there is a maximum length X for which you will get back unique strings. Finding X, though, could be very hard. The idea is to run this program:
X= 0;
For i = 0 onward
For all strings of length i
Compute the hash code of that string.
If a collision is found, return X.
X = i
The idea is to just list longer and longer strings until you find a hash collision. Eventually you'll have to, since eventually you'll have generated more strings than there are possible hash outputs.
On expectation, assuming that a hash function is actually pretty random, you'll need to generate O(√U) different strings before you find a collision, where U is the size of the space to which the hash function maps. For 256-bit hashes, this is 2256. This means that in practice the above program would never actually terminate unless the hash function was pretty broken, but in theory it means that your number X exists.
Hope this helps!
Take a commonly used binary hash function - for example, SHA-256. As the name implies, it outputs a 256 bit value.
Let A be the set of all possible 256 bit binary values. A is extremely large, but finite.
Let B be the set of all possible binary values. B is infinite.
Let C be the set of values obtained by running SHA-256 on every member of B. Obviously this can't be done in practice, but I'm guessing we can still do mathematical analysis of it.
My Question: By necessity, C ⊆ A. But does C = A?
EDIT: As was pointed out by some answers, this is wholly dependent on the has function in question. So, if you know the answer for any particular hash function, please say so!
First, let's point out that SHA-256 does not accept all possible binary strings as input. As defined by FIPS 180-3, SHA-256 accepts as input any sequence of bits of length lower than 2^64 bits (i.e. no more than 18446744073709551615 bits). This is very common; all hash functions are somehow limited in formal input length. One reason is that the notion of security is defined with regards to computational cost; there is a threshold about computational power that any attacker may muster. Inputs beyond a given length would require more than that maximum computational power to simply evaluate the function. In brief, cryptographers are very wary of infinites, because infinites tend to prevent security from being even defined, let alone quantified. So your input set C should be restricted to sequences up to 2^64-1 bits.
That being said, let's see what is known about hash function surjectivity.
Hash functions try to emulate a random oracle, a conceptual object which selects outputs at random under the only constraint that it "remembers" previous inputs and outputs, and, if given an already seen input, it returns the same output than previously. By definition, a random oracle can be proven surjective only by trying inputs and exhausting the output space. If the output has size n bits, then it is expected that about 2^(2n) distinct inputs will be needed to exhaust the output space of size 2^n. For n = 256, this means that hashing about 2^512 messages (e.g. all messages of 512 bits) ought to be enough (on average). SHA-256 accepts inputs very much longer than 512 bits (indeed, it accepts inputs up to 18446744073709551615 bits), so it seems highly plausible that SHA-256 is surjective.
However, it has not been proven that SHA-256 is surjective, and that is expected. As shown above, a surjectivity proof for a random oracle requires an awful lot of computing power, substantially more than mere attacks such as preimages (2^n) and collisions (2^(n/2)). Consequently, a good hash function "should not" allow a property such as surjectivity to be actually proven. It would be very suspicious: security of hash function stems from the intractability of their internal structure, and such an intractability should firmly oppose to any attempt at mathematical analysis.
As a consequence, surjectivity is not formally proven for any decent hash function, and not even for "broken" hash functions such as MD4. It is only "highly suspected" (a random oracle with inputs much longer than the output should be surjective).
Not necessarily. The pigeonhole principle states that once one more hash beyond the size of A has been generated that there is a probability of collision of 1, but it does not state that every single element of A has been generated.
It really depends on the hash function. If you use this valid hash function:
Int256 Hash (string input) {
return 0;
}
then it is obvious that C != A. So the "for example, SHA256" is a pretty important note to consider.
To answer your actual question: I believe so, but I'm just guessing. Wikipedia does not provide any meaningful info on this.
Not necessarily. That would depend on the hash function.
It would probably be ideal if the hash function was surjective, but there are things that're usually more important, such as a low likelihood of collisions.
It is not always the case. However, quality required for an hash algorithm are:
Cardinality of B
Repartition of hashes in B (every value in B must have the same probability to be a hash)