I am trying to come up with a checksum algorithm that produces a hash of fixed length based on arbitrary strings that are un-ordered.
By that I mean to say, the hash of the strings ["a", "b"] should result in the same hash as ["b", "a"]. Also, ["this is a really long string", "a"] should result in the same as ["a", "this is a really long string"].
Ideally, I would like this hash to look something similar to a sha256 string, but that is less important.
The hash doesn't need to be unique, but it does need to act as a checksum, so for lots of inputs, it should be at least a bit unlikely that you'd get collisions.
As I say, this is a checksum rather than a cryptographic hash, so some degree of duplication is ok.
The pseudocode, written in go for no other reason than it's syntactically light whilst being typed, would be:
func hash(inputs []string) string
One option would be to simply order the inputs and create a sha256 from it, but that is memory intensive. Another option is to convert to uuid using a v5 uuid, and XOR it, but ideally the algorithm would require no other hashing function.
Any thoughts would be greatly appreciated.
Chose a hash algorithm that suits you need, and apply it on each element .. and then xor all the hash's together. This will give you the same result, independent of the sequence of the elements.
Related
I have an encrypted hash and I don't know the HASH TYPE.
I have 2 alternatives, one of them is surely correct (matches INPUT with OUTPUT). How can I find the hashtype using them?
INPUT1 = 123459999
OUTPUT1 = eb6ae08384753f42445b7418661924c1632d36c06d1f3695e2ec90c192e7f92a
INPUT2 = 123-45-9999
OUTPUT2 = eb6ae08384753f42445b7418661924c1632d36c06d1f3695e2ec90c192e7f92a
I would rlly appreciate If someone can find the HASH TYPE or explain how should I find it, please :)
There is no specific way to know what hash function is used here. A cryptographic hash function produces output indistinguishable from random, so we can't distinguish between the outputs of different secure hash functions (or we'd be able to also distinguish them from random).
However, there are some common hash functions that are in use. Since this is a 256-bit output, you could try SHA-256, SHA-512/256, SHA-3-256, BLAKE2s, or BLAKE2b-256 and see if one of those produces the expected output. However, it is also equally plausible that somebody produced these hashes using a 256-bit MAC, like HMAC-SHA-256, which is essentially a hash function with a key, in which case there's no way to reproduce the hash without knowing the key.
is there any tool or method to figure out what is this hash/cipher function?
i have only a 500 item list of input and output plus i know all of the inputs are numeric, and output is always 2 Byte long hexadecimal representation.
here's some samples:
794352:6657
983447:efbf
479537:0796
793670:dee4
1063060:623c
1063059:bc1b
1063058:b8bc
1063057:b534
1063056:b0cc
1063055:181f
1063054:9f95
1063053:f73c
1063052:a365
1063051:1738
1063050:7489
i looked around and couldn't find any hash this short, is this a hash folded on itself? (with xor maybe?) or maybe a simple trivial cipher?
is there any tool or method for finding the output of other numbers?
(i want to figure this out; my next option would be training a Neural Network or Regression, so i thought i ask before taking any drastic action )
Edit: The Numbers are directory names, and for accessing them, the Hex parts are required.
Actually, Wikipedia's page on hashes lists three CRCs and three checksum methods that it could be. It could also be only half the output from some more complex hashing mechanism. Cross your fingers and hope that it's of the former. Hashes are specifically meant to be difficult (if not impossible) to reverse engineer.
What it's being used for should be a very strong hint about whether or not it's more likely to be a checksum/CRC or a hash.
if i have a hash say like this: 0d47aeda9d97686ab3da96bae2c93d078a5ab253
how do i do the math to find out the number of possibilities to try if i start with 0000000000000000000000000000000000000000 to 9999999999999999999999999999999999999999 which is the general length of a sha1.
The number of possibilities would be 2^(X) where X is the number of bits in the hash.
In the normal hexadecimal string representation of the hash value like the one you gave, each character is 4 bits, so it would be 2^(4*len) where len is the string length of the hash value. In your example, you have a 40 character SHA1 digest, which corresponds to 160 bits, or 2^160 == 1.4615016373309029182036848327163e+48 values.
An SHA-1 hash is 160 bits, so there are 2^160 possible hashes.
Your hexadecimal digit range is 0 through f.
Then it's simply 16^40 or however many characters it contains
Recall that a hash function accepts inputs of arbitrary length. A good cryptographic hash function will seem to assign a "random" hash result to any input. So if the digest is N bits long (for SHA-1, N=160), then every input will be hashed to one of 2^N possible results, in a manner we'll treat as random.
That means that the expectation for finding a preimage for your hash result is running though 2^N inputs. They don't have to be specifically the range that you suggested - any 2^N distinct inputs are fine.
This also means that 2^N inputs don't guarantee that you'll find a preimage - each try is random, so you might miss your 1-in-2^N chance in every single one of those 2^N inputs (just like flipping a coin twice doesn't guarantee you'll get heads at least once). But you can figure out how many inputs are required to find a preimage for the hash with probability p or greater - with p being as close to one as you desire (just not actually 1).
maximum variations, with repeating and with attention to the order are defined as n^k. in your case this would mean 10^40, which can't be correct for SHA1. Reading Wikipedia it sais SHA1 has a max. complexity for a collision based attack of 2^80, using different technices researches were allready successfull with 2^51 collisions, so 10^40 seems a bit much.
Take a commonly used binary hash function - for example, SHA-256. As the name implies, it outputs a 256 bit value.
Let A be the set of all possible 256 bit binary values. A is extremely large, but finite.
Let B be the set of all possible binary values. B is infinite.
Let C be the set of values obtained by running SHA-256 on every member of B. Obviously this can't be done in practice, but I'm guessing we can still do mathematical analysis of it.
My Question: By necessity, C ⊆ A. But does C = A?
EDIT: As was pointed out by some answers, this is wholly dependent on the has function in question. So, if you know the answer for any particular hash function, please say so!
First, let's point out that SHA-256 does not accept all possible binary strings as input. As defined by FIPS 180-3, SHA-256 accepts as input any sequence of bits of length lower than 2^64 bits (i.e. no more than 18446744073709551615 bits). This is very common; all hash functions are somehow limited in formal input length. One reason is that the notion of security is defined with regards to computational cost; there is a threshold about computational power that any attacker may muster. Inputs beyond a given length would require more than that maximum computational power to simply evaluate the function. In brief, cryptographers are very wary of infinites, because infinites tend to prevent security from being even defined, let alone quantified. So your input set C should be restricted to sequences up to 2^64-1 bits.
That being said, let's see what is known about hash function surjectivity.
Hash functions try to emulate a random oracle, a conceptual object which selects outputs at random under the only constraint that it "remembers" previous inputs and outputs, and, if given an already seen input, it returns the same output than previously. By definition, a random oracle can be proven surjective only by trying inputs and exhausting the output space. If the output has size n bits, then it is expected that about 2^(2n) distinct inputs will be needed to exhaust the output space of size 2^n. For n = 256, this means that hashing about 2^512 messages (e.g. all messages of 512 bits) ought to be enough (on average). SHA-256 accepts inputs very much longer than 512 bits (indeed, it accepts inputs up to 18446744073709551615 bits), so it seems highly plausible that SHA-256 is surjective.
However, it has not been proven that SHA-256 is surjective, and that is expected. As shown above, a surjectivity proof for a random oracle requires an awful lot of computing power, substantially more than mere attacks such as preimages (2^n) and collisions (2^(n/2)). Consequently, a good hash function "should not" allow a property such as surjectivity to be actually proven. It would be very suspicious: security of hash function stems from the intractability of their internal structure, and such an intractability should firmly oppose to any attempt at mathematical analysis.
As a consequence, surjectivity is not formally proven for any decent hash function, and not even for "broken" hash functions such as MD4. It is only "highly suspected" (a random oracle with inputs much longer than the output should be surjective).
Not necessarily. The pigeonhole principle states that once one more hash beyond the size of A has been generated that there is a probability of collision of 1, but it does not state that every single element of A has been generated.
It really depends on the hash function. If you use this valid hash function:
Int256 Hash (string input) {
return 0;
}
then it is obvious that C != A. So the "for example, SHA256" is a pretty important note to consider.
To answer your actual question: I believe so, but I'm just guessing. Wikipedia does not provide any meaningful info on this.
Not necessarily. That would depend on the hash function.
It would probably be ideal if the hash function was surjective, but there are things that're usually more important, such as a low likelihood of collisions.
It is not always the case. However, quality required for an hash algorithm are:
Cardinality of B
Repartition of hashes in B (every value in B must have the same probability to be a hash)
Do you know of any other ciphers that performs like the ROT47 family?
My major requirement is that it'd be keyless.
Sounds like you might be looking for some "classical cryptography" solutions.
SUBSTITUTION CIPHERS are encodings where one character is substituted with another. E.g. A->Y, B->Q, C->P, and so on. The "Caesar Cipher" is a special case where the order is preserved, and the "key" is the offset. In the rot13/47 case, the "key" is 13 or 47, respectively, though it could be something like 3 (A->D, B->E, C->F, ...).
TRANSPOSITION CIPHERS are ones that don't substitute letters, but ones that rearrange letters in a pre-defined way. For example:
CRYPTOGRAPHY
may be written as
C Y T G A H
R P O R P Y
So the ciphered output is created by reading the two lines left to right
CYTGAHRPORPY
Another property of rot13/47 is that it's REVERSABLE:
encode(encode(plaintext)) == plaintext
If this is the property you want, you could simply XOR the message with a known (previously decided) XOR value. Then, XOR-ing the ciphertext with the same value will return the original plaintext. An example of this would be the memfrob function, which just XORs a buffer with the binary representation of the number 42.
You also might check out other forms of ENCODINGS, such as Base64 if that's closer to what you're looking for.
!! Disclaimer - if you have data that you're actually trying to protect from anyone, don't use any of these methods. While entertaining, all of these methods are trivial to break.