I have 5 numeric codes. They vary in length (8-10 digits). For each numeric code I have a corresponding alpha-numeric code. The alpha numeric codes are always 8 digits in length.
Now the problem. I know that by some process each numeric code is converted into it's corresponding 8 digit alpha numeric code but I do not know the process used. At first I thought that the alpha-numeric codes may be randomly generated using a seed from the numeric code but that did not seem to work. Now I am thinking that some sort of hashing algorithm is being used to convert the numerics to the alpha-numerics
My question is
1) Can I brute force solve this
2) If yes then what algorithms should I look into that can covert a numeric code to an 8 digit alpha-numeric code
3) Is there some other way to solve this?
Notes: The alpha-numeric codes are not case sensitive. I do not mind if a brute force search returns a few false positives because I will be able to narrow them down myself.
Clarification: I think the first guy misunderstood something. I know the exact values of these numeric and alpha-numeric codes. I simply am not sharing them on the site. I'm not trying to randomly map codes to codes I'm trying to find an algorithm that map my specific codes to the outputs.
No, you cannot brute force this.
There are an unlimited number of functions that will map 5 inputs to 5 outputs. How would you know whether you found the right function? For example, you can use these 5 pairs as constraints for a polynomial of degree n. There are an infinite number of possible polynomial solutions.
If you can narrow the functions down, then there are additional constraints on the problem.
If you assume a hash function is used, you can try guessing that there is no salting, and the search space is over well known hash functions. If there is salting, you are stuck brute forcing all possible salts over all possible hash functions. With just the salts, you are probably looking at > 2^128 values. A brute force attack is not going to be useful.
If a symmetric cipher is used, you have an instance of the chosen ciphertext problem. Modern ciphers are intentionally designed with this attack in mind and use 128 bits or more of key space. Brute forcing all keys is not going to work.
You do not state anything about the function. Is it reversible? Is it randomized?
Related
is there any tool or method to figure out what is this hash/cipher function?
i have only a 500 item list of input and output plus i know all of the inputs are numeric, and output is always 2 Byte long hexadecimal representation.
here's some samples:
794352:6657
983447:efbf
479537:0796
793670:dee4
1063060:623c
1063059:bc1b
1063058:b8bc
1063057:b534
1063056:b0cc
1063055:181f
1063054:9f95
1063053:f73c
1063052:a365
1063051:1738
1063050:7489
i looked around and couldn't find any hash this short, is this a hash folded on itself? (with xor maybe?) or maybe a simple trivial cipher?
is there any tool or method for finding the output of other numbers?
(i want to figure this out; my next option would be training a Neural Network or Regression, so i thought i ask before taking any drastic action )
Edit: The Numbers are directory names, and for accessing them, the Hex parts are required.
Actually, Wikipedia's page on hashes lists three CRCs and three checksum methods that it could be. It could also be only half the output from some more complex hashing mechanism. Cross your fingers and hope that it's of the former. Hashes are specifically meant to be difficult (if not impossible) to reverse engineer.
What it's being used for should be a very strong hint about whether or not it's more likely to be a checksum/CRC or a hash.
Is SHA(-1-2-3) a one to one function for inputs the same length as the output?
To restate the question as a concrete example:
SHA-1 has a 160 byte output, so do all 160 byte inputs have unique 160 byte outputs? Is the answer the same for SHA-2 and 3 and for all available output sizes?
Nobody knows, because nobody has proven it one way or the other, or tested every possible input at that range. That's the simple truth.
If the functions behaved truly randomly, then the answer would almost certainly be "no" due to the birthday paradox -- on average, you need to test 2^80 inputs to find a collision between any pair, for a 160-bit output.
Short answer: While there's no conclusive, definitive answer, I think the safer bet (by far if you extend your question to cover all the SHA family functions) is to say "no". Let's get a bit more mathematical.
Let's pick and examine one of the SHA family functions. Assume it returns an n-bit output and behaves like a "random oracle" (it doesn't, but assume it) which means it will return a random n-bit value for any input with the restriction that will always return the same output for the same input.
With those assumptions, the probability of a collision for any two input strings which are not the same ought to be 2^(-n). Because of the birthday paradox, you would expect to find a collision after about 2^(n/2) distinct inputs.
So because of the birthday paradox, the chances that our function is one-to-one when hashing n-bit inputs and generating n-bit outputs is not good.
Ultimately, the only way to conclusively answer your question would be to try all possible n-bit inputs with every possible n-bit SHA function. Don't count on getting a definitive answer in your lifetime...
if i have a hash say like this: 0d47aeda9d97686ab3da96bae2c93d078a5ab253
how do i do the math to find out the number of possibilities to try if i start with 0000000000000000000000000000000000000000 to 9999999999999999999999999999999999999999 which is the general length of a sha1.
The number of possibilities would be 2^(X) where X is the number of bits in the hash.
In the normal hexadecimal string representation of the hash value like the one you gave, each character is 4 bits, so it would be 2^(4*len) where len is the string length of the hash value. In your example, you have a 40 character SHA1 digest, which corresponds to 160 bits, or 2^160 == 1.4615016373309029182036848327163e+48 values.
An SHA-1 hash is 160 bits, so there are 2^160 possible hashes.
Your hexadecimal digit range is 0 through f.
Then it's simply 16^40 or however many characters it contains
Recall that a hash function accepts inputs of arbitrary length. A good cryptographic hash function will seem to assign a "random" hash result to any input. So if the digest is N bits long (for SHA-1, N=160), then every input will be hashed to one of 2^N possible results, in a manner we'll treat as random.
That means that the expectation for finding a preimage for your hash result is running though 2^N inputs. They don't have to be specifically the range that you suggested - any 2^N distinct inputs are fine.
This also means that 2^N inputs don't guarantee that you'll find a preimage - each try is random, so you might miss your 1-in-2^N chance in every single one of those 2^N inputs (just like flipping a coin twice doesn't guarantee you'll get heads at least once). But you can figure out how many inputs are required to find a preimage for the hash with probability p or greater - with p being as close to one as you desire (just not actually 1).
maximum variations, with repeating and with attention to the order are defined as n^k. in your case this would mean 10^40, which can't be correct for SHA1. Reading Wikipedia it sais SHA1 has a max. complexity for a collision based attack of 2^80, using different technices researches were allready successfull with 2^51 collisions, so 10^40 seems a bit much.
I'm not great with statistical mathematics, etc. I've been wondering, if I use the following:
import uuid
unique_str = str(uuid.uuid4())
double_str = ''.join([str(uuid.uuid4()), str(uuid.uuid4())])
Is double_str string squared as unique as unique_str or just some amount more unique? Also, is there any negative implication in doing something like this (like some birthday problem situation, etc)? This may sound ignorant, but I simply would not know as my math spans algebra 2 at best.
The uuid4 function returns a UUID created from 16 random bytes and it is extremely unlikely to produce a collision, to the point at which you probably shouldn't even worry about it.
If for some reason uuid4 does produce a duplicate it is far more likely to be a programming error such as a failure to correctly initialize the random number generator than genuine bad luck. In which case the approach you are using it will not make it any better - an incorrectly initialized random number generator can still produce duplicates even with your approach.
If you use the default implementation random.seed(None) you can see in the source that only 16 bytes of randomness are used to initialize the random number generator, so this is an a issue you would have to solve first. Also, if the OS doesn't provide a source of randomness the system time will be used which is not very random at all.
But ignoring these practical issues, you are basically along the right lines. To use a mathematical approach we first have to define what you mean by "uniqueness". I think a reasonable definition is the number of ids you need to generate before the probability of generating a duplicate exceeds some probability p. An approcimate formula for this is:
where d is 2**(16*8) for a single randomly generated uuid and 2**(16*2*8) with your suggested approach. The square root in the formula is indeed due to the Birthday Paradox. But if you work it out you can see that if you square the range of values d while keeping p constant then you also square n.
Since uuid4 is based off a pseudo-random number generator, calling it twice is not going to square the amount of "uniqueness" (and may not even add any uniqueness at all).
See also When should I use uuid.uuid1() vs. uuid.uuid4() in python?
It depends on the random number generator, but it's almost squared uniqueness.
Isn't it easily possible to construct a PRNG in such a fashion? Why is it not done?
That is, as far as I know we could simply have a PRNG that takes a seed n. When you ask for a random bit, it takes the nth digit of the binary expansion of the computable normal number, and increments n.
My first thought was that perhaps we hadn't found a computable normal number, but we have. The remaining thought is that there is a good reason not to-- either there's some property of PRNGs that I'm not familiar with that such a method would not have, or it would be impractical somehow, or is otherwise outstripped by other methods.
That would make predicting the output really simple.
Say, for example, you generate the integer 0x54a30b7f. If you have 4GiB of pi (or random noise or an actual normal number), chances are there's only going to be one (or maybe a handful) occurrence of that particular integer and I can predict with reasonably high probability all future numbers. This is a serious problem in the case of cryptographically strong PRNGs. If instead of simple sequential scan you use some function, I just have to follow the function which if it is difficult enough to follow it turns into a PRNG in it's own right.
If you are not concerned about the cryptographic strength of your generator, then there are much more compact ways of generating random numbers. Mersenne Twister, for example, has a much larger period without requiring a 4GiB lookup table.