Pseudo-random numbers from a 32-bit auto-increment INTEGER - math

I have a table with an auto-increment 32-bit integer primary key in a database, which will produce numbers ranging 1-4294967295.
I would like to keep the convenience of an auto-generated primary key, while having my numbers on the front-end of an application look like randomly generated.
Is there a mathematical function which would allow a two-way, one-to-one transformation between an integer and another?
For example a function would take a number, and translate it to another:
1 => 1538645623
2 => 2043145593
3 => 393439399
And another function the way back:
1538645623 => 1
2043145593 => 2
393439399 => 3
I'm not necessarily looking for an implementation here, but rather a hint on what I suppose, must be a well-known mathematical problem somewhere :)

Mathematically this is almost exactly the same problem as cryptography.
You: I want to go from an id(string of bits) to another number (string of bits) and back again in a non-obvious way.
Cryptography: I want to go from plaintext (string of bits) to another string of bits and back again (reversible) in a non-obvious way.
So for a simple solution, can I suggest just plugging in whatever cryptography algorithm is most convenient in your language, and encrypt and decrypt your id?
If you wanted to be a bit cleverer you can do what is called "salting" in addition to cryptography. Take your id as a 32 bit (or whatever) number. Concatenate it with a random 32 bit number. Encrypt the result. To reverse, just decrypt, and throw away the random part.
Of course, if someone was seriously attacking this, this might be vulnerable to known plaintext/differential cryptanalysis attacks as you have a very small known plaintext space, but it sounds like you aren't trying to defend against serious attacks.

First remove the offset of 1, so you get numbers in the range 0 to 232-2. Let m = 232-1.
Choose some a that is relative prime to m. Since it is relatively prime it has an inverse a' so that a * a' = 1 (mod m). Also choose some b. Choose big numbers to get a good mixing effect.
Then you can compute your desired pseudo-random number by y = (a * x + b) % m, and get back the original by x = ((y - b) * a') % m.
This is essentially one step of a linear congruential generator (LCG) for pseudo-random numbers.
Note that this is not secure, it is only obfuscation. For example, if a user can get two numbers in sequence then he can recover a and b easily.

In most cases web apps use a hash of a randomly generated number as a reference to a table row. This hash can be stored as a number and displayed as a string for the end user.
This hash is unique and it is identifier and the id is only used in the application itself, never shown to the outside world.

Related

RSA decryption methodology

I'm not learning cryptography yet, and this exercise - in the form it was delivered as a homework, was more of an exercise on reading composite functions and the like. Either way, I took a look at some part of the source code and didn't understand this.
For RSA encryption, the source code manipulated the string in such a way:
Message is being hashed into an integer list. (int1, int2, int3...)
Encrypt int1
Subtract result from int2 ( int2 - e(int1))
Modulo with the modulo key (n)
RSA transform with a key.
However, the RSA decryption method is done by:
1) RSA_transform
2) Result is added
3) Modulo with n
The part that puzzles me about the RSA decryption is the need for modulo after the adding and rsa_transform. If it's needed, shouldnt it be used in reverse order of how the chain of operations was carried out in RSA encryption?
Also, an "invert_modulo" was provided in the source code. I originally believed this to be a key in decrypting the message, but it wasn't so. What could "invert_modulo" be used for?
I cannot understand the first part of your question as the steps to hash the string is not clear also i don't get 3rd part of your encryption step. As for the Second question invert_modulo is the "MODULAR MULTIPLICATIVE INVERSE".
While working with modular airthmetic we always want our answer to be in the integer range 0 to M-1(where M is the number we modulo with) simple operations like addition , multiplication and subtraction are easy to perform : like (a+b) MOD M, it is well defined for the constraints of modular airthmetic.
Problem arises wen we try to divide : (a/b) MOD M
as you can see here a/b may not always always give an integer, therefore (a/b) does not lie in the integer range 0 to M-1. so to overcome this we try to find an inverse of b that we would rather multiply a with, i.e : (a*b_inverse) MOD M.
b_inverse can be defined as : (b*b_inverse) MOD M = 1.
i.e b_inverse is a number in the range 0 to M-1, which when multiplied with b, modulo M yields 1.
Note : also note that modular inverse of some numbers might not exist we can check that by taking the GCD of M and the number concerned(in our example "b") if GCD is not equal to 1 the the modular inverse does not exist.

Why are "large prime numbers" used in RSA/encryption?

I've learned the theory of public key encryption but I'm missing the connection to the physical world. e.g.
I've been told that good RSA encryption should rely on prime numbers with 300 decimal digits but why? who came up with this number? How long it will take to break such encryption (statistics about different machines).
I've tried Google, but couldn't find what I wanted. anyone?
thanks
The key of asymmetric cryptography is to have an asymmetric function which allow decrypting message encrypted by the asymmetric key, without allowing to find the other key. In RSA, the function used is based on factorization of prime numbers however it is not the only option (Elliptic curve is another one for example).
So, basically you need two prime numbers for generating a RSA key pair. If you are able to factorize the public key and find these prime numbers, you will then be able to find the private key. The whole security of RSA is based on the fact that it is not easy to factorize large composite numbers, that's why the length of the key highly change the robustness of the RSA algorithm.
There are competitions to factorize large prime numbers with calculators each years with nice price. The last step of factorizing RSA key was done in 2009 by factorizing 768 bits keys. That's why at least 2048 bit keys should be used now.
As usual, Wikipedia is a good reference on RSA.
All public key algorithms are based on trapdoor functions, that is, mathematical constructs that are "easy" to compute in one way, but "hard" to reverse unless you have also some additional information (used as private key) at which point also the reverse becomes "easy".
"Easy" and "hard" are just qualitative adjectives that are always more formally defined in terms of computational complexity. "Hard" very often refers to computations that cannot be solved in polynomial time O(nx) for some fixed x and where n is the input data.
In the case of RSA, the "easy" function is the modular exponentiation C = Me mod N where the factors of N are kept secret. The "hard" problem is to find the e-th root of C (that is, M). Of course, "hard" does not mean that it is always hard, but (intuitively) that increasing the size of N by a certain factor increases the complexity by a much larger factor.
The sizes of the modulus which are recommended (2048 bits, or 617 decimal digits) relate to the availability of computation power at present time, so that if you stick to them you are assured that it will be extremely expensive for the attacker to break it. For more details, I should refer you to a brilliant answer on cryptography.SE (go and upvote :-)).
Finally, in order to have a trapdoor, N is built so as to be a composite number. It theory, for improved performance, N may have more than 2 factors, but the general security rule is that all factors must be balanced and have roughly the same size. That means that if you have K factors, and N is B bits long, each factor is roughly B/K bits longs.
This problem to solve is not the same as the integer factorization problem though. The two are related in that if you manage to factor N you can compute the private key by re-doing what the party that generated the key did. Typically, the exponent e being used is very small (3); it cannot be excluded that someday somebody devises an algorithm to compute the e-th without factoring N.
EDIT: Corrected the number of decimal digits for the modulus of a 2048 bits RSA key.
RSA uses the idea of one-way math functions, so that it's easy to encrypt and decrypt if you have the key, but hard (as in it takes lots and lots of CPU cycles) to decrypt if you don't have the key. Even before they thought of using prime numbers, mathematicians identified the need for a one-way function.
The first method they hit upon was the idea that if your "key" is a prime number, and your message is another number, then you can encrypt by multiplying the two together. Someone with the key can easily divide out the prime number and get the message, but for someone without the prime number, figuring out the prime number key is hard.

how to find the number of possibilities of a hash

if i have a hash say like this: 0d47aeda9d97686ab3da96bae2c93d078a5ab253
how do i do the math to find out the number of possibilities to try if i start with 0000000000000000000000000000000000000000 to 9999999999999999999999999999999999999999 which is the general length of a sha1.
The number of possibilities would be 2^(X) where X is the number of bits in the hash.
In the normal hexadecimal string representation of the hash value like the one you gave, each character is 4 bits, so it would be 2^(4*len) where len is the string length of the hash value. In your example, you have a 40 character SHA1 digest, which corresponds to 160 bits, or 2^160 == 1.4615016373309029182036848327163e+48 values.
An SHA-1 hash is 160 bits, so there are 2^160 possible hashes.
Your hexadecimal digit range is 0 through f.
Then it's simply 16^40 or however many characters it contains
Recall that a hash function accepts inputs of arbitrary length. A good cryptographic hash function will seem to assign a "random" hash result to any input. So if the digest is N bits long (for SHA-1, N=160), then every input will be hashed to one of 2^N possible results, in a manner we'll treat as random.
That means that the expectation for finding a preimage for your hash result is running though 2^N inputs. They don't have to be specifically the range that you suggested - any 2^N distinct inputs are fine.
This also means that 2^N inputs don't guarantee that you'll find a preimage - each try is random, so you might miss your 1-in-2^N chance in every single one of those 2^N inputs (just like flipping a coin twice doesn't guarantee you'll get heads at least once). But you can figure out how many inputs are required to find a preimage for the hash with probability p or greater - with p being as close to one as you desire (just not actually 1).
maximum variations, with repeating and with attention to the order are defined as n^k. in your case this would mean 10^40, which can't be correct for SHA1. Reading Wikipedia it sais SHA1 has a max. complexity for a collision based attack of 2^80, using different technices researches were allready successfull with 2^51 collisions, so 10^40 seems a bit much.

calculate the average of three encrypted numbers

Is possible to calculate average of three encrypted integer? No constrain on the method of encrypting. The point of this is just to hide the three numbers and find average.
What you seem to be looking for is called Homomorphic Encryption: an encryption scheme which allows you to perform operations on encrypted data, with the encrypted result as the outcome.
Such a scheme would allow you to give encrypted data to a 3rd party, which could then do computations on it for you without knowing what they were computing.
In your case, you need two operations: addition and division. Until recently, homomorphic encryption schemes typically supported only 1 operation. But in september 2009 IMB announced the first fully homomorphic cryptosystem. Other researches published another system soon after that.
These cryptosystems might be be able to do what you want, but it is all cutting edge computer science research.
Decrypt the numbers, then calculate their average.
I don't see any simple ways to do what you ask, apart from decrypting the numbers first.
Taking the average (or the "arithmetic mean") requires adding the numbers. Now if you wanted to multiply the numbers, then you could do that neatly with RSA encryption. If p is the plaintext, c is the ciphertext, and e is the encryption key, then in RSA, c = p^e. If you have 3 separate integers, p1, p2, p3, and the product is pp then
pp^e = (p1 * p2 * p3)^e = p1^e * p2^e * p3^3 = c1 * c2 * c3 = cp
That is, you can either multiply the three plaintext integers together and then encrypt, or you can just multiply the three ciphertexts together, and get the same answer. This would get you some way towards the "geometric mean", where you multiply all the numbers together, and then take the cube-root (or nth root for n numbers). Unfortunately, calculating a cube root in modular arithmetic is non-trivial.
With ideal encryption methods: No.
With most real-world encryption methods: No.
With some stupidly simple to undo obfuscation method especially designed to allow averaging: Yes.
Calling the latter method "encryption" really would be using the wrong term.
If you could calculate the average of encrypted numbers without decrypting them, that would make decrypting the original numbers quite a lot easier, so I would be very surprised if this works with any serious encryption algorithm.
In general three encrypted numbers shouldn't maintain the same order if encrypted, so I'm pretty sure you have to decrypt them and calculate the avarage.
If, and only if, the method of encryption is a one-to-one mathematical function, then it is possible to do so while the numbers are encrypted.
For example, if my very unsecure method of encryption is to multiply every number of 2, then I would do the following:
function encrypt($number){
return $number*2;
}
$a=encrypt(3); // a= 9
$b=encrypt(5); // b= 15
$c=encrypt(6); // c= 18
$average = ($a+$b+$c)/6; // We divide by 6 because first we divide by 3 to get the average, then by 2 to do the decryption. The method will vary based on the mathematical function.
The only other possibility is to decrypt the numbers first.

Do cryptographic hash functions reach each possible values, i.e., are they surjective?

Take a commonly used binary hash function - for example, SHA-256. As the name implies, it outputs a 256 bit value.
Let A be the set of all possible 256 bit binary values. A is extremely large, but finite.
Let B be the set of all possible binary values. B is infinite.
Let C be the set of values obtained by running SHA-256 on every member of B. Obviously this can't be done in practice, but I'm guessing we can still do mathematical analysis of it.
My Question: By necessity, C ⊆ A. But does C = A?
EDIT: As was pointed out by some answers, this is wholly dependent on the has function in question. So, if you know the answer for any particular hash function, please say so!
First, let's point out that SHA-256 does not accept all possible binary strings as input. As defined by FIPS 180-3, SHA-256 accepts as input any sequence of bits of length lower than 2^64 bits (i.e. no more than 18446744073709551615 bits). This is very common; all hash functions are somehow limited in formal input length. One reason is that the notion of security is defined with regards to computational cost; there is a threshold about computational power that any attacker may muster. Inputs beyond a given length would require more than that maximum computational power to simply evaluate the function. In brief, cryptographers are very wary of infinites, because infinites tend to prevent security from being even defined, let alone quantified. So your input set C should be restricted to sequences up to 2^64-1 bits.
That being said, let's see what is known about hash function surjectivity.
Hash functions try to emulate a random oracle, a conceptual object which selects outputs at random under the only constraint that it "remembers" previous inputs and outputs, and, if given an already seen input, it returns the same output than previously. By definition, a random oracle can be proven surjective only by trying inputs and exhausting the output space. If the output has size n bits, then it is expected that about 2^(2n) distinct inputs will be needed to exhaust the output space of size 2^n. For n = 256, this means that hashing about 2^512 messages (e.g. all messages of 512 bits) ought to be enough (on average). SHA-256 accepts inputs very much longer than 512 bits (indeed, it accepts inputs up to 18446744073709551615 bits), so it seems highly plausible that SHA-256 is surjective.
However, it has not been proven that SHA-256 is surjective, and that is expected. As shown above, a surjectivity proof for a random oracle requires an awful lot of computing power, substantially more than mere attacks such as preimages (2^n) and collisions (2^(n/2)). Consequently, a good hash function "should not" allow a property such as surjectivity to be actually proven. It would be very suspicious: security of hash function stems from the intractability of their internal structure, and such an intractability should firmly oppose to any attempt at mathematical analysis.
As a consequence, surjectivity is not formally proven for any decent hash function, and not even for "broken" hash functions such as MD4. It is only "highly suspected" (a random oracle with inputs much longer than the output should be surjective).
Not necessarily. The pigeonhole principle states that once one more hash beyond the size of A has been generated that there is a probability of collision of 1, but it does not state that every single element of A has been generated.
It really depends on the hash function. If you use this valid hash function:
Int256 Hash (string input) {
return 0;
}
then it is obvious that C != A. So the "for example, SHA256" is a pretty important note to consider.
To answer your actual question: I believe so, but I'm just guessing. Wikipedia does not provide any meaningful info on this.
Not necessarily. That would depend on the hash function.
It would probably be ideal if the hash function was surjective, but there are things that're usually more important, such as a low likelihood of collisions.
It is not always the case. However, quality required for an hash algorithm are:
Cardinality of B
Repartition of hashes in B (every value in B must have the same probability to be a hash)

Resources