Series of numbers with minimized risk of collision - math

I want to generate some numbers, which should attempt to share as few common bit patterns as possible, such that collisions happen at minimal amount. Until now its "simple" hashing with a given amount of output bits. However, there is another 'constraint'. I want to minimize the risk that, if you take one number and change it by toggling a small amount of bits, you end up with another number you've just generated. Note: I don't want it to be impossible or something, I want to minimize the risk!
How to calculate the probability for a list with n numbers, where each number has m bits? And, of course, what would be a suitable method to generate those numbers? Any good articles about this?

To answer this question precisely, you need to say what exactly you mean by "collision", and what you mean by "generate". If you just want the strings to be far apart from each other in hamming distance, you could hope to make an optimal, deterministic set of such strings. It is true that random strings will have this property with high probability, so you could use random strings instead.
When you say
Note: I don't want it to be impossible or something, I want to minimize the risk!
this sounds like an XY problem. If some outcome is the "bad thing" then why do you want it to be possible, but just low probability? Shouldn't you want it not to happen at all?
In short I think you should look up the term "error correcting code". The codewords of any good error correcting code, with any parameters that you feel like, will have the minimal risk of collision in the presence of random noise, for that number of code words of that length, and they can typically be generated very easily using matrix multiplication.

Related

Find the first root and local maximum/minimum of a function

Problem
I want to find
The first root
The first local minimum/maximum
of a black-box function in a given range.
The function has following properties:
It's continuous and differentiable.
It's combination of constant and periodic functions. All periods are known.
(It's better if it can be done with weaker assumptions)
What is the fastest way to get the root and the extremum?
Do I need more assumptions or bounds of the function?
What I've tried
I know I can use root-finding algorithm. What I don't know is how to find the first root efficiently.
It needs to be fast enough so that it can run within a few miliseconds with precision of 1.0 and range of 1.0e+8, which is the problem.
Since the range could be quite large and it should be precise enough, I can't brute-force it by checking all the possible subranges.
I considered bisection method, but it's too slow to find the first root if the function has only one big root in the range, as every subrange should be checked.
It's preferable if the solution is in java, but any similar language is fine.
Background
I want to calculate when arbitrary celestial object reaches certain height.
It's a configuration-defined virtual object, so I can't assume anything about the object.
It's not easy to get either analytical solution or simple approximation because various coordinates are involved.
I decided to find a numerical solution for this.
For a general black box function, this can't really be done. Any root finding algorithm on a black box function can't guarantee that it has found all the roots or any particular root, even if the function is continuous and differentiable.
The property of being periodic gives a bit more hope, but you can still have periodic functions with infinitely many roots in a bounded domain. Given that your function relates to celestial objects, this isn't likely to happen. Assuming your periodic functions are sinusoidal, I believe you can get away with checking subranges on the order of one-quarter of the shortest period (out of all the periodic components).
Maybe try Brent's Method on the shortest quarter period subranges?
Another approach would be to apply your root finding algorithm iteratively. If your range is (a, b), then apply your algorithm to that range to find a root at say c < b. Then apply your algorithm to the range (a, c) to find a root in that range. Continue until no more roots are found. The last root you found is a good candidate for your minimum root.
Black box function for any range? You cannot even be sure it has the continuous domain over that range. What kind of solutions are you looking for? Natural numbers, integers, real numbers, complex? These are all the question that greatly impact the answer.
So 1st thing should be determining what kind of number you accept as the result.
Second is having some kind of protection against limes of function that will try to explode your calculations as it goes for plus or minus infinity.
Since we are touching the limes topics you could have your solution edge towards zero and look like a solution but never touch 0 and become a solution. This depends on your margin of error, how close something has to be to be considered ok, it's good enough.
I think for this your SIMPLEST TO IMPLEMENT bet for real number solutions (I assume those) is to take an interval and this divide and conquer algorithm:
Take lower and upper border and middle value (or approx middle value for infinity decimals border/borders)
Try to calculate solution with all 3 and have some kind of protection against infinities
remember all 3 values in an array with results from them (3 pair of values)
remember the current best value (one its closest to solution) in seperate variable (a pair of value and result for that value)
STEP FORWARD - repeat above with 1st -2nd value range and 2nd -3rd value range
have a new pair of value and result to be closest to solution.
clear the old value-result pairs, replace them with new ones gotten from this iteration while remembering the best value solution pair (total)
Repeat above for how precise you wish to get and look at that memory explode with each iteration, keep in mind you are gonna to have exponential growth of values there. It can be further improved if you lets say take one interval and go as deep as you wanna, remember best value-result pair and then delete all other memory and go for next interval and dig deep.

Creating an efficient function to fit a dataset

Basically I have a large (could get as large as 100,000-150,000 values) data set of 4-byte inputs and their corresponding 4-byte outputs. The inputs aren't guaranteed to be unique (which isn't really a problem because I figure I can generate pseudo-random numbers to add or xor the inputs with so that they do become unique), but the outputs aren't guaranteed to be unique either (so two different sets of inputs might have the same output).
I'm trying to create a function that effectively models the values in my data-set. I don't need it to interpolate efficiently, or even at all (by this I mean that I'm never going to feed it an input that isn't contained in this static data-set). However it does need to be as efficient as possible. I've looked into interpolation and found that it doesn't really fit what I'm looking for. For example, the large number of values means that spline interpolation won't do since it creates a polynomial per interval.
Also, from my understanding polynomial interpolation would be way too computationally expensive (n values means that the polynomial could include terms as high as pow(x,n-1). For x= a 4-byte number and n=100,000 it's just not feasible). I've tried looking online for a while now, but I'm not very strong with math and must not know the right terms to search with because I haven't come across anything similar so far.
I can see that this is not completely (to put it mildly) a programming question and I apologize in advance. I'm not looking for the exact solution or even a complete answer. I just need pointers on the topics that I would need to read up on so I can solve this problem on my own. Thanks!
TL;DR - I need a variant of interpolation that only needs to fit the initially given data-points, but which is computationally efficient.
Edit:
Some clarification - I do need the output to be exact and not an approximation. This is sort of an optimization of some research work I'm currently doing and I need to have this look-up implemented without the actual bytes of the outputs being present in my program. I can't really say a whole lot about it at the moment, but I will say that for the purposes of my work, encryption (or compression or any other other form of obfuscation) is not an option to hide the table. I need a mathematical function that can recreate the output so long as it has access to the input. I hope that clears things up a bit.
Here is one idea. Make your function be the sum (mod 232) of a linear function over all 4-byte integers, a piecewise linear function whose pieces depend on the value of the first bit, another piecewise linear function whose pieces depend on the value of the first two bits, and so on.
The actual output values appear nowhere, you have to add together linear terms to get them. There is also no direct record of which input values you have. (Someone could conclude something about those input values, but not their actual values.)
The various coefficients you need can be stored in a hash. Any lookups you do which are not found in the hash are assumed to be 0.
If you add a certain amount of random "noise" to your dataset before starting to encode it fairly efficiently, it would be hard to tell what your input values are, and very hard to tell what the outputs are even approximately without knowing the inputs.
Since you didn't impose any restriction on the function (continuous, smooth, etc), you could simply do a piece-wise constant interpolation:
or a linear interpolation:
I assume you can figure out how to construct such a function without too much trouble.
EDIT: In light of your additional requirement that such a function should "hide" the data points...
For a piece-wise constant interpolation, the constant intervals should be randomized so as to not reveal where the data point is. So for example in the picture, the intervals are centered about the data point it's interpolating. Instead, you might want to do something like:
[0 , 0.3) -> 0
[0.3 , 1.9) -> 0.8
[1.9 , 2.1) -> 0.9
[2.1 , 3.5) -> 0.2
etc
Of course, this only hides the x-coordinate. To hide the y-coordinate as well, you can use a linear interpolation.
Simply make it so that the "pointy" part isn't where the data point is. Pick random x-values such that every adjacent data point has one of these x-values in between. Then interpolate such that the "pointy" part is at these x-values.
I suggest a huge Lookup Table full of unused entries. It's the brute-force approach, having an ordered table of outputs, ordered by every possible value of the input (not just the data set, but also all other possible 4-byte value).
Though all of your data would be there, you could fill the non-used inputs with random, arbitrary, or stochastic (random whithin potentially complex constraints) data. If you make it convincing, no one could pick your real data out of it. If a "real" function interpolated all your data, it would also "contain" all the information of your real data, and anyone with access to it could use it to generate an LUT as described above.
LUTs are lightning-fast, but very memory hungry. Your case is on the edge of feasibility, requiring (2^32)*32= 16 Gigabytes of RAM, which requires a 64-bit machine to run. That is just for the data, not the program, the Operating System, or other data. It's better to have 24, just to be sure. If you can afford it, they are the way to go.

Generate very very large random numbers

How would you generate a very very large random number? I am thinking on the order of 2^10^9 (one billion bits). Any programming language -- I assume the solution would translate to other languages.
I would like a uniform distribution on [1,N].
My initial thoughts:
--You could randomly generate each digit and concatenate. Problem: even very good pseudorandom generators are likely to develop patterns with millions of digits, right?
You could perhaps help create large random numbers by raising random numbers to random exponents. Problem: you must make the math work so that the resulting number is still random, and you should be able to compute it in a reasonable amount of time (say, an hour).
If it helps, you could try to generate a possibly non-uniform distribution on a possibly smaller range (using the real numbers, for instance) and transform. Problem: this might be equally difficult.
Any ideas?
Generate log2(N) random bits to get a number M,
where M may be up to twice as large as N.
Repeat until M is in the range [1;N].
Now to generate the random bits you could either use a source of true randomness, which is expensive.
Or you might use some cryptographically secure random number generator, for example AES with a random key encrypting a counter for subsequent blocks of bits. The cryptographically secure implies that there can be no noticeable patterns.
It depends on what you need the data for. For most purposes, a PRNG is fast and simple. But they are not perfect. For instance I remember hearing that Monte Carlos simulations of chaotic systems are really good at revealing the underlying pattern in a PRNG.
If that is the sort of thing that you are doing, though, there is a simple trick I learned in grad school for generating lots of random data. Take a large (preferably rapidly changing) file. (Some big data structures from the running kernel are good.) Compress it to increase the entropy. Throw away the headers. Then for good measure, encrypt the result. If you're planning to use this for cryptographic purposes (and you didn't have a perfect entropy data set to work with), then reverse it and encrypt again.
The underlying theory is simple. Information theory tells us that there is no difference between a signal with no redundancy and pure random data. So if we pick a big file (ie lots of signal), remove redundancy with compression, and strip the headers, we have a pretty good random signal. Encryption does a really good job at removing artifacts. However encryption algorithms tend to work forward in blocks. So if someone could, despite everything, guess what was happening at the start of the file, that data is more easily guessable. But then reversing the file and encrypting again means that they would need to know the whole file, and our encryption, to find any pattern in the data.
The reason to pick a rapidly changing piece of data is that if you run out of data and want to generate more, you can go back to the same source again. Even small changes will, after that process, turn into an essentially uncorrelated random data set.
NTL: A Library for doing Number Theory
This was recommended by my Coding Theory and Cryptography teacher... so I guess it does the work right, and it's pretty easy to use.
RandomBnd, RandomBits, RandomLen -- routines for generating pseudo-random numbers
ZZ RandomLen_ZZ(long l);
// ZZ = psuedo-random number with precisely l bits,
// or 0 of l <= 0.
If you have a random number generator that generates random numbers of X bits. And concatenated bits of [X1, X2, ... Xn ] create the number you want of N bits, as long as each X is random, I don't see why your large number wouldn't be random as well for all intents and purposes. And if standard C rand() method is not secure enough, I'm sure there's plenty of other libraries (like the ones mentioned in this thread) whose pseudo-random numbers are "more random".
even very good pseudorandom generators are likely to develop patterns with millions of digits, right?
From the wikipedia on pseudo-random number generation:
A PRNG can be started from an arbitrary starting state using a seed state. It will always produce the same sequence thereafter when initialized with that state. The maximum length of the sequence before it begins to repeat is determined by the size of the state, measured in bits. However, since the length of the maximum period potentially doubles with each bit of 'state' added, it is easy to build PRNGs with periods long enough for many practical applications.
You could perhaps help create large random numbers by raising random numbers to random exponents
I assume you're suggesting something like populating the values of a scientific notation with random values?
E.g.: 1.58901231 x 10^5819203489
The problem with this is that your distribution is going to be logarithmic (or is that exponential? :) - same difference, it isn't even). You will never get a value that has the millionth digit set, yet contains a digit in the one's column.
you could try to generate a possibly non-uniform distribution on a possibly smaller range (using the real numbers, for instance) and transform
Not sure I understand this. Sounds like the same thing as the exponential solution, with the same problems. If you're talking about multiplying by a constant, then you'll get a lumpy distribution instead of a logarithmic (exponential?) one.
Suggested Solution
If you just need really big pseudo-random values, with a good distribution, use a PRNG algorithm with a larger state. The Periodicity of a PRNG is often the square of the number of bits, so it doesn't take that many bits to fill even a really large number.
From there, you can use your first solution:
You could randomly generate each digit and concatenate
Although I'd suggest that you use the full range of values returned by your PRNG (possibly 2^31 or 2^32), and populate a byte array with those values, splitting it up as necessary. Otherwise you might be throwing away a lot of bits of randomness. Also, scaling your values to a range (or using modulo) can easily screw up your distribution, so there's another reason to try to keep the max number of bits your PRNG can return. Be careful to pack your byte array full of the bits returned, though, or you'll again introduce lumpiness to your distribution.
The problem with those solution, though, is how to fill that (larger than normal) seed state with random-enough values. You might be able to use standard-size seeds (populated via time or GUID-style population), and populate your big-PRNG state with values from the smaller-PRNG. This might work if it isn't mission critical how well distributed your numbers are.
If you need truly cryptographically secure random values, the only real way to do it is use a natural form of randomness, such as that at http://www.random.org/. The disadvantages of natural randomness are availability, and the fact that many natural-random devices take a while to generate new entropy, so generating large amounts of data might be really slow.
You can also use a hybrid and be safe - natural-random seeds only (to avoid the slowness of generation), and PRNG for the rest of it. Re-seed periodically.

Math question regarding Python's uuid4

I'm not great with statistical mathematics, etc. I've been wondering, if I use the following:
import uuid
unique_str = str(uuid.uuid4())
double_str = ''.join([str(uuid.uuid4()), str(uuid.uuid4())])
Is double_str string squared as unique as unique_str or just some amount more unique? Also, is there any negative implication in doing something like this (like some birthday problem situation, etc)? This may sound ignorant, but I simply would not know as my math spans algebra 2 at best.
The uuid4 function returns a UUID created from 16 random bytes and it is extremely unlikely to produce a collision, to the point at which you probably shouldn't even worry about it.
If for some reason uuid4 does produce a duplicate it is far more likely to be a programming error such as a failure to correctly initialize the random number generator than genuine bad luck. In which case the approach you are using it will not make it any better - an incorrectly initialized random number generator can still produce duplicates even with your approach.
If you use the default implementation random.seed(None) you can see in the source that only 16 bytes of randomness are used to initialize the random number generator, so this is an a issue you would have to solve first. Also, if the OS doesn't provide a source of randomness the system time will be used which is not very random at all.
But ignoring these practical issues, you are basically along the right lines. To use a mathematical approach we first have to define what you mean by "uniqueness". I think a reasonable definition is the number of ids you need to generate before the probability of generating a duplicate exceeds some probability p. An approcimate formula for this is:
where d is 2**(16*8) for a single randomly generated uuid and 2**(16*2*8) with your suggested approach. The square root in the formula is indeed due to the Birthday Paradox. But if you work it out you can see that if you square the range of values d while keeping p constant then you also square n.
Since uuid4 is based off a pseudo-random number generator, calling it twice is not going to square the amount of "uniqueness" (and may not even add any uniqueness at all).
See also When should I use uuid.uuid1() vs. uuid.uuid4() in python?
It depends on the random number generator, but it's almost squared uniqueness.

Psuedo-Random-Number-Generator from a computable normal number

Isn't it easily possible to construct a PRNG in such a fashion? Why is it not done?
That is, as far as I know we could simply have a PRNG that takes a seed n. When you ask for a random bit, it takes the nth digit of the binary expansion of the computable normal number, and increments n.
My first thought was that perhaps we hadn't found a computable normal number, but we have. The remaining thought is that there is a good reason not to-- either there's some property of PRNGs that I'm not familiar with that such a method would not have, or it would be impractical somehow, or is otherwise outstripped by other methods.
That would make predicting the output really simple.
Say, for example, you generate the integer 0x54a30b7f. If you have 4GiB of pi (or random noise or an actual normal number), chances are there's only going to be one (or maybe a handful) occurrence of that particular integer and I can predict with reasonably high probability all future numbers. This is a serious problem in the case of cryptographically strong PRNGs. If instead of simple sequential scan you use some function, I just have to follow the function which if it is difficult enough to follow it turns into a PRNG in it's own right.
If you are not concerned about the cryptographic strength of your generator, then there are much more compact ways of generating random numbers. Mersenne Twister, for example, has a much larger period without requiring a 4GiB lookup table.

Resources