Creating an efficient function to fit a dataset - math

Basically I have a large (could get as large as 100,000-150,000 values) data set of 4-byte inputs and their corresponding 4-byte outputs. The inputs aren't guaranteed to be unique (which isn't really a problem because I figure I can generate pseudo-random numbers to add or xor the inputs with so that they do become unique), but the outputs aren't guaranteed to be unique either (so two different sets of inputs might have the same output).
I'm trying to create a function that effectively models the values in my data-set. I don't need it to interpolate efficiently, or even at all (by this I mean that I'm never going to feed it an input that isn't contained in this static data-set). However it does need to be as efficient as possible. I've looked into interpolation and found that it doesn't really fit what I'm looking for. For example, the large number of values means that spline interpolation won't do since it creates a polynomial per interval.
Also, from my understanding polynomial interpolation would be way too computationally expensive (n values means that the polynomial could include terms as high as pow(x,n-1). For x= a 4-byte number and n=100,000 it's just not feasible). I've tried looking online for a while now, but I'm not very strong with math and must not know the right terms to search with because I haven't come across anything similar so far.
I can see that this is not completely (to put it mildly) a programming question and I apologize in advance. I'm not looking for the exact solution or even a complete answer. I just need pointers on the topics that I would need to read up on so I can solve this problem on my own. Thanks!
TL;DR - I need a variant of interpolation that only needs to fit the initially given data-points, but which is computationally efficient.
Edit:
Some clarification - I do need the output to be exact and not an approximation. This is sort of an optimization of some research work I'm currently doing and I need to have this look-up implemented without the actual bytes of the outputs being present in my program. I can't really say a whole lot about it at the moment, but I will say that for the purposes of my work, encryption (or compression or any other other form of obfuscation) is not an option to hide the table. I need a mathematical function that can recreate the output so long as it has access to the input. I hope that clears things up a bit.

Here is one idea. Make your function be the sum (mod 232) of a linear function over all 4-byte integers, a piecewise linear function whose pieces depend on the value of the first bit, another piecewise linear function whose pieces depend on the value of the first two bits, and so on.
The actual output values appear nowhere, you have to add together linear terms to get them. There is also no direct record of which input values you have. (Someone could conclude something about those input values, but not their actual values.)
The various coefficients you need can be stored in a hash. Any lookups you do which are not found in the hash are assumed to be 0.
If you add a certain amount of random "noise" to your dataset before starting to encode it fairly efficiently, it would be hard to tell what your input values are, and very hard to tell what the outputs are even approximately without knowing the inputs.

Since you didn't impose any restriction on the function (continuous, smooth, etc), you could simply do a piece-wise constant interpolation:
or a linear interpolation:
I assume you can figure out how to construct such a function without too much trouble.
EDIT: In light of your additional requirement that such a function should "hide" the data points...
For a piece-wise constant interpolation, the constant intervals should be randomized so as to not reveal where the data point is. So for example in the picture, the intervals are centered about the data point it's interpolating. Instead, you might want to do something like:
[0 , 0.3) -> 0
[0.3 , 1.9) -> 0.8
[1.9 , 2.1) -> 0.9
[2.1 , 3.5) -> 0.2
etc
Of course, this only hides the x-coordinate. To hide the y-coordinate as well, you can use a linear interpolation.
Simply make it so that the "pointy" part isn't where the data point is. Pick random x-values such that every adjacent data point has one of these x-values in between. Then interpolate such that the "pointy" part is at these x-values.

I suggest a huge Lookup Table full of unused entries. It's the brute-force approach, having an ordered table of outputs, ordered by every possible value of the input (not just the data set, but also all other possible 4-byte value).
Though all of your data would be there, you could fill the non-used inputs with random, arbitrary, or stochastic (random whithin potentially complex constraints) data. If you make it convincing, no one could pick your real data out of it. If a "real" function interpolated all your data, it would also "contain" all the information of your real data, and anyone with access to it could use it to generate an LUT as described above.
LUTs are lightning-fast, but very memory hungry. Your case is on the edge of feasibility, requiring (2^32)*32= 16 Gigabytes of RAM, which requires a 64-bit machine to run. That is just for the data, not the program, the Operating System, or other data. It's better to have 24, just to be sure. If you can afford it, they are the way to go.

Related

compute the tableau's nonbasic term in SCIP separator

In traditional Simplex Algorithm notation, we have x at the current basis selection B as so:
xB = AB-1b - AB-1ANxN. How can I compute the AB-1AN term inside a separator in SCIP, or at least iterate over its columns?
I see three helpful methods: getLPColsData, getLPRowsData, getLPBasisInd. I'm just not sure exactly what data those methods represent, particularly the last one, with its negative row indexes. How do I use those to get the value I want?
Do those methods return the same data no matter what LP algorithm is used? Or do I need to account for dual vs primal? How does the use of the "revised" algorithm play into my calculation?
Update: I discovered the getLPBInvARow and getLPBInvRow. That seems to be much closer to what I'm after. I don't yet understand their results; they seem to include more/less dimensions than expected. I'm still looking for understanding at how to use them to get the rays away from the corner.
you are correct that getLPBInvRow or getLPBInvARow are the methods you want. getLPBInvARow directly returns you a of the simplex tableau, but it is not more efficient to use than getLPBInvRow and doing the multiplication yourself since the LP solver needs to also compute the actual tableau first.
I suggest you look into either sepa_gomory.c or sepa_gmi.c for examples of how to use these methods. How do they include less dimensions than expected? They both return sparse vectors.

Find the first root and local maximum/minimum of a function

Problem
I want to find
The first root
The first local minimum/maximum
of a black-box function in a given range.
The function has following properties:
It's continuous and differentiable.
It's combination of constant and periodic functions. All periods are known.
(It's better if it can be done with weaker assumptions)
What is the fastest way to get the root and the extremum?
Do I need more assumptions or bounds of the function?
What I've tried
I know I can use root-finding algorithm. What I don't know is how to find the first root efficiently.
It needs to be fast enough so that it can run within a few miliseconds with precision of 1.0 and range of 1.0e+8, which is the problem.
Since the range could be quite large and it should be precise enough, I can't brute-force it by checking all the possible subranges.
I considered bisection method, but it's too slow to find the first root if the function has only one big root in the range, as every subrange should be checked.
It's preferable if the solution is in java, but any similar language is fine.
Background
I want to calculate when arbitrary celestial object reaches certain height.
It's a configuration-defined virtual object, so I can't assume anything about the object.
It's not easy to get either analytical solution or simple approximation because various coordinates are involved.
I decided to find a numerical solution for this.
For a general black box function, this can't really be done. Any root finding algorithm on a black box function can't guarantee that it has found all the roots or any particular root, even if the function is continuous and differentiable.
The property of being periodic gives a bit more hope, but you can still have periodic functions with infinitely many roots in a bounded domain. Given that your function relates to celestial objects, this isn't likely to happen. Assuming your periodic functions are sinusoidal, I believe you can get away with checking subranges on the order of one-quarter of the shortest period (out of all the periodic components).
Maybe try Brent's Method on the shortest quarter period subranges?
Another approach would be to apply your root finding algorithm iteratively. If your range is (a, b), then apply your algorithm to that range to find a root at say c < b. Then apply your algorithm to the range (a, c) to find a root in that range. Continue until no more roots are found. The last root you found is a good candidate for your minimum root.
Black box function for any range? You cannot even be sure it has the continuous domain over that range. What kind of solutions are you looking for? Natural numbers, integers, real numbers, complex? These are all the question that greatly impact the answer.
So 1st thing should be determining what kind of number you accept as the result.
Second is having some kind of protection against limes of function that will try to explode your calculations as it goes for plus or minus infinity.
Since we are touching the limes topics you could have your solution edge towards zero and look like a solution but never touch 0 and become a solution. This depends on your margin of error, how close something has to be to be considered ok, it's good enough.
I think for this your SIMPLEST TO IMPLEMENT bet for real number solutions (I assume those) is to take an interval and this divide and conquer algorithm:
Take lower and upper border and middle value (or approx middle value for infinity decimals border/borders)
Try to calculate solution with all 3 and have some kind of protection against infinities
remember all 3 values in an array with results from them (3 pair of values)
remember the current best value (one its closest to solution) in seperate variable (a pair of value and result for that value)
STEP FORWARD - repeat above with 1st -2nd value range and 2nd -3rd value range
have a new pair of value and result to be closest to solution.
clear the old value-result pairs, replace them with new ones gotten from this iteration while remembering the best value solution pair (total)
Repeat above for how precise you wish to get and look at that memory explode with each iteration, keep in mind you are gonna to have exponential growth of values there. It can be further improved if you lets say take one interval and go as deep as you wanna, remember best value-result pair and then delete all other memory and go for next interval and dig deep.

Series of numbers with minimized risk of collision

I want to generate some numbers, which should attempt to share as few common bit patterns as possible, such that collisions happen at minimal amount. Until now its "simple" hashing with a given amount of output bits. However, there is another 'constraint'. I want to minimize the risk that, if you take one number and change it by toggling a small amount of bits, you end up with another number you've just generated. Note: I don't want it to be impossible or something, I want to minimize the risk!
How to calculate the probability for a list with n numbers, where each number has m bits? And, of course, what would be a suitable method to generate those numbers? Any good articles about this?
To answer this question precisely, you need to say what exactly you mean by "collision", and what you mean by "generate". If you just want the strings to be far apart from each other in hamming distance, you could hope to make an optimal, deterministic set of such strings. It is true that random strings will have this property with high probability, so you could use random strings instead.
When you say
Note: I don't want it to be impossible or something, I want to minimize the risk!
this sounds like an XY problem. If some outcome is the "bad thing" then why do you want it to be possible, but just low probability? Shouldn't you want it not to happen at all?
In short I think you should look up the term "error correcting code". The codewords of any good error correcting code, with any parameters that you feel like, will have the minimal risk of collision in the presence of random noise, for that number of code words of that length, and they can typically be generated very easily using matrix multiplication.

How to compute the exponent values for error correction in QR Code

When i started reading about Qr Codes every article browsed i can see one QR Code Exponents of αx Table which having values for the specific powers of αx . I am not sure how this table is getting created. Can anybody explain me the logic behind this table.
For reference the table can be found at http://www.matchadesign.com/_blog/Matcha_Design_Blog/post/QR_Code_Demystified_-_Part_4/#
(The zxing source code for this might help you.)
It would take a lot to explain all the math here. For the Reed-Solomon error correction, you need a Galois field of 256 elements (nothing fancy -- just a set of 256 things that have addition and exponentiation and such defined.)
This is defined not in terms of numbers, but in terms of polynomials whose coefficients are all 0 or 1. We work with polynomials with 8 coefficient -- conveniently these map to 8-bit values. While it's tempting to think of those values as numbers, they're really something different.
In fact for addition and such to make sense such that all the operations land you back in a value in the Galois field, all the results are computed modulo an irreducible polynomial in the field. (Skip what that means now.)
To make operations faster, it helps to pre-compute what the powers of the polynomial "x" are in the field. This is alpha. You can think of this as "2", since the polynomial "x" is 00000010, though that's not entirely accurate.
So then you just compute the powers of x in the field. Because it's a field you'll hit every element of the field this way. The sequence seems to be the powers of two, which it happens to map to for a short while, until the first "modulo" of the primitive polynomial takes effect. Multiplying by x is indeed still something like multiplying by 2 but it's a bit of coincidence in this field, really.

How to watermark some vector data in an invisible way?

I have a some vector data that has been manually created, it is just a list of x,y values. The coordinate of the points is not perfectly accurate - it can be off by a few pixels and it won't make any perceivable difference.
So now I am looking for some way to watermark this data, so that if someone steal the vector data, I can prove that it's indeed been stolen. I'm looking for some method reliable enough that even if someone take my data and shift all the points by a some small amount, I can still prove that it's been stolen.
Is there any way to do that? I know it exists for bitmap data but how about vector data?
PS: the vector graphic itself is rather random - it cannot be copyrighted.
Is the set of points all you can work with? If, for example, you were dealing with SVG, you could export the file with a certain type of XML formatting, a <!-- generated by thingummy --> comment at the top, IDs generated according to such-and-such a pattern, extra attributes specifically yours, a particular style of applying translations, etc. Just like you can work out from a JPEG what is likely to have been used to create it, you can tell a lot about what produced an SVG file by observation.
On the vectors themselves, you could do something like consider them as an ordered sequence and apply offsets given by the values of two pseudo-random sequences, each starting from a known seed, for X and Y translation, in a certain range (such as [-1, 1]). Even if some points are modified, you should be able to build up an argument from how things match the sequence. How to distinguish precisely what has been shifted could do with a bit more consideration, too; if you were simply doing int(x) + random(-1, 1), then if someone just rounded all values your evidence would be lost. A better way of dealing with this would be to, while still rendering at the same screen size, multiply everything by some constant like 953 (an arbitrary near-1000 prime) and then adjust your values by something in that range (viz, [0, 952]). This base-953 system would be superior to a base-10 system because it's much (much much) harder to see what's happening. If the person changes the scaling, it would require a bit more analysis of values, but it should still be quite possible. I've got a gut feeling that that's where picking a prime number could be a bit helpful, but I haven't thought about it terribly much. If in danger or in doubt in such matters, pick a prime number for the sake of it... you may find out later there are benefits to it!
Combine a number of different techniques for best results, of course.

Resources