How to watermark some vector data in an invisible way? - vector-graphics

I have a some vector data that has been manually created, it is just a list of x,y values. The coordinate of the points is not perfectly accurate - it can be off by a few pixels and it won't make any perceivable difference.
So now I am looking for some way to watermark this data, so that if someone steal the vector data, I can prove that it's indeed been stolen. I'm looking for some method reliable enough that even if someone take my data and shift all the points by a some small amount, I can still prove that it's been stolen.
Is there any way to do that? I know it exists for bitmap data but how about vector data?
PS: the vector graphic itself is rather random - it cannot be copyrighted.

Is the set of points all you can work with? If, for example, you were dealing with SVG, you could export the file with a certain type of XML formatting, a <!-- generated by thingummy --> comment at the top, IDs generated according to such-and-such a pattern, extra attributes specifically yours, a particular style of applying translations, etc. Just like you can work out from a JPEG what is likely to have been used to create it, you can tell a lot about what produced an SVG file by observation.
On the vectors themselves, you could do something like consider them as an ordered sequence and apply offsets given by the values of two pseudo-random sequences, each starting from a known seed, for X and Y translation, in a certain range (such as [-1, 1]). Even if some points are modified, you should be able to build up an argument from how things match the sequence. How to distinguish precisely what has been shifted could do with a bit more consideration, too; if you were simply doing int(x) + random(-1, 1), then if someone just rounded all values your evidence would be lost. A better way of dealing with this would be to, while still rendering at the same screen size, multiply everything by some constant like 953 (an arbitrary near-1000 prime) and then adjust your values by something in that range (viz, [0, 952]). This base-953 system would be superior to a base-10 system because it's much (much much) harder to see what's happening. If the person changes the scaling, it would require a bit more analysis of values, but it should still be quite possible. I've got a gut feeling that that's where picking a prime number could be a bit helpful, but I haven't thought about it terribly much. If in danger or in doubt in such matters, pick a prime number for the sake of it... you may find out later there are benefits to it!
Combine a number of different techniques for best results, of course.

Related

Convert RGBA{U8}(0.384,0.0,0.0,1.0) to Integer

I am using Images.jl in Julia. I am trying to convert an image into a graph-like data structure (v,w,c) where
v is a node
w is a neighbor and
c is a cost function
I want to give an expensive cost to those neighbors which have not the same color. However, when I load an image each pixel has the following Type RGBA{U8}(1.0,1.0,1.0,1.0), is there any way to convert this into a number like Int64 or Float?
If all you want to do is penalize adjacent pairs that have different color values (no matter how small the difference), I think img[i,j] != img[i+1,j] should be sufficient, and infinitely more performant than calling colordiff.
Images.jl also contains methods, raw and separate, that allow you to "convert" that image into a higher-dimensional array of UInt8. However, for your apparent application this will likely be more of a pain, because you'll have to choose between using a syntax like A[:, i, j] != A[:, i+1, j] (which will allocate memory and have much worse performance) or write out loops and check each color channel manually. Then there's always the slight annoyance of having to special case your code for grayscale and color, wondering what a 3d array really means (is it 3d grayscale or 2d with a color channel?), and wondering whether the color channel is stored as the first or last dimension.
None of these annoyances arise if you just work with the data directly in RGBA format. For a little more background, they are examples of Julia's "immutable" objects, which have at least two advantages. First, they allow you to clearly specify the "meaning" of a certain collection of numbers (in this case, that these 4 numbers represent a color, in a particular colorspace, rather than, say, pressure readings from a sensor)---that means you can write code that isn't forced to make assumptions that it can't enforce. Second, once you learn how to use them, they make your code much prettier all while providing fantastic performance.
The color types are documented here.
Might I recommend converting each pixel to greyscale if all you want is a magnitude difference.
See this answer for a how-to:
Converting RGB to grayscale/intensity
This will give you a single value for intensity that you can then use to compare.
Following #daycaster's suggestion, colordiff from Colors.jl can be used.
colordiff takes two colors as arguments. To use it, you should extract the color part of the pixel with color i.e. colordiff(color(v),color(w)) where v would be RGBA{U8(0.384,0.0,0.0,1.0) value.

Series of numbers with minimized risk of collision

I want to generate some numbers, which should attempt to share as few common bit patterns as possible, such that collisions happen at minimal amount. Until now its "simple" hashing with a given amount of output bits. However, there is another 'constraint'. I want to minimize the risk that, if you take one number and change it by toggling a small amount of bits, you end up with another number you've just generated. Note: I don't want it to be impossible or something, I want to minimize the risk!
How to calculate the probability for a list with n numbers, where each number has m bits? And, of course, what would be a suitable method to generate those numbers? Any good articles about this?
To answer this question precisely, you need to say what exactly you mean by "collision", and what you mean by "generate". If you just want the strings to be far apart from each other in hamming distance, you could hope to make an optimal, deterministic set of such strings. It is true that random strings will have this property with high probability, so you could use random strings instead.
When you say
Note: I don't want it to be impossible or something, I want to minimize the risk!
this sounds like an XY problem. If some outcome is the "bad thing" then why do you want it to be possible, but just low probability? Shouldn't you want it not to happen at all?
In short I think you should look up the term "error correcting code". The codewords of any good error correcting code, with any parameters that you feel like, will have the minimal risk of collision in the presence of random noise, for that number of code words of that length, and they can typically be generated very easily using matrix multiplication.

Creating an efficient function to fit a dataset

Basically I have a large (could get as large as 100,000-150,000 values) data set of 4-byte inputs and their corresponding 4-byte outputs. The inputs aren't guaranteed to be unique (which isn't really a problem because I figure I can generate pseudo-random numbers to add or xor the inputs with so that they do become unique), but the outputs aren't guaranteed to be unique either (so two different sets of inputs might have the same output).
I'm trying to create a function that effectively models the values in my data-set. I don't need it to interpolate efficiently, or even at all (by this I mean that I'm never going to feed it an input that isn't contained in this static data-set). However it does need to be as efficient as possible. I've looked into interpolation and found that it doesn't really fit what I'm looking for. For example, the large number of values means that spline interpolation won't do since it creates a polynomial per interval.
Also, from my understanding polynomial interpolation would be way too computationally expensive (n values means that the polynomial could include terms as high as pow(x,n-1). For x= a 4-byte number and n=100,000 it's just not feasible). I've tried looking online for a while now, but I'm not very strong with math and must not know the right terms to search with because I haven't come across anything similar so far.
I can see that this is not completely (to put it mildly) a programming question and I apologize in advance. I'm not looking for the exact solution or even a complete answer. I just need pointers on the topics that I would need to read up on so I can solve this problem on my own. Thanks!
TL;DR - I need a variant of interpolation that only needs to fit the initially given data-points, but which is computationally efficient.
Edit:
Some clarification - I do need the output to be exact and not an approximation. This is sort of an optimization of some research work I'm currently doing and I need to have this look-up implemented without the actual bytes of the outputs being present in my program. I can't really say a whole lot about it at the moment, but I will say that for the purposes of my work, encryption (or compression or any other other form of obfuscation) is not an option to hide the table. I need a mathematical function that can recreate the output so long as it has access to the input. I hope that clears things up a bit.
Here is one idea. Make your function be the sum (mod 232) of a linear function over all 4-byte integers, a piecewise linear function whose pieces depend on the value of the first bit, another piecewise linear function whose pieces depend on the value of the first two bits, and so on.
The actual output values appear nowhere, you have to add together linear terms to get them. There is also no direct record of which input values you have. (Someone could conclude something about those input values, but not their actual values.)
The various coefficients you need can be stored in a hash. Any lookups you do which are not found in the hash are assumed to be 0.
If you add a certain amount of random "noise" to your dataset before starting to encode it fairly efficiently, it would be hard to tell what your input values are, and very hard to tell what the outputs are even approximately without knowing the inputs.
Since you didn't impose any restriction on the function (continuous, smooth, etc), you could simply do a piece-wise constant interpolation:
or a linear interpolation:
I assume you can figure out how to construct such a function without too much trouble.
EDIT: In light of your additional requirement that such a function should "hide" the data points...
For a piece-wise constant interpolation, the constant intervals should be randomized so as to not reveal where the data point is. So for example in the picture, the intervals are centered about the data point it's interpolating. Instead, you might want to do something like:
[0 , 0.3) -> 0
[0.3 , 1.9) -> 0.8
[1.9 , 2.1) -> 0.9
[2.1 , 3.5) -> 0.2
etc
Of course, this only hides the x-coordinate. To hide the y-coordinate as well, you can use a linear interpolation.
Simply make it so that the "pointy" part isn't where the data point is. Pick random x-values such that every adjacent data point has one of these x-values in between. Then interpolate such that the "pointy" part is at these x-values.
I suggest a huge Lookup Table full of unused entries. It's the brute-force approach, having an ordered table of outputs, ordered by every possible value of the input (not just the data set, but also all other possible 4-byte value).
Though all of your data would be there, you could fill the non-used inputs with random, arbitrary, or stochastic (random whithin potentially complex constraints) data. If you make it convincing, no one could pick your real data out of it. If a "real" function interpolated all your data, it would also "contain" all the information of your real data, and anyone with access to it could use it to generate an LUT as described above.
LUTs are lightning-fast, but very memory hungry. Your case is on the edge of feasibility, requiring (2^32)*32= 16 Gigabytes of RAM, which requires a 64-bit machine to run. That is just for the data, not the program, the Operating System, or other data. It's better to have 24, just to be sure. If you can afford it, they are the way to go.

Generate very very large random numbers

How would you generate a very very large random number? I am thinking on the order of 2^10^9 (one billion bits). Any programming language -- I assume the solution would translate to other languages.
I would like a uniform distribution on [1,N].
My initial thoughts:
--You could randomly generate each digit and concatenate. Problem: even very good pseudorandom generators are likely to develop patterns with millions of digits, right?
You could perhaps help create large random numbers by raising random numbers to random exponents. Problem: you must make the math work so that the resulting number is still random, and you should be able to compute it in a reasonable amount of time (say, an hour).
If it helps, you could try to generate a possibly non-uniform distribution on a possibly smaller range (using the real numbers, for instance) and transform. Problem: this might be equally difficult.
Any ideas?
Generate log2(N) random bits to get a number M,
where M may be up to twice as large as N.
Repeat until M is in the range [1;N].
Now to generate the random bits you could either use a source of true randomness, which is expensive.
Or you might use some cryptographically secure random number generator, for example AES with a random key encrypting a counter for subsequent blocks of bits. The cryptographically secure implies that there can be no noticeable patterns.
It depends on what you need the data for. For most purposes, a PRNG is fast and simple. But they are not perfect. For instance I remember hearing that Monte Carlos simulations of chaotic systems are really good at revealing the underlying pattern in a PRNG.
If that is the sort of thing that you are doing, though, there is a simple trick I learned in grad school for generating lots of random data. Take a large (preferably rapidly changing) file. (Some big data structures from the running kernel are good.) Compress it to increase the entropy. Throw away the headers. Then for good measure, encrypt the result. If you're planning to use this for cryptographic purposes (and you didn't have a perfect entropy data set to work with), then reverse it and encrypt again.
The underlying theory is simple. Information theory tells us that there is no difference between a signal with no redundancy and pure random data. So if we pick a big file (ie lots of signal), remove redundancy with compression, and strip the headers, we have a pretty good random signal. Encryption does a really good job at removing artifacts. However encryption algorithms tend to work forward in blocks. So if someone could, despite everything, guess what was happening at the start of the file, that data is more easily guessable. But then reversing the file and encrypting again means that they would need to know the whole file, and our encryption, to find any pattern in the data.
The reason to pick a rapidly changing piece of data is that if you run out of data and want to generate more, you can go back to the same source again. Even small changes will, after that process, turn into an essentially uncorrelated random data set.
NTL: A Library for doing Number Theory
This was recommended by my Coding Theory and Cryptography teacher... so I guess it does the work right, and it's pretty easy to use.
RandomBnd, RandomBits, RandomLen -- routines for generating pseudo-random numbers
ZZ RandomLen_ZZ(long l);
// ZZ = psuedo-random number with precisely l bits,
// or 0 of l <= 0.
If you have a random number generator that generates random numbers of X bits. And concatenated bits of [X1, X2, ... Xn ] create the number you want of N bits, as long as each X is random, I don't see why your large number wouldn't be random as well for all intents and purposes. And if standard C rand() method is not secure enough, I'm sure there's plenty of other libraries (like the ones mentioned in this thread) whose pseudo-random numbers are "more random".
even very good pseudorandom generators are likely to develop patterns with millions of digits, right?
From the wikipedia on pseudo-random number generation:
A PRNG can be started from an arbitrary starting state using a seed state. It will always produce the same sequence thereafter when initialized with that state. The maximum length of the sequence before it begins to repeat is determined by the size of the state, measured in bits. However, since the length of the maximum period potentially doubles with each bit of 'state' added, it is easy to build PRNGs with periods long enough for many practical applications.
You could perhaps help create large random numbers by raising random numbers to random exponents
I assume you're suggesting something like populating the values of a scientific notation with random values?
E.g.: 1.58901231 x 10^5819203489
The problem with this is that your distribution is going to be logarithmic (or is that exponential? :) - same difference, it isn't even). You will never get a value that has the millionth digit set, yet contains a digit in the one's column.
you could try to generate a possibly non-uniform distribution on a possibly smaller range (using the real numbers, for instance) and transform
Not sure I understand this. Sounds like the same thing as the exponential solution, with the same problems. If you're talking about multiplying by a constant, then you'll get a lumpy distribution instead of a logarithmic (exponential?) one.
Suggested Solution
If you just need really big pseudo-random values, with a good distribution, use a PRNG algorithm with a larger state. The Periodicity of a PRNG is often the square of the number of bits, so it doesn't take that many bits to fill even a really large number.
From there, you can use your first solution:
You could randomly generate each digit and concatenate
Although I'd suggest that you use the full range of values returned by your PRNG (possibly 2^31 or 2^32), and populate a byte array with those values, splitting it up as necessary. Otherwise you might be throwing away a lot of bits of randomness. Also, scaling your values to a range (or using modulo) can easily screw up your distribution, so there's another reason to try to keep the max number of bits your PRNG can return. Be careful to pack your byte array full of the bits returned, though, or you'll again introduce lumpiness to your distribution.
The problem with those solution, though, is how to fill that (larger than normal) seed state with random-enough values. You might be able to use standard-size seeds (populated via time or GUID-style population), and populate your big-PRNG state with values from the smaller-PRNG. This might work if it isn't mission critical how well distributed your numbers are.
If you need truly cryptographically secure random values, the only real way to do it is use a natural form of randomness, such as that at http://www.random.org/. The disadvantages of natural randomness are availability, and the fact that many natural-random devices take a while to generate new entropy, so generating large amounts of data might be really slow.
You can also use a hybrid and be safe - natural-random seeds only (to avoid the slowness of generation), and PRNG for the rest of it. Re-seed periodically.

fourier transform to transpose key of a wav file

I want to write an app to transpose the key a wav file plays in (for fun, I know there are apps that already do this)... my main understanding of how this might be accomplished is to
1) chop the audio file into very small blocks (say 1/10 a second)
2) run an FFT on each block
3) phase shift the frequency space up or down depending on what key I want
4) use an inverse FFT to return each block to the time domain
5) glue all the blocks together
But now I'm wondering if the transformed blocks would no longer be continuous when I try to glue them back together. Are there ideas how I should do this to guarantee continuity, or am I just worrying about nothing?
Overlap the time samples for each block by half so that each block after the first consists of the last N/2 samples from the previous block and N/2 new samples. Be sure to apply some window to the samples before the transform.
After shifting the frequency, perform an inverse FFT and use the middle N/2 samples from each block. You'll need to adjust the final gain after the IFFT.
Of course, mixing the time samples with a sine wave and then low pass filtering will provide the same shift in the time domain as well. The frequency of the mixer would be the desired frequency difference.
For speech you might want to look at PSOLA - this is a popular algorithm for pitch-shifting and/or time stretching/compression which is a little more sophisticated than the basic overlap-add method, but not much more complex.
If you need to process non-speech samples, e.g. music, then there are several possibilities, however the overlap-add FFT/modify/IFFT approach mentioned in other answers is probably the best bet.
Found this great article on the subject, for anyone trying it in the future!
You may have to find a zero-crossing between the blocks to glue the individual wavs back together. Otherwise you may find that you are getting clicks or pops between the blocks.

Resources