Truncation in Homomorphic Encryption - encryption

How do you implement truncation in homomorphic encryption libraries like HELib or SEAL when no division operation is allowed?
I have two floating point numbers a=2.3,b=1.5 which I scale to integers with 2-digit precision. Hence my encoder looks basically like this encode(x)=x*10^2. Assuming enc(x) is the encryption function, then enc(encode(a))=enc(230) and enc(encode(b))=enc(150).
Upon multiplication we obtain the huge value of a*b=enc(23*15)=enc(34500) because the scaling factors multiply too. This means that my inputs grow exponentially unless I can truncate the result, so that trunate(enc(34500))=truncate(enc(345)).
I assume such a truncation function is not possible because it cant be represented by a polynomial. Nonetheless, I wonder if there is any trick on how to perform this truncation mathematically or whether it is just an unsolved problem?

It is possible but difficult to perform such truncation in the BFV and BGV schemes, and is unlikely to result in acceptable performance in most use-cases. This problem is very much related to the complexity of bootstrapping said schemes; for more details, see https://eprint.iacr.org/2018/067 and https://eprint.iacr.org/2014/873.
On the other hand, truncation is much easier to achieve in the CKKS scheme (see https://eprint.iacr.org/2016/421) where it is a natural operation. However, the downside of the CKKS scheme is that all computations only yield approximately correct results which may not be what you want.

Related

Optimize dataset for floating point add/sub/mul/div

Suppose we have a data set of numbers, with which we want to do some calculations using addition/subtraction/multiplication/division using a computer.
The coverage of the real numbers by the floating point representation varies a lot, depending on the number being represented:
In terms of absolute precision in the real->FP mapping the "holes" grow towards the bigger numbers, with a weird hole around 0, depending on the architecture. Due to this, the add/sub precision towards the bigger numbers will drop.
If we divide 2 consecutive numbers which are represented in our floating point representation, the result of the division will be bigger both while going to the bigger numbers and when going to smaller and smaller fractions.
So, my question is:
Is there a "sweet interval" for floats on an ordinary PC today, where the results for the arithmetics with the said operators (add/sub/mul/div) are just more precise?
If I have a data set of many-significant-digit numbers like "123123123123123", "134534513412351151", etc., with which I want to do some arithmetics, which floating point interval should it be converted to, to have the best precision for the result?
Since floating points are something like 1.xxx*10^yyy, 2.xxx*10^yyy, ..., 9.xxx*10^yyy, I would assume, converting my numbers into the [1, 9] interval would give the best results for the memory consumed, but I may be terribly wrong...
Suppose I use C, can such conversion even be made? Is there a best-practice to do that? Before an operation, C will convert the operands to the same format, so I guess I would have to use a string representation, inject a "." somewhere and parse that as float.
Please note:
This is a theoretical question, I don't have an actual data set on my hand that would decide what is best. On the same note, the mentioning of C was random, I am also interested in responses like "forget C, I would use this and this, BECAUSE it supports this and this".
Please spare me from answers like "this cannot be answered, because it depends on the actual operations, since the results may be in another magnitude range than the original data, etc., etc.". Let's suppose that the results of the calculation is more or less in the same interval, as the operands. Sure, when dividing the "more-or-less the same magnitude" operands, the result will be somewhere between 1-10, maybe 0.1-100, ... , but that is probably exactly the best interval they can be in.
Of course, if the answer includes some explanation, other than a brush-off, I will be happy to read it!
The absolute precision of floating-point numbers changes with the magnitude of the numbers because the exponent changes. The relative precision does not change, except for numbers near the bottom of the exponent range, where underflow occurs. If you multiply binary floating-point numbers by a power of two, perform arithmetic (suitably adjusted for the scaling), and reverse the scaling, the results will be identical to doing the arithmetic without scaling, barring effects from overflow and underflow. If your arithmetic does involve underflow or overflow, then scaling could help avoid that. For example, if your precision is suffering because your numbers are so small that some intermediate results are below the normal range of the floating-point format, then scaling by a power of two can avoid the loss of precision from underflow.
If you scale by something other than a power of two, the results can be different, due to changes in the significands. The effects will generally be tiny, and whether the results are better or worse will effectively be random chance, except in carefully engineered special situations.

Generate very very large random numbers

How would you generate a very very large random number? I am thinking on the order of 2^10^9 (one billion bits). Any programming language -- I assume the solution would translate to other languages.
I would like a uniform distribution on [1,N].
My initial thoughts:
--You could randomly generate each digit and concatenate. Problem: even very good pseudorandom generators are likely to develop patterns with millions of digits, right?
You could perhaps help create large random numbers by raising random numbers to random exponents. Problem: you must make the math work so that the resulting number is still random, and you should be able to compute it in a reasonable amount of time (say, an hour).
If it helps, you could try to generate a possibly non-uniform distribution on a possibly smaller range (using the real numbers, for instance) and transform. Problem: this might be equally difficult.
Any ideas?
Generate log2(N) random bits to get a number M,
where M may be up to twice as large as N.
Repeat until M is in the range [1;N].
Now to generate the random bits you could either use a source of true randomness, which is expensive.
Or you might use some cryptographically secure random number generator, for example AES with a random key encrypting a counter for subsequent blocks of bits. The cryptographically secure implies that there can be no noticeable patterns.
It depends on what you need the data for. For most purposes, a PRNG is fast and simple. But they are not perfect. For instance I remember hearing that Monte Carlos simulations of chaotic systems are really good at revealing the underlying pattern in a PRNG.
If that is the sort of thing that you are doing, though, there is a simple trick I learned in grad school for generating lots of random data. Take a large (preferably rapidly changing) file. (Some big data structures from the running kernel are good.) Compress it to increase the entropy. Throw away the headers. Then for good measure, encrypt the result. If you're planning to use this for cryptographic purposes (and you didn't have a perfect entropy data set to work with), then reverse it and encrypt again.
The underlying theory is simple. Information theory tells us that there is no difference between a signal with no redundancy and pure random data. So if we pick a big file (ie lots of signal), remove redundancy with compression, and strip the headers, we have a pretty good random signal. Encryption does a really good job at removing artifacts. However encryption algorithms tend to work forward in blocks. So if someone could, despite everything, guess what was happening at the start of the file, that data is more easily guessable. But then reversing the file and encrypting again means that they would need to know the whole file, and our encryption, to find any pattern in the data.
The reason to pick a rapidly changing piece of data is that if you run out of data and want to generate more, you can go back to the same source again. Even small changes will, after that process, turn into an essentially uncorrelated random data set.
NTL: A Library for doing Number Theory
This was recommended by my Coding Theory and Cryptography teacher... so I guess it does the work right, and it's pretty easy to use.
RandomBnd, RandomBits, RandomLen -- routines for generating pseudo-random numbers
ZZ RandomLen_ZZ(long l);
// ZZ = psuedo-random number with precisely l bits,
// or 0 of l <= 0.
If you have a random number generator that generates random numbers of X bits. And concatenated bits of [X1, X2, ... Xn ] create the number you want of N bits, as long as each X is random, I don't see why your large number wouldn't be random as well for all intents and purposes. And if standard C rand() method is not secure enough, I'm sure there's plenty of other libraries (like the ones mentioned in this thread) whose pseudo-random numbers are "more random".
even very good pseudorandom generators are likely to develop patterns with millions of digits, right?
From the wikipedia on pseudo-random number generation:
A PRNG can be started from an arbitrary starting state using a seed state. It will always produce the same sequence thereafter when initialized with that state. The maximum length of the sequence before it begins to repeat is determined by the size of the state, measured in bits. However, since the length of the maximum period potentially doubles with each bit of 'state' added, it is easy to build PRNGs with periods long enough for many practical applications.
You could perhaps help create large random numbers by raising random numbers to random exponents
I assume you're suggesting something like populating the values of a scientific notation with random values?
E.g.: 1.58901231 x 10^5819203489
The problem with this is that your distribution is going to be logarithmic (or is that exponential? :) - same difference, it isn't even). You will never get a value that has the millionth digit set, yet contains a digit in the one's column.
you could try to generate a possibly non-uniform distribution on a possibly smaller range (using the real numbers, for instance) and transform
Not sure I understand this. Sounds like the same thing as the exponential solution, with the same problems. If you're talking about multiplying by a constant, then you'll get a lumpy distribution instead of a logarithmic (exponential?) one.
Suggested Solution
If you just need really big pseudo-random values, with a good distribution, use a PRNG algorithm with a larger state. The Periodicity of a PRNG is often the square of the number of bits, so it doesn't take that many bits to fill even a really large number.
From there, you can use your first solution:
You could randomly generate each digit and concatenate
Although I'd suggest that you use the full range of values returned by your PRNG (possibly 2^31 or 2^32), and populate a byte array with those values, splitting it up as necessary. Otherwise you might be throwing away a lot of bits of randomness. Also, scaling your values to a range (or using modulo) can easily screw up your distribution, so there's another reason to try to keep the max number of bits your PRNG can return. Be careful to pack your byte array full of the bits returned, though, or you'll again introduce lumpiness to your distribution.
The problem with those solution, though, is how to fill that (larger than normal) seed state with random-enough values. You might be able to use standard-size seeds (populated via time or GUID-style population), and populate your big-PRNG state with values from the smaller-PRNG. This might work if it isn't mission critical how well distributed your numbers are.
If you need truly cryptographically secure random values, the only real way to do it is use a natural form of randomness, such as that at http://www.random.org/. The disadvantages of natural randomness are availability, and the fact that many natural-random devices take a while to generate new entropy, so generating large amounts of data might be really slow.
You can also use a hybrid and be safe - natural-random seeds only (to avoid the slowness of generation), and PRNG for the rest of it. Re-seed periodically.

Math question regarding Python's uuid4

I'm not great with statistical mathematics, etc. I've been wondering, if I use the following:
import uuid
unique_str = str(uuid.uuid4())
double_str = ''.join([str(uuid.uuid4()), str(uuid.uuid4())])
Is double_str string squared as unique as unique_str or just some amount more unique? Also, is there any negative implication in doing something like this (like some birthday problem situation, etc)? This may sound ignorant, but I simply would not know as my math spans algebra 2 at best.
The uuid4 function returns a UUID created from 16 random bytes and it is extremely unlikely to produce a collision, to the point at which you probably shouldn't even worry about it.
If for some reason uuid4 does produce a duplicate it is far more likely to be a programming error such as a failure to correctly initialize the random number generator than genuine bad luck. In which case the approach you are using it will not make it any better - an incorrectly initialized random number generator can still produce duplicates even with your approach.
If you use the default implementation random.seed(None) you can see in the source that only 16 bytes of randomness are used to initialize the random number generator, so this is an a issue you would have to solve first. Also, if the OS doesn't provide a source of randomness the system time will be used which is not very random at all.
But ignoring these practical issues, you are basically along the right lines. To use a mathematical approach we first have to define what you mean by "uniqueness". I think a reasonable definition is the number of ids you need to generate before the probability of generating a duplicate exceeds some probability p. An approcimate formula for this is:
where d is 2**(16*8) for a single randomly generated uuid and 2**(16*2*8) with your suggested approach. The square root in the formula is indeed due to the Birthday Paradox. But if you work it out you can see that if you square the range of values d while keeping p constant then you also square n.
Since uuid4 is based off a pseudo-random number generator, calling it twice is not going to square the amount of "uniqueness" (and may not even add any uniqueness at all).
See also When should I use uuid.uuid1() vs. uuid.uuid4() in python?
It depends on the random number generator, but it's almost squared uniqueness.

When is it appropriate to use floating precision data types?

It's clear that one shouldn't use floating precision when working with, say, monetary amounts since the variation in precision leads to inaccuracies when doing calculations with that amount.
That said, what are use cases when that is acceptable? And, what are the general principles one should have in mind when deciding?
Floating point numbers should be used for what they were designed for: computations where what you want is a fixed precision, and you only care that your answer is accurate to within a certain tolerance. If you need an exact answer in all cases, you're best using something else.
Here are three domains where you might use floating point:
Scientific Simulations
Science apps require a lot of number crunching, and often use sophisticated numerical methods to solve systems of differential equations. You're typically talking double-precision floating point here.
Games
Think of games as a simulation where it's ok to cheat. If the physics is "good enough" to seem real then it's ok for games, and you can make up in user experience what you're missing in terms of accuracy. Games usually use single-precision floating point.
Stats
Like science apps, statistical methods need a lot of floating point. A lot of the numerical methods are the same; the application domain is just different. You find a lot of statistics and monte carlo simulations in financial applications and in any field where you're analyzing a lot of survey data.
Floating point isn't trivial, and for most business applications you really don't need to know all these subtleties. You're fine just knowing that you can't represent some decimal numbers exactly in floating point, and that you should be sure to use some decimal type for prices and things like that.
If you really want to get into the details and understand all the tradeoffs and pitfalls, check out the classic What Every Programmer Should Know About Floating Point, or pick up a book on Numerical Analysis or Applied Numerical Linear Algebra if you're really adventurous.
I'm guessing you mean "floating point" here. The answer is, basically, any time the quantities involved are approximate, measured, rather than precise; any time the quantities involved are larger than can be conveniently represented precisely on the underlying machine; any time the need for computational speed overwhelms exact precision; and any time the appropriate precision can be maintained without other complexities.
For more details of this, you really need to read a numerical analysis book.
Short story is that if you need exact calculations, DO NOT USE floating point.
Don't use floating point numbers as loop indices: Don't get caught doing:
for ( d = 0.1; d < 1.0; d+=0.1)
{ /* Some Code... */ }
You will be surprised.
Don't use floating point numbers as keys to any sort of map because you can never count on equality behaving like you may expect.
Most real-world quantities are inexact, and typically we know their numeric properties with a lot less precision than a typical floating-point value. In almost all cases, the C types float and double are good enough.
It is necessary to know some of the pitfalls. For example, testing two floating-point numbers for equality is usually not what you want, since all it takes is a single bit of inaccuracy to make the comparison non-equal. tgamblin has provided some good references.
The usual exception is money, which is calculated exactly according to certain conventions that don't translate well to binary representations. Part of this is the constants used: you'll never see a pi% interest rate, or a 22/7% interest rate, but you might well see a 3.14% interest rate. In other words, the numbers used are typically expressed in exact decimal fractions, not all of which are exact binary fractions. Further, the rounding in calculations is governed by conventions that also don't translate well into binary. This makes it extremely difficult to precisely duplicate financial calculations with standard floating point, and therefore people use other methods for them.
It's appropriate to use floating point types when dealing with scientific or statistical calculations. These will invariably only have, say, 3-8 significant digits of accuracy.
As to whether to use single or double precision floating point types, this depends on your need for accuracy and how many significant digits you need. Typically though people just end up using doubles unless they have a good reason not to.
For example if you measure distance or weight or any physical quantity like that the number you come up with isn't exact: it has a certain number of significant digits based on the accuracy of your instruments and your measurements.
For calculations involving anything like this, floating point numbers are appropriate.
Also, if you're dealing with irrational numbers floating point types are appropriate (and really your only choice) eg linear algebra where you deal with square roots a lot.
Money is different because you typically need to be exact and every digit is significant.
I think you should ask the other way around: when should you not use floating point. For most numerical tasks, floating point is the preferred data type, as you can (almost) forget about overflow and other kind of problems typically encountered with integer types.
One way to look at floating point data type is that the precision is independent of the dynamic, that is whether the number is very small of very big (within an acceptable range of course), the number of meaningful digits is approximately the same.
One drawback is that floating point numbers have some surprising properties, like x == x can be False (if x is nan), they do not follow most mathematical rules (distributivity, that is x( y + z) != xy + xz). Depending on the values for z, y, and z, this can matters.
From Wikipedia:
Floating-point arithmetic is at its
best when it is simply being used to
measure real-world quantities over a
wide range of scales (such as the
orbital period of Io or the mass of
the proton), and at its worst when it
is expected to model the interactions
of quantities expressed as decimal
strings that are expected to be exact.
Floating point is fast but inexact. If that is an acceptable trade off, use floating point.

Psuedo-Random-Number-Generator from a computable normal number

Isn't it easily possible to construct a PRNG in such a fashion? Why is it not done?
That is, as far as I know we could simply have a PRNG that takes a seed n. When you ask for a random bit, it takes the nth digit of the binary expansion of the computable normal number, and increments n.
My first thought was that perhaps we hadn't found a computable normal number, but we have. The remaining thought is that there is a good reason not to-- either there's some property of PRNGs that I'm not familiar with that such a method would not have, or it would be impractical somehow, or is otherwise outstripped by other methods.
That would make predicting the output really simple.
Say, for example, you generate the integer 0x54a30b7f. If you have 4GiB of pi (or random noise or an actual normal number), chances are there's only going to be one (or maybe a handful) occurrence of that particular integer and I can predict with reasonably high probability all future numbers. This is a serious problem in the case of cryptographically strong PRNGs. If instead of simple sequential scan you use some function, I just have to follow the function which if it is difficult enough to follow it turns into a PRNG in it's own right.
If you are not concerned about the cryptographic strength of your generator, then there are much more compact ways of generating random numbers. Mersenne Twister, for example, has a much larger period without requiring a 4GiB lookup table.

Resources