Remainder sequences - math

I would like to compute the remainder sequence of two polynomials as used by GCD. If I understood the Wikipedia article about Pseudo-remainder sequence, one way to compute it is to use Euclid's algorithm:
gcd(a, b) := if b = 0 then a else gcd(b, rem(a, b))
meaning I will collect that rem() parts. If however the coefficients are integers, the intermediate fractions grow very quickly so then there are the so-called "Pseudo-remainder sequences" which try to keep the coefficients in small integers.
My question is, if I understood correctly (did I?), the two above sequences differ only by constant factor but when I try to run the following example I get different results, why? The first remainder sequence differs by -2, ok, but why is the second sequence so different? I presume subresultants() works correctly, but why does that g % (f % g) not work?
f = Poly(x**2*y + x**2 - 5*x*y + 2*x + 1, x, y)
g = Poly(2*x**2 - 12*x + 1, x)
print
print subresultants(f, g)[2]
print subresultants(f, g)[3]
print
print f % g
print g % (f % g)
which results in
Poly(-2*x*y - 16*x + y - 1, x, y, domain='ZZ')
Poly(-9*y**2 - 54*y + 225, x, y, domain='ZZ')
Poly(x*y + 8*x - 1/2*y + 1/2, x, y, domain='QQ')
Poly(2*x**2 - 12*x + 1, x, y, domain='QQ')

the two above sequences differ only by constant factor
For polynomials of one variable, they do. For multivariate polynomials, they don't.
The division of multivariable polynomials is a somewhat tricky business: result depends on the chosen order of monomials (by default, sympy uses lexicographic order). When you ask it to divide 2*x**2 - 12*x + 1 by x*y + 8*x - 1/2*y + 1/2, it observes that the leading monomial of the denominator is x*y, and there is no monomial in the numerator that is divisible by x*y. So the quotient is zero, and everything is a remainder.
The computation of subresultants (as it's implemented in sympy) treats polynomials in x,y as single-variable polynomials in x whose coefficients happen to come from the ring of polynomials in y. It is certain to produce a sequence of subresultants whose degree with respect to x keeps decreasing until it reaches 0: the last polynomial of the sequence will not have x in it. The degree with respect to y may (and does) go up, since the algorithm has no problem multiplying the terms by any polynomials in y in order to get x to drop out.
The upshot is that both computations work correctly, they just do different things.

Related

questions about AES irreducible polynomials

For galois field GF(2^8), the polynomial's format is a7x^7+a6x^6+...+a0.
For AES, the irreducible polynomial is x^8+x^4+x^3+x+1.
Apparently, the max power in GF(2^8) is x^7, but why the max power of irreducible polynomial is x^8?
How will the max power in irreducible polynomial affect inverse result in GF?
Can I set the max power of irreducible polynomial be x^9?
To understand why the modulus of GF(2⁸) must be order 8 (that is, have 8 as its largest exponent), you must know how to perform polynomial division with coefficients in GF(2), which means you must know how to perform polynomial division in general. I will assume you know how to do those things. If you don't know how, there are many tutorials on the web from which you can learn.
Remember that if r = a mod m, it means that there is a q such that a = q m + r. To make a working GF(2⁸) arithmetic, we need to guarantee that r is a element of GF(2⁸) for any a and q (even though a and q do not need to be elements of GF(2⁸)). Furthermore, we need to ensure that r can be any element of GF(2⁸), if we pick the right a from GF(2⁸).
So we must pick a modulus (the m) that makes these guarantees. We do this by picking an m of exactly order 8.
If the numerator of the division (the a in a = q m + r) is order 8 or higher, we can find something to put in the quotient (the q) that, when multiplied by x⁸, cancels out that higher order. But there's nothing we can put in the quotient that can be multiplied by x⁸ to give a term with order less than 8, so the remainder (the r) can be any order up to and including 7.
Let's try a few examples of polynomial division with a modulus (or divisor) of x⁸+x⁴+x³+x+1 to see what I mean. First let's compute x⁸+1 mod x⁸+x⁴+x³+x+1:
1 <- quotient
┌──────────────
x⁸+x⁴+x³+x+1 │ x⁸ +1
-(x⁸+x⁴+x³+x+1)
───────────────
x⁴+x³+x <- remainder
So x⁸+1 mod x⁸+x⁴+x³+x+1 = x⁴+x³+x.
Next let's compute x¹²+x⁹+x⁷+x⁵+x² mod x⁸+x⁴+x³+x+1.
x⁴ +x +1 <- quotient
┌──────────────────────────────
x⁸+x⁴+x³+x+1 │ x¹²+x⁹ +x⁷+x⁵ +x²
-(x¹² +x⁸+x⁷+x⁵+x⁴ )
───────────────────────────
x⁹+x⁸ +x⁴ +x²
-(x⁹ +x⁵+x⁴ +x²+x)
─────────────────────────
x⁸ +x⁵ +x
-(x⁸ +x⁴+x³ +x+1)
────────────────────────
x⁵+x⁴+x³ +1 <- remainder
So x¹²+x⁹+x⁷+x⁵+x² mod x⁸+x⁴+x³+x+1 = x⁵+x⁴+x³+1, which has order < 8.
Finally, let's try a substantially higher order: how about x¹⁰⁰+x⁹⁶⁺x⁹⁵+x⁹³+x⁸⁸+x⁸⁷+x⁸⁵+x⁸⁴+x mod x⁸+x⁴+x³+x+1?
x⁹² +x⁸⁴ <- quotient
┌────────────────────────────────────────
x⁸+x⁴+x³+x+1 │ x¹⁰⁰+x⁹⁶⁺x⁹⁵+x⁹³ +x⁸⁸+x⁸⁷+x⁸⁵+x⁸⁴+x
-(x¹⁰⁰+x⁹⁶+x⁹⁵+x⁹³+x⁹² )
─────────────────────────────────────────
x⁹²+x⁸⁸+x⁸⁷+x⁸⁵+x⁸⁴+x
-(x⁹²+x⁸⁸+x⁸⁷+x⁸⁵+x⁸⁴ )
────────────────────────
x <- remainder
So x¹⁰⁰+x⁹⁶⁺x⁹⁵+x⁹³+x⁸⁸+x⁸⁷+x⁸⁵+x⁸⁴+x mod x⁸+x⁴+x³+x+1 = x. Note that I carefully chose the numerator so that it wouldn't be a long computation. If you want some pain, try doing x¹⁰⁰ mod x⁸+x⁴+x³+x+1 by hand.

How many Binary Digits needed for X Decimal Digits of Square Root of 2

I've been playing around with calculating the square root of 2 and the like. It's easy to come up with an algorithm that will produce n correct binary digits. What I'd like help with is determining how many binary digits I need to get m correct decimal digits? m Binary digits will get me m Decimal digits, but the m decimal digits may not all be correct yet.
EDIT:
I've determined that the lower bound on the binary precision = ceil(log2(10^m)).
Thinking about it there might not be a strict upper-bound, since a carry from any lower power of 2 (when converting to base 10) could potentially effect any higher digit base 10.
This may thus be a dynamic problem that requires evaluating the fractional expansion at m binary digits and determining which additional binary digits could potentially cause a carry in base 10.
Edit 2: I was probably overthinking this. After the initial calculation I can keep adding (1x10^(-precision)) and squaring the result until I exceed 2 - and then subtract (1x10^(-precision)) and I'll have my answer. Nevertheless I am still interested in finding/developing such an algorithm :)
Let x be a real and y be its approximation.
Let RE be the relative error of y with respect to x:
RE(x, y) = abs(x - y) / abs(x)
Let b be a nonnegative integer. The Log-Relative Error in base b is defined as:
LREb(x, y) = -logb(RE(x, y))
where logb is the base-b logarithm:
logb(z) = log(z) / log(b)
for any nonnegative z.
The LRE in base b represents the number of common digits between x and y. Here, the "number of correct digits" is not an integer, but a real number: this will simplify the next calculations avoiding the need for ceil and floor functions, provided that we accept statements such as : "y has 2.3 correct digits with respect to x". More precisely, if x and y have q common base b digits, then:
LREb(x, y) >= q - 1
With these equation, if the relative error has an upper bound, then the LREb has a lower bound. More precisely, if:
RE(x, y) <= epsilon
then:
LREb(x, y) >= -logb(epsilon)
Also, if the number of correct digits in base 10 is LRE10 = p, then RE = 10^-p, which implies that the number of correct digits in base 2 is:
LRE2 = -log2(10^-p)
what method you are using?
I am assuming binary search of x in y = x^2
integer part is limited by the result sqrt(y) and cannot be cut otherwise result would be wrong. However the x is limited by half the bits of y so:
ni2 = log2(|y|)
fractional part is tricky see:
the relation between binary and decimal digits
but after the nonlinear start of first digits the dependence stabilizes here reversed formula from linked answer:
nf2 = (((nf10-7.810)/9.6366363636363636363636)+1.0)<<5;
ni2 is integer part binary bits/digits
nf2 is fractional part binary bits/digits
nf10 is fractional part decadic digits
btw I used 32 bit aligned values as that is what I use for my arithmetics so:
9.6366363636363636363636 = 32/0.30102999566398119521373889472449
0.30102999566398119521373889472449 = log10(2)

R: approximating `e = exp(1)` using `(1 + 1 / n) ^ n` gives absurd result when `n` is large

So, I was just playing around with manually calculating the value of e in R and I noticed something that was a bit disturbing to me.
The value of e using R's exp() command...
exp(1)
#[1] 2.718282
Now, I'll try to manually calculate it using x = 10000
x <- 10000
y <- (1 + (1 / x)) ^ x
y
#[1] 2.718146
Not quite but we'll try to get closer using x = 100000
x <- 100000
y <- (1 + (1 / x)) ^ x
y
#[1] 2.718268
Warmer but still a bit off...
x <- 1000000
y <- (1 + (1 / x)) ^ x
y
#[1] 2.71828
Now, let's try it with a huge one
x <- 5000000000000000
y <- (1 + (1 / x)) ^ x
y
#[1] 3.035035
Well, that's not right. What's going on here? Am I overflowing the data type and need to use a certain package instead? If so, are there no warnings when you overflow a data type?
You've got a problem with machine precision. As soon as (1 / x) < 2.22e-16, 1 + (1 / x) is just 1. Mathematical limit breaks down in finite-precision numerical computations. Your final x in the question is already 5e+15, very close to this brink. Try x <- x * 10, and your y would be 1.
This is neither "overflow" nor "underflow" as there is no difficulty in representing a number as small as 1e-308. It is the problem of the loss of significant digits during floating-point arithmetic. When you do 1 + (1 / x), the bigger x is, the fewer significant digits in the (1 / x) part can be preserved when you add it to 1, and eventually you lose that (1 / x) term altogether.
## valid 16 significant digits
1 + 1.23e-01 = 1.123000000000000|
1 + 1.23e-02 = 1.012300000000000|
... ...
1 + 1.23e-15 = 1.000000000000001|
1 + 1.23e-16 = 1.000000000000000|
Any numerical analysis book would tell you the following.
Avoid adding a large number and a small number. In floating-point addition a + b = a * (1 + b / a), if b / a < 2.22e-16, there us a + b = a. This implies that when adding up a number of positive numbers, it is more stable to accumulate them from the smallest to the largest.
Avoid subtracting one number from another of the same magnitude, or you may get cancellation error. The web page has a classic example of using the quadratic formula.
You are also advised to have a read on Approximation to constant "pi" does not get any better after 50 iterations, a question asked a few days after your question. Using a series to approximate an irrational number is numerically stable as you won't get the absurd behavior seen in your question. But the finite number of valid significant digits imposes a different problem: numerical convergence, that is, you can only approximate the target value up to a certain number of significant digits. MichaelChirico's answer using Taylor series would converge after 19 terms, since 1 / factorial(19) is already numerically 0 when added to 1.
Multiplication / division between floating-point numbers don't cause problem on significant digits; they may cause "overflow" or "underflow". However, given the wide range of representable floating-point values (1e-308 ~ 1e+307), "overflow" and "underflow" should be rare. The real difficulty is with addition / subtraction where significant digits can be easily lost. See Can I stably invert a Vandermonde matrix with many small values in R? for an example on matrix computations. It is not impossible to get higher precision, but the work is probably more involved. For example, OP of the matrix example eventually used the GMP (GNU Multiple Precision Arithmetic Library) and associated R packages to proceed: How to put Rmpfr values into a function in R?
You might also try the Taylor series approximation to exp(1), namely
e^x = \sum_{k = 0}{\infty} x^k / k!
Thus we can approximate e = e^1 by truncating this sum; in R:
sprintf('%.20f', exp(1))
# [1] "2.71828182845904509080"
sprintf('%.20f', sum(1/factorial(0:10)))
# [1] "2.71828180114638451315"
sprintf('%.20f', sum(1/factorial(0:100)))
# [1] "2.71828182845904509080"

Yun's algorithm

I would like to try to implement Yun's algorithm for square-free factorization of polynomials. From Wikipedia (f is the polynomial):
a0 = gcd(f, f'); b1 = f/a0; c1 = f'/a0; d1 = c1 - b1'; i = 1
repeat
ai = gcd(bi, di); bi+1 = bi/ai; ci+1 = di/ai; i = i + 1; di = ci - bi'
until b = 1
However, I'm not sure about the second step. I would like to use it for polynomials with integer coefficients (not necessary monic or primitive). Is it possible to realize the division b1 = f/a0 using just integers?
I found the code for synthetic division:
def extended_synthetic_division(dividend, divisor):
'''Fast polynomial division by using Extended Synthetic Division. Also works with non-monic polynomials.'''
# dividend and divisor are both polynomials, which are here simply lists of coefficients. Eg: x^2 + 3x + 5 will be represented as [1, 3, 5]
out = list(dividend) # Copy the dividend
normalizer = divisor[0]
for i in xrange(len(dividend)-(len(divisor)-1)):
out[i] /= normalizer # for general polynomial division (when polynomials are non-monic),
# we need to normalize by dividing the coefficient with the divisor's first coefficient
coef = out[i]
if coef != 0: # useless to multiply if coef is 0
for j in xrange(1, len(divisor)): # in synthetic division, we always skip the first coefficient of the divisor,
# because it is only used to normalize the dividend coefficients
out[i + j] += -divisor[j] * coef
# The resulting out contains both the quotient and the remainder, the remainder being the size of the divisor (the remainder
# has necessarily the same degree as the divisor since it is what we couldn't divide from the dividend), so we compute the index
# where this separation is, and return the quotient and remainder.
separator = -(len(divisor)-1)
return out[:separator], out[separator:] # return quotient, remainder.
The problem for me is that out[i] /= normalizer. Would it always work with integer (floor) division for Yun's b1 = f/a0? Is it so that it is always possible to divide f/gcd(f, f')? Is the out[separator:] (remainder) always going to zero?
The fact that the "division in p/GCD(p, p') will always work (i.e. be "exact", with no remainder in Z)" follows from the definition of the GCD. For any polynomials p and q their GCD(p,q) divides both p and q exactly. That's why it is called GCD i.e. Greatest Common Divisor:
A greatest common divisor of p and q is a polynomial d that divides p and q and such that every common divisor of p and q also divides d.
P.S. it makes more sense to ask such purely mathematical questions at the more specialized https://math.stackexchange.com/

Vectorizing code to calculate (squared) Mahalanobis Distiance

EDIT 2: this post seems to have been moved from CrossValidated to StackOverflow due to it being mostly about programming, but that means by fancy MathJax doesn't work anymore. Hopefully this is still readable.
Say I want to to calculate the squared Mahalanobis distance between two vectors x and y with covariance matrix S. This is a fairly simple function defined by
M2(x, y; S) = (x - y)^T * S^-1 * (x - y)
With python's numpy package I can do this as
# x, y = numpy.ndarray of shape (n,)
# s_inv = numpy.ndarray of shape (n, n)
diff = x - y
d2 = diff.T.dot(s_inv).dot(diff)
or in R as
diff <- x - y
d2 <- t(diff) %*% s_inv %*% diff
In my case, though, I am given
m by n matrix X
n-dimensional vector mu
n by n covariance matrix S
and want to find the m-dimensional vector d such that
d_i = M2(x_i, mu; S) ( i = 1 .. m )
where x_i is the ith row of X.
This is not difficult to accomplish using a simple loop in python:
d = numpy.zeros((m,))
for i in range(m):
diff = x[i,:] - mu
d[i] = diff.T.dot(s_inv).dot(diff)
Of course, given that the outer loop is happening in python instead of in native code in the numpy library means it's not as fast as it could be. $n$ and $m$ are about 3-4 and several hundred thousand respectively and I'm doing this somewhat often in an interactive program so a speedup would be very useful.
Mathematically, the only way I've been able to formulate this using basic matrix operations is
d = diag( X' * S^-1 * X'^T )
where
x'_i = x_i - mu
which is simple to write a vectorized version of, but this is unfortunately outweighed by the inefficiency of calculating a 10-billion-plus element matrix and only taking the diagonal... I believe this operation should be easily expressible using Einstein notation, and thus could hopefully be evaluated quickly with numpy's einsum function, but I haven't even begun to figure out how that black magic works.
So, I would like to know: is there either a nicer way to formulate this operation mathematically (in terms of simple matrix operations), or could someone suggest some nice vectorized (python or R) code that does this efficiently?
BONUS QUESTION, for the brave
I don't actually want to do this once, I want to do it k ~ 100 times. Given:
m by n matrix X
k by n matrix U
Set of n by n covariance matrices each denoted S_j (j = 1..k)
Find the m by k matrix D such that
D_i,j = M(x_i, u_j; S_j)
Where i = 1..m, j = 1..k, x_i is the ith row of X and u_j is the jth row of U.
I.e., vectorize the following code:
# s_inv is (k x n x n) array containing "stacked" inverses
# of covariance matrices
d = numpy.zeros( (m, k) )
for j in range(k):
for i in range(m):
diff = x[i, :] - u[j, :]
d[i, j] = diff.T.dot(s_inv[j, :, :]).dot(diff)
First off, it seems like maybe you're getting S and then inverting it. You shouldn't do that; it's slow and numerically inaccurate. Instead, you should get the Cholesky factor L of S so that S = L L^T; then
M^2(x, y; L L^T)
= (x - y)^T (L L^T)^-1 (x - y)
= (x - y)^T L^-T L^-1 (x - y)
= || L^-1 (x - y) ||^2,
and since L is triangular L^-1 (x - y) can be computed efficiently.
As it turns out, scipy.linalg.solve_triangular will happily do a bunch of these at once if you reshape it properly:
L = np.linalg.cholesky(S)
y = scipy.linalg.solve_triangular(L, (X - mu[np.newaxis]).T, lower=True)
d = np.einsum('ij,ij->j', y, y)
Breaking that down a bit, y[i, j] is the ith component of L^-1 (X_j - \mu). The einsum call then does
d_j = \sum_i y_{ij} y_{ij}
= \sum_i y_{ij}^2
= || y_j ||^2,
like we need.
Unfortunately, solve_triangular won't vectorize across its first argument, so you should probably just loop there. If k is only about 100, that's not going to be a significant issue.
If you are actually given S^-1 rather than S, then you can indeed do this with einsum more directly. Since S is quite small in your case, it's also possible that actually inverting the matrix and then doing this would be faster. As soon as n is a nontrivial size, though, you're throwing away a lot of numerical accuracy by doing this.
To figure out what to do with einsum, write everything in terms of components. I'll go straight to the bonus case, writing S_j^-1 = T_j for notational convenience:
D_{ij} = M^2(x_i, u_j; S_j)
= (x_i - u_j)^T T_j (x_i - u_j)
= \sum_k (x_i - u_j)_k ( T_j (x_i - u_j) )_k
= \sum_k (x_i - u_j)_k \sum_l (T_j)_{k l} (x_i - u_j)_l
= \sum_{k l} (X_{i k} - U_{j k}) (T_j)_{k l} (X_{i l} - U_{j l})
So, if we make arrays X of shape (m, n), U of shape (k, n), and T of shape (k, n, n), then we can write this as
diff = X[np.newaxis, :, :] - U[:, np.newaxis, :]
D = np.einsum('jik,jkl,jil->ij', diff, T, diff)
where diff[j, i, k] = X_[i, k] - U[j, k].
Dougal nailed this one with an excellent and detailed answer, but thought I'd share a small modification that I found increases efficiency in case anyone else is trying to implement this. Straight to the point:
Dougal's method was as follows:
def mahalanobis2(X, mu, sigma):
L = np.linalg.cholesky(sigma)
y = scipy.linalg.solve_triangular(L, (X - mu[np.newaxis,:]).T, lower=True)
return np.einsum('ij,ij->j', y, y)
A mathematically equivalent variant I tried is
def mahalanobis2_2(X, mu, sigma):
# Cholesky decomposition of inverse of covariance matrix
# (Doing this in either order should be equivalent)
linv = np.linalg.cholesky(np.linalg.inv(sigma))
# Just do regular matrix multiplication with this matrix
y = (X - mu[np.newaxis,:]).dot(linv)
# Same as above, but note different index at end because the matrix
# y is transposed here compared to above
return np.einsum('ij,ij->i', y, y)
Ran both versions head-to-head 20x using identical random inputs and recorded the times (in milliseconds). For X as a 1,000,000 x 3 matrix (mu and sigma 3 and 3x3) I get:
Method 1 (min/max/avg): 30/62/49
Method 2 (min/max/avg): 30/47/37
That's about a 30% speedup for the 2nd version. I'm mostly going to be running this in 3 or 4 dimensions but to see how it scaled I tried X as 1,000,000 x 100 and got:
Method 1 (min/max/avg): 970/1134/1043
Method 2 (min/max/avg): 776/907/837
which is about the same improvement.
I mentioned this in a comment on Dougal's answer but adding here for additional visibility:
The first pair of methods above take a single center point mu and covariance matrix sigma and calculate the squared Mahalanobis distance to each row of X. My bonus question was to do this multiple times with many sets of mu and sigma and output a two-dimensional matrix. The set of methods above can be used to accomplish this with a simple for loop, but Dougal also posted a more clever example using einsum.
I decided to compare these methods with each other by using them to solve the following problem: Given k d-dimensional normal distributions (with centers stored in rows of k by d matrix U and covariance matrices in the last two dimensions of the k by d by d array S), find the density at the n points stored in rows of the n by d matrix X.
The density of a multivariate normal distribution is a function of the squared Mahalanobis distance of the point to the mean. Scipy has an implementation of this as scipy.stats.multivariate_normal.pdf to use as a reference. I ran all three methods against each other 10x using identical random parameters each time, with d=3, k=96, n=5e5. Here are the results, in points/sec:
[Method]: (min/max/avg)
Scipy: 1.18e5/1.29e5/1.22e5
Fancy 1: 1.41e5/1.53e5/1.48e5
Fancy 2: 8.69e4/9.73e4/9.03e4
Fancy 2 (cheating version): 8.61e4/9.88e4/9.04e4
where Fancy 1 is the better of the two methods above and Fancy2 is Dougal's 2nd solution. Since the Fancy 2 needs to calculate the inverses of all the covariance matrices I also tried a "cheating version" where it was passed these as a parameter, but it looks like that didn't make a difference. I had planned on including the non-vectorized implementation but that was so slow it would have taken all day.
What we can take away from this is that using Dougal's first method is about 20% faster than however Scipy does it. Unfortunately despite its cleverness the 2nd method is only about 60% as fast as the first. There are probably some other optimizations that can be done but this is already fast enough for me.
I also tested how this scaled with higher dimensionality. With d=100, k=96, n=1e4:
Scipy: 7.81e3/7.91e3/7.86e3
Fancy 1: 1.03e4/1.15e4/1.08e4
Fancy 2: 3.75e3/4.10e3/3.95e3
Fancy 2 (cheating version): 3.58e3/4.09e3/3.85e3
Fancy 1 seems to have an even bigger advantage this time. Also worth noting that Scipy threw a LinAlgError 8/10 times, probably because some of my randomly-generated 100x100 covariance matrices were close to singular (which may mean that the other two methods are not as numerically stable, I did not actually check the results).

Resources