How do I optimize the speed of my Sage reducibility algorithm? - sage

Suppose I have the polynomial f(x) = x^n + x + a. I set a value for n, and want 0 <= a <= A, where A is some other value I set. This means I will have a total of A different polynomials, since a can be any value between 0 and A.
Using Sage, I'm finding the number of these A polynomials that are reducible. For example, suppose I set n=5 and A=10^7. That would tell me how many of these 10^7 polynomials of degree 5 are reducible. I've done this using a loop, which works for low values of A. But for the large values I need (ie. A=10^7), it's taking an extremely long & impractical amount of time. The code is below. Could someone please help me meaningfully optimize this?
x = polygen(QQ)
n = 5
A = 10^7
count = 0
for i in range(A):
p_pol = x^n + x + i
if not p_pol.is_irreducible():
count = count + 1
print(i)
print('Count:' + str(count))

One small, but in this case pretty meaningless optimization is to replace range(A) with xrange(A). The former will create an array of all integers from 0 to A - 1 which is a waste of time and space. xrange(A) will just produce integers one by one and discard them when you're done. Sage 9.0 will be base on Python 3 by default where range is equivalent to xrange.
Let's do a little experiment though. Another small optimization will be to pre-define the part of your polynomial that's constant in each loop:
x = polygen(QQ)
n = 5
A = 10^7
base = x^n + x
Now just as a general test, let's see how long it takes in a few cases to add an integer to the polynomial and then compute its irreducibility:
sage: (base + 1).is_irreducible()
False
sage: %timeit (base + 1).is_irreducible()
1000 loops, best of 3: 744 µs per loop
sage: (base + 3).is_irreducible()
True
sage: %timeit (base + 3).is_irreducible()
1000 loops, best of 3: 387 µs per loop
So it seems in cases where it is irreducible (which will be the majority) it's a little faster, so let's say on average it will take 387µs per. Then:
sage: 0.000387 * 10^7 / 60
64.5000000000000
So this will still take a little over an hour, on average (on my machine).
One thing you can do to speed things up is parallelize it, if you have many CPU cores. For example:
x = polygen(QQ)
A = 10^7
def is_irreducible(i, base=(x^5 + x)):
return (base + i).is_irreducible()
from multiprocessing import Pool
pool = Pool()
A - sum(pool.map(is_irreducible, xrange(A)))
That will in principle give you the same result. Though the speed up you'll get will only be on the order of the number of CPUs you have at best (typically a little less). Sage also comes with some parallelization helpers but I tend to find them a bit lacking for the case of speeding up small calculations over a large range of values (they can be used for this but it requires some care, such as manually batching your inputs; I'm not crazy about it...)
Beyond that, you may need to use some mathematical intuition to try to reduce the problem space.

Related

Mixing function for non power of 2 integer intervals

I'm looking for a mixing function that given an integer from an interval <0, n) returns a random-looking integer from the same interval. The interval size n will typically be a composite non power of 2 number. I need the function to be one to one. It can only use O(1) memory, O(1) time is strongly preferred. I'm not too concerned about randomness of the output, but visually it should look random enough (see next paragraph).
I want to use this function as a pixel shuffling step in a realtime-ish renderer to select the order in which pixels are rendered (The output will be displayed after a fixed time and if it's not done yet this gives me a noisy but fast partial preview). Interval size n will be the number of pixels in the render (n = 1920*1080 = 2073600 would be a typical value). The function must be one to one so that I can be sure that every pixel is rendered exactly once when finished.
I've looked at the reversible building blocks used by hash prospector, but these are mostly specific to power of 2 ranges.
The only other method I could think of is multiply by large prime, but it doesn't give particularly nice random looking outputs.
What are some other options here?
Here is one solution based on the idea of primitive roots modulo a prime:
If a is a primitive root mod p then the function g(i) = a^i % p is a permutation of the nonzero elements which are less than p. This corresponds to the Lehmer prng. If n < p, you can get a permutation of 0, ..., n-1 as follows: Given i in that range, first add 1, then repeatedly multiply by a, taking the result mod p, until you get an element which is <= n, at which point you return the result - 1.
To fill in the details, this paper contains a table which gives a series of primes (all of which are close to various powers of 2) and corresponding primitive roots which are chosen so that they yield a generator with good statistical properties. Here is a part of that table, encoded as a Python dictionary in which the keys are the primes and the primitive roots are the values:
d = {32749: 30805,
65521: 32236,
131071: 66284,
262139: 166972,
524287: 358899,
1048573: 444362,
2097143: 1372180,
4194301: 1406151,
8388593: 5169235,
16777213: 9726917,
33554393: 32544832,
67108859: 11526618,
134217689: 70391260,
268435399: 150873839,
536870909: 219118189,
1073741789: 599290962}
Given n (in a certain range -- see the paper if you need to expand that range), you can find the smallest p which works:
def find_p_a(n):
for p in sorted(d.keys()):
if n < p:
return p, d[p]
once you know n and the matching p,a the following function is a permutation of 0 ... n-1:
def f(i,n,p,a):
x = a*(i+1) % p
while x > n:
x = a*x % p
return x-1
For a quick test:
n = 2073600
p,a = find_p_a(n) # p = 2097143, a = 1372180
nums = [f(i,n,p,a) for i in range(n)]
print(len(set(nums)) == n) #prints True
The average number of multiplications in f() is p/n, which in this case is 1.011 and will never be more than 2 (or very slightly larger since the p are not exact powers of 2). In practice this method is not fundamentally different from your "multiply by a large prime" approach, but in this case the factor is chosen more carefully, and the fact that sometimes more than 1 multiplication is required adding to the apparent randomness.

Max size of set linear equations to solve? (X=AX+B)

This is a very general question regarding the maximum size of a set of linear equations to be solved by today's fastest hardware, in the form:
X = AX + B
A: NxN matrix of floats, it is sparse.
B: N-vector of floats.
solve for X.
This becomes X(I-A) = B which is best solved using factorisation (and not matrix inversion) as I read here:
http://www.johndcook.com/blog/2010/01/19/dont-invert-that-matrix/
Do you know yourselfs or have a reference to a benchmark or paper which gives some maximum value for N with today's fastest hardware? Most benchmarks I have seen use N < 10,000. I am thinking about N>10x10^6 or more to be processed within a month.
Please consider not only the computational dimension but also storage for A. It can be a problem:
e.g. assuming N = 1 x 10^6, storage would be 1x10^12 x 4 bytes / (1024x1024x1024) = 4 Terrabytes for totally dense matrix, which is about manageable I guess.
Lastly, can the method to solve the system be parallelised so that I can make the assumption that with parallelisation N can be pretty large?
thanks in advance,
bliako

How to do hypot2(x,y) calculation when numbers can overflow

I'd like to do a hypot2 calculation on a 16-bit processor.
The standard formula is c = sqrt((a * a) + (b * b)). The problem with this is with large inputs it overflows. E.g. 200 and 250, multiply 200 * 200 to get 90,000 which is higher than the max signed value of 32,767, so it overflows, as does b, the numbers are added and the result may as well be useless; it might even signal an error condition because of a negative sqrt.
In my case, I'm dealing with 32-bit numbers, but 32-bit multiply on my processor is very fast, about 4 cycles. I'm using a dsPIC microcontroller. I'd rather not have to multiply with 64-bit numbers because that's wasting precious memory and undoubtedly will be slower. Additionally I only have sqrt for 32-bit numbers, so 64-bit numbers would require another function. So how can I compute a hypot when the values may be large?
Please note I can only really use integer math for this. Using anything like floating point math incurs a speed hit which I'd rather avoid. My processor has a fast integer/fixed point atan2 routine, about 130 cycles; could I use this to compute the hypotenuse length?
Depending on how much accuracy you need you may be able to avoid the squares and the square root operation. There is a section on this topic in Understanding Digital Signal Processing by Rick Lyons (section 10.2, "High-Speed Vector-Magnitude Approximation", starting at page 400 in my edition).
The approximation is essentially:
magnitude = alpha * min + beta * max
where max and min are the maximum and minimum absolute values of the real and imaginary components, and alpha and beta are two constants which are chosen to give a reasonable error distribution over the range of interest. These constants can be represented as fractions with power of 2 divisors to keep the arithemtic simple/efficient. In the book he suggests alpha = 15/16, beta = 15/32, and you can then simplify the formula to:
magnitude = (15 / 16) * (max + min / 2)
which might be implemented as follows using integer operations:
magnitude = 15 * (max + min / 2) / 16
and of course we can use shifts for the divides:
magnitude = (15 * (max + (min >> 1))) >> 4
Error is +/- 5% over a quadrant.
More information on this technique here: http://www.dspguru.com/dsp/tricks/magnitude-estimator
This is taken verbatim from this #John D. Cook blog post, hence CW:
Here’s how to compute sqrt(x*x + y*y)
without risking overflow.
max = maximum(|x|, |y|)
min = minimum(|x|, |y|)
r = min / max
return max*sqrt(1 + r*r)
If #John D. Cook comes along and posts this you should give him the accept :)
Since you essentially can't do any multiplications without overflow you're likely going to lose some precision.
To get the numbers into an acceptable range, pull out some factor x and use
c = x*sqrt( (a/x)*(a/x) + (b/x)*(b/x) )
If x is a common factor, you won't lose precision, but if it's not, you will lose precision.
Update:
Even better, given that you can do some mild work with 64-bit numbers, with just one 64-bit addition, you could do the rest of this problem in 32-bits with only a tiny loss of accuracy. To do this: do the two 32-bit multiplications to give you two 64-bit numbers, add these, and then bit shift as needed to get the sum back down to 32-bits before taking the square root. If you always bit shift by 2 bits, then just multiply the final result by 2^(half the number of bit shifts), based on the rule above. The truncation should only cause a very small loss of accuracy, no more than 2^31, or 0.00000005% error.
Aniko and John, it seems to me that you haven't addressed the OP's problem. If a and b are integers, then a*a + b*b is likely to overflow, because integer operations are being performed. The obvious solution is to convert a and b to floating-point values before computing a*a + b*b. But the OP hasn't let us know what language we should use, so we're a bit stuck.
The standard formula is c = sqrt((a * a) + (b * b)). The problem with this is with large >inputs it overflows.
The solution for overflows (aside from throwing an error) is to saturate your intermediate calculations.
Calculate C = a*a + b*b. If a and b are signed 16-bit numbers, you will never have an overflow. If they are unsigned numbers, you'll need to right-shift the inputs first to get the sum to fit in a 32-bit number.
If C > (MAX_RADIUS)^2, return MAX_RADIUS, where MAX_RADIUS is the maximum value you can tolerate before encounting an overflow.
Otherwise, use either sqrt() or the CORDIC algorithm, which avoids the cost of square roots in favor of loop iteration + adds + shifts, to retrieve the amplitude of the (a,b) vector.
If you can constrain a and b to be at most 7 bits, you won't get any overflow. You can use a count-leading-zeros instruction to figure out how many bits to throw away.
Assume a>=b.
int bits = 16 - count_leading_zeros(a);
if (bits > 7) {
a >>= bits - 7;
b >>= bits - 7;
}
c = sqrt(a*a + b*b);
if (bits > 7) {
c <<= bits - 7;
}
Lots of processors have this instruction nowadays, and if not, you can use other fast techniques.
Although this won't give you the exact answer, it will be very close (at most ~1% low).
Do you need full precision? If you don't, you can increase your range a little bit by discarding a few least significant bits and multiplying them in afterwards.
Can a and b be anything? How about a lookup table if you only have a few a and b that you need to calculate?
A simple solution to avoid overflow is to divide both a and b by a+b before squaring, and then multiply the square root by a+b. Or do the same with max(a,b).
You can do a little simple algebra to bring the results back into range.
sqrt((a * a) + (b * b))
= 2 * sqrt(((a * a) + (b * b)) / 4)
= 2 * sqrt((a * a) / 4 + (b * b) / 4)
= 2 * sqrt((a/2 * a/2) + (b/2 * b/2))

What is O value for naive random selection from finite set?

This question on getting random values from a finite set got me thinking...
It's fairly common for people to want to retrieve X unique values from a set of Y values. For example, I may want to deal a hand from a deck of cards. I want 5 cards, and I want them to all be unique.
Now, I can do this naively, by picking a random card 5 times, and try again each time I get a duplicate, until I get 5 cards. This isn't so great, however, for large numbers of values from large sets. If I wanted 999,999 values from a set of 1,000,000, for instance, this method gets very bad.
The question is: how bad? I'm looking for someone to explain an O() value. Getting the xth number will take y attempts...but how many? I know how to figure this out for any given value, but is there a straightforward way to generalize this for the whole series and get an O() value?
(The question is not: "how can I improve this?" because it's relatively easy to fix, and I'm sure it's been covered many times elsewhere.)
Variables
n = the total amount of items in the set
m = the amount of unique values that are to be retrieved from the set of n items
d(i) = the expected amount of tries needed to achieve a value in step i
i = denotes one specific step. i ∈ [0, n-1]
T(m,n) = expected total amount of tries for selecting m unique items from a set of n items using the naive algorithm
Reasoning
The first step, i=0, is trivial. No matter which value we choose, we get a unique one at the first attempt. Hence:
d(0) = 1
In the second step, i=1, we at least need 1 try (the try where we pick a valid unique value). On top of this, there is a chance that we choose the wrong value. This chance is (amount of previously picked items)/(total amount of items). In this case 1/n. In the case where we picked the wrong item, there is a 1/n chance we may pick the wrong item again. Multiplying this by 1/n, since that is the combined probability that we pick wrong both times, gives (1/n)2. To understand this, it is helpful to draw a decision tree. Having picked a non-unique item twice, there is a probability that we will do it again. This results in the addition of (1/n)3 to the total expected amounts of tries in step i=1. Each time we pick the wrong number, there is a chance we might pick the wrong number again. This results in:
d(1) = 1 + 1/n + (1/n)2 + (1/n)3 + (1/n)4 + ...
Similarly, in the general i:th step, the chance to pick the wrong item in one choice is i/n, resulting in:
d(i) = 1 + i/n + (i/n)2 + (i/n)3 + (i/n)4 + ... = = sum( (i/n)k ), where k ∈ [0,∞]
This is a geometric sequence and hence it is easy to compute it's sum:
d(i) = (1 - i/n)-1
The overall complexity is then computed by summing the expected amount of tries in each step:
T(m,n) = sum ( d(i) ), where i ∈ [0,m-1] = = 1 + (1 - 1/n)-1 + (1 - 2/n)-1 + (1 - 3/n)-1 + ... + (1 - (m-1)/n)-1
Extending the fractions in the series above by n, we get:
T(m,n) = n/n + n/(n-1) + n/(n-2) + n/(n-3) + ... + n/(n-m+2) + n/(n-m+1)
We can use the fact that:
n/n ≤ n/(n-1) ≤ n/(n-2) ≤ n/(n-3) ≤ ... ≤ n/(n-m+2) ≤ n/(n-m+1)
Since the series has m terms, and each term satisfies the inequality above, we get:
T(m,n) ≤ n/(n-m+1) + n/(n-m+1) + n/(n-m+1) + n/(n-m+1) + ... + n/(n-m+1) + n/(n-m+1) = = m*n/(n-m+1)
It might be(and probably is) possible to establish a slightly stricter upper bound by using some technique to evaluate the series instead of bounding by the rough method of (amount of terms) * (biggest term)
Conclusion
This would mean that the Big-O order is O(m*n/(n-m+1)). I see no possible way to simplify this expression from the way it is.
Looking back at the result to check if it makes sense, we see that, if n is constant, and m gets closer and closer to n, the results will quickly increase, since the denominator gets very small. This is what we'd expect, if we for example consider the example given in the question about selecting "999,999 values from a set of 1,000,000". If we instead let m be constant and n grow really, really large, the complexity will converge towards O(m) in the limit n → ∞. This is also what we'd expect, since while chosing a constant number of items from a "close to" infinitely sized set the probability of choosing a previously chosen value is basically 0. I.e. We need m tries independently of n since there are no collisions.
If you already have chosen i values then the probability that you pick a new one from a set of y values is
(y-i)/y.
Hence the expected number of trials to get (i+1)-th element is
y/(y-i).
Thus the expected number of trials to choose x unique element is the sum
y/y + y/(y-1) + ... + y/(y-x+1)
This can be expressed using harmonic numbers as
y (Hy - Hy-x).
From the wikipedia page you get the approximation
Hx = ln(x) + gamma + O(1/x)
Hence the number of necessary trials to pick x unique elements from a set of y elements
is
y (ln(y) - ln(y-x)) + O(y/(y-x)).
If you need then you can get a more precise approximation by using a more precise approximation for Hx. In particular, when x is small it is possible to
improve the result a lot.
If you're willing to make the assumption that your random number generator will always find a unique value before cycling back to a previously seen value for a given draw, this algorithm is O(m^2), where m is the number of unique values you are drawing.
So, if you are drawing m values from a set of n values, the 1st value will require you to draw at most 1 to get a unique value. The 2nd requires at most 2 (you see the 1st value, then a unique value), the 3rd 3, ... the mth m. Hence in total you require 1 + 2 + 3 + ... + m = [m*(m+1)]/2 = (m^2 + m)/2 draws. This is O(m^2).
Without this assumption, I'm not sure how you can even guarantee the algorithm will complete. It's quite possible (especially with a pseudo-random number generator which may have a cycle), that you will keep seeing the same values over and over and never get to another unique value.
==EDIT==
For the average case:
On your first draw, you will make exactly 1 draw.
On your 2nd draw, you expect to make 1 (the successful draw) + 1/n (the "partial" draw which represents your chance of drawing a repeat)
On your 3rd draw, you expect to make 1 (the successful draw) + 2/n (the "partial" draw...)
...
On your mth draw, you expect to make 1 + (m-1)/n draws.
Thus, you will make 1 + (1 + 1/n) + (1 + 2/n) + ... + (1 + (m-1)/n) draws altogether in the average case.
This equals the sum from i=0 to (m-1) of [1 + i/n]. Let's denote that sum(1 + i/n, i, 0, m-1).
Then:
sum(1 + i/n, i, 0, m-1) = sum(1, i, 0, m-1) + sum(i/n, i, 0, m-1)
= m + sum(i/n, i, 0, m-1)
= m + (1/n) * sum(i, i, 0, m-1)
= m + (1/n)*[(m-1)*m]/2
= (m^2)/(2n) - (m)/(2n) + m
We drop the low order terms and the constants, and we get that this is O(m^2/n), where m is the number to be drawn and n is the size of the list.
There's a beautiful O(n) algorithm for this. It goes as follows. Say you have n items, from which you want to pick m items. I assume the function rand() yields a random real number between 0 and 1. Here's the algorithm:
items_left=n
items_left_to_pick=m
for j=1,...,n
if rand()<=(items_left_to_pick/items_left)
Pick item j
items_left_to_pick=items_left_to_pick-1
end
items_left=items_left-1
end
It can be proved that this algorithm does indeed pick each subset of m items with equal probability, though the proof is non-obvious. Unfortunately, I don't have a reference handy at the moment.
Edit The advantage of this algorithm is that it takes only O(m) memory (assuming the items are simply integers or can be generated on-the-fly) compared to doing a shuffle, which takes O(n) memory.
Your actual question is actually a lot more interesting than what I answered (and harder). I've never been any good at statistitcs (and it's been a while since I did any), but intuitively, I'd say that the run-time complexity of that algorithm would probably something like an exponential. As long as the number of elements picked is small enough compared to the size of the array the collision-rate will be so small that it will be close to linear time, but at some point the number of collisions will probably grow fast and the run-time will go down the drain.
If you want to prove this, I think you'd have to do something moderately clever with the expected number of collisions in function of the wanted number of elements. It might be possible do to by induction as well, but I think going by that route would require more cleverness than the first alternative.
EDIT: After giving it some thought, here's my attempt:
Given an array of m elements, and looking for n random and different elements. It is then easy to see that when we want to pick the ith element, the odds of picking an element we've already visited are (i-1)/m. This is then the expected number of collisions for that particular pick. For picking n elements, the expected number of collisions will be the sum of the number of expected collisions for each pick. We plug this into Wolfram Alpha (sum (i-1)/m, i=1 to n) and we get the answer (n**2 - n)/2m. The average number of picks for our naive algorithm is then n + (n**2 - n)/2m.
Unless my memory fails me completely (which entirely possible, actually), this gives an average-case run-time O(n**2).
The worst case for this algorithm is clearly when you're choosing the full set of N items. This is equivalent to asking: On average, how many times must I roll an N-sided die before each side has come up at least once?
Answer: N * HN, where HN is the Nth harmonic number,
a value famously approximated by log(N).
This means the algorithm in question is N log N.
As a fun example, if you roll an ordinary 6-sided die until you see one of each number, it will take on average 6 H6 = 14.7 rolls.
Before being able to answer this question in details, lets define the framework. Suppose you have a collection {a1, a2, ..., an} of n distinct objects, and want to pick m distinct objects from this set, such that the probability of a given object aj appearing in the result is equal for all objects.
If you have already picked k items, and radomly pick an item from the full set {a1, a2, ..., an}, the probability that the item has not been picked before is (n-k)/n. This means that the number of samples you have to take before you get a new object is (assuming independence of random sampling) geometric with parameter (n-k)/n. Thus the expected number of samples to obtain one extra item is n/(n-k), which is close to 1 if k is small compared to n.
Concluding, if you need m unique objects, randomly selected, this algorithm gives you
n/n + n/(n-1) + n/(n-2) + n/(n-3) + .... + n/(n-(m-1))
which, as Alderath showed, can be estimated by
m*n / (n-m+1).
You can see a little bit more from this formula:
* The expected number of samples to obtain a new unique element increases as the number of already chosen objects increases (which sounds logical).
* You can expect really long computation times when m is close to n, especially if n is large.
In order to obtain m unique members from the set, use a variant of David Knuth's algorithm for obtaining a random permutation. Here, I'll assume that the n objects are stored in an array.
for i = 1..m
k = randInt(i, n)
exchange(i, k)
end
here, randInt samples an integer from {i, i+1, ... n}, and exchange flips two members of the array. You only need to shuffle m times, so the computation time is O(m), whereas the memory is O(n) (although you can adapt it to only save the entries such that a[i] <> i, which would give you O(m) on both time and memory, but with higher constants).
Most people forget that looking up, if the number has already run, also takes a while.
The number of tries nessesary can, as descriped earlier, be evaluated from:
T(n,m) = n(H(n)-H(n-m)) ⪅ n(ln(n)-ln(n-m))
which goes to n*ln(n) for interesting values of m
However, for each of these 'tries' you will have to do a lookup. This might be a simple O(n) runthrough, or something like a binary tree. This will give you a total performance of n^2*ln(n) or n*ln(n)^2.
For smaller values of m (m < n/2), you can do a very good approximation for T(n,m) using the HA-inequation, yielding the formula:
2*m*n/(2*n-m+1)
As m goes to n, this gives a lower bound of O(n) tries and performance O(n^2) or O(n*ln(n)).
All the results are however far better, that I would ever have expected, which shows that the algorithm might actually be just fine in many non critical cases, where you can accept occasional longer running times (when you are unlucky).

Random number generation without using bit operations

I'm writing a vertex shader at the moment, and I need some random numbers. Vertex shader hardware doesn't have logical/bit operations, so I cannot implement any of the standard random number generators.
Is it possible to make a random number generator using only standard arithmetic? the randomness doesn't have to be particularly good!
If you don't mind crappy randomness, a classic method is
x[n+1] = (x[n] * x[n] + C) mod N
where C and N are constants, C != 0 and C != -2, and N is prime. This is a typical pseudorandom generator for Pollard Rho factoring. Try C = 1 and N = 8051, those work ok.
Vertex shaders sometimes have built-in noise generators for you to use, such as cg's noise() function.
Use a linear congruential generator:
X_(n+1) = (a * X_n + c) mod m
Those aren't that strong, but at least they are well known and can have long periods. The Wikipedia page also has good recommendations:
The period of a general LCG is at most
m, and for some choices of a much less
than that. The LCG will have a full
period if and only if:
1. c and m are relatively prime,
2. a - 1 is divisible by all prime factors of m,
3. a - 1 is a multiple of 4 if m is a multiple of 4
Believe it or not, I used newx = oldx * 5 + 1 (or a slight variation of it) in several videogames. The randomness is horrible--it's more of a scrambled sequence than a random generator. But sometimes that's all you need. If I recall correctly, it goes through all numbers before it repeats.
It has some terrible characteristics. It doesn't ever give you the same number twice in a row. A few of us did a bunch of tests on variations of it and we used some variations in other games.
We used it when there was no good modulo available to us. It's just a shift by two and two adds (or a multiply by 5 and one add). I would never use it nowadays for random numbers--I'd use an LCG--but maybe it would work OK for a shader where speed is crucial and your instruction set may be limited.

Resources