Montgomery Multiplication - 32-bit register vs. 64-bit register

I need to calculate the speed difference between performing a Montgomery Multiplication page 602-603 with a word-size/register of size 32 vs. 64.
So far, this is what I understand:
x and y are represented by multiple-word arrays of length n
where n = m/w and w is the register size (either 32 or
The total number of single-digit multiplications in Montgomery
multiplication is n*(2 + 2*n), where n represents the number length of the word-arrays.
I will assume that the multiplication of two single-digit takes 1 clock cycle on each of the computers.
How can I put all this together to represent the number of clock cycles needed in Montgomery multiplication on a computer with a 32-bit register or 64-bit register?

The number of cycles for a multiple-precision Montgomery multiplication would indeed be n(2+2*n) if all the intermediate single-precision multiplication operands and results were available in registers. For cryptographic operations this is hardly possible since m is usually 1024 or larger. Assuming 32-bit registers (xyR^-1 mod m) would require 192 registers only to store the operands (3*(1024/32)). In fact you need to take into account memory accesses to answer this question.
A rewrite of the algorithm with memory accesses (assuming multiplications can be done in parallel with loads/stores):
For i from 0 to n: a_i <- 0
For i from 0 to (n − 1) do the following:
Fetch a_0
Fetch y_0
Fetch x_i
Compute u_i <- (a_0 + x_i*y_0)m' mod b. Store u_i in a register
c = 0 (Computing A <- (A + x_i*y + u_i*m)/b)
for j from 0 to (n-1):
Fetch a_j
Fetch y_j
Compute (cv) = a_j + x_i*y_j + c, Fetch m_j
Compute (cv) = (cv) + u_i*m_j, if j>0 Store a_{j-1} <- v
Store a_n <- c and a_{n-1} <- v
If A >= m then A <- A − m.
Minimum number of increments so that all elements have a common divisor

I got this problem in an interview recently:
Given a set of numbers X = [X_1, X_2, ...., X_n] where X_i <= 500 for 1 <= i <= n. Increment the numbers (only positive increments) in the set so that each element in the set has a common divisor >=2, and such that the sum of all increments is minimized.
For example, if X = [5, 7, 7, 7, 7] the new set would be X = [7, 7, 7, 7, 7] Since you can add 2 to X_1. X = [6, 8, 8, 8, 8] has a common denominator of 2 but is not correct since we're adding 6 (add 2 to 5 and 1 to each of the 4 7's).
I had a seemingly working solution (as in it passed all the test cases) that loops through the prime numbers < 500 and for each X_i in X finds the closest multiple of the prime number greater than X_i.
function closest_multiple(x, y)
return ceil(x/y)*y
min_increment = inf
for each prime_number < 500:
total_increment = 0
for each element X_i in X:
total_increment += closest_multiple(X_i, prime_number) - X_i
min_increment = min(min_increment, total_increment)
return min_increment
It's technically O(n) but is there a better way to solve this? I've been suggested to use dynamic programming but am unsure how that would fit in here.
Constant-bounded entries case
When X_i is bounded by a constant, the best time you can achieve asymptotically is O(n), since it takes at least that long to read all of your inputs. There are some practical improvements:
Filter out duplicates, so you work with a list of (element, frequency) pairs.
Early stopping in your loop.
Faster computation of closest_multiple(x, p) - x. This is slightly hardware/language dependent, but a single integer modulus op is almost certainly faster than an int -> float cast, float division, ceiling() call, and multiplication on the same magnitude numbers.
freq_counts <- Initialize-Counter(X) // List of (element, freq) pairs
min_increment = inf
for each prime_number < 500:
total_increment = 0
for each pair X_i, freq in freq_counts:
total_increment += (prime_number - (X_i % prime_number)) * freq
if total_increment >= min_increment: break
min_increment = min(min_increment, total_increment)
return min_increment
Large entries case
With uniformly chosen random data, the answer is almost always from using '2' as the divisor, and much larger prime divisors are vanishingly unlikely. However, let's solve for that worst case scenario.
Here, let max(X) = M, so that our input size is O(n (log M)) bits. We want a solution that's sub-exponential in that input size, so finding all primes below M (or even sqrt(M)) is out of the question. We're looking for any prime that gives us a min-total-increment; we'll call such a prime a min-prime. After finding such a prime, we can get the min-total-increment in linear time. We'll use a factoring approach along with two observations.
Observation 1: The answer is always at most n, since the increment needed for the prime 2 to divide X_i is at most 1.
Observation 2: We're trying to find primes that divide X_i or a number slightly larger than X_i for a large fraction of our entries X_i. Let Consecutive-Product-Divisors[i] be the set of all primes dividing either of X_i or X_i+1, which I'll abbreviate CPD[i]. This is exactly the set of all primes which divide X_i * (1 + X_i).
(Obs. 2 Continued) If U is a known upper bound on our answer (here, at most n), and p is a min-prime for X, then p must divide either X_i or X_i + 1 for at least N - U/2 of our CPD entries. Use frequency counts on the CPD array to find all such primes.
Once you have a list of candidate primes (all min-primes are guaranteed to be in this list), you can test each one individually using your algorithm. Since a number k can have at most O(log k) distinct prime divisors, this gives O(n log M) possible distinct primes that divide at least half of the numbers
[X_1*(1 + X_1), X_2*(1 + X_2), ... X_n*(1 + X_n)] that make up our candidate list. It's possible you can lower this bound with some more careful analysis, but it likely won't strongly affect the asymptotic runtime of the whole algorithm.
A more optimal complexity for large entries
The complexity of this solution is hard to write in short form, because the bottleneck is factoring n numbers of maximum size M, plus O(n^2 log M) arithmetic (i.e. addition, subtraction, multiply, modulo) operations on numbers of maximum size M. That doesn't mean the runtime is unknown: If you select any integer factoring algorithm and large-integer-arithmetic algorithms, you can derive the runtime exactly. Unfortunately, because of factoring, the best known runtime of the above algorithm is super-polynomial (but sub-exponential).
How can we do better? I did find a more complicated solution, based on Greatest Common Divisors (GCD) and dynamic-programming-like that runs in polynomial time (although likely much slower on non-astronomical-size inputs) since it doesn't rely on factoring.
The solution relies on the fact that at least one of the following two statements is true:
The number 2 is a min-prime for X, or
For at least one value of i, 1 <= i <= n there is an optimal solution where X_i remains unincremented, i.e. where one of the divisors of X_i produces a min-total-increment.
GCD-Based polynomial time algorithm
We can test 2 and all small primes quickly for their minimum costs. In fact, we'll test all primes p, p <= n, which we can do in polynomial time, and factor out these primes from X_i and its first n increments. This leads us to the following algorithm:
// Given: input list X = [X_1, X_2, ... X_n].
// Subroutine compute-min-cost(list A, int p) is
// just the inner loop of the above algorithm.
min_increment = inf;
for each prime p <= n:
min_increment = min(min_increment, compute-min-cost(X, p));
// Initialize empty, 2-D, n x (n+1) list Y[n][n+1], of offset X-values
for all 1 <= i <= n:
for all 0 <= j <= n:
Y[i][j] <- X[i] + j;
for each prime p <= n: // Factor out all small prime divisors from Y
for each Y[i][j]:
while Y[i][j] % p == 0:
Y[i][j] /= p;
for all 1 <= i <= n: // Loop 1
// Y[i][0] is the test 'unincremented' entry
// Initialize empty hash-tables 'costs' and 'new_costs'
// Keys of hash-tables are GCDs,
// Values are a running sum of increment-costs for that GCD
costs[Y[i][0]] = 0;
for all 1 <= k <= n: // Loop 2
if i == k: continue;
clear all entries from new_costs // or reinitialize to empty
for all 0 <= j < n: // Loop 3
for each Key in costs: // Loop 4
g = GCD(Key, Y[k][j]);
if g == 1: continue;
if g is not a key in new_costs:
new_costs[g] = j + costs[Key];
new_costs[g] = min(new_costs[g], j + costs[Key]);
swap(costs, new_costs);
if costs is not empty:
min_increment = min(min_increment, smallest Value in costs);
return min_increment;
The correctness of this solution follows from the previous two observations, and the (unproven, but straightforward) fact that there is a list
[X_1 + r_1, X_2 + r_2, ... , X_n + r_n] (with 0 <= r_i <= n for all i) whose GCD is a divisor with minimum increment cost.
The runtime of this solution is trickier: GCDs can easily be computed in O(log^2(M)) time, and the list of all primes up to n can be computed in low poly(n) time. From the loop structure of the algorithm, to prove a polynomial bound on the whole algorithm, it suffices to show that the maximum size of our 'costs' hash-table is polynomial in log M. This is where the 'factoring-out' of small primes comes into play. After iteration k of loop 2, the entries in costs are (Key, Value) pairs, where each Key is the GCD of k + 1 elements:
our initial Y[i][0], and [Y[1][j_1], Y[2][j_2], ... Y[k][j_k]] for some 0 <= j_l < n. The Value for this Key is the minimum increment sum needed for this divisor (i.e. sum of the j_l) over all possible choices of j_l.
There are at most O(log M) unique prime divisors of Y[i][0]. Each such prime divides at most one key in our 'costs' table at any time: Since we've factored out all prime divisors below n, any remaining prime divisor p can divide at most one of the n consecutive numbers in any Y[j] = [X_j, 1 + X_j, ... n-1 + X_j]. This means the overall algorithm is polynomial, and has a runtime below O(n^4 log^3(M)).
questions about AES irreducible polynomials

For galois field GF(2^8), the polynomial's format is a7x^7+a6x^6+...+a0.
For AES, the irreducible polynomial is x^8+x^4+x^3+x+1.
Apparently, the max power in GF(2^8) is x^7, but why the max power of irreducible polynomial is x^8?
How will the max power in irreducible polynomial affect inverse result in GF?
Can I set the max power of irreducible polynomial be x^9?
To understand why the modulus of GF(2⁸) must be order 8 (that is, have 8 as its largest exponent), you must know how to perform polynomial division with coefficients in GF(2), which means you must know how to perform polynomial division in general. I will assume you know how to do those things. If you don't know how, there are many tutorials on the web from which you can learn.
Remember that if r = a mod m, it means that there is a q such that a = q m + r. To make a working GF(2⁸) arithmetic, we need to guarantee that r is a element of GF(2⁸) for any a and q (even though a and q do not need to be elements of GF(2⁸)). Furthermore, we need to ensure that r can be any element of GF(2⁸), if we pick the right a from GF(2⁸).
So we must pick a modulus (the m) that makes these guarantees. We do this by picking an m of exactly order 8.
If the numerator of the division (the a in a = q m + r) is order 8 or higher, we can find something to put in the quotient (the q) that, when multiplied by x⁸, cancels out that higher order. But there's nothing we can put in the quotient that can be multiplied by x⁸ to give a term with order less than 8, so the remainder (the r) can be any order up to and including 7.
Let's try a few examples of polynomial division with a modulus (or divisor) of x⁸+x⁴+x³+x+1 to see what I mean. First let's compute x⁸+1 mod x⁸+x⁴+x³+x+1:
1 <- quotient
x⁸+x⁴+x³+x+1 │ x⁸ +1
x⁴+x³+x <- remainder
So x⁸+1 mod x⁸+x⁴+x³+x+1 = x⁴+x³+x.
Next let's compute x¹²+x⁹+x⁷+x⁵+x² mod x⁸+x⁴+x³+x+1.
x⁴ +x +1 <- quotient
x⁸+x⁴+x³+x+1 │ x¹²+x⁹ +x⁷+x⁵ +x²
-(x¹² +x⁸+x⁷+x⁵+x⁴ )
x⁹+x⁸ +x⁴ +x²
-(x⁹ +x⁵+x⁴ +x²+x)
x⁸ +x⁵ +x
-(x⁸ +x⁴+x³ +x+1)
x⁵+x⁴+x³ +1 <- remainder
So x¹²+x⁹+x⁷+x⁵+x² mod x⁸+x⁴+x³+x+1 = x⁵+x⁴+x³+1, which has order < 8.
Finally, let's try a substantially higher order: how about x¹⁰⁰+x⁹⁶⁺x⁹⁵+x⁹³+x⁸⁸+x⁸⁷+x⁸⁵+x⁸⁴+x mod x⁸+x⁴+x³+x+1?
x⁹² +x⁸⁴ <- quotient
x⁸+x⁴+x³+x+1 │ x¹⁰⁰+x⁹⁶⁺x⁹⁵+x⁹³ +x⁸⁸+x⁸⁷+x⁸⁵+x⁸⁴+x
-(x¹⁰⁰+x⁹⁶+x⁹⁵+x⁹³+x⁹² )
-(x⁹²+x⁸⁸+x⁸⁷+x⁸⁵+x⁸⁴ )
x <- remainder
Is O(n) greater than O(2^log n)

I read in a data structures book complexity hierarchy diagram that n is greater than 2log n. But cannot understand how and why. On using simple examples in power of 2 as n, I get values equal to n.
It is not mentioned in book , but I am assuming it to base 2 ( as context is DS complexity)
a) Is O(n) > O(pow(2,logn))?
b) Is O(pow(2,log n)) better than O(n)?
Notice that 2logb n = 2log2 n / log2 b = n(1 / log2 b). If log2 b ≥ 1 (that is, b ≥ 2), then this entire expression is strictly less than n and is therefore O(n). If log2 b < 1 (that is, b < 2), then this expression is of the form n1 + ε and therefore not O(n). Therefore, it boils down to what the log base is. If b ≥ 2, then the expression is O(n). If b < 2, then the expression is ω(n).
Hope this helps!
There is a constant factor in there somewhere, but it's not in the right place to make O(n) equal to O(pow(2,log n)), assuming log means the natural logarithm.
n = 2 ** log2(n) // by definition of log2, the base-2 logarithm
= 2 ** (log(n)/log(2)) // standard conversion of logs from one base to another
n ** log(2) = 2 ** log(n) // raise both sides of that to the log(2) power
Since log(2) < 1, O(n ** log(2)) < O(n ** 1). Sure, there is only a constant ratio between the exponents, but the fact remains that they are different exponents. O(n ** 3) is greater than O(n ** 2) for the same reason: even though 3 is bigger than 2 by only a constant factor, it is bigger and the Orders are different.
We therefore have
O(n) = O(n ** 1) > O(n ** log(2)) = O(2 ** log(n))
How to compute this modulus when there is an integer overflow

(10^{17}-1)*(10^{17}-1) mod 10^{18}
I am solving a programming problem and I hold my integers in 64 bit long long integers. Above is a particular case I am unable to solve. (ab)mod m = (a mod m)(b mod m) mod m, doesn't hold here as (a mod m)(b mod m) would still overflow a 64 bit integer. How do I solve this? I took 17th power only as an example. The problem holds even for all the integers in the range (10^{10}, 10^{18}-1).
Edit: I am using C++ for solving this problem. This problem can be solved without using a library for handling big integers.
You can use the identity you quoted, you just need another similar identity: (a+b) mod m = (a mod m) + (b mod m).
The goal is to multiply x*y mod m without any intermediate values exceeding the overflow limit (in this case 2^64), where x is starting less than m (if it isn't, reduce it mod m), y is possibly larger than m, and x*y can potentially overflow. We can do this if m is less than half of the overflow limit.
A solution is simple: Just perform basic multiplication bit-by-bit for x*y and do every step modulo m.
Start with x and y less than m (if either isn't, reduce it first). Write y in the form a_0 * 2^0 + a_1 * 2^1 + a_2 * 2^2 + ... , where a_n is either 0 or 1 (indicating the term is present or not). (Aka, write y in binary form.) Now we have:
x * (a_0 * 2^0 + a_1 * 2^1 + a_2 * 2^2 + ...) mod m
Distribute x over each of the terms of y:
(x * a_0 * 2^0) + (x * a_1 * 2^1) + (x * a_2 * 2^2) + ... mod m
Then use the original multiplication identity: For each term above, multiply x by 2 mod m until you reach the desired power of 2 for that term. (Since x < m and 2 * m < 2^64, then 2 * x < 2^64, so we can multiply by 2 without overflowing.) When you are done, add the result for each term mod m (you can keep a running sum as you go).
None of those operations will exceed 2^64 and thus will not overflow. This will work for any value of m less than 2^64 / 2 = 2^63 and any integers x and y less than m.
dividing by 2 and ceiling until remains 1

having the following algorithm only for natural numbers:
rounds(n)={1, if n=1; 1+rounds(ceil(n/2)), else}
so writing in a programming language this will be
int rounds(int n){
return 1;
return 1+rounds(ceil(n/2));
i think this has time complexity O(log n)
is there a better complexity?
Start by listing the results from 1 upward,
rounds(1) = 1
rounds(2) = 1 + rounds(2/2) = 1 + 1 = 2
Next, when ceil(n/2) is 2, rounds(n) will be 3. That's for n = 3 and n = 4.
rounds(3) = rounds(4) = 3
then, when ceil(n/2) is 3 or 4, the result will be 4. 3 <= ceil(n/2) <= 4 happens if and only if 2*3-1 <= n <= 2*4, so
round(5) = ... = rounds(8) = 4
Continuing, you can see that
rounds(n) = k+2 if 2^k < n <= 2^(k+1)
by induction.
You can rewrite that to
rounds(n) = 2 + floor(log_2(n-1)) if n > 1 [and rounds(1) = 1]
and mathematically, you can also treat n = 1 uniformly by rewriting it to
rounds(n) = 1 + floor(log_2(2*n-1))
The last formula has the potential for overflow if you're using fixed-width types, though.
So the question is
how fast can you compare a number to 1,
how fast can you subtract 1 from a number,
how fast can you compute the (floor of the) base-2 logarithm of a positive integer?
For a fixed-width type, thus a bounded range, all these are of course O(1) operations, but then you're probably still interested in making it as efficient as possible, even though computational complexity doesn't enter the game.
For native machine types - which int and long usually are - comparing and subtracting integers are very fast machine instructions, so the only possibly problematic one is the base-2 logarithm.
Many processors have a machine instruction to count the leading 0-bits in a value of the machine types, and if that is made accessible by the compiler, you will get a very fast implementation of the base-2 logarithm. If not, you can get a faster version than the recursion using one of the classic bit-hacks.
For example, sufficiently recent versions of gcc and clang have a __builtin_clz (resp. __builtin_clzl for 64-bit types) that maps to the bsr* instruction if that is present on the processor, and presumably a good implementation using some bit-twiddling if it isn't provided by the processor.
The version
unsigned rounds(unsigned long n) {
if (n <= 1) return n;
return sizeof n * CHAR_BIT + 1 - __builtin_clzl(n-1);
using the bsrq instruction takes (on my box) 0.165 seconds to compute rounds for 1 to 100,000,000, the bit-hack
unsigned rounds(unsigned n) {
if (n <= 1) return n;
n |= n >> 1;
n |= n >> 2;
n |= n >> 4;
n |= n >> 8;
n |= n >> 16;
n -= (n >> 1) & 0x55555555;
n = (n & 0x33333333) + ((n >> 2) & 0x33333333);
n = (n & 0x0F0F0F0F) + ((n >> 4) & 0x0F0F0F0F);
return ((n * 0x01010101) >> 24)+1;
takes 0.626 seconds, and the naive loop
unsigned rounds(unsigned n) {
unsigned r = 1;
while(n > 1) {
n = (n+1)/2;
return r;
takes 1.865 seconds.
If you don't use a fixed-width type, but arbitrary precision integers, things change a bit. The naive loop (or recursion) still uses Θ(log n) steps, but the steps take Θ(log n) time (or worse) on average, so overall you have a Θ(log² n) algorithm (or worse). Then using the formula above can not only offer an implementation with lower constant factors, but one with lower algorithmic complexity.
Comparing to 1 can be done in constant time for suitable representations, O(log n) is the worst case for reasonable representations.
Subtracting 1 from a positive integer takes O(log n) for reasonable representations.
Computing the (floor of the) base-2 logarithm can be done in constant time for some representations, and in O(log n) for other reasonable representations [if they use a power-of-2 base, which all arbitrary precision libraries I'm semi-familiar with do; if they used a power-of-10 base, that would be different].
If you think of the algorithm as iterative and the numbers as binary, then this function shifts out the lowest bit and increases the number by 1 if it was a 1 that was shifted out. Thus, except for the increment, it counts the number of bits in the number (that is, the position of the highest 1). The increment will eventually increase the result by one, except when the number is of the form 1000.... Thus, you get the number of bits plus one, or the number of bits if the number is a power of two. Depending on your machine model, this might be faster to calculate than O(log n).
