How to speed up a cyclotomic polynomial inversion in NTL? - inverse

I'm computing polynomial modular inverses in a ring built over the nth-cyclotomic polynomial, with n power of 2. These inverses are computed for polynomials with the coefficient of x^0 equal to (x+1) and others sampled from some random distribution but always equal 0 or x, for some integer x.
I'm able to use NTL's InvMod to compute the inverse of several polynomials, however for big instances of the described ones it took just forever to return. I did compiled NTL 9.10.0 with GMP 6.1.1. Is there any optimization I can use on NTL to solve this operations faster?
This is a minimum code:
#include <NTL/ZZ_p.h>
#include <NTL/ZZ_pEX.h>
#include <ctime>
NTL_CLIENT
int main(){
int degree = 4095;
int nphi = 8192;
int nq = 127;
// Sets q as the mersenne prime 2^nq - 1
ZZ q = NTL::power2_ZZ(nq) - 1;
ZZ_p::init(q);
// Sets phi as the n-th cyclotomic polynomial
ZZ_pX ring;
NTL::SetCoeff(ring,0,1);
NTL::SetCoeff(ring,nphi/2,1);
ZZ_pE::init(ring);
ZZ_pEX ntl_phi;
NTL::SetCoeff(ntl_phi,0,conv<ZZ_p>(1));
NTL::SetCoeff(ntl_phi,nphi/2,conv<ZZ_p>(1));
// Initializes f
std::srand(std::time(0)); // use current time as seed for random generator
ZZ_pEX f;
NTL::SetCoeff(f,0,conv<ZZ_p>(1025));
for(int i = 1; i <= degree; i++){
int random_variable = std::rand();
NTL::SetCoeff(f,i,conv<ZZ_p>(1024*(random_variable%2 == 1)));
}
// Computes the inverse of f
ZZ_pEX fInv = NTL::InvMod(f, ntl_phi);
return 0;
}

Related

Carmichael Number using Pari

Trying to write Pari code to solve the above question.
I've got no experience in using Pari, but here's some useful advice:
n is Carmichael if and only if it is composite and, for all a with 1 < a < n which are relatively prime to n, the congruence a^(n-1) = 1 (mod n) holds. To use this definition directly, you need:
1) An efficient way to test if a and n are relatively prime
2) An efficient way to compute a^(n-1) (mod n)
For the first -- use the Euclidean algorithm for greatest common divisors. It is most efficiently computed in a loop, but can also be defined via the simple recurrence gcd(a,b) = gcd(b,a%b) with basis gcd(a,0) = a. In C this is just:
unsigned int gcd(unsigned int a, unsigned int b){
return b == 0? a : gcd(b, a%b);
}
For the second point -- almost the worst possible thing you can do when computing a^k (mod n) is to first compute a^k via repeated multiplication and to then mod the result by n. Instead -- use exponentiation by squaring, taking the remainder (mod n) at intermediate stages. It is a divide-and-conquer algorithm based on the observation that e.g. a^10 = (a^5)^2 and a^11 = (a^5)^2 * a. A simple C implementation is:
unsigned int modexp(unsigned int a, unsigned int p, unsigned int n){
unsigned long long b;
switch(p){
case 0:
return 1;
case 1:
return a%n;
default:
b = modexp(a,p/2,n);
b = (b*b) % n;
if(p%2 == 1) b = (b*a) % n;
return b;
}
}
Note the use of unsigned long long to guard against overflow in the calculation of b*b.
To test if n is Carmichael, you might as well first test if n is even and return 0 in that case. Otherwise, step through numbers, a, in the range 2 to n-1. First check if gcd(a,n) == 1 Note that if n is composite then you must have at least one a before you reach the square root of n with gcd(a,n) > 1). Keep a Boolean flag which keeps track of whether or not such an a has been encountered and if you exceed the square root without finding such an a, return 0. For those a with gcd(a,n) == 1, compute the modular exponentiation a^(n-1) (mod n). If this is ever different from 1, return 0. If your loop finishes checking all a below n without returning 0, then the number is Carmichael, so return 1. An implementation is:
int is_carmichael(unsigned int n){
int a,s;
int factor_found = 0;
if (n%2 == 0) return 0;
//else:
s = sqrt(n);
a = 2;
while(a < n){
if(a > s && !factor_found){
return 0;
}
if(gcd(a,n) > 1){
factor_found = 1;
}
else{
if(modexp(a,n-1,n) != 1){
return 0;
}
}
a++;
}
return 1; //anything that survives to here is a carmichael
}
A simple driver program:
int main(void){
unsigned int n;
for(n = 2; n < 100000; n ++){
if(is_carmichael(n)) printf("%u\n",n);
}
return 0;
}
output:
C:\Programs>gcc carmichael.c
C:\Programs>a
561
1105
1729
2465
2821
6601
8911
10585
15841
29341
41041
46657
52633
62745
63973
75361
This only takes about 2 seconds to run and matches the initial part of this list.
This is probably a somewhat practical method for checking if numbers up to a million or so are Carmichael numbers. For larger numbers, you should probably get yourself a good factoring algorithm and use Korseldt's criterion as described in the Wikipedia entry on Carmichael numbers.

Finding (a ^ x) % m from a % m. This is about utilizing a % m to calculate (a ^ x) % m. % is the modulus operator [duplicate]

I want to calculate ab mod n for use in RSA decryption. My code (below) returns incorrect answers. What is wrong with it?
unsigned long int decrypt2(int a,int b,int n)
{
unsigned long int res = 1;
for (int i = 0; i < (b / 2); i++)
{
res *= ((a * a) % n);
res %= n;
}
if (b % n == 1)
res *=a;
res %=n;
return res;
}
You can try this C++ code. I've used it with 32 and 64-bit integers. I'm sure I got this from SO.
template <typename T>
T modpow(T base, T exp, T modulus) {
base %= modulus;
T result = 1;
while (exp > 0) {
if (exp & 1) result = (result * base) % modulus;
base = (base * base) % modulus;
exp >>= 1;
}
return result;
}
You can find this algorithm and related discussion in the literature on p. 244 of
Schneier, Bruce (1996). Applied Cryptography: Protocols, Algorithms, and Source Code in C, Second Edition (2nd ed.). Wiley. ISBN 978-0-471-11709-4.
Note that the multiplications result * base and base * base are subject to overflow in this simplified version. If the modulus is more than half the width of T (i.e. more than the square root of the maximum T value), then one should use a suitable modular multiplication algorithm instead - see the answers to Ways to do modulo multiplication with primitive types.
In order to calculate pow(a,b) % n to be used for RSA decryption, the best algorithm I came across is Primality Testing 1) which is as follows:
int modulo(int a, int b, int n){
long long x=1, y=a;
while (b > 0) {
if (b%2 == 1) {
x = (x*y) % n; // multiplying with base
}
y = (y*y) % n; // squaring the base
b /= 2;
}
return x % n;
}
See below reference for more details.
1) Primality Testing : Non-deterministic Algorithms – topcoder
Usually it's something like this:
while (b)
{
if (b % 2) { res = (res * a) % n; }
a = (a * a) % n;
b /= 2;
}
return res;
The only actual logic error that I see is this line:
if (b % n == 1)
which should be this:
if (b % 2 == 1)
But your overall design is problematic: your function performs O(b) multiplications and modulus operations, but your use of b / 2 and a * a implies that you were aiming to perform O(log b) operations (which is usually how modular exponentiation is done).
Doing the raw power operation is very costly, hence you can apply the following logic to simplify the decryption.
From here,
Now say we want to encrypt the message m = 7, c = m^e mod n = 7^3 mod 33
= 343 mod 33 = 13. Hence the ciphertext c = 13.
To check decryption we compute m' = c^d mod n = 13^7 mod 33 = 7. Note
that we don't have to calculate the full value of 13 to the power 7
here. We can make use of the fact that a = bc mod n = (b mod n).(c mod
n) mod n so we can break down a potentially large number into its
components and combine the results of easier, smaller calculations to
calculate the final value.
One way of calculating m' is as follows:- Note that any number can be
expressed as a sum of powers of 2. So first compute values of 13^2,
13^4, 13^8, ... by repeatedly squaring successive values modulo 33. 13^2
= 169 ≡ 4, 13^4 = 4.4 = 16, 13^8 = 16.16 = 256 ≡ 25. Then, since 7 = 4 + 2 + 1, we have m' = 13^7 = 13^(4+2+1) = 13^4.13^2.13^1 ≡ 16 x 4 x 13 = 832
≡ 7 mod 33
Are you trying to calculate (a^b)%n, or a^(b%n) ?
If you want the first one, then your code only works when b is an even number, because of that b/2. The "if b%n==1" is incorrect because you don't care about b%n here, but rather about b%2.
If you want the second one, then the loop is wrong because you're looping b/2 times instead of (b%n)/2 times.
Either way, your function is unnecessarily complex. Why do you loop until b/2 and try to multiply in 2 a's each time? Why not just loop until b and mulitply in one a each time. That would eliminate a lot of unnecessary complexity and thus eliminate potential errors. Are you thinking that you'll make the program faster by cutting the number of times through the loop in half? Frankly, that's a bad programming practice: micro-optimization. It doesn't really help much: You still multiply by a the same number of times, all you do is cut down on the number of times testing the loop. If b is typically small (like one or two digits), it's not worth the trouble. If b is large -- if it can be in the millions -- then this is insufficient, you need a much more radical optimization.
Also, why do the %n each time through the loop? Why not just do it once at the end?
Calculating pow(a,b) mod n
A key problem with OP's code is a * a. This is int overflow (undefined behavior) when a is large enough. The type of res is irrelevant in the multiplication of a * a.
The solution is to ensure either:
the multiplication is done with 2x wide math or
with modulus n, n*n <= type_MAX + 1
There is no reason to return a wider type than the type of the modulus as the result is always represent by that type.
// unsigned long int decrypt2(int a,int b,int n)
int decrypt2(int a,int b,int n)
Using unsigned math is certainly more suitable for OP's RSA goals.
Also see Modular exponentiation without range restriction
// (a^b)%n
// n != 0
// Test if unsigned long long at least 2x values bits as unsigned
#if ULLONG_MAX/UINT_MAX - 1 > UINT_MAX
unsigned decrypt2(unsigned a, unsigned b, unsigned n) {
unsigned long long result = 1u % n; // Insure result < n, even when n==1
while (b > 0) {
if (b & 1) result = (result * a) % n;
a = (1ULL * a * a) %n;
b >>= 1;
}
return (unsigned) result;
}
#else
unsigned decrypt2(unsigned a, unsigned b, unsigned n) {
// Detect if UINT_MAX + 1 < n*n
if (UINT_MAX/n < n-1) {
return TBD_code_with_wider_math(a,b,n);
}
a %= n;
unsigned result = 1u % n;
while (b > 0) {
if (b & 1) result = (result * a) % n;
a = (a * a) % n;
b >>= 1;
}
return result;
}
#endif
int's are generally not enough for RSA (unless you are dealing with small simplified examples)
you need a data type that can store integers up to 2256 (for 256-bit RSA keys) or 2512 for 512-bit keys, etc
Here is another way. Remember that when we find modulo multiplicative inverse of a under mod m.
Then
a and m must be coprime with each other.
We can use gcd extended for calculating modulo multiplicative inverse.
For computing ab mod m when a and b can have more than 105 digits then its tricky to compute the result.
Below code will do the computing part :
#include <iostream>
#include <string>
using namespace std;
/*
* May this code live long.
*/
long pow(string,string,long long);
long pow(long long ,long long ,long long);
int main() {
string _num,_pow;
long long _mod;
cin>>_num>>_pow>>_mod;
//cout<<_num<<" "<<_pow<<" "<<_mod<<endl;
cout<<pow(_num,_pow,_mod)<<endl;
return 0;
}
long pow(string n,string p,long long mod){
long long num=0,_pow=0;
for(char c: n){
num=(num*10+c-48)%mod;
}
for(char c: p){
_pow=(_pow*10+c-48)%(mod-1);
}
return pow(num,_pow,mod);
}
long pow(long long a,long long p,long long mod){
long res=1;
if(a==0)return 0;
while(p>0){
if((p&1)==0){
p/=2;
a=(a*a)%mod;
}
else{
p--;
res=(res*a)%mod;
}
}
return res;
}
This code works because ab mod m can be written as (a mod m)b mod m-1 mod m.
Hope it helped { :)
use fast exponentiation maybe..... gives same o(log n) as that template above
int power(int base, int exp,int mod)
{
if(exp == 0)
return 1;
int p=power(base, exp/2,mod);
p=(p*p)% mod;
return (exp%2 == 0)?p:(base * p)%mod;
}
This(encryption) is more of an algorithm design problem than a programming one. The important missing part is familiarity with modern algebra. I suggest that you look for a huge optimizatin in group theory and number theory.
If n is a prime number, pow(a,n-1)%n==1 (assuming infinite digit integers).So, basically you need to calculate pow(a,b%(n-1))%n; According to group theory, you can find e such that every other number is equivalent to a power of e modulo n. Therefore the range [1..n-1] can be represented as a permutation on powers of e. Given the algorithm to find e for n and logarithm of a base e, calculations can be significantly simplified. Cryptography needs a tone of math background; I'd rather be off that ground without enough background.
For my code a^k mod n in php:
function pmod(a, k, n)
{
if (n==1) return 0;
power = 1;
for(i=1; i<=k; $i++)
{
power = (power*a) % n;
}
return power;
}
#include <cmath>
...
static_cast<int>(std::pow(a,b))%n
but my best bet is you are overflowing int (IE: the number is two large for the int) on the power I had the same problem creating the exact same function.
I'm using this function:
int CalculateMod(int base, int exp ,int mod){
int result;
result = (int) pow(base,exp);
result = result % mod;
return result;
}
I parse the variable result because pow give you back a double, and for using mod you need two variables of type int, anyway, in a RSA decryption, you should just use integer numbers.

How to get the first x leading binary digits of 5**x without big integer multiplication

I want to efficiently and elegantly compute with perfect precision the first x leading binary digits of 5**x?
For example 5**20 is 10101101011110001110101111000101101011000110001. The first 8 leading binary digits is 10101101.
In my use case, x is only up to 1-60. I don't want to create a table. A solution using 64-bit integers would be fine. I just don't want to use big integers.
first x leading binary digits of 5**x without big integer multiplication
efficiently and elegantly compute with perfect precision the first x leading binary digits of 5x?
"compute with perfect precision" leaves out pow(). Too many implementations will return an imperfect result and FP math might not use 64 bit precision, even with long double.
Form an integer with a 64-bit whole number part .ms and a 64-bit fraction part .ls. Then loop 60 times, multiply by 5 and diving by 2 as needed, to keep the leading bits from growing too big.
Note there is some precision lost in the fraction, with N > 42, yet that is not significant enough to affect the whole number part OP is seeking.
#include <inttypes.h>
#include <stdint.h>
#include <stdio.h>
typedef struct {
uint64_t ms, ls;
} uint128;
// Simplifications possible here, leave for OP
uint128 times5(uint128 x) {
uint128 y = x;
for (int i=1; i<5; i++) {
// y += x
y.ms += x.ms;
y.ls += x.ls;
if (y.ls < x.ls) y.ms++;
}
return y;
}
uint128 div2(uint128 x) {
x.ls = (x.ls >> 1) | (x.ms << 63);
x.ms >>= 1;
return x;
}
int main(void) {
uint128 y = {.ms = 1};
uint64_t pow2 = 2;
for (unsigned x = 1; x <= 60; x++) {
y = times5(y);
while (y.ms >= pow2) {
y = div2(y);
}
printf("%2u %16" PRIX64 ".%016" PRIX64 "\n", x, y.ms, y.ls);
pow2 <<= 1;
}
}
Output
whole part.fraction
1 1.4000000000000000
2 3.2000000000000000
3 7.D000000000000000
4 9.C400000000000000
...
57 14643E5AE44D12B.8F5FEE5AA432560D
58 32FA9BE33AC0AEC.E66FD3E29A7DD720
59 7F7285B812E1B50.401791B6823A99D0
60 9F4F2726179A224.501D762422C94044
^-------------^ This is the part OP is seeking.
The key to solving this task is: divide and conquer. Form an algorithm, (which is simply *5 and /2 as needed), and code a type and functions to do each small step.
Is a loop of 60 efficient? Perhaps not. Another approach would use Exponentiation by squaring. Certainly would be worth it for large N, yet for N == 60, a loop was simple enough for a quick turn.
5n = 2(-n) • 10n
Using this identity, we can easily compute the leading N base-2 digits of (the nearest integer to) any given power of 5.
This code example is in C, but it's the same idea in any other language.
Example output: https://wandbox.org/permlink/Fs205DDzQR0gaLSo
#include <assert.h>
#include <float.h>
#include <math.h>
#include <stdint.h>
#define STATIC_ASSERT(CONDITION) ((void)sizeof(int[(CONDITION) ? 1 : -1]))
uint64_t pow5_leading_digits(double power, uint8_t ndigits)
{
STATIC_ASSERT(DBL_MANT_DIG <= 64);
double pow5 = exp2(-power) * pow(10, power);
const double binary_digits = ceil(log2(pow5));
assert(ndigits <= DBL_MANT_DIG);
if (!ndigits || binary_digits < 0)
return 0;
// If pow5 can fit in the number of digits requested, return it
if (binary_digits <= ndigits)
return pow5;
// If pow5 is too big to return, divide by 2 until it fits
if (binary_digits > DBL_MANT_DIG)
pow5 /= exp2(binary_digits - DBL_MANT_DIG + 1);
return (uint64_t)pow5 >> (DBL_MANT_DIG - ndigits);
}
Edit: Now limits the returned value to those exactly representable with double's.

How can I get Z*Z^T using GSL, where Z is column vector?

I am looking through GSL functions to calculate Z*Z^T, where Z is n*1 column vector, but I could not find any fit function, every help is much appreciated.
GSL supports BLAS (basic linear algebra subprograms),
see [http://www.gnu.org/software/gsl/manual/html_node/GSL-BLAS-Interface.html][1].
The functions are classified by the complexity of the operation:
level 1: vector-vector operations
level 2: matrix-vector operations
level 3: matrix-matrix operations
Most functions come in different versions for float, double and complex numbers. Your operation is basically an outer product of the vector Z with itself.
You can initialize the vector as a column vector (here double precision numbers):
gsl_matrix * Z = gsl_matrix_calloc (n,1);
and then use the BLAS function gsl_blas_dgemm to compute
Z * Z^T. The first arguments of this function determine, whether or not the input matrices should be transposed before the matrix multiplication:
gsl_blas_dgemm (CblasNoTrans, CblasTrans, 1.0, Z, Z, 0.0, C);
Here's a working test program (you may need to link it against gsl and blas):
#include <gsl/gsl_matrix.h>
#include <gsl/gsl_blas.h>
int main(int argc, char ** argv)
{
size_t n = 4;
gsl_matrix * Z = gsl_matrix_calloc (n,1);
gsl_matrix * C = gsl_matrix_calloc (n,n);
gsl_matrix_set(Z,0,0,1);
gsl_matrix_set(Z,1,0,2);
gsl_matrix_set(Z,2,0,0);
gsl_matrix_set(Z,3,0,1);
gsl_blas_dgemm (CblasNoTrans,
CblasTrans, 1.0, Z, Z, 0.0, C);
int i,j;
for (i = 0; i < n; i++)
{
for (j = 0; j < n; j++)
{
printf ("%g\t", gsl_matrix_get (C, i, j));
}
printf("\n");
}
gsl_matrix_free(Z);
gsl_matrix_free(C);
return 0;
}

Inverse sqrt for fixed point

I am looking for the best inverse square root algorithm for fixed point 16.16 numbers. The code below is what I have so far(but basically it takes the square root and divides by the original number, and I would like to get the inverse square root without a division). If it changes anything, the code will be compiled for armv5te.
uint32_t INVSQRT(uint32_t n)
{
uint64_t op, res, one;
op = ((uint64_t)n<<16);
res = 0;
one = (uint64_t)1 << 46;
while (one > op) one >>= 2;
while (one != 0)
{
if (op >= res + one)
{
op -= (res + one);
res += (one<<1);
}
res >>= 1;
one >>= 2;
}
res<<=16;
res /= n;
return(res);
}
The trick is to apply Newton's method to the problem x - 1/y^2 = 0. So, given x, solve for y using an iterative scheme.
Y_(n+1) = y_n * (3 - x*y_n^2)/2
The divide by 2 is just a bit shift, or at worst, a multiply by 0.5. This scheme converges to y=1/sqrt(x), exactly as requested, and without any true divides at all.
The only problem is that you need a decent starting value for y. As I recall there are limits on the estimate y for the iterations to converge.
ARMv5TE processors provide a fast integer multiplier, and a "count leading zeros" instruction. They also typically come with moderately sized caches. Based on this, the most suitable approach for a high-performance implementation appears to be a table lookup for an initial approximation, followed by two Newton-Raphson iterations to achieve fully accurate results. We can speed up the first of these iterations further with additional pre-computation that is incorporated into the table, a technique used by Cray computers forty years ago.
The function fxrsqrt() below implements this approach. It starts out with an 8-bit approximation r to the reciprocal square root of the argument a, but instead of storing r, each table element stores 3r (in the lower ten bits of the 32-bit entry) and r3 (in the upper 22 bits of the 32-bit entry). This allows the quick computation of the first iteration as
r1 = 0.5 * (3 * r - a * r3). The second iteration is then computed in the conventional way as r2 = 0.5 * r1 * (3 - r1 * (r1 * a)).
To be able to perform these computations accurately, regardless of the magnitude of the input, the argument a is normalized at the start of the computation, in essence representing it as a 2.32 fixed-point number multiplied with a scale factor of 2scal. At the end of the computation the result is denormalized according to formula 1/sqrt(22n) = 2-n. By rounding up results whose most significant discarded bit is 1, accuracy is improved, resulting in almost all results being correctly rounded. The exhaustive test reports: results too low: 639 too high: 1454 not correctly rounded: 2093
The code makes use of two helper functions: __clz() determines the number of leading zero bits in a non-zero 32-bit argument. __umulhi() computes the 32 most significant bits of a full 64-bit product of two unsigned 32-bit integers. Both functions should be implemented either via compiler intrinsics, or by using a bit of inline assembly. In the code below I am showing portable implementations well suited to ARM CPUs along with inline assembly versions for x86 platforms. On ARMv5TE platforms __clz() should be mapped map to the CLZ instruction, and __umulhi() should be mapped to UMULL.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <math.h>
#define USE_OWN_INTRINSICS 1
#if USE_OWN_INTRINSICS
__forceinline int __clz (uint32_t a)
{
int r;
__asm__ ("bsrl %1,%0\n\t" : "=r"(r): "r"(a));
return 31 - r;
}
uint32_t __umulhi (uint32_t a, uint32_t b)
{
uint32_t r;
__asm__ ("movl %1,%%eax\n\tmull %2\n\tmovl %%edx,%0\n\t"
: "=r"(r) : "r"(a), "r"(b) : "eax", "edx");
return r;
}
#else // USE_OWN_INTRINSICS
int __clz (uint32_t a)
{
uint32_t r = 32;
if (a >= 0x00010000) { a >>= 16; r -= 16; }
if (a >= 0x00000100) { a >>= 8; r -= 8; }
if (a >= 0x00000010) { a >>= 4; r -= 4; }
if (a >= 0x00000004) { a >>= 2; r -= 2; }
r -= a - (a & (a >> 1));
return r;
}
uint32_t __umulhi (uint32_t a, uint32_t b)
{
return (uint32_t)(((uint64_t)a * b) >> 32);
}
#endif // USE_OWN_INTRINSICS
/*
* For each sub-interval in [1, 4), use an 8-bit approximation r to reciprocal
* square root. To speed up subsequent Newton-Raphson iterations, each entry in
* the table combines two pieces of information: The least-significant 10 bits
* store 3*r, the most-significant 22 bits store r**3, rounded from 24 down to
* 22 bits such that accuracy is optimized.
*/
uint32_t rsqrt_tab [96] =
{
0xfa0bdefa, 0xee6af6ee, 0xe5effae5, 0xdaf27ad9,
0xd2eff6d0, 0xc890aec4, 0xc10366bb, 0xb9a71ab2,
0xb4da2eac, 0xadce7ea3, 0xa6f2b29a, 0xa279a694,
0x9beb568b, 0x97a5c685, 0x9163027c, 0x8d4fd276,
0x89501e70, 0x8563da6a, 0x818ac664, 0x7dc4fe5e,
0x7a122258, 0x7671be52, 0x72e44a4c, 0x6f68fa46,
0x6db22a43, 0x6a52623d, 0x67041a37, 0x65639634,
0x622ffe2e, 0x609cba2b, 0x5d837e25, 0x5bfcfe22,
0x58fd461c, 0x57838619, 0x560e1216, 0x53300a10,
0x51c72e0d, 0x50621a0a, 0x4da48204, 0x4c4c2e01,
0x4af789fe, 0x49a689fb, 0x485a11f8, 0x4710f9f5,
0x45cc2df2, 0x448b4def, 0x421505e9, 0x40df5de6,
0x3fadc5e3, 0x3e7fe1e0, 0x3d55c9dd, 0x3d55d9dd,
0x3c2f41da, 0x39edd9d4, 0x39edc1d4, 0x38d281d1,
0x37bae1ce, 0x36a6c1cb, 0x3595d5c8, 0x3488f1c5,
0x3488fdc5, 0x337fbdc2, 0x3279ddbf, 0x317749bc,
0x307831b9, 0x307879b9, 0x2f7d01b6, 0x2e84ddb3,
0x2d9005b0, 0x2d9015b0, 0x2c9ec1ad, 0x2bb0a1aa,
0x2bb0f5aa, 0x2ac615a7, 0x29ded1a4, 0x29dec9a4,
0x28fabda1, 0x2819e99e, 0x2819ed9e, 0x273c3d9b,
0x273c359b, 0x2661dd98, 0x258ad195, 0x258af195,
0x24b71192, 0x24b6b192, 0x23e6058f, 0x2318118c,
0x2318718c, 0x224da189, 0x224dd989, 0x21860d86,
0x21862586, 0x20c19183, 0x20c1b183, 0x20001580
};
/* This function computes the reciprocal square root of its 16.16 fixed-point
* argument. After normalization of the argument if uses the most significant
* bits of the argument for a table lookup to obtain an initial approximation
* accurate to 8 bits. This is followed by two Newton-Raphson iterations with
* quadratic convergence. Finally, the result is denormalized and some simple
* rounding is applied to maximize accuracy.
*
* To speed up the first NR iteration, for the initial 8-bit approximation r0
* the lookup table supplies 3*r0 along with r0**3. A first iteration computes
* a refined estimate r1 = 1.5 * r0 - x * r0**3. The second iteration computes
* the final result as r2 = 0.5 * r1 * (3 - r1 * (r1 * x)).
*
* The accuracy for all arguments in [0x00000001, 0xffffffff] is as follows:
* 639 results are too small by one ulp, 1454 results are too big by one ulp.
* A total of 2093 results deviate from the correctly rounded result.
*/
uint32_t fxrsqrt (uint32_t a)
{
uint32_t s, r, t, scal;
/* handle special case of zero input */
if (a == 0) return ~a;
/* normalize argument */
scal = __clz (a) & 0xfffffffe;
a = a << scal;
/* initial approximation */
t = rsqrt_tab [(a >> 25) - 32];
/* first NR iteration */
r = (t << 22) - __umulhi (t, a);
/* second NR iteration */
s = __umulhi (r, a);
s = 0x30000000 - __umulhi (r, s);
r = __umulhi (r, s);
/* denormalize and round result */
r = ((r >> (18 - (scal >> 1))) + 1) >> 1;
return r;
}
/* reference implementation, 16.16 reciprocal square root of non-zero argment */
uint32_t ref_fxrsqrt (uint32_t a)
{
double arg = a / 65536.0;
double rsq = sqrt (1.0 / arg);
uint32_t r = (uint32_t)(rsq * 65536.0 + 0.5);
return r;
}
int main (void)
{
uint32_t arg = 0x00000001;
uint32_t res, ref;
uint32_t err, lo = 0, hi = 0;
do {
res = fxrsqrt (arg);
ref = ref_fxrsqrt (arg);
err = 0;
if (res < ref) {
err = ref - res;
lo++;
}
if (res > ref) {
err = res - ref;
hi++;
}
if (err > 1) {
printf ("!!!! arg=%08x res=%08x ref=%08x\n", arg, res, ref);
return EXIT_FAILURE;
}
arg++;
} while (arg);
printf ("results too low: %u too high: %u not correctly rounded: %u\n",
lo, hi, lo + hi);
return EXIT_SUCCESS;
}
I have a solution that I characterize as "fast inverse sqrt, but for 32bit fixed points". No table, no reference, just straight to the point with a good guess.
If you want, jump to the source code below, but beware of a few things.
(x * y)>>16 can be replaced with any fixed-point multiplication scheme you want.
This does not require 64-bit [long-words], I just use that for the ease of demonstration. Long words are used to prevent overflow in multiplication. A fixed-point math library will have fixed-point multiplication functions that handle this better.
The initial guess is pretty good, so you get relatively precise results in the first incantation.
The code is more verbose than needed for demonstration.
Values less than 65536 (<1) and greater than 32767<<16 cannot be used.
This is generally not faster than using a square root table and division if your hardware has a division function. If it does not, this avoids divisions.
int fxisqrt(int input){
if(input <= 65536){
return 1;
}
long xSR = input>>1;
long pushRight = input;
long msb = 0;
long shoffset = 0;
long yIsqr = 0;
long ysqr = 0;
long fctrl = 0;
long subthreehalf = 0;
while(pushRight >= 65536){
pushRight >>=1;
msb++;
}
shoffset = (16 - ((msb)>>1));
yIsqr = 1<<shoffset;
//y = (y * (98304 - ( ( (x>>1) * ((y * y)>>16 ) )>>16 ) ) )>>16; x2
//Incantation 1
ysqr = (yIsqr * yIsqr)>>16;
fctrl = (xSR * ysqr)>>16;
subthreehalf = 98304 - fctrl;
yIsqr = (yIsqr * subthreehalf)>>16;
//Incantation 2 - Increases precision greatly, but may not be neccessary
ysqr = (yIsqr * yIsqr)>>16;
fctrl = (xSR * ysqr)>>16;
subthreehalf = 98304 - fctrl;
yIsqr = (yIsqr * subthreehalf)>>16;
return yIsqr;
}

Resources