Inverse sqrt for fixed point - math

I am looking for the best inverse square root algorithm for fixed point 16.16 numbers. The code below is what I have so far(but basically it takes the square root and divides by the original number, and I would like to get the inverse square root without a division). If it changes anything, the code will be compiled for armv5te.
uint32_t INVSQRT(uint32_t n)
{
uint64_t op, res, one;
op = ((uint64_t)n<<16);
res = 0;
one = (uint64_t)1 << 46;
while (one > op) one >>= 2;
while (one != 0)
{
if (op >= res + one)
{
op -= (res + one);
res += (one<<1);
}
res >>= 1;
one >>= 2;
}
res<<=16;
res /= n;
return(res);
}

The trick is to apply Newton's method to the problem x - 1/y^2 = 0. So, given x, solve for y using an iterative scheme.
Y_(n+1) = y_n * (3 - x*y_n^2)/2
The divide by 2 is just a bit shift, or at worst, a multiply by 0.5. This scheme converges to y=1/sqrt(x), exactly as requested, and without any true divides at all.
The only problem is that you need a decent starting value for y. As I recall there are limits on the estimate y for the iterations to converge.

ARMv5TE processors provide a fast integer multiplier, and a "count leading zeros" instruction. They also typically come with moderately sized caches. Based on this, the most suitable approach for a high-performance implementation appears to be a table lookup for an initial approximation, followed by two Newton-Raphson iterations to achieve fully accurate results. We can speed up the first of these iterations further with additional pre-computation that is incorporated into the table, a technique used by Cray computers forty years ago.
The function fxrsqrt() below implements this approach. It starts out with an 8-bit approximation r to the reciprocal square root of the argument a, but instead of storing r, each table element stores 3r (in the lower ten bits of the 32-bit entry) and r3 (in the upper 22 bits of the 32-bit entry). This allows the quick computation of the first iteration as
r1 = 0.5 * (3 * r - a * r3). The second iteration is then computed in the conventional way as r2 = 0.5 * r1 * (3 - r1 * (r1 * a)).
To be able to perform these computations accurately, regardless of the magnitude of the input, the argument a is normalized at the start of the computation, in essence representing it as a 2.32 fixed-point number multiplied with a scale factor of 2scal. At the end of the computation the result is denormalized according to formula 1/sqrt(22n) = 2-n. By rounding up results whose most significant discarded bit is 1, accuracy is improved, resulting in almost all results being correctly rounded. The exhaustive test reports: results too low: 639 too high: 1454 not correctly rounded: 2093
The code makes use of two helper functions: __clz() determines the number of leading zero bits in a non-zero 32-bit argument. __umulhi() computes the 32 most significant bits of a full 64-bit product of two unsigned 32-bit integers. Both functions should be implemented either via compiler intrinsics, or by using a bit of inline assembly. In the code below I am showing portable implementations well suited to ARM CPUs along with inline assembly versions for x86 platforms. On ARMv5TE platforms __clz() should be mapped map to the CLZ instruction, and __umulhi() should be mapped to UMULL.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <math.h>
#define USE_OWN_INTRINSICS 1
#if USE_OWN_INTRINSICS
__forceinline int __clz (uint32_t a)
{
int r;
__asm__ ("bsrl %1,%0\n\t" : "=r"(r): "r"(a));
return 31 - r;
}
uint32_t __umulhi (uint32_t a, uint32_t b)
{
uint32_t r;
__asm__ ("movl %1,%%eax\n\tmull %2\n\tmovl %%edx,%0\n\t"
: "=r"(r) : "r"(a), "r"(b) : "eax", "edx");
return r;
}
#else // USE_OWN_INTRINSICS
int __clz (uint32_t a)
{
uint32_t r = 32;
if (a >= 0x00010000) { a >>= 16; r -= 16; }
if (a >= 0x00000100) { a >>= 8; r -= 8; }
if (a >= 0x00000010) { a >>= 4; r -= 4; }
if (a >= 0x00000004) { a >>= 2; r -= 2; }
r -= a - (a & (a >> 1));
return r;
}
uint32_t __umulhi (uint32_t a, uint32_t b)
{
return (uint32_t)(((uint64_t)a * b) >> 32);
}
#endif // USE_OWN_INTRINSICS
/*
* For each sub-interval in [1, 4), use an 8-bit approximation r to reciprocal
* square root. To speed up subsequent Newton-Raphson iterations, each entry in
* the table combines two pieces of information: The least-significant 10 bits
* store 3*r, the most-significant 22 bits store r**3, rounded from 24 down to
* 22 bits such that accuracy is optimized.
*/
uint32_t rsqrt_tab [96] =
{
0xfa0bdefa, 0xee6af6ee, 0xe5effae5, 0xdaf27ad9,
0xd2eff6d0, 0xc890aec4, 0xc10366bb, 0xb9a71ab2,
0xb4da2eac, 0xadce7ea3, 0xa6f2b29a, 0xa279a694,
0x9beb568b, 0x97a5c685, 0x9163027c, 0x8d4fd276,
0x89501e70, 0x8563da6a, 0x818ac664, 0x7dc4fe5e,
0x7a122258, 0x7671be52, 0x72e44a4c, 0x6f68fa46,
0x6db22a43, 0x6a52623d, 0x67041a37, 0x65639634,
0x622ffe2e, 0x609cba2b, 0x5d837e25, 0x5bfcfe22,
0x58fd461c, 0x57838619, 0x560e1216, 0x53300a10,
0x51c72e0d, 0x50621a0a, 0x4da48204, 0x4c4c2e01,
0x4af789fe, 0x49a689fb, 0x485a11f8, 0x4710f9f5,
0x45cc2df2, 0x448b4def, 0x421505e9, 0x40df5de6,
0x3fadc5e3, 0x3e7fe1e0, 0x3d55c9dd, 0x3d55d9dd,
0x3c2f41da, 0x39edd9d4, 0x39edc1d4, 0x38d281d1,
0x37bae1ce, 0x36a6c1cb, 0x3595d5c8, 0x3488f1c5,
0x3488fdc5, 0x337fbdc2, 0x3279ddbf, 0x317749bc,
0x307831b9, 0x307879b9, 0x2f7d01b6, 0x2e84ddb3,
0x2d9005b0, 0x2d9015b0, 0x2c9ec1ad, 0x2bb0a1aa,
0x2bb0f5aa, 0x2ac615a7, 0x29ded1a4, 0x29dec9a4,
0x28fabda1, 0x2819e99e, 0x2819ed9e, 0x273c3d9b,
0x273c359b, 0x2661dd98, 0x258ad195, 0x258af195,
0x24b71192, 0x24b6b192, 0x23e6058f, 0x2318118c,
0x2318718c, 0x224da189, 0x224dd989, 0x21860d86,
0x21862586, 0x20c19183, 0x20c1b183, 0x20001580
};
/* This function computes the reciprocal square root of its 16.16 fixed-point
* argument. After normalization of the argument if uses the most significant
* bits of the argument for a table lookup to obtain an initial approximation
* accurate to 8 bits. This is followed by two Newton-Raphson iterations with
* quadratic convergence. Finally, the result is denormalized and some simple
* rounding is applied to maximize accuracy.
*
* To speed up the first NR iteration, for the initial 8-bit approximation r0
* the lookup table supplies 3*r0 along with r0**3. A first iteration computes
* a refined estimate r1 = 1.5 * r0 - x * r0**3. The second iteration computes
* the final result as r2 = 0.5 * r1 * (3 - r1 * (r1 * x)).
*
* The accuracy for all arguments in [0x00000001, 0xffffffff] is as follows:
* 639 results are too small by one ulp, 1454 results are too big by one ulp.
* A total of 2093 results deviate from the correctly rounded result.
*/
uint32_t fxrsqrt (uint32_t a)
{
uint32_t s, r, t, scal;
/* handle special case of zero input */
if (a == 0) return ~a;
/* normalize argument */
scal = __clz (a) & 0xfffffffe;
a = a << scal;
/* initial approximation */
t = rsqrt_tab [(a >> 25) - 32];
/* first NR iteration */
r = (t << 22) - __umulhi (t, a);
/* second NR iteration */
s = __umulhi (r, a);
s = 0x30000000 - __umulhi (r, s);
r = __umulhi (r, s);
/* denormalize and round result */
r = ((r >> (18 - (scal >> 1))) + 1) >> 1;
return r;
}
/* reference implementation, 16.16 reciprocal square root of non-zero argment */
uint32_t ref_fxrsqrt (uint32_t a)
{
double arg = a / 65536.0;
double rsq = sqrt (1.0 / arg);
uint32_t r = (uint32_t)(rsq * 65536.0 + 0.5);
return r;
}
int main (void)
{
uint32_t arg = 0x00000001;
uint32_t res, ref;
uint32_t err, lo = 0, hi = 0;
do {
res = fxrsqrt (arg);
ref = ref_fxrsqrt (arg);
err = 0;
if (res < ref) {
err = ref - res;
lo++;
}
if (res > ref) {
err = res - ref;
hi++;
}
if (err > 1) {
printf ("!!!! arg=%08x res=%08x ref=%08x\n", arg, res, ref);
return EXIT_FAILURE;
}
arg++;
} while (arg);
printf ("results too low: %u too high: %u not correctly rounded: %u\n",
lo, hi, lo + hi);
return EXIT_SUCCESS;
}

I have a solution that I characterize as "fast inverse sqrt, but for 32bit fixed points". No table, no reference, just straight to the point with a good guess.
If you want, jump to the source code below, but beware of a few things.
(x * y)>>16 can be replaced with any fixed-point multiplication scheme you want.
This does not require 64-bit [long-words], I just use that for the ease of demonstration. Long words are used to prevent overflow in multiplication. A fixed-point math library will have fixed-point multiplication functions that handle this better.
The initial guess is pretty good, so you get relatively precise results in the first incantation.
The code is more verbose than needed for demonstration.
Values less than 65536 (<1) and greater than 32767<<16 cannot be used.
This is generally not faster than using a square root table and division if your hardware has a division function. If it does not, this avoids divisions.
int fxisqrt(int input){
if(input <= 65536){
return 1;
}
long xSR = input>>1;
long pushRight = input;
long msb = 0;
long shoffset = 0;
long yIsqr = 0;
long ysqr = 0;
long fctrl = 0;
long subthreehalf = 0;
while(pushRight >= 65536){
pushRight >>=1;
msb++;
}
shoffset = (16 - ((msb)>>1));
yIsqr = 1<<shoffset;
//y = (y * (98304 - ( ( (x>>1) * ((y * y)>>16 ) )>>16 ) ) )>>16; x2
//Incantation 1
ysqr = (yIsqr * yIsqr)>>16;
fctrl = (xSR * ysqr)>>16;
subthreehalf = 98304 - fctrl;
yIsqr = (yIsqr * subthreehalf)>>16;
//Incantation 2 - Increases precision greatly, but may not be neccessary
ysqr = (yIsqr * yIsqr)>>16;
fctrl = (xSR * ysqr)>>16;
subthreehalf = 98304 - fctrl;
yIsqr = (yIsqr * subthreehalf)>>16;
return yIsqr;
}

Related

Approximately reversible function from a pair of floats to a single float?

Some time ago, I came across a pair of functions in some CAD code to encode a set of coordinates (a pair of floats) as a single float (to use as a hash key), and then to unpack that single float back into the original pair.
The forward and backward functions only used standard mathematical operations -- no magic fiddling with bit-level representations of floats, no extracting and interleaving individual digits or anything like that. Obviously the reversal is not perfect in practice, because you lose considerable precision going from two floats to one, but according to the Wikipedia page for the function it should have been exactly invertible given infinite precision arithmetic.
Unfortunately, I don't work on that code anymore, and I've forgotten the name of the function so I can't look it up on Wikipedia again. Anybody know of a named mathematical functions that meets that description?
In a comment, OP clarified that the desired function should map two IEEE-754 binary64 operands into a single such operand. One way to accomplish this is to map each binary64 (double-precision) number into a positive integer in [0, 226-2], and then use a well-known pairing function to map two such integers into a single positive integer the interval [0,252), which is exactly representable in a binary64 which has 52 stored significand ("mantissa") bits.
As the binary64 operands are unrestricted in range per a comment by OP, all binades should be representable with equal relative accuracy, and we need to handle zeroes, infinities, and NaNs as well. For this reason I chose log2() to compress the data. Zeros are treated as the smallest binary64 subnormal, 0x1.0p-1074, which has the consequence that both 0x1.0p-1074 and zero will decompress into zero. The result from log2 falls into the range [-1074, 1024). Since we need to store the sign of the operand, we bias the logarithm value by 1074, giving a result in [0, 2098), then scale that to almost [0, 225), round to the nearest integer, and finally affix the sign of the original binary64 operand. The motivation for almost utilizing the complete range is to leave a little bit of room at the top of the range for special encodings for infinity and NaN (so a single canonical NaN encoding is used).
Since pairing functions known from the literature operate on natural numbers only, we need a mapping from whole numbers to natural numbers. This is easily accomplished by mapping negative whole numbers to odd natural numbers, while positive whole numbers are mapped to even natural numbers. Thus our operands are mapped from (-225, +225) to [0, 226-2]. The pairing function then combines two integers in [0, 226-2] into a single integer in [0, 252).
Different pairing functions known from the literature differ in their scrambling behavior, which may impact the hashing functionality mentioned in the question. They may also differ in their performance. Therefore I am offering a selection of four different pairing functions for the pair() / unpair() implementations in the code below. Please see the comments in the code for corresponding literature references.
Unpacking of the packed operand involves applying the inverse of each packing step in reverse order. The unpairing function gives us two natural integers. These are mapped to two whole numbers, which are mapped to two logarithm values, which are then exponentiated with exp2() to recover the original numbers, with a bit of added work to get special values and the sign correct.
While logarithms are represented with a relative accuracy on the order of 10-8, the expected maximum relative error in final results is on the order of 10-5 due to the well-known error magnification property of exponentiation. Maximum relative error observed for a pack() / unpack() round-trip in extensive testing was 2.167e-5.
Below is my ISO C99 implementation of the algorithm together with a portion of my test framework. This should be portable to other programming languages with a modicum of effort.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <math.h>
#include <float.h>
#define SZUDZIK_ELEGANT_PAIRING (1)
#define ROZSA_PETER_PAIRING (2)
#define ROSENBERG_STRONG_PAIRING (3)
#define CHRISTOPH_MICHEL_PAIRING (4)
#define PAIRING_FUNCTION (ROZSA_PETER_PAIRING)
/*
https://groups.google.com/forum/#!original/comp.lang.c/qFv18ql_WlU/IK8KGZZFJx4J
From: geo <gmars...#gmail.com>
Newsgroups: sci.math,comp.lang.c,comp.lang.fortran
Subject: 64-bit KISS RNGs
Date: Sat, 28 Feb 2009 04:30:48 -0800 (PST)
This 64-bit KISS RNG has three components, each nearly
good enough to serve alone. The components are:
Multiply-With-Carry (MWC), period (2^121+2^63-1)
Xorshift (XSH), period 2^64-1
Congruential (CNG), period 2^64
*/
static uint64_t kiss64_x = 1234567890987654321ULL;
static uint64_t kiss64_c = 123456123456123456ULL;
static uint64_t kiss64_y = 362436362436362436ULL;
static uint64_t kiss64_z = 1066149217761810ULL;
static uint64_t kiss64_t;
#define MWC64 (kiss64_t = (kiss64_x << 58) + kiss64_c, \
kiss64_c = (kiss64_x >> 6), kiss64_x += kiss64_t, \
kiss64_c += (kiss64_x < kiss64_t), kiss64_x)
#define XSH64 (kiss64_y ^= (kiss64_y << 13), kiss64_y ^= (kiss64_y >> 17), \
kiss64_y ^= (kiss64_y << 43))
#define CNG64 (kiss64_z = 6906969069ULL * kiss64_z + 1234567ULL)
#define KISS64 (MWC64 + XSH64 + CNG64)
double uint64_as_double (uint64_t a)
{
double r;
memcpy (&r, &a, sizeof r);
return r;
}
#define LOG2_BIAS (1074.0)
#define CLAMP_LOW (exp2(-LOG2_BIAS))
#define SCALE (15993.5193125)
#define NAN_ENCODING (33554430)
#define INF_ENCODING (33554420)
/* check whether argument is an odd integer */
int is_odd_int (double a)
{
return (-2.0 * floor (0.5 * a) + a) == 1.0;
}
/* compress double-precision number into an integer in (-2**25, +2**25) */
double compress (double x)
{
double t;
t = fabs (x);
t = (t < CLAMP_LOW) ? CLAMP_LOW : t;
t = rint ((log2 (t) + LOG2_BIAS) * SCALE);
if (isnan (x)) t = NAN_ENCODING;
if (isinf (x)) t = INF_ENCODING;
return copysign (t, x);
}
/* expand integer in (-2**25, +2**25) into double-precision number */
double decompress (double x)
{
double s, t;
s = fabs (x);
t = s / SCALE;
if (s == (NAN_ENCODING)) t = NAN;
if (s == (INF_ENCODING)) t = INFINITY;
return copysign ((t == 0) ? 0 : exp2 (t - LOG2_BIAS), x);
}
/* map whole numbers to natural numbers. Here: (-2^25, +2^25) to [0, 2^26-2] */
double map_Z_to_N (double x)
{
return (x < 0) ? (-2 * x - 1) : (2 * x);
}
/* Map natural numbers to whole numbers. Here: [0, 2^26-2] to (-2^25, +2^25) */
double map_N_to_Z (double x)
{
return is_odd_int (x) ? (-0.5 * (x + 1)) : (0.5 * x);
}
#if PAIRING_FUNCTION == SZUDZIK_ELEGANT_PAIRING
/* Matthew Szudzik, "An elegant pairing function." In Wolfram Research (ed.)
Special NKS 2006 Wolfram Science Conference, pp. 1-12.
Here: map two natural numbers in [0, 2^26-2] to natural number in [0, 2^52),
and vice versa
*/
double pair (double x, double y)
{
return (x != fmax (x, y)) ? (y * y + x) : (x * x + x + y);
}
void unpair (double z, double *x, double *y)
{
double sqrt_z = trunc (sqrt (z));
double sqrt_z_diff = z - sqrt_z * sqrt_z;
*x = (sqrt_z_diff < sqrt_z) ? sqrt_z_diff : sqrt_z;
*y = (sqrt_z_diff < sqrt_z) ? sqrt_z : (sqrt_z_diff - sqrt_z);
}
#elif PAIRING_FUNCTION == ROZSA_PETER_PAIRING
/*
Rozsa Peter, "Rekursive Funktionen" (1951), p. 44. Via David R. Hagen's blog,
https://drhagen.com/blog/superior-pairing-function/
Here: map two natural numbers in [0, 2^26-2] to natural number in [0, 2^52),
and vice versa
*/
double pair (double x, double y)
{
double mx = fmax (x, y);
double mn = fmin (x, y);
double sel = (mx == x) ? 0 : 1;
return mx * mx + mn * 2 + sel;
}
void unpair (double z, double *x, double *y)
{
double sqrt_z = trunc (sqrt (z));
double sqrt_z_diff = z - sqrt_z * sqrt_z;
double half_diff = trunc (sqrt_z_diff * 0.5);
*x = is_odd_int (sqrt_z_diff) ? half_diff : sqrt_z;
*y = is_odd_int (sqrt_z_diff) ? sqrt_z : half_diff;
}
#elif PAIRING_FUNCTION == ROSENBERG_STRONG_PAIRING
/*
A. L. Rosenberg and H. R. Strong, "Addressing arrays by shells",
IBM Technical Disclosure Bulletin, Vol. 14, No. 10, March 1972,
pp. 3026-3028.
Arnold L. Rosenberg, "Allocating storage for extendible arrays,"
Journal of the ACM, Vol. 21, No. 4, October 1974, pp. 652-670.
Corrigendum, Journal of the ACM, Vol. 22, No. 2, April 1975, p. 308.
Matthew P. Szudzik, "The Rosenberg-Strong Pairing Function", 2019
https://arxiv.org/abs/1706.04129
Here: map two natural numbers in [0, 2^26-2] to natural number in [0, 2^52),
and vice versa
*/
double pair (double x, double y)
{
double mx = fmax (x, y);
return mx * mx + mx + x - y;
}
void unpair (double z, double *x, double *y)
{
double sqrt_z = trunc (sqrt (z));
double sqrt_z_diff = z - sqrt_z * sqrt_z;
*x = (sqrt_z_diff < sqrt_z) ? sqrt_z_diff : sqrt_z;
*y = (sqrt_z_diff < sqrt_z) ? sqrt_z : (2 * sqrt_z - sqrt_z_diff);
}
#elif PAIRING_FUNCTION == CHRISTOPH_MICHEL_PAIRING
/*
Christoph Michel, "Enumerating a Grid in Spiral Order", September 7, 2016,
https://cmichel.io/enumerating-grid-in-spiral-order. Via German Wikipedia,
https://de.wikipedia.org/wiki/Cantorsche_Paarungsfunktion
Here: map two natural numbers in [0, 2^26-2] to natural number in [0, 2^52),
and vice versa
*/
double pair (double x, double y)
{
double mx = fmax (x, y);
return mx * mx + mx + (is_odd_int (mx) ? (x - y) : (y - x));
}
void unpair (double z, double *x, double *y)
{
double sqrt_z = trunc (sqrt (z));
double sqrt_z_diff = z - sqrt_z * (sqrt_z + 1);
double min_clamp = fmin (sqrt_z_diff, 0);
double max_clamp = fmax (sqrt_z_diff, 0);
*x = is_odd_int (sqrt_z) ? (sqrt_z + min_clamp) : (sqrt_z - max_clamp);
*y = is_odd_int (sqrt_z) ? (sqrt_z - max_clamp) : (sqrt_z + min_clamp);
}
#else
#error unknown PAIRING_FUNCTION
#endif
/* Lossy pairing function for double precision numbers. The maximum round-trip
relative error is about 2.167e-5
*/
double pack (double a, double b)
{
double c, p, q, s, t;
p = compress (a);
q = compress (b);
s = map_Z_to_N (p);
t = map_Z_to_N (q);
c = pair (s, t);
return c;
}
/* Unpairing function for double precision numbers. The maximum round-trip
relative error is about 2.167e-5 */
void unpack (double c, double *a, double *b)
{
double s, t, u, v;
unpair (c, &s, &t);
u = map_N_to_Z (s);
v = map_N_to_Z (t);
*a = decompress (u);
*b = decompress (v);
}
int main (void)
{
double a, b, c, ua, ub, relerr_a, relerr_b;
double max_relerr_a = 0, max_relerr_b = 0;
#if PAIRING_FUNCTION == SZUDZIK_ELEGANT_PAIRING
printf ("Testing with Szudzik's elegant pairing function\n");
#elif PAIRING_FUNCTION == ROZSA_PETER_PAIRING
printf ("Testing with Rozsa Peter's pairing function\n");
#elif PAIRING_FUNCTION == ROSENBERG_STRONG_PAIRING
printf ("Testing with Rosenberg-Strong pairing function\n");
#elif PAIRING_FUNCTION == CHRISTOPH_MICHEL_PAIRING
printf ("Testing with C. Michel's spiral pairing function\n");
#else
#error unkown PAIRING_FUNCTION
#endif
do {
a = uint64_as_double (KISS64);
b = uint64_as_double (KISS64);
c = pack (a, b);
unpack (c, &ua, &ub);
if (!isnan(ua) && !isinf(ua) && (ua != 0)) {
relerr_a = fabs ((ua - a) / a);
if (relerr_a > max_relerr_a) {
printf ("relerr_a= %15.8e a=% 23.16e ua=% 23.16e\n",
relerr_a, a, ua);
max_relerr_a = relerr_a;
}
}
if (!isnan(ub) && !isinf(ub) && (ub != 0)) {
relerr_b = fabs ((ub - b) / b);
if (relerr_b > max_relerr_b) {
printf ("relerr_b= %15.8e b=% 23.16e ub=% 23.16e\n",
relerr_b, b, ub);
max_relerr_b = relerr_b;
}
}
} while (1);
return EXIT_SUCCESS;
}
I don't know the name of the function, but you can normalize the 2 values x and y to the range [0, 1] using some methods like
X = arctan(x)/π + 0.5
Y = arctan(y)/π + 0.5
At this point X = 0.a1a2a3... and Y = 0.b1b2b3... Now just interleave the digits we can get a single float with value 0.a1b1a2b2a3b3...
At the receiving site just slice the bits and get back x and y from X and Y
x = tan((X - 0.5)π)
y = tan((Y - 0.5)π)
This works in decimal but also works in binary and of course it'll be easier to manipulate the binary digits directly. But probably you'll need to normalize the values to [0, ½] or [½, 1] to make the exponents the same. You can also avoid the use of bit manipulation by utilizing the fact that the significand part is always 24 bits long and we can just store x and y in the high and low parts of the significand. The result paired value is
r = ⌊X×212⌋/212 + Y/212
⌊x⌋ is the floor symbol. Now that's a pure math solution!
If you know the magnitudes of the values are always close to each other then you can improve the process by aligning the values' radix points to normalize and take the high 12 significant bits of the significand to merge together, no need to use atan
In case the range of the values is limited then you can normalize by this formula to avoid the loss of precision due to atan
X = (x - min)/(max - min)
Y = (y - min)/(max - min)
But in this case there's a way to combine the values just with pure mathematical functions. Suppose the values are in the range [0, max] the the value is r = x*max + y. To reverse the operation:
x = ⌊r/max⌋;
y = r mod max
If min is not zero then just shift the range accordingly
Read more in Is there a mathematical function that converts two numbers into one so that the two numbers can always be extracted again?

Finding (a ^ x) % m from a % m. This is about utilizing a % m to calculate (a ^ x) % m. % is the modulus operator [duplicate]

I want to calculate ab mod n for use in RSA decryption. My code (below) returns incorrect answers. What is wrong with it?
unsigned long int decrypt2(int a,int b,int n)
{
unsigned long int res = 1;
for (int i = 0; i < (b / 2); i++)
{
res *= ((a * a) % n);
res %= n;
}
if (b % n == 1)
res *=a;
res %=n;
return res;
}
You can try this C++ code. I've used it with 32 and 64-bit integers. I'm sure I got this from SO.
template <typename T>
T modpow(T base, T exp, T modulus) {
base %= modulus;
T result = 1;
while (exp > 0) {
if (exp & 1) result = (result * base) % modulus;
base = (base * base) % modulus;
exp >>= 1;
}
return result;
}
You can find this algorithm and related discussion in the literature on p. 244 of
Schneier, Bruce (1996). Applied Cryptography: Protocols, Algorithms, and Source Code in C, Second Edition (2nd ed.). Wiley. ISBN 978-0-471-11709-4.
Note that the multiplications result * base and base * base are subject to overflow in this simplified version. If the modulus is more than half the width of T (i.e. more than the square root of the maximum T value), then one should use a suitable modular multiplication algorithm instead - see the answers to Ways to do modulo multiplication with primitive types.
In order to calculate pow(a,b) % n to be used for RSA decryption, the best algorithm I came across is Primality Testing 1) which is as follows:
int modulo(int a, int b, int n){
long long x=1, y=a;
while (b > 0) {
if (b%2 == 1) {
x = (x*y) % n; // multiplying with base
}
y = (y*y) % n; // squaring the base
b /= 2;
}
return x % n;
}
See below reference for more details.
1) Primality Testing : Non-deterministic Algorithms – topcoder
Usually it's something like this:
while (b)
{
if (b % 2) { res = (res * a) % n; }
a = (a * a) % n;
b /= 2;
}
return res;
The only actual logic error that I see is this line:
if (b % n == 1)
which should be this:
if (b % 2 == 1)
But your overall design is problematic: your function performs O(b) multiplications and modulus operations, but your use of b / 2 and a * a implies that you were aiming to perform O(log b) operations (which is usually how modular exponentiation is done).
Doing the raw power operation is very costly, hence you can apply the following logic to simplify the decryption.
From here,
Now say we want to encrypt the message m = 7, c = m^e mod n = 7^3 mod 33
= 343 mod 33 = 13. Hence the ciphertext c = 13.
To check decryption we compute m' = c^d mod n = 13^7 mod 33 = 7. Note
that we don't have to calculate the full value of 13 to the power 7
here. We can make use of the fact that a = bc mod n = (b mod n).(c mod
n) mod n so we can break down a potentially large number into its
components and combine the results of easier, smaller calculations to
calculate the final value.
One way of calculating m' is as follows:- Note that any number can be
expressed as a sum of powers of 2. So first compute values of 13^2,
13^4, 13^8, ... by repeatedly squaring successive values modulo 33. 13^2
= 169 ≡ 4, 13^4 = 4.4 = 16, 13^8 = 16.16 = 256 ≡ 25. Then, since 7 = 4 + 2 + 1, we have m' = 13^7 = 13^(4+2+1) = 13^4.13^2.13^1 ≡ 16 x 4 x 13 = 832
≡ 7 mod 33
Are you trying to calculate (a^b)%n, or a^(b%n) ?
If you want the first one, then your code only works when b is an even number, because of that b/2. The "if b%n==1" is incorrect because you don't care about b%n here, but rather about b%2.
If you want the second one, then the loop is wrong because you're looping b/2 times instead of (b%n)/2 times.
Either way, your function is unnecessarily complex. Why do you loop until b/2 and try to multiply in 2 a's each time? Why not just loop until b and mulitply in one a each time. That would eliminate a lot of unnecessary complexity and thus eliminate potential errors. Are you thinking that you'll make the program faster by cutting the number of times through the loop in half? Frankly, that's a bad programming practice: micro-optimization. It doesn't really help much: You still multiply by a the same number of times, all you do is cut down on the number of times testing the loop. If b is typically small (like one or two digits), it's not worth the trouble. If b is large -- if it can be in the millions -- then this is insufficient, you need a much more radical optimization.
Also, why do the %n each time through the loop? Why not just do it once at the end?
Calculating pow(a,b) mod n
A key problem with OP's code is a * a. This is int overflow (undefined behavior) when a is large enough. The type of res is irrelevant in the multiplication of a * a.
The solution is to ensure either:
the multiplication is done with 2x wide math or
with modulus n, n*n <= type_MAX + 1
There is no reason to return a wider type than the type of the modulus as the result is always represent by that type.
// unsigned long int decrypt2(int a,int b,int n)
int decrypt2(int a,int b,int n)
Using unsigned math is certainly more suitable for OP's RSA goals.
Also see Modular exponentiation without range restriction
// (a^b)%n
// n != 0
// Test if unsigned long long at least 2x values bits as unsigned
#if ULLONG_MAX/UINT_MAX - 1 > UINT_MAX
unsigned decrypt2(unsigned a, unsigned b, unsigned n) {
unsigned long long result = 1u % n; // Insure result < n, even when n==1
while (b > 0) {
if (b & 1) result = (result * a) % n;
a = (1ULL * a * a) %n;
b >>= 1;
}
return (unsigned) result;
}
#else
unsigned decrypt2(unsigned a, unsigned b, unsigned n) {
// Detect if UINT_MAX + 1 < n*n
if (UINT_MAX/n < n-1) {
return TBD_code_with_wider_math(a,b,n);
}
a %= n;
unsigned result = 1u % n;
while (b > 0) {
if (b & 1) result = (result * a) % n;
a = (a * a) % n;
b >>= 1;
}
return result;
}
#endif
int's are generally not enough for RSA (unless you are dealing with small simplified examples)
you need a data type that can store integers up to 2256 (for 256-bit RSA keys) or 2512 for 512-bit keys, etc
Here is another way. Remember that when we find modulo multiplicative inverse of a under mod m.
Then
a and m must be coprime with each other.
We can use gcd extended for calculating modulo multiplicative inverse.
For computing ab mod m when a and b can have more than 105 digits then its tricky to compute the result.
Below code will do the computing part :
#include <iostream>
#include <string>
using namespace std;
/*
* May this code live long.
*/
long pow(string,string,long long);
long pow(long long ,long long ,long long);
int main() {
string _num,_pow;
long long _mod;
cin>>_num>>_pow>>_mod;
//cout<<_num<<" "<<_pow<<" "<<_mod<<endl;
cout<<pow(_num,_pow,_mod)<<endl;
return 0;
}
long pow(string n,string p,long long mod){
long long num=0,_pow=0;
for(char c: n){
num=(num*10+c-48)%mod;
}
for(char c: p){
_pow=(_pow*10+c-48)%(mod-1);
}
return pow(num,_pow,mod);
}
long pow(long long a,long long p,long long mod){
long res=1;
if(a==0)return 0;
while(p>0){
if((p&1)==0){
p/=2;
a=(a*a)%mod;
}
else{
p--;
res=(res*a)%mod;
}
}
return res;
}
This code works because ab mod m can be written as (a mod m)b mod m-1 mod m.
Hope it helped { :)
use fast exponentiation maybe..... gives same o(log n) as that template above
int power(int base, int exp,int mod)
{
if(exp == 0)
return 1;
int p=power(base, exp/2,mod);
p=(p*p)% mod;
return (exp%2 == 0)?p:(base * p)%mod;
}
This(encryption) is more of an algorithm design problem than a programming one. The important missing part is familiarity with modern algebra. I suggest that you look for a huge optimizatin in group theory and number theory.
If n is a prime number, pow(a,n-1)%n==1 (assuming infinite digit integers).So, basically you need to calculate pow(a,b%(n-1))%n; According to group theory, you can find e such that every other number is equivalent to a power of e modulo n. Therefore the range [1..n-1] can be represented as a permutation on powers of e. Given the algorithm to find e for n and logarithm of a base e, calculations can be significantly simplified. Cryptography needs a tone of math background; I'd rather be off that ground without enough background.
For my code a^k mod n in php:
function pmod(a, k, n)
{
if (n==1) return 0;
power = 1;
for(i=1; i<=k; $i++)
{
power = (power*a) % n;
}
return power;
}
#include <cmath>
...
static_cast<int>(std::pow(a,b))%n
but my best bet is you are overflowing int (IE: the number is two large for the int) on the power I had the same problem creating the exact same function.
I'm using this function:
int CalculateMod(int base, int exp ,int mod){
int result;
result = (int) pow(base,exp);
result = result % mod;
return result;
}
I parse the variable result because pow give you back a double, and for using mod you need two variables of type int, anyway, in a RSA decryption, you should just use integer numbers.

How to get the first x leading binary digits of 5**x without big integer multiplication

I want to efficiently and elegantly compute with perfect precision the first x leading binary digits of 5**x?
For example 5**20 is 10101101011110001110101111000101101011000110001. The first 8 leading binary digits is 10101101.
In my use case, x is only up to 1-60. I don't want to create a table. A solution using 64-bit integers would be fine. I just don't want to use big integers.
first x leading binary digits of 5**x without big integer multiplication
efficiently and elegantly compute with perfect precision the first x leading binary digits of 5x?
"compute with perfect precision" leaves out pow(). Too many implementations will return an imperfect result and FP math might not use 64 bit precision, even with long double.
Form an integer with a 64-bit whole number part .ms and a 64-bit fraction part .ls. Then loop 60 times, multiply by 5 and diving by 2 as needed, to keep the leading bits from growing too big.
Note there is some precision lost in the fraction, with N > 42, yet that is not significant enough to affect the whole number part OP is seeking.
#include <inttypes.h>
#include <stdint.h>
#include <stdio.h>
typedef struct {
uint64_t ms, ls;
} uint128;
// Simplifications possible here, leave for OP
uint128 times5(uint128 x) {
uint128 y = x;
for (int i=1; i<5; i++) {
// y += x
y.ms += x.ms;
y.ls += x.ls;
if (y.ls < x.ls) y.ms++;
}
return y;
}
uint128 div2(uint128 x) {
x.ls = (x.ls >> 1) | (x.ms << 63);
x.ms >>= 1;
return x;
}
int main(void) {
uint128 y = {.ms = 1};
uint64_t pow2 = 2;
for (unsigned x = 1; x <= 60; x++) {
y = times5(y);
while (y.ms >= pow2) {
y = div2(y);
}
printf("%2u %16" PRIX64 ".%016" PRIX64 "\n", x, y.ms, y.ls);
pow2 <<= 1;
}
}
Output
whole part.fraction
1 1.4000000000000000
2 3.2000000000000000
3 7.D000000000000000
4 9.C400000000000000
...
57 14643E5AE44D12B.8F5FEE5AA432560D
58 32FA9BE33AC0AEC.E66FD3E29A7DD720
59 7F7285B812E1B50.401791B6823A99D0
60 9F4F2726179A224.501D762422C94044
^-------------^ This is the part OP is seeking.
The key to solving this task is: divide and conquer. Form an algorithm, (which is simply *5 and /2 as needed), and code a type and functions to do each small step.
Is a loop of 60 efficient? Perhaps not. Another approach would use Exponentiation by squaring. Certainly would be worth it for large N, yet for N == 60, a loop was simple enough for a quick turn.
5n = 2(-n) • 10n
Using this identity, we can easily compute the leading N base-2 digits of (the nearest integer to) any given power of 5.
This code example is in C, but it's the same idea in any other language.
Example output: https://wandbox.org/permlink/Fs205DDzQR0gaLSo
#include <assert.h>
#include <float.h>
#include <math.h>
#include <stdint.h>
#define STATIC_ASSERT(CONDITION) ((void)sizeof(int[(CONDITION) ? 1 : -1]))
uint64_t pow5_leading_digits(double power, uint8_t ndigits)
{
STATIC_ASSERT(DBL_MANT_DIG <= 64);
double pow5 = exp2(-power) * pow(10, power);
const double binary_digits = ceil(log2(pow5));
assert(ndigits <= DBL_MANT_DIG);
if (!ndigits || binary_digits < 0)
return 0;
// If pow5 can fit in the number of digits requested, return it
if (binary_digits <= ndigits)
return pow5;
// If pow5 is too big to return, divide by 2 until it fits
if (binary_digits > DBL_MANT_DIG)
pow5 /= exp2(binary_digits - DBL_MANT_DIG + 1);
return (uint64_t)pow5 >> (DBL_MANT_DIG - ndigits);
}
Edit: Now limits the returned value to those exactly representable with double's.

Implementing equality function with basic arithmetic operations

Given positive-integer inputs x and y, is there a mathematical formula that will return 1 if x==y and 0 otherwise? I am in the unfortunate position of having to use a tool that only allows me to use the following symbols: numerals 0-9; decimal point .; parentheses ( and ); and the four basic arithmetic operations +, -, /, and *.
Currently I am relying on the fact that the tool that evaluates division by zero to be zero. (I can't tell if this is a bug or a feature.) Because of this, I have been able to use ((x-y)/(y-x))+1. Obviously, this is ugly and unideal, especially in the case that it is a bug and they fix it in a future version.
Taking advantage of integer division in C truncates toward 0, the follows works well. No multiplication overflow. Well defined for all "positive-integer inputs x and y".
(x/y) * (y/x)
#include <stdio.h>
#include <limits.h>
void etest(unsigned x, unsigned y) {
unsigned ref = x == y;
unsigned z = (x/y) * (y/x);
if (ref != z) {
printf("%u %u %u %u\n", x,y,z,ref);
}
}
void etests(void) {
unsigned list[] = { 1,2,3,4,5,6,7,8,9,10,100,1000, UINT_MAX/2 , UINT_MAX - 1, UINT_MAX };
for (unsigned x = 0; x < sizeof list/sizeof list[0]; x++) {
for (unsigned y = 0; y < sizeof list/sizeof list[0]; y++) {
etest(list[x], list[y]);
}
}
}
int main(void) {
etests();
printf("Done\n");
return 0;
}
Output (No difference from x == y)
Done
If division is truncating and the numbers are not too big, then:
((x - y) ^ 2 + 2) / ((x - y) ^ 2 + 1) - 1
The division has the value 2 if x = y and otherwise truncates to 1.
(Here x^2 is an abbreviation for x*x.)
This will fail if (x-y)^2 overflows. In that case, you need to independently check x/k = y/k and x%k = y%k where (k-1)*(k-1) doesn't overflow (which will work if k is ceil(sqrt(INT_MAX))). x%k can be computed as x-k*(x/k) and A&&B is simply A*B.
That will work for any x and y in the range [-k*k, k*k].
A slightly incorrect computation, using lots of intermediate values, which assumes that x - y won't overflow (or at least that the overflow won't produce a false 0).
int delta = x - y;
int delta_hi = delta / K;
int delta_lo = delta - K * delta_hi;
int equal_hi = (delta_hi * delta_hi + 2) / (delta_hi * delta_hi + 1) - 1;
int equal_lo = (delta_lo * delta_lo + 2) / (delta_lo * delta_lo + 1) - 1;
int equals = equal_hi * equal_lo;
or written out in full:
((((x-y)/K)*((x-y)/K)+2)/(((x-y)/K)*((x-y)/K)+1)-1)*
((((x-y)-K*((x-y)/K))*((x-y)-K*((x-y)/K))+2)/
(((x-y)-K*((x-y)/K))*((x-y)-K*((x-y)/K))+1)-1)
(For signed 31-bit integers, use K=46341; for unsigned 32-bit integers, 65536.)
Checked with #chux's test harness, adding the 0 case: live on coliru and with negative values also on coliru.
On a platform where integer subtraction might produce something other than the 2s-complement wraparound, a similar technique could be used, but dividing the numbers into three parts instead of two.
So the problem is that if they fix division by zero, it means that you cannot use any divisor that contains input variables anymore (you'd have to check that the divisor != 0, and implementing that check would solve the original x-y == 0 problem!); hence, division cannot be used at all.
Ergo, only +, -, * and the association operator () can be used. It's not hard to see that with only these operators, the desired behaviour cannot be implemented.

Calculate bessel function in MATLAB using Jm+1=2mj(m) -j(m-1) formula

I tried to implement bessel function using that formula, this is the code:
function result=Bessel(num);
if num==0
result=bessel(0,1);
elseif num==1
result=bessel(1,1);
else
result=2*(num-1)*Bessel(num-1)-Bessel(num-2);
end;
But if I use MATLAB's bessel function to compare it with this one, I get too high different values.
For example if I type Bessel(20) it gives me 3.1689e+005 as result, if instead I type bessel(20,1) it gives me 3.8735e-025 , a totally different result.
such recurrence relations are nice in mathematics but numerically unstable when implementing algorithms using limited precision representations of floating-point numbers.
Consider the following comparison:
x = 0:20;
y1 = arrayfun(#(n)besselj(n,1), x); %# builtin function
y2 = arrayfun(#Bessel, x); %# your function
semilogy(x,y1, x,y2), grid on
legend('besselj','Bessel')
title('J_\nu(z)'), xlabel('\nu'), ylabel('log scale')
So you can see how the computed values start to differ significantly after 9.
According to MATLAB:
BESSELJ uses a MEX interface to a Fortran library by D. E. Amos.
and gives the following as references for their implementation:
D. E. Amos, "A subroutine package for Bessel functions of a complex
argument and nonnegative order", Sandia National Laboratory Report,
SAND85-1018, May, 1985.
D. E. Amos, "A portable package for Bessel functions of a complex
argument and nonnegative order", Trans. Math. Software, 1986.
The forward recurrence relation you are using is not stable. To see why, consider that the values of BesselJ(n,x) become smaller and smaller by about a factor 1/2n. You can see this by looking at the first term of the Taylor series for J.
So, what you're doing is subtracting a large number from a multiple of a somewhat smaller number to get an even smaller number. Numerically, that's not going to work well.
Look at it this way. We know the result is of the order of 10^-25. You start out with numbers that are of the order of 1. So in order to get even one accurate digit out of this, we have to know the first two numbers with at least 25 digits precision. We clearly don't, and the recurrence actually diverges.
Using the same recurrence relation to go backwards, from high orders to low orders, is stable. When you start with correct values for J(20,1) and J(19,1), you can calculate all orders down to 0 with full accuracy as well. Why does this work? Because now the numbers are getting larger in each step. You're subtracting a very small number from an exact multiple of a larger number to get an even larger number.
You can just modify the code below which is for the Spherical bessel function. It is well tested and works for all arguments and order range. I am sorry it is in C#
public static Complex bessel(int n, Complex z)
{
if (n == 0) return sin(z) / z;
if (n == 1) return sin(z) / (z * z) - cos(z) / z;
if (n <= System.Math.Abs(z.real))
{
Complex h0 = bessel(0, z);
Complex h1 = bessel(1, z);
Complex ret = 0;
for (int i = 2; i <= n; i++)
{
ret = (2 * i - 1) / z * h1 - h0;
h0 = h1;
h1 = ret;
if (double.IsInfinity(ret.real) || double.IsInfinity(ret.imag)) return double.PositiveInfinity;
}
return ret;
}
else
{
double u = 2.0 * abs(z.real) / (2 * n + 1);
double a = 0.1;
double b = 0.175;
int v = n - (int)System.Math.Ceiling((System.Math.Log(0.5e-16 * (a + b * u * (2 - System.Math.Pow(u, 2)) / (1 - System.Math.Pow(u, 2))), 2)));
Complex ret = 0;
while (v > n - 1)
{
ret = z / (2 * v + 1.0 - z * ret);
v = v - 1;
}
Complex jnM1 = ret;
while (v > 0)
{
ret = z / (2 * v + 1.0 - z * ret);
jnM1 = jnM1 * ret;
v = v - 1;
}
return jnM1 * sin(z) / z;
}
}

Resources