Bias value and range of the exponent of floating point - math

I have realized that I didn't heed duly the floating-point part of IEEE 754 standard as sitting my university desks. However, even if I'm not currently struggling with embedded stuff, I feel incompetent myself and incapable of entitling to be engineer title for lack of some way of math-calculations and wholly grasping the standard.
What I know is
0 and 255 are special values to express 0 and infinity
values.
There is implicit 1 to be used to express 23bit as 24
where e becomes 1 only if it's 000, if it's 111 and mantissa is 0000, then it's infinity, and if it's 111 and mantissa is XXXX, then it's not a number.
What I don't understand is
How can we mention -126 and 127, inclusively? How are total possible
254 values sectioned as the inclusive values?
Why is 127 selected as the bias value?
Some sources explain the sectionization as [-126..127] but some [-125...128]. It is really intricate and perplexing.
How can we say the minimum 2^{-126} if not the second aforementioned source? If it is 2^{-125} ? (I have not be able to run my brain to get it understand till now though struggling :)
Isn't using modulo remainder operator more logical with the bias value instead of subtraction i.e. 2^{e%127}? (the correction thanks to chux)

Exponent range
for 32bit float the raw exponent rexp is 8 bit <0,255> and bias is 127. Excluding special cases { 0,255 } we got <1,254> applying bias:
expmin = 1-127 = -126
expmax = 254-127 = +127
Denormal values are without implicit 1 so for minimal number the mantisa is 1 and if the exponent should point to lsb of mantisa then we need to shift few more:
expmin = 0-127-(23-1) = -149
Normal max value will be with maximal mantisa so:
max = ((2^24)-1)*(2^127) = (2^24)*(2^127) - (2^127) = 2^151 - 2^127
so the real range (denormals included) of float is:
<2^-149 ,2^+151 )
<1.40e-45,2.85e+45)
In most specs and docs only the exponent for normalized numbers is shown so:
<2^-126 ,2^+127 >
<1.175e-38,1.701e38>
Here a small C++/VCL example of disecting the 32 and 64 bit floats:
//$$---- Form CPP ----
//---------------------------------------------------------------------------
#include <vcl.h>
#include <math.h>
#pragma hdrstop
#include "Unit1.h"
//---------------------------------------------------------------------------
#pragma package(smart_init)
#pragma resource "*.dfm"
TForm1 *Form1;
//---------------------------------------------------------------------------
typedef unsigned __int32 U32;
typedef __int32 S32;
//---------------------------------------------------------------------------
// IEEE 754 double MSW masks
const U32 _f64_sig =0x80000000; // sign
const U32 _f64_exp =0x7FF00000; // exponent
const U32 _f64_exp_sig=0x40000000; // exponent sign
const U32 _f64_exp_bia=0x3FF00000; // exponent bias
const U32 _f64_exp_lsb=0x00100000; // exponent LSB
const U32 _f64_exp_pos= 20; // exponent LSB bit position
const U32 _f64_man =0x000FFFFF; // mantisa
const U32 _f64_man_msb=0x00080000; // mantisa MSB
const U32 _f64_man_bits= 52; // mantisa bits
const double _f64_lsb = 1.7e-308; // abs min number
// IEEE 754 single masks <2^-149,2^+151) <1.40e-45,2.85e+45).
const U32 _f32_sig =0x80000000; // sign
const U32 _f32_exp =0x7F800000; // exponent
const U32 _f32_exp_sig=0x40000000; // exponent sign
const U32 _f32_exp_bia=0x3F800000; // exponent bias
const U32 _f32_exp_lsb=0x00800000; // exponent LSB
const U32 _f32_exp_pos= 23; // exponent LSB bit position
const U32 _f32_man =0x007FFFFF; // mantisa
const U32 _f32_man_msb=0x00400000; // mantisa MSB
const U32 _f32_man_bits= 23; // mantisa bits
const float _f32_lsb = 3.4e-38;// abs min number
//---------------------------------------------------------------------------
void f64_disect(double x)
{
const int h=1; // may be platform dependent MSB/LSB order
const int l=0;
union _f64
{
double f; // 64bit floating point
U32 u[2]; // 2x32 bit uint
} f64;
AnsiString txt="";
U32 man[2];
S32 exp,bias;
char sign='+';
f64.f=x;
bias=_f64_exp_bia>>_f64_exp_pos;
if (f64.u[h]&_f64_sig) sign='-';
exp =(f64.u[h]&_f64_exp)>>_f64_exp_pos;
exp -=bias;
man[h]=f64.u[h]&_f64_man;
man[l]=f64.u[l];
if (exp==-bias ) // zero, denormalized
{
exp-=_f64_man_bits-1; // change exp pointing from msb to lsb (ignoring implicit bit)
txt=AnsiString().sprintf("%c%06X%08Xh>>%4i",sign,man[h],man[l],-exp);
}
else if (exp==+bias+1) // Inf,NaN
{
if (man[h]|man[l]==0) txt=AnsiString().sprintf("%cInf ",sign);
else txt=AnsiString().sprintf("%cNaN ",sign);
man[h]=0; man[l]=0; exp=0;
}
else{
exp -=_f64_man_bits; // change exp pointing from msb to lsb
man[h]|=_f64_exp_lsb; // implicit msb mantisa bit for normalized numbers
txt=AnsiString().sprintf("%06X",man);
if (exp<0) txt=AnsiString().sprintf("%c%06X%08Xh>>%4i",sign,man[h],man[l],-exp);
else txt=AnsiString().sprintf("%c%06X%08Xh<<%4i",sign,man[h],man[l],+exp);
}
// reconstruct man,exp back to double
double y=double(man[l])*pow(2.0,exp);
y+=double(man[h])*pow(2.0,exp+32.0);
Form1->mm_log->Lines->Add(AnsiString().sprintf("%21.10lf = %s = %21.10lf",x,txt,y));
}
//---------------------------------------------------------------------------
void f32_disect(double x)
{
union _f32 // float bits access
{
float f; // 32bit floating point
U32 u; // 32 bit uint
} f32;
AnsiString txt="";
U32 man;
S32 exp,bias;
char sign='+';
f32.f=x;
bias=_f32_exp_bia>>_f32_exp_pos;
if (f32.u&_f32_sig) sign='-';
exp =(f32.u&_f32_exp)>>_f32_exp_pos;
exp-=bias;
man =f32.u&_f32_man;
if (exp==-bias ) // zero, denormalized
{
exp-=_f32_man_bits-1; // change exp pointing from msb to lsb (ignoring implicit bit)
txt=AnsiString().sprintf("%c%06Xh>>%3i",sign,man,-exp);
}
else if (exp==+bias+1) // Inf,NaN
{
if (man==0) txt=AnsiString().sprintf("%cInf ",sign);
else txt=AnsiString().sprintf("%cNaN ",sign);
man=0; exp=0;
}
else{
exp-=_f32_man_bits; // change exp pointing from msb to lsb
man|=_f32_exp_lsb; // implicit msb mantisa bit for normalized numbers
txt=AnsiString().sprintf("%06X",man);
if (exp<0) txt=AnsiString().sprintf("%c%06Xh>>%3i",sign,man,-exp);
else txt=AnsiString().sprintf("%c%06Xh<<%3i",sign,man,+exp);
}
// reconstruct man,exp back to float
float y=float(man)*pow(2.0,exp);
Form1->mm_log->Lines->Add(AnsiString().sprintf("%21.10f = %s = %21.10f",x,txt,y));
}
//---------------------------------------------------------------------------
//--- Builder: --------------------------------------------------------------
//---------------------------------------------------------------------------
__fastcall TForm1::TForm1(TComponent* Owner):TForm(Owner)
{
mm_log->Lines->Add("[Float]\r\n");
f32_disect(123*pow(2.0,-127-22)); // Denormalizxed
f32_disect(+0.0); // Zero
f32_disect(-0.0); // Zero
f32_disect(+0.0/0.0); // NaN
f32_disect(-0.0/0.0); // NaN
f32_disect(+1.0/0.0); // Inf
f32_disect(-1.0/0.0); // Inf
f32_disect(+123.456); // Normalized
f32_disect(-0.000123); // Normalized
mm_log->Lines->Add("\r\n[Double]\r\n");
f64_disect(123*pow(2.0,-127-22)); // Denormalizxed
f64_disect(+0.0); // Zero
f64_disect(-0.0); // Zero
f64_disect(+0.0/0.0); // NaN
f64_disect(-0.0/0.0); // NaN
f64_disect(+1.0/0.0); // Inf
f64_disect(-1.0/0.0); // Inf
f64_disect(+123.456); // Normalized
f64_disect(-0.000123); // Normalized
mm_log->Lines->Add("\r\n[Fixed]\r\n");
const int n=10;
float fx=12.345,fy=4.321,fm=1<<n;
int x=float(fx*fm);
int y=float(fy*fm);
mm_log->Lines->Add(AnsiString().sprintf("%7.3f + %7.3f = %8.3f = %8.3f",fx,fy,fx+fy,float(int((x+y) ))/fm));
mm_log->Lines->Add(AnsiString().sprintf("%7.3f - %7.3f = %8.3f = %8.3f",fx,fy,fx-fy,float(int((x-y) ))/fm));
mm_log->Lines->Add(AnsiString().sprintf("%7.3f * %7.3f = %8.3f = %8.3f",fx,fy,fx*fy,float(int((x*y)>>n))/fm));
mm_log->Lines->Add(AnsiString().sprintf("%7.3f / %7.3f = %8.3f = %8.3f",fx,fy,fx/fy,float(int((x/y)<<n))/fm
+float(int(((x%y)<<n)/y))/fm));
}
//---------------------------------------------------------------------------
Which might help you understand a bit more ... If you're interested then look also at this:
print 32bit float using only integer arithmetics
exponent bias
It was selected as midle between the range edges:
bias = (0+255)/2 = 127
to simply have the same range for positive and negative exponents as possible
modulo
using exp=rexp%127 will not give you negative values from unsigned rexp no matter what not to mention division is slow operation (at least at the time the specs was created)... That is why exp=rexp-bias

How can we mention -126 and 127, inclusively? How are total possible 254 values sectioned as the inclusive values?
IEEE 754-2008 3.3 says emin, the minimum exponent, for any format shall be 1−emax, where emax is the maximum exponent. Table 3.2 in that clause says emax for the 32-bit format (named “binary32”) shall be 127. So emin is 1−127 = −126.
There is no mathematical constraint that forces this. The relationship is chosen as a matter of preference. I recall there being a desire to have slightly more positive exponents than negative but do not recall the justification for this.
Why is 127 selected as the bias value?
Once the bounds above are selected, 127 is necessarily the value needed to encode them in eight bits (as codes 1-254 while leaving 0 and 255 as special codes).
Some sources explain the sectionization as [-126..127] but some [-125...128]. It is really intricate and perplexing.
Given bits of a binary32 that are the sign bit S, the eight exponent bits E (which are a binary representation of a number e), and the 23 significand bits F (which are a binary representation of a number f), and given 0 < e < 255, then the following are equivalent to each other:
The number represented is (−1)S • 2e−127 • (1+f/223).
The number represented is (−1)S • 2e−127 • 1.F2.
The number represented is (−1)S • 2e−126 • (½+f/224).
The number represented is (−1)S • 2e−126 • .1F2.
The number represented is (−1)S • 2e−150 • (223+f).
The number represented is (−1)S • 2e−150 • 1F.2.
The difference between the first two is just that the first takes the significand bits F, treats them as a binary numeral to get a number f, then divides that number by 223 and adds 1, whereas the second uses the 23 significand bits F to write a 24-bit numeral “1.F”, which it then interprets as a binary numeral. These two methods produce the same value.
The difference between the first pair and the second pair is that the first prepares a significand in the half-open interval [1, 2), whereas the second prepares a significand in the half-open interval [½, 1) and adjusts the exponent to compensate. The product is the same.
The difference between the first pair and the third pair is also one of scaling. The third pair scales the significand so that it is an integer. The first form is most commonly seen in discussions of floating-point numbers, but the third form is useful for mathematical proofs because number theory generally works with integers. This form is also mentioned in IEEE 754 in passing, also in clause 3.3.
How can we say the minimum 2^{-126} if not the second aforementioned source? If it is 2^{-125} ? (I have not be able to run my brain to get it understand till now though struggling :)
The minimum positive normal value has S bit 0, E bits 00000001, and F bits 00000000000000000000000. In the first form, this represents +1 • 21−127 • 1 = 2−126. In the second form, it represents +1 • 21−126 • ½ = 2−126. In the third form, it represents +1 • 21-150 • 223 = 2−126. So the form is irrelevant; the values represented are the same.
Isn't using remainder operator more logical with the bias value instead of subtraction i.e. 2^{e%127}?
No. That would cause the exponent field values 1 and 128 to map to the same value, and that would waste some encodings. There is no benefit to that.
Additionally, the encoding format is such that all positive floating-point numbers are in the same order as their encodings: Increasing the encoding increases the value represented, and vice-versa. This relationship would not old with any sort of wrapped interpretation of the exponent field. (Unfortunately, this is flipped for negative numbers, so compare the encodings of floating-point numbers as pure integers does not give the same results as comparing the floating-point numbers.)

Related

How to bruteforce a lossy AND routine?

Im wondering whether there are any standard approaches to reversing AND routines by brute force.
For example I have the following transformation:
MOV(eax, 0x5b3e0be0) <- Here we move 0x5b3e0be0 to EDX.
MOV(edx, eax) # Here we copy 0x5b3e0be0 to EAX as well.
SHL(edx, 0x7) # Bitshift 0x5b3e0be0 with 0x7 which results in 0x9f05f000
AND(edx, 0x9d2c5680) # AND 0x9f05f000 with 0x9d2c5680 which results in 0x9d045000
XOR(edx, eax) # XOR 0x9d045000 with original value 0x5b3e0be0 which results in 0xc63a5be0
My question is how to brute force and reverse this routine (i.e. transform 0xc63a5be0 back into 0x5b3e0be0)
One idea i had (which didn't work) was this using PeachPy implementation:
#Input values
MOV(esi, 0xffffffff) < Initial value to AND with, which will be decreased by 1 in a loop.
MOV(cl, 0x1) < Initial value to SHR with which will be increased by 1 until 0x1f.
MOV(eax, 0xc63a5be0) < Target result which I'm looking to get using the below loop.
MOV(edx, 0x5b3e0be0) < Input value which will be transformed.
sub_esi = peachpy.x86_64.Label()
with loop:
#End the loop if ESI = 0x0
TEST(esi, esi)
JZ(loop.end)
#Test the routine and check if it matches end result.
MOV(ebx, eax)
SHR(ebx, cl)
TEST(ebx, ebx)
JZ(sub_esi)
AND(ebx, esi)
XOR(ebx, eax)
CMP(ebx, edx)
JZ(loop.end)
#Add to the CL register which is used for SHR.
#Also check if we've reached the last potential value of CL which is 0x1f
ADD(cl, 0x1)
CMP(cl, 0x1f)
JNZ(loop.begin)
#Decrement ESI by 1, reset CL and restart routine.
peachpy.x86_64.LABEL(sub_esi)
SUB(esi, 0x1)
MOV(cl, 0x1)
JMP(loop.begin)
#The ESI result here will either be 0x0 or a valid value to AND with and get the necessary result.
RETURN(esi)
Maybe an article or a book you can recommend specific to this?
It's not lossy, the final operation is an XOR.
The whole routine can be modeled in C as
#define K 0x9d2c5680
uint32_t hash(uint32_t num)
{
return num ^ ( (num << 7) & K);
}
Now, if we have two bits x and y and the operation x XOR y, when y is zero the result is x.
So given two numbers n1 and n2 and considering their XOR, the bits or n1 that pairs with a zero in n2 would make it to the result unchanged (the others will be flipped).
So in considering num ^ ( (num << 7) & K) we can identify num with n1 and (num << 7) & K with n2.
Since n2 is an AND, we can tell that it must have at least the same zero bits that K has.
This means that each bit of num that corresponds to a zero bit in the constant K will make it unchanged into the result.
Thus, by extracting those bits from the result we already have a partial inverse function:
/*hash & ~K extracts the bits of hash that pair with a zero bit in K*/
partial_num = hash & ~K
Technically, the factor num << 7 would also introduce other zeros in the result of the AND. We know for sure that the lowest 7 bits must be zero.
However K already has the lowest 7 bits zero, so we cannot exploit this information.
So we will just use K here, but if its value were different you'd need to consider the AND (which, in practice, means to zero the lower 7 bits of K).
This leaves us with 13 bits unknown (the ones corresponding to the bits that are set in K).
If we forget about the AND for a moment, we would have x ^ (x << 7) meaning that
hi = numi for i from 0 to 6 inclusive
hi = numi ^ numi-7 for i from 7 to 31 inclusive
(The first line is due to the fact that the lower 7 bits of the right-hand are zero)
From this, starting from h7 and going up, we can retrive num7 as h7 ^ num0 = h7 ^ h0.
From bit 7 onward, the equality doesn't work and we need to use numk (for the suitable k) but luckily we already have computed its value in a previous step (that's why we start from lower to higher).
What the AND does to this is just restricting the values the index i runs in, specifically only to the bits that are set in K.
So to fill in the thirteen remaining bits one have to do:
part_num7 = h7 ^ part_num0
part_num9 = h9 ^ part_num2
part_num12 = h12 ^ part_num5
...
part_num31 = h31 ^ part_num24
Note that we exploited that fact that part_num0..6 = h0..6.
Here's a C program that inverts the function:
#include <stdio.h>
#include <stdint.h>
#define BIT(i, hash, result) ( (((result >> i) ^ (hash >> (i+7))) & 0x1) << (i+7) )
#define K 0x9d2c5680
uint32_t base_candidate(uint32_t hash)
{
uint32_t result = hash & ~K;
result |= BIT(0, hash, result);
result |= BIT(2, hash, result);
result |= BIT(3, hash, result);
result |= BIT(5, hash, result);
result |= BIT(7, hash, result);
result |= BIT(11, hash, result);
result |= BIT(12, hash, result);
result |= BIT(14, hash, result);
result |= BIT(17, hash, result);
result |= BIT(19, hash, result);
result |= BIT(20, hash, result);
result |= BIT(21, hash, result);
result |= BIT(24, hash, result);
return result;
}
uint32_t hash(uint32_t num)
{
return num ^ ( (num << 7) & K);
}
int main()
{
uint32_t tester = 0x5b3e0be0;
uint32_t candidate = base_candidate(hash(tester));
printf("candidate: %x, tester %x\n", candidate, tester);
return 0;
}
Since the original question was how to "bruteforce" instead of solve here's something that I eventually came up with which works just as well. Obviously its prone to errors depending on input (might be more than 1 result).
from peachpy import *
from peachpy.x86_64 import *
input = 0xc63a5be0
x = Argument(uint32_t)
with Function("DotProduct", (x,), uint32_t) as asm_function:
LOAD.ARGUMENT(edx, x) # EDX = 1b6fb67c
MOV(esi, 0xffffffff)
with Loop() as loop:
TEST(esi,esi)
JZ(loop.end)
MOV(eax, esi)
SHL(eax, 0x7)
AND(eax, 0x9d2c5680)
XOR(eax, esi)
CMP(eax, edx)
JZ(loop.end)
SUB(esi, 0x1)
JMP(loop.begin)
RETURN(esi)
#Read Assembler Return
abi = peachpy.x86_64.abi.detect()
encoded_function = asm_function.finalize(abi).encode()
python_function = encoded_function.load()
print(hex(python_function(input)))

Approximately reversible function from a pair of floats to a single float?

Some time ago, I came across a pair of functions in some CAD code to encode a set of coordinates (a pair of floats) as a single float (to use as a hash key), and then to unpack that single float back into the original pair.
The forward and backward functions only used standard mathematical operations -- no magic fiddling with bit-level representations of floats, no extracting and interleaving individual digits or anything like that. Obviously the reversal is not perfect in practice, because you lose considerable precision going from two floats to one, but according to the Wikipedia page for the function it should have been exactly invertible given infinite precision arithmetic.
Unfortunately, I don't work on that code anymore, and I've forgotten the name of the function so I can't look it up on Wikipedia again. Anybody know of a named mathematical functions that meets that description?
In a comment, OP clarified that the desired function should map two IEEE-754 binary64 operands into a single such operand. One way to accomplish this is to map each binary64 (double-precision) number into a positive integer in [0, 226-2], and then use a well-known pairing function to map two such integers into a single positive integer the interval [0,252), which is exactly representable in a binary64 which has 52 stored significand ("mantissa") bits.
As the binary64 operands are unrestricted in range per a comment by OP, all binades should be representable with equal relative accuracy, and we need to handle zeroes, infinities, and NaNs as well. For this reason I chose log2() to compress the data. Zeros are treated as the smallest binary64 subnormal, 0x1.0p-1074, which has the consequence that both 0x1.0p-1074 and zero will decompress into zero. The result from log2 falls into the range [-1074, 1024). Since we need to store the sign of the operand, we bias the logarithm value by 1074, giving a result in [0, 2098), then scale that to almost [0, 225), round to the nearest integer, and finally affix the sign of the original binary64 operand. The motivation for almost utilizing the complete range is to leave a little bit of room at the top of the range for special encodings for infinity and NaN (so a single canonical NaN encoding is used).
Since pairing functions known from the literature operate on natural numbers only, we need a mapping from whole numbers to natural numbers. This is easily accomplished by mapping negative whole numbers to odd natural numbers, while positive whole numbers are mapped to even natural numbers. Thus our operands are mapped from (-225, +225) to [0, 226-2]. The pairing function then combines two integers in [0, 226-2] into a single integer in [0, 252).
Different pairing functions known from the literature differ in their scrambling behavior, which may impact the hashing functionality mentioned in the question. They may also differ in their performance. Therefore I am offering a selection of four different pairing functions for the pair() / unpair() implementations in the code below. Please see the comments in the code for corresponding literature references.
Unpacking of the packed operand involves applying the inverse of each packing step in reverse order. The unpairing function gives us two natural integers. These are mapped to two whole numbers, which are mapped to two logarithm values, which are then exponentiated with exp2() to recover the original numbers, with a bit of added work to get special values and the sign correct.
While logarithms are represented with a relative accuracy on the order of 10-8, the expected maximum relative error in final results is on the order of 10-5 due to the well-known error magnification property of exponentiation. Maximum relative error observed for a pack() / unpack() round-trip in extensive testing was 2.167e-5.
Below is my ISO C99 implementation of the algorithm together with a portion of my test framework. This should be portable to other programming languages with a modicum of effort.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <math.h>
#include <float.h>
#define SZUDZIK_ELEGANT_PAIRING (1)
#define ROZSA_PETER_PAIRING (2)
#define ROSENBERG_STRONG_PAIRING (3)
#define CHRISTOPH_MICHEL_PAIRING (4)
#define PAIRING_FUNCTION (ROZSA_PETER_PAIRING)
/*
https://groups.google.com/forum/#!original/comp.lang.c/qFv18ql_WlU/IK8KGZZFJx4J
From: geo <gmars...#gmail.com>
Newsgroups: sci.math,comp.lang.c,comp.lang.fortran
Subject: 64-bit KISS RNGs
Date: Sat, 28 Feb 2009 04:30:48 -0800 (PST)
This 64-bit KISS RNG has three components, each nearly
good enough to serve alone. The components are:
Multiply-With-Carry (MWC), period (2^121+2^63-1)
Xorshift (XSH), period 2^64-1
Congruential (CNG), period 2^64
*/
static uint64_t kiss64_x = 1234567890987654321ULL;
static uint64_t kiss64_c = 123456123456123456ULL;
static uint64_t kiss64_y = 362436362436362436ULL;
static uint64_t kiss64_z = 1066149217761810ULL;
static uint64_t kiss64_t;
#define MWC64 (kiss64_t = (kiss64_x << 58) + kiss64_c, \
kiss64_c = (kiss64_x >> 6), kiss64_x += kiss64_t, \
kiss64_c += (kiss64_x < kiss64_t), kiss64_x)
#define XSH64 (kiss64_y ^= (kiss64_y << 13), kiss64_y ^= (kiss64_y >> 17), \
kiss64_y ^= (kiss64_y << 43))
#define CNG64 (kiss64_z = 6906969069ULL * kiss64_z + 1234567ULL)
#define KISS64 (MWC64 + XSH64 + CNG64)
double uint64_as_double (uint64_t a)
{
double r;
memcpy (&r, &a, sizeof r);
return r;
}
#define LOG2_BIAS (1074.0)
#define CLAMP_LOW (exp2(-LOG2_BIAS))
#define SCALE (15993.5193125)
#define NAN_ENCODING (33554430)
#define INF_ENCODING (33554420)
/* check whether argument is an odd integer */
int is_odd_int (double a)
{
return (-2.0 * floor (0.5 * a) + a) == 1.0;
}
/* compress double-precision number into an integer in (-2**25, +2**25) */
double compress (double x)
{
double t;
t = fabs (x);
t = (t < CLAMP_LOW) ? CLAMP_LOW : t;
t = rint ((log2 (t) + LOG2_BIAS) * SCALE);
if (isnan (x)) t = NAN_ENCODING;
if (isinf (x)) t = INF_ENCODING;
return copysign (t, x);
}
/* expand integer in (-2**25, +2**25) into double-precision number */
double decompress (double x)
{
double s, t;
s = fabs (x);
t = s / SCALE;
if (s == (NAN_ENCODING)) t = NAN;
if (s == (INF_ENCODING)) t = INFINITY;
return copysign ((t == 0) ? 0 : exp2 (t - LOG2_BIAS), x);
}
/* map whole numbers to natural numbers. Here: (-2^25, +2^25) to [0, 2^26-2] */
double map_Z_to_N (double x)
{
return (x < 0) ? (-2 * x - 1) : (2 * x);
}
/* Map natural numbers to whole numbers. Here: [0, 2^26-2] to (-2^25, +2^25) */
double map_N_to_Z (double x)
{
return is_odd_int (x) ? (-0.5 * (x + 1)) : (0.5 * x);
}
#if PAIRING_FUNCTION == SZUDZIK_ELEGANT_PAIRING
/* Matthew Szudzik, "An elegant pairing function." In Wolfram Research (ed.)
Special NKS 2006 Wolfram Science Conference, pp. 1-12.
Here: map two natural numbers in [0, 2^26-2] to natural number in [0, 2^52),
and vice versa
*/
double pair (double x, double y)
{
return (x != fmax (x, y)) ? (y * y + x) : (x * x + x + y);
}
void unpair (double z, double *x, double *y)
{
double sqrt_z = trunc (sqrt (z));
double sqrt_z_diff = z - sqrt_z * sqrt_z;
*x = (sqrt_z_diff < sqrt_z) ? sqrt_z_diff : sqrt_z;
*y = (sqrt_z_diff < sqrt_z) ? sqrt_z : (sqrt_z_diff - sqrt_z);
}
#elif PAIRING_FUNCTION == ROZSA_PETER_PAIRING
/*
Rozsa Peter, "Rekursive Funktionen" (1951), p. 44. Via David R. Hagen's blog,
https://drhagen.com/blog/superior-pairing-function/
Here: map two natural numbers in [0, 2^26-2] to natural number in [0, 2^52),
and vice versa
*/
double pair (double x, double y)
{
double mx = fmax (x, y);
double mn = fmin (x, y);
double sel = (mx == x) ? 0 : 1;
return mx * mx + mn * 2 + sel;
}
void unpair (double z, double *x, double *y)
{
double sqrt_z = trunc (sqrt (z));
double sqrt_z_diff = z - sqrt_z * sqrt_z;
double half_diff = trunc (sqrt_z_diff * 0.5);
*x = is_odd_int (sqrt_z_diff) ? half_diff : sqrt_z;
*y = is_odd_int (sqrt_z_diff) ? sqrt_z : half_diff;
}
#elif PAIRING_FUNCTION == ROSENBERG_STRONG_PAIRING
/*
A. L. Rosenberg and H. R. Strong, "Addressing arrays by shells",
IBM Technical Disclosure Bulletin, Vol. 14, No. 10, March 1972,
pp. 3026-3028.
Arnold L. Rosenberg, "Allocating storage for extendible arrays,"
Journal of the ACM, Vol. 21, No. 4, October 1974, pp. 652-670.
Corrigendum, Journal of the ACM, Vol. 22, No. 2, April 1975, p. 308.
Matthew P. Szudzik, "The Rosenberg-Strong Pairing Function", 2019
https://arxiv.org/abs/1706.04129
Here: map two natural numbers in [0, 2^26-2] to natural number in [0, 2^52),
and vice versa
*/
double pair (double x, double y)
{
double mx = fmax (x, y);
return mx * mx + mx + x - y;
}
void unpair (double z, double *x, double *y)
{
double sqrt_z = trunc (sqrt (z));
double sqrt_z_diff = z - sqrt_z * sqrt_z;
*x = (sqrt_z_diff < sqrt_z) ? sqrt_z_diff : sqrt_z;
*y = (sqrt_z_diff < sqrt_z) ? sqrt_z : (2 * sqrt_z - sqrt_z_diff);
}
#elif PAIRING_FUNCTION == CHRISTOPH_MICHEL_PAIRING
/*
Christoph Michel, "Enumerating a Grid in Spiral Order", September 7, 2016,
https://cmichel.io/enumerating-grid-in-spiral-order. Via German Wikipedia,
https://de.wikipedia.org/wiki/Cantorsche_Paarungsfunktion
Here: map two natural numbers in [0, 2^26-2] to natural number in [0, 2^52),
and vice versa
*/
double pair (double x, double y)
{
double mx = fmax (x, y);
return mx * mx + mx + (is_odd_int (mx) ? (x - y) : (y - x));
}
void unpair (double z, double *x, double *y)
{
double sqrt_z = trunc (sqrt (z));
double sqrt_z_diff = z - sqrt_z * (sqrt_z + 1);
double min_clamp = fmin (sqrt_z_diff, 0);
double max_clamp = fmax (sqrt_z_diff, 0);
*x = is_odd_int (sqrt_z) ? (sqrt_z + min_clamp) : (sqrt_z - max_clamp);
*y = is_odd_int (sqrt_z) ? (sqrt_z - max_clamp) : (sqrt_z + min_clamp);
}
#else
#error unknown PAIRING_FUNCTION
#endif
/* Lossy pairing function for double precision numbers. The maximum round-trip
relative error is about 2.167e-5
*/
double pack (double a, double b)
{
double c, p, q, s, t;
p = compress (a);
q = compress (b);
s = map_Z_to_N (p);
t = map_Z_to_N (q);
c = pair (s, t);
return c;
}
/* Unpairing function for double precision numbers. The maximum round-trip
relative error is about 2.167e-5 */
void unpack (double c, double *a, double *b)
{
double s, t, u, v;
unpair (c, &s, &t);
u = map_N_to_Z (s);
v = map_N_to_Z (t);
*a = decompress (u);
*b = decompress (v);
}
int main (void)
{
double a, b, c, ua, ub, relerr_a, relerr_b;
double max_relerr_a = 0, max_relerr_b = 0;
#if PAIRING_FUNCTION == SZUDZIK_ELEGANT_PAIRING
printf ("Testing with Szudzik's elegant pairing function\n");
#elif PAIRING_FUNCTION == ROZSA_PETER_PAIRING
printf ("Testing with Rozsa Peter's pairing function\n");
#elif PAIRING_FUNCTION == ROSENBERG_STRONG_PAIRING
printf ("Testing with Rosenberg-Strong pairing function\n");
#elif PAIRING_FUNCTION == CHRISTOPH_MICHEL_PAIRING
printf ("Testing with C. Michel's spiral pairing function\n");
#else
#error unkown PAIRING_FUNCTION
#endif
do {
a = uint64_as_double (KISS64);
b = uint64_as_double (KISS64);
c = pack (a, b);
unpack (c, &ua, &ub);
if (!isnan(ua) && !isinf(ua) && (ua != 0)) {
relerr_a = fabs ((ua - a) / a);
if (relerr_a > max_relerr_a) {
printf ("relerr_a= %15.8e a=% 23.16e ua=% 23.16e\n",
relerr_a, a, ua);
max_relerr_a = relerr_a;
}
}
if (!isnan(ub) && !isinf(ub) && (ub != 0)) {
relerr_b = fabs ((ub - b) / b);
if (relerr_b > max_relerr_b) {
printf ("relerr_b= %15.8e b=% 23.16e ub=% 23.16e\n",
relerr_b, b, ub);
max_relerr_b = relerr_b;
}
}
} while (1);
return EXIT_SUCCESS;
}
I don't know the name of the function, but you can normalize the 2 values x and y to the range [0, 1] using some methods like
X = arctan(x)/π + 0.5
Y = arctan(y)/π + 0.5
At this point X = 0.a1a2a3... and Y = 0.b1b2b3... Now just interleave the digits we can get a single float with value 0.a1b1a2b2a3b3...
At the receiving site just slice the bits and get back x and y from X and Y
x = tan((X - 0.5)π)
y = tan((Y - 0.5)π)
This works in decimal but also works in binary and of course it'll be easier to manipulate the binary digits directly. But probably you'll need to normalize the values to [0, ½] or [½, 1] to make the exponents the same. You can also avoid the use of bit manipulation by utilizing the fact that the significand part is always 24 bits long and we can just store x and y in the high and low parts of the significand. The result paired value is
r = ⌊X×212⌋/212 + Y/212
⌊x⌋ is the floor symbol. Now that's a pure math solution!
If you know the magnitudes of the values are always close to each other then you can improve the process by aligning the values' radix points to normalize and take the high 12 significant bits of the significand to merge together, no need to use atan
In case the range of the values is limited then you can normalize by this formula to avoid the loss of precision due to atan
X = (x - min)/(max - min)
Y = (y - min)/(max - min)
But in this case there's a way to combine the values just with pure mathematical functions. Suppose the values are in the range [0, max] the the value is r = x*max + y. To reverse the operation:
x = ⌊r/max⌋;
y = r mod max
If min is not zero then just shift the range accordingly
Read more in Is there a mathematical function that converts two numbers into one so that the two numbers can always be extracted again?

How to get the first x leading binary digits of 5**x without big integer multiplication

I want to efficiently and elegantly compute with perfect precision the first x leading binary digits of 5**x?
For example 5**20 is 10101101011110001110101111000101101011000110001. The first 8 leading binary digits is 10101101.
In my use case, x is only up to 1-60. I don't want to create a table. A solution using 64-bit integers would be fine. I just don't want to use big integers.
first x leading binary digits of 5**x without big integer multiplication
efficiently and elegantly compute with perfect precision the first x leading binary digits of 5x?
"compute with perfect precision" leaves out pow(). Too many implementations will return an imperfect result and FP math might not use 64 bit precision, even with long double.
Form an integer with a 64-bit whole number part .ms and a 64-bit fraction part .ls. Then loop 60 times, multiply by 5 and diving by 2 as needed, to keep the leading bits from growing too big.
Note there is some precision lost in the fraction, with N > 42, yet that is not significant enough to affect the whole number part OP is seeking.
#include <inttypes.h>
#include <stdint.h>
#include <stdio.h>
typedef struct {
uint64_t ms, ls;
} uint128;
// Simplifications possible here, leave for OP
uint128 times5(uint128 x) {
uint128 y = x;
for (int i=1; i<5; i++) {
// y += x
y.ms += x.ms;
y.ls += x.ls;
if (y.ls < x.ls) y.ms++;
}
return y;
}
uint128 div2(uint128 x) {
x.ls = (x.ls >> 1) | (x.ms << 63);
x.ms >>= 1;
return x;
}
int main(void) {
uint128 y = {.ms = 1};
uint64_t pow2 = 2;
for (unsigned x = 1; x <= 60; x++) {
y = times5(y);
while (y.ms >= pow2) {
y = div2(y);
}
printf("%2u %16" PRIX64 ".%016" PRIX64 "\n", x, y.ms, y.ls);
pow2 <<= 1;
}
}
Output
whole part.fraction
1 1.4000000000000000
2 3.2000000000000000
3 7.D000000000000000
4 9.C400000000000000
...
57 14643E5AE44D12B.8F5FEE5AA432560D
58 32FA9BE33AC0AEC.E66FD3E29A7DD720
59 7F7285B812E1B50.401791B6823A99D0
60 9F4F2726179A224.501D762422C94044
^-------------^ This is the part OP is seeking.
The key to solving this task is: divide and conquer. Form an algorithm, (which is simply *5 and /2 as needed), and code a type and functions to do each small step.
Is a loop of 60 efficient? Perhaps not. Another approach would use Exponentiation by squaring. Certainly would be worth it for large N, yet for N == 60, a loop was simple enough for a quick turn.
5n = 2(-n) • 10n
Using this identity, we can easily compute the leading N base-2 digits of (the nearest integer to) any given power of 5.
This code example is in C, but it's the same idea in any other language.
Example output: https://wandbox.org/permlink/Fs205DDzQR0gaLSo
#include <assert.h>
#include <float.h>
#include <math.h>
#include <stdint.h>
#define STATIC_ASSERT(CONDITION) ((void)sizeof(int[(CONDITION) ? 1 : -1]))
uint64_t pow5_leading_digits(double power, uint8_t ndigits)
{
STATIC_ASSERT(DBL_MANT_DIG <= 64);
double pow5 = exp2(-power) * pow(10, power);
const double binary_digits = ceil(log2(pow5));
assert(ndigits <= DBL_MANT_DIG);
if (!ndigits || binary_digits < 0)
return 0;
// If pow5 can fit in the number of digits requested, return it
if (binary_digits <= ndigits)
return pow5;
// If pow5 is too big to return, divide by 2 until it fits
if (binary_digits > DBL_MANT_DIG)
pow5 /= exp2(binary_digits - DBL_MANT_DIG + 1);
return (uint64_t)pow5 >> (DBL_MANT_DIG - ndigits);
}
Edit: Now limits the returned value to those exactly representable with double's.

how to encode 27 vector3's into a 0-256 value?

I have 27 combinations of 3 values from -1 to 1 of type:
Vector3(0,0,0);
Vector3(-1,0,0);
Vector3(0,-1,0);
Vector3(0,0,-1);
Vector3(-1,-1,0);
... up to
Vector3(0,1,1);
Vector3(1,1,1);
I need to convert them to and from a 8-bit sbyte / byte array.
One solution is to say the first digit, of the 256 = X the second digit is Y and the third is Z...
so
Vector3(-1,1,1) becomes 022,
Vector3(1,-1,-1) becomes 200,
Vector3(1,0,1) becomes 212...
I'd prefer to encode it in a more compact way, perhaps using bytes (which I am clueless about), because the above solution uses a lot of multiplications and round functions to decode, do you have some suggestions please? the other option is to write 27 if conditions to write the Vector3 combination to an array, it seems inefficient.
Thanks to Evil Tak for the guidance, i changed the code a bit to add 0-1 values to the first bit, and to adapt it for unity3d:
function Pack4(x:int,y:int,z:int,w:int):sbyte {
var b: sbyte = 0;
b |= (x + 1) << 6;
b |= (y + 1) << 4;
b |= (z + 1) << 2;
b |= (w + 1);
return b;
}
function unPack4(b:sbyte):Vector4 {
var v : Vector4;
v.x = ((b & 0xC0) >> 6) - 1; //0xC0 == 1100 0000
v.y = ((b & 0x30) >> 4) - 1; // 0x30 == 0011 0000
v.z = ((b & 0xC) >> 2) - 1; // 0xC == 0000 1100
v.w = (b & 0x3) - 1; // 0x3 == 0000 0011
return v;
}
I assume your values are float not integer
so bit operations will not improve speed too much in comparison to conversion to integer type. So my bet using full range will be better. I would do this for 3D case:
8 bit -> 256 values
3D -> pow(256,1/3) = ~ 6.349 values per dimension
6^3 = 216 < 256
So packing of (x,y,z) looks like this:
BYTE p;
p =floor((x+1.0)*3.0);
p+=floor((y+1.0)*3.0*6.0);
p+=floor((y+1.0)*3.0*6.0*6.0);
The idea is convert <-1,+1> to range <0,1> hence the +1.0 and *3.0 instead of *6.0 and then just multiply to the correct place in final BYTE.
and unpacking of p looks like this:
x=p%6; x=(x/3.0)-1.0; p/=6;
y=p%6; y=(y/3.0)-1.0; p/=6;
z=p%6; z=(z/3.0)-1.0;
This way you use 216 from 256 values which is much better then just 2 bits (4 values). Your 4D case would look similar just use instead 3.0,6.0 different constant floor(pow(256,1/4))=4 so use 2.0,4.0 but beware case when p=256 or use 2 bits per dimension and bit approach like the accepted answer does.
If you need real speed you can optimize this to force float representation holding result of packet BYTE to specific exponent and extract mantissa bits as your packed BYTE directly. As the result will be <0,216> you can add any bigger number to it. see IEEE 754-1985 for details but you want the mantissa to align with your BYTE so if you add to p number like 2^23 then the lowest 8 bit of float should be your packed value directly (as MSB 1 is not present in mantissa) so no expensive conversion is needed.
In case you got just {-1,0,+1} instead of <-1,+1>
then of coarse you should use integer approach like bit packing with 2 bits per dimension or use LUT table of all 3^3 = 27 possibilities and pack entire vector in 5 bits.
The encoding would look like this:
int enc[3][3][3] = { 0,1,2, ... 24,25,26 };
p=enc[x+1][y+1][z+1];
And decoding:
int dec[27][3] = { {-1,-1,-1},.....,{+1,+1,+1} };
x=dec[p][0];
y=dec[p][1];
z=dec[p][2];
Which should be fast enough and if you got many vectors you can pack the p into each 5 bits ... to save even more memory space
One way is to store the component of each vector in every 2 bits of a byte.
Converting a vector component value to and from the 2 bit stored form is as simple as adding and subtracting one, respectively.
-1 (1111 1111 as a signed byte) <-> 00 (in binary)
0 (0000 0000 in binary) <-> 01 (in binary)
1 (0000 0001 in binary) <-> 10 (in binary)
The packed 2 bit values can be stored in a byte in any order of your preference. I will use the following format: 00XXYYZZ where XX is the converted (packed) value of the X component, and so on. The 0s at the start aren't going to be used.
A vector will then be packed in a byte as follows:
byte Pack(Vector3<int> vector) {
byte b = 0;
b |= (vector.x + 1) << 4;
b |= (vector.y + 1) << 2;
b |= (vector.z + 1);
return b;
}
Unpacking a vector from its byte form will be as follows:
Vector3<int> Unpack(byte b) {
Vector3<int> v = new Vector<int>();
v.x = ((b & 0x30) >> 4) - 1; // 0x30 == 0011 0000
v.y = ((b & 0xC) >> 2) - 1; // 0xC == 0000 1100
v.z = (b & 0x3) - 1; // 0x3 == 0000 0011
return v;
}
Both the above methods assume that the input is valid, i.e. All components of vector in Pack are either -1, 0 or 1 and that all two-bit sections of b in Unpack have a (binary) value of either 00, 01 or 10.
Since this method uses bitwise operators, it is fast and efficient. If you wish to compress the data further, you could try using the 2 unused bits too, and convert every 3 two-bit elements processed to a vector.
The most compact way is by writing a 27 digits number in base 3 (using a shift -1 -> 0, 0 -> 1, 1 -> 2).
The value of this number will range from 0 to 3^27-1 = 7625597484987, which takes 43 bits to be encoded, i.e. 6 bytes (and 5 spare bits).
This is a little saving compared to a packed representation with 4 two-bit numbers packed in a byte (hence 7 bytes/56 bits in total).
An interesting variant is to group the base 3 digits five by five in bytes (hence numbers 0 to 242). You will still require 6 bytes (and no spare bits), but the decoding of the bytes can easily be hard-coded as a table of 243 entries.

Arduino: Formula to convert byte

Im looking for a way to modify a binary byte value on Arduino.
Because of the Hardware, its neccesarry, to split a two digit number into 2 4-bit.
the code to set output is wire.write(byte, 0xFF) which sets all outputs on High.
0xFF = binary 1111 1111
the formula should be convert a value like this:
e.g nr 35 is binary 0010 0011
but for my use it should displayed as 0011 0101 which would be refer to 53 in reality.
The first 4 bits are for a BCD-Input IC which displays the 5 from 35, the second 4 bits are for a BCD-Input IC which displays the 3 from 35.
Does anybody has a idea how to convert this by code, or like a mathematical formula?
Possible numbers are from 00 to 59.
Thank you for your help
To convert a value n between 0 and 99 to BCD:
((n / 10) * 16) + (n % 10)
assuming n is an integer and thus / is doing integer division; also assumes this will be stored in an unsigned byte.
(If this is not producing the desired result, please either explain how it is incorrect for the example given, or provide a different example for which it is incorrect.)
#include <string.h>
int num = // Any number from 0 to 59
int tens = num/10;
int units = num-(tens*10);
// Make string array for binary
string tensbinary;
int quotient = tens;
char buffer[1];
// Convert numbers
for (int i = 0; i < 4; i++)
{
quotientint = quotientint % 2;
sprintf(buffer, 1, "%d", quotientint);
binary.append(buffer);
}
// Repeat above for the units
// Now join the two together
binarytens.append(binaryunits);
I don't know if this will work, but still, you might be able to extrapolate based on the available information in my code.
The last thing you need to do is convert the string to binary.

Resources