Floating point data format sign+exponent - math

I am receiving data over UART from a heat meter, but I need some help to understand how i should deal with the data.
I have the documentation but that is not enough for me, I have to little experience with this kind of calculations.
Maybe someone with the right skill could explain to me how it should be done with a better example that I have from the documentation.
One value consists of the following bytes:
[number of bytes][sign+exponent] (integer)
(integer) is the register data value. The length of the integer value is
specified by [number of bytes]. [sign+exponent] is an 8-bit value that
specifies the sign of the data value and sign and value of the exponent. The
meaning of the individual bits in the [sign+exponent] byte is shown below:
Examples:
-123.45 = 04h, C2h, 0h, 0h, 30h, 39h
87654321*103 = 04h, 03h , 05h, 39h, 7Fh, B1h
255*103 = 01h, 03h , FFh
And now to one more example with actual data.
This is the information that I have from the documentation about this.
This is some data that I have received from my heat meter
10 00 56 25 04 42 00 00 1B E4
So in my example then 04 is the [number of bytes], 42 is the [sign+exponent] and 00 00 1B E4 is the (integer).
But I do not know how I should make the calculation to receive the actual value.
Any help?

Your data appears to be big-endian, according to your example. So here's how you break those bytes into the fields you need using bit shifting and masking.
n = b[0]
SI = (b[1] & 0x80) >> 7
SE = (b[1] & 0x40) >> 6
exponent = b[1] & 0x3f
integer = 0
for i = 0 to n-1:
integer = (integer << 8) + b[2+i]

The sign of the mantissa is obtained from the MSb of the Sign+exponent byte, by masking (byte & 80h != 0 => SI = -1).
The sign of the exponent is similarly obtained by byte & 40h != 0 => SE = -1.
The exponent value is EXP = byte & 3Fh.
The mantissa INT is the binary number formed by the four other bytes, which can be read as a single integer (but mind the indianness).
Finally, compute SI * INT * pow(10, SE * EXP).
In your example, SI = 1, SE = -1, EXP = 2, INT = 7140, hence
1 * 7140 * pow(10, -1 * 2) = +71.4
It is not in the scope of this answer to explain how to implement this efficiently.

Related

Why does XOR and subtraction with the same HEX values give the same result?

For example:
7A - 20 = 5A
7A XOR 20 = 5A
Of course, this will work the same using different values. Why does this occur exactly?
It's only the same if there are no borrows,
i.e. no 0 - 1 at any bit-positions, only 1-0 = 1 or 1-1 = 0.
That's the same as saying that the first operand (the minuend) has set bits everywhere the second operand (subtrahend) does.
i.e. if x & y == y, then x-y == x^y.
The simplest counter-example:
0 - 1 = 0xFF - borrow propagates all the way to the top of the register.
0 ^ 1 = 0x01 - XOR is add-without-carry, it just flips the bits in one operand where a bit is set in the other operand. (i.e. you could look at it as flipping no bits in 1, leaving 1. Or as flipping the low bit in 0, producing 1.)
XOR is commutative (x ^ y == y ^ x), subtraction is not (x-y is usually different from y-x, except for special-case results like 0 or 0x80)
Repeating XOR with the same value undoes it, e.g. 5A ^ 20 = 7A flips the bit back on.
But repeating subtraction doesn't: 5A - 20 = 3A.

how to encode 27 vector3's into a 0-256 value?

I have 27 combinations of 3 values from -1 to 1 of type:
Vector3(0,0,0);
Vector3(-1,0,0);
Vector3(0,-1,0);
Vector3(0,0,-1);
Vector3(-1,-1,0);
... up to
Vector3(0,1,1);
Vector3(1,1,1);
I need to convert them to and from a 8-bit sbyte / byte array.
One solution is to say the first digit, of the 256 = X the second digit is Y and the third is Z...
so
Vector3(-1,1,1) becomes 022,
Vector3(1,-1,-1) becomes 200,
Vector3(1,0,1) becomes 212...
I'd prefer to encode it in a more compact way, perhaps using bytes (which I am clueless about), because the above solution uses a lot of multiplications and round functions to decode, do you have some suggestions please? the other option is to write 27 if conditions to write the Vector3 combination to an array, it seems inefficient.
Thanks to Evil Tak for the guidance, i changed the code a bit to add 0-1 values to the first bit, and to adapt it for unity3d:
function Pack4(x:int,y:int,z:int,w:int):sbyte {
var b: sbyte = 0;
b |= (x + 1) << 6;
b |= (y + 1) << 4;
b |= (z + 1) << 2;
b |= (w + 1);
return b;
}
function unPack4(b:sbyte):Vector4 {
var v : Vector4;
v.x = ((b & 0xC0) >> 6) - 1; //0xC0 == 1100 0000
v.y = ((b & 0x30) >> 4) - 1; // 0x30 == 0011 0000
v.z = ((b & 0xC) >> 2) - 1; // 0xC == 0000 1100
v.w = (b & 0x3) - 1; // 0x3 == 0000 0011
return v;
}
I assume your values are float not integer
so bit operations will not improve speed too much in comparison to conversion to integer type. So my bet using full range will be better. I would do this for 3D case:
8 bit -> 256 values
3D -> pow(256,1/3) = ~ 6.349 values per dimension
6^3 = 216 < 256
So packing of (x,y,z) looks like this:
BYTE p;
p =floor((x+1.0)*3.0);
p+=floor((y+1.0)*3.0*6.0);
p+=floor((y+1.0)*3.0*6.0*6.0);
The idea is convert <-1,+1> to range <0,1> hence the +1.0 and *3.0 instead of *6.0 and then just multiply to the correct place in final BYTE.
and unpacking of p looks like this:
x=p%6; x=(x/3.0)-1.0; p/=6;
y=p%6; y=(y/3.0)-1.0; p/=6;
z=p%6; z=(z/3.0)-1.0;
This way you use 216 from 256 values which is much better then just 2 bits (4 values). Your 4D case would look similar just use instead 3.0,6.0 different constant floor(pow(256,1/4))=4 so use 2.0,4.0 but beware case when p=256 or use 2 bits per dimension and bit approach like the accepted answer does.
If you need real speed you can optimize this to force float representation holding result of packet BYTE to specific exponent and extract mantissa bits as your packed BYTE directly. As the result will be <0,216> you can add any bigger number to it. see IEEE 754-1985 for details but you want the mantissa to align with your BYTE so if you add to p number like 2^23 then the lowest 8 bit of float should be your packed value directly (as MSB 1 is not present in mantissa) so no expensive conversion is needed.
In case you got just {-1,0,+1} instead of <-1,+1>
then of coarse you should use integer approach like bit packing with 2 bits per dimension or use LUT table of all 3^3 = 27 possibilities and pack entire vector in 5 bits.
The encoding would look like this:
int enc[3][3][3] = { 0,1,2, ... 24,25,26 };
p=enc[x+1][y+1][z+1];
And decoding:
int dec[27][3] = { {-1,-1,-1},.....,{+1,+1,+1} };
x=dec[p][0];
y=dec[p][1];
z=dec[p][2];
Which should be fast enough and if you got many vectors you can pack the p into each 5 bits ... to save even more memory space
One way is to store the component of each vector in every 2 bits of a byte.
Converting a vector component value to and from the 2 bit stored form is as simple as adding and subtracting one, respectively.
-1 (1111 1111 as a signed byte) <-> 00 (in binary)
0 (0000 0000 in binary) <-> 01 (in binary)
1 (0000 0001 in binary) <-> 10 (in binary)
The packed 2 bit values can be stored in a byte in any order of your preference. I will use the following format: 00XXYYZZ where XX is the converted (packed) value of the X component, and so on. The 0s at the start aren't going to be used.
A vector will then be packed in a byte as follows:
byte Pack(Vector3<int> vector) {
byte b = 0;
b |= (vector.x + 1) << 4;
b |= (vector.y + 1) << 2;
b |= (vector.z + 1);
return b;
}
Unpacking a vector from its byte form will be as follows:
Vector3<int> Unpack(byte b) {
Vector3<int> v = new Vector<int>();
v.x = ((b & 0x30) >> 4) - 1; // 0x30 == 0011 0000
v.y = ((b & 0xC) >> 2) - 1; // 0xC == 0000 1100
v.z = (b & 0x3) - 1; // 0x3 == 0000 0011
return v;
}
Both the above methods assume that the input is valid, i.e. All components of vector in Pack are either -1, 0 or 1 and that all two-bit sections of b in Unpack have a (binary) value of either 00, 01 or 10.
Since this method uses bitwise operators, it is fast and efficient. If you wish to compress the data further, you could try using the 2 unused bits too, and convert every 3 two-bit elements processed to a vector.
The most compact way is by writing a 27 digits number in base 3 (using a shift -1 -> 0, 0 -> 1, 1 -> 2).
The value of this number will range from 0 to 3^27-1 = 7625597484987, which takes 43 bits to be encoded, i.e. 6 bytes (and 5 spare bits).
This is a little saving compared to a packed representation with 4 two-bit numbers packed in a byte (hence 7 bytes/56 bits in total).
An interesting variant is to group the base 3 digits five by five in bytes (hence numbers 0 to 242). You will still require 6 bytes (and no spare bits), but the decoding of the bytes can easily be hard-coded as a table of 243 entries.

Subtraction assembly with "base 10" EMU8086

Hello I'm making a base 10 calculator in assembler that can take number with max length of 5 dig... so there is two numbers after the input was taken one of the five dig number is stored in ax and bl for example
AX - 23 45
BX - 00 01
So the value of the input is 12345 And the other is for example is 23243 and it's stored on CX and DX with the same idea of the first number (that stored in AX and BX...) Now, I have made the addition code, but I can't figure out how making the Subtraction code with all the neg problem...
So what I thought to do is to, for example, take bh (that I'm not using because the number can't be longer than 6 digs...) and if the number is negative Ill put 1 and if its positive I'll put 0 so this problem is solved, Now the problem is that I dont know how to make the code work like with all the sub part and the carry and every thing ...(in the addition i used commands like adc,daa...)
last example:
value is: 12345 and its positive
AX - 23 45
BX - 00 01
(if Bh is 0 the number is positive if 1 its negative...)
Now the value is : 23243 and its positive
CX - 32 43
DX - 00 02
Calculation
12345-23243(= -10898)
lets say the answer goes to CX AND DX
so it will look like that:
CX - 08 98
DX - 01 01
answer: (-10898)
Can someone please help me/give me an example code that I'll know how to do it ?
Sorry if I'm little bit Confused...
Thx.
EDIT:
here is the addition code that you ask for:
proc Add_two_numbers;2 values useing stack...
pop [150]
pop dx
pop cx
pop bx
pop ax
add al,cl
daa
mov cl,al
mov al,ah
adc al,ch
daa
mov ch,al
mov al,bl
adc al,dl
daa
mov dl,al
push cx
push dx
push [150]
ret
endp Add_two_numbers
2nd edit:
I figure out how making it Negative so I just need algorithms that sub 2 number it does not need to work with numbers like 1000-2000 please make it work only on positive values like 2000-1000
Answering your comment, this is one way you can convert from decimal and back using C as an example. I leave you to code it in asm!
#include <conio.h>
#define MAX 100000000
// input a signed decimal number
int inp_num(void) {
int number=0, neg=0, key;
while (number < MAX) {
key = _getche();
if (key == '-') {
if (number==0)
neg = 1; // else ignore
}
else if (key >= '0' && key <= '9')
number = number * 10 + key - '0';
else
break;
}
if (neg)
number = -number;
_putch('\n');
return number;
}
// output a signed number as decimal
void out_num(int number) {
int digit, suppress0, d;
suppress0 = 1; // zero-suppression on
if (number < 0) {
_putch('-');
number =-number;
}
for (d=MAX; d>0; d/=10) {
digit = number / d;
if (digit) // if non-0
suppress0 = 0; // cancel zero-suppression
if (!suppress0)
_putch('0' + digit);
number -= digit * d;
}
}
int main(void) {
int number;
number = inp_num();
out_num(number);
return 0;
}

Arduino: Formula to convert byte

Im looking for a way to modify a binary byte value on Arduino.
Because of the Hardware, its neccesarry, to split a two digit number into 2 4-bit.
the code to set output is wire.write(byte, 0xFF) which sets all outputs on High.
0xFF = binary 1111 1111
the formula should be convert a value like this:
e.g nr 35 is binary 0010 0011
but for my use it should displayed as 0011 0101 which would be refer to 53 in reality.
The first 4 bits are for a BCD-Input IC which displays the 5 from 35, the second 4 bits are for a BCD-Input IC which displays the 3 from 35.
Does anybody has a idea how to convert this by code, or like a mathematical formula?
Possible numbers are from 00 to 59.
Thank you for your help
To convert a value n between 0 and 99 to BCD:
((n / 10) * 16) + (n % 10)
assuming n is an integer and thus / is doing integer division; also assumes this will be stored in an unsigned byte.
(If this is not producing the desired result, please either explain how it is incorrect for the example given, or provide a different example for which it is incorrect.)
#include <string.h>
int num = // Any number from 0 to 59
int tens = num/10;
int units = num-(tens*10);
// Make string array for binary
string tensbinary;
int quotient = tens;
char buffer[1];
// Convert numbers
for (int i = 0; i < 4; i++)
{
quotientint = quotientint % 2;
sprintf(buffer, 1, "%d", quotientint);
binary.append(buffer);
}
// Repeat above for the units
// Now join the two together
binarytens.append(binaryunits);
I don't know if this will work, but still, you might be able to extrapolate based on the available information in my code.
The last thing you need to do is convert the string to binary.

How to find x mod 15 without using any Arithmetic Operations?

We are given a unsigned integer, suppose. And without using any arithmetic operators ie + - / * or %, we are to find x mod 15. We may use binary bit manipulations.
As far as I could go, I got this based on 2 points.
a = a mod 15 = a mod 16 for a<15
Let a = x mod 15
then a = x - 15k (for some non-negative k).
ie a = x - 16k + k...
ie a mod 16 = ( x mod 16 + k mod 16 ) mod 16
ie a mod 15 = ( x mod 16 + k mod 16 ) mod 16
ie a = ( x mod 16 + k mod 16 ) mod 16
OK. Now to implement this. A mod16 operations is basically & OxF. and k is basically x>>4
So a = ( x & OxF + (x>>4) & OxF ) & OxF.
It boils down to adding 2 4-bit numbers. Which can be done by bit expressions.
sum[0] = a[0] ^ b[0]
sum[1] = a[1] ^ b[1] ^ (a[0] & b[0])
...
and so on
This seems like cheating to me. I'm hoping for a more elegant solution
This reminds me of an old trick from base 10 called "casting out the 9s". This was used for checking the result of large sums performed by hand.
In this case 123 mod 9 = 1 + 2 + 3 mod 9 = 6.
This happens because 9 is one less than the base of the digits (10). (Proof omitted ;) )
So considering the number in base 16 (Hex). you should be able to do:
0xABCE123 mod 0xF = (0xA + 0xB + 0xC + 0xD + 0xE + 0x1 + 0x2 + 0x3 ) mod 0xF
= 0x42 mod 0xF
= 0x6
Now you'll still need to do some magic to make the additions disappear. But it gives the right answer.
UPDATE:
Heres a complete implementation in C++. The f lookup table takes pairs of digits to their sum mod 15. (which is the same as the byte mod 15). We then repack these results and reapply on half as much data each round.
#include <iostream>
uint8_t f[256]={
0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,0,
1,2,3,4,5,6,7,8,9,10,11,12,13,14,0,1,
2,3,4,5,6,7,8,9,10,11,12,13,14,0,1,2,
3,4,5,6,7,8,9,10,11,12,13,14,0,1,2,3,
4,5,6,7,8,9,10,11,12,13,14,0,1,2,3,4,
5,6,7,8,9,10,11,12,13,14,0,1,2,3,4,5,
6,7,8,9,10,11,12,13,14,0,1,2,3,4,5,6,
7,8,9,10,11,12,13,14,0,1,2,3,4,5,6,7,
8,9,10,11,12,13,14,0,1,2,3,4,5,6,7,8,
9,10,11,12,13,14,0,1,2,3,4,5,6,7,8,9,
10,11,12,13,14,0,1,2,3,4,5,6,7,8,9,10,
11,12,13,14,0,1,2,3,4,5,6,7,8,9,10,11,
12,13,14,0,1,2,3,4,5,6,7,8,9,10,11,12,
13,14,0,1,2,3,4,5,6,7,8,9,10,11,12,13,
14,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,
0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,0};
uint64_t mod15( uint64_t in_v )
{
uint8_t * in = (uint8_t*)&in_v;
// 12 34 56 78 12 34 56 78 => aa bb cc dd
in[0] = f[in[0]] | (f[in[1]]<<4);
in[1] = f[in[2]] | (f[in[3]]<<4);
in[2] = f[in[4]] | (f[in[5]]<<4);
in[3] = f[in[6]] | (f[in[7]]<<4);
// aa bb cc dd => AA BB
in[0] = f[in[0]] | (f[in[1]]<<4);
in[1] = f[in[2]] | (f[in[3]]<<4);
// AA BB => DD
in[0] = f[in[0]] | (f[in[1]]<<4);
// DD => D
return f[in[0]];
}
int main()
{
uint64_t x = 12313231;
std::cout<< mod15(x)<<" "<< (x%15)<<std::endl;
}
Your logic is somewhere flawed but I can't put a finger on it. Think about it yourself, your final formula operates on first 8 bits and ignores the rest. That could only be valid if the part you throw away (9+ bits) are always the multiplication of 15. However, in reality (in binary numbers) 9+ bits are always multiplications of 16 but not 15. For example try putting 1 0000 0000 and 11 0000 0000 in your formula. Your formula will give 0 as a result for both cases, while in reality the answer is 1 and 3.
In essense I'm almost sure that your task can not be solved without loops. And if you are allowed to use loops - then it's nothing easier than to implement bitwiseAdd function and do whatever you like with it.
Added:
Found your problem. Here it is:
... a = x - 15k (for some non-negative k).
... and k is basically x>>4
It equals x>>4 only by pure coincidence for some numbers. Take any big example, for instance x=11110000. By your calculation k = 15, while in reality it is k=16: 16*15 = 11110000.

Resources