2's complement representation of fractions? - math

I'm a little lost on this. I need to use two fractional bits
0.(a-1)(a-2)
Like that, now I can use .00 .01 .10 and .11
But I need negative numbers (in 2's complement) also, so would .10 be -.5 ? or would it be -.25 ?
The same with .11 , that would be -.75? or would it be -.5 ?
I'm pretty sure it would be the former in both cases, but I'm not entirely positive.

In two's complement notation, all of the most significant bits of a negative number are set to 1. Let's assume you're storing these numbers as 8 bits, with 2 to the right of the "binary point."
By definition, x + -x = 0, so we can write:
0.5 + -0.5 = 0.10 + 111111.10 = 0 // -0.5 = 111111.10
0.25 + -0.25 = 0.01 + 111111.11 = 0 // -0.25 = 111111.11
0.75 + -0.75 = 0.11 + 111111.01 = 0 // -0.75 = 111111.01
and so on.
Using 8 bits like this, the largest number you can store is
011111.11 = 31.75
the least-positive number is
000000.01 = 0.25
the least-negative number is
111111.11 = -0.25
and the smallest (that is, the most negative) is
100000.00 = -32

see it this way:
you have normal binary representation
let's assume 8 bit words ...
the first bit (MSB) has the value 128, the second 64, and so on ...
in other words the first bit (MSB) is 2^7 ... the second bit is 2^6 ... and the last bit is 2^0
now we can assume our 8 bit word has 2 decimal places ....
we now start with the first bit (MSB) 2^5 and end with the last bit beeing 2^-2
no magic here ...
now to turn that into binary complement: simply negate the value of the first bit
so instead of 2^5 it would be -2^5
so base 10 -0.75 would be in binary complement
111111.01 ...
(1*(-32) + 1*16 + 1*8 + 1*4 + 1*2 +1*1 + 0*0.5 + 1*0.25)
(1*(-2^5) + 1*2^4 + 1*2^3 + 1*2^2 + 1*2^1 +1*2^0 + 0*2^(-1) + 1*2^(-2))

A number stored in two's complement inverts the sign of the uppermost bit's magnitude (so that for e.g. a 16-bit number, the upper bit is -32768 rather than +32768). All other bits behave as normal. Consequently, when performing math on multi-word numbers, the upper word of each number should be regarded as two's-complement (since its uppermost bit will be the uppermost bit of the overall number), but all other words in each number should be regarded as unsigned quantities.
For example, a 16-bit two's complement number has place values (-32768, 16384, 8192, 4096, 2048, 1024, 512, 256, 128, 64, 32, 16, 8, 4, 2, and 1). Split into two 8-bit parts, those parts will have place values (-32768, 16384, 8192, 4096, 2048, 1024, 512, and 256); and (128, 64, 32, 16, 8, 4, 2, and 1). The first set of values is in a two's complement 8-bit number, times 256; the latter set is an unsigned 8-bit number.

Related

Sum of 2 two's complement binary number

I am familiar with two's complement when performing addition in 4 bits, then I am confused when I face the question below
** Find the sum of the 2 two's complement binary number 010111 and 110101 in 8 bit output**
Below is my attempt, but I am in a dilemma, should I
(1) discard the carry, then add two 0, so the answer is 00001100, which is 12 in decimal
Thank you !
(2) just add 1 in the beginning, so it the answer is 11001100, which is 204 in decimal
For the Twos Complement in 8 bit you have to invert ALL 8 bits of the number
As far as I know the ones complement and twos complement operate on the absolute value so:
The Binary Number 010111 is represented in 8 bits with 00010111
C_1(00010111) = 00010111 xor 11111111 = 11101000
C_2(00010111) = C_1 + 1 = 00010111 + 1 = 00011000
The Binary Number 110101 is represented in 8 bits -> 00110101
C_1(00110101) = 00110101 xor 11111111 = 11001010
C_2(00010111) = C_1 + 1 = 11001010 + 1 = 11001011
Now add the two twos-complements:
C_2(00010111) + C_2(00010111) = 00011000 + 11001011 = 11100011
Pls correct me if messed something up with sign bits (I just took the binary numbers as is in 8 bit)...

Is there a mathematical formula for two's complement multiplication when dealing with overflow?

For instance, given a word size of 4 bits:
0b1001 * 0b0111 = 0b1111 // -7 * 7 = -1
0b0111 * 0b0111 = 0b0001 // 7 * 7 = 1
0b0111 * 0b0110 = 0b1010 // 7 * 6 = -6
0b1001 * 0b0110 = 0b0110 // -7 * 6 = 6
There's undoubtedly some modular arithmetic going on here, but the way you take mod seems to be quite inconsistent. Is there a neat mathematical formulation of two's complement multiplication?
The nice thing about twos complement is that addition, subtraction, and multiplication of signed operands are exactly the same operations, bit-for-bit, as the ones for unsigned operands, so the computer doesn't need to care whether you think of them as signed or not.
In terms of modular arithmetic as well, the operations mean exactly the same thing. With 4 bit words, when you say:
r = a * b;
You get r = a * b mod 16.
The only difference between signed and unsigned is the value we assign in our heads to the residues mod 16. If we think of the words as unsigned then we have values 0-15. But 15 = -1 mod 16, 14 = -2 mod 16, etc, and if we think of the words as signed, then we just think of the values -8 to 7 instead of 0 to 15.
The reminder operator % that you get in C, java, etc, is annoying in the way it handles negative numbers. If you wanted to express your 4-bit multiply using that operator in larger words, then you could say:
a * b = ( (a * b % 16) + 24 ) % 16 - 8
If the remainder operator worked "properly" so that -1 % 16 == 15, then you could write a * b = (a * b + 8) % 16 - 8

Converting a number to IEEE 754

Can someone help me with this question:
“Convert the decimal number 10/32 to the 32-bit IEEE 754 floating point and
express your answer in hexadecimal. (Reminder: the 32 bits are used as
follows: Bit 1: sign of mantissa, bits 2-9: 8-bits of exponent in excess 127, bits 10-32: 23 bits for magnitude of mantissa.)”
I understand how to convert a decimal number to IEE 754. But I am confused on how to answer this—it only gives me a quotient? I am not allowed to use a calculator, so I am unsure how to work this out. Should I convert them both to binary first and divide them?
10/32 = 5/16 = 5•2−4 = 1.25•2−2 = 1.012•2−2.
The sign is +, the exponent is −2, and the significand is 1.012.
A positive sign is encoded as 0.
Exponent −2 is encoded as −2 + 127 = 125 = 011111012.
Significand 1.012 is 1.010000000000000000000002, and it is encoded using the last 23 bits, 010000000000000000000002.
Putting these together, the IEEE-754 encoding is 0 01111101 01000000000000000000000. To convert to hexadecimal, first organize into groups of four bits: 0011 1110 1010 0000 0000 0000 0000 0000. Then the hexadecimal can be easily read: 3EA0000016.
I see it like this:
10/32 = // input
10/2^5 = // convert division by power of 2 to bitshift
1010b >> 5 =
.01010b // fractional result
--^-------------------------------------------------------------
|
first nonzero bit is the exponent position and start of mantissa
----------------------------------------------------------------
man = (1)010b // first one is implicit
exp = -2 + 127 = 125 // position from decimal point + bias
sign = 0 // non negative
----------------------------------------------------------------
0 01111101 01000000000000000000000 b
^ ^ ^
| | mantissa + zero padding
| exp
sign
----------------------------------------------------------------
0011 1110 1010 0000 0000 0000 0000 0000 b
3 E A 0 0 0 0 0 h
----------------------------------------------------------------
3EA00000h
Yes the answer of Eric Postpischil is the same approach (+1 btw) but I didn't like the formating as it was not clear from a first look what to do without proper reading the text.
Giving the conversion of 10/322 without a calculator as an exercise is pure sadism.
There is a a general method doable without tools, but it may be tedious.
N is the number to code. We assume n<1
exp=0
mantissa=0
repeat
n *= 2
exp ++
if n>1
n = n-1
mantissa = mantissa <<1 | 1
else
mantissa = mantissa <<1
until mantissa is a 1 followed by 23 bits
Then you just have to code mantissa and (23-exp) in IEEE format.
Note that frequently this kind of computations lead to loops. Whenever you find the same n, you know that the sequence will be repeated.
As an example, assume we have to code 3/14
3/14 -> 6/14 e=1 m=0
6/14 -> 12/14 e=2 m=00
12/14 -> 24/14-14/14=10/14 e=3 m=001
10->14 -> 20/14-14/14=6/14 e=4 m=0011
6/14 -> 12/14 e=5 m=00110
Great we found a loop !
6/14->12/14->10/14->6/14.
So the mantissa will be 110 iterated as required 110110110...
If we fill the mantissa with 24 bits, we need 26 iterations and exponent is 23-26=-3 (another way to get it is to remark that n became >1 for the first time at iteration 3 and exponent is -3 as 1≤3/14*2^3<2).
And we can do the IEEE754 coding with exponent=127-3=124 and mantissa =1.1011011011011....

How to add fixed point with two's complement fixed point?

I want to multiply unsigned integer 5432 by 0.01 and then add/subtract 0.3. Instead of using floats I want to use fixed-point arithmetic. Here are my steps:
1) ((1 << 16) * 0.01) = 655 => Fixed point Q0.16
2) 655 * 5432 = 3557960 => Fixed point Q16.16
3) ((1 << 16) * 0.3) = 19660 => Fixed point Q0.16
4) Add 0.3: 3557960 + 19660 = 3577620 => Convert to float = 54.59 which is pretty much same as using floating calculations: 5432 * 0.01 + 0.3 = 54.62
5) Subtract 0.3: find two's complement of 19660 => 45876, now 3577620 + 45876 = 3623496 => 55.29 which is not as expected 5432 * 0.01 - 0.3 = 54.02
Can anyone verify that I am correct in point 1-4, and what I'm missing in point 5?
Your mistake is that you assume that 2-complement representation is independent of the word size. It is not. 16-bit to 2-complement for 19660 is 2^16 - 19660 or 45876 but since you are working with 32-bit number you need a corresponding 2-complement which is 2^32 - 19960 or 4294947636. In other words when you extend 2-complement from 16-bits to 32-bits you should fill top bytes with the sign bit i.e. 1 for negative values. You can see that in binary both values are actually the same under such extension:
45876 = 10110011_00110100 (16-bit binary)
4294947636 = 11111111_11111111_10110011_00110100 (32-bit binary)
If you add 3557960 + 4294947636 you'll get 4298505596 or if you truncate it back to a 32-bit value - 3538300 which is a fixed point representation of 53,99

Convert signed to unsigned integer mathematically

I am in need of an unsigned 16 bit integer value in my OPC server but can only send it a signed 16 bit integer. I need to change this signed integer to unsigned mathematically but am unsure how. My internet research has not lead me in the right path either. Could someone please give some advise? Thanks in advance.
Mathematically, the conversion from signed to unsigned works as follows: (1) do the integer division of the signed integer by 1 + max, (2) codify the signed integer as the non-negative remainder of the division. Here max is the maximum integer you can write with the available number of bits, 16 in your case.
Recall that the non-negative remainder is the only integer r that satisfies
1. s = q(1+max) + r
2. 0 <= r < 1+max.
Note that when s >= 0, the non-negative remainder is s itself (you cannot codify integers greater than max). Therefore, there is actually something to do only when s < 0:
if s >= 0 then return s else return 1 + max + s
because the value r = 1 + max + s satisfies conditions 1 and 2 above for the non-negative reminder.
For this convention to work as expected the signed s must satisfy
- (1 + max)/2 <= s < (1 + max)/2
In your case, given that you have 16 bits, we have max = 0xFFFF and 1 + max = 0x10000 = 65536.
Note also that if you codify a negative integer with this convention, the result will have its highest bit on, i.e., equal to 1. This way, the highest bit becomes a flag that tells whether the number is negative or positive.
Examples:
2 -> 2
1 -> 1
0 -> 0
-1 -> 0xFFFF
-2 -> 0xFFFE
-3 -> 0xFFFD
...
-15 -> 0xFFF1
...
-32768 -> 0x8000 = 32768 (!)
-32769 -> error: cannot codify using only 16 bits.

Resources