Binary 2's Complement - math

I'm facing a problem. when we want to subtract a number from another using 2's complement we can do that. I don't know how to subtract fractional number using 2's complement.
5 is in binary form 101 and 2 is 10. if we want to subtract 2 from 5 we need to find out 2's complement of 2
2's complement of 2-> 11111110
so if we now add with binary of 5 we can get the subtraction result. If I want to get the result of 5.5-2.125. what would be the procedure.

Fixed point numbers can be used and it is still common to find them in embedded code or hardware.
Their use is identical to integers, but you need to specify where your "point" is. For instance, assume that you want 3 bits after after the point and that your data is 8 bits, bits 7..3 are the integer part (left of "point") and bits 2..0 the fractional part. The interpretation of integer part is as usual the binary decomposition of this integer: bits 3 correspond to 20, bits 4 to 21, etc.
For the fractional part, the decomposition is in negative powers or two. bits 2 correspond to 2-1, bits 1 to 2-2 and bit 0 to 2-3.
So for you problem, 5.5=4+1+1/2=22+20+2-1 and its code is 00101(.)100. Similarly 2.125=2+1/8 and its code is 00010(.)001 (note (.) is just an help to understand the coding).
Indeed they are just integers, but you must take into account that all your numbers are multiplied by 2-3. This will have no impact for addition, but results of multiplication and division must be adjusted. Taking into account the position of point and managing over and underflows is the difficulty of arithmetic with fixed point, but it allows to do fractional computations even if your hardware does not provide floating point support (for instance with low end microcontrollers or FPGA systems).
Two complement is similar to integers and its computation is identical. If code of 2.125 is 00010(.)001, than -2.125==11101(.)111. Operations are as usual.
+5 00101(.)100
-2.125 11101(.)111
00011(.)011
and 00011(.)011=2+1+1/4+1/8=3,375
For the record, two complement first use was for fixed point fractional numbers and two complement name comes from that. If a fractional number if represented by, say 0(.)1100000 (0.75), its negative counter part will be 1(.)0100000 (-0.75 or 1.25 if interpreted as unsigned) and we always have x+(unsigned)-x=2. For this coding, the negative value of a fractional number x is the number y that must be added to x to get a 2, hence the name that y is 2's complement of x.

Related

Converting a decimal number to floating point notation and IEEE 754 format

I'm doing an assignment for one of my classes and I'm stuck on those two questions:
Express the decimal -412.8 using binary floating point notation using 11 fraction bits for
the significand and 3 digits for the exponent without bias
I think I managed to solve it, but my exponent has 4 bits not 3. I don't really understand how you can convert -412.8 to floating point notation using only 3 bit exponents. Here is how I tried to solve it:
First of all, the floating point notation has three parts. The sign part, 0 for positive numbers and 1 for negative numbers, the exponent part and finally the mantissa. The mantissa in this case includes the leading 1. Since the number is negative, the sign bit is going to be 1. For the mantissa, I first converted 412.8 to binary, which gave me 110011100.11 and then I shifted the decimal point to the left 8 times, which gives me 1.1001110011. The mantissa is therefore 1100 1110 011 (11 bits as the teacher asked). Finally, the exponent is going to be 2^8, since I shifted the decimal 8 times to the right. 8 is 1000 in binary. So am I correct to assume that my floating point notation should be 1 1000 11001110011?
Represent the decimal number 16.1875×2-134 in single-precision IEEE 754 format.
I'm completely stuck on this one. I don't know how to convert that number. When I enter it in wolfram, the decimal number is way beyond the limit of the single precision format. I do know that the sign bit is going to be 0 since the number is positive. I don't know what the mantissa is though, nor how to find it. I also don't know how to find the exponent. Can someone guide me through this problem? Thanks.
For 1, you appear to be correct -- there's no way to reperesent the exponent unbiased in 3 bits. Of course, the problem says "3 digits" and doesn't define a base for the digits...
2 is relatively straight-forward -- convert the value to binary gives you 10000.0011 then normalize, giving 1.00000011×2-130. Now -130 is too small for a single-precision exponent (minimum is -126), so we have to denormalize (continue shifting the point to get an exponent of -126), which gives us 0.000100000011×2-126. That's then our mantissa (with the 0 dropped) and an exponent field of 0: 0|00000000|00010000001100000000000 (vertical bars separating the sign/exponent/mantissa fields) or 0x00081800

Representing decimal numbers in binary

How do I represent integers numbers, for example, 23647 in two bytes, where one byte contains the last two digits (47) and the other contains the rest of the digits(236)?
There are several ways do to this.
One way is to try to use Binary Coded Decimal (BCD). This codes decimal digits, rather than the number as a whole into binary. The packed form puts two decimal digits into a byte. However, your example value 23647 has five decimal digits and will not fit into two bytes in BCD. This method will fit values up to 9999.
Another way is to put each of your two parts in binary and place each part into a byte. You can do integer division by 100 to get the upper part, so in Python you could use
upperbyte = 23647 // 100
Then the lower part can be gotten by the modulus operation:
lowerbyte = 23647 % 100
Python will directly convert the results into binary and store them that way. You can do all this in one step in Python and many other languages:
upperbyte, lowerbyte = divmod(23647, 100)
You are guaranteed that the lowerbyte value fits, but if the given value is too large the upperbyte value many not actually fit into a byte. All this assumes that the value is positive, since negative values would complicate things.
(This following answer was for a previous version of the question, which was to fit a floating-point number like 36.47 into two bytes, one byte for the integer part and another byte for the fractional part.)
One way to do that is to "shift" the number so you consider those two bytes to be a single integer.
Take your value (36.47), multiply it by 256 (the number of values that fit into one byte), round it to the nearest integer, convert that to binary. The bottom 8 bits of that value are the "decimal numbers" and the next 8 bits are the "integer value." If there are any other bits still remaining, your number was too large and there is an overflow condition.
This assumes you want to handle only non-negative values. Handling negatives complicates things somewhat. The final result is only an approximation to your starting value, but that is the best you can do.
Doing those calculations on 36.47 gives the binary integer
10010001111000
So the "decimal byte" is 01111000 and the "integer byte" is 100100 or 00100100 when filled out to 8 bits. This represents the float number 36.46875 exactly and your desired value 36.47 approximately.

Whats the highest and the lowest integer for representing signed numbers in two's complement in 5 bits?

I understand how binary works and I can calculate binary to decimal, but I'm lost around signed numbers.
I have found a calculator that does the conversion. But I'm not sure how to find the maximum and the minumum number or convert if a binary number is not given, and question in StackO seems to be about converting specific numbers or doesn't include signed numbers to a specific bit.
The specific question is:
We have only 5 bits for representing signed numbers in two's complement:
What is the highest signed integer?
Write its decimal value (including the sign only if negative).
What is the lowest signed integer?
Write its decimal value (including the sign only if negative).
Seems like I'll have to go heavier on binary concepts, I just have 2 months in programming and I thought i knew about binary conversion.
From a logical point of view:
Bounds in signed representation
You have 5 bits, so there are 32 different combinations. It means that you can make 32 different numbers with 5 bits. On unsigned integers, it makes sense to store integers from 0 to 31 (inclusive) on 5 bits.
However, this is about unsigned integers. Meaning: we have to find a way to represent negative numbers too. Meaning: we have to store the number's value, but also its sign (+ or -). The representation used is 2's complement, and it is the one that's learned everywhere (maybe other exist but I don't know them). In this representation, the sign is given by the first bit. That is, in 2's complement representation a positive number starts with a 0 and a negative number starts with an 1.
And here the problem rises: Is 0 a positive number or a negative number ? It can't be both, because it would mean that 0 can be represented in two manners for a given number a bits (for 5: 00000 and 10000), that is we lose the space to put one more number. I have no idea how they decided, but fact is 0 is a positive number. For any number of bits, signed or unsigned, a 0 is represented with only 0.
Great. This gives us the answer to the first question: what is the upper bound for a decimal number represented in 2's complement ? We know that the first bit is for the sign, so all of the numbers we can represent must be composed of 4 bits. We can have 16 different values of 4-bits strings, and 0 is one of them, so the upper bound is 15.
Now, for the negative numbers, this becomes easy. We have already filled 16 values out of the 32 we can make on 5 bits. 16 left. We also know that 0 has already been represented, so we don't need to include it. Then we start at the number right before 0: -1. As we have 16 numbers to represent, starting from -1, the lowest signed integer we can represent on 5 bits is -16.
More generally, with n bits we can represent 2^n numbers. With signed integers, half of them are positive, and half of them are negative. That is, we have 2^(n-1) positive numbers and 2^(n-1) negative numbers. As we know 0 is considered as positive, the greatest signed integer we can represent on n bits is 2^(n-1) - 1 and the lowest is -2^(n-1)
2's complement representation
Now that we know which numbers can be represented on 5 bits, the question is to know how we represent them.
We already saw the sign is represented on the first bit, and that 0 is considered as positive. For positive numbers, it works the same way as it does for unsigned integers: 00000 is 0, 00001 is 1, 00010 is 2, etc until 01111 which is 15. This is where we stop for positive signed integers because we have occupied all the 16 values we had.
For negative signed integers, this is different. If we keep the same representation (10001 is -1, 10010 is -2, ...) then we end up with 11111 being -15 and 10000 not being attributed. We could decide to say it's -16 but we would have to check for this particular case each time we work with negative integers. Plus, this messes up all of the binary operations. We could also decide that 10000 is -1, 10001 is -2, 10010 is -3 etc. But it also messes up all of the binary operations.
2's complement works the following way. Let's say you have the signed integer 10011, you want to know what decimal is is.
Flip all the bits: 10011 --> 01100
Add 1: 01100 --> 01101
Read it as an unsigned integer: 01101 = 0*2^4 + 1*2^3 + 1*2^2 + 0*2^1 + 1*2^0 = 13.
10011 represents -13. This representation is very handy because it works both ways. How to represent -7 as a binary signed integer ? Start with the binary representation of 7 which is 00111.
Flip all the bits: 00111 --> 11000
Add 1: 11000 --> 11001
And that's it ! On 5 bits, -7 is represented by 11001.
I won't cover it, but another great advantage with 2's complement is that the addition works the same way. That is, When adding two binary numbers you do not have to care if they are signed or unsigned, this is the same algorithm behind.
With this, you should be able to answer the questions, but more importantly to understand the answers.
This topic is great for understanding 2's complement: Why is two's complement used to represent negative numbers?

How to do Division of two fixed point 64 bits variables in Synthesizable Verilog?

I'm implementing an Math equation in verilog, in a combinational scheme (assigns = ...) to the moment Synthesis tool (Quartus II) has been able to do add, sub and mul easly 32 bit unsigned absolute numbers by using the operators "+,- and *" respectively.
However, one of the final steps of the equation is to divide two 64 bits unsigned fixed point variables, the reason why is such of large 64 bit capacity is because I'm destinating 16 bits for integers and 48 bits for fractions (although, computer does everything in binary and doesn't care about fractions, I would be able to check the number to separate fraction from integer in the end).
Problem is that the operator "/" is useless since it auto-invokes a so-called "LPM_divide" library which output only gives me the integer, disregarding fractions, plus in a wrong position (the less significant bit).
For example:
b1000111010000001_000000000000000000000000000000000000000000000000 / b1000111010000001_000000000000000000000000000000000000000000000000
should be 1, it gives me
b0000000000000000_000000000000000000000000000000000000000000000001
So, how can I make this division for synthesizable verilog? What methods or algorithms should I follow, I'd like it to be faster, maybe a full combinational?
I'd like it to keep the 16 integers - 24 fractions user point of view. Thanks in advance.
First assume you multiply two fixed-point numbers.
Let's call them X and Y, first containing Xf fractional bits, and second Yf fractional bits accordingly.
If you multiply those numbers as integers, the LSB Xf+Yf bits of the integer result could be treated as fractional bits of resulting fixed-point number (and you still multiply them as integers).
Similarly, if you divide number of Sf fractional bits by number of Df fractional bits, the resulting integer could be treated as fixed-point number having Sf-Df fractional bits -- therefore your example with resulting integer 1.
Thus, if you need to get 48 fractional bits from your division of 16.48 number by another 16.48 number, append divident with another 48 zeroed fractional bits, then divide the resulting 64+48=112-bit number by another 64-bit number, treating both as integers (and using LPM_divide). The result's LSB 48 bits will then be what you need -- the resulting fixed-point number's 48 fractional bits.

fixed point multiplication for normal multiplication

I need to multiply X with a floating point number in floating point as i don't have floating point operations in my processor. I understand the method but don't know why that method exists?
Suppose we want to multiply 2*4.5 in decimal I do the below:
2 * 4.5 (100.1)
So i multiply 2*1001 = 2*9 = 18 and then right shift by 1.
so 18>>1 = 9
Is it that we represent 2 in fixed point and represent 4.5 in fixed point and as we multiply Q1.1 and Q1.1 format so we get Q2.2 format and we do right shifting causing Q1.1 format result.Is this right?
In decimal, your fixed-point example is actually:
2 * 4.5
2 * 45 (after multiplying by 10) = 90
90 / 10 = 9 (after dividing the 10 back out)
In binary, the same thing is being done, but just with powers of 2 instead of powers of 10 (as the factors / divisors). Fixed point operations occur in purely integral space after appropriate multiplications. And multiplying or dividing by a power of 2 is just a left shift or right shift respectively on the binary number (very fast for the CPU). In fixed-point the number of bits to the left (integer) and right (fractional) of the decimal point are fixed (predetermined), which means that some numbers cannot be represented on the scale without loss of precision.
Floating-point further extends the concept by allowing the number of bits assigned to the left and right of the decimal point to be flexible. In floating point, every number is represented as an integral "significand" (or mantissa) to a specified power (for example, a power of 2). This representation allows the same number of significant digits to be maintained over a greater dynamic range (for very small or very large magnitude numbers). For floating point, most of the bits will be assigned to the significant digits of the mantissa, and fewer of the bits assigned to the digits of the power. Floating-point calculations are more expensive (time-wise) than fixed-point, which is why fixed-point remains popular in microcontrollers and embedded systems.
If I didn't answer your question, please elaborate and I will edit this answer to include the information you desire.

Resources