Related
I understand how binary works and I can calculate binary to decimal, but I'm lost around signed numbers.
I have found a calculator that does the conversion. But I'm not sure how to find the maximum and the minumum number or convert if a binary number is not given, and question in StackO seems to be about converting specific numbers or doesn't include signed numbers to a specific bit.
The specific question is:
We have only 5 bits for representing signed numbers in two's complement:
What is the highest signed integer?
Write its decimal value (including the sign only if negative).
What is the lowest signed integer?
Write its decimal value (including the sign only if negative).
Seems like I'll have to go heavier on binary concepts, I just have 2 months in programming and I thought i knew about binary conversion.
From a logical point of view:
Bounds in signed representation
You have 5 bits, so there are 32 different combinations. It means that you can make 32 different numbers with 5 bits. On unsigned integers, it makes sense to store integers from 0 to 31 (inclusive) on 5 bits.
However, this is about unsigned integers. Meaning: we have to find a way to represent negative numbers too. Meaning: we have to store the number's value, but also its sign (+ or -). The representation used is 2's complement, and it is the one that's learned everywhere (maybe other exist but I don't know them). In this representation, the sign is given by the first bit. That is, in 2's complement representation a positive number starts with a 0 and a negative number starts with an 1.
And here the problem rises: Is 0 a positive number or a negative number ? It can't be both, because it would mean that 0 can be represented in two manners for a given number a bits (for 5: 00000 and 10000), that is we lose the space to put one more number. I have no idea how they decided, but fact is 0 is a positive number. For any number of bits, signed or unsigned, a 0 is represented with only 0.
Great. This gives us the answer to the first question: what is the upper bound for a decimal number represented in 2's complement ? We know that the first bit is for the sign, so all of the numbers we can represent must be composed of 4 bits. We can have 16 different values of 4-bits strings, and 0 is one of them, so the upper bound is 15.
Now, for the negative numbers, this becomes easy. We have already filled 16 values out of the 32 we can make on 5 bits. 16 left. We also know that 0 has already been represented, so we don't need to include it. Then we start at the number right before 0: -1. As we have 16 numbers to represent, starting from -1, the lowest signed integer we can represent on 5 bits is -16.
More generally, with n bits we can represent 2^n numbers. With signed integers, half of them are positive, and half of them are negative. That is, we have 2^(n-1) positive numbers and 2^(n-1) negative numbers. As we know 0 is considered as positive, the greatest signed integer we can represent on n bits is 2^(n-1) - 1 and the lowest is -2^(n-1)
2's complement representation
Now that we know which numbers can be represented on 5 bits, the question is to know how we represent them.
We already saw the sign is represented on the first bit, and that 0 is considered as positive. For positive numbers, it works the same way as it does for unsigned integers: 00000 is 0, 00001 is 1, 00010 is 2, etc until 01111 which is 15. This is where we stop for positive signed integers because we have occupied all the 16 values we had.
For negative signed integers, this is different. If we keep the same representation (10001 is -1, 10010 is -2, ...) then we end up with 11111 being -15 and 10000 not being attributed. We could decide to say it's -16 but we would have to check for this particular case each time we work with negative integers. Plus, this messes up all of the binary operations. We could also decide that 10000 is -1, 10001 is -2, 10010 is -3 etc. But it also messes up all of the binary operations.
2's complement works the following way. Let's say you have the signed integer 10011, you want to know what decimal is is.
Flip all the bits: 10011 --> 01100
Add 1: 01100 --> 01101
Read it as an unsigned integer: 01101 = 0*2^4 + 1*2^3 + 1*2^2 + 0*2^1 + 1*2^0 = 13.
10011 represents -13. This representation is very handy because it works both ways. How to represent -7 as a binary signed integer ? Start with the binary representation of 7 which is 00111.
Flip all the bits: 00111 --> 11000
Add 1: 11000 --> 11001
And that's it ! On 5 bits, -7 is represented by 11001.
I won't cover it, but another great advantage with 2's complement is that the addition works the same way. That is, When adding two binary numbers you do not have to care if they are signed or unsigned, this is the same algorithm behind.
With this, you should be able to answer the questions, but more importantly to understand the answers.
This topic is great for understanding 2's complement: Why is two's complement used to represent negative numbers?
11111 111111 11111 > 16 bit
R G B
how can I get the R/G/B values?
R = x/2048 (2048 are the 6+5 G+B 11 bit)
G = x-R*2048/32 (32 are the B )
is there a simple explanation of this? why should the decimal value be divided by others bit to get the single group value?
If you want to get R you have to move 11 bits on the right so that G and B disappear. So
R=x>>11
but every shift of 1 bit is equivalent to a division by 2 so the expression above is equivalent to
R=X/(2^11)=x/2048
And so you get the R.
To get the G, according to the formula, you can first remove the R, which is the R you obtained previously but shift back of 11 bits (or multiplied by 2048). Now done this you have a number like:
111111 11111
You shift left of 5 bits (or divide for 32) and you get the result.
Note that in practice you can do this easier doing this:
B=(x and 0x1F)
G=((x>>5) and 0x3F)
R=((x>>11) and 0x1F)
The 2's complement of a number which is represented by N bits is 2^N-number.
For example: if number is 7 (0111) and i'm representing it using 4 bits then, 2's complement of it would be (2^N-number) i.e. (2^4 -7)=9(1001)
7==> 0111
1's compliment of 7==> 1000
1000
+ 1
-------------
1001 =====> (9)
While calculating 2's complement of a number, we do following steps:
1. do one's complement of the number
2. Add one to the result of step 1.
I understand that we need to do one's complement of the number because we are doing a negation operation. But why do we add the 1?
This might be a silly question but I'm having a hard time understanding the logic. To explain with above example (for number 7), we do one's complement and get -7 and then add +1, so -7+1=-6, but still we are getting the correct answer i.e. +9
Your error is in "we do one's compliment and get -7". To see why this is wrong, take the one's complement of 7 and add 7 to it. If it's -7, you should get zero because -7 + 7 = 0. You won't.
The one's complement of 7 was 1000. Add 7 to that, and you get 1111. Definitely not zero. You need to add one more to it to get zero!
The negative of a number is the number you need to add to it to get zero.
If you add 1 to ...11111, you get zero. Thus -1 is represented as all 1 bits.
If you add a number, say x, to its 1's complement ~x, you get all 1 bits.
Thus:
~x + x = -1
Add 1 to both sides:
~x + x + 1 = 0
Subtract x from both sides:
~x + 1 = -x
The +1 is added so that the carry over in the technique is taken care of.
Take the 7 and -7 example.
If you represent 7 as 00000111
In order to find -7:
Invert all bits and add one
11111000 -> 11111001
Now you can add following standard math rules:
00000111
+ 11111001
-----------
00000000
For the computer this operation is relatively easy, as it involves basically comparing bit by bit and carrying one.
If instead you represented -7 as 10000111, this won't make sense:
00000111
+ 10000111
-----------
10001110 (-14)
To add them, you will involve more complex rules like analyzing the first bit, and transforming the values.
A more detailed explanation can be found here.
Short answer: If you don't add 1 then you have two different representations of the number 0.
Longer answer: In one's complement
the values from 0000 to 0111 represent the numbers from 0 to 7
the values from 1111 to 1000 represent the numbers from 0 to -7
since their inverses are 0000 and 0111.
There is the problem, now you have 2 different ways of writing the same number, both 0000 and 1111 represent 0.
If you add 1 to these inverses they become 0001 and 1000 and represent the numbers from -1 to -8 therefore you avoid duplicates.
I'm going to directly answer what the title is asking (sorry the details aren't as general to everyone as understanding where flipping bits + adding one comes from).
First let motivate two's complement by recalling the fact that we can carry out standard (elementary school) arithmetic with them (i.e. adding the digits and doing the carrying over etc). Easy of computation is what motivates this representation (I assume it means we only 1 piece of hardware to do addition rather than 2 if we implemented subtraction differently than addition, and we do and subtract differently in elementary school addition btw).
Now recall the meaning of each of the digit's in two's complements and some binary numbers in this form as an example (slides borrowed from MIT's 6.004 course):
Now notice that arithmetic works as normal here and the sign is included inside the binary number in two's complement itself. In particular notice that:
1111....1111 + 0000....1 = 000....000
i.e.
-1 + 1 = 0
Using this fact let's try to derive what the two complement representation for -A should be. So the problem to solve is:
Q: Given the two's complement representation for A what is the two's complement's representation for -A?
To do this let's do some algebra using values we know:
A + (-A) = 0 = 1 + (-1) = 11...1 + 00000...1 = 000...0
now let's make -A the subject expressed in terms of numbers expressed in two's complement:
-A = 1 + (-1 - A) = 000.....1 + (111....1 - A)
where A is in two's complements. So what we need to compute is the subtraction of -1 and A in two's complement format. For that we notice how numbers are represented as a linear combination of it's bases (i.e. 2^i):
1*-2^N-1 + 1 * 2^N-1 + ... 1 = -1
a_N * -2^N-1 + a_N-1 * 2^N-1 + ... + a_0 = A
--------------------------------------------- (subtract them)
a_N-1 * -2^N-1 + a_N-1 -1 * 2^N-1 + ... + a_0 -1 = A
which essentially means we subtract each digit for it's corresponding value. This ends up simply flipping bits which results in the following:
-A = 1 + (-1 - A) = 1 + ~ A
where ~ is bit flip. This is why you need to bit flip and add 1.
Remark:
I think a comment that was helpful to me is that complement is similar to inverse but instead of giving 0 it gives 2^N (by definition) e.g. with 3 bits for the number A we want A+~A=2^N so 010 + 110 = 1000 = 8 which is 2^3. At least that clarifies what the word "complement" is suppose to mean here as it't not just the inverting of the meaning of 0 and 1.
If you've forgotten what two's complement is perhaps this will be helpful: What is “2's Complement”?
Cornell's answer that I hope to read at some point: https://www.cs.cornell.edu/~tomf/notes/cps104/twoscomp.html#whyworks
What is a canonical signed digit (CSD) and how does one convert a binary number to a CSD and a CSD back to a binary number? How do you know if a digit of a CSD should be canonically chosen to be +, -, or 0?
Signed-digit binary uses three symbols in each power-of-two position: -1, 0, 1. The value represented is the sum of the positional coefficients times the corresponding power of 2, just like binary, the difference being that some of the coefficients may be -1. A number can have multiple distinct representations in this system.
Canonical signed digit representation is the same, but subject to the constraint that no two consecutive digits are non-0. It works out that each number has a unique representation in CSD.
See slides 31 onwards in Parhi's Bit Level Arithmetic for more, including a binary to CSD conversion algorithm.
What is canonical signed digit format?
Canonical Signed Digit (CSD) is a type of number representation. The important characteristics of the CSD presentation are:
CSD presentation of a number consists of numbers 0, 1 and -1. [1, 2].
The CSD presentation of a number is unique [2].
The number of nonzero digits is minimal [2].
There cannot be two consecutive non-zero digits [2].
How to convert a number into its CSD presentation?
First, find the binary presentation of the number.
Example 1
Lets take for example a number 287, which is 1 0001 1111 in binary representation. (256 + 16 + 8 + 4 + 2 + 1 = 287)
1 0001 1111
Starting from the right (LSB), if you find more than non-zero elements (1 or -1) in a row, take all of them, plus the next zero. (if there is not zero at the left side of the MSB, create one there). We see that the first part of this number is
01 1111
Add 1 to the number (i.e. change the 0 to 1, and all the 1's to 0's), and force the rightmost digit to be -1.
01 1111 -> 10 000-1
You can check that the number is still the same: 16 + 8 + 4 + 2 + 1 = 31 = 32 + (-1).
Now the number looks like this
1 0010 000-1
Since there are no more consecutive non-zero digits, the conversion is complete. Thus, the CSD presentation for the number 287 is 1 0010 000-1, which is 256 + 31 - 1.
Example 2
How about a little more challenging example. Number 345. In binary, it is
1 0101 1001
Find the first place (starting from righ), where there are more than one non-zero numbers in a row. Take also the next zero. Add one to it, and force the rightmost digit to be -1.
1 0110 -1001
Now we just created another pair of ones, which has to be transformed. Take the 011, and add one to it (get 100), and force the last digit to be -1. (get 10-1). Now the number looks like this
1 10-10 -1001
Do the same thing again. This time, you will have to imagine a zero in the left side of the MSB.
10 -10-10 -1001
You can make sure that this is the right CSD presentation by observing that: 1) There are no consecutive non-zero digits. 2) The sum adds to 325 (512 - 128 - 32 - 8 + 1 = 345).
More formal definitions of this algorithm can be found in [2].
Motivation behind the CSD presentation
CSD might be used in some other applications, too, but this is the digital microelectronics perspective. It is often used in digital multiplication. [1, 2]. Digital multiplication consists of two phases: Computing partial products and summing up the partial product. Let's consider the multiplication of 1010 and 1011:
1010
x 1011
1010
1010
0000
+ 1010
= 1101110
As we can see, the number of non-zero partial products (the 1010's), which are has to be summed up depends on the number of non-zero digits in the multiplier. Thus, the computation time of the sum of the partial products depends on the number of non-zero digits in the multiplier. Therefore the digital multiplication using CSD converted numbers is faster than using conventional digital numbers. The CSD form contains 33% less non-zero digits than the binary presentation (on average). For example, a conventional double precision floating point multiplication might take 100.2 ns, but only 93.2 ns, when using the CSD presentation. [1]
And how about the negative ones, then. Are there actually three states (voltage levels) in the microcircuit? No, the partial products calculated with the negative sign are not summed right away. Instead, you add the 2's complement (i.e. the negative presentation) of these numbers to the final sum.
Sources:
[1] D. Harini Sharma, Addanki
Purna Ramesh: Floating point multiplier using Canonical Signed
Digit
[2] Gustavo A. Ruiz, Mercedes Grand: Efficient canonic signed digit recoding
I'm trying to learn C and have come across the inability to work with REALLY big numbers (i.e., 100 digits, 1000 digits, etc.). I am aware that there exist libraries to do this, but I want to attempt to implement it myself.
I just want to know if anyone has or can provide a very detailed, dumbed down explanation of arbitrary-precision arithmetic.
It's all a matter of adequate storage and algorithms to treat numbers as smaller parts. Let's assume you have a compiler in which an int can only be 0 through 99 and you want to handle numbers up to 999999 (we'll only worry about positive numbers here to keep it simple).
You do that by giving each number three ints and using the same rules you (should have) learned back in primary school for addition, subtraction and the other basic operations.
In an arbitrary precision library, there's no fixed limit on the number of base types used to represent our numbers, just whatever memory can hold.
Addition for example: 123456 + 78:
12 34 56
78
-- -- --
12 35 34
Working from the least significant end:
initial carry = 0.
56 + 78 + 0 carry = 134 = 34 with 1 carry
34 + 00 + 1 carry = 35 = 35 with 0 carry
12 + 00 + 0 carry = 12 = 12 with 0 carry
This is, in fact, how addition generally works at the bit level inside your CPU.
Subtraction is similar (using subtraction of the base type and borrow instead of carry), multiplication can be done with repeated additions (very slow) or cross-products (faster) and division is trickier but can be done by shifting and subtraction of the numbers involved (the long division you would have learned as a kid).
I've actually written libraries to do this sort of stuff using the maximum powers of ten that can be fit into an integer when squared (to prevent overflow when multiplying two ints together, such as a 16-bit int being limited to 0 through 99 to generate 9,801 (<32,768) when squared, or 32-bit int using 0 through 9,999 to generate 99,980,001 (<2,147,483,648)) which greatly eased the algorithms.
Some tricks to watch out for.
1/ When adding or multiplying numbers, pre-allocate the maximum space needed then reduce later if you find it's too much. For example, adding two 100-"digit" (where digit is an int) numbers will never give you more than 101 digits. Multiply a 12-digit number by a 3 digit number will never generate more than 15 digits (add the digit counts).
2/ For added speed, normalise (reduce the storage required for) the numbers only if absolutely necessary - my library had this as a separate call so the user can decide between speed and storage concerns.
3/ Addition of a positive and negative number is subtraction, and subtracting a negative number is the same as adding the equivalent positive. You can save quite a bit of code by having the add and subtract methods call each other after adjusting signs.
4/ Avoid subtracting big numbers from small ones since you invariably end up with numbers like:
10
11-
-- -- -- --
99 99 99 99 (and you still have a borrow).
Instead, subtract 10 from 11, then negate it:
11
10-
--
1 (then negate to get -1).
Here are the comments (turned into text) from one of the libraries I had to do this for. The code itself is, unfortunately, copyrighted, but you may be able to pick out enough information to handle the four basic operations. Assume in the following that -a and -b represent negative numbers and a and b are zero or positive numbers.
For addition, if signs are different, use subtraction of the negation:
-a + b becomes b - a
a + -b becomes a - b
For subtraction, if signs are different, use addition of the negation:
a - -b becomes a + b
-a - b becomes -(a + b)
Also special handling to ensure we're subtracting small numbers from large:
small - big becomes -(big - small)
Multiplication uses entry-level math as follows:
475(a) x 32(b) = 475 x (30 + 2)
= 475 x 30 + 475 x 2
= 4750 x 3 + 475 x 2
= 4750 + 4750 + 4750 + 475 + 475
The way in which this is achieved involves extracting each of the digits of 32 one at a time (backwards) then using add to calculate a value to be added to the result (initially zero).
ShiftLeft and ShiftRight operations are used to quickly multiply or divide a LongInt by the wrap value (10 for "real" math). In the example above, we add 475 to zero 2 times (the last digit of 32) to get 950 (result = 0 + 950 = 950).
Then we left shift 475 to get 4750 and right shift 32 to get 3. Add 4750 to zero 3 times to get 14250 then add to result of 950 to get 15200.
Left shift 4750 to get 47500, right shift 3 to get 0. Since the right shifted 32 is now zero, we're finished and, in fact 475 x 32 does equal 15200.
Division is also tricky but based on early arithmetic (the "gazinta" method for "goes into"). Consider the following long division for 12345 / 27:
457
+-------
27 | 12345 27 is larger than 1 or 12 so we first use 123.
108 27 goes into 123 4 times, 4 x 27 = 108, 123 - 108 = 15.
---
154 Bring down 4.
135 27 goes into 154 5 times, 5 x 27 = 135, 154 - 135 = 19.
---
195 Bring down 5.
189 27 goes into 195 7 times, 7 x 27 = 189, 195 - 189 = 6.
---
6 Nothing more to bring down, so stop.
Therefore 12345 / 27 is 457 with remainder 6. Verify:
457 x 27 + 6
= 12339 + 6
= 12345
This is implemented by using a draw-down variable (initially zero) to bring down the segments of 12345 one at a time until it's greater or equal to 27.
Then we simply subtract 27 from that until we get below 27 - the number of subtractions is the segment added to the top line.
When there are no more segments to bring down, we have our result.
Keep in mind these are pretty basic algorithms. There are far better ways to do complex arithmetic if your numbers are going to be particularly large. You can look into something like GNU Multiple Precision Arithmetic Library - it's substantially better and faster than my own libraries.
It does have the rather unfortunate misfeature in that it will simply exit if it runs out of memory (a rather fatal flaw for a general purpose library in my opinion) but, if you can look past that, it's pretty good at what it does.
If you cannot use it for licensing reasons (or because you don't want your application just exiting for no apparent reason), you could at least get the algorithms from there for integrating into your own code.
I've also found that the bods over at MPIR (a fork of GMP) are more amenable to discussions on potential changes - they seem a more developer-friendly bunch.
While re-inventing the wheel is extremely good for your personal edification and learning, its also an extremely large task. I don't want to dissuade you as its an important exercise and one that I've done myself, but you should be aware that there are subtle and complex issues at work that larger packages address.
For example, multiplication. Naively, you might think of the 'schoolboy' method, i.e. write one number above the other, then do long multiplication as you learned in school. example:
123
x 34
-----
492
+ 3690
---------
4182
but this method is extremely slow (O(n^2), n being the number of digits). Instead, modern bignum packages use either a discrete Fourier transform or a Numeric transform to turn this into an essentially O(n ln(n)) operation.
And this is just for integers. When you get into more complicated functions on some type of real representation of number (log, sqrt, exp, etc.) things get even more complicated.
If you'd like some theoretical background, I highly recommend reading the first chapter of Yap's book, "Fundamental Problems of Algorithmic Algebra". As already mentioned, the gmp bignum library is an excellent library. For real numbers, I've used MPFR and liked it.
Don't reinvent the wheel: it might turn out to be square!
Use a third party library, such as GNU MP, that is tried and tested.
You do it in basically the same way you do with pencil and paper...
The number is to be represented in a buffer (array) able to take on an arbitrary size (which means using malloc and realloc) as needed
you implement basic arithmetic as much as possible using language supported structures, and deal with carries and moving the radix-point manually
you scour numeric analysis texts to find efficient arguments for dealing by more complex function
you only implement as much as you need.
Typically you will use as you basic unit of computation
bytes containing with 0-99 or 0-255
16 bit words contaning wither 0-9999 or 0--65536
32 bit words containing...
...
as dictated by your architecture.
The choice of binary or decimal base depends on you desires for maximum space efficiency, human readability, and the presence of absence of Binary Coded Decimal (BCD) math support on your chip.
You can do it with high school level of mathematics. Though more advanced algorithms are used in reality. So for example to add two 1024-byte numbers :
unsigned char first[1024], second[1024], result[1025];
unsigned char carry = 0;
unsigned int sum = 0;
for(size_t i = 0; i < 1024; i++)
{
sum = first[i] + second[i] + carry;
carry = sum - 255;
}
result will have to be bigger by one place in case of addition to take care of maximum values. Look at this :
9
+
9
----
18
TTMath is a great library if you want to learn. It is built using C++. The above example was silly one, but this is how addition and subtraction is done in general!
A good reference about the subject is Computational complexity of mathematical operations. It tells you how much space is required for each operation you want to implement. For example, If you have two N-digit numbers, then you need 2N digits to store the result of multiplication.
As Mitch said, it is by far not an easy task to implement! I recommend you take a look at TTMath if you know C++.
One of the ultimate references (IMHO) is Knuth's TAOCP Volume II. It explains lots of algorithms for representing numbers and arithmetic operations on these representations.
#Book{Knuth:taocp:2,
author = {Knuth, Donald E.},
title = {The Art of Computer Programming},
volume = {2: Seminumerical Algorithms, second edition},
year = {1981},
publisher = {\Range{Addison}{Wesley}},
isbn = {0-201-03822-6},
}
Assuming that you wish to write a big integer code yourself, this can be surprisingly simple to do, spoken as someone who did it recently (though in MATLAB.) Here are a few of the tricks I used:
I stored each individual decimal digit as a double number. This makes many operations simple, especially output. While it does take up more storage than you might wish, memory is cheap here, and it makes multiplication very efficient if you can convolve a pair of vectors efficiently. Alternatively, you can store several decimal digits in a double, but beware then that convolution to do the multiplication can cause numerical problems on very large numbers.
Store a sign bit separately.
Addition of two numbers is mainly a matter of adding the digits, then check for a carry at each step.
Multiplication of a pair of numbers is best done as convolution followed by a carry step, at least if you have a fast convolution code on tap.
Even when you store the numbers as a string of individual decimal digits, division (also mod/rem ops) can be done to gain roughly 13 decimal digits at a time in the result. This is much more efficient than a divide that works on only 1 decimal digit at a time.
To compute an integer power of an integer, compute the binary representation of the exponent. Then use repeated squaring operations to compute the powers as needed.
Many operations (factoring, primality tests, etc.) will benefit from a powermod operation. That is, when you compute mod(a^p,N), reduce the result mod N at each step of the exponentiation where p has been expressed in a binary form. Do not compute a^p first, and then try to reduce it mod N.
Here's a simple ( naive ) example I did in PHP.
I implemented "Add" and "Multiply" and used that for an exponent example.
http://adevsoft.com/simple-php-arbitrary-precision-integer-big-num-example/
Code snip
// Add two big integers
function ba($a, $b)
{
if( $a === "0" ) return $b;
else if( $b === "0") return $a;
$aa = str_split(strrev(strlen($a)>1?ltrim($a,"0"):$a), 9);
$bb = str_split(strrev(strlen($b)>1?ltrim($b,"0"):$b), 9);
$rr = Array();
$maxC = max(Array(count($aa), count($bb)));
$aa = array_pad(array_map("strrev", $aa),$maxC+1,"0");
$bb = array_pad(array_map("strrev", $bb),$maxC+1,"0");
for( $i=0; $i<=$maxC; $i++ )
{
$t = str_pad((string) ($aa[$i] + $bb[$i]), 9, "0", STR_PAD_LEFT);
if( strlen($t) > 9 )
{
$aa[$i+1] = ba($aa[$i+1], substr($t,0,1));
$t = substr($t, 1);
}
array_unshift($rr, $t);
}
return implode($rr);
}