What does this logic equation of overflow mean? - logical-operators

here is a paragraph from the textbook:
When two's complement numbers are added or subtracted...Overflow is defined as the situation in which the result of an arithmetic operation lies outside of the number range that can be represented by the number of bits in the word...
The logic function that indicates that the result of
an operation is outside of the representable number range is:
OVR = Cs XOR Cs+1 where Cs is the carry-in to the sign bit and Cs+1 is the
carry-out of the sign bit.
I assume that by saying "sign bit" the author means the top bit. Now assume we have a 4-bit adder, 1100+1100, which leads to an overflow. The carry-in to the sign bit is 1 and the carry-out is also 1. This seems to contradict the formula. Where is the mistake?

(Please read the comments of the original question for more details)
In fact, as Raymond Chen mentioned, 1100+1100 does not cause an overflow, as the result still firs in a 4-bit signed value.
If we instead use 1000+1111, then the resulting -9 is indeed an overflow and we can observe the carry-in as 0 and carry-out as 1 for the sign-bit.

Related

How does a processor without an overflow flag perform signed arithmetic?

I know that addition of two unsigned integers larger than the bus size of a given processor can be achieved via the carry flag. And normally, the same is true for signed integers using the overflow flag. However, the Intel 8085 only possesses a Sign flag, not an Overflow flag, so how does it deal with signed integer arithmetic?
As you know, the overflow flag is only relevant for signed integer arithmetic. On processors whose ALU has both overflow and carry flags (like x86), both of these flags get set according to the result of a binary arithmetic operation, but it's up to the programmer to decide how to interpret them. Signed arithmetic uses the overflow flag; unsigned arithmetic uses the carry flag. Looking at the wrong one gives you meaningless data.
There are two cases where the overflow flag would be turned on during a binary arithmetic operation:
The inputs both have sign bits that are off, while the result has a sign bit that is on.
The inputs both have sign bits that are on, while the result has a sign bit that is off.
Basically, then, the overflow flag gets set when the sign bit of the result does not match the sign bit of the input operands. In all other cases, the overflow flag is turned off.
Taking a couple of examples:
0100 + 0001 = 0101 (overflow flag off)
0100 + 0100 = 1000 (overflow flag on)
0110 + 1001 = 1111 (overflow flag off)
1000 + 1000 = 0000 (overflow flag on)
1000 + 0001 = 1001 (overflow flag off)
1100 + 1100 = 1000 (overflow flag off)
Notice that the state of the overflow flag depends only on the sign bits of the three numbers; thus, you need only look at these bits. This makes intuitive sense. If you add two positive numbers to get a negative, then the answer must be wrong because two positive numbers should give a positive result. Conversely, if you add two negative numbers and get a positive number, that also must be wrong. A positive number added to a negative number can never overflow because the sum lies between the two input values. Thus, arithmetic of mixed-signed values never turns on the overflow flag.
(Obviously this all assumes two's-complement arithmetic.)
Thus, you can easily calculate the state of the overflow flag even if the processor's ALU doesn't do it for you automatically. All you need to do is look at the sign bits of the three values, specifically the binary carry into the sign bit and the binary carry out of the sign bit. Overflow occurs when a bit is carried into the sign-bit place and no corresponding carry out occurs.
These C functions implement the logic:
// For the binary (two's complement) addition of two signed integers,
// an overflow occurs if the inputs have the same sign AND ALSO the
// sign of the result is different from the signs of the inputs.
bool GetOverflowFlagForAddition(int op1, int op2, int result)
{
return (~(op1 ^ op2) & (op1 ^ result)) < 0;
}
// For the binary (two's complement) subtraction of two signed integers,
// an overflow occurs if the inputs have the same sign AND ALSO the
// sign of the result matches the signs of the inputs.
bool GetOverflowFlagForSubtraction(int op1, int op2, int result)
{
return ((op1 ^ op2) & (op1 ^ result)) < 0;
}
(There are many different ways that you could write this, of course.)
Or, to put it in the terms that Iwillnotexist Idonotexist did in a comment: "Overflow can be defined as the XOR of the carry into and out of the sign bit." Overflow has occurred if the carry in does not equal the carry out at that particular (leftmost) bit.
A more formal definition is that the overflow flag is the XOR of the carry-out of the high two bits of the result. Symbolically, for an 8-bit value: O = C6 ^ C7, where O means "overflow" and C means "carry". This is just a restatement of the definition I already gave: overflow happens if the carry out is different from the carry into the highest-order bit (in this case, bit 7).
See also Ken Shirriff's article on how the overflow flag works arithmetically (this is in the context of the 6502, another popular 8-bit processor). He also explains the implementation of the overflow flag at the silicon level in the 6502.
Okay, so what does the carry flag mean? The carry flag indicates an overflow condition in unsigned arithmetic. There are again two cases where it is set:
The carry flag is set during addition if there is a carry out of the most significant bit (the sign bit).
The carry flag is set during subtraction if there is a borrow into the most significant bit (the sign bit).
In all other cases, the carry flag is turned off. Again, examples:
1111 + 0001 = 0000 (carry flag on)
0111 + 0001 = 1000 (carry flag off)
0000 - 0001 = 1111 (carry flag on)
1000 - 0001 = 0111 (carry flag off)
Just in case it isn't obvious, it bears explicitly pointing out that subtraction is the same as addition of the two's-complement negation, so the latter two examples can be rewritten in terms of addition as:
0000 + 1111 = 1111 (carry flag on)
1000 + 1111 = 0111 (carry flag off)
…but note that the carry for subtraction is the inverse of the carry set by addition.
Putting all of this together, then, you can absolutely implement the overflow flag in terms of the carry and sign flags. The carry flag gets set if you have a carry out of the most-significant bit (the sign bit). The sign flag gets set if the result has its sign bit set, which means that there was a carry into the most-significant bit. By our definition of the overflow flag above, OF == CF ^ SF, because overflow is the carry coming out of the sign bit XORed with the carry coming into the sign bit. If the carry in does not equal the carry out, then signed overflow occurred.
Interestingly, though, Ken Shirriff's reverse-engineering of the 8085 processor shows that it does, in fact, have an overflow flag—it's just undocumented. This is bit 1 of the 8-bit flag status register. It is known as "V" and, as Ken explains here, is implemented in exactly the way discussed above, by XORing the carry into the most-significant bit with the carry out of the most-significant bit—C6 ^ C7, with these values coming directly from the ALU. (He also describes in the same article how the other undocumented flag, the "K" flag, is implemented in terms of the "V" flag and the sign flag, yielding a flag that is useful in signed comparisons, but a bit beyond the scope of this answer.)

Adding Two's Complement binary

Ladies and gentlemen, I have successfully been able to understand adding, etc. for unsigned binary numbers. However, this Two's complement has me beat.. Let me explain with some examples.
Example practice problem, perform each arithmetic using 8-bit signed integer storage system, under two's complement:
1111111
01100001
+ 00111111
----------
10100000 <== My Answer (idk if right / wrong), but i think right.
So it doesn't exceed 8 bits, but we have changed (positive + positive) = negative. That has to be an overflow, because the sign is changing, right? (I never understood what the carry in MSB and carry out MSB's were).
The reallllly tricky part for me are the following equations: (negative + negative) which in reality is equal to (negative - positive).
111111
10111111
+ 10010101
----------
1 | 01010100
So I think this should be wrong because when we discard the overflow bit (the 1 that is way out in left field), it turns the 8-bit representation into a positive number, when it should be a negative. SO this would entail an overflow, no?
The following equation is similar:
1111
10001110
+ 10110101
----------
1 | 01000011
Understandably, if we were working with 16-bit, etc. Then these wouldn't be overflows, because the signs aren't changing, the math is correct. But since when we store the 8-bit representation of these numbers, we lose the MSB, that would flip the signs.
But one thing that I noticed about my theory is, whenever adding two negative numbers, the MSB's will obviously always be 1, therefore you will always have a carry, which means you would always have an overflow.
** I think the more logical conclusion is that I am forgetting to convert the second negative to a positive or something prior to adding them, or something along those lines. But I've tried youtube and various research online. And TBH, my professor is terrible with the whole "communication" thing.. I would appreciate any help the community can give, so I can push past these problems and onto harder material XD.
Yes, if you discard a 1 carry bit then that signals overflow. Don't worry about overflow. It is correct to discard the carry bit.
But since when we store the 8-bit representation of these numbers, we lose the MSB, that would flip the signs.
It's important not to think of discarding a 1 bit as flipping the sign. First, discarding a carry bit is not the same as flipping the sign bit. You can discard the carry bit from a negative result and still end up with a negative answer. For instance:
1111111
11111111 (-1)
+ 11111111 (-1)
--------
1 | 11111110 (-2)
The final 1 carry bit is discarded, but the answer's sign doesn't flip.
Second, even if you're thinking of flipping (as opposed to discarding) the sign bit, it's not good to think of that as flipping the sign. In sign magnitude representation flipping the sign bit will flip the sign of the number. But in one's and two's complement, negation is more than just flipping the leftmost bit. If you just flip the bit you get a very different number. Yes, it has the opposite sign, but it's not the same number.
Sign Mag. | One's Compl. | Two's Compl.
01111111 = 127 | 127 | 127
11111111 = -127 | -0 | -1
Yes, your math is correct.
The elegance of two's compliment is that addition "just works" without any special considerations for the sign bit. The reason both of those subtractions underflow is because the magnitudes of the numbers are already quite large.
Let's do the last two questions in decimal:
(-65)
+ (-107)
--------
(-172) which underflows to 84.
(-114)
+ (-75)
--------
(-189) which underflows to 67.
The lowest signed 8-bit value is -128, so both of them underflow.

How is overflow detected at the binary level?

I'm reading the textbook Computer Organization And Design by Hennessey and Patterson (4th edition). On page 225 they describe how overflow is detected in signed, 2's complement arithmetic. I just can't even understand what they're talking about.
"How do we detect [overflow] when it does occur? Clearly, adding or
substracting two 32-bit numbers can yield a result that needs 33 bits
to be fully expressed."
Sure. And it won't need 34 bits because even the smallest 34 bit number is twice the smallest 33 bit number, and we're adding 32 bit numbers.
"The lack of a 33rd bit means that when overflow occurs, the sign bit
is set with the value of the result instead of the proper sign of
the result."
What does this mean? The sign bit is set with the "value" of the result... meaning it's set as if the result were unsigned? And if so, how does that follow from the lack of a 33rd bit?
"Since we need just one extra bit, only the sign bit can be wrong."
And that's where they lost me completely.
What I'm getting from this is that, when adding signed numbers, there's an overflow if and only if the sign bit is wrong. So if you add two positives and get a negative, or if you add two negatives and get a positive. But I don't understand their explanation.
Also, this only applies to unsigned numbers, right? If you're adding signed numbers, surely detecting overflow is much simpler. If the last half-adder of the ALU sets its carry bit, there's an overflow.
note: I really don't know what tags are appropriate here, feel free to edit them.
Any time you want to deal with these kind of ALU items be it add, subtract, multiply, etc, start with 2 or 3 bit numbers, much easier to get a handle on than 32 or 64 bit numbers. After 2 or 3 bits it doesn't matter if it is 22 or 2200 bits it all works exactly the same from there on out. Basically you can by hand if you want make a table of all 3 bit operands and their results such that you can examine the whole table visually, but a table of all 32 bit operands against all 32 bit operands and their results, can't do that by hand in a reasonable time and cannot examine the whole table visually.
Now twos complement, that is just a scheme for representing positive and negative numbers, and it is not some arbitrary thing it has a reason, the reason for the madness is that your adder logic (which is also what the subtractor uses which is the same kind of thing the multiplier uses) DOES NOT CARE ABOUT UNSIGNED OR SIGNED. It does not know the difference. YOU the programmer cares in my three bit world the bit pattern 0b111 could be a positive seven (+7) or it could be a negative one. Same bit pattern, feed it to the add logic and the same thing comes out, and the answer that comes out I can choose to interpret as unsigned or twos complement (so long as I interpret the operands and the result all as either unsigned or all as twos complement). Twos complement also has the feature that for negative numbers the most significant bit (msbit) is set, for positive numbers it is zero. So it is not sign plus magnitude but we still talk about the msbit being the sign bit, because except for two special numbers that is what it is telling us, the sign of the number, the other bits are actually telling us the magnitude they are just not an unsigned magnitude as you might have in sign+magnitude notation.
So, the key to this whole question is understanding your limits. For a 3 bit unsigned number our range is 0 to 7, 0b000 to 0b111. for a 3 bit signed (twos complement) interpretation our range is -4 to +3 (0b100 to 0b011). For now limiting ourselves to 3 bits if you add 7+1, 0b111 + 0b001 = 0b1000 but we only have a 3 bit system so that is 0b000, 7+1 = 8, we cannot represent 8 in our system so that is an overflow, because we happen to be interpreting the bits as unsigned we look at the "unsigned overflow" which is also known as the carry bit or flag. Now if we take those same bits but interpret them as signed, then 0b111 (-1) + 0b001 (+1) = 0b000 (0). Minus one plus one is zero. No overflow, the "signed overflow" is not set...What is the signed overflow?
First what is the "unsigned overflow".
The reason why "it all works the same" no matter how many bits we have in our registers is no different than elementary school math with base 10 (decimal) numbers. If you add 9 + 1 which are both in the ones column you say 9 + 1 = zero carry the 1. you carry a one over to the tens column then 1 plus 0 plus 0 (you filled in two zeros in the tens column) is 1 carry the zero. You have a 1 in the tens column and a zero in the ones column:
1
09
+01
====
10
What if we declared that we were limited to only numbers in the ones column, there isn't any room for a tens column. Well that carry bit being a non-zero means we have an overflow, to properly compute the result we need another column, same with binary:
111
111
+ 001
=======
1000
7 + 1 = 8, but we cant do 8 if we declare a 3 bit system, we can do 7 + 1 = 0 with the carry bit set. Here is where the beauty of twos complement comes in:
111
111
+ 001
=======
000
if you look at the above three bit addition, you cannot tell by looking if that is 7 + 1 = 0 with the carry bit set or if that is -1 + 1 = 0.
So for unsigned addition, as we have known since grade school that a carry over into the next column of something other than zero means we have overflowed that many placeholders and need one more placeholder, one more column, to hold the actual answer.
Signed overflow. The sort of academic answer is if the carry in of the msbit column does not match the carry out. Let's take some examples in our 3 bit world. So with twos complement we are limited to -4 to +3. So if we add -2 + -3 = -5 that wont work correct?
To figure out what minus two is we do an invert and add one 0b010, inverted 0b101, add one 0b110. Minus three is 0b011 -> 0b100 -> 0b101
So now we can do this:
abc
100
110
+ 101
======
011
If you look at the number under the b that is the "carry in" to the msbit column, the number under the a the 1, is the carry out, these two do not match so we know there is a "signed overflow".
Let's try 2 + 2 = 4:
abc
010
010
+ 010
======
100
You may say but that looks right, sure unsigned it does, but we are doing signed math here, so the result is actually a -4 not a positive 4. 2 + 2 != -4. The carry in which is under the b is a 1, the carry out of the msbit is a zero, the carry in and the carry out don't match. Signed overflow.
There is a shortcut to figuring out the signed overflow without having to look at the carry in (or carry out). if ( msbit(opa) == msbit(opb) ) && ( msbit(res) != msbit(opb) ) signed overflow, else no signed overflow. opa being one operand, opb being the other and res the result.
010
+ 010
======
100
Take this +2 + +2 = -4. msbit(opa) and msbit(opb) are equal, and the result msbit is not equal to opb msbit so this is a signed overflow. You could think about it using this table:
x ab cr
0 00 00
0 01 01
0 10 01
0 11 10 signed overflow
1 00 01 signed overflow
1 01 10
1 10 10
1 11 11
This table is all the possible combinations if carry in bit, operand a, operand b, carry out and result bit for a single column turn your head sideways to the left to sort of see this x is the carry in, a and b columns are the two operands. cr as a pair is the result xab of 011 means 0+1+1 = 2 decimal which is 0b10 binary. So taking the rule that has been dictated to us, that if the carry in and carry out do not match that is a signed overflow. Well the two cases where the item in the x column does not match the item in the c column are indicated those are the cases where a and b inputs match each other, but the result bit is the opposite of a and b. So assuming the rule is correct this quick shortcut that does not require knowing what the carry bits are, will tell you if there was a signed overflow.
Now you are reading an H&P book. Which probably means mips or dlx, neither mips or dlx deal with carry and signed flags in the way that most other processors do. mips is not the best first instruction set IMO primarily for that reason, their approach is not wrong in any way, but being the oddball, you will spend forever thinking differently and having to translate when going to most other processors. Where if you learned the typical znvc flags (zero flag, negative flag, v=signed overflow, c=carry or unsigned overflow) way then you only have to translate when going to mips. Normally these are computed on every alu operation (for the non-mips type processors) you will see signed and unsigned overflow being computed for add and subtract. (I am used to an older mips, maybe this gen of books and the current instruction set has something different). Calling it addu add unsigned right at the start of mips after learning all of the above about how an adder circuit does not care about signed vs unsigned, is a huge problem with mips it really puts you in the wrong mindset for understanding something this simple. Leads to the belief that there is a difference between signed addition and unsigned addition when there isn't. It is only the overflow flags that are computed differently. Now multiply, and divide there is definitely a twos complement vs unsigned difference and you ideally need a signed multiply and an unsigned multiply or you need to deal with the limitation.
I recommend a simple (depending on how strong your bit manipulation is and twos complement) exercise that you can write in some high level language. Basically take all the combinations of unsigned numbers 0 to 7 added to 0 to 7 and save the result. Print out both as decimal and as binary (three bits for operands, four bits for result) and if the result is greater than 7 print overflow as well. Repeat this using signed variables using the numbers -4 to +3 added to -4 to +3. print both decimal with a +/- sign and the binary. If the result is less than -4 or greater than +3 print overflow. From those two tables you should be able to see that the rules above are true. Looking strictly at the operand and result bit patterns for the size allowed (three bits in this case) you will see that the addition operation gives the same result, same bit pattern for a given pair of inputs independent of whether those bit patterns are considered unsigned or twos complement. Also you can verify that unsigned overflow is when the result needs to use that fourth column, there is a carry out off of the msbit. For signed when the carry in doesn't match the carry out, which you see using the shortcut looking at the msbits of the operands and result. Even better is to have your program do those comparisons and print out something. So if you see a note in your table that the result is greater than 7 and a note in your table that bit 3 is set in the result, then you will see for the unsigned table that is always the case (limited to inputs of 0 to 7). And the more complicated one, signed overflow, is always when the result is less than -4 and greater than 3 and when the operand upper bits match and the result upper bit does not match the operands.
I know this is super long and very elementary. If I totally missed the mark here, please comment and I will remove or re-write this answer.
The other half of the twos complement magic. Hardware does not have subtract logic. One way to "convert" to twos complement is to "invert and add one". If I wanted to subtract 3 - 2 using twos complement what actually happens is that is the same as +3 + (-2) right, and to get from +2 to to -2 we invert and add one. Looking at our elementary school addition, did you notice the hole in the carry in on the first column?
111H
111
+ 001
=======
1000
I put an H above where the hole is. Well that carry in bit is added to the operands right? Our addition logic is not a two input adder it is a three input adder yes? Most of the columns have to add three one bit numbers in order to compute two operands. If we use a three input adder on the first column now we have a place to ... add one. If I wanted to subtract 3 - 2 = 3 + (-2) = 3 + (~2) + 1 which is:
1
011
+ 101
=====
Before we start and filled in it is:
1111
011
+ 101
=====
001
3 - 2 = 1.
What the logic does is:
if add then carry in = 0; the b operand is not inverted, the carry out is not inverted.
if subtract then carry in = 1; the b operand is inverted, the carry out MIGHT BE inverted.
The addition above shows a carry out, I didn't mention that this was an unsigned operation 3 - 2 = 1. I used some twos complement tricks to perform an unsigned operation, because here again no matter whether I interpret the operands as signed or unsigned the same rules apply for if add or if subtract. Why I said that the carry out MIGHT BE inverted is that some processors invert the carry out and some don't. It has to do with cascading operations, taking say a 32 bit addition logic and using the carry flag and an add with carry or subtract with borrow instruction creating a 64 bit add or subtract, or any multiple of the base register size. Say you have two 64 bit numbers in a 32 bit system a:b + c:d where a:b is the 64 bit number but it is held in the two registers a and b where a is the upper half and b is the lower half. so a:b + c:d = e:f on a 32 bit system unsigned that has a carry bit and add with carry:
add f,b,d
addc e,a,c
The add leaves its carry out bit from the upper most bit lane in the carry flag in the status register, the addc instruction is add with carry takes the operands a+c and if the carry bit is set then adds one more. a+c+1 putting the result in e and the carry out in the carry flag, so:
add f,b,d
addc e,a,c
addc x,y,z
Is a 96 bit addition, and so on. Here again something very foreign to mips since it doesn't use flags like other processors. Where the invert or don't invert comes in for signed carry out is on the subtract with borrow for a particular processor. For subtract:
if subtract then carry in = 1; the b operand is inverted, the carry out MIGHT BE inverted.
For subtract with borrow you have to say if the carry flag from the status register indicates a borrow then the carry in is a 0 else the carry in is a 1, and you have to get the carry out into the status register to indicate the borrow.
Basically for the normal subtract some processors invert b operand and carry on in the way in and carry out on the way out, some processors invert the b operand and carry in in the way in but don't invert carry out on the way out. Then when you want to do a conditional branch you need to know if the carry flag means greater than or less than (often the syntax will have a branch if greater or branch if less than and sometimes tell you which one is the simplified branch if carry set or branch if carry clear). (If you don't "get" what I just said there that is another equally long answer which won't mean anything so long as you are studying mips).
As a 32-bit signed integers are represented by 1 sign-bit and 31 bits for the actual number we are effectively adding two 31 bit-numbers. Hence the 32nd bit (sign bit) will be where the overflow will be visible.
"The lack of a 33rd bit means that when overflow occurs, the sign bit is set with the value of the result instead of the proper sign of the result."
Imagine the following addition of two positive numbers (16 bit to simpify):
0100 1100 0011 1010 (19514)
+ 0110 0010 0001 0010 (25106)
= 1010 1110 0110 1100 (-20884 [or 44652])
For the summation of two large negative numbers however the extra bit would be required
1100 1100 0011 1010
+ 1110 0010 0001 0010
=11010 1110 0110 1100
Usually the CPU have this 33rd bit (or whatever bitsize it operates on +1) exposed as a overflow-bit in the micro-architecture.
Their description relates to operations on values with a particular bit sequence: the first bit corresponds to the sign of the value, and the other bits relate to the magnitude of that value.
What does this mean? The sign bit is set with the "value" of the result...
They mean that the overflow bit - the one that is a consequence of adding two numbers that need to spill into the next digit over - is dumped into the same place that the sign bit should be.
"Since we need just one extra bit, only the sign bit can be wrong."
All this means is that, when you perform arithmetic that overflows, the only bit whose value may be incorrect is the sign bit. All of the other bits are still the value they should be.
This is a consequence of what was described above: confusion between the sign bit's value due to overflow.

Where did the leading 1 means negative number in signed int arise from?

Even though I read a number of articles that say that mostly 2's complement is used to represent the negative numbers in a signed integer and that that is the best method,
However for some reason I have this (below) stuck in my head and can't get rid of it without knowing the history of it
"Use the leading bit as 1 to denote negative numbers when using signed int."
I have read many posts online & in StakOverflow that 2's complement is the best way to represent negative numbers. But my question is not about the best way, it is about the history or from where did the "leading bit" concept arise and then disappear?
P.S: Also it is just not me, a bunch of other folks were also getting confused with this.
Edit - 1
The so called leading 1 method I mentioned is described with an example in this post:
Why is two's complement used to represent negative numbers?
Now I understand, the MSB of 1 signifies negative numbers. This is by nature of 2's complement and not any special scheme.
Eg. If not for the 1st bit, we can't say if 1011 represents -5 or +11.
Thanks to:
jamesdlin, Oli Charlesworth, Mr Lister for asking imploring questions to make me realize the correct answer.
Rant:
I think there are a bunch of groups/folks who have been taught or been made to think (incorrectly) that 1011 evaluates to -3. 1 denoting - and 011 denoting 3.
The folks who ask "what my question was.. " were probably taught the correct 2's complement way from the first instance they learnt it and weren't exposed to these wrong answers.
There are several advantages to the two's-complement representation for signed integers.
Let's assume 16 bits for now.
Non-negative numbers in the range 0 to 32,767 have the same representation in both signed and unsigned types. (Two's-complement shares this feature with ones'-complement and sign-and-magnitude.)
Two's-complement is easy to implement in hardware. For many operations, you can use the same instructions for signed and unsigned arithmetic (if you don't mind ignoring overflow). For example, -1 is represented as 1111 1111 1111 1111, and +1 as 0000 0000 0000 0001. If you add them, ignoring the fact that the high-order bit is a sign bit, the mathematical result is 1 0000 0000 0000 0000; dropping all but the low-order 16 bits, gives you 0000 0000 0000 0000, which is the correct signed result. Interpreting the same operation as unsigned, you're adding 65535 + 1, and getting 0, which is the correct unsigned result (with wraparound modulo 65536).
You can think of the leading bit, not as a "sign bit", but as just another value bit. In an unsigned binary representation, each bit represents 0 or 1 multiplied by the place value, and the total value is the sum of those products. The lowest bit's place value is 1, the next lower bit is 2, then 4, etc. In a 16-bit unsigned representation, the high-order bit's place value is 32768. In a 16-bit signed two's-complement representation, the high-order bit's place value is -32768. Try a few examples, and you'll see that everything adds up nicely.
See Wikipedia for more information.
It's not just about the leading bit. It's about all the bits.
Starting with addition
First let's look at how addition is done in 4-bit binary for 2 + 7:
10 +
111
____
1001
It's the same as long addition in decimal: bit by bit, right to left.
In the rightmost place we add 0 and 1, it makes 1, no carry.
In the second place from the right, we add 1 and 1, that makes 2 in decimal or 10 in binary - we write the 0, carry the 1.
In the third place from the right, we add the 1 we carried to the 1 already there, it makes binary 10. We write the 0, carry the 1.
The 1 that just got carried gets written in the fourth place from the right.
Long subtraction
Now we know that binary 10 + 111 = 1001, we should be able to work backwards and prove that 1001 - 10 = 111. Again, this is exactly the same as in decimal long subtraction.
1001 -
10
____
111
Here's what we did, working right to left again:
In the rightmost place, 1 - 0 = 1, we write that down.
In the second place, we have 0 - 1, so we need to borrow an extra bit. We now do binary 10 - 1, which leaves 1. We write this down.
In the third place, remember we borrowed an extra bit - so again we have 0 - 1. We use the same trick to borrow an extra bit, giving us 10 - 1 = 1, which we put in the third place of the result.
In the fourth place, we again have a borrowed bit to deal with. Subtract the borrowed bit from the 1 already there: 1 - 1 = 0. We could write this down in front of the result, but since it's the end of the subtraction there's no need.
There's a number less than zero?!
Do you remember how you learnt about negative numbers? Part of the idea is that you can subtract any number from any other number and still get a number. So 7 - 5 is 2; 6 - 5 is 1; 5 - 5 is 0; What is 4 - 5? Well, one way to reason about such numbers is simply to apply the same method as above to do the subtraction.
As an example, let's try 2 - 7 in binary:
10 -
111
_______
...1011
I started in the same way as before:
In the rightmost place, subtract 1 from 0, which requires a borrowed bit. 10 - 1 = 1, so the last bit of the result is 1.
In the second-rightmost place, we have 1 - 1 with an extra borrow bit, so we have to subtract another 1. This means we need to borrow our own bit, giving 11 - 1 - 1 = 1. We write 1 in the second-rightmost spot.
In the third place, there are no more bits in the top number! But we know we can pretend there's a 0 in front, just like we would do if the bottom number ran out of bits. So we have 0 - 1 - 1 because of the borrow bit from second place. We have to borrow a bit again! Anyway we have 10 - 1 - 1 = 0, which we write down in the third place from the right.
Now something very interesting has happened - both the operands of the subtraction have no more digits, but we still have a borrow bit to take care of! Oh well, let's just carry on as we have been doing. We have 0 - 0, since neither the top or bottom operand have any bits here, but because of the borrow bit it's actually 0 - 1.
(We have to borrow again! If we keep borrowing like this we'll have to declare bankruptcy soon.)
Anyway, we borrow the bit, and we get 10 - 1 = 1, which we write in the fourth place from the right.
Now anyone with half a mind is about to see that we are going to keep borrowing bits until the cows come home, because there ain't no more bits to go around! We ran out of them two places ago if you forgot. But if you tried to keep going it'd look like this:
...00000010
...00000111
___________
...11111011
In the fifth place we get 0 - 0 - 1, and we borrow a bit to get 10 - 0 - 1 = 1.
In the sixth place we get 0 - 0 - 1, and we borrow a bit to get 10 - 0 - 1 = 1.
In the seventh place we get 0 - 0 - 1, and we borrow a bit to get 10 - 0 - 1 = 1.
...And so it goes on for as many places as you like. By the way, we just derived the two's complement binary form of -5.
You could try this for any pair of numbers you like, and generate the two's complement form of any negative number. If you try to do 0 - 1, you'll see why -1 is represented as ...11111111. You'll also realise why all two's complement negative numbers have a 1 as their most significant bit (the "leading bit" in the original question).
In practice, your computer doesn't have infinitely many bits to store negative numbers in, so it usually stops after some more reasonable number, like 32. What do we do with the extra borrow bit in position 33? Eh, we just quietly ignore it and hope no one notices. When some does notice that our new number system doesn't work, we call it integer overflow.
Final notes
This isn't the only way to make our number system work, of course. After all, if I owe you $5, I wouldn't say that your current balance with me was $...999999995.
But there are some cool things about the system we just derived, like the fact that subtraction gives you the right result in this system, even if you ignore the fact that one of the numbers is negative. Normally, we have to think about subtractions with conditional steps: to calculate 2 - 7, we first have to figure out that 2 is less than 7, so instead we calculate 7 - 2 = 5, and then stick a minus sign in front to get 2 - 7 = -5. But with two's complement we just go ahead do the subtraction and don't care about which number is bigger, and the right result comes out by itself. And others have mentioned that addition works nicely, and so does multiplication.
You don't use the leading bit, per say. For instance, in an 8-bit signed char,
11111111
represents -1. You can test the leading bit to determine if it is a negative number.
There are a number of reasons to use 2's complement, but the first and greatest is convenience. Take the above number and add 2. What do we end up with?
00000001
You can add and subtract 2's complement numbers basically for free. This was a big deal historically, because the logic is very simple; you don't need dedicated hardware to handle signed numbers. You use less transistors, you need less complicated design, etc. It goes back to before 8-bit microprocessors, which didn't even have multiply instructions built-in (even many 16-bit ones didn't have them, such as the 65c816 used in apple IIe and Super NES).
With that said, multiplication is relatively trivial with 2's complement also, so that's no big deal.
Complements (including things like 9s complement in decimal, mechanical calculators / adding-machines / cash registers) have been around forever. In nines' complement with four decimal digits, for instance, values in the range 0000..4999 are positive while values in 5000..9999 are negative. See http://en.wikipedia.org/wiki/Method_of_complements for details.
This directly gives rise to 1s complement in binary, and in both 1s and 2s complement, the topmost bit acts as a "sign bit". This does not explain exactly how computers moved from ones' complement to two's complement (I use Knuth's apostrophe convention when spelling these out as words with apostrophes, by the way). I think it was a combination of luck, irritation at "negative zero", and the way ones' complement requires end-around carry (vs two's complement, not requiring it).
In a logical sense, it does not matter which bit you use to represent signs, but for practical purposes, using the top bit, and two's complement, simplifies the hardware. Back when transistors were expensive, this was pretty important. (Or even tubes, although I think most if not all vacuum-tube computers used ones' complement. In any case they predated the C language by rather a lot.)
In summary, the history goes back way before electronic computers and the C language, and there was no reason to change from a good way of implementing this mechanically, when converting from mechanical calculators to vacuum-tube ENIACs to transistorized computers and then on to "chips", MSI, LSI, VLSI, and onward.
Well, it had to work such that 2 plus -2 gives zero. Early CPUs had hardware addition and subtraction and someone noticed that by complementing all the bits (one's complement, the original system), to change the "sign" of the value, it allowed the existing addition hardware to work properly—except that sometimes the result was negative zero. (What is the difference between -0 and 0? On such machines, it was indeterminate.)
Someone soon realized that by using twos-complement (convert a number between negative and positive by inverting the bits and adding one), the negative zero problem was avoided.
So really, it is not just the sign bit which is affected by negatives, but all of the bits except the LSB. However, by examining the MSB, one can immediately determine whether the signed value there is negative.

Floating Point Algorithms in C

I am thinking recently on how floating point math works on computers and is hard for me understand all the tecnicals details behind the formulas. I would need to understand the basics of addition, subtraction, multiplication, division and remainder. With these I will be able to make trig functions and formulas.
I can guess something about it, but its a bit unclear. I know that a fixed point can be made by separating a 4 byte integer by a signal flag, a radix and a mantissa. With this we have a 1 bit flag, a 5 bits radix and a 10 bit mantissa. A word of 32 bits is perfect for a floating point value :)
To make an addition between two floats, I can simply try to add the two mantissas and add the carry to the 5 bits radix? This is a way to do floating point math (or fixed point math, to be true) or I am completely wrong?
All the explanations I saw use formulas, multiplications, etc. and they look so complex for a thing I guess, would be a bit more simple. I would need an explanation more directed to beginning programmers and less to mathematicians.
See Anatomy of a floating point number
The radix depends of the representation, if you use radix r=2 you can never change it, the number doesn't even have any data that tell you which radix have. I think you're wrong and you mean exponent.
To add two numbers in floating point you must make the exponent one equal to another by rotating the mantissa. One bit right means exponent+1, and one bit left means exponent -1, when you have the numbers with the same exponent then you can add them.
Value(x) = mantissa * radix ^ exponent
adding these two numbers
101011 * 2 ^ 13
001011 * 2 ^ 12
would be the same as adding:
101011 * 2 ^ 13
000101 * 2 ^ 13
After making exponent equal one to another you can operate.
You also have to know if the representation has implicit bit, I mean, the most significant bit must be a 1, so usually, as in the iee standard its known to be there, but it isn't representated, although its used to operate.
I know this can be a bit confusing and I'm not the best teacher so any doubt you have, just ask.
Run, don't walk, to get Knuth's Seminumerical Algorithms which contains wonderful intuition and algorithms behind doing multiprecision and floating point arithmetic.

Resources