It is my understanding that numbers are negated using the two's compliment, which to my understanding is: !num + 1.
So my question is does this mean that, for variable 'foo'=1, a negated 'foo' will be the exactly the same as variable 'bar'=255.
f we were to check if -'foo' == 'bar' or if -'foo' == 255, would we get that they are equal?
I know that some languages, such as Java, keep a sign bit -- so the comparisons would yield false. What of languages that do not? And I'm assuming that assembler/native machine does not have a sign bit.
In addition to all of this, I read about a zero flag or a carry-over flag that is set when a 'negative' number is added to another (of any sign) number. This flag being set whenever it is added because of the way two's complement works, 0x01 + 0xff = 0x00 (with the leading 1 truncated). What exactly is this flag used for?
And my last question, for other math operations (such as multiplication), would I have to re-negate the number (so it is now positive), perform the operation, and negate the result? E.g., !((!neg + 1) * pos) + 1.
Edit
Finished the question, so feel free fire away.
Yes, in two’s complement, the number x is represented as ~x+1, where ~x is the bitwise complement of the binary numeral for x in some fixed number bits. E.g., for eight bits, the binary numeral for x is 000000001, so the bitwise complement is 11111110, and adding one produces 11111111.
There is no way to distinguish -1 in eight-bit two’s complement from 255 in eight-bit binary (with no sign). They both have the same representation in bits: 11111111. If you are using both of these numbers, you must either separately remember which one is eight-bit two’s complement and which one is plain eight-bit binary or you must use more than eight bits. In other words, at the raw bit level, 11111111 is just eight bits; it has no value until we decide how to interpret it.
Java and typical other languages do not maintain a sign bit separate from the value of a number; the sign is part of the encoding of the number. Also, typical languages do not allow you to compare different types. If you have a two’s complement x and an unsigned y, then either one must be converted to the type of the other before comparison or they must both be converted to a third type. Thus, if you compare x and y, and one is converted to the other, then the conversion will overflow or wrap, and you cannot expect to get the correct mathematical result. To compare these two numbers, we might convert each of them to a wider integer, such as 32-bits, then compare. Converting the eight-bit two’s complement 11111111 to a 32-bit integer produces -1, and converting the eight-bit plain binary 11111111 to a 32-bit integer produces 255, and then the comparison reports they are unequal.
The zero flag and the carry flag you read about are flags that are set when a comparison instruction is executed in a computer processor. Most high-level languages do not give you direct access to these flags. Many processors have an instruction with a form like this:
cmp a, b
That instruction subtracts b from a and discards the difference but remembers several flags that describe the subtraction: Was the result zero (zero flag)? Did a borrow occur (borrow flag)? Was the result negative (sign flag)? Did an overflow occur (overflow flag)?
The compare instruction requires that the two things being compared be the same type (two’s complement or unsigned), but it does not care which type. The results can be tested later by checking particular combinations of the flags depending on the type. That is, the information recorded in the flags can distinguish whether one two’s complement number was greater than another or whether one unsigned number was greater than another, depending on what tests are made. There are conditional branch instructions that test the desired flag properties.
There is generally no need to “un-negate” a number to perform arithmetic operations. Processors include arithmetic instructions that work on two’s complement numbers. Usually the add and subtract instructions are type-agnostic, the same way the compare instruction is, but the multiply and divide instructions are not (except for some forms of multiply that return partial results). The add and subtract instructions can be type-agnostic because the wrapping that occurs in the arithmetic works for both two’s complement and unsigned. However, that wrapping does not work for multiplication and division.
Related
When I need to subtract 2 numbers (X-Y), I can take 2's complement of Y and add it to X. Let's say our system represents integers using a byte (8 bits).
X = 7 = 00000111
Y = 5 = 00000101
2's complement of 5
11111010 + 1 = 11111011
Adding those 2 =
00000111
11111011
__________
100000010
There is a carryover. How does one deal with this carryover?
If I am using 8 bits, that means I have a range of -128 to 127. So 7 and -5 and their sum do not fall outside that range. So this is not overflow.
that depends on what you are trying to do
if you are just computing simple/single +/- operations
then the overflow is usually ignored
when you need to handle overflow/underflow
for example if you need to clamp the result for some reason (usually safety of the result range ...) then Carry flag of the ALU marks if the overflow underflow occur. After that you set the result as max positive or negative value depending on the inputs sign,magnitude and operation (+,-). Aome platforms have instructions that do this automatically (saturated add,sub).
Another reason is making bigint operations in that case carry is added as +/-1 to higher operation (sign depends on the operation)... but the result itself stays as is (add,adc,adc,adc,...)
on modern languages/platforms you do not have direct ALU flag register access anymore
sometimes you can tap into assembler but it can be slower in some cases then the computation itself. In that case use this approach 32bit ALU in C++ where cy is the carry flag
Should have read my textbook again :)
In 2's complement arithmetic the carryover is thrown away, versus 1's complement arithmetic, where carryover is carried back and added to the result.
This video helped me understand - https://www.youtube.com/watch?v=lKTsv6iVxV4
I have a question about working on very big numbers. I'm trying to run RSA algorithm and lets's pretend i have 512 bit number d and 1024 bit number n. decrypted_word = crypted_word^d mod n, isn't it? But those d and n are very large numbers! Non of standard variable types can handle my 512 bit numbers. Everywhere is written, that rsa needs 512 bit prime number at last, but how actually can i perform any mathematical operations on such a number?
And one more think. I can't use extra libraries. I generate my prime numbers with java, using BigInteger, but on my system, i have only basic variable types and STRING256 is the biggest.
Suppose your maximal integer size is 64 bit. Strings are not that useful for doing math in most languages, so disregard string types. Now choose an integer of half that size, i.e. 32 bit. An array of these can be interpreted as digits of a number in base 232. With these, you can do long addition and multiplication, just like you are used to with base 10 and pen and paper. In each elementary step, you combine two 32-bit quantities, to produce both a 32-bit result and possibly some carry. If you do the elementary operation in 64-bit arithmetic, you'll have both of these as part of a single 64-bit variable, which you'll then have to split into the 32-bit result digit (via bit mask or simple truncating cast) and the remaining carry (via bit shift).
Division is harder. But if the divisor is known, then you may get away with doing a division by constant using multiplication instead. Consider an example: division by 7. The inverse of 7 is 1/7=0.142857…. So you can multiply by that to obtain the same result. Obviously we don't want to do any floating point math here. But you can also simply multiply by 14286 then omit the last six digits of the result. This will be exactly the right result if your dividend is small enough. How small? Well, you compute x/7 as x*14286/100000, so the error will be x*(14286/100000 - 1/7)=x/350000 so you are on the safe side as long as x<350000. As long as the modulus in your RSA setup is known, i.e. as long as the key pair remains the same, you can use this approach to do integer division, and can also use that to compute the remainder. Remember to use base 232 instead of base 10, though, and check how many digits you need for the inverse constant.
There is an alternative you might want to consider, to do modulo reduction more easily, perhaps even if n is variable. Instead of expressing your remainders as numbers 0 through n-1, you could also use 21024-n through 21024-1. So if your initial number is smaller than 21024-n, you add n to convert to this new encoding. The benefit of this is that you can do the reduction step without performing any division at all. 21024 is equivalent to 21024-n in this setup, so an elementary modulo reduction would start by splitting some number into its lower 1024 bits and its higher rest. The higher rest will be right-shifted by 1024 bits (which is just a change in your array indexing), then multiplied by 21024-n and finally added to the lower part. You'll have to do this until you can be sure that the result has no more than 1024 bits. How often that is depends on n, so for fixed n you can precompute that (and for large n I'd expect it to be two reduction steps after addition but hree steps after multiplication, but please double-check that) whereas for variable n you'll have to check at runtime. At the very end, you can go back to the usual representation: if the result is not smaller than n, subtract n. All of this should work as described if n>2512. If not, i.e. if the top bit of your modulus is zero, then you might have to make further adjustments. Haven't thought this through, since I only used this approach for fixed moduli close to a power of two so far.
Now for that exponentiation. I very much suggest you do the binary approach for that. When computing xd, you start with x, x2=x*x, x4=x2*x2, x8=…, i.e. you compute all power-of-two exponents. You also maintain some intermediate result, which you initialize to one. In every step, if the corresponding bit is set in the exponent d, then you multiply the corresponding power into that intermediate result. So let's say you have d=11. Then you'd compute 1*x1*x2*x8 because d=11=1+2+8=10112. That way, you'll need only about 1024 multiplications max if your exponent has 512 bits. Half of them for the powers-of-two exponentiation, the other to combine the right powers of two. Every single multiplication in all of this should be immediately followed by a modulo reduction, to keep memory requirements low.
Note that the speed of the above exponentiation process will, in this simple form, depend on how many bits in d are actually set. So this might open up a side channel attack which might give an attacker access to information about d. But if you are worried about side channel attacks, then you really should have an expert develop your implementation, because I guess there might be more of those that I didn't think about.
You may write some macros you may execute under Microsoft for functions like +, -, x, /, modulo, x power y which work generally for any integer of less than ten or hundred thousand digits (the practical --not theoretical-- limit being the internal memory of your CPU). Please note the logic is exactly the same as the one you got at elementary school.
E.g.: p= 1819181918953471 divider of (2^8091) - 1, q = ((2^8091) - 1)/p, mod(2^8043 ; q ) = 23322504995859448929764248735216052746508873363163717902048355336760940697615990871589728765508813434665732804031928045448582775940475126837880519641309018668592622533434745187004918392715442874493425444385093718605461240482371261514886704075186619878194235490396202667733422641436251739877125473437191453772352527250063213916768204844936898278633350886662141141963562157184401647467451404036455043333801666890925659608198009284637923691723589801130623143981948238440635691182121543342187092677259674911744400973454032209502359935457437167937310250876002326101738107930637025183950650821770087660200075266862075383130669519130999029920527656234911392421991471757068187747362854148720728923205534341236146499449910896530359729077300366804846439225483086901484209333236595803263313219725469715699546041162923522784170350104589716544529751439438021914727772620391262534105599688603950923321008883179433474898034318285889129115556541479670761040388075352934137326883287245821888999474421001155721566547813970496809555996313854631137490774297564881901877687628176106771918206945434350873509679638109887831932279470631097604018939855788990542627072626049281784152807097659485238838560958316888238137237548590528450890328780080286844038796325101488977988549639523988002825055286469740227842388538751870971691617543141658142313059934326924867846151749777575279310394296562191530602817014549464614253886843832645946866466362950484629554258855714401785472987727841040805816224413657036499959117701249028435191327757276644272944743479296268749828927565559951441945143269656866355210310482235520220580213533425016298993903615753714343456014577479225435915031225863551911605117029393085632947373872635330181718820669836830147312948966028682960518225213960218867207825417830016281036121959384707391718333892849665248512802926601676251199711698978725399048954325887410317060400620412797240129787158839164969382498537742579233544463501470239575760940937130926062252501116458281610468726777710383038372260777522143500312913040987942762244940009811450966646527814576364565964518092955053720983465333258335601691477534154940549197873199633313223848155047098569827560014018412679602636286195283270106917742919383395056306107175539370483171915774381614222806960872813575048014729965930007408532959309197608469115633821869206793759322044599554551057140046156235152048507130125695763956991351137040435703946195318000567664233417843805257728.
The last step took about 0.1 sec.
wpjo (willibrord oomen on academia.edu)
I am developing a programming language, September, which uses a tagged variant type as its main value type. 3 bits are used for the type (integer, string, object, exception, etc.), and 61 bits are used for the actual value (the actual integer, pointer to the object, etc.).
Soon, it will be time to add a float type to the language. I almost have the space for a 64-bit double, so I wanted to make use of doubles for calculations internally. Since I'm actually 3 bits short for storage, I would have to round the doubles off after each calculation - essentially resulting in a 61-bit double with a mantissa or exponent shorter by 3 bits.
But! I know floating point is fraught with peril and doing things which sound sensible on paper can produce disastrous results with FP math, so I have an open-ended question to the experts out there:
Is this approach viable at all? Will I run into serious error-accumulation problems in long-running calculations by rounding at each step? Is there some specific way in which I could do the rounding in order to avoid that? Are there any special values that I won't be able to treat that way (subnormals come to mind)?
Ideally, I would like my floats to be as well-behaved as a native 61-bit double would be.
I would recommend borrowing bits from the exponent field of the double-precision format. This is the method described in this article (that you would modify to borrow 3 bits from the exponent instead of 1). With this approach, all computations that do not use very large or very small intermediate results behave exactly as the original double-precision computation would. Even computations that run into the subnormal region of the new format behave exactly as they would if a 1+8+52 61-bit format had been standardized by IEEE.
By contrast, naively borrowing any number of bits at all from the significand introduces many double-rounding problems, all the more frequent that you are rounding from a 52-bit significand to a significand with only a few bits removed. Borrowing one bit from the significand as you suggest in an edit to your question would be the worst, with half the operations statistically producing double-rounded results that are different from what the ideal “native 61-bit double” would have produced. This means that instead of being accurate to 0.5ULP, the basic operations would be accurate to 3/4ULP, a dramatic loss of accuracy that would derail many of the existing, finely-designed numerical algorithms that expect 0.5ULP.
Three is a significant number of bits to borrow from an exponent that only has 11, though, and you could also consider using the single-precision 32-bit format in your language (calling the single-precision operations from the host).
Lastly, I give visibility here to another solution found by Jakub: borrow the three bits from the significand, and simulate round-to-odd for the intermediate double-precision computation before converting to the nearest number in 49-explicit-significand-bit, 11-exponent-bit format. If this way is chosen, it may useful to remark that the rounding itself to 49 bits of significand can be achieved with the following operations:
if ((repr & 7) == 4)
repr += (repr & 8) >> 1); /* midpoint case */
else
repr += 4;
repr &= ~(uint64_t)7; /* round to the nearest */
Despite working on the integer having the same representation as the double being considered, the above snippet works even if the number goes from normal to subnormal, from subnormal to normal, or from normal to infinite. You will of course want to set a tag in the three bits that have been freed as above. To recover a standard double-precision number from its unboxed representation, simply clear the tag with repr &= ~(uint64_t)7;.
This is a summary of my own research and information from the excellent answer by #Pascal Cuoq.
There are two places where we can truncate the 3-bits we need: the exponent, and the mantissa (significand). Both approaches run into problems which have to be explicitly handled in order for the calculations to behave as if we used a hypothetical native 61-bit IEEE format.
Truncating the mantissa
We shorten the mantissa by 3 bits, resulting in a 1s+11e+49m format. When we do that, performing calculations in double-precision and then rounding after each computation exposes us to double rounding problems. Fortunately, double rounding can be avoided by using a special rounding mode (round-to-odd) for the intermediate computations. There is an academic paper describing the approach and proving its correctness for all doubles - as long as we truncate at least 2 bits.
Portable implementation in C99 is straightforward. Since round-to-odd is not one of the available rounding modes, we emulate it by using fesetround(FE_TOWARD_ZERO), and then setting the last bit if the FE_INEXACT exception occurs. After computing the final double this way, we simply round to nearest for storage.
The format of the resulting float loses about 1 significant (decimal) digit compared to a full 64-bit double (from 15-17 digits to 14-16).
Truncating the exponent
We take 3 bits from the exponent, resulting in a 1s+8e+52m format. This approach (applied to a hypothetical introduction of 63-bit floats in OCaml) is described in an article. Since we reduce the range, we have to handle out-of-range exponents on both the positive side (by simply 'rounding' them to infinity) and the negative side. Doing this correctly on the negative side requires biasing the inputs to any operation in order to ensure that we get subnormals in the 64-bit computation whenever the 61-bit result needs to be subnormal. This has to be done a bit differently for each operation, since what matters is not whether the operands are subnormal, but whether we expect the result to be (in 61-bit).
The resulting format has significantly reduced range since we borrow a whopping 3 out of 11 bits of the exponent. The range goes down from 10-308...10308 to about 10-38 to 1038. Seems OK for computation, but we still lose a lot.
Comparison
Both approaches yield a well-behaved 61-bit float. I'm personally leaning towards truncating the mantissa, for three reasons:
the "fix-up" operations for round-to-odd are simpler, do not differ from operation to operation, and can be done after the computation
there is a proof of mathematical correctness of this approach
giving up one significant digit seems less impactful than giving up a big chunk of the double's range
Still, for some uses, truncating the exponent might be more attractive (especially if we care more about precision than range).
If I run following line of code, I get DIVIDE BY ZERO error
1. System.out.println(5/0);
which is the expected behavior.
Now I run the below line of code
2. System.out.println(5/0F);
here there is no DIVIDE BY ZERO error, rather it shows INFINITY
In the first line I am dividing two integers and in the second two real numbers.
Why does dividing by zero for integers gives DIVIDE BY ZERO error while in the case of real numbers it gives INFINITY
I am sure it is not specific to any programming language.
(EDIT: The question has been changed a bit - it specifically referred to Java at one point.)
The integer types in Java don't have representations of infinity, "not a number" values etc - whereas IEEE-754 floating point types such as float and double do. It's as simple as that, really. It's not really a "real" vs "integer" difference - for example, BigDecimal represents real numbers too, but it doesn't have a representation of infinity either.
EDIT: Just to be clear, this is language/platform specific, in that you could create your own language/platform which worked differently. However, the underlying CPUs typically work the same way - so you'll find that many, many languages behave this way.
EDIT: In terms of motivation, bear in mind that for the infinity case in particular, there are ways of getting to infinity without dividing by zero - such as dividing by a very, very small floating point number. In the case of integers, there's obviously nothing between zero and one.
Also bear in mind that the cases in which integers (or decimal floating point types) are used typically don't need to concept of infinity, or "not a number" results - whereas in scientific applications (where float/double are more typically useful), "infinity" (or at least, "a number which is too large to sensibly represent") is still a potentially valid result.
This is specific to one programming language or a family of languages. Not all languages allow integers and floats to be used in the same expression. Not all languages have both types (for example, ECMAScript implementations like JavaScript have no notion of an integer type externally). Not all languages have syntax like this to convert values inline.
However, there is an intrinsic difference between integer arithmetic and floating-point arithmetic. In integer arithmetic, you must define that division by zero is an error, because there are no values to represent the result. In floating-point arithmetic, specifically that defined in IEEE-754, there are additional values (combinations of sign bit, exponent and mantissa) for the mathematical concept of infinity and meta-concepts like NaN (not a number).
So we can assume that the / operator in this programming language is generic, that it performs integer division if both operands are of the language's integer type; and that it performs floating-point division if at least one of the operands is of a float type of the language, whereas the other operands would be implicitly converted to that float type for the purpose of the operation.
In real-number math, division of a number by a number close to zero is equivalent to multiplying the first number by a number whose absolute is very large (x / (1 / y) = x * y). So it is reasonable that the result of dividing by zero should be (defined as) infinity as the precision of the floating-point value would be exceeded.
Implementation details were to be found in that programming language's specification.
I have a method that deals with some geographic coordinates in .NET, and I have a struct that stores a coordinate pair such that if 256 is passed in for one of the coordinates, it becomes 0. However, in one particular instance a value of approximately 255.99999998 is calculated, and thus stored in the struct. When it's printed in ToString(), it becomes 256, which should not happen - 256 should be 0. I wouldn't mind if it printed 255.9999998 but the fact that it prints 256 when the debugger shows 255.99999998 is a problem. Having it both store and display 0 would be even better.
Specifically there's an issue with comparison. 255.99999998 is sufficiently close to 256 such that it should equal it. What should I do when comparing doubles? use some sort of epsilon value?
EDIT: Specifically, my problem is that I take a value, perform some calculations, then perform the opposite calculations on that number, and I need to get back the original value exactly.
This sounds like a problem with how the number is printed, not how it is stored. A double has about 15 significant figures, so it can tell 255.99999998 from 256 with precision to spare.
You could use the epsilon approach, but the epsilon is typically a fudge to get around the fact that floating-point arithmetic is lossy.
You might consider avoiding binary floating-points altogether and use a nice Rational class.
The calculation above was probably destined to be 256 if you were doing lossless arithmetic as you would get with a Rational type.
Rational types can go by the name of Ratio or Fraction class, and are fairly simple to write
Here's one example.
Here's another
Edit....
To understand your problem consider that when the decimal value 0.01 is converted to a binary representation it cannot be stored exactly in finite memory. The Hexidecimal representation for this value is 0.028F5C28F5C where the "28F5C" repeats infinitely. So even before doing any calculations, you loose exactness just by storing 0.01 in binary format.
Rational and Decimal classes are used to overcome this problem, albeit with a performance cost. Rational types avoid this problem by storing a numerator and a denominator to represent your value. Decimal type use a binary encoded decimal format, which can be lossy in division, but can store common decimal values exactly.
For your purpose I still suggest a Rational type.
You can choose format strings which should let you display as much of the number as you like.
The usual way to compare doubles for equality is to subtract them and see if the absolute value is less than some predefined epsilon, maybe 0.000001.
You have to decide yourself on a threshold under which two values are equal. This amounts to using so-called fixed point numbers (as opposed to floating point). Then, you have to perform the round up manually.
I would go with some unsigned type with known size (eg. uint32 or uint64 if they're available, I don't know .NET) and treat it as a fixed point number type mod 256.
Eg.
typedef uint32 fixed;
inline fixed to_fixed(double d)
{
return (fixed)(fmod(d, 256.) * (double)(1 << 24))
}
inline double to_double(fixed f)
{
return (double)f / (double)(1 << 24);
}
or something more elaborated to suit a rounding convention (to nearest, to lower, to higher, to odd, to even). The highest 8 bits of fixed hold the integer part, the 24 lower bits hold the fractional part. Absolute precision is 2^{-24}.
Note that adding and substracting such numbers naturally wraps around at 256. For multiplication, you should beware.