Do we really need 1s complement? - math

Why do we actually need 1s compliment?
Can we simply make a number negative or positive by by just flipping one bit!
the MSB bit?
Why do we need to flip all bits

Depending on your needs, 1s complement, 2s complement or sign-magnitude (flipping the MS bit) can all be useful. There is no definitive criterion by which any of the representations is universally superior.
To answer your question, 1s complement can be better than sign-magnitude when you are performing arithmetic operations. You would need different hardware for addition and subtraction when using sign-magnitude; but you can use the same hardware for both operations when using 1s complement (i.e., addition is addition, subtraction is addition with complement).

Related

Is there a more efficient way to divide and conquer a uint 256 log on 64 bit hardware with rust or inline assembly than converging Taylor series?

I am looking to take the log base n (10 would be fine) of a 256 bit unsigned integer as a floating point in rust, with no loss of precision. It would seem to me that I need to implement an 8xf64 512 bit float 512 type and use a Taylor series to approximate ln and then the log. I know there are assembly methods to obtain the log of an f64. I am wondering if anyone on stack overflow can think of a divide and conquer or other method which would be more efficient. I would be amenable to inline assembly operating on the 8xf64 512 bit array.
This might be a useful starting point / outline of an algorithm. IDK if it will get you exact results, like error <= 0.5ulp (i.e. the last bit of the mantissa of your 512-bit float correctly rounded), or even error <= 1 ulp. Perhaps worth looking into what extended-precision calculators like bc / dc / calc do.
I think log converges quickly, so if you're going to do Newton iterations to refine, this bit-scan method might be a fast way to get a good starting point. Even if you only really need about 256 mantissa bits correct, I don't know how big a polynomial it would take to get that, and each multiply / add / fma would be on 512-bit (8x) or 320-bit (5x double precision).
Start by converting integer to binary float
For normal-sized floating-point numbers, the usual method takes advantage of the logarithmic nature of binary floating point. Without 256-bit HW float, you'll want to find the ilog2(int) yourself, i.e. position of the highest set bit (Efficiently find least significant set bit in a large array?).
Then treat your 256-bit integer as the mantissa of a number in the [1..2) or [0.5 .. 1) range, and yes use a polynomial approximation for log2() that's accurate over that limited range. (Before actual soft-float stuff, you might want to left-shift the number so it's normalized, i.e. the highest set bit is at the top. i.e. x <<= clz(x).
Then a polynomial approximation over the mantissa
And then add the integer exponent + log_approx(mantissa) => log2(x).
Efficient implementation of log2(__m256d) in AVX2 has more detail on implementing log2(double) (with SIMD doing 4 at a time, very different from doing one extended precision calculation).
It includes some links to implementations, e.g. Agner Fog's VCL using the ratio of two polynomials instead of one larger polynomial, and various tricks to maintain as much precision as possible: https://github.com/vectorclass/version2/blob/9874e4bfc7a0919fda16596144d393da5f8bf6c0/vectormath_exp.h#L942. Such as further range reduction: if x > SQRT2*0.5, then increment the exponent and double the mantissa. (If 512-bit FP division is really expensive, you might just use more terms in one polynomial.) VCL is currently Apache licensed, so feel free to copy as much as you want from it into anything.
IDK if there are more tricks that might become more valuable for big extended precision, or for soft-float, which that implementation doesn't use. VCL's math functions spend more effort to maintain high precision than some faster approximations, but they're not exact.
Do you really need 512-bit float? Maybe only 320-bit (5x double)?
If you don't need more exponent-range than a double, you might be able to extend the double-double-arithmetic technique to wider floats, taking advantage of hardware FP to get 52 or 53 mantissa bits per 64-bit chunk. (From comments, apparently you're already planning to do that.)
You might not need 512-bit float to have sufficient precision. 256/52 = 4.92, so only 5x double chunks have more precision (mantissa bits) than your input, and could exactly represent any 256-bit integer. (IEEE double does have a large enough exponent range; -1022 .. +1023). And have enough to spare that log2(int) should map each 256-bit input to a unique monotonic output, even with some rounding error.

Is the speed of multiplying/adding interger a constant, regardless of how big the number is?

Is 3*3 faster, or takes the same number CPU cycles as 1000*1000 (values are C int). Is this claim applied to all other arithmetic operators, including floating point ones?
CPUs typically implement multiplication for fixed sized numbers in hardware, and no matter which two numbers you pass in, the circuitry is going to run through all the bits even if most of them are zeros. For examples of how you can multiply two numbers in hardware see https://en.m.wikipedia.org/wiki/Binary_multiplier
This means the time it takes to multiply two "int"s in C is pretty much a constant.
Caveat: You may find that multiplication by a power of 2 is much faster than multiplication in general. The compiler is free to not use the multiply instruction and replace it with bit shifts and adds if it produces the correct result.

Truncating 64-bit IEEE doubles to 61-bits in a safe fashion

I am developing a programming language, September, which uses a tagged variant type as its main value type. 3 bits are used for the type (integer, string, object, exception, etc.), and 61 bits are used for the actual value (the actual integer, pointer to the object, etc.).
Soon, it will be time to add a float type to the language. I almost have the space for a 64-bit double, so I wanted to make use of doubles for calculations internally. Since I'm actually 3 bits short for storage, I would have to round the doubles off after each calculation - essentially resulting in a 61-bit double with a mantissa or exponent shorter by 3 bits.
But! I know floating point is fraught with peril and doing things which sound sensible on paper can produce disastrous results with FP math, so I have an open-ended question to the experts out there:
Is this approach viable at all? Will I run into serious error-accumulation problems in long-running calculations by rounding at each step? Is there some specific way in which I could do the rounding in order to avoid that? Are there any special values that I won't be able to treat that way (subnormals come to mind)?
Ideally, I would like my floats to be as well-behaved as a native 61-bit double would be.
I would recommend borrowing bits from the exponent field of the double-precision format. This is the method described in this article (that you would modify to borrow 3 bits from the exponent instead of 1). With this approach, all computations that do not use very large or very small intermediate results behave exactly as the original double-precision computation would. Even computations that run into the subnormal region of the new format behave exactly as they would if a 1+8+52 61-bit format had been standardized by IEEE.
By contrast, naively borrowing any number of bits at all from the significand introduces many double-rounding problems, all the more frequent that you are rounding from a 52-bit significand to a significand with only a few bits removed. Borrowing one bit from the significand as you suggest in an edit to your question would be the worst, with half the operations statistically producing double-rounded results that are different from what the ideal “native 61-bit double” would have produced. This means that instead of being accurate to 0.5ULP, the basic operations would be accurate to 3/4ULP, a dramatic loss of accuracy that would derail many of the existing, finely-designed numerical algorithms that expect 0.5ULP.
Three is a significant number of bits to borrow from an exponent that only has 11, though, and you could also consider using the single-precision 32-bit format in your language (calling the single-precision operations from the host).
Lastly, I give visibility here to another solution found by Jakub: borrow the three bits from the significand, and simulate round-to-odd for the intermediate double-precision computation before converting to the nearest number in 49-explicit-significand-bit, 11-exponent-bit format. If this way is chosen, it may useful to remark that the rounding itself to 49 bits of significand can be achieved with the following operations:
if ((repr & 7) == 4)
repr += (repr & 8) >> 1); /* midpoint case */
else
repr += 4;
repr &= ~(uint64_t)7; /* round to the nearest */
Despite working on the integer having the same representation as the double being considered, the above snippet works even if the number goes from normal to subnormal, from subnormal to normal, or from normal to infinite. You will of course want to set a tag in the three bits that have been freed as above. To recover a standard double-precision number from its unboxed representation, simply clear the tag with repr &= ~(uint64_t)7;.
This is a summary of my own research and information from the excellent answer by #Pascal Cuoq.
There are two places where we can truncate the 3-bits we need: the exponent, and the mantissa (significand). Both approaches run into problems which have to be explicitly handled in order for the calculations to behave as if we used a hypothetical native 61-bit IEEE format.
Truncating the mantissa
We shorten the mantissa by 3 bits, resulting in a 1s+11e+49m format. When we do that, performing calculations in double-precision and then rounding after each computation exposes us to double rounding problems. Fortunately, double rounding can be avoided by using a special rounding mode (round-to-odd) for the intermediate computations. There is an academic paper describing the approach and proving its correctness for all doubles - as long as we truncate at least 2 bits.
Portable implementation in C99 is straightforward. Since round-to-odd is not one of the available rounding modes, we emulate it by using fesetround(FE_TOWARD_ZERO), and then setting the last bit if the FE_INEXACT exception occurs. After computing the final double this way, we simply round to nearest for storage.
The format of the resulting float loses about 1 significant (decimal) digit compared to a full 64-bit double (from 15-17 digits to 14-16).
Truncating the exponent
We take 3 bits from the exponent, resulting in a 1s+8e+52m format. This approach (applied to a hypothetical introduction of 63-bit floats in OCaml) is described in an article. Since we reduce the range, we have to handle out-of-range exponents on both the positive side (by simply 'rounding' them to infinity) and the negative side. Doing this correctly on the negative side requires biasing the inputs to any operation in order to ensure that we get subnormals in the 64-bit computation whenever the 61-bit result needs to be subnormal. This has to be done a bit differently for each operation, since what matters is not whether the operands are subnormal, but whether we expect the result to be (in 61-bit).
The resulting format has significantly reduced range since we borrow a whopping 3 out of 11 bits of the exponent. The range goes down from 10-308...10308 to about 10-38 to 1038. Seems OK for computation, but we still lose a lot.
Comparison
Both approaches yield a well-behaved 61-bit float. I'm personally leaning towards truncating the mantissa, for three reasons:
the "fix-up" operations for round-to-odd are simpler, do not differ from operation to operation, and can be done after the computation
there is a proof of mathematical correctness of this approach
giving up one significant digit seems less impactful than giving up a big chunk of the double's range
Still, for some uses, truncating the exponent might be more attractive (especially if we care more about precision than range).

Hexadecimal Calculator Features. What do you want?

I plan on making a Hex Calculator one of these weekends for my android phone. I would like to put it up as a free application on the android market when I'm done. As a programmer, what do you think are some valuable features that I should consider?
Conversion between hex, binary and decimal would be nice
Showing current date and time in hex
Coloring of inputs like (FF, 00, 00) as RGB values
Usual arithmetic
Stack based calculation
Registers for saving of values for some future time
Defining variables for easier re-use
Too much? Doable?
Convert to/from decimal and binary
AND / OR / NOT / XOR / 2s Complement
Basic arithmetic ( plus,minus, multiply, divide)
Multiple memories
Besides the obvious adding and subtracting hex color values, the next hex operation I perform the most is averaging two (or even an array of) hex color values. Good luck with the project!
overflow flag for assembly programing.
when ever there is a arithmetic operation of two numbers (in two's compliment mode), this flag is raised when calculations fall out of range.
add octal to the conversions

When is it appropriate to use floating precision data types?

It's clear that one shouldn't use floating precision when working with, say, monetary amounts since the variation in precision leads to inaccuracies when doing calculations with that amount.
That said, what are use cases when that is acceptable? And, what are the general principles one should have in mind when deciding?
Floating point numbers should be used for what they were designed for: computations where what you want is a fixed precision, and you only care that your answer is accurate to within a certain tolerance. If you need an exact answer in all cases, you're best using something else.
Here are three domains where you might use floating point:
Scientific Simulations
Science apps require a lot of number crunching, and often use sophisticated numerical methods to solve systems of differential equations. You're typically talking double-precision floating point here.
Games
Think of games as a simulation where it's ok to cheat. If the physics is "good enough" to seem real then it's ok for games, and you can make up in user experience what you're missing in terms of accuracy. Games usually use single-precision floating point.
Stats
Like science apps, statistical methods need a lot of floating point. A lot of the numerical methods are the same; the application domain is just different. You find a lot of statistics and monte carlo simulations in financial applications and in any field where you're analyzing a lot of survey data.
Floating point isn't trivial, and for most business applications you really don't need to know all these subtleties. You're fine just knowing that you can't represent some decimal numbers exactly in floating point, and that you should be sure to use some decimal type for prices and things like that.
If you really want to get into the details and understand all the tradeoffs and pitfalls, check out the classic What Every Programmer Should Know About Floating Point, or pick up a book on Numerical Analysis or Applied Numerical Linear Algebra if you're really adventurous.
I'm guessing you mean "floating point" here. The answer is, basically, any time the quantities involved are approximate, measured, rather than precise; any time the quantities involved are larger than can be conveniently represented precisely on the underlying machine; any time the need for computational speed overwhelms exact precision; and any time the appropriate precision can be maintained without other complexities.
For more details of this, you really need to read a numerical analysis book.
Short story is that if you need exact calculations, DO NOT USE floating point.
Don't use floating point numbers as loop indices: Don't get caught doing:
for ( d = 0.1; d < 1.0; d+=0.1)
{ /* Some Code... */ }
You will be surprised.
Don't use floating point numbers as keys to any sort of map because you can never count on equality behaving like you may expect.
Most real-world quantities are inexact, and typically we know their numeric properties with a lot less precision than a typical floating-point value. In almost all cases, the C types float and double are good enough.
It is necessary to know some of the pitfalls. For example, testing two floating-point numbers for equality is usually not what you want, since all it takes is a single bit of inaccuracy to make the comparison non-equal. tgamblin has provided some good references.
The usual exception is money, which is calculated exactly according to certain conventions that don't translate well to binary representations. Part of this is the constants used: you'll never see a pi% interest rate, or a 22/7% interest rate, but you might well see a 3.14% interest rate. In other words, the numbers used are typically expressed in exact decimal fractions, not all of which are exact binary fractions. Further, the rounding in calculations is governed by conventions that also don't translate well into binary. This makes it extremely difficult to precisely duplicate financial calculations with standard floating point, and therefore people use other methods for them.
It's appropriate to use floating point types when dealing with scientific or statistical calculations. These will invariably only have, say, 3-8 significant digits of accuracy.
As to whether to use single or double precision floating point types, this depends on your need for accuracy and how many significant digits you need. Typically though people just end up using doubles unless they have a good reason not to.
For example if you measure distance or weight or any physical quantity like that the number you come up with isn't exact: it has a certain number of significant digits based on the accuracy of your instruments and your measurements.
For calculations involving anything like this, floating point numbers are appropriate.
Also, if you're dealing with irrational numbers floating point types are appropriate (and really your only choice) eg linear algebra where you deal with square roots a lot.
Money is different because you typically need to be exact and every digit is significant.
I think you should ask the other way around: when should you not use floating point. For most numerical tasks, floating point is the preferred data type, as you can (almost) forget about overflow and other kind of problems typically encountered with integer types.
One way to look at floating point data type is that the precision is independent of the dynamic, that is whether the number is very small of very big (within an acceptable range of course), the number of meaningful digits is approximately the same.
One drawback is that floating point numbers have some surprising properties, like x == x can be False (if x is nan), they do not follow most mathematical rules (distributivity, that is x( y + z) != xy + xz). Depending on the values for z, y, and z, this can matters.
From Wikipedia:
Floating-point arithmetic is at its
best when it is simply being used to
measure real-world quantities over a
wide range of scales (such as the
orbital period of Io or the mass of
the proton), and at its worst when it
is expected to model the interactions
of quantities expressed as decimal
strings that are expected to be exact.
Floating point is fast but inexact. If that is an acceptable trade off, use floating point.

Resources