Why was SIGFPE used for integer arithmetic exceptions? - unix

Why was SIGFPE used for integer arithmetic exceptions, such as division by zero, instead of creating a separate signal for integer arithmetic exceptions or naming the signal in the first place for arithmetic exceptions generally?

IEEE Std 1003.1 Standard defines SIGFPE as:
Erroneous arithmetic operation.
And doesn't really mention floating point operations. Reasoning behind this is not clearly stated, but here's my take on it.
x86 FPU can operate on both integer and floating point data at the same time with instructions such as FIDIV, thus it would be unclear whether dividing floating poitn data by integer zero would generate a floating or and integer point exception.
Additionally, up to 80486 (which was released the same year as the ISO/ANSI C standard) x86 CPUs did not have floating point capabilities at all, floating point co-processor was a separate chip. Software floating point emulation could be used in place of the chip, but that used CPU's built in ALU (integer arithmetic-logical unit) which would throw integer exceptions.

Related

What happens if you divide by Zero on a computer?

What happens if you divide by Zero on a Computer?
In any given programming languange (I worked with, at least) this raises an error.
But why? Is it built in the language, that this is prohibited? Or will it compile, and the hardware will figure out that an error must be returned?
I guess handling this by the language can only be done, if it is hard code, e.g. there is a line like double z = 5.0/0.0; If it is a function call, and the devisor is given from outside, the language could not even know that this is a division by zero (at least a compile time).
double divideByZero(double divisor){
return 5.0/divisor;
}
where divisor is called with 0.0.
Update:
According to the comments/answers it makes a difference whether you divide by int 0 or double 0.0.
I was not aware of that. This is interesting in itself and I'm interested in both cases.
Also one answer is, that the CPU throws an error. Now, how is this done? Also in software (doesn't make sense on a CPU), or are there some circuits which recognize this? I guess this happens on the Arithmetic Logic Unit (ALU).
When an integer is divided by 0 in the CPU, this causes an interrupt.¹ A programming language implementation can then handle that interrupt by throwing an exception or employing whichever other error-handling mechanisms the language has.
When a floating point number is divided by 0, the result is infinity, NaN or negative infinity (which are special floating point values). That's mandated by the IEEE floating point standard, which any modern CPU will adhere to. Programming languages generally do as well. If a programming language wanted to handle it as an error instead, it could just check for NaN or infinite results after every floating point operation and cause an error in that case. But, as I said, that's generally not done.
¹ On x86 at least. But I imagine it's the same on most other architectures as well.

Truncating 64-bit IEEE doubles to 61-bits in a safe fashion

I am developing a programming language, September, which uses a tagged variant type as its main value type. 3 bits are used for the type (integer, string, object, exception, etc.), and 61 bits are used for the actual value (the actual integer, pointer to the object, etc.).
Soon, it will be time to add a float type to the language. I almost have the space for a 64-bit double, so I wanted to make use of doubles for calculations internally. Since I'm actually 3 bits short for storage, I would have to round the doubles off after each calculation - essentially resulting in a 61-bit double with a mantissa or exponent shorter by 3 bits.
But! I know floating point is fraught with peril and doing things which sound sensible on paper can produce disastrous results with FP math, so I have an open-ended question to the experts out there:
Is this approach viable at all? Will I run into serious error-accumulation problems in long-running calculations by rounding at each step? Is there some specific way in which I could do the rounding in order to avoid that? Are there any special values that I won't be able to treat that way (subnormals come to mind)?
Ideally, I would like my floats to be as well-behaved as a native 61-bit double would be.
I would recommend borrowing bits from the exponent field of the double-precision format. This is the method described in this article (that you would modify to borrow 3 bits from the exponent instead of 1). With this approach, all computations that do not use very large or very small intermediate results behave exactly as the original double-precision computation would. Even computations that run into the subnormal region of the new format behave exactly as they would if a 1+8+52 61-bit format had been standardized by IEEE.
By contrast, naively borrowing any number of bits at all from the significand introduces many double-rounding problems, all the more frequent that you are rounding from a 52-bit significand to a significand with only a few bits removed. Borrowing one bit from the significand as you suggest in an edit to your question would be the worst, with half the operations statistically producing double-rounded results that are different from what the ideal “native 61-bit double” would have produced. This means that instead of being accurate to 0.5ULP, the basic operations would be accurate to 3/4ULP, a dramatic loss of accuracy that would derail many of the existing, finely-designed numerical algorithms that expect 0.5ULP.
Three is a significant number of bits to borrow from an exponent that only has 11, though, and you could also consider using the single-precision 32-bit format in your language (calling the single-precision operations from the host).
Lastly, I give visibility here to another solution found by Jakub: borrow the three bits from the significand, and simulate round-to-odd for the intermediate double-precision computation before converting to the nearest number in 49-explicit-significand-bit, 11-exponent-bit format. If this way is chosen, it may useful to remark that the rounding itself to 49 bits of significand can be achieved with the following operations:
if ((repr & 7) == 4)
repr += (repr & 8) >> 1); /* midpoint case */
else
repr += 4;
repr &= ~(uint64_t)7; /* round to the nearest */
Despite working on the integer having the same representation as the double being considered, the above snippet works even if the number goes from normal to subnormal, from subnormal to normal, or from normal to infinite. You will of course want to set a tag in the three bits that have been freed as above. To recover a standard double-precision number from its unboxed representation, simply clear the tag with repr &= ~(uint64_t)7;.
This is a summary of my own research and information from the excellent answer by #Pascal Cuoq.
There are two places where we can truncate the 3-bits we need: the exponent, and the mantissa (significand). Both approaches run into problems which have to be explicitly handled in order for the calculations to behave as if we used a hypothetical native 61-bit IEEE format.
Truncating the mantissa
We shorten the mantissa by 3 bits, resulting in a 1s+11e+49m format. When we do that, performing calculations in double-precision and then rounding after each computation exposes us to double rounding problems. Fortunately, double rounding can be avoided by using a special rounding mode (round-to-odd) for the intermediate computations. There is an academic paper describing the approach and proving its correctness for all doubles - as long as we truncate at least 2 bits.
Portable implementation in C99 is straightforward. Since round-to-odd is not one of the available rounding modes, we emulate it by using fesetround(FE_TOWARD_ZERO), and then setting the last bit if the FE_INEXACT exception occurs. After computing the final double this way, we simply round to nearest for storage.
The format of the resulting float loses about 1 significant (decimal) digit compared to a full 64-bit double (from 15-17 digits to 14-16).
Truncating the exponent
We take 3 bits from the exponent, resulting in a 1s+8e+52m format. This approach (applied to a hypothetical introduction of 63-bit floats in OCaml) is described in an article. Since we reduce the range, we have to handle out-of-range exponents on both the positive side (by simply 'rounding' them to infinity) and the negative side. Doing this correctly on the negative side requires biasing the inputs to any operation in order to ensure that we get subnormals in the 64-bit computation whenever the 61-bit result needs to be subnormal. This has to be done a bit differently for each operation, since what matters is not whether the operands are subnormal, but whether we expect the result to be (in 61-bit).
The resulting format has significantly reduced range since we borrow a whopping 3 out of 11 bits of the exponent. The range goes down from 10-308...10308 to about 10-38 to 1038. Seems OK for computation, but we still lose a lot.
Comparison
Both approaches yield a well-behaved 61-bit float. I'm personally leaning towards truncating the mantissa, for three reasons:
the "fix-up" operations for round-to-odd are simpler, do not differ from operation to operation, and can be done after the computation
there is a proof of mathematical correctness of this approach
giving up one significant digit seems less impactful than giving up a big chunk of the double's range
Still, for some uses, truncating the exponent might be more attractive (especially if we care more about precision than range).

Floating point math on a processor that does not support it?

How is floating point math performed on a processor with no floating point unit ? e.g low-end 8 bit microcontrollers.
Have a look at this article: http://www.edwardrosten.com/code/fp_template.html
(from this article)
First you have to think about how to represent a floating point number in memory:
struct this_is_a_floating_point_number
{
static const unsigned int mant = ???;
static const int expo = ???;
static const bool posi = ???;
};
Then you'd have to consider how to do basic calculations with this representation. Some might be easy to implement and be rather fast at runtime (multiply or divide by 2 come to mind)
Division might be harder and, for instance, Newtons algorithm could be used to calculate the answer.
Finally, smart approximations and generated values in tables might speed up the calculations at run time.
Many years ago C++ templates helped me getting floating point calculations on an Intel 386 SX
In the end I learned a lot of math and C++ but decided at the same time to buy a co-processor.
Especially the polynomial algorithms and the smart lookup tables; who needs a cosine or tan function when you have sine function, helped a lot in thinking about using integers for floating point arithmetic. Taylor series were a revelation too.
In systems without any floating-point hardware, the CPU emulates it using a series of simpler fixed-point arithmetic operations that run on the integer arithmetic logic unit.
Take a look at the wikipedia page: Floating-point_unit#Floating-point_library as you might find more info.
It is not actually the cpu who emulates the instructions. The floating point operations for low end cpu's are made out of integer arithmetic instructions and the compiler is the one which generates those instructions. Basically the compiler (tool chain) comes with a floating point library containing floating point functions.
The short answer is "slowly". Specialized hardware can do tasks like extracting groups of bits that are not necessarily byte-aligned very fast. Software can do everything that can be done by specialized hardware, but tends to take much longer to do it.
Read "The complete Spectrum ROM disassembly" at http://www.worldofspectrum.org/documentation.html to see examples of floating point computations on an 8 bit Z80 processor.
For things like sine functions, you precompute a few values then interpolate using Chebyshev polynomials.

Integer divide by Zero and Float (Real no.) divide by Zero

If I run following line of code, I get DIVIDE BY ZERO error
1. System.out.println(5/0);
which is the expected behavior.
Now I run the below line of code
2. System.out.println(5/0F);
here there is no DIVIDE BY ZERO error, rather it shows INFINITY
In the first line I am dividing two integers and in the second two real numbers.
Why does dividing by zero for integers gives DIVIDE BY ZERO error while in the case of real numbers it gives INFINITY
I am sure it is not specific to any programming language.
(EDIT: The question has been changed a bit - it specifically referred to Java at one point.)
The integer types in Java don't have representations of infinity, "not a number" values etc - whereas IEEE-754 floating point types such as float and double do. It's as simple as that, really. It's not really a "real" vs "integer" difference - for example, BigDecimal represents real numbers too, but it doesn't have a representation of infinity either.
EDIT: Just to be clear, this is language/platform specific, in that you could create your own language/platform which worked differently. However, the underlying CPUs typically work the same way - so you'll find that many, many languages behave this way.
EDIT: In terms of motivation, bear in mind that for the infinity case in particular, there are ways of getting to infinity without dividing by zero - such as dividing by a very, very small floating point number. In the case of integers, there's obviously nothing between zero and one.
Also bear in mind that the cases in which integers (or decimal floating point types) are used typically don't need to concept of infinity, or "not a number" results - whereas in scientific applications (where float/double are more typically useful), "infinity" (or at least, "a number which is too large to sensibly represent") is still a potentially valid result.
This is specific to one programming language or a family of languages. Not all languages allow integers and floats to be used in the same expression. Not all languages have both types (for example, ECMAScript implementations like JavaScript have no notion of an integer type externally). Not all languages have syntax like this to convert values inline.
However, there is an intrinsic difference between integer arithmetic and floating-point arithmetic. In integer arithmetic, you must define that division by zero is an error, because there are no values to represent the result. In floating-point arithmetic, specifically that defined in IEEE-754, there are additional values (combinations of sign bit, exponent and mantissa) for the mathematical concept of infinity and meta-concepts like NaN (not a number).
So we can assume that the / operator in this programming language is generic, that it performs integer division if both operands are of the language's integer type; and that it performs floating-point division if at least one of the operands is of a float type of the language, whereas the other operands would be implicitly converted to that float type for the purpose of the operation.
In real-number math, division of a number by a number close to zero is equivalent to multiplying the first number by a number whose absolute is very large (x / (1 / y) = x * y). So it is reasonable that the result of dividing by zero should be (defined as) infinity as the precision of the floating-point value would be exceeded.
Implementation details were to be found in that programming language's specification.

Bad floating-point magic

I have a strange floating-point problem.
Background:
I am implementing a double-precision (64-bit) IEEE 754 floating-point library for an 8-bit processor with a large integer arithmetic co-processor. To test this library, I am comparing the values returned by my code against the values returned by Intel's floating-point instructions. These don't always agree, because Intel's Floating-Point Unit stores values internally in an 80-bit format, with a 64-bit mantissa.
Example (all in hex):
X = 4C816EFD0D3EC47E:
biased exponent = 4C8 (true exponent = 1C9), mantissa = 116EFD0D3EC47E
Y = 449F20CDC8A5D665:
biased exponent = 449 (true exponent = 14A), mantissa = 1F20CDC8A5D665
Calculate X * Y
The product of the mantissas is 10F5643E3730A17FF62E39D6CDB0, which when rounded to 53 (decimal) bits is 10F5643E3730A1 (because the top bit of 7FF62E39D6CDB0 is zero). So the correct mantissa in the result is 10F5643E3730A1.
But if the computation is carried out with a 64-bit mantissa, 10F5643E3730A17FF62E39D6CDB0 is rounded up to 10F5643E3730A1800, which when rounded again to 53 bits becomes 10F5643E3730A2. The least significant digit has changed from 1 to 2.
To sum up: my library returns the correct mantissa 10F5643E3730A1, but the Intel hardware returns (correctly) 10F5643E3730A2, because of its internal 64-bit mantissa.
The problem:
Now, here's what I don't understand: sometimes the Intel hardware returns 10F5643E3730A1 in the mantissa! I have two programs, a Windows console program and a Windows GUI program, both built by Qt using g++ 4.5.2. The console program returns 10F5643E3730A2, as expected, but the GUI program returns 10F5643E3730A1. They are using the same library function, which has the three instructions:
fldl -0x18(%ebp)
fmull -0x10(%ebp)
fstpl 0x4(%esp)
And these three instructions compute a different result in the two programs. (I have stepped through them both in the debugger.) It seems to me that this might be something that Qt does to configure the FPU in its GUI startup code, but I can't find any documentation about this. Does anybody have any idea what's happening here?
The instructions stream of and inputs to a function do not uniquely determine its execution. You must also consider the environment that is already established in the processor at the time of its execution.
If you inspect the x87 control word, you will find that it is set in two different states, corresponding to your two observed behaviors. In one, the precision control [bits 9:8] has been set to 10b (53 bits). In the other, it is set to 11b (64 bits).
As to exactly what is establishing the non-default state, it could be anything that happens in that thread prior to execution of your code. Any libraries that are pulled in are likely suspects. If you want to do some archaeology, the smoking gun is typically the fldcw instruction (though the control word can also be written to by fldenv, frstor, and finit.
normally it's a compiler setting. Check for example the following page for Visual C++:
http://msdn.microsoft.com/en-us/library/aa289157%28v=vs.71%29.aspx
or this document for intel:
http://cache-www.intel.com/cd/00/00/34/76/347605_347605.pdf
Especially the intel document mentions some flags inside the processor that determine the behavior of the FPU instructions. This explains why the same code behaves differently in 2 programs (one sets the flags different to the other).

Resources