I would like to ask if the functions mpi_send and mpi_recv have any rounding error similar to mpi_reduce ? I thought it should not be since the rounding error of the mpr_reduce function comes from the difference in the order of processor executing but the functions mpi_send and mpr_recv do not have a similar procedure.
Then I would like to ask if it is logical to verify the calculation of a parallel code with only mpi_send and mpi_recv functions by compare its results with a serial code ?
Thank you for your time.
MPI_Send and MPI_Recv do not perform rounding per se. But there could still be differences between the results from the serial code and the parallel one on systems, where higher internal precision is used. A typical example is x86 when the x87 FPU is used (mostly in 32-bit code). x87 operates on a small stack of 80-bit values and all operations, even those involving values of lesser precision, are performed with 80-bit internal precision. Whenever an intermediate value has to be transferred to another MPI rank, it first gets rounded to either float or double, unless the non-standard extended precision type is used, which removes significant bits that would otherwise be there if the value were to remain in the x87 stack. This is not an MPI-specific problem as it might also manifest itself in serial code as different results depending on the level of register optimisation performed by the compiler.
Related
I noticed that the C++ standard library has separate functions for round and lround rather than just having you use long(round(x)) for the latter.
Looking into the implementation in glibc, I find that indeed, for platforms using IEEE754 floating point, the version that returns an integer will directly manipulate the bits from within the floating point representation, and not do the rounding using floating point operations (e.g. adding ±0.5).
What is the benefit of having a distinct implementation when you want the result as an integer type? Is this supposed to be faster, or more accurate? If it is better to use integer math on the underlying representation, why not just always do it that way even if returning the result as a double?
One reason is that adding .5 is insufficient. Let’s say you add .5 and then truncate to an integer. (How? Is there an instruction for that? Or are you doing more work?) If x is ½−2−54 (the greatest representable value less than ½), adding .5 yields 1, because the mathematical sum, 1−2−54, is exactly halfway between the nearest two representable values, 1−2−53 and 1, and the common default rounding mode, round-to-nearest-ties-to-even, rounds that to 1. But the correct result for lround(x) is 0.
And, of course, lround is specified to round ties away from zero, regardless of the current rounding mode. You could set the rounding mode, do some arithmetic, and restore the rounding mode, but there are problems with this.
One is that changing the rounding mode is a typically a time-consuming operation. The rounding mode is a global state that affects most floating-point instructions. So the processor has to ensure all pending instructions complete with the prior mode, change the global state, and ensure all later instructions start after that change.
If you are lucky, you might have a processor with per-instruction rounding modes or something similar, and then you can use any rounding mode you like without time penalty. Hewlett Packard has some processors like that. However, “round away from zero” is an uncommon mode. Most processors have round-to-nearest-ties-to-even, round toward zero, round down (toward −∞), and round up (toward +∞), and round-to-odd is becoming popular for its value in avoiding double-rounding errors. But round away from zero is rare.
Another reason is that doing floating-point instructions alters the floating-point status flags and may generate traps, but it is desired that library routines behave as single operations. For example, if we add .5 and rounding occurs, the inexact flag will be raised, since the floating-point addition with .5 produced a result different from the mathematical sum. But to the user of lround, no inexact condition ever occurs; lround is defined to return a value rounded to an integer, and it always does so—within the long range, it never returns a computed result different from its ideal mathematical definition. So if lround(x) raised the inexact flag, that would be incorrect behavior. To avoid it, an implementation that used floating-point instructions would have to save the current floating-point flags, do its work, and restore the flags before returning.
I use my GPU to perform lots of integer arithmetic. mul24() and mad24() are very helpful to get significant integer performance boosts. Sadly, some of my kernels needs more than 24-bit integers, forcing me to use compiler generated code, which is not always optimal. If I could access hardware instruction equivalent to mul_hi() but for 24-bit integers, name it mul24_hi(), I would get better performance from my GPUs.
Is there any equivalent to mul_hi() but for 24-bit integers or any pattern/idiom/workaround to reliably instruct the compiler to emit it?
What happens if you divide by Zero on a Computer?
In any given programming languange (I worked with, at least) this raises an error.
But why? Is it built in the language, that this is prohibited? Or will it compile, and the hardware will figure out that an error must be returned?
I guess handling this by the language can only be done, if it is hard code, e.g. there is a line like double z = 5.0/0.0; If it is a function call, and the devisor is given from outside, the language could not even know that this is a division by zero (at least a compile time).
double divideByZero(double divisor){
return 5.0/divisor;
}
where divisor is called with 0.0.
Update:
According to the comments/answers it makes a difference whether you divide by int 0 or double 0.0.
I was not aware of that. This is interesting in itself and I'm interested in both cases.
Also one answer is, that the CPU throws an error. Now, how is this done? Also in software (doesn't make sense on a CPU), or are there some circuits which recognize this? I guess this happens on the Arithmetic Logic Unit (ALU).
When an integer is divided by 0 in the CPU, this causes an interrupt.¹ A programming language implementation can then handle that interrupt by throwing an exception or employing whichever other error-handling mechanisms the language has.
When a floating point number is divided by 0, the result is infinity, NaN or negative infinity (which are special floating point values). That's mandated by the IEEE floating point standard, which any modern CPU will adhere to. Programming languages generally do as well. If a programming language wanted to handle it as an error instead, it could just check for NaN or infinite results after every floating point operation and cause an error in that case. But, as I said, that's generally not done.
¹ On x86 at least. But I imagine it's the same on most other architectures as well.
So far I learned that a processor has registers, for 32 bit processor
they are 32 bits, for 64 bit they are 64 bits. So can someone explain
what happens if I give to the processor a larger value than its register
size? How is the calculation performed?
It depends.
Assuming x86 for the sake of discussion, 64-bit integers can still be handled "natively" on a 32-bit architecture. In this case, the program often uses a pair of 32-bit registers to hold the 64-bit value. For example, the value 0xDEADBEEF2B84F00D might be stored in the EDX:EAX register pair:
eax = 0x2B84F00D
edx = 0xDEADBEEF
The CPU actually expects 64-bit numbers in this format in some cases (IDIV, for example).
Math operations are done in multiple instructions. For example, a 64-bit add on a 32-bit x86 CPU is done with an add of the lower DWORDs, and then an adc of the upper DWORDs, which takes into account the carry flag from the first addition.
For even bigger integers, an arbitrary-precision arithmetic (or "big int") library is used. Here, a dynamically-sized array of bytes is used to represent the integer, with additional information (like the number of bits used). GMP is a popular choice.
Mathematical operations on big integers are done iteratively, probably in native word-size values at-a-time. For the gory details, I suggest you have a look through the source code of one of these open-source libraries.
The key to all of this, is that numeric operations are carried out in manageable pieces, and combined to produce the final result.
I have a strange floating-point problem.
Background:
I am implementing a double-precision (64-bit) IEEE 754 floating-point library for an 8-bit processor with a large integer arithmetic co-processor. To test this library, I am comparing the values returned by my code against the values returned by Intel's floating-point instructions. These don't always agree, because Intel's Floating-Point Unit stores values internally in an 80-bit format, with a 64-bit mantissa.
Example (all in hex):
X = 4C816EFD0D3EC47E:
biased exponent = 4C8 (true exponent = 1C9), mantissa = 116EFD0D3EC47E
Y = 449F20CDC8A5D665:
biased exponent = 449 (true exponent = 14A), mantissa = 1F20CDC8A5D665
Calculate X * Y
The product of the mantissas is 10F5643E3730A17FF62E39D6CDB0, which when rounded to 53 (decimal) bits is 10F5643E3730A1 (because the top bit of 7FF62E39D6CDB0 is zero). So the correct mantissa in the result is 10F5643E3730A1.
But if the computation is carried out with a 64-bit mantissa, 10F5643E3730A17FF62E39D6CDB0 is rounded up to 10F5643E3730A1800, which when rounded again to 53 bits becomes 10F5643E3730A2. The least significant digit has changed from 1 to 2.
To sum up: my library returns the correct mantissa 10F5643E3730A1, but the Intel hardware returns (correctly) 10F5643E3730A2, because of its internal 64-bit mantissa.
The problem:
Now, here's what I don't understand: sometimes the Intel hardware returns 10F5643E3730A1 in the mantissa! I have two programs, a Windows console program and a Windows GUI program, both built by Qt using g++ 4.5.2. The console program returns 10F5643E3730A2, as expected, but the GUI program returns 10F5643E3730A1. They are using the same library function, which has the three instructions:
fldl -0x18(%ebp)
fmull -0x10(%ebp)
fstpl 0x4(%esp)
And these three instructions compute a different result in the two programs. (I have stepped through them both in the debugger.) It seems to me that this might be something that Qt does to configure the FPU in its GUI startup code, but I can't find any documentation about this. Does anybody have any idea what's happening here?
The instructions stream of and inputs to a function do not uniquely determine its execution. You must also consider the environment that is already established in the processor at the time of its execution.
If you inspect the x87 control word, you will find that it is set in two different states, corresponding to your two observed behaviors. In one, the precision control [bits 9:8] has been set to 10b (53 bits). In the other, it is set to 11b (64 bits).
As to exactly what is establishing the non-default state, it could be anything that happens in that thread prior to execution of your code. Any libraries that are pulled in are likely suspects. If you want to do some archaeology, the smoking gun is typically the fldcw instruction (though the control word can also be written to by fldenv, frstor, and finit.
normally it's a compiler setting. Check for example the following page for Visual C++:
http://msdn.microsoft.com/en-us/library/aa289157%28v=vs.71%29.aspx
or this document for intel:
http://cache-www.intel.com/cd/00/00/34/76/347605_347605.pdf
Especially the intel document mentions some flags inside the processor that determine the behavior of the FPU instructions. This explains why the same code behaves differently in 2 programs (one sets the flags different to the other).