I saw somewhere that this is a special case and that +NaN goes from 0x7F800001 to 0x7FFFFFFF. Is the answer +NaN?
If you interpret 7FFFFFFF as an IEEE754 32-bit float then yes, 7FFFFFFF is NaN. You can understand these things from looking at the Wikipedia page for Single-precision floating-point format. I wrote this little C program to illustrate the point:
#include <stdio.h>
int main(){
unsigned u0 = 0x7FFFFFFF;
unsigned u1 = 0x7F800001;
unsigned u2 = 0x7F800000;
unsigned u3 = 0x7F7FFFFF;
// *(float*)&u0 causes the data stored in u0 to be interpreted as a float
printf("%e\n", *(float*)&u0); // This gives nan
printf("%e\n", *(float*)&u1); // This also gives nan
printf("%e\n", *(float*)&u2); // This gives inf
printf("%e\n", *(float*)&u3); // This gives 3.402823e+38, the largest possible IEEE754 32-bit float
// The above code only works because sizeof(unsigned)==sizeof(float)
printf("%u\t%u\n", sizeof(unsigned), sizeof(float));
// Remember that nan is only for floats, u0 is a perfectly valid unsigned.
printf("%u\n", u0); // This gives 2147483647
}
Again, it has to be mentioned that NaN only exists as a floating point number.
+NaN is a special value for floating point numbers (And it has no decimal equivalent. It's "Not a Number").
If you just want the decimal representation of the integer, which has 7FFFFFFF as hexadecimal representation, there's no floating point involved, and no +NaN
Related
I'm learning C++, and encountering these problems in a simple program, so please help me out.
This is the code
#include<iostream>
using std::cout;
int main()
{ float pie;
pie = (22/7);
cout<<"The Value of Pi(22/7) is "<< pie<<"\n";
return 0;
}
and the output is
The Value of Pi(22/7) is 3
Why is the value of Pi not in decimal?
That's because you're doing integer division.
What you want is really float division:
#include<iostream>
using std::cout;
int main()
{
float pie;
pie = float(22)/7;// 22/(float(7)) is also equivalent
cout<<"The Value of Pi(22/7) is "<< pie<<"\n";
return 0;
}
However, this type conversion: float(variable) or float(value) isn't type safe.
You could have gotten the value you wanted by ensuring that the values you were computing were floating point to begin with as follows:
22.0/7
OR
22/7.0
OR
22.0/7.0
But, that's generally a hassle and will involve that you keep track of all the types you're working with. Thus, the final and best method involves using static_cast:
static_cast<float>(22)/7
OR
22/static_cast<float>(7)
As for why you should use static_cast - see this:
Why use static_cast<int>(x) instead of (int)x?
pie = (22/7);
Here the division is integer division, because both operands are int.
What you intend to do is floating-point division:
pie = (22.0/7);
Here 22.0 is double, so the division becomes floating-point division (even though 7 is still int).
The rule is that IF both operands are integral type (such as int, long, char etc), then it is integer division, ELSE it is floating-point division (i.e when even if a single operand is float or double).
Use:
pi = 22/7.0
If u give the two operands to the / operator as integer then the division performed will be integer division and a float will not be the result.
On some Juniper MX routers, floats are not handled correctly: The sticky bit is lost if it is shifted more that 8 bits to the right (underflow) during a calculation. Is there a workaround for this? Are there any known impacts? Has it been fixed? Is this an IEEE acceptable option? Does the issue exist in other systems?
Example with Math Details (best viewed with fixed width font, and wide screen):
1
shifts: 12345678901
4095.05615204245304994401521980762481689453125000000000 = 0x1.ffe1cbff5e3e1p+11 = 0x40affe1cbff5e3e1 = 111111111111.00001110010111111111101011110001111100001
+ 1.0000137123424794882708965815254487097263336181640625 = 0x1.0000e60e10001p+0 = 0x3ff0000170168000 = 1.0000000000000000111001100000111000010000000000000001
^
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1000000000000.000011100110000011100001000000000000000010s
LGRS
0101
1 2 3 4 5
mantissa bit #: 1234567890123 4567890123456789012345678901234567890123
4096.0561657547959839575923979282379150390625 = 0x1.0000e60e10001p+12 = 0x40b0000e60e10001 = 1000000000000.0000111001100000111000010000000000000001 (on "all" systems/correct)
4096.056165754795074462890625 = 0x1.0000e60e10000p+12 = 0x40b0000e60e10000 = 1000000000000.0000111001100000111000010000000000000000 (on Juniper router)
^ ^ ^ ^
Internet source informs me that a number of MX-series routers utilize Intel x86 CPUs. The observed behavior is entirely consistent with the use of an x87 FPU for floating-point computation (as opposed to SSE or AVX), when that x87 is configured to operate in extended-precision mode.
The x87 FPU stores all operands in 80-bit registers, where each register holds a floating-point operand using 64 significand (mantissa) bits, and the integer bit of the significand is explicit. Bits 8 and 9 of the FPU control word represent the precision control field that indicate at which bit position the FPU will round results. A setting of 2 is equivalent to double precision while a setting of 3 means round to extended precision.
Most Unix-like 32-bit operating systems set the x87 rounding control to 3, while Windows set it to 2. I do not know whether modern Junos is a 32-bit or 64-bit OS. It may retain use of the x87 and FPU precision control setting of 3 for reasons of backwards compatibility.
With x87 precision control set to 3 (extended precision) there is a issue with double rounding. Results of floating-point operations are first rounded to extended precision and stored in an internal FPU register. Later, the data is taken from the register and rounded again when this result is stored out to a memory location corresponding to a double variable.
I programmed up the specific scenario from the question on Windows64 using the Intel compiler for easy access to x87 assembly language instructions. The program dumps the two source operands a and b and the sum r in three different formats (decimal floating point, hexadecimal floating point, and binary) and also dumps the internal 80-bit representation of these operands (with a t prefix).
By defining USE_X87_EXTENDED_PRECISION as either 0 or 1 the precision control of the FPU can be set to either double precision or extended precision prior to the computation, and the value of the relevant FPU control word is shown as compute cw. With USE_X87_EXTENDED_PRECISION set to 0, the output of the program is:
original cw=027f
compute cw=027f
a=4.0950561520424530e+003 0x1.ffe1cbff5e3e1p+11 40affe1cbff5e3e1 ta=400afff0e5ffaf1f0800
b=1.0000137123424795e+000 0x1.0000e60e10001p+0 3ff0000e60e10001 tb=3fff8000730708000800
r=4.0960561657547960e+003 0x1.0000e60e10001p+12 40b0000e60e10001 tr=400b8000730708000800
However, when USE_X87_EXTENDED_PRECISION is 1, the result is:
original cw=027f
compute cw=037f
a=4.0950561520424530e+003 0x1.ffe1cbff5e3e1p+11 40affe1cbff5e3e1 ta=400afff0e5ffaf1f0800
b=1.0000137123424795e+000 0x1.0000e60e10001p+0 3ff0000e60e10001 tb=3fff8000730708000800
r=4.0960561657547951e+003 0x1.0000e60e10000p+12 40b0000e60e10000 tr=400b8000730708000400
During the second rounding, from tr to r, the round bit is 1, but the sticky bit is 0 as all trailing significand bits past the round bit are zero, so the "even" part of the default rounding mode "round to nearest or even" kicks in.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#define USE_X87_EXTENDED_PRECISION (1)
typedef struct tbyte {
uint64_t l;
uint16_t h;
} tbyte;
uint64_t double_as_uint64 (double a)
{
uint64_t r; memcpy (&r, &a, sizeof r); return r;
}
int main (void)
{
double a = 0x1.ffe1cbff5e3e1p+11;
double b = 0x1.0000e60e10001p+0;
double r;
uint16_t cw_orig, cw_comp, cw_temp;
tbyte ta, tb, tr;
__asm fstcw word ptr [cw_orig];
#if USE_X87_EXTENDED_PRECISION
cw_temp = cw_orig | (3 << 8);
__asm fldcw word ptr [cw_temp];
#endif // USE_X87_EXTENDED_PRECISION
__asm fstcw word ptr [cw_comp];
__asm fld qword ptr [a];
__asm fld qword ptr [b];
__asm fld st(1);
__asm fadd st, st(1);
__asm fst qword ptr [r];
__asm fstp tbyte ptr [tr];
__asm fstp tbyte ptr [tb];
__asm fstp tbyte ptr [ta];
__asm fldcw word ptr [cw_orig];
printf ("original cw=%04x\n", cw_orig);
printf ("compute cw=%04x\n", cw_comp);
printf ("a=%23.16e %21.13a %016llx ta=%04x%016llx\n", a, a, double_as_uint64 (a), ta.h, ta.l);
printf ("b=%23.16e %21.13a %016llx tb=%04x%016llx\n", b, b, double_as_uint64 (b), tb.h, tb.l);
printf ("r=%23.16e %21.13a %016llx tr=%04x%016llx\n", r, r, double_as_uint64 (r), tr.h, tr.l);
return EXIT_SUCCESS;
}
There is an example of shuffle of OpenCL during the document.
//Examples that are not valid are:
uint8 mask;
short16 a;
short8 b;
b = shuffle(a, mask); // invalid
But I can not understand why. I test this during Android with AndroidStudio, and the result said:build program failed:BC-src-code:9:9:{9:9-9:16}: error: no matching builtin function for call to 'shuffle'. Then, I change the short to int, like this:
uint8 mask;
int16 a;
int8 b;
b = shuffle(a, mask);
and it is ok. I can not find any reason from the document, can anybody help me?
Thanks!
I think the critical part of the description in the spec is this:
The size of each element in the mask must match the size of each element in the result.
I take that to mean that if you want to shuffle a vector of shorts, your mask must be a vector of ushort; a mask of uint8 would only be valid for shuffling vectors with elements of 4 bytes - in other words, int, uint, and float.
So the following should be valid again:
ushort8 mask; // <-- changed
short16 a;
short8 b;
b = shuffle(a, mask); // now valid
I need to implement but I am not sure how can I as I am completely new into this. A function called get_values that has the prototype:
void get_values(unsigned int value, unsigned int *p_lsb, unsigned int *p_msb,
unsigned int *p_combined)
The function computes the least significant byte and the most significant byte of the value
parameter. In addition, both values are combined. For this problem:
a. You may not use any loop constructs.
b. You may not use the multiplication operator (* or *=).
c. Your code must work for unsigned integers of any size (4 bytes, 8 bytes, etc.).
d. To combine the values, append the least significant byte to the most significant one.
e. Your implementation should be efficient.
The following driver (and associated output) provides an example of using the function you are
expected to write. Notice that in this example an unsigned int is 4 bytes, but your function
needs to work with an unsigned int of any size.
Driver
int main() {
unsigned int value = 0xabcdfaec, lsb, msb, combined;
get_values(value, &lsb, &msb, &combined);
printf("Value: %x, lsb: %x, msb: %x, combined: %x\n", value, lsb, msb, combined);
return 0;
}
Output
Value: abcdfaec, lsb: ec, msb: ab, combined: abec
I think you want to look into bitwise and and bit shifting operators. The last piece of the puzzle might be the sizeof() operator if the question is asking that the code should work with platforms with different sized int types.
I have a string like that "2.1648797E -05" and I need to format it to convert "0.00021648797"
Is there any solution to do this conversion
try to use double or long long
cout << setiosflags(ios::fixed) << thefloat << endl;
An important characteristic of floating point is that they do not have precision associated with all the significant figures back to the decimal point for large values. The "scientific" display reasonably reflects the inherent internal storage realities.
In C++ you can use std::stringstream First print the number, then read it as double and then print it using format specifiers to set the accuracy of the number to 12 digits. Take a look at this question for how to print decimal number with fixed precision.
If you are really just going from string representation to string representation and precision is very important or values may leave the valid range for doubles then I would avoid converting to a double.
Your value may get altered by that due to precision errors or range problems.
Try writing a simple text parser. Roughly like that:
Read the digits, omitting the decimal point up to the 'E' but store the decimal point position.
After the 'E' read the exponent as a number and add that to your stored decimal position.
Then output the digits again properly appending zeros at beginning or end and inserting the decimal point.
There are unclear issues here
1. Was the space in "2.1648797E -05" intended, let's assume it is OK.
2. 2.1648797E-05 is 10 times smaller than 0.00021648797. Assume OP meant "0.000021648797" (another zero).
3. Windows is not tagged, but OP posted a Windows answer.
The major challenge here, and I think is the OP's core question is that std::precision() has different meanings in fixed versus default and the OP wants the default meaning in fixed.
Precision field differs between fixed and default floating-point notation. On default, the precision field specifies the maximum number of useful digits to display both before and after the decimal point, possible using scientific notation, while in fixed, the precision field specifies exactly how many digits to display after the decimal point.
2 approaches to solve this: Change the input string to a number and then output the number in the new fixed space format - that is presented below. 2nd method is to parse the input string and form the new format - not done here.
#include <iostream>
#include <iomanip>
#include <string>
#include <sstream>
#include <cmath>
#include <cfloat>
double ConvertStringWithSpaceToDouble(std::string s) {
// Get rid of pesky space in "2.1648797E -05"
s.erase (std::remove (s.begin(), s.end(), ' '), s.end());
std::istringstream i(s);
double x;
if (!(i >> x)) {
x = 0; // handle error;
}
std::cout << x << std::endl;
return x;
}
std::string ConvertDoubleToString(double x) {
std::ostringstream s;
double fraction = fabs(modf(x, &x));
s.precision(0);
s.setf(std::ios::fixed);
// stream whole number part
s << x << '.';
// Threshold becomes non-zero once a non-zero digit found.
// Its level increases with each additional digit streamed to prevent excess trailing zeros.
double threshold = 0.0;
while (fraction > threshold) {
double digit;
fraction = modf(fraction*10, &digit);
s << digit;
if (threshold) {
threshold *= 10.0;
}
else if (digit > 0) {
// Use DBL_DIG to define number of interesting digits
threshold = pow(10, -DBL_DIG);
}
}
return s.str();
}
int main(int argc, char* argv[]){
std::string s("2.1648797E -05");
double x = ConvertStringWithSpaceToDouble(s);
s = ConvertDoubleToString(x);
std::cout << s << std::endl;
return 0;
}
thanks guys and i fix it using :
Decimal dec = Decimal.Parse(str, System.Globalization.NumberStyles.Any);