How to keep decimal accuracy when dividing with floating points

How to keep decimal accuracy when dividing with floating points - math

I am working on a project, and I need to divide a very large 64 bit long value. I absolutely do not care about the whole number result, and only care about the decimal value. The problem is that when dividing a large long with a small 64 bit double floating point value, I loose accuracy in the floating point value due to it needing to store the whole numbers.
Essentially what I am trying to do is this:
double x = long_value / double_value % 1;
but without loosing precision the larger the long_value is. Is there a way of writing this expression so that the whole numbers are discarded and floating point accuracy is not lost? Thanks.
EDIT: btw im out here trying to upvote all these helpful answers, but I just made this account for this question and you need 15 reputation to cast a vote

If your language provides an exact fmod implementation you can do something like this:
double rem = fmod(long_value, double_value);
return rem / double_value;
If long_value does not convert exactly to a double value, you could split it into two halves, fmod them individually, add these values together and divide that sum or sum - double_value by double_value.
If long_value or double_value is negative you may also need to consider different cases depending on how your fmod behaves and what result you expect.

long_value is congruent to:
long_value = long_value - double_value * static_cast<long>(long_value / double_value);
Then you can do this:
double fractionalPart = static_cast<double>(long_value / double_value) % 1;

Does the language you're using have a big integer/big rational library? To avoid loss of information, you'll have to "spread out" the information across more memory while you're transforming it so you don't lose the part you're interested in preserving. This is essentially what a big integer library would do for you. You could employ this algorithm (I don't know what language you're using so this is just pseudocode:
// e.g. 1.5 => (3, 2)
let (numerator, denominator) = double_value.ToBigRational().NumAndDenom();
// information-preserving version of long_value / double_value
let quotient = new BigRational(num: long_value * denominator, denom: numerator);
// information-preserving version of % 1
let remainder = quotient.FractionPart();
// some information could be lost here, but we saved it for the last step
return remainder.ToDouble();

Related

Advantages and disadvantages of single numeric (float) data type [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
Why we use various data types in programming languages ? Why not use float everywhere ? I have heard some arguments like
Arithmetic on int is faster ( but why ?)
It takes more memory to store float. ( I get it.)
What are the additional benefits of using various types of numeric data types ?

Arithmetic on integers has traditionally been faster because it's a simpler operation. It can be implemented in logic gates and, if properly designed, the whole thing can happen in a single clock cycle.
On most modern PCs floating-point support is actually quite fast, because loads of time has been invested into making it fast. It's only on lower-end processors (like Arduino, or some versions of the ARM platform) where floating point seriously suffers, or is absent from the CPU altogether.
A floating point number contains a few different pieces of data: there's a sign bit, and the mantissa, and the exponent. To put those three parts together to determine the value they represent, you do something like this:
value = sign * mantissa * 2^exponent
It's a little more complicated than that because floating point numbers optimize how they store the mantissa a bit (for instance the first bit of the mantissa is assumed to be 1, thus the first bit doesn't actually need to be stored... But this also means zero has to be stored a particular way, and there's various "special values" that can be stored in floats like "not a number" and infinity that have to be handled correctly when working with floats)
So to store the number "3" you'd have a mantissa of 0.75 and an exponent of 2. (0.75 * 2^2 = 3).
But then to add two floats together, you first have to align them. For instance, 3 + 10:
m3 = 0.75 (stored as binary (1)1000000... the first (1) implicit and not actually stored)
e3 = 2
m10 = .625 (stored as binary (1)010000...)
e10 = 4 (.625 * 2^4 = 10)
You can't just add m3 and m10 together, 'cause you'd get the wrong answer. You first have to shift m3 over by a couple bits to get e3 and e10 to match, then you can add the mantissas together and reassemble the result into a new floating point number. A CPU with good floating-point implementation will do all that for you, of course, and do it fast.
So why else would you not want to use floating point values for everything? Well, for starters there's the problem of exactness. If you add or multiply two integers to get another integer, as long as you don't exceed the limits of your integer size, the answer you get will be exactly correct. This isn't the case with floating-point. For instance:
x = 1000000000.0
y = .0000000001
for (cc = 0; cc < 1000000000; cc++) { x += y; }
Logically you'd expect the final value of (x) to be 1000000000.1, but that's almost certainly not what you're going to get. When you add (y) to (x), the change to (x)'s mantissa may be so small that it doesn't even fit into the float, and so (x) may not change at all. And even if that's not the case, (y)'s value is not exact. There are no two integers (a, b) such that (a * 2^b = 10^-10). That's true for many common decimal values, actually. Even something simple like 0.3 can't be stored as an exact value in a binary floating-point number.
So (y) isn't exactly 10^-10, it's actually off by some small amount. For a 32-bit floating point number it'll be off by about 10^-26:
y = 10^-10 + error, error is about 10^-26
Then if you add (y) together ten billion times, the error is magnified by about ten billion times as well, so your final error is around 10^-16
A good floating-point implementation will try to minimize these errors, but it can't always get it right. The problem is fundamental to how the numbers are stored, and to some extent unavoidable. As a result, for instance, even though it seems natural to store a money value in a float, it might be preferable to store it as an integer instead, to get that assurance that the value is always exact.
The "exactness" issue also means that when you test the value of a floating point number, generally speaking, you can't use exact comparisons. For instance:
x = 11.0 / 500
if (x * 50 == 1.1) { ... It doesn't!
for (float x = 0.0; x < 1.0; x += 0.01) { print x; }
// prints 101 values instead of 100, the last one being 0.9999999...
The test fails because (x) isn't exactly the value we specified, and 1.1, when encoded as a float, isn't exactly the value we specified either. They're both close but not exact. So you have to do inexact comparisons:
if (abs(x - expected_value) < small_value) {...
Choosing the correct "small_value" is a problem unto itself. It can depend on what you're doing with the values, what kind of behavior you're trying to achieve.
Finally, if you look at the "it takes more memory" issue, you can also turn that around and think of it in terms of what you get for the memory you use.
If you can work with integer math for your problem, a 32-bit unsigned integer lets you work with (exact) values between 0 and around 4 billion.
If you're using 32-bit floats instead of 32-bit integers, you can store larger values than 4 billion, but you're still limited by the representation: of those 32 bits, one is used for the sign bit, and eight for the mantissa, so you get 23 bits (24, effectively) of mantissa. Once (x >= 2^24), you're beyond the range where integers are stored "exactly" in that float, so (x+1 = x). So a loop like this:
float i;
for (i = 1600000; i < 1700000; i += 1);
would never terminate: (i) would reach (2^24 = 16777216), and the least-significant bit of its mantissa would be of a magnitude greater than 1, so adding 1 to (i) would cease to have any effect.

ATMega peformance for different operations

Has anyone experiences replacing floating point operations on ATMega (2560) based systems? There are a couple of very common situations which happen every day.
For example:
Are comparisons faster than divisions/multiplications?
Are float to int type cast with followed multiplication/division faster than pure floating point operations without type cast?
I hope I don't have to make a benchmark just for me.
Example one:
int iPartialRes = (int)fArg1 * (int)fArg2;
iPartialRes *= iFoo;
faster as?:
float fPartialRes = fArg1 * fArg2;
fPartialRes *= iFoo;
And example two:
iSign = fVal < 0 ? -1 : 1;
faster as?:
iSign = fVal / fabs(fVal);

the questions could be solved just by thinking a moment about it.
AVRs does not have a FPU so all floating point related stuff is done in software --> fp multiplication involves much more than a simple int multiplication
since AVRs also does not have a integer division unit a simple branch is also much faster than a software division. if dividing floating points this is the worst worst case :)
but please note, that your first 2 examples produce very different results.

This is an old answer but I will submit this elaborated answer for the curious.
Just typecasting a float will truncate it ie; 3.7 will become 3, there is no rounding.
Fastest math on a 2560 will be (+,-,*) with divide being the slowest due to no hardware divide. Typecasting to an unsigned long int after multiplying all operands by a pseudo decimal point that suits your fractal number(1) range that your floats are expected to see and tracking the sign as a bool will give the best range/accuracy compromise.
If your loop needs to be as fast as possible, avoid even integer division, instead multiplying by a pseudo fraction instead and then doing your typecast back into a float with myFloat(defined elsewhere) = float(myPseudoFloat) / myPseudoDecimalConstant;
Not sure if you came across the Show info page in the playground. It's basically a sketch that runs a benchmark on your (insert Arduino model here) Shows the actual compute times for various things and systems. The Mega 2560 will be very close to an At Mega 328 as far as FLOPs goes, up to 12.5K/s (80uS per divide float). Typecasting would likely handicap the CPU more as it introduces more overhead and might even give erroneous results due to rounding errors and lack of precision.
(1)ie: 543.509,291 * 100000 = 543,509,291 will move the decimal 6 places to the maximum precision of a float on an 8-bit AVR. If you first multiply all values by the same constant like 1000, or 100000, etc, then the decimal point is preserved and then you cast it back to a float number by dividing by your decimal constant when you are ready to print or store it.
float f = 3.1428;
int x;
x = f * 10000;
x now contains 31428

Finding square root of 2 upto more than 100 decimal places

i tried to get this work by using Newton's method as described here: wiki using the following code, but the problem is that it is giving accurate result only upto 16 decimal places. I tried to increase the number of iterations still the result is the same. I started with initial guess of 1. So how can i improve the accuracy of answer(upto 100 or more decimal places) ?
Thanks.
Code:
double x0,x1;
#define n 2
double f(double x0)
{
return ((x0*x0)-n);
}
double firstDerv(double x0)
{
return 2.0*x0;
}
int main()
{
x0 = n/2.0;
int i;
for(i=0;i<40000;i++)
{
x1=x0-(f(x0)/((firstDerv(x0))));
x0=x1;
}
printf("%.100lf\n",x1);
return 0;
}

To get around the problem of limited precision floating points, you can also use Newton's method to
find in each iteration a rational (a/b, with a and b integers) that is a better approximation of sqr(2).
If x=a/b is the value returned from you last iteration, then Newton's method states that the new estimate y=c/d is:
y = x/2 + 1/x = a/2b + b/a = (a^2+2b^2)(2ab)
so:
c= a^2 + 2b^2
d= 2ab
Precision doubles each iteration. You are still limited in the precision you can reach, because nominator and denominator quickly increase, but perhaps finding an implementation of large integers (or concocting one yourself) is easier than finding an implementation of arbitrary precision floating points. Also, if you are really interested in decimals, then this answer won't help you. It does give you a very precise estimate of sqr(2).
Just some iterates of a/b for the algorithm:
1/1 , 3/2 , 17/12 , 577/408 , 665857/470832.
665857/470832 approximates sqr(2) with an error of 1.59e-12. Error will remain to be of the order 1/a^2, so implementing a and b as longs will give you precision of 1e-37 -ish.

The floating point numbers on current machines are IEEE754 and have a limited precision (of about 15 digits).
If you want much more precision, you'll need bignums which are (slowly) provided by software libraries like GMP
You could also code your program in languages and implementations using bignums.

You simply can't do it with that approach; doubles don't have enough bits to get 100 places of precision. Consider using a library for arbitrary-precision such as GMP.

maybe this is because floating point numbers are approximated by computers through an m*10^e form. since m and e consist of a finite number of digits you cannot approximate all numbers with absolute accuracy.
think of 1/3 which is 0.333333333333333...

How to get around some rounding errors?

I have a method that deals with some geographic coordinates in .NET, and I have a struct that stores a coordinate pair such that if 256 is passed in for one of the coordinates, it becomes 0. However, in one particular instance a value of approximately 255.99999998 is calculated, and thus stored in the struct. When it's printed in ToString(), it becomes 256, which should not happen - 256 should be 0. I wouldn't mind if it printed 255.9999998 but the fact that it prints 256 when the debugger shows 255.99999998 is a problem. Having it both store and display 0 would be even better.
Specifically there's an issue with comparison. 255.99999998 is sufficiently close to 256 such that it should equal it. What should I do when comparing doubles? use some sort of epsilon value?
EDIT: Specifically, my problem is that I take a value, perform some calculations, then perform the opposite calculations on that number, and I need to get back the original value exactly.

This sounds like a problem with how the number is printed, not how it is stored. A double has about 15 significant figures, so it can tell 255.99999998 from 256 with precision to spare.

You could use the epsilon approach, but the epsilon is typically a fudge to get around the fact that floating-point arithmetic is lossy.
You might consider avoiding binary floating-points altogether and use a nice Rational class.
The calculation above was probably destined to be 256 if you were doing lossless arithmetic as you would get with a Rational type.
Rational types can go by the name of Ratio or Fraction class, and are fairly simple to write
Here's one example.
Here's another
Edit....
To understand your problem consider that when the decimal value 0.01 is converted to a binary representation it cannot be stored exactly in finite memory. The Hexidecimal representation for this value is 0.028F5C28F5C where the "28F5C" repeats infinitely. So even before doing any calculations, you loose exactness just by storing 0.01 in binary format.
Rational and Decimal classes are used to overcome this problem, albeit with a performance cost. Rational types avoid this problem by storing a numerator and a denominator to represent your value. Decimal type use a binary encoded decimal format, which can be lossy in division, but can store common decimal values exactly.
For your purpose I still suggest a Rational type.

You can choose format strings which should let you display as much of the number as you like.
The usual way to compare doubles for equality is to subtract them and see if the absolute value is less than some predefined epsilon, maybe 0.000001.

You have to decide yourself on a threshold under which two values are equal. This amounts to using so-called fixed point numbers (as opposed to floating point). Then, you have to perform the round up manually.
I would go with some unsigned type with known size (eg. uint32 or uint64 if they're available, I don't know .NET) and treat it as a fixed point number type mod 256.
Eg.
typedef uint32 fixed;
inline fixed to_fixed(double d)
{
return (fixed)(fmod(d, 256.) * (double)(1 << 24))
}
inline double to_double(fixed f)
{
return (double)f / (double)(1 << 24);
}
or something more elaborated to suit a rounding convention (to nearest, to lower, to higher, to odd, to even). The highest 8 bits of fixed hold the integer part, the 24 lower bits hold the fractional part. Absolute precision is 2^{-24}.
Note that adding and substracting such numbers naturally wraps around at 256. For multiplication, you should beware.

As a programmer how would you explain imaginary numbers?

As a programmer I think it is my job to be good at math but I am having trouble getting my head round imaginary numbers. I have tried google and wikipedia with no luck so I am hoping a programmer can explain in to me, give me an example of a number squared that is <= 0, some example usage etc...

I guess this blog entry is one good explanation:
The key word is rotation (as opposed to direction for negative numbers, which are as stranger as imaginary number when you think of them: less than nothing ?)
Like negative numbers modeling flipping, imaginary numbers can model anything that rotates between two dimensions “X” and “Y”. Or anything with a cyclic, circular relationship

Problem: not only am I a programmer, I am a mathematician.
Solution: plow ahead anyway.
There's nothing really magical to complex numbers. The idea behind their inception is that there's something wrong with real numbers. If you've got an equation x^2 + 4, this is never zero, whereas x^2 - 2 is zero twice. So mathematicians got really angry and wanted there to always be zeroes with polynomials of degree at least one (wanted an "algebraically closed" field), and created some arbitrary number j such that j = sqrt(-1). All the rules sort of fall into place from there (though they are more accurately reorganized differently-- specifically, you formally can't actually say "hey this number is the square root of negative one"). If there's that number j, you can get multiples of j. And you can add real numbers to j, so then you've got complex numbers. The operations with complex numbers are similar to operations with binomials (deliberately so).
The real problem with complexes isn't in all this, but in the fact that you can't define a system whereby you can get the ordinary rules for less-than and greater-than. So really, you get to where you don't define it at all. It doesn't make sense in a two-dimensional space. So in all honesty, I can't actually answer "give me an exaple of a number squared that is <= 0", though "j" makes sense if you treat its square as a real number instead of a complex number.
As for uses, well, I personally used them most when working with fractals. The idea behind the mandelbrot fractal is that it's a way of graphing z = z^2 + c and its divergence along the real-imaginary axes.

You might also ask why do negative numbers exist? They exist because you want to represent solutions to certain equations like: x + 5 = 0. The same thing applies for imaginary numbers, you want to compactly represent solutions to equations of the form: x^2 + 1 = 0.
Here's one way I've seen them being used in practice. In EE you are often dealing with functions that are sine waves, or that can be decomposed into sine waves. (See for example Fourier Series).
Therefore, you will often see solutions to equations of the form:
f(t) = A*cos(wt)
Furthermore, often you want to represent functions that are shifted by some phase from this function. A 90 degree phase shift will give you a sin function.
g(t) = B*sin(wt)
You can get any arbitrary phase shift by combining these two functions (called inphase and quadrature components).
h(t) = Acos(wt) + iB*sin(wt)
The key here is that in a linear system: if f(t) and g(t) solve an equation, h(t) will also solve the same equation. So, now we have a generic solution to the equation h(t).
The nice thing about h(t) is that it can be written compactly as
h(t) = Cexp(wt+theta)
Using the fact that exp(iw) = cos(w)+i*sin(w).
There is really nothing extraordinarily deep about any of this. It is merely exploiting a mathematical identity to compactly represent a common solution to a wide variety of equations.

Well, for the programmer:
class complex {
public:
double real;
double imaginary;
complex(double a_real) : real(a_real), imaginary(0.0) { }
complex(double a_real, double a_imaginary) : real(a_real), imaginary(a_imaginary) { }
complex operator+(const complex &other) {
return complex(
real + other.real,
imaginary + other.imaginary);
}
complex operator*(const complex &other) {
return complex(
real*other.real - imaginary*other.imaginary,
real*other.imaginary + imaginary*other.real);
}
bool operator==(const complex &other) {
return (real == other.real) && (imaginary == other.imaginary);
}
};
That's basically all there is. Complex numbers are just pairs of real numbers, for which special overloads of +, * and == get defined. And these operations really just get defined like this. Then it turns out that these pairs of numbers with these operations fit in nicely with the rest of mathematics, so they get a special name.
They are not so much numbers like in "counting", but more like in "can be manipulated with +, -, *, ... and don't cause problems when mixed with 'conventional' numbers". They are important because they fill the holes left by real numbers, like that there's no number that has a square of -1. Now you have complex(0, 1) * complex(0, 1) == -1.0 which is a helpful notation, since you don't have to treat negative numbers specially anymore in these cases. (And, as it turns out, basically all other special cases are not needed anymore, when you use complex numbers)

If the question is "Do imaginary numbers exist?" or "How do imaginary numbers exist?" then it is not a question for a programmer. It might not even be a question for a mathematician, but rather a metaphysician or philosopher of mathematics, although a mathematician may feel the need to justify their existence in the field. It's useful to begin with a discussion of how numbers exist at all (quite a few mathematicians who have approached this question are Platonists, fyi). Some insist that imaginary numbers (as the early Whitehead did) are a practical convenience. But then, if imaginary numbers are merely a practical convenience, what does that say about mathematics? You can't just explain away imaginary numbers as a mere practical tool or a pair of real numbers without having to account for both pairs and the general consequences of them being "practical". Others insist in the existence of imaginary numbers, arguing that their non-existence would undermine physical theories that make heavy use of them (QM is knee-deep in complex Hilbert spaces). The problem is beyond the scope of this website, I believe.
If your question is much more down to earth e.g. how does one express imaginary numbers in software, then the answer above (a pair of reals, along with defined operations of them) is it.

I don't want to turn this site into math overflow, but for those who are interested: Check out "An Imaginary Tale: The Story of sqrt(-1)" by Paul J. Nahin. It talks about all the history and various applications of imaginary numbers in a fun and exciting way. That book is what made me decide to pursue a degree in mathematics when I read it 7 years ago (and I was thinking art). Great read!!

The main point is that you add numbers which you define to be solutions to quadratic equations like x2= -1. Name one solution to that equation i, the computation rules for i then follow from that equation.
This is similar to defining negative numbers as the solution of equations like 2 + x = 1 when you only knew positive numbers, or fractions as solutions to equations like 2x = 1 when you only knew integers.

It might be easiest to stop trying to understand how a number can be a square root of a negative number, and just carry on with the assumption that it is.
So (using the i as the square root of -1):
(3+5i)*(2-i)
= (3+5i)*2 + (3+5i)*(-i)
= 6 + 10i -3i - 5i * i
= 6 + (10 -3)*i - 5 * (-1)
= 6 + 7i + 5
= 11 + 7i
works according to the standard rules of maths (remembering that i squared equals -1 on line four).

An imaginary number is a real number multiplied by the imaginary unit i. i is defined as:
i == sqrt(-1)
So:
i * i == -1
Using this definition you can obtain the square root of a negative number like this:
sqrt(-3)
== sqrt(3 * -1)
== sqrt(3 * i * i) // Replace '-1' with 'i squared'
== sqrt(3) * i // Square root of 'i squared' is 'i' so move it out of sqrt()
And your final answer is the real number sqrt(3) multiplied by the imaginary unit i.

A short answer: Real numbers are one-dimensional, imaginary numbers add a second dimension to the equation and some weird stuff happens if you multiply...

If you're interested in finding a simple application and if you're familiar with matrices,
it's sometimes useful to use complex numbers to transform a perfectly real matrice into a triangular one in the complex space, and it makes computation on it a bit easier.
The result is of course perfectly real.

Great answers so far (really like Devin's!)
One more point:
One of the first uses of complex numbers (although they were not called that way at the time) was as an intermediate step in solving equations of the 3rd degree.
link
Again, this is purely an instrument that is used to answer real problems with real numbers having physical meaning.

In electrical engineering, the impedance Z of an inductor is jwL, where w = 2*pi*f (frequency) and j (sqrt(-1))means it leads by 90 degrees, while for a capacitor Z = 1/jwc = -j/wc which is -90deg/wc so that it lags a simple resistor by 90 deg.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex