Dealing with floating point number precision in Go arithmetic? - math

I'm interested in a way to accurately subtract 2 float's in Go.
I've tried to use the math/big library but I can't get an accurate result.
I've used the big.js library in Javascript which solves this problem. Is there a similar library/method for Go arithmetic?
package main
import (
"fmt"
"math/big"
)
func main() {
const prec = 200
a := new(big.Float).SetPrec(prec).SetFloat64(5000.0)
b := new(big.Float).SetPrec(prec).SetFloat64(4000.30)
result := new(big.Float).Sub(a, b)
fmt.Println(result)
}
Result: 999.6999999999998181010596454143524169921875
https://play.golang.org/p/vomAr87Xln

Package big
import "math/big"
func (*Float) String
func (x *Float) String() string
String formats x like x.Text('g', 10). (String must be called
explicitly, Float.Format does not support %s verb.)
Use string input and round the output, for example,
package main
import (
"fmt"
"math/big"
)
func main() {
const prec = 200
a, _ := new(big.Float).SetPrec(prec).SetString("5000")
b, _ := new(big.Float).SetPrec(prec).SetString("4000.30")
result := new(big.Float).Sub(a, b)
fmt.Println(result)
fmt.Println(result.String())
}
Output:
999.6999999999999999999999999999999999999999999999999999999995
999.7
For decimal, by definition, binary floating-point numbers are an approximation. For example, the decimal number 0.1 cannot be represented exactly, it is approximately 1.10011001100110011001101 * (2**(-4)).
You are already used to this sort of thing since you know about repeating decimals, an approximation for rational numbers: 1 / 3 = .333... and 3227 / 555 = 5.8144144144....
See What Every Computer Scientist Should Know About Floating-Point Arithmetic.

Maybe try github.com/shopspring/decimal.
It calls itself a package for "arbitrary-precision fixed-point decimal numbers in go".
NOTE: can "only" represent numbers with a maximum of 2^31 digits after the decimal point.

The accepted answer is not accurate. I just ran into this issue and after experimenting with multiple float values and operations I have concluded that you absolutely cannot rely on float precision like this in terms of integer digits or digits after the decimal point. I was able to break the consistency no matter my approach using either addition, subtraction, or multiplication. This is true regardless of whether you use float64 type or big.Float type. You simply will not have consistent accuracy to a large number of integer or decimal places.
For this reason, API's that deal with large precision numbers, such as Etherscan always return integers in string form and you MUST do the same. You will have to rely on the big.Int type exclusively and then display the desired decimal value at the very end of your process (i.e. when printing to console or displaying on a browser).
Again, when printing you cannot do mathematical operations to display the decimal point correctly as this re-introduces the float issue either in Go or in browser-side JavaScript. Therefore, when printing you must manipulate the string and manually insert the decimal point at the proper place within that string. You must avoid the float type at all costs at all parts of your stack.

Related

string to integer in scientific notation

What is the canonical way of converting a string storing a number in scientific notation into an integer?
from
"1e6"
to
1000000
As for the reverse process, converting integer to string in scientific notation I understand I can use #sprintf macro. If one knows the exact format to achieve exactly the reverse process - so small e and no extra trailing .00 zeros (like 1.00e6), or leading zeros (like 1e08) - I will appreciate if it will be included for completeness.
The conversion from string to integer can be achieved via floats like this:
julia> Int(parse(Float64, "1e6"))
1000000
if you know that the number will fit into Int64 or like this
julia> BigInt(parse(BigFloat, "1e6"))
1000000
for larger numbers.
For the reverse process the default in #sprintf would be the following:
julia> #sprintf("%.0e", 1_000_000)
"1e+06"
However, you get + after e and at least two digits are displayed in the exponent (both features are a standard to expect across different languages when you do such a conversion). Also note that this process will lead to rounding, e.g.:
julia> #sprintf("%.0e", 1_000_001)
"1e+06"

Converting a Gray-Scale Array to a FloatingPoint-Array

I am trying to read a .tif-file in julia as a Floating Point Array. With the FileIO & ImageMagick-Package I am able to do this, but the Array that I get is of the Type Array{ColorTypes.Gray{FixedPointNumbers.Normed{UInt8,8}},2}.
I can convert this FixedPoint-Array to Float32-Array by multiplying it with 255 (because UInt8), but I am looking for a function to do this for any type of FixedPointNumber (i.e. reinterpret() or convert()).
using FileIO
# Load the tif
obj = load("test.tif");
typeof(obj)
# Convert to Float32-Array
objNew = real.(obj) .* 255
typeof(objNew)
The output is
julia> using FileIO
julia> obj = load("test.tif");
julia> typeof(obj)
Array{ColorTypes.Gray{FixedPointNumbers.Normed{UInt8,8}},2}
julia> objNew = real.(obj) .* 255;
julia> typeof(objNew)
Array{Float32,2}
I have been looking in the docs quite a while and have not found the function with which to convert a given FixedPoint-Array to a FloatingPont-Array without multiplying it with the maximum value of the Integer type.
Thanks for any help.
edit:
I made a small gist to see if the solution by Michael works, and it does. Thanks!
Note:I don't know why, but the real.(obj) .* 255-code does not work (see the gist).
Why not just Float32.()?
using ColorTypes
a = Gray.(convert.(Normed{UInt8,8}, rand(5,6)));
typeof(a)
#Array{ColorTypes.Gray{FixedPointNumbers.Normed{UInt8,8}},2}
Float32.(a)
The short answer is indeed the one given by Michael, just use Float32.(a) (for grayscale). Another alternative is channelview(a), which generally performs channel separation thus also stripping the color information from the array. In the latter case you won't get a Float32 array, because your image is stored with 8 bits per pixel, instead you'll get an N0f8 (= FixedPointNumbers.Normed{UInt8,8}). You can read about those numbers here.
Your instinct to multiply by 255 is natural, given how other image-processing frameworks work, but Julia has made some effort to be consistent about "meaning" in ways that are worth taking a moment to think about. For example, in another programming language just changing the numerical precision of an array:
img = uint8(255*rand(10, 10, 3)); % an 8-bit per color channel image
figure; image(img)
imgd = double(img); % convert to double-precision, but don't change the values
figure; image(imgd)
produces the following surprising result:
That second "all white" image represents saturation. In this other language, "5" means two completely different things depending on whether it's stored in memory as a UInt8 vs a Float64. I think it's fair to say that under any normal circumstances, a user of a numerical library would call this a bug, and a very serious one at that, yet somehow many of us have grown to accept this in the context of image processing.
These new types arise because in Julia we've gone to the effort to implement new numerical types (FixedPointNumbers) that act like fractional values (e.g., between 0 and 1) but are stored internally with the same bit pattern as the "corresponding" UInt8 (the one you get by multiplying by 255). This allows us to work with 8-bit data and yet allow values to always be interpreted on a consistent scale (0.0=black, 1.0=white).

AS3 adding 1 (+1) not working on string cast to Number?

just learning as3 for flex. i am trying to do this:
var someNumber:String = "10150125903517628"; //this is the actual number i noticed the issue with
var result:String = String(Number(someNumber) + 1);
I've tried different ways of putting the expression together and no matter what i seem to do the result is always equal to 10150125903517628 rather than 10150125903517629
Anyone have any ideas??! thanks!
All numbers in JavaScript/ActionScript are effectively double-precision IEEE-754 floats. These use a 64-bit binary number to represent your decimal, and have a precision of roughly 16 or 17 decimal digits.
You've run up against the limit of that format with your 17-digit number. The internal binary representation of 10150125903517628 is no different to that of 10150125903517629 which is why you're not seeing any difference when you add 1.
If, however, you add 2 then you will (should?) see the result as 10150125903517630 because that's enough of a "step" that the internal binary representation will change.

Issues with checking the equality of two doubles in .NET -- what's wrong with this method?

So I'm just going to dive into this issue... I've got a heavily used web application that, for the first time in 2 years, failed doing an equality check on two doubles using the equality function a colleague said he'd also been using for years.
The goal of the function I'm about to paste in here is to compare two double values to 4 digits of precision and return the comparison results. For the sake of illustration, my values are:
Dim double1 As Double = 0.14625000000000002 ' The result of a calculation
Dim double2 As Double = 0.14625 ' A value that was looked up in a DB
If I pass them into this function:
Public Shared Function AreEqual(ByVal double1 As Double, ByVal double2 As Double) As Boolean
Return (CType(double1 * 10000, Long) = CType(double2 * 10000, Long))
End Function
the comparison fails. After the multiplication and cast to Long, the comparison ends up being:
Return 1463 = 1462
I'm kind of answering my own question here, but I can see that double1 is within the precision of a double (17 digits) and the cast is working correctly.
My first real question is: If I change the line above to the following, why does it work correctly (returns True)?
Return (CType(CType(double1, Decimal) * 10000, Long) = _
CType(CType(double2, Decimal) * 10000, Long))
Doesn't Decimal have even more precision, thus the cast to Long should still be 1463, and the comparison return False? I think I'm having a brain fart on this stuff...
Secondly, if one were to change this function to make the comparison I'm looking for more accurate or less error prone, would you recommend changing it to something much simpler? For example:
Return (Math.Abs(double1 - double2) < 0.0001)
Would I be crazy to try something like:
Return (double1.ToString("N5").Equals(double2.ToString("N5")))
(I would never do the above, I'm just curious about your reactions. It would be horribly inefficient in my application.)
Anyway, if someone could shed some light on the difference I'm seeing between casting Doubles and Decimals to Long, that would be great.
Thanks!
What Every Computer Scientist Should Know About Floating-Point Arithmetic
Relying on a cast in this situation is error prone, as you have discovered - depending upon the rules used when casting, you may not get the number you expect.
I would strongly advise you to write the comparison code without a cast. Your Math.Abs line is perfectly fine.
Regarding your first question:
My first real question is: If I change
the line above to the following, why
does it work correctly (returns True)?
The reason is that the cast from Double to Decimal is losing precision, resulting in a comparison of 0.1425 to 0.1425.
When you use CType, you're telling your program "I don't care how you round the numbers; just make sure the result is this other type". That's not exactly what you want to say to your program when comparing numbers.
Comparing floating-point numbers is a pain and I wouldn't ever trust a Round function in any language unless you know exactly how it behaves (e.g. sometimes it rounds .5 up and sometimes down, depending on the previous number...it's a mess).
In .NET, I might actually use Math.Truncate() after multiplying out my double value. So, Math.Truncate(.14625 * 10000) (which is Math.Truncate(1462.5)) is going to equal 1462 because it gets rid of all decimal values. Using Truncate() with the data from your example, both values would end up being equal because 1) they remain doubles and 2) you made sure the decimal was removed from each.
I actually don't think String comparison is very bad in this situation since floating point comparison is pretty nasty in itself. Granted, if you're comparing numbers, it's probably better to stick with numeric types, but using string comparison is another option.

How to get around some rounding errors?

I have a method that deals with some geographic coordinates in .NET, and I have a struct that stores a coordinate pair such that if 256 is passed in for one of the coordinates, it becomes 0. However, in one particular instance a value of approximately 255.99999998 is calculated, and thus stored in the struct. When it's printed in ToString(), it becomes 256, which should not happen - 256 should be 0. I wouldn't mind if it printed 255.9999998 but the fact that it prints 256 when the debugger shows 255.99999998 is a problem. Having it both store and display 0 would be even better.
Specifically there's an issue with comparison. 255.99999998 is sufficiently close to 256 such that it should equal it. What should I do when comparing doubles? use some sort of epsilon value?
EDIT: Specifically, my problem is that I take a value, perform some calculations, then perform the opposite calculations on that number, and I need to get back the original value exactly.
This sounds like a problem with how the number is printed, not how it is stored. A double has about 15 significant figures, so it can tell 255.99999998 from 256 with precision to spare.
You could use the epsilon approach, but the epsilon is typically a fudge to get around the fact that floating-point arithmetic is lossy.
You might consider avoiding binary floating-points altogether and use a nice Rational class.
The calculation above was probably destined to be 256 if you were doing lossless arithmetic as you would get with a Rational type.
Rational types can go by the name of Ratio or Fraction class, and are fairly simple to write
Here's one example.
Here's another
Edit....
To understand your problem consider that when the decimal value 0.01 is converted to a binary representation it cannot be stored exactly in finite memory. The Hexidecimal representation for this value is 0.028F5C28F5C where the "28F5C" repeats infinitely. So even before doing any calculations, you loose exactness just by storing 0.01 in binary format.
Rational and Decimal classes are used to overcome this problem, albeit with a performance cost. Rational types avoid this problem by storing a numerator and a denominator to represent your value. Decimal type use a binary encoded decimal format, which can be lossy in division, but can store common decimal values exactly.
For your purpose I still suggest a Rational type.
You can choose format strings which should let you display as much of the number as you like.
The usual way to compare doubles for equality is to subtract them and see if the absolute value is less than some predefined epsilon, maybe 0.000001.
You have to decide yourself on a threshold under which two values are equal. This amounts to using so-called fixed point numbers (as opposed to floating point). Then, you have to perform the round up manually.
I would go with some unsigned type with known size (eg. uint32 or uint64 if they're available, I don't know .NET) and treat it as a fixed point number type mod 256.
Eg.
typedef uint32 fixed;
inline fixed to_fixed(double d)
{
return (fixed)(fmod(d, 256.) * (double)(1 << 24))
}
inline double to_double(fixed f)
{
return (double)f / (double)(1 << 24);
}
or something more elaborated to suit a rounding convention (to nearest, to lower, to higher, to odd, to even). The highest 8 bits of fixed hold the integer part, the 24 lower bits hold the fractional part. Absolute precision is 2^{-24}.
Note that adding and substracting such numbers naturally wraps around at 256. For multiplication, you should beware.

Resources