This query
range x from -1 to 1 step 0.2
| project x
produces this result. Can this be correct?
Yes, the type that is inferred based on the values provided is double (real) which is a floating point. If you would like to see precise numbers use the decimal type, for example:
range x from decimal(-1) to decimal(1) step decimal(0.2)
Working with numbers with a floating point can’t be expected to be 100% precise.
I wrote a quick C program that starts with float f = 0.0f and then runs f += 0.1 100 times, while printing f every 10 iterations, and the result is:
1.000000
2.000000
2.999999
3.999998
4.999998
5.999997
6.999996
7.999995
8.999998
10.000002
You can see/modify/run see the code I wrote here: https://onlinegdb.com/HyylownHd
The lack of precision may happen not only in the arithmetic calculations, but also in the code that converts the float number to a string (to pass between layers or just before printing it).
As #avnera suggested, using decimal indeed gives you better precision, but you still shouldn't expect it to be 100% precise.
Related
this is something that has always bugged me when I look at code around the web and in so much of the literature: why do we multiply by 255 and not 256?
sometimes you'll see something like this:
float input = some_function(); // returns 0.0 to 1.0
byte output = input * 255.0;
(i'm assuming that there's an implicit floor going on during the type conversion).
am i not correct in thinking that this is clearly wrong?
consider this:
what range of input gives an output of 0 ? (0 -> 1/255], right?
what range of input gives an output of 1 ? (1/255 -> 2/255], great!
what range of input gives an output of 255 ? only 1.0 does. any significantly smaller value of input will return a lower output.
this means that input is not evently mapped onto the output range.
ok. so you might think: ok use a better rounding function:
byte output = round(input * 255.0);
where round() is the usual mathematical rounding to zero decimal places. but this is still wrong. ask the same questions:
what range of input gives an output of 0 ? (0 -> 0.5/255]
what range of input gives an output of 1 ? (0.5/255 -> 1.5/255], twice as much as for 0 !
what range of input gives an output of 255 ? (254.5/255 -> 1.0), again half as much as for 1
so in this case the input range isn't evenly mapped either!
IMHO. the right way to do this mapping is this:
byte output = min(255, input * 256.0);
again:
what range of input gives an output of 0 ? (0 -> 1/256]
what range of input gives an output of 1 ? (1/256 -> 2/256]
what range of input gives an output of 255 ? (255/256 -> 1.0)
all those ranges are the same size and constitute 1/256th of the input.
i guess my question is this: am i right in considering this a bug, and if so, why is this so prevalent in code?
edit: it looks like i need to clarify. i'm not talking about random numbers here or probability. and i'm not talking about colors or hardware at all. i'm talking about converting a float in the range [0,1] evenly to a byte [0,255] so each range in the input that corresponds to each value in the output is the same size.
You are right. Assuming that valueBetween0and1 can take values 0.0 and 1.0, the "correct" way to do it is something like
byteValue = (byte)(min(255, valueBetween0and1 * 256))
Having said that, one could also argue that the desired quality of the software can vary: does it really matter whether you get 16777216 or 16581375 colors in some throw-away plot?
It is one of those "trivial" tasks which is very easy to get wrong by +1/-1. Is it worth it to spend 5 minutes trying to get the 255-th pixel intensity, or can you apply your precious attention elsewhere? It depends on the situation: (byte)(valueBetween0and1 * 255) is a pragmatic solution which is simple, cheap, close enough to the truth, and also immediately, obviously "harmless" in the sense that it definitely won't produce 256 as output. It's not a good solution if you are working on some image manipulation tool like Photoshop or if you are working on some rendering pipeline for a computer game. But it is perfectly acceptable in almost all other contexts. So, whether it is a "bug" or merely a minor improvement proposal depends on the context.
Here is a variant of your problem, which involves random number generators:
Generate random numbers in specified range - various cases (int, float, inclusive, exclusive)
Notice that e.g. Math.random() in Java or Random.NextDouble in C# return values greater or equal to 0, but strictly smaller than 1.0.
You want the case "Integer-B: [min, max)" (inclusive-exclusive) with min = 0 and max = 256.
If you follow the "recipe" Int-B exactly, you obtain the code:
0 + floor(random() * (256 - 0))
If you remove all the zeros, you are left with just
floor(random() * 256)
and you don't need to & with 0xFF, because you never get 256 (as long as your random number generator guarantees to never return 1).
I think your question is misled. It looks like you start assuming that there is some "fairness rule" that enforces the "right way" of translation. Unfortunately in practice this is not the case. If you want just generate a random color, then you may use whatever logic fits you. But if you do actual image processing, there is no rule that says the each integer value has to be mapped on the same interval on the float value. On the contrary what you really want is a mapping between two inclusive intervals [0;1] and [0;255]. And often you don't know how many real discretization steps there will be in the [0;1] range down the line when the color is actually shown. (On modern monitors there are probable all 256 different levels for each color but on other output devices there might be significantly less choices and the total number might be not a power of 2). And the real mapping rule is that if for two colors red component values are R1 and R2 then proportion of the actual colors' red component brightness should be as close to R1:R2 as possible. And this rule automatically implies multiply by 255 when you want to map onto [0;255] and thus this is what everybody does.
Note that what you suggest is most probably introducing a bug rather than fixing a bug. For example the proportion rules actually means that you can calculate a mix of two colors R1 and R2 with mixing coefficients k1 and k2 as
Rmix = (k1*R1 + k2*R2)/(k1+k2)
Now let's try to calculate 3:1 mix of 100% Red with 100% Black (i.e. 0% Red) two ways:
using [0-255] integers Rmix = (255*3+1*0)/(3+1) = 191.25 ≈ 191
using [0;1] floating range and then converting it to [0-255] Rmix_float = (1.0*3 + 1*0.0)/(3+1) = 0.75 so Rmix_converted_256 = 256*0.75 = 192.
It means your "multiply by 256" logic has actually introduced inconsistency of different results depending on which scale you use for image processing. Obviously if you used "multiply by 255" logic as everyone else does, you'd get a consistent answer Rmix_converted_255 = 255*0.75 = 191.25 ≈ 191.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
Why we use various data types in programming languages ? Why not use float everywhere ? I have heard some arguments like
Arithmetic on int is faster ( but why ?)
It takes more memory to store float. ( I get it.)
What are the additional benefits of using various types of numeric data types ?
Arithmetic on integers has traditionally been faster because it's a simpler operation. It can be implemented in logic gates and, if properly designed, the whole thing can happen in a single clock cycle.
On most modern PCs floating-point support is actually quite fast, because loads of time has been invested into making it fast. It's only on lower-end processors (like Arduino, or some versions of the ARM platform) where floating point seriously suffers, or is absent from the CPU altogether.
A floating point number contains a few different pieces of data: there's a sign bit, and the mantissa, and the exponent. To put those three parts together to determine the value they represent, you do something like this:
value = sign * mantissa * 2^exponent
It's a little more complicated than that because floating point numbers optimize how they store the mantissa a bit (for instance the first bit of the mantissa is assumed to be 1, thus the first bit doesn't actually need to be stored... But this also means zero has to be stored a particular way, and there's various "special values" that can be stored in floats like "not a number" and infinity that have to be handled correctly when working with floats)
So to store the number "3" you'd have a mantissa of 0.75 and an exponent of 2. (0.75 * 2^2 = 3).
But then to add two floats together, you first have to align them. For instance, 3 + 10:
m3 = 0.75 (stored as binary (1)1000000... the first (1) implicit and not actually stored)
e3 = 2
m10 = .625 (stored as binary (1)010000...)
e10 = 4 (.625 * 2^4 = 10)
You can't just add m3 and m10 together, 'cause you'd get the wrong answer. You first have to shift m3 over by a couple bits to get e3 and e10 to match, then you can add the mantissas together and reassemble the result into a new floating point number. A CPU with good floating-point implementation will do all that for you, of course, and do it fast.
So why else would you not want to use floating point values for everything? Well, for starters there's the problem of exactness. If you add or multiply two integers to get another integer, as long as you don't exceed the limits of your integer size, the answer you get will be exactly correct. This isn't the case with floating-point. For instance:
x = 1000000000.0
y = .0000000001
for (cc = 0; cc < 1000000000; cc++) { x += y; }
Logically you'd expect the final value of (x) to be 1000000000.1, but that's almost certainly not what you're going to get. When you add (y) to (x), the change to (x)'s mantissa may be so small that it doesn't even fit into the float, and so (x) may not change at all. And even if that's not the case, (y)'s value is not exact. There are no two integers (a, b) such that (a * 2^b = 10^-10). That's true for many common decimal values, actually. Even something simple like 0.3 can't be stored as an exact value in a binary floating-point number.
So (y) isn't exactly 10^-10, it's actually off by some small amount. For a 32-bit floating point number it'll be off by about 10^-26:
y = 10^-10 + error, error is about 10^-26
Then if you add (y) together ten billion times, the error is magnified by about ten billion times as well, so your final error is around 10^-16
A good floating-point implementation will try to minimize these errors, but it can't always get it right. The problem is fundamental to how the numbers are stored, and to some extent unavoidable. As a result, for instance, even though it seems natural to store a money value in a float, it might be preferable to store it as an integer instead, to get that assurance that the value is always exact.
The "exactness" issue also means that when you test the value of a floating point number, generally speaking, you can't use exact comparisons. For instance:
x = 11.0 / 500
if (x * 50 == 1.1) { ... It doesn't!
for (float x = 0.0; x < 1.0; x += 0.01) { print x; }
// prints 101 values instead of 100, the last one being 0.9999999...
The test fails because (x) isn't exactly the value we specified, and 1.1, when encoded as a float, isn't exactly the value we specified either. They're both close but not exact. So you have to do inexact comparisons:
if (abs(x - expected_value) < small_value) {...
Choosing the correct "small_value" is a problem unto itself. It can depend on what you're doing with the values, what kind of behavior you're trying to achieve.
Finally, if you look at the "it takes more memory" issue, you can also turn that around and think of it in terms of what you get for the memory you use.
If you can work with integer math for your problem, a 32-bit unsigned integer lets you work with (exact) values between 0 and around 4 billion.
If you're using 32-bit floats instead of 32-bit integers, you can store larger values than 4 billion, but you're still limited by the representation: of those 32 bits, one is used for the sign bit, and eight for the mantissa, so you get 23 bits (24, effectively) of mantissa. Once (x >= 2^24), you're beyond the range where integers are stored "exactly" in that float, so (x+1 = x). So a loop like this:
float i;
for (i = 1600000; i < 1700000; i += 1);
would never terminate: (i) would reach (2^24 = 16777216), and the least-significant bit of its mantissa would be of a magnitude greater than 1, so adding 1 to (i) would cease to have any effect.
I just wonder how can i round to the nearest zero bitwise? Previously, I perform the long division using a loop. However, since the number always divided by a number power by 2. I decide to use bit shifting. So, I can get result like this:
12/4=3
13/4=3
14/4=3
15/4=3
16/4=4
can I do this by performing the long division like usual?
12>>2
13>>2
if I use this kind of bit shifting, are the behavior different for different compiler? how about rounding up? I am using visual c++ 2010 compiler and gcc. thx
Bitwise shifts are equivalent to round-to-negative-infinity divisions by powers of two, meaning that the answer is never bigger than the unrounded value (so e.g. (-3) >> 1 is equal to -2).
For non-negative integers, this is equivalent to round-to-zero.
I would be interested to increase the floating point limit for when calculating qnorm/pnorm from their current level, for example:
x <- pnorm(10) # 1
qnorm(x) # Inf
qnorm(.9999999999999999444) # The highst limit I've found that still return a <<Inf number
Is that (under a reasonable amount of time) possible to do? If so, how?
If the argument is way in the upper tail, you should be able to get better precision by calculating 1-p. Like this:
> x = pnorm(10, lower.tail=F)
> qnorm(x, lower.tail=F)
10
I would expect (though I don't know for sure) that the pnorm() function is referring to a C or Fortran routine that is stuck on whatever floating point size the hardware supports. Probably better to rearrange your problem so the precision isn't needed.
Then, if you're dealing with really really big z-values, you can use log.p=T:
> qnorm(pnorm(100, low=F, log=T), low=F, log=T)
100
Sorry this isn't exactly what you're looking for. But I think it will be more scalable -- pnorm hits 1 so rapidly at high z-values (it is e^(-x^2), after all) that even if you add more bits they will run out fast.
just learning as3 for flex. i am trying to do this:
var someNumber:String = "10150125903517628"; //this is the actual number i noticed the issue with
var result:String = String(Number(someNumber) + 1);
I've tried different ways of putting the expression together and no matter what i seem to do the result is always equal to 10150125903517628 rather than 10150125903517629
Anyone have any ideas??! thanks!
All numbers in JavaScript/ActionScript are effectively double-precision IEEE-754 floats. These use a 64-bit binary number to represent your decimal, and have a precision of roughly 16 or 17 decimal digits.
You've run up against the limit of that format with your 17-digit number. The internal binary representation of 10150125903517628 is no different to that of 10150125903517629 which is why you're not seeing any difference when you add 1.
If, however, you add 2 then you will (should?) see the result as 10150125903517630 because that's enough of a "step" that the internal binary representation will change.