#R how R compares numbers? - r

Here's the codes:
2.9999999999999948933990 > 2.999999999999994893399
[1] TRUE
"20" > 2.999999999999994893399
[1] TRUE
"20" > 2.9999999999999948933990
[1] FALSE
"200" > 2.999999999999994893399
[1] TRUE
"200" > 2.9999999999999948933990
[1] FALSE
"20" > "200"
[1] FALSE
"20" < "200"
[1] TRUE
My mind is just boomed. Can anyone explain why adding a 0 matters? Also, which exact numbers "20" and "200" equal with?

According to help("Comparison"), numeric values are converted to character strings (for the comparison) if you compare them to a character string. Adding the 0 matters due to accuracy of floating point numbers.
In help("as.character") it is documented that
as.character represents real and complex numbers to 15 significant
digits
Now compare this:
sprintf("%.16f", 2.999999999999994893399)
#[1] "2.9999999999999947"
sprintf("%.16f", 2.9999999999999948933990)
#[1] "2.9999999999999951"
as.character(2.999999999999994893399)
#[1] "2.99999999999999"
as.character(2.9999999999999948933990)
#[1] "3"

Part of the problem is that 2.999999999999994893399 and 2.9999999999999948933990 are not parsed as the same number. The reason for this is likely the parsing algorithm that R uses. I believe it goes something like this:
When you see a number containing a decimal point, ignore the decimal and read the number as an integer, then divide by the appropriate power of 10.
So 2.999999999999994893399 is read as 2999999999999994893399 and divided by 10^21. But 2999999999999994893399 is too big to represent exactly, so it becomes 2999999999999994757120 after reading, and that becomes 2.9999999999999946709 after the division (since 10^21 can't be stored exactly either).
On the other hand, 2.9999999999999948933990 is read as 29999999999999948933990 and divided by 10^22. Rounding is different, because the number is 10 times bigger: the integer becomes 29999999999999949668352 and after division it is 2.999999999999995115.
Some of the numbers I show might be different on your system: most of this is handled at a very low level, and could be different depending on the system library and hardware.

Related

#R curious about how R compares character with number? [duplicate]

This question already has answers here:
Comparing integers with characters in R [duplicate]
(2 answers)
Why does "one" < 2 equal FALSE in R?
(2 answers)
Closed last year.
So here is what I found in R:
> "20" < 3
[1] TRUE
> "20" < 2
[1] FALSE
and I tried to find out what exactly number the "20" is. So basically, "20" is between 2.99999999999999489 and 2.99999999999999490, so why is it?
Also, "200" and "2000" (and so on...) have the same range as "20", but the wired thing is R gives me this:
> "20" == "200"
[1] FALSE
> "200" == "2000"
[1] FALSE
I believe they are not same, but R stores them in different number with that range "20" has. So beside why is it, I'm more curious about how R stores characters in numbers, I mean if "200" and "2000" are different numbers between 2.99999999999999489 and 2.99999999999999490, then how many numbers between them? (I know real numbers are infinite, but there must be a digit that R decides to end with)
The help for < tells:
If the two arguments are atomic vectors of different types,
one is coerced to the type of the other,
the (decreasing) order of precedence being
character, complex, numeric, integer, logical and raw.
So when you do "20" < 3, R internally converts 3 to "3". So this becomes "20" < "3".
The comparison between strings is lexicographic (same as the order of words in dictionaries).

Comparing integers with characters in R [duplicate]

This question already has answers here:
Why TRUE == "TRUE" is TRUE in R?
(3 answers)
Why does "one" < 2 equal FALSE in R?
(2 answers)
Closed last year.
It appears that as.character() of a number is still a number, which I find counter intuitive. Consider this example:
1 > "2"
[1] FALSE
2 > "1"
[1] TRUE
Even if I try to use as.character() or paste()
as.character(2)
[1] "2"
as.character(2) > 1
[1] TRUE
as.character(2) < 1
[1] FALSE
Why is that? Can't I have R return an error when I am comparing numbers with strings?
As explained in the comments the problem is that the numeric 1 is coerced to character.
The operation < still works for characters. A character is smaller than another if it comes first in alphabetical order.
> "a" < "b"
[1] TRUE
> "z" < "b"
[1] FALSE
So in your case as.character(2) > 1 is transformed to as.character(2) > as.character(1) and because of the "alphabetical" order of numbers TRUEis returned.
To prevent this you would have to check for the class of an object manually.
The documentation of ?Comparison states that
If the two arguments are atomic vectors of different types, one is coerced to the type of the other, the (decreasing) order of precedence being character, complex, numeric, integer, logical and raw.
So in your case the number is automatically coerced to string and the comparison is made based on the respective collation.
In order to prevent it, the only option I know of is to manually compare the class first.

R: How to convert long number to string to save precision

I have a problem to convert a long number to a string in R. How to easily convert a number to string to preserve precision? A have a simple example below.
a = -8664354335142704128
toString(a)
[1] "-8664354335142704128"
b = -8664354335142703762
toString(b)
[1] "-8664354335142704128"
a == b
[1] TRUE
I expected toString(a) == toString(b), but I got different values. I suppose toString() converts the number to float or something like that before converting to string.
Thank you for your help.
Edit:
> -8664354335142704128 == -8664354335142703762
[1] TRUE
> along = bit64::as.integer64(-8664354335142704128)
> blong = bit64::as.integer64(-8664354335142703762)
> along == blong
[1] TRUE
> blong
integer64
[1] -8664354335142704128
I also tried:
> as.character(blong)
[1] "-8664354335142704128"
> sprintf("%f", -8664354335142703762)
[1] "-8664354335142704128.000000"
> sprintf("%f", blong)
[1] "-0.000000"
Edit 2:
My question first was, if I can convert a long number to string without loss. Then I realized, in R is impossible to get the real value of a long number passed into a function, because R automatically read the value with the loss.
For example, I have the function:
> my_function <- function(long_number){
+ string_number <- toString(long_number)
+ print(string_number)
+ }
If someone used it and passed a long number, I am not able to get the information, which number was passed exactly.
> my_function(-8664354335142703762)
[1] "-8664354335142704128"
For example, if I read some numbers from a file, it is easy. But it is not my case. I just need to use something that some user passed.
I am not R expert, so I just was curious why in another language it works and in R not. For example in Python:
>>> def my_function(long_number):
... string_number = str(long_number)
... print(string_number)
...
>>> my_function(-8664354335142703762)
-8664354335142703762
Now I know, the problem is how R reads and stores numbers. Every language can do it differently. I have to change the way how to pass numbers to R function, and it solves my problem.
So the correct answer to my question is:
""I suppose toString() converts the number to float", nope, you did it yourself (even if unintentionally)." - Nope, R did it itself, that is the way how R reads numbers.
So I marked r2evans answer as the best answer because this user helped me to find the right solution. Thank you!
Bottom line up front, you must (in this case) read in your large numbers as string before converting to 64-bit integers:
bit64::as.integer64("-8664354335142704128") == bit64::as.integer64("-8664354335142703762")
# [1] FALSE
Some points about what you've tried:
"I suppose toString() converts the number to float", nope, you did it yourself (even if unintentionally). In R, when creating a number, 5 is a float and 5L is an integer. Even if you had tried to create it as an integer, it would have complained and lost precision anyway:
class(5)
# [1] "numeric"
class(5L)
# [1] "integer"
class(-8664354335142703762)
# [1] "numeric"
class(-8664354335142703762L)
# Warning: non-integer value 8664354335142703762L qualified with L; using numeric value
# [1] "numeric"
more appropriately, when you type it in as a number and then try to convert it, R processes the inside of the parentheses first. That is, with
bit64::as.integer64(-8664354335142704128)
R first has to parse and "understand" everything inside the parentheses before it can be passed to the function. (This is typically a compiler/language-parsing thing, not just an R thing.) In this case, it sees that it appears to be a (large) negative float, so it creates a class numeric (float). Only then does it send this numeric to the function, but by this point the precision has already been lost. Ergo the otherwise-illogical
bit64::as.integer64(-8664354335142704128) == bit64::as.integer64(-8664354335142703762)
# [1] TRUE
In this case, it just *happens that the 64-bit version of that number is equal to what you intended.
bit64::as.integer64(-8664254335142704128) # ends in 4128
# integer64
# [1] -8664254335142704128 # ends in 4128, yay! (coincidence?)
If you subtract one, it results in the same effective integer64:
bit64::as.integer64(-8664354335142704127) # ends in 4127
# integer64
# [1] -8664354335142704128 # ends in 4128 ?
This continues for quite a while, until it finally shifts to the next rounding point
bit64::as.integer64(-8664254335142703617)
# integer64
# [1] -8664254335142704128
bit64::as.integer64(-8664254335142703616)
# integer64
# [1] -8664254335142703104
It is unlikely to be coincidence that the difference is 1024, or 2^10. I haven't fished yet, but I'm guessing there's something meaningful about this with respect to floating point precision in 32-bit land.
fortunately, bit64::as.integer64 has several S3 methods, useful for converting different formats/classes to a integer64
library(bit64)
methods(as.integer64)
# [1] as.integer64.character as.integer64.double as.integer64.factor
# [4] as.integer64.integer as.integer64.integer64 as.integer64.logical
# [7] as.integer64.NULL
So, bit64::as.integer64.character can be useful, since precision is not lost when you type it or read it in as a string:
bit64::as.integer64("-8664354335142704128")
# integer64
# [1] -8664354335142704128
bit64::as.integer64("-8664354335142704128") == bit64::as.integer64("-8664354335142703762")
# [1] FALSE
FYI, your number is already near the 64-bit boundary:
-.Machine$integer.max
# [1] -2147483647
-(2^31-1)
# [1] -2147483647
log(8664354335142704128, 2)
# [1] 62.9098
-2^63 # the approximate +/- range of 64-bit integers
# [1] -9.223372e+18
-8664354335142704128
# [1] -8.664354e+18

Why can't R handle inequalities between negative numbers in quotes

This is a weird problem, with an easy workaround, but I'm just so curious why R is behaving this way.
> "-1"<"-2"
[1] TRUE
> -1<"-2"
[1] TRUE
> "-1"< -2
[1] TRUE
> -1< -2
[1] FALSE
> as.numeric("-1")<"-2"
[1] TRUE
> "-1"<as.numeric("-2")
[1] TRUE
> as.numeric("-1")<as.numeric("-2")
[1] FALSE
What is happening? Please, for my own sanity...
A "number in quotes" is not a number at all, it is a string of characters. Those characters happen to be displayed with the same drawing on your screen as the corresponding number, but they are fundamentally not the same object.
The behavior you are seeing is consistent with the following:
A pair of numbers (numeric in R) is compared in the way that you should expect, numerically with the natural ordering. So, -1 < -2 is indeed FALSE.
A pair of strings (character in R) are compared in lexicographic order, meaning roughly that it is compared alphabetically, character by character, from left to right. Since "-1" and "-2" start with the same character, we move to the second, and "2" comes after "1", so "-2" comes after "-1" and therefore "-1" < "-2" is TRUE.
When comparing objects of mismatched types, you have two basic choices: either you give an error, or you convert one of the types to the other and then fall back on the two facts above. R takes the 2nd route, and chooses to convert numeric to character, which explains the result you got above (all your mismatched examples give TRUE).
Note that it makes more sense to convert numeric to character, rather than the other way around, because most character can't be automatically converted to numeric in a meaningful way.
I've always thought this is because the default behavior is to treat the values in quotes as character, and the values without quotes as double. Without expressly declaring the data types, you get this:
> typeof(-1)
[1] "double"
> typeof("-1")
[1] "character"
> typeof(as.numeric("-1"))
[1] "double"
It's only when the negative numbers are put in quotes that it orders them alphabetically, because they are characters.

Is it okay to use floating-point numbers as indices or when creating factors in R?

Is it okay to use floating-point numbers as indices or when creating factors in R?
I don't mean numbers with decimal parts; that would clearly be odd, but instead numbers which really are integers (to the user, that is), but are being stored as floating point numbers.
For example, I've often used constructs like (1:3)*3 or seq(3,9,by=3) as indices, but you'll notice that they're actually being represented as floating point numbers, not integers, even though to me, they're really integers.
Another time this could come up is when reading data from a file; if the file represents the integers as 1.0, 2.0, 3.0, etc, R will store them as floating-point numbers.
(I posted an answer below with an example of why one should be careful, but it doesn't really address if simple constructs like the above can cause trouble.)
(This question was inspired by this question, where the OP created integers to use as coding levels of a factor, but they were being stored as floating point numbers.)
It's always better to use integer representation when you can. For instance, with (1L:3L)*3L or seq(3L,9L,by=3L).
I can come up with an example where floating representation gives an unexpected answer, but it depends on actually doing floating point arithmetic (that is, on the decimal part of a number). I don't know if storing an integer directly in floating point and possibly then doing multiplication, as in the two examples in the original post, could ever cause a problem.
Here's my somewhat forced example to show that floating points can give funny answers. I make two 3's that are different in floating point representation; the first element isn't quite exactly equal to three (on my system with R 2.13.0, anyway).
> (a <- c((0.3*3+0.1)*3,3L))
[1] 3 3
> a[1] == a[2]
[1] FALSE
Creating a factor directly works as expected because factor calls as.character on them which has the same result for both.
> as.character(a)
[1] "3" "3"
> factor(a, levels=1:3, labels=LETTERS[1:3])
[1] C C
Levels: A B C
But using it as an index doesn't work as expected because when they're forced to an integer, they are truncated, so they become 2 and 3.
> trunc(a)
[1] 2 3
> LETTERS[a]
[1] "B" "C"
Constructs such as 1:3 are really integers:
> class(1:3)
[1] "integer"
Using a float as an index entails apparently some truncation:
> foo <- 1:3
> foo
[1] 1 2 3
> foo[1.0]
[1] 1
> foo[1.5]
[1] 1

Resources