Why "<some string>" >= <a number> is TRUE? - r

Maybe it is a silly question but playing with subsetting I faced this thing and I can't understand why it happens.
For example let's consider a string, say "a", and an integer, say 3, why this expression returns TRUE?
"a" >= 3
[1] TRUE

When you try to compare a string to an integer, R will coerce the number into a string, so 3 becomes "3".
Using logical operators on strings will check if the condition is true or false given their alphabetical order. For example:
> "a" < "b"
[1] TRUE
> "b" > "c"
[1] FALSE
This results happen because for R, the ascending order is a, b, c. Numbers usually come before letters in alphabetical orders (just check files ordered by name which start with a number). This is why you get
"a" >= 3
[1] TRUE
Finally, note that your result can vary depending on your locale and how the alphabetical order is defined on it. The manual says:
Comparison of strings in character vectors is lexicographic within the
strings using the collating sequence of the locale in use: see
locales. The collating sequence of locales such as en_US is normally
different from C (which should use ASCII) and can be surprising.
Beware of making any assumptions about the collation order: e.g. in
Estonian Z comes between S and T, and collation is not necessarily
character-by-character – in Danish aa sorts as a single letter, after
z. In Welsh ng may or may not be a single sorting unit: if it is it
follows g. Some platforms may not respect the locale and always sort
in numerical order of the bytes in an 8-bit locale, or in Unicode
code-point order for a UTF-8 locale (and may not sort in the same
order for the same language in different character sets). Collation of
non-letters (spaces, punctuation signs, hyphens, fractions and so on)
is even more problematic.
This is important and should be considered if the logical operators are used to compare strings (regardless of comparing them to numbers or not).

Related

How does R compare version strings with the inequality operators?

Could someone explain this behavior in R?
> '3.0.1' < '3.0.2'
[1] TRUE
> '3.0.1' > '3.0.2'
[1] FALSE
What process is R doing to make the comparison?
It's making a lexicographic comparison in this case, as opposed to converting to numeric, as calling as.numeric('3.0.1') returns NA.
The logic here would be something like, "the strings '3.0.1' and '3.0.2' are equivalent until their final characters, and since 1 precedes 2 in an alphanumeric alphabet, '3.0.1' is less than '3.0.2'." You can test this with some toy examples:
'a' < 'b' # TRUE
'ab' < 'ac' # TRUE
'ab0' < 'ab1' # TRUE
Per the note in the manual in the post that #rawr linked in the comments, this will get hairy in different locales, where the alphanumeric alphabet may be sorted differently.

Why can't R handle inequalities between negative numbers in quotes

This is a weird problem, with an easy workaround, but I'm just so curious why R is behaving this way.
> "-1"<"-2"
[1] TRUE
> -1<"-2"
[1] TRUE
> "-1"< -2
[1] TRUE
> -1< -2
[1] FALSE
> as.numeric("-1")<"-2"
[1] TRUE
> "-1"<as.numeric("-2")
[1] TRUE
> as.numeric("-1")<as.numeric("-2")
[1] FALSE
What is happening? Please, for my own sanity...
A "number in quotes" is not a number at all, it is a string of characters. Those characters happen to be displayed with the same drawing on your screen as the corresponding number, but they are fundamentally not the same object.
The behavior you are seeing is consistent with the following:
A pair of numbers (numeric in R) is compared in the way that you should expect, numerically with the natural ordering. So, -1 < -2 is indeed FALSE.
A pair of strings (character in R) are compared in lexicographic order, meaning roughly that it is compared alphabetically, character by character, from left to right. Since "-1" and "-2" start with the same character, we move to the second, and "2" comes after "1", so "-2" comes after "-1" and therefore "-1" < "-2" is TRUE.
When comparing objects of mismatched types, you have two basic choices: either you give an error, or you convert one of the types to the other and then fall back on the two facts above. R takes the 2nd route, and chooses to convert numeric to character, which explains the result you got above (all your mismatched examples give TRUE).
Note that it makes more sense to convert numeric to character, rather than the other way around, because most character can't be automatically converted to numeric in a meaningful way.
I've always thought this is because the default behavior is to treat the values in quotes as character, and the values without quotes as double. Without expressly declaring the data types, you get this:
> typeof(-1)
[1] "double"
> typeof("-1")
[1] "character"
> typeof(as.numeric("-1"))
[1] "double"
It's only when the negative numbers are put in quotes that it orders them alphabetically, because they are characters.

Why does "one" < 2 equal FALSE in R?

I'm reading Hadley Wickham's Advanced R section on coercion, and I can't understand the result of this comparison:
"one" < 2
# [1] FALSE
I'm assuming that R coerces 2 to a character, but I don't understand why R returns FALSE instead of returning an error. This is especially puzzling to me since
-1 < "one"
# TRUE
So my question is two-fold: first, why this answer, and second, is there a way of seeing how R converts the individual elements within a logical vector like these examples?
From help("<"):
If the two arguments are atomic vectors of different types, one is
coerced to the type of the other, the (decreasing) order of precedence
being character, complex, numeric, integer, logical and raw.
So in this case, the numeric is of lower precedence than the character. So 2 is coerced to the character "2". Comparison of strings in character vectors is lexicographic which, as I understand it, is alphabetic but locale-dependent.
It coerces 2 into a character, then it does an alphabetical comparison. And numeric characters are assumed to come before alphabetical ones
to get a general idea on the behavior try
'a'<'1'
'1'<'.'
'b'<'B'
'a'<'B'
'A'<'B'
'C'<'B'

What's the difference between integer class and numeric class in R

I want to preface this by saying I'm an absolute programming beginner, so please excuse how basic this question is.
I'm trying to get a better understanding of "atomic" classes in R and maybe this goes for classes in programming in general. I understand the difference between a character, logical, and complex data classes, but I'm struggling to find the fundamental difference between a numeric class and an integer class.
Let's say I have a simple vector x <- c(4, 5, 6, 6) of integers, it would make sense for this to be an integer class. But when I do class(x) I get [1] "numeric". Then if I convert this vector to an integer class x <- as.integer(x). It return the same exact list of numbers except the class is different.
My question is why is this the case, and why the default class for a set of integers is a numeric class, and what are the advantages and or disadvantages of having an integer set as numeric instead of integer.
There are multiple classes that are grouped together as "numeric" classes, the 2 most common of which are double (for double precision floating point numbers) and integer. R will automatically convert between the numeric classes when needed, so for the most part it does not matter to the casual user whether the number 3 is currently stored as an integer or as a double. Most math is done using double precision, so that is often the default storage.
Sometimes you may want to specifically store a vector as integers if you know that they will never be converted to doubles (used as ID values or indexing) since integers require less storage space. But if they are going to be used in any math that will convert them to double, then it will probably be quickest to just store them as doubles to begin with.
Patrick Burns on Quora says:
First off, it is perfectly feasible to use R successfully for years
and not need to know the answer to this question. R handles the
differences between the (usual) numerics and integers for you in the
background.
> is.numeric(1)
[1] TRUE
> is.integer(1)
[1] FALSE
> is.numeric(1L)
[1] TRUE
> is.integer(1L)
[1] TRUE
(Putting capital 'L' after an integer forces it to be stored as an
integer.)
As you can see "integer" is a subset of "numeric".
> .Machine$integer.max
[1] 2147483647
> .Machine$double.xmax
[1] 1.797693e+308
Integers only go to a little more than 2 billion, while the other
numerics can be much bigger. They can be bigger because they are
stored as double precision floating point numbers. This means that
the number is stored in two pieces: the exponent (like 308 above,
except in base 2 rather than base 10), and the "significand" (like
1.797693 above).
Note that 'is.integer' is not a test of whether you have a whole
number, but a test of how the data are stored.
One thing to watch out for is that the colon operator, :, will return integers if the start and end points are whole numbers. For example, 1:5 creates an integer vector of numbers from 1 to 5. You don't need to append the letter L.
> class(1:5)
[1] "integer"
Reference: https://www.quora.com/What-is-the-difference-between-numeric-and-integer-in-R
To quote the help page (try ?integer), bolded portion mine:
Integer vectors exist so that data can be passed to C or Fortran code which expects them, and so that (small) integer data can be represented exactly and compactly.
Note that current implementations of R use 32-bit integers for integer vectors, so the range of representable integers is restricted to about +/-2*10^9: doubles can hold much larger integers exactly.
Like the help page says, R's integers are signed 32-bit numbers so can hold between -2147483648 and +2147483647 and take up 4 bytes.
R's numeric is identical to an 64-bit double conforming to the IEEE 754 standard. R has no single precision data type. (source: help pages of numeric and double). A double can store all integers between -2^53 and 2^53 exactly without losing precision.
We can see the data type sizes, including the overhead of a vector (source):
> object.size(1:1000)
4040 bytes
> object.size(as.numeric(1:1000))
8040 bytes
To my understanding - we do not declare a variable with a data type so by default R has set any number without L to be a numeric.
If you wrote:
> x <- c(4L, 5L, 6L, 6L)
> class(x)
>"integer" #it would be correct
Example of Integer:
> x<- 2L
> print(x)
Example of Numeric (kind of like double/float from other programming languages)
> x<-3.4
> print(x)
Numeric is an umbrella term for several types of classes (e.g. double and integer). Integers are numbers which do not have decimal points and thus are stored with minimal space in memory. Use the integer class only when doing computations with such numbers, otherwise revert to numeric.

Is it okay to use floating-point numbers as indices or when creating factors in R?

Is it okay to use floating-point numbers as indices or when creating factors in R?
I don't mean numbers with decimal parts; that would clearly be odd, but instead numbers which really are integers (to the user, that is), but are being stored as floating point numbers.
For example, I've often used constructs like (1:3)*3 or seq(3,9,by=3) as indices, but you'll notice that they're actually being represented as floating point numbers, not integers, even though to me, they're really integers.
Another time this could come up is when reading data from a file; if the file represents the integers as 1.0, 2.0, 3.0, etc, R will store them as floating-point numbers.
(I posted an answer below with an example of why one should be careful, but it doesn't really address if simple constructs like the above can cause trouble.)
(This question was inspired by this question, where the OP created integers to use as coding levels of a factor, but they were being stored as floating point numbers.)
It's always better to use integer representation when you can. For instance, with (1L:3L)*3L or seq(3L,9L,by=3L).
I can come up with an example where floating representation gives an unexpected answer, but it depends on actually doing floating point arithmetic (that is, on the decimal part of a number). I don't know if storing an integer directly in floating point and possibly then doing multiplication, as in the two examples in the original post, could ever cause a problem.
Here's my somewhat forced example to show that floating points can give funny answers. I make two 3's that are different in floating point representation; the first element isn't quite exactly equal to three (on my system with R 2.13.0, anyway).
> (a <- c((0.3*3+0.1)*3,3L))
[1] 3 3
> a[1] == a[2]
[1] FALSE
Creating a factor directly works as expected because factor calls as.character on them which has the same result for both.
> as.character(a)
[1] "3" "3"
> factor(a, levels=1:3, labels=LETTERS[1:3])
[1] C C
Levels: A B C
But using it as an index doesn't work as expected because when they're forced to an integer, they are truncated, so they become 2 and 3.
> trunc(a)
[1] 2 3
> LETTERS[a]
[1] "B" "C"
Constructs such as 1:3 are really integers:
> class(1:3)
[1] "integer"
Using a float as an index entails apparently some truncation:
> foo <- 1:3
> foo
[1] 1 2 3
> foo[1.0]
[1] 1
> foo[1.5]
[1] 1

Resources