How does R compare version strings with the inequality operators? - r

Could someone explain this behavior in R?
> '3.0.1' < '3.0.2'
[1] TRUE
> '3.0.1' > '3.0.2'
[1] FALSE
What process is R doing to make the comparison?

It's making a lexicographic comparison in this case, as opposed to converting to numeric, as calling as.numeric('3.0.1') returns NA.
The logic here would be something like, "the strings '3.0.1' and '3.0.2' are equivalent until their final characters, and since 1 precedes 2 in an alphanumeric alphabet, '3.0.1' is less than '3.0.2'." You can test this with some toy examples:
'a' < 'b' # TRUE
'ab' < 'ac' # TRUE
'ab0' < 'ab1' # TRUE
Per the note in the manual in the post that #rawr linked in the comments, this will get hairy in different locales, where the alphanumeric alphabet may be sorted differently.

Related

How come as.character(1) == as.numeric(1) is TRUE? [duplicate]

This question already has answers here:
Why does "one" < 2 equal FALSE in R?
(2 answers)
Why is the expression "1"==1 evaluating to TRUE? [duplicate]
(1 answer)
Closed 3 years ago.
Just like the title says, why does "1" == 1 is TRUE? What is the real reason behind this? Is R trying to be kind or is this something else? I was thinking since "1" (or any numbers it really doesn't matter) where read by R as a character it would automatically return FALSE if compare with as.numeric(1) or as.integer(1).
> as.character(1) == as.numeric(1)
[1] TRUE
or
> "1" == 1
[1] TRUE
I guess it is a simple question but I'd like to get an answer. Thank you.
According to ?==
For numerical and complex values, remember == and != do not allow for the finite representation of fractions, nor for rounding error. Using all.equal with identical is almost always preferable. S
In another paragraph, it is also written
x, y
atomic vectors, symbols, calls, or other objects for which methods have been written. If the two arguments are atomic vectors of different types, one is coerced to the type of the other, the (decreasing) order of precedence being character, complex, numeric, integer, logical and raw.
identical(as.character(1), as.numeric(1))
#[1] FALSE

How to test if an object is a vector in R

I want to test if an object is a vector in R. I'm confused as to why
is.vector(c(0.1))
returns TRUE and so does
is.vector(0.1)
I would like it to return false when it is just a number and true when it is a vector. Can anyone offer any help on this please?
Many thanks in advance.
in R there doesn't exist a single number or string alone. They are vectors of length 1. Or embedded in some more complex structures.
is.vector(c(0.1)) and is.vector(0.1) are in R absolutely identical.
That is also the reason, why length("this is a string/character") returns 1 - because length() in this case measures the number of elements in the vector.
And you see it if you type "this is a string/character" into R console:
It returns [1] "this is a string/character" - the [1] indicates: vector of length 1.
So you have to do nchar("this is a string/character") to get the length of the first element - the charater string - returning 26.
nchar(c("this is a string/character", "and this another string"))
## [1] 26 23
## nchar is vectorized as you see ...
This is an important difference to Python, where strings and numbers can stand alone.
So len("this") returns 4 in Python. len(["this"]) however 1 (1 element in list, thus length of list is 1).
As already mentioned by #RHertel, R considers c(0.1) a vector of length 1. You may want to test for length as well. E.g.
> x <- 1
> y <- 1:2
> is.vector(x) & length(x) > 1
[1] FALSE
> is.vector(y) & length(y) > 1
[1] TRUE

Why can't R handle inequalities between negative numbers in quotes

This is a weird problem, with an easy workaround, but I'm just so curious why R is behaving this way.
> "-1"<"-2"
[1] TRUE
> -1<"-2"
[1] TRUE
> "-1"< -2
[1] TRUE
> -1< -2
[1] FALSE
> as.numeric("-1")<"-2"
[1] TRUE
> "-1"<as.numeric("-2")
[1] TRUE
> as.numeric("-1")<as.numeric("-2")
[1] FALSE
What is happening? Please, for my own sanity...
A "number in quotes" is not a number at all, it is a string of characters. Those characters happen to be displayed with the same drawing on your screen as the corresponding number, but they are fundamentally not the same object.
The behavior you are seeing is consistent with the following:
A pair of numbers (numeric in R) is compared in the way that you should expect, numerically with the natural ordering. So, -1 < -2 is indeed FALSE.
A pair of strings (character in R) are compared in lexicographic order, meaning roughly that it is compared alphabetically, character by character, from left to right. Since "-1" and "-2" start with the same character, we move to the second, and "2" comes after "1", so "-2" comes after "-1" and therefore "-1" < "-2" is TRUE.
When comparing objects of mismatched types, you have two basic choices: either you give an error, or you convert one of the types to the other and then fall back on the two facts above. R takes the 2nd route, and chooses to convert numeric to character, which explains the result you got above (all your mismatched examples give TRUE).
Note that it makes more sense to convert numeric to character, rather than the other way around, because most character can't be automatically converted to numeric in a meaningful way.
I've always thought this is because the default behavior is to treat the values in quotes as character, and the values without quotes as double. Without expressly declaring the data types, you get this:
> typeof(-1)
[1] "double"
> typeof("-1")
[1] "character"
> typeof(as.numeric("-1"))
[1] "double"
It's only when the negative numbers are put in quotes that it orders them alphabetically, because they are characters.

Why "<some string>" >= <a number> is TRUE?

Maybe it is a silly question but playing with subsetting I faced this thing and I can't understand why it happens.
For example let's consider a string, say "a", and an integer, say 3, why this expression returns TRUE?
"a" >= 3
[1] TRUE
When you try to compare a string to an integer, R will coerce the number into a string, so 3 becomes "3".
Using logical operators on strings will check if the condition is true or false given their alphabetical order. For example:
> "a" < "b"
[1] TRUE
> "b" > "c"
[1] FALSE
This results happen because for R, the ascending order is a, b, c. Numbers usually come before letters in alphabetical orders (just check files ordered by name which start with a number). This is why you get
"a" >= 3
[1] TRUE
Finally, note that your result can vary depending on your locale and how the alphabetical order is defined on it. The manual says:
Comparison of strings in character vectors is lexicographic within the
strings using the collating sequence of the locale in use: see
locales. The collating sequence of locales such as en_US is normally
different from C (which should use ASCII) and can be surprising.
Beware of making any assumptions about the collation order: e.g. in
Estonian Z comes between S and T, and collation is not necessarily
character-by-character – in Danish aa sorts as a single letter, after
z. In Welsh ng may or may not be a single sorting unit: if it is it
follows g. Some platforms may not respect the locale and always sort
in numerical order of the bytes in an 8-bit locale, or in Unicode
code-point order for a UTF-8 locale (and may not sort in the same
order for the same language in different character sets). Collation of
non-letters (spaces, punctuation signs, hyphens, fractions and so on)
is even more problematic.
This is important and should be considered if the logical operators are used to compare strings (regardless of comparing them to numbers or not).

Why does "one" < 2 equal FALSE in R?

I'm reading Hadley Wickham's Advanced R section on coercion, and I can't understand the result of this comparison:
"one" < 2
# [1] FALSE
I'm assuming that R coerces 2 to a character, but I don't understand why R returns FALSE instead of returning an error. This is especially puzzling to me since
-1 < "one"
# TRUE
So my question is two-fold: first, why this answer, and second, is there a way of seeing how R converts the individual elements within a logical vector like these examples?
From help("<"):
If the two arguments are atomic vectors of different types, one is
coerced to the type of the other, the (decreasing) order of precedence
being character, complex, numeric, integer, logical and raw.
So in this case, the numeric is of lower precedence than the character. So 2 is coerced to the character "2". Comparison of strings in character vectors is lexicographic which, as I understand it, is alphabetic but locale-dependent.
It coerces 2 into a character, then it does an alphabetical comparison. And numeric characters are assumed to come before alphabetical ones
to get a general idea on the behavior try
'a'<'1'
'1'<'.'
'b'<'B'
'a'<'B'
'A'<'B'
'C'<'B'

Resources