I had a question related to the sorting algorithm in R.
if I use order() to sort a particular column, the shorter string is not what is sorted first.
To give you an example: I had to sort a column of character type and it puts firearm_weight above fire_weigh and this is not how the dictionary way of sorting strings anyways.
How can I change this while using the order() command?
Thanks!
"_" < "a" is TRUE on my system and locale.
help("Comparison") is relevant here:
Comparison of strings in character vectors is lexicographic within the
strings using the collating sequence of the locale in use: see
locales. The collating sequence of locales such as en_US is normally
different from C (which should use ASCII) and can be surprising.
Beware of making any assumptions about the collation order: [...]
Collation of non-letters (spaces, punctuation signs, hyphens,
fractions and so on) is even more problematic.
You could substitute "_" with something that is ordered after "z" on your system. E.g., a "µ" on my system.
Related
As far as I know, what most languages call a string, R calls a character vector. For example, "Alice" is not a string, it's a character vector of length 1. Similarly, c("Alice", "Bob") is a character vector of length 2. I cannot recall my IDE or any of my work with R's type system telling me that R has any internal concept of "strings".
Despite this, R's documentation frequently uses the word "string":
?paste and ?nchar frequently talk of "character strings".
Many "See Also" sections mention strings without any qualifier, e.g. ?paste, ?chartr, and ?agrep.
?strsplit mentions "substrings".
?agrep, ?toString, and ?adist talk about strings both in their titles and "Description" sections.
strsplit, strwidth, and toString have string or a shorthand for it in their names.
So does R actually have a concept of strings, or does it always mean exactly the same thing as "character vector"?
Converting my comment to an answer.
A description of character and string can be found in the R Language Definition:
R has six basic (‘atomic’) vector types: logical, integer, real, complex, string (or character) and raw. The modes and storage modes for the different vector types are listed in the following table.
typeof
mode
storage.mode
logical
logical
logical
integer
numeric
integer
double
numeric
double
complex
complex
complex
character
character
character
raw
raw
raw
[...]
String vectors have mode and storage mode "character". A single element of a character vector is often referred to as a character string.
What is the logic used by R to end up with the output FALSE in the below logical operation on characters. Is it just comparing letter S with letter T instead of the entire string.
"Sachin" > "Tendulkar"
Output: FALSE
This is in the documentation. ?">" gives:
Comparison of strings in character vectors is lexicographic within the strings using the collating sequence of the locale in use
In other words, this is just a regular dictionary-style comparison. Things can get very complicated/weird depending on locales (e.g. how non-alphabetic, accented, upper/vs lower case, etc. etc. characters are handled), but this case looks straightforward. "S" comes before "T" in every locale I can imagine, so "S"<"T"; in a lexicographic sort, this will determine the order (otherwise ties would be broken by later letters in the sequence).
I have spend hours to look for a proper solutions but I found nothing on Internet. There is my question. In R, I have a specific list of characters containings my desired variable names ("2011_Q4", "2012_Q1", ...). When I try to assign a dataset to each of this name with a loop, it does work but the output it's strange. Indeed, I have
> View(`2011_Q4`)
instead of
> View(2011_Q4)
And I don't know how to remove this apostrophe. It's very annoying since I have to type this ` in order to call the variable.
Somebody can help me? I would appreciate his help.
Thanks a lot and best regards
Firstly, it's a backtick (`), not an apostrophe ('). In R, backticks occasionally denote variable names; apostrophes work as single quotes for denoting strings.
The issue you're having is that your variables start with a number, which is not allowed in R. Since you somehow made it happen anyway, you need to use backticks to tell R not to interpret 2011_Q4 as a number, but as a variable.
From ?Quotes:
Names and Identifiers
Identifiers consist of a sequence of letters, digits, the period (.)
and the underscore. They must not start with a digit nor underscore,
nor with a period followed by a digit. Reserved words are not valid
identifiers.
The definition of a letter depends on the current locale, but only
ASCII digits are considered to be digits.
Such identifiers are also known as syntactic names and may be used
directly in R code. Almost always, other names can be used provided
they are quoted. The preferred quote is the backtick (`), and deparse
will normally use it, but under many circumstances single or double
quotes can be used (as a character constant will often be converted to
a name). One place where backticks may be essential is to delimit
variable names in formulae: see formula.
The best solution to your issue is simply to change your variable names to something that starts with a character, e.g. Y2011_Q4.
I can’t find a spec of the language…
Note that I want a correct answer, e.g. like this, as i could easily come up with a simple, but likely wrong approximation myself, such as [[:alpha:]._][\w._]*
The documentation for make.names() says
A syntactically valid name consists of letters, numbers and the dot or underline characters and starts with a letter or the dot not followed by a number. Names such as ".2way" are not valid, and neither are the reserved words.
The definition of a letter depends on the current locale, but only ASCII digits are considered to be digits.
#Roland points out this section of the R language definition:
10.3.2 Identifiers
Identifiers consist of a sequence of letters, digits, the period (‘.’) and the underscore. They must not start with a digit or an underscore, or with a period followed by a digit.
The definition of a letter depends on the current locale: the precise set of characters allowed is given by the C expression (isalnum(c) || c == ‘.’ || c == ‘_’) and will include accented letters in many Western European locales.
Notice that identifiers starting with a period are not by default listed by the ls function and that ‘...’ and ‘..1’, ‘..2’, etc. are special.
Notice also that objects can have names that are not identifiers. These are generally accessed via get and assign, although they can also be represented by text strings in some limited circumstances when there is no ambiguity (e.g. "x" <- 1). As get and assign are not restricted to names that are identifiers they do not recognise subscripting operators or replacement functions.
The rules seem to allow "Morse coding":
> .__ <- 1
> ._._. <- 2
> .__ + ._._.
[1] 3
R sorts character vectors in a sequence which I describe as alphabetic, not ASCII.
For example:
sort(c("dog", "Cat", "Dog", "cat"))
[1] "cat" "Cat" "dog" "Dog"
Three questions:
What is the technically correct terminology to describe this sort order?
I can not find any reference to this in the manuals on CRAN. Where can I find a description of the sorting rules in R?
is this any different from this sort of behaviour in other languages like C, Java, Perl or PHP?
Details: for sort() states:
The sort order for character vectors will depend on the collating
sequence of the locale in use: see ‘Comparison’. The sort order
for factors is the order of their levels (which is particularly
appropriate for ordered factors).
and help(Comparison) then shows:
Comparison of strings in character vectors is lexicographicwithin
the strings using the collating sequence of the locale in use:see
‘locales’. The collating sequence of locales such as ‘en_US’ is
normally different from ‘C’ (which should use ASCII) and can be
surprising. Beware of making _any_ assumptions about the
collation order: e.g. in Estonian ‘Z’ comes between ‘S’ and ‘T’,
and collation is not necessarily character-by-character - in
Danish ‘aa’ sorts as a single letter, after ‘z’. In Welsh ‘ng’
may or may not be a single sorting unit: if it is it follows ‘g’.
Some platforms may not respect the locale and always sort in
numerical order of the bytes in an 8-bit locale, or in Unicode
point order for a UTF-8 locale (and may not sort in the same order
for the same language in different character sets). Collation of
non-letters (spaces, punctuation signs, hyphens, fractions and so
on) is even more problematic.
so it depends on your locale setting.
Sorting depends on locale.
My solution for that is the following...
I create ~/.Renviron file
cat ~/.Renviron
#LC_ALL=C
then in R sorting is in C locale
x=c("A", "B", "d", "F", "g", "H")
sort(x)
#[1] "A" "B" "F" "H" "d" "g"