What is the underlying logic when comparing strings? - r

What is the logic used by R to end up with the output FALSE in the below logical operation on characters. Is it just comparing letter S with letter T instead of the entire string.
"Sachin" > "Tendulkar"
Output: FALSE

This is in the documentation. ?">" gives:
Comparison of strings in character vectors is lexicographic within the strings using the collating sequence of the locale in use
In other words, this is just a regular dictionary-style comparison. Things can get very complicated/weird depending on locales (e.g. how non-alphabetic, accented, upper/vs lower case, etc. etc. characters are handled), but this case looks straightforward. "S" comes before "T" in every locale I can imagine, so "S"<"T"; in a lexicographic sort, this will determine the order (otherwise ties would be broken by later letters in the sequence).

Related

How does the Between operator work in dynamodb with strings

I was not expecting to get back a value from the query below. 1574208000#W2 is not between 1574207999 and 1574208001. But the records are still returned. Can anyone shed light on how the between comparison is done?
DynamoDb between operator with strings works with the lexicographic order of the strings (ie, the order in which they would appear in a dictionary). Using this order, 1574208000#W2 does fall between 1574207999 and 1574208001
Two strings are lexicographically equal if they are the same length and contain the same characters in the same positions.
Apart from that, to determine which string comes first, compare corresponding characters of the two strings from left to right. The first character where the two strings differ determines which string comes first. Characters are compared using the Unicode character set. All uppercase letters come before lower case letters. If two letters are the same case, then alphabetic order is used to compare them.
If two strings contain the same characters in the same positions, then the shortest string comes first. Ref
To try this out, you can try a simple example in Java
String a = "1574207999", b = "1574208000#W2", c = "1574208001";
System.out.println(a.compareTo(b)); // prints negative number, indicating a < b
System.out.println(b.compareTo(c)); // prints negative number, indicating b < c

How to compare two hex values in drools

I need to compare two hex values in drools.
for eg: compare 0xbadf00d with 0xbadf00e
This should result in false as d doesn't match with e.
So my question is, can hex be treated as string value and same comparisons can be made, or there is some other way.
I tried googling but no use.
When using ASCII, the natural order of the digits and letters of an HEX is ascending. This makes the comparison of these values as Strings trivial (assuming they are left-padded with 0s and using the same case).
As an example, if you have an Input class with a hex attribute of type String you can write something like this:
rule "Test"
when
$i1: Input()
$i2: Input(hex > $i1.hex)
then
//Do whatever you need here
end
Hope it helps,

What constitutes a valid symbol (identifier) in R

I can’t find a spec of the language…
Note that I want a correct answer, e.g. like this, as i could easily come up with a simple, but likely wrong approximation myself, such as [[:alpha:]._][\w._]*
The documentation for make.names() says
A syntactically valid name consists of letters, numbers and the dot or underline characters and starts with a letter or the dot not followed by a number. Names such as ".2way" are not valid, and neither are the reserved words.
The definition of a letter depends on the current locale, but only ASCII digits are considered to be digits.
#Roland points out this section of the R language definition:
10.3.2 Identifiers
Identifiers consist of a sequence of letters, digits, the period (‘.’) and the underscore. They must not start with a digit or an underscore, or with a period followed by a digit.
The definition of a letter depends on the current locale: the precise set of characters allowed is given by the C expression (isalnum(c) || c == ‘.’ || c == ‘_’) and will include accented letters in many Western European locales.
Notice that identifiers starting with a period are not by default listed by the ls function and that ‘...’ and ‘..1’, ‘..2’, etc. are special.
Notice also that objects can have names that are not identifiers. These are generally accessed via get and assign, although they can also be represented by text strings in some limited circumstances when there is no ambiguity (e.g. "x" <- 1). As get and assign are not restricted to names that are identifiers they do not recognise subscripting operators or replacement functions.
The rules seem to allow "Morse coding":
> .__ <- 1
> ._._. <- 2
> .__ + ._._.
[1] 3

Sorting Algorithm in R

I had a question related to the sorting algorithm in R.
if I use order() to sort a particular column, the shorter string is not what is sorted first.
To give you an example: I had to sort a column of character type and it puts firearm_weight above fire_weigh and this is not how the dictionary way of sorting strings anyways.
How can I change this while using the order() command?
Thanks!
"_" < "a" is TRUE on my system and locale.
help("Comparison") is relevant here:
Comparison of strings in character vectors is lexicographic within the
strings using the collating sequence of the locale in use: see
locales. The collating sequence of locales such as en_US is normally
different from C (which should use ASCII) and can be surprising.
Beware of making any assumptions about the collation order: [...]
Collation of non-letters (spaces, punctuation signs, hyphens,
fractions and so on) is even more problematic.
You could substitute "_" with something that is ordered after "z" on your system. E.g., a "µ" on my system.

Sorting character string with white-space - Incorrect results in Ubuntu version of R [duplicate]

R sorts character vectors in a sequence which I describe as alphabetic, not ASCII.
For example:
sort(c("dog", "Cat", "Dog", "cat"))
[1] "cat" "Cat" "dog" "Dog"
Three questions:
What is the technically correct terminology to describe this sort order?
I can not find any reference to this in the manuals on CRAN. Where can I find a description of the sorting rules in R?
is this any different from this sort of behaviour in other languages like C, Java, Perl or PHP?
Details: for sort() states:
The sort order for character vectors will depend on the collating
sequence of the locale in use: see ‘Comparison’. The sort order
for factors is the order of their levels (which is particularly
appropriate for ordered factors).
and help(Comparison) then shows:
Comparison of strings in character vectors is lexicographicwithin
the strings using the collating sequence of the locale in use:see
‘locales’. The collating sequence of locales such as ‘en_US’ is
normally different from ‘C’ (which should use ASCII) and can be
surprising. Beware of making _any_ assumptions about the
collation order: e.g. in Estonian ‘Z’ comes between ‘S’ and ‘T’,
and collation is not necessarily character-by-character - in
Danish ‘aa’ sorts as a single letter, after ‘z’. In Welsh ‘ng’
may or may not be a single sorting unit: if it is it follows ‘g’.
Some platforms may not respect the locale and always sort in
numerical order of the bytes in an 8-bit locale, or in Unicode
point order for a UTF-8 locale (and may not sort in the same order
for the same language in different character sets). Collation of
non-letters (spaces, punctuation signs, hyphens, fractions and so
on) is even more problematic.
so it depends on your locale setting.
Sorting depends on locale.
My solution for that is the following...
I create ~/.Renviron file
cat ~/.Renviron
#LC_ALL=C
then in R sorting is in C locale
x=c("A", "B", "d", "F", "g", "H")
sort(x)
#[1] "A" "B" "F" "H" "d" "g"

Resources