Why does comparing two strings with `>` not throw an error? - r

Why does this work in R? I would think it would throw an error, because you cannot actually compare whether one string is greater than another.
"Test" > "Test"
[1] FALSE

You can compare strings in R. There is complete section provided in the help page (?Comparison) explaining how the comparison is performed.
Comparison of strings in character vectors is lexicographic within the strings using the collating sequence of the locale in use: see locales. The collating sequence of locales such as en_US is normally different from C (which should use ASCII) and can be surprising. Beware of making any assumptions about the collation order: e.g. in Estonian Z comes between S and T, and collation is not necessarily character-by-character – in Danish aa sorts as a single letter, after z. In Welsh ng may or may not be a single sorting unit: if it is it follows g. Some platforms may not respect the locale and always sort in numerical order of the bytes in an 8-bit locale, or in Unicode code-point order for a UTF-8 locale (and may not sort in the same order for the same language in different character sets). Collation of non-letters (spaces, punctuation signs, hyphens, fractions and so on) is even more problematic.
Character strings can be compared with different marked encodings (see Encoding): they are translated to UTF-8 before comparison.

Related

Single (accented) characters with `str.__len__(x) == 2`

To the best of my understanding, str.__len__(x) counts accented characters double in Python 2 because of their byte representation, but once in Python 3, although I couldn't find proper documentation on str.__len__ on python.org.
Python documentation on stdtypes
Python documentation on len
However, If I run the following on Google Colab, the str.__len__(..) is counted as 2
import sys
test = u'ö'
print(type(test), len(test), sys.version)
Where is str.__len__ documented?
There are two ways to represent the symbol "ö" in Unicode. One is as U+00F6 LATIN SMALL LETTER O WITH DIAERESIS. The other is U+006F LATIN SMALL LETTER O followed by U+0308 COMBINING DIAERESIS. If you restrict your source files to ASCII these can be represented as "\u00f6" and "o\u0308" respectively.
In the first case, I get a length of 1. In the second case, I get a length of 2. (Tested with Python 3.7.2). I suspect your code is using the second representation.
This matches the documentation for the string type which notes that "Strings are immutable sequences of Unicode code points" (emphasis mine). A representation that consists of two code points would therefore have a length of 2.
You can use the unicodedata.normalize function to convert between the two forms. Using "NFC" for the form parameter will convert to the composed representation (length 1), using "NFD" will decompose it into a letter and a combining character (length 2).

Remove repeated elements in a string with R

I plan to remove repeated elements (each containing two or more characters) from strings. For example, from "aaa" I expect "aaa", from "aaaa" I expect "aa", from "abababcdcd" I epxect "abcd", from "cdababcdcd" I expect "cdabcd".
I tried gsub("(.{2,})\\1+","\\1",str). It works in cases 1-3, but fails in case 4. How to solve this problem?
SOLUTION
The solution is to rely on the PCRE or ICU regex engines, rather than TRE.
Use either base R gsub with perl=TRUE (it uses PCRE regex engine) and "(?s)(.{2,})\\1+" pattern, or a stringr::str_replace_all() (it uses ICU regex engine) with the same pattern:
> x <- "cdababcdcd"
> gsub("(?s)(.{2,})\\1+", "\\1", x, perl=TRUE)
[1] "cdabcd"
> library(stringr)
> str_replace_all(x, "(?s)(.{2,})\\1+", "\\1")
[1] "cdabcd"
The (?s) flag is necessary for . to match any char including line break chars (in TRE regex, . matches all chars by default).
DETAILS
TRE regex is not good at handling "pathological" cases that are mostly related to backtracking, which directly involves quantifiers (I bolded some parts):
The matching algorithm used in TRE uses linear worst-case time in the length of the text being searched, and quadratic worst-case time in the length of the used regular expression. In other words, the time complexity of the algorithm is O(M2N), where M is the length of the regular expression and N is the length of the text. The used space is also quadratic on the length of the regex, but does not depend on the searched string. This quadratic behaviour occurs only on pathological cases which are probably very rare in practice.
Predictable matching speed
Because of the matching algorithm used in TRE, the maximum time consumed by any regexec() call is always directly proportional to the length of the searched string. There is one exception: if back references are used, the matching may take time that grows exponentially with the length of the string. This is because matching back references is an NP complete problem, and almost certainly requires exponential time to match in the worst case.
In those cases when TRE has trouble calculating all possibilities of matching a string it does not return any match, the string is returned as is. Hence, there is no changes in the gsub call.
As easy as:
gsub("(.{2,})\\1+","\\1",str, perl = T)

Precise syntax of CSS3 colors

I couldn't find a precise definition of legal syntax for CSS3 colors, either as regular expression, BNF or whatever strict formal definition there might be. Some info can be derived from the verbal description in the CSS3 Color Module (for example that comma separated lists may contain whitespace), but I don't see whether e.g. leading zeros in something like
rgb(010,005,255)
rgba(050%,1%,01%,0)
are actually legal, or omitting leading zeros of decimal fractions, like
rgba(100,100,100,.5)
I'm not talking about what is tolerated by browsers, I'm asking whether this is officially legal CSS3 syntax as I'm interested in the use of these color definitions in non-browser applications as well.
As you found already, the CSS3 Color Module specification says
The format of an RGB value in the functional notation is 'rgb(' followed by a comma-separated list of three numerical values (either three integer values or three percentage values) followed by ')'
But you then need to look in basic data types section of CSS 2.1 to find out what an integer or a percentage value is and it says...
Some value types may have integer values (denoted by ) or real number values (denoted by ). Real numbers and integers are specified in decimal notation only. An consists of one or more digits "0" to "9". A can either be an , or it can be zero or more digits followed by a dot (.) followed by one or more digits. Both integers and real numbers may be preceded by a "-" or "+" to indicate the sign. -0 is equivalent to 0 and is not a negative number.
So integers and numbers can have leading zeros.
Then later on basic data types says
The format of a percentage value (denoted by in this specification) is a immediately followed by '%'.
So percentages can have leading zeros too.

Optimal integer encoding that still sorts

One of the neat characteristics of UTF-8 is that if you compare two strings (with <) byte-by-byte, you get the same answer as if you had compared them codepoint-by-codepoint. I was wondering if there was a similar encoding that was optimal in size (e.g. UTF-8 "wastes" space by tagging bytes with 10xxxxxx if they are not the first byte representing a codepoint).
The assumption for optimality here is that a non-negative number n is more frequent than a number m if n < m.
I am most interested in knowing if there is a (byte-comparable) encoding that works for integers, with n more frequent than m if |n| < |m|.
Have you considered a variant of Huffman coding? Traditionally one recursively merges the two least frequent symbols, but to preserve order one could instead merge the two adjacent symbols having the least sum.
Looks like this problem has been well-studied (and the greedy algorithm is not optimal). The optimal algorithm was given by Hu and Tucker, which is described here and more detail in this thesis.
This paper discussing order-preserving dictionary-based compression also looks interesting.
There are very few standard encodings and the answer is no. Any further optimization beyond UTF-8 should not be referred to as "encoding" but a "compression" - and lexicographically-comparable compression is a different department.
If you are solving a real-world (non-purely-academic) problem, I'd just stick with the most standard UTF8. You can learn about its efficiency compared to other standard encodings on utf8everywhere.org.
To fully answer that question you need to know the frequency of the codepoints in the material.
UTF-8 is optimal for texts in English as multi-byte characters are very rare in typical English text.
To encode integers using UTF-8 as a base algorithm would entail mapping the first n integers to a 1-byte encoding, the next m to a 2-byte encoding and so on.
Whether that is an optimal encoding depends on the distribution. If the first n numbers are very frequent compared to higher numbers, then UTF-8 would be (near) optimal.

should I use utf-8 or utf-16 or utf-32 for my multilingual cms?

Besides the difference in how characters are stored, are there any special characters in any language utf-32 can display and utf-8 cannot?
All UTF encodings can represent the same range of code points (0 to 0x10FFFF). So, the same characters can be encoded by any of them.
Whether they can be "displayed" is an entirely different question. That's nothing to do with the encoding, and a function of the font family used. I am not sure that any font has glyphs for every single Unicode code point. But I assume you meant "represented".
They do vary in how many bytes they'll need to represent a given string. UTF-8 is almost always the shortest for non-Asian languages. For those, UTF-16 might win (I haven't really "benchmarked".) I can't imagine a realistic case where UTF-32 would be optimal.
Is there any character one of them can't represent?
In theory: No.
All of those formats can represent all Unicode code points.
In practice: Depends.
The Windows API uses UCS-2 (which is pretty much the first UTF-16 chunk) and doesn't always handle surrogates correctly. So you might want to use UTF-16 to have your program act as "normal" as possible compared to other programs, instead of truncating high-ranging UTF-32 code points manually.
Anything else?
Yes: Use UTF-8!
It's endian-less, so you it avoids byte-order issues, which are a pain in the rear.
Of course, if you're on Windows then you need to convert to UTF-16 before using them.
UTF-8, UTF-16 and UTF-32 all can be used to represent all Unicode datapoints. So no, there are no special characters that can be represented in UTF-32 and not in UTF-8.
1) UTF-8 can be backward compatible with ASCII for regular english characters, this can be an advantage when your client just have english characters.
2) UTF-8 is good in saving network bandwidth if you have ASCII characters more than non-English characters.
3) UTF-16 would be good if you have more non-English characters in terms of saving Storage space.
I suggest to use UTF-8 based on #1 above.

Resources