Split String by Upper Case Consisting of Both Latin and Unicode - r

Building on the Splitting String based on letters case answer;
lang <- "DeutschEsperantoItalianoNederlandsNedersaksiesNorskРусский"
strsplit(lang, "(?!^)(?=[[:upper:]])", perl = T)
results in
"Deutsch" "Esperanto" "Italiano" "Nederlands" "Nedersaksies" "NorskРусский"
The problem is the last pair is not separated as Russian is converted to UTF-8 (there will be more variation in the strings; e.g. more or less all other languages in Wikipedia). I checked online Regex testers and other SO answers but they are not much help with R. Tried iconv and Encoding workarounds in base R as well (can't seem to convert to UTF-16; conversion to bytes doesn't help). Thoughts?

Use unicode property \p{Lu} that means an uppercase (u) letter (L) in any alphabet. See http://www.regular-expressions.info/unicode.html
lang <- "DeutschEsperantoItalianoNederlandsNedersaksiesNorskРусский"
strsplit(lang, "(?!^)(?=\p{Lu})", perl = TRUE)

Related

How to generate all possible unicode characters?

If we type in letters we get all lowercase letters from english alphabet. However, there are many more possible characters like ä, é and so on. And there are symbols like $ or (, too. I found this table of unicode characters which is exactly what I need. Of course I do not want to copy and paste hundreds of possible unicode characters in one vector.
What I've tried so far: The table gives the decimals for (some of) the unicode characters. For example, see the following small table:
Glyph Decimal Unicode Usage in R
! 33 U+0021 "\U0021"
So if type "\U0021" we get a !. Further, paste0("U", format(as.hexmode(33), width= 4, flag="0")) returns "U0021" which is quite close to what I need but adding \ results in an error:
paste0("\U", format(as.hexmode(33), width= 4, flag="0"))
Error: '\U' used without hex digits in character string starting ""\U"
I am stuck. And I am afraid even if I figure out how to transform numbers to characters usings as.hexmode() there is still the problem that there are not Decimals for all unicode characters (see table, Decimals end with 591).
Any idea how to generate a vector with all the unicode characters listed in the table linked?
(The question started with a real world problem but now I am mostly simply eager to know how to do this.)
There may be easier ways to do this, but here goes. The Unicode package contains everything you need.
First we can get a list of unicode scripts and the block ranges:
library(Unicode)
uranges <- u_scripts()
Check what we've got:
head(uranges, 3)
$Adlam
[1] U+1E900..U+1E943 U+1E944..U+1E94A U+1E94B U+1E950..U+1E959 U+1E95E..U+1E95F
$Ahom
[1] U+11700..U+1171A U+1171D..U+1171F U+11720..U+11721 U+11722..U+11725 U+11726 U+11727..U+1172B U+11730..U+11739 U+1173A..U+1173B U+1173C..U+1173E U+1173F
[11] U+11740..U+11746
$Anatolian_Hieroglyphs
[1] U+14400..U+14646
Next we can convert the ranges into their sequences.
expand_uranges <- lapply(uranges, as.u_char_seq)
To get a single vector of all characters we can unlist it. This won't be easy to work with so really it would be better to keep them as a list:
all_unicode_chars <- unlist(expand_uranges)
# The Wikipedia page linked states there are 144,697 characters
length(all_unicode_chars)
[1] 144762
So seems to be all of them and the page needs updating. They are stored as integers so to print them (assuming the glyph is supported) we can do, for example, printing Japanese katakana:
intToUtf8(expand_uranges$Katakana[[1]])
[1] "ァアィイゥウェエォオカガキギクグケゲコゴサザシジスズセゼソゾタダチヂッツヅテデトドナニヌネノハバパヒビピフブプヘベペホボポマミムメモャヤュユョヨラリルレロヮワヰヱヲンヴヵヶヷヸヹヺ"

How to remove "\" from paste function output with quotation marks?

I'm working with the following code:
Y_Columns <- c("Y.1.1")
paste('{"ImportId":"', Y_Columns, '"}', sep = "")
The paste function produces the following output:
"{\"ImportId\":\"Y.1.1\"}"
How do I get the paste function to omit the \? Such that, the output is:
"{"ImportId":"Y.1.1"}"
Thank you for your help.
Note: I did do a search on SO to see if there were any Q's that asked "what is an escape character in R". But I didn't review all the 160 answers, only the first 20.
This is one way of demonstrating what I wrote in my comment:
out <- paste('{"ImportId":"', Y_Columns, '"}', sep = "")
out
#[1] "{\"ImportId\":\"Y.1.1\"}"
?print
print(out,quote=FALSE)
#[1] {"ImportId":"Y.1.1"}
Both R and regex patterns use escape characters to allow special characters to be displayed in print output or input. (And sometimes regex patterns need to have doubled escapes.) R has a few characters that need to be "escaped" in certain situation. You illustrated one such situation: including double-quote character inside a result that will be printed with surrounding double-quotes. If you were intending to include any single quotes inside a character value that was delimited by single quotes at the time of creation, they would have needed to be escaped as well.
out2 <- '\'quoted\''
nchar(out2)
#[1] 8 ... note that neither the surround single-quotes nor the backslashes get counted
> out2
[1] "'quoted'" ... and the default output quote-char is a double-quote.
Here's a good Q&A to review:How to replace '+' using gsub() function in R
It has two answers, both useful: one shows how to double escape a special character and the other shows how to use teh fixed argument to get around that requirement.
And another potentially useful Q&A on the topic of handling Windows paths:
File path issues in R using Windows ("Hex digits in character string" error)
And some further useful reading suggestions: Look at the series of help pages that start with capital letters. (Since I can never remember which one has which nugget of essential information, I tried ?Syntax first and it has a "See Also" list of essential reading: Arithmetic, Comparison, Control, Extract, Logic, NumericConstants, Paren, Quotes, Reserved. and I then realized what I wanted to refer you to was most likely ?Quotes where all the R-specific escape sequence letters should be listed.

URL / URI encoding in R

I have to request an API with an URL encoding according to RFC 3986, knowing that I have accented characters in my query.
For instance, this argument :
quel écrivain ?
should be encoded like this:
quel%20%C3%A9crivain%20%3F%0D%0A
Unfortunately, when I use URLencode, encoding, url_encode, or curlEscape, I have the resulting encoding:
URLencode("quel écrivain ?")
[1] "quel%20%E9crivain%20?"
The problem is on accented letters: for instance "é" is converted into "%E9" instead of "%C3%A9"...
I struggle with this URL encoding without finding any issue... As I don't have the hand on the API, I don't know how it handles the encoding.
A weird thing is that using POST instead of GET leads to a response in which word with accent are cutted into 2 different lines :
"1\tquel\tquel\tDET\tDET\tGender=Masc|Number=Sing\t5\tdet\t0\t_\n4\t<U+FFFD>\t<U+FFFD>\tSYM\tSYM\t_\t5\tcompound\t0\t_\n5\tcrivain\tcrivain\
As you can see, "écrivain" is splitted into "<U+FFFD>" (which is an ASCII encoding of "é") and "crivain".
I become mad with this encoding problem, if a brilliant mind could help me I would be very gratefull!
Set reserved = TRUE
i.e.
your_string <- "quel écrivain ?"
URLencode(your_string, reserved = TRUE)
# [1] "quel%20%C3%A9crivain%20%3F"
I do not think I am a brilliant mind, but I still have a possible solution for you. After using URLencode() it seems that your accented characters are converted into the trailing part of their unicode representation preceeded by a %. To convert your characters into readable characters you might turn them into "real unicode" and use the package stringi to make them readable. For your single string the solution worked on my machine, at least. I hope it also works for you.
Please note that I have introduced a % character at the end of your string to demonstrate that below gsub command should work in any case.
You might have to adapt the replacement pattern \\u00 to also cover unicode patterns that have more than the last two positions filled with something but 0, if this is relevant in your case.
library(stringi)
str <- "quel écrivain ?"
str <- URLencode(str)
#"quel%20%E9crivain%20?"
#replacing % by a single \ backslash to directly get correct unicode representation
#does not work since it is an escape character, therefore "\\"
str <- gsub("%", paste0("\\", "u00"), str , fixed = T)
#[1] "quel\\u0020\\u00E9crivain\\u0020?"
#since we have double escapes, we need the unescape function from stringi
#which recognizes double backslash as single backslash for the conversion
str <- stri_unescape_unicode(str)
#[1] "quel écrivain ?"

converting Unicode characters from string column in R

Imported a bunch of CSV and one of the columns has what i think are Unicode chars.
something like:
PEÃ<U+0083>â<U+0080><U+0098>A
SOPEÃ<U+0083>â<U+0080><U+0098>A
Not in all rows, just some, but I've tried to convert to "human readable" chars but to no avail.
Tested this solution from SO but no success so far: unicode characters conversion in R
and this brute substitution didn't worked
gsub('Ã<U+0083>â<U+0080><U+0098>', 'Ñ', 'PEÃ<U+0083>â<U+0080><U+0098>A')
[1] "Ã<U+0083>â<U+0080><U+0098>"

Using grep() with Unicode characters in R

(strap in!)
Hi, I'm running into issues involving Unicode encoding in R.
Basically, I'm importing data sets that contain Unicode (UTF-8) characters, and then running grep() searches to match values. For example, say I have:
bigData <- c("foo","αβγ","bar","αβγγ (abgg)", ...)
smallData <- c("αβγ","foo", ...)
What I'm trying to do is take the entries in smallData and match them to entries in bigData. (The actual sets are matrixes with columns of values, so what I'm trying to do is find the indexes of the matches, so I can tell what row to add the values to.) I've been using
matches <- grepl(smallData[i], bigData, fixed=T)
which usually results in a vector of matches. For i=2, it would return 1, since "foo" is element 1 of bigData. This is peachy and all is well. But RStudio seems to not be dealing with unicode characters properly. When I import the sets and view them, they use the character IDs.
dataset <- read_csv("[file].csv", col_names = FALSE, locale = locale())
Using View(dataset) shows "aß<U+03B3>" instead of "αβγ." The same goes for
dataset[1]
A tibble: 1x1 <chr>
[1] aß<U+03B3>
print(dataset[1])
A tibble: 1x1 <chr>
[1] aß<U+03B3>
However, and this is why I'm stuck rather than just adjusting the encoding:
paste(dataset[1])
[1] "αβγ"
Encoding(toString(dataset[1]))
[1] "UTF-8"
So it appears that R is recognizing in certain contexts that it should display Unicode characters, while in others it just sticks to--ASCII? I'm not entirely sure, but certainly a more limited set.
In any case, regardless of how it displays, what I want to do is be able to get
grep("αβγ", bigData)
[1] 2 4
However, none of the following work:
grep("αβ", bigData) #(Searching the two letters that do appear to convert)
grep("<U+03B3>",bigData,fixed=T) #(Searching the code ID itself)
grep("αβ", toString(bigData)) #(converts the whole thing to one string)
grep("\\β", bigData) #(only mentioning because it matches, bizarrely, to ß)
The only solution I've found is:
grep("\u03B3", bigData)
[1] 2 4
Which is not ideal for a couple reasons, most jarringly that it doesn't look like it's possible to just take every <U+####> and replace it with \u####, since not every Unicode character is converted to the <U+####> format, but none of them can be searched. (i.e., α and ß didn't turn into their unicode keys, but they're also not searchable by themselves. So I'd have to turn them into their keys, then alter their keys to a form that grep() can use, then search.)
That means I can't just regex the keys into a searchable format--and even if I could, I have a lot of entries including characters that'd need to be escaped (e.g., () or ), so having to remove the fixed=T term would be its own headache involving nested escapes.
Anyway...I realize that a significant part of the problem is that my set apparently involves every sort of character under the sun, and it seems I have thoroughly entrapped myself in a net of regular expressions.
Is there any way of forcing a search with (arbitrary) unicode characters? Or do I have to find a way of using regular expressions to escape every ( and α in my data set? (coordinate to that second question: is there a method to convert a unicode character to its key? I can't seem to find anything that does that specific function.)

Resources