turn non-ascii character into their unicode form - r

I'm writing a package function where I want to check if some text contains any of the following characters:
äöüßâçèéêîôûąęćśłńóżźžšůěřáúëïùÄÖÜSSÂÇÈÉÊÎÔÛĄĘĆŚŁŃÓŻŹŽŠŮĚŘÁÚËÏÙ
Problem is that devtools::check() returns a warning:
W checking R files for non-ASCII characters ... Found the
following file with non-ASCII characters:
gb_data_prepare.R Portable packages must use only ASCII characters in their R code, except perhaps in comments. Use
\uxxxx escapes for other characters.
So I tried to convert these characters into unicode, but actually don't really know how.
stringi::stri_encode("äöüßâçèéêîôûąęćśłńóżźžšůěřáúëïùÄÖÜSSÂÇÈÉÊÎÔÛĄĘĆŚŁŃÓŻŹŽŠŮĚŘÁÚËÏÙ", to = "Unicode")
Error in stringi::stri_encode(x, to = "Unicode") :
embedded nul in string: '\\xff\\xfe\\xe4'
doesn't work. Same with
iconv("äöüßâçèéêîôûąęćśłńóżźžšůěřáúëïùÄÖÜSSÂÇÈÉÊÎÔÛĄĘĆŚŁŃÓŻŹŽŠŮĚŘÁÚËÏÙ", from = "UTF-8", to = "Unicode")
Error in iconv(x, from = "UTF-8", to = "Unicode") :
unsupported conversion from 'UTF-8' to 'Unicode' in codepage 1252
Any ideas what I can do?
Note: weird thing also is that if I do:
x <- "äöüßâçèéêîôûąęćśłńóżźžšůěřáúëïùÄÖÜSSÂÇÈÉÊÎÔÛĄĘĆŚŁŃÓŻŹŽŠŮĚŘÁÚËÏÙ"
x now returns "äöüßâçèéêîôûaecslnózzžšueráúëïùÄÖÜSSÂÇÈÉÊÎÔÛAECSLNÓZZŽŠUERÁÚËÏÙ" which is wrong. So I guess it also has something to do with my general R encoding?

Related

URL / URI encoding in R

I have to request an API with an URL encoding according to RFC 3986, knowing that I have accented characters in my query.
For instance, this argument :
quel écrivain ?
should be encoded like this:
quel%20%C3%A9crivain%20%3F%0D%0A
Unfortunately, when I use URLencode, encoding, url_encode, or curlEscape, I have the resulting encoding:
URLencode("quel écrivain ?")
[1] "quel%20%E9crivain%20?"
The problem is on accented letters: for instance "é" is converted into "%E9" instead of "%C3%A9"...
I struggle with this URL encoding without finding any issue... As I don't have the hand on the API, I don't know how it handles the encoding.
A weird thing is that using POST instead of GET leads to a response in which word with accent are cutted into 2 different lines :
"1\tquel\tquel\tDET\tDET\tGender=Masc|Number=Sing\t5\tdet\t0\t_\n4\t<U+FFFD>\t<U+FFFD>\tSYM\tSYM\t_\t5\tcompound\t0\t_\n5\tcrivain\tcrivain\
As you can see, "écrivain" is splitted into "<U+FFFD>" (which is an ASCII encoding of "é") and "crivain".
I become mad with this encoding problem, if a brilliant mind could help me I would be very gratefull!
Set reserved = TRUE
i.e.
your_string <- "quel écrivain ?"
URLencode(your_string, reserved = TRUE)
# [1] "quel%20%C3%A9crivain%20%3F"
I do not think I am a brilliant mind, but I still have a possible solution for you. After using URLencode() it seems that your accented characters are converted into the trailing part of their unicode representation preceeded by a %. To convert your characters into readable characters you might turn them into "real unicode" and use the package stringi to make them readable. For your single string the solution worked on my machine, at least. I hope it also works for you.
Please note that I have introduced a % character at the end of your string to demonstrate that below gsub command should work in any case.
You might have to adapt the replacement pattern \\u00 to also cover unicode patterns that have more than the last two positions filled with something but 0, if this is relevant in your case.
library(stringi)
str <- "quel écrivain ?"
str <- URLencode(str)
#"quel%20%E9crivain%20?"
#replacing % by a single \ backslash to directly get correct unicode representation
#does not work since it is an escape character, therefore "\\"
str <- gsub("%", paste0("\\", "u00"), str , fixed = T)
#[1] "quel\\u0020\\u00E9crivain\\u0020?"
#since we have double escapes, we need the unescape function from stringi
#which recognizes double backslash as single backslash for the conversion
str <- stri_unescape_unicode(str)
#[1] "quel écrivain ?"

problems replacing €-symbol in strings

I want to replace every "€" in a string with "[euro]". Now this works perfectly fine with
file.col.name <- gsub("€","[euro]", file.col.name, fixed = TRUE)
Now I am looping over column names from a csv-file and suddenly I have trouble with the string "total€".
It works for other special character (#,?) but the € sign doesn't get recognized.
grep("€",file.column.name)
also returns 0 and if I extract the last letter it prints "€" but
print(lastletter(file.column.name) == "€")
returns FALSE. (lastletter is just a function to extract the last letter of a string.)
Does anyone have an idea why that happens and maybe an idea to solve it? I checked the class of "file.column.name" and it returns "character", also tried to convert it into a character again and stuff like that but didn't help.
Thank you!
Your encodings are probably mixed. Check the encodings of the files, then add the appropriate encoding to, e.g., read.csv using fileEncoding="…" as an argument.
If you are working under Unix/Linux, the file utility will tell you the encoding of text files. Otherwise, any editor should show you the encoding of the files.
Common encodings are UTF-8, ISO-8859-15 and windows-1252. Try "UTF-8", "windows-1252" and "latin-9" as values for fileEncoding (the latter being a portable name for ISO-8859-15 according to R's documentation).

What's the difference between hex code (\x) and unicode (\u) chars?

From ?Quotes:
\xnn character with given hex code (1 or 2 hex digits)
\unnnn Unicode character with given code (1--4 hex digits)
In the case where the Unicode character has only one or two digits, I would expect these characters to be the same. In fact, one of the examples on the ?Quotes help page shows:
"\x48\x65\x6c\x6c\x6f\x20\x57\x6f\x72\x6c\x64\x21"
## [1] "Hello World!"
"\u48\u65\u6c\u6c\u6f\u20\u57\u6f\u72\u6c\u64\u21"
## [1] "Hello World!"
However, under Linux, when trying to print a pound sign, I see
cat("\ua3")
## £
cat("\xa3")
## �
That is, the \x hex code fails to display correctly. (This behaviour persisted with any locale that I tried.) Under Windows 7 both versions show a pound sign.
If I convert to integer and back then the pound sign displays correctly under Linux.
cat(intToUtf8(utf8ToInt("\xa3")))
## £
Incidentally, this doesn't work under Windows, since utf8ToInt("\xa3") returns NA.
Some \x characters return NA under Windows but throw an error under Linux. For example:
utf8ToInt("\xf0")
## Error in utf8ToInt("\xf0") : invalid UTF-8 string
("\uf0" is a valid character.)
These examples show that there are some differences between \x and \u forms of characters, which seem to be OS-specific, but I can't see any logic in how they are defined.
What are the difference between these two character forms?
The escape sequence \xNN inserts the raw byte NN into a string, whereas \uNN inserts the UTF-8 bytes for the Unicode code point NN into a UTF-8 string:
> charToRaw('\xA3')
[1] a3
> charToRaw('\uA3')
[1] c2 a3
These two types of escape sequence cannot be mixed in the same string:
> '\ua3\xa3'
Error: mixing Unicode and octal/hex escapes in a string is not allowed
This is because the escape sequences also define the encoding of the string. A \uNN sequence explicitly sets the encoding of the entire string to "UTF-8", whereas \xNN leaves it in the default "unknown" (aka. native) encoding:
> Encoding('\xa3')
[1] "unknown"
> Encoding('\ua3')
[1] "UTF-8"
This becomes important when printing strings, as they need to be converted into the appropriate output encoding (e.g., that of your console). Strings with a defined encoding can be converted appropriately (see enc2native), but those with an "unknown" encoding are simply output as-is:
On Linux, your console is probably expecting UTF-8 text, and as 0xA3 is not a valid UTF-8 sequence, it gives you "�".
On Windows, your console is probably expecting Windows-1252 text, and as 0xA3 is the correct encoding for "£", that's what you see. (When the string is \uA3, a conversion from UTF-8 to Windows-1252 takes place.)
If the encoding is set explicitly, the appropriate conversion will take place on Linux:
> s <- '\xa3'
> Encoding(s) <- 'latin1'
> cat(s)
£

How to display a message/warning/error with Unicode characters under Windows?

I have a message (or warning or error) containing Unicode characters. (The string has UTF-8 encoding.)
x <- "\u20AC \ub124" # a euro symbol, and Hangul 'ne'
## [1] "€ 네"
Encoding(x)
## [1] "UTF-8"
Under Linux, this prints OK in a message if the locale is UTF-8 (l10n_info()$`UTF-8` returns TRUE).
I can force this, by doing, e.g.,
devtools::with_locale(
c(LC_CTYPE = "en_US.utf8"),
message(x)
)
## € 네
Under Windows there are no UTF-8 locales, so I can't find an equivalent way to enforce correct printing. For example, with a US locale, the Hangul character doesn't display properly.
devtools::with_locale(
c(LC_CTYPE = "English_United States"),
message(x)
)
## € <U+B124>
There's a related problem with Unicode characters not displaying properly when printing data frames under Windows. The advice there was to set the locale to Chinese/Japanese/Korean. This does not work here.
devtools::with_locale(
c(LC_CTYPE = "Korean_Korea"),
message(x)
)
## ¢æ ³× # equivalent to iconv(x, "UTF-8", "EUC-KR")
How can I get UTF-8 messages, warnings and errors to display correctly under Windows?
I noticed that the help for the function Sys.setlocale() in R says this: "LC_MESSAGES" will be "C" on systems that do not support message translation, and is not supported on Windows.
To me this sounds like modifying character representation for R messages/errors can't be done on any Windows version...

Accentuation in R

I was wonderiing if there exist a short way to get accents of a character string in R. For instance when I the "accentued" string is "Université d'Aix-Marseille" with my script I get "Universit%C3A9 d%27Aix-Marseille". Is there any function or algorithm to get the former one directly ?
I precise tht the file from where a get all my character string is encoded is UTF-8.
Sincerely yours.
You can get and set the encoding of a character vector like this:
s <- "Université d'Aix-Marseille"
Encoding(s)
# set encoding to utf-8
Encoding(s) <- "UTF-8"
s
If that fixes it, you could change your default encoding to UTF-8.

Resources