Lots of European authors have Unicode characters in their names such as Å, Æ, ø and Ä. How can they have their actual name with these Unicode characters rather than some transformed version of the English alphabet (Å -> A, Æ -> A, Ä -> A) when creating R-package. In short, how can I use Unicode characters in author/creator/maintainer name when creating an R-Package.
There is some note here:
http://r-pkgs.had.co.nz/check.html
I quote:
If you use any non-ASCII characters in the DESCRIPTION, you must also specify an encoding. There are only three encodings that work on all platforms: latin1, latin2 and UTF-8. I strongly recommend UTF-8:
Encoding: UTF-8
Related
I need to resolve the following task:
I have registration forms which not allow to use Non-Latin characters and I need to transliterate from Non-Latin characters to Latin characters.
Could you, please, tell me approach to this task solving or library for transliteration from Unicode symbols to Latin?
I need to transliterate form Unicode symbols for the following languages:
Arabic, Armenian, Bengali, Tibetan, Myanmar, Khmer, Chinese (simplified), Chinese, Ethiopian, Devanagari, Georgian, Greek, Hebrew, Hiragana, Katakana, Kanji, Thaana, Hangul, Hanja, Tamil,
Sinhala, Thai to the Latin symbols.
Thank you.
I read from a html file in R which contains Chinese characters. But it shows something like
" <td class=\"forumCell\">\xbbָ\xb4</td>"
It is the "\x" strings that I need to extract. How can I convert them into readable Chinese characters?
By the way, somehow simply copy and pasting the above \x strings would not replicate the problem.
are you sure they are all chinese characters? what is the html page encoding? the strings you pasted looks to be a mix of hex \xc4\xe3 and unicode chars \u0237.
Running R CMD check --as-cran gives
Portable packages must use only ASCII characters in their R code,
except perhaps in comments.
Use \uxxxx escapes for other characters.
What are \uxxxx, and more importantly, how can I convert non ASCII characters into them?
What I know so far
?iconv is very informative, and looks powerful, but I see nothing of the form \u
this python documentation indicates \uxxxx are
Character with 16-bit hex value xxxx (Unicode only)
Question
How can I convert non-ASCII characters into character representations of the form \uxxxx
Some examples c("¤", "£", "€", "¢", "¥", "₧", "ƒ")
You have stri_escape_unicode from stringi to escape unicode:
stringi::stri_escape_unicode(c("¤", "£", "€", "¢", "¥", "₧", "ƒ"))
## [1] "\\u00a4" "\\u00a3" "\\u20ac" "\\u00a2" "\\u00a5" "P" "\\u0192"
I have an addin based on that to remove non ascii characters between quotes in function here : https://github.com/dreamRs/prefixer
Building on the Splitting String based on letters case answer;
lang <- "DeutschEsperantoItalianoNederlandsNedersaksiesNorskРусский"
strsplit(lang, "(?!^)(?=[[:upper:]])", perl = T)
results in
"Deutsch" "Esperanto" "Italiano" "Nederlands" "Nedersaksies" "NorskРусский"
The problem is the last pair is not separated as Russian is converted to UTF-8 (there will be more variation in the strings; e.g. more or less all other languages in Wikipedia). I checked online Regex testers and other SO answers but they are not much help with R. Tried iconv and Encoding workarounds in base R as well (can't seem to convert to UTF-16; conversion to bytes doesn't help). Thoughts?
Use unicode property \p{Lu} that means an uppercase (u) letter (L) in any alphabet. See http://www.regular-expressions.info/unicode.html
lang <- "DeutschEsperantoItalianoNederlandsNedersaksiesNorskРусский"
strsplit(lang, "(?!^)(?=\p{Lu})", perl = TRUE)
I am trying to use the Unicode UTF-16 character set, but I am unsure how to do this, by default when I use the Unicode character set it uses UTF-8 which changes foreign Spanish, Arabic, etc. characters into ?. I am currently using Teradata 14.