How to generate all possible unicode characters? - r

If we type in letters we get all lowercase letters from english alphabet. However, there are many more possible characters like ä, é and so on. And there are symbols like $ or (, too. I found this table of unicode characters which is exactly what I need. Of course I do not want to copy and paste hundreds of possible unicode characters in one vector.
What I've tried so far: The table gives the decimals for (some of) the unicode characters. For example, see the following small table:
Glyph Decimal Unicode Usage in R
! 33 U+0021 "\U0021"
So if type "\U0021" we get a !. Further, paste0("U", format(as.hexmode(33), width= 4, flag="0")) returns "U0021" which is quite close to what I need but adding \ results in an error:
paste0("\U", format(as.hexmode(33), width= 4, flag="0"))
Error: '\U' used without hex digits in character string starting ""\U"
I am stuck. And I am afraid even if I figure out how to transform numbers to characters usings as.hexmode() there is still the problem that there are not Decimals for all unicode characters (see table, Decimals end with 591).
Any idea how to generate a vector with all the unicode characters listed in the table linked?
(The question started with a real world problem but now I am mostly simply eager to know how to do this.)

There may be easier ways to do this, but here goes. The Unicode package contains everything you need.
First we can get a list of unicode scripts and the block ranges:
library(Unicode)
uranges <- u_scripts()
Check what we've got:
head(uranges, 3)
$Adlam
[1] U+1E900..U+1E943 U+1E944..U+1E94A U+1E94B U+1E950..U+1E959 U+1E95E..U+1E95F
$Ahom
[1] U+11700..U+1171A U+1171D..U+1171F U+11720..U+11721 U+11722..U+11725 U+11726 U+11727..U+1172B U+11730..U+11739 U+1173A..U+1173B U+1173C..U+1173E U+1173F
[11] U+11740..U+11746
$Anatolian_Hieroglyphs
[1] U+14400..U+14646
Next we can convert the ranges into their sequences.
expand_uranges <- lapply(uranges, as.u_char_seq)
To get a single vector of all characters we can unlist it. This won't be easy to work with so really it would be better to keep them as a list:
all_unicode_chars <- unlist(expand_uranges)
# The Wikipedia page linked states there are 144,697 characters
length(all_unicode_chars)
[1] 144762
So seems to be all of them and the page needs updating. They are stored as integers so to print them (assuming the glyph is supported) we can do, for example, printing Japanese katakana:
intToUtf8(expand_uranges$Katakana[[1]])
[1] "ァアィイゥウェエォオカガキギクグケゲコゴサザシジスズセゼソゾタダチヂッツヅテデトドナニヌネノハバパヒビピフブプヘベペホボポマミムメモャヤュユョヨラリルレロヮワヰヱヲンヴヵヶヷヸヹヺ"

Related

stringi::stri_unescape_unicode() is not able to render Unicode characters in some ranges

Table of contents
The context
The problem
The question
The context
In the context of R, I'm aware that stringi::stri_unescape_unicode() could be used for converting a Unicode code to its corresponding character.
For example, the Unicode code for á (LATIN SMALL LETTER A WITH ACUTE) and 好 is U+00E1 and U+597D, respectively. This means that I can insert those character by executing the following.
library(stringi)
stringi::stri_unescape_6unicode("\\u00E1")
stringi::stri_unescape_unicode("\\u597D")
[1] "á"
[1] "好"
I'm also aware that characters in the following ranges are for private use. The following quote was retrieved fromd this glossary (archive) in https://unicode.org.
Private-Use Code Point. Code points in the ranges U+E000..U+F8FF, U+F0000..U+FFFFD, and U+100000..U+10FFFD. (See definition D49 in Section 3.5, Properties.) These code points are designated in the Unicode Standard for private use.
As you can read in the quote, there are three ranges. The following lists those characters that are the limits of those ranges.
First range:  (U+E000)
First range:  (U+F8FF)
Second range: 󰀀 (U+F0000)
Second range: 󿿽 (U+FFFFD)
Third range: 􀀀 (U+100000)
Third range: 􏿽 (U+10FFFD)
The problem
When I try to print the characters in the in the list above that belong to the first range (i.e.  (U+E000) and  (U+F8FF)), there's no problem.
stringi::stri_unescape_unicode("\\ue000")
stringi::stri_unescape_unicode("\\uf8ff")
[1] ""
[1] ""
However, when I try to print the characters in shown in the list above that belong to the second range (i.e. 󰀀 (U+F0000) and 󿿽 (U+FFFFD)), R doesn't return those characters.
stringi::stri_unescape_unicode("\\uf0000")
stringi::stri_unescape_unicode("\\uffffd")
[1] "0"
[1] "\uffffd"
Similarly, the following doesn't print the characters shown in the list above that belong in the third range (i.e. 􏿽 (U+10FFFD) and 􀀀 (U+100000))
stringi::stri_unescape_unicode("\\u100000")
stringi::stri_unescape_unicode("\\u10fffd")
[1] "က00"
[1] "ჿfd"
The question
Why isn't stringi::stri_unescape_unicode() able to display characters that belong to the ranges U+F0000..U+FFFFD or U+100000..U+10FFFD?
Is there any function in R that is able to return those characters?

Insert specific unicode symbols inside values of data.frame variable

In my data.frame I would like to add two variables, "A" and "B", whose values contain respectively an n with the i subscript and an n with the s subscript.
As I have understood so far, it's not possible to specify an expression for the values of a variable, and hence to add special characters it's necessary to use unicode symbols. Some of this unicodes work in R, as for example the greek letter "mu", identified with the unicode \U00B5, or numeric subscripts, as you can see in this reprex in your R console:
x <- data.frame("A" = c("\U00B5"),
"B" = c("B\U2082"))
print(x)
These unicodes work also if I decide to put this variable in a ggplot() object, because I will display the correct symbol ("mu" for example) on the axis text or the facets.
The problem is that when I do the same for the subscripts of i (unicode: \U1D62) and s (unicode: \u209B), R doesn't recognise the unicode and prints the whole string inside the variable name.
Do you know how I can resolve this issue and if this unicode works on every operating system?
Thanks
Is there a reason you can't use the expression() function? It seems this would solve your problem (at least concerning greek letters).
Here is the site i used to learn how to input greek letters into my R/ggplot-legends.
https://stats.idre.ucla.edu/r/codefragments/greek_letters/
Altough it is not exactly the answer you look for, i still hope it helps!
If you are on Windows 10 recently updated with as of April 2018 Update:
Use the Windows key + '.' (e.g. hold together Windows Key plus period) in your text editor. This brings up Microsoft Emoji keyboard.
Select the Greek letters variable for your script.
The R Console will not accept the Greek letters as variables directly but only the from the editor script. Some of the Greek letters don't translate to English (like "µ" or "ß".) You can paste and copy them from ls() output to access. You may be able to use some math symbols as well for variable names. I can't however, get this to work with source(). That must be a text encoding problem.

RemoveWords command not removing some weird words

The point is that im trying to remove some weird words (like <U+0001F399><U+FE0F>) from my text corpus to do some twitter analysis.
There are many words like that that i just can't remove by using <- tm_map(X, removeWords).
i have plenty of tweets agregated in a dataset. Then i use the following code:
corpus_tweets <- tm_map (corpus_tweets, removeWords, c("<U+0001F339>", "<U+0001F4CD>"))
if i try changing those weird words for regular ones (like "life" or "animal") that also appear on my dataset the regular ones get removed easily.
Any idea of how to solve this?
As these are Unicode characters, you need to figure out how to properly enter them in R.
The escape code syntax for Unicode in R probably is not <U+xxxx>, but rather something like \Uxxxx. See the manual for details (I don't use R - I am too annoyed by its inconsistencies. This is even an example for such an inconsistency, where apparently the string is printed differently than what R would accept as input.)
corpus_tweets <- tm_map (corpus_tweets, removeWords, c("\U0001F339", "\U0001F4CD","\uFE0F","\uFE0E"))
NOTE: You use a slash and lowercase u then 4 hex digits to specify a character from Unicode plane 0; you must use uppercase U then 8 hex digits for the other planes (which are typically emoji, given you are working with tweets).
BTW, see Some emojis (e.g. ☁) have two unicode, u'\u2601' and u'\u2601\ufe0f'. What does u'\ufe0f' mean? Is it the same if I delete it? for why you are getting the FE0F in there: they are when the user wants to choose a variation of an emoji, e.g. to add colour. FE0E is its partner (to say you want the plain text glyph).

Using grep() with Unicode characters in R

(strap in!)
Hi, I'm running into issues involving Unicode encoding in R.
Basically, I'm importing data sets that contain Unicode (UTF-8) characters, and then running grep() searches to match values. For example, say I have:
bigData <- c("foo","αβγ","bar","αβγγ (abgg)", ...)
smallData <- c("αβγ","foo", ...)
What I'm trying to do is take the entries in smallData and match them to entries in bigData. (The actual sets are matrixes with columns of values, so what I'm trying to do is find the indexes of the matches, so I can tell what row to add the values to.) I've been using
matches <- grepl(smallData[i], bigData, fixed=T)
which usually results in a vector of matches. For i=2, it would return 1, since "foo" is element 1 of bigData. This is peachy and all is well. But RStudio seems to not be dealing with unicode characters properly. When I import the sets and view them, they use the character IDs.
dataset <- read_csv("[file].csv", col_names = FALSE, locale = locale())
Using View(dataset) shows "aß<U+03B3>" instead of "αβγ." The same goes for
dataset[1]
A tibble: 1x1 <chr>
[1] aß<U+03B3>
print(dataset[1])
A tibble: 1x1 <chr>
[1] aß<U+03B3>
However, and this is why I'm stuck rather than just adjusting the encoding:
paste(dataset[1])
[1] "αβγ"
Encoding(toString(dataset[1]))
[1] "UTF-8"
So it appears that R is recognizing in certain contexts that it should display Unicode characters, while in others it just sticks to--ASCII? I'm not entirely sure, but certainly a more limited set.
In any case, regardless of how it displays, what I want to do is be able to get
grep("αβγ", bigData)
[1] 2 4
However, none of the following work:
grep("αβ", bigData) #(Searching the two letters that do appear to convert)
grep("<U+03B3>",bigData,fixed=T) #(Searching the code ID itself)
grep("αβ", toString(bigData)) #(converts the whole thing to one string)
grep("\\β", bigData) #(only mentioning because it matches, bizarrely, to ß)
The only solution I've found is:
grep("\u03B3", bigData)
[1] 2 4
Which is not ideal for a couple reasons, most jarringly that it doesn't look like it's possible to just take every <U+####> and replace it with \u####, since not every Unicode character is converted to the <U+####> format, but none of them can be searched. (i.e., α and ß didn't turn into their unicode keys, but they're also not searchable by themselves. So I'd have to turn them into their keys, then alter their keys to a form that grep() can use, then search.)
That means I can't just regex the keys into a searchable format--and even if I could, I have a lot of entries including characters that'd need to be escaped (e.g., () or ), so having to remove the fixed=T term would be its own headache involving nested escapes.
Anyway...I realize that a significant part of the problem is that my set apparently involves every sort of character under the sun, and it seems I have thoroughly entrapped myself in a net of regular expressions.
Is there any way of forcing a search with (arbitrary) unicode characters? Or do I have to find a way of using regular expressions to escape every ( and α in my data set? (coordinate to that second question: is there a method to convert a unicode character to its key? I can't seem to find anything that does that specific function.)

Finding number of occurrences of a word in a file using R functions

I am using the following code for finding number of occurrences of a word memory in a file and I am getting the wrong result. Can you please help me to know what I am missing?
NOTE1: The question is looking for exact occurrence of word "memory"!
NOTE2: What I have realized they are exactly looking for "memory" and even something like "memory," is not accepted! That was the part which has brought up the confusion I guess. I tried it for word "action" and the correct answer is 7! You can try as well.
#names=scan("hamlet.txt", what=character())
names <- scan('http://pastebin.com/raw.php?i=kC9aRvfB', what=character())
Read 28230 items
> length(grep("memory",names))
[1] 9
Here's the file
The problem is really Shakespeare's use of punctuation. There are a lot of apostrophes (') in the text. When the R function scan encounters an apostrophe it assumes it is the start of a quoted string and reads all characters up until the next apostrophe into a single entry of your names array. One of these long entries happens to include two instances of the word "memory" and so reduces the total number of matches by one.
You can fix the problem by telling scan to regard all quotation marks as normal characters and not treat them specially:
names <- scan('http://pastebin.com/raw.php?i=kC9aRvfB', what=character(), quote=NULL )
Be careful when using the R implementation of grep. It does not behave in exactly the same way as the usual GNU/Linux program. In particular, the way you have used it here WILL find the number of matching words and not just the total number of matching lines as some people have suggested.
As pointed by #andrew, my previous answer would give wrong results if a word repeats on the same line. Based on other answers/comments, this one seems ok:
names = scan('http://pastebin.com/raw.php?i=kC9aRvfB', what=character(), quote=NULL )
idxs = grep("memory", names, ignore.case = TRUE)
length(idxs)
# [1] 10

Resources