Insert specific unicode symbols inside values of data.frame variable - r

In my data.frame I would like to add two variables, "A" and "B", whose values contain respectively an n with the i subscript and an n with the s subscript.
As I have understood so far, it's not possible to specify an expression for the values of a variable, and hence to add special characters it's necessary to use unicode symbols. Some of this unicodes work in R, as for example the greek letter "mu", identified with the unicode \U00B5, or numeric subscripts, as you can see in this reprex in your R console:
x <- data.frame("A" = c("\U00B5"),
"B" = c("B\U2082"))
print(x)
These unicodes work also if I decide to put this variable in a ggplot() object, because I will display the correct symbol ("mu" for example) on the axis text or the facets.
The problem is that when I do the same for the subscripts of i (unicode: \U1D62) and s (unicode: \u209B), R doesn't recognise the unicode and prints the whole string inside the variable name.
Do you know how I can resolve this issue and if this unicode works on every operating system?
Thanks

Is there a reason you can't use the expression() function? It seems this would solve your problem (at least concerning greek letters).
Here is the site i used to learn how to input greek letters into my R/ggplot-legends.
https://stats.idre.ucla.edu/r/codefragments/greek_letters/
Altough it is not exactly the answer you look for, i still hope it helps!

If you are on Windows 10 recently updated with as of April 2018 Update:
Use the Windows key + '.' (e.g. hold together Windows Key plus period) in your text editor. This brings up Microsoft Emoji keyboard.
Select the Greek letters variable for your script.
The R Console will not accept the Greek letters as variables directly but only the from the editor script. Some of the Greek letters don't translate to English (like "µ" or "ß".) You can paste and copy them from ls() output to access. You may be able to use some math symbols as well for variable names. I can't however, get this to work with source(). That must be a text encoding problem.

Related

How to generate all possible unicode characters?

If we type in letters we get all lowercase letters from english alphabet. However, there are many more possible characters like ä, é and so on. And there are symbols like $ or (, too. I found this table of unicode characters which is exactly what I need. Of course I do not want to copy and paste hundreds of possible unicode characters in one vector.
What I've tried so far: The table gives the decimals for (some of) the unicode characters. For example, see the following small table:
Glyph Decimal Unicode Usage in R
! 33 U+0021 "\U0021"
So if type "\U0021" we get a !. Further, paste0("U", format(as.hexmode(33), width= 4, flag="0")) returns "U0021" which is quite close to what I need but adding \ results in an error:
paste0("\U", format(as.hexmode(33), width= 4, flag="0"))
Error: '\U' used without hex digits in character string starting ""\U"
I am stuck. And I am afraid even if I figure out how to transform numbers to characters usings as.hexmode() there is still the problem that there are not Decimals for all unicode characters (see table, Decimals end with 591).
Any idea how to generate a vector with all the unicode characters listed in the table linked?
(The question started with a real world problem but now I am mostly simply eager to know how to do this.)
There may be easier ways to do this, but here goes. The Unicode package contains everything you need.
First we can get a list of unicode scripts and the block ranges:
library(Unicode)
uranges <- u_scripts()
Check what we've got:
head(uranges, 3)
$Adlam
[1] U+1E900..U+1E943 U+1E944..U+1E94A U+1E94B U+1E950..U+1E959 U+1E95E..U+1E95F
$Ahom
[1] U+11700..U+1171A U+1171D..U+1171F U+11720..U+11721 U+11722..U+11725 U+11726 U+11727..U+1172B U+11730..U+11739 U+1173A..U+1173B U+1173C..U+1173E U+1173F
[11] U+11740..U+11746
$Anatolian_Hieroglyphs
[1] U+14400..U+14646
Next we can convert the ranges into their sequences.
expand_uranges <- lapply(uranges, as.u_char_seq)
To get a single vector of all characters we can unlist it. This won't be easy to work with so really it would be better to keep them as a list:
all_unicode_chars <- unlist(expand_uranges)
# The Wikipedia page linked states there are 144,697 characters
length(all_unicode_chars)
[1] 144762
So seems to be all of them and the page needs updating. They are stored as integers so to print them (assuming the glyph is supported) we can do, for example, printing Japanese katakana:
intToUtf8(expand_uranges$Katakana[[1]])
[1] "ァアィイゥウェエォオカガキギクグケゲコゴサザシジスズセゼソゾタダチヂッツヅテデトドナニヌネノハバパヒビピフブプヘベペホボポマミムメモャヤュユョヨラリルレロヮワヰヱヲンヴヵヶヷヸヹヺ"

RemoveWords command not removing some weird words

The point is that im trying to remove some weird words (like <U+0001F399><U+FE0F>) from my text corpus to do some twitter analysis.
There are many words like that that i just can't remove by using <- tm_map(X, removeWords).
i have plenty of tweets agregated in a dataset. Then i use the following code:
corpus_tweets <- tm_map (corpus_tweets, removeWords, c("<U+0001F339>", "<U+0001F4CD>"))
if i try changing those weird words for regular ones (like "life" or "animal") that also appear on my dataset the regular ones get removed easily.
Any idea of how to solve this?
As these are Unicode characters, you need to figure out how to properly enter them in R.
The escape code syntax for Unicode in R probably is not <U+xxxx>, but rather something like \Uxxxx. See the manual for details (I don't use R - I am too annoyed by its inconsistencies. This is even an example for such an inconsistency, where apparently the string is printed differently than what R would accept as input.)
corpus_tweets <- tm_map (corpus_tweets, removeWords, c("\U0001F339", "\U0001F4CD","\uFE0F","\uFE0E"))
NOTE: You use a slash and lowercase u then 4 hex digits to specify a character from Unicode plane 0; you must use uppercase U then 8 hex digits for the other planes (which are typically emoji, given you are working with tweets).
BTW, see Some emojis (e.g. ☁) have two unicode, u'\u2601' and u'\u2601\ufe0f'. What does u'\ufe0f' mean? Is it the same if I delete it? for why you are getting the FE0F in there: they are when the user wants to choose a variation of an emoji, e.g. to add colour. FE0E is its partner (to say you want the plain text glyph).

Using grep() with Unicode characters in R

(strap in!)
Hi, I'm running into issues involving Unicode encoding in R.
Basically, I'm importing data sets that contain Unicode (UTF-8) characters, and then running grep() searches to match values. For example, say I have:
bigData <- c("foo","αβγ","bar","αβγγ (abgg)", ...)
smallData <- c("αβγ","foo", ...)
What I'm trying to do is take the entries in smallData and match them to entries in bigData. (The actual sets are matrixes with columns of values, so what I'm trying to do is find the indexes of the matches, so I can tell what row to add the values to.) I've been using
matches <- grepl(smallData[i], bigData, fixed=T)
which usually results in a vector of matches. For i=2, it would return 1, since "foo" is element 1 of bigData. This is peachy and all is well. But RStudio seems to not be dealing with unicode characters properly. When I import the sets and view them, they use the character IDs.
dataset <- read_csv("[file].csv", col_names = FALSE, locale = locale())
Using View(dataset) shows "aß<U+03B3>" instead of "αβγ." The same goes for
dataset[1]
A tibble: 1x1 <chr>
[1] aß<U+03B3>
print(dataset[1])
A tibble: 1x1 <chr>
[1] aß<U+03B3>
However, and this is why I'm stuck rather than just adjusting the encoding:
paste(dataset[1])
[1] "αβγ"
Encoding(toString(dataset[1]))
[1] "UTF-8"
So it appears that R is recognizing in certain contexts that it should display Unicode characters, while in others it just sticks to--ASCII? I'm not entirely sure, but certainly a more limited set.
In any case, regardless of how it displays, what I want to do is be able to get
grep("αβγ", bigData)
[1] 2 4
However, none of the following work:
grep("αβ", bigData) #(Searching the two letters that do appear to convert)
grep("<U+03B3>",bigData,fixed=T) #(Searching the code ID itself)
grep("αβ", toString(bigData)) #(converts the whole thing to one string)
grep("\\β", bigData) #(only mentioning because it matches, bizarrely, to ß)
The only solution I've found is:
grep("\u03B3", bigData)
[1] 2 4
Which is not ideal for a couple reasons, most jarringly that it doesn't look like it's possible to just take every <U+####> and replace it with \u####, since not every Unicode character is converted to the <U+####> format, but none of them can be searched. (i.e., α and ß didn't turn into their unicode keys, but they're also not searchable by themselves. So I'd have to turn them into their keys, then alter their keys to a form that grep() can use, then search.)
That means I can't just regex the keys into a searchable format--and even if I could, I have a lot of entries including characters that'd need to be escaped (e.g., () or ), so having to remove the fixed=T term would be its own headache involving nested escapes.
Anyway...I realize that a significant part of the problem is that my set apparently involves every sort of character under the sun, and it seems I have thoroughly entrapped myself in a net of regular expressions.
Is there any way of forcing a search with (arbitrary) unicode characters? Or do I have to find a way of using regular expressions to escape every ( and α in my data set? (coordinate to that second question: is there a method to convert a unicode character to its key? I can't seem to find anything that does that specific function.)

how to put y axis greek letters in Veusz plot?

I want to put Capitalomega with index DE and k label:
and then ı want to show on the y axis label? How to do them?
Generally you can use tex symbols in Veusz. Therefore, you can write \Omega_{DE} and \Omega_{k} for your request. See details here (Sec. 2.4 Text).
Veusz understands a limited set of LaTeX-like formatting for text. There are some differences (for example, "10^23" puts the 2 and 3 into superscript), but it is fairly similar. You should also leave out the dollar signs. Veusz supports superscripts ("^"), subscripts ("_"), brackets for grouping attributes are "{" and "}".
Supported LaTeX symbols include: \AA, \Alpha, \Beta, \Chi, \Delta, \Epsilon, \Eta, \Gamma, \Iota, \Kappa, \Lambda, \Mu, \Nu, \Omega, \Omicron, \Phi, \Pi, \Psi, \Rho, \Sigma, \Tau, \Theta, \Upsilon, \Xi, \Zeta, \alpha, \approx, \ast, \asymp, \beta, \bowtie, \bullet, \cap, \chi, \circ, \cup, \dagger, \dashv, \ddagger, \deg, \delta, \diamond, \divide, \doteq, \downarrow, \epsilon, \equiv, \eta, \gamma, \ge, \gg, \in, \infty, \int, \iota, \kappa, \lambda, \le, \leftarrow, \lhd, \ll, \models, \mp, \mu, \neq, \ni, \nu, \odot, \omega, \omicron, \ominus, \oplus, \oslash, \otimes, \parallel, \perp, \phi, \pi, \pm, \prec, \preceq, \propto, \psi, \rhd, \rho, \rightarrow, \sigma, \sim, \simeq, \sqrt, \sqsubset, \sqsubseteq, \sqsupset, \sqsupseteq, \star, \stigma, \subset, \subseteq, \succ, \succeq, \supset, \supseteq, \tau, \theta, \times, \umid, \unlhd, \unrhd, \uparrow, \uplus, \upsilon, \vdash, \vee, \wedge, \xi, \zeta. Please request additional characters if they are required (and exist in the unicode character set). Special symbols can be included directly from a character map.
Other LaTeX commands are supported. "\" breaks a line. This can be used for simple tables. For example "{a\b} {c\d}" shows "a c" over "b d". The command "\frac{a}{b}" shows a vertical fraction a/b.
Also supported are commands to change font. The command "\font{name}{text}" changes the font text is written in to name. This may be useful if a symbol is missing from the current font, e.g. "\font{symbol}{g}" should produce a gamma. You can increase, decrease, or set the size of the font with "\size{+2}{text}", "\size{-2}{text}", or "\size{20}{text}". Numbers are in points.
Various font attributes can be changed: for example, "\italic{some italic text}" (or use "\textit" or "\emph"), "\bold{some bold text}" (or use "\textbf") and "\underline{some underlined text}".
Example text could include "Area / \pi (10^{-23} cm^{-2})", or "\pi\bold{g}".
Veusz plots these symbols with Qt's unicode support. You can also include special characters directly, by copying and pasting from a character map application. If your current font does not contain these symbols then you may get a box character.
In addition to the answer OmG posted, you can also directly enter the character (via a character map application or copy and paste), as Veusz supports unicode characters.

Use greek letters to name list elements in R

Is there any way to name list elements in R using Greek letters? I'm asked to create a list that should look like list(α = 42).
The actual result of this expression is equivalent to the result of list(a = 42).
I know it is possible to use Greek letters in plots using expression(symbol("a")), but I couldn't find a solution to use Greek letters as names for list elements. Using as.character("\U03B1") results in an error, using just "\U03B1" leads to an "a". I doubt that it makes sense to use Greek letters anywhere else than plots, but this is homework, so I have to find a way (if there is one).
I have not tested this thoroughly, but R seems to take pretty much any symbol as a variable name without any problems and I could not find any specific rule about variable naming aside from this, taken from ?name
Names are limited to 10,000 bytes (and were to 256 bytes in versions of R before 2.13.0).
The code you posted works perfectly for me (R 3.0.1 running under Fedora Core 18):
> a <- list(α = 42)
> a
$α
[1] 42
That said, I would definitely suggest to just spell out the letter as "alpha", as it is more practical to write and maintain.
a <- list(alpha = 42)
Alternatively, name the list afterwards using your unicode symbols in the character vector:
ll <- as.list( 42 )
names(ll) <- "\u03B1"
ll
#$α
#[1] 42
Interesting that you need to do this. Also your list holds the meaning of life.

Resources