Cyrillic transliteration in R - r

Are there packages for Cyrillic text transliteration to Latin in R? I need to convert data frames to Latin to use factors. It is somewhat messy to use Cyrillic factors in R.

I have found the package at last.
> library(stringi)
> stri_trans_general("женщина", "cyrillic-latin")
[1] "ženŝina"
> stri_trans_general("женщина", "russian-latin/bgn")
[1] "zhenshchina"
After that, the only issue remaining is the "ё" letter.
> stri_trans_general("Ёж", "russian-latin/bgn")
[1] "Yëzh"
> stri_trans_general("подъезд", "russian-latin/bgn")
[1] "podʺyezd"
> stri_trans_general("мальчик", "russian-latin/bgn")
[1] "malʹchik"
I had to remove all the "ё", "ʹ" and "ʺ" characters
> iconv(stri_trans_general("ёж", "russian-latin/bgn"),from="UTF8",to="ASCII",sub="")
[1] "yzh"
Or one can just remove the 'Ё' and 'ё' letters before
> gsub('ё','e',gsub('Ё','E','Ёжики на ёлке'))
[1] "Eжики на eлке"
or after transliteration.

It is possible to do it with stringi package as you above, but with different transform identifier, for Serbian latin:
`stri_trans_general("жшчћђ", "Serbian-Latin/BGN")`
All characters should be transformed correctly to Serbian latin.

If afterwards one uses Base R to filter the data in Cyrillic, one get's all NA's, but if dplyr is used then everything is fine.

Related

Filtering out entries in a column that contain UTF-8 arabic characters in R

I have a data set called event_table that has a column titled "c.Comments" which contains strings mostly in english, but has some arabic in some of the comment entries. I want to filter out rows in which the comments entry contains arabic characters.
I read the data into R from an xlsx file and the arabic characters show as UTF-8 "< U+4903 >< U+483d >" (with no spaces) etc.
I've tried using regular expressions to achieve what I want, but the strings I'm trying to match refuse to be filtered out. I've tried all kinds of different regular expressions, but none of them seem to do the trick. I'll try filtering out literally "
event_table <- event_table %>%
filter(!grepl("<U+", c.Comments, fixed = TRUE))
event_table <- event_table %>%
filter(!grepl("<U\\+", c.Comments)
"\x", "\d\d\d\d", and all sorts of other combinations have done nothing for me
I'm starting to suspect that my method of filtering may be the issue rather than the regular expression, so any suggestions would be greatly appreciated.
Arabic chars can be detected with grep/grepl using a PCRE regex like \p{Arabic}:
> df <- data.frame(x=c("123", "abc", "ﺏ"))
> df
x
1 123
2 abc
3 <U+FE8F>
> grepl("\\p{Arabic}", df$x, perl=TRUE)
[1] FALSE FALSE TRUE
In your case, the code will look like
event_table <- event_table %>%
filter(!grepl("\\p{Arabic}", c.Comments, perl=TRUE))
Look at the ?Syntax help page. The Unicode character associated with may vary with the assumed codepage. On my machine the R character would be created with the string: "\u4903" but it prints as a Chinese glyph. The regex engine in R (as documented in the ?regex help page which you should refer to now) is PCRE.
The pattern in this grepl expression will filter out the printing non-ASCII characters:
grepl("[[:alnum:]]|[[:punct:]]", "\u4903")
[1] FALSE
And I don't think you should be negating that grepl result:
dplyr::filter(data.frame("\u4903"), grepl("[[:alnum:]]|[[:punct:]]", "\u4903"))
[1] X.䤃.
<0 rows> (or 0-length row.names)
dplyr::filter(data.frame("\u4903"), !grepl("[[:alnum:]]|[[:punct:]]", "\u4903"))
X.䤃.
1 䤃

remove invisible characters from a UTF-8 string

I have an R tibble with UTF-8 character column. When I print the contents of this column for a certain problematic record, everything looks fine: one ‭two‬ three. There are, however, problems when I try to use this string in a RDBMS query which I construct in R and send to the database.
If I copy this string to Notepad++ and convert the encoding to ANSI, I can see that the string actually contains some additional characters that cause the problem: one ‭two‬ three.
A partial solution that works would be conversion to ASCII:
iconv(my_string, "UTF-8", "ASCII", sub = "")
, but all non-ASCII characters are lost here.
Conversion from UTF-8 to UTF-8 doesn't solve my problem:
iconv(my_string, "UTF-8", "UTF-8", sub = "").
Is it possible to remove all invisible characters like the ones above without losing the UTF-8 encoding?
That is:
how can I convert my string to the form that I see when I print it out in R (without hidden parts)?
Not sure I completely understand what you are trying to do, but you can use stringi or stringr to explicitly specify what characters you want to retain. For your example, it could look something like this. You may have to expand the characters you want to retain, but this approach is one option:
library(stringr)
my_string <- "one ‭two‬ three"
# Specifying that you only want upper and lowercase letters,
# numbers, punctuation, and whitespace.
str_remove_all(my_string, "[^A-z|0-9|[:punct:]|\\s]")
[1] "one two three"
# Just checking
stringi::stri_enc_isutf8(str_remove_all(my_string, "[^A-z|0-9|[:punct:]|\\s]"))
[1] TRUE
EDIT: I do want to note that you should check and see how robust this approach is. I have not dealt with invisible characters often so this may not be the best way to go about removing them.
You haven't given us a way to construct your bad string so I can't test this on your data, but it works on this example.
badString <- "one \u200Btwo\u200B three"
chars <- strsplit(badString, "")[[1]] # Assume badString has one entry; if not, add a loop
chars <- chars[nchar(chars, type = "width") > 0]
goodString <- paste(chars, collapse = "")
Both badString and goodString look the same when printed:
> badString
[1] "one ​two​ three"
> goodString
[1] "one two three"
but they have different numbers of characters:
> nchar(badString)
[1] 15
> nchar(goodString)
[1] 13

Convert special letters to english letters in R

Is there a way to convert special letters, in a text, to english letters in R? For example:
Æ -> AE
Ø -> O
Å -> A
Edit: The reason I need this convert is R cant see that these two words are the same:
stringdist('oversættelse','oversaettelse')
[1] 2
grepl('oversættelse','oversaettelse')
FALSE
Some people tent to write using only english characters and some others not. In order to compare some texts I need to have them in the 'same format'.
I recently had a very similar problem and was pointed to the question Unicode normalization (form C) in R : convert all characters with accents into their one-unicode-character form?
basically the gist is for many of this special characters there exist more than one unicode representation - which will mess with text comparisons. The suggested solution is to use the stringi package function stri_trans_nfc - it has also a function stri_trans_general that supports transliteration, which might be exactly what you need.
You can use chartr
x <- "ØxxÅxx"
chartr("ØÅ", "OA", x)
[1] "OxxAxx"
And/or gsub
y <- "Æabc"
gsub("Æ", "AE", y)
[1] "AEabc"

print backslash in R strings

GNU R 3.02
> bib <- "\cite"
Error: '\c' is an unrecognized escape in character string starting ""\c"
> bib <- "\\cite"
> print(bib)
[1] "\\cite"
> sprintf(bib)
[1] "\\cite"
>
how can I print out the string variable bib with just one "\"?
(I've tried everything conceivable, and discover that R treats the "\\" as one character.)
I see that in many cases this is not a problem, since this is usually handled internally by R, say, if the string were to be used as text for a plot.
But I need to send it to LaTeX. So I really have to remove it.
I see cat does the trick. If cat could only be made to send its result to a string.
You should use cat.
bib <- "\\cite"
cat(bib)
# \cite
You can remove the ## and [1] by setting a few options in knitr. Here is an example chunk:
<<newChunk,echo=FALSE,comment=NA,background=NA>>=
bib <- "\\cite"
cat(bib)
#
which gets you \cite. Note as well that you can set these options globally.
There is no backslash in the character element "\cite". The backslash is being interpreted as an escape and the two character "\c" is being interpreted as a cntrl-c. Except that is not a recognized character. See ?Quotes. The second version has only one backslash followed by 4 alpha characters. Count the characters to see this:
nchar("\\cite")
[1] 5
OK,
<<echo=FALSE,result='asis'>>
result <- cat(rbib)
#
does the trick (without the result <- bit, [1] is added). It just feels kludgy.

Comma separator for numbers in R?

Is there a function in R to display large numbers separated with commas?
i.e., from 1000000 to 1,000,000.
You can try either format or prettyNum, but both functions return a vector of characters. I'd only use that for printing.
> prettyNum(12345.678,big.mark=",",scientific=FALSE)
[1] "12,345.68"
> format(12345.678,big.mark=",",scientific=FALSE)
[1] "12,345.68"
EDIT: As Michael Chirico says in the comment:
Be aware that these have the side effect of padding the printed strings with blank space, for example:
> prettyNum(c(123,1234),big.mark=",")
[1] " 123" "1,234"
Add trim=TRUE to format or preserve.width="none" to prettyNum to prevent this:
> prettyNum(c(123,1234),big.mark=",", preserve.width="none")
[1] "123" "1,234"
> format(c(123,1234),big.mark=",", trim=TRUE)
[1] "123" "1,234"
See ?format:
> format(1e6, big.mark=",", scientific=FALSE)
[1] "1,000,000"
>
The other answers posted obviously work - but I have always used
library(scales)
label_comma()(1000000)
I think Joe's comment to MatthewR offers the best answer and should be highlighted:
As of Sept 2018, the scales package (part of the Tidyverse) does exactly this:
> library(scales)
> x <- 10e5
> comma(x)
[1] "1,000,000"
The scales package appears to play very nicely with ggplot2, allowing for fine control of how numerics are displayed in plots and charts.

Resources