Get R to keep UTF-8 Codepoint representation - r

This question is related to the utf-8 package for R. I have a weird problem in which I want emojis in a data set I'm working with to stay in code point representation (i.e. as '\U0001f602'). I want to use the 'FindReplace' function from the Data Combine package to turn UTF-8 encodings into prose descriptions of emojis in a dataset of YouTube comments (using a dictionary I made available here). The only issue is that when I 'save' the output as an object in R the nice utf-8 encoding generated by utf8_encode for which I can use my dictionary, it disappears...
First I have to adjust the dictionary a bit:
emojis$YouTube <- tolower(emojis$Codepoint)
emojis$YouTube <- gsub("u\\+","\\\\U000", emojis$YouTube)
Convert to character so as to be able to use utf8_encode:
emojimovie$test <- as.character(emojimovie$textOriginal)
This works great, gives output of \U0001f595 (etc.) that can be matched with dictionary entries when it 'prints' in the console.
utf8_encode(emojimovie$test)
BUT, when I do this:
emojimovie$text2 <- utf8_encode(emojimovie$test)
and then:
emoemo <- FindReplace(data = emojimovie, Var = "text2", replaceData = emojis, from = "YouTube", to = "Name", exact = TRUE)
I get all NAs. When I look at the output in $text2 with View I don't see the \U0001f595, I see actual emojis. I think this is why the FindReplace function isn't working -- when it gets saved to an object it just gets represented as emojis again and the function can't find any matches. When I try gsub("\U0001f602", "lolface", emojimovie$text2), however, I can actually match and replace things, but I don't want to do this for all ~2,000 or so emojis.... I've tried reading as much as I can about utf-8, but I can't understand why this is happening. I'm stumped! :P

It looks like in the above, you are trying to convert the UTF-8 emoji to a text version. I would recommend going the other direction. Something like
emojis <- read.csv('Emoji Dictionary 2.1.csv', stringsAsFactors = FALSE)
# change U+1F469 U+200D U+1F467 to \U1F469\U200D\U1F467
escapes <- gsub("[[:space:]]*\\U\\+", "\\\\U", emojis$Codepoint)
# convert to UTF-8 using the R parser
codes <- sapply(parse(text = paste0("'", escapes, "'"),
keep.source = FALSE), eval)
This will convert the text representations like U+1F469 to UTF-8 strings. Then, you can search for these strings in the original data.
Note: If you are using Windows, make sure you have the latest release of R; in older versions, the parser gives you the wrong result for strings litke "\U1F469".
The utf8::utf8_encode should really only be used if you have UTF-8 and are trying to print it to the screen.

Related

Is there some way to change the characters encoding to its English equivalent IN R?

In R
I am extracting data from Pdf tables using Tabulizer library and the Name are on Nepali language
and after extracting i Get this Table
[1]: https://i.stack.imgur.com/Ltpqv.png
But now i want that column 2's name To change, in its English Equivalent
Is there any way to do this in R
The R code i wrote was
library(tabulizer)
location <- "https://citizenlifenepal.com/wp-content/uploads/2019/10/2nd-AGM.pdf"
out <- extract_tables(location,pages = 113)
##write.table(out,file = "try.txt")
final <- do.call(rbind,out)
final <- as.data.frame(final) ### creating df
col_name <- c("S.No.","Types of Insurance","Inforce Policy Count", "","Sum Assured of Inforce Policies","","Sum at Risk","","Sum at Risk Transferred to Re-Insurer","","Sum At Risk Retained By Insurer","")
names(final) <- col_name
final <- final[-1,]
write.csv(final,file = "/cloud/project/Extracted_data/Citizen_life.csv",row.names = FALSE)
View(final)```
It appears that document is using a non-Unicode encoding. This web site https://www.ashesh.com.np/preeti-unicode/ can convert some Nepali encodings to Unicode, which would display properly in R, assuming you have the right fonts loaded. When I tried it on the output of your code, it did something that looked okay to me, but I don't know Nepali:
> out[[1]][1,2]
[1] ";fjlws hLjg aLdf"
When I convert the contents of that string, I get
सावधिक जीवन बीमा
which looks to me something like the text on that page in the document. If it's actually written correctly, then converting it to English will need some Nepali speaker to do the translation: hopefully that's you, but if I use Google Translate, it gives
Term life insurance
So here's my suggestion: contact the owner of that www.ashesh.com.np website, and find out if they can give you the translation rules. Write an R function to implement them if you can't find one by someone else. Then do the English translations manually.

Replacing string with greater/smaller than symbol is not replaced

I want to clear a table from unwanted characters. The table can be downloaded here freely: https://www.aggdata.com/free/germany-postal-codes
It contains all german postal codes, and since in Germany there are some special characters like ü,ö,ä or ß, I need to swap them to other symbols.
Now let's say I want to replace all "ß" by "ss".
I used this code, collected from different posts in Stack Overflow. My code looks like this:
postal <- read.csv("~/Downloads/de_postal_codes.csv")
postal <- as.data.frame(sapply(postal,gsub,pattern="<df>",replacement="ss"))
When I try to replace other strings for testing like pattern = "Cot" the code works, but not if it contains the <> symbols. What is the problem here?
I am using R 3.3.3 in RStudio 1.0.136 on MacOS 10.13.4.
This seems to work. If you put encoding = "UTF-8" in to the read.table command, you see that <df> comes back as \xdf. I don't know much about this area, but trying this with the original encoding seemed to work. Hope this helps
postal <- read.table("~/Downloads/de_postal_codes.csv", sep = ",", header = TRUE,
stringsAsFactors = FALSE)
postal$Place.Name[4]
postal <- as.data.frame(
sapply(postal, function(x){
gsub(pattern="\xdf", replacement="ss", x=x)
})
)
postal$Place.Name[4]
edit: Also, I don't think you're sapply was doing the trick. The x parameter in gsub is not the first variable when you do ?gsub.
edit2: I'm using windows & 3.5.0 R version

Strings containing an accent do not appear

I'm currently building a shiny application that needs to be translated in different languages. I have the whole structure but I'm struggling into getting values such as "Validació" that contain accents.
The structure I've followed is the following:
I have a dictionary that is simply a csv with the translation where
there's a key and then each language. The structure of this dictionary is the following:
key, cat, en
"selecció", "selecció", "Selection"
"Diferències","Diferències", "Differences"
"Descarregar","Descarregar", "Download"
"Diagnòstics","Diagnòstics", "Diagnoses"
I have a script that once the dictionary.csv is modified, generates a .bin file that later will be loaded in the code.
In strings.R I have all the strings that will appear on the code and I use a function to translate the current language to the one I want. The function is the following:
Code:
tr <- function(text){
sapply(text, function(s) translation[[s]][["cat"]], USE.NAMES=F)
}
When I translate something, since I am doing in another file, I assign it to another variable something like:
str_seleccio <- tr('Selecció)
The problem I'm facing is that for example if we translate 'Selecció' would be according to this function, tr('Selecció') and provides a correct answer if I execute it in the RStudio terminal but when I do it in the Shiny application, appears to me as a NULL. If the word I translate has no accents such as "Hello", tr("Hello") provides me a correct answer in the Shiny application and I can see it throught the code.
So mainly tr(word) gets the correct value but when assigning it "loses the value" so I'm a bit lost how to do it.
I know that you can do something like Encoding(str_seleccio) <- "UTF-8" but in this case is not working. In case of plain words it used to do but since when I asssign it, gets NULL is not working.
Any idea? Any suggestion? What I would like is to add something to tr function
The main idea comes from this repository that if you can take a look is the simplest version you can do, but (s)he has problem with utf-8 also.
https://github.com/chrislad/multilingualShinyApp
As in http://shiny.rstudio.com/articles/unicode.html suggested (re)save all files with UTF-8 encoding.
Additionaly change within updateTranslation.R:
translationContent <- read.delim("dictionary.csv", header = TRUE, sep = "\t", as.is = TRUE)
to:
translationContent <- read.delim("dictionary.csv", header = TRUE, sep = "\t", as.is = TRUE, fileEncoding = "UTF-8").
Warning, when you (re)save ui.R, your "c-cedilla" might get destroyed. Just re-insert it, in case it happens.
Happy easter :)

Can R transform emoji characters to their text equivalents?

In my question yesterday, "Can R read html-encoded emoji characters?", user rensa noted that:
As far as I'm aware, there's no solution to printing emoji in the R console: they always come out as "U0001f600" (or what have you). However, the packages I described above can help you plot emoji in some circumstances (I'm hoping to expand ggflags to display arbitrary full-colour emoji at some point). They can also help you search for emoji to get their codes, but they can't get names given the codes AFAIK. But maybe you could try importing the emoji list from emojilib into R and doing a join with your data frame, if you've extracted the emoji codes into a column, to get the English names.
How would this look in R?
(Note: I'm posting this question with the intention of answering it immediately, rather than posting this in the question linked above, since it's tangential to that question, but still possibly of use to others.)
The approach below works for transforming an emoji character or unicode representation into a name.
I am happy to release the code snippet below under a CC0 dedication (i.e., putting this implementation into the public domain for free reuse).
# Get (MIT-licensed) emojilib data:
emoji_json_file <- "https://raw.githubusercontent.com/muan/emojilib/master/emojis.json"
json_data <- rjson::fromJSON(paste(readLines(emoji_json_file), collapse = ""))
get_name_from_emoji <- function(emoji_unicode, emoji_data = json_data){
emoji_evaluated <- stringi::stri_unescape_unicode(emoji_unicode)
# names(json_data)
vector_of_emoji_names_and_characters <- unlist(
lapply(json_data, function(x){
x$char
})
)
name_of_emoji <- attr(
which(vector_of_emoji_names_and_characters == emoji_evaluated)[1],
"names"
)
name_of_emoji
}
get_name_from_emoji("\\U0001f917")
# [1] "hugs"
get_name_from_emoji("🤗") # An attempt actually pasting the hugs emoji in also works.
# [1] "hugs"

How to convert JSON data from R to not be in one line

I extracted data in R using the twitteR package and searchtwitter. I am then converting to a JSON file. It works, but when I view the JSON in notepad++ it is one long string. Is there a way to get it to separate so that each tweet with its specific information is separate.
testreal <- searchTwitteR('startup', n = 100, lang = 'en')
testrealdf <- do.call("rbind", lapply(testreal, as.data.frame))
write(exportJson, file = "testrealdf.json")
json_realdf <- fromJSON(file="testrealdf.json")
My file in notepad looks like.....
Use jsonlite::toJSON(x=yourDataFrame, pretty=TRUE). From ?toJSON:
pretty adds indentation whitespace to JSON output. Can be TRUE/FALSE
or a number specifying the number of spaces to indent. See prettify
This is just for cosmetics. The whitespace in JSON is not syntactically meaningful. For a serialization format with syntactically meaningful whitespace, check out R's yaml package, which can also read JSON.

Resources