Can R transform emoji characters to their text equivalents? - r

In my question yesterday, "Can R read html-encoded emoji characters?", user rensa noted that:
As far as I'm aware, there's no solution to printing emoji in the R console: they always come out as "U0001f600" (or what have you). However, the packages I described above can help you plot emoji in some circumstances (I'm hoping to expand ggflags to display arbitrary full-colour emoji at some point). They can also help you search for emoji to get their codes, but they can't get names given the codes AFAIK. But maybe you could try importing the emoji list from emojilib into R and doing a join with your data frame, if you've extracted the emoji codes into a column, to get the English names.
How would this look in R?
(Note: I'm posting this question with the intention of answering it immediately, rather than posting this in the question linked above, since it's tangential to that question, but still possibly of use to others.)

The approach below works for transforming an emoji character or unicode representation into a name.
I am happy to release the code snippet below under a CC0 dedication (i.e., putting this implementation into the public domain for free reuse).
# Get (MIT-licensed) emojilib data:
emoji_json_file <- "https://raw.githubusercontent.com/muan/emojilib/master/emojis.json"
json_data <- rjson::fromJSON(paste(readLines(emoji_json_file), collapse = ""))
get_name_from_emoji <- function(emoji_unicode, emoji_data = json_data){
emoji_evaluated <- stringi::stri_unescape_unicode(emoji_unicode)
# names(json_data)
vector_of_emoji_names_and_characters <- unlist(
lapply(json_data, function(x){
x$char
})
)
name_of_emoji <- attr(
which(vector_of_emoji_names_and_characters == emoji_evaluated)[1],
"names"
)
name_of_emoji
}
get_name_from_emoji("\\U0001f917")
# [1] "hugs"
get_name_from_emoji("🤗") # An attempt actually pasting the hugs emoji in also works.
# [1] "hugs"

Related

Is there some way to change the characters encoding to its English equivalent IN R?

In R
I am extracting data from Pdf tables using Tabulizer library and the Name are on Nepali language
and after extracting i Get this Table
[1]: https://i.stack.imgur.com/Ltpqv.png
But now i want that column 2's name To change, in its English Equivalent
Is there any way to do this in R
The R code i wrote was
library(tabulizer)
location <- "https://citizenlifenepal.com/wp-content/uploads/2019/10/2nd-AGM.pdf"
out <- extract_tables(location,pages = 113)
##write.table(out,file = "try.txt")
final <- do.call(rbind,out)
final <- as.data.frame(final) ### creating df
col_name <- c("S.No.","Types of Insurance","Inforce Policy Count", "","Sum Assured of Inforce Policies","","Sum at Risk","","Sum at Risk Transferred to Re-Insurer","","Sum At Risk Retained By Insurer","")
names(final) <- col_name
final <- final[-1,]
write.csv(final,file = "/cloud/project/Extracted_data/Citizen_life.csv",row.names = FALSE)
View(final)```
It appears that document is using a non-Unicode encoding. This web site https://www.ashesh.com.np/preeti-unicode/ can convert some Nepali encodings to Unicode, which would display properly in R, assuming you have the right fonts loaded. When I tried it on the output of your code, it did something that looked okay to me, but I don't know Nepali:
> out[[1]][1,2]
[1] ";fjlws hLjg aLdf"
When I convert the contents of that string, I get
सावधिक जीवन बीमा
which looks to me something like the text on that page in the document. If it's actually written correctly, then converting it to English will need some Nepali speaker to do the translation: hopefully that's you, but if I use Google Translate, it gives
Term life insurance
So here's my suggestion: contact the owner of that www.ashesh.com.np website, and find out if they can give you the translation rules. Write an R function to implement them if you can't find one by someone else. Then do the English translations manually.

Wrong encoding while loading the JSON data to R

I'm trying to build a word corpus based on my data frame, which was loaded from a JSON file. While doing it R doesn't see special signs like 'ř' (in the original json data it is visible and encoding is utf-8). I tried encoding in R with source editor and Encoding(x), but none of them works.
I would like to change the signs to latin letters. e.g. ř --> r, but r using gsub function completely destroys my data frame.
Do you have any ideas how to solve it?
#JSON file contains name with "ř", after loading data I get <f8> even though I choose encoding of source file
data5 <- fromJSON(file = "Test1801.json")
data6 <- as.data.frame(data5)
data6 <- tolower(data6) #This and gsub change whole data frame to character values "1"
data6 <- gsub("ř", "r", data6)
Welcome to SO. Please have in mind that you are expected to provide a reproducible example so we can work on your problem.
I understand you're looking after a way to change the symbols to latin letters. That can be accomplished with stringi::stri_trans_general:
require(stringi) # load library
a <- "ř" # assign your weird character to variable
newA <- stri_trans_general(a, "latin-ascii") # convert to latin
newA
> "r"
If you find this answer helpful, please consider marking it as such by ticking on the mark below the voting.

Replacing string with greater/smaller than symbol is not replaced

I want to clear a table from unwanted characters. The table can be downloaded here freely: https://www.aggdata.com/free/germany-postal-codes
It contains all german postal codes, and since in Germany there are some special characters like ü,ö,ä or ß, I need to swap them to other symbols.
Now let's say I want to replace all "ß" by "ss".
I used this code, collected from different posts in Stack Overflow. My code looks like this:
postal <- read.csv("~/Downloads/de_postal_codes.csv")
postal <- as.data.frame(sapply(postal,gsub,pattern="<df>",replacement="ss"))
When I try to replace other strings for testing like pattern = "Cot" the code works, but not if it contains the <> symbols. What is the problem here?
I am using R 3.3.3 in RStudio 1.0.136 on MacOS 10.13.4.
This seems to work. If you put encoding = "UTF-8" in to the read.table command, you see that <df> comes back as \xdf. I don't know much about this area, but trying this with the original encoding seemed to work. Hope this helps
postal <- read.table("~/Downloads/de_postal_codes.csv", sep = ",", header = TRUE,
stringsAsFactors = FALSE)
postal$Place.Name[4]
postal <- as.data.frame(
sapply(postal, function(x){
gsub(pattern="\xdf", replacement="ss", x=x)
})
)
postal$Place.Name[4]
edit: Also, I don't think you're sapply was doing the trick. The x parameter in gsub is not the first variable when you do ?gsub.
edit2: I'm using windows & 3.5.0 R version

Get R to keep UTF-8 Codepoint representation

This question is related to the utf-8 package for R. I have a weird problem in which I want emojis in a data set I'm working with to stay in code point representation (i.e. as '\U0001f602'). I want to use the 'FindReplace' function from the Data Combine package to turn UTF-8 encodings into prose descriptions of emojis in a dataset of YouTube comments (using a dictionary I made available here). The only issue is that when I 'save' the output as an object in R the nice utf-8 encoding generated by utf8_encode for which I can use my dictionary, it disappears...
First I have to adjust the dictionary a bit:
emojis$YouTube <- tolower(emojis$Codepoint)
emojis$YouTube <- gsub("u\\+","\\\\U000", emojis$YouTube)
Convert to character so as to be able to use utf8_encode:
emojimovie$test <- as.character(emojimovie$textOriginal)
This works great, gives output of \U0001f595 (etc.) that can be matched with dictionary entries when it 'prints' in the console.
utf8_encode(emojimovie$test)
BUT, when I do this:
emojimovie$text2 <- utf8_encode(emojimovie$test)
and then:
emoemo <- FindReplace(data = emojimovie, Var = "text2", replaceData = emojis, from = "YouTube", to = "Name", exact = TRUE)
I get all NAs. When I look at the output in $text2 with View I don't see the \U0001f595, I see actual emojis. I think this is why the FindReplace function isn't working -- when it gets saved to an object it just gets represented as emojis again and the function can't find any matches. When I try gsub("\U0001f602", "lolface", emojimovie$text2), however, I can actually match and replace things, but I don't want to do this for all ~2,000 or so emojis.... I've tried reading as much as I can about utf-8, but I can't understand why this is happening. I'm stumped! :P
It looks like in the above, you are trying to convert the UTF-8 emoji to a text version. I would recommend going the other direction. Something like
emojis <- read.csv('Emoji Dictionary 2.1.csv', stringsAsFactors = FALSE)
# change U+1F469 U+200D U+1F467 to \U1F469\U200D\U1F467
escapes <- gsub("[[:space:]]*\\U\\+", "\\\\U", emojis$Codepoint)
# convert to UTF-8 using the R parser
codes <- sapply(parse(text = paste0("'", escapes, "'"),
keep.source = FALSE), eval)
This will convert the text representations like U+1F469 to UTF-8 strings. Then, you can search for these strings in the original data.
Note: If you are using Windows, make sure you have the latest release of R; in older versions, the parser gives you the wrong result for strings litke "\U1F469".
The utf8::utf8_encode should really only be used if you have UTF-8 and are trying to print it to the screen.

How to convert special symbols in web scraping with R?

I am learning how to scrape the web with the XML and the RCurl packages. All goes well except for one thing. Special characters like ö or č they are read in differently into R. For instance the í is read in as í. I assume the latter is some sort of HTML coding for the first.
I have been looking for a way to convert these characters but I have not found it. I am sure other people have stumbled upon this problem as well, and I suspect there must be some sort of function to convert these characters. Does anyone know the solution? Thanks in advance.
Here is an example of the code, sorry I did not provide it earlier.
library(XML)
url <- 'http://en.wikipedia.org/wiki/2000_Wimbledon_Championships_%E2%80%93_Men%27s_Singles'
tables <- readHTMLTable(url)
Sec <- tables[[6]]
pl1R1 <- unlist(strsplit(as.character(Sec[,2]), ' '))[seq(2,32, 4)]
enc2utf8(pl1R1) # does not seem to work
Try parsing it first while specifying the encoding, then reading the table, as here: readHTMLTable and UTF-8 encoding.
An example might be:
library(XML)
url <- "http://en.wikipedia.org/wiki/2000_Wimbledon_Championships_%E2%80%93_Men%27s_Singles"
doc <- htmlParse(url, encoding = "UTF-8") #this will preserve characters
tables <- as.data.frame(readHTMLTable(doc, stringsAsFactors = FALSE))
Sec <- tables[[6]]
#not sure what you're trying to do here though
pl1R1 <- unlist(strsplit(as.character(Sec[,2]), ' '))[seq(2,32, 4)]

Resources