Wrong encoding while loading the JSON data to R - r

I'm trying to build a word corpus based on my data frame, which was loaded from a JSON file. While doing it R doesn't see special signs like 'ř' (in the original json data it is visible and encoding is utf-8). I tried encoding in R with source editor and Encoding(x), but none of them works.
I would like to change the signs to latin letters. e.g. ř --> r, but r using gsub function completely destroys my data frame.
Do you have any ideas how to solve it?
#JSON file contains name with "ř", after loading data I get <f8> even though I choose encoding of source file
data5 <- fromJSON(file = "Test1801.json")
data6 <- as.data.frame(data5)
data6 <- tolower(data6) #This and gsub change whole data frame to character values "1"
data6 <- gsub("ř", "r", data6)

Welcome to SO. Please have in mind that you are expected to provide a reproducible example so we can work on your problem.
I understand you're looking after a way to change the symbols to latin letters. That can be accomplished with stringi::stri_trans_general:
require(stringi) # load library
a <- "ř" # assign your weird character to variable
newA <- stri_trans_general(a, "latin-ascii") # convert to latin
newA
> "r"
If you find this answer helpful, please consider marking it as such by ticking on the mark below the voting.

Related

Is there some way to change the characters encoding to its English equivalent IN R?

In R
I am extracting data from Pdf tables using Tabulizer library and the Name are on Nepali language
and after extracting i Get this Table
[1]: https://i.stack.imgur.com/Ltpqv.png
But now i want that column 2's name To change, in its English Equivalent
Is there any way to do this in R
The R code i wrote was
library(tabulizer)
location <- "https://citizenlifenepal.com/wp-content/uploads/2019/10/2nd-AGM.pdf"
out <- extract_tables(location,pages = 113)
##write.table(out,file = "try.txt")
final <- do.call(rbind,out)
final <- as.data.frame(final) ### creating df
col_name <- c("S.No.","Types of Insurance","Inforce Policy Count", "","Sum Assured of Inforce Policies","","Sum at Risk","","Sum at Risk Transferred to Re-Insurer","","Sum At Risk Retained By Insurer","")
names(final) <- col_name
final <- final[-1,]
write.csv(final,file = "/cloud/project/Extracted_data/Citizen_life.csv",row.names = FALSE)
View(final)```
It appears that document is using a non-Unicode encoding. This web site https://www.ashesh.com.np/preeti-unicode/ can convert some Nepali encodings to Unicode, which would display properly in R, assuming you have the right fonts loaded. When I tried it on the output of your code, it did something that looked okay to me, but I don't know Nepali:
> out[[1]][1,2]
[1] ";fjlws hLjg aLdf"
When I convert the contents of that string, I get
सावधिक जीवन बीमा
which looks to me something like the text on that page in the document. If it's actually written correctly, then converting it to English will need some Nepali speaker to do the translation: hopefully that's you, but if I use Google Translate, it gives
Term life insurance
So here's my suggestion: contact the owner of that www.ashesh.com.np website, and find out if they can give you the translation rules. Write an R function to implement them if you can't find one by someone else. Then do the English translations manually.

Can R transform emoji characters to their text equivalents?

In my question yesterday, "Can R read html-encoded emoji characters?", user rensa noted that:
As far as I'm aware, there's no solution to printing emoji in the R console: they always come out as "U0001f600" (or what have you). However, the packages I described above can help you plot emoji in some circumstances (I'm hoping to expand ggflags to display arbitrary full-colour emoji at some point). They can also help you search for emoji to get their codes, but they can't get names given the codes AFAIK. But maybe you could try importing the emoji list from emojilib into R and doing a join with your data frame, if you've extracted the emoji codes into a column, to get the English names.
How would this look in R?
(Note: I'm posting this question with the intention of answering it immediately, rather than posting this in the question linked above, since it's tangential to that question, but still possibly of use to others.)
The approach below works for transforming an emoji character or unicode representation into a name.
I am happy to release the code snippet below under a CC0 dedication (i.e., putting this implementation into the public domain for free reuse).
# Get (MIT-licensed) emojilib data:
emoji_json_file <- "https://raw.githubusercontent.com/muan/emojilib/master/emojis.json"
json_data <- rjson::fromJSON(paste(readLines(emoji_json_file), collapse = ""))
get_name_from_emoji <- function(emoji_unicode, emoji_data = json_data){
emoji_evaluated <- stringi::stri_unescape_unicode(emoji_unicode)
# names(json_data)
vector_of_emoji_names_and_characters <- unlist(
lapply(json_data, function(x){
x$char
})
)
name_of_emoji <- attr(
which(vector_of_emoji_names_and_characters == emoji_evaluated)[1],
"names"
)
name_of_emoji
}
get_name_from_emoji("\\U0001f917")
# [1] "hugs"
get_name_from_emoji("🤗") # An attempt actually pasting the hugs emoji in also works.
# [1] "hugs"

Get R to keep UTF-8 Codepoint representation

This question is related to the utf-8 package for R. I have a weird problem in which I want emojis in a data set I'm working with to stay in code point representation (i.e. as '\U0001f602'). I want to use the 'FindReplace' function from the Data Combine package to turn UTF-8 encodings into prose descriptions of emojis in a dataset of YouTube comments (using a dictionary I made available here). The only issue is that when I 'save' the output as an object in R the nice utf-8 encoding generated by utf8_encode for which I can use my dictionary, it disappears...
First I have to adjust the dictionary a bit:
emojis$YouTube <- tolower(emojis$Codepoint)
emojis$YouTube <- gsub("u\\+","\\\\U000", emojis$YouTube)
Convert to character so as to be able to use utf8_encode:
emojimovie$test <- as.character(emojimovie$textOriginal)
This works great, gives output of \U0001f595 (etc.) that can be matched with dictionary entries when it 'prints' in the console.
utf8_encode(emojimovie$test)
BUT, when I do this:
emojimovie$text2 <- utf8_encode(emojimovie$test)
and then:
emoemo <- FindReplace(data = emojimovie, Var = "text2", replaceData = emojis, from = "YouTube", to = "Name", exact = TRUE)
I get all NAs. When I look at the output in $text2 with View I don't see the \U0001f595, I see actual emojis. I think this is why the FindReplace function isn't working -- when it gets saved to an object it just gets represented as emojis again and the function can't find any matches. When I try gsub("\U0001f602", "lolface", emojimovie$text2), however, I can actually match and replace things, but I don't want to do this for all ~2,000 or so emojis.... I've tried reading as much as I can about utf-8, but I can't understand why this is happening. I'm stumped! :P
It looks like in the above, you are trying to convert the UTF-8 emoji to a text version. I would recommend going the other direction. Something like
emojis <- read.csv('Emoji Dictionary 2.1.csv', stringsAsFactors = FALSE)
# change U+1F469 U+200D U+1F467 to \U1F469\U200D\U1F467
escapes <- gsub("[[:space:]]*\\U\\+", "\\\\U", emojis$Codepoint)
# convert to UTF-8 using the R parser
codes <- sapply(parse(text = paste0("'", escapes, "'"),
keep.source = FALSE), eval)
This will convert the text representations like U+1F469 to UTF-8 strings. Then, you can search for these strings in the original data.
Note: If you are using Windows, make sure you have the latest release of R; in older versions, the parser gives you the wrong result for strings litke "\U1F469".
The utf8::utf8_encode should really only be used if you have UTF-8 and are trying to print it to the screen.

How to convert special symbols in web scraping with R?

I am learning how to scrape the web with the XML and the RCurl packages. All goes well except for one thing. Special characters like ö or č they are read in differently into R. For instance the í is read in as í. I assume the latter is some sort of HTML coding for the first.
I have been looking for a way to convert these characters but I have not found it. I am sure other people have stumbled upon this problem as well, and I suspect there must be some sort of function to convert these characters. Does anyone know the solution? Thanks in advance.
Here is an example of the code, sorry I did not provide it earlier.
library(XML)
url <- 'http://en.wikipedia.org/wiki/2000_Wimbledon_Championships_%E2%80%93_Men%27s_Singles'
tables <- readHTMLTable(url)
Sec <- tables[[6]]
pl1R1 <- unlist(strsplit(as.character(Sec[,2]), ' '))[seq(2,32, 4)]
enc2utf8(pl1R1) # does not seem to work
Try parsing it first while specifying the encoding, then reading the table, as here: readHTMLTable and UTF-8 encoding.
An example might be:
library(XML)
url <- "http://en.wikipedia.org/wiki/2000_Wimbledon_Championships_%E2%80%93_Men%27s_Singles"
doc <- htmlParse(url, encoding = "UTF-8") #this will preserve characters
tables <- as.data.frame(readHTMLTable(doc, stringsAsFactors = FALSE))
Sec <- tables[[6]]
#not sure what you're trying to do here though
pl1R1 <- unlist(strsplit(as.character(Sec[,2]), ' '))[seq(2,32, 4)]

Quotation issues reading data into R

I have some data from and I am trying to load it into R. It is in .csv files and I can view the data in both Excel and OpenOffice. (If you are curious, it is the 2011 poll results data from Elections Canada data available here).
The data is coded in an unusual manner. A typical line is:
12002,Central Nova","Nova-Centre"," 1","River John",N,N,"",1,299,"Chisholm","","Matthew","Green Party","Parti Vert",N,N,11
There is a " on the end of the Central-Nova but not at the beginning. So in order to read in the data, I suppressed the quotes, which worked fine for the first few files. ie.
test<-read.csv("pollresults_resultatsbureau11001.csv",header = TRUE,sep=",",fileEncoding="latin1",as.is=TRUE,quote="")
Now here is the problem: in another file (eg. pollresults_resultatsbureau12002.csv), there is a line of data like this:
12002,Central Nova","Nova-Centre"," 6-1","Pictou, Subd. A",N,N,"",0,168,"Parker","","David K.","NDP-New Democratic Party","NPD-Nouveau Parti democratique",N,N,28
Because I need to suppress the quotes, the entry "Pictou, Subd. A" makes R wants to split this into 2 variables. The data can't be read in since it wants to add a column half way through constructing the dataframe.
Excel and OpenOffice both can open these files no problem. Somehow, Excel and OpenOffice know that quotation marks only matter if they are at the beginning of a variable entry.
Do you know what option I need to enable on R to get this data in? I have >300 files that I need to load (each with ~1000 rows each) so a manual fix is not an option...
I have looked all over the place for a solution but can't find one.
Building on my comments, here is a solution that would read all the CSV files into a single list.
# Deal with French properly
options(encoding="latin1")
# Set your working directory to where you have
# unzipped all of your 308 CSV files
setwd("path/to/unzipped/files")
# Get the file names
temp <- list.files()
# Extract the 5-digit code which we can use as names
Codes <- gsub("pollresults_resultatsbureau|.csv", "", temp)
# Read all the files into a single list named "pollResults"
pollResults <- lapply(seq_along(temp), function(x) {
T0 <- readLines(temp[x])
T0[-1] <- gsub('^(.{6})(.*)$', '\\1\\"\\2', T0[-1])
final <- read.csv(text = T0, header = TRUE)
final
})
names(pollResults) <- Codes
You can easily work with this list in different ways. If you wanted to just see the 90th data.frame you can access it by using pollResults[[90]] or by using pollResults[["24058"]] (in other words, either by index number or by district number).
Having the data in this format means you can also do a lot of other convenient things. For instance, if you wanted to fix all 308 of the CSVs in one go, you can use the following code, which will create new CSVs with the file name prefixed with "Corrected_".
invisible(lapply(seq_along(pollResults), function(x) {
NewFilename <- paste("Corrected", temp[x], sep = "_")
write.csv(pollResults[[x]], file = NewFilename,
quote = TRUE, row.names = FALSE)
}))
Hope this helps!
This answer is mainly to #AnandaMahto (see comments to the original question).
First, it helps to set some options globally because of the french accents in the data:
options(encoding="latin1")
Next, read in the data verbatim using readLines():
temp <- readLines("pollresults_resultatsbureau13001.csv")
Following this, simply replace the first comma in each line of data with a comma+quotation. This works because the first field is always 5 characters long. Note that it leaves the header untouched.
temp[-1] <- gsub('^(.{6})(.*)$', '\\1\\"\\2', temp[-1])
Penultimately, write over the original file.
fileConn<-file("pollresults_resultatsbureau13001.csv")
writeLines(temp,fileConn)
close(fileConn)
Finally, simply read the data back into R:
data<-read.csv(file="pollresults_resultatsbureau13001.csv",header = TRUE,sep=",")
There is probably a more parsimonious way to do this (and one that can be iterated more easily) but this process made sense to me.

Resources