Why is the encoding changing when I write to file? - r

I am attempting to make a boatload of Anki flashcards for Thai, so I did some web-scraping with R to extract transliterated elements from a website (dictionary). Everything looks good when printing in the console, but when I try to write the transliteration to a text file, the encoding changes, and I lose tone marks. Using Encoding() revealed that most entries were "UTF-8", which should be fine, but some entries were labeled as "unknown". You can download the HTML file from my GitHub, and my code is below for importing and extracting the text.
# Install appropriate library
install.packages("rvest")
library(rvest)
# Read in page to local variable
page <- read_html("Thai to English dictionary, translation and transliteration.html")
# Filter for specific tags
translit <- page %>% html_nodes(".tlit-line") %>% html_text()
write(translit, file = 'translit.txt')

library(stringi)
stringi::stri_write_lines(translit, encoding = "UTF-8", "translit.txt")
stri_write_lines (From stringi v1.5.3 by Marek Gagolewski)
Write Text Lines To A Text File. Writes a text file is such a way
that each element of a given character vector becomes a separate text
line.
Usage
stri_write_lines(
str,
con,
encoding = "UTF-8",
sep = ifelse(.Platform$OS.type == "windows", "\r\n", "\n"),
fname = con
)
Arguments
str - character vector with data to write
con - name of the output file or a connection object (opened in the
binary mode)
encoding - output encoding, NULL or '' for the current default one
sep - newline separator
fname - deprecated alias of con
Details
It is a substitute for the R writeLines function, with the ability to
easily re-encode the output.
We suggest using the UTF-8 encoding for all text files: thus, it is
the default one for the output.

Related

(R) Save data (vector or dataframe) with chinese character/ UTF-8 and windows 10

I am trying to save some data downloaded from a website that includes some chinese characters. I have tried many things with no success. R studio default text encoding is set to UTF-8, windows 10 region is also set to Beta, use unicode UTF-8 for worldwide language support.
Here is the code to reproduce the problem:
##package used
library(jiebaR) ##here for file_coding
library(htm2txt) ## to get the text
library(httr) ## just in case
library(readtext)
##get original text with chinese character
mytxtC <- gettxt("https://archive.li/wip/kRknx")
##print to check that chinese characters appear
mytxtC
##try to save in UTF-8
write.csv(mytxtC, "csv_mytxtC.csv", row.names = FALSE, fileEncoding = "UTF-8")
##check if it is readable
read.csv("csv_mytxtC.csv", encoding = "UTF-8")
##doesn't work, check file encoding
file_coding("csv_mytxtC.csv")
## answer: "windows-1252"
##try with txt
write(mytxtC, "txt_mytxtC.txt")
toto <- readtext("txt_mytxtC.txt")
toto[1,2]
##still not, try file_coding
file_coding("txt_mytxtC.txt")
## "windows-1252" ```
For information
``` Sys.getlocale()
[1] "LC_COLLATE=French_Switzerland.1252;LC_CTYPE=French_Switzerland.1252;LC_MONETARY=French_Switzerland.1252;LC_NUMERIC=C;LC_TIME=French_Switzerland.1252" ```
I changed the setLocal and it seems like it is working.
I just added this line in the beginning of the code:
Sys.setlocale("LC_CTYPE","chinese")
just need to remember to change it back eventually. And still, I found it weird that this line makes it possible to use UTF-8 for saving while before it was not possible...
This works for me on Windows :
Download the file :
download.file("https://archive.li/wip/kRknx", destfile="external_file", method="libcurl")
Input Text :
my_text <- readLines("external_file") # readLines(url) works as well
Check for UTF8 :
> sum(validUTF8(my_text)) == length(my_text)
[1] TRUE
You can also check the file :
> validUTF8("external_file")
[1] TRUE
Here's the only difference I noticed on Windows :
user#somewhere:~/Downloads$ file external_file
external_file: HTML document, UTF-8 Unicode text, with very long lines, with CRLF line terminators
vs
user#somewhere:~/Downloads$ file external_file
external_file: HTML document, UTF-8 Unicode text, with very long lines

unable to read csv file saved with encoding "UTF-8-SIG"

I got data that I crawled using Scrapy, which saves as csv file with encoding utf-8-sig. The data has many different special characters: Korean, Russian, Chinese, Spanish,..., a star symbol (★), and this 🎵, and this 🎄...
So Scrapy can save, and I can view those on Notepad++ or app like CSVFileView. But when I load in R using mydata <- read.csv(<path_to_file>, fileEncoding="UTF-8-SIG", header=FALSE), I got this error:
Error in file(file, "rt", encoding = fileEncoding) :
unsupported conversion from 'UTF-8-SIG' to ''
If I don't specify the encoding, I can load but the symbols will become characters like ☠and the first column head will be appended with ï..
Which encoding should I choose to include all characters?
As the input is already encoded as UTF-8, you should use the encoding argument to read the file as-is. Using fileEncoding will attempt to re-encode the file.
mydata <- read.csv(<path_to_file>, encoding="UTF-8", header=FALSE)

Keeping Turkish characters with the text mining package for R

let me start this by saying that I'm still pretty much a beginner with R.
Currently I am trying out basic text mining techniques for Turkish texts, using the tm package.
I have, however, encountered a problem with the display of Turkish characters in R.
Here's what I did:
docs <- VCorpus(DirSource("DIRECTORY", encoding = "UTF-8"), readerControl = list(language = "tur"))
writeLines(as.character(docs), con="documents.txt")
My thinking being, that setting the language to Turkish and the encoding to UTF-8 (which is the original encoding of the text files) should make the display of the Turkish characters İ, ı, ğ, Ğ, ş and Ş possible. Instead the output converts these charaters to I, i, g, G, s and S respectively and saves it to an ANSI-Encoding, which cannot display these characters.
writeLines(as.character(docs), con="documents.txt", Encoding("UTF-8"))
also saves the file without the characters in ANSI encoding.
This seems to not only be an issue with the output file.
writeLines(as.character(docs[[1]])
for example yields a line that should read "Okul ve cami açılışları umutları artırdı" but instead reads "Okul ve cami açilislari umutlari artirdi"
After reading this: UTF-8 file output in R
I also tried the following code:
writeLines(as.character(docs), con="documents.txt", Encoding("UTF-8"), useBytes=T)
which didn't change the results.
All of this is on Windows 7 with both the most recent version of R and RStudio.
Is there a way to fix this? I am probably missing something obvious, but any help would be appreciated.
Here is how I keep the Turkish characters intact:
Open a new .Rmd file in RStudio. (RStudio -> File -> New File -> R Markdown)
Copy and Paste your text containing Turkish characters.
Save the .Rmd file with encoding. (RStudio -> File -> Save with Encoding.. -> UTF-8)
yourdocument <- readLines("yourdocument.Rmd", encoding = "UTF-8")
yourdocument <- paste(yourdocument, collapse = " ")
After this step you can create your corpus
e.g start from VectorSource() in tm package.
Turkish characters will appear as they should.

Get R to keep UTF-8 Codepoint representation

This question is related to the utf-8 package for R. I have a weird problem in which I want emojis in a data set I'm working with to stay in code point representation (i.e. as '\U0001f602'). I want to use the 'FindReplace' function from the Data Combine package to turn UTF-8 encodings into prose descriptions of emojis in a dataset of YouTube comments (using a dictionary I made available here). The only issue is that when I 'save' the output as an object in R the nice utf-8 encoding generated by utf8_encode for which I can use my dictionary, it disappears...
First I have to adjust the dictionary a bit:
emojis$YouTube <- tolower(emojis$Codepoint)
emojis$YouTube <- gsub("u\\+","\\\\U000", emojis$YouTube)
Convert to character so as to be able to use utf8_encode:
emojimovie$test <- as.character(emojimovie$textOriginal)
This works great, gives output of \U0001f595 (etc.) that can be matched with dictionary entries when it 'prints' in the console.
utf8_encode(emojimovie$test)
BUT, when I do this:
emojimovie$text2 <- utf8_encode(emojimovie$test)
and then:
emoemo <- FindReplace(data = emojimovie, Var = "text2", replaceData = emojis, from = "YouTube", to = "Name", exact = TRUE)
I get all NAs. When I look at the output in $text2 with View I don't see the \U0001f595, I see actual emojis. I think this is why the FindReplace function isn't working -- when it gets saved to an object it just gets represented as emojis again and the function can't find any matches. When I try gsub("\U0001f602", "lolface", emojimovie$text2), however, I can actually match and replace things, but I don't want to do this for all ~2,000 or so emojis.... I've tried reading as much as I can about utf-8, but I can't understand why this is happening. I'm stumped! :P
It looks like in the above, you are trying to convert the UTF-8 emoji to a text version. I would recommend going the other direction. Something like
emojis <- read.csv('Emoji Dictionary 2.1.csv', stringsAsFactors = FALSE)
# change U+1F469 U+200D U+1F467 to \U1F469\U200D\U1F467
escapes <- gsub("[[:space:]]*\\U\\+", "\\\\U", emojis$Codepoint)
# convert to UTF-8 using the R parser
codes <- sapply(parse(text = paste0("'", escapes, "'"),
keep.source = FALSE), eval)
This will convert the text representations like U+1F469 to UTF-8 strings. Then, you can search for these strings in the original data.
Note: If you are using Windows, make sure you have the latest release of R; in older versions, the parser gives you the wrong result for strings litke "\U1F469".
The utf8::utf8_encode should really only be used if you have UTF-8 and are trying to print it to the screen.

Output accented characters for use with latex

I'm trying to use R to create the content of a tex file. The content contains many accented letters and I not able to correctly write them to a tex file.
Here is a short minimal example of what I would like to perform:
I have a file texinput.tex, which already exists and is encoded as UTF8 without BOM. When I manually write é in Notepad++ and save this file, it compiles correctly in LaTex and the output is as expected.
Then I tried to do this in R:
str.to.write <- "é"
cat(str.to.write, file = "tex_list.tex", append=TRUE)
As a result, the encoded character xe9 appears in the tex file. LaTex throws this error when trying to compile:
! File ended while scanning use of \UTFviii#three#octets.<inserted text>\par \include{texinput}
I then tried all of the following things before the cat command:
Encoding(str.to.write) <- "latin1"
-> same output error as above
str.to.write <- enc2utf8(str.to.write)
-> same output and error as above
Encoding(str.to.write) <- "UTF-8"
-> this appears in the tex file: \xe9. LaTex throws this error: ! Undefined control sequence. \xe
Encoding(str.to.write) <- "bytes"
-> this appears in the tex file: \\xe9. LaTex compiles without error and the output is xe9
I know that I could replace é by \'{e}, but I would like to have an automatic method, because the real content is very long and contains words from 3 different Latin languages, so it has lots of different accented characters.
However, I would also be happy about a function to automatically sanitize the R output to be used with Latex. I tried using xtable and sanitize.text.function, but it appears that it doesn't accept character vectors as input.
After quite a bit of searching and trial-and-error, I found something that worked for me:
# create output function
writeTex <- function(x) {write.table(x, "tex_list.tex",
append = TRUE, row.names = FALSE,
col.names = FALSE, quote = FALSE,
fileEncoding = "UTF-8")}
writeTex("é")
Output is as expected (é), and it compiles perfectly well in LaTex.
Use TIPA for processing International
Phonetic Alphabet (IPA) symbols in Latex. It has become standard in the linguistics field.

Resources