Keeping Turkish characters with the text mining package for R - r

let me start this by saying that I'm still pretty much a beginner with R.
Currently I am trying out basic text mining techniques for Turkish texts, using the tm package.
I have, however, encountered a problem with the display of Turkish characters in R.
Here's what I did:
docs <- VCorpus(DirSource("DIRECTORY", encoding = "UTF-8"), readerControl = list(language = "tur"))
writeLines(as.character(docs), con="documents.txt")
My thinking being, that setting the language to Turkish and the encoding to UTF-8 (which is the original encoding of the text files) should make the display of the Turkish characters İ, ı, ğ, Ğ, ş and Ş possible. Instead the output converts these charaters to I, i, g, G, s and S respectively and saves it to an ANSI-Encoding, which cannot display these characters.
writeLines(as.character(docs), con="documents.txt", Encoding("UTF-8"))
also saves the file without the characters in ANSI encoding.
This seems to not only be an issue with the output file.
writeLines(as.character(docs[[1]])
for example yields a line that should read "Okul ve cami açılışları umutları artırdı" but instead reads "Okul ve cami açilislari umutlari artirdi"
After reading this: UTF-8 file output in R
I also tried the following code:
writeLines(as.character(docs), con="documents.txt", Encoding("UTF-8"), useBytes=T)
which didn't change the results.
All of this is on Windows 7 with both the most recent version of R and RStudio.
Is there a way to fix this? I am probably missing something obvious, but any help would be appreciated.

Here is how I keep the Turkish characters intact:
Open a new .Rmd file in RStudio. (RStudio -> File -> New File -> R Markdown)
Copy and Paste your text containing Turkish characters.
Save the .Rmd file with encoding. (RStudio -> File -> Save with Encoding.. -> UTF-8)
yourdocument <- readLines("yourdocument.Rmd", encoding = "UTF-8")
yourdocument <- paste(yourdocument, collapse = " ")
After this step you can create your corpus
e.g start from VectorSource() in tm package.
Turkish characters will appear as they should.

Related

Why is the encoding changing when I write to file?

I am attempting to make a boatload of Anki flashcards for Thai, so I did some web-scraping with R to extract transliterated elements from a website (dictionary). Everything looks good when printing in the console, but when I try to write the transliteration to a text file, the encoding changes, and I lose tone marks. Using Encoding() revealed that most entries were "UTF-8", which should be fine, but some entries were labeled as "unknown". You can download the HTML file from my GitHub, and my code is below for importing and extracting the text.
# Install appropriate library
install.packages("rvest")
library(rvest)
# Read in page to local variable
page <- read_html("Thai to English dictionary, translation and transliteration.html")
# Filter for specific tags
translit <- page %>% html_nodes(".tlit-line") %>% html_text()
write(translit, file = 'translit.txt')
library(stringi)
stringi::stri_write_lines(translit, encoding = "UTF-8", "translit.txt")
stri_write_lines (From stringi v1.5.3 by Marek Gagolewski)
Write Text Lines To A Text File. Writes a text file is such a way
that each element of a given character vector becomes a separate text
line.
Usage
stri_write_lines(
str,
con,
encoding = "UTF-8",
sep = ifelse(.Platform$OS.type == "windows", "\r\n", "\n"),
fname = con
)
Arguments
str - character vector with data to write
con - name of the output file or a connection object (opened in the
binary mode)
encoding - output encoding, NULL or '' for the current default one
sep - newline separator
fname - deprecated alias of con
Details
It is a substitute for the R writeLines function, with the ability to
easily re-encode the output.
We suggest using the UTF-8 encoding for all text files: thus, it is
the default one for the output.

write.csv() writes a different result from Mac OS than from Windows 10?

Character strings that look completely normal when printed to the RStudio console but appear as strange characters when written to csv and opened with excel.
Reproducible example
The following generates the object that appears as the string "a wit", then writes it to a csv:
# install.packages("dplyr")
library(dplyr)
serialized_char <- "580a000000030003060200030500000000055554462d380000001000000001000080090000000661c2a0776974"
(string <- serialized_char %>%
{substring(., seq(1, nchar(.), 2), seq(2, nchar(.), 2))} %>%
paste0("0x", .) %>%
as.integer %>%
as.raw %>%
unserialize())
[1] "a wit"
write.csv(string, "myfile.csv", row.names=F)
This is what it looks like when written from Mojave (and viewed in excel in OSX Mojave) - contains undesirable characters:
This is when it's written in High Sierra (and viewed in excel in High Sierra) - contains undesirable characters:
When is when written from Windows 10 and viewed in excel on windows 10 (looks good!):
This is when it is written from Mojave, but viewed in excel on Windows 10 - - still contains undesirable characters:
Question
I have a lot of character data of the form above (with characters that looks strange when written to csv and opened in excel) - how can these be cleaned in such a way that the text appears 'normally' in excel.
What I've tried
I have tried 4 things so far
write.csv(string, "myfile.csv", fileEncoding = 'UTF-8')
Encoding(string) <- "latin-1"
Encoding(string) <- "UTF-8"
iconv(string, "UTF-8", "latin1", sub=NA)
The problem isn’t R, the problem is Excel.
Excel has its own ideas about what a platform’s character encoding should be. Notably, it insists, even on modern macOSs, that the platform encoding is naturally Mac Roman. Rather than the actually prevailing UTF-8.
The file is correctly written as UTF-8 on macOS by default.
To get Excel to read it correctly, you need to choose “File” › “Import…”, and from thre follow the import wizard which lets you specify the file encoding.

Write.csv in Japanese from R to excel

When I use write.csv for my Japanese text, I get gibberish in Excel (which normally handles Japanese fine). I've searched this site for solutions, but am coming up empty-handed. Is there an encoding command to add to write.csv to enable Excel to import the Japanese properly from R? Any help appreciated!
Thanks!
I just ran into this exact same problem - I used what I saw online:
write.csv(merch_df, file = "merch.reduced.csv", fileEncoding = "UTF-8")
and indeed, when opening my .xls file, <U+30BB><U+30D6><U+30F3>, etc.... Odd and disappointing.
A little Google and I found this awesome blog post by Kevin Ushey which explains it all... https://kevinushey.github.io/blog/2018/02/21/string-encoding-and-r/
Using the function he proposes:
write_utf8 <- function(text, f = tempfile()) {
# step 1: ensure our text is utf8 encoded
utf8 <- enc2utf8(text)
# step 2: create a connection with 'native' encoding
# this signals to R that translation before writing
# to the connection should be skipped
con <- file(f, open = "w+", encoding = "native.enc")
# step 3: write to the connection with 'useBytes = TRUE',
# telling R to skip translation to the native encoding
writeLines(utf8, con = con, useBytes = TRUE)
# close our connection
close(con)
# read back from the file just to confirm
# everything looks as expected
readLines(f, encoding = "UTF-8")
}
works magic. Thank you Kevin!
As a work around -- and diagnostic -- have you tried saving as .txt and then both opening the file in Excel and also pasting the data into Excel from a text editor?
I ran into the same problem as tchevrier. Japanese text were not display correctly both in Excel and a text editor when exporting with write.csv. I found using:
readr::write_excel_csv(df, "filename.csv")
corrected the issue.

Output accented characters for use with latex

I'm trying to use R to create the content of a tex file. The content contains many accented letters and I not able to correctly write them to a tex file.
Here is a short minimal example of what I would like to perform:
I have a file texinput.tex, which already exists and is encoded as UTF8 without BOM. When I manually write é in Notepad++ and save this file, it compiles correctly in LaTex and the output is as expected.
Then I tried to do this in R:
str.to.write <- "é"
cat(str.to.write, file = "tex_list.tex", append=TRUE)
As a result, the encoded character xe9 appears in the tex file. LaTex throws this error when trying to compile:
! File ended while scanning use of \UTFviii#three#octets.<inserted text>\par \include{texinput}
I then tried all of the following things before the cat command:
Encoding(str.to.write) <- "latin1"
-> same output error as above
str.to.write <- enc2utf8(str.to.write)
-> same output and error as above
Encoding(str.to.write) <- "UTF-8"
-> this appears in the tex file: \xe9. LaTex throws this error: ! Undefined control sequence. \xe
Encoding(str.to.write) <- "bytes"
-> this appears in the tex file: \\xe9. LaTex compiles without error and the output is xe9
I know that I could replace é by \'{e}, but I would like to have an automatic method, because the real content is very long and contains words from 3 different Latin languages, so it has lots of different accented characters.
However, I would also be happy about a function to automatically sanitize the R output to be used with Latex. I tried using xtable and sanitize.text.function, but it appears that it doesn't accept character vectors as input.
After quite a bit of searching and trial-and-error, I found something that worked for me:
# create output function
writeTex <- function(x) {write.table(x, "tex_list.tex",
append = TRUE, row.names = FALSE,
col.names = FALSE, quote = FALSE,
fileEncoding = "UTF-8")}
writeTex("é")
Output is as expected (é), and it compiles perfectly well in LaTex.
Use TIPA for processing International
Phonetic Alphabet (IPA) symbols in Latex. It has become standard in the linguistics field.

Import dataset with spanish special characters into R

I'm new to R and I've imported a dataset in a CSV file format created in Excel into my project using the "Import Dataset from Text File" function. However the dataset displays spanish special characters (á, é, í, ó, ú, ñ) with the � symbol, as below:
Nombre Direccion Beneficiado
Mu�oz B�rbara H�medo
...
Subsequently I tried with this code to make R display the spanish special characters:
Encoding(dataset) <- "UTF-8"
And received the following answer:
Error in `Encoding<-`(`*tmp*`, value = "UTF-8") :
a character vector argument expected
So far I haven't been able to find a solution to this.
I'm working in Rstudio Version 0.98.1083, in Windows 7.
Thanks in advance for your help.

Resources