R, Windows and foreign language characters - r

This has been a longstanding problem with R: it can read non-latin characters on Unix, but I cannot read them on Windows. I've reproduced this program on several English-edition Windows machines over the years. I've tried changing the localisation settings in Windows and numerous other to no effect. Has anyone actually been able to read a foreign text file on Windows? I think being able to read/write/display unicode is a pretty nifty feature for a program.
Environment:
> Sys.getlocale()
[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
The problem can be reproduced as follows:
Create simple file in a language like Russian or Arabic in a text editor and save it as UTF-8 w/o BOM.
> test_df <- read.table("test2.txt",header=FALSE, sep=";", quote="",fill=FALSE,
encoding="UTF-8",comment.char="",dec=",")
......Warning message:
......In read.table("test2.txt", header = FALSE, sep = ";", quote = "", :
......incomplete final line found by readTableHeader on 'test2.txt'
> test_df
...... V1 V2
......1 <U+043E><U+0439>!yes 9
using read.csv()yields the same results, minus the warning. I realize that the "" is both searchable and can be converted to the readable character by an external program. But I want to see actual cyrillic text in charts, tables, output etc, like I can in every other program I've used.
So I've had this problem for a few years, consistently. Then one morning, yesterday, I tried the following:
test_df <- read.table("items.txt",header=FALSE, sep=";",quote="",fill=FALSE,
encoding="bytes",comment.char="",dec=",")
And encoding="bytes" worked! I saw cyrillic in the console. I then had to reinstall R (same version, same computer, same everything), the solution evaporated. I've literally retraced all my steps, and it seems like magic. Now encoding="bytes", just produces the same garbage (РєРѕРЅСЊСЏРє) as encoding="pizza" would (the param is ignored).
There is also a fileEncoding param for read.table. I am not sure how what it does, but it doesn't work either and cannot read even english text.
Can you read a non-ascii text file on your windows PC? How on earth do you do it?

Try setting the locale. For example,
Sys.setlocale(locale = "Russian")
See ?Sys.setlocale for more information.

Related

(R) Save data (vector or dataframe) with chinese character/ UTF-8 and windows 10

I am trying to save some data downloaded from a website that includes some chinese characters. I have tried many things with no success. R studio default text encoding is set to UTF-8, windows 10 region is also set to Beta, use unicode UTF-8 for worldwide language support.
Here is the code to reproduce the problem:
##package used
library(jiebaR) ##here for file_coding
library(htm2txt) ## to get the text
library(httr) ## just in case
library(readtext)
##get original text with chinese character
mytxtC <- gettxt("https://archive.li/wip/kRknx")
##print to check that chinese characters appear
mytxtC
##try to save in UTF-8
write.csv(mytxtC, "csv_mytxtC.csv", row.names = FALSE, fileEncoding = "UTF-8")
##check if it is readable
read.csv("csv_mytxtC.csv", encoding = "UTF-8")
##doesn't work, check file encoding
file_coding("csv_mytxtC.csv")
## answer: "windows-1252"
##try with txt
write(mytxtC, "txt_mytxtC.txt")
toto <- readtext("txt_mytxtC.txt")
toto[1,2]
##still not, try file_coding
file_coding("txt_mytxtC.txt")
## "windows-1252" ```
For information
``` Sys.getlocale()
[1] "LC_COLLATE=French_Switzerland.1252;LC_CTYPE=French_Switzerland.1252;LC_MONETARY=French_Switzerland.1252;LC_NUMERIC=C;LC_TIME=French_Switzerland.1252" ```
I changed the setLocal and it seems like it is working.
I just added this line in the beginning of the code:
Sys.setlocale("LC_CTYPE","chinese")
just need to remember to change it back eventually. And still, I found it weird that this line makes it possible to use UTF-8 for saving while before it was not possible...
This works for me on Windows :
Download the file :
download.file("https://archive.li/wip/kRknx", destfile="external_file", method="libcurl")
Input Text :
my_text <- readLines("external_file") # readLines(url) works as well
Check for UTF8 :
> sum(validUTF8(my_text)) == length(my_text)
[1] TRUE
You can also check the file :
> validUTF8("external_file")
[1] TRUE
Here's the only difference I noticed on Windows :
user#somewhere:~/Downloads$ file external_file
external_file: HTML document, UTF-8 Unicode text, with very long lines, with CRLF line terminators
vs
user#somewhere:~/Downloads$ file external_file
external_file: HTML document, UTF-8 Unicode text, with very long lines

write.csv() writes a different result from Mac OS than from Windows 10?

Character strings that look completely normal when printed to the RStudio console but appear as strange characters when written to csv and opened with excel.
Reproducible example
The following generates the object that appears as the string "a wit", then writes it to a csv:
# install.packages("dplyr")
library(dplyr)
serialized_char <- "580a000000030003060200030500000000055554462d380000001000000001000080090000000661c2a0776974"
(string <- serialized_char %>%
{substring(., seq(1, nchar(.), 2), seq(2, nchar(.), 2))} %>%
paste0("0x", .) %>%
as.integer %>%
as.raw %>%
unserialize())
[1] "a wit"
write.csv(string, "myfile.csv", row.names=F)
This is what it looks like when written from Mojave (and viewed in excel in OSX Mojave) - contains undesirable characters:
This is when it's written in High Sierra (and viewed in excel in High Sierra) - contains undesirable characters:
When is when written from Windows 10 and viewed in excel on windows 10 (looks good!):
This is when it is written from Mojave, but viewed in excel on Windows 10 - - still contains undesirable characters:
Question
I have a lot of character data of the form above (with characters that looks strange when written to csv and opened in excel) - how can these be cleaned in such a way that the text appears 'normally' in excel.
What I've tried
I have tried 4 things so far
write.csv(string, "myfile.csv", fileEncoding = 'UTF-8')
Encoding(string) <- "latin-1"
Encoding(string) <- "UTF-8"
iconv(string, "UTF-8", "latin1", sub=NA)
The problem isn’t R, the problem is Excel.
Excel has its own ideas about what a platform’s character encoding should be. Notably, it insists, even on modern macOSs, that the platform encoding is naturally Mac Roman. Rather than the actually prevailing UTF-8.
The file is correctly written as UTF-8 on macOS by default.
To get Excel to read it correctly, you need to choose “File” › “Import…”, and from thre follow the import wizard which lets you specify the file encoding.

Rstudio will not write "UTF-8" encoding with emoji data all of a sudden

I am working on a project that uses text files with emojis and I started having issue with writing a dataframe of emojis to a csv file. I have working with these files for some time now and so far I've been able to save the text data using write.csv(x, "filename") and viewing them with read.csv("filename", encoding = "UTF-8") without any problems. Yesterday, quite suddenly, that stopped working. All the files that I previously saved will still display emojis using the read.csv() function but I can not write and read any new files. For example, if I have:
x <- c("😂","😃","😄")
View(x)
write.csv(x, "testemoji.csv")
x2 <- read.csv("testemoji.csv", encoding = "UTF-8")
View(x2)
x will be
while x2 will be
I am using R verion 3.6.3 and windows 10.
What I have tried so far:
write.csv(x, "filename", fileEncoding = "UTF-8")
write.table(x, "filename", fileEncoding = "UTF-8")
write.csv2(x, "filename", fileEncoding = "UTF-8")
x2<- read.csv2("filename", encoding = "UTF-8")
I've tried every option of "tools - global options - coding - saving - default text encoding"
I've also tried messing with the locale language on the computer and the beta-UTF-8 option
when I check the encoding with Encoding(x$v1) it returns "UTF-8", "UTF-8", "UTF-8" but when I check Encoding(x2$x) it returns "unknown", "unknown", "unknown".
trying to change the encoding with Encoding(x2$x)<- "UTF-8" does not change the outcome.
I have been working on this project for 3 months now with no issues. I can't understand why it would come on so suddenly. To my recollection, I have not changed any preferences or setting is R, Rstudion or my computer before this happened. The deadline for this project is coming up in a week and I am getting desperate for answers. If anyone could please help I would greatly appreciate it. Thank you
It might be worth switching to use the readr library to have better encoding support. This worked for me
readr::write_csv(data.frame(x),'testemoji.csv')
x2<- readr::read_csv("testemoji.csv")
View(x2)

Using chinese characters without changing locale in R

I can use chinese characters in R, can put them in the strings inside a data.frame, substitute them with gsub, and they display normally on screen. I can save them to a file using write.table, but I can't read them with read.table! I'm using fileEncoding="UTF-8" for write.table and for read.table, but the latter gives me:
invalid multibyte string at ...
I've read about changing the locale, but if the chinese characters work everywhere else, I would like not to mess with the locale (my machine use a mix of english and portuguese locale). Is that possible?
I'm using RKWard in Ubuntu 14.10.
EDIT: chinese characters work perfectly everywhere in the files, they just produce errors when used for quoting...
Sorry. I arrived too late. I am using ubuntu 20.04 and the following worked for my file:
lists <- read_delim("LISTS.csv", ";", escape_double = FALSE, locale = locale(encoding = "ISO-8859-1"), trim_ws = TRUE)
Good luck

utf-8 characters get lost when converting from list to data.frame in R

I am using R 3.2.0 with RStudio 0.98.1103 on Windows 7 64-bit. The Windows "regional and language settings" of my computer is English (United States).
For some reason the following code replaces my Czech characters "č" and "ř" by "c" and "r" in the text "Koryčany nad přehradou", when I read a XML file in utf-8 encoding from the web, parse the XML file to a list, and convert the list to a data.frame.
library(XML)
url <- "http://hydrodata.info/chmi-h/cuahsi_1_1.asmx/GetSiteInfoObject?site=CHMI-H:1263&authToken="
doc <- xmlRoot(xmlTreeParse(url, getDTD=FALSE, useInternalNodes = TRUE))
infoList <- xmlToList(doc[[2]][[1]])
siteName <- infoList$siteName
#this still displays correctly "Koryčany nad přehradou"
print(siteName)
#make a data.frame from the list item. I suspect here is the problem.
df <- data.frame(name=siteName, id=1)
#now the Czech characters are lost. I see only "Korycany nad prehradou"
View(df)
write.csv(df,"test.csv")
#the test.csv file also contains "Korycany nad prehradou"
#instead of "Koryčany nad přehradou"
What is the problem? How do I make R to show my data.frame correctly with all the utf-8 special characters and save the .csv file without losing the "č" and "ř" Czech characters?
This is not a perfect answer, but the following workaround solved the problem for me. I tried to understand the behavior or R, and make the example so that my R script produces the same results both on Windows and on Linux platform:
(1) Get XML data in UTF-8 from the Internet
library(XML)
url <- "http://hydrodata.info/chmi-h/cuahsi_1_1.asmx/GetSiteInfoObject?site=CHMI-H:1263&authToken="
doc <- xmlRoot(xmlTreeParse(url, getDTD=FALSE, useInternalNodes = TRUE))
infoList <- xmlToList(doc[[2]][[1]])
siteName <- infoList$siteName
(2) Print out the text from the Internet: Encoding is UTF-8, display in the R console is also correct using both the Czech and the English locale on Windows:
> Sys.getlocale(category="LC_CTYPE")
[1] "English_United States.1252"
> print(siteName)
[1] "Koryčany nad přehradou"
> Encoding(siteName)
[1] "UTF-8"
>
(3) Try to create and view a data.frame. This has a problem. The data.frame displays incorrectly both in the RStudio view and in the console:
df <- data.frame(name=siteName, id=1)
df
name id
1 Korycany nad prehradou 1
(4) Try to use a matrix instead. Surprisingly the matrix displays correctly in the R console.
m <- as.matrix(df)
View(m) #this shows incorrectly in RStudio
m #however, this shows correctly in the R console.
name id
[1,] "Koryčany nad přehradou" "1"
(5) Change the locale. If I'm on Windows, set locale to Czech. If I'm on Unix or Mac, set locale to UTF-8. NOTE: This has some problems when I run the script in RStudio, apparently RStudio doesn't always react immediately to the Sys.setlocale command.
#remember the original locale.
original.locale <- Sys.getlocale(category="LC_CTYPE")
#for Windows set locale to Czech. Otherwise set locale to UTF-8
new.locale <- ifelse(.Platform$OS.type=="windows", "Czech_Czech Republic.1250", "en_US.UTF-8")
Sys.setlocale("LC_CTYPE", new.locale)
(7) Write the data to a text file. IMPORTANT: don't use write.csv but instead use write.table. When my locale is Czech on my English Windows, I must use the fileEncoding="UTF-8" in the write.table. Now the text file shows up correctly in notepad++ and in also in Excel.
write.table(m, "test-czech-utf8.txt", sep="\t", fileEncoding="UTF-8")
(8) Set the locale back to original
Sys.setlocale("LC_CTYPE", original.locale)
(9) Try to read the text file back into R. NOTE: If I read the file, I had to set the encoding parameter (NOT fileEncoding !). The display of a data.frame read from the file is still incorrect, but when I convert my data.frame to a matrix the Czech UTF-8 characters are preserved:
data.from.file <- read.table("test-czech-utf8.txt", sep="\t", encoding="UTF-8")
#the data.frame still has the display problem, "č" and "ř" get "lost"
> data.from.file
name id
1 Korycany nad prehradou 1
#see if a matrix displays correctly: YES it does!
matrix.from.file <- as.matrix(data.from.file)
> matrix.from.file
name id
1 "Koryčany nad přehradou" "1"
So the lesson learnt is that I need to convert my data.frame to a matrix, set my locale to Czech (on Windows) or to UTF-8 (on Mac and Linux) before I write my data with Czech characters to a file. Then when I write the file, I must make sure fileEncoding must be set to UTF-8. On the other hand when I later read the file, I can keep working in the English locale, but in read.table I must set the encoding="UTF-8".
If anybody has a better solution, I'll welcome your suggestions.

Resources