non-English character CSV encoding error among PC/MAC/Ubuntu - r

This problem troubles for a year. My R has trouble in opening my csv file containing simplified Chinese character. The data is coded as GBK I believe. I have three computers with different language and operation system and it has mixed results on opening the same Chinese csv file. Could someone tell me why the results are different?
(1)Windows+English OS+English R and R studio: UNABLE to read my csv even if I encoded it as UTF8,GBK, and you name it encoding for Chinese.
(2) Mac+EnglishOS+English R: ABLE to read the Chinese csv without forcing the encoding (update: after I reinstall operation system to El Caption, it could not open my csv correctly)
(3) Windows+Chinese OS,+Chinese R: ABLE read csv without forcing encoding or gbk
(4) Windows+English OS,+Chinese R: UNABLE
(5) Ubuntu English OS,English R: ABLE
In the windows case(English and Chinese OS), notebook can open the csv correctly but excel cannot in the English Case. When ever I could not open my csv with excel, my r cannot either.
If I converge the csv by Google sheet, my excel can open my csv but R still not ok.
How does the encoding work in R, why the results change with the OS Lanuage?
read.csv(...,encoding=)

It could be related with the excel csv encoding system. If your windows operation system is Englihs. The excel might not properly open the cvs correctly. A work around is the using google sheer or Ubuntu installed sheet to converge it to csv and try using r open it.

I have figured out how to solve. It deals with large less than 800M files contained Simplified Chinese Characters. The key is that you should know the default Chinese encoding in your operation system.
The Ubuntu use UTF-8 as default Chinese Encoding. So you should encode it as UTF-8 instead GB18130 or other GB starting encoding.
(1) Download Open Office (free and fast to install, have have higher
file size than Cals in Ubuntu).
(2) Detect your CSV encoding. Simply open your csv using Open office and choose an encoding method that display your Chinese character.
(3) Save your csv to correct encoding system according to your
operation system. Default Windows are GBK for Chinese and Ubuntu is
UTF8.
This should solve your file size problem and encoding problem. You do not even force the encoding. Normal read.csv would work.

Related

How to read csv file with unknown formatting and unknown encoding in R Program? (example file provided)

I have tried my best to read a CSV file in r but failed. I have provided a sample of the file in the following Gdrive link.
Data
I found that it is a tab-delimited file by opening in a text editor. The file is read in Excel without issues. But when I try to read it in R using "readr" package or the base r packages, it fails. Not sure why. I have tried different encoding like UTF-8. UTF-16, UTF16LE. Could you please help me to write the correct script to read this file. Currently, I am converting this file to excel as a comma-delimited to read in R. But I am sure there must be something that I am doing wrong. Any help would be appreciated.
Thanks
Amal
PS: What I don't understand is how excel is reading the file without any parameters provided? Can we build the same logic in R to read any file?
This is a Windows-related encoding problem.
When I open your file in Notepad++ it tells me it is encoded as UCS-2 LE BOM. There is a trick to reading in files with unusual encodings into R. In your case this seems to do the trick:
read.delim(con <- file("temp.csv", encoding = "UCS-2LE"))
(adapted from R: can't read unicode text files even when specifying the encoding).
BTW "CSV" stands for "comma separated values". This file has tab-separated values, so you should give it either a .tsv or .txt suffix, not .csv, to avoid confusion.
In terms of your second question, could we build the same logic in R to guess encoding, delimiters and read in many types of file without us explicitly saying what the encoding and delimiter is - yes, this would certainly be possible. Whether it is desirable I'm not sure.

How to save a dataframe as a .csv file with UTF-8 encoding and LF line ending in R using Rstudio?

I came across this weird situation:
I need to save a dataframe to a .csv file UTF-8 and with a LF ending. I'm using the latest version of R and Rstudio on a Windows 10 machine.
My first attempt was to do naively:
write.csv(df, fileEncoding="UTF-8", eol="\n")
checking with Notepad++, it appears the encoding is UTF-8, however the line ending is CRLF and not LF. Ok, let's double check with Notepad: surprise, surprise, the encoding, according to Notepad, is ANSI. At this point I'm confused.
After looking at the docs for the function write.csv I read that:
CSV files do not record an encoding
I'm not an expert on the topic, so I decide to revert back and save the file as a simple .txt using write.table as follows:
write.table(df, fileEncoding="UTF-8", eol="\n")
again, the same result as above. No changes whatsoever. I tried the combinations
write.csv(df)
write.table(df)
without specified encodings but no change. Then I set the default encoding in Rstudio to be UTF-8 and LF line ending (as in the picture below)
and ran the tests again. No change. What am I missing??
This is an odd one, at least for me. Nonetheless, by reading the docs of write.table I found the solution. Apparently on Windows, to save files Unix-style you have to open a binary connection to a file and then save the file using the desired eol:
f <- file("filename.csv", "wb")
write.csv(df, file=f, eol="\n")
close(f)
As far as the UTF-8 format is concerned, global settings should work fine.
Check that the eol is LF using Notepad++. UTF-8 is harder to check since on Linux isutf8 (from moreutils) says files are indeed UTF-8 but Windows' Notepad disagrees when saving and says they are ANSI.

Corrupted Rmarkdown script: How can I get the Cyrillic characters back?

I was working with a script with lots of Cyrillic characters (throughout chunks and out of them) for weeks. One day I have opened a new Rmarkdown script where I wrote English, while the other document is still in my R session. Afterwards, I have returned to the Cyrillic document and everything written turns to something like this 8 иÑлÑ 1995 --> ÐлаÑÑÑ - наÑодÑ
The question is: Where is the source of problem? And, how can the corrupted script turn to its original form (with the Cyrillic characters)?
UPDATE!!
I have tried reopeining the Rstudio scrip with encoding CP1251, CP1252, windows1251 and UTF8, but it does not work. Certaintly the weird symbols change to another weird symbols. The problem is that I have saved the document with the default encoding CP1251 and windows1251) at the very begining.
Solution:
If working with cyrillic and lating characters, be sure you save the Rstudio script with UTF-8 encoding always, when you computer is windows (I do not know mac). If you close the script and open it again, re-open the file with UTF8 encoding.
Assuming you're using RStudio: Open your *.Rmd file and then try to reopen it "with encoding". Therefore simply use the File-Menu as shown below.
Select "Show all encodings" and choose your specific encoding, I suggest windows-1251 for cyrillic encoding:
Note: Apparently the issue can also occur while at the one time opening the *.Rmd file as "standalone" and at the other time from within an R Project.
Hope that would help.

Reading CSV files with Chinese Characters

There are a number of StackOverflow posts about opening CSV files containing (UTF-8 encoded) Chinese characters into R, in Windows. None of the answers I've found seem to work completely.
If I read.csv with encoding="UTF-8", then the Chinese characters are shown encoded (<U+XXXX>, which I've manually verified are at least correct). However, if I interrogate the data frame to get just one row or a specific cell from a row, then it's printed properly.
One post suggested this is due to strings being typed as factors. However, setting stringsAsFactors=FALSE had no effect.
Other posts say the locale must be set correctly. My system locale is apparently English_United Kingdom.1252; a Windows code page looks decidedly non-Unicode friendly! If I try to change it to any of en.UTF-8, en_GB.UTF-8 or en_US.UTF-8 (or even UTF-8 or Unicode), I get an error saying that my OS cannot honour the request.
If I try Sys.setlocale(category="LC_ALL", locale="Chinese"), the locale does change (albeit to another Windows code page; still no Unicode) but then the CSV files can't be parsed. That said, if I read the files in the English locale and then switch to Chinese afterwards, the data frame is printed out correctly in the console. However, this is cludgy and, regardless, View(myData) now shows mojibake rather than the encoded Unicode code points.
Is there any way to just make it all work? That is, correct Chinese characters are echoed from the data frame to the console and View works, without having to perform secret handshakes when reading the data?
My gut feeling is that the problem is the locale: It should be set to a UTF-8 locale and then everything should [might] just work. However, I don't know how to do that...
The UTF notation is good and it means your characters were read in property. The issue is on R's side with printing to console, which shouldn't be a big problem unless you are copying and pasting output. Writing out is a bit tricky: you want to open a UTF-8 file connection, then write to that file.

wrong text file output of special characters using UTF-8 enconding in R 3.1.2 with Mac OS X

I am having problems to write a csv file with Spanish accents, using R 3.1.2 and Mac OS X 10.6.
I cannot write words with accents into text file.
When I do:
con <- file("y.csv",encoding="UTF-8")
write.csv("Ú",con)
I get y.csv file which has the following content:
"","x"
"1","√ö"
Ie, "√ö" instead of "Ú".
When using write.table the outcome is equivalent.
Encoding("Ú") is "UTF-8"
If I do write.xlsx("Ú","y.xlsx") I get y.xlsx file which successfully shows Ú.
I have also tried to convert to other encodings using iconv() with no success.
I have set default encoding "UTF-8" in RStudio and on TextEdit. When using only R (not RStudio) the problem is the same.
In RStudio the special characters appear correctly (in files), and also in the console in R.
Sys.getlocale()gives
"es_ES.UTF-8/es_ES.UTF-8/es_ES.UTF-8/C/es_ES.UTF-8/es_ES.UTF-8"
In Mac OS X Terminal
file -I y.csv
gives
y.csv: text/plain; charset=utf-8
I don't see where the problem is. Any help, please?
Just came across this other question that seems like a near duplicate to this one :
Export UTF-8 BOM to .csv in R
The problem was not one of codification in R, but from TextEdit, which did not show the right characters, although I had selected UTF-8 codification in preferences. Got solved using a different editor. I was using Mac OS X 10.6.8 and TextEdit 1.6.
Maybe you cán write words with accents, but Excel expects a different encoding. Try writing your csv with for example write_csv(), and open the csv with a workaround:
open Excel
then choose tab Data | Get External Data | From Text
choose your file
and in step 1 of the text import wizard, choose file origin 65001: Unicode(UTF8).
See also http://www.openforis.org/support/questions/279/wrong-characters-display-when-exporting-files-to-csv-from-collect

Resources