UTF-8 file encoding in R - r

I have a .csv file which should be in 'UTF-8' encoding. I have exported it from Sql Server Management Studio. However, when importing it to R it fails on the lines with ÿ. I use read.csv2 and specify file encoding "UTF-8-BOM".
Notepad++ correctly displays the ÿ and says it is UTF-8 encoding. Is this a bug with the R encoding, or is ÿ in fact not part of the UTF-8 encoding scheme?
I have uploaded a small tab delimited .txt file that fails here:
https://www.dropbox.com/s/i2d5yj8sv299bsu/TestData.txt
Thanks

That is probably part of the BOM marker at the beginning. If the editor or parser doesn't recognize BOM markers it believes it is garbage. See https://www.ultraedit.com/support/tutorials-power-tips/ultraedit/unicode.html for more details.

Related

read_csv doesn't get accents correctly

I'm reading a UTF-8 encoded file with readr::read_csv("path_to/file.csv", locale = locale(encoding = "utf-8")) but still doesn't get the spanish accents correctly.
I wrote the file with readr::write_csv(file, "path_to/file.csv") and the output of readr::guess_encoding("path_to/file.csv") is UTF-8 with 100% confidence.
As a side note, eveytime I wrote the file, the R session run into fatal error, but the file was still written.
What can I do to get strings with correct accents?
EDIT
I've found this issue in readr's github repo, pointing out that the error should disappear with the latest vroom release, but in my case didn't.
I have solved the accent marks issue: when calling my funs.R file, which contains all relevant functions on data preprocessing and was used before writting the csv, I wasn't doing it correctly. Apparently, file sourcing is done with R's defult encoding, which isn't necessarily the same encoding of the file itself. I just had to set encoding = "utf-8" argument into source().
I wasn't able to resolve the fatal error.

R write.csv not handling characters like é correctly

When I look at data in R, it has characters like "é" displayed correctly.
I export it to excel using write.csv. When I open the csv file, "é" is displayed as "√©". Is the problem with write.csv or with excel? What can I do to fix it?
Thanks
Try the write_excel_csv() function from the readr package
readr::write_excel_csv(your_dataframe, "file_path")
It's a problem with Excel. Try Importing data instead of Opening the file.
Go to: 'Data' --> 'From Text/CSV' and then select '65001:Unicode (UTF-8)'. That will match the encoding from R.
Try experimenting with the parameter fileEncoding of write.csv:
write.csv(..., fileEncoding="UTF-16LE")
From write.csv documentation:
fileEncoding character string: if non-empty declares the encoding to
be used on a file (not a connection) so the character data can be
re-encoded as they are written. See file.
CSV files do not record an encoding, and this causes problems if they
are not ASCII for many other applications. Windows Excel 2007/10 will
open files (e.g., by the file association mechanism) correctly if they
are ASCII or UTF-16 (use fileEncoding = "UTF-16LE") or perhaps in the
current Windows codepage (e.g., "CP1252"), but the ‘Text Import
Wizard’ (from the ‘Data’ tab) allows far more choice of encodings.
Excel:mac 2004/8 can import only ‘Macintosh’ (which seems to mean Mac
Roman), ‘Windows’ (perhaps Latin-1) and ‘PC-8’ files. OpenOffice 3.x
asks for the character set when opening the file.

How to read csv file with unknown formatting and unknown encoding in R Program? (example file provided)

I have tried my best to read a CSV file in r but failed. I have provided a sample of the file in the following Gdrive link.
Data
I found that it is a tab-delimited file by opening in a text editor. The file is read in Excel without issues. But when I try to read it in R using "readr" package or the base r packages, it fails. Not sure why. I have tried different encoding like UTF-8. UTF-16, UTF16LE. Could you please help me to write the correct script to read this file. Currently, I am converting this file to excel as a comma-delimited to read in R. But I am sure there must be something that I am doing wrong. Any help would be appreciated.
Thanks
Amal
PS: What I don't understand is how excel is reading the file without any parameters provided? Can we build the same logic in R to read any file?
This is a Windows-related encoding problem.
When I open your file in Notepad++ it tells me it is encoded as UCS-2 LE BOM. There is a trick to reading in files with unusual encodings into R. In your case this seems to do the trick:
read.delim(con <- file("temp.csv", encoding = "UCS-2LE"))
(adapted from R: can't read unicode text files even when specifying the encoding).
BTW "CSV" stands for "comma separated values". This file has tab-separated values, so you should give it either a .tsv or .txt suffix, not .csv, to avoid confusion.
In terms of your second question, could we build the same logic in R to guess encoding, delimiters and read in many types of file without us explicitly saying what the encoding and delimiter is - yes, this would certainly be possible. Whether it is desirable I'm not sure.

How to save a dataframe as a .csv file with UTF-8 encoding and LF line ending in R using Rstudio?

I came across this weird situation:
I need to save a dataframe to a .csv file UTF-8 and with a LF ending. I'm using the latest version of R and Rstudio on a Windows 10 machine.
My first attempt was to do naively:
write.csv(df, fileEncoding="UTF-8", eol="\n")
checking with Notepad++, it appears the encoding is UTF-8, however the line ending is CRLF and not LF. Ok, let's double check with Notepad: surprise, surprise, the encoding, according to Notepad, is ANSI. At this point I'm confused.
After looking at the docs for the function write.csv I read that:
CSV files do not record an encoding
I'm not an expert on the topic, so I decide to revert back and save the file as a simple .txt using write.table as follows:
write.table(df, fileEncoding="UTF-8", eol="\n")
again, the same result as above. No changes whatsoever. I tried the combinations
write.csv(df)
write.table(df)
without specified encodings but no change. Then I set the default encoding in Rstudio to be UTF-8 and LF line ending (as in the picture below)
and ran the tests again. No change. What am I missing??
This is an odd one, at least for me. Nonetheless, by reading the docs of write.table I found the solution. Apparently on Windows, to save files Unix-style you have to open a binary connection to a file and then save the file using the desired eol:
f <- file("filename.csv", "wb")
write.csv(df, file=f, eol="\n")
close(f)
As far as the UTF-8 format is concerned, global settings should work fine.
Check that the eol is LF using Notepad++. UTF-8 is harder to check since on Linux isutf8 (from moreutils) says files are indeed UTF-8 but Windows' Notepad disagrees when saving and says they are ANSI.

non-English character CSV encoding error among PC/MAC/Ubuntu

This problem troubles for a year. My R has trouble in opening my csv file containing simplified Chinese character. The data is coded as GBK I believe. I have three computers with different language and operation system and it has mixed results on opening the same Chinese csv file. Could someone tell me why the results are different?
(1)Windows+English OS+English R and R studio: UNABLE to read my csv even if I encoded it as UTF8,GBK, and you name it encoding for Chinese.
(2) Mac+EnglishOS+English R: ABLE to read the Chinese csv without forcing the encoding (update: after I reinstall operation system to El Caption, it could not open my csv correctly)
(3) Windows+Chinese OS,+Chinese R: ABLE read csv without forcing encoding or gbk
(4) Windows+English OS,+Chinese R: UNABLE
(5) Ubuntu English OS,English R: ABLE
In the windows case(English and Chinese OS), notebook can open the csv correctly but excel cannot in the English Case. When ever I could not open my csv with excel, my r cannot either.
If I converge the csv by Google sheet, my excel can open my csv but R still not ok.
How does the encoding work in R, why the results change with the OS Lanuage?
read.csv(...,encoding=)
It could be related with the excel csv encoding system. If your windows operation system is Englihs. The excel might not properly open the cvs correctly. A work around is the using google sheer or Ubuntu installed sheet to converge it to csv and try using r open it.
I have figured out how to solve. It deals with large less than 800M files contained Simplified Chinese Characters. The key is that you should know the default Chinese encoding in your operation system.
The Ubuntu use UTF-8 as default Chinese Encoding. So you should encode it as UTF-8 instead GB18130 or other GB starting encoding.
(1) Download Open Office (free and fast to install, have have higher
file size than Cals in Ubuntu).
(2) Detect your CSV encoding. Simply open your csv using Open office and choose an encoding method that display your Chinese character.
(3) Save your csv to correct encoding system according to your
operation system. Default Windows are GBK for Chinese and Ubuntu is
UTF8.
This should solve your file size problem and encoding problem. You do not even force the encoding. Normal read.csv would work.

Resources