read_csv doesn't get accents correctly - r

I'm reading a UTF-8 encoded file with readr::read_csv("path_to/file.csv", locale = locale(encoding = "utf-8")) but still doesn't get the spanish accents correctly.
I wrote the file with readr::write_csv(file, "path_to/file.csv") and the output of readr::guess_encoding("path_to/file.csv") is UTF-8 with 100% confidence.
As a side note, eveytime I wrote the file, the R session run into fatal error, but the file was still written.
What can I do to get strings with correct accents?
EDIT
I've found this issue in readr's github repo, pointing out that the error should disappear with the latest vroom release, but in my case didn't.

I have solved the accent marks issue: when calling my funs.R file, which contains all relevant functions on data preprocessing and was used before writting the csv, I wasn't doing it correctly. Apparently, file sourcing is done with R's defult encoding, which isn't necessarily the same encoding of the file itself. I just had to set encoding = "utf-8" argument into source().
I wasn't able to resolve the fatal error.

Related

Reading diacritics in R

I have imported several .txt files (texts written in Spanish) to RStudio using the following code:
content = readLines(paste("my_texts", "text1",sep = "/"))
However, when I read the texts in RStudio, they contain codes instead of diacritics. For example, I see the code <97> instead of an "ó" or the code <96> instead of an "ñ".
I have realized also that if the .txt file was originally written using a computer configured in Spanish, I don't see the codes but the actual diacritics. And if the texts were written using a a computer configured in English, then I do get the codes (even though when opening the .txt file on TextEdit I see the diacritics).
I don't know why R displays those symbols and what I can do to retain the diacritics I see in the original .txt files.
I read I could possibly solve this by changing the encoding to UTF-8, so I tried this:
content = readLines(paste("my_texts", "text1",sep = "/"), encoding = "UTF-8")
But that didn't work. Any ideas what those codes are and how to keep my diacritics?
As you figured out, you need to set the correct encoding. Unfortunately the text file was written using a legacy encoding rather than UTF-8 — namely, MacRoman. Ideally the application producing the file would not use this encoding, and Apple products by default no longer produce it.
But since this is what you’ve got, we have to deal with it, and we can. But unfortunately we need to go a detour because the encoding argument of readLines is a bit useless. Instead, we need to manually open a file connection:
con = file(file.path("my_texts", "text1"), encoding = "macintosh")
on.exit(close(con)) # Always make sure to close connections!
contents = readLines(con)
Do note that the encoding name “macintosh” is strictly speaking not portable, so this might not work on all platforms.

How to read csv file with unknown formatting and unknown encoding in R Program? (example file provided)

I have tried my best to read a CSV file in r but failed. I have provided a sample of the file in the following Gdrive link.
Data
I found that it is a tab-delimited file by opening in a text editor. The file is read in Excel without issues. But when I try to read it in R using "readr" package or the base r packages, it fails. Not sure why. I have tried different encoding like UTF-8. UTF-16, UTF16LE. Could you please help me to write the correct script to read this file. Currently, I am converting this file to excel as a comma-delimited to read in R. But I am sure there must be something that I am doing wrong. Any help would be appreciated.
Thanks
Amal
PS: What I don't understand is how excel is reading the file without any parameters provided? Can we build the same logic in R to read any file?
This is a Windows-related encoding problem.
When I open your file in Notepad++ it tells me it is encoded as UCS-2 LE BOM. There is a trick to reading in files with unusual encodings into R. In your case this seems to do the trick:
read.delim(con <- file("temp.csv", encoding = "UCS-2LE"))
(adapted from R: can't read unicode text files even when specifying the encoding).
BTW "CSV" stands for "comma separated values". This file has tab-separated values, so you should give it either a .tsv or .txt suffix, not .csv, to avoid confusion.
In terms of your second question, could we build the same logic in R to guess encoding, delimiters and read in many types of file without us explicitly saying what the encoding and delimiter is - yes, this would certainly be possible. Whether it is desirable I'm not sure.

Best way to fix encoding UTF-8 in my function's package R

I try to deal with encoding UTF-8 my R package. My R version is 3.4.4 on Windows.
My package is composed of some functions with console printing and
graph who needed encoding UTF 8 (french).
I try to add this line in my R Script (at the beginning of script containing my function and in my function) but the printing is like this "Répartition de la différence"
Sys.setlocale("LC_CTYPE","french")
options(encoding = "UTF-8")
In another script, after load my package, I also add this few line but I have the same encoding problem ...
Any ideas ?
You can add a line specifying Encoding: UTF-8 in your DESCRIPTION file.
See https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Character-encoding-issues
If the DESCRIPTION file is not entirely in ASCII it should contain an
‘Encoding’ field specifying an encoding. This is used as the encoding
of the DESCRIPTION file itself and of the R and NAMESPACE files, and
as the default encoding of .Rd files. The examples are assumed to be
in this encoding when running R CMD check, and it is used for the
encoding of the CITATION file. Only encoding names latin1, latin2 and
UTF-8 are known to be portable. (Do not specify an encoding unless one
is actually needed: doing so makes the package less portable. If a
package has a specified encoding, you should run R CMD build etc in a
locale using that encoding.)
Please let me know if it solves your issue.

How to save a dataframe as a .csv file with UTF-8 encoding and LF line ending in R using Rstudio?

I came across this weird situation:
I need to save a dataframe to a .csv file UTF-8 and with a LF ending. I'm using the latest version of R and Rstudio on a Windows 10 machine.
My first attempt was to do naively:
write.csv(df, fileEncoding="UTF-8", eol="\n")
checking with Notepad++, it appears the encoding is UTF-8, however the line ending is CRLF and not LF. Ok, let's double check with Notepad: surprise, surprise, the encoding, according to Notepad, is ANSI. At this point I'm confused.
After looking at the docs for the function write.csv I read that:
CSV files do not record an encoding
I'm not an expert on the topic, so I decide to revert back and save the file as a simple .txt using write.table as follows:
write.table(df, fileEncoding="UTF-8", eol="\n")
again, the same result as above. No changes whatsoever. I tried the combinations
write.csv(df)
write.table(df)
without specified encodings but no change. Then I set the default encoding in Rstudio to be UTF-8 and LF line ending (as in the picture below)
and ran the tests again. No change. What am I missing??
This is an odd one, at least for me. Nonetheless, by reading the docs of write.table I found the solution. Apparently on Windows, to save files Unix-style you have to open a binary connection to a file and then save the file using the desired eol:
f <- file("filename.csv", "wb")
write.csv(df, file=f, eol="\n")
close(f)
As far as the UTF-8 format is concerned, global settings should work fine.
Check that the eol is LF using Notepad++. UTF-8 is harder to check since on Linux isutf8 (from moreutils) says files are indeed UTF-8 but Windows' Notepad disagrees when saving and says they are ANSI.

R changing names when there is ä ü ö

OK, this is an extremly annoying problem and I was not able to find a solution on the internet, therefore I come to you.
When importing data sets that contain German names with Umlaut (ä, ö, ü), R modifies the names. Somethin like Möhlin -> M<f6>hlin.
When writing code word containing Umlaut cause no problem, until saving the script. After reloading a save script all my beloved Umlaut are modified. Aka all the names of my plots, the name of the variables, etc etc ...
Please, anyone can help me ?
Try setting the locale:
Sys.setlocale(category = "LC_ALL", locale = "German")
Try changing default codepage to UTF-8 in RStudio via:
Tools - Global Options - Code - Saving - Default Text Encoding - UTF-8
then restart RStudio and save and reopen your script with umlauts.
I'd just try to make sure all your files are UTF-8 encoded, ie. know their Umlauts.
Thus, when writing and reading files, try to always explicitly set the file encoding to "UTF-8".
For instance, when writing df to file,
write.csv(tt, "output.csv", fileEncoding = "UTF-8")
The same logic applies to read.csv(), etc.
Note that opening files that way will only work properly when you saved them as UTF-8 in the first place.
I know that some people like to use stringr for string manipulation in general when working with non-English text, but I have never used it.

Resources