R text encoding - r

The R Data Import/Export Manual says that there is a good way to guess the encoding of a text file is to use the "file" command line tool (available in R tools). How would one use this? I already have the newest version of Rtools installed. Is this something I can do from my R session? Or do I need to open up the command prompt?

In the context of R Data Import/Export Manual, I interpret it as using a file on a command prompt.
However you can invoke a system command with system() function from R. For example if I have a file called mpi.R in the current directory, I can do:
> foo <- system('file mpi.R', intern=TRUE, ignore.stdout=FALSE, ignore.stderr=TRUE, wait=TRUE)
> print(foo)
[1] "mpi.R: ASCII text"

The "command prompt" here refers to a "Terminal" window (OS X or Linux) or "Command Prompt" (Windows). From these, you have access to the command-line file utility, which as the manual states, provides a good description of the type and format of (text) files.
You can also run this straight from R, using the system() function to pass the call to file. For example, on my system, in the current working directory I have three text files:
> list.files(pattern = "*.txt")
[1] "00005802.txt" "googlebooks-eng-all-totalcounts-20120701.txt"
[3] "sentences.txt"
> system("file *.txt")
00005802.txt: Par archive data
googlebooks-eng-all-totalcounts-20120701.txt: ASCII text, with very long lines, with no line terminators
sentences.txt: ASCII English text, with very long lines
It could be that file will call something "plain ASCII" when it only contains the lower 128 ASCII characters, but this will be the same as UTF-8 since those two encodings share the same 8-bit mappings of the first 128 ASCII characters.
Also, file is not always right -- for instance the 00005802.txt is in fact UTF-8 encoded text that I converted from a pdf using pdftotext.
Also beware that on most Windows platforms, you cannot set your system locale to UTF-8 in R. Try Sys.getlocale(). (To set it, use Sys.setlocale()).

Related

Which is the correct encoding for a degree character?

I have a line of code that alters text
temperature<-as.numeric(gsub("°.*","",temp))
R does not like the "°" character. When I save the file it says I need to use a different encoding.
I have tried all sorts of different encodings from the list, but they all save the code in some variation of
temperature<-as.numeric(gsub("??.*","",temp))
My current solution is to open the script in notepad and copy paste the code into rstudio. Which encoding do I need to save a ° in rstudio?
The full solution to this in rstudio was to go to file -> save with encoding -> select ISO-8859-1 -> check the box Set as default encoding for source files. Now the file opens properly with the degree character every time.

Best way to fix encoding UTF-8 in my function's package R

I try to deal with encoding UTF-8 my R package. My R version is 3.4.4 on Windows.
My package is composed of some functions with console printing and
graph who needed encoding UTF 8 (french).
I try to add this line in my R Script (at the beginning of script containing my function and in my function) but the printing is like this "Répartition de la différence"
Sys.setlocale("LC_CTYPE","french")
options(encoding = "UTF-8")
In another script, after load my package, I also add this few line but I have the same encoding problem ...
Any ideas ?
You can add a line specifying Encoding: UTF-8 in your DESCRIPTION file.
See https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Character-encoding-issues
If the DESCRIPTION file is not entirely in ASCII it should contain an
‘Encoding’ field specifying an encoding. This is used as the encoding
of the DESCRIPTION file itself and of the R and NAMESPACE files, and
as the default encoding of .Rd files. The examples are assumed to be
in this encoding when running R CMD check, and it is used for the
encoding of the CITATION file. Only encoding names latin1, latin2 and
UTF-8 are known to be portable. (Do not specify an encoding unless one
is actually needed: doing so makes the package less portable. If a
package has a specified encoding, you should run R CMD build etc in a
locale using that encoding.)
Please let me know if it solves your issue.

How to save a dataframe as a .csv file with UTF-8 encoding and LF line ending in R using Rstudio?

I came across this weird situation:
I need to save a dataframe to a .csv file UTF-8 and with a LF ending. I'm using the latest version of R and Rstudio on a Windows 10 machine.
My first attempt was to do naively:
write.csv(df, fileEncoding="UTF-8", eol="\n")
checking with Notepad++, it appears the encoding is UTF-8, however the line ending is CRLF and not LF. Ok, let's double check with Notepad: surprise, surprise, the encoding, according to Notepad, is ANSI. At this point I'm confused.
After looking at the docs for the function write.csv I read that:
CSV files do not record an encoding
I'm not an expert on the topic, so I decide to revert back and save the file as a simple .txt using write.table as follows:
write.table(df, fileEncoding="UTF-8", eol="\n")
again, the same result as above. No changes whatsoever. I tried the combinations
write.csv(df)
write.table(df)
without specified encodings but no change. Then I set the default encoding in Rstudio to be UTF-8 and LF line ending (as in the picture below)
and ran the tests again. No change. What am I missing??
This is an odd one, at least for me. Nonetheless, by reading the docs of write.table I found the solution. Apparently on Windows, to save files Unix-style you have to open a binary connection to a file and then save the file using the desired eol:
f <- file("filename.csv", "wb")
write.csv(df, file=f, eol="\n")
close(f)
As far as the UTF-8 format is concerned, global settings should work fine.
Check that the eol is LF using Notepad++. UTF-8 is harder to check since on Linux isutf8 (from moreutils) says files are indeed UTF-8 but Windows' Notepad disagrees when saving and says they are ANSI.

wrong text file output of special characters using UTF-8 enconding in R 3.1.2 with Mac OS X

I am having problems to write a csv file with Spanish accents, using R 3.1.2 and Mac OS X 10.6.
I cannot write words with accents into text file.
When I do:
con <- file("y.csv",encoding="UTF-8")
write.csv("Ú",con)
I get y.csv file which has the following content:
"","x"
"1","√ö"
Ie, "√ö" instead of "Ú".
When using write.table the outcome is equivalent.
Encoding("Ú") is "UTF-8"
If I do write.xlsx("Ú","y.xlsx") I get y.xlsx file which successfully shows Ú.
I have also tried to convert to other encodings using iconv() with no success.
I have set default encoding "UTF-8" in RStudio and on TextEdit. When using only R (not RStudio) the problem is the same.
In RStudio the special characters appear correctly (in files), and also in the console in R.
Sys.getlocale()gives
"es_ES.UTF-8/es_ES.UTF-8/es_ES.UTF-8/C/es_ES.UTF-8/es_ES.UTF-8"
In Mac OS X Terminal
file -I y.csv
gives
y.csv: text/plain; charset=utf-8
I don't see where the problem is. Any help, please?
Just came across this other question that seems like a near duplicate to this one :
Export UTF-8 BOM to .csv in R
The problem was not one of codification in R, but from TextEdit, which did not show the right characters, although I had selected UTF-8 codification in preferences. Got solved using a different editor. I was using Mac OS X 10.6.8 and TextEdit 1.6.
Maybe you cán write words with accents, but Excel expects a different encoding. Try writing your csv with for example write_csv(), and open the csv with a workaround:
open Excel
then choose tab Data | Get External Data | From Text
choose your file
and in step 1 of the text import wizard, choose file origin 65001: Unicode(UTF8).
See also http://www.openforis.org/support/questions/279/wrong-characters-display-when-exporting-files-to-csv-from-collect

Converting .Rd file to plain text

I'm trying to convert R documentation files (extension .Rd) into plain text. I am aware that RdUtils contains a tool called Rdconv, but as far as I know it can only be used from the command line. Is there a way to access Rdconv (or a similar conversion tool) from within an R session?
Try
tools::Rd2txt("path/to/file.Rd")
You may always invoke a system command e.g. with the system2 function:
input <- '~/Projekty/stringi/man/stri_length.Rd'
output <- '/tmp/out.txt'
system2('R', paste('CMD Rdconv -t txt', filename, '-o', output))
readLines(output)
## [1] "Count the Number of Characters"
## ...
Make sure that R is in your system's search path. If not, replace the first argument of system2() above with full path, e.g. C:\Program Files\R\3.1\bin\R.exe.

Resources