UTF-8 with R Markdown, knitr and Windows - r

What?
An .Rmd file is error-free rendered via knitr (or rmarkdown) within from Linux. Related material (i.e. child R scripts and CSV input data) is all set in UTF-8.
Executing the same script from within Windows (actually the script is inside a cloned git repository), does not render all characters cleanly, since it's set to Windows-1252.
Examples
For example, the string "sans réserves", sourced from a CSV into some data.frame's column content, is typeset as "sans réserves". To read this one correctly, it suffices to add encoding='UTF-8' to read.csv, obviously while reading-in the data.
Another example, that concerns an entry among other R code lines, is the string "Trésorier Général". It is typeset as "Trésorier Général". Fortunately, the following advice
read_chunk(lines = readLines("TestSpanishText.R", encoding = "UTF-8"))
taken from https://stackoverflow.com/a/15714617/1172302, works and the string is rendered as expected.
Related
[Update] There are some related Q&As, but they are more than 2-3 years old. As well, this page https://support.rstudio.com/hc/en-us/articles/200532197-Character-Encoding, points to the very issue.
Questions
Is there another, easier way to overcome this issue regarding UTF-8 and Windows, inside R? Recommendations on how to approach such a problem? I am trying to follow a one source for all principle.
ps- An interesting reading: https://superuser.com/a/221602/128768

Related

Reading diacritics in R

I have imported several .txt files (texts written in Spanish) to RStudio using the following code:
content = readLines(paste("my_texts", "text1",sep = "/"))
However, when I read the texts in RStudio, they contain codes instead of diacritics. For example, I see the code <97> instead of an "ó" or the code <96> instead of an "ñ".
I have realized also that if the .txt file was originally written using a computer configured in Spanish, I don't see the codes but the actual diacritics. And if the texts were written using a a computer configured in English, then I do get the codes (even though when opening the .txt file on TextEdit I see the diacritics).
I don't know why R displays those symbols and what I can do to retain the diacritics I see in the original .txt files.
I read I could possibly solve this by changing the encoding to UTF-8, so I tried this:
content = readLines(paste("my_texts", "text1",sep = "/"), encoding = "UTF-8")
But that didn't work. Any ideas what those codes are and how to keep my diacritics?
As you figured out, you need to set the correct encoding. Unfortunately the text file was written using a legacy encoding rather than UTF-8 — namely, MacRoman. Ideally the application producing the file would not use this encoding, and Apple products by default no longer produce it.
But since this is what you’ve got, we have to deal with it, and we can. But unfortunately we need to go a detour because the encoding argument of readLines is a bit useless. Instead, we need to manually open a file connection:
con = file(file.path("my_texts", "text1"), encoding = "macintosh")
on.exit(close(con)) # Always make sure to close connections!
contents = readLines(con)
Do note that the encoding name “macintosh” is strictly speaking not portable, so this might not work on all platforms.

How does R (RGui) parse multiline character strings?

The RGui (Windows; R version 3.5.3) appears to ignore tab characters that occur at the beginning of a line within a character string (press CTRL+R over the lines of code):
# REPLACE "<TAB>" WITH AN ACTUAL TAB CHARACTER TO GET THE CODE INTENDED BELOW.
foo <- 'LINE1
<TAB>LINE2
<TAB>LINE3
'
foo
# [1] "LINE1\nLINE2\nLINE3\n"
longstring <- removetabsatbeginningoflines('
<TAB>Sometimes I have really long strings that I format
<TAB>so that they read nicely (not with too long of a
<TAB>line length). Tabs at the beginning of the lines
<TAB>within a string preserve my code indenting scheme
<TAB>that I use to make the code more readable. If the
<TAB>tabs are not removed automatically by the parser,
<TAB>then I need to wrap the string in a function that
<TAB>removes them.')
The tab characters are preserved when the above code is source'd from a file.
Why doesn't RGui keep the tab characters?
Where is this behavior documented?
What other non-intuitive, related behaviors does RGui have with regard to parsing (multiline) strings?
I can't find anywhere that this is documented, so this is a worthwhile question to publicly answer.
If you are using RGui's built-in 'R Editor', then all Tab characters entered via the Tab key, or already existing in a text file that you have opened into the 'R Editor', will not be respected when submitting using Ctrl-R (This is difficult to represent here in an example given that tabs are stripped from answers).
I imagine the 'R Editor' is not meant to be used for serious code editing, and you may be better off using a dedicated IDE (e.g. RStudio) or more full-featured editor (e.g. Emacs, Notepad++).
You can route around this issue in RGui by manually replacing tabs as \t when editing in RGui, but this may not be appropriate if you want to keep actual Tabs in your file. Tabs will also be correctly processed when using source() to directly run the code stored in the text file.

How to extract a database from a text file (word or libreoffice) with styles and content

I ask my question after searching an answer on stackoverflow and on the web, without success.
I'm sorry if there is already an answer somewhere.
Global objective
I aim to create my questionnaires in libreoffice ( I need to print it, it's not for an online survey), and secondly to use it in a R shiny app I've created for register the collected answers and to export the data.
I want to create the fields in R (questions, answers...) automatically from the styles of my questionnaires in .odt, .docx or others formats.
I need to have well formatted questionnaires, nice-looking.
There is the problem:
I have written a questionnaire on a libreoffice .odt file (or if necessary in microsoft word).
I uses styles for different text blocks: one style for the "questions", one for the "answer", one for the parts of the questionnaire, one for the "instructions"...
I want to get a database ( in .csv format) with one column with the styles, and one column with the text content.
Solutions?
I try to open the xml files in the .odt or .docx archives, but the conversion to a simpler and readable format seems quite difficult.
Is it possible to export a toc from libreoffice or word to a spreadsheet format?
R can read in such files (.odt or .dox, or.xml) ?
Thank you very much for your ideas, and more generaly for your feedbacks on my project.
I'm sorry for my english
I would recommend using .Rmd (for rmarkdown) or .Rnw (for knitr) files as the source for your questionaires, rather than starting with .odt or .docx. You can produce output in various formats, including .docx, .pdf, .html (only .pdf for .Rnw) to display the questionaire to the subjects, but you can also develop functions to manage the data, or even interactive displays to collect and record the data.
I'm not familiar with R packages that do all of this for you, but I expect they already exist. Maybe someone else will give an answer with more details.
You might explore using the .fodt format in libreOffice Writer. That format is an "unzipped" version of the Writer xml format, so could be directly readable by xml utilities (and probably R, with appropriate libraries). I note that for another answer you seemed to want to avoid markdown or knitr composition, and .fodt would provide a "text" format completely compatible with LibreOffice as a front end.
(Note the other parts of LibreOffice have "flat" versions, so you could, in theory, process text versions of spreadsheets, graphics, and presentation files in your R utility.)
A few web searches indicates some relevant libraries and utilities for R exist, which may get you closer to what you need for your project.

How to use UTF-8 characters in dygraph title in R

Using Rstudio [Windows8], when I use the dygraph function to plot a time series, I have a problem when trying to use UTF-8 characters in the main title.
library(dygraphs)
dygraph(AirPassengers, main = "Título")
This results in a title: "T?tulo"
I have tried to convert "Título" to the utf-8 enconding, but it doesn't work.
You can use enc2utf8.
dygraph(AirPassengers, main = enc2utf8("Título"))
You need to make sure your locale settings support the character that you want to use, and that the file is saved with the right encoding. Saving as UTF-8 worked for me.
I was able to replicate your situation in Windows 7 and tried a bunch of things. Embedded in Rmarkdown, here is a minimal working example.
```{r}
Sys.setlocale("LC_ALL","German")
#note that windows locale names are different from unix & mac, usually
#the name of nationality works here.
#also works with "Faroese", "Hungarian", and others who have this letter.
#the locale has to be set in a preceding block to take effect.
```
```{r}
Encoding("Título")
library(dygraphs)
dygraph(AirPassengers, main = "Título")
```
You can try out the encoding given to the title to have with Encoding(). Languages like Faroese, Hungarian, and German encode "Título" as latin or unknown, both of which seem to cause no problems for dygraph's javascript. UTF-8 wrote it as <U+00ED> which was a problem for the javascript, as well as some, but not all other functions. With a matching locale, converting to utf-8 as #Michele recommended has the same result.
Also, if you don't have the title in many places, it is possible to just manually find and replace the title in the html/javascript file that is made. The problem occurs on conversion, but if the file is already made, the title variable can be successfully changed. The letter still has a question mark in Rstudio "Viewer" output, but I recommend making the entire file for javascript regularly, as I've seen other functions malfunction in the viewer window.

Create and save R's default codebooks as a pdf

If I load data(mtcars) it comes with a very neat codebook that I can call using ?mtcars.
I'm interested to document my data in the same way and, furthermore, save that neat codebook as a pdf.
Is it possible to save the 'content' of ?mtcars and how is it created?
Thanks, Eric
P.S. I did read this thread.
update 2012-05-14 00:39:59 PDT
I am looking for a solution using only R; unfortunately I cannot rely on other software (e.g. Tex)
update 2012-05-14 09:49:05 PDT
Thank you very much everyone for the many answers.
Reading these answers I realized that I should have made my priorities much clearer. Therefore, here is a list of my priorities in regard to this question.
R, I am looking for a solution that is based exclusively on R.
Reproducibility, that the codebook can be part of a automated script.
Readability, the text should be easy to read.
Searchability, a file that can be open with any standard software and searched (this is why I thought pdf would be a good solution, but this is overruled by 1 through 3).
I am currently labeling my variables using label() from the Hmisc package and might end up writing a .txt codebook using Label() from the same package.
(I'm not completely sure what you're after, but):
Like other package documentation, the file for mtcars is an .Rd file. You can convert it into other formats (ASCII) than pdf, but the usual way of producing a pdf does use pdflatex.
However, most information in such an .Rd file is written more or less by hand (unless you use yet another R package like roxygen/roxygen2 help you to generate parts of it automatically.
For user-data, usually Noweb is much more convenient.
.Rnw -Sweave-> -> .tex -pdflatex-> pdf is certainly the most usual way with such files.
However, you can use it e.g. with Openoffice (if that is installed) or use it with plain ASCII files instead of TeX.
Have a look at package knitr which may be easier with pure-ASCII files. (I'm not an expert, just switching over from Sweave)
If html is an option, both Sweave and knitr can work with that.
I don't know how to get the pdf of individual data sets but you can build the pdf of the entire datasets package from the LaTeX version using:
path <- find.package('datasets')
system(paste(shQuote(file.path(R.home("bin"), "R")),"CMD",
"Rd2pdf",shQuote(path)))
I'm not sure on this but it only makes sense you'd have to have some sort of LaTeX program like MikTex. Also I'm not sure how this will work on different OS as mine is windows and this works for me.
PS this is only a partial answer to your question as you want to do this for your data, but if nothing else it may get the ball rolling.
The help page that is displayed when entering ?mtcars is generated from an .Rd file, which is a LaTeX-like file that is used for all of R's help pages. Although .Rd files are LaTeX-like, you don't actually need to know LaTeX to read or write them. The actual mtcars.Rd file is available here: http://commondatastorage.googleapis.com/jthetzel-public/mtcars.Rd , which can be viewed with any text editor.
.Rd files included in the ./man directory of a package are converted to .html files when installing the package. They are converted by functions in the "tools" package.. If you would like functionality like ?mtcars for your datasets, you would need to create a package for them. That might sound complicated if you have never created a package before, but it is easy enough to learn and will make you a better R programmer. There are a number of examples of dataset-only packages on CRAN, for example msProstate: http://cran.r-project.org/web/packages/msProstate/index.html . Consider downloading the package source to see how it is organized.
For more information on creating your own packages, writing .Rd files, and building packages:
http://cran.r-project.org/doc/manuals/R-exts.html, especially "1.1.5 Data in packages".
Edit
And if you want to convert the .Rd file in your package to a .pdf, you can do so when building your package, but you will need a LaTeX compiler. If you are on Windows, see here: http://cran.r-project.org/bin/windows/Rtools/ .
You can't create a PDF with just R; you need to use other software that creates PDFs.
You could use a combination of utils::promptData, tools::Rd2HTML, and a simple custom function to open the created HTML file in the users' browser.
It would probably be easier to just make a package containing your data sets. Look at the "datasets" package for an example.
It looks like that if you want to generate a pdf, an external tool like LaTeX is always needed. I would recommend using a simple ASCII text format to generate such a file. In principle the .Rd files are also ASCII text, but I do not find them particularly readable.
Instead, I would recommend using a plain text ASCII format such as Markdown (which is e.g. used on StackOverflow) to write the text file. Such a file is already much more readable than an .Rd formatted file, and as a bonus it can quite easily be processed into a PDF should you choose to do so later on. The knitr package I think is capable of generating PDF files from Markdown sources. In addition, knitr allows you to mix in R code in the Markdown text. This code can be evaluated and the results (even figures) added to the resulting PDF.
In practice you can use sprintf to generate character vectors that you can pipe to a file in order to dynamically generate the markdown text. Just write the template one time, and mark the places for the text you want to add later like this:
base_text = "
First header
============
This document was generated on %s, by %s.
"
text_forfile = sprintf(text, some_date, some_name)
Just dump the text in text_forfile to a .md file and your done, no external tools needed. See this post on SO for how dump text to a file.

Resources