Non-english special characters in knitr - r

I am using knitr 1.1. in R 3.0.0 and within WinEdt (RWinEdt 2.0). I am having problems with knitr recognizing Swedish characters (ä, ö, å). This is not an issue with R; those characters are even recognized in file names, directory names, objects, etc. In Sweave it was not a problem either.
I already have \usepackage[utf8]{inputenc} in my document, but knitr does not seem able to handle the special characters. After running knitr, I get the following message:
Warning in remind_sweave(if (in.file) input) :
It seems you are using the Sweave-specific syntax; you may need Sweave2knitr("deskriptiv 130409.Rnw") to convert it to knitr
processing file: deskriptiv 130409.Rnw
(*) NOTE: I saw chunk options "label=läser_in_data"
please go to http://yihui.name/knitr/options (it is likely that you forgot to
quote "character" options)
Error in parse(text = str_c("alist(", quote_label(params), ")"), srcfile = NULL) :
1:15: unexpected input
1: alist(label=lä
^
Calls: knit ... parse_params -> withCallingHandlers -> eval -> parse
Execution halted
The particular label it complains about is label=läser. Changing the label is not enough, since knitr even complains if R objects use äåö.
I used Sweave2knitr() since the file originally was created for Sweave, but the result was not better: now all äåö have been transformed to äpåö, both in the R chunks and in the latex text, and knitr still gives an error message.
Session info:
R version 3.0.0 (2013-04-03)
Platform: i386-w64-mingw32/i386 (32-bit)
locale:
[1] LC_COLLATE=Swedish_Sweden.1252 LC_CTYPE=Swedish_Sweden.1252 LC_MONETARY=Swedish_Sweden.1252
[4] LC_NUMERIC=C LC_TIME=Swedish_Sweden.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] knitr_1.1
loaded via a namespace (and not attached):
[1] digest_0.6.3 evaluate_0.4.3 formatR_0.7 stringr_0.6.2 tools_3.0.0
As I mentioned there are file names and objects with Swedish characters (since that has not been a problem before), and also the text needs to be in Swedish.
Thank you for any help in getting knitr to work outside of English.

I think you have to contact the maintainer of the R-Sweave mode in WinEdt if you are using this mode to call knitr. The issue is WinEdt has to pass the encoding of the file to knit() if you are not using the native encoding of your OS. You mentioned UTF-8 but that is not the native encoding for Windows, so you must not use \usepackage[utf8]{inputenc} unless you are sure your file is UTF8-encoded.
There are several problems mixed up here, and it is unlikely to solve them all with a single answer.
The first problem is label=läser, which really should be label='läser', i.e. you must quote all the chunk labels (check other labels in the document as well); knitr tries to automatically quote your labels when you write <<foo>>= (it is turned to <<'foo'>>=), but this does not work when you use <<label=foo>>= (you have to write <<label='foo'>>= explicitly). But this problem is perhaps not essential here.
I think the real problem here is the file encoding (which is nasty under Windows). You seem to be using UTF-8 under a system that does not respect UTF-8 by default. In this case you have call knit('yourfile.Rnw', encoding = 'UTF-8'), i.e. pass the encoding to knit(). I do not use WinEdt, so I have no idea how to do that. You can hard-code the encoding in the configurations, but that is not recommended.
Two suggestions:
do not use UTF-8 under Windows; use your system native encoding (Windows-1252, I guess) instead;
or use RStudio instead of WinEdt, which can pass the encoding to knitr;
BTW, since Sweave2knitr() was popped up, there must be other problems in your Rnw document. To diagnose the problem, there are two ways to go:
if you use UTF-8, run Sweave2knitr('deskriptiv 130409.Rnw', encoding = 'UTF-8')
if you use the native encoding of your OS, just run Sweave2knitr('deskriptiv 130409.Rnw')
Please read the documentation if you have questions about the diagnostic information printed out by Sweave2knitr().

R-Sweave invokes knitr through the knitr.edt macro, which itself uses the code in knitrSweave.R to launch knit. The knitcommand in this later script is near the top and reads res <- knit(filename).
Following Yihui's suggestion, you can try to replace this command with
res <- knit(filename, encoding = 'UTF-8')
The knitr.edt and knitrSweave.R files should be in your %b\Contrib\R-Sweave folder, where %b is your winEdt user folder (something like "C:\Users\userA\AppData\Roaming\WinEdt Team\WinEdt 7" under Win 7).
Currently, I do not know how we could pass the encoding as an argument to avoid this hard coding solution.
I would suggest to avoid extended characters in file names which can only be sources of problems. Personally, I never use such names.

Related

Problem with spell checking packages in R

I'm trying to check spelling some words in Russian using "hunspell" library in R.
bad_words <- hunspell("Язвенная болзень", dict='ru_RU.dic')
I have installed Russian dictionary, from here: https://code.google.com/archive/p/hunspell-ru/
It has encoding UTF-8. However, I have following error:
Failed to convert line 1 to ISO8859-1 encoding. Try spelling with a UTF8 dictionary.
It seems strange, neither dict nor R file don't have encoding ISO8859-1...
What is the problem?
If you are operating on Windows, my first guess would be that this is related to the lack of native UTF-8 support in R on Windows. This will be resolved when R4.2 is released; you might wish to try using the development release and seeing whether the problem persists.
Another thing to check is whether your DESCRIPTION file contains the line Encoding: UTF-8, such that your source files are treated as having this encoding.

RMarkdown: UTF-8 works with Knit button but not with render()

I'm working in RMarkdown, trying to render a document that has some UTF-8 characters in it. When I push the "Knit" button in RStudio, everything works great. But when I use the render() function, the UTF-8 gets lost. Here's a short snippet of reproducible code:
---
output: html_document
---
Total nitrogen (µg/L)
Water temperature (°C)
Pushing the Knit button gives me the correct output, whether I view it in RStudio or in Chrome. But if I render the file with render(), I get:
Total nitrogen (µg/L)
Water temperature (°C)
I'm working in Windows, which may be the source of much of the problem. Here's my locale info.
Sys.getlocale("LC_ALL")
[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
I've tried adding a code chunk with "options(encoding = 'UTF-8')" but it doesn't help. I'm using pwalk() to generate 36 reports automatically with different parameters, so I need to get this working with render().
You can force encoding:
render("test.html",encoding="UTF-8")
You can also set the encoding using you R terminal:
options(encoding = 'UTF-8')
render("test.html")
I was considering this as a comment because it does not necessarily answer your question, however, it is too long to get lost there.
Firstly, using the knit button in RStudio does call render, so all things being equal, both running from the console and via the GUI will produce the same output.
your environment
An important note from jjallaire in an old closed issue on Github:
when RStudio calls render it is in a fresh R process rather than in the global environment of the current session (which is what you would git when calling render at the console)
A good question that provides context is here.
initial conclusion
If the document renders correctly using the GUI button and not from the console, then there is something in your environment causing the encoding to be read incorrectly.
Try from a clean session, if it still produces the same output then that would suggest an issue in the the environment at startup. Check the encoding...
getOption("encoding")
# [1] "native.enc"
Instead of placing options(encoding = "UTF-8") in a code chunk, execute it before your render statement. You can check that it has changed by running the getOption as above again and confirm it now returns # [1] "UTF-8"

Why does RMarkdown `render` behavior depend on whether it's called from RStudio Server or from a PHP shell?

I have an RMarkdown document that includes 'special characters', such as ë. If I render the document using RStudio Server's "knit document" button, it renders fine. When I render it by using the RStudio Server button to source another R script that calls RMarkdown's render function, it also renders fine.
However, from some reason that's beyond me (but hopefully not for long), I get different results when that same R script is called by index.php using:
$results = shell_exec("R --file='/home/username/public_html/some/subdirectories/process.R' --no-save 2>&1");
When I do this, in the resulting .html file, the special symbols (I guess the unicode symbols) are replaced by <U+00EB>. I've tried to look up whether this is some kind of variation of HTML elements that I didn't know about yet, but I have been unable to find anything about this.
(note: any link to a place where I can learn more about this (and, while we're at it, why my browser doesn't show it as, for example, the ë it represents, is also greatly appreciated!)
Reproducable example
Contents of example.php:
<?php
shell_exec("R --file='/home/username/public_html/subdirectory/example.R' --no-save 2>&1");
?>
Contents of example.R (this is what I needed on my server):
workingPath <- "/home/username/public_html/subdirectory";
### Set path to RStudio's pandoc version
Sys.setenv(PATH=paste(Sys.getenv("PATH"),
"/usr/lib/rstudio-server/bin/pandoc",
sep=":"));
### Set HOME and LANG
Sys.setenv(HOME = '/home/username');
Sys.setenv(LANG = 'en_US.UTF-8');
require(rmarkdown);
renderResults <-
render(file.path(workingPath, 'example.Rmd'),
output_file = file.path(workingPath, 'example.html'),
intermediates_dir = file.path(workingPath, 'tmp'),
encoding="UTF-8");
Contents of example.Rmd:
---
title: 'Reproducable example'
output: html_document
---
```{r}
cat("This is an ë symbol.");
```
Results of this example:
When I run this from R Studio, I get:
cat("This is an ë symbol.");
## This is an ë symbol.
When I run this from PHP, I get:
cat("This is an ë symbol.");
## This is an <U+00EB> symbol.
(note how, interestingly, the echo'ed ë does show up normally...)
I now resorted to doing a str_replace in the index.php file, but that's not ideal.
I've checked the render manual, but I can't find anything about this behavior.
I've also looked at specifying options for pandoc in the YAML header of the .Rmd file, but the only thing that seems to come close is the --ascii option, and that doesn't do anything. The R Studio RMarkdown page doesn't provide any hints, either.
Could it perhaps have to do with environment variables that are set in RStudio? I already had to set:
Sys.setenv(HOME = '/home/oupsyusr');
Sys.setenv(LANG = 'en_US.UTF-8');
in the R script to get Pandoc going in the first place when called in the R script called from the PHP shell; but if this is the problem, how can I figure out which settings RStudio sets to which values, or more accurately, which of those are important? I ran:
Sys.getenv()
From within R Studio, and that shows quite a list. I recognize none of the entries as having to do with encoding or so.
Or, does knitr cause this? When I store and inspect the .md file, the Unicode element things already show up. However, the knitr help page with chunk options doesn't say anything about unicode or encoding in general.
Does anybody know where this is documented, or does anybody happen to have encountered this situation before?
I'm running RStudio 0.99.903 and R 3.3.1 on CentOS 6.8.
Usually, issues of this form (where unicode characters are converted to a unicode code point representation, e.g. <U+00EB> in this case) are caused by an attempt to run R within a non-UTF-8 locale.
Typically, this can be verified by checking the output of Sys.getlocale("LC_ALL"). If you see a C locale reported, then you likely need to enforce a UTF-8 locale with something like:
Sys.setlocale("LC_ALL", "en_US.UTF-8")
substituting the particular UTF-8 locale flavor based on your desired language. (For reference, the set of available locales can usually be queried from a terminal with something like locale -a).

Input Chinese characters not correctly echoed in ESS

I had this weird encoding issue for my Emacs and R environment. Display of Chinese characters are all good with my .Rprofile setting Sys.setlocale("LC_ALL","zh_CN.utf-8"); except the echo of input ones.
> linkTexts[5]
font
"使用帮助"
> functionNotExist()
错误: 没有"functionNotExist"这个函数
> fire <- "你好"
> fire
[1] " "
As we can see, Chinese characters contained in the vector linkTexts, Chinese error messages, and input Chinese characters all can be perfectly shown, yet the echo of input characters were only shown as blank placeholders.
sessionInfo() is here, which is as expected given the Sys.setlocale("LC_ALL","zh_CN.utf-8"); setting:
> sessionInfo()
R version 2.15.2 (2012-10-26)
Platform: i386-apple-darwin9.8.0/i386 (32-bit)
locale:
[1] zh_CN.utf-8/zh_CN.utf-8/zh_CN.utf-8/C/zh_CN.utf-8/C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] XML_3.96-1.1
loaded via a namespace (and not attached):
[1] compiler_2.15.2 tools_2.15.2
And I have no locale settings in the .Emacs file.
To me, this seems to be an Emacs encoding issue, but I just don't know how to correct it. Any idea or suggestion? Thanks.
Your examples work for me out of the box. You can set emacs process decoding/encoding with M-x set-buffer-process-coding-system. Once you figure out what encoding works (if any) you can make the change permanent with:
(add-hook 'ess-R-post-run-hook
(lambda () (set-buffer-process-coding-system
'utf-8-unix 'utf-8-unix)))
Replace utf-8-unix with your chosen encoding.
I am not very convinced that the above will help. LinkText in your example displays well, but fire does not, doesn't look like an emacs/ESS issue.
VitoshKa has made the perfectly correct suggestion. I just wanna add more of own findings here, as people may meet different but similar special character problems. Yet they can be solved in the same way.
The root cause is the input encoding setting to the current buffer process. As shown by the M-x describe-current-coding-system command, default buffer process encoding setting was good for output (utf-8-unix) but deteriorated for input:
Coding systems for process I/O:
encoding input to the process: 1 -- iso-latin-1-unix (alias: iso-8859-1-unix latin-1-unix)
decoding output from the process: U -- utf-8-unix (alias: mule-utf-8-unix)
Changing the coding system for input into utf-8-unix, either by 'M-x set-buffer-process-coding-system' or adding the ess-post-run-hook into .emacs like suggested by VitoshKa, would suffice for solving the Chinese character display problem.
The other problem people may meet due to this setting is special character in ESS. When trying to input special characters, you may get the error message, 错误: 句法分析器%d行里不能有多字节字符
, or invalid multibyte character in parser at line %d in English.
> x <- data.frame(part = c("målløs", "ny"))
错误: 句法分析器1行里不能有多字节字符
And with the correct utf-8-unix setting for input coding system of buffer process, the above error for special characters disappears.

Encoding: knitr and child files

I am using Windows 7, R2.15.3 and RStudio 0.97.320 with knitr 1.1. Not sure what my pandoc version is, but I downloaded it a couple of days ago.
sessionInfo()
R version 2.15.3 (2013-03-01)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=Spanish_Argentina.1252 LC_CTYPE=Spanish_Argentina.1252 LC_MONETARY=Spanish_Argentina.1252
[4] LC_NUMERIC=C LC_TIME=Spanish_Argentina.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] tools_2.15.3
I would like to get my reports both in html and Word, so I'm using markdown and pandoc.
I write in spanish with accents on vowels and tildes on the n: á-ú and ñ.
I have read many posts and I see problems similar to the one I'm having have been solved with new versions of knitr. But there is one issue I haven't found a solution for.
When I started, I used the 'system default' encoding that appears in the RStudio dialog, i.e. ISO 8859-1, and the RStudio previews worked great. However when I tried to get Word documents, pandoc choked on the accentuated vowels. I found a post showing how to solve this using iconv:
iconv -t utf-8 "myfile.md" | pandoc -o "myfile.docx"| iconv -f utf-8
While this did solve pandoc's unrecognized utf-8 characters complaints, for some reason pandoc stops finding my plots, with an error like this one:
pandoc: Could not find image `figure/Parent.png', skipping...
If I use only non-accent characters, pandoc finds the images with no problems. I looked at the two .md files with an hex editor, and I can't see any difference when I compare the sections that handle the figures:
![plot of chunk Parent](figure/Parent.png)
although obviously the accentuated characters are completely different... I have verified that the image files do exist in the figure folder
Anyway, after reading many posts I decided to set RStudio to use UTF-8 encoding. With only one level of files things work great. For example, I can -independently- knit and then pandoc into Word the following 2 Rmd files:
Parent - SAVED WITH utf-8 encoding in RStudio
========================================================
u with an accent: "ú" SAVED WITH utf-8 encoding in RStudio
```{r fig.width=7, fig.height=6}
plot(cars, main='Parent ú')
```
and separately:
Child - SAVED WITH utf-8 encoding in RStudio
========================================================
u with an accent: "ú" Child file
```{r fig.width=7, fig.height=6}
plot(cars, main='One File Child ú')
```
and I get both 2 perfect prevues in RStudio and 2 perfect Word documents from pandoc.
The problem arises when I try to call the child part from the parent part. In other words, if I add to the first file the following lines:
```{r CallChild, child='TestUTFChild.Rmd'}
```
then all the accents in the child file become garbled as if the UTF-8 was beeing interpreted as ISO 8859-1. Pandoc stops reading the file as well, complaining it's not utf-8.
If anybody could point me in the right direction, either:
1. With pandoc not finding the plots if I stay with ISO 8859-1. I have also tried Windows-1252 because it's what I saw in the sessionInfo, but the result is the same.
or
2. With the call to the child file, if UTF-8 is the way to go. I have looked for a way of setting some option to force the encoding in the child call, but I haven't found it yet.
Many thanks!
I think this problem should be fixed in the latest development version. See instructions in the development repository on how to install the devel version. Then you should be able to choose UTF-8 in RStudio, and get a UTF-8 encoded output file.
Just in case anyone is interested in the gory details: the reason for the failure before was that I wrote the child output with the encoding you provided, but did not read it with the same encoding. Now I just avoid writing output files for child documents.

Resources