RStudio does not read non-English characters in paths - r

I want to list files and folders containing Japanese characters in my working directory with list.files(), but when I tried this, it does not show the proper files names.
For example, the "test" direcotry has folders "test1", "test2", "テスト3", and running list.files() gives unreadable characters for the one with Japanese characters like this.
> getwd()
[1] "C:/Users/10040153/Documents/test"
> list.files()
[1] "繝<86>繧ケ繝<88>3" "test1" "test2"
What I tried
Set "Default text encoding" to UTR-8
Changed locale setting to Japanese with sys.setlocale(locale = "Japanese"), which returned [1]"LC_COLLATE=Japanese_Japan.932;LC_CTYPE=Japanese_Japan.932;LC_MONETARY=Japanese_Japan.932;LC_NUMERIC=C;LC_TIME=Japanese_Japan.932"
Reinstalled R and RStudio
Rebooted the computer
None of these didn' help.
I doubt this is an issue with RStudio, not R program, because I see no problems with running the same code above in R. Does anybody have an idea?
System environment
Windows 10 x64
RStudio
R version 4.1.2 (2021-11-01)
Update
`Encoding<-`(list.files(), "UTF-8") solved the problem.
> `Encoding<-`(list.files(), "UTF-8")
[1] "テスト3" "test1" "test2"
I know this has something do with encoding, but how can I make it work in global environemt?

This is a known bug in RStudio; see https://github.com/rstudio/rstudio/issues/10451. If you're willing to try a fix, we have one in the dailies as of last week:
https://dailies.rstudio.com/

Related

Turkish Language Encoding Problem in R Studio

I am using R Studio for creating plots for economic variables. But in our language when you don't use our specific letters as "ğ,ş,ı,ü,ç" the word means different. And even sometimes it means swearing. I can't create graphs with this letters. I tried to use this command;
Sys.setlocale(category = "LC_ALL", locale = "Turkish")
The output is
OS reports request to set locale to "Turkish" cannot be honored[1] ""
How can i solve this problem? Any idea?
If you have the same problem and your system is Mac. First open terminal and then copy this code:
defaults write org.R-project.R force.LANG en_US.UTF-8
paste and run. I solved. I hope it works for your system, too.

Error in R Tesseract

I have the R Tesseract package working with the default eng.traineddata under OSX, but it simply won't find other languages.
trial <- ocr("test.png", engine = tesseract(language = "jpn", datapath="/Users/histmr/Library/R/3.3/library/tesseract/tessdata"))
Generates the error:
Failed loading language 'jpn'
Tesseract couldn't load any languages!
Error in tesseract_engine_internal(datapath, language) :
Unable to find training data for: jpn
I've checked with
tesseract_info()
$datapath
[1] "/Users/histmr/Library/R/3.3/library/tesseract/tessdata/"
$available
[1] "eng" "jpn"
$version
[1] "3.05.00"
Sometimes I get references to a "TESSDATA_PREFIX environment variable" but I don't know where that is. How can I get the correct directory path (I can see the file in the directory) or edit the "TESSDATA_PREFIX environment variable"?
The problem seems to occur with Japanese but NOT French
tesseract_download("fra")
french <- tesseract("fra")
Works fine! But
tesseract_download("jpn")
japanese <- tesseract("jpn")
Generates an error
The error message Error in tesseract_engine_internal(datapath, language) said the language file, in your case jpn.traineddata, is not available in the TESSDATA_PREFIX which is the default path for storing all the trained language data. If you haven't set the path, you may open a terminal and type the command below.
export TESSDATA_PREFIX=/Users/histmr/Library/R/3.3/library/tesseract/tessdata/
Hope this help.
One possible problem is multiple installs of Tesseract (I used Homebrew and MacPorts) creating multiple TESSDATA folders. Strangely R was happier with a seemingly identical folder, but in a different place closer to root, ordinarily hidden under OSX. I got things working with
export TESSDATA_PREFIX=/opt/local/share
I hope this helps

Why does RMarkdown `render` behavior depend on whether it's called from RStudio Server or from a PHP shell?

I have an RMarkdown document that includes 'special characters', such as ë. If I render the document using RStudio Server's "knit document" button, it renders fine. When I render it by using the RStudio Server button to source another R script that calls RMarkdown's render function, it also renders fine.
However, from some reason that's beyond me (but hopefully not for long), I get different results when that same R script is called by index.php using:
$results = shell_exec("R --file='/home/username/public_html/some/subdirectories/process.R' --no-save 2>&1");
When I do this, in the resulting .html file, the special symbols (I guess the unicode symbols) are replaced by <U+00EB>. I've tried to look up whether this is some kind of variation of HTML elements that I didn't know about yet, but I have been unable to find anything about this.
(note: any link to a place where I can learn more about this (and, while we're at it, why my browser doesn't show it as, for example, the ë it represents, is also greatly appreciated!)
Reproducable example
Contents of example.php:
<?php
shell_exec("R --file='/home/username/public_html/subdirectory/example.R' --no-save 2>&1");
?>
Contents of example.R (this is what I needed on my server):
workingPath <- "/home/username/public_html/subdirectory";
### Set path to RStudio's pandoc version
Sys.setenv(PATH=paste(Sys.getenv("PATH"),
"/usr/lib/rstudio-server/bin/pandoc",
sep=":"));
### Set HOME and LANG
Sys.setenv(HOME = '/home/username');
Sys.setenv(LANG = 'en_US.UTF-8');
require(rmarkdown);
renderResults <-
render(file.path(workingPath, 'example.Rmd'),
output_file = file.path(workingPath, 'example.html'),
intermediates_dir = file.path(workingPath, 'tmp'),
encoding="UTF-8");
Contents of example.Rmd:
---
title: 'Reproducable example'
output: html_document
---
```{r}
cat("This is an ë symbol.");
```
Results of this example:
When I run this from R Studio, I get:
cat("This is an ë symbol.");
## This is an ë symbol.
When I run this from PHP, I get:
cat("This is an ë symbol.");
## This is an <U+00EB> symbol.
(note how, interestingly, the echo'ed ë does show up normally...)
I now resorted to doing a str_replace in the index.php file, but that's not ideal.
I've checked the render manual, but I can't find anything about this behavior.
I've also looked at specifying options for pandoc in the YAML header of the .Rmd file, but the only thing that seems to come close is the --ascii option, and that doesn't do anything. The R Studio RMarkdown page doesn't provide any hints, either.
Could it perhaps have to do with environment variables that are set in RStudio? I already had to set:
Sys.setenv(HOME = '/home/oupsyusr');
Sys.setenv(LANG = 'en_US.UTF-8');
in the R script to get Pandoc going in the first place when called in the R script called from the PHP shell; but if this is the problem, how can I figure out which settings RStudio sets to which values, or more accurately, which of those are important? I ran:
Sys.getenv()
From within R Studio, and that shows quite a list. I recognize none of the entries as having to do with encoding or so.
Or, does knitr cause this? When I store and inspect the .md file, the Unicode element things already show up. However, the knitr help page with chunk options doesn't say anything about unicode or encoding in general.
Does anybody know where this is documented, or does anybody happen to have encountered this situation before?
I'm running RStudio 0.99.903 and R 3.3.1 on CentOS 6.8.
Usually, issues of this form (where unicode characters are converted to a unicode code point representation, e.g. <U+00EB> in this case) are caused by an attempt to run R within a non-UTF-8 locale.
Typically, this can be verified by checking the output of Sys.getlocale("LC_ALL"). If you see a C locale reported, then you likely need to enforce a UTF-8 locale with something like:
Sys.setlocale("LC_ALL", "en_US.UTF-8")
substituting the particular UTF-8 locale flavor based on your desired language. (For reference, the set of available locales can usually be queried from a terminal with something like locale -a).

Rstudio: Cmd + C/V not working in editor

I have used pipe to copy and paste data between Rstudio (v0.99.467) and Excel on my Mac OSX 10.9.5.
pipe("pbcopy", "w")
pipe("pbpaste")
For some time, I tried to use pipe("pbcopy", "r"), but Rstudio is not responding (because my code is wrong). After a while, I found Cmd + C/V is not working in the editor any more (but it still works in the R console). I re-install R-studio, removed .rstudio-desktop, the problem still exists. Does anyone know what is going on? Can I remove the .bash file that stores the Rstudio shortcut preferences (assuming re-install won't delete it)? BTW, where is the shortcut .bash file in Rstudio?
On OSX Mojave using R 3.5.1, you can use the following block to capture the clipboard:
clipboard <- system("pbpaste", intern = T)
I can also confirm that the following block is working:
clipboard <- scan(pipe("pbpaste", "r"), what = character())
However, connections are sometimes tricky to work with. For example:
clipboard <- readLines(pipe("pbpaste", "r"))
Returns an empty character vector, likely because there's no newline terminator in the clipboard!

Encoding: knitr and child files

I am using Windows 7, R2.15.3 and RStudio 0.97.320 with knitr 1.1. Not sure what my pandoc version is, but I downloaded it a couple of days ago.
sessionInfo()
R version 2.15.3 (2013-03-01)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=Spanish_Argentina.1252 LC_CTYPE=Spanish_Argentina.1252 LC_MONETARY=Spanish_Argentina.1252
[4] LC_NUMERIC=C LC_TIME=Spanish_Argentina.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] tools_2.15.3
I would like to get my reports both in html and Word, so I'm using markdown and pandoc.
I write in spanish with accents on vowels and tildes on the n: á-ú and ñ.
I have read many posts and I see problems similar to the one I'm having have been solved with new versions of knitr. But there is one issue I haven't found a solution for.
When I started, I used the 'system default' encoding that appears in the RStudio dialog, i.e. ISO 8859-1, and the RStudio previews worked great. However when I tried to get Word documents, pandoc choked on the accentuated vowels. I found a post showing how to solve this using iconv:
iconv -t utf-8 "myfile.md" | pandoc -o "myfile.docx"| iconv -f utf-8
While this did solve pandoc's unrecognized utf-8 characters complaints, for some reason pandoc stops finding my plots, with an error like this one:
pandoc: Could not find image `figure/Parent.png', skipping...
If I use only non-accent characters, pandoc finds the images with no problems. I looked at the two .md files with an hex editor, and I can't see any difference when I compare the sections that handle the figures:
![plot of chunk Parent](figure/Parent.png)
although obviously the accentuated characters are completely different... I have verified that the image files do exist in the figure folder
Anyway, after reading many posts I decided to set RStudio to use UTF-8 encoding. With only one level of files things work great. For example, I can -independently- knit and then pandoc into Word the following 2 Rmd files:
Parent - SAVED WITH utf-8 encoding in RStudio
========================================================
u with an accent: "ú" SAVED WITH utf-8 encoding in RStudio
```{r fig.width=7, fig.height=6}
plot(cars, main='Parent ú')
```
and separately:
Child - SAVED WITH utf-8 encoding in RStudio
========================================================
u with an accent: "ú" Child file
```{r fig.width=7, fig.height=6}
plot(cars, main='One File Child ú')
```
and I get both 2 perfect prevues in RStudio and 2 perfect Word documents from pandoc.
The problem arises when I try to call the child part from the parent part. In other words, if I add to the first file the following lines:
```{r CallChild, child='TestUTFChild.Rmd'}
```
then all the accents in the child file become garbled as if the UTF-8 was beeing interpreted as ISO 8859-1. Pandoc stops reading the file as well, complaining it's not utf-8.
If anybody could point me in the right direction, either:
1. With pandoc not finding the plots if I stay with ISO 8859-1. I have also tried Windows-1252 because it's what I saw in the sessionInfo, but the result is the same.
or
2. With the call to the child file, if UTF-8 is the way to go. I have looked for a way of setting some option to force the encoding in the child call, but I haven't found it yet.
Many thanks!
I think this problem should be fixed in the latest development version. See instructions in the development repository on how to install the devel version. Then you should be able to choose UTF-8 in RStudio, and get a UTF-8 encoded output file.
Just in case anyone is interested in the gory details: the reason for the failure before was that I wrote the child output with the encoding you provided, but did not read it with the same encoding. Now I just avoid writing output files for child documents.

Resources