Warning: Input string not available in this locale - r

Rstudio Version 1.0.136
R Version 3.3.2
It's strange that when I run code(it has Chinese comment in code)line by line in a .Rmd file with Rmarkdown,console will print a warning as follow:
Warning message:
In strsplit(code, "\n", fixed = TRUE) :
input string 1 is invalid in this locale
It's so annoying ,because every line it will appear.
I has change default text encoding in RStudio's setting ,but neither UTF-8 nor BG2312 can prevent this warning messag appearing.
Please notice that it just appear when a run code line by line ,if I select a chunk an press button to produce a html,warning doesn't appear.
my code is as follows:
```{r}
da=read.table("m-intcsp7309.txt",header=T)
head(da)
# date intel sp三列
length(da$date)
# 444数据
intc=log(da$intc+1)
# 测试
plot(cars)
# 测试警告信息
plot(cars)
# 为什么会出现警告?
plot(cars)
```
I have test it's not arise from Chinese comment,I meet it when I only use English
just now.
Here is more information:
Sys.getlocale()
[1] "LC_COLLATE=Chinese (Simplified)_People's Republic of China.936;
LC_CTYPE=Chinese (Simplified)_People's Republic of China.936;
LC_MONETARY=Chinese (Simplified)_People's Republic of China.936;
LC_NUMERIC=C;LC_TIME=Chinese (Simplified)_People's Republic of China.936"

I had a similar issue with gsub() and was able to resolve it, without changing the locale, simply by setting useBytes = TRUE. The same should work in strsplit(). From the documentation:
If TRUE the matching is done byte-by-byte rather than character-by-character, and inputs with marked encodings are not converted.

Embed this directly in the Rmarkdown script that contains the Chinese character comment(s):
Sys.setlocale('LC_ALL','C')
If you just run it in the R console before running the rmarkdown script, that may temporarily change the setting and work, but as you said, it won't stay that way if you restart R. That's why it's better to directly embed that line into the script(s) that need it.

If you get this warning just after or during building your vignettes during the check() of your package, then it is probably linked to this problem: https://github.com/r-lib/rcmdcheck/issues/140
If you update {processx} and {rcmdcheck}, this should work better.

Setting useBytes = TRUE in gsub seems to work best. For example, gsub('pattern text','replacement text', useBytes = TRUE)

Related

Encoding discrepancy in RScript

I have been struggling with an encoding problem with a program that needs to run both in RStudio and using RScript. After wasting half a day on this I have a kludgy workaround, but would like to understand why the RScript version marks a string as latin1 when it is in fact UTF-8, and whether there is a better alternative to my solution. Example:
x <- "Ø28"
print(x)
print(paste("Marked as", Encoding(x)))
print(paste("Valid UTF = ", validUTF8(x)))
x <- iconv(x, "UTF-8", "latin1")
print(x)
In RStudio, the output is:
[1] "Ø28"
[1] "Marked as latin1"
[1] "Valid UTF = FALSE"
[1] NA
and when run using RScript from a batch file in Windows the output from the same code is:
[1] "Ã\23028"
[1] "Marked as latin1"
[1] "Valid UTF = TRUE"
[1] "Ø28"
In the latter case, it does not strike me as entirely helpful that a string defined within an R program by a simple assignment is marked as Latin-1 when in fact it is UTF-8. The solution I used in the end was to write a function that tests the actual (rather than declared) encoding of character variables using validUTF8, and if that returns TRUE, then use iconv to convert to latin1. It is still a bit of a pain since I have to call that repeatedly, and it would be better to have a global solution. There is quite a bit out there on encoding problems with R, but nothing that I can find that solves this when running programs with RScript. Any suggestions?
R 3.5.0, RStudio 1.1.453, Windows 7 / Windows Server 2008 (don't ask...)

How to ensure english error messages in testthat unit tests

I have a lot of unit tests using the testthat package that expect english error messages.
If other developer run the tests on a computer configured for a non-english locale the error message are emitted in a different language and my tests fail.
How can I initialize testthat to change the language settings only during the test run-time without manually or permanently changing the language or test environment from outside of R (like e. g. proposed here: in R how to get error messages in english)?
library(testthat)
# works only in english locales...
expect_error(log("a"), "non-numeric argument to mathematical function", fixed = TRUE)
Edit 1: Changing the locale during run-time does not change the language of the error messages (using Ubuntu and OSX High Sierra):
Sys.setlocale( locale = "en_US.UTF-8")
Sys.getlocale() # en_US is active now but messages are still in another language
Edit 2: It seems that Sys.setenv("LANGUAGE"="EN") seems to change the error message language immediately (tested using OSX). Where should I put this command for testthat? In the testthat.R file?
The R console is in German language, how can I set R to English?
Edit 3: As a first work-around I have put
Sys.setenv("LANGUAGE"="EN") # work-around to always create english R (error) messages
into my testthat.R file under the tests folder (it seems to work but I am not sure whether this is the right or best way...
Setting Sys.setenv("LANGUAGE" = "EN") works for me as well.
However, when testing with devtools::test() - as ctrl + shift + T in Rstudio will do - I had to call Sys.setenv() in the test scripts inside the tests/testthat/ directory. The reason being that devtools::test() will call testthat::test_dir() circumventing the tests/testthat.R file.
So far, this did not have undesirable side-effects. The environment variable will only be set for that particular R process as described in the help page:
Sys.setenv sets environment variables (for other processes called from within R or future calls to Sys.getenv from this R process).
For completeness, you can also unset the variable again on Windows (see comments).
Sys.setenv("LANGUAGE" = "DE")
expect_error(log("a"), "Nicht-numerisches Argument")
Sys.setenv("LANGUAGE" = "FR")
expect_error(log("a"), "argument non numérique ")
Sys.unsetenv("LANGUAGE")
RStudio might also give trouble (I was not able to change the language there interactively), but when executing with devtools::test() it works.
Finally, wrapping it in a helper function.
expect_error_lang <- function(..., lang = "EN") {
Sys.setenv("LANGUAGE" = lang)
expect_error(...)
Sys.unsetenv("LANGUAGE")
}
#...
expect_error_lang(log("a"), "non-numeric")
expect_error_lang(log("a"), "Nicht-numerisches", lang = "DE")
expect_error_lang(log("a"), "argument non", lang = "FR")

R: Encoding of labelled data and knit to html problems

First of all, sorry for not providing a reproducible example and posting images, a word of explanation why I did it is at the end.
I'd really appreciate some help - comments or otherwise, I think I did my best to be as specific and concise as I can
Problem I'm trying to solve is how to set up (and where to do it) encoding in order to get polish letters after a .Rmd document is knitted to html.
I'm working with a labelled spss file imported to R via haven library and using sjPlot tools to make tables and graphs.
I already spent almost all day trying to sort this out, but I feel I'm stucked with no idea where to go.
My sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
Matrix products: default
locale:
[1] LC_COLLATE=Polish_Poland.1250 LC_CTYPE=Polish_Poland.1250 LC_MONETARY=Polish_Poland.1250
[4] LC_NUMERIC=C LC_TIME=Polish_Poland.1250
Whenever I run (via console / script)
sjt.frq(df$sex, encoding = "Windows-1250")
I get a nice table with proper encoding in the rstudio viewer pane:
Trying with no encoding sjt.frq(df$sex) gives this:
I could live with setting encoding each time a call to sjt.frq is made, but problem is, that no matter how I set up sjt.frq inside a markdown document, it always gets knited the wrong way.
Running chunk inside .Rmd is OK (for a completely unknown reason encoding = "UTF-8 worked as well here and it didn't previously):
Knitting same document, not OK:
(note, that html header has all the polish characters)
Also, it looks like that it could be either html or sjPlot specific because knitr can print polish letters when they are in a vector and are passed as if they where printed to console:
Is there anything I can set up / change in order to make this work?
While testing different options I discovered, that manually converting sex variable to factor and assigning labels again, works and Rstudio knits to html with proper encoding
df$sex <- factor(df$sex, label = c("kobieta", "mężczyzna"))
sjt.frq(df$sex, encoding = "Windows-1250")
Regarding no reproducible example:
I tried to simulate this example with fake data:
# Get libraries
library(sjPlot)
library(sjlabelled)
x <- rep(1:4, 4)
x<- set_labels(x, labels = c("ąę", "ćŁ", "óŚŚ", "abcd"))
# Run freq table similar to df$sex above
sjt.frq(x)
sjt.frq(x, encoding = "UTF-8")
sjt.frq(x, encoding = "Windows-1250")
Thing is, each sjt.frq call knits the way it should (although only encoding = "Windows-1250" renders properly in rstudio viewer pane.
If you run sjt.frq(), a complete HTML-page is returned, which is displayed in a viewer.
However, for use inside markdown/knitr-documents, there are only parts of the HTML-output required: You don't need the <head> part, for instance, as the knitr-document creates an own header for the HTML-page. Thus, there's an own print()-method for knitr-documents, which use another return-value to include into the knitr-file.
Compare:
dummy <- sjt.frq(df$sex, encoding = "Windows-1250")
dummy$output.complete # used for default display in viewer
dummy$knitr # used in knitr-documents
Since the encoding is located in the <meta>-tag, which is not included in the $knitr-value, the encoding-argument in sjt.frq() has no effect on knitr-documents.
I think that this might help you: rmarkdown::render_site(encoding = 'UTF-8'). Maybe there are also other options to encode text, or you need to modify the final HTML-file, changing the charset encoding there.

renderMarkdown locally vs. shiny-server

For a shiny application, I have a small issue with renderMarkdown.
Consider a text file with the following simple contents:
Markdown Test File
+ Item 1
+ Item 2
Let's save this file as "Markdown Test.txt". Now, let's read it in and process it, using the following R code:
filename <- "Markdown Test.txt"
text.in <- readLines(filename)
text.out <- renderMarkdown(text=text.in)
When I run this locally - i.e. on my Windows machine - I get:
> text.out
[1] "<p>Markdown Test File</p>\n\n<ul>\n<li>Item 1</li>\n<li>Item 2</li>\n</ul>\n"
This looks good. However, running the same code on the machine that hosts shiny server, I get:
> text.out
[1] "<p>Markdown Test File+ Item 1+ Item 2</p>\n"
As you can see, the Markdown conversion is far from perfect; e.g. the list is not converted.
On the Windows machine I have:
> Sys.getlocale()
[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
On the shiny machine, I get:
> Sys.getlocale()
[1] "LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=C;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C"
So, I'm assuming that this has to do with the encoding, but the little I know about encoding I wish I didn't... my experiments with dos2unix and Sys.setlocale() let to nothing but frustration.
Would anyone happen to have a clever "one liner" that can fix this? Any help appreciated!
Thanks, Philipp
I'm not sure if R has a dedicated package to fix line encodings, but one way is to use sub to replace \r\n with \n (or just strip \rs).

How to source() .R file saved using UTF-8 encoding?

The following, when copied and pasted directly into R works fine:
> character_test <- function() print("R同时也被称为GNU S是一个强烈的功能性语言和环境,探索统计数据集,使许多从自定义数据图形显示...")
> character_test()
[1] "R同时也被称为GNU S是一个强烈的功能性语言和环境,探索统计数据集,使许多从自定义数据图形显示..."
However, if I make a file called character_test.R containing the EXACT SAME code, save it in UTF-8 encoding (so as to retain the special Chinese characters), then when I source() it in R, I get the following error:
> source(file="C:\\Users\\Tony\\Desktop\\character_test.R", encoding = "UTF-8")
Error in source(file = "C:\\Users\\Tony\\Desktop\\character_test.R", encoding = "utf-8") :
C:\Users\Tony\Desktop\character_test.R:3:0: unexpected end of input
1: character.test <- function() print("R
2:
^
In addition: Warning message:
In source(file = "C:\\Users\\Tony\\Desktop\\character_test.R", encoding = "UTF-8") :
invalid input found on input connection 'C:\Users\Tony\Desktop\character_test.R'
Any help you can offer in solving and helping me to understand what is going on here would be much appreciated.
> sessionInfo() # Windows 7 Pro x64
R version 2.12.1 (2010-12-16)
Platform: x86_64-pc-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United Kingdom.1252
[2] LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base
loaded via a namespace (and not attached):
[1] tools_2.12.1
and
> l10n_info()
$MBCS
[1] FALSE
$`UTF-8`
[1] FALSE
$`Latin-1`
[1] TRUE
$codepage
[1] 1252
On R/Windows, source runs into problems with any UTF-8 characters that can't be represented in the current locale (or ANSI Code Page in Windows-speak). And unfortunately Windows doesn't have UTF-8 available as an ANSI code page--Windows has a technical limitation that ANSI code pages can only be one- or two-byte-per-character encodings, not variable-byte encodings like UTF-8.
This doesn't seem to be a fundamental, unsolvable problem--there's just something wrong with the source function. You can get 90% of the way there by doing this instead:
eval(parse(filename, encoding="UTF-8"))
This'll work almost exactly like source() with default arguments, but won't let you do echo=T, eval.print=T, etc.
We talked about this a lot in the comments to my previous post but I don't want this to get lost on page 3 of comments: You have to set the locale, it works with both input from the R-console (see screenshot in comments) as well as with input from file see this screenshot:
The file "myfile.r" contains:
russian <- function() print ("Американские с...");
The console contains:
source("myfile.r", encoding="utf-8")
> Error in source(".....
Sys.setlocale("LC_CTYPE","ru")
> [1] "Russian_Russia.1251"
russian()
[1] "Американские с..."
Note that the file-in fails and it points to the same character as the original poster's error (the one after "R). I can not do this with Chinese because i would have to install "Microsoft Pinyin IME 3.0", but the process is the same, you just replace the locale with "chinese" (the naming is a bit inconsistent, consult the documentation).
I think the problem lies with R. I can happily source UTF-8 files, or UCS-2LE files with many non-ASCII characters in. But some characters cause it to fail. For example the following
danish <- function() print("Skønt H. C. Andersens barndomsomgivelser var meget fattige, blev de i hans rige fantasi solbeskinnede.")
croatian <- function() print("Dodigović. Kako se Vi zovete?")
new_testament <- function() print("Ne provizu al vi trezorojn sur la tero, kie tineo kaj rusto konsumas, kaj jie ŝtelistoj trafosas kaj ŝtelas; sed provizu al vi trezoron en la ĉielo")
russian <- function() print ("Американские суда находятся в международных водах. Япония выразила серьезное беспокойство советскими действиями.")
is fine in both UTF-8 and UCS-2LE without the Russian line. But if that is included then it fails. I'm pointing the finger at R. Your Chinese text also appears to be too hard for R on Windows.
Locale seems irrelevant here. It's just a file, you tell it what encoding the file is, why should your locale matter?
For me (on windows) I do:
source.utf8 <- function(f) {
l <- readLines(f, encoding="UTF-8")
eval(parse(text=l),envir=.GlobalEnv)
}
It works fine.
Building on crow's answer, this solution makes RStudio's Source button work.
When hitting that Source button, RStudio executes source('myfile.r', encoding = 'UTF-8')), so overriding source makes the errors disappear and runs the code as expected:
source <- function(f, encoding = 'UTF-8') {
l <- readLines(f, encoding=encoding)
eval(parse(text=l),envir=.GlobalEnv)
}
You can then add that script to an .Rprofile file, so it will execute on startup.
I encounter this problem when a try to source a .R file containing some Chinese characters. In my case, I found that merely set "LC_CTYPE" to "chinese" is not enough. But setting "LC_ALL" to "chinese" works well.
Note that it's not enough to get encoding right when you read or write plain text file in Rstudio (or R?) with non-ASCII. The locale setting counts too.
PS. the command is Sys.setlocale(category = "LC_CTYPE",locale = "chinese"). Please replace locale value correspondingly.
On windows, when you copy-paste a unicode or utf-8 encoded string into a text-control that is set to single-byte-input (ascii... depending on locale), the unknown bytes will be replaced by questionmarks. If i take the first 4 characters of your string and copy-paste it into e.g. Notepad and then save it, the file becomes in hex:
52 3F 3F 3F 3F
what you have to do is find an editor which you can set to utf-8 before copy-pasting the text into it, then the saved file (of your first 4 characters) becomes:
52 E5 90 8C E6 97 B6 E4 B9 9F E8 A2 AB
This will then be recognized as valid utf-8 by [R].
I used "Notepad2" for trying this, but i am sure there are many more.

Resources