For a shiny application, I have a small issue with renderMarkdown.
Consider a text file with the following simple contents:
Markdown Test File
+ Item 1
+ Item 2
Let's save this file as "Markdown Test.txt". Now, let's read it in and process it, using the following R code:
filename <- "Markdown Test.txt"
text.in <- readLines(filename)
text.out <- renderMarkdown(text=text.in)
When I run this locally - i.e. on my Windows machine - I get:
> text.out
[1] "<p>Markdown Test File</p>\n\n<ul>\n<li>Item 1</li>\n<li>Item 2</li>\n</ul>\n"
This looks good. However, running the same code on the machine that hosts shiny server, I get:
> text.out
[1] "<p>Markdown Test File+ Item 1+ Item 2</p>\n"
As you can see, the Markdown conversion is far from perfect; e.g. the list is not converted.
On the Windows machine I have:
> Sys.getlocale()
[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
On the shiny machine, I get:
> Sys.getlocale()
[1] "LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=C;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C"
So, I'm assuming that this has to do with the encoding, but the little I know about encoding I wish I didn't... my experiments with dos2unix and Sys.setlocale() let to nothing but frustration.
Would anyone happen to have a clever "one liner" that can fix this? Any help appreciated!
Thanks, Philipp
I'm not sure if R has a dedicated package to fix line encodings, but one way is to use sub to replace \r\n with \n (or just strip \rs).
Related
I have been struggling with an encoding problem with a program that needs to run both in RStudio and using RScript. After wasting half a day on this I have a kludgy workaround, but would like to understand why the RScript version marks a string as latin1 when it is in fact UTF-8, and whether there is a better alternative to my solution. Example:
x <- "Ø28"
print(x)
print(paste("Marked as", Encoding(x)))
print(paste("Valid UTF = ", validUTF8(x)))
x <- iconv(x, "UTF-8", "latin1")
print(x)
In RStudio, the output is:
[1] "Ø28"
[1] "Marked as latin1"
[1] "Valid UTF = FALSE"
[1] NA
and when run using RScript from a batch file in Windows the output from the same code is:
[1] "Ã\23028"
[1] "Marked as latin1"
[1] "Valid UTF = TRUE"
[1] "Ø28"
In the latter case, it does not strike me as entirely helpful that a string defined within an R program by a simple assignment is marked as Latin-1 when in fact it is UTF-8. The solution I used in the end was to write a function that tests the actual (rather than declared) encoding of character variables using validUTF8, and if that returns TRUE, then use iconv to convert to latin1. It is still a bit of a pain since I have to call that repeatedly, and it would be better to have a global solution. There is quite a bit out there on encoding problems with R, but nothing that I can find that solves this when running programs with RScript. Any suggestions?
R 3.5.0, RStudio 1.1.453, Windows 7 / Windows Server 2008 (don't ask...)
First of all, sorry for not providing a reproducible example and posting images, a word of explanation why I did it is at the end.
I'd really appreciate some help - comments or otherwise, I think I did my best to be as specific and concise as I can
Problem I'm trying to solve is how to set up (and where to do it) encoding in order to get polish letters after a .Rmd document is knitted to html.
I'm working with a labelled spss file imported to R via haven library and using sjPlot tools to make tables and graphs.
I already spent almost all day trying to sort this out, but I feel I'm stucked with no idea where to go.
My sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
Matrix products: default
locale:
[1] LC_COLLATE=Polish_Poland.1250 LC_CTYPE=Polish_Poland.1250 LC_MONETARY=Polish_Poland.1250
[4] LC_NUMERIC=C LC_TIME=Polish_Poland.1250
Whenever I run (via console / script)
sjt.frq(df$sex, encoding = "Windows-1250")
I get a nice table with proper encoding in the rstudio viewer pane:
Trying with no encoding sjt.frq(df$sex) gives this:
I could live with setting encoding each time a call to sjt.frq is made, but problem is, that no matter how I set up sjt.frq inside a markdown document, it always gets knited the wrong way.
Running chunk inside .Rmd is OK (for a completely unknown reason encoding = "UTF-8 worked as well here and it didn't previously):
Knitting same document, not OK:
(note, that html header has all the polish characters)
Also, it looks like that it could be either html or sjPlot specific because knitr can print polish letters when they are in a vector and are passed as if they where printed to console:
Is there anything I can set up / change in order to make this work?
While testing different options I discovered, that manually converting sex variable to factor and assigning labels again, works and Rstudio knits to html with proper encoding
df$sex <- factor(df$sex, label = c("kobieta", "mężczyzna"))
sjt.frq(df$sex, encoding = "Windows-1250")
Regarding no reproducible example:
I tried to simulate this example with fake data:
# Get libraries
library(sjPlot)
library(sjlabelled)
x <- rep(1:4, 4)
x<- set_labels(x, labels = c("ąę", "ćŁ", "óŚŚ", "abcd"))
# Run freq table similar to df$sex above
sjt.frq(x)
sjt.frq(x, encoding = "UTF-8")
sjt.frq(x, encoding = "Windows-1250")
Thing is, each sjt.frq call knits the way it should (although only encoding = "Windows-1250" renders properly in rstudio viewer pane.
If you run sjt.frq(), a complete HTML-page is returned, which is displayed in a viewer.
However, for use inside markdown/knitr-documents, there are only parts of the HTML-output required: You don't need the <head> part, for instance, as the knitr-document creates an own header for the HTML-page. Thus, there's an own print()-method for knitr-documents, which use another return-value to include into the knitr-file.
Compare:
dummy <- sjt.frq(df$sex, encoding = "Windows-1250")
dummy$output.complete # used for default display in viewer
dummy$knitr # used in knitr-documents
Since the encoding is located in the <meta>-tag, which is not included in the $knitr-value, the encoding-argument in sjt.frq() has no effect on knitr-documents.
I think that this might help you: rmarkdown::render_site(encoding = 'UTF-8'). Maybe there are also other options to encode text, or you need to modify the final HTML-file, changing the charset encoding there.
Rstudio Version 1.0.136
R Version 3.3.2
It's strange that when I run code(it has Chinese comment in code)line by line in a .Rmd file with Rmarkdown,console will print a warning as follow:
Warning message:
In strsplit(code, "\n", fixed = TRUE) :
input string 1 is invalid in this locale
It's so annoying ,because every line it will appear.
I has change default text encoding in RStudio's setting ,but neither UTF-8 nor BG2312 can prevent this warning messag appearing.
Please notice that it just appear when a run code line by line ,if I select a chunk an press button to produce a html,warning doesn't appear.
my code is as follows:
```{r}
da=read.table("m-intcsp7309.txt",header=T)
head(da)
# date intel sp三列
length(da$date)
# 444数据
intc=log(da$intc+1)
# 测试
plot(cars)
# 测试警告信息
plot(cars)
# 为什么会出现警告?
plot(cars)
```
I have test it's not arise from Chinese comment,I meet it when I only use English
just now.
Here is more information:
Sys.getlocale()
[1] "LC_COLLATE=Chinese (Simplified)_People's Republic of China.936;
LC_CTYPE=Chinese (Simplified)_People's Republic of China.936;
LC_MONETARY=Chinese (Simplified)_People's Republic of China.936;
LC_NUMERIC=C;LC_TIME=Chinese (Simplified)_People's Republic of China.936"
I had a similar issue with gsub() and was able to resolve it, without changing the locale, simply by setting useBytes = TRUE. The same should work in strsplit(). From the documentation:
If TRUE the matching is done byte-by-byte rather than character-by-character, and inputs with marked encodings are not converted.
Embed this directly in the Rmarkdown script that contains the Chinese character comment(s):
Sys.setlocale('LC_ALL','C')
If you just run it in the R console before running the rmarkdown script, that may temporarily change the setting and work, but as you said, it won't stay that way if you restart R. That's why it's better to directly embed that line into the script(s) that need it.
If you get this warning just after or during building your vignettes during the check() of your package, then it is probably linked to this problem: https://github.com/r-lib/rcmdcheck/issues/140
If you update {processx} and {rcmdcheck}, this should work better.
Setting useBytes = TRUE in gsub seems to work best. For example, gsub('pattern text','replacement text', useBytes = TRUE)
I am simply trying to create a corpus from Russian, UTF-8 encoded text. The problem is, the Corpus method from the tm package is not encoding the strings correctly.
Here is a reproducible example of my problem:
Load in the Russian text:
> data <- c("Renault Logan, 2005","Складское помещение, 345 м²",
"Су-шеф","3-к квартира, 64 м², 3/5 эт.","Samsung galaxy S4 mini GT-I9190 (чёрный)")
Create a VectorSource:
> vs <- VectorSource(data)
> vs # outputs correctly
Then, create the corpus:
> corp <- Corpus(vs)
> inspect(corp) # output is not encoded properly
The output that I get is:
> inspect(corp)
<<VCorpus (documents: 5, metadata (corpus/indexed): 0/0)>>
[[1]]
<<PlainTextDocument (metadata: 7)>>
Renault Logan, 2005
[[2]]
<<PlainTextDocument (metadata: 7)>>
Ñêëàäñêîå ïîìåùåíèå, 345 ì<U+00B2>
[[3]]
<<PlainTextDocument (metadata: 7)>>
Ñó-øåô
[[4]]
<<PlainTextDocument (metadata: 7)>>
3-ê êâàðòèðà, 64 ì<U+00B2>, 3/5 ýò.
[[5]]
<<PlainTextDocument (metadata: 7)>>
Samsung galaxy S4 mini GT-I9190 (÷¸ðíûé)
Why does it output incorrectly? There doesn't seem to be any option to set the encoding on the Corpus method. Is there a way to set it after the fact? I have tried this:
> title_corpus <- tm_map(title_corpus, enc2utf8)
Error in FUN(X[[1L]], ...) : argumemt is not a character vector
But, it errors as shown.
Well, there seems to be good news and bad news.
The good news is that the data appears to be fine even if it doesn't display correctly with inspect(). Try looking at
content(corp[[2]])
# [1] "Складское помещение, 345 м²"
The reason it looks funny in inspect() is because the authors changed the way the print.PlainTextDocument function works. It formerly would cat the value to screen. Now, however, they feed the data though writeLines(). This function uses the locale of the system to format the characters/bytes in the document. (This can be viewed with Sys.getlocale()). It turns out Linux and OS X have a proper "UTF-8" encoding, but Windows uses language specific code pages. So if the characters aren't in the code page, they get escaped or translated to funny characters. This means this should work just fine on a Mac, but not on a PC.
Try going a step further and building a DocumentTermMatrix
dtm <- DocumentTermMatrix(corp)
Terms(dtm)
Hopefully you will see (as I do) the words correctly displayed.
If you like, this article about writing UTF-8 files on Windows has some more information about this OS specific issue. I see no easy way to get writeLines to output UTF-8 to stdout() on Windows. I'm not sure why the package maintainers changed the print method, but one might ask or submit a feature request to change it back.
I'm surprised the answer has not been posted yet. Don't bother messing with locale. I'm using tm package version 0.6.0 and it works absolutely fine, provided you add the following little piece of magic :
Encoding(data) <- "UTF-8"
Well, here is the reproducible code :
data <- c("Renault Logan, 2005","Складское помещение, 345 м²","Су-шеф","3-к квартира, 64 м², 3/5 эт.","Samsung galaxy S4 mini GT-I9190 (чёрный)")
Encoding(data)
# [1] "unknown" "unknown" "unknown" "unknown" "unknown"
Encoding(data) <- "UTF-8"
# [1] "unknown" "UTF-8" "UTF-8" "UTF-8" "UTF-8"
Just put it in a text file saved with UTF-8 encoding, then source it normally in R. But do not use source.with.encoding(..., encoding = "UTF-8"); it will throw an error.
I forgot where I learned this trick, but I picked it up somehere along the way this past week, while surfing the Web trying to learn how to process UTF8 text in R. Things were alot cleaner in Python (just convert everything to Unicode!). R's approach is much less straighforward for me, and it did not help that documentation is sparse and confusing.
I had a problem with German UTF-8 encoding while importing the texts. For me, the next oneliner helped:
Sys.setlocale("LC_ALL", "de_DE.UTF-8")
Try to run the same with Russian?
Sys.setlocale("LC_ALL", "ru_RU.UTF-8")
Of course, that goes after library(tm) and before creating a corpus.
How can I permanently remove a library in R?
.libPaths()
[1] "\\\\per-homedrive1.corp.riotinto.org/homedrive$/Tommy.O'Dell/R/win-library/2.15"
[2] "C:/Program Files/R/R-2.15.2/library"
[3] "C:/Program Files/RStudio/R/library"
The first item is my corporate "My Documents" folder, and the apostrophe in the path from my surname is causing all kinds of grief when using R CMD INSTALL --build on a package I'm making, not to mention issues using packages installed there when I'm offline from the network.
I want to use C:/Program Files/R/R-2.15.2/library as the default instead, but I don't want to have to rely on an Rprofile.site.
What I've tried
> .libPaths(.libPaths()[2:3])
> .libPaths()
[1] "C:/Program Files/R/R-2.15.2/library" "C:/Program Files/RStudio/R/library"
That seems to work, but only until I restart my R session, and then I'm back to the original .libPaths() output...
Restarting R session...
> .libPaths()
[1] "\\\\per-homedrive1.corp.riotinto.org/homedrive$/Tommy.O'Dell/R/win-library/2.15"
[2] "C:/Program Files/R/R-2.15.2/library"
[3] "C:/Program Files/RStudio/R/library"
I thought maybe .libPaths() was using R_LIBS_USER
> Sys.getenv("R_LIBS_USER")
[1] "//per-homedrive1.corp.riotinto.org/homedrive$/Tommy.O'Dell/R/win-library/2.15"
So I've tried to unset it using Sys.unsetenv("R_LIBS_USER") but it doesn't persist between sessions.
Additional Info
If it matters, here are some environment variables that might be relevant...
> Sys.getenv("R_HOME")
[1] "C:/PROGRA~1/R/R-215~1.2"
> Sys.getenv("R_HOME")
[1] "C:/PROGRA~1/R/R-215~1.2"
> Sys.getenv("R_USER")
[1] "//per-homedrive1.corp.riotinto.org/homedrive$/Tommy.O'Dell"
> Sys.getenv("R_LIBS_USER")
[1] "//per-homedrive1.corp.riotinto.org/homedrive$/Tommy.O'Dell/R/win-library/2.15"
> Sys.getenv("R_LIBS_SITE")
[1] ""
I've tried Sys.unsetenv("R_LIBS_USER") but this also doesn't stick between sessions
Just set the environment variable R_LIBS in Windows to something like
R_LIBS=C:/Program Files/R/R-2.15.2/library
Restart R.
This is bit late response to the question, but might be useful for others.
I order to set up my own path (and remove one of the original ones) I have:
used .libPaths() inside R to check current library paths;
identified which paths to keep. In my case, it kept R's original library but removed link to my documents.
found R-Home path using R.home() or Sys.getenv("R_HOME");
R-Home\R-3.2.2\etc\Rprofile.site is read every time R kernel starts. Therefore, any modification will be persistent to every run of R.
Edited Rprofile.site by adding the following,
.libPaths(.libPaths()[2])
.libPaths("d:/tmp/R/win-library/3.2")
How it works?
First line remove all but one path (second from the original list), the second line adds an additional path. We end up with two paths.
note that I use Unix path notation despite using windows. R always use Unix notation, regardless of operating system
restarted R (using Ctr+Shift+F10)
This will work every time now.
Use this function in .Rprofile
set_lib_paths <- function(lib_vec) {
lib_vec <- normalizePath(lib_vec, mustWork = TRUE)
shim_fun <- .libPaths
shim_env <- new.env(parent = environment(shim_fun))
shim_env$.Library <- character()
shim_env$.Library.site <- character()
environment(shim_fun) <- shim_env
shim_fun(lib_vec)
}
set_lib_paths("~/code/library") # where "~/code/library" is your package directory
Original Source: https://milesmcbain.xyz/hacking-r-library-paths/
I have put the Sys.unsetenv("R_LIBS_USER") command in a .Rprofile file in my windows "own documents" folder. Seems to help. My problem was that being in an active directory environment made R upstart and package loading incredibly slow when connected via vpn.
If you want to do this at RProfile file (#library/base/R/), you can search the lines where R_LIBS_* environment variables are set (for e.g. Sys.setenv(R_LIBS_SITE=....) and Sys.setenv(R_LIBS_USER=.....))
You can also search the code .libPaths(), which sets the library tree. So you can achieve your goal by a combination of commenting, unsetting and setting the R_LIBS variables before the .libPaths() call as you wish. For e.g. Something like:
Sys.unsetenv("R_LIBS")
Sys.unsetenv("R_LIBS_USER")
Sys.setenv(R_LIBS_SITE = "D:/R/libs/site")
Sys.setenv(R_LIBS_USER = "D:/R/libs/user")
Sys.setenv(R_LIBS = "D:/R/libs")