I'm working with files containing text in Hindi and parsing them. I wrote my code in Rstudio and executed it without many issues. But now, I need to execute the same script from command line using R.exe/Rscript.exe and it doesn't work the same way. I've run a simple script from both RStudio and the terminal:
n_p<-'नाम'
Encoding(n_p)
gregexpr(n_p,c('adfdafc','नाम adsfdfa'))
sessionInfo()
Output In RStudio:
> n_p<-'नाम'
>
> Encoding(n_p)
[1] "UTF-8"
>
> gregexpr(n_p,c('adfdafc','नाम adsfdfa'))
[[1]]
[1] -1
attr(,"match.length")
[1] -1
[[2]]
[1] 1
attr(,"match.length")
[1] 3
> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7600)
Matrix products: default
locale:
[1] LC_COLLATE=English_India.1252 LC_CTYPE=English_India.1252
[3] LC_MONETARY=English_India.1252 LC_NUMERIC=C
[5] LC_TIME=English_India.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] rJava_0.9-10
loaded via a namespace (and not attached):
[1] compiler_3.5.0 tools_3.5.0
Output with R.exe in cmd (For debugging purposes. Rscript.exe gives a similar if not identical output)
> n_p<-'à☼"à☼_à☼r'
>
> Encoding(n_p)
[1] "latin1"
>
> gregexpr(n_p,c('adfdafc','à☼"à☼_à☼r adsfdfa'))
[[1]]
[1] -1
attr(,"match.length")
[1] -1
[[2]]
[1] 1
attr(,"match.length")
[1] 9
> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7600)
Matrix products: default
locale:
[1] LC_COLLATE=English_India.1252 LC_CTYPE=English_India.1252
[3] LC_MONETARY=English_India.1252 LC_NUMERIC=C
[5] LC_TIME=English_India.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_3.5.0
I've tried changing locales but Sys.setlocale refuses to work properly. In some cases, gregexpr gives an error when it can't parse non ASCII code. And finally, when it does run without errors, it doesn't match regular expressions properly. I can't provide a reproducible example at the moment, but I will try to later.
Help.
The right answer is that you should run Rscript with the option --encoding=file encoding
There is no need to set locale, and as you probably found out, it doesn't work anyway. If your file is UTF-8:
Rscript.exe --encoding=UTF-8 file.R
You need to ensure that R is running in a suitable locale:
Running rterm use: Sys.getlocale() to find your current locale.
You can set your locale using:
Sys.setlocale(category = "LC_ALL", locale = "hi-IN")
# Try "hi-IN.UTF-8" too...
You can find locale names here, the MSDN, and here.
If you have the correct value, put the Sys.setlocale() command in your ~/.Rprofile.
References
https://cran.r-project.org/bin/windows/base/rw-FAQ.html
http://withr.me/configure-character-encoding-for-r-under-linux-and-windows/
Related
I was working on a toy project and tried using some unicode variable names to match a paper I was attempting to implement.
The following code works fine on R 3.4.3 on Windows (RStudio version 1.1.456) and R 3.5.1 on OSX:
> µ <- function(ß, n) ß * n
> µ(2, 3)
[1] 6
This code gives the following error, with α typed as ALT+224:
> α <- 2
Error: unexpected input in "\"
The file was saved as UTF-8, so this is surprising to me.
make.names is consistent with the results above:
> make.names('µ')
[1] "µ"
> make.names('α')
[1] "a"
What is the rule for non-ASCII letters, why are mu and scharfes OK but alpha isn't?
Edit: Output of sessionInfo()
> sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_3.4.3 tools_3.4.3 yaml_2.2.0
Edit2: It seems like Sys.setlocale should be the answer, but here is what happens when I try this:
> Sys.setlocale("LC_ALL", 'en_US.UTF-8')
[1] ""
Warning message:
In Sys.setlocale("LC_ALL", "en_US.UTF-8") :
OS reports request to set locale to "en_US.UTF-8" cannot be honored
Working with Ben Bolker we determined the issue was that the current session was using character encoding Windows-1252, which has some non-ASCII characters but not many. This is despite the fact that RStudio saved the file as UTF-8.
Attempting to change the current collation of a running R session does not seem to be possible? At least on Windows I get a warning (see the question and here).
I have a partial solution, if someone finds themselves in the situation where they are given a file like this and want to run it and have interactive access to the results, the following will mostly work (variables will be translated to Win-1252):
> source('utf-8-file.r', encoding='UTF-8')
I would be very excited to see a better solution, one which allows editing and running the file and entering such snippets into the console of RStudio on Windows.
I'm working with files containing text in Hindi and parsing them. I wrote my code in Rstudio and executed it without many issues. But now, I need to execute the same script from command line using R.exe/Rscript.exe and it doesn't work the same way. I've run a simple script from both RStudio and the terminal:
n_p<-'नाम'
Encoding(n_p)
gregexpr(n_p,c('adfdafc','नाम adsfdfa'))
sessionInfo()
Output In RStudio:
> n_p<-'नाम'
>
> Encoding(n_p)
[1] "UTF-8"
>
> gregexpr(n_p,c('adfdafc','नाम adsfdfa'))
[[1]]
[1] -1
attr(,"match.length")
[1] -1
[[2]]
[1] 1
attr(,"match.length")
[1] 3
> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7600)
Matrix products: default
locale:
[1] LC_COLLATE=English_India.1252 LC_CTYPE=English_India.1252
[3] LC_MONETARY=English_India.1252 LC_NUMERIC=C
[5] LC_TIME=English_India.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] rJava_0.9-10
loaded via a namespace (and not attached):
[1] compiler_3.5.0 tools_3.5.0
Output with R.exe in cmd (For debugging purposes. Rscript.exe gives a similar if not identical output)
> n_p<-'à☼"à☼_à☼r'
>
> Encoding(n_p)
[1] "latin1"
>
> gregexpr(n_p,c('adfdafc','à☼"à☼_à☼r adsfdfa'))
[[1]]
[1] -1
attr(,"match.length")
[1] -1
[[2]]
[1] 1
attr(,"match.length")
[1] 9
> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7600)
Matrix products: default
locale:
[1] LC_COLLATE=English_India.1252 LC_CTYPE=English_India.1252
[3] LC_MONETARY=English_India.1252 LC_NUMERIC=C
[5] LC_TIME=English_India.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_3.5.0
I've tried changing locales but Sys.setlocale refuses to work properly. In some cases, gregexpr gives an error when it can't parse non ASCII code. And finally, when it does run without errors, it doesn't match regular expressions properly. I can't provide a reproducible example at the moment, but I will try to later.
Help.
The right answer is that you should run Rscript with the option --encoding=file encoding
There is no need to set locale, and as you probably found out, it doesn't work anyway. If your file is UTF-8:
Rscript.exe --encoding=UTF-8 file.R
You need to ensure that R is running in a suitable locale:
Running rterm use: Sys.getlocale() to find your current locale.
You can set your locale using:
Sys.setlocale(category = "LC_ALL", locale = "hi-IN")
# Try "hi-IN.UTF-8" too...
You can find locale names here, the MSDN, and here.
If you have the correct value, put the Sys.setlocale() command in your ~/.Rprofile.
References
https://cran.r-project.org/bin/windows/base/rw-FAQ.html
http://withr.me/configure-character-encoding-for-r-under-linux-and-windows/
I am reading input text file line by line with the following function:
lines_reader<-function(filename){
conn<-file(filename,open="r")
linn<-readLines(conn,encoding="UCS-2LE")
close(conn)
return(linn)
}
If I try to plot these lines in the R enviroment, letters with accent marks are treated not adequately appearing like "Ã" or "è" instead of "à" or "è".
How to cope with this? What encoding I should choose?
Here they are my session and local system info:
> sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 8 x64 (build 9200)
locale:
[1] LC_COLLATE=Italian_Italy.1252 LC_CTYPE=Italian_Italy.1252
[3] LC_MONETARY=Italian_Italy.1252 LC_NUMERIC=C
[5] LC_TIME=Italian_Italy.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] tools_3.2.0
> Sys.getlocale()
[1] "LC_COLLATE=Italian_Italy.1252;LC_CTYPE=Italian_Italy.1252;LC_MONETARY=Italian_Italy.1252;LC_NUMERIC=C;LC_TIME=Italian_Italy.1252"
How about changing the encoding that you are using:
lines_reader<-function(filename){
conn<-file(filename,open="r")
linn<-readLines(conn,encoding="UTF-8")
close(conn)
return(linn)
}
What am I missing here, it works on another PC. From R Im trying to use Sys.setenv() to invoke ghost script. Then, normally I can use tools::find_gs_cmd() to check that it is set.
Heres the ghostscript exe:
> dir("C:/Program Files/gs/gs9.16/bin/")
[1] "gsdll64.dll" "gsdll64.lib" "gswin64.exe" "gswin64c.exe"
But when I try to set it I think it does NOT do it:
> Sys.setenv(R_GSCMD="C:/Program Files/gs/gs9.16/bin/gswin64c.exe")
> tools::find_gs_cmd("gswin64c.exe")
gswin64c.exe
""
If I use Sys.getenv() it is there on the list of environment variable all right, so Im a little unsure what is happening.
My system is:
> sessionInfo()
R version 3.1.3 (2015-03-09)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
locale:
[1] LC_COLLATE=Danish_Denmark.1252 LC_CTYPE=Danish_Denmark.1252 LC_MONETARY=Danish_Denmark.1252 LC_NUMERIC=C
[5] LC_TIME=Danish_Denmark.1252
attached base packages:
[1] tools stats graphics grDevices utils datasets methods base
R doesn't display correctly Arabic text. I get very weird stuff when I use Arabic. Here's a screenshot:
The problem is that I want to create a wordcloud with Arabic text and I need to solve this problem first.
R version: R 2.15.2 GUI 1.53 Leopard build 64-bit (6335)
Here are more info:
> options("encoding")
$encoding
[1] "native.enc"
> Encoding("الله")
[1] "unknown"
SessionInfo():
> sessionInfo()
R version 2.15.2 (2012-10-26)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
locale:
[1] C/C/C/C/de_DE/C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] tools_2.15.2
>
Some tinkering:
> x = "مرحبا"
> Encoding(x) = "UTF-8"
> x
[1] "<U+0645><U+0631><U+062D><U+0628><U+0627>"
> Encoding(iconv(x))
[1] "unknown"
More info:
> Sys.getlocale()
[1] "C/C/C/C/de_DE/C"
> Sys.setlocale("LC_ALL", "en_US.utf8")
[1] ""
Warning message:
In Sys.setlocale("LC_ALL", "en_US.utf8") :
OS reports request to set locale to "en_US.utf8" cannot be honored
>
This solved the problem:
Sys.setlocale("LC_ALL", "en_US.UTF-8")
This works:
Sys.setlocale("LC_ALL", "en_US.UTF-8")
Just wanted to point out that I'm not having this problem (Arabic characters are displayed correctly without any change to locale), even though I am not in a UTF-8 locale. Not sure what to make of this, so if someone else does please enlighten us.
I'm using RStudio 0.98.1091 and my sessionInfo is as follows :
> sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: i386-w64-mingw32/i386 (32-bit)
locale:
[1] LC_COLLATE=French_France.1252 LC_CTYPE=French_France.1252 LC_MONETARY=French_France.1252
[4] LC_NUMERIC=C LC_TIME=French_France.1252