How to read extra-ASCII characters in R? - r

I am reading input text file line by line with the following function:
lines_reader<-function(filename){
conn<-file(filename,open="r")
linn<-readLines(conn,encoding="UCS-2LE")
close(conn)
return(linn)
}
If I try to plot these lines in the R enviroment, letters with accent marks are treated not adequately appearing like "Ã" or "è" instead of "à" or "è".
How to cope with this? What encoding I should choose?
Here they are my session and local system info:
> sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 8 x64 (build 9200)
locale:
[1] LC_COLLATE=Italian_Italy.1252 LC_CTYPE=Italian_Italy.1252
[3] LC_MONETARY=Italian_Italy.1252 LC_NUMERIC=C
[5] LC_TIME=Italian_Italy.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] tools_3.2.0
> Sys.getlocale()
[1] "LC_COLLATE=Italian_Italy.1252;LC_CTYPE=Italian_Italy.1252;LC_MONETARY=Italian_Italy.1252;LC_NUMERIC=C;LC_TIME=Italian_Italy.1252"

How about changing the encoding that you are using:
lines_reader<-function(filename){
conn<-file(filename,open="r")
linn<-readLines(conn,encoding="UTF-8")
close(conn)
return(linn)
}

Related

Configure R to use utf-8 by default

To avoid problems when using R and RStudio, can R be configured to work with utf-8 by default? My system is windows 10 64 bit, R scripts are saved as ANSI, however when opening those same files with RStudio there are problems with the Latin characters, ñ, tildes, euro symbol €. I have to convert the files to change the character encoding, so that I can use them with RStudio.
From the R console:
> sessionInfo()
R version 4.0.5 (2021-03-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)
Matrix products: default
locale:
[1] LC_COLLATE=Spanish_Spain.1252 LC_CTYPE=Spanish_Spain.1252
[3] LC_MONETARY=Spanish_Spain.1252 LC_NUMERIC=C
[5] LC_TIME=Spanish_Spain.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_4.0.5
In RStudio go to Tools bar, then click Global options (at the bottom):
See picture below: Click Code, then Saving -> you can set default text encoding

Running Rscript with UTF-8 on Windows 10 [duplicate]

I'm working with files containing text in Hindi and parsing them. I wrote my code in Rstudio and executed it without many issues. But now, I need to execute the same script from command line using R.exe/Rscript.exe and it doesn't work the same way. I've run a simple script from both RStudio and the terminal:
n_p<-'नाम'
Encoding(n_p)
gregexpr(n_p,c('adfdafc','नाम adsfdfa'))
sessionInfo()
Output In RStudio:
> n_p<-'नाम'
>
> Encoding(n_p)
[1] "UTF-8"
>
> gregexpr(n_p,c('adfdafc','नाम adsfdfa'))
[[1]]
[1] -1
attr(,"match.length")
[1] -1
[[2]]
[1] 1
attr(,"match.length")
[1] 3
> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7600)
Matrix products: default
locale:
[1] LC_COLLATE=English_India.1252 LC_CTYPE=English_India.1252
[3] LC_MONETARY=English_India.1252 LC_NUMERIC=C
[5] LC_TIME=English_India.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] rJava_0.9-10
loaded via a namespace (and not attached):
[1] compiler_3.5.0 tools_3.5.0
Output with R.exe in cmd (For debugging purposes. Rscript.exe gives a similar if not identical output)
> n_p<-'à☼"à☼_à☼r'
>
> Encoding(n_p)
[1] "latin1"
>
> gregexpr(n_p,c('adfdafc','à☼"à☼_à☼r adsfdfa'))
[[1]]
[1] -1
attr(,"match.length")
[1] -1
[[2]]
[1] 1
attr(,"match.length")
[1] 9
> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7600)
Matrix products: default
locale:
[1] LC_COLLATE=English_India.1252 LC_CTYPE=English_India.1252
[3] LC_MONETARY=English_India.1252 LC_NUMERIC=C
[5] LC_TIME=English_India.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_3.5.0
I've tried changing locales but Sys.setlocale refuses to work properly. In some cases, gregexpr gives an error when it can't parse non ASCII code. And finally, when it does run without errors, it doesn't match regular expressions properly. I can't provide a reproducible example at the moment, but I will try to later.
Help.
The right answer is that you should run Rscript with the option --encoding=file encoding
There is no need to set locale, and as you probably found out, it doesn't work anyway. If your file is UTF-8:
Rscript.exe --encoding=UTF-8 file.R
You need to ensure that R is running in a suitable locale:
Running rterm use: Sys.getlocale() to find your current locale.
You can set your locale using:
Sys.setlocale(category = "LC_ALL", locale = "hi-IN")
# Try "hi-IN.UTF-8" too...
You can find locale names here, the MSDN, and here.
If you have the correct value, put the Sys.setlocale() command in your ~/.Rprofile.
References
https://cran.r-project.org/bin/windows/base/rw-FAQ.html
http://withr.me/configure-character-encoding-for-r-under-linux-and-windows/

How to fix an invalid multibyte string error with recommenderRegistry$get_entries()?

I'm trying "recommenderlab" package and getting an error with recommenderRegistry$get_entries() fuction.
To get the info about recommendation algorithms, I tried
library(recommenderlab)
recommenderRegistry$get_entries()
But the output is truncated in the middle as below.
$AR_binaryRatingMatrix
Recommender method: AR for binaryRatingMatrix
Description: Recommender based on association rules.
Reference: NA
Parameters:
Error in dput(p, control = list()) :
invalid multibyte string at '<ff><fe><63><6f>nfidence<ff><fe>"
When I tried
names(recommenderRegistry[["AR", "binaryRatingMatrix"]][['parameters']])
all the parameters are displayed without an error.
[1] "support" "confidence" "maxlen" "sort_measure" "sort_decreasing" "apriori_control" "verbose"
Additionally, I tried re-installing it, but the result is the same.
recommenderRegistry$get_entries() should show 15 algorithms available in this pacakge. Has anyone encountered a similar problem?
The system information is like below
R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
Matrix products: default
locale: [1] LC_COLLATE=Korean_Korea.949 [2] LC_CTYPE=Korean_Korea.949 [3] LC_MONETARY=Korean_Korea.949 [4] LC_NUMERIC=C [5] LC_TIME=Korean_Korea.949
attached base packages: [1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached): [1] compiler_3.6.1 tools_3.6.1 [3] yaml_2.2.0
Looks like your problem may be your locale setting:
locale: [1] LC_COLLATE=Korean_Korea.949 [2] LC_CTYPE=Korean_Korea.949 [3] LC_MONETARY=Korean_Korea.949 [4] LC_NUMERIC=C [5] LC_TIME=Korean_Korea.949
You can try Sys.setlocale("LC_ALL", "C")

Script with utf-8 text runs differently from RStudio and command line in Windows

I'm working with files containing text in Hindi and parsing them. I wrote my code in Rstudio and executed it without many issues. But now, I need to execute the same script from command line using R.exe/Rscript.exe and it doesn't work the same way. I've run a simple script from both RStudio and the terminal:
n_p<-'नाम'
Encoding(n_p)
gregexpr(n_p,c('adfdafc','नाम adsfdfa'))
sessionInfo()
Output In RStudio:
> n_p<-'नाम'
>
> Encoding(n_p)
[1] "UTF-8"
>
> gregexpr(n_p,c('adfdafc','नाम adsfdfa'))
[[1]]
[1] -1
attr(,"match.length")
[1] -1
[[2]]
[1] 1
attr(,"match.length")
[1] 3
> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7600)
Matrix products: default
locale:
[1] LC_COLLATE=English_India.1252 LC_CTYPE=English_India.1252
[3] LC_MONETARY=English_India.1252 LC_NUMERIC=C
[5] LC_TIME=English_India.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] rJava_0.9-10
loaded via a namespace (and not attached):
[1] compiler_3.5.0 tools_3.5.0
Output with R.exe in cmd (For debugging purposes. Rscript.exe gives a similar if not identical output)
> n_p<-'à☼"à☼_à☼r'
>
> Encoding(n_p)
[1] "latin1"
>
> gregexpr(n_p,c('adfdafc','à☼"à☼_à☼r adsfdfa'))
[[1]]
[1] -1
attr(,"match.length")
[1] -1
[[2]]
[1] 1
attr(,"match.length")
[1] 9
> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7600)
Matrix products: default
locale:
[1] LC_COLLATE=English_India.1252 LC_CTYPE=English_India.1252
[3] LC_MONETARY=English_India.1252 LC_NUMERIC=C
[5] LC_TIME=English_India.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_3.5.0
I've tried changing locales but Sys.setlocale refuses to work properly. In some cases, gregexpr gives an error when it can't parse non ASCII code. And finally, when it does run without errors, it doesn't match regular expressions properly. I can't provide a reproducible example at the moment, but I will try to later.
Help.
The right answer is that you should run Rscript with the option --encoding=file encoding
There is no need to set locale, and as you probably found out, it doesn't work anyway. If your file is UTF-8:
Rscript.exe --encoding=UTF-8 file.R
You need to ensure that R is running in a suitable locale:
Running rterm use: Sys.getlocale() to find your current locale.
You can set your locale using:
Sys.setlocale(category = "LC_ALL", locale = "hi-IN")
# Try "hi-IN.UTF-8" too...
You can find locale names here, the MSDN, and here.
If you have the correct value, put the Sys.setlocale() command in your ~/.Rprofile.
References
https://cran.r-project.org/bin/windows/base/rw-FAQ.html
http://withr.me/configure-character-encoding-for-r-under-linux-and-windows/

Graphics API version mismatch

I have a big problem
I update R to the latest version.
After updating, every time i try to plot something with whatsoever kind of graphic function( even with the basic plot) i get this error
Error: Graphics API version mismatch
these are my session info
R version 3.4.1 (2017-06-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
locale:
[1] LC_COLLATE=Italian_Italy.1252
[2] LC_CTYPE=Italian_Italy.1252
[3] LC_MONETARY=Italian_Italy.1252
[4] LC_NUMERIC=C
[5] LC_TIME=Italian_Italy.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base
loaded via a namespace (and not attached):
[1] compiler_3.3.2 tools_3.3.2
Note: no visible binding for global variable 'x'
Note: no visible binding for global variable 'envir'
All packages installed are updated to the latest version available

Resources