R console does not recognize the encoding of directly input text - r

I have been wondering why the text that is directly typed in an R script does not seem to be treated as UTF-8 encoded? The file itself is saved and opened in UTF-8. For example, the encoding of Chinese text strings mixed with numbers is displayed as unknown according to the result in the console.
> str <- c("字符編碼問題", "字符編碼問題1", "字符編碼問題2")
> Encoding(str)
[1] "unknown" "unknown" "unknown"
Using the enc2utf8() function to declare the encoding changes the encoding recognized in R.
> str_utf <- enc2utf8(str)
> Encoding(str_utf)
[1] "UTF-8" "UTF-8" "UTF-8"
This issue is affecting the use of text processing function such as str_detect() because the text stored inside the txt_utf object is UTF-8 encoded whereas the text in the script has an unknown encoding. Declaring the encoding of every single search string is cumbersome. So my question is: How can the text in the script be recognized as UTF-8 by default?
> str_detect(str, "題[:digit:]")
[1] FALSE TRUE TRUE
# fails to detect text patterns in the UTF-8 object
> str_detect(str_utf, "題[:digit:]")
[1] FALSE FALSE FALSE
> str_detect(str_utf, enc2utf8("題[:digit:]"))
[1] FALSE TRUE TRUE
My R session information is as follows:
R version 4.1.1 (2021-08-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)
Matrix products: default
locale:
[1] LC_COLLATE=Chinese (Traditional)_Hong Kong SAR.950 LC_CTYPE=Chinese (Traditional)_Hong Kong SAR.950
[3] LC_MONETARY=Chinese (Traditional)_Hong Kong SAR.950 LC_NUMERIC=C
[5] LC_TIME=Chinese (Traditional)_Hong Kong SAR.950
system code page: 932

Related

Using urltools::url_parse with UTF-8 domains

The function url_parse is very fast and works fine most of the time. But recently, domain names may contain UTF-8 characters, for example
url <- "www.cordes-tiefkühlprodukte.de"
Now if I apply url_parse on this url, I get a special character "< fc >" in the domain column:
url_parse(url)
scheme domain port path parameter fragment
1 <NA> www.cordes-tiefk<fc>hlprodukte.de <NA> <NA> <NA> <NA>
My question is: How can I "fix" this entry to UTF-8? I tried iconv and some functions from the stringi package, but with no success.
(I am aware of httr::parse_url, which does not have this problem. So one approach would be to detect the urls that are not ascii, and use url_parse on those and parse_url on the few special cases. However, this leads to the problem to (efficiently) detect the non-ascii URLs.)
EDIT: Unfortunately, url1 <- URLencode(enc2utf8(url)) does not help. When I do
robotstxt::paths_allowed(
url1,
domain=urltools::suffix_extract(urltools::domain(url1))
)
I get an error could not resolve host. However, plugging in the original URL and the 2nd level domain by hand, paths_allowed works.
> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17134)
Matrix products: default
locale:
[1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252
[3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C
[5] LC_TIME=German_Germany.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] urltools_1.7.3 fortunes_1.5-4
loaded via a namespace (and not attached):
[1] compiler_3.6.1 Rcpp_1.0.1 triebeard_0.3.0
I could reproduce the issue. I could convert the column domain to UTF-8 by reading it with readr::parse_character and latin1 encoding:
library(urltools)
library(tidyverse)
url <- "www.cordes-tiefkühlprodukte.de"
parts <-
url_parse(url) %>%
mutate(domain = parse_character(domain, locale = locale(encoding = "latin1")))
parts
scheme domain port path parameter fragment
1 <NA> www.cordes-tiefkühlprodukte.de <NA> <NA> <NA> <NA>
I guess that the encoding you have to specify (here latin1) depends only on your locale and not on the url's special characters, but I'm not 100% sure about that.
Just for reference, another method that works fine for me is:
library(stringi)
url <- "www.cordes-tiefkühlprodukte.de"
url <- stri_escape_unicode(url)
dat <- urltools::parse_url(url)
for(cn in colnames(dat)) dat[,cn] <- stri_unescape_unicode(dat[,cn])

Stringi/stringr pattern behave differently if function is sourced in R

I am using the stringi package for a while now and everything works fine.
I recently wanted to put some regex inside a function and store that function in a separate file. The code works just fine if the function is loaded from the script but when it is sourced I do not get the expected result.
Here is the code to reproduce the issue :
clean <- function(text){
stri_replace_all_regex(str = text,
pattern = "(?i)[^a-zàâçéèêëîïôûùüÿñæœ0-9,\\.\\?!']",
replacement = " ")
}
text <- "A sample text with some french accent é, è, â, û and some special characters |, [, ( that needs to be cleaned."
clean(text) # OK
[1] "A sample text with some french accent é, è, â, û and some special characters , , that needs to be cleaned."
source(clean.r)
clean(text) # KO
[1] "A sample text with some french accent , , , and some special characters , , that needs to be cleaned."
I want to remove everything that is not a letter, an accented letters and punctuation charcater ?, !, ,, and ..
The code works just fine if the function is loaded inside the script directly. If it is sourced then it gives a different result.
I also tried using stringr and I have the same problem. My files are saved in UTF-8 encoding.
I do not understand why this is happening, any help is greatly appreciated.
Thank you.
R version 3.4.1 (2017-06-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=French_France.1252 LC_CTYPE=French_France.1252
[3] LC_MONETARY=French_France.1252 LC_NUMERIC=C
[5] LC_TIME=French_France.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] stringi_1.1.5 data.table_1.10.4
loaded via a namespace (and not attached):
[1] compiler_3.4.1 tools_3.4.1 yaml_2.1.14
Try converting the text to ASCII first. This will change the characters, and may allow the same behaviour when you source the function in R.
+1 to Felipe Alvarenga
https://stackoverflow.com/a/45941699/2069472
text <- "Ábcdêãçoàúü"
iconv(text, to = "ASCII//TRANSLIT")

R source() encoding bug?

I am found very strange bug about encoding of character constants in R.
main.R:
options(encoding = "UTF-8")
print(Sys.getlocale())
print(getOption("encoding"))
print("first run")
source("internal.R")
print("")
print("second run")
source("internal.R", encoding = "UTF-8")
print("")
internal.R
print(Sys.getlocale())
print(getOption("encoding"))
char_constant="Тут не просто живут баги, тут у них гнездо"
print(Encoding(char_constant))
Now lets see the output, push source button in R
[1] "ru_RU.UTF-8/ru_RU.UTF-8/ru_RU.UTF-8/C/ru_RU.UTF-8/ru_RU.UTF-8"
[1] "UTF-8"
[1] "first run"
[1] "ru_RU.UTF-8/ru_RU.UTF-8/ru_RU.UTF-8/C/ru_RU.UTF-8/ru_RU.UTF-8"
[1] "UTF-8"
[1] "unknown"
[1] ""
[1] "second run"
[1] "ru_RU.UTF-8/ru_RU.UTF-8/ru_RU.UTF-8/C/ru_RU.UTF-8/ru_RU.UTF-8"
[1] "UTF-8"
[1] "UTF-8"
[1] ""
Notice the difference in encoding. "unknown" first time and "UTF-8" second time.
There is obvious small bug source ignores default encoding parameter.
The real bug is what mixing different encodings in data.table causes a lot of problems, and R-studio makes "UTF-8" constant when you execute just one string and makes "unknown" constant when you source whole file.
Do somebody have any idea what is going on and how to make workaround?
R version 3.3.0 (2016-05-03)
Platform: x86_64-apple-darwin14.5.0 (64-bit)
Running under: OS X 10.12.4 (unknown)
locale:
[1] ru_RU.UTF-8/ru_RU.UTF-8/ru_RU.UTF-8/C/ru_RU.UTF-8/ru_RU.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] tools_3.3.0
On Windows, R's source function does not work with files that include characters that aren't part of the current system encoding. You may have trouble with RStudio's Run All and Source on Save commands, as they rely on source.
Take a look at: https://support.rstudio.com/hc/en-us/articles/200532197-Character-Encoding

Using special characters in Rstudio

I am working with some special characters in Rstudio. It coverts them into plain letters.
print("Safarzyńska2013")
[1] "Safarzynska2013"
x <- "Māori"
x
[1] "Maori"
Is there any way to read in the exact original characters.
Following info might be helpful:
Rstudio default encoding is UTF-8
sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] tools_3.1.1
This not an exclusively RStudio problem.
Typing print("Safarzyńska2013") on the console of RGui also converts them to plain letters. Running this code from an UTF-8 encoded Script in RGui returns [1] "Safarzy?ska2013".
I don't think that it is a good idea to type such special chars on the console. x <- "SomeString"; Encoding(x) returns "unknown" and that is probably the problem: R has no idea what encoding you are using on the console and probably has no chance to get your original encoding.
I put "Safarzyńska2013\nMāori\n" in a text file encoded with UTF-8. Then the following works fine:
tbl <- read.table('c:/test1.txt', encoding = 'UTF-8', stringsAsFactors = FALSE)
tbl[1,1]
tbl[2,1]
Encoding(tbl[1,1]) # returns "UTF-8"
If you really want to use the console, you probably will have to mask the special chars. In ?Encoding we find the following example to create a word with special chars:
x <- "fa\xE7ile"
Encoding(x)
Actually I don't know at the moment how to get these codes for your special chars and ?Encoding has also no hints...
Go to the label File of RStudio, them click on Save with encoding... , Choose Encoding
UTF-8 , Set as default encoding for source file and save.
Hope this helps

how to display and input chinese (and other non-ASCII) character in r console?

My system: win7 ultimate 64 english version + r-3.1(64) .
Here is my sessionInfo.
> sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
LC_MONETARY=English_United States.1252 LC_NUMERIC=C
LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
1.can't input chinese character into r console
When I input a chinese character in r console, it turns to garbled character .
2.can't display chinese character on the r console
When I read data in r console, the chinese character turns into a garbled character .
You can download the data, and test it with
read.table("r1.csv",sep=",")
Download Data
Please see the graph to download the data if you don't know how to get the data from my web.
How can I setup my pc to properly display and input chinese characters in r console?
I have updated the chinese language pack ,and enabled it,but problem remains still.
It is probably not very well documented, but you want to use setlocale in order to use Chinese. And the method applies to many other languages as well. The solution is not obvious as the official document of setlocale didn't specifically mentioned it as a method to solve the display issues.
> print('ÊÔÊÔ') #试试, meaning let's give it a shot in Chinese
[1] "ÊÔÊÔ" #won't show up correctly
> Sys.getlocale()
[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
> Sys.setlocale(category = "LC_ALL", locale = "chs") #cht for traditional Chinese, etc.
[1] "LC_COLLATE=Chinese_People's Republic of China.936;LC_CTYPE=Chinese_People's Republic of China.936;LC_MONETARY=Chinese_People's Republic of China.936;LC_NUMERIC=C;LC_TIME=Chinese_People's Republic of China.936"
> print('试试')
[1] "试试"
> read.table("c:/CHS.txt",sep=" ") #Chinese: the 1st record/observation
V1 V2 V3 V4 V5 V6
1 122 第一 122 条 122 记录
If you just want to change the display encoding, without changing other aspects of locales, use LC_CTYPE instead of LC_ALL:
> Sys.setlocale(category = "LC_CTYPE", locale = "chs")
[1] "Chinese_People's Republic of China.936"
> print('试试')
[1] "试试"
Now, of course this only applies to the official R console. If you use other IDE's, such as the very popular RStudio, you don't need to do this at all to be able to type and display Chinese, even if you didn't have the Chinese locale loaded.
Migrate some useful stuff from the following comments:
If the data still fails to show up correctly, the we should also look into the issue of the file encoding. If the file is UTF-8 encoded, tither data <- read.table("you_file", sep=',', fileEncoding="UTF-8-BOM", header=TRUE) or fileEncoding="UTF-8" will do, depends on which encoding it really has.
But you may want to stay away from UTF-BOM as it is not recommended: What's different between UTF-8 and UTF-8 without BOM?

Resources