R source() encoding bug? - r

I am found very strange bug about encoding of character constants in R.
main.R:
options(encoding = "UTF-8")
print(Sys.getlocale())
print(getOption("encoding"))
print("first run")
source("internal.R")
print("")
print("second run")
source("internal.R", encoding = "UTF-8")
print("")
internal.R
print(Sys.getlocale())
print(getOption("encoding"))
char_constant="Тут не просто живут баги, тут у них гнездо"
print(Encoding(char_constant))
Now lets see the output, push source button in R
[1] "ru_RU.UTF-8/ru_RU.UTF-8/ru_RU.UTF-8/C/ru_RU.UTF-8/ru_RU.UTF-8"
[1] "UTF-8"
[1] "first run"
[1] "ru_RU.UTF-8/ru_RU.UTF-8/ru_RU.UTF-8/C/ru_RU.UTF-8/ru_RU.UTF-8"
[1] "UTF-8"
[1] "unknown"
[1] ""
[1] "second run"
[1] "ru_RU.UTF-8/ru_RU.UTF-8/ru_RU.UTF-8/C/ru_RU.UTF-8/ru_RU.UTF-8"
[1] "UTF-8"
[1] "UTF-8"
[1] ""
Notice the difference in encoding. "unknown" first time and "UTF-8" second time.
There is obvious small bug source ignores default encoding parameter.
The real bug is what mixing different encodings in data.table causes a lot of problems, and R-studio makes "UTF-8" constant when you execute just one string and makes "unknown" constant when you source whole file.
Do somebody have any idea what is going on and how to make workaround?
R version 3.3.0 (2016-05-03)
Platform: x86_64-apple-darwin14.5.0 (64-bit)
Running under: OS X 10.12.4 (unknown)
locale:
[1] ru_RU.UTF-8/ru_RU.UTF-8/ru_RU.UTF-8/C/ru_RU.UTF-8/ru_RU.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] tools_3.3.0

On Windows, R's source function does not work with files that include characters that aren't part of the current system encoding. You may have trouble with RStudio's Run All and Source on Save commands, as they rely on source.
Take a look at: https://support.rstudio.com/hc/en-us/articles/200532197-Character-Encoding

Related

R console does not recognize the encoding of directly input text

I have been wondering why the text that is directly typed in an R script does not seem to be treated as UTF-8 encoded? The file itself is saved and opened in UTF-8. For example, the encoding of Chinese text strings mixed with numbers is displayed as unknown according to the result in the console.
> str <- c("字符編碼問題", "字符編碼問題1", "字符編碼問題2")
> Encoding(str)
[1] "unknown" "unknown" "unknown"
Using the enc2utf8() function to declare the encoding changes the encoding recognized in R.
> str_utf <- enc2utf8(str)
> Encoding(str_utf)
[1] "UTF-8" "UTF-8" "UTF-8"
This issue is affecting the use of text processing function such as str_detect() because the text stored inside the txt_utf object is UTF-8 encoded whereas the text in the script has an unknown encoding. Declaring the encoding of every single search string is cumbersome. So my question is: How can the text in the script be recognized as UTF-8 by default?
> str_detect(str, "題[:digit:]")
[1] FALSE TRUE TRUE
# fails to detect text patterns in the UTF-8 object
> str_detect(str_utf, "題[:digit:]")
[1] FALSE FALSE FALSE
> str_detect(str_utf, enc2utf8("題[:digit:]"))
[1] FALSE TRUE TRUE
My R session information is as follows:
R version 4.1.1 (2021-08-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)
Matrix products: default
locale:
[1] LC_COLLATE=Chinese (Traditional)_Hong Kong SAR.950 LC_CTYPE=Chinese (Traditional)_Hong Kong SAR.950
[3] LC_MONETARY=Chinese (Traditional)_Hong Kong SAR.950 LC_NUMERIC=C
[5] LC_TIME=Chinese (Traditional)_Hong Kong SAR.950
system code page: 932

Stringi/stringr pattern behave differently if function is sourced in R

I am using the stringi package for a while now and everything works fine.
I recently wanted to put some regex inside a function and store that function in a separate file. The code works just fine if the function is loaded from the script but when it is sourced I do not get the expected result.
Here is the code to reproduce the issue :
clean <- function(text){
stri_replace_all_regex(str = text,
pattern = "(?i)[^a-zàâçéèêëîïôûùüÿñæœ0-9,\\.\\?!']",
replacement = " ")
}
text <- "A sample text with some french accent é, è, â, û and some special characters |, [, ( that needs to be cleaned."
clean(text) # OK
[1] "A sample text with some french accent é, è, â, û and some special characters , , that needs to be cleaned."
source(clean.r)
clean(text) # KO
[1] "A sample text with some french accent , , , and some special characters , , that needs to be cleaned."
I want to remove everything that is not a letter, an accented letters and punctuation charcater ?, !, ,, and ..
The code works just fine if the function is loaded inside the script directly. If it is sourced then it gives a different result.
I also tried using stringr and I have the same problem. My files are saved in UTF-8 encoding.
I do not understand why this is happening, any help is greatly appreciated.
Thank you.
R version 3.4.1 (2017-06-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=French_France.1252 LC_CTYPE=French_France.1252
[3] LC_MONETARY=French_France.1252 LC_NUMERIC=C
[5] LC_TIME=French_France.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] stringi_1.1.5 data.table_1.10.4
loaded via a namespace (and not attached):
[1] compiler_3.4.1 tools_3.4.1 yaml_2.1.14
Try converting the text to ASCII first. This will change the characters, and may allow the same behaviour when you source the function in R.
+1 to Felipe Alvarenga
https://stackoverflow.com/a/45941699/2069472
text <- "Ábcdêãçoàúü"
iconv(text, to = "ASCII//TRANSLIT")

Non-ASCII characters in R, reading from .sav file

I am trying to read a .sav file into RStudio. The file contains data from a Spanish language survey, and when I read it into R -- even though my default text encoding has already been set to ISO-8859-1 -- the display of special characters is incorrect.
For example, the word "Camión" appears as
"Cami<c3><b3>n"
even though it shows up correctly as "Camión" in PSPP.
This is what I did:
install.packages("memisc")
jcv2014 <- as.data.set(spss.system.file('myfile.sav'))
Later, I wanted to create a list of just the variable labels, so I did the following:
library(foreign)
jcv2014.spss <- read.spss("myfile.sav", to.data.frame=FALSE, use.value.labels=FALSE)
jcv2014_vars <- attr(jcv2014.spss, "variable.labels")
(I'm not sure if this is the best way to do it, but it worked)
Anyway, this time around, I still didn't get the proper accents but there was a different sort of encoding:
A variable label that was supposed to be "¿Qué calificación le daría..." instead appeared as
"\302\277Qu\303\251 calificaci\303\263n le dar\303\255a..."
I'm not sure how to get the proper characters, but they appear correctly in PSPP. I tried changing the default text encoding in R to both ISO-8859-1 and UTF-8, to no avail. I don't know what the original file was encoded in, but I guessed it would be one of those.
Any ideas?
And if it helps, I have R version 3.1.1 and OS X Yosemite version 10.10.1, and I am using PSPP, not SPSS.
Thanks so much in advance!!!
Can you just set the encoding once you've read the data in?
# Here's your sentence
s <- "\302\277Qu\303\251 calificaci\303\263n le dar\303\255a..."
# it has no encoding
Encoding(s)
# [1] "unknown"
# but if you specify UTF-8, then it shows up correctly
iconv(s, 'UTF-8')
# [1] "¿Qué calificación le daría..."
# This also works
Encoding(s) <- 'UTF-8'
s
# [1] "¿Qué calificación le daría..."
Here are the results of my sessionInfo() call. You should post yours too.
> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-apple-darwin13.1.0 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] reshape2_1.4 hexbin_1.27.0 ggplot2_1.0.0 data.table_1.9.2 yaml_2.1.13
[6] redshift_0.4 RJDBC_0.2-4 rJava_0.9-6 DBI_0.3.1
loaded via a namespace (and not attached):
[1] colorspace_1.2-4 digest_0.6.4 grid_3.1.1 gtable_0.1.2 labeling_0.2
[6] lattice_0.20-29 MASS_7.3-33 munsell_0.4.2 plyr_1.8.1 proto_0.3-10
[11] Rcpp_0.11.2 scales_0.2.4 stringr_0.6.2 tools_3.1.1
Update: looks like you may not have a locale that supports UTF-8. Here are the locale settings for each category on my system. You might try using Sys.setLocale() and updating them one by one on your system (or just use LC_ALL if you don't feel the need to test each one incrementally). ?Sys.setLocale for more info
cat_str <- c("LC_COLLATE", "LC_CTYPE", "LC_MONETARY", "LC_NUMERIC",
"LC_TIME", "LC_MESSAGES", "LC_PAPER", "LC_MEASUREMENT")
sapply(cat_str, Sys.getlocale)
# LC_COLLATE LC_CTYPE LC_MONETARY LC_NUMERIC LC_TIME LC_MESSAGES
# "en_US.UTF-8" "en_US.UTF-8" "en_US.UTF-8" "C" "en_US.UTF-8" "en_US.UTF-8"
# LC_PAPER LC_MEASUREMENT
# "" ""

Using special characters in Rstudio

I am working with some special characters in Rstudio. It coverts them into plain letters.
print("Safarzyńska2013")
[1] "Safarzynska2013"
x <- "Māori"
x
[1] "Maori"
Is there any way to read in the exact original characters.
Following info might be helpful:
Rstudio default encoding is UTF-8
sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] tools_3.1.1
This not an exclusively RStudio problem.
Typing print("Safarzyńska2013") on the console of RGui also converts them to plain letters. Running this code from an UTF-8 encoded Script in RGui returns [1] "Safarzy?ska2013".
I don't think that it is a good idea to type such special chars on the console. x <- "SomeString"; Encoding(x) returns "unknown" and that is probably the problem: R has no idea what encoding you are using on the console and probably has no chance to get your original encoding.
I put "Safarzyńska2013\nMāori\n" in a text file encoded with UTF-8. Then the following works fine:
tbl <- read.table('c:/test1.txt', encoding = 'UTF-8', stringsAsFactors = FALSE)
tbl[1,1]
tbl[2,1]
Encoding(tbl[1,1]) # returns "UTF-8"
If you really want to use the console, you probably will have to mask the special chars. In ?Encoding we find the following example to create a word with special chars:
x <- "fa\xE7ile"
Encoding(x)
Actually I don't know at the moment how to get these codes for your special chars and ?Encoding has also no hints...
Go to the label File of RStudio, them click on Save with encoding... , Choose Encoding
UTF-8 , Set as default encoding for source file and save.
Hope this helps

Force character vector encoding from "unknown" to "UTF-8" in R

I have a problem with inconsistent encoding of character vector in R.
The text file which I read a table from is encoded (via Notepad++) in UTF-8 (I tried with UTF-8 without BOM, too.).
I want to read table from this text file, convert it do data.table, set a key and make use of binary search. When I tried to do so, the following appeared:
Warning message:
In [.data.table(poli.dt, "żżonymi", mult = "first") :
A known encoding (latin1 or UTF-8) was detected in a join column. data.table compares the bytes currently, so doesn't support
mixed encodings well; i.e., using both latin1 and UTF-8, or if any unknown encodings are non-ascii and some of those are marked known and
others not. But if either latin1 or UTF-8 is used exclusively, and all
unknown encodings are ascii, then the result should be ok. In future
we will check for you and avoid this warning if everything is ok. The
tricky part is doing this without impacting performance for ascii-only
cases.
and binary search does not work.
I realised that my data.table-key column consists of both: "unknown" and "UTF-8" Encoding types:
> table(Encoding(poli.dt$word))
unknown UTF-8
2061312 2739122
I tried to convert this column (before creating a data.table object) with the use of:
Encoding(word) <- "UTF-8"
word<- enc2utf8(word)
but with no effect.
I also tried a few different ways of reading a file into R (setting all helpful parameters, e.g. encoding = "UTF-8"):
data.table::fread
utils::read.table
base::scan
colbycol::cbc.read.table
but with no effect.
==================================================
My R.version:
> R.version
_
platform x86_64-w64-mingw32
arch x86_64
os mingw32
system x86_64, mingw32
status
major 3
minor 0.3
year 2014
month 03
day 06
svn rev 65126
language R
version.string R version 3.0.3 (2014-03-06)
nickname Warm Puppy
My session info:
> sessionInfo()
R version 3.0.3 (2014-03-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=Polish_Poland.1250 LC_CTYPE=Polish_Poland.1250 LC_MONETARY=Polish_Poland.1250
[4] LC_NUMERIC=C LC_TIME=Polish_Poland.1250
base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.9.2 colbycol_0.8 filehash_2.2-2 rJava_0.9-6
loaded via a namespace (and not attached):
[1] plyr_1.8.1 Rcpp_0.11.1 reshape2_1.2.2 stringr_0.6.2 tools_3.0.3
The Encoding function returns unknown if a character string has a "native encoding" mark (CP-1250 in your case) or if it's in ASCII.
To discriminate between these two cases, call:
library(stringi)
stri_enc_mark(poli.dt$word)
To check whether each string is a valid UTF-8 byte sequence, call:
all(stri_enc_isutf8(poli.dt$word))
If it's not the case, your file is definitely not in UTF-8.
I suspect that you haven't forced the UTF-8 mode in the data read function (try inspecting the contents of poli.dt$word to verify this statement). If my guess is true, try:
read.csv2(file("filename", encoding="UTF-8"))
or
poli.dt$word <- stri_encode(poli.dt$word, "", "UTF-8") # re-mark encodings
If data.table still complains about the "mixed" encodings, you may want to transliterate the non-ASCII characters, e.g.:
stri_trans_general("Zażółć gęślą jaźń", "Latin-ASCII")
## [1] "Zazolc gesla jazn"
I could not find a solution myself to a similar problem.
I could not translate back unknown encoding characters from txt file into something more manageable in R.
Therefore, I was in a situation that the same character appeared more than once in the same dataset, because it was encoded differently ("X" in Latin setting and "X" in Greek setting).
However, txt saving operation preserved that encoding difference --- of course well-done.
Trying some of the above methods, nothing worked.
The problem is well described “cannot distinguish ASCII from UTF-8 and the bit will not stick even if you set it”.
A good workaround is " export your data.frame to a CSV temporary file and reimport with data.table::fread() , specifying Latin-1 as source encoding.".
Reproducing / copying the example given from the above source:
package(data.table)
df <- your_data_frame_with_mixed_utf8_or_latin1_and_unknown_str_fields
fwrite(df,"temp.csv")
your_clean_data_table <- fread("temp.csv",encoding = "Latin-1")
I hope, it will help someone that.

Resources