How does R handle Unicode / UTF-8? - r

If I write
`Δ` <- function(a,b) (a-b)/a
then I can include U+394 so long as it's enclosed in backticks. (By contrast, Δ <- function(a,b) (a-b)/a fails with unexpected input in "�".) So apparently R parses UTF-8 or Unicode or something like that. The assignment goes well and so does the evaluation of eg
`Δ`(1:5, 9:13)
. And I can also evaluate Δ(1:5, 9:13).
Finally, if I defined something like winsorise <- function(x, λ=.05) { ... } then λ (U+3bb) doesn't need to be "introduced to" R with a backtick. I can then call winsorise(data, .1) with no problems.
The only mention in R's documentation I can find of unicode is over my head. Could someone who understands it better explain to me — what's going on "under the hood" when R needs the ` to understand assignment to ♔, but can parse ♔(a,b,c) once assigned?

I can't speak to what's going on under the hood regarding the function calls vs. function arguments, but this email from Prof. Ripley from 2008 may shed some light (excerpt below):
R passes around, prints and plots UTF-8 character data pretty well, but it translates to the native encoding for almost all character-level manipulations (and not just on Windows). ?Encoding spells out the exceptions [...]
The reason R does this translation (on Windows at least) is mentioned in the documentation that the OP linked to:
Windows has no UTF-8 locales, but rather expects to work with UCS-2 strings. R (being written in standard C) would not work internally with UCS-2 without extensive changes.
The R documentation for ?Quotes explains how you can sometimes use out-of-locale characters anyway (emphasis added):
Identifiers consist of a sequence of letters, digits, the period (.) and the underscore. They must not start with a digit nor underscore, nor with a period followed by a digit. Reserved words are not valid identifiers.
The definition of a letter depends on the current locale, but only ASCII digits are considered to be digits.
Such identifiers are also known as syntactic names and may be used directly in R code. Almost always, other names can be used provided they are quoted. The preferred quote is the backtick (`), and deparse will normally use it, but under many circumstances single or double quotes can be used (as a character constant will often be converted to a name). One place where backticks may be essential is to delimit variable names in formulae: see formula.
There is another way to get at such characters, which is using the unicode escape sequence (like \u0394 for Δ). This is usually a bad idea if you're using that character for anything other than text on a plot (i.e., don't do this for variable or function names; cf. this quote from the R 2.7 release notes, when much of the current UTF-8 support was added):
If a string presented to the parser contains a \uxxxx escape invalid in the current locale, the string is recorded in UTF-8 with the encoding declared. This is likely to throw an error if it is used later in the session, but it can be printed, and used for e.g. plotting on the windows() device. So "\u03b2" gives a Greek small beta and "\u2642" a 'male sign'. Such strings will be printed as e.g. <U+2642> except in the Rgui console (see below).
I think this addresses most of your questions, though I don't know why there is a difference between the function name and function argument examples you gave; hopefully someone more knowledgable can chime in on that. FYI, on Linux all of these different ways of assigning and calling a function work without error (because the system locale is UTF-8, so no translation need occur):
Δ <- function(a,b) (a-b)/a # no error
`Δ` <- function(a,b) (a-b)/a # no error
"Δ" <- function(a,b) (a-b)/a # no error
"\u0394" <- function(a,b) (a-b)/a # no error
Δ(1:5, 9:13) # -8.00 -4.00 -2.67 -2.00 -1.60
`Δ`(1:5, 9:13) # same
"Δ"(1:5, 9:13) # same
"\u0394"(1:5, 9:13) # same
sessionInfo()
# R version 3.1.2 (2014-10-31)
# Platform: x86_64-pc-linux-gnu (64-bit)
# locale:
# LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
# LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
# LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
# LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
# attached base packages:
# stats graphics grDevices utils datasets methods base

For the record, under R-devel (2015-02-11 r67792), Win 7, English UK locale, I see:
options(encoding = "UTF-8")
`Δ` <- function(a,b) (a-b)/a
## Error: \uxxxx sequences not supported inside backticks (line 1)
Δ <- function(a,b) (a-b)/a
## Error: unexpected input in "\"
"Δ" <- function(a,b) (a-b)/a # OK
`Δ`(1:5, 9:13)
## Error: \uxxxx sequences not supported inside backticks (line 1)
Δ(1:5, 9:13)
## Error: unexpected input in "\"
"Δ"(1:5, 9:13)
## Error: could not find function "Δ"

Related

Encoding discrepancy in RScript

I have been struggling with an encoding problem with a program that needs to run both in RStudio and using RScript. After wasting half a day on this I have a kludgy workaround, but would like to understand why the RScript version marks a string as latin1 when it is in fact UTF-8, and whether there is a better alternative to my solution. Example:
x <- "Ø28"
print(x)
print(paste("Marked as", Encoding(x)))
print(paste("Valid UTF = ", validUTF8(x)))
x <- iconv(x, "UTF-8", "latin1")
print(x)
In RStudio, the output is:
[1] "Ø28"
[1] "Marked as latin1"
[1] "Valid UTF = FALSE"
[1] NA
and when run using RScript from a batch file in Windows the output from the same code is:
[1] "Ã\23028"
[1] "Marked as latin1"
[1] "Valid UTF = TRUE"
[1] "Ø28"
In the latter case, it does not strike me as entirely helpful that a string defined within an R program by a simple assignment is marked as Latin-1 when in fact it is UTF-8. The solution I used in the end was to write a function that tests the actual (rather than declared) encoding of character variables using validUTF8, and if that returns TRUE, then use iconv to convert to latin1. It is still a bit of a pain since I have to call that repeatedly, and it would be better to have a global solution. There is quite a bit out there on encoding problems with R, but nothing that I can find that solves this when running programs with RScript. Any suggestions?
R 3.5.0, RStudio 1.1.453, Windows 7 / Windows Server 2008 (don't ask...)

encoding error with read_html

I am trying to web scrape a page. I thought of using the package rvest.
However, I'm stuck in the first step, which is to use read_html to read the content.
Here´s my code:
library(rvest)
url <- "http://simec.mec.gov.br/painelObras/recurso.php?obra=17956"
obra_caridade <- read_html(url,
encoding = "ISO-8895-1")
And I got the following error:
Error in doc_parse_raw(x, encoding = encoding, base_url = base_url, as_html = as_html, :
Input is not proper UTF-8, indicate encoding !
Bytes: 0xE3 0x6F 0x20 0x65 [9]
I tried using what similar questions had as answers, but it did not solve my issue:
obra_caridade <- read_html(iconv(url, to = "UTF-8"),
encoding = "UTF-8")
obra_caridade <- read_html(iconv(url, to = "ISO-8895-1"),
encoding = "ISO-8895-1")
Both attempts returned a similar error.
Does anyone has any suggestion about how to solve this issue?
Here's my session info:
R version 3.3.1 (2016-06-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
locale:
[1] LC_COLLATE=Portuguese_Brazil.1252 LC_CTYPE=Portuguese_Brazil.1252
[3] LC_MONETARY=Portuguese_Brazil.1252 LC_NUMERIC=C
[5] LC_TIME=Portuguese_Brazil.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] rvest_0.3.2 xml2_1.1.1
loaded via a namespace (and not attached):
[1] httr_1.2.1 magrittr_1.5 R6_2.2.1 tools_3.3.1 curl_2.6 Rcpp_0.12.11
What's the issue?
Your issue here is in correctly determining the encoding of the webpage.
The good news
Your approach looks like a good one to me since you looked at the source code and found the Meta charset, given as ISO-8895-1. It is certainly ideal to be told the encoding, rather than have to resort to guess-work.
The bad news
I don't believe that encoding exists. Firstly, when I search for it online the results tend to look like typos. Secondly, R provides you with a list of supported encodings via iconvlist(). ISO-8895-1 is not in the list, so entering it as an argument to read_html isn't useful. I think it'd be nice if entering a non-supported encoding threw a warning, but this doesn't seem to happen.
Quick solution
As suggested by #MrFlick in a comment, using encoding = "latin1" appears to work.
I suspect the Meta charset has a typo and it should read ISO-8859-1 (which is the same thing as latin1).
Tips on guessing an encoding
What is your browser doing?
When loading the page in a browser, you can see what encoding it is using to read the page. If the page looks right, this seems like a sensible guess. In this instance, Firefox uses Western encoding (i.e. ISO-8859-1).
Guessing with R
rvest::guess_encoding is a nice, user-friendly function which can give a quick estimate. You can provide the function with a url e.g. guess_encoding(url), or copy in phrases with more complex characters e.g. guess_encoding("Situação do Termo/Convênio:").
One thing to note about this function is it can only detect from 30 of the more common encodings, but there are many more possibilities.
As mentioned earlier, iconvlist() provides a list of supported encodings. By looping through these encodings and examining some text in the page to see if it's what we expect, we should end up with a shortlist of possible encodings (and rule many encodings out).
Sample code can be found at the bottom of this answer.
Final comments
All the above points towards ISO-8859-1 being a sensible guess for the encoding.
The page url contains a .br extension indicating it's Brazilian, and - according to Wikipedia - this encoding has complete language coverage for Brazilian Portuguese, which suggests it might not be a crazy choice for whoever created the webpage. I believe this is also a reasonably common encoding type.
Code
Sample code for 'Guessing with R' point 2 (using iconvlist()):
library(rvest)
url <- "http://simec.mec.gov.br/painelObras/recurso.php?obra=17956"
# 1. See which encodings don't throw an error
read_page <- lapply(unique(iconvlist()), function(encoding_attempt) {
# Optional print statement to show progress to 1 since this can take some time
print(match(encoding_attempt, iconvlist()) / length(iconvlist()))
read_attempt <- tryCatch(expr=read_html(url, encoding=encoding_attempt),
error=function(condition) NA,
warning=function(condition) message(condition))
return(read_attempt)
})
names(read_page) <- unique(iconvlist())
# 2. See which encodings correctly display some complex characters
read_phrase <- lapply(x, function(encoded_page)
if(!is.na(encoded_page))
html_text(html_nodes(encoded_page, ".dl-horizontal:nth-child(1) dt")))
# We've ended up with 27 encodings which could be sensible...
encoding_shortlist <- names(read_phrase)[read_phrase == "Situação:"]

iconvlist() inconsistency on alpine linux

I have a docker container set up that is based on artemklevtsov/r-alpine:latest. When I run my R scripts I see this error:
Invalid encoding UTF-8: defaulting to UTF-8.
I tracked this down to this code in the httr library:
https://github.com/hadley/httr/blob/master/R/content-parse.r#L5
It looks like iconvlist() on alpine returns encodings that have a trailing comma at the end, ex:
iconvlist()
[1] "..." "ISO8859-1," "ISO8859-2," "ISO8859-3," "ISO8859-4,"
[6] "ISO8859-5," "ISO8859-6," "ISO8859-7," "UCS-2BE," "UCS-2LE,"
[11] "US_ASCII," "UTF-16BE," "UTF-16LE," "UTF-32BE," "UTF-8,"
Therefore UTF-8 never matches UTF-8,. Has anyone ran into this issue before? The list of encodings I get on my local Mac (OSX) is correct and doesn't have trailing commas. It also doesn't happen on CentOS, so it looks like it's specific to alpine.
Is there a way to get around this? Maybe through a configuration in R or by modifying the iconvlist() output?
I have the same issue, this time from calling read::read_csv, which uses base::iconvlist and gives the same error message Invalid encoding UTF-8: defaulting to UTF-8.. This is on alpine:3.12 using R 3.6.3 provided by apk add R and, based on the details below, I think the issue will be present on any version of alpine and R unless steps have been taken to address it directly.
I found a couple of solutions. TLDR:
Remove the commas from the file at system.file("iconvlist", package = "utils"), or
Recompile R using the gnu-libiconv library for more comprehensive iconv support.
Solution 1
The base::iconvlist() function uses an iconvlist file as a fallback method to get the list of encodings the system supports. On alpine this fallback method will always be used, for reasons outlined below, but the iconvlist file has commas in, which R is not expecting.
The easiest solution is to remove the commas from the iconvlist file, which can be found with base::system.file().
> system.file("iconvlist", package = "utils")
[1] "/usr/lib/R/library/utils/iconvlist"
One way to remove the commas, from the command line (not R) is:
sed -i 's/,//g' /usr/lib/R/library/utils/iconvlist
Subsequent calls to base::iconvlist() will read and parse the new file without the commas, and other functions that rely on base::iconvlist() will be able to successfully check for support, e.g. for "UTF-8".
> iconvlist()
[1] "..." "ISO8859-1" "ISO8859-2" "ISO8859-3" "ISO8859-4" "ISO8859-5"
[7] "ISO8859-6" "ISO8859-7" "UCS-2BE" "UCS-2LE" "US_ASCII" "UTF-16BE"
[13] "UTF-16LE" "UTF-32BE" "UTF-8" "UTF32-LE" "WCHAR_T"
> "UTF-8" %in% iconvlist()
[1] TRUE
Why is this necessary?
International conversion (iconv) of character encodings is a feature that R expects to be provided by the operating system, stipulated in the R Administration and Installation Manual. Operating systems provide their own implementations of iconv functionality, sometimes with fewer features. Since alpine is designed to be minimal, it is not surprising that it provides only what is necessary to meet the POSIX standards.
When R is built on a system it first checks the extent of iconv support from the host's C development libraries, before it compiles features into R's internals. Crucially, support for the C function iconvlist is checked for, which is not present on alpine, as shown in the apk build log for R: checking for iconvlist... no, so this C function is not available to R internally.
R's base::iconvlist() function will first try to get encodings using pre-compiled C code via .Internal(iconv(..., which will call iconvlist (in C) if available. As the iconvlist C function is not present on alpine, this .Internal call will always return NULL, and the R function will fall back to reading the info from the iconvlist file:
> iconvlist
function ()
{
int <- .Internal(iconv(NULL, "", "", "", TRUE, FALSE))
if (length(int))
return(sort.int(int))
icfile <- system.file("iconvlist", package = "utils")
# ... (truncated)
Why is the iconvlist file in an unexpected format?
The iconvlist file is created when R is built, from the command iconv -l which lists the available encodings. This is the utility program at /usr/bin/iconv not an R or C function. There is no standard for the format of the output of iconv -l. Alpine tries to conform to POSIX standards, and these only require that the -l option writes "values to standard output in an unspecified format".
R is expecting the file format to contain values separated by spaces (base::iconvlist() parses the file with strsplit(ext, "[[:space:]]")), which is true for other Linux variants, e.g. Debian, CentOS, but not for alpine's musl libc version, which has the commas.
Solution 2
A more rigorous solution is to build R from source using an alternative iconv C library implementation that provides the iconvlist C function. base::iconvlist() can then fetch the encodings via its .Internal(iconv(... call, and never needs to fall back to the iconvlist file.
An implementation that provides iconvlist is GNU libiconv, which has been packaged for alpine and can be installed with:
apk add gnu-libiconv gnu-libiconv-dev
The package gnu-libiconv-dev provides headers in /usr/include/gnu-libiconv/, so the compiler needs to be pointed here in preference to the existing ones in /usr/include. This is outside my expertise but can be done by adding -I/usr/include/gnu-libiconv to the CFLAGS environment variable.
export CFLAGS=-I/usr/include/gnu-libiconv $CFLAGS
Running ./configure should yield check results similar to:
... (truncated)
checking for iconv.h... yes
checking for iconv... in libiconv
checking whether iconv accepts "UTF-8", "latin1", "ASCII" and "UCS-*"... yes
checking whether iconv accepts "CP1252"... yes
checking for iconvlist... yes
... (truncated)
After make I can run ./bin/R and, even if the iconvlist file still contains commas, calls to base::iconvlist() yields well-formatted results:
> iconvlist()
[1] "850"
[2] "862"
[3] "866"
[4] "ANSI_X3.4-1968"
[5] "ANSI_X3.4-1986"
... (truncated)
# The unsorted list is coming from the internal C functions, not the file
> .Internal(iconv(NULL, "", "", "", TRUE, FALSE))
[1] "ANSI_X3.4-1968"
[2] "ANSI_X3.4-1986"
[3] "ASCII"
[4] "CP367"
[5] "IBM367"
... (truncated)

Force character vector encoding from "unknown" to "UTF-8" in R

I have a problem with inconsistent encoding of character vector in R.
The text file which I read a table from is encoded (via Notepad++) in UTF-8 (I tried with UTF-8 without BOM, too.).
I want to read table from this text file, convert it do data.table, set a key and make use of binary search. When I tried to do so, the following appeared:
Warning message:
In [.data.table(poli.dt, "żżonymi", mult = "first") :
A known encoding (latin1 or UTF-8) was detected in a join column. data.table compares the bytes currently, so doesn't support
mixed encodings well; i.e., using both latin1 and UTF-8, or if any unknown encodings are non-ascii and some of those are marked known and
others not. But if either latin1 or UTF-8 is used exclusively, and all
unknown encodings are ascii, then the result should be ok. In future
we will check for you and avoid this warning if everything is ok. The
tricky part is doing this without impacting performance for ascii-only
cases.
and binary search does not work.
I realised that my data.table-key column consists of both: "unknown" and "UTF-8" Encoding types:
> table(Encoding(poli.dt$word))
unknown UTF-8
2061312 2739122
I tried to convert this column (before creating a data.table object) with the use of:
Encoding(word) <- "UTF-8"
word<- enc2utf8(word)
but with no effect.
I also tried a few different ways of reading a file into R (setting all helpful parameters, e.g. encoding = "UTF-8"):
data.table::fread
utils::read.table
base::scan
colbycol::cbc.read.table
but with no effect.
==================================================
My R.version:
> R.version
_
platform x86_64-w64-mingw32
arch x86_64
os mingw32
system x86_64, mingw32
status
major 3
minor 0.3
year 2014
month 03
day 06
svn rev 65126
language R
version.string R version 3.0.3 (2014-03-06)
nickname Warm Puppy
My session info:
> sessionInfo()
R version 3.0.3 (2014-03-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=Polish_Poland.1250 LC_CTYPE=Polish_Poland.1250 LC_MONETARY=Polish_Poland.1250
[4] LC_NUMERIC=C LC_TIME=Polish_Poland.1250
base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.9.2 colbycol_0.8 filehash_2.2-2 rJava_0.9-6
loaded via a namespace (and not attached):
[1] plyr_1.8.1 Rcpp_0.11.1 reshape2_1.2.2 stringr_0.6.2 tools_3.0.3
The Encoding function returns unknown if a character string has a "native encoding" mark (CP-1250 in your case) or if it's in ASCII.
To discriminate between these two cases, call:
library(stringi)
stri_enc_mark(poli.dt$word)
To check whether each string is a valid UTF-8 byte sequence, call:
all(stri_enc_isutf8(poli.dt$word))
If it's not the case, your file is definitely not in UTF-8.
I suspect that you haven't forced the UTF-8 mode in the data read function (try inspecting the contents of poli.dt$word to verify this statement). If my guess is true, try:
read.csv2(file("filename", encoding="UTF-8"))
or
poli.dt$word <- stri_encode(poli.dt$word, "", "UTF-8") # re-mark encodings
If data.table still complains about the "mixed" encodings, you may want to transliterate the non-ASCII characters, e.g.:
stri_trans_general("Zażółć gęślą jaźń", "Latin-ASCII")
## [1] "Zazolc gesla jazn"
I could not find a solution myself to a similar problem.
I could not translate back unknown encoding characters from txt file into something more manageable in R.
Therefore, I was in a situation that the same character appeared more than once in the same dataset, because it was encoded differently ("X" in Latin setting and "X" in Greek setting).
However, txt saving operation preserved that encoding difference --- of course well-done.
Trying some of the above methods, nothing worked.
The problem is well described “cannot distinguish ASCII from UTF-8 and the bit will not stick even if you set it”.
A good workaround is " export your data.frame to a CSV temporary file and reimport with data.table::fread() , specifying Latin-1 as source encoding.".
Reproducing / copying the example given from the above source:
package(data.table)
df <- your_data_frame_with_mixed_utf8_or_latin1_and_unknown_str_fields
fwrite(df,"temp.csv")
your_clean_data_table <- fread("temp.csv",encoding = "Latin-1")
I hope, it will help someone that.

How to source() .R file saved using UTF-8 encoding?

The following, when copied and pasted directly into R works fine:
> character_test <- function() print("R同时也被称为GNU S是一个强烈的功能性语言和环境,探索统计数据集,使许多从自定义数据图形显示...")
> character_test()
[1] "R同时也被称为GNU S是一个强烈的功能性语言和环境,探索统计数据集,使许多从自定义数据图形显示..."
However, if I make a file called character_test.R containing the EXACT SAME code, save it in UTF-8 encoding (so as to retain the special Chinese characters), then when I source() it in R, I get the following error:
> source(file="C:\\Users\\Tony\\Desktop\\character_test.R", encoding = "UTF-8")
Error in source(file = "C:\\Users\\Tony\\Desktop\\character_test.R", encoding = "utf-8") :
C:\Users\Tony\Desktop\character_test.R:3:0: unexpected end of input
1: character.test <- function() print("R
2:
^
In addition: Warning message:
In source(file = "C:\\Users\\Tony\\Desktop\\character_test.R", encoding = "UTF-8") :
invalid input found on input connection 'C:\Users\Tony\Desktop\character_test.R'
Any help you can offer in solving and helping me to understand what is going on here would be much appreciated.
> sessionInfo() # Windows 7 Pro x64
R version 2.12.1 (2010-12-16)
Platform: x86_64-pc-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United Kingdom.1252
[2] LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base
loaded via a namespace (and not attached):
[1] tools_2.12.1
and
> l10n_info()
$MBCS
[1] FALSE
$`UTF-8`
[1] FALSE
$`Latin-1`
[1] TRUE
$codepage
[1] 1252
On R/Windows, source runs into problems with any UTF-8 characters that can't be represented in the current locale (or ANSI Code Page in Windows-speak). And unfortunately Windows doesn't have UTF-8 available as an ANSI code page--Windows has a technical limitation that ANSI code pages can only be one- or two-byte-per-character encodings, not variable-byte encodings like UTF-8.
This doesn't seem to be a fundamental, unsolvable problem--there's just something wrong with the source function. You can get 90% of the way there by doing this instead:
eval(parse(filename, encoding="UTF-8"))
This'll work almost exactly like source() with default arguments, but won't let you do echo=T, eval.print=T, etc.
We talked about this a lot in the comments to my previous post but I don't want this to get lost on page 3 of comments: You have to set the locale, it works with both input from the R-console (see screenshot in comments) as well as with input from file see this screenshot:
The file "myfile.r" contains:
russian <- function() print ("Американские с...");
The console contains:
source("myfile.r", encoding="utf-8")
> Error in source(".....
Sys.setlocale("LC_CTYPE","ru")
> [1] "Russian_Russia.1251"
russian()
[1] "Американские с..."
Note that the file-in fails and it points to the same character as the original poster's error (the one after "R). I can not do this with Chinese because i would have to install "Microsoft Pinyin IME 3.0", but the process is the same, you just replace the locale with "chinese" (the naming is a bit inconsistent, consult the documentation).
I think the problem lies with R. I can happily source UTF-8 files, or UCS-2LE files with many non-ASCII characters in. But some characters cause it to fail. For example the following
danish <- function() print("Skønt H. C. Andersens barndomsomgivelser var meget fattige, blev de i hans rige fantasi solbeskinnede.")
croatian <- function() print("Dodigović. Kako se Vi zovete?")
new_testament <- function() print("Ne provizu al vi trezorojn sur la tero, kie tineo kaj rusto konsumas, kaj jie ŝtelistoj trafosas kaj ŝtelas; sed provizu al vi trezoron en la ĉielo")
russian <- function() print ("Американские суда находятся в международных водах. Япония выразила серьезное беспокойство советскими действиями.")
is fine in both UTF-8 and UCS-2LE without the Russian line. But if that is included then it fails. I'm pointing the finger at R. Your Chinese text also appears to be too hard for R on Windows.
Locale seems irrelevant here. It's just a file, you tell it what encoding the file is, why should your locale matter?
For me (on windows) I do:
source.utf8 <- function(f) {
l <- readLines(f, encoding="UTF-8")
eval(parse(text=l),envir=.GlobalEnv)
}
It works fine.
Building on crow's answer, this solution makes RStudio's Source button work.
When hitting that Source button, RStudio executes source('myfile.r', encoding = 'UTF-8')), so overriding source makes the errors disappear and runs the code as expected:
source <- function(f, encoding = 'UTF-8') {
l <- readLines(f, encoding=encoding)
eval(parse(text=l),envir=.GlobalEnv)
}
You can then add that script to an .Rprofile file, so it will execute on startup.
I encounter this problem when a try to source a .R file containing some Chinese characters. In my case, I found that merely set "LC_CTYPE" to "chinese" is not enough. But setting "LC_ALL" to "chinese" works well.
Note that it's not enough to get encoding right when you read or write plain text file in Rstudio (or R?) with non-ASCII. The locale setting counts too.
PS. the command is Sys.setlocale(category = "LC_CTYPE",locale = "chinese"). Please replace locale value correspondingly.
On windows, when you copy-paste a unicode or utf-8 encoded string into a text-control that is set to single-byte-input (ascii... depending on locale), the unknown bytes will be replaced by questionmarks. If i take the first 4 characters of your string and copy-paste it into e.g. Notepad and then save it, the file becomes in hex:
52 3F 3F 3F 3F
what you have to do is find an editor which you can set to utf-8 before copy-pasting the text into it, then the saved file (of your first 4 characters) becomes:
52 E5 90 8C E6 97 B6 E4 B9 9F E8 A2 AB
This will then be recognized as valid utf-8 by [R].
I used "Notepad2" for trying this, but i am sure there are many more.

Resources