How to source() .R file saved using UTF-8 encoding? - r

The following, when copied and pasted directly into R works fine:
> character_test <- function() print("R同时也被称为GNU S是一个强烈的功能性语言和环境,探索统计数据集,使许多从自定义数据图形显示...")
> character_test()
[1] "R同时也被称为GNU S是一个强烈的功能性语言和环境,探索统计数据集,使许多从自定义数据图形显示..."
However, if I make a file called character_test.R containing the EXACT SAME code, save it in UTF-8 encoding (so as to retain the special Chinese characters), then when I source() it in R, I get the following error:
> source(file="C:\\Users\\Tony\\Desktop\\character_test.R", encoding = "UTF-8")
Error in source(file = "C:\\Users\\Tony\\Desktop\\character_test.R", encoding = "utf-8") :
C:\Users\Tony\Desktop\character_test.R:3:0: unexpected end of input
1: character.test <- function() print("R
2:
^
In addition: Warning message:
In source(file = "C:\\Users\\Tony\\Desktop\\character_test.R", encoding = "UTF-8") :
invalid input found on input connection 'C:\Users\Tony\Desktop\character_test.R'
Any help you can offer in solving and helping me to understand what is going on here would be much appreciated.
> sessionInfo() # Windows 7 Pro x64
R version 2.12.1 (2010-12-16)
Platform: x86_64-pc-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United Kingdom.1252
[2] LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base
loaded via a namespace (and not attached):
[1] tools_2.12.1
and
> l10n_info()
$MBCS
[1] FALSE
$`UTF-8`
[1] FALSE
$`Latin-1`
[1] TRUE
$codepage
[1] 1252

On R/Windows, source runs into problems with any UTF-8 characters that can't be represented in the current locale (or ANSI Code Page in Windows-speak). And unfortunately Windows doesn't have UTF-8 available as an ANSI code page--Windows has a technical limitation that ANSI code pages can only be one- or two-byte-per-character encodings, not variable-byte encodings like UTF-8.
This doesn't seem to be a fundamental, unsolvable problem--there's just something wrong with the source function. You can get 90% of the way there by doing this instead:
eval(parse(filename, encoding="UTF-8"))
This'll work almost exactly like source() with default arguments, but won't let you do echo=T, eval.print=T, etc.

We talked about this a lot in the comments to my previous post but I don't want this to get lost on page 3 of comments: You have to set the locale, it works with both input from the R-console (see screenshot in comments) as well as with input from file see this screenshot:
The file "myfile.r" contains:
russian <- function() print ("Американские с...");
The console contains:
source("myfile.r", encoding="utf-8")
> Error in source(".....
Sys.setlocale("LC_CTYPE","ru")
> [1] "Russian_Russia.1251"
russian()
[1] "Американские с..."
Note that the file-in fails and it points to the same character as the original poster's error (the one after "R). I can not do this with Chinese because i would have to install "Microsoft Pinyin IME 3.0", but the process is the same, you just replace the locale with "chinese" (the naming is a bit inconsistent, consult the documentation).

I think the problem lies with R. I can happily source UTF-8 files, or UCS-2LE files with many non-ASCII characters in. But some characters cause it to fail. For example the following
danish <- function() print("Skønt H. C. Andersens barndomsomgivelser var meget fattige, blev de i hans rige fantasi solbeskinnede.")
croatian <- function() print("Dodigović. Kako se Vi zovete?")
new_testament <- function() print("Ne provizu al vi trezorojn sur la tero, kie tineo kaj rusto konsumas, kaj jie ŝtelistoj trafosas kaj ŝtelas; sed provizu al vi trezoron en la ĉielo")
russian <- function() print ("Американские суда находятся в международных водах. Япония выразила серьезное беспокойство советскими действиями.")
is fine in both UTF-8 and UCS-2LE without the Russian line. But if that is included then it fails. I'm pointing the finger at R. Your Chinese text also appears to be too hard for R on Windows.
Locale seems irrelevant here. It's just a file, you tell it what encoding the file is, why should your locale matter?

For me (on windows) I do:
source.utf8 <- function(f) {
l <- readLines(f, encoding="UTF-8")
eval(parse(text=l),envir=.GlobalEnv)
}
It works fine.

Building on crow's answer, this solution makes RStudio's Source button work.
When hitting that Source button, RStudio executes source('myfile.r', encoding = 'UTF-8')), so overriding source makes the errors disappear and runs the code as expected:
source <- function(f, encoding = 'UTF-8') {
l <- readLines(f, encoding=encoding)
eval(parse(text=l),envir=.GlobalEnv)
}
You can then add that script to an .Rprofile file, so it will execute on startup.

I encounter this problem when a try to source a .R file containing some Chinese characters. In my case, I found that merely set "LC_CTYPE" to "chinese" is not enough. But setting "LC_ALL" to "chinese" works well.
Note that it's not enough to get encoding right when you read or write plain text file in Rstudio (or R?) with non-ASCII. The locale setting counts too.
PS. the command is Sys.setlocale(category = "LC_CTYPE",locale = "chinese"). Please replace locale value correspondingly.

On windows, when you copy-paste a unicode or utf-8 encoded string into a text-control that is set to single-byte-input (ascii... depending on locale), the unknown bytes will be replaced by questionmarks. If i take the first 4 characters of your string and copy-paste it into e.g. Notepad and then save it, the file becomes in hex:
52 3F 3F 3F 3F
what you have to do is find an editor which you can set to utf-8 before copy-pasting the text into it, then the saved file (of your first 4 characters) becomes:
52 E5 90 8C E6 97 B6 E4 B9 9F E8 A2 AB
This will then be recognized as valid utf-8 by [R].
I used "Notepad2" for trying this, but i am sure there are many more.

Related

encoding error with read_html

I am trying to web scrape a page. I thought of using the package rvest.
However, I'm stuck in the first step, which is to use read_html to read the content.
Here´s my code:
library(rvest)
url <- "http://simec.mec.gov.br/painelObras/recurso.php?obra=17956"
obra_caridade <- read_html(url,
encoding = "ISO-8895-1")
And I got the following error:
Error in doc_parse_raw(x, encoding = encoding, base_url = base_url, as_html = as_html, :
Input is not proper UTF-8, indicate encoding !
Bytes: 0xE3 0x6F 0x20 0x65 [9]
I tried using what similar questions had as answers, but it did not solve my issue:
obra_caridade <- read_html(iconv(url, to = "UTF-8"),
encoding = "UTF-8")
obra_caridade <- read_html(iconv(url, to = "ISO-8895-1"),
encoding = "ISO-8895-1")
Both attempts returned a similar error.
Does anyone has any suggestion about how to solve this issue?
Here's my session info:
R version 3.3.1 (2016-06-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
locale:
[1] LC_COLLATE=Portuguese_Brazil.1252 LC_CTYPE=Portuguese_Brazil.1252
[3] LC_MONETARY=Portuguese_Brazil.1252 LC_NUMERIC=C
[5] LC_TIME=Portuguese_Brazil.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] rvest_0.3.2 xml2_1.1.1
loaded via a namespace (and not attached):
[1] httr_1.2.1 magrittr_1.5 R6_2.2.1 tools_3.3.1 curl_2.6 Rcpp_0.12.11
What's the issue?
Your issue here is in correctly determining the encoding of the webpage.
The good news
Your approach looks like a good one to me since you looked at the source code and found the Meta charset, given as ISO-8895-1. It is certainly ideal to be told the encoding, rather than have to resort to guess-work.
The bad news
I don't believe that encoding exists. Firstly, when I search for it online the results tend to look like typos. Secondly, R provides you with a list of supported encodings via iconvlist(). ISO-8895-1 is not in the list, so entering it as an argument to read_html isn't useful. I think it'd be nice if entering a non-supported encoding threw a warning, but this doesn't seem to happen.
Quick solution
As suggested by #MrFlick in a comment, using encoding = "latin1" appears to work.
I suspect the Meta charset has a typo and it should read ISO-8859-1 (which is the same thing as latin1).
Tips on guessing an encoding
What is your browser doing?
When loading the page in a browser, you can see what encoding it is using to read the page. If the page looks right, this seems like a sensible guess. In this instance, Firefox uses Western encoding (i.e. ISO-8859-1).
Guessing with R
rvest::guess_encoding is a nice, user-friendly function which can give a quick estimate. You can provide the function with a url e.g. guess_encoding(url), or copy in phrases with more complex characters e.g. guess_encoding("Situação do Termo/Convênio:").
One thing to note about this function is it can only detect from 30 of the more common encodings, but there are many more possibilities.
As mentioned earlier, iconvlist() provides a list of supported encodings. By looping through these encodings and examining some text in the page to see if it's what we expect, we should end up with a shortlist of possible encodings (and rule many encodings out).
Sample code can be found at the bottom of this answer.
Final comments
All the above points towards ISO-8859-1 being a sensible guess for the encoding.
The page url contains a .br extension indicating it's Brazilian, and - according to Wikipedia - this encoding has complete language coverage for Brazilian Portuguese, which suggests it might not be a crazy choice for whoever created the webpage. I believe this is also a reasonably common encoding type.
Code
Sample code for 'Guessing with R' point 2 (using iconvlist()):
library(rvest)
url <- "http://simec.mec.gov.br/painelObras/recurso.php?obra=17956"
# 1. See which encodings don't throw an error
read_page <- lapply(unique(iconvlist()), function(encoding_attempt) {
# Optional print statement to show progress to 1 since this can take some time
print(match(encoding_attempt, iconvlist()) / length(iconvlist()))
read_attempt <- tryCatch(expr=read_html(url, encoding=encoding_attempt),
error=function(condition) NA,
warning=function(condition) message(condition))
return(read_attempt)
})
names(read_page) <- unique(iconvlist())
# 2. See which encodings correctly display some complex characters
read_phrase <- lapply(x, function(encoded_page)
if(!is.na(encoded_page))
html_text(html_nodes(encoded_page, ".dl-horizontal:nth-child(1) dt")))
# We've ended up with 27 encodings which could be sensible...
encoding_shortlist <- names(read_phrase)[read_phrase == "Situação:"]

How does R handle Unicode / UTF-8?

If I write
`Δ` <- function(a,b) (a-b)/a
then I can include U+394 so long as it's enclosed in backticks. (By contrast, Δ <- function(a,b) (a-b)/a fails with unexpected input in "�".) So apparently R parses UTF-8 or Unicode or something like that. The assignment goes well and so does the evaluation of eg
`Δ`(1:5, 9:13)
. And I can also evaluate Δ(1:5, 9:13).
Finally, if I defined something like winsorise <- function(x, λ=.05) { ... } then λ (U+3bb) doesn't need to be "introduced to" R with a backtick. I can then call winsorise(data, .1) with no problems.
The only mention in R's documentation I can find of unicode is over my head. Could someone who understands it better explain to me — what's going on "under the hood" when R needs the ` to understand assignment to ♔, but can parse ♔(a,b,c) once assigned?
I can't speak to what's going on under the hood regarding the function calls vs. function arguments, but this email from Prof. Ripley from 2008 may shed some light (excerpt below):
R passes around, prints and plots UTF-8 character data pretty well, but it translates to the native encoding for almost all character-level manipulations (and not just on Windows). ?Encoding spells out the exceptions [...]
The reason R does this translation (on Windows at least) is mentioned in the documentation that the OP linked to:
Windows has no UTF-8 locales, but rather expects to work with UCS-2 strings. R (being written in standard C) would not work internally with UCS-2 without extensive changes.
The R documentation for ?Quotes explains how you can sometimes use out-of-locale characters anyway (emphasis added):
Identifiers consist of a sequence of letters, digits, the period (.) and the underscore. They must not start with a digit nor underscore, nor with a period followed by a digit. Reserved words are not valid identifiers.
The definition of a letter depends on the current locale, but only ASCII digits are considered to be digits.
Such identifiers are also known as syntactic names and may be used directly in R code. Almost always, other names can be used provided they are quoted. The preferred quote is the backtick (`), and deparse will normally use it, but under many circumstances single or double quotes can be used (as a character constant will often be converted to a name). One place where backticks may be essential is to delimit variable names in formulae: see formula.
There is another way to get at such characters, which is using the unicode escape sequence (like \u0394 for Δ). This is usually a bad idea if you're using that character for anything other than text on a plot (i.e., don't do this for variable or function names; cf. this quote from the R 2.7 release notes, when much of the current UTF-8 support was added):
If a string presented to the parser contains a \uxxxx escape invalid in the current locale, the string is recorded in UTF-8 with the encoding declared. This is likely to throw an error if it is used later in the session, but it can be printed, and used for e.g. plotting on the windows() device. So "\u03b2" gives a Greek small beta and "\u2642" a 'male sign'. Such strings will be printed as e.g. <U+2642> except in the Rgui console (see below).
I think this addresses most of your questions, though I don't know why there is a difference between the function name and function argument examples you gave; hopefully someone more knowledgable can chime in on that. FYI, on Linux all of these different ways of assigning and calling a function work without error (because the system locale is UTF-8, so no translation need occur):
Δ <- function(a,b) (a-b)/a # no error
`Δ` <- function(a,b) (a-b)/a # no error
"Δ" <- function(a,b) (a-b)/a # no error
"\u0394" <- function(a,b) (a-b)/a # no error
Δ(1:5, 9:13) # -8.00 -4.00 -2.67 -2.00 -1.60
`Δ`(1:5, 9:13) # same
"Δ"(1:5, 9:13) # same
"\u0394"(1:5, 9:13) # same
sessionInfo()
# R version 3.1.2 (2014-10-31)
# Platform: x86_64-pc-linux-gnu (64-bit)
# locale:
# LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
# LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
# LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
# LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
# attached base packages:
# stats graphics grDevices utils datasets methods base
For the record, under R-devel (2015-02-11 r67792), Win 7, English UK locale, I see:
options(encoding = "UTF-8")
`Δ` <- function(a,b) (a-b)/a
## Error: \uxxxx sequences not supported inside backticks (line 1)
Δ <- function(a,b) (a-b)/a
## Error: unexpected input in "\"
"Δ" <- function(a,b) (a-b)/a # OK
`Δ`(1:5, 9:13)
## Error: \uxxxx sequences not supported inside backticks (line 1)
Δ(1:5, 9:13)
## Error: unexpected input in "\"
"Δ"(1:5, 9:13)
## Error: could not find function "Δ"

how to display and input chinese (and other non-ASCII) character in r console?

My system: win7 ultimate 64 english version + r-3.1(64) .
Here is my sessionInfo.
> sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
LC_MONETARY=English_United States.1252 LC_NUMERIC=C
LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
1.can't input chinese character into r console
When I input a chinese character in r console, it turns to garbled character .
2.can't display chinese character on the r console
When I read data in r console, the chinese character turns into a garbled character .
You can download the data, and test it with
read.table("r1.csv",sep=",")
Download Data
Please see the graph to download the data if you don't know how to get the data from my web.
How can I setup my pc to properly display and input chinese characters in r console?
I have updated the chinese language pack ,and enabled it,but problem remains still.
It is probably not very well documented, but you want to use setlocale in order to use Chinese. And the method applies to many other languages as well. The solution is not obvious as the official document of setlocale didn't specifically mentioned it as a method to solve the display issues.
> print('ÊÔÊÔ') #试试, meaning let's give it a shot in Chinese
[1] "ÊÔÊÔ" #won't show up correctly
> Sys.getlocale()
[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
> Sys.setlocale(category = "LC_ALL", locale = "chs") #cht for traditional Chinese, etc.
[1] "LC_COLLATE=Chinese_People's Republic of China.936;LC_CTYPE=Chinese_People's Republic of China.936;LC_MONETARY=Chinese_People's Republic of China.936;LC_NUMERIC=C;LC_TIME=Chinese_People's Republic of China.936"
> print('试试')
[1] "试试"
> read.table("c:/CHS.txt",sep=" ") #Chinese: the 1st record/observation
V1 V2 V3 V4 V5 V6
1 122 第一 122 条 122 记录
If you just want to change the display encoding, without changing other aspects of locales, use LC_CTYPE instead of LC_ALL:
> Sys.setlocale(category = "LC_CTYPE", locale = "chs")
[1] "Chinese_People's Republic of China.936"
> print('试试')
[1] "试试"
Now, of course this only applies to the official R console. If you use other IDE's, such as the very popular RStudio, you don't need to do this at all to be able to type and display Chinese, even if you didn't have the Chinese locale loaded.
Migrate some useful stuff from the following comments:
If the data still fails to show up correctly, the we should also look into the issue of the file encoding. If the file is UTF-8 encoded, tither data <- read.table("you_file", sep=',', fileEncoding="UTF-8-BOM", header=TRUE) or fileEncoding="UTF-8" will do, depends on which encoding it really has.
But you may want to stay away from UTF-BOM as it is not recommended: What's different between UTF-8 and UTF-8 without BOM?

Force character vector encoding from "unknown" to "UTF-8" in R

I have a problem with inconsistent encoding of character vector in R.
The text file which I read a table from is encoded (via Notepad++) in UTF-8 (I tried with UTF-8 without BOM, too.).
I want to read table from this text file, convert it do data.table, set a key and make use of binary search. When I tried to do so, the following appeared:
Warning message:
In [.data.table(poli.dt, "żżonymi", mult = "first") :
A known encoding (latin1 or UTF-8) was detected in a join column. data.table compares the bytes currently, so doesn't support
mixed encodings well; i.e., using both latin1 and UTF-8, or if any unknown encodings are non-ascii and some of those are marked known and
others not. But if either latin1 or UTF-8 is used exclusively, and all
unknown encodings are ascii, then the result should be ok. In future
we will check for you and avoid this warning if everything is ok. The
tricky part is doing this without impacting performance for ascii-only
cases.
and binary search does not work.
I realised that my data.table-key column consists of both: "unknown" and "UTF-8" Encoding types:
> table(Encoding(poli.dt$word))
unknown UTF-8
2061312 2739122
I tried to convert this column (before creating a data.table object) with the use of:
Encoding(word) <- "UTF-8"
word<- enc2utf8(word)
but with no effect.
I also tried a few different ways of reading a file into R (setting all helpful parameters, e.g. encoding = "UTF-8"):
data.table::fread
utils::read.table
base::scan
colbycol::cbc.read.table
but with no effect.
==================================================
My R.version:
> R.version
_
platform x86_64-w64-mingw32
arch x86_64
os mingw32
system x86_64, mingw32
status
major 3
minor 0.3
year 2014
month 03
day 06
svn rev 65126
language R
version.string R version 3.0.3 (2014-03-06)
nickname Warm Puppy
My session info:
> sessionInfo()
R version 3.0.3 (2014-03-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=Polish_Poland.1250 LC_CTYPE=Polish_Poland.1250 LC_MONETARY=Polish_Poland.1250
[4] LC_NUMERIC=C LC_TIME=Polish_Poland.1250
base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.9.2 colbycol_0.8 filehash_2.2-2 rJava_0.9-6
loaded via a namespace (and not attached):
[1] plyr_1.8.1 Rcpp_0.11.1 reshape2_1.2.2 stringr_0.6.2 tools_3.0.3
The Encoding function returns unknown if a character string has a "native encoding" mark (CP-1250 in your case) or if it's in ASCII.
To discriminate between these two cases, call:
library(stringi)
stri_enc_mark(poli.dt$word)
To check whether each string is a valid UTF-8 byte sequence, call:
all(stri_enc_isutf8(poli.dt$word))
If it's not the case, your file is definitely not in UTF-8.
I suspect that you haven't forced the UTF-8 mode in the data read function (try inspecting the contents of poli.dt$word to verify this statement). If my guess is true, try:
read.csv2(file("filename", encoding="UTF-8"))
or
poli.dt$word <- stri_encode(poli.dt$word, "", "UTF-8") # re-mark encodings
If data.table still complains about the "mixed" encodings, you may want to transliterate the non-ASCII characters, e.g.:
stri_trans_general("Zażółć gęślą jaźń", "Latin-ASCII")
## [1] "Zazolc gesla jazn"
I could not find a solution myself to a similar problem.
I could not translate back unknown encoding characters from txt file into something more manageable in R.
Therefore, I was in a situation that the same character appeared more than once in the same dataset, because it was encoded differently ("X" in Latin setting and "X" in Greek setting).
However, txt saving operation preserved that encoding difference --- of course well-done.
Trying some of the above methods, nothing worked.
The problem is well described “cannot distinguish ASCII from UTF-8 and the bit will not stick even if you set it”.
A good workaround is " export your data.frame to a CSV temporary file and reimport with data.table::fread() , specifying Latin-1 as source encoding.".
Reproducing / copying the example given from the above source:
package(data.table)
df <- your_data_frame_with_mixed_utf8_or_latin1_and_unknown_str_fields
fwrite(df,"temp.csv")
your_clean_data_table <- fread("temp.csv",encoding = "Latin-1")
I hope, it will help someone that.

Knitting Rmd treats non-english characters differently

I've tried to write reproducable example below. It is a mix of .Rmd and .r . Hopefully you can see why.
The problem I have is that non-english characters are treated differently depending on whether code is run directly in the console or when Knitted to HTML.
In the example below I create a small data.frame with characters ü and ö, write it to csv, then read it back in again.
If the writing and reading both take place inside or outside a chunk, then all is well.
But if the writing and reading take place in different places then a different encoding is used (I think). and characters get mixed up.
This means that when reading in data I need a different encoding when compiling an .Rmd file than when working directly in R.
As far as I can see the locale is always the same, so I don't understand what's going on.
Any ideas?
Write and read csv directly to create new datafile
df2 <- data.frame(Cäl1 = c(1,2), Col2 = c("ü","a"))
write.csv(df2, file="df2.csv")
read.csv("df2.csv")
Sys.getlocale(category = "LC_ALL")
Now try Knitting the whole document (just running the chunk behaves differently)
```{r read_inside}
read.csv("df2.csv")
Sys.getlocale(category = "LC_ALL")
```
this second chunk will work because the data.frame is created inside the chunk
```{r write_read_inside}
df2 <- data.frame(Cäl1 = c(1,2), Col2 = c("ü","a"))
write.csv(df2, file="df2.csv")
read.csv("df2.csv")
Sys.getlocale(category = "LC_ALL")
```
Session Info:
R version 2.15.0 (2012-03-30)
Platform: x86_64-pc-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252 LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C LC_TIME=English_United Kingdom.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] tools_2.15.0
So the answer is to guarantee UTF8 encoding, e.g. write.csv(..., fileEncoding = 'UTF-8'). The root problem was actually that RStudio uses UTF8 by default, but R uses the native encoding of the OS by default. We can either ask R to use UTF8 in write.csv, or ask RStudio to use native encoding (options(encoding = 'native.enc')).

Resources