encoding error with read_html - r

I am trying to web scrape a page. I thought of using the package rvest.
However, I'm stuck in the first step, which is to use read_html to read the content.
Here´s my code:
library(rvest)
url <- "http://simec.mec.gov.br/painelObras/recurso.php?obra=17956"
obra_caridade <- read_html(url,
encoding = "ISO-8895-1")
And I got the following error:
Error in doc_parse_raw(x, encoding = encoding, base_url = base_url, as_html = as_html, :
Input is not proper UTF-8, indicate encoding !
Bytes: 0xE3 0x6F 0x20 0x65 [9]
I tried using what similar questions had as answers, but it did not solve my issue:
obra_caridade <- read_html(iconv(url, to = "UTF-8"),
encoding = "UTF-8")
obra_caridade <- read_html(iconv(url, to = "ISO-8895-1"),
encoding = "ISO-8895-1")
Both attempts returned a similar error.
Does anyone has any suggestion about how to solve this issue?
Here's my session info:
R version 3.3.1 (2016-06-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
locale:
[1] LC_COLLATE=Portuguese_Brazil.1252 LC_CTYPE=Portuguese_Brazil.1252
[3] LC_MONETARY=Portuguese_Brazil.1252 LC_NUMERIC=C
[5] LC_TIME=Portuguese_Brazil.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] rvest_0.3.2 xml2_1.1.1
loaded via a namespace (and not attached):
[1] httr_1.2.1 magrittr_1.5 R6_2.2.1 tools_3.3.1 curl_2.6 Rcpp_0.12.11

What's the issue?
Your issue here is in correctly determining the encoding of the webpage.
The good news
Your approach looks like a good one to me since you looked at the source code and found the Meta charset, given as ISO-8895-1. It is certainly ideal to be told the encoding, rather than have to resort to guess-work.
The bad news
I don't believe that encoding exists. Firstly, when I search for it online the results tend to look like typos. Secondly, R provides you with a list of supported encodings via iconvlist(). ISO-8895-1 is not in the list, so entering it as an argument to read_html isn't useful. I think it'd be nice if entering a non-supported encoding threw a warning, but this doesn't seem to happen.
Quick solution
As suggested by #MrFlick in a comment, using encoding = "latin1" appears to work.
I suspect the Meta charset has a typo and it should read ISO-8859-1 (which is the same thing as latin1).
Tips on guessing an encoding
What is your browser doing?
When loading the page in a browser, you can see what encoding it is using to read the page. If the page looks right, this seems like a sensible guess. In this instance, Firefox uses Western encoding (i.e. ISO-8859-1).
Guessing with R
rvest::guess_encoding is a nice, user-friendly function which can give a quick estimate. You can provide the function with a url e.g. guess_encoding(url), or copy in phrases with more complex characters e.g. guess_encoding("Situação do Termo/Convênio:").
One thing to note about this function is it can only detect from 30 of the more common encodings, but there are many more possibilities.
As mentioned earlier, iconvlist() provides a list of supported encodings. By looping through these encodings and examining some text in the page to see if it's what we expect, we should end up with a shortlist of possible encodings (and rule many encodings out).
Sample code can be found at the bottom of this answer.
Final comments
All the above points towards ISO-8859-1 being a sensible guess for the encoding.
The page url contains a .br extension indicating it's Brazilian, and - according to Wikipedia - this encoding has complete language coverage for Brazilian Portuguese, which suggests it might not be a crazy choice for whoever created the webpage. I believe this is also a reasonably common encoding type.
Code
Sample code for 'Guessing with R' point 2 (using iconvlist()):
library(rvest)
url <- "http://simec.mec.gov.br/painelObras/recurso.php?obra=17956"
# 1. See which encodings don't throw an error
read_page <- lapply(unique(iconvlist()), function(encoding_attempt) {
# Optional print statement to show progress to 1 since this can take some time
print(match(encoding_attempt, iconvlist()) / length(iconvlist()))
read_attempt <- tryCatch(expr=read_html(url, encoding=encoding_attempt),
error=function(condition) NA,
warning=function(condition) message(condition))
return(read_attempt)
})
names(read_page) <- unique(iconvlist())
# 2. See which encodings correctly display some complex characters
read_phrase <- lapply(x, function(encoded_page)
if(!is.na(encoded_page))
html_text(html_nodes(encoded_page, ".dl-horizontal:nth-child(1) dt")))
# We've ended up with 27 encodings which could be sensible...
encoding_shortlist <- names(read_phrase)[read_phrase == "Situação:"]

Related

R: Encoding of labelled data and knit to html problems

First of all, sorry for not providing a reproducible example and posting images, a word of explanation why I did it is at the end.
I'd really appreciate some help - comments or otherwise, I think I did my best to be as specific and concise as I can
Problem I'm trying to solve is how to set up (and where to do it) encoding in order to get polish letters after a .Rmd document is knitted to html.
I'm working with a labelled spss file imported to R via haven library and using sjPlot tools to make tables and graphs.
I already spent almost all day trying to sort this out, but I feel I'm stucked with no idea where to go.
My sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
Matrix products: default
locale:
[1] LC_COLLATE=Polish_Poland.1250 LC_CTYPE=Polish_Poland.1250 LC_MONETARY=Polish_Poland.1250
[4] LC_NUMERIC=C LC_TIME=Polish_Poland.1250
Whenever I run (via console / script)
sjt.frq(df$sex, encoding = "Windows-1250")
I get a nice table with proper encoding in the rstudio viewer pane:
Trying with no encoding sjt.frq(df$sex) gives this:
I could live with setting encoding each time a call to sjt.frq is made, but problem is, that no matter how I set up sjt.frq inside a markdown document, it always gets knited the wrong way.
Running chunk inside .Rmd is OK (for a completely unknown reason encoding = "UTF-8 worked as well here and it didn't previously):
Knitting same document, not OK:
(note, that html header has all the polish characters)
Also, it looks like that it could be either html or sjPlot specific because knitr can print polish letters when they are in a vector and are passed as if they where printed to console:
Is there anything I can set up / change in order to make this work?
While testing different options I discovered, that manually converting sex variable to factor and assigning labels again, works and Rstudio knits to html with proper encoding
df$sex <- factor(df$sex, label = c("kobieta", "mężczyzna"))
sjt.frq(df$sex, encoding = "Windows-1250")
Regarding no reproducible example:
I tried to simulate this example with fake data:
# Get libraries
library(sjPlot)
library(sjlabelled)
x <- rep(1:4, 4)
x<- set_labels(x, labels = c("ąę", "ćŁ", "óŚŚ", "abcd"))
# Run freq table similar to df$sex above
sjt.frq(x)
sjt.frq(x, encoding = "UTF-8")
sjt.frq(x, encoding = "Windows-1250")
Thing is, each sjt.frq call knits the way it should (although only encoding = "Windows-1250" renders properly in rstudio viewer pane.
If you run sjt.frq(), a complete HTML-page is returned, which is displayed in a viewer.
However, for use inside markdown/knitr-documents, there are only parts of the HTML-output required: You don't need the <head> part, for instance, as the knitr-document creates an own header for the HTML-page. Thus, there's an own print()-method for knitr-documents, which use another return-value to include into the knitr-file.
Compare:
dummy <- sjt.frq(df$sex, encoding = "Windows-1250")
dummy$output.complete # used for default display in viewer
dummy$knitr # used in knitr-documents
Since the encoding is located in the <meta>-tag, which is not included in the $knitr-value, the encoding-argument in sjt.frq() has no effect on knitr-documents.
I think that this might help you: rmarkdown::render_site(encoding = 'UTF-8'). Maybe there are also other options to encode text, or you need to modify the final HTML-file, changing the charset encoding there.

How to get proper encoding using browseURL()?

I'm basically trying to browse a URL with japanese letters in it. This question builds up on my first question from yesterday. My code now generates the right URL and if I just take the URL and put into my browser I get the right result, but if I try to automate the process by integrating browseURL() I get a wrong result.
E.g. I am trying to call following URL:
http://www.google.com/trends/trendsReport?hl=en-US&q=VWゴルフ %2B VWポロ %2B VWパサート %2B VWティグアン&date=1%2F2010 68m&cmpt=q&content=1&export=1
if I now use
browseURL(http://www.google.com/trends/trendsReport?hl=en-US&q=VWゴルフ %2B VWポロ %2B VWパサート %2B VWティグアン&date=1%2F2010 68m&cmpt=q&content=1&export=1)
I can see in the browser that it browsed
www.google.com/trends/trendsReport?hl=en-US&q=VW%E3%83%BB%EF%BD%BDS%E3%83%BB%EF%BD%BD%E3%83%BB%EF%BD%BD%E3%83%BB%EF%BD%BDt%20%2B%20VW%E3%83%BB%EF%BD%BD%7C%E3%83%BB%EF%BD%BD%E3%83%BB%EF%BD%BD%20%2B%20VW%E3%83%BB%EF%BD%BDp%E3%83%BB%EF%BD%BDT%E3%83%BB%EF%BD%BD[%E3%83%BB%EF%BD%BDg%20%2B%20VW%E3%83%BB%EF%BD%BDe%E3%83%BB%EF%BD%BDB%E3%83%BB%EF%BD%BDO%E3%83%BB%EF%BD%BDA%E3%83%BB%EF%BD%BD%E3%83%BB%EF%BD%BD&date=1%2F2010%2068m&cmpt=q&content=1&export=1
which seems to be an encoding mistake. I already tried
browseURL(URL, encodeIfNeeded=TRUE)
but that doesnt seem to change a thing and as far as I interpret the function it also shouldnt because this function is there to generate those "%B" letters, which makes it even more surprising that I get them even when encodeIfNeeded = FALSE.
Any help is highly appreciated!
> sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: i386-w64-mingw32/i386 (32-bit)
Running under: Windows 8 (build 9200)
locale:
[1] LC_COLLATE=German_Germany.1252 LC_CTYPE=Japanese_Japan.932 LC_MONETARY=German_Germany.1252
[4] LC_NUMERIC=C LC_TIME=German_Germany.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] tools_3.2.1
I think this will get around the issue:
library(httr)
library(curl)
gt_url <- "http://www.google.com/trends/trendsReport?hl=en-US&q=VWゴルフ %2B VWポロ %2B VWパサート %2B VWティグアン&date=1%2F2010 68m&cmpt=q&content=1&export=1"
# ensure the %2B's aren't getting in the way then
# ask httr to carve up the url and put it back together
parts <- parse_url(URLdecode(gt_url))
browseURL(build_url(parts))
That gives this (too long to paste but I want to make sure OP gets to see the whole content).
I also now see why you have to do it this way (both download.file and GET with write_disk don't work due to the javascript redirect).

Force character vector encoding from "unknown" to "UTF-8" in R

I have a problem with inconsistent encoding of character vector in R.
The text file which I read a table from is encoded (via Notepad++) in UTF-8 (I tried with UTF-8 without BOM, too.).
I want to read table from this text file, convert it do data.table, set a key and make use of binary search. When I tried to do so, the following appeared:
Warning message:
In [.data.table(poli.dt, "żżonymi", mult = "first") :
A known encoding (latin1 or UTF-8) was detected in a join column. data.table compares the bytes currently, so doesn't support
mixed encodings well; i.e., using both latin1 and UTF-8, or if any unknown encodings are non-ascii and some of those are marked known and
others not. But if either latin1 or UTF-8 is used exclusively, and all
unknown encodings are ascii, then the result should be ok. In future
we will check for you and avoid this warning if everything is ok. The
tricky part is doing this without impacting performance for ascii-only
cases.
and binary search does not work.
I realised that my data.table-key column consists of both: "unknown" and "UTF-8" Encoding types:
> table(Encoding(poli.dt$word))
unknown UTF-8
2061312 2739122
I tried to convert this column (before creating a data.table object) with the use of:
Encoding(word) <- "UTF-8"
word<- enc2utf8(word)
but with no effect.
I also tried a few different ways of reading a file into R (setting all helpful parameters, e.g. encoding = "UTF-8"):
data.table::fread
utils::read.table
base::scan
colbycol::cbc.read.table
but with no effect.
==================================================
My R.version:
> R.version
_
platform x86_64-w64-mingw32
arch x86_64
os mingw32
system x86_64, mingw32
status
major 3
minor 0.3
year 2014
month 03
day 06
svn rev 65126
language R
version.string R version 3.0.3 (2014-03-06)
nickname Warm Puppy
My session info:
> sessionInfo()
R version 3.0.3 (2014-03-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=Polish_Poland.1250 LC_CTYPE=Polish_Poland.1250 LC_MONETARY=Polish_Poland.1250
[4] LC_NUMERIC=C LC_TIME=Polish_Poland.1250
base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.9.2 colbycol_0.8 filehash_2.2-2 rJava_0.9-6
loaded via a namespace (and not attached):
[1] plyr_1.8.1 Rcpp_0.11.1 reshape2_1.2.2 stringr_0.6.2 tools_3.0.3
The Encoding function returns unknown if a character string has a "native encoding" mark (CP-1250 in your case) or if it's in ASCII.
To discriminate between these two cases, call:
library(stringi)
stri_enc_mark(poli.dt$word)
To check whether each string is a valid UTF-8 byte sequence, call:
all(stri_enc_isutf8(poli.dt$word))
If it's not the case, your file is definitely not in UTF-8.
I suspect that you haven't forced the UTF-8 mode in the data read function (try inspecting the contents of poli.dt$word to verify this statement). If my guess is true, try:
read.csv2(file("filename", encoding="UTF-8"))
or
poli.dt$word <- stri_encode(poli.dt$word, "", "UTF-8") # re-mark encodings
If data.table still complains about the "mixed" encodings, you may want to transliterate the non-ASCII characters, e.g.:
stri_trans_general("Zażółć gęślą jaźń", "Latin-ASCII")
## [1] "Zazolc gesla jazn"
I could not find a solution myself to a similar problem.
I could not translate back unknown encoding characters from txt file into something more manageable in R.
Therefore, I was in a situation that the same character appeared more than once in the same dataset, because it was encoded differently ("X" in Latin setting and "X" in Greek setting).
However, txt saving operation preserved that encoding difference --- of course well-done.
Trying some of the above methods, nothing worked.
The problem is well described “cannot distinguish ASCII from UTF-8 and the bit will not stick even if you set it”.
A good workaround is " export your data.frame to a CSV temporary file and reimport with data.table::fread() , specifying Latin-1 as source encoding.".
Reproducing / copying the example given from the above source:
package(data.table)
df <- your_data_frame_with_mixed_utf8_or_latin1_and_unknown_str_fields
fwrite(df,"temp.csv")
your_clean_data_table <- fread("temp.csv",encoding = "Latin-1")
I hope, it will help someone that.

Quantmod, getSymbols error trying to replicate answer

I just downloaded the package Quantmod and have been playing with getSymbols. I want to be able to get data for multiple stocks as in this question: getSymbols and using lapply, Cl, and merge to extract close prices.
Unfortuantely, when I try to duplicate the answer:
tickers <- c("SPY","DIA","IWM","SMH","OIH","XLY",
"XLP","XLE","XLI","XLB","XLK","XLU")
getSymbols(tickers, from="2001-03-01", to="2011-03-11")
I get the following error message:
Error in download.file(paste(yahoo.URL, "s=", Symbols.name, "&a=", from.m, :
cannot open URL
'http://chart.yahoo.com/table.csv?s=SPY&a=2&b=01&c=2001&d=2&e=11&f=2011&g=d&q=q&y=0&z=SPY&x=.csv'
In addition: Warning message:
In download.file(paste(yahoo.URL, "s=", Symbols.name, "&a=", from.m,
: cannot open: HTTP status was '0 (null)'
Here is my sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-apple-darwin10.8.0 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] quantmod_0.4-0 TTR_0.22-0 xts_0.9-7 zoo_1.7-10 Defaults_1.1-1
loaded via a namespace (and not attached):
[1] grid_3.0.2 lattice_0.20-23 tools_3.0.2
EDIT: In response to OP's comment:
So the bottom line seems to be that sites which provide free downloads of historical data are quirky, to say the least. They do not necessarily work for all valid symbols, and sometimes they become unavailable for no apparent reason. ichart.yahoo.com/table.csv worked for me 24 hours ago, but does not work (for me) at the moment. This may be because Yahoo has imposed a 24 hour lockout on my IP, which they will do if they detect activity interpretable as a DDOS attack. Or it might be for some other reason...
The updated code below, which queries Google, does work (again, at the moment), for everything except DJIA. Note that there is more success if you specify the exchange and the symbol (EXCHANGE:SYMBOL). I was unable to download SMH without the exchange. Finally, if your are having problems try uncommenting the print statement and pasting the url into a browser to see what happens.
tickers <- c("SPY","DJIA","IWM","NYSEARCA:SMH","OIH","XLY",
"XLP","XLE","XLI","XLB","XLK","XLU")
g <- function(x,from,to,output="csv") {
uri <- "http://www.google.com/finance/historical"
q.symbol <- paste("q",x,sep="=")
q.from <- paste("startdate",from,sep="=")
q.to <- paste("enddate",to,sep="=")
q.output <- paste("output",output,sep="=")
query <- paste(q.symbol,q.output,q.from,q.to,sep="&")
url <- paste(uri,query,sep="?")
# print(url)
try(assign(x,read.csv(url),envir=.GlobalEnv))
}
lapply(tickers,g,from="2001-03-01",to="2011-03-11",output="csv")
You can download DJI from the St. Louis Fed, which is very reliable. Unfortunately, you get all of it (from 1896), and it's a time series.
getSymbols("DJIA",src="FRED")
Original Response:
This worked for me, for everything except SMH and OIH.
tickers <- c("SPY","DJIA","IWM","SMH","OIH","XLY",
"XLP","XLE","XLI","XLB","XLK","XLU")
f <- function(x) {
uri <- "http://ichart.yahoo.com/table.csv"
symbol <- paste("s",x,sep="=")
from <- "a=2&b=1&c=2001"
to <- "d=2&e=11&f=2011"
period <- "g=d"
ignore <- "ignore=.csv"
query <- paste(symbol,from,to,period,ignore,sep="&")
url <- paste(uri,query,sep="?")
try(assign(x,read.csv(url),envir=.GlobalEnv))
}
lapply(tickers,f)
The main difference between this and getSymbols(...) is that this uses ichart.yahoo.com (as documented here), whereas getSymbols(...) uses chart.yahoo.com. The former seems to be much more reliable. In my experience, using getSymbols(...) with Yahoo is a monumental headache.
If the suggestion of #user2492310, to use src="google" works for you, then clearly this is the way to go. It didn't work for me.
One other note: SMH and OIH did not exist in 2001. The others all go back to at least 2000. So it might be that ichart.yahoo.com (and chart.yahoo.com) throws an error if you provide a date range outside of the symbol's operating range.

How to source() .R file saved using UTF-8 encoding?

The following, when copied and pasted directly into R works fine:
> character_test <- function() print("R同时也被称为GNU S是一个强烈的功能性语言和环境,探索统计数据集,使许多从自定义数据图形显示...")
> character_test()
[1] "R同时也被称为GNU S是一个强烈的功能性语言和环境,探索统计数据集,使许多从自定义数据图形显示..."
However, if I make a file called character_test.R containing the EXACT SAME code, save it in UTF-8 encoding (so as to retain the special Chinese characters), then when I source() it in R, I get the following error:
> source(file="C:\\Users\\Tony\\Desktop\\character_test.R", encoding = "UTF-8")
Error in source(file = "C:\\Users\\Tony\\Desktop\\character_test.R", encoding = "utf-8") :
C:\Users\Tony\Desktop\character_test.R:3:0: unexpected end of input
1: character.test <- function() print("R
2:
^
In addition: Warning message:
In source(file = "C:\\Users\\Tony\\Desktop\\character_test.R", encoding = "UTF-8") :
invalid input found on input connection 'C:\Users\Tony\Desktop\character_test.R'
Any help you can offer in solving and helping me to understand what is going on here would be much appreciated.
> sessionInfo() # Windows 7 Pro x64
R version 2.12.1 (2010-12-16)
Platform: x86_64-pc-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United Kingdom.1252
[2] LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base
loaded via a namespace (and not attached):
[1] tools_2.12.1
and
> l10n_info()
$MBCS
[1] FALSE
$`UTF-8`
[1] FALSE
$`Latin-1`
[1] TRUE
$codepage
[1] 1252
On R/Windows, source runs into problems with any UTF-8 characters that can't be represented in the current locale (or ANSI Code Page in Windows-speak). And unfortunately Windows doesn't have UTF-8 available as an ANSI code page--Windows has a technical limitation that ANSI code pages can only be one- or two-byte-per-character encodings, not variable-byte encodings like UTF-8.
This doesn't seem to be a fundamental, unsolvable problem--there's just something wrong with the source function. You can get 90% of the way there by doing this instead:
eval(parse(filename, encoding="UTF-8"))
This'll work almost exactly like source() with default arguments, but won't let you do echo=T, eval.print=T, etc.
We talked about this a lot in the comments to my previous post but I don't want this to get lost on page 3 of comments: You have to set the locale, it works with both input from the R-console (see screenshot in comments) as well as with input from file see this screenshot:
The file "myfile.r" contains:
russian <- function() print ("Американские с...");
The console contains:
source("myfile.r", encoding="utf-8")
> Error in source(".....
Sys.setlocale("LC_CTYPE","ru")
> [1] "Russian_Russia.1251"
russian()
[1] "Американские с..."
Note that the file-in fails and it points to the same character as the original poster's error (the one after "R). I can not do this with Chinese because i would have to install "Microsoft Pinyin IME 3.0", but the process is the same, you just replace the locale with "chinese" (the naming is a bit inconsistent, consult the documentation).
I think the problem lies with R. I can happily source UTF-8 files, or UCS-2LE files with many non-ASCII characters in. But some characters cause it to fail. For example the following
danish <- function() print("Skønt H. C. Andersens barndomsomgivelser var meget fattige, blev de i hans rige fantasi solbeskinnede.")
croatian <- function() print("Dodigović. Kako se Vi zovete?")
new_testament <- function() print("Ne provizu al vi trezorojn sur la tero, kie tineo kaj rusto konsumas, kaj jie ŝtelistoj trafosas kaj ŝtelas; sed provizu al vi trezoron en la ĉielo")
russian <- function() print ("Американские суда находятся в международных водах. Япония выразила серьезное беспокойство советскими действиями.")
is fine in both UTF-8 and UCS-2LE without the Russian line. But if that is included then it fails. I'm pointing the finger at R. Your Chinese text also appears to be too hard for R on Windows.
Locale seems irrelevant here. It's just a file, you tell it what encoding the file is, why should your locale matter?
For me (on windows) I do:
source.utf8 <- function(f) {
l <- readLines(f, encoding="UTF-8")
eval(parse(text=l),envir=.GlobalEnv)
}
It works fine.
Building on crow's answer, this solution makes RStudio's Source button work.
When hitting that Source button, RStudio executes source('myfile.r', encoding = 'UTF-8')), so overriding source makes the errors disappear and runs the code as expected:
source <- function(f, encoding = 'UTF-8') {
l <- readLines(f, encoding=encoding)
eval(parse(text=l),envir=.GlobalEnv)
}
You can then add that script to an .Rprofile file, so it will execute on startup.
I encounter this problem when a try to source a .R file containing some Chinese characters. In my case, I found that merely set "LC_CTYPE" to "chinese" is not enough. But setting "LC_ALL" to "chinese" works well.
Note that it's not enough to get encoding right when you read or write plain text file in Rstudio (or R?) with non-ASCII. The locale setting counts too.
PS. the command is Sys.setlocale(category = "LC_CTYPE",locale = "chinese"). Please replace locale value correspondingly.
On windows, when you copy-paste a unicode or utf-8 encoded string into a text-control that is set to single-byte-input (ascii... depending on locale), the unknown bytes will be replaced by questionmarks. If i take the first 4 characters of your string and copy-paste it into e.g. Notepad and then save it, the file becomes in hex:
52 3F 3F 3F 3F
what you have to do is find an editor which you can set to utf-8 before copy-pasting the text into it, then the saved file (of your first 4 characters) becomes:
52 E5 90 8C E6 97 B6 E4 B9 9F E8 A2 AB
This will then be recognized as valid utf-8 by [R].
I used "Notepad2" for trying this, but i am sure there are many more.

Resources