How to get proper encoding using browseURL()? - r

I'm basically trying to browse a URL with japanese letters in it. This question builds up on my first question from yesterday. My code now generates the right URL and if I just take the URL and put into my browser I get the right result, but if I try to automate the process by integrating browseURL() I get a wrong result.
E.g. I am trying to call following URL:
http://www.google.com/trends/trendsReport?hl=en-US&q=VWゴルフ %2B VWポロ %2B VWパサート %2B VWティグアン&date=1%2F2010 68m&cmpt=q&content=1&export=1
if I now use
browseURL(http://www.google.com/trends/trendsReport?hl=en-US&q=VWゴルフ %2B VWポロ %2B VWパサート %2B VWティグアン&date=1%2F2010 68m&cmpt=q&content=1&export=1)
I can see in the browser that it browsed
www.google.com/trends/trendsReport?hl=en-US&q=VW%E3%83%BB%EF%BD%BDS%E3%83%BB%EF%BD%BD%E3%83%BB%EF%BD%BD%E3%83%BB%EF%BD%BDt%20%2B%20VW%E3%83%BB%EF%BD%BD%7C%E3%83%BB%EF%BD%BD%E3%83%BB%EF%BD%BD%20%2B%20VW%E3%83%BB%EF%BD%BDp%E3%83%BB%EF%BD%BDT%E3%83%BB%EF%BD%BD[%E3%83%BB%EF%BD%BDg%20%2B%20VW%E3%83%BB%EF%BD%BDe%E3%83%BB%EF%BD%BDB%E3%83%BB%EF%BD%BDO%E3%83%BB%EF%BD%BDA%E3%83%BB%EF%BD%BD%E3%83%BB%EF%BD%BD&date=1%2F2010%2068m&cmpt=q&content=1&export=1
which seems to be an encoding mistake. I already tried
browseURL(URL, encodeIfNeeded=TRUE)
but that doesnt seem to change a thing and as far as I interpret the function it also shouldnt because this function is there to generate those "%B" letters, which makes it even more surprising that I get them even when encodeIfNeeded = FALSE.
Any help is highly appreciated!
> sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: i386-w64-mingw32/i386 (32-bit)
Running under: Windows 8 (build 9200)
locale:
[1] LC_COLLATE=German_Germany.1252 LC_CTYPE=Japanese_Japan.932 LC_MONETARY=German_Germany.1252
[4] LC_NUMERIC=C LC_TIME=German_Germany.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] tools_3.2.1

I think this will get around the issue:
library(httr)
library(curl)
gt_url <- "http://www.google.com/trends/trendsReport?hl=en-US&q=VWゴルフ %2B VWポロ %2B VWパサート %2B VWティグアン&date=1%2F2010 68m&cmpt=q&content=1&export=1"
# ensure the %2B's aren't getting in the way then
# ask httr to carve up the url and put it back together
parts <- parse_url(URLdecode(gt_url))
browseURL(build_url(parts))
That gives this (too long to paste but I want to make sure OP gets to see the whole content).
I also now see why you have to do it this way (both download.file and GET with write_disk don't work due to the javascript redirect).

Related

Using urltools::url_parse with UTF-8 domains

The function url_parse is very fast and works fine most of the time. But recently, domain names may contain UTF-8 characters, for example
url <- "www.cordes-tiefkühlprodukte.de"
Now if I apply url_parse on this url, I get a special character "< fc >" in the domain column:
url_parse(url)
scheme domain port path parameter fragment
1 <NA> www.cordes-tiefk<fc>hlprodukte.de <NA> <NA> <NA> <NA>
My question is: How can I "fix" this entry to UTF-8? I tried iconv and some functions from the stringi package, but with no success.
(I am aware of httr::parse_url, which does not have this problem. So one approach would be to detect the urls that are not ascii, and use url_parse on those and parse_url on the few special cases. However, this leads to the problem to (efficiently) detect the non-ascii URLs.)
EDIT: Unfortunately, url1 <- URLencode(enc2utf8(url)) does not help. When I do
robotstxt::paths_allowed(
url1,
domain=urltools::suffix_extract(urltools::domain(url1))
)
I get an error could not resolve host. However, plugging in the original URL and the 2nd level domain by hand, paths_allowed works.
> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17134)
Matrix products: default
locale:
[1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252
[3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C
[5] LC_TIME=German_Germany.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] urltools_1.7.3 fortunes_1.5-4
loaded via a namespace (and not attached):
[1] compiler_3.6.1 Rcpp_1.0.1 triebeard_0.3.0
I could reproduce the issue. I could convert the column domain to UTF-8 by reading it with readr::parse_character and latin1 encoding:
library(urltools)
library(tidyverse)
url <- "www.cordes-tiefkühlprodukte.de"
parts <-
url_parse(url) %>%
mutate(domain = parse_character(domain, locale = locale(encoding = "latin1")))
parts
scheme domain port path parameter fragment
1 <NA> www.cordes-tiefkühlprodukte.de <NA> <NA> <NA> <NA>
I guess that the encoding you have to specify (here latin1) depends only on your locale and not on the url's special characters, but I'm not 100% sure about that.
Just for reference, another method that works fine for me is:
library(stringi)
url <- "www.cordes-tiefkühlprodukte.de"
url <- stri_escape_unicode(url)
dat <- urltools::parse_url(url)
for(cn in colnames(dat)) dat[,cn] <- stri_unescape_unicode(dat[,cn])

encoding error with read_html

I am trying to web scrape a page. I thought of using the package rvest.
However, I'm stuck in the first step, which is to use read_html to read the content.
Here´s my code:
library(rvest)
url <- "http://simec.mec.gov.br/painelObras/recurso.php?obra=17956"
obra_caridade <- read_html(url,
encoding = "ISO-8895-1")
And I got the following error:
Error in doc_parse_raw(x, encoding = encoding, base_url = base_url, as_html = as_html, :
Input is not proper UTF-8, indicate encoding !
Bytes: 0xE3 0x6F 0x20 0x65 [9]
I tried using what similar questions had as answers, but it did not solve my issue:
obra_caridade <- read_html(iconv(url, to = "UTF-8"),
encoding = "UTF-8")
obra_caridade <- read_html(iconv(url, to = "ISO-8895-1"),
encoding = "ISO-8895-1")
Both attempts returned a similar error.
Does anyone has any suggestion about how to solve this issue?
Here's my session info:
R version 3.3.1 (2016-06-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
locale:
[1] LC_COLLATE=Portuguese_Brazil.1252 LC_CTYPE=Portuguese_Brazil.1252
[3] LC_MONETARY=Portuguese_Brazil.1252 LC_NUMERIC=C
[5] LC_TIME=Portuguese_Brazil.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] rvest_0.3.2 xml2_1.1.1
loaded via a namespace (and not attached):
[1] httr_1.2.1 magrittr_1.5 R6_2.2.1 tools_3.3.1 curl_2.6 Rcpp_0.12.11
What's the issue?
Your issue here is in correctly determining the encoding of the webpage.
The good news
Your approach looks like a good one to me since you looked at the source code and found the Meta charset, given as ISO-8895-1. It is certainly ideal to be told the encoding, rather than have to resort to guess-work.
The bad news
I don't believe that encoding exists. Firstly, when I search for it online the results tend to look like typos. Secondly, R provides you with a list of supported encodings via iconvlist(). ISO-8895-1 is not in the list, so entering it as an argument to read_html isn't useful. I think it'd be nice if entering a non-supported encoding threw a warning, but this doesn't seem to happen.
Quick solution
As suggested by #MrFlick in a comment, using encoding = "latin1" appears to work.
I suspect the Meta charset has a typo and it should read ISO-8859-1 (which is the same thing as latin1).
Tips on guessing an encoding
What is your browser doing?
When loading the page in a browser, you can see what encoding it is using to read the page. If the page looks right, this seems like a sensible guess. In this instance, Firefox uses Western encoding (i.e. ISO-8859-1).
Guessing with R
rvest::guess_encoding is a nice, user-friendly function which can give a quick estimate. You can provide the function with a url e.g. guess_encoding(url), or copy in phrases with more complex characters e.g. guess_encoding("Situação do Termo/Convênio:").
One thing to note about this function is it can only detect from 30 of the more common encodings, but there are many more possibilities.
As mentioned earlier, iconvlist() provides a list of supported encodings. By looping through these encodings and examining some text in the page to see if it's what we expect, we should end up with a shortlist of possible encodings (and rule many encodings out).
Sample code can be found at the bottom of this answer.
Final comments
All the above points towards ISO-8859-1 being a sensible guess for the encoding.
The page url contains a .br extension indicating it's Brazilian, and - according to Wikipedia - this encoding has complete language coverage for Brazilian Portuguese, which suggests it might not be a crazy choice for whoever created the webpage. I believe this is also a reasonably common encoding type.
Code
Sample code for 'Guessing with R' point 2 (using iconvlist()):
library(rvest)
url <- "http://simec.mec.gov.br/painelObras/recurso.php?obra=17956"
# 1. See which encodings don't throw an error
read_page <- lapply(unique(iconvlist()), function(encoding_attempt) {
# Optional print statement to show progress to 1 since this can take some time
print(match(encoding_attempt, iconvlist()) / length(iconvlist()))
read_attempt <- tryCatch(expr=read_html(url, encoding=encoding_attempt),
error=function(condition) NA,
warning=function(condition) message(condition))
return(read_attempt)
})
names(read_page) <- unique(iconvlist())
# 2. See which encodings correctly display some complex characters
read_phrase <- lapply(x, function(encoded_page)
if(!is.na(encoded_page))
html_text(html_nodes(encoded_page, ".dl-horizontal:nth-child(1) dt")))
# We've ended up with 27 encodings which could be sensible...
encoding_shortlist <- names(read_phrase)[read_phrase == "Situação:"]

setMethodS3 fails to correctly reassign default function

When I create s3 methods in R 2.14.1, and then call them, the s3 objects fail to execute the methods in cases where the method has the same name as functions already loaded into the workspace (i.e. base functions). Instead it calls the base function and returns an error. This example uses 'match'. I never had this problem before today. Since I last ran this code, I installed R 3.0.2, but kept my 2.14.1 version. I ran into some trouble (different trouble) with 3.0.2 due to certain packages not being up to date in CRAN, so I reverted RStudio to 2.14.1, and then this problem cropped up. Here's an example:
rm(list=ls())
library(R.oo)
this.df<-data.frame(letter=c("A","B","C"),number=1:3)
setConstructorS3("TestClass", function(DF) {
if (missing(DF)) {
data= NA
} else {
data=DF
}
extend(Object(), "TestClass",
.data=data
)
})
setMethodS3(name="match", class="TestClass", function(this,letter,number,...){
ret = rep(TRUE,nrow(this$.data))
if (!missing(number))
ret = ret & (this$.data$number %in% number)
if (!missing(letter)){
ret = ret & (this$.data$letter %in% letter)
}
return(ret)
})
setMethodS3("get_data", "TestClass", function(this,...) {
return(this$.data[this$match(...),])
})
hope<-TestClass(this.df)
hope$match()
Error in match(this, ...) : argument "table" is missing, with no default
hope$get_data()
Here's the sessionInfo() for clues:
sessionInfo()
R version 2.14.1 (2011-12-22)
Platform: i386-pc-mingw32/i386 (32-bit)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] R.oo_1.13.0 R.methodsS3_1.4.2
loaded via a namespace (and not attached):
[1] tools_2.14.1
I tried a lot of combinations of the arguments in setMethodsS3 with no luck.
How can I fix this?
First, I highly recommend calling S3 methods the regular way and not via the <object>$method(...) way, e.g. match(hope) instead of hope$match(). If you do that, everything works as expected.
Second, I can reproduce this issue with R 3.0.2 and R.oo 1.17.0. There appears to be some issues using the particular method name match() here, because if you instead use match2(), calling hope$match2() works as expected. I've seen similar problems when trying to create S3 methods named assign() and get(). The latter actually generates a error if tried, e.g. "Trying to use an unsafe generic method name (trust us, it is for a good reason): get". I'll add assign() and most likely match() to the list of no-no names. DETAILS: Those functions are very special in R so one should avoid using those names. This is because, if done, S3 generic functions are created for them and all calls will be dispatched via generic functions and that is not compatible with
Finally, you should really really update your R - it's literally ancient and few people will not bother trying to help you with issues unless you run the most recent stable R version (now R 3.0.2 soon to be R 3.1.0). At a minimum, you should make sure to run the latest versions of the package (your R.methodsS3 and R.oo versions are nearly 2 and 1 years old by now with important updates since).
Hope this helps

how to display đ, ư, ơ, ă in R graphs

I am trying to put Vietnamese labeling in R graphs. I use RStudio and save my code using UTF-8 encoding. It handles the Vietnamese characters I put in the code well, I mean everything shows up in the code properly. However, in the graphs I make, while many characters display OK, several important ones do not show up properly, including
đ - which displays incorrectly as d
ư - which displays incorrectly as u
ơ - which displays incorrectly as o
ă - which displays incorrectly as a
Unfortunately this makes my graphs look unprofessional and untrustworthy.
I would really appreciate it if someone can help me figure this out.
Thanks much!
Trang
#DWin: I am on Windows 7, and here is my sessionInfo()
R version 2.15.0 (2012-03-30)
Platform: x86_64-pc-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] MASS_7.3-17 MatchIt_2.4-20 tools_2.15.0
#krlmlr: Here's my code for a simple graph:
knorelative.count <- matrix(nrow=5,ncol=1)
knorelative.count[,1] <- c(1579,638,215,100,120)
par(mar=c(2,4,4,2))
barplot(prop.table(knorelative.count),beside=TRUE,
yaxt="n",ylim=c(0,.6),
legend=c("không ai biết",
"không biết nhiều hơn biết",
"nửa biết, nửa không biết",
"biết nhiều hơn không biết",
"tất cả đều biết"),
main="Người khác trong gia đình, họ hàng biết hay không")
axis(2,at=seq(0,.6,.1),labels=paste(100*seq(0,.6,.1),"%",sep=""),las=1)
When I run this, the đ in the main title and the two ơ's and the đ in the legend turn into d and o.
You are often safer specifying unicode characters by their hex codes:
plot(1:4,rep(1,4),pch=c("\u0111","\u01B0","\u01A1","\u0103"),cex=4)
For any Vietnamese folks out there who run into the same problem, here's an example for the fix using hex codes suggested by James:
print("trường")
[1] "truờng"
print("tr\u01B0ờng")
[1] "trường"
While I can type Vietnamese, in this example the word trường, into my R console ok, any kind of output (e.g. print, graph) fails to display the character ư. Replacing ư with the hex code fixes the output.
(Note: I used the function paste earlier, but then edited this based on James's suggestion to stick the hex code in the character string.)
I am so thankful to learn this way. I will do this for the report I am currently writing.
Trang
You save code in another file. Example: folder R/graph_display.R
After that run this code
eval(parse("R/graph_display.R", encoding = "UTF-8"))

R exits unexpectedly when trying to print a date vector with length more than 130K

I have tried the following code, and then R exits unexpectedly:
temp <- rep(as.Date("2009-01-01")+1:365, 365)
print(temp)
Anyone tried this before, is it a bug, or if there is anything I can do?
I have increase the memory available for R from 1024M to 2047M, but the same thing happens.
Thanks.
UPDATE #1
Here's my sessionInfo()
> sessionInfo()
R version 2.11.0 (2010-04-22)
i386-pc-mingw32
locale:
[1] LC_COLLATE=Chinese_Hong Kong S.A.R..950 LC_CTYPE=Chinese_Hong Kong S.A.R..950
[3] LC_MONETARY=Chinese_Hong Kong S.A.R..950 LC_NUMERIC=C
[5] LC_TIME=Chinese_Hong Kong S.A.R..950
attached base packages:
[1] stats graphics grDevices utils datasets methods base
And actually I am trying to do some format(foo, "%y%M"), but it also exit unexpectedly - for "unexpectedly", I mean R closes itself without any signs. Thanks again.
Looks like a known bug, fixed in 2.11.1 to me. Look at 2.11.1 changelog, section BUG FIXES, 8th item.
Works fine for me (R 2.12.0, running on Fedora Core 13)
You may check the output of getOption("max.print") and lower it a little bit (using, for instance, option(max.print=5000).
In general, however, there is no need to print out such a vector, as you will not be able to read the complete output. Functions like str(t) or head(t) are your friends!

Resources