RCurl memory leak in getURL method - r

It looks like we have hit a bug in RCurl. The method getURL seems to be leaking memory. A simple test case to reproduce the bug is given here:
library(RCurl)
handle<-getCurlHandle()
range<-1:100
for (r in range) {x<-getURL(url="news.google.com.au",curl=handle)}
If I run this code, the memory allocated to the R session is never recovered.
We are using RCurl for some long running experiments and we are running out of memory on the test system.
The specs of our test system are as follows:
OS: Ubuntu 14.04 (64 bit)
Memory: 24 GB
RCurl version: 1.95-4.3
Any ideas about how to get around this issue?
Thanks

See if getURLContent() also exhibits the problem, i.e. replace getURL() with getURLContent().
The function getURLContent() is a richer version of getURL() and one that gets more attention.

I just hit this too, and made the following code change to work around it:
LEAK (Old code)
h = basicHeaderGatherer()
tmp = tryCatch(getURL(url = url,
headerfunction = h$update,
useragent = R.version.string,
timeout = timeout_secs),
error = function(x) { .__curlError <<- TRUE; __curlErrorMessage <<- x$message })
NO LEAK (New code)
method <- "GET"
h <- basicHeaderGatherer()
t <- basicTextGatherer()
tmp <- tryCatch(curlPerform(url = url,
customrequest = method,
writefunction = t$update,
headerfunction = h$update,
useragent=R.version.string,
verbose = FALSE,
timeout = timeout_secs),
error = function(x) { .__curlError <<- TRUE; .__curlErrorMessage <<- x$message })

Related

Connect to redis cluster in R

Suppose there were several hosts and ports of the redis server, like
10.0.1.1:6381
10.0.1.1:6382
10.0.1.2:6381
10.0.1.2:6382
how can I configure the redux::hiredis()?
I have google around but can't find a solution. And I noticed that there was a note on db parameter of the redis_config function with "Do not use in a redis clustering context.", so I was wondering that this was a way to connect to a cluster. In addition, I have also try to pass redis://10.0.1.1:6381,10.0.1.1:6382,10.0.1.2:6381,10.0.1.2:6382 to the url parameter, but still failed.
Any suggestions? Or is there another package you would suggest?
My initial solution is writing a function to point to the correct node based on the error message.
check_redis <- function(key = "P10000", host = "10.0.1.1", port = 6381) {
r <- redux::hiredis(host = host, port = port)
status <- tryCatch(
{
r$EXISTS(key = key)
},
error = function(e){
address <- str_match(e$message,
"[0-9]+.[0-9]+.[0-9]+.[0-9]+:[0-9]+")
host <- str_split(address, ":", simplify = T)[1]
port <- str_split(address, ":", simplify = T)[2]
return(list(host = host, port = port))
}
)
if (is.list(status)) {
r <- redux::hiredis(host = status$host, port = status$port)
}
return(r)
}
It can help to direct to the correct node, but this solution is neither elegant nor efficient. So please advise.

R GDAX-API Delete Request

I am having trouble with DELETE requests in R. I have been successful in making GET and POST requests using the below code. Any help / pointers will be appreciated.
It will require an api.key, secret & passphrase from GDAX to work.
Here is my function:
library(RCurl)
library(jsonlite)
library(httr)
library(digest)
cancel_order <- function(api.key,
secret,
passphrase) {
api.url <- "https://api.gdax.com"
#get url extension----
req.url <- "/orders/"
#define method----
method = "DELETE"
url <- paste0(api.url, req.url)
timestamp <-
format(as.numeric(Sys.time()), digits = 13) # create nonce
key <- base64Decode(secret, mode = "raw") # encode api secret
#create final end point----
what <- paste0(timestamp, method, req.url)
#create encoded signature----
sign <-
base64Encode(hmac(key, what, algo = "sha256", raw = TRUE)) # hash
#define headers----
httpheader <- list(
'CB-ACCESS-KEY' = api.key,
'CB-ACCESS-SIGN' = sign,
'CB-ACCESS-TIMESTAMP' = timestamp,
'CB-ACCESS-PASSPHRASE' = passphrase,
'Content-Type' = 'application/json'
)
##------------------------------------------------
response <- getURL(
url = url,
curl = getCurlHandle(useragent = "R"),
httpheader = httpheader
)
print(rawToChar(response)) #rawToChar only on macOS and not on Win
}
The error I get is "{\"message\":\"invalid signature\"}" even though the same command will code and signature will work with GET & POST.
Ref: GDAX API DOCs
just a guess as I am not familiar with the API, but perhaps you are missing the 'order-id' ...
look at: https://docs.gdax.com/?javascript#cancel-an-order
Ok. I took #mrflick's advise and pointed my connection to requestbin based on his feedback on a different but related question.
After careful inspection, I realized that the my request for some reason was treated as a POST request and not a DELETE request. So I decided to replace the getURL function with another higher level function from RCurl for it to work.
response <- httpDELETE(
url = url,
curl = getCurlHandle(useragent = "R"),
httpheader = httpheader
)
Everything else remains the same. Apparently there never was an issue with the signature.
I have added this function now to my unofficial wrapper rgdax
EDIT::
The unofficial wrapper is now official and on CRAN.

R: 'unable to connect to 'maps.googleapis.com' on port 80' inside foreach loop

I'm new to stackoverflow, so please correct me if I make any major mistakes.
As a part of a bigger project I have a function that requests routes from Google and calculates the driving time, I do this with the package ggmap. This worked perfectly fine until I tried to speed things up on other parts of the project and needed to call the driving time function within a foreach loop. In the loop, when I use %dopar% it throws this error:
unable to connect to 'maps.googleapis.com' on port 80.
Does anyone know, where this error comes from and how it can be fixed?
I managed to produce a small example that shows the behaviour:
# necessary packages
library(ggmap)
library(doParallel)
library(doSNOW)
library(foreach)
# some lines to test the function in a for and a foreach loop
Origins <- c("Bern","Biel","Thun","Spiez")
Destinations <- c("Biel","Thun","Spiez","Bern")
numRoutes = length(Origins)
# numCores = detectCores()
# I use only 1 core in testing to make sure that the debug-file is readable
cl <- snow::makeCluster(1, outfile = "debug.txt")
registerDoSNOW(cl)
timesDoPar <-foreach(idx=1:numRoutes,
.packages = c("ggmap")) %dopar% {
getDrivingTime(Origins[idx], Destinations[idx])
}
timesDo <-foreach(idx=1:numRoutes,
.packages = c("ggmap")) %do% {
getDrivingTime(Origins[idx], Destinations[idx])
}
stopCluster(cl)
The function (with some extra for debugging):
getDrivingTime <- function(from, to){
if (from == to){
drivingTimeMin = 0
} else{
route_simple <- tryCatch({
message("Trying to get route from Google")
route(from, to, structure = "route", mode = "driving", output = "simple")
},
error=function(cond) {
message("Route throws an error:\nHere's the original error message:")
message(cond)
return(data.frame(minutes=0))
},
warning=function(cond) {
message("Route throws a warning:\nHere's the original warning message:")
message(cond)
return(data.frame(minutes=0))
},
finally={
message(paste0("\nProcessed route: ", from, "; ", to, "\n\n"))
})
drivingTimeMin = sum(route_simple$minutes, na.rm = TRUE)
}
return(drivingTimeMin)
}
I'm aware that in this example it would make absolutely no sense to use parallel programming - especially with using only one core - but in the scope of the full project it is needed.
I couldn't find any useful information related to this except for this question, where the person asking suggests that the problem might be with the network in their company. I don't think that this is the case for me, since it works with %do%. I couldn't test it in another network yet, though.
(I'm working on Windows 7, using a portable version of R (R version 3.1.0) and R Studio (Version 0.98.501))

What's the "internal method" of R's download.file?

I'm trying to download the following dataset with download.file, which only works when method = "wget")
# Doesn't work
download.file('http://uofi.box.com/shared/static/bba3968d7c3397c024ec.dta', tempfile(), method = "auto")
download.file('http://uofi.box.com/shared/static/bba3968d7c3397c024ec.dta', tempfile(), method = "curl")
# Works
download.file('http://uofi.box.com/shared/static/bba3968d7c3397c024ec.dta', tempfile(), method = "wget")
According to help(download.file),
If method = "auto" is chosen (the default), the internal method is
chosen for file:// URLs, and for the others provided
capabilities("http/ftp") is true (which it almost always is).
Looking at the source code, "internal method" refers to:
if (method == "internal") {
status <- .External(C_download, url, destfile, quiet,
mode, cacheOK)
if (!quiet)
flush.console()
}
But still, I don't know what .External(C_download) does, especially across platform. It's important for me to know this instead of relying on wget because I'm writing a package that should work cross-platform.
The source code for this is in the R sources (download the current version from http://cran.r-project.org/sources.html). The relevant code (as of R 3.2.1) is in "./src/modules/internet/internet.c" and "./src/modules/internet/nanohttp.c".
According to the latter, the code for the minimalist HTTP GET functionality is based on libxml2-2.3.6.
The files are also available on the R svn site at https://svn.r-project.org/R/branches/R-3-2-branch/src/modules/internet/internet.c and https://svn.r-project.org/R/branches/R-3-2-branch/src/modules/internet/nanohttp.c if you'd prefer not to download the whole .tgz file and decompress it.
If you look at the code, most of it is consistent across platforms. However, on Windows, the wininet code seems to be used.
The code was identified by looking initially in the utils package, since that is where the R command download.file is found. I grepped for download in the c files in the "./src/library/utils/src" directory and found that the relevant code was in "sock.c". There was a comment high up in that file which read /* from src/main/internet.c */ and so I next went to "internet.c".
With respect to your specific file, the issue is that the link you have returns a 302 Found status code. On Windows and using wget, the download routine follows the Location field of the 302 response and gets the actual file. Using the curl method works but only if you supply the parameter extra="-L".
download.file('http://uofi.box.com/shared/static/bba3968d7c3397c024ec.dta', tempfile(), method = "curl", extra="-L")
There's a package called downloader which claims to offer a good cross-platform solution for https. Given an http URL, it just passes the call onto download.file. Here's a version that works for http too. It also defaults to binary transfers, which seems generally to be a good idea.
my_download <- function(url, destfile, method, quiet = FALSE,
mode = "wb", cacheOK = TRUE, extra = getOption("download.file.extra")) {
if (.Platform$OS.type == "windows" && (missing(method) || method %in% c("auto", "internal", "wininet"))) {
seti2 <- utils::"setInternet2"
internet2_start <- seti2(NA)
on.exit(suppressWarnings(seti2(internet2_start)))
suppressWarnings(seti2(TRUE))
} else {
if (missing(method)) {
if (nzchar(Sys.which("wget")[1])) {
method <- "wget"
} else if (nzchar(Sys.which("curl")[1])) {
method <- "curl"
if (!grepl("-L", extra)) {
extra <- paste("-L", extra)
}
} else if (nzchar(Sys.which("lynx")[1])) {
method <- "lynx"
} else {
stop("no download method found")
}
}
}
download.file(url = url, destfile = destfile, method = method, quiet = quiet, mode = mode,
cacheOK = cacheOK, extra = extra)
}
You can answer this yourself. Just type download.file at the console prompt and you should see this near the top of the function definition:
if (method == "auto") { # this is actually the default from
# getOption("download.file.method", default = "auto")
if (capabilities("http/ftp"))
method <- "internal"
else if (length(grep("^file:", url))) {
method <- "internal"
url <- URLdecode(url)
}
else if (system("wget --help > /dev/null") == 0L)
method <- "wget"
else if (system("curl --help > /dev/null") == 0L)
method <- "curl"
else if (system("lynx -help > /dev/null") == 0L)
method <- "lynx"
else stop("no download method found")
}
if (method == "internal") {
status <- .External(C_download, url, destfile, quiet,
mode, cacheOK)
if (!quiet)
flush.console()
}

CURL handle goes Stale when inside foreach()

Alright, so I've recently figured out that I can query a website behind a login screen for a CSV Report. Then I thought, wouldn't it be even better to do this concurrently? Afterall some reports take a lot longer to produce than others and if I were querying 10 different reports at once that would be way more efficient. So I'm now over my head twice here playing around with HTTPS protocols and now also Parallel Processing. I think my frankencode is almost there though but it gives me a
"Error in ( : task 1 failed - "Stale CURL handle being passed to libcurl"
Note that the "curl" is very much current as the "html" variable did login successfully. Something happens in it's parallel chunk that makes it stale.
library(RCurl)
library(doParallel)
registerDoParallel(cores=4)
agent="Firefox/23.0"
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))
curl = getCurlHandle()
curlSetOpt(
cookiejar = 'cookies.txt' ,
useragent = agent,
followlocation = TRUE ,
autoreferer = TRUE ,
curl = curl
)
un="username#domain.com"
pw="password"
html = postForm(paste("https://login.salesforce.com/?un=", un, "&pw=", pw, sep=""), curl=curl)
urls = c("https://xyz123.salesforce.com/00O400000046ayd?export=1&enc=UTF-8&xf=csv",
"https://xyz123.salesforce.com/00O400000045sWu?export=1&enc=UTF-8&xf=csv",
"https://xyz123.salesforce.com/00O400000045z3Q?export=1&enc=UTF-8&xf=csv")
x <- foreach(i=1:4, .combine=rbind, .packages=c("RCurl")) %dopar% {
xxx <- getURL(urls[i], curl=curl)
}

Resources