What's the "internal method" of R's download.file? - r

I'm trying to download the following dataset with download.file, which only works when method = "wget")
# Doesn't work
download.file('http://uofi.box.com/shared/static/bba3968d7c3397c024ec.dta', tempfile(), method = "auto")
download.file('http://uofi.box.com/shared/static/bba3968d7c3397c024ec.dta', tempfile(), method = "curl")
# Works
download.file('http://uofi.box.com/shared/static/bba3968d7c3397c024ec.dta', tempfile(), method = "wget")
According to help(download.file),
If method = "auto" is chosen (the default), the internal method is
chosen for file:// URLs, and for the others provided
capabilities("http/ftp") is true (which it almost always is).
Looking at the source code, "internal method" refers to:
if (method == "internal") {
status <- .External(C_download, url, destfile, quiet,
mode, cacheOK)
if (!quiet)
flush.console()
}
But still, I don't know what .External(C_download) does, especially across platform. It's important for me to know this instead of relying on wget because I'm writing a package that should work cross-platform.

The source code for this is in the R sources (download the current version from http://cran.r-project.org/sources.html). The relevant code (as of R 3.2.1) is in "./src/modules/internet/internet.c" and "./src/modules/internet/nanohttp.c".
According to the latter, the code for the minimalist HTTP GET functionality is based on libxml2-2.3.6.
The files are also available on the R svn site at https://svn.r-project.org/R/branches/R-3-2-branch/src/modules/internet/internet.c and https://svn.r-project.org/R/branches/R-3-2-branch/src/modules/internet/nanohttp.c if you'd prefer not to download the whole .tgz file and decompress it.
If you look at the code, most of it is consistent across platforms. However, on Windows, the wininet code seems to be used.
The code was identified by looking initially in the utils package, since that is where the R command download.file is found. I grepped for download in the c files in the "./src/library/utils/src" directory and found that the relevant code was in "sock.c". There was a comment high up in that file which read /* from src/main/internet.c */ and so I next went to "internet.c".
With respect to your specific file, the issue is that the link you have returns a 302 Found status code. On Windows and using wget, the download routine follows the Location field of the 302 response and gets the actual file. Using the curl method works but only if you supply the parameter extra="-L".
download.file('http://uofi.box.com/shared/static/bba3968d7c3397c024ec.dta', tempfile(), method = "curl", extra="-L")
There's a package called downloader which claims to offer a good cross-platform solution for https. Given an http URL, it just passes the call onto download.file. Here's a version that works for http too. It also defaults to binary transfers, which seems generally to be a good idea.
my_download <- function(url, destfile, method, quiet = FALSE,
mode = "wb", cacheOK = TRUE, extra = getOption("download.file.extra")) {
if (.Platform$OS.type == "windows" && (missing(method) || method %in% c("auto", "internal", "wininet"))) {
seti2 <- utils::"setInternet2"
internet2_start <- seti2(NA)
on.exit(suppressWarnings(seti2(internet2_start)))
suppressWarnings(seti2(TRUE))
} else {
if (missing(method)) {
if (nzchar(Sys.which("wget")[1])) {
method <- "wget"
} else if (nzchar(Sys.which("curl")[1])) {
method <- "curl"
if (!grepl("-L", extra)) {
extra <- paste("-L", extra)
}
} else if (nzchar(Sys.which("lynx")[1])) {
method <- "lynx"
} else {
stop("no download method found")
}
}
}
download.file(url = url, destfile = destfile, method = method, quiet = quiet, mode = mode,
cacheOK = cacheOK, extra = extra)
}

You can answer this yourself. Just type download.file at the console prompt and you should see this near the top of the function definition:
if (method == "auto") { # this is actually the default from
# getOption("download.file.method", default = "auto")
if (capabilities("http/ftp"))
method <- "internal"
else if (length(grep("^file:", url))) {
method <- "internal"
url <- URLdecode(url)
}
else if (system("wget --help > /dev/null") == 0L)
method <- "wget"
else if (system("curl --help > /dev/null") == 0L)
method <- "curl"
else if (system("lynx -help > /dev/null") == 0L)
method <- "lynx"
else stop("no download method found")
}
if (method == "internal") {
status <- .External(C_download, url, destfile, quiet,
mode, cacheOK)
if (!quiet)
flush.console()
}

Related

Why is the Service Unavailable Error using Lapply?

I am using the spotifyr library where I want to find audio features for multiple tracks. For example I can do this in order to find the audio features of a specific song using it's id.
analysis2 <- get_track_audio_features("2xLMifQCjDGFmkHkpNLD9h",
authorization = get_spotify_access_token())
Yesterday, I wrote this function below that takes all the tracks in a dataframe and finds the audio features for all of them and stores them in a list and it was working fine.
get_analysis <- function(track_id)
{
analysis <- get_track_audio_features(track_id,
authorization = get_spotify_access_token())
}
tracks_list <- lapply(all_tracks$track.id, get_analysis)
Now I am getting an error saying Request failed [503] and Error in get_track_audio_features(track_id, authorization = get_spotify_access_token()) : Service Unavailable (HTTP 503).
I am still able to find the audio features of a specific song so I am not sure which service is unavailable.
I suspect you are reaching a song in your data for which the response is denied from spotify. You could try adding an error-catching mechanism to see which one it is:
get_analysis <- function(track_id){
tryCatch(
expr = {
get_track_audio_features(track_id, authorization = get_spotify_access_token())
},
error = function(e){
print(track_id)
}) -> analysis
return(analysis)
}
tracks_list <- lapply(all_tracks$track.id, get_analysis)
I looked at the source code for the package and didn't see any sneaky rate-limiting issues and the Web API page shows error 503 as a generic error that needs waiting to be resolved (https://developer.spotify.com/documentation/web-api/). Thus you could also try just adding a 10 minute wait (I couldn't find how long exactly it is on Spotify's website):
get_analysis <- function(track_id){
tryCatch(
expr = {
get_track_audio_features(track_id, authorization = get_spotify_access_token()) -> output
return(output)
},
error = function(e){
print(track_id)
return(e)
}) -> output
}
wait.function <- funciton(){
Sys.sleep(600)
}
get_analysis_master <- function(all_tracks){
k <- 1
tracks_list <- list()
for(track.id in all_tracks$track.id){
get_analysis(track.id) -> output
if(!inherits(output, "error")){
tracks_list[[k]] <- output
k <- k + 1
} else {
wait.function()
}
return(tracks_list)
}
get_analysis_master(all_tracks) -> tracks_list

open.connection failing in geocoding function

I'm currently running a geocoding function (using the google_places function in the googleway package). The function will run for a while (I have almost 3k locations), then throw the following error:
Error in open.connection(con, "rb") :
schannel: next InitializeSecurityContext failed: SEC_E_ILLEGAL_MESSAGE (0x80090326) - This error usually occurs when a fatal SSL/TLS alert is received (e.g. handshake failed). More detail may be available in the Windows System event log.
Having consulted the system event log, I found the following information:
The machine-default permission settings do not grant Local Activation permission for the COM Server application with CLSID
{9BA05972-F6A8-11CF-A442-00A0C90A8F39}
and APPID
{9BA05972-F6A8-11CF-A442-00A0C90A8F39}
I'm not really sure what to do with this information. From my limited knowledge, it appears this is some sort of security/firewall issue. How should I go about giving R the permissions needed to run this function?
I am running Windows 10 with Windows Defender as antivirus/firewall. For reference, this is the function I am using for geocoding:
metro.locater <- function(lat, lon){
library(googleway)
#putting latitude and longitude into the same vector
latlon <- c(lat, lon)
#getting places result
res <- google_places(location = latlon,
place_type = "subway_station", radius = 50000,
rankby="distance",
key = "myKey")
#condition handling
if(res$status == 'OK'){
closest <- res$results[1:3, ]
return(closest)} else {
try(return(res$status))
}
}
I was able to fix the issue by using an adverb I'd used with another geocoding function that attempts to run the function 5 times when it fails to provide results. Given that this worked, it seems likely that this was just a transient error rather than a systemic issue.
The adverb I used:
safely <- function(fn, ..., max_attempts = 5) {
function(...) {
this_env <- environment()
for(i in seq_len(max_attempts)) {
ok <- tryCatch({
assign("result", fn(...), envir = this_env)
TRUE
},
error = function(e) {
FALSE
}
)
if(ok) {
return(this_env$result)
}
}
msg <- sprintf(
"%s failed after %d tries; returning NULL.",
deparse(match.call()),
max_attempts
)
warning(msg)
NULL
}
}
Taken from Repeating values in loop until error disappears.

Unable to create folder with RCurl

I'm having trouble using the ftpUpload() function of RCurl to upload a file to a non-existent folder in an SFTP. I want the folder to be made if its not there, using the ftp.create.missing.dirs option. Here's my code currently:
.opts <- list(ftp.create.missing.dirs=TRUE)
ftpUpload(what = "test.txt",
to "sftp://ftp.testserver.com:22/newFolder/existingfile.txt",
userpwd = paste(user, pwd, sep = ":"), .opts = opts)`
It doesn't seem to be working as I get the following error:
* Initialized password authentication
* Authentication complete
* Failed to close libssh2 file
I can upload a file to an existent folder with success, its just when the folder isn't there I get the error.
The problem seems be due the fact you are trying to create the new folder, as seen in this question: Create an remote directory using SFTP / RCurl
The error can be found in Microsoft R Open git page:
case SSH_SFTP_CLOSE:
if(sshc->sftp_handle) {
rc = libssh2_sftp_close(sshc->sftp_handle);
if(rc == LIBSSH2_ERROR_EAGAIN) {
break;
}
else if(rc < 0) {
infof(data, "Failed to close libssh2 file\n");
}
sshc->sftp_handle = NULL;
}
if(sftp_scp)
Curl_safefree(sftp_scp->path);
In the code the parameter rc is related to libssh2_sftp_close function (more info here https://www.libssh2.org/libssh2_sftp_close_handle.html), that tries close the nonexistent directory, resulting in the error.
Try use curlPerform as:
curlPerform(url="ftp.xxx.xxx.xxx.xxx/";, postquote="MkDir /newFolder/", userpwd="user:pass")

R: 'unable to connect to 'maps.googleapis.com' on port 80' inside foreach loop

I'm new to stackoverflow, so please correct me if I make any major mistakes.
As a part of a bigger project I have a function that requests routes from Google and calculates the driving time, I do this with the package ggmap. This worked perfectly fine until I tried to speed things up on other parts of the project and needed to call the driving time function within a foreach loop. In the loop, when I use %dopar% it throws this error:
unable to connect to 'maps.googleapis.com' on port 80.
Does anyone know, where this error comes from and how it can be fixed?
I managed to produce a small example that shows the behaviour:
# necessary packages
library(ggmap)
library(doParallel)
library(doSNOW)
library(foreach)
# some lines to test the function in a for and a foreach loop
Origins <- c("Bern","Biel","Thun","Spiez")
Destinations <- c("Biel","Thun","Spiez","Bern")
numRoutes = length(Origins)
# numCores = detectCores()
# I use only 1 core in testing to make sure that the debug-file is readable
cl <- snow::makeCluster(1, outfile = "debug.txt")
registerDoSNOW(cl)
timesDoPar <-foreach(idx=1:numRoutes,
.packages = c("ggmap")) %dopar% {
getDrivingTime(Origins[idx], Destinations[idx])
}
timesDo <-foreach(idx=1:numRoutes,
.packages = c("ggmap")) %do% {
getDrivingTime(Origins[idx], Destinations[idx])
}
stopCluster(cl)
The function (with some extra for debugging):
getDrivingTime <- function(from, to){
if (from == to){
drivingTimeMin = 0
} else{
route_simple <- tryCatch({
message("Trying to get route from Google")
route(from, to, structure = "route", mode = "driving", output = "simple")
},
error=function(cond) {
message("Route throws an error:\nHere's the original error message:")
message(cond)
return(data.frame(minutes=0))
},
warning=function(cond) {
message("Route throws a warning:\nHere's the original warning message:")
message(cond)
return(data.frame(minutes=0))
},
finally={
message(paste0("\nProcessed route: ", from, "; ", to, "\n\n"))
})
drivingTimeMin = sum(route_simple$minutes, na.rm = TRUE)
}
return(drivingTimeMin)
}
I'm aware that in this example it would make absolutely no sense to use parallel programming - especially with using only one core - but in the scope of the full project it is needed.
I couldn't find any useful information related to this except for this question, where the person asking suggests that the problem might be with the network in their company. I don't think that this is the case for me, since it works with %do%. I couldn't test it in another network yet, though.
(I'm working on Windows 7, using a portable version of R (R version 3.1.0) and R Studio (Version 0.98.501))

RCurl memory leak in getURL method

It looks like we have hit a bug in RCurl. The method getURL seems to be leaking memory. A simple test case to reproduce the bug is given here:
library(RCurl)
handle<-getCurlHandle()
range<-1:100
for (r in range) {x<-getURL(url="news.google.com.au",curl=handle)}
If I run this code, the memory allocated to the R session is never recovered.
We are using RCurl for some long running experiments and we are running out of memory on the test system.
The specs of our test system are as follows:
OS: Ubuntu 14.04 (64 bit)
Memory: 24 GB
RCurl version: 1.95-4.3
Any ideas about how to get around this issue?
Thanks
See if getURLContent() also exhibits the problem, i.e. replace getURL() with getURLContent().
The function getURLContent() is a richer version of getURL() and one that gets more attention.
I just hit this too, and made the following code change to work around it:
LEAK (Old code)
h = basicHeaderGatherer()
tmp = tryCatch(getURL(url = url,
headerfunction = h$update,
useragent = R.version.string,
timeout = timeout_secs),
error = function(x) { .__curlError <<- TRUE; __curlErrorMessage <<- x$message })
NO LEAK (New code)
method <- "GET"
h <- basicHeaderGatherer()
t <- basicTextGatherer()
tmp <- tryCatch(curlPerform(url = url,
customrequest = method,
writefunction = t$update,
headerfunction = h$update,
useragent=R.version.string,
verbose = FALSE,
timeout = timeout_secs),
error = function(x) { .__curlError <<- TRUE; .__curlErrorMessage <<- x$message })

Resources