Why does my download.file fail to complete? - r

I was trying to retrieve dozens of files from a website (addresses listed at urls) with the following code
L <- lapply(urls, read.xls, sheet=1,header=T,skip=1,perl="C:/perl/bin/perl.exe",row.names=NULL)
But after a few successful downloads I kept receiving this error:
Trying URL 'http://www.xyz.com'
Error in download.file(xls, tf, mode = "wb") :
cannot open URL 'http://www.xyz.com'
In addition: Warning message:
In download.file(xls, tf, mode = "wb") :
cannot open: HTTP status was '0 (nil)'
Error in file.exists(tfn) : invalid 'file' argument
Why am I getting this error?

The error is caused by the default timeout option, which is set to its default of 60 seconds.
You can retrieve it by calling:
getOption("timeout")
To change it you simply run options(timeout = X), where X is your desired timeout in seconds.

Related

Reading CSV from URL previously worked, now returning error

I have a program that I intended to rerun regularly with minimal code-manipulation. The code below previously ran succcessfully, but it stopped working. I thought, perhaps the server was down, but when I pasted the URL on my browser, it initiated a download of the csv file. So I think I'm missing something...
nyc_temp_data <- read.csv("https://www.ncei.noaa.gov/access/services/data/v1?dataset=daily-summaries
&dataTypes=TMAX&stations=USW00094728&startDate=2014-01-01&endDate=2020-05-01&includeAttributes=true&units=standard&format=csv")
When I ran it today, I get the following error:
Error in file(file, "rt") : cannot open the connection
In addition: Warning message:
In file(file, "rt") :
cannot open URL 'https://www.ncei.noaa.gov/access/services/data/v1?dataset=daily-summaries
&dataTypes=TMAX&stations=USW00094728&startDate=2014-01-01&endDate=2020-05-01&includeAttributes=true&units=standard&format=csv': HTTP status was '400 '
Sub-optimal solution but gets the job done:
# Store the url: csv_url => character vector
csv_url <- "https://www.ncei.noaa.gov/access/services/data/v1?dataset=daily-summaries
&dataTypes=TMAX&stations=USW00094728&startDate=2014-01-01&endDate=2020-05-01&includeAttributes=true&units=standard&format=csv"
# Store the full filepath to the desired output location / filename:
# output_fpath => character vector
output_fpath <- paste0(getwd(), "/nyc_temp_data.csv")
# Download the url and store it at the filepath: nyc_temp_data.csv => stdout
download.file(csv_url, output_fpath)
# Read in the csv from the file path. nyc_temp_data => data.frame
nyc_temp_data <- read.csv(output_fpath)

Why there is database connection issue in RNCEP package

I am trying to use "RNCEP" package in R studio. I ran following code
install.packages("RNCEP", dependencies=TRUE)
library(RNCEP)
wx.extent <- NCEP.gather(variable= 'air', level=850, months.minmax=c(8,9),
years.minmax=c(2006,2007), lat.southnorth=c(50,55), lon.westeast=c(0,5),
reanalysis2 = FALSE, return.units = TRUE)
I got error messages as:
trying URL
'http://www.esrl.noaa.gov/psd/thredds/dodsC/Datasets/ncep.reanalysis/pressure/air.2006.nc.das'
Content length 660 bytes
Error in NCEP.gather.pressure(variable = variable, months.minmax =
months.minmax, :
There is a problem connecting to the NCEP database with the
information provided.
Try entering
http://www.esrl.noaa.gov/psd/thredds/dodsC/Datasets/ncep.reanalysis/pressure/air.2006.nc.das
into a web browser to obtain an error message.
In addition: Warning messages:
1: In
download.file(paste("http://www.esrl.noaa.gov/psd/thredds/dodsC/Datasets/ncep.reanalysis",
: cannot open URL
'http://www.cfauth.com/?cfru=aHR0cDovL3d3dy5lc3JsLm5vYWEuZ292L3BzZC90aHJlZGRzL2RvZHNDL0RhdGFzZXRzL25jZXAucmVhbmFseXNpcy9wcmVzc3VyZS9haXIuMjAwNi5uYy5kYXM=':
HTTP status was '401 Unauthorized'
Please suggest me the correct syntax to download NCEP data.
Thanks
Sam

“Error in open.connection(x, "rb") : Timeout was reached”

When I do some webscraping (using a for loop to scrap multiple pages), sometimes, after scraping the 35th out of 40 pages, I have the following error:
“Error in open.connection(x, "rb") : Timeout was reached”
And sometimes I receive in addition this message:
“In addition: Warning message: closing unused connection 3”
Below a list of things I would like to clarify:
1) I have read it might need to define explicitly the user agent. I have tried that with:
read_html(curl('www.link.com', handle = curl::new_handle("useragent" = "Mozilla/5.0")))
but it did not change anything.
2) I noticed that when I turn on a VPN, and change location, sometimes my scraping works without any error. I would like to understand why?
3) I have also read it might depend of the proxy. How would like to understand how and why?
4) In addition to the error I have, I would like to understand this warning, has it might be a clue that leads to understand the error:
Warning message: closing unused connection 3
Does that mean that when I am doing webscraping I should somehow at the end call a function to close a connection?
I have already read the following posts on stackoverflow but there is no clear resolution:
Iterating rvest scrape function gives: "Error in open.connection(x, "rb") : Timeout was reached"
rvest Error in open.connection(x, "rb") : Timeout was reached
Error in open.connection(x, "rb") : Couldn't connect to server
Did you try this?
https://stackoverflow.com/a/38463559
library(rvest)
url = "http://google.com"
download.file(url, destfile = "scrapedpage.html", quiet=TRUE)
content <- read_html("scrapedpage.html")

download.file "operation timed out" error with large files

I'm using R 3.1.2 with RStudio 0.98 on Windows 7 32 bits.
I want to download some weather forecasts files of the GFS model, to be found on an open ftp server, e.g.:
ftp://nomads.ncdc.noaa.gov/GFS/Grid4/201412/20141221/gfs_4_20141221_0000_000.grb2
The internet connection is done through a proxy (.Renviron is properly configured), and I'm basically using the donwload.file function for this purpose.
url <- file.path("ftp://nomads.ncdc.noaa.gov/GFS/Grid4/201412/20141221/gfs_4_20141221_0000_000.grb2")
download.file(url, destfile="temp.grb2", mode="wb")
Where I get the following error message:
trying URL 'ftp://nomads.ncdc.noaa.gov/GFS/Grid4/201412/20141221/gfs_4_20141221_0000_000.grb2'
using Synchronous WinInet calls
Error in download.file(url, destfile = "temp.grb2", mode = "wb", :
cannot open URL 'ftp://nomads.ncdc.noaa.gov/GFS/Grid4/201412/20141221/gfs_4_20141221_0000_000.grb2'
In addition: Warning message:
In download.file(url, destfile = "temp.grb2", mode = "wb", :
InternetOpenUrl failed: 'Operation timed out'
This message appears exactly 30 seconds after running those lines, and no issues appear when downloading a smaller file, such as 'ftp://nomads.ncdc.noaa.gov/GFS/Grid4/201412/20141221/gfs_4_20141221_0000_000.inv', so I assume it's a matter of timeout configuration.
Setting:
options(timeout=240) doesn't seem to work.
With another computer, using R 3.0.2 with RStudio 0.98 on Windows 8 64 bits, and without using proxy connection, it works perfect.
Any suggestions?

Request URL failed/timeout in R

I'm trying to get a csv file from a url but it seems to be timing out after one minute. The csv file is being created at the time of the request so it needs a little more than a minute. I tried to increase the timeout but it didn't work, it still fails after a minute.
I'm using url and read.csv as follows:
# Start the timer
ptm <- proc.time()
urlCSV <- getURL("http://someurl.com/getcsv", timeout = 200)
txtCSV <- textConnection(urlCSV)
csvFile <- read.csv(txtCSV)
close(txtCSV)
# Stop the timer
proc.time() - ptm
resulting log:
Error in open.connection(file, "rt") : cannot open the connection
In addition: Warning message:
In open.connection(file, "rt") :
cannot open: HTTP status was '500 Internal Server Error'
user system elapsed
0.225 0.353 60.445
It keeps failing when it reach one minute, what could be the problem? Or how do I increase the timeout?
I tried the url in a browser and it works fine but it takes more than a minute to load the csv
libcurl has a CONNECTTIMEOUT setting http://curl.haxx.se/libcurl/c/CURLOPT_CONNECTTIMEOUT.html.
You can set this in RCurl:
library(RCurl)
> getCurlOptionsConstants()[["connecttimeout"]]
[1] 78
myOpts <- curlOptions(connecttimeout = 200)
urlCSV <- getURL("http://someurl.com/getcsv", .opts = myOpts)
You're getting a 500 error from the server, which suggests the time out is happening there, and is therefor outside your control (unless you can ask for less data)

Resources