Request URL failed/timeout in R - r

I'm trying to get a csv file from a url but it seems to be timing out after one minute. The csv file is being created at the time of the request so it needs a little more than a minute. I tried to increase the timeout but it didn't work, it still fails after a minute.
I'm using url and read.csv as follows:
# Start the timer
ptm <- proc.time()
urlCSV <- getURL("http://someurl.com/getcsv", timeout = 200)
txtCSV <- textConnection(urlCSV)
csvFile <- read.csv(txtCSV)
close(txtCSV)
# Stop the timer
proc.time() - ptm
resulting log:
Error in open.connection(file, "rt") : cannot open the connection
In addition: Warning message:
In open.connection(file, "rt") :
cannot open: HTTP status was '500 Internal Server Error'
user system elapsed
0.225 0.353 60.445
It keeps failing when it reach one minute, what could be the problem? Or how do I increase the timeout?
I tried the url in a browser and it works fine but it takes more than a minute to load the csv

libcurl has a CONNECTTIMEOUT setting http://curl.haxx.se/libcurl/c/CURLOPT_CONNECTTIMEOUT.html.
You can set this in RCurl:
library(RCurl)
> getCurlOptionsConstants()[["connecttimeout"]]
[1] 78
myOpts <- curlOptions(connecttimeout = 200)
urlCSV <- getURL("http://someurl.com/getcsv", .opts = myOpts)

You're getting a 500 error from the server, which suggests the time out is happening there, and is therefor outside your control (unless you can ask for less data)

Related

Reading CSV from URL previously worked, now returning error

I have a program that I intended to rerun regularly with minimal code-manipulation. The code below previously ran succcessfully, but it stopped working. I thought, perhaps the server was down, but when I pasted the URL on my browser, it initiated a download of the csv file. So I think I'm missing something...
nyc_temp_data <- read.csv("https://www.ncei.noaa.gov/access/services/data/v1?dataset=daily-summaries
&dataTypes=TMAX&stations=USW00094728&startDate=2014-01-01&endDate=2020-05-01&includeAttributes=true&units=standard&format=csv")
When I ran it today, I get the following error:
Error in file(file, "rt") : cannot open the connection
In addition: Warning message:
In file(file, "rt") :
cannot open URL 'https://www.ncei.noaa.gov/access/services/data/v1?dataset=daily-summaries
&dataTypes=TMAX&stations=USW00094728&startDate=2014-01-01&endDate=2020-05-01&includeAttributes=true&units=standard&format=csv': HTTP status was '400 '
Sub-optimal solution but gets the job done:
# Store the url: csv_url => character vector
csv_url <- "https://www.ncei.noaa.gov/access/services/data/v1?dataset=daily-summaries
&dataTypes=TMAX&stations=USW00094728&startDate=2014-01-01&endDate=2020-05-01&includeAttributes=true&units=standard&format=csv"
# Store the full filepath to the desired output location / filename:
# output_fpath => character vector
output_fpath <- paste0(getwd(), "/nyc_temp_data.csv")
# Download the url and store it at the filepath: nyc_temp_data.csv => stdout
download.file(csv_url, output_fpath)
# Read in the csv from the file path. nyc_temp_data => data.frame
nyc_temp_data <- read.csv(output_fpath)

“Error in open.connection(x, "rb") : Timeout was reached”

When I do some webscraping (using a for loop to scrap multiple pages), sometimes, after scraping the 35th out of 40 pages, I have the following error:
“Error in open.connection(x, "rb") : Timeout was reached”
And sometimes I receive in addition this message:
“In addition: Warning message: closing unused connection 3”
Below a list of things I would like to clarify:
1) I have read it might need to define explicitly the user agent. I have tried that with:
read_html(curl('www.link.com', handle = curl::new_handle("useragent" = "Mozilla/5.0")))
but it did not change anything.
2) I noticed that when I turn on a VPN, and change location, sometimes my scraping works without any error. I would like to understand why?
3) I have also read it might depend of the proxy. How would like to understand how and why?
4) In addition to the error I have, I would like to understand this warning, has it might be a clue that leads to understand the error:
Warning message: closing unused connection 3
Does that mean that when I am doing webscraping I should somehow at the end call a function to close a connection?
I have already read the following posts on stackoverflow but there is no clear resolution:
Iterating rvest scrape function gives: "Error in open.connection(x, "rb") : Timeout was reached"
rvest Error in open.connection(x, "rb") : Timeout was reached
Error in open.connection(x, "rb") : Couldn't connect to server
Did you try this?
https://stackoverflow.com/a/38463559
library(rvest)
url = "http://google.com"
download.file(url, destfile = "scrapedpage.html", quiet=TRUE)
content <- read_html("scrapedpage.html")

rvest Error in open.connection(x, "rb") : Timeout was reached

I'm trying to scrape the content from http://google.com.
the error message come out.
library(rvest)
html("http://google.com")
Error in open.connection(x, "rb") :
Timeout was reached In addition:
Warning message: 'html' is deprecated.
Use 'read_html' instead.
See help("Deprecated")
since I'm using company network ,this maybe caused by firewall or proxy. I try to use set_config ,but not working .
I encountered the same Error in open.connection(x, “rb”) : Timeout was reached issue when working behind a proxy in the office network.
Here's what worked for me,
library(rvest)
url = "http://google.com"
download.file(url, destfile = "scrapedpage.html", quiet=TRUE)
content <- read_html("scrapedpage.html")
Credit : https://stackoverflow.com/a/38463559
This is probably an issue with your call to read_html (or html in your case) not properly identifying itself to server it's trying to retrieve content from, which is the default behaviour. Using curl, add a user agent to the handle argument of read_html to have your scraper identify itself.
library(rvest)
library(curl)
read_html(curl('http://google.com', handle = curl::new_handle("useragent" = "Mozilla/5.0")))
I ran into this issue because my VPN was switched on. Immediately after turning it off, I re-tried, and it resolved the issue.
I was facing a similar problem and a small hack solved it.
There were 2 characters in the hyperlink who were creating the problem for me.
Hence I replaced "è" with "e" & "é" with "e" and it worked.
But just ensure that the hyperlink still remains valid.
I got the error message when my laptop was wifi connected to my router, but my ISP was having some sort of an outage:
read_html(brand_url)
Error in open.connection(x, "rb") :
Timeout was reached: [somewebsite.com.au] Operation timed out after 10024 milliseconds with 0 out of 0 bytes received
In the above case, my wifi was still connected to the modem, but pages wouldn't load via rvest (nor in a browser). It was temporary and lasted ~2 minutes.
May also be worth noting that a different error message is received when wifi is turned off entirely:
brand_page <- read_html(brand_url)
Error in open.connection(x, "rb") :
Could not resolve host: somewebsite.com.au

download.file "operation timed out" error with large files

I'm using R 3.1.2 with RStudio 0.98 on Windows 7 32 bits.
I want to download some weather forecasts files of the GFS model, to be found on an open ftp server, e.g.:
ftp://nomads.ncdc.noaa.gov/GFS/Grid4/201412/20141221/gfs_4_20141221_0000_000.grb2
The internet connection is done through a proxy (.Renviron is properly configured), and I'm basically using the donwload.file function for this purpose.
url <- file.path("ftp://nomads.ncdc.noaa.gov/GFS/Grid4/201412/20141221/gfs_4_20141221_0000_000.grb2")
download.file(url, destfile="temp.grb2", mode="wb")
Where I get the following error message:
trying URL 'ftp://nomads.ncdc.noaa.gov/GFS/Grid4/201412/20141221/gfs_4_20141221_0000_000.grb2'
using Synchronous WinInet calls
Error in download.file(url, destfile = "temp.grb2", mode = "wb", :
cannot open URL 'ftp://nomads.ncdc.noaa.gov/GFS/Grid4/201412/20141221/gfs_4_20141221_0000_000.grb2'
In addition: Warning message:
In download.file(url, destfile = "temp.grb2", mode = "wb", :
InternetOpenUrl failed: 'Operation timed out'
This message appears exactly 30 seconds after running those lines, and no issues appear when downloading a smaller file, such as 'ftp://nomads.ncdc.noaa.gov/GFS/Grid4/201412/20141221/gfs_4_20141221_0000_000.inv', so I assume it's a matter of timeout configuration.
Setting:
options(timeout=240) doesn't seem to work.
With another computer, using R 3.0.2 with RStudio 0.98 on Windows 8 64 bits, and without using proxy connection, it works perfect.
Any suggestions?

Why does my download.file fail to complete?

I was trying to retrieve dozens of files from a website (addresses listed at urls) with the following code
L <- lapply(urls, read.xls, sheet=1,header=T,skip=1,perl="C:/perl/bin/perl.exe",row.names=NULL)
But after a few successful downloads I kept receiving this error:
Trying URL 'http://www.xyz.com'
Error in download.file(xls, tf, mode = "wb") :
cannot open URL 'http://www.xyz.com'
In addition: Warning message:
In download.file(xls, tf, mode = "wb") :
cannot open: HTTP status was '0 (nil)'
Error in file.exists(tfn) : invalid 'file' argument
Why am I getting this error?
The error is caused by the default timeout option, which is set to its default of 60 seconds.
You can retrieve it by calling:
getOption("timeout")
To change it you simply run options(timeout = X), where X is your desired timeout in seconds.

Resources