I'm trying to get a csv file from a url but it seems to be timing out after one minute. The csv file is being created at the time of the request so it needs a little more than a minute. I tried to increase the timeout but it didn't work, it still fails after a minute.
I'm using url and read.csv as follows:
# Start the timer
ptm <- proc.time()
urlCSV <- getURL("http://someurl.com/getcsv", timeout = 200)
txtCSV <- textConnection(urlCSV)
csvFile <- read.csv(txtCSV)
close(txtCSV)
# Stop the timer
proc.time() - ptm
resulting log:
Error in open.connection(file, "rt") : cannot open the connection
In addition: Warning message:
In open.connection(file, "rt") :
cannot open: HTTP status was '500 Internal Server Error'
user system elapsed
0.225 0.353 60.445
It keeps failing when it reach one minute, what could be the problem? Or how do I increase the timeout?
I tried the url in a browser and it works fine but it takes more than a minute to load the csv
libcurl has a CONNECTTIMEOUT setting http://curl.haxx.se/libcurl/c/CURLOPT_CONNECTTIMEOUT.html.
You can set this in RCurl:
library(RCurl)
> getCurlOptionsConstants()[["connecttimeout"]]
[1] 78
myOpts <- curlOptions(connecttimeout = 200)
urlCSV <- getURL("http://someurl.com/getcsv", .opts = myOpts)
You're getting a 500 error from the server, which suggests the time out is happening there, and is therefor outside your control (unless you can ask for less data)
Related
I have a program that I intended to rerun regularly with minimal code-manipulation. The code below previously ran succcessfully, but it stopped working. I thought, perhaps the server was down, but when I pasted the URL on my browser, it initiated a download of the csv file. So I think I'm missing something...
nyc_temp_data <- read.csv("https://www.ncei.noaa.gov/access/services/data/v1?dataset=daily-summaries
&dataTypes=TMAX&stations=USW00094728&startDate=2014-01-01&endDate=2020-05-01&includeAttributes=true&units=standard&format=csv")
When I ran it today, I get the following error:
Error in file(file, "rt") : cannot open the connection
In addition: Warning message:
In file(file, "rt") :
cannot open URL 'https://www.ncei.noaa.gov/access/services/data/v1?dataset=daily-summaries
&dataTypes=TMAX&stations=USW00094728&startDate=2014-01-01&endDate=2020-05-01&includeAttributes=true&units=standard&format=csv': HTTP status was '400 '
Sub-optimal solution but gets the job done:
# Store the url: csv_url => character vector
csv_url <- "https://www.ncei.noaa.gov/access/services/data/v1?dataset=daily-summaries
&dataTypes=TMAX&stations=USW00094728&startDate=2014-01-01&endDate=2020-05-01&includeAttributes=true&units=standard&format=csv"
# Store the full filepath to the desired output location / filename:
# output_fpath => character vector
output_fpath <- paste0(getwd(), "/nyc_temp_data.csv")
# Download the url and store it at the filepath: nyc_temp_data.csv => stdout
download.file(csv_url, output_fpath)
# Read in the csv from the file path. nyc_temp_data => data.frame
nyc_temp_data <- read.csv(output_fpath)
When I do some webscraping (using a for loop to scrap multiple pages), sometimes, after scraping the 35th out of 40 pages, I have the following error:
“Error in open.connection(x, "rb") : Timeout was reached”
And sometimes I receive in addition this message:
“In addition: Warning message: closing unused connection 3”
Below a list of things I would like to clarify:
1) I have read it might need to define explicitly the user agent. I have tried that with:
read_html(curl('www.link.com', handle = curl::new_handle("useragent" = "Mozilla/5.0")))
but it did not change anything.
2) I noticed that when I turn on a VPN, and change location, sometimes my scraping works without any error. I would like to understand why?
3) I have also read it might depend of the proxy. How would like to understand how and why?
4) In addition to the error I have, I would like to understand this warning, has it might be a clue that leads to understand the error:
Warning message: closing unused connection 3
Does that mean that when I am doing webscraping I should somehow at the end call a function to close a connection?
I have already read the following posts on stackoverflow but there is no clear resolution:
Iterating rvest scrape function gives: "Error in open.connection(x, "rb") : Timeout was reached"
rvest Error in open.connection(x, "rb") : Timeout was reached
Error in open.connection(x, "rb") : Couldn't connect to server
Did you try this?
https://stackoverflow.com/a/38463559
library(rvest)
url = "http://google.com"
download.file(url, destfile = "scrapedpage.html", quiet=TRUE)
content <- read_html("scrapedpage.html")
I'm trying to scrape the content from http://google.com.
the error message come out.
library(rvest)
html("http://google.com")
Error in open.connection(x, "rb") :
Timeout was reached In addition:
Warning message: 'html' is deprecated.
Use 'read_html' instead.
See help("Deprecated")
since I'm using company network ,this maybe caused by firewall or proxy. I try to use set_config ,but not working .
I encountered the same Error in open.connection(x, “rb”) : Timeout was reached issue when working behind a proxy in the office network.
Here's what worked for me,
library(rvest)
url = "http://google.com"
download.file(url, destfile = "scrapedpage.html", quiet=TRUE)
content <- read_html("scrapedpage.html")
Credit : https://stackoverflow.com/a/38463559
This is probably an issue with your call to read_html (or html in your case) not properly identifying itself to server it's trying to retrieve content from, which is the default behaviour. Using curl, add a user agent to the handle argument of read_html to have your scraper identify itself.
library(rvest)
library(curl)
read_html(curl('http://google.com', handle = curl::new_handle("useragent" = "Mozilla/5.0")))
I ran into this issue because my VPN was switched on. Immediately after turning it off, I re-tried, and it resolved the issue.
I was facing a similar problem and a small hack solved it.
There were 2 characters in the hyperlink who were creating the problem for me.
Hence I replaced "è" with "e" & "é" with "e" and it worked.
But just ensure that the hyperlink still remains valid.
I got the error message when my laptop was wifi connected to my router, but my ISP was having some sort of an outage:
read_html(brand_url)
Error in open.connection(x, "rb") :
Timeout was reached: [somewebsite.com.au] Operation timed out after 10024 milliseconds with 0 out of 0 bytes received
In the above case, my wifi was still connected to the modem, but pages wouldn't load via rvest (nor in a browser). It was temporary and lasted ~2 minutes.
May also be worth noting that a different error message is received when wifi is turned off entirely:
brand_page <- read_html(brand_url)
Error in open.connection(x, "rb") :
Could not resolve host: somewebsite.com.au
I'm using R 3.1.2 with RStudio 0.98 on Windows 7 32 bits.
I want to download some weather forecasts files of the GFS model, to be found on an open ftp server, e.g.:
ftp://nomads.ncdc.noaa.gov/GFS/Grid4/201412/20141221/gfs_4_20141221_0000_000.grb2
The internet connection is done through a proxy (.Renviron is properly configured), and I'm basically using the donwload.file function for this purpose.
url <- file.path("ftp://nomads.ncdc.noaa.gov/GFS/Grid4/201412/20141221/gfs_4_20141221_0000_000.grb2")
download.file(url, destfile="temp.grb2", mode="wb")
Where I get the following error message:
trying URL 'ftp://nomads.ncdc.noaa.gov/GFS/Grid4/201412/20141221/gfs_4_20141221_0000_000.grb2'
using Synchronous WinInet calls
Error in download.file(url, destfile = "temp.grb2", mode = "wb", :
cannot open URL 'ftp://nomads.ncdc.noaa.gov/GFS/Grid4/201412/20141221/gfs_4_20141221_0000_000.grb2'
In addition: Warning message:
In download.file(url, destfile = "temp.grb2", mode = "wb", :
InternetOpenUrl failed: 'Operation timed out'
This message appears exactly 30 seconds after running those lines, and no issues appear when downloading a smaller file, such as 'ftp://nomads.ncdc.noaa.gov/GFS/Grid4/201412/20141221/gfs_4_20141221_0000_000.inv', so I assume it's a matter of timeout configuration.
Setting:
options(timeout=240) doesn't seem to work.
With another computer, using R 3.0.2 with RStudio 0.98 on Windows 8 64 bits, and without using proxy connection, it works perfect.
Any suggestions?
I was trying to retrieve dozens of files from a website (addresses listed at urls) with the following code
L <- lapply(urls, read.xls, sheet=1,header=T,skip=1,perl="C:/perl/bin/perl.exe",row.names=NULL)
But after a few successful downloads I kept receiving this error:
Trying URL 'http://www.xyz.com'
Error in download.file(xls, tf, mode = "wb") :
cannot open URL 'http://www.xyz.com'
In addition: Warning message:
In download.file(xls, tf, mode = "wb") :
cannot open: HTTP status was '0 (nil)'
Error in file.exists(tfn) : invalid 'file' argument
Why am I getting this error?
The error is caused by the default timeout option, which is set to its default of 60 seconds.
You can retrieve it by calling:
getOption("timeout")
To change it you simply run options(timeout = X), where X is your desired timeout in seconds.