When I do some webscraping (using a for loop to scrap multiple pages), sometimes, after scraping the 35th out of 40 pages, I have the following error:
“Error in open.connection(x, "rb") : Timeout was reached”
And sometimes I receive in addition this message:
“In addition: Warning message: closing unused connection 3”
Below a list of things I would like to clarify:
1) I have read it might need to define explicitly the user agent. I have tried that with:
read_html(curl('www.link.com', handle = curl::new_handle("useragent" = "Mozilla/5.0")))
but it did not change anything.
2) I noticed that when I turn on a VPN, and change location, sometimes my scraping works without any error. I would like to understand why?
3) I have also read it might depend of the proxy. How would like to understand how and why?
4) In addition to the error I have, I would like to understand this warning, has it might be a clue that leads to understand the error:
Warning message: closing unused connection 3
Does that mean that when I am doing webscraping I should somehow at the end call a function to close a connection?
I have already read the following posts on stackoverflow but there is no clear resolution:
Iterating rvest scrape function gives: "Error in open.connection(x, "rb") : Timeout was reached"
rvest Error in open.connection(x, "rb") : Timeout was reached
Error in open.connection(x, "rb") : Couldn't connect to server
Did you try this?
https://stackoverflow.com/a/38463559
library(rvest)
url = "http://google.com"
download.file(url, destfile = "scrapedpage.html", quiet=TRUE)
content <- read_html("scrapedpage.html")
Related
My script happened to get errors with accessing SharePoint. It used to work.
sp_con = sp_connection("https://asdf.sharepoint.com/sites/staff",
credentialFile = "H:/SharePoint API/creds.yml", Office365 = T)
The error was
Error in sp_connection("https://asdf.sharepoint.com/sites/staff", :
Receiving access cookies failed.
In addition: Warning message:
In readLines(file) :
incomplete final line found on 'Y:/Operations/SharePoint API/creds.yml'
I googled the warning message and found solutions to fix it. But still got the access cookies error. Thanks in advance for any idea!
I have a problem with obtaining data from specific website - when trying to download raw website data with R 3.6.3 using following example code:
website_raw <- readLines("https://tge.pl/gaz-rdn?dateShow=09-02-2022")
The result I got is:
Error in file(con, "r") : cannot open the connection In addition: Warning message: In file(con, "r") : InternetOpenUrl failed: 'the connection with the server was reset'
readLines() method used to work fine on this website but from one week on it fails. I've tried also download.file() method: at the beginning the result was the same (error, connection reset) but after setting options(download.file.method = "libcurl"), website file starts to download but then it suddenly stops with information:
trying URL 'https://tge.pl/gaz-rdn?dateShow=09-02-2022'
Error in download.file("https://tge.pl/gaz-rdn?dateShow=09-02-2022", "test.html") :
cannot open URL 'https://tge.pl/gaz-rdn?dateShow=09-02-2022'
In addition: Warning message:
In download.file("https://tge.pl/gaz-rdn?dateShow=09-02-2022", "test.html") :
URL 'https://tge.pl/gaz-rdn?dateShow=09-02-2022': status was 'Failure when receiving data from the peer'
I've tried also disabling Use Internet Explorer library/proxy for HTTP in Rstudio Global Options but it didn't help. Another solution that I've tested was read_html() from rvest package - getting following error:
Error in open.connection(x, "rb") : Send failure: Connection was reset
Downloading data from other websites works fine though, with all considered methods.
Is there any way I can download data from this website with R?
Any kind of help or suggestion will be highly appreciated
In order to find some features that I need, I want to establish a connection to a website using open(mycon, "r"). To do this, I used the code below which is provided by #Dunois:
myx <- httr::HEAD(example)$url
mycon <- url(myx)
open(mycon, "r")
where example is a link to a website. This code works perfectly for all websites; however, in some unique cases like "https://www.pixilink.com/140079#mode=tour" or "https://www.pixilink.com/141152#mode=0" it doesn't work. These websites exist and I check them in my browser and I am not sure why the connection cannot be established. The error message I get is:
Error in open.connection(mycon, "r") : cannot open the connection In addition: Warning message: In open.connection(mycon, "r") : cannot open URL 'https://www.pixilink.com/140079#mode=tour': HTTP status was '400 Bad Request'
I appreciate it if you can shed light on this and clarify why I get this error message?
I'm trying to scrape the content from http://google.com.
the error message come out.
library(rvest)
html("http://google.com")
Error in open.connection(x, "rb") :
Timeout was reached In addition:
Warning message: 'html' is deprecated.
Use 'read_html' instead.
See help("Deprecated")
since I'm using company network ,this maybe caused by firewall or proxy. I try to use set_config ,but not working .
I encountered the same Error in open.connection(x, “rb”) : Timeout was reached issue when working behind a proxy in the office network.
Here's what worked for me,
library(rvest)
url = "http://google.com"
download.file(url, destfile = "scrapedpage.html", quiet=TRUE)
content <- read_html("scrapedpage.html")
Credit : https://stackoverflow.com/a/38463559
This is probably an issue with your call to read_html (or html in your case) not properly identifying itself to server it's trying to retrieve content from, which is the default behaviour. Using curl, add a user agent to the handle argument of read_html to have your scraper identify itself.
library(rvest)
library(curl)
read_html(curl('http://google.com', handle = curl::new_handle("useragent" = "Mozilla/5.0")))
I ran into this issue because my VPN was switched on. Immediately after turning it off, I re-tried, and it resolved the issue.
I was facing a similar problem and a small hack solved it.
There were 2 characters in the hyperlink who were creating the problem for me.
Hence I replaced "è" with "e" & "é" with "e" and it worked.
But just ensure that the hyperlink still remains valid.
I got the error message when my laptop was wifi connected to my router, but my ISP was having some sort of an outage:
read_html(brand_url)
Error in open.connection(x, "rb") :
Timeout was reached: [somewebsite.com.au] Operation timed out after 10024 milliseconds with 0 out of 0 bytes received
In the above case, my wifi was still connected to the modem, but pages wouldn't load via rvest (nor in a browser). It was temporary and lasted ~2 minutes.
May also be worth noting that a different error message is received when wifi is turned off entirely:
brand_page <- read_html(brand_url)
Error in open.connection(x, "rb") :
Could not resolve host: somewebsite.com.au
I was trying to retrieve dozens of files from a website (addresses listed at urls) with the following code
L <- lapply(urls, read.xls, sheet=1,header=T,skip=1,perl="C:/perl/bin/perl.exe",row.names=NULL)
But after a few successful downloads I kept receiving this error:
Trying URL 'http://www.xyz.com'
Error in download.file(xls, tf, mode = "wb") :
cannot open URL 'http://www.xyz.com'
In addition: Warning message:
In download.file(xls, tf, mode = "wb") :
cannot open: HTTP status was '0 (nil)'
Error in file.exists(tfn) : invalid 'file' argument
Why am I getting this error?
The error is caused by the default timeout option, which is set to its default of 60 seconds.
You can retrieve it by calling:
getOption("timeout")
To change it you simply run options(timeout = X), where X is your desired timeout in seconds.