RCurl error: Connection reset by peer - r

I am scraping a website for links using the XML and RCurl packages of R. I need to make multiple calls (several thousand).
The script I use is in the following form:
raw <- getURL("http://www.example.com",encoding="UTF-8",.mapUnicode = F)
parsed <- htmlParse(raw)
links <- xpathSApply(parsed,"//a/#href")
...
...
return(links)
When used a single time, there is no problem.
However, when applied to a list of urls (using sapply), I receive the following error:
Error in function (type, msg, asError = TRUE) : Recv failure:
Connection reset by peer
If I retry the same request later it usually returns ok.
I am new to Curl and web scraping, and not sure how to fix or avoid this.
Thank you in advance

Try something like this
for(i in 1:length(links)){
try(WebPage <- getURL(links[[i]], ssl.verifypeer = FALSE,curl=curl))
while((inherits(NivelRegion, "try-error"))){
Sys.sleep(1)
try(WebPage <- getURL(links[[i]], ssl.verifypeer = FALSE,curl=curl))
}

Related

jsonlite::fromJSON failed to connect to Port 443 in quantmod getFX()

I needed a function which automatically gets the Exchange Rate from a website (oanda.com), therefore I used the quantmod package and made a function based on that.
library(quantmod)
ForeignCurrency<-"MYR" # has to be a character
getExchangeRate<-function(ForeignCurrency){
Conv<-paste("EUR/",(ForeignCurrency),sep="")
getFX(Conv,from=Sys.Date()-179,to=Sys.Date())
Conv2<-paste0("EUR",ForeignCurrency,sep="")
Table<-as.data.frame(get(Conv2))
ExchangeRate<-1/mean(Table[,1])
ExchangeRate
}
ExchangeRate<-getExchangeRate(ForeignCurrency)
ExchangeRate
On my personal PC, it works perfectly and do what I want. If i run this on the working PC, I get following Error:
Warning: Unable to import “EUR/MYR”.
Failed to connect to www.oanda.com port 443: Timed out
I googled already a lot, it seems to be a Firewall Problem, but none of the suggestions I found there doesnt work. After checking the getFX() function, the Problem seems to be in the jsonlite::fromJSON function which getFX() is using.
Did someone of you faced a similar Problem? I am quite familar with R, but with Firewalls/Ports I have no expertise. Do I have to change something in the R settings or is it a Problem independent of R and something in the Proxy settings needs to be changed?
Can you please help :-) ?
The code below shows how you can do a workaround for the getfx() in an enterprise context where you often have to go through a proxy to the internet.
library(httr)
# url that you can find inside https://github.com/joshuaulrich/quantmod/blob/master/R/getSymbols.R
url <- "https://www.oanda.com/fx-for-business//historical-rates/api/data/update/?&source=OANDA&adjustment=0&base_currency=EUR&start_date=2022-02-17&end_date=2022-02-17&period=daily&price=mid&view=table&quote_currency_0=VND"
# original call inside the quantmod library: # Fetch data (jsonlite::fromJSON will handle connection) tbl <- jsonlite::fromJSON(oanda.URL, simplifyVector = FALSE)
# add the use_proxy with your proxy address and proxy port to get through the proxy
response <- httr::GET(url, use_proxy("XX.XX.XX.XX",XXXX))
status <- status_code(response)
if(status == 200){
content <- httr::content(response)
# use jsonlite to get the single attributes like quote currency, exchange rate = average and base currency
exportJson <- jsonlite::toJSON(content, auto_unbox = T)
getJsonObject <- jsonlite::fromJSON(exportJson, flatten = FALSE)
print(getJsonObject$widget$quoteCurrency)
print(getJsonObject$widget$average)
print(getJsonObject$widget$baseCurrency)
}

How to make a new request while there is an error? (fromJSON)

I have a code where I make requests for an API using the jsonlite package.
My request is:
aux <- fromJSON (www ... js)
The problem is that there is a time limit on requests and sometimes the error is returned:
*Error in open.connection (con, "rb"): HTTP error 429.*
I need that, when there is an error the code wait X seconds and make a new request and this is repeated until I get the requested data.
I found the try and tryCatch functions and the retry package. But I couldn't make it work as I need it.
Try this approach :
aux <- tryCatch(fromJSON (www ... js), error = function(e) {return(NA)})
while(all(is.na(aux))) {
Sys.sleep(30) #Change as per requirement.
aux <- tryCatch(fromJSON(www ... js), error = function(e) {return(NA)})
}

Error is returned while using getURL() function in R language

i have started learning data science and new to R language,
i am trying to read data from below HTTPS URL using getURL funtion and Rcurl pacakge.
while executing below code, receiving SSL protocal issue.
R Code
load the library Rcurl
library(RCurl)
specify the URL for the Iris data CSV
urlfile = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
download the file
downloaded = getURL(urlfile, ssl.verifypeer=FALSE)
Error
Error in function (type, msg, asError = TRUE) : Unknown SSL
protocol error in connection to archive.ics.uci.edu:443
can anyone help me with this answer?
First see if you can read data from the URL with:
fileURL <- "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
myfile <- readLines(fileURL)
head(myfile)
If you can read data from the URL, then the embedded double quotes in the data may be causing your problem.
Try read.csv with the quote parameter:
iris <- read.csv(fileURL, header = FALSE, sep = ",", quote = "\"'")
names(iris) <- c("sepal_length", "sepal_width", "petal_length", "petal_width", "class")
head(iris)

Avoiding "Could not resolve host" error to stop the program running in R

I use the getURL function from the Rcurl package in R to read content from a list of links.
When trying to fetch a broken link of the list I get the error "Error in function (type, msg, asError = TRUE) : Could not resolve host:" and the program stops running.
I use the Try command to try to avoid the program stopping but it doesn´t work.
try(getURL(URL, ssl.verifypeer = FALSE, useragent = "R")
Any hint on how can I avoid the program to stop running when trying to get a broken link?
You need to be doing some type of error handling. I would argue tryCatch is actually better for your situation.
I'm assuming you are inside a loop over the links, then you can check the response from your try/tryCatch to see if an error was thrown, and if so just move to the following iteration in your loop using next.
status <- tryCatch(
getURL(URL, ssl.verifypeer=FALSE, useragent="R"),
error = function(e) e
)
if(inherits(status, "error")) next

scrape webpage bypass server error

I am trying to scrape below webpage
parenturl = http://www.liberty.co.uk/fcp/product/Liberty//Rosa-A-Tana-Lawn/1390
but I get below error
srcpage = getURLContent(GET(parenturl)$url,timeout(10))
Error in function (type, msg, asError = TRUE) : Empty reply from server
Is it possible to bypass and scrape webpage
Many Thanks in advance
Try using the httr library instead:
library(httr)
pg <- GET("http://www.liberty.co.uk/fcp/product/Liberty//Rosa-A-Tana-Lawn/1390")
print(content(pg))
# too much to paste here

Resources