scrape webpage bypass server error

scrape webpage bypass server error - r

I am trying to scrape below webpage
parenturl = http://www.liberty.co.uk/fcp/product/Liberty//Rosa-A-Tana-Lawn/1390
but I get below error
srcpage = getURLContent(GET(parenturl)$url,timeout(10))
Error in function (type, msg, asError = TRUE) : Empty reply from server
Is it possible to bypass and scrape webpage
Many Thanks in advance

Try using the httr library instead:
library(httr)
pg <- GET("http://www.liberty.co.uk/fcp/product/Liberty//Rosa-A-Tana-Lawn/1390")
print(content(pg))
# too much to paste here

Related

Webscraping, read_html() - Error in open.connection(x, "rb") : SSL certificate problem: certificate has expired

I am currently trying to build a small webscraper.
I am using the following code to scrape a website:
webpage <- "https://www.whisky.de/shop/Schottland/Single-Malt/Macallan-Triple-Cask-15-Jahre.html"
content <- read_html(webpage)
However, when I run the second line with the read_html command, I get the following error message:
Error in open.connection(x, "rb") :
SSL certificate problem: certificate has expired
Does anyone of you know where this is coming from? When I used it a few days ago, I did not have any trouble with it.
I am using Mac OS X 10.15.5, RStudio (1.2.5033)
I also installed the library "rvest"
Many thanks for your help in advance!

I was getting the same problem for another website, but the other answer did not solve it for me. I'm posting what worked for me in case it is useful to someone else.
library(tidyverse)
library(rvest)
webpage <- "https://www.whisky.de/shop/Schottland/Single-Malt/Macallan-Triple-Cask-15-Jahre.html"
content <- webpage %>%
httr::GET(config = httr::config(ssl_verifypeer = FALSE)) %>%
read_html()
See here for a discussion about this solution.

Try using the GET function.
webpage <- "https://www.whisky.de/shop/Schottland/Single-Malt/Macallan-Triple-Cask-15-Jahre.html"
content <- read_html(GET(webpage))
I should have mentioned the GET function is part of the httr R package. Make sure you use GET and not get.

I had the same problem. I fixed it by changing the ssl settings in R. Just add the following line to the beginning of your code (at least before you call read_html()):
httr::set_config(config(ssl_verifypeer = FALSE, ssl_verifyhost = FALSE))

Error is returned while using getURL() function in R language

i have started learning data science and new to R language,
i am trying to read data from below HTTPS URL using getURL funtion and Rcurl pacakge.
while executing below code, receiving SSL protocal issue.
R Code
load the library Rcurl
library(RCurl)
specify the URL for the Iris data CSV
urlfile = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
download the file
downloaded = getURL(urlfile, ssl.verifypeer=FALSE)
Error
Error in function (type, msg, asError = TRUE) : Unknown SSL
protocol error in connection to archive.ics.uci.edu:443
can anyone help me with this answer?

First see if you can read data from the URL with:
fileURL <- "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
myfile <- readLines(fileURL)
head(myfile)
If you can read data from the URL, then the embedded double quotes in the data may be causing your problem.
Try read.csv with the quote parameter:
iris <- read.csv(fileURL, header = FALSE, sep = ",", quote = "\"'")
names(iris) <- c("sepal_length", "sepal_width", "petal_length", "petal_width", "class")
head(iris)

R - Error in fromJSON(raw.data) : incomplete list

I'm trying to read API data from the BLS into R. I am using the Version 1.0 that does not require registration and is open for public use.
Here is my code:
url <-"http://api.bls.gov/publicAPI/v1/timeseries/data/LAUCN040010000000005"
raw.data <- readLines(url, warn = F)
library(rjson)
rd <- fromJSON(raw.data)
And here is the error message I receive:
Error in fromJSON(raw.data) : incomplete list
If I just try to go to the url in my webrowser it seems to work (pull up a JSON webpage). Not really sure what is going on when I try to get this into R.

When you've used readLines, the object returned is a vector of length 4:
length(raw.data)
You can look at the individual pieces via:
raw.data[1]
If you stick the pieces back together using paste
fromJSON(paste(raw.data, collapse = ""))
everything works. Alternatively,
jsonlite::fromJSON(url)

Avoiding "Could not resolve host" error to stop the program running in R

I use the getURL function from the Rcurl package in R to read content from a list of links.
When trying to fetch a broken link of the list I get the error "Error in function (type, msg, asError = TRUE) : Could not resolve host:" and the program stops running.
I use the Try command to try to avoid the program stopping but it doesn´t work.
try(getURL(URL, ssl.verifypeer = FALSE, useragent = "R")
Any hint on how can I avoid the program to stop running when trying to get a broken link?

You need to be doing some type of error handling. I would argue tryCatch is actually better for your situation.
I'm assuming you are inside a loop over the links, then you can check the response from your try/tryCatch to see if an error was thrown, and if so just move to the following iteration in your loop using next.
status <- tryCatch(
getURL(URL, ssl.verifypeer=FALSE, useragent="R"),
error = function(e) e
)
if(inherits(status, "error")) next

RCurl error: Connection reset by peer

I am scraping a website for links using the XML and RCurl packages of R. I need to make multiple calls (several thousand).
The script I use is in the following form:
raw <- getURL("http://www.example.com",encoding="UTF-8",.mapUnicode = F)
parsed <- htmlParse(raw)
links <- xpathSApply(parsed,"//a/#href")
...
...
return(links)
When used a single time, there is no problem.
However, when applied to a list of urls (using sapply), I receive the following error:
Error in function (type, msg, asError = TRUE) : Recv failure:
Connection reset by peer
If I retry the same request later it usually returns ok.
I am new to Curl and web scraping, and not sure how to fix or avoid this.
Thank you in advance

Try something like this
for(i in 1:length(links)){
try(WebPage <- getURL(links[[i]], ssl.verifypeer = FALSE,curl=curl))
while((inherits(NivelRegion, "try-error"))){
Sys.sleep(1)
try(WebPage <- getURL(links[[i]], ssl.verifypeer = FALSE,curl=curl))
}

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

scrape webpage bypass server error - r

Try using the httr library instead: library(httr) pg <- GET("http://www.liberty.co.uk/fcp/product/Liberty//Rosa-A-Tana-Lawn/1390") print(content(pg)) # too much to paste here

Related

Webscraping, read_html() - Error in open.connection(x, "rb") : SSL certificate problem: certificate has expired

Error is returned while using getURL() function in R language

R - Error in fromJSON(raw.data) : incomplete list

Avoiding "Could not resolve host" error to stop the program running in R

RCurl error: Connection reset by peer

Categories

Resources