Scraping with rvest: Getting error HTTP 502

Scraping with rvest: Getting error HTTP 502 - web-scraping

I have an R script that uses rvest to pull some data from accuweather. The accuweather URLs have IDs in them that uniquely correspond to cities. I'm trying to pull IDs in a given range and the associated City names. rvest itself works perfectly for a single ID, but when I iterate through a for loop it eventually returns this error - "Error in open.connection(x, "rb") : HTTP error 502."
I suspect this error is due to the website blocking me out. How do I get around this? I want to scrape from quite a large range (10,000 IDs) and it keeps giving me this error after ~500 iterations of the loop. I also tried closeAllConnections() and Sys.sleep() but to no avail. I'd really appreciate any help with this problem.
EDIT: Solved. I found a way around it through this thread here: Use tryCatch skip to next value of loop upon error?. I used tryCatch() with error = function(e) e as an argument and it suppressed the error message and allowed the loop to continue without breaking. Hopefully, this will be helpful to anyone else stuck on a similar problem.
library(rvest)
library(httr)
# create matrix to store IDs and Cities
# each ID corresponds to a single city
id_mat<- matrix(0, ncol = 2, nrow = 10001 )
# initialize index for matrix row
j = 1
for (i in 300000:310000){
z <- as.character(i)
# pull city name from website
accu <- read_html(paste("https://www.accuweather.com/en/us/new-york-ny/10007/june-weather/", z, sep = ""))
citystate <- accu %>% html_nodes('h1') %>% html_text()
# store values
id_mat[j,1] = i
id_mat[j,2] = citystate
# increment by 1
i = i + 1
j = j + 1
# close connection after 200 pulls, wait 5 mins and loop again
if (i %% 200 == 0) {
closeAllConnections()
Sys.sleep(300)
next
} else {
# sleep for 1 or 2 seconds every loop
Sys.sleep(sample(2,1))
}
}

The problem seems to be coming from scientific notation.
How to disable scientific notation?
I changed your code slightly, now it seems to be working:
library(rvest)
library(httr)
id_mat<- matrix(0, ncol = 2, nrow = 10001 )
readUrl <- function(url) {
out <- tryCatch(
{
download.file(url, destfile = "scrapedpage.html", quiet=TRUE)
return(1)
},
error=function(cond) {
return(0)
},
warning=function(cond) {
return(0)
}
)
return(out)
}
j = 1
options(scipen = 999)
for (i in 300000:310000){
z <- as.character(i)
# pull city name from website
url <- paste("https://www.accuweather.com/en/us/new-york-ny/10007/june-weather/", z, sep = "")
if( readUrl(url)==1) {
download.file(url, destfile = "scrapedpage.html", quiet=TRUE)
accu <- read_html("scrapedpage.html")
citystate <- accu %>% html_nodes('h1') %>% html_text()
# store values
id_mat[j,1] = i
id_mat[j,2] = citystate
# increment by 1
i = i + 1
j = j + 1
# close connection after 200 pulls, wait 5 mins and loop again
if (i %% 200 == 0) {
closeAllConnections()
Sys.sleep(300)
next
} else {
# sleep for 1 or 2 seconds every loop
Sys.sleep(sample(2,1))
}
} else {er <- 1}
}

Related

Wrapping a while loop inside a try and repeat function to run the while loop again if a specific error occurs

I have a series of functions which goes to a website and collects data. Sometimes the website returns a 404 error and my code breaks. It could take 10 minutes of processing until I get a 404 error, or the code (more often then not) runs without the 404 error.
I have the following code:
linkToStopAt = as.character(unique(currentData$linkURL)[1])
myLinksToSearchOver = as.character(unique(currentData$page))
tmp = NULL
i <- 1
out_lst = list()
while(i <= length(myLinksToSearchOver)){
print(paste("Processing page: ", i))
tmp <- possible_collectPageData(myLinksToSearchOver[i]) %>%
add_column(page = myLinksToSearchOver[i])
if(linkToStopAt %in% tmp$linkURL)
{
print(paste("We stopped at: ", i))
break
}
out_lst[[i]] <- tmp
i <- i + 1
}
Broken down as:
linkToStopAt = as.character(unique(currentData$linkURL)[1]) gives me a single URL where the while loops will break if it see this URL
myLinksToSearchOver = as.character(unique(currentData$page)) gives me multiple links in which the while loop will search over, once it finds the linkToStopAt on one of these links, the while loop breaks.
tmp <- possible_collectPageData(myLinksToSearchOver[i]) %>% add_column(page = myLinksToSearchOver[i]) This is a big function, which relies on many other functions...
######################################################
So, the while loop runs until it finds a link linkToStopAt on one of the pages from myLinksToSearchOver. The function possible_collectPageData just does all my scraping/data processing etc. Each page from myLinksToSearchOver is stored in out_lst[[i]] <- tmp.
I recieve a specific error "Error in if (nrow(df) != nrow(.data)) { : argumento tiene longitud cero" in the console sometimes.
What I want to do, is something like:
repeat {
tmpCollectData <- try(while("ALL-MY-WHILE-LOOP-HERE??")) #try(execute(f))
if (!(inherits(tmpCollectData, "Error in if (nrow(df) != nrow(.data)) { : argumento tiene longitud cero")))
break
}
Where, if the while loop breaks with that error, just run it all again, setting tmp = NULL, i = 1, out_list = list() etc. (Basically start again, I can do this manually by just re-executing the code again)

You could create a function that does your work, and then wrap the call to that function in try(), with silent=TRUE. Then place that in a while(TRUE) loop, breaking out if get_data() does NOT return an error:
Function to do your work
get_data <- function(links, stoplink) {
i=1
out_lst=list()
while(i <= length(links)){
print(paste("Processing page: ", i))
tmp = possible_collectPageData(links[i]) %>% add_column(page = links[i])
if(stoplink %in% tmp$linkURL) {
print(paste("We stopped at: ", i))
break
}
out_lst[[i]] <- tmp
i <- i + 1
}
return(out_lst)
}
Infinite loop that gets broken if result does not have any error.
while(TRUE) {
result = try(get_data(myLinksToSearchOver, linkToStopAt), silent=T)
if(!"try-error" %in% class(result)) break
}

htmltab "no table found" workaround?

I am trying to scrape some (a lot) of NCAA mens basketball data off of a website called RealGM. My code lies below:
library(htmltab)
tables <- list()
for (i in 0:1548) {
for (j in 0:16) {
for (k in 0:4) {
a <- i+1
b <- 2003+j
c <- k+1
url <- paste("https://basketball.realgm.com/ncaa/conferences/Big-Ten-Conference/2/Michigan/",a,"/individual-games/",b,"/minutes/Season/desc/",c,sep = "")
tables[[paste(i,j,k,sep = "")]] <- htmltab(url,rm_nodata_cols = F,which = 1)
}
}
}
I've used similar methods in the past to pull data off of sites like Sports Reference which keep player data in tables.
In this loop, the variable a controls the team, b controls the year, and c controls the page number for the game log set.
My issue here is that some of the referenced URLs contain no tables, i.e. there is no 4th page of game logs for Michigan's 2003 team, but there are 5 pages for their 2018 team.
Unfortunately, htmltab returns an error when there is not table found, and it aborts my loop. Is there a workaround for this so that it will just skip those urls and/or continue through the rest of the process?

I was able to figure out how to do this by checking first to see if a table existed, and if not, go to the next iteration of the loop:
library(htmltab)
tables <- list()
for (i in 0:1548) {
for (j in 0:16) {
for (k in 0:4) {
a <- i+1
b <- 2003+j
c <- k+1
url <- paste("https://basketball.realgm.com/ncaa/conferences/Big-Ten-Conference/2/Michigan/",a,"/individual-games/",b,"/minutes/Season/desc/",c,sep = "")
test <- html_nodes(read_html(url),"table")
if (length(test) == 0){
next
}
tables[[paste(i,j,k,sep = "")]] <- htmltab(url,rm_nodata_cols = F,which = 1)
}
}
}

One option is to use tryCatch and skip the URL's which give an error.
library(htmltab)
tables <- list()
for (i in 1:1549) {
for (j in 2003:2019) {
for (k in 1:5) {
url <- paste0("https://basketball.realgm.com/ncaa/conferences/Big-Ten-Conference/2/Michigan/",i,"/individual-games/",j,"/minutes/Season/desc/",k)
tables[[paste0(i,j,k)]] <- tryCatch({
htmltab(url,rm_nodata_cols = F,which = 1)
}, error = function(e) {
cat("Wrong URL : ", url, " skipping\n")
})
}
}
}

How to use tryCatch() to ignore the error in while loop in R

I have a code that reads each line of my dataframe's first column, visits the website and then downloads the photo of each deputy. But it doesn't work properly because there are some deputies who don't have a photo yet.
That's why my code breaks and stop working. I tried to use "next" and if clauses, but it still didn't work. So a friend recomended me to use the tryCatch(). I couldn't find enough information online, and the code still doesn't work.
The file is here:
https://gist.github.com/gabrielacaesar/940f3ef14eaf29d18c3780a66053bbee
deputados <- fread("dep-legislatura56-14jan2019.csv")
i <- 1
while(i <= 514) {
this.could.go.wrong <- tryCatch(
attemptsomething(),
error=function(e) next
)
url <- deputados$uri[i]
api_content <- rawToChar(GET(url)$content)
pessoa_info <- jsonlite::fromJSON(api_content)
pessoa_foto <- pessoa_info$dados$ultimoStatus$urlFoto
download.file(pessoa_foto, basename(pessoa_foto), mode = "wb")
Sys.sleep(0.5)
i <- i + 1
}

Here is a solution using purrr:
library(purrr)
download_picture <- function(url){
api_content <- rawToChar(httr::GET(url)$content)
pessoa_info <- jsonlite::fromJSON(api_content)
pessoa_foto <- pessoa_info$dados$ultimoStatus$urlFoto
download.file(pessoa_foto, basename(pessoa_foto), mode = "wb")
}
walk(deputados$uri, possibly(download_picture, NULL))

Simply wrap tryCatch on the lines that can potentially raise errors and have it return NULL or NA on the error block:
i <- 1
while(i <= 514) {
tryCatch({
url <- deputados$uri[i]
api_content <- rawToChar(GET(url)$content)
pessoa_info <- jsonlite::fromJSON(api_content)
pessoa_foto <- pessoa_info$dados$ultimoStatus$urlFoto
download.file(pessoa_foto, basename(pessoa_foto), mode = "wb")
Sys.sleep(0.5)
}, error = function(e) return(NULL)
)
i <- i + 1
}

R Looping through list of follower to get ego network using twitteR package

I am trying to get network data on who follows who based on a closed list of twitter accounts who are followed by a given user. That is, given User A, I'd like to retrieve its friends list and then learn who of its friends follows each other.
The first issue I had was the rate limit set by the Twitter API, but seemed to solve it using the Sys.sleep() function. Although very slow, I used the function shown below and it worked fine the first time. However, I am trying to get the same info from other users' friends list and it keeps on giving me errors of the type:
Error in curl::curl_fetch_memory(url, handle = handle) :
Timeout was reached
Error HTTP 503
And other errors. So far I've tried it three times in three different laptops and got a different error each time. Any idea why this may be happening?
Thanks a lot in advance!
friendsnet <- function(tuser) {
require(twitteR)
# if rate limit is hit, wait for 15 minutes
limit <- getCurRateLimitInfo()[53,3]
print(paste("Look up limit", limit))
if (limit == 0) {
print("sleeping for fifteen minutes")
Sys.sleep(900)
}
# Find user
tuser <- getUser(tuser)
print(tuser$screenName)
# Empty dataframe
df <- NULL
print("empty data frame")
# Get names of friends
f <- lookupUsers(tuser$getFriendIDs())
f.id <- sapply(f, id)
f.name <- sapply(f, screenName)
f2 <- as.data.frame(cbind(f.id, f.name))
print("list of friends")
print(head(f2))
for (i in f2$f.name) {
# if rate limit is hit, wait for 15 minutes
limit <- getCurRateLimitInfo()[53,3]
print(paste("Look up limit", limit))
if (limit == 0) {
print("sleeping for fifteen minutes")
Sys.sleep(900)
}
A <- getUser(i)
friends.object <- lookupUsers(A$getFriendIDs())
# Convert list into data frame
friends.id <- sapply(friends.object,id)
friends.name <- sapply(friends.object, screenName)
friends <- as.data.frame(cbind(friends.id, friends.name))
for (j in f2$f.name) {
if (i != j) {
if ((j %in% friends$friends.name) == TRUE) {
print(paste(i, "follows", j))
df <- rbind(df, data.frame(i, j))
}
}
}
}
return(df)
}

R JSONlite: How to tackle below error?

See the below R code, I'm using JSONlite package to scrape data from a website:
library(jsonlite)
url <- "http://fantasy.premierleague.com/web/api/elements/"
seasonsdata <- data.frame(matrix(NA,nrow=1,ncol=20))
seasonsdata <- seasonsdata[-1,]
fetchData <- function(i) {res <- try(a <- fromJSON(paste0(url,i)))
if(!inherits(res,"try-error")) {b<-data.frame(a[1],a[20],a[21],as.data.frame(a$season_history))}}
seasonsdata <- lapply(1:696, fetchData)
seasonsdata <-do.call(rbind,lapply(seasonsdata,data.frame,stringsAsFactors=FALSE))
The code is working fine for 'i' till 10 at least, I'm getting the desired output. However, as I increase 'i' to 696, I'm getting the error:
Error in data.frame(a[1], a[20], a[21], as.data.frame(a$season_history)) :
arguments imply differing number of rows: 1, 0
Any advise?

If a$season_history is empty (page 57 is an example) then when you do data.frame(a[1], a[20], a[21], as.data.frame(a$season_history)) the first 3 elements have one row (they are scalars) and the last element has zero rows. In your function you can first check if a$season_history is there. If it it's not you can create a row of NAs in its places.
However there is another problem with your code you may not be aware of yet. Not every page to 696 exists and you get a 404 error when you try to pull the data from it. I added some steps to remove those pages before you do the final do.call(rbind, ...) step.
library(jsonlite)
url <- "http://fantasy.premierleague.com/web/api/elements/"
seasonsdata <- data.frame(matrix(NA, nrow = 1, ncol = 20))
seasonsdata <- seasonsdata[-1, ]
fetchData <- function(i) {
res <- try(a <- fromJSON(paste0(url, i)))
if (!inherits(res, "try-error")) {
if (nrow(as.data.frame(a$season_history)) == 0) {
b <- data.frame(a[1], a[20], a[21], as.data.frame(matrix(NA, ncol = 17)))
} else {
b <- data.frame(a[1], a[20], a[21], as.data.frame(a$season_history))
}
}
}
seasonsdata <- lapply(1:696, fetchData)
seasonsdata <- seasonsdata[!sapply(seasonsdata, is.null)]
seasonsdata <- seasonsdata[sapply(seasonsdata, is.data.frame)]
seasonsdata <- do.call(rbind,lapply(seasonsdata, data.frame, stringsAsFactors = FALSE))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Scraping with rvest: Getting error HTTP 502 - web-scraping

Related

Wrapping a while loop inside a try and repeat function to run the while loop again if a specific error occurs

htmltab "no table found" workaround?

How to use tryCatch() to ignore the error in while loop in R

R Looping through list of follower to get ego network using twitteR package

R JSONlite: How to tackle below error?

Categories

Resources