for loop - skipping iterations with error [duplicate] - r

When retrieving the h1 title using rvest, I sometimes run into 404 pages. This stop the process and returns this error.
Error in open.connection(x, "rb") : HTTP error 404.
See the example below
Data<-data.frame(Pages=c(
"http://boingboing.net/2016/06/16/spam-king-sanford-wallace.html",
"http://boingboing.net/2016/06/16/omg-the-japanese-trump-commer.html",
"http://boingboing.net/2016/06/16/omar-mateen-posted-to-facebook.html",
"http://boingboing.net/2016/06/16/omar-mateen-posted-to-facdddebook.html"))
Code used to retrieve h1
library (rvest)
sapply(Data$Pages, function(url){
url %>%
as.character() %>%
read_html() %>%
html_nodes('h1') %>%
html_text()
})
Is there a way to include an argument to ignore errors and continue the process ?

You're looking for try or tryCatch, which are how R handles error catching.
With try, you just need to wrap the thing that might fail in try(), and it will return the error and keep running:
library(rvest)
sapply(Data$Pages, function(url){
try(
url %>%
as.character() %>%
read_html() %>%
html_nodes('h1') %>%
html_text()
)
})
# [1] "'Spam King' Sanford Wallace gets 2.5 years in prison for 27 million Facebook scam messages"
# [2] "OMG, this Japanese Trump Commercial is everything"
# [3] "Omar Mateen posted to Facebook during Orlando mass shooting"
# [4] "Error in open.connection(x, \"rb\") : HTTP error 404.\n"
However, while that will get everything, it will also insert bad data into our results. tryCatch allows you to configure what happens when an error is called by passing it a function to run when that condition arises:
sapply(Data$Pages, function(url){
tryCatch(
url %>%
as.character() %>%
read_html() %>%
html_nodes('h1') %>%
html_text(),
error = function(e){NA} # a function that returns NA regardless of what it's passed
)
})
# [1] "'Spam King' Sanford Wallace gets 2.5 years in prison for 27 million Facebook scam messages"
# [2] "OMG, this Japanese Trump Commercial is everything"
# [3] "Omar Mateen posted to Facebook during Orlando mass shooting"
# [4] NA
There we go; much better.
Update
In the tidyverse, the purrr package offers two functions, safely and possibly, which work like try and tryCatch. They are adverbs, not verbs, meaning they take a function, modify it so as to handle errors, and return a new function (not a data object) which can then be called. Example:
library(tidyverse)
library(rvest)
df <- Data %>% rowwise() %>% # Evaluate each row (URL) separately
mutate(Pages = as.character(Pages), # Convert factors to character for read_html
title = possibly(~.x %>% read_html() %>% # Try to take a URL, read it,
html_nodes('h1') %>% # select header nodes,
html_text(), # and collect text inside.
NA)(Pages)) # If error, return NA. Call modified function on URLs.
df %>% select(title)
## Source: local data frame [4 x 1]
## Groups: <by row>
##
## # A tibble: 4 × 1
## title
## <chr>
## 1 'Spam King' Sanford Wallace gets 2.5 years in prison for 27 million Facebook scam messages
## 2 OMG, this Japanese Trump Commercial is everything
## 3 Omar Mateen posted to Facebook during Orlando mass shooting
## 4 <NA>

You can see this Question for explanation here
urls<-c(
"http://boingboing.net/2016/06/16/spam-king-sanford-wallace.html",
"http://boingboing.net/2016/06/16/omg-the-japanese-trump-commer.html",
"http://boingboing.net/2016/06/16/omar-mateen-posted-to-facebook.html",
"http://boingboing.net/2016/06/16/omar-mateen-posted-to-facdddebook.html")
readUrl <- function(url) {
out <- tryCatch(
{
message("This is the 'try' part")
url %>% as.character() %>% read_html() %>% html_nodes('h1') %>% html_text()
},
error=function(cond) {
message(paste("URL does not seem to exist:", url))
message("Here's the original error message:")
message(cond)
return(NA)
}
}
)
return(out)
}
y <- lapply(urls, readUrl)

Related

Scraping book table from goodreads

I'm attempting to scrape a table of read books from the Goodreads website using rvest. The data is formatted as a table, however I am getting errors when attempting to extract this info.
First we load some packages and set the url to scrape
library(dplyr)
library(rvest)
url <- "https://www.goodreads.com/review/list/4622890?shelf=read"
Running this code:
dat <- read_html(url) %>%
html_nodes('//*[#id="booksBody"]') %>%
html_table()
Produces: Error in tokenize(css) : Unexpected character '/' found at position 1
Trying it again, but without the first /:
dat <- read_html(url) %>%
html_nodes('/*[#id="booksBody"]') %>%
html_table()
Produces: Error in parse_simple_selector(stream) : Expected selector, got <EOF at 20>
And finally, just trying to get the table directly, without the intermediate call to html_nodes:
dat <- read_html(url) %>%
html_table('/*[#id="booksBody"]')
Produces: Error in if (header) { : argument is not interpretable as logical
Would appreciate any guidance on how to scrape this table
Scraping the first 5 pages
library(tidyverse)
library(rvest)
library(httr2)
get_books <- function(page) {
cat("Scraping page:", page, "\n")
books <-
str_c("https://www.goodreads.com/review/list/4622890-emily-may?page=", page,
"&shelf=%23ALL%23") %>%
read_html() %>%
html_elements(".bookalike.review")
tibble(
title = books %>%
html_elements(".title a") %>%
html_text2(),
author = books %>%
html_elements(".author a") %>%
html_text2(),
rating = books %>%
html_elements(".avg_rating .value") %>%
html_text2() %>%
as.numeric(),
date = books %>%
html_elements(".date_added .value") %>%
html_text2() %>%
lubridate::mdy()
)
}
df <- map_dfr(0:5, get_books)
# A tibble: 180 x 4
title author rating date
<chr> <chr> <dbl> <date>
1 Sunset "Cave~ 4.19 2023-01-14
2 Green for Danger (Inspector Cockrill~ "Bran~ 3.84 2023-01-12
3 Stone Cold Fox "Crof~ 4.22 2023-01-12
4 What If I'm Not a Cat? "Wint~ 4.52 2023-01-10
5 The Prisoner's Throne (The Stolen He~ "Blac~ 4.85 2023-01-07
6 The Kind Worth Saving (Henry Kimball~ "Swan~ 4.13 2023-01-06
7 Girl at War "Novi~ 4 2022-12-29
8 If We Were Villains "Rio,~ 4.23 2022-12-29
9 The Gone World "Swet~ 3.94 2022-12-28
10 Batman: The Dark Knight Returns "Mill~ 4.26 2022-12-28
# ... with 170 more rows
# i Use `print(n = ...)` to see more rows
I can get the first 30 books using this -
library(dplyr)
library(rvest)
url <- "https://www.goodreads.com/review/list/4622890?shelf=read"
book_table <- read_html(url) %>%
html_elements('table#books') %>%
html_table() %>%
.[[1]]
book_table
There is some cleaning that you might need to do in the data captured. Moreover, to get the complete list I am afraid rvest would not be enough. You might need to use something like RSelenium to scroll through the list.

How to deal with HTTP error 504 when scraping data from hundreds of webpages?

I am trying to scrape voting data from the website of the Russian parliament. I am working with nearly 600 webpages, and I am trying to scrape data from within those pages as well. Here is the code I have written thus far:
# load packages
library(rvest)
library(purrr)
library(writexl)
# base url
base_url <- sprintf("http://vote.duma.gov.ru/?convocation=AAAAAAA6&sort=date_asc&page=%d", 1:789)
# loop over pages
map_df(base_url, function(i) {
pg <- read_html(i)
tibble(
title = html_nodes(pg, ".item-left a") %>% html_text() %>% str_trim(),
link = html_elements(pg, '.item-left a') %>%
html_attr('href') %>%
paste0('http://vote.duma.gov.ru', .),
)
}) -> duma_votes_data
The above code executed successfully. This results in a df containing the titles and links. I am now trying to extract the date information. Here is the code I have written for that:
# extract date of vote
duma_votes_data$date <- map(duma_votes_data$link, ~ {
.x %>%
read_html() %>%
html_nodes(".date-p span") %>%
html_text() %>%
paste(collapse = " ")
})
After running this code, I receive the following error:
Error in open.connection(x, "rb") : HTTP error 504.
What is the best way to get around this issue? I have read about the possibility of incorporating Sys.sleep() to my code, but I am not sure where it should go. Note that this code is for all 789 pages, as indicated in base_url. The code does work with around 40 pages, so I guess worst case scenario I could do everything in small chunks and save the resulting dfs as a single df.

Error in doc_parse_file using rvest to scrape multiple pages

I'm trying to scrape the links from multiple pages of a web forum, and I'm getting an error message that I'm not sure how to fix.
I tried the following, using rvest and purrr
pages <- c("https://www.immigrationboards.com/eea-route-applications/page") %>%
paste0(1:18000) %>%
paste0(c(".html"))
i<-1
pages.subset<-pages[1:(i+49)==(i+49)]
pages.subset<-as_data_frame(pages.subset)
scrape_links<-function(pages.subset){read_html(pages.subset) %>% html_node(".topictitle") %>% html_attr('href')}
links<-map_df(pages.subset, scrape_links)
However, I got this error message
Error in doc_parse_file(con, encoding = encoding, as_html = as_html, options = options) :
Expecting a single string value: [type=character; extent=360].
Does anyone have any ideas as to how to solve this?
Although I am not 100% sure what caused the error, it seems that passing an entire data.frame as list in the map_df command blew things up. I have readjusted you code:
library(tidyverse)
library(rvest)
pages <- c("https://www.immigrationboards.com/eea-route-applications/page") %>%
paste0(1:18000) %>%
paste0(c(".html"))
scrape_links <- function(url) {
out <- url %>%
read_html() %>%
html_node(".topictitle") %>%
html_attr("href")
return(out)
}
links <- tibble(page = pages[1:(50) == (50)]) %>%
mutate(url = map_chr(page, scrape_links))
head(links)
# # A tibble: 6 x 2
# page url
# <chr> <chr>
# 1 https://www.immigrationboards.com/eea-route-applications/page50.html https://www.immigrationboards.com/eea-route-applications/covid-19-home-office-immigration-guidance-t299291.html
# 2 https://www.immigrationboards.com/eea-route-applications/page100.html https://www.immigrationboards.com/eea-route-applications/covid-19-home-office-immigration-guidance-t299291.html
# 3 https://www.immigrationboards.com/eea-route-applications/page150.html https://www.immigrationboards.com/eea-route-applications/covid-19-home-office-immigration-guidance-t299291.html
# 4 https://www.immigrationboards.com/eea-route-applications/page200.html https://www.immigrationboards.com/eea-route-applications/covid-19-home-office-immigration-guidance-t299291.html
# 5 https://www.immigrationboards.com/eea-route-applications/page250.html https://www.immigrationboards.com/eea-route-applications/covid-19-home-office-immigration-guidance-t299291.html
# 6 https://www.immigrationboards.com/eea-route-applications/page300.html https://www.immigrationboards.com/eea-route-applications/covid-19-home-office-immigration-guidance-t299291.html

Using tryCatch and rvest to deal with 404 and other crawling errors

When retrieving the h1 title using rvest, I sometimes run into 404 pages. This stop the process and returns this error.
Error in open.connection(x, "rb") : HTTP error 404.
See the example below
Data<-data.frame(Pages=c(
"http://boingboing.net/2016/06/16/spam-king-sanford-wallace.html",
"http://boingboing.net/2016/06/16/omg-the-japanese-trump-commer.html",
"http://boingboing.net/2016/06/16/omar-mateen-posted-to-facebook.html",
"http://boingboing.net/2016/06/16/omar-mateen-posted-to-facdddebook.html"))
Code used to retrieve h1
library (rvest)
sapply(Data$Pages, function(url){
url %>%
as.character() %>%
read_html() %>%
html_nodes('h1') %>%
html_text()
})
Is there a way to include an argument to ignore errors and continue the process ?
You're looking for try or tryCatch, which are how R handles error catching.
With try, you just need to wrap the thing that might fail in try(), and it will return the error and keep running:
library(rvest)
sapply(Data$Pages, function(url){
try(
url %>%
as.character() %>%
read_html() %>%
html_nodes('h1') %>%
html_text()
)
})
# [1] "'Spam King' Sanford Wallace gets 2.5 years in prison for 27 million Facebook scam messages"
# [2] "OMG, this Japanese Trump Commercial is everything"
# [3] "Omar Mateen posted to Facebook during Orlando mass shooting"
# [4] "Error in open.connection(x, \"rb\") : HTTP error 404.\n"
However, while that will get everything, it will also insert bad data into our results. tryCatch allows you to configure what happens when an error is called by passing it a function to run when that condition arises:
sapply(Data$Pages, function(url){
tryCatch(
url %>%
as.character() %>%
read_html() %>%
html_nodes('h1') %>%
html_text(),
error = function(e){NA} # a function that returns NA regardless of what it's passed
)
})
# [1] "'Spam King' Sanford Wallace gets 2.5 years in prison for 27 million Facebook scam messages"
# [2] "OMG, this Japanese Trump Commercial is everything"
# [3] "Omar Mateen posted to Facebook during Orlando mass shooting"
# [4] NA
There we go; much better.
Update
In the tidyverse, the purrr package offers two functions, safely and possibly, which work like try and tryCatch. They are adverbs, not verbs, meaning they take a function, modify it so as to handle errors, and return a new function (not a data object) which can then be called. Example:
library(tidyverse)
library(rvest)
df <- Data %>% rowwise() %>% # Evaluate each row (URL) separately
mutate(Pages = as.character(Pages), # Convert factors to character for read_html
title = possibly(~.x %>% read_html() %>% # Try to take a URL, read it,
html_nodes('h1') %>% # select header nodes,
html_text(), # and collect text inside.
NA)(Pages)) # If error, return NA. Call modified function on URLs.
df %>% select(title)
## Source: local data frame [4 x 1]
## Groups: <by row>
##
## # A tibble: 4 × 1
## title
## <chr>
## 1 'Spam King' Sanford Wallace gets 2.5 years in prison for 27 million Facebook scam messages
## 2 OMG, this Japanese Trump Commercial is everything
## 3 Omar Mateen posted to Facebook during Orlando mass shooting
## 4 <NA>
You can see this Question for explanation here
urls<-c(
"http://boingboing.net/2016/06/16/spam-king-sanford-wallace.html",
"http://boingboing.net/2016/06/16/omg-the-japanese-trump-commer.html",
"http://boingboing.net/2016/06/16/omar-mateen-posted-to-facebook.html",
"http://boingboing.net/2016/06/16/omar-mateen-posted-to-facdddebook.html")
readUrl <- function(url) {
out <- tryCatch(
{
message("This is the 'try' part")
url %>% as.character() %>% read_html() %>% html_nodes('h1') %>% html_text()
},
error=function(cond) {
message(paste("URL does not seem to exist:", url))
message("Here's the original error message:")
message(cond)
return(NA)
}
}
)
return(out)
}
y <- lapply(urls, readUrl)

(...) is not a length 1 atomic vector error

I am trying to scrape IMDB data, and for one variable I keep getting an error.
Error: Result 28 is not a length 1 atomic vector
library(rvest)
library(purrr)
library(tidyr)
topmovies <- read_html("http://www.imdb.com/chart/top")
links <- topmovies %>%
html_nodes(".titleColumn") %>%
html_nodes("a") %>%
html_attr("href") %>%
xml2::url_absolute("http://imdb.com")
pages <- links %>% map(read_html)
budget <- try(pages %>%
map_chr(. %>%
html_nodes("#titleDetails .txt-block:nth-child(11)") %>%
html_text() %>%
#gsub("\\D", "", .) %>%
extract_numeric()),silent=TRUE)
When I do it with try (as in the code), I get
budget
[1] "Error : cannot convert object to a data frame\n" attr(,"class")
[1] "try-error" attr(,"condition") <Rcpp::exception: cannot convert
object to a data frame>
It would be great if anyone could tell me what is going wrong / why try isn't just skipping that result?
This can happen when you encounter NULL. Try:
map_if(is_empty, ~ NA_character_)
in the pipe chain and see if it works.

Resources