I'm trying to scrape the links from multiple pages of a web forum, and I'm getting an error message that I'm not sure how to fix.
I tried the following, using rvest and purrr
pages <- c("https://www.immigrationboards.com/eea-route-applications/page") %>%
paste0(1:18000) %>%
paste0(c(".html"))
i<-1
pages.subset<-pages[1:(i+49)==(i+49)]
pages.subset<-as_data_frame(pages.subset)
scrape_links<-function(pages.subset){read_html(pages.subset) %>% html_node(".topictitle") %>% html_attr('href')}
links<-map_df(pages.subset, scrape_links)
However, I got this error message
Error in doc_parse_file(con, encoding = encoding, as_html = as_html, options = options) :
Expecting a single string value: [type=character; extent=360].
Does anyone have any ideas as to how to solve this?
Although I am not 100% sure what caused the error, it seems that passing an entire data.frame as list in the map_df command blew things up. I have readjusted you code:
library(tidyverse)
library(rvest)
pages <- c("https://www.immigrationboards.com/eea-route-applications/page") %>%
paste0(1:18000) %>%
paste0(c(".html"))
scrape_links <- function(url) {
out <- url %>%
read_html() %>%
html_node(".topictitle") %>%
html_attr("href")
return(out)
}
links <- tibble(page = pages[1:(50) == (50)]) %>%
mutate(url = map_chr(page, scrape_links))
head(links)
# # A tibble: 6 x 2
# page url
# <chr> <chr>
# 1 https://www.immigrationboards.com/eea-route-applications/page50.html https://www.immigrationboards.com/eea-route-applications/covid-19-home-office-immigration-guidance-t299291.html
# 2 https://www.immigrationboards.com/eea-route-applications/page100.html https://www.immigrationboards.com/eea-route-applications/covid-19-home-office-immigration-guidance-t299291.html
# 3 https://www.immigrationboards.com/eea-route-applications/page150.html https://www.immigrationboards.com/eea-route-applications/covid-19-home-office-immigration-guidance-t299291.html
# 4 https://www.immigrationboards.com/eea-route-applications/page200.html https://www.immigrationboards.com/eea-route-applications/covid-19-home-office-immigration-guidance-t299291.html
# 5 https://www.immigrationboards.com/eea-route-applications/page250.html https://www.immigrationboards.com/eea-route-applications/covid-19-home-office-immigration-guidance-t299291.html
# 6 https://www.immigrationboards.com/eea-route-applications/page300.html https://www.immigrationboards.com/eea-route-applications/covid-19-home-office-immigration-guidance-t299291.html
Related
i am facing trouble need help.
i have list of links (about 9000 links) which i am running in loop and doing some process on each one
links look like this :-
link1
link2
link3
link4
.....
link9000
but i am facing trouble as sometimes link 2nd gets failed (timeout) and sometime link2nd works and 400 or any random link fails as timeout . is there any way i can try failed link again n again ? i have added :-
status_c <- httr::GET(Links, config = httr::config(connecttimeout = 150))
but still i get timeout . please help me! or any suggestion regarding it? final_links_bind = have all list of links
some sample links:-
https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146711
https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146703
https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146789
for(i in 1:nrow(final_links_bind)) {
Links <- final_links_bind[i,]
BP_ID <- final_bp_bind[i,]
#print(Links)
status_c <- GET(Links,timeout(120))
status <- status_code(status_c)
if(status == "200"){
url_parse<- read_html(Links)
col_name<- url_parse %>%
html_nodes("tr") %>%
html_text()
col_name <- stringr::str_remove_all(col_name, "\\\t|\\\n|\\\r")
pattern_col_no <- grep("využití", col_name)
col_name <- as.data.frame(col_name)
method_selected <- col_name[pattern_col_no,]
WRITE_CSV_DATA <- rbind(WRITE_CSV_DATA, data.frame(BP_ID = c(BP_ID), method_selected = c(method_selected), Links = c(Links)))
#METHOD_OF_USE <- rbind(method_selected,METHOD_OF_USE)
print(WRITE_CSV_DATA)
}else{
print("LINK NOT WORKING")
no_Links <- sorted_link[i,]
not_working_link <- rbind(not_working_link,no_Links)
}
}
It is not clear how you want the final output, but here is how to scrape and skip links that are not working
library(rvest)
library(httr2)
library(tidyverse)
Given this data frame of links, notice the third one is not working:
df <- tibble(
links = c(
"https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146711",
"https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146703",
"https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/9999999",
"https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146789"
)
)
# A tibble: 4 × 1
links
<chr>
1 https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146711
2 https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146703
3 https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/9999999
4 https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146789
Create a function to scrape the table, specifically the third row:
get_info <- function(link) {
cat("Scraping", link, "\n")
link %>%
read_html() %>%
html_table() %>%
pluck(2) %>%
slice(3) %>%
pull(2)
}
And mutate() a new column with the info, NA if the link is not working. If the link is not working possibly() will throw NA (NA_character_) back instead of stopping the code.
df %>%
mutate(
info = map_chr(links, possibly(get_info, otherwise = NA_character_))
)
# A tibble: 4 × 2
links info
<chr> <chr>
1 https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146711 rodinný dům
2 https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146703 rodinný dům
3 https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/9999999 NA
4 https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146789 rodinný dům
When retrieving the h1 title using rvest, I sometimes run into 404 pages. This stop the process and returns this error.
Error in open.connection(x, "rb") : HTTP error 404.
See the example below
Data<-data.frame(Pages=c(
"http://boingboing.net/2016/06/16/spam-king-sanford-wallace.html",
"http://boingboing.net/2016/06/16/omg-the-japanese-trump-commer.html",
"http://boingboing.net/2016/06/16/omar-mateen-posted-to-facebook.html",
"http://boingboing.net/2016/06/16/omar-mateen-posted-to-facdddebook.html"))
Code used to retrieve h1
library (rvest)
sapply(Data$Pages, function(url){
url %>%
as.character() %>%
read_html() %>%
html_nodes('h1') %>%
html_text()
})
Is there a way to include an argument to ignore errors and continue the process ?
You're looking for try or tryCatch, which are how R handles error catching.
With try, you just need to wrap the thing that might fail in try(), and it will return the error and keep running:
library(rvest)
sapply(Data$Pages, function(url){
try(
url %>%
as.character() %>%
read_html() %>%
html_nodes('h1') %>%
html_text()
)
})
# [1] "'Spam King' Sanford Wallace gets 2.5 years in prison for 27 million Facebook scam messages"
# [2] "OMG, this Japanese Trump Commercial is everything"
# [3] "Omar Mateen posted to Facebook during Orlando mass shooting"
# [4] "Error in open.connection(x, \"rb\") : HTTP error 404.\n"
However, while that will get everything, it will also insert bad data into our results. tryCatch allows you to configure what happens when an error is called by passing it a function to run when that condition arises:
sapply(Data$Pages, function(url){
tryCatch(
url %>%
as.character() %>%
read_html() %>%
html_nodes('h1') %>%
html_text(),
error = function(e){NA} # a function that returns NA regardless of what it's passed
)
})
# [1] "'Spam King' Sanford Wallace gets 2.5 years in prison for 27 million Facebook scam messages"
# [2] "OMG, this Japanese Trump Commercial is everything"
# [3] "Omar Mateen posted to Facebook during Orlando mass shooting"
# [4] NA
There we go; much better.
Update
In the tidyverse, the purrr package offers two functions, safely and possibly, which work like try and tryCatch. They are adverbs, not verbs, meaning they take a function, modify it so as to handle errors, and return a new function (not a data object) which can then be called. Example:
library(tidyverse)
library(rvest)
df <- Data %>% rowwise() %>% # Evaluate each row (URL) separately
mutate(Pages = as.character(Pages), # Convert factors to character for read_html
title = possibly(~.x %>% read_html() %>% # Try to take a URL, read it,
html_nodes('h1') %>% # select header nodes,
html_text(), # and collect text inside.
NA)(Pages)) # If error, return NA. Call modified function on URLs.
df %>% select(title)
## Source: local data frame [4 x 1]
## Groups: <by row>
##
## # A tibble: 4 × 1
## title
## <chr>
## 1 'Spam King' Sanford Wallace gets 2.5 years in prison for 27 million Facebook scam messages
## 2 OMG, this Japanese Trump Commercial is everything
## 3 Omar Mateen posted to Facebook during Orlando mass shooting
## 4 <NA>
You can see this Question for explanation here
urls<-c(
"http://boingboing.net/2016/06/16/spam-king-sanford-wallace.html",
"http://boingboing.net/2016/06/16/omg-the-japanese-trump-commer.html",
"http://boingboing.net/2016/06/16/omar-mateen-posted-to-facebook.html",
"http://boingboing.net/2016/06/16/omar-mateen-posted-to-facdddebook.html")
readUrl <- function(url) {
out <- tryCatch(
{
message("This is the 'try' part")
url %>% as.character() %>% read_html() %>% html_nodes('h1') %>% html_text()
},
error=function(cond) {
message(paste("URL does not seem to exist:", url))
message("Here's the original error message:")
message(cond)
return(NA)
}
}
)
return(out)
}
y <- lapply(urls, readUrl)
I want to scrape a large amount of websites. For this, I first read in the websites' html-scripts and store them as xml_nodesets. As I only need the websites' contents, I lastly extract each websites' contents from the xml_nodesets. To achieve this, I have written following code:
# required packages
library(purrr)
library(dplyr)
library(xml2)
library(rvest)
# urls of the example sources
test_files <- c("https://en.wikipedia.org/wiki/Web_scraping", "https://en.wikipedia.org/wiki/Data_scraping")
# reading in the html sources, storing them as xml_nodesets
test <- test_files %>%
map(., ~ xml2::read_html(.x, encoding = "UTF-8"))
# extracting selected nodes (contents)
test_tbl <- test %>%
map(., ~tibble(
# scrape contents
test_html = rvest::html_nodes(.x, xpath = '//*[(#id = "toc")]')
))
Unfortunately, this produces following error:
Error: All columns in a tibble must be vectors.
x Column `test_html` is a `xml_nodeset` object.
I think I understand the substance of this error, but I can't find a way around it. It's also a bit strange, because I was able to smoothly run this code in January and suddenly it is not working anymore. I suspected package updates to be the reason, but installing older versions of xml2, rvest or tibble didn't help either. Also, scraping only one single website doesn't produce any errors either:
test <- read_html("https://en.wikipedia.org/wiki/Web_scraping", encoding = "UTF-8") %>%
rvest::html_nodes(xpath = '//*[(#id = "toc")]')
Do you have any suggestions on how to solve this issue? Thank you very much!
EDIT: I removed %>% html_text from ...
test_tbl <- test %>%
map(., ~tibble(
# scrape contents
test_html = rvest::html_nodes(.x, xpath = '//*[(#id = "toc")]')
))
... as this doesn't produce this error. The edited code does, though.
You need to store the objects in a list.
test %>%
purrr::map(~tibble(
# scrape contents
test_html = list(rvest::html_nodes(.x, xpath = '//*[(#id = "toc")]'))
))
#[[1]]
# A tibble: 1 x 1
# test_html
# <list>
#1 <xml_ndst>
#[[2]]
# A tibble: 1 x 1
# test_html
# <list>
#1 <xml_ndst>
I am trying to scrape a list of plumbers from http://www.yellowpages.com.au to build a tibble.
The code works fine with each section (name, phone number, email) but when I put it together in a function to build the tibble it hits an error because some don't have phone numbers or emails.
url <- "https://www.yellowpages.com.au/search/listings?clue=plumbers&locationClue=Greater+Sydney%2C+NSW&lat=&lon=&selectedViewMode=list"
testscrape <- function(){
webpage <- read_html(url)
docname <- webpage %>%
html_nodes(".left .listing-name") %>%
html_text()
ph_no <- webpage %>%
html_nodes(".contact-phone .contact-text") %>%
html_text()
email <- webpage %>%
html_nodes(".contact-email") %>%
html_attr("href") %>%
as.character() %>%
str_remove_all(".*:") %>%
str_remove_all("\\?(.*)") %>%
str_replace_all("%40","#")
return(tibble(docname = docname, ph_no = ph_no, email = email))
}
Then I run the function:
test_run <- testscrape
test_run()
And the following errors arrive:
Error: Tibble columns must have compatible sizes.
* Size 36: Existing data.
* Size 17: Column `ph_no`.
ℹ Only values of size one are recycled.
Run `rlang::last_error()` to see where the error occurred.
Called from: signal_abort(cnd)
Browse[1]>
Which leaves it hanging.
I appreciate that there are fewer phone numbers than listed plumbers so how do I create a N/A return for that line for that plumber so that the numbers align with the relevant plumbers?
Thanks in advance.
You can subset the extracted data to get 1st value which will give NA when the value is empty.
library(rvest)
library(stringr)
testscrape <- function(url){
webpage <- read_html(url)
docname <- webpage %>%
html_nodes(".left .listing-name") %>%
html_text()
ph_no <- webpage %>%
html_nodes(".contact-phone .contact-text") %>%
html_text()
email <- webpage %>%
html_nodes(".contact-email") %>%
html_attr("href") %>%
as.character() %>%
str_remove_all(".*:") %>%
str_remove_all("\\?(.*)") %>%
str_replace_all("%40","#")
n <- seq_len(max(length(practice), length(ph_no), length(email)))
tibble(docname = practice[n], ph_no = ph_no[n], email = email[n])
}
testscrape(url)
# docname ph_no email
# <lgl> <lgl> <lgl>
#1 NA NA NA
When retrieving the h1 title using rvest, I sometimes run into 404 pages. This stop the process and returns this error.
Error in open.connection(x, "rb") : HTTP error 404.
See the example below
Data<-data.frame(Pages=c(
"http://boingboing.net/2016/06/16/spam-king-sanford-wallace.html",
"http://boingboing.net/2016/06/16/omg-the-japanese-trump-commer.html",
"http://boingboing.net/2016/06/16/omar-mateen-posted-to-facebook.html",
"http://boingboing.net/2016/06/16/omar-mateen-posted-to-facdddebook.html"))
Code used to retrieve h1
library (rvest)
sapply(Data$Pages, function(url){
url %>%
as.character() %>%
read_html() %>%
html_nodes('h1') %>%
html_text()
})
Is there a way to include an argument to ignore errors and continue the process ?
You're looking for try or tryCatch, which are how R handles error catching.
With try, you just need to wrap the thing that might fail in try(), and it will return the error and keep running:
library(rvest)
sapply(Data$Pages, function(url){
try(
url %>%
as.character() %>%
read_html() %>%
html_nodes('h1') %>%
html_text()
)
})
# [1] "'Spam King' Sanford Wallace gets 2.5 years in prison for 27 million Facebook scam messages"
# [2] "OMG, this Japanese Trump Commercial is everything"
# [3] "Omar Mateen posted to Facebook during Orlando mass shooting"
# [4] "Error in open.connection(x, \"rb\") : HTTP error 404.\n"
However, while that will get everything, it will also insert bad data into our results. tryCatch allows you to configure what happens when an error is called by passing it a function to run when that condition arises:
sapply(Data$Pages, function(url){
tryCatch(
url %>%
as.character() %>%
read_html() %>%
html_nodes('h1') %>%
html_text(),
error = function(e){NA} # a function that returns NA regardless of what it's passed
)
})
# [1] "'Spam King' Sanford Wallace gets 2.5 years in prison for 27 million Facebook scam messages"
# [2] "OMG, this Japanese Trump Commercial is everything"
# [3] "Omar Mateen posted to Facebook during Orlando mass shooting"
# [4] NA
There we go; much better.
Update
In the tidyverse, the purrr package offers two functions, safely and possibly, which work like try and tryCatch. They are adverbs, not verbs, meaning they take a function, modify it so as to handle errors, and return a new function (not a data object) which can then be called. Example:
library(tidyverse)
library(rvest)
df <- Data %>% rowwise() %>% # Evaluate each row (URL) separately
mutate(Pages = as.character(Pages), # Convert factors to character for read_html
title = possibly(~.x %>% read_html() %>% # Try to take a URL, read it,
html_nodes('h1') %>% # select header nodes,
html_text(), # and collect text inside.
NA)(Pages)) # If error, return NA. Call modified function on URLs.
df %>% select(title)
## Source: local data frame [4 x 1]
## Groups: <by row>
##
## # A tibble: 4 × 1
## title
## <chr>
## 1 'Spam King' Sanford Wallace gets 2.5 years in prison for 27 million Facebook scam messages
## 2 OMG, this Japanese Trump Commercial is everything
## 3 Omar Mateen posted to Facebook during Orlando mass shooting
## 4 <NA>
You can see this Question for explanation here
urls<-c(
"http://boingboing.net/2016/06/16/spam-king-sanford-wallace.html",
"http://boingboing.net/2016/06/16/omg-the-japanese-trump-commer.html",
"http://boingboing.net/2016/06/16/omar-mateen-posted-to-facebook.html",
"http://boingboing.net/2016/06/16/omar-mateen-posted-to-facdddebook.html")
readUrl <- function(url) {
out <- tryCatch(
{
message("This is the 'try' part")
url %>% as.character() %>% read_html() %>% html_nodes('h1') %>% html_text()
},
error=function(cond) {
message(paste("URL does not seem to exist:", url))
message("Here's the original error message:")
message(cond)
return(NA)
}
}
)
return(out)
}
y <- lapply(urls, readUrl)