Scraping book table from goodreads - r

I'm attempting to scrape a table of read books from the Goodreads website using rvest. The data is formatted as a table, however I am getting errors when attempting to extract this info.
First we load some packages and set the url to scrape
library(dplyr)
library(rvest)
url <- "https://www.goodreads.com/review/list/4622890?shelf=read"
Running this code:
dat <- read_html(url) %>%
html_nodes('//*[#id="booksBody"]') %>%
html_table()
Produces: Error in tokenize(css) : Unexpected character '/' found at position 1
Trying it again, but without the first /:
dat <- read_html(url) %>%
html_nodes('/*[#id="booksBody"]') %>%
html_table()
Produces: Error in parse_simple_selector(stream) : Expected selector, got <EOF at 20>
And finally, just trying to get the table directly, without the intermediate call to html_nodes:
dat <- read_html(url) %>%
html_table('/*[#id="booksBody"]')
Produces: Error in if (header) { : argument is not interpretable as logical
Would appreciate any guidance on how to scrape this table

Scraping the first 5 pages
library(tidyverse)
library(rvest)
library(httr2)
get_books <- function(page) {
cat("Scraping page:", page, "\n")
books <-
str_c("https://www.goodreads.com/review/list/4622890-emily-may?page=", page,
"&shelf=%23ALL%23") %>%
read_html() %>%
html_elements(".bookalike.review")
tibble(
title = books %>%
html_elements(".title a") %>%
html_text2(),
author = books %>%
html_elements(".author a") %>%
html_text2(),
rating = books %>%
html_elements(".avg_rating .value") %>%
html_text2() %>%
as.numeric(),
date = books %>%
html_elements(".date_added .value") %>%
html_text2() %>%
lubridate::mdy()
)
}
df <- map_dfr(0:5, get_books)
# A tibble: 180 x 4
title author rating date
<chr> <chr> <dbl> <date>
1 Sunset "Cave~ 4.19 2023-01-14
2 Green for Danger (Inspector Cockrill~ "Bran~ 3.84 2023-01-12
3 Stone Cold Fox "Crof~ 4.22 2023-01-12
4 What If I'm Not a Cat? "Wint~ 4.52 2023-01-10
5 The Prisoner's Throne (The Stolen He~ "Blac~ 4.85 2023-01-07
6 The Kind Worth Saving (Henry Kimball~ "Swan~ 4.13 2023-01-06
7 Girl at War "Novi~ 4 2022-12-29
8 If We Were Villains "Rio,~ 4.23 2022-12-29
9 The Gone World "Swet~ 3.94 2022-12-28
10 Batman: The Dark Knight Returns "Mill~ 4.26 2022-12-28
# ... with 170 more rows
# i Use `print(n = ...)` to see more rows

I can get the first 30 books using this -
library(dplyr)
library(rvest)
url <- "https://www.goodreads.com/review/list/4622890?shelf=read"
book_table <- read_html(url) %>%
html_elements('table#books') %>%
html_table() %>%
.[[1]]
book_table
There is some cleaning that you might need to do in the data captured. Moreover, to get the complete list I am afraid rvest would not be enough. You might need to use something like RSelenium to scroll through the list.

Related

Webscraping PolitiTweet with rvest

The webpage https://polititweet.org/ stores the complete tweet history of certain politicans, CEOs and so on. Importantly, they also provide deleted tweets I am interested in. Now, I would like to write a webscraper in R to retrieve the texts of the deleted tweets from Elon Musk, but I fail as the html includes some href.
That's my try (after edit due to #Bensstats):
library(rvest)
url_page1<- read_html("https://polititweet.org/tweets?page=1&deleted=True&account=44196397&search=")
tweets_deleted <- html_nodes(url_page1, ".tweet-card") |> html_attr("href")
tweets_deleted
With this, I yield the ID of the deleted tweets on page 1. However, what I want, is the deleted text itself.
Moreover, there are 9 pages of deleted tweets for Musk. As in future this amount of pages is likely to increase, I would like to extract the number of pages automatically and then automatise the process for each page (via a loop or something like that).
I would really appreciate it if somebody of you has an idea how to fix these problems!
Thanks a lot!
Get all of Elons deleted tweets, page 1:9.
library(tidyverse)
library(rvest)
get_tweets <- function(page) {
tweets <-
str_c(
"https://polititweet.org/tweets?page=",
page,
"&deleted=True&account=44196397&search="
) %>%
read_html() %>%
html_elements(".tweet-card")
tibble(
tweeter = tweets %>%
html_element("strong+ .has-text-grey") %>%
html_text2(),
tweet = tweets %>%
html_element(".small-top-margin") %>%
html_text2(),
time = tweets %>%
html_element(".is-paddingless") %>%
html_text2() %>%
str_remove_all("Posted ")
)
}
map_dfr(1:9, get_tweets)
# A tibble: 244 × 3
tweeter tweet time
<chr> <chr> <chr>
1 #elonmusk "#BBCScienceNews Tesla Megapacks are extremely e… Nov.…
2 #elonmusk "\u2014 PolitiTweet.org" Nov.…
3 #elonmusk "#BuzzPatterson They could help it \U0001f923 \u… Nov.…
4 #elonmusk "\U0001f37f \u2014 PolitiTweet.org" Nov.…
5 #elonmusk "Let\u2019s call the fact-checkers \u2026 \u2014… Nov.…
6 #elonmusk "#SharkJumping \u2014 Po… Nov.…
7 #elonmusk "Can you believe this app only costs $8!? https:… Nov.…
8 #elonmusk "#langdon #EricFrohnhoefer #pokemoniku He\u2019s… Nov.…
9 #elonmusk "#EricFrohnhoefer #MEAInd I\u2019ve been at Twit… Nov.…
10 #elonmusk "#ashleevance #mtaibbi #joerogan Twitter drives … Nov.…
# … with 234 more rows
# ℹ Use `print(n = ...)` to see more rows
Since you wanted it to automatically detect pages and scrape, here's a possible solution where you just supply a link into the function:
get_tweets <- function(link) {
page <- link %>%
read_html()
pages <- page %>%
html_elements(".pagination-link") %>%
last() %>%
html_text2() %>%
as.numeric()
twitter <- function(link, page) {
tweets <-
link %>%
str_replace(pattern = "page=1", str_c("page=", page)) %>%
read_html() %>%
html_elements(".tweet-card")
tibble(
tweeter = tweets %>%
html_element("strong+ .has-text-grey") %>%
html_text2(),
tweet = tweets %>%
html_element(".small-top-margin") %>%
html_text2(),
time = tweets %>%
html_element(".is-paddingless") %>%
html_text2() %>%
str_remove_all("Posted ")
)
}
map2_dfr(link, 1:pages, twitter)
}
get_tweets("https://polititweet.org/tweets?page=1&deleted=True&account=44196397&search=")
I highly recommend this tool to help you select CSS elements.
Edit
You are going to want to change the CSS selector
library(rvest)
url <- read_html("https://polititweet.org/tweets?account=44196397&deleted=True")
tweets_deleted <- html_nodes(url, "p.small-top-margin") |> html_text()
tweets_deleted %>%
gsub('\\n','',.)

R Scraping for image details from several thousand pages

I am trying to scrape details from a website in order to gather details for pictures with a script in R.
What I need is:
Image name (1.jpg)
Image caption ("A recruit demonstrates the proper use of a CO2 portable extinguisher to put out a small outside fire.")
Photo credit ("Photo courtesy of: James Fortner")
There are over 16,000 files, and thankfully the web url goes "...asp?photo=1, 2, 3, 4" so there is base url which doesn't change, just the last section with the image number. I would like the script to loop for either a set number (I tell it where to start) or it just breaks when it gets to a page which doesn't exisit.
Using the code below, I can get the caption of the photo, but only one line. I would like to get the photo credit, which is on a separate line; there are three between the main caption and photo credit. I'd be fine if the table which is generated had two or three blank columns to account for the lines, as I can delete them later.
library(rvest)
library(dplyr)
link = "http://fallschurchvfd.org/photovideo.asp?photo=1"
page = read_html(link)
caption = page %>% html_nodes(".text7 i") %>% html_text()
info = data.frame(caption, stringsAsFactors = FALSE)
write.csv(info, "photos.csv")
Scraping with rvest and tidyverse
library(tidyverse)
library(rvest)
get_picture <- function(page) {
cat("Scraping page", page, "\n")
page <- str_c("http://fallschurchvfd.org/photovideo.asp?photo=", page) %>%
read_html()
tibble(
image_name = page %>%
html_element(".text7 img") %>%
html_attr("src"),
caption = page %>%
html_element(".text7") %>%
html_text() %>%
str_split(pattern = "\r\n\t\t\t\t") %>%
unlist %>%
nth(1),
credit = page %>%
html_element(".text7") %>%
html_text() %>%
str_split(pattern = "\r\n\t\t\t\t") %>%
unlist %>%
nth(3)
)
}
# Get the first 1:50
df <- map_dfr(1:50, possibly(get_picture, otherwise = tibble()))
# A tibble: 42 × 3
image_name caption credit
<chr> <chr> <chr>
1 /photos/1.jpg Recruit Clay Hamric demonstrates the use… James…
2 /photos/2.jpg A recruit demonstrates the proper use of… James…
3 /photos/3.jpg Recruit Paul Melnick demonstrates the pr… James…
4 /photos/4.jpg Rescue 104 James…
5 /photos/5.jpg Rescue 104 James…
6 /photos/6.jpg Rescue 104 James…
7 /photos/15.jpg Truck 106 operates a ladder pipe from Wi… Jim O…
8 /photos/16.jpg Truck 106 operates a ladder pipe as heav… Jim O…
9 /photos/17.jpg Heavy fire vents from the roof area of t… Jim O…
10 /photos/18.jpg Arlington County Fire and Rescue Associa… James…
# … with 32 more rows
# ℹ Use `print(n = ...)` to see more rows
For the images, you can use the command line tool curl. For example, to download images 1.jpg through 100.jpg
curl -O "http://fallschurchvfd.org/photos/[0-100].jpg"
For the R code, if you grab the whole .text7 section, then you can split into caption and photo credit subsequently:
extractedtext <- page %>% html_nodes(".text7") %>% html_text()
caption <- str_split(extractedtext, "\r\n\t\t\t\t")[[1]][1]
credit <- str_split(extractedtext, "\r\n\t\t\t\t")[[1]][3]
As a loop
library(rvest)
library(tidyverse)
df<-data.frame(id=1:20,
image=NA,
caption=NA,
credit=NA)
for (i in 1:20){
cat(i, " ") # to monitor progress and debug
link <- paste0("http://fallschurchvfd.org/photovideo.asp?photo=", i)
tryCatch({ # This is to avoid stopping on an error message for missing pages
page <- read_html(link)
close(link)
df$image[i] <- page %>% html_nodes(".text7 img") %>% html_attr("src")
extractedtext <- page %>% html_nodes(".text7") %>% html_text()
df$caption[i] <- str_split(extractedtext, "\r\n\t\t\t\t")[[1]][1] # This is an awkward way of saying "list 1, element 1"
df$credit[i] <- str_split(extractedtext, "\r\n\t\t\t\t")[[1]][3]
},
error=function(e){cat("ERROR :",conditionMessage(e), "\n")})
}
I get inconsistent results with this current code, for example, page 15 has more line breaks than page 1.
TODO: enhance string extraction; switch to an 'append' method of adding data to a data.frame (vs pre-allocate and insert).

for loop - skipping iterations with error [duplicate]

When retrieving the h1 title using rvest, I sometimes run into 404 pages. This stop the process and returns this error.
Error in open.connection(x, "rb") : HTTP error 404.
See the example below
Data<-data.frame(Pages=c(
"http://boingboing.net/2016/06/16/spam-king-sanford-wallace.html",
"http://boingboing.net/2016/06/16/omg-the-japanese-trump-commer.html",
"http://boingboing.net/2016/06/16/omar-mateen-posted-to-facebook.html",
"http://boingboing.net/2016/06/16/omar-mateen-posted-to-facdddebook.html"))
Code used to retrieve h1
library (rvest)
sapply(Data$Pages, function(url){
url %>%
as.character() %>%
read_html() %>%
html_nodes('h1') %>%
html_text()
})
Is there a way to include an argument to ignore errors and continue the process ?
You're looking for try or tryCatch, which are how R handles error catching.
With try, you just need to wrap the thing that might fail in try(), and it will return the error and keep running:
library(rvest)
sapply(Data$Pages, function(url){
try(
url %>%
as.character() %>%
read_html() %>%
html_nodes('h1') %>%
html_text()
)
})
# [1] "'Spam King' Sanford Wallace gets 2.5 years in prison for 27 million Facebook scam messages"
# [2] "OMG, this Japanese Trump Commercial is everything"
# [3] "Omar Mateen posted to Facebook during Orlando mass shooting"
# [4] "Error in open.connection(x, \"rb\") : HTTP error 404.\n"
However, while that will get everything, it will also insert bad data into our results. tryCatch allows you to configure what happens when an error is called by passing it a function to run when that condition arises:
sapply(Data$Pages, function(url){
tryCatch(
url %>%
as.character() %>%
read_html() %>%
html_nodes('h1') %>%
html_text(),
error = function(e){NA} # a function that returns NA regardless of what it's passed
)
})
# [1] "'Spam King' Sanford Wallace gets 2.5 years in prison for 27 million Facebook scam messages"
# [2] "OMG, this Japanese Trump Commercial is everything"
# [3] "Omar Mateen posted to Facebook during Orlando mass shooting"
# [4] NA
There we go; much better.
Update
In the tidyverse, the purrr package offers two functions, safely and possibly, which work like try and tryCatch. They are adverbs, not verbs, meaning they take a function, modify it so as to handle errors, and return a new function (not a data object) which can then be called. Example:
library(tidyverse)
library(rvest)
df <- Data %>% rowwise() %>% # Evaluate each row (URL) separately
mutate(Pages = as.character(Pages), # Convert factors to character for read_html
title = possibly(~.x %>% read_html() %>% # Try to take a URL, read it,
html_nodes('h1') %>% # select header nodes,
html_text(), # and collect text inside.
NA)(Pages)) # If error, return NA. Call modified function on URLs.
df %>% select(title)
## Source: local data frame [4 x 1]
## Groups: <by row>
##
## # A tibble: 4 × 1
## title
## <chr>
## 1 'Spam King' Sanford Wallace gets 2.5 years in prison for 27 million Facebook scam messages
## 2 OMG, this Japanese Trump Commercial is everything
## 3 Omar Mateen posted to Facebook during Orlando mass shooting
## 4 <NA>
You can see this Question for explanation here
urls<-c(
"http://boingboing.net/2016/06/16/spam-king-sanford-wallace.html",
"http://boingboing.net/2016/06/16/omg-the-japanese-trump-commer.html",
"http://boingboing.net/2016/06/16/omar-mateen-posted-to-facebook.html",
"http://boingboing.net/2016/06/16/omar-mateen-posted-to-facdddebook.html")
readUrl <- function(url) {
out <- tryCatch(
{
message("This is the 'try' part")
url %>% as.character() %>% read_html() %>% html_nodes('h1') %>% html_text()
},
error=function(cond) {
message(paste("URL does not seem to exist:", url))
message("Here's the original error message:")
message(cond)
return(NA)
}
}
)
return(out)
}
y <- lapply(urls, readUrl)

How to find all Event IDs in an efficient way?

How could I crawl this database with rvest to identify all tournament IDs for each year? Currently, I'm just going from 1:maxx(event_id), which is really a drain on compute time.
https://www.worldloppet.com/results/
The results filter seems to be dynamic on the webpage, so the url doesn't change.
outlist <- list()
for (event_id in 2483:2570) {
event_id = 2483
# update progress
message('Retrieving Event ',event_id)
race_url = paste0('https://www.worldloppet.com/browse/?id=',event_id)
event_info = read_html(race_url) %>%
html_nodes('h2') %>%
.[1] %>%
gsub('<br>','<br> ',.) %>%
gsub("<[^>]+>", "",.) %>%
str_split(.,' ') %>%
unlist()
#event_info$eventid <- event_id
outlist <- c(outlist, list(c(event_id, event_info)))
# temporary break
Sys.sleep(3)
}
You can extract all links containing the word browse from the HTML document:
library(tidyverse)
library(rvest)
#>
#> Attaching package: 'rvest'
#> The following object is masked from 'package:readr':
#>
#> guess_encoding
read_html("https://www.worldloppet.com/results/") %>%
html_nodes("a") %>%
html_attr("href") %>%
as.character() %>%
keep(~ .x %>% str_detect("browse")) %>%
paste0("https://www.worldloppet.com",.)
#> [1] "https://www.worldloppet.com/browse/?id=2570"
#> [2] "https://www.worldloppet.com/browse/?id=1818"
#> [3] "https://www.worldloppet.com/browse/?id=1817"
#> [4] "https://www.worldloppet.com/browse/?id=2518"
#> [5] "https://www.worldloppet.com/browse/?id=2517"
Created on 2022-02-09 by the reprex package (v2.0.1)
The IDs of the rage can be found in the links, which can be extracted using the html_attr function. From there we can use some regex to find the numbers, here I include id= to make sure the page is an id, as I'm not sure whether you want to include links like masters=9173.
library(rvest)
library(stringi)
url <- "https://www.worldloppet.com/results/"
page <- read_html(url)
string <- html_attr(html_elements(page, "a"), "href")
matches <- stri_extract_all_regex(string, "(?<=id=).*", simplify = T)
as.integer(matches[!is.na(matches)])
# first 5
[1] 2570 1818 1817 2518 2517

Using tryCatch and rvest to deal with 404 and other crawling errors

When retrieving the h1 title using rvest, I sometimes run into 404 pages. This stop the process and returns this error.
Error in open.connection(x, "rb") : HTTP error 404.
See the example below
Data<-data.frame(Pages=c(
"http://boingboing.net/2016/06/16/spam-king-sanford-wallace.html",
"http://boingboing.net/2016/06/16/omg-the-japanese-trump-commer.html",
"http://boingboing.net/2016/06/16/omar-mateen-posted-to-facebook.html",
"http://boingboing.net/2016/06/16/omar-mateen-posted-to-facdddebook.html"))
Code used to retrieve h1
library (rvest)
sapply(Data$Pages, function(url){
url %>%
as.character() %>%
read_html() %>%
html_nodes('h1') %>%
html_text()
})
Is there a way to include an argument to ignore errors and continue the process ?
You're looking for try or tryCatch, which are how R handles error catching.
With try, you just need to wrap the thing that might fail in try(), and it will return the error and keep running:
library(rvest)
sapply(Data$Pages, function(url){
try(
url %>%
as.character() %>%
read_html() %>%
html_nodes('h1') %>%
html_text()
)
})
# [1] "'Spam King' Sanford Wallace gets 2.5 years in prison for 27 million Facebook scam messages"
# [2] "OMG, this Japanese Trump Commercial is everything"
# [3] "Omar Mateen posted to Facebook during Orlando mass shooting"
# [4] "Error in open.connection(x, \"rb\") : HTTP error 404.\n"
However, while that will get everything, it will also insert bad data into our results. tryCatch allows you to configure what happens when an error is called by passing it a function to run when that condition arises:
sapply(Data$Pages, function(url){
tryCatch(
url %>%
as.character() %>%
read_html() %>%
html_nodes('h1') %>%
html_text(),
error = function(e){NA} # a function that returns NA regardless of what it's passed
)
})
# [1] "'Spam King' Sanford Wallace gets 2.5 years in prison for 27 million Facebook scam messages"
# [2] "OMG, this Japanese Trump Commercial is everything"
# [3] "Omar Mateen posted to Facebook during Orlando mass shooting"
# [4] NA
There we go; much better.
Update
In the tidyverse, the purrr package offers two functions, safely and possibly, which work like try and tryCatch. They are adverbs, not verbs, meaning they take a function, modify it so as to handle errors, and return a new function (not a data object) which can then be called. Example:
library(tidyverse)
library(rvest)
df <- Data %>% rowwise() %>% # Evaluate each row (URL) separately
mutate(Pages = as.character(Pages), # Convert factors to character for read_html
title = possibly(~.x %>% read_html() %>% # Try to take a URL, read it,
html_nodes('h1') %>% # select header nodes,
html_text(), # and collect text inside.
NA)(Pages)) # If error, return NA. Call modified function on URLs.
df %>% select(title)
## Source: local data frame [4 x 1]
## Groups: <by row>
##
## # A tibble: 4 × 1
## title
## <chr>
## 1 'Spam King' Sanford Wallace gets 2.5 years in prison for 27 million Facebook scam messages
## 2 OMG, this Japanese Trump Commercial is everything
## 3 Omar Mateen posted to Facebook during Orlando mass shooting
## 4 <NA>
You can see this Question for explanation here
urls<-c(
"http://boingboing.net/2016/06/16/spam-king-sanford-wallace.html",
"http://boingboing.net/2016/06/16/omg-the-japanese-trump-commer.html",
"http://boingboing.net/2016/06/16/omar-mateen-posted-to-facebook.html",
"http://boingboing.net/2016/06/16/omar-mateen-posted-to-facdddebook.html")
readUrl <- function(url) {
out <- tryCatch(
{
message("This is the 'try' part")
url %>% as.character() %>% read_html() %>% html_nodes('h1') %>% html_text()
},
error=function(cond) {
message(paste("URL does not seem to exist:", url))
message("Here's the original error message:")
message(cond)
return(NA)
}
}
)
return(out)
}
y <- lapply(urls, readUrl)

Resources