How to find all Event IDs in an efficient way? - r

How could I crawl this database with rvest to identify all tournament IDs for each year? Currently, I'm just going from 1:maxx(event_id), which is really a drain on compute time.
https://www.worldloppet.com/results/
The results filter seems to be dynamic on the webpage, so the url doesn't change.
outlist <- list()
for (event_id in 2483:2570) {
event_id = 2483
# update progress
message('Retrieving Event ',event_id)
race_url = paste0('https://www.worldloppet.com/browse/?id=',event_id)
event_info = read_html(race_url) %>%
html_nodes('h2') %>%
.[1] %>%
gsub('<br>','<br> ',.) %>%
gsub("<[^>]+>", "",.) %>%
str_split(.,' ') %>%
unlist()
#event_info$eventid <- event_id
outlist <- c(outlist, list(c(event_id, event_info)))
# temporary break
Sys.sleep(3)
}

You can extract all links containing the word browse from the HTML document:
library(tidyverse)
library(rvest)
#>
#> Attaching package: 'rvest'
#> The following object is masked from 'package:readr':
#>
#> guess_encoding
read_html("https://www.worldloppet.com/results/") %>%
html_nodes("a") %>%
html_attr("href") %>%
as.character() %>%
keep(~ .x %>% str_detect("browse")) %>%
paste0("https://www.worldloppet.com",.)
#> [1] "https://www.worldloppet.com/browse/?id=2570"
#> [2] "https://www.worldloppet.com/browse/?id=1818"
#> [3] "https://www.worldloppet.com/browse/?id=1817"
#> [4] "https://www.worldloppet.com/browse/?id=2518"
#> [5] "https://www.worldloppet.com/browse/?id=2517"
Created on 2022-02-09 by the reprex package (v2.0.1)

The IDs of the rage can be found in the links, which can be extracted using the html_attr function. From there we can use some regex to find the numbers, here I include id= to make sure the page is an id, as I'm not sure whether you want to include links like masters=9173.
library(rvest)
library(stringi)
url <- "https://www.worldloppet.com/results/"
page <- read_html(url)
string <- html_attr(html_elements(page, "a"), "href")
matches <- stri_extract_all_regex(string, "(?<=id=).*", simplify = T)
as.integer(matches[!is.na(matches)])
# first 5
[1] 2570 1818 1817 2518 2517

Related

Scraping book table from goodreads

I'm attempting to scrape a table of read books from the Goodreads website using rvest. The data is formatted as a table, however I am getting errors when attempting to extract this info.
First we load some packages and set the url to scrape
library(dplyr)
library(rvest)
url <- "https://www.goodreads.com/review/list/4622890?shelf=read"
Running this code:
dat <- read_html(url) %>%
html_nodes('//*[#id="booksBody"]') %>%
html_table()
Produces: Error in tokenize(css) : Unexpected character '/' found at position 1
Trying it again, but without the first /:
dat <- read_html(url) %>%
html_nodes('/*[#id="booksBody"]') %>%
html_table()
Produces: Error in parse_simple_selector(stream) : Expected selector, got <EOF at 20>
And finally, just trying to get the table directly, without the intermediate call to html_nodes:
dat <- read_html(url) %>%
html_table('/*[#id="booksBody"]')
Produces: Error in if (header) { : argument is not interpretable as logical
Would appreciate any guidance on how to scrape this table
Scraping the first 5 pages
library(tidyverse)
library(rvest)
library(httr2)
get_books <- function(page) {
cat("Scraping page:", page, "\n")
books <-
str_c("https://www.goodreads.com/review/list/4622890-emily-may?page=", page,
"&shelf=%23ALL%23") %>%
read_html() %>%
html_elements(".bookalike.review")
tibble(
title = books %>%
html_elements(".title a") %>%
html_text2(),
author = books %>%
html_elements(".author a") %>%
html_text2(),
rating = books %>%
html_elements(".avg_rating .value") %>%
html_text2() %>%
as.numeric(),
date = books %>%
html_elements(".date_added .value") %>%
html_text2() %>%
lubridate::mdy()
)
}
df <- map_dfr(0:5, get_books)
# A tibble: 180 x 4
title author rating date
<chr> <chr> <dbl> <date>
1 Sunset "Cave~ 4.19 2023-01-14
2 Green for Danger (Inspector Cockrill~ "Bran~ 3.84 2023-01-12
3 Stone Cold Fox "Crof~ 4.22 2023-01-12
4 What If I'm Not a Cat? "Wint~ 4.52 2023-01-10
5 The Prisoner's Throne (The Stolen He~ "Blac~ 4.85 2023-01-07
6 The Kind Worth Saving (Henry Kimball~ "Swan~ 4.13 2023-01-06
7 Girl at War "Novi~ 4 2022-12-29
8 If We Were Villains "Rio,~ 4.23 2022-12-29
9 The Gone World "Swet~ 3.94 2022-12-28
10 Batman: The Dark Knight Returns "Mill~ 4.26 2022-12-28
# ... with 170 more rows
# i Use `print(n = ...)` to see more rows
I can get the first 30 books using this -
library(dplyr)
library(rvest)
url <- "https://www.goodreads.com/review/list/4622890?shelf=read"
book_table <- read_html(url) %>%
html_elements('table#books') %>%
html_table() %>%
.[[1]]
book_table
There is some cleaning that you might need to do in the data captured. Moreover, to get the complete list I am afraid rvest would not be enough. You might need to use something like RSelenium to scroll through the list.

Get maps location without clicking on it

I have a url/api from google which allocate the location as per the latitude and longitude as shown below.
Here the user have to click on the below link to navigate to the maps.
So wanted to check if we can have location ready without clicking on it
HTML('Google Maps')
How about this:
library(rvest)
library(stringr)
h <- read_html(htmltools::HTML('Google Maps'))
h %>%
html_elements("a") %>%
html_attr("href") %>%
gsub(".*\\?q\\=(.*)$", "\\1", .) %>%
str_split(., ",", simplify=TRUE) %>%
as.numeric(.)
#> [1] 50.89091 14.86668
Created on 2022-12-29 by the reprex package (v2.0.1)

Scraping 'Artsy' using rvest

I am trying to get information from Artsy using rvest package of R. I want to get information on name of painting, year, price, place (name of gallery, auction etc.), name of artist, and materials that are used. Information on material is provided in inside page of each painting. Codes that I tried to use are provided below:
library(rvest)
library(dplyr)
library(tidyverse)
get_material = function (painting_link) {
painting_page = read_html (painting_link)
material = painting_page %>% html_nodes('h2+ .kPqROo') %>%
html_text() %>% paste(collapse = ",")
return(material)
}
for(page_result in 2:3) {
link = paste0 ("https://www.artsy.net/collect?page=", page_result, "&additional_gene_ids%5B0%5D=painting")
page = read_html(link)
painting_name_year = page %>% html_nodes("#main .kjRHrZ") %>% html_text()
painting_link = page %>% html_nodes('#main .kjRHrZ') %>% html_attr("<div color="black60" font-family="sans" class="Box-sc-15se88d-0 Text-sc-18gcpao-0 kjRHrZ">\n<i>") %>% paste("https://www.artsy.net", ., sep="/")
price = page %>% html_nodes('.ibabyz') %>% html_text()
place = page %>% html_nodes('hWKLzd') %>% html_text()
artist = page %>% html_nodes('.bQOCym .bQOCym') %>% html_text()
material = sapply(painting_link, FUN=get_material, USE.NAMES = FALSE)
}
artsy <- data.frame(painting_name_year, price, place, artist)
view(artsy)
Code for painting_link, place, and material are not working. Moreover, one observation is repeating for 3 times. How can I fix this problem?
You can remove the loop. First generate the start url list. Then, rather than scrape some info from landing pages, before visiting individual listing pages, you can instead, gather all the urls of the individual listings first.
Then, you can gain a little efficiency by working across more cpu cores and gathering the data you want from all the listings via a function call against each url.
N.B. As this operation is I/O bound you would likely see better efficiencies with an asynchronous method. If I can find a decent tutorial/reference on this I will maybe update this answer.
If you return a tibble of the desired info, from each listing url, via the function, you can generate a final dataframe by calling future_map_dfr on the listings links and user defined function.
library(purrr)
library(rvest)
#> Loading required package: xml2
#> Warning: package 'xml2' was built under R version 4.0.3
#>
#> Attaching package: 'rvest'
#> The following object is masked from 'package:purrr':
#>
#> pluck
library(tidyverse)
#> Warning: package 'tibble' was built under R version 4.0.3
#> Warning: package 'forcats' was built under R version 4.0.3
library(jsonlite)
#> Warning: package 'jsonlite' was built under R version 4.0.3
#>
#> Attaching package: 'jsonlite'
#> The following object is masked from 'package:purrr':
#>
#> flatten
library(furrr)
#> Warning: package 'furrr' was built under R version 4.0.3
#> Loading required package: future
#> Warning: package 'future' was built under R version 4.0.3
library(stringr)
get_art_links <- function(link) {
hrefs <- read_html(link) %>%
html_nodes("[href*=artwork][class]") %>%
html_attr("href") %>%
paste0("https://www.artsy.net", .)
return(hrefs)
}
get_listing_json <- function(page) {
data <- page %>%
html_node('[type="application/ld+json"]') %>%
html_text() %>%
jsonlite::parse_json()
return(data)
}
get_listing_info <- function(link) {
page <- read_html(link)
json <- get_listing_json(page)
artist <- json$brand$name
title <- page %>%
html_node('[data-test="artworkSidebar"] h2 > i') %>%
html_text()
production_date <- json$productionDate
material <- page %>%
html_node('[data-test="artworkSidebar"] h2 + div') %>%
html_text()
width <- json$width
height <- json$height
place <- stringr::str_match(json$description, "from (.*?),")[, 2]
price <- json$offers$price
currency <- json$offers$priceCurrency
availability <- str_replace(json$offers$availability, "https://schema.org/", "")
return(tibble(artist, title, production_date, material, width, height, place, price, currency, availability))
}
pages <- 2:3 %>% as.character()
urls <- sprintf("https://www.artsy.net/collect?page=%s&additional_gene_ids[0]=painting", pages)
links <- purrr::map(urls, get_art_links) %>%
unlist()
no_cores <- future::availableCores() - 1
future::plan(future::multisession, workers = no_cores)
results <- future_map_dfr(links, .f = get_listing_info)
Created on 2021-05-16 by the reprex package (v0.3.0)

How to check whether an XML node set is empty in R?

I'm writing a function that iterates over XML nodes in R; for this I've been looking for a verb that affirms or denies the presence of an empty XML-nodeset (something like isEmptyNodeSet).
In other words, a function that returns TRUE if a case like the following occurs:
library(magrittr)
library(rvest)
#> Loading required package: xml2
library(xml2)
"https://www.admin.ch/ch/d/gg/pc/ind2010.html" %>%
read_html() %>%
html_nodes("a.adminCHlink, div#spalteContentPlus h2 ~ ul") %>%
.[[1]] %>%
html_nodes("strong")
#> {xml_nodeset (0)}
Created on 2019-01-12 by the reprex package (v0.2.1)
Thanks so much in advance (and sorry if the answer is obvious, I'm an XML-rookie)!
Either use is_empty <- function(x) if(length(x) == 0) TRUE else FALSE (thanks #Chase).
Or use rlang::is_empty() or purrr::is_empty() respectively, which does exactly the same.
The code then becomes:
library(magrittr)
library(rvest)
#> Loading required package: xml2
library(xml2)
"https://www.admin.ch/ch/d/gg/pc/ind2010.html" %>%
read_html() %>%
html_nodes("a.adminCHlink, div#spalteContentPlus h2 ~ ul") %>%
.[[1]] %>%
html_nodes("strong") %>%
rlang::is_empty()
#> [1] TRUE

Scraping <li> elements with Rvest

Good morning,
I'm new to scraping with R, and I'm having a hard time to scrape a list of elements from a webpage in a useful manner.
This is my script
library(rvest)
url <- read_html("https://www.pole-emploi.fr/annuaire/provins-77070")
webpage <- url %>%
html_nodes('.zone') %>%
html_text()
webpage
When I run the script all the elements appear squeezed together without any whitespace between, which is comprehensible since each item is enclosed in a single tag.
[1] "77114GouaixHerméNoyen-sur-SeineVilliers-sur-Seine"
[2] "77118BalloyBazoches-lès-BrayGravon"
I would like to have them either like this (or separated by commas)
[1] "77114 Gouaix Hermé Noyen-sur-Seine Villiers-sur-Seine"
[2] "77118 Balloy Bazoches-lès-Bray Gravon"
Or even better on a tidy format
Postal City
77114 Gouaix
77114 Hermé
77114 Noyen-sur-Seine
77114 Villiers-sur-Seine
I have tried to find other selector or Xpaths in the page without success. The most I have got is to select one single element of the list.
Any help would be greatly apprecaited.
Thanks in advance.
Each list element looks like this (truncated for brevity):
<li class="zone">\n<span class="code-postal">77114</span><ul>\n<li>Gouaix</li>\n<li>Hermé</li>\n ...
So, each one has a set of child nodes that look uniform. We can target the <span> and the <li> elements in the nested <ul> to get what you want:
library(rvest)
library(tidyverse)
pg <- read_html("https://www.pole-emploi.fr/annuaire/provins-77070")
html_nodes(pg, ".zone") %>%
map_df(~{
data_frame(
postal = html_node(.x, "span") %>% html_text(trim=TRUE),
city = html_nodes(.x, "ul > li") %>% html_text(trim=TRUE)
)
})
## # A tibble: 95 x 2
## postal city
## <chr> <chr>
## 1 77114 Gouaix
## 2 77114 Hermé
## 3 77114 Noyen-sur-Seine
## 4 77114 Villiers-sur-Seine
## 5 77118 Balloy
## 6 77118 Bazoches-lès-Bray
## 7 77118 Gravon
## 8 77126 Châtenay-sur-Seine
## 9 77126 Égligny
## 10 77134 Les Ormes-sur-Voulzie
## # ... with 85 more rows
the tidyverse method with explicit anonymous function (vs .x via formula function):
html_nodes(pg, ".zone") %>%
map_df(function(x) {
data_frame(
postal = html_node(x, "span") %>% html_text(trim=TRUE),
city = html_nodes(x, "ul > li") %>% html_text(trim=TRUE)
)
})
and, a pure base R version:
elements <- html_nodes(pg, ".zone")
lapply(elements, function(x) {
data.frame(
postal = html_text(html_node(x, "span"), trim=TRUE),
city = html_text(html_nodes(x, "ul > li"), trim=TRUE),
stringsAsFactors = FALSE
)
}) -> tmp
Reduce(rbind.data.frame, tmp)
# or
do.call(rbind.data.frame, tmp)

Resources