Scraping <li> elements with Rvest - r

Good morning,
I'm new to scraping with R, and I'm having a hard time to scrape a list of elements from a webpage in a useful manner.
This is my script
library(rvest)
url <- read_html("https://www.pole-emploi.fr/annuaire/provins-77070")
webpage <- url %>%
html_nodes('.zone') %>%
html_text()
webpage
When I run the script all the elements appear squeezed together without any whitespace between, which is comprehensible since each item is enclosed in a single tag.
[1] "77114GouaixHerméNoyen-sur-SeineVilliers-sur-Seine"
[2] "77118BalloyBazoches-lès-BrayGravon"
I would like to have them either like this (or separated by commas)
[1] "77114 Gouaix Hermé Noyen-sur-Seine Villiers-sur-Seine"
[2] "77118 Balloy Bazoches-lès-Bray Gravon"
Or even better on a tidy format
Postal City
77114 Gouaix
77114 Hermé
77114 Noyen-sur-Seine
77114 Villiers-sur-Seine
I have tried to find other selector or Xpaths in the page without success. The most I have got is to select one single element of the list.
Any help would be greatly apprecaited.
Thanks in advance.

Each list element looks like this (truncated for brevity):
<li class="zone">\n<span class="code-postal">77114</span><ul>\n<li>Gouaix</li>\n<li>Hermé</li>\n ...
So, each one has a set of child nodes that look uniform. We can target the <span> and the <li> elements in the nested <ul> to get what you want:
library(rvest)
library(tidyverse)
pg <- read_html("https://www.pole-emploi.fr/annuaire/provins-77070")
html_nodes(pg, ".zone") %>%
map_df(~{
data_frame(
postal = html_node(.x, "span") %>% html_text(trim=TRUE),
city = html_nodes(.x, "ul > li") %>% html_text(trim=TRUE)
)
})
## # A tibble: 95 x 2
## postal city
## <chr> <chr>
## 1 77114 Gouaix
## 2 77114 Hermé
## 3 77114 Noyen-sur-Seine
## 4 77114 Villiers-sur-Seine
## 5 77118 Balloy
## 6 77118 Bazoches-lès-Bray
## 7 77118 Gravon
## 8 77126 Châtenay-sur-Seine
## 9 77126 Égligny
## 10 77134 Les Ormes-sur-Voulzie
## # ... with 85 more rows
the tidyverse method with explicit anonymous function (vs .x via formula function):
html_nodes(pg, ".zone") %>%
map_df(function(x) {
data_frame(
postal = html_node(x, "span") %>% html_text(trim=TRUE),
city = html_nodes(x, "ul > li") %>% html_text(trim=TRUE)
)
})
and, a pure base R version:
elements <- html_nodes(pg, ".zone")
lapply(elements, function(x) {
data.frame(
postal = html_text(html_node(x, "span"), trim=TRUE),
city = html_text(html_nodes(x, "ul > li"), trim=TRUE),
stringsAsFactors = FALSE
)
}) -> tmp
Reduce(rbind.data.frame, tmp)
# or
do.call(rbind.data.frame, tmp)

Related

if link fails try again or skip to next link

i am facing trouble need help.
i have list of links (about 9000 links) which i am running in loop and doing some process on each one
links look like this :-
link1
link2
link3
link4
.....
link9000
but i am facing trouble as sometimes link 2nd gets failed (timeout) and sometime link2nd works and 400 or any random link fails as timeout . is there any way i can try failed link again n again ? i have added :-
status_c <- httr::GET(Links, config = httr::config(connecttimeout = 150))
but still i get timeout . please help me! or any suggestion regarding it? final_links_bind = have all list of links
some sample links:-
https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146711
https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146703
https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146789
for(i in 1:nrow(final_links_bind)) {
Links <- final_links_bind[i,]
BP_ID <- final_bp_bind[i,]
#print(Links)
status_c <- GET(Links,timeout(120))
status <- status_code(status_c)
if(status == "200"){
url_parse<- read_html(Links)
col_name<- url_parse %>%
html_nodes("tr") %>%
html_text()
col_name <- stringr::str_remove_all(col_name, "\\\t|\\\n|\\\r")
pattern_col_no <- grep("využití", col_name)
col_name <- as.data.frame(col_name)
method_selected <- col_name[pattern_col_no,]
WRITE_CSV_DATA <- rbind(WRITE_CSV_DATA, data.frame(BP_ID = c(BP_ID), method_selected = c(method_selected), Links = c(Links)))
#METHOD_OF_USE <- rbind(method_selected,METHOD_OF_USE)
print(WRITE_CSV_DATA)
}else{
print("LINK NOT WORKING")
no_Links <- sorted_link[i,]
not_working_link <- rbind(not_working_link,no_Links)
}
}
It is not clear how you want the final output, but here is how to scrape and skip links that are not working
library(rvest)
library(httr2)
library(tidyverse)
Given this data frame of links, notice the third one is not working:
df <- tibble(
links = c(
"https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146711",
"https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146703",
"https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/9999999",
"https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146789"
)
)
# A tibble: 4 × 1
links
<chr>
1 https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146711
2 https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146703
3 https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/9999999
4 https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146789
Create a function to scrape the table, specifically the third row:
get_info <- function(link) {
cat("Scraping", link, "\n")
link %>%
read_html() %>%
html_table() %>%
pluck(2) %>%
slice(3) %>%
pull(2)
}
And mutate() a new column with the info, NA if the link is not working. If the link is not working possibly() will throw NA (NA_character_) back instead of stopping the code.
df %>%
mutate(
info = map_chr(links, possibly(get_info, otherwise = NA_character_))
)
# A tibble: 4 × 2
links info
<chr> <chr>
1 https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146711 rodinný dům
2 https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146703 rodinný dům
3 https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/9999999 NA
4 https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146789 rodinný dům

How to find all Event IDs in an efficient way?

How could I crawl this database with rvest to identify all tournament IDs for each year? Currently, I'm just going from 1:maxx(event_id), which is really a drain on compute time.
https://www.worldloppet.com/results/
The results filter seems to be dynamic on the webpage, so the url doesn't change.
outlist <- list()
for (event_id in 2483:2570) {
event_id = 2483
# update progress
message('Retrieving Event ',event_id)
race_url = paste0('https://www.worldloppet.com/browse/?id=',event_id)
event_info = read_html(race_url) %>%
html_nodes('h2') %>%
.[1] %>%
gsub('<br>','<br> ',.) %>%
gsub("<[^>]+>", "",.) %>%
str_split(.,' ') %>%
unlist()
#event_info$eventid <- event_id
outlist <- c(outlist, list(c(event_id, event_info)))
# temporary break
Sys.sleep(3)
}
You can extract all links containing the word browse from the HTML document:
library(tidyverse)
library(rvest)
#>
#> Attaching package: 'rvest'
#> The following object is masked from 'package:readr':
#>
#> guess_encoding
read_html("https://www.worldloppet.com/results/") %>%
html_nodes("a") %>%
html_attr("href") %>%
as.character() %>%
keep(~ .x %>% str_detect("browse")) %>%
paste0("https://www.worldloppet.com",.)
#> [1] "https://www.worldloppet.com/browse/?id=2570"
#> [2] "https://www.worldloppet.com/browse/?id=1818"
#> [3] "https://www.worldloppet.com/browse/?id=1817"
#> [4] "https://www.worldloppet.com/browse/?id=2518"
#> [5] "https://www.worldloppet.com/browse/?id=2517"
Created on 2022-02-09 by the reprex package (v2.0.1)
The IDs of the rage can be found in the links, which can be extracted using the html_attr function. From there we can use some regex to find the numbers, here I include id= to make sure the page is an id, as I'm not sure whether you want to include links like masters=9173.
library(rvest)
library(stringi)
url <- "https://www.worldloppet.com/results/"
page <- read_html(url)
string <- html_attr(html_elements(page, "a"), "href")
matches <- stri_extract_all_regex(string, "(?<=id=).*", simplify = T)
as.integer(matches[!is.na(matches)])
# first 5
[1] 2570 1818 1817 2518 2517

Extracting Details from Google Earth KML File in R

I am trying to take the details from a series of locations in a Google Earth kml file.
Getting the ids and coordinates works but for the name of location (which is located in the first table cell (td tag) of the Description), when I do it for ALL the locations, it returns the same value for all of them (Stratford Road - the name of the first location).
library(sf)
library(tidyverse)
library(rvest)
removeHtmlTags <- function(htmlString) {
return(gsub("<.*?>", "", htmlString))
}
getHtmlTableCells<- function(htmlString) {
# Convert html to html doc
htmldoc <- read_html(htmlString)
# get html for each cell (i.e. within <td></td>)
table_cells_with_tags <- html_nodes(htmldoc, "td")
# remove the html tags (<td></td>)
return(removeHtmlTags(table_cells_with_tags))[1]
}
download.file("https://www.dropbox.com/s/ohipb477kqrqtlz/AQMS_2019.kml?dl=1","aqms.kml")
locations <- st_read("aqms.kml", stringsAsFactors = FALSE) %>%
rename(id = Name) %>%
mutate(latitude = st_coordinates(geometry)[,1],
longitiude = st_coordinates(geometry)[,2],
name = getHtmlTableCells(Description)[1]) %>%
st_drop_geometry()
Now if I use the function on a particular location and get the first table cell (td), then it works, returning Stratford Road and Selly Oak for the first as below.
getHtmlTableCells(locations$Description[1])[1]
getHtmlTableCells(locations$Description[2])[1]
What am I doing wrong?
read_html is not vectorised - it does not accept a vector of different html to parse. We can apply your function over each element of the vector:
locations <- st_read("aqms.kml", stringsAsFactors = FALSE)
locations %>%
rename(id = Name) %>%
mutate(latitude = st_coordinates(geometry)[,1],
longitiude = st_coordinates(geometry)[,2],
name = sapply(Description, function(x) getHtmlTableCells(x)[1])) %>%
st_drop_geometry()
#> latitude longitiude name
#> 1 -1.871622 52.45920 Stratford Road
#> 2 -1.934559 52.44513 Selly Oak (Bristol Road)
#> 3 -1.830070 52.43771 Acocks Green
#> 4 -1.898731 52.48180 Colmore Row
#> 5 -1.896764 52.48607 St Chads Queensway
#> 6 -1.891955 52.47990 Moor Street Queensway
#> 7 -1.918173 52.48138 Birmingham Ladywood
#> 8 -1.902121 52.47675 Lower Severn Street
#> 9 -1.786413 52.56815 New Hall
#> 10 -1.874989 52.47609 Birmingham A4540 Roadside
Alternatively, since you're making use of regex anyway within your function, you could make use of stringr::str_extract to extract your text (which is already vectorised).
library(sf)
library(tidyverse)
locations <- st_read("aqms.kml", stringsAsFactors = FALSE) %>%
rename(id = Name) %>%
mutate(latitude = st_coordinates(geometry)[,1],
longitiude = st_coordinates(geometry)[,2],
name = str_extract(Description, '(?<=Location</td> <td>)[^<]+')) %>%
st_drop_geometry()
Where (?<=Location</td> <td>) is a lookbehind for the Location td tag that precedes our name, and [^<]+ matches anything up to the next tag following the name.
Your getHtmlTableCells function isn't vectorized. If you pass it a single html string, it works fine, but if you pass it multiple strings it will only process the first. Also, you have put a [1] after the return statement, which doesn't do anything. It needs to be inside the brackets. One you do this, it is easy to vectorize the function using sapply.
So make a tiny change in your function...
getHtmlTableCells <- function(htmlString) {
# Convert html to html doc
htmldoc <- read_html(htmlString)
# get html for each cell (i.e. within <td></td>)
table_cells_with_tags <- html_nodes(htmldoc, "td")
# remove the html tags (<td></td>)
return(removeHtmlTags(table_cells_with_tags)[1])
}
and vectorize it like this:
download.file("https://www.dropbox.com/s/ohipb477kqrqtlz/AQMS_2019.kml?dl=1","aqms.kml")
locations <- st_read("aqms.kml", stringsAsFactors = FALSE) %>%
rename(id = Name) %>%
mutate(latitude = st_coordinates(geometry)[,1],
longitiude = st_coordinates(geometry)[,2],
name = sapply(as.list(Description), getHtmlTableCells)) %>%
st_drop_geometry()
Which gives the correct result:
locations$name
#> [1] "Stratford Road" "Selly Oak (Bristol Road)"
#> [3] "Acocks Green" "Colmore Row"
#> [5] "St Chads Queensway" "Moor Street Queensway"
#> [7] "Birmingham Ladywood" "Lower Severn Street"
#> [9] "New Hall" "Birmingham A4540 Roadside"

How to get rvest or sapply to skip NA values?

I am using rvest to (try to) scrape all the author affiliation data from a database of academic publications called RePEc. I have the authors' short IDs (author_reg), which I'm using to scrape affiliation data. However, I have several columns indicating multiple authors (each of which I need the affiliation data for). When there aren't multiple authors, the cell has an NA value. Some of the columns are mostly NA values so how do I alter my code so it skips the NA values but doesn't delete them?
Here is the code I'm using:
library(rvest)
library(purrr)
df$author_reg <- c("paa6","paa2","paa1", "paa8", "pve266", "pya500", "NA", "NA")
http1 <- "https://ideas.repec.org/e/"
http2 <- "https://ideas.repec.org/f/"
df$affiliation_author_1 <- sapply(df$author_reg_1, function(x) {
links = c(paste0(http1, x, ".html"),paste0(http2, x, ".html"))
# here we try both links and store under attempts
attempts = links %>% map(function(i){
try(read_html(i) %>% html_nodes("#affiliation h3") %>% html_text())
})
# the good ones will have "character" class, the failed ones, try-error
gdlink = which(sapply(attempts,class) != "try-error")
if(length(gdlink)>0){
return(attempts[[gdlink[1]]])
}
else{
return("True 404 error")
}
})
Thanks in advance for your help!
As far as I see the target links, you can try the following way. First, you want to scrape all links from https://ideas.repec.org/e/ and create all links. Then, check if each link exists or not. (There are about 26000 links with this URL, and I do not have time to check all. So I just used 100 URLs in the following demonstration.) Extract all existing links.
library(rvest)
library(httr)
library(tidyverse)
# Get all possible links from this webpage. There are 26665 links.
read_html("https://ideas.repec.org/e/") %>%
html_nodes("td") %>%
html_nodes("a") %>%
html_attr("href") %>%
.[grepl(x = ., pattern = "html")] -> x
# Create complete URLs.
mylinks1 <- paste("https://ideas.repec.org/e/", x, sep = "")
# For this demonstration I created a subset.
mylinks_samples <- mylinks1[1:100]
# Check if each URL exists or not. If FALSE, a link exists.
foo <- sapply(mylinks_sample, http_error)
# Using the logical vector, foo, extract existing links.
urls <- mylinks_samples[!foo]
Then, for each link, I tried to extract affiliation information. There are several spots with h3. So I tried to specifically target h3 that stays in xpath containing id = "affiliation". If there is no affiliation information, R returns character(0). When enframe() is applied, these elements are removed. For instance, pab127 does not have any affiliation information, so there is no entry for this link.
lapply(urls, function(x){
read_html(x, encoding = "UTF-8") %>%
html_nodes(xpath = '//*[#id="affiliation"]') %>%
html_nodes("h3") %>%
html_text() %>%
trimws() -> foo
return(foo)}) -> mylist
Then, I assigned names to mylist with the links and created a data frame.
names(mylist) <- sub(x = basename(urls), pattern = ".html", replacement = "")
enframe(mylist) %>%
unnest(value)
name value
<chr> <chr>
1 paa1 "(80%) Institutt for ØkonomiUniversitetet i Bergen"
2 paa1 "(20%) Gruppe for trygdeøkonomiInstitutt for ØkonomiUniversitetet i Bergen"
3 paa2 "Department of EconomicsCollege of BusinessUniversity of Wyoming"
4 paa6 "Statistisk SentralbyråGovernment of Norway"
5 paa8 "Centraal Planbureau (CPB)Government of the Netherlands"
6 paa9 "(79%) Economic StudiesBrookings Institution"
7 paa9 "(21%) Brookings Institution"
8 paa10 "Helseøkonomisk Forskningsprogram (HERO) (Health Economics Research Programme)\nUniversitetet i Oslo (Unive~
9 paa10 "Institutt for Helseledelse og Helseökonomi (Institute of Health Management and Health Economics)\nUniversi~
10 paa11 "\"Carlo F. Dondena\" Centre for Research on Social Dynamics (DONDENA)\nUniversità Commerciale Luigi Boccon~

rvest: for loop/map to pull multiple tables using html_node & html_table

I'm trying to programmatically pull all of the box scores for a given day from NBA Reference (I used January 4th, 2020 which has multiple games). I started by creating a list of integers to denote the amount of box scores to pull:
games<- c(1:3)
Then I used developer tools from my browser to determine what each table contains (you can use selector gadget):
#content > div.game_summaries > div:nth-child(1) > table.team
Then I used purrr::map to create a list of the the tables to pull, using games:
map_list<- map(.x= '', paste, '#content > div.game_summaries > div:nth-child(', games, ') > table.teams',
sep = "")
# check map_list
map_list
Then I tried to run this list through a for loop to generate three tables, using tidyverse and rvest, which delivered an error:
for (i in map_list){
read_html('https://www.basketball-reference.com/boxscores/') %>%
html_node(map_list[[1]][i]) %>%
html_table() %>%
glimpse()
}
Error in selectr::css_to_xpath(css, prefix = ".//") :
Zero length character vector found for the following argument: selector
In addition: Warning message:
In selectr::css_to_xpath(css, prefix = ".//") :
NA values were found in the 'selector' argument, they have been removed
For reference, if I explicitly denote the html or call the exact item from map_list, the code works as intended (run below items for reference):
read_html('https://www.basketball-reference.com/boxscores/') %>%
html_node('#content > div.game_summaries > div:nth-child(1) > table.teams') %>%
html_table() %>%
glimpse()
read_html('https://www.basketball-reference.com/boxscores/') %>%
html_node(map_list[[1]][1]) %>%
html_table() %>%
glimpse()
How do I make this work with a list? I have looked at other threads but even though they use the same site, they're not the same issue.
Using your current map_list, if you want to use for loop this is what you should use
library(rvest)
for (i in seq_along(map_list[[1]])){
read_html('https://www.basketball-reference.com/boxscores/') %>%
html_node(map_list[[1]][i]) %>%
html_table() %>%
glimpse()
}
but I think this is simpler as you don't need to use map to create map_list since paste is vectorized :
map_list<- paste0('#content > div.game_summaries > div:nth-child(', games, ') > table.teams')
url <- 'https://www.basketball-reference.com/boxscores/'
webpage <- url %>% read_html()
purrr::map(map_list, ~webpage %>% html_node(.x) %>% html_table)
#[[1]]
# X1 X2 X3
#1 Indiana 111 Final
#2 Atlanta 116
#[[2]]
# X1 X2 X3
#1 Toronto 121 Final
#2 Brooklyn 102
#[[3]]
# X1 X2 X3
#1 Boston 111 Final
#2 Chicago 104
This page is reasonably straight forward to scrape. Here is a possible solution, first scrape the game summary nodes "div with class=game_summary". This provides a list of all of the games played. Also this allows the use of html_node function which guarantees a return, thus keeping the list sizes equal.
Each game summary is made up of three subtables, the first and third tables can be scraped directly. The second table does not have a class assigned thus making it more tricky to retrieve.
library(rvest)
page <- read_html('https://www.basketball-reference.com/boxscores/')
#find all of the game summaries on the page
games<-page %>% html_nodes("div.game_summary")
#Each game summary has 3 sub tables
#game score is table 1 of class=teams
#the stats is table 3 of class=stats
# the quarterly score is the second table and does not have a class defined
table1<-games %>% html_node("table.teams") %>% html_table()
stats <-games %>% html_node("table.stats") %>% html_table()
quarter<-sapply(games, function(g){
g %>% html_nodes("table") %>% .[2] %>% html_table()
})

Resources