How to skip a page scrape when table is missing in R - r

I'm building a scrape that pulls the name of a player and the years he played for thousands of different players. I have built an otherwise successful function to do this but unfortunately in some instances the table with the other half of data I need (years played) does not exist. For these instances, I'd like to add a way to tell the scrape to bypass these instances. Here is the code:
(note: the object "url_final" is the list of active webpage URLs of which there are many)
library(rvest)
library(curl)
library(tidyverse)
library(httr)
df <- map_dfr(.x = url_final,
.f = function(x){Sys.sleep(.3); cat(1);
fyr <- read_html(curl(x, handle = curl::new_handle("useragent" = "Mozilla/5.0"))) %>%
html_table() %>%
.[[1]]
fyr <- fyr %>%
select(1) %>%
mutate(name = str_extract(string = x, pattern = "(?<=cbb/players/).*?(?=-\\d\\.html)"))
})
Here is an example of an active page in which you can recreate the scrape by replacing "url_final" as the .x call in map_dfr with:
https://www.sports-reference.com/cbb/players/karl-aaker-1.html
Here is an example of one of the instances in which there is no table and thus returns an error breaking the loop of the scrape.
https://www.sports-reference.com/cbb/players/karl-aaker-1.html

How about adding try-Catch which will ignore any errors?
library(tidyverse)
library(rvest)
df <- map_dfr(.x = url_final,
.f = function(x){Sys.sleep(.3); cat(1);
tryCatch({
fyr <- read_html(curl::curl(x,
handle = curl::new_handle("useragent" = "Mozilla/5.0"))) %>%
html_table() %>% .[[1]]
fyr <- fyr %>%
select(1) %>%
mutate(name = str_extract(string = x,
pattern = "(?<=cbb/players/).*?(?=-\\d\\.html)"))
}, error = function(e) message('Skipping url', x))
})

Related

Launch web browser and copy information contained R

I'm trying to find a way to copy-paste the title and the abstract from a PubMed page.
I started using
browseURL("https://pubmed.ncbi.nlm.nih.gov/19592249") ## final numbers are the PMID
now I can't find a way to obtain the title and the abstract in a txt way. I have to do it for multiple PMID so I need to automatize it. It can be useful also just copying everything is on that page and after I can take only what I need.
Is it possible to do that? thanks!
I suppose what you're trying to do is scrape PubMed for articles of interest?
Here's one way to do this using the rvest package:
#Required libraries.
library(magrittr)
library(rvest)
#Function.
getpubmed <- function(url){
dat <- rvest::read_html(url)
pid <- dat %>% html_elements(xpath = '//*[#title="PubMed ID"]') %>% html_text2() %>% unique()
ptitle <- dat %>% html_elements(xpath = '//*[#class="heading-title"]') %>% html_text2() %>% unique()
pabs <- dat %>% html_elements(xpath = '//*[#id="enc-abstract"]') %>% html_text2()
return(data.frame(pubmed_id = pid, title = ptitle, abs = pabs, stringsAsFactors = FALSE))
}
#Test run.
urls <- c("https://pubmed.ncbi.nlm.nih.gov/19592249", "https://pubmed.ncbi.nlm.nih.gov/22281223/")
df <- do.call("rbind", lapply(urls, getpubmed))
The code should be fairly self-explanatory. (I've not added the contents of df here for brevity.) The function getpubmed does no error-handling or anything of that sort, but it is a start. By supplying a vector of URLs to the do.call("rbind", lapply(urls, getpubmed)) construct, you can get back a data.frame consisting of the PubMed ID, title, and abstract as columns.
Another option would be to explore the easyPubMed package.
I would also use a function and rvest. However, I would go with a passing the pid in as the argument function, using html_node as only a single node is needed to be matched, and use faster css selectors. String cleaning is done via stringr package:
library(rvest)
library(stringr)
library(dplyr)
get_abstract <- function(pid){
page <- read_html(paste0('https://pubmed.ncbi.nlm.nih.gov/', pid))
df <-tibble(
title = page %>% html_node('.heading-title') %>% html_text() %>% str_squish(),
abstract = page %>% html_node('#enc-abstract') %>% html_text() %>% str_squish()
)
return(df)
}
get_abstract('19592249')

Naming Error when nesting df to scrape multiple pages using Rvest, Glue and Purrr

I'm trying to scrape football data per week 1-17 and by position (QB, RB, WR, TE) and I'm successful in creating a function to scrape the data but after I create a scaffold and try to add a third col to nest the weekly page per position results in I get the error at bottom:
library(tidyverse)
library(rvest)
library(glue)
url <- "https://fantasy.nfl.com/research/scoringleaders?position=1&sort=pts&statCategory=stats&statSeason=2019&statType=weekStats&statWeek=1"
scrape_19 <- function(week, position) {
Sys.sleep(3)
cat(".")
url <- glue("https://fantasy.nfl.com/research/scoringleaders?position={position}&sort=pts&statCategory=stats&statSeason=2019&statType=weekStats&statWeek={week}")
read_html(url) %>%
html_nodes("table") %>%
html_table(header = T) %>%
simplify() %>%
first() %>%
setNames(paste0(colnames(.), as.character(.[1,]))) %>%
slice(-1) %>%
list()
}
weeks <- 1:17
positions <- c("QB","RB","WR","TE") #only qb rb wr te as all columns
match - will do k and dst separately later
scaffold <- tibble(week = weeks,
position = list(positions)) %>%
tidyr::unnest()
scaffold
#here's where i get an error and get stuck
tbl_data <- scaffold %>%
mutate(data = purrr::map2(week, position, ~scrape_19(.x, .y)[[1]]))
Error in names(object) <- nm : attempt to set an attribute on NULL
I feel like I'm so close, yet so far.

Scraping pages with inconsistent lengths in dataframe

I want to scrape all the names from this page. With the result of one tibble of three columns. My code only works if all the data is there hence my error:
Error: Tibble columns must have consistent lengths, only values of length one are recycled:
* Length 20: Columns `huisarts`, `url`
* Length 21: Column `praktijk`
How can I let my code run but fill with Na's in tibble if the data isn't there.
My code for a pauzing robot later used in scraper function:
pauzing_robot <- function (periods = c(0, 1)) {
tictoc <- runif(1, periods[1], periods[2])
cat(paste0(Sys.time()),
"- Sleeping for ", round(tictoc, 2), "seconds\n")
Sys.sleep(tictoc)
}
Scraper:
library(tidyverse)
library(rvest)
scrape_page <- function(pagina_nummer) {
page <- read_html(paste0("https://www.zorgkaartnederland.nl/huisarts/pagina", pagina_nummer))
pauzing_robot(periods = c(0, 1.5))
tibble(
huisarts = page %>%
html_nodes(".media-heading.title.orange") %>%
html_text() %>%
str_trim(),
praktijk = page %>%
html_nodes(".location") %>%
html_text() %>%
str_trim(),
url = page %>%
html_nodes(".media-heading.title.orange") %>%
html_nodes("a") %>%
html_attr("href") %>%
str_trim() %>%
paste0("https://www.zorgkaartnederland.nl", .)
)
}
Total number of pages 445, but for example sake only scraping three:
huisartsen <- map_df(sample(1:3), scrape_page)
Page 2 seems to be the problem with inconsistent lengths because this code works:
huisartsen <- map_df(3:4, scrape_page)
If possible with tidyverse code. Thanks in advance.
You need to retrieve the list of parent nodes
parents <- page %>% html_nodes("li.media")
Then parse the parent nodes with function html_node().
tibble(
huisarts = parents %>%
html_node(".media-heading.title.orange") %>%
html_text() %>%
str_trim(),
praktijk = parents %>%
html_node(".location") %>%
html_text() %>%
str_trim(),
url = parents %>%
html_node(".media-heading.title.orange a") %>%
html_attr("href") %>%
str_trim() %>%
paste0("https://www.zorgkaartnederland.nl", .)
)
The html_node function will always return a value even if it is just a NA

Looping through a list of webpages with rvest follow_link

I'm trying to webscrape the government release calendar: https://www.gov.uk/government/statistics and use the rvest follow_link functionality to go to each publication link and scrape text from the next page. I have this working for each single page of results (40 publications are displayed per page), but can't get a loop to work so that I can run the code over all publications listed.
This is the code I run first to get the list of publications (just from the first 10 pages of results):
#Loading the rvest package
library('rvest')
library('dplyr')
library('tm')
#######PUBLISHED RELEASES################
###function to add number after 'page=' in url to loop over all pages of published releases results (only 40 publications per page)
###check the site and see how many pages you want to scrape, to cover months of interest
##titles of publications - creates a list
publishedtitles <- lapply(paste0('https://www.gov.uk/government/statistics?page=', 1:10),
function(url_base){
url_base %>% read_html() %>%
html_nodes('h3 a') %>%
html_text()
})
##Dates of publications
publisheddates <- lapply(paste0('https://www.gov.uk/government/statistics?page=', 1:10),
function(url_base){
url_base %>% read_html() %>%
html_nodes('.public_timestamp') %>%
html_text()
})
##Organisations
publishedorgs <- lapply(paste0('https://www.gov.uk/government/statistics?page=', 1:10),
function(url_base){
url_base %>% read_html() %>%
html_nodes('.organisations') %>%
html_text()
})
##Links to publications
publishedpartial_links <- lapply(paste0('https://www.gov.uk/government/statistics?page=', 1:10),
function(url_base){
url_base %>% read_html() %>%
html_nodes('h3 a') %>%
html_attr('href')
})
#Check all lists are the same length - if not, have to deal with missings before next step
# length(publishedtitles)
# length(publisheddates)
# length(publishedorgs)
# length(publishedpartial_links)
#str(publishedorgs)
#Combining all the lists to form a data frame
published <-data.frame(Title = unlist(publishedtitles), Date = unlist(publisheddates), Organisation = unlist(publishedorgs), PartLinks = unlist(publishedpartial_links))
#adding prefix to partial links, to turn into full URLs
published$Links = paste("https://www.gov.uk", published$PartLinks, sep="")
#Drop partial links column
keeps <- c("Title", "Date", "Organisation", "Links")
published <- published[keeps]
Then I want to run something like the below, but over all pages of results. I've ran this code manually changing the parameters for each page, so know it works.
session1 <- html_session("https://www.gov.uk/government/statistics?page=1")
list1 <- list()
for(i in published$Title[1:40]){
nextpage1 <- session1 %>% follow_link(i) %>% read_html()
list1[[i]]<- nextpage1 %>%
html_nodes(".grid-row") %>% html_text()
df1 <- data.frame(text=list1)
df1 <-as.data.frame(t(df1))
}
So the above would need to change page=1 in the html_session, and also the publication$Title[1:40] - I'm struggling with creating a function or loop that includes both variables.
I think I should be able to do this using lapply:
df <- lapply(paste0('https://www.gov.uk/government/statistics?page=', 1:10),
function(url_base){
for(i in published$Title[1:40]){
nextpage1 <- url_base %>% follow_link(i) %>% read_html()
list1[[i]]<- nextpage1 %>%
html_nodes(".grid-row") %>% html_text()
}
}
)
But I get the error
Error in follow_link(., i) : is.session(x) is not TRUE
I've also tried other methods of looping and turning it into a function but didn't want to make this post too long!
Thanks in advance for any suggestions and guidance :)
It looks like you may have just need to start a session inside the lapply function. In the last chunk of code, url_base is simply a text string that gives the base URL. Would something like this work:
df <- lapply(paste0('https://www.gov.uk/government/statistics?page=', 1:10),
function(url_base){
for(i in published$Title[1:40]){
tmpSession <- html_session(url_base)
nextpage1 <- tmpSession %>% follow_link(i) %>% read_html()
list1[[i]]<- nextpage1 %>%
html_nodes(".grid-row") %>% html_text()
}
}
)
To change the published$Title[1:40] for each iteraction of the lapply function, you could make an object that holds the lower and upper bounds of the indices:
lowers <- cumsum(c(1, rep(40, 9)))
uppers <- cumsum(rep(40, 10))
Then, you could include those in the call to lapply
df <- lapply(1:10, function(j){
url_base <- paste0('https://www.gov.uk/government/statistics?page=', j)
for(i in published$Title[lowers[j]:uppers[j]]){
tmpSession <- html_session(url_base)
nextpage1 <- tmpSession %>% follow_link(i) %>% read_html()
list1[[i]]<- nextpage1 %>%
html_nodes(".grid-row") %>% html_text()
}
}
)
Not sure if this is what you want or not, I might have misunderstood the things that are supposed to be changing.

rvest web content scraping issue / car trading website

Question
I wanted to rvest specific parts of the websites (car sales platform).
The CSS is frankly too confusing for me to figure out what's wrong on my own.
#### scraping the website www.otomoto.pl with used cars #####
baseURL_otomoto = "https://www.otomoto.pl/osobowe/?page="
i <- 1
for ( i in 1:7000 )
{
link = paste0(baseURL_otomoto,i)
out = read_html(link)
print(i)
print(link)
### building year
build_year = html_nodes(out, xpath = '//*[#id="body-container"]/div[2]/div[1]/div/div[6]/div[2]/article[1]/div[2]/div[3]/ul/li[1]') %>%
html_text() %>%
str_replace_all("\n","") %>%
str_replace_all("\r","") %>%
str_trim()
mileage = html_nodes(out, xpath = '//*[#id="body-container"]/div[2]/div[1]/div/div[6]/div[2]/article[1]/div[2]/div[3]/ul/li[2]') %>%
html_text() %>%
str_replace_all("\n","") %>%
str_replace_all("\r","") %>%
str_trim()
volume = html_nodes(out, xpath = '//*[#id="body-container"]/div[2]/div[1]/div/div[6]/div[2]/article[1]/div[2]/div[3]/ul/li[3]') %>%
html_text() %>%
str_replace_all("\n","") %>%
str_replace_all("\r","") %>%
str_trim()
fuel_type = html_nodes(out, xpath = '//*[#id="body-container"]/div[2]/div[1]/div/div[6]/div[2]/article[1]/div[2]/div[3]/ul/li[4]') %>%
html_text() %>%
str_replace_all("\n","") %>%
str_replace_all("\r","") %>%
str_trim()
price = html_nodes(out, xpath = '//div[#class="offer-item__price"]') %>%
html_text() %>%
str_replace_all("\n","") %>%
str_replace_all("\r","") %>%
str_trim()
link = html_nodes(out, xpath = '//div[#class="offer-item__title"]') %>%
html_text() %>%
str_replace_all("\n","") %>%
str_replace_all("\r","") %>%
str_trim()
offer_details = html_nodes(out, xpath = '//*[#id="body-container"]/div[2]/div[1]/div/div[6]/div[2]/article[1]/div[2]/div[3]/ul') %>%
html_text() %>%
str_replace_all("\n","") %>%
str_replace_all("\r","") %>%
str_trim()
Any guesses what might be the reason for this behaviour?
PS#1.
How to rvest all build_type, mileage and fuel_type data from offers available on the analysed website at once as a data.frame? using classes (xpath = '//div[#class=...) didn't work in my case
PS#2.
I wanted to rvest details of the actual offers using f.i.
gear_type = html_nodes(out, xpath = '//*[#id="parameters"]/ul[1]/li[10]/div') %>%
html_text() %>%
str_replace_all("\n","") %>%
str_replace_all("\r","") %>%
str_trim()
the arguments
in ul[a] are for a in (1:2) &
in li[b] are for b in (1:12)
Unfortunately though this concept fails as the resulting data frame is empty. Any guesses why?
First and foremost, learn about CSS selectors and XPath. Your selectors are very long and extremely fragile (some of them did not work for me at all, mere two weeks later). For example, instead of:
html_nodes(out, xpath = '//*[#id="body-container"]/div[2]/div[1]/div/div[6]/div[2]/article[1]/div[2]/div[3]/ul/li[1]') %>%
html_text()
you can write:
html_nodes(out, css="[data-code=year]") %>% html_text()
Second, read documentation of libraries that you use. str_replace_all pattern may be regular expression, which saves you one call (use str_replace_all("[\n\r]", "") instead of str_replace_all("\n","") %>% str_replace_all("\r","")). html_text can do text trimming for you, which means that str_trim() is not needed at all.
Third, if you copy-paste some code, step back and think if function wouldn't be better solution; usually it would. In your case, personally, I would probably skip str_replace_all calls until data cleaning step, when I would call them on data.frame holding entire scrapped data.
To create data.frame from your data, call data.frame() function with column names and content, like that:
data.frame(build_year = build_year,
mileage = mileage,
volume = volume,
fuel_type = fuel_type,
price = price,
link = link,
offer_details = offer_details)
Or you could initialize data.frame with one column only and then add further vectors as columns:
output_df <- data.frame(build_year = html_nodes(out, css="[data-code=year]") %>% html_text(TRUE))
output_df$volume <- html_nodes(out, css="[data-code=engine_capacity]") %>%
html_text(TRUE)
Finally, you should note that data.frame columns must all be the same length, while some of data that you scrap is optional. At the moment of writing this answer I had few offers without engine capacity and without offer description. You have to use two html_nodes calls in succession (as single CSS selector will not match what doesn't exist). But even then, html_nodes will silently drop missing data. This can be worked around by piping html_nodes output to html_node call:
current_df$volume = out %>% html_nodes("ul.offer-item__params") %>%
html_node("[data-code=engine_capacity]") %>%
html_text(TRUE)
The final version of my approach to loop internals is below. Just make sure that you initialize empty data.frame before calling it and that you merge output of current iteration with final data frame (using for example rbind), or each iteration will overwrite results of previous one. Or you could use do.call(rbind, lapply()), which is idiomatic R for such task.
As a side note, when scraping large amount of quickly changing data, consider decoupling data downloading and data processing steps. Imagine that there is some corner case that you haven't accounted for which will cause R to terminate. How will you proceed if such condition appear in the middle of your iterations? The longer you stay on one page, the more duplicates you introduce (as more offers appear and existing ones are pushed down on further pages), and more offers you miss (as sale is concluded and offers disappear forever).
current_df <- data.frame(build_year = html_nodes(out, css="[data-code=year]") %>% html_text(TRUE))
current_df$mileage = html_nodes(out, css="[data-code=mileage]") %>%
html_text(TRUE)
current_df$volume = out %>% html_nodes("ul.offer-item__params") %>%
html_node("[data-code=engine_capacity]") %>%
html_text(TRUE)
current_df$fuel_type = html_nodes(out, css="[data-code=fuel_type]") %>%
html_text(TRUE)
current_df$price = out %>% html_nodes(xpath="//div[#class='offer-price']//span[contains(#class, 'number')]") %>%
html_text(TRUE)
current_df$link = out %>% html_nodes(css = "div.offer-item__title h2 > a") %>%
html_text(TRUE) %>%
str_replace_all("[\n\r]", "")
current_df$offer_details = out %>% html_nodes("div.offer-item__title") %>%
html_node("h3") %>%
html_text(TRUE)

Resources