Looping through a list of webpages with rvest follow_link - r

I'm trying to webscrape the government release calendar: https://www.gov.uk/government/statistics and use the rvest follow_link functionality to go to each publication link and scrape text from the next page. I have this working for each single page of results (40 publications are displayed per page), but can't get a loop to work so that I can run the code over all publications listed.
This is the code I run first to get the list of publications (just from the first 10 pages of results):
#Loading the rvest package
library('rvest')
library('dplyr')
library('tm')
#######PUBLISHED RELEASES################
###function to add number after 'page=' in url to loop over all pages of published releases results (only 40 publications per page)
###check the site and see how many pages you want to scrape, to cover months of interest
##titles of publications - creates a list
publishedtitles <- lapply(paste0('https://www.gov.uk/government/statistics?page=', 1:10),
function(url_base){
url_base %>% read_html() %>%
html_nodes('h3 a') %>%
html_text()
})
##Dates of publications
publisheddates <- lapply(paste0('https://www.gov.uk/government/statistics?page=', 1:10),
function(url_base){
url_base %>% read_html() %>%
html_nodes('.public_timestamp') %>%
html_text()
})
##Organisations
publishedorgs <- lapply(paste0('https://www.gov.uk/government/statistics?page=', 1:10),
function(url_base){
url_base %>% read_html() %>%
html_nodes('.organisations') %>%
html_text()
})
##Links to publications
publishedpartial_links <- lapply(paste0('https://www.gov.uk/government/statistics?page=', 1:10),
function(url_base){
url_base %>% read_html() %>%
html_nodes('h3 a') %>%
html_attr('href')
})
#Check all lists are the same length - if not, have to deal with missings before next step
# length(publishedtitles)
# length(publisheddates)
# length(publishedorgs)
# length(publishedpartial_links)
#str(publishedorgs)
#Combining all the lists to form a data frame
published <-data.frame(Title = unlist(publishedtitles), Date = unlist(publisheddates), Organisation = unlist(publishedorgs), PartLinks = unlist(publishedpartial_links))
#adding prefix to partial links, to turn into full URLs
published$Links = paste("https://www.gov.uk", published$PartLinks, sep="")
#Drop partial links column
keeps <- c("Title", "Date", "Organisation", "Links")
published <- published[keeps]
Then I want to run something like the below, but over all pages of results. I've ran this code manually changing the parameters for each page, so know it works.
session1 <- html_session("https://www.gov.uk/government/statistics?page=1")
list1 <- list()
for(i in published$Title[1:40]){
nextpage1 <- session1 %>% follow_link(i) %>% read_html()
list1[[i]]<- nextpage1 %>%
html_nodes(".grid-row") %>% html_text()
df1 <- data.frame(text=list1)
df1 <-as.data.frame(t(df1))
}
So the above would need to change page=1 in the html_session, and also the publication$Title[1:40] - I'm struggling with creating a function or loop that includes both variables.
I think I should be able to do this using lapply:
df <- lapply(paste0('https://www.gov.uk/government/statistics?page=', 1:10),
function(url_base){
for(i in published$Title[1:40]){
nextpage1 <- url_base %>% follow_link(i) %>% read_html()
list1[[i]]<- nextpage1 %>%
html_nodes(".grid-row") %>% html_text()
}
}
)
But I get the error
Error in follow_link(., i) : is.session(x) is not TRUE
I've also tried other methods of looping and turning it into a function but didn't want to make this post too long!
Thanks in advance for any suggestions and guidance :)

It looks like you may have just need to start a session inside the lapply function. In the last chunk of code, url_base is simply a text string that gives the base URL. Would something like this work:
df <- lapply(paste0('https://www.gov.uk/government/statistics?page=', 1:10),
function(url_base){
for(i in published$Title[1:40]){
tmpSession <- html_session(url_base)
nextpage1 <- tmpSession %>% follow_link(i) %>% read_html()
list1[[i]]<- nextpage1 %>%
html_nodes(".grid-row") %>% html_text()
}
}
)
To change the published$Title[1:40] for each iteraction of the lapply function, you could make an object that holds the lower and upper bounds of the indices:
lowers <- cumsum(c(1, rep(40, 9)))
uppers <- cumsum(rep(40, 10))
Then, you could include those in the call to lapply
df <- lapply(1:10, function(j){
url_base <- paste0('https://www.gov.uk/government/statistics?page=', j)
for(i in published$Title[lowers[j]:uppers[j]]){
tmpSession <- html_session(url_base)
nextpage1 <- tmpSession %>% follow_link(i) %>% read_html()
list1[[i]]<- nextpage1 %>%
html_nodes(".grid-row") %>% html_text()
}
}
)
Not sure if this is what you want or not, I might have misunderstood the things that are supposed to be changing.

Related

R: How can I open a list of links to scrape the homepage of a news website?

I'm trying to build a web scraper to scrape articles published on www.20min.ch, a news website, with R. Their api is openly accessible so I could create a dataframe containing titles, urls, descriptions, and timestamps with rvest. The next step would be to access every single link and create a list of article texts and combine it with my dataframe. However I don't know how to automatize the access to those articles. Ideally, I would like to read_html link 1, then copy the text with html node and then proceed to link 2...
This is what I wrote so far:
site20min <- read_xml("https://api.20min.ch/rss/view/1")
site20min
url_list <- site20min %>% html_nodes('link') %>% html_text()
df20min <- data.frame(Title = character(),
Zeit = character(),
Lead = character(),
Text = character()
)
for(i in 1:length(url_list)){
myLink <- url_list[i]
site20min <- read_html(myLink)
titel20min <- site20min %>% html_nodes('h1 span') %>% html_text()
zeit20min <- site20min %>% html_nodes('#story_content .clearfix span') %>% html_text()
lead20min <- site20min %>% html_nodes('#story_content h3') %>% html_text()
text20min <- site20min %>% html_nodes('.story_text') %>% html_text()
df20min_a <- data.frame(Title = titel20min)
df20min_b <- data.frame(Zeit = zeit20min)
df20min_c <- data.frame(Lead = lead20min)
df20min_d <- data.frame(Text = text20min)
}
What I need is R to open every single link and extract some information:
site20min_1 <- read_html("https://www.20min.ch/schweiz/news/story/-Es-liegen-auch-Junge-auf-der-Intensivstation--14630453")
titel20min_1 <- site20min_1 %>% html_nodes('h1 span') %>% html_text()
zeit20min_1 <- site20min_1 %>% html_nodes('#story_content .clearfix span') %>% html_text()
lead20min_1 <- site20min_1 %>% html_nodes('#story_content h3') %>% html_text()
text20min_1 <- site20min_1 %>% html_nodes('.story_text') %>% html_text()
It should not be too much of a problem to rbind this to a dataframe. but at the moment some of my results turn out empty.
thx for your help!
You're on the right track with setting up a dataframe. You can loop through each link and rbind it to your existing dataframe structure.
First, you can set a vector of urls to be looped through. Based on the edit, here is such a vector:
url_list <- c("http://www.20min.ch/ausland/news/story/14618481",
"http://www.20min.ch/schweiz/news/story/18901454",
"http://www.20min.ch/finance/news/story/21796077",
"http://www.20min.ch/schweiz/news/story/25363072",
"http://www.20min.ch/schweiz/news/story/19113494",
"http://www.20min.ch/community/social_promo/story/20407354",
"https://cp.20min.ch/de/stories/635-stressfrei-durch-den-verkehr-so-sieht-der-alltag-von-busfahrer-claudio-aus")
Next, you can set a dataframe structure that includes everything you're looking to gether.
# Set up the dataframe first
df20min <- data.frame(Title = character(),
Link = character(),
Lead = character(),
Zeit = character())
Finally, you can loop through each url in your list and add the relevant info to your dataframe.
# Go through a loop
for(i in 1:length(url_list)){
myLink <- url_list[i]
site20min <- read_xml(myLink)
# Extract the info
titel20min <- site20min %>% html_nodes('title') %>% html_text()
link20min <- site20min %>% html_nodes('link') %>% html_text()
zeit20min <- site20min %>% html_nodes('pubDate') %>% html_text()
lead20min <- site20min %>% html_nodes('description') %>% html_text()
# Structure into dataframe
df20min_a <- data.frame(Title = titel20min, Link =link20min, Lead = lead20min)
df20min_b <- df20min_a [-(1:2),]
df20min_c <- data.frame(Zeit = zeit20min)
# Insert into final dataframe
df20min <- rbind(df20min, cbind(df20min_b,df20min_c))
}

R Webscraping: How to feed URLS into a function

My end goal is to be able to take all 310 articles from this page and its following pages and run it through this function:
library(tidyverse)
library(rvest)
library(stringr)
library(purrr)
library(lubridate)
library(dplyr)
scrape_docs <- function(URL){
doc <- read_html(URL)
speaker <- html_nodes(doc, ".diet-title a") %>%
html_text()
date <- html_nodes(doc, ".date-display-single") %>%
html_text() %>%
mdy()
title <- html_nodes(doc, "h1") %>%
html_text()
text <- html_nodes(doc, "div.field-docs-content") %>%
html_text()
all_info <- list(speaker = speaker, date = date, title = title, text = text)
return(all_info)
}
I assume the way to go forward would be to somehow create a list of the URLs I want, then iterate that list through the scrape_docs function. As it stands, however, I'm having a hard time understanding how to go about that. I thought something like this would work, but I seem to be missing something key given the following error:
xml_attr cannot be applied to object of class "character'.
source_col <- "https://www.presidency.ucsb.edu/advanced-search?field-keywords=%22space%20exploration%22&field-keywords2=&field-keywords3=&from%5Bdate%5D=&to%5Bdate%5D=&person2=&items_per_page=100&page=0"
pages <- 4
all_links <- tibble()
for(i in seq_len(pages)){
page <- paste0(source_col,i) %>%
read_html() %>%
html_attr("href") %>%
html_attr()
tmp <- page[[1]]
all_links <- bind_rows(all_links, tmp)
}
all_links
You can get all the url's by doing
library(rvest)
source_col <- "https://www.presidency.ucsb.edu/advanced-search?field-keywords=%22space%20exploration%22&field-keywords2=&field-keywords3=&from%5Bdate%5D=&to%5Bdate%5D=&person2=&items_per_page=100&page=0"
all_urls <- source_col %>%
read_html() %>%
html_nodes("td a") %>%
html_attr("href") %>%
.[c(FALSE, TRUE)] %>%
paste0("https://www.presidency.ucsb.edu", .)
Now do the same by changing the page number in source_col to get remaining data.
You can then use a for loop or map to extract all the data.
purrr::map(all_urls, scrape_docs)
Testing the function scrape_docs on 1 URL
scrape_docs(all_urls[1])
#$speaker
#[1] "Dwight D. Eisenhower"
#$date
#[1] "1958-04-02"
#$title
#[1] "Special Message to the Congress Relative to Space Science and Exploration."
#$text
#[1] "\n To the Congress of the United States:\nRecent developments in long-range
# rockets for military purposes have for the first time provided man with new mac......

Web scraping in R: the same element gets scraped multiple times. How could I fix this?

I am trying to scrape some URLs from the dutch train disruptions website. The problem is that on every page the first URL gets scraped 7x times. The HTML only contains the URL once so I don't understand why it is scraped multiple times.
The problem occurs the same way on every page: Every time, the first URL is scraped 7 times and on the rest of the page just once.
I am using the following script:
library(tidyverse)
library(rvest)
scrape_css_attr <- function(css,group,attribute,html_page){
txt <- html_page %>%
html_nodes(group) %>%
lapply(.%>% html_nodes(css) %>% html_attr(attribute) %>% ifelse(identical(.,character(0)),NA,.)) %>%
unlist()
return(txt)
}
get_element_data <- function(link){
if(!is.na(link)){
html <- read_html(link)
Sys.sleep(2)
datum <- html %>%
html_node(".disruption-cause") %>%
html_text()
return(tibble(datum=datum))
}
}
get_elements_from_url <- function(url){
html_page <- read_html(url)
Sys.sleep(2)
element_urls <- scrape_css_attr(".resolved","div","href",html_page)
element_urls <- element_urls[!is.na(element_urls)]
element_urls <- paste0("https://www.rijdendetreinen.nl", element_urls)
element_data_detail <- element_urls %>%
map(get_element_data) %>%
bind_rows()
elements_data <- tibble(element_urls=element_urls)
elements_data_overview <- elements_data[complete.cases(elements_data[,1]), ]
return(bind_cols(elements_data_overview,element_data_detail))
}
scrape_write_table <- function(url){
list_of_pages <- str_c(url, 1)
list_of_pages %>%
map(get_elements_from_url) %>%
bind_rows()
}
trainDisruptions <- scrape_write_table("https://www.rijdendetreinen.nl/storingen?lines=&reasons=&date_before=31-12-2018&date_after=01-01-2018&page=")
View(trainDisruptions)

Scraping information from multiple webpages using rvest

I am trying to scrape the results from the 2012-2016 Stockholm Marathon races. I am able to do so using the code outlined below, but every time that I've scraped the results from one year I have to go through the process of manually changing the URL to capture the next year.
This bothers me as the only thing that needs to change is the bold part of http://results.marathon.se/2012/?content=list&event=STHM&num_results=250&page=1&pid=list&search[sex]=M&lang=SE.
How can I modify the code below so that it scrapes the results from each year, outputting the results into a single dataframe that also includes a column to indicate the year to which the observation belongs?
library(dplyr)
library(rvest)
library(tidyverse)
# Find the total number of pages to scrape
tot_pages <- read_html('http://results.marathon.se/2012/?content=list&event=STHM&num_results=250&page=1&pid=list&search[sex]=M&lang=EN') %>%
html_nodes('a:nth-child(6)') %>% html_text() %>% as.numeric()
#Store the URLs in a vector
URLs <- sprintf('http://results.marathon.se/2012/?content=list&event=STHM&num_results=250&page=%s&pid=list&search[sex]=M&lang=EN', 1:tot_pages)
#Create a progress bar
pb <- progress_estimated(tot_pages, min = 0)
# Create a function to scrape the name and finishing time from each page
getdata <- function(URL) {
pb$tick()$print()
pg <- read_html(URL)
html_nodes(pg, 'tbody td:nth-child(3)') %>% html_text() %>% as_tibble() %>% set_names(c('Name')) %>%
mutate(finish_time = html_nodes(pg, 'tbody .right') %>% html_text())
}
#Map everything into a dataframe
map_df(URLs, getdata) -> results
You can use lapply to do this:
library(dplyr)
library(rvest)
library(tidyverse)
# make a vector of the years you want
years <- seq(2012,2016)
# now use lapply to iterate your code over those years
Results.list <- lapply(years, function(x) {
# make a target url with the relevant year
link <- sprintf('http://results.marathon.se/%s/?content=list&event=STHM&num_results=250&page=1&pid=list&search[sex]=M&lang=EN', x)
# Find the total number of pages to scrape
tot_pages <- read_html(link) %>%
html_nodes('a:nth-child(6)') %>% html_text() %>% as.numeric()
# Store the URLs in a vector
URLs <- sprintf('http://results.marathon.se/%s/?content=list&event=STHM&num_results=250&page=%s&pid=list&search[sex]=M&lang=EN', x, 1:tot_pages)
#Create a progress bar
pb <- progress_estimated(tot_pages, min = 0)
# Create a function to scrape the name and finishing time from each page
getdata <- function(URL) {
pb$tick()$print()
pg <- read_html(URL)
html_nodes(pg, 'tbody td:nth-child(3)') %>% html_text() %>% as_tibble() %>% set_names(c('Name')) %>%
mutate(finish_time = html_nodes(pg, 'tbody .right') %>% html_text())
}
#Map everything into a dataframe
map_df(URLs, getdata) -> results
# add an id column indicating which year
results$year <- x
return(results)
})
# now collapse the resulting list into one tidy df
Results <- bind_rows(Results.list)

R return multiple nodes in 1 search using rvest (massive list of urls)

I am using rvest to scrape a website. It works, buy highly inefficient, and I can't figure out how to get it to work better.
in url is a list of over 10.000 url's.
number <- sapply(url, function(x)
read_html(x) %>%
html_nodes(".js-product-artnr") %>%
html_text())
price_new <- sapply(url, function(x)
read_html(x) %>%
html_nodes(".product-page__price__new") %>%
html_text())
price_old <- sapply(url, function(x)
read_html(x) %>%
html_nodes(".product-page__price__old") %>%
html_text())
The problem above is, rvest visits the 10.000 urls to get the first node in ".js-product-artnr", then visits the same 10.000 urls again for the second node and so on. In the end I expect to need about 10 different nodes from these 10.000 pages. getting them 1 by 1 and combining into a data frame later on takes way to long, there must be a better way.
I am looking for something like below, to get all information in 1 search
info <- sapply(url, function(x)
read_html(x) %>%
html_nodes(".js-product-artnr") %>%
html_nodes(".product-page__price__new") %>%
html_nodes(".product-page__price__old") %>%
html_text())
This works for me.
func <- function(url){
sample <- read_html(url) %>%
scrape1 <- html_nodes(sample, ".js-product-artnr")%>%
html_text()
scrape2 <- html_nodes(sample, ".product-page__price__new") %>%
html_text()
scrape3 <- html_nodes(sample,".product-page__price__old") %>%
html_text()
df <- cbind(scrape1, scrape2, scrape3)
final_df <- as.data.frame(df)
return(final_df)
}
data <- lapply(urls_all, func)

Resources