How do I prevent a for loop is overwriting my results - r

I would like to scrape all the seasons from 2003-2004 to 2019-2020 of the Dutch football league including the 34 playing rounds (I am using this website https://www.voetbal.com/wedstrijdgegevens/ned-eredivisie-2003-2004-spieltag/). As you can see in my code it's only showing me de results of de last season. I think it's overwriting the other seasons. What am I doing wrong? What do I have to add to my code? Can anybody help me?
Here is the code I use:
library(tidyverse)
library(dplyr)
library(ggplot2)
library(caret)
library(rvest)
library(devtools)
library(httr)
library(tidyr)
library(tibble)
library(xml2)
library(tidyr)
library(stringr)
url <- sprintf("https://www.voetbal.com/wedstrijdgegevens/ned-eredivisie-%d-%d-spieltag/", 2003:2019, 2004:2020)
basis<-function(url){
website <- read_html(url)
Sys.sleep(2)
datum <- website %>%
html_nodes(".data .standard_tabelle td[nowrap]:nth-of-type(1)") %>%
html_text()
tijdstip <- website %>%
html_nodes(".data .standard_tabelle td[nowrap]:nth-of-type(2)") %>%
html_text()
thuisclub <- website %>%
html_nodes(".data .standard_tabelle [align='right'] a") %>%
html_text()
uitclub <- website %>%
html_nodes(".standard_tabelle td:nth-of-type(5) a") %>%
html_text()
uitslag <- website %>%
html_nodes(".data .standard_tabelle td[nowrap]:nth-of-type(6)") %>%
html_text()
return(tibble(datum=datum, tijdstip=tijdstip, thuisclub=thuisclub, uitclub=uitclub, uitslag=uitslag))
}
overige_seizoenen<-function(url){
for (i in 1:17){
list_of_pages<-str_c(url[[i]], 1:34)
table <-list_of_pages%>%
map(basis)%>%
bind_rows()
}
return(table)
}
jochem <- overige_seizoenen(url)
```

I suspect there is an error in the for loop. In R, if you want a loop to iterate from element 1 to 10, you can't just say for (i in 10), you must clarify it as for (i in 1:10). So try this loop now:
for (i in 1:seizoenen){
list_of_pages<-str_c(url[[i]], 1:34)
table <-list_of_pages%>%
map(basis)%>%
bind_rows()
}
return(table)
}

Related

Loop with rvest

I'm very new to all this and am trying to work through some examples on stackoverflow to build up my confidence.
I found this answer by #RonakShah
Using rvest to scrape data that is not in table
and thought I'd use it because I'm familiar with HTML to build up my confidence with loops.
My issue is that I can't make the loop work.
Could someone please point out where I'm going wrong? It's bits and pieces of code I've found through the messageboards, but I'm not getting anywhere!
library(rvest)
page<- (0:2)
urls <- list()
for (i in 1:length(page)) {
url<- paste0("https://concreteplayground.com/sydney/bars?page=",page[i])
urls[[i]] <- url
}
tbl <- list()
j <- 1
for (j in seq_along(urls)) {
tbl[[j]] <- urls[[j]] %>% read_html()
name <- tbl[[j]] %>% html_nodes('p.name a') %>%html_text() %>% trimws()
address <- tbl[[j]] %>% html_nodes('p.address') %>% html_text() %>% trimws()
links <- tbl[[j]] %>% html_nodes('p.name a') %>% html_attr('href')
data.frame(name, address, links)
j <- j+1
}
#convert list to data frame
tbl <- do.call(rbind, tbl)
Create urls using paste0 directly, no need for a loop.
library(rvest)
pages <- 1:2
urls <- paste0("https://concreteplayground.com/sydney/bars?page=", pages)
If you put the code on that page in a function, you can use it with map_df to get combined dataframe directly. map_df does the job of for loop and do.call(rbind, tbl) together.
get_web_data <- function(url) {
webpage <- url %>% read_html()
name <- webpage %>% html_nodes('p.name a') %>%html_text() %>% trimws()
address <- webpage %>% html_nodes('p.address') %>% html_text() %>% trimws()
links <- webpage %>% html_nodes('p.name a') %>% html_attr('href')
data.frame(name, address, links)
}
purrr::map_df(urls, get_web_data)

R Webscraping: How to feed URLS into a function

My end goal is to be able to take all 310 articles from this page and its following pages and run it through this function:
library(tidyverse)
library(rvest)
library(stringr)
library(purrr)
library(lubridate)
library(dplyr)
scrape_docs <- function(URL){
doc <- read_html(URL)
speaker <- html_nodes(doc, ".diet-title a") %>%
html_text()
date <- html_nodes(doc, ".date-display-single") %>%
html_text() %>%
mdy()
title <- html_nodes(doc, "h1") %>%
html_text()
text <- html_nodes(doc, "div.field-docs-content") %>%
html_text()
all_info <- list(speaker = speaker, date = date, title = title, text = text)
return(all_info)
}
I assume the way to go forward would be to somehow create a list of the URLs I want, then iterate that list through the scrape_docs function. As it stands, however, I'm having a hard time understanding how to go about that. I thought something like this would work, but I seem to be missing something key given the following error:
xml_attr cannot be applied to object of class "character'.
source_col <- "https://www.presidency.ucsb.edu/advanced-search?field-keywords=%22space%20exploration%22&field-keywords2=&field-keywords3=&from%5Bdate%5D=&to%5Bdate%5D=&person2=&items_per_page=100&page=0"
pages <- 4
all_links <- tibble()
for(i in seq_len(pages)){
page <- paste0(source_col,i) %>%
read_html() %>%
html_attr("href") %>%
html_attr()
tmp <- page[[1]]
all_links <- bind_rows(all_links, tmp)
}
all_links
You can get all the url's by doing
library(rvest)
source_col <- "https://www.presidency.ucsb.edu/advanced-search?field-keywords=%22space%20exploration%22&field-keywords2=&field-keywords3=&from%5Bdate%5D=&to%5Bdate%5D=&person2=&items_per_page=100&page=0"
all_urls <- source_col %>%
read_html() %>%
html_nodes("td a") %>%
html_attr("href") %>%
.[c(FALSE, TRUE)] %>%
paste0("https://www.presidency.ucsb.edu", .)
Now do the same by changing the page number in source_col to get remaining data.
You can then use a for loop or map to extract all the data.
purrr::map(all_urls, scrape_docs)
Testing the function scrape_docs on 1 URL
scrape_docs(all_urls[1])
#$speaker
#[1] "Dwight D. Eisenhower"
#$date
#[1] "1958-04-02"
#$title
#[1] "Special Message to the Congress Relative to Space Science and Exploration."
#$text
#[1] "\n To the Congress of the United States:\nRecent developments in long-range
# rockets for military purposes have for the first time provided man with new mac......

Web scraping in R: the same element gets scraped multiple times. How could I fix this?

I am trying to scrape some URLs from the dutch train disruptions website. The problem is that on every page the first URL gets scraped 7x times. The HTML only contains the URL once so I don't understand why it is scraped multiple times.
The problem occurs the same way on every page: Every time, the first URL is scraped 7 times and on the rest of the page just once.
I am using the following script:
library(tidyverse)
library(rvest)
scrape_css_attr <- function(css,group,attribute,html_page){
txt <- html_page %>%
html_nodes(group) %>%
lapply(.%>% html_nodes(css) %>% html_attr(attribute) %>% ifelse(identical(.,character(0)),NA,.)) %>%
unlist()
return(txt)
}
get_element_data <- function(link){
if(!is.na(link)){
html <- read_html(link)
Sys.sleep(2)
datum <- html %>%
html_node(".disruption-cause") %>%
html_text()
return(tibble(datum=datum))
}
}
get_elements_from_url <- function(url){
html_page <- read_html(url)
Sys.sleep(2)
element_urls <- scrape_css_attr(".resolved","div","href",html_page)
element_urls <- element_urls[!is.na(element_urls)]
element_urls <- paste0("https://www.rijdendetreinen.nl", element_urls)
element_data_detail <- element_urls %>%
map(get_element_data) %>%
bind_rows()
elements_data <- tibble(element_urls=element_urls)
elements_data_overview <- elements_data[complete.cases(elements_data[,1]), ]
return(bind_cols(elements_data_overview,element_data_detail))
}
scrape_write_table <- function(url){
list_of_pages <- str_c(url, 1)
list_of_pages %>%
map(get_elements_from_url) %>%
bind_rows()
}
trainDisruptions <- scrape_write_table("https://www.rijdendetreinen.nl/storingen?lines=&reasons=&date_before=31-12-2018&date_after=01-01-2018&page=")
View(trainDisruptions)

Looping through a list of webpages with rvest follow_link

I'm trying to webscrape the government release calendar: https://www.gov.uk/government/statistics and use the rvest follow_link functionality to go to each publication link and scrape text from the next page. I have this working for each single page of results (40 publications are displayed per page), but can't get a loop to work so that I can run the code over all publications listed.
This is the code I run first to get the list of publications (just from the first 10 pages of results):
#Loading the rvest package
library('rvest')
library('dplyr')
library('tm')
#######PUBLISHED RELEASES################
###function to add number after 'page=' in url to loop over all pages of published releases results (only 40 publications per page)
###check the site and see how many pages you want to scrape, to cover months of interest
##titles of publications - creates a list
publishedtitles <- lapply(paste0('https://www.gov.uk/government/statistics?page=', 1:10),
function(url_base){
url_base %>% read_html() %>%
html_nodes('h3 a') %>%
html_text()
})
##Dates of publications
publisheddates <- lapply(paste0('https://www.gov.uk/government/statistics?page=', 1:10),
function(url_base){
url_base %>% read_html() %>%
html_nodes('.public_timestamp') %>%
html_text()
})
##Organisations
publishedorgs <- lapply(paste0('https://www.gov.uk/government/statistics?page=', 1:10),
function(url_base){
url_base %>% read_html() %>%
html_nodes('.organisations') %>%
html_text()
})
##Links to publications
publishedpartial_links <- lapply(paste0('https://www.gov.uk/government/statistics?page=', 1:10),
function(url_base){
url_base %>% read_html() %>%
html_nodes('h3 a') %>%
html_attr('href')
})
#Check all lists are the same length - if not, have to deal with missings before next step
# length(publishedtitles)
# length(publisheddates)
# length(publishedorgs)
# length(publishedpartial_links)
#str(publishedorgs)
#Combining all the lists to form a data frame
published <-data.frame(Title = unlist(publishedtitles), Date = unlist(publisheddates), Organisation = unlist(publishedorgs), PartLinks = unlist(publishedpartial_links))
#adding prefix to partial links, to turn into full URLs
published$Links = paste("https://www.gov.uk", published$PartLinks, sep="")
#Drop partial links column
keeps <- c("Title", "Date", "Organisation", "Links")
published <- published[keeps]
Then I want to run something like the below, but over all pages of results. I've ran this code manually changing the parameters for each page, so know it works.
session1 <- html_session("https://www.gov.uk/government/statistics?page=1")
list1 <- list()
for(i in published$Title[1:40]){
nextpage1 <- session1 %>% follow_link(i) %>% read_html()
list1[[i]]<- nextpage1 %>%
html_nodes(".grid-row") %>% html_text()
df1 <- data.frame(text=list1)
df1 <-as.data.frame(t(df1))
}
So the above would need to change page=1 in the html_session, and also the publication$Title[1:40] - I'm struggling with creating a function or loop that includes both variables.
I think I should be able to do this using lapply:
df <- lapply(paste0('https://www.gov.uk/government/statistics?page=', 1:10),
function(url_base){
for(i in published$Title[1:40]){
nextpage1 <- url_base %>% follow_link(i) %>% read_html()
list1[[i]]<- nextpage1 %>%
html_nodes(".grid-row") %>% html_text()
}
}
)
But I get the error
Error in follow_link(., i) : is.session(x) is not TRUE
I've also tried other methods of looping and turning it into a function but didn't want to make this post too long!
Thanks in advance for any suggestions and guidance :)
It looks like you may have just need to start a session inside the lapply function. In the last chunk of code, url_base is simply a text string that gives the base URL. Would something like this work:
df <- lapply(paste0('https://www.gov.uk/government/statistics?page=', 1:10),
function(url_base){
for(i in published$Title[1:40]){
tmpSession <- html_session(url_base)
nextpage1 <- tmpSession %>% follow_link(i) %>% read_html()
list1[[i]]<- nextpage1 %>%
html_nodes(".grid-row") %>% html_text()
}
}
)
To change the published$Title[1:40] for each iteraction of the lapply function, you could make an object that holds the lower and upper bounds of the indices:
lowers <- cumsum(c(1, rep(40, 9)))
uppers <- cumsum(rep(40, 10))
Then, you could include those in the call to lapply
df <- lapply(1:10, function(j){
url_base <- paste0('https://www.gov.uk/government/statistics?page=', j)
for(i in published$Title[lowers[j]:uppers[j]]){
tmpSession <- html_session(url_base)
nextpage1 <- tmpSession %>% follow_link(i) %>% read_html()
list1[[i]]<- nextpage1 %>%
html_nodes(".grid-row") %>% html_text()
}
}
)
Not sure if this is what you want or not, I might have misunderstood the things that are supposed to be changing.

Loop URL and store info in R

I'm trying to write a for loop that will loop through many websites and extract a few elements, and store the results in a table in R. Here's my go so far, just not sure how to start the for loop, or copy all results into one variable to be exported later.
library("dplyr")
library("rvest")
library("leaflet")
library("ggmap")
url <- c(html("http://www.webiste_name.com/")
agent <- html_nodes(url,"h1 span")
fnames<-html_nodes(url, "#offNumber_mainLocContent span")
address <- html_nodes(url,"#locStreetContent_mainLocContent")
scrape<-t(c(html_text(agent),html_text(fnames),html_text(address)))
View(scrape)
Given that your question isn't fully reproducible, here's a toy example that loops through three URLs (Red Socks, Jays and Yankees):
library(rvest)
# teams
teams <- c("BOS", "TOR", "NYY")
# init
df <- NULL
# loop
for(i in teams){
# find url
url <- paste0("http://www.baseball-reference.com/teams/", i, "/")
page <- read_html(url)
# grab table
table <- page %>%
html_nodes(css = "#franchise_years") %>%
html_table() %>%
as.data.frame()
# bind to dataframe
df <- rbind(df, table)
}
# view captured data
View(df)
The loop works because it replaces i in paste0 with each team in sequence.
I would go with lapply.
The code would look something like this:
library("rvest")
library("dplyr")
#a vector of urls you want to scrape
URLs <- c("http://...1", "http://...2", ....)
df <- lapply(URLs, function(u){
html.obj <- read_html(u)
agent <- html_nodes(html.obj,"h1 span") %>% html_text
fnames<-html_nodes(html.obj, "#offNumber_mainLocContent span") %>% html_text
address <- html_nodes(html.obj,"#locStreetContent_mainLocContent") %>% html_text
data.frame(Agent=agent, Fnames=fnames, Address=address)
})
df <- do.all(rbind, df)
View(df)

Resources