Need help scraping a big archive

Need help scraping a big archive - r

For a schoolproject i have to scrape a website which isn't a problem. But for it to be called BigData i wanted to scrape the whole archive(which is the past 5 years). The only thing that changes in the url is the date at the end of the url but i don't know how to write a script that changes only the date at the end.
The website I'm using is this: https://www.ongelukvandaag.nl/archief/ .
And the dates i need are from 01-01-2015 until 24-09-2020. The first part of the code i already figured out and I'm able to scrape 1 page. I'm a beginner at using R and would like to know if anyone could help me. The code is shown below. Thanks in advance!
This is what i got so far and the errors are underneath the code.
install.packages("XML")
install.packages("reshape")
install.packages("robotstxt")
install.packages("Rcrawler")
install.packages("RSelenium")
install.packages("devtools")
install.packages("exifr")
install.packages("Publish")
devtools::install_github("r-lib/xml2")
library(rvest)
library(dplyr)
library(xml)
library(stringr)
library(jsonlite)
library(xml12)
library(purrr)
library(tidyr)
library(reshape)
library(XML)
library(robotstxt)
library(Rcrawler)
library(RSelenium)
library(ps)
library(devtools)
library(exifr)
library(Publish)
#Create an url object
url<-"https://www.ongelukvandaag.nl/archief/%d "
#Verify the web can be scraped
paths_allowed(paths = c(url))
#Obtain the links for every day from 2015 to 2020
map_df(2015:2020, function(i){
page<-read_html(sprintf(url,i))
data.frame(Links = html_attr(html_nodes(page, ".archief a"),"href"))
}) -> Links %>%
Links$Links<-paste("https://www.ongelukvandaag.nl/",Links$Links,sep = "")
#Scrape what you want from each link:
d<- map(Links$Links, function(x) {
Z <- read_html(x)
Date <- Z %>% html_nodes(".text-muted") %>% html_text(trim = TRUE) # Last update
All_title <- Z %>% html_nodes("h2") %>% html_text(trim = TRUE) # Title
return(tibble(All_title,Date))
})
The errors i get:
Error in open.connection(x, "rb") : HTTP error 400.
in paste("https://www.ongelukvandaag.nl/", Links$Links, sep = "") : object 'Links' not found >
in map(Links$Links, function(x) { : object 'Links' not found
and the packages "xml12" & "xml" don't work in this version of RStudio

Take a look at my code and my comments:
library(purrr)
library(rvest) # don't load a lot of libraries if you don't need them
url <- "https://www.ongelukvandaag.nl/archief/"
bigdata <-
map_dfr(
2015:2020,
function(year){
year_pg <- read_html(paste0(url, year))
list_dates <- year_pg %>% html_nodes(xpath = "//div[#class='archief']/a") %>% html_text() # in case some dates are missing
map_dfr(
list_dates,
function(date) {
pg <- read_html(paste0(url, date))
items <- pg %>% html_nodes("div.full > div.row")
items <- items[sapply(items, function(x) length(x %>% html_node(xpath = "./descendant::h2"))) > 0] # drop NA items
data.frame(
date = date,
title = items %>% html_node(xpath = "./descendant::h2") %>% html_text(),
update = items %>% html_node(xpath = "./descendant::h4") %>% html_text(),
image = items %>% html_node(xpath = "./descendant::img") %>% html_attr("src")
)
}
)
}
)

Related

Web Scraping using Rvest and Stringr: Can't figure out what I'm doing wrong

I have a code to scrape a senate website and extract all the information about representatives in a data frame. It runs fine up until I try to scrape the part about their term information. The function I'm using just returns "NA" instead of the term assignments. Would really appreciate some help in figuring out what I'm doing wrong in the last block of code (baselink3 onwards).
install.packages("tidyverse")
install.packages("rvest")
library(rvest)
library(dplyr)
library(stringr)
#Create blank lists
member_list <- list()
photo_list <- list()
memberlink_list <- list()
cycle_list <- list()
#Scrape data
cycles <- c("2007","2009","2011","2013","2015","2017","2019","2021")
base_link <- "https://www.legis.state.pa.us/cfdocs/legis/home/member_information/mbrList.cfm?Body=S&SessYear="
for(cycle in cycles) {
member_list[[cycle]] <- read_html(paste(base_link, cycle, sep="")) %>%
html_nodes(".MemberInfoList-MemberBio a") %>%
html_text()
memberlink_list[[cycle]] <- read_html(paste(base_link, cycle, sep="")) %>%
html_nodes(".MemberInfoList-MemberBio a") %>%
html_attr("href")
photo_list[[cycle]] <- read_html(paste(base_link, cycle, sep="")) %>%
html_nodes(".MemberInfoList-PhotoThumb img") %>%
html_attr("src")
cycle_list[[cycle]] <- rep(cycle, times = length(member_list[[cycle]]))
}
#Assemble data frame
member_list2 <- unlist(member_list)
cycle_list2 <- unlist(cycle_list)
photo_list2 <- unlist(photo_list)
memberlink_list2 <- unlist(memberlink_list)
senate_directory <- data.frame(cycle_list2, member_list2, photo_list2, memberlink_list2) %>%
rename(Cycle = cycle_list2,
Member = member_list2,
Photo = photo_list2,
Link = memberlink_list2)
#New Section from March 12
##Trying to use each senator's individual page
#Convert memberlink_list into dataframe
df <- data.frame(matrix(unlist(memberlink_list), nrow=394, byrow=TRUE),stringsAsFactors=FALSE)
colnames(df) <- "Link" #rename column to link
base_link3 <- paste0("https://www.legis.state.pa.us/cfdocs/legis/home/member_information/", df$Link) #creating each senator's link
terminfo <- sapply(base_link2, function(x) {
val <- x %>%
read_html %>%
html_nodes('div.MemberBio-TermInfo') %>%
html_text() %>%
str_extract('(?<=Senate Term )\\d+')
if(length(val)) val else NA
}, USE.NAMES = FALSE)
terminfo <- data.frame(terminfo, df$Link)

I am not sure what exactly you are looking for, but something like this might help you. Note that the page has a crawl delay of 5 seconds. Something you did not implement or respect in your code above. See here
library(httr)
library(purrr)
extract_terminfo <- function(link) {
html <- httr::GET(link)
Sys.sleep(runif(1,5,6))
val <- html %>%
content(as = "parsed") %>%
html_nodes('div.MemberBio-TermInfo') %>%
html_text() %>%
str_extract('(?<=Term Expires: )\\d+')
if(length(val)>0){
return(data.frame(terminfo = val, link = link))
} else {
return(data.frame(terminfo = "historic", link = link))
}
}
link <- base_link3[1]
link
extract_terminfo(link)
term_info <- map_dfr(base_link3[1:3],extract_terminfo)

R: How can I open a list of links to scrape the homepage of a news website?

I'm trying to build a web scraper to scrape articles published on www.20min.ch, a news website, with R. Their api is openly accessible so I could create a dataframe containing titles, urls, descriptions, and timestamps with rvest. The next step would be to access every single link and create a list of article texts and combine it with my dataframe. However I don't know how to automatize the access to those articles. Ideally, I would like to read_html link 1, then copy the text with html node and then proceed to link 2...
This is what I wrote so far:
site20min <- read_xml("https://api.20min.ch/rss/view/1")
site20min
url_list <- site20min %>% html_nodes('link') %>% html_text()
df20min <- data.frame(Title = character(),
Zeit = character(),
Lead = character(),
Text = character()
)
for(i in 1:length(url_list)){
myLink <- url_list[i]
site20min <- read_html(myLink)
titel20min <- site20min %>% html_nodes('h1 span') %>% html_text()
zeit20min <- site20min %>% html_nodes('#story_content .clearfix span') %>% html_text()
lead20min <- site20min %>% html_nodes('#story_content h3') %>% html_text()
text20min <- site20min %>% html_nodes('.story_text') %>% html_text()
df20min_a <- data.frame(Title = titel20min)
df20min_b <- data.frame(Zeit = zeit20min)
df20min_c <- data.frame(Lead = lead20min)
df20min_d <- data.frame(Text = text20min)
}
What I need is R to open every single link and extract some information:
site20min_1 <- read_html("https://www.20min.ch/schweiz/news/story/-Es-liegen-auch-Junge-auf-der-Intensivstation--14630453")
titel20min_1 <- site20min_1 %>% html_nodes('h1 span') %>% html_text()
zeit20min_1 <- site20min_1 %>% html_nodes('#story_content .clearfix span') %>% html_text()
lead20min_1 <- site20min_1 %>% html_nodes('#story_content h3') %>% html_text()
text20min_1 <- site20min_1 %>% html_nodes('.story_text') %>% html_text()
It should not be too much of a problem to rbind this to a dataframe. but at the moment some of my results turn out empty.
thx for your help!

You're on the right track with setting up a dataframe. You can loop through each link and rbind it to your existing dataframe structure.
First, you can set a vector of urls to be looped through. Based on the edit, here is such a vector:
url_list <- c("http://www.20min.ch/ausland/news/story/14618481",
"http://www.20min.ch/schweiz/news/story/18901454",
"http://www.20min.ch/finance/news/story/21796077",
"http://www.20min.ch/schweiz/news/story/25363072",
"http://www.20min.ch/schweiz/news/story/19113494",
"http://www.20min.ch/community/social_promo/story/20407354",
"https://cp.20min.ch/de/stories/635-stressfrei-durch-den-verkehr-so-sieht-der-alltag-von-busfahrer-claudio-aus")
Next, you can set a dataframe structure that includes everything you're looking to gether.
# Set up the dataframe first
df20min <- data.frame(Title = character(),
Link = character(),
Lead = character(),
Zeit = character())
Finally, you can loop through each url in your list and add the relevant info to your dataframe.
# Go through a loop
for(i in 1:length(url_list)){
myLink <- url_list[i]
site20min <- read_xml(myLink)
# Extract the info
titel20min <- site20min %>% html_nodes('title') %>% html_text()
link20min <- site20min %>% html_nodes('link') %>% html_text()
zeit20min <- site20min %>% html_nodes('pubDate') %>% html_text()
lead20min <- site20min %>% html_nodes('description') %>% html_text()
# Structure into dataframe
df20min_a <- data.frame(Title = titel20min, Link =link20min, Lead = lead20min)
df20min_b <- df20min_a [-(1:2),]
df20min_c <- data.frame(Zeit = zeit20min)
# Insert into final dataframe
df20min <- rbind(df20min, cbind(df20min_b,df20min_c))
}

Looping through a list of webpages with rvest follow_link

I'm trying to webscrape the government release calendar: https://www.gov.uk/government/statistics and use the rvest follow_link functionality to go to each publication link and scrape text from the next page. I have this working for each single page of results (40 publications are displayed per page), but can't get a loop to work so that I can run the code over all publications listed.
This is the code I run first to get the list of publications (just from the first 10 pages of results):
#Loading the rvest package
library('rvest')
library('dplyr')
library('tm')
#######PUBLISHED RELEASES################
###function to add number after 'page=' in url to loop over all pages of published releases results (only 40 publications per page)
###check the site and see how many pages you want to scrape, to cover months of interest
##titles of publications - creates a list
publishedtitles <- lapply(paste0('https://www.gov.uk/government/statistics?page=', 1:10),
function(url_base){
url_base %>% read_html() %>%
html_nodes('h3 a') %>%
html_text()
})
##Dates of publications
publisheddates <- lapply(paste0('https://www.gov.uk/government/statistics?page=', 1:10),
function(url_base){
url_base %>% read_html() %>%
html_nodes('.public_timestamp') %>%
html_text()
})
##Organisations
publishedorgs <- lapply(paste0('https://www.gov.uk/government/statistics?page=', 1:10),
function(url_base){
url_base %>% read_html() %>%
html_nodes('.organisations') %>%
html_text()
})
##Links to publications
publishedpartial_links <- lapply(paste0('https://www.gov.uk/government/statistics?page=', 1:10),
function(url_base){
url_base %>% read_html() %>%
html_nodes('h3 a') %>%
html_attr('href')
})
#Check all lists are the same length - if not, have to deal with missings before next step
# length(publishedtitles)
# length(publisheddates)
# length(publishedorgs)
# length(publishedpartial_links)
#str(publishedorgs)
#Combining all the lists to form a data frame
published <-data.frame(Title = unlist(publishedtitles), Date = unlist(publisheddates), Organisation = unlist(publishedorgs), PartLinks = unlist(publishedpartial_links))
#adding prefix to partial links, to turn into full URLs
published$Links = paste("https://www.gov.uk", published$PartLinks, sep="")
#Drop partial links column
keeps <- c("Title", "Date", "Organisation", "Links")
published <- published[keeps]
Then I want to run something like the below, but over all pages of results. I've ran this code manually changing the parameters for each page, so know it works.
session1 <- html_session("https://www.gov.uk/government/statistics?page=1")
list1 <- list()
for(i in published$Title[1:40]){
nextpage1 <- session1 %>% follow_link(i) %>% read_html()
list1[[i]]<- nextpage1 %>%
html_nodes(".grid-row") %>% html_text()
df1 <- data.frame(text=list1)
df1 <-as.data.frame(t(df1))
}
So the above would need to change page=1 in the html_session, and also the publication$Title[1:40] - I'm struggling with creating a function or loop that includes both variables.
I think I should be able to do this using lapply:
df <- lapply(paste0('https://www.gov.uk/government/statistics?page=', 1:10),
function(url_base){
for(i in published$Title[1:40]){
nextpage1 <- url_base %>% follow_link(i) %>% read_html()
list1[[i]]<- nextpage1 %>%
html_nodes(".grid-row") %>% html_text()
}
}
)
But I get the error
Error in follow_link(., i) : is.session(x) is not TRUE
I've also tried other methods of looping and turning it into a function but didn't want to make this post too long!
Thanks in advance for any suggestions and guidance :)

It looks like you may have just need to start a session inside the lapply function. In the last chunk of code, url_base is simply a text string that gives the base URL. Would something like this work:
df <- lapply(paste0('https://www.gov.uk/government/statistics?page=', 1:10),
function(url_base){
for(i in published$Title[1:40]){
tmpSession <- html_session(url_base)
nextpage1 <- tmpSession %>% follow_link(i) %>% read_html()
list1[[i]]<- nextpage1 %>%
html_nodes(".grid-row") %>% html_text()
}
}
)
To change the published$Title[1:40] for each iteraction of the lapply function, you could make an object that holds the lower and upper bounds of the indices:
lowers <- cumsum(c(1, rep(40, 9)))
uppers <- cumsum(rep(40, 10))
Then, you could include those in the call to lapply
df <- lapply(1:10, function(j){
url_base <- paste0('https://www.gov.uk/government/statistics?page=', j)
for(i in published$Title[lowers[j]:uppers[j]]){
tmpSession <- html_session(url_base)
nextpage1 <- tmpSession %>% follow_link(i) %>% read_html()
list1[[i]]<- nextpage1 %>%
html_nodes(".grid-row") %>% html_text()
}
}
)
Not sure if this is what you want or not, I might have misunderstood the things that are supposed to be changing.

R scraping xpath

I am new to scraping and for a first task I decided to scrape this webpage: https://finstat.sk/databaza-financnych-udajov?EmployeeExact=False&RpvsInsert=False&Sort=assets&PerPage=20
Lower on the page there is a list that contains numeric informations that I would like to scrape. Would you please help me with that? I tried this code.
library('rvest')
url <- 'https://finstat.sk/databaza-financnych-udajov?EmployeeExact=False&RpvsInsert=False&Sort=assets&PerPage=20'
webpage <- read_html(url)
tabulka <- html_nodes(webpage, xpath='/html/body/div[5]/div/div[3]/div[4]/div[2]/div/div/div[3]/table/tbody/tr[1]') %>%
html_table() %>%
head(tabulka)
After I run this I get the error: length(n) == 1L is not TRUE

Maybe this:
library(rvest)
library(tidyverse)
scrape_data <- function(x) {
page <- read_html(sprintf("https://finstat.sk/databaza-financnych-udajov?EmployeeExact=False&RpvsInsert=False&Sort=assets&Page=%s", x))
first_two_cols <- lapply(c("td.data-table-column-pinned", "td.hidden-xs"), function(x) page %>% html_nodes(x) %>% html_text(trim = T)) %>% data.frame()
remaining_cols <- lapply(3:7, function(x) page %>% html_nodes(sprintf(".nowrap:nth-child(%s)",x)) %>% html_text(trim = T)) %>% data.frame()
cbind(first_two_cols, remaining_cols) %>% set_names(paste0("var", 1:7))
}
#The following scrapes 5 pages, but the number can be adjusted:
df <- map_df(1:5, scrape_data)

Use R to do web Crawler and it can not capture content I need(text mining)(Taiwanese BBS, ptt)

this is Joe from National Taipei University of Business, Taiwan. I'm currently doing a research on online games and E-sports by text mining in the social media. I chose to get the data from the most popular BBS, "PTT",in Taiwan, but it seems my code can only capture the article titles but cannot reach the contents.
I tried to get the texts from www.ptt.cc/bbs/LoL/index6402.html to index6391, and the code I used is here in my R code data or R code txt file or following.
install.packages("httr")
install.packages("XML")
install.packages("RCurl")
install.packages("xml2")
library(httr)
library(XML)
library(RCurl)
library(xml2)
data <- list()
for( i in 6391:6402) {
tmp <- paste(i, '.html', sep='')
url <- paste('https://www.ptt.cc/bbs/LoL/index', tmp, sep='')
tmp <- read_html(url)
html <- htmlParse(getURL(url))
url.list <- xml_find_all(tmp, "//div[#class='title']/a[#href]")
data <- rbind(data, as.matrix(paste('https://www.ptt.cc', url.list, sep='')))
}
data <- unlist(data)
getdoc <- function(line){
start <- regexpr('https://www', line)[1]
end <- regexpr('html', line)[1]
if(start != -1 & end != -1){
url <- substr(line, start, end+3)
html <- htmlParse(getURL(url), encoding='UTF-8')
doc <- xpathSApply(html, "//div[#id='main-content']", xmlValue)
name <- strsplit(url, '/')[[1]][4]
write(doc, gsub('html', 'txt', name))
}
}
setwd("E:/data")
sapply(data, getdoc)
But this code can only capture the titles and my txt files are empty. I'm not sure which part goes wrong and thus I need some advice from you at stackoverflow.
Any advice will be very much appreciated and anyone helping me with this will be on the list of acknowledgement in my thesis, and, if you're curious about it, I will inform you of the research result after it is done. :)

Something like:
library(tidyverse)
library(rvest)
# change the end number
pages <- map(6391:6392, ~read_html(sprintf("https://www.ptt.cc/bbs/LoL/index%d.html", .)))
map(pages, ~xml_find_all(., "//div[#class='title']/a[#href]")) %>%
map(xml_attr, "href") %>%
flatten_chr() %>%
map_df(function(x) {
URL <- sprintf("https://www.ptt.cc%s", x)
pg <- read_html(URL)
data_frame(
url=URL,
text=html_nodes(pg, xpath="//div[#id='main-content']") %>% html_text()
)
}) -> df
glimpse(df)
## Observations: 40
## Variables: 2
## $ url <chr> "https://www.ptt.cc/bbs/LoL/M.1481947445.A.17B.html", "https://www.ptt.cc/b...
## $ text <chr> "作者rainnawind看板LoL標題[公告] LoL 板 開始舉辦樂透!時間Sat Dec 17 12:04:03 2016\nIMT KDM 勝...
to make a data frame or sub out the last part with:
dir.create("pttdocs")
map(pages, ~xml_find_all(., "//div[#class='title']/a[#href]")) %>%
map(xml_attr, "href") %>%
flatten_chr() %>%
walk(function(x) {
URL <- sprintf("https://www.ptt.cc%s", x)
basename(x) %>%
tools::file_path_sans_ext() %>%
sprintf(fmt="%s.txt") %>%
file.path("pttdocs", .) -> fil
pg <- read_html(URL)
html_nodes(pg, xpath="//div[#id='main-content']") %>%
html_text() %>%
writeLines(fil)
})
to write files to a directory.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Need help scraping a big archive - r

Related

Web Scraping using Rvest and Stringr: Can't figure out what I'm doing wrong

R: How can I open a list of links to scrape the homepage of a news website?

Looping through a list of webpages with rvest follow_link

R scraping xpath

Use R to do web Crawler and it can not capture content I need(text mining)(Taiwanese BBS, ptt)

Categories

Resources