Scraping data from a site with multiple urls - r

I've been trying to scrape a list of companies off of the site -Company list401.html. I can scrape the single table off of this page with this code:
>fileurl = read_html("http://archive.fortune.com
/magazines/fortune/fortune500_archive/full/2005/1")
> content = fileurl %>%
+ html_nodes(xpath = '//*[#id="MagListDataTable"]/table[2]') %>%
+ html_table()
>contentframe = data.frame(content)
> view(contentframe)
However, I need all of the data that goes back to 1955 from 2005 as well as a list of the companies 1 through 500, whereas this list only shows 100 companies and a single year at a time. I've recognized that the only changes to the url are "...fortune500_archive/full/" YEAR "/" 1, 201,301, or 401 (per range of companies showing).
I also understand that I have to create a loop that will automatically collect this data for me as opposed to me manually replacing the url after saving each table. I've tried a few variations of sapply functions from reading other posts and watching videos, but none will work for me and I'm lost.

A few suggestions to get you started. First, it may be useful to write a function to download and parse each page, e.g.
getData <- function(year, start) {
url <- sprintf("http://archive.fortune.com/magazines/fortune/fortune500_archive/full/%d/%d.html",
year, start)
fileurl <- read_html(url)
content <- fileurl %>%
html_nodes(xpath = '//*[#id="MagListDataTable"]/table[2]') %>%
html_table()
contentframe <- data.frame(content)
}
We can then loop through the years and pages using lapply (as well as do.call(rbind, ...) to rbind all 5 dataframes from each year together). E.g.:
D <- lapply(2000:2005, function(year) {
do.call(rbind, lapply(seq(1, 500, 100), function(start) {
cat(paste("Retrieving", year, ":", start, "\n"))
getData(year, start)
}))
})

Related

Using Sys.sleep breaks rvest scrape

I am trying to scrape a website that has hundreds of pages. I have been using the following code to get through all pages, but in order to not overwhelm the website, there must be a pause between scrapes. I have been trying to induce this pause using Sys.sleep(15), but this causes the final dataframe to come out empty. Any ideas why this is happening?
Version one:
a <- lapply(paste0("https://website.com/page/",1:500),
function(url){
url %>% read_html() %>%
html_nodes(".text") %>%
html_text()
Sys.sleep(15)
})
raw_posts <- unlist(a)
a <- data.frame(raw_posts)
This simply returns empty data frame.
Version two:
url_base <- "https://website.com/page/"
map_df(1:500, function(i) {
Sys.sleep(15)
cat(" bababooeey ")
pg <- read_html(sprintf(url_base, i))
data.frame(text=html_text(html_nodes(pg, ".text")),
date=html_text(html_nodes(pg, "time")),
stringsAsFactors=FALSE)
}) -> b
This just pastes the same set of results found on the same page over and over.
Does anything stand out as being wrongly coded?

rvest scraper working, but not returning newest data from website + not returning links

I'm using rvest to scrape the title, date and nested link for Danish parliamentary committee agendas. In general it works fine and I get the data I want, but I have two issues that I hope you can help with. As an example I'm scraping this committee website for the information in the table and the nested links. https://www.ft.dk/da/udvalg/udvalgene/liu/dokumenter/udvalgsdagsordner?committeeAbbreviation=LIU&session=20211
First problem - Missing newest data:
The scraper does not get the newest data although it is available on the website. For example on the particular page in the link there are two entries from June that is not "detected". This problem is consistent with the other committee pages, where it also does not pick up the newest data entries.
Q: Does anybody know why the data is not showing up in R even though it is present on the website and have a solution for getting the data?
Second problem - Missing links:
For the particular committee (LIU) linked to above, I'm not able to get the full nested links to the agendas, even though it works for all the other committees. Instead it just returns www.ft.dk as the nested link. Up until now I have solved it by manually adding every nested link to the dataset, but it is rather time consuming. Does anybody know why this is not working and can help solve it?
Q: How do I get the nested link for the individual committee agenda?
I'm using loops to go through all the different committee pages, but here's the basic code:
library(tidyverse)
library(rvest)
library(httr)
library(dplyr)
library(purrr)
library(stringr)
# base url of Folketinget for committee agendas
base.url <- "https://www.ft.dk/da/udvalg/udvalgene/"
#List of all committees
committee <- c("§71","BEU", "BUU", "UPN", "EPI", "ERU", "EUU", "FIU", "FOU", "FÆU", "GRA", "GRU", "BOU", "IFU", "KIU", "KEF", "KUU", "LIU", "MOF", "REU", "SAU", "SOU", "SUU", "TRU", "UFU", "URU", "UUI", "UFO", "ULØ", "UFS", "UPV", "UER", "UET", "UUF")
## Set up search archives
if (!dir.exists("./DO2011-2022/")) {
dir.create("./DO2011-2022/")
}
search.archive <- "./DO2011-2022/dagsorden_search/"
if (!dir.exists(search.archive)) {
dir.create(search.archive)
}
# empty data set
cols <- c("date", "title", "cmte", "link")
df <- cols %>% t %>% as_tibble(.name_repair = "unique") %>% `[`(0, ) %>% rename_all(~cols)
## Set up main date parameters
first.yr <- 2011
last.yr <- 2022
session <- 1:2
# main loop over committees
for (i in committee) {
for(current.yr in first.yr:last.yr) {
for(j in session) {
print(paste("Working on committee:", i, "Year", current.yr, "session", j))
result.page <- 1
## INTERIOR LOOP OVER SEARCH PAGES
repeat {
# build archive file name
file.name <- paste0(search.archive, i,
current.yr, "session", j,
"-page-",
result.page,
".html")
# construct url to pull
final.url <- paste0(base.url,i, "/dokumenter/udvalgsdagsordner?committeeAbbreviation=", i,
"&session=", current.yr, j, "&pageSize=200&pageNumber=", result.page)
# check archive / pull in page
#Fix problem with missing data from 2021 page - its because newly downloaded data is not on previous downloaded pages.
if(!current.yr == 2021){
if (file.exists(file.name)) {
page <- read_html(x = file.name)
} else {
page <- read_html(final.url)
tmp <- page %>% as.character
#Sys.sleep(3 + rpois(lambda = 2, n = 1))
write(x = tmp, file = file.name)
}
}
else{
page <- read_html(final.url)
tmp <- page %>% as.character
Sys.sleep(5)
write(x = tmp, file = file.name)
}
# only grab length of results once
if (result.page == 1) {
# get total # search results
total.results <- page %>%
html_nodes('.pagination-text-container-top .results') %>%
html_text(trim = T) %>%
str_extract("[[:digit:]]*") %>%
as.numeric
# break out of loop if no results on page (typical for session=2)
if (length(total.results) == 0) break
# count search pages to visit (NB: 200 = number of results per page)
count.pages <- ceiling(total.results / 200)
# print total results to console
print(paste("Total of", total.results, "for committee", i))
}
if(i == "FOU"|i == "GRU"){
titles <- page %>% html_nodes('.column-documents:nth-child(1) .column-documents__icon-text') %>% html_text(trim = T)
}
else{
titles <- page %>% html_nodes('.highlighted+ .column-documents .column-documents__icon-text') %>% html_text(trim = T) }
dates <- page %>% html_nodes('.highlighted .column-documents__icon-text') %>% html_text(trim = T)
# Solution to problem with links for LIU
if(i == "LIU"){
links <- page %>% html_nodes(".column-documents__link") %>% html_attr('href') %>% unique()
}
else{
links <- page %>% html_nodes(xpath = "//td[#data-title = 'Titel']/a[#class = 'column-documents__link']") %>% html_attr('href')
}
links <- paste0("https://www.ft.dk", links)
# build data frame from data
df <- df %>% add_row(
date = dates,
title = titles,
cmte = i,
link = links)
## BREAK LOOP when result.page == length of search result pages by year
if (result.page == count.pages) break
## iterate search page by ONE
result.page <- result.page + 1
} #END PAGE LOOP
} #END SESSION LOOP
} #END YEAR LOOP
} #END COMMITTEE LOOP
end <- Sys.time()
#Scraping time
end - start
If I alternatively use selectorgadget instead of xpath to get the links, I get the following error:
Error in tokenize(css) : Unclosed string at 42
links <- page %>% html_nodes(".highlighted .column-documents__icon-text']") %>% html_attr('href')
Thanks in advance.

Adding ifelse() into a Map function

I've got a simple Map function that scrapes text files from a blog site. It's pretty easy to get a scraper that gets all of the text files and downloads them to my working directory. My goal: use an ifelse() or a plain if statement to only scrape a file based on a certain date.
Eg, if four files were posted on 1/31/19, and I pointed my ifelse at that date, the function would return those four files. Code:
library(tidyverse)
library(rvest)
# URL set up
url <- "https://www.example-blog/posts.aspx"
page <- html_session(url, config(ssl_verifypeer = FALSE))
# Picking elements
links <- page %>%
html_nodes("td") %>%
html_nodes("a") %>%
html_attr("href")
# Getting date elements
dates <- page %>%
html_nodes("node.dates") %>%
html_text()
dates <- parse_date_time(dates, "%m/%d/%Y", tz = "EST",
locale = Sys.getlocale("LC_TIME"))
# Function
out <- Map(function(ln) {
fun1 <- html_session(URLencode(
paste0("https://www.example-blog", ln)),
config(ssl_verifypeer = FALSE))
write <- writeBin(fun1$response$content)
ifelse(dates == '2019-01-31', write, "He's dead, Jim")
}, links)
I've tried various ways to get that if statement in there, and also moving the writeBin around. (Usually the writeBin would not be vectorized - I did it for easy viewing in my ifelse). Error:
Error in ans[test & ok] <- rep(yes, length.out = length(ans))[test & ok] :
replacement has length zero
If I leave out the if code, everything works great, it just returns many text files, when I only want the ones from the specified date.
Based on the description, it seems like check the corresponding 'dates' for each 'links' and then apply the if/else. If that is the case, then we can have two arguments in Map
Map(function(ln, y) {
fun1 <- html_session(URLencode(
paste0("https://www.example-blog", ln)),
config(ssl_verifypeer = FALSE))
write <- writeBin(fun1$response$content)
if(y == '2019-01-31') {
write
} else "He's dead, Jim"
},
links, dates)

Extract data from multiple webpages from a website which reloads automatically in r

I have seen other posts which show to extract data from multiple webpages
But the problem is that for my website when I scroll the website to see the number of webpages to check in how many pages the data is divided into, the page automatically refresh next data, making unable to identify the number of webpages.I don't have that good knowledge of html and javascript so that I can easily identify the attribute on which the method is been getting called. so I have identified a way by which we can get the number of pages.
The website when loaded in browser gives number of records present, accessing that number and divide it by 30(number of data present per page) for e.g if number of records present is 90, then do 90/30 = 3 number of pages
here is the code to get the number of records found on that page
active_name_data1 <- html_nodes(webpage,'.active')
active1 <- html_text(active_name_data1)
as.numeric(gsub("[^\\d]+", "", word(active1[1],start = 1,end =1), perl=TRUE))
AND another approach is that get the attribute for number of pages i.e
url='http://www.magicbricks.com/property-for-sale/residential-real-estate?bedroom=1&proptype=Multistorey-Apartment,Builder-Floor-Apartment,Penthouse,Studio-Apartment&cityName=Thane&BudgetMin=5-Lacs&BudgetMax=10-Lacs'
webpage <- read_html(url)
active_data_html <- html_nodes(webpage,'a.act')
active <- html_text(active_data_html)
here active gives me number of pages i.e "1" " 2" " 3" " 4"
SO here I'm unable to identify how do I get the active page data and iterate the other number of webpage so as to get the entire data.
here is what I have tried (uuu_df2 is the dataframe with multiple link for which I want to crawl data)
library(rvest)
uuu_df2 <- data.frame(x = c('http://www.magicbricks.com/property-for-
sale/residential-real-estate?bedroom=1&proptype=Multistorey-Apartment,Builder-
Floor-Apartment,Penthouse,Studio-Apartment&cityName=Thane&BudgetMin=5-
Lacs&BudgetMax=5-Lacs',
'http://www.magicbricks.com/property-for-sale/residential-real-estate?bedroom=1&proptype=Multistorey-Apartment,Builder-Floor-Apartment,Penthouse,Studio-Apartment&cityName=Thane&BudgetMin=5-Lacs&BudgetMax=10-Lacs',
'http://www.magicbricks.com/property-for-sale/residential-real-estate?bedroom=1&proptype=Multistorey-Apartment,Builder-Floor-Apartment,Penthouse,Studio-Apartment&cityName=Thane&BudgetMin=5-Lacs&BudgetMax=10-Lacs'))
urlList <- llply(uuu_df2[,1], function(url){
this_pg <- read_html(url)
results_count <- this_pg %>%
xml_find_first(".//span[#id='resultCount']") %>%
xml_text() %>%
as.integer()
if(!is.na(results_count) & (results_count > 0)){
cards <- this_pg %>%
xml_find_all('//div[#class="SRCard"]')
df <- ldply(cards, .fun=function(x){
y <- data.frame(wine = x %>% xml_find_first('.//span[#class="agentNameh"]') %>% xml_text(),
excerpt = x %>% xml_find_first('.//div[#class="postedOn"]') %>% xml_text(),
locality = x %>% xml_find_first('.//span[#class="localityFirst"]') %>% xml_text(),
society = x %>% xml_find_first('.//div[#class="labValu"]') %>% xml_text() %>% gsub('\\n', '', .))
return(y)
})
} else {
df <- NULL
}
return(df)
}, .progress = 'text')
names(urlList) <- uuu_df2[,1]
a=bind_rows(urlList)
But this code just gives me the data from active page and does not iterate through other pages of the given link.
P.S : If the link doesn't has any record the code skips that link and
moves to other link from the list.
Any suggestion on what changes should be made to the code will be helpful. Thanks in advance.

How to scrape all subreddit posts in a given time period

I have a function to scrape all the posts in the Bitcoin subreddit between 2014-11-01 and 2015-10-31.
However, I'm only able to extract about 990 posts that go back only to October 25. I don't understand what's happening. I included a Sys.sleep of 15 seconds between each extract after referring to https://github.com/reddit/reddit/wiki/API, to no avail.
Also, I experimented with scraping from another subreddit (fitness), but it also returned around 900 posts.
require(jsonlite)
require(dplyr)
getAllPosts <- function() {
url <- "https://www.reddit.com/r/bitcoin/search.json?q=timestamp%3A1414800000..1446335999&sort=new&restrict_sr=on&rank=title&syntax=cloudsearch&limit=100"
extract <- fromJSON(url)
posts <- extract$data$children$data %>% dplyr::select(name, author, num_comments, created_utc,
title, selftext)
after <- posts[nrow(posts),1]
url.next <- paste0("https://www.reddit.com/r/bitcoin/search.json?q=timestamp%3A1414800000..1446335999&sort=new&restrict_sr=on&rank=title&syntax=cloudsearch&after=",after,"&limit=100")
extract.next <- fromJSON(url.next)
posts.next <- extract.next$data$children$data
# execute while loop as long as there are any rows in the data frame
while (!is.null(nrow(posts.next))) {
posts.next <- posts.next %>% dplyr::select(name, author, num_comments, created_utc,
title, selftext)
posts <- rbind(posts, posts.next)
after <- posts[nrow(posts),1]
url.next <- paste0("https://www.reddit.com/r/bitcoin/search.json?q=timestamp%3A1414800000..1446335999&sort=new&restrict_sr=on&rank=title&syntax=cloudsearch&after=",after,"&limit=100")
Sys.sleep(15)
extract <- fromJSON(url.next)
posts.next <- extract$data$children$data
}
posts$created_utc <- as.POSIXct(posts$created_utc, origin="1970-01-01")
return(posts)
}
posts <- getAllPosts()
Does reddit have some kind of limit that I'm hitting?
Yes, all reddit listings (posts, comments, etc.) are capped at 1000 items; they're essentially just cached lists, rather than queries, for performance reasons.
To get around this, you'll need to do some clever searching based on timestamps.

Resources