Using rvest to creating a database from multiple XML files - r

Using R to extract relevant data from multiple online XML files to create a database
I just started to learn R to do text analysis. Here is what I am trying to do: I'm trying to use rvest in r to create a CSV database of bill summaries from the 116th Congress from online XML files. The database should have two columns:
The title of the bill.
The summary text of the bill.
The website source is https://www.govinfo.gov/bulkdata/BILLSUM/116/hr
The issue I am having is
I would like to collect all the speeches that are returned from the search. So I need to web scrape multiple links. But I don't know how to ensure that r runs function with a series of different links and then extract the expected data.
I have tried the following code but I am not sure how exactly to apply them to my specific problem. Also, I got an error report of my code. Please see my code below. Thanks for any help in advance!
library(rvest)
library(tidyverse)
library(purrr)
html_source <- "https://www.govinfo.gov/bulkdata/BILLSUM/116/hr?page="
map_df(1:997, function(i) {
cat(".")
pg <- read_html(sprintf(html_source, i))
data.frame(title = html_text(html_nodes(pg, "title")),
bill_text %>% html_node("summary-text") %>% html_text(),
stringsAsFactors = FALSE)
}) -> Bills
Error in open.connection(x, "rb") : HTTP error 406.

At the bottom of that page is a link to a zipfile with all of the XML files, so instead of scraping each one individually (which will get onerous with a suggested crawl-delay of 10s) you can just download the zipfile and parse the XML files with xml2 (rvest is for HTML):
library(xml2)
library(purrr)
local_dir <- "~/Downloads/BILLSUM-116-hr"
local_zip <- paste0(local_dir, '.zip')
download.file("https://www.govinfo.gov/bulkdata/BILLSUM/116/hr/BILLSUM-116-hr.zip", local_zip)
# returns vector of paths to unzipped files
xml_files <- unzip(local_zip, exdir = local_dir)
bills <- xml_files %>%
map(read_xml) %>%
map_dfr(~list(
# note xml2 functions only take XPath selectors, not CSS ones
title = xml_find_first(.x, '//title') %>% xml_text(),
summary = xml_find_first(.x, '//summary-text') %>% xml_text()
))
bills
#> # A tibble: 1,367 x 2
#> title summary
#> <chr> <chr>
#> 1 For the relief of certain aliens w… Provides for the relief of certain …
#> 2 "To designate the facility of the … "Designates the facility of the Uni…
#> 3 Consolidated Appropriations Act, 2… <p><b>Consolidated Appropriations A…
#> 4 Financial Institution Customer Pro… <p><strong>Financial Institution Cu…
#> 5 Zero-Baseline Budget Act of 2019 <p><b>Zero-Baseline Budget Act of 2…
#> 6 Agriculture, Rural Development, Fo… "<p><b>Highlights: </b></p> <p>This…
#> 7 SAFETI Act <p><strong>Security for the Adminis…
#> 8 Buy a Brick, Build the Wall Act of… <p><b>Buy a Brick, Build the Wall A…
#> 9 Inspector General Access Act of 20… <p><strong>Inspector General Access…
#> 10 Federal CIO Authorization Act of 2… <p><b>Federal CIO Authorization Act…
#> # … with 1,357 more rows
The summary column is HTML-formatted, but by and large this is pretty clean already.

Related

Is there a way to put a wildcard character in a web address when using rvest?

I am new to web scrapping and using R and rvest to try and pull some info for a friend. This project might be a bit much for my first, but hopefully someone can help or tell me if it is possible.
I am trying to pull info from https://www.veteranownedbusiness.com/mo like business name, address, phone number, and description. I started by pulling all the names of the business' and was going to loop through each page to pull the information by business. The problem I ran into is that the business url's have numbers assigned to them :
www.veteranownedbusiness.com/business/**32216**/accel-polymers-llc
Is there a way to tell R to ignore this number or accept any entry in its spot so that I could loop through the business names?
Here is the code I have so far to get and clean the business titles if it helps:
library(rvest)
library(tibble)
library(stringr)
vet_name_list <- "https://www.veteranownedbusiness.com/mo"
HTML <- read_html(vet_name_list)
biz_names_html <- html_nodes(HTML, '.top_level_listing a')
biz_names <- html_text(biz_names_html)
biz_names <- biz_names[biz_names != ""]
biz_names_lower <- tolower(biz_names)
biz_names_sym <- gsub("[][!#$&%()*,.:;<=>#^_`|~.{}]", "", biz_names_lower)
biz_names_dub <- str_squish(biz_names_sym)
biz_name_clean <- chartr(" ", "-", biz_names_dub)
No, I'm afraid you can't use wildcards to get a valid url. What you can do is to scrape all the correct urls from the page, number and all.
To do this, we find all the correct nodes (I'm using xpath here rather than css selectors since it gives a bit more flexibility). You then get the href attribute from each node.
This can produce a data frame of business names and url. Here's a fully reproducible example:
library(rvest)
library(tibble)
vet_name_list <- "https://www.veteranownedbusiness.com/mo"
biz <- read_html(vet_name_list) %>%
html_nodes(xpath = "//tr[#class='top_level_listing']/td/a[#href]")
tibble(business = html_text(biz),
url = paste0(biz %>% html_attr("href")))
#> # A tibble: 550 x 2
#> business url
#> <chr> <chr>
#> 1 Accel Polymers, LLC /business/32216/ac~
#> 2 Beacon Car & Pet Wash /business/35987/be~
#> 3 Compass Quest Joplin Regional Veteran Services /business/21943/co~
#> 4 Financial Assistance for Military Experiencing Divorce /business/20797/fi~
#> 5 Focus Marines Foundation /business/29376/fo~
#> 6 Make It Virtual Assistant /business/32204/ma~
#> 7 Malachi Coaching & Training Ministries Int'l /business/24060/ma~
#> 8 Mike Jackson - Author /business/29536/mi~
#> 9 The Mission Continues /business/14492/th~
#> 10 U.S. Small Business Conference & EXPO /business/8266/us-~
#> # ... with 540 more rows
Created on 2022-08-05 by the reprex package (v2.0.1)

Converting PDF to text with pdftools in R returning empty string

In the following example, the result is empty for every page in the PDF.
library(pdftools)
rm(list = ls())
setwd(dirname(rstudioapi::getActiveDocumentContext()$path))
url = "https://reporting.standardbank.com/wp-content/uploads/2022/02/SBS72-Pricing-Supplement.pdf"
destfile = file.path(getwd(), basename(url))
download.file(url, destfile, mode = "wb")
file = list.files(path=".", pattern="pdf$")
pdf_text(file)
I am not sure whether there is a problem with the PDF file and the way it was scanned and saved that prevents PDF reading.
Is there a workaround for PDF files like this or a better package/library that I should consider?
I would guess that the issue is that it's a scanned document. So your probably need some OCR tools to extract the text and information from the document. One option would be the tesseract package:
library(tesseract)
url = "https://reporting.standardbank.com/wp-content/uploads/2022/02/SBS72-Pricing-Supplement.pdf"
eng <- tesseract("eng")
text <- tesseract::ocr(url, engine = eng)
#> Converting page 1 to file16a069b77ed2SBS72-Pricing-Supplement_1.png... done!
#> Converting page 2 to file16a069b77ed2SBS72-Pricing-Supplement_2.png... done!
#> Converting page 3 to file16a069b77ed2SBS72-Pricing-Supplement_3.png... done!
#> Converting page 4 to file16a069b77ed2SBS72-Pricing-Supplement_4.png... done!
#> Converting page 5 to file16a069b77ed2SBS72-Pricing-Supplement_5.png... done!
#> Converting page 6 to file16a069b77ed2SBS72-Pricing-Supplement_6.png... done!
#> Converting page 7 to file16a069b77ed2SBS72-Pricing-Supplement_7.png... done!
#> Converting page 8 to file16a069b77ed2SBS72-Pricing-Supplement_8.png... done!
text[[1]]
#> [1] "APPLICABLE PRICING SUPPLEMENT DATED 28 JANUARY 2022\nThe Standard Bank of South Africa Limited\n(dncorporated with limited liability under Registration Number 1962/000738/06\nin the Republic of South Africa)\nIssue of ZAR404,000,000 Senior Unsecured Floating Rate Notes due 02 February 2029\nUnder its ZAR110,000,000,000 Domestic Medium Term Note Programme\nThis document constitutes the Applicable Pricing Supplement relating to the issue of Notes described herein.\nTerms used herein shall be deemed to be defined as such for the purposes of the terms and conditions (the\n“Terms and Conditions\") set forth in the Programme Memorandum dated 24 December 2020 (the \"Programme\nMemorandum\"), as updated and amended from time to time. This Pricing Supplement must be read in\nconjunction with such Programme Memorandum. To the extent that there is any conflict or inconsistency between\nthe contents of this Pricing Supplement and the Programme Memorandum, the provisions of this Pricing\nSupplement shall prevail.\nDESCRIPTION OF THE NOTES\nl. Issuer The Standard Bank of South Africa\nLimited\n2. Debt Officer Amo Daehnke, Group Chief\nFinancial and Value Management\nOfficer of Standard Bank Group\nLimited\n3. Status of the Notes Senior Unsecured\n4. (a) Series Number 72\n(b) Tranche Number ]\n5. Aggregate Nominal Amount ZAR404,000,000\n6. Redemption/Payment Basis N/A\n7. Type of Notes Floating Rate Notes\n8. Interest Payment Basis Floating Rate\n9. Form of Notes Registered Notes\n10. Automatic/Optional Conversion from one Interest/Payment N/A\nBasis to another\nll. Issue Date 2 February 2022\n12. Business Centre Johannesburg\n13. Additional Business Centre N/A\n14. Specified Denomination ZAR]1,000,000\n15. Calculation Amount ZAR1,000,000\n16. Issue Price 100%\n17. Interest Commencement Date 02 February 2022\n18. Maturity Date 02 February 2029\n19. Maturity Period N/A\n1\n"

Why can't I read clickable links for webscraping with rvest?

I am trying to webscrape this website.
The content I need is available after clicking on each title. I can get the content I want if I do this for example (I am using SelectorGadget):
library("rvest")
url_boe ="https://www.bankofengland.co.uk/speech/2021/june/andrew-bailey-reuters-events-global-responsible-business-2021"
sample_text = html_text(html_nodes(read_html(url_boe), "#output .page-section"))
However, I would need to get each text for each clickable link in the website. So I usually do:
url_boe = "https://www.bankofengland.co.uk/news/speeches"
html_attr(html_nodes(read_html(url_boe), "#SearchResults .exclude-navigation"), name = "href")
I get an empty object though. I tried different variants of the code but with the same result.
How can I read those links and then apply the code in the first part to all the links?
Can anyone help me?
Thanks!
As #KonradRudolph has noted before, the links are inserted dynamically into the webpage. Therefore, I have produced a code using RSelenium and rvest to tackle this issue:
library(rvest)
library(RSelenium)
# URL
url = "https://www.bankofengland.co.uk/news/speeches"
# Base URL
base_url = "https://www.bankofengland.co.uk"
# Instantiate a Selenium server
rD <- rsDriver(browser=c("chrome"), chromever="91.0.4472.19")
# Assign the client to an object
rem_dr <- rD[["client"]]
# Navigate to the URL
rem_dr$navigate(url)
# Get page HTML
page <- read_html(rem_dr$getPageSource()[[1]])
# Extract links and concatenate them with the base_url
links <- page %>%
html_nodes(".release-speech") %>%
html_attr('href') %>%
paste0(base_url, .)
# Get links names
links_names <- page %>%
html_nodes('#SearchResults .exclude-navigation') %>%
html_text()
# Keep only even results to deduplicate
links_names <- links_names[c(FALSE, TRUE)]
# Create a data.frame with the results
df <- data.frame(links_names, links)
# Close the client and the server
rem_dr$close()
rD$server$stop()
The resulting data.frame looks like this:
> head(df)
links_names
1 Stablecoins: What’s old is new again - speech by Christina Segal-Knowles
2 Tackling climate for real: progress and next steps - speech by Andrew Bailey
3 Tackling climate for real: the role of central banks - speech by Andrew Bailey
4 What are government bond yields telling us about the economic outlook? - speech by Gertjan Vlieghe
5 Responsible openness in the Insurance Sector - speech by Anna Sweeney
6 Cyber Risk: 2015 to 2027 and the Penrose steps - speech by Lyndon Nelson
links
1 https://www.bankofengland.co.uk/speech/2021/june/christina-segal-knowles-speech-at-the-westminster-eforum-poicy-conference
2 https://www.bankofengland.co.uk/speech/2021/june/andrew-bailey-bis-bank-of-france-imf-ngfs-green-swan-conference
3 https://www.bankofengland.co.uk/speech/2021/june/andrew-bailey-reuters-events-global-responsible-business-2021
4 https://www.bankofengland.co.uk/speech/2021/may/gertjan-vlieghe-speech-hosted-by-the-department-of-economics-and-the-ipr
5 https://www.bankofengland.co.uk/speech/2021/may/anna-sweeney-association-of-british-insurers-prudential-regulation
6 https://www.bankofengland.co.uk/speech/2021/may/lyndon-nelson-the-8th-operational-resilience-and-cyber-security-summit

Using Rvest to webscrape rankings

I am wanting to scrape all rankings from the following website:
https://www.glassdoor.com/ratingsDetails/full.htm?employerId=432&employerName=McDonalds#trends-overallRating
I have tried using CSS selectors, which tell me to use ".ratingNum", but it leaves me with blank data. I have also tried using the GET function, which results in a similar problem.
# Attempt 1
url <- 'https://www.glassdoor.com/ratingsDetails/full.htm?employerId=432&employerName=McDonalds#trends-overallRating'
webpage <- read_html(url)
rank_data_html <- html_nodes(webpage,'.rankingNum')
rank_data <- html_table(rank_data_html)
head(rank_data)
# Attempt 2
res <- GET("https://www.glassdoor.com/ratingsDetails/full.htm",
query=list(employerId="432",
employerName="McDonalds"))
doc <- read_html(content(res, as="text"))
html_nodes(doc, ".ratingNum")
rank_data <- html_table(rank_data_html)
head(rank_data)
I expect the result to give me a list of all of the rankings, but instead it is giving me an empty list, or a list that doesn't include the rankings.
Your list is empty because you're GETing an unpopulated HTML document. Frequently when this happens you have to resort to RSelenium and co., but Glassdoor's public-facing API actually has everything you need – if you know where to look.
(Note: I'm not sure if this is officially part of Glassdoor's public API, but I think it's fair game if they haven't made more of an effort to conceal it. I tried to find some information, but their documentation is pretty meager. Usually companies will look the other way if you're just doing a smallish analysis and not slamming their servers or trying to profit from their data, but it's still a good idea to heed their ToS. You might want to shoot them an email describing what you're doing, or even ask about becoming an API partner. Make sure you adhere to their attribution rules. Continue at your own peril.)
Take a look at the network analysis tab in you browser's developer tools. You will see some GET requests that return JSONs, and one of those has the address you need. Send a GET and parse the JSON:
library(httr)
library(purrr)
library(dplyr)
ratings <- paste0("https://www.glassdoor.com/api/employer/432-rating.htm?",
"locationStr=&jobTitleStr=&filterCurrentEmployee=false")
req_obj <- GET(ratings)
cont <- content(req_obj)
ratings_df <- map(cont$ratings, bind_cols) %>% bind_rows()
ratings_df
You should end up with a dataframe containing ratings data. Just don't forget that the "ceoRating", "bizOutlook", and "recommend" are are proportions from 0-1 (or percentages if *100), while the rest reflect average user ratings on a 5-point scale:
# A tibble: 9 x 3
hasRating type value
<lgl> <chr> <dbl>
1 TRUE overallRating 3.3
2 TRUE ceoRating 0.72
3 TRUE bizOutlook 0.42
4 TRUE recommend 0.570
5 TRUE compAndBenefits 2.8
6 TRUE cultureAndValues 3.1
7 TRUE careerOpportunities 3.2
8 TRUE workLife 3.1
9 TRUE seniorManagement 2.9

Web crawler in R with heading and summary

I'm trying to extract links from here with the article heading and a brief summary of each link.
The output should have the article heading and the brief summary of each article which is on the same page.
I'm able to get the links. Can you please suggest how can i get heading and summary for each link. Please see my code below.
install.packages('rvest')
#Loading the rvest package
library('rvest')
library(xml2)
#Specifying the url for desired website to be scrapped
url <- 'http://money.howstuffworks.com/business-profiles.htm'
webpage <- read_html(url)
pg <- read_html(url)
head(html_attr(html_nodes(pg, "a"), "href"))
We can use purrr to inspect each node and extract the relevant information:
library(rvest)
library(purrr)
url <- 'http://money.howstuffworks.com/business-profiles.htm'
articles <- read_html(url) %>%
html_nodes('.infinite-item > .media') %>%
map_df(~{
title <- .x %>%
html_node('.media-heading > h3') %>%
html_text()
head <- .x %>%
html_node('p') %>%
html_text()
link <- .x %>%
html_node('p > a') %>%
html_attr('href')
data.frame(title, head, link, stringsAsFactors = F)
})
head(articles)
#> title
#> 1 How Amazon Same-day Delivery Works
#> 2 10 Companies That Completely Reinvented Themselves
#> 3 10 Trade Secrets We Wish We Knew
#> 4 How Kickstarter Works
#> 5 Can you get rich selling stuff online?
#> 6 Are the Golden Arches really supposed to be giant french fries?
#> head
#> 1 The Amazon same-day delivery service aims to get your package to you in no time at all. Learn how Amazon same-day delivery works. See more »
#> 2 You might be surprised at what some of today's biggest companies used to do. Here are 10 companies that reinvented themselves from HowStuffWorks. See more »
#> 3 Trade secrets are often locked away in corporate vaults, making their owners a fortune. Which trade secrets are the stuff of legend? See more »
#> 4 Kickstarter is a service that utilizes crowdsourcing to raise funds for your projects. Learn about how Kickstarter works at HowStuffWorks. See more »
#> 5 Can you get rich selling your stuff online? Find out more in this article by HowStuffWorks.com. See more »
#> 6 Are McDonald's golden arches really suppose to be giant french fries? Check out this article for a brief history of McDonald's golden arches. See more »
#> link
#> 1 http://money.howstuffworks.com/amazon-same-day-delivery.htm
#> 2 http://money.howstuffworks.com/10-companies-reinvented-themselves.htm
#> 3 http://money.howstuffworks.com/10-trade-secrets.htm
#> 4 http://money.howstuffworks.com/kickstarter.htm
#> 5 http://money.howstuffworks.com/can-you-get-rich-selling-online.htm
#> 6 http://money.howstuffworks.com/mcdonalds-arches.htm
Obligatory comment: In this case I saw no disclaimer against harvesting on their Terms and conditions, but always be sure to check the terms of a site before scraping it.

Resources