Why can't I read clickable links for webscraping with rvest? - r

I am trying to webscrape this website.
The content I need is available after clicking on each title. I can get the content I want if I do this for example (I am using SelectorGadget):
library("rvest")
url_boe ="https://www.bankofengland.co.uk/speech/2021/june/andrew-bailey-reuters-events-global-responsible-business-2021"
sample_text = html_text(html_nodes(read_html(url_boe), "#output .page-section"))
However, I would need to get each text for each clickable link in the website. So I usually do:
url_boe = "https://www.bankofengland.co.uk/news/speeches"
html_attr(html_nodes(read_html(url_boe), "#SearchResults .exclude-navigation"), name = "href")
I get an empty object though. I tried different variants of the code but with the same result.
How can I read those links and then apply the code in the first part to all the links?
Can anyone help me?
Thanks!

As #KonradRudolph has noted before, the links are inserted dynamically into the webpage. Therefore, I have produced a code using RSelenium and rvest to tackle this issue:
library(rvest)
library(RSelenium)
# URL
url = "https://www.bankofengland.co.uk/news/speeches"
# Base URL
base_url = "https://www.bankofengland.co.uk"
# Instantiate a Selenium server
rD <- rsDriver(browser=c("chrome"), chromever="91.0.4472.19")
# Assign the client to an object
rem_dr <- rD[["client"]]
# Navigate to the URL
rem_dr$navigate(url)
# Get page HTML
page <- read_html(rem_dr$getPageSource()[[1]])
# Extract links and concatenate them with the base_url
links <- page %>%
html_nodes(".release-speech") %>%
html_attr('href') %>%
paste0(base_url, .)
# Get links names
links_names <- page %>%
html_nodes('#SearchResults .exclude-navigation') %>%
html_text()
# Keep only even results to deduplicate
links_names <- links_names[c(FALSE, TRUE)]
# Create a data.frame with the results
df <- data.frame(links_names, links)
# Close the client and the server
rem_dr$close()
rD$server$stop()
The resulting data.frame looks like this:
> head(df)
links_names
1 Stablecoins: What’s old is new again - speech by Christina Segal-Knowles
2 Tackling climate for real: progress and next steps - speech by Andrew Bailey
3 Tackling climate for real: the role of central banks - speech by Andrew Bailey
4 What are government bond yields telling us about the economic outlook? - speech by Gertjan Vlieghe
5 Responsible openness in the Insurance Sector - speech by Anna Sweeney
6 Cyber Risk: 2015 to 2027 and the Penrose steps - speech by Lyndon Nelson
links
1 https://www.bankofengland.co.uk/speech/2021/june/christina-segal-knowles-speech-at-the-westminster-eforum-poicy-conference
2 https://www.bankofengland.co.uk/speech/2021/june/andrew-bailey-bis-bank-of-france-imf-ngfs-green-swan-conference
3 https://www.bankofengland.co.uk/speech/2021/june/andrew-bailey-reuters-events-global-responsible-business-2021
4 https://www.bankofengland.co.uk/speech/2021/may/gertjan-vlieghe-speech-hosted-by-the-department-of-economics-and-the-ipr
5 https://www.bankofengland.co.uk/speech/2021/may/anna-sweeney-association-of-british-insurers-prudential-regulation
6 https://www.bankofengland.co.uk/speech/2021/may/lyndon-nelson-the-8th-operational-resilience-and-cyber-security-summit

Related

Is there a way to put a wildcard character in a web address when using rvest?

I am new to web scrapping and using R and rvest to try and pull some info for a friend. This project might be a bit much for my first, but hopefully someone can help or tell me if it is possible.
I am trying to pull info from https://www.veteranownedbusiness.com/mo like business name, address, phone number, and description. I started by pulling all the names of the business' and was going to loop through each page to pull the information by business. The problem I ran into is that the business url's have numbers assigned to them :
www.veteranownedbusiness.com/business/**32216**/accel-polymers-llc
Is there a way to tell R to ignore this number or accept any entry in its spot so that I could loop through the business names?
Here is the code I have so far to get and clean the business titles if it helps:
library(rvest)
library(tibble)
library(stringr)
vet_name_list <- "https://www.veteranownedbusiness.com/mo"
HTML <- read_html(vet_name_list)
biz_names_html <- html_nodes(HTML, '.top_level_listing a')
biz_names <- html_text(biz_names_html)
biz_names <- biz_names[biz_names != ""]
biz_names_lower <- tolower(biz_names)
biz_names_sym <- gsub("[][!#$&%()*,.:;<=>#^_`|~.{}]", "", biz_names_lower)
biz_names_dub <- str_squish(biz_names_sym)
biz_name_clean <- chartr(" ", "-", biz_names_dub)
No, I'm afraid you can't use wildcards to get a valid url. What you can do is to scrape all the correct urls from the page, number and all.
To do this, we find all the correct nodes (I'm using xpath here rather than css selectors since it gives a bit more flexibility). You then get the href attribute from each node.
This can produce a data frame of business names and url. Here's a fully reproducible example:
library(rvest)
library(tibble)
vet_name_list <- "https://www.veteranownedbusiness.com/mo"
biz <- read_html(vet_name_list) %>%
html_nodes(xpath = "//tr[#class='top_level_listing']/td/a[#href]")
tibble(business = html_text(biz),
url = paste0(biz %>% html_attr("href")))
#> # A tibble: 550 x 2
#> business url
#> <chr> <chr>
#> 1 Accel Polymers, LLC /business/32216/ac~
#> 2 Beacon Car & Pet Wash /business/35987/be~
#> 3 Compass Quest Joplin Regional Veteran Services /business/21943/co~
#> 4 Financial Assistance for Military Experiencing Divorce /business/20797/fi~
#> 5 Focus Marines Foundation /business/29376/fo~
#> 6 Make It Virtual Assistant /business/32204/ma~
#> 7 Malachi Coaching & Training Ministries Int'l /business/24060/ma~
#> 8 Mike Jackson - Author /business/29536/mi~
#> 9 The Mission Continues /business/14492/th~
#> 10 U.S. Small Business Conference & EXPO /business/8266/us-~
#> # ... with 540 more rows
Created on 2022-08-05 by the reprex package (v2.0.1)

How can I crawl/scrape (using R) the non-table EPA CompTox Dashboard?

The EPA CompTox Chemical Dashboard received an update, and my old code is not longer able to scrape the Boiling Point for chemicals. Is anyone able to help me scrape the Experimental Average Boiling Point? I need to be able to write an R code that can loop through several chemicals.
Example webpages:
Acetone: https://comptox.epa.gov/dashboard/chemical/properties/DTXSID8021482
Methane: https://comptox.epa.gov/dashboard/chemical/properties/DTXSID8025545
I have tried read_html() and xmlParse() without success. The Experimental Average Boiling Point (ExpAvBP) value does not show up in the XML.
I have tried using ContentScraper() from the RCrawler, but it only returns NA whatever I try. Furthermore, this would only work for the first webpage listed, as the cell id changes with each chemical.
ContentScraper(Url="https://comptox.epa.gov/dashboard/chemical/properties/DTXSID8021482", XpathPatterns = "//*[#id='cell-225']")
I have tried using readLines(), but the information is all crammed into the last script tag, and I am unsure how to isolate just the ExpAvBP value. And it looks like the value is stored elsewhere? For example, below is what I believe is the boiling point information within the last script tag.
Acetone:
{unit:c_,name:"Boiling Point",predicted:{rawData:[{value:c$,minValue:e,maxValue:e,source:am,description:an,modelName:"TEST_BP",modelId:T,hasOpera:d,globalApplicability:e,hasQmrfPdf:d,details:{value:B,link:"https:\u002F\u002Fs3.amazonaws.com\u002Fepa-comptox\u002Ftest-reports\u002FDTXCID101482-TEST_BP.html",showLink:a},qmrf:{value:e,link:e,showLink:d}},{value:44.8,minValue:e,maxValue:e,source:ci,description:cj,modelName:"EPISUITE_BP",modelId:dV,hasOpera:d,globalApplicability:e,hasQmrfPdf:d,details:{value:M,link:e,showLink:d},qmrf:{value:e,link:e,showLink:d}},{value:46.458,minValue:e,maxValue:e,source:ad,description:V,modelName:"ACD_BP",modelId:135,hasOpera:d,globalApplicability:e,hasQmrfPdf:d,details:{value:M,link:e,showLink:d},qmrf:{value:e,link:e,showLink:d}},{value:da,minValue:e,maxValue:e,source:aL,description:bo,modelName:"OPERA_BP",modelId:dS,hasOpera:a,globalApplicability:q,hasQmrfPdf:a,details:{value:B,link:"http:\u002F\u002Fcomptox-dev.epa.gov\u002Fdashboard\u002Fdsstoxdb\u002Fcalculation_details?model_id=27&search=21482",showLink:a},qmrf:{value:B,link:"http:\u002F\u002Fcomptox-dev.epa.gov\u002Fdashboard\u002Fdsstoxdb\u002Fdownload_qmrf_pdf?model=27",showLink:a}}],count:bu,mean:47.06289999999999,min:c$,max:da,range:[c$,da],median:45.629},experimental:{rawData:[{value:db,minValue:e,maxValue:e,source:aN,description:aO,experimentalDetails:[]},{value:ak,minValue:ak,maxValue:ak,source:ck,description:cl,experimentalDetails:[]},{value:ak,minValue:ak,maxValue:ak,source:ck,description:cl,experimentalDetails:[]},{value:ak,minValue:ak,maxValue:ak,source:"Food and Agriculture Organization of the United Nations",description:"The Joint FAO\u002FWHO Expert Committee on Food Additives (JECFA) is an international expert scientific committee that is administered jointly by the Food and Agriculture Organization of the United Nations (FAO) and the World Health Organization (WHO). Website: \u003Ca href="http:\u002F\u002Fwww.fao.org\u002Fhome\u002F" target="_blank"\u003Ehttp:\u002F\u002Fwww.fao.org\u002Fhome\u002F\u003C\u002Fa\u003E",experimentalDetails:[]},{value:56.05,minValue:e,maxValue:e,source:"Abooali et al. Int. J. Refrig. 2014, 40, 282–293",description:"Abooali, D.; Sobati, M. A. Novel method for prediction of normal boiling point and enthalpy of vaporization at normal boiling point of pure refrigerants: A QSPR approach. (\u003Ca href="http:\u002F\u002Fdx.doi.org\u002F10.1016\u002Fj.ijrefrig.2013.12.007" target="_blank"\u003EInt. J. Refrig. 2014, 40, 282–293\u003C\u002Fa\u003E)\r\n",experimentalDetails:[]},{value:bO,minValue:bO,maxValue:bO,source:hI,description:hJ,experimentalDetails:[]}],count:dK,mean:55.98518333333333,min:db,max:bO,range:[db,bO],median:ak},arrKey:"BOILING_POINT"}
Methane:
{unit:cO,name:"Boiling Point",predicted:{rawData:[{value:at,minValue:f,maxValue:f,source:bB,description:bb,modelName:"ACD_BP",modelId:135,hasOpera:d,globalApplicability:f,hasQmrfPdf:d,details:{value:ag,link:f,showLink:d},qmrf:{value:f,link:f,showLink:d}},{value:hl,minValue:f,maxValue:f,source:aF,description:ba,modelName:"OPERA_BP",modelId:dv,hasOpera:a,globalApplicability:s,hasQmrfPdf:a,details:{value:O,link:"http:\u002F\u002Fcomptox-dev.epa.gov\u002Fdashboard\u002Fdsstoxdb\u002Fcalculation_details?model_id=27&search=25545",showLink:a},qmrf:{value:O,link:"http:\u002F\u002Fcomptox-dev.epa.gov\u002Fdashboard\u002Fdsstoxdb\u002Fdownload_qmrf_pdf?model=27",showLink:a}},{value:cP,minValue:f,maxValue:f,source:bZ,description:b_,modelName:"EPISUITE_BP",modelId:dy,hasOpera:d,globalApplicability:f,hasQmrfPdf:d,details:{value:ag,link:f,showLink:d},qmrf:{value:f,link:f,showLink:d}}],count:bH,mean:-129.25300000000001,min:at,max:cP,range:[at,cP],median:hl},experimental:{rawData:[{value:at,minValue:at,maxValue:at,source:hm,description:hn,experimentalDetails:[]},{value:cQ,minValue:f,maxValue:f,source:bC,description:bD,experimentalDetails:[]}],count:H,mean:ho,min:at,max:cQ,range:[at,cQ],median:ho},arrKey:"BOILING_POINT"}
Any help or insight would be greatly appreciated!
As the data is in no table format we have to extract text and extract the boiling temperature by matching pattern BoilingPoint.
library(rvest)
library(dplyr)
library(RSelenium)
url = 'https://comptox.epa.gov/dashboard/chemical/properties/DTXSID8025545'
driver = rsDriver(browser = c("firefox"))
remDr <- driver[["client"]]
remDr$navigate(url)
df = remDr$getPageSource()[[1]] %>%
read_html() %>% html_nodes(xpath = '//*[#id="__layout"]/div/div[5]/div[2]/main/div/div[3]/div[2]/div/div[2]/div[2]/div[3]') %>%
html_text()
Now get boiling temp. Refered https://stackoverflow.com/a/35936065/12135618
df1 = df %>% str_remove_all( '\n') %>% str_replace_all( ' ', '')
as.numeric(sub(".*?BoilingPoint.*?(\\d+).*", "\\1", df1))
[1] 163
You may have to do further fine tuning to get the decimal points of boiling temperature.

Using rvest to creating a database from multiple XML files

Using R to extract relevant data from multiple online XML files to create a database
I just started to learn R to do text analysis. Here is what I am trying to do: I'm trying to use rvest in r to create a CSV database of bill summaries from the 116th Congress from online XML files. The database should have two columns:
The title of the bill.
The summary text of the bill.
The website source is https://www.govinfo.gov/bulkdata/BILLSUM/116/hr
The issue I am having is
I would like to collect all the speeches that are returned from the search. So I need to web scrape multiple links. But I don't know how to ensure that r runs function with a series of different links and then extract the expected data.
I have tried the following code but I am not sure how exactly to apply them to my specific problem. Also, I got an error report of my code. Please see my code below. Thanks for any help in advance!
library(rvest)
library(tidyverse)
library(purrr)
html_source <- "https://www.govinfo.gov/bulkdata/BILLSUM/116/hr?page="
map_df(1:997, function(i) {
cat(".")
pg <- read_html(sprintf(html_source, i))
data.frame(title = html_text(html_nodes(pg, "title")),
bill_text %>% html_node("summary-text") %>% html_text(),
stringsAsFactors = FALSE)
}) -> Bills
Error in open.connection(x, "rb") : HTTP error 406.
At the bottom of that page is a link to a zipfile with all of the XML files, so instead of scraping each one individually (which will get onerous with a suggested crawl-delay of 10s) you can just download the zipfile and parse the XML files with xml2 (rvest is for HTML):
library(xml2)
library(purrr)
local_dir <- "~/Downloads/BILLSUM-116-hr"
local_zip <- paste0(local_dir, '.zip')
download.file("https://www.govinfo.gov/bulkdata/BILLSUM/116/hr/BILLSUM-116-hr.zip", local_zip)
# returns vector of paths to unzipped files
xml_files <- unzip(local_zip, exdir = local_dir)
bills <- xml_files %>%
map(read_xml) %>%
map_dfr(~list(
# note xml2 functions only take XPath selectors, not CSS ones
title = xml_find_first(.x, '//title') %>% xml_text(),
summary = xml_find_first(.x, '//summary-text') %>% xml_text()
))
bills
#> # A tibble: 1,367 x 2
#> title summary
#> <chr> <chr>
#> 1 For the relief of certain aliens w… Provides for the relief of certain …
#> 2 "To designate the facility of the … "Designates the facility of the Uni…
#> 3 Consolidated Appropriations Act, 2… <p><b>Consolidated Appropriations A…
#> 4 Financial Institution Customer Pro… <p><strong>Financial Institution Cu…
#> 5 Zero-Baseline Budget Act of 2019 <p><b>Zero-Baseline Budget Act of 2…
#> 6 Agriculture, Rural Development, Fo… "<p><b>Highlights: </b></p> <p>This…
#> 7 SAFETI Act <p><strong>Security for the Adminis…
#> 8 Buy a Brick, Build the Wall Act of… <p><b>Buy a Brick, Build the Wall A…
#> 9 Inspector General Access Act of 20… <p><strong>Inspector General Access…
#> 10 Federal CIO Authorization Act of 2… <p><b>Federal CIO Authorization Act…
#> # … with 1,357 more rows
The summary column is HTML-formatted, but by and large this is pretty clean already.

Using Rvest to webscrape rankings

I am wanting to scrape all rankings from the following website:
https://www.glassdoor.com/ratingsDetails/full.htm?employerId=432&employerName=McDonalds#trends-overallRating
I have tried using CSS selectors, which tell me to use ".ratingNum", but it leaves me with blank data. I have also tried using the GET function, which results in a similar problem.
# Attempt 1
url <- 'https://www.glassdoor.com/ratingsDetails/full.htm?employerId=432&employerName=McDonalds#trends-overallRating'
webpage <- read_html(url)
rank_data_html <- html_nodes(webpage,'.rankingNum')
rank_data <- html_table(rank_data_html)
head(rank_data)
# Attempt 2
res <- GET("https://www.glassdoor.com/ratingsDetails/full.htm",
query=list(employerId="432",
employerName="McDonalds"))
doc <- read_html(content(res, as="text"))
html_nodes(doc, ".ratingNum")
rank_data <- html_table(rank_data_html)
head(rank_data)
I expect the result to give me a list of all of the rankings, but instead it is giving me an empty list, or a list that doesn't include the rankings.
Your list is empty because you're GETing an unpopulated HTML document. Frequently when this happens you have to resort to RSelenium and co., but Glassdoor's public-facing API actually has everything you need – if you know where to look.
(Note: I'm not sure if this is officially part of Glassdoor's public API, but I think it's fair game if they haven't made more of an effort to conceal it. I tried to find some information, but their documentation is pretty meager. Usually companies will look the other way if you're just doing a smallish analysis and not slamming their servers or trying to profit from their data, but it's still a good idea to heed their ToS. You might want to shoot them an email describing what you're doing, or even ask about becoming an API partner. Make sure you adhere to their attribution rules. Continue at your own peril.)
Take a look at the network analysis tab in you browser's developer tools. You will see some GET requests that return JSONs, and one of those has the address you need. Send a GET and parse the JSON:
library(httr)
library(purrr)
library(dplyr)
ratings <- paste0("https://www.glassdoor.com/api/employer/432-rating.htm?",
"locationStr=&jobTitleStr=&filterCurrentEmployee=false")
req_obj <- GET(ratings)
cont <- content(req_obj)
ratings_df <- map(cont$ratings, bind_cols) %>% bind_rows()
ratings_df
You should end up with a dataframe containing ratings data. Just don't forget that the "ceoRating", "bizOutlook", and "recommend" are are proportions from 0-1 (or percentages if *100), while the rest reflect average user ratings on a 5-point scale:
# A tibble: 9 x 3
hasRating type value
<lgl> <chr> <dbl>
1 TRUE overallRating 3.3
2 TRUE ceoRating 0.72
3 TRUE bizOutlook 0.42
4 TRUE recommend 0.570
5 TRUE compAndBenefits 2.8
6 TRUE cultureAndValues 3.1
7 TRUE careerOpportunities 3.2
8 TRUE workLife 3.1
9 TRUE seniorManagement 2.9

Writing a loop to read_html through a column of url's

I am using rvest to scrape some corporate documents from the US Securities and Exchange Commission. Starting with a specific company, I successfully extracted the URL's to each of their 10k documents and put those URL's in a data frame named xcel. I then would like to further scrape each of those URL's.
I am thinking it makes the most sense to use a for loop to go through each of the URL's in the xcel$fullurl column, use the read_html function on each of them, and extract the table on each page).
I am having trouble getting the actual for loop to work. If you think a for loop is not the way to go, I would love to hear any other advice.
library(rvest)
library(stringi)
sec<-read_html("https://www.sec.gov/cgi-bin/browse-edgar?
action=getcompany&CIK=0000072903&type=10-k&dateb=&owner=exclude&count=40")
xcel<- sec %>%
html_nodes("#documentsbutton") %>%
html_attr("href")
xcel<-data.frame(xcel)
xcel$xcell<-paste0("https://www.sec.gov",xcel$xcell)
xcel$fullurl<-paste0(xcel$xcell,xcel$xcel)
as.character(xcel$fullurl) #set of URL's that I want to scrape from
#Problem starts here
for (i in xcel$fullurl){
pageurl<-xcel$fullurl
phase2 <- read_html(pageurl[i])
hopefully<-phase2 %>%
html_table("tbody")
hopefully this should give me the ensuing table from each of the
sites
You could loop over each URL using map/lapply and extract the 1st table from each
library(rvest)
library(dplyr)
library(purrr)
map(xcel$fullurl, ~ .x %>% read_html() %>% html_table() %>% .[[1]])
# Seq Description Document Type Size
#1 1 10-K xcel1231201510-k.htm 10-K 6375358
#2 2 EXHIBIT 10.28 xcelex1028q42015.htm EX-10.28 57583
#3 3 EXHIBIT 10.29 xcelex1029q42015.htm EX-10.29 25233
#4 4 EXHIBIT 12.01 xcelex1201q42015.htm EX-12.01 50108
#5 5 EXHIBIT 21.01 xcelex2101q42015.htm EX-21.01 22841
#.....
This would return a list of dataframes. If you want to combine all of them into a single dataframe you could use map_dfr instead of map.

Resources