Writing a loop to read_html through a column of url's - r

I am using rvest to scrape some corporate documents from the US Securities and Exchange Commission. Starting with a specific company, I successfully extracted the URL's to each of their 10k documents and put those URL's in a data frame named xcel. I then would like to further scrape each of those URL's.
I am thinking it makes the most sense to use a for loop to go through each of the URL's in the xcel$fullurl column, use the read_html function on each of them, and extract the table on each page).
I am having trouble getting the actual for loop to work. If you think a for loop is not the way to go, I would love to hear any other advice.
library(rvest)
library(stringi)
sec<-read_html("https://www.sec.gov/cgi-bin/browse-edgar?
action=getcompany&CIK=0000072903&type=10-k&dateb=&owner=exclude&count=40")
xcel<- sec %>%
html_nodes("#documentsbutton") %>%
html_attr("href")
xcel<-data.frame(xcel)
xcel$xcell<-paste0("https://www.sec.gov",xcel$xcell)
xcel$fullurl<-paste0(xcel$xcell,xcel$xcel)
as.character(xcel$fullurl) #set of URL's that I want to scrape from
#Problem starts here
for (i in xcel$fullurl){
pageurl<-xcel$fullurl
phase2 <- read_html(pageurl[i])
hopefully<-phase2 %>%
html_table("tbody")
hopefully this should give me the ensuing table from each of the
sites

You could loop over each URL using map/lapply and extract the 1st table from each
library(rvest)
library(dplyr)
library(purrr)
map(xcel$fullurl, ~ .x %>% read_html() %>% html_table() %>% .[[1]])
# Seq Description Document Type Size
#1 1 10-K xcel1231201510-k.htm 10-K 6375358
#2 2 EXHIBIT 10.28 xcelex1028q42015.htm EX-10.28 57583
#3 3 EXHIBIT 10.29 xcelex1029q42015.htm EX-10.29 25233
#4 4 EXHIBIT 12.01 xcelex1201q42015.htm EX-12.01 50108
#5 5 EXHIBIT 21.01 xcelex2101q42015.htm EX-21.01 22841
#.....
This would return a list of dataframes. If you want to combine all of them into a single dataframe you could use map_dfr instead of map.

Related

Why can't I read clickable links for webscraping with rvest?

I am trying to webscrape this website.
The content I need is available after clicking on each title. I can get the content I want if I do this for example (I am using SelectorGadget):
library("rvest")
url_boe ="https://www.bankofengland.co.uk/speech/2021/june/andrew-bailey-reuters-events-global-responsible-business-2021"
sample_text = html_text(html_nodes(read_html(url_boe), "#output .page-section"))
However, I would need to get each text for each clickable link in the website. So I usually do:
url_boe = "https://www.bankofengland.co.uk/news/speeches"
html_attr(html_nodes(read_html(url_boe), "#SearchResults .exclude-navigation"), name = "href")
I get an empty object though. I tried different variants of the code but with the same result.
How can I read those links and then apply the code in the first part to all the links?
Can anyone help me?
Thanks!
As #KonradRudolph has noted before, the links are inserted dynamically into the webpage. Therefore, I have produced a code using RSelenium and rvest to tackle this issue:
library(rvest)
library(RSelenium)
# URL
url = "https://www.bankofengland.co.uk/news/speeches"
# Base URL
base_url = "https://www.bankofengland.co.uk"
# Instantiate a Selenium server
rD <- rsDriver(browser=c("chrome"), chromever="91.0.4472.19")
# Assign the client to an object
rem_dr <- rD[["client"]]
# Navigate to the URL
rem_dr$navigate(url)
# Get page HTML
page <- read_html(rem_dr$getPageSource()[[1]])
# Extract links and concatenate them with the base_url
links <- page %>%
html_nodes(".release-speech") %>%
html_attr('href') %>%
paste0(base_url, .)
# Get links names
links_names <- page %>%
html_nodes('#SearchResults .exclude-navigation') %>%
html_text()
# Keep only even results to deduplicate
links_names <- links_names[c(FALSE, TRUE)]
# Create a data.frame with the results
df <- data.frame(links_names, links)
# Close the client and the server
rem_dr$close()
rD$server$stop()
The resulting data.frame looks like this:
> head(df)
links_names
1 Stablecoins: What’s old is new again - speech by Christina Segal-Knowles
2 Tackling climate for real: progress and next steps - speech by Andrew Bailey
3 Tackling climate for real: the role of central banks - speech by Andrew Bailey
4 What are government bond yields telling us about the economic outlook? - speech by Gertjan Vlieghe
5 Responsible openness in the Insurance Sector - speech by Anna Sweeney
6 Cyber Risk: 2015 to 2027 and the Penrose steps - speech by Lyndon Nelson
links
1 https://www.bankofengland.co.uk/speech/2021/june/christina-segal-knowles-speech-at-the-westminster-eforum-poicy-conference
2 https://www.bankofengland.co.uk/speech/2021/june/andrew-bailey-bis-bank-of-france-imf-ngfs-green-swan-conference
3 https://www.bankofengland.co.uk/speech/2021/june/andrew-bailey-reuters-events-global-responsible-business-2021
4 https://www.bankofengland.co.uk/speech/2021/may/gertjan-vlieghe-speech-hosted-by-the-department-of-economics-and-the-ipr
5 https://www.bankofengland.co.uk/speech/2021/may/anna-sweeney-association-of-british-insurers-prudential-regulation
6 https://www.bankofengland.co.uk/speech/2021/may/lyndon-nelson-the-8th-operational-resilience-and-cyber-security-summit

TABLEAU: How can I measure similarity of sets of dimensions across dates?

this is a bit of a complicated one - but I'll do my best to explain. I have a dataset comprised of data that I scrape from a particular video on demand interface every day. Each day there are around 120 titles on display (a grid of 12 x 10) - the data includes a range of variables: date of scrape, title of programme, vertical/horizontal position of programme, genre, synopsis, etc.
One of the things I want to do is analyse the similarity of what's on offer on a day-to-day basis. What I mean by this is that I want to compare how many of the titles on a given day appeared on the previous date (ideally expressed as a percentage). So if 40 (out of 120) titles were the same as the previous day, the similarity would be 30%.
Here's the thing - I know how to do this (thanks to some kindly stranger on this very site who helped me write a script using R). You can see the post here which gives some more detail: Calculate similarity within a dataframe across specific rows (R)
However, this method creates a similarity score based on the total number of titles on a day-to-day basis whereas I also want to be able to explore the similarity after applying other filters. Specifically, I want to narrow the focus to titles that appear within the first four rows and columns. In other words: how many of these titles are the same as the previous day in those positions? I could do this by modifying the R script, but it seems that the better way would be to do this within Tableau so that I can change these parameters in "real-time", so to speak. I.e. if I want to focus on the top 6 rows and columns I don't want to have to run the R script all over again and update the underlying data!
It feels as though I'm missing something very obvious here - maybe it's a simple table calculation? Or I need to somehow tell Tableau how to subset the data?
Hopefully this all makes sense, but I'm happy to clarify if not. Also, I can't provide you the underlying data (for research reasons!) but I can provide a sample if it would help.
Thanks in advance :)
You can have the best of both worlds. Use Tableau to connect to your data, filter as desired, then have Tableau call an R script to calculate similarity and return the results to Tableau for display.
If this fits your use case, you need to learn the mechanics to put this into play. On the Tableau side, you’ll be using the functions that start with the word SCRIPT to call your R code, for example SCRIPT_REAL(), or SCRIPT_INT() etc. Those are table calculations, so you’ll need to learn how table calculations work, in particular with regard to partitioning and addressing. This is described in the Tableau help. You’ll also have to point Tableau at the host for your R code, by managing external services under the Help->Settings and Performance menu.
On the R side, you’ll have write your function of course, and then use the function RServe() to make it accessible to Tableau. Tableau sends vectors of arguments to R and expects a vector in response. The partitioning and addressing mentioned above controls the size and ordering of those vectors.
It can be a bit tricky to get the mechanics working, but they do work. Practice on something simple first.
See Tableau’s web site resources for more information. The official name for this functionality is Tableau “analytic extensions”
I am sharing a strategy to solve this in R.
Step-1 Load the libraries and data
library(tidyverse)
library(lubridate)
movies <- tibble(read.csv("movies.csv"))
movies$date <- as.Date(movies$date, format = "%d-%m-%Y")
set the rows and columns you want to restrict your similarity search to in two variables. Say you are restricting the search to 5 columns and 4 rows only
filter_for_row <- 4
filter_for_col <- 5
Getting final result
movies %>% filter(rank <= filter_for_col, row <= filter_for_row) %>% #Restricting search to designated rows and columns
group_by(Title, date) %>% mutate(d_id = row_number()) %>%
filter(d_id ==1) %>% # removing duplicate titles screened on any given day
group_by(Title) %>%
mutate(similarity = ifelse(lag(date)== date - lubridate::days(1), 1, 0)) %>% #checking whether it was screened previous day
group_by(date) %>%
summarise(total_movies_displayed = sum(d_id),
similar_movies = sum(similarity, na.rm = T),
similarity_percent = similar_movies/total_movies_displayed)
# A tibble: 3 x 4
date total_movies_displayed similar_movies similarity_percent
<date> <int> <dbl> <dbl>
1 2018-08-13 17 0 0
2 2018-08-14 17 10 0.588
3 2018-08-15 17 9 0.529
If you change the filters to 12, 12 respectively, then
filter_for_row <- 12
filter_for_col <- 12
movies %>% filter(rank <= filter_for_col, row <= filter_for_row) %>%
group_by(Title, date) %>% mutate(d_id = row_number()) %>%
filter(d_id ==1) %>%
group_by(Title) %>%
mutate(similarity = ifelse(lag(date)== date - lubridate::days(1), 1, 0)) %>%
group_by(date) %>%
summarise(total_movies_displayed = sum(d_id),
similar_movies = sum(similarity, na.rm = T),
similarity_percent = similar_movies/total_movies_displayed)
# A tibble: 3 x 4
date total_movies_displayed similar_movies similarity_percent
<date> <int> <dbl> <dbl>
1 2018-08-13 68 0 0
2 2018-08-14 75 61 0.813
3 2018-08-15 72 54 0.75
Good Luck
As Alex has suggested, you can have best of both the worlds. But to the best of my knowledge, Tableau Desktop allows interface with R (or python etc.) through calculated fields i.e. script_int script_real etc. All of these can be used in tableau through calculated fields. Presently these functions in tableau allows creation on calculated field through Table calculations which in tableau work only in context. We cannot hard code these values (fields/columns) and thus. we are not at liberty to use these independent on context. Moreover, table calculations in tableau can neither be further aggregated and nor be mixed with LOD expressions. Thus, in your use case, (again to the best of my knowledge) you can build a parameter dependent view in tableau, after hard-coding values through any programming language of your choice. I therefore, suggest that prior to importing data in tableau a new column can be created in your dataset by running following (or alternate as per choice programming language)
movies_edited <- movies %>% group_by(Title) %>%
mutate(similarity = ifelse(lag(date)== date - lubridate::days(1), 1, 0)) %>%
ungroup()
write.csv(movies_edited, "movies_edited.csv")
This created a new column named similarity in dataset wherein 1 denotes that it was available on previous day, 0 denotes it was not not screened on immediately previous day and NA means it is first day of its screening.
I have imported this dataset in tableau and created a parameter dependent view, as you desired.

Implementing Code Conditionally in R Based on Features of Dataset

I'm looking to streamline my code, and minimize manual tweaks depending on the data set I run through it. I.e. I receive batches of data by country - but each country is slightly different in terms of fields and field names, so requires tweaking each time I run a new country. I would like to eliminate the tweaks and do some selective coding. (Many of the challenges I handle easily with ifelse(), but haven't been able to do a conditional mutate for example).
This is a logic question, so please let me know if I should have uploaded a data set.
This is a new example, I just added. I realized since the one I had used was a mutate, there were many tools to answer the question. In this example, I am dealing with data from various countries, each df with varying dimensionality, which I will want to keep. I of course, could use different code for each, but I think it would be cleaner if I used the same code, but it accommodated various country data.
I have created a version of this using mutate with ifelse, creating variables for these non-common dimensions and that works. I'm wondering if there is an alternative in R where I can run select snippets of code (and a good answer may be, there is not that option inside pipes). [I know how to do with with separate sets of code and if {} else{}.
Keep in mind, this is part of a much larger block of code that I need all the countries to run though...this is just an illustrative subset.
# As you can see, I comment out each countries unique variables (and spelling!)
P_Region_HP_Brand <- P_Region_HP %>%
left_join(M_brand) %>%
left_join(M_prodcat) %>%
group_by(Calendar_Year, Calendar_Quarter, Calendar_Month, Calendar_Month_txt, Date,
region_b_frcst5, region_b_frcst7, Country, country_b,
BrandSummary, rank_m, Launch_Year, Launch_Month, Model, PriceSegment, SumProdCat, ProductCategory, True_Wireless, ProductType,
# SPORTS, VOICE.ASSISTANT.FUNCTION # JPN
# Sports, Heart.Rate.Sensor # EU3
# HEARTMON, WTRRSST # USA
Sports, DIST_TYP # CHN
) %>%
summarize(Dollars = sum(Dollars), # ALL (inc USA)
Local_Currency = sum(Local_Currency), # ALL
Units = sum(Units)) %>%
select(Calendar_Year, Calendar_Quarter, Calendar_Month, Calendar_Month_txt, Date, Launch_Year, Launch_Month,
region_b_frcst5, region_b_frcst7, Country, country_b,
BrandSummary, Model, PriceSegment, SumProdCat, ProductCategory, True_Wireless, ProductType,
Units, Dollars, Local_Currency, rank_m, # ALL (inc USA)
# HEARTMON, WTRRSST, # USA
# SPORTS, VOICE.ASSISTANT.FUNCTION # JPN
# Sports, Heart.Rate.Sensor # EU3
Sports, DIST_TYP # CHN
) %>%
as.data.frame() %>%
arrange(Country, desc(Date), desc(Local_Currency))
Does anyone know a solution for this that will allow me to keep my code simple enough? & run select lines for given countries?

Using rvest to creating a database from multiple XML files

Using R to extract relevant data from multiple online XML files to create a database
I just started to learn R to do text analysis. Here is what I am trying to do: I'm trying to use rvest in r to create a CSV database of bill summaries from the 116th Congress from online XML files. The database should have two columns:
The title of the bill.
The summary text of the bill.
The website source is https://www.govinfo.gov/bulkdata/BILLSUM/116/hr
The issue I am having is
I would like to collect all the speeches that are returned from the search. So I need to web scrape multiple links. But I don't know how to ensure that r runs function with a series of different links and then extract the expected data.
I have tried the following code but I am not sure how exactly to apply them to my specific problem. Also, I got an error report of my code. Please see my code below. Thanks for any help in advance!
library(rvest)
library(tidyverse)
library(purrr)
html_source <- "https://www.govinfo.gov/bulkdata/BILLSUM/116/hr?page="
map_df(1:997, function(i) {
cat(".")
pg <- read_html(sprintf(html_source, i))
data.frame(title = html_text(html_nodes(pg, "title")),
bill_text %>% html_node("summary-text") %>% html_text(),
stringsAsFactors = FALSE)
}) -> Bills
Error in open.connection(x, "rb") : HTTP error 406.
At the bottom of that page is a link to a zipfile with all of the XML files, so instead of scraping each one individually (which will get onerous with a suggested crawl-delay of 10s) you can just download the zipfile and parse the XML files with xml2 (rvest is for HTML):
library(xml2)
library(purrr)
local_dir <- "~/Downloads/BILLSUM-116-hr"
local_zip <- paste0(local_dir, '.zip')
download.file("https://www.govinfo.gov/bulkdata/BILLSUM/116/hr/BILLSUM-116-hr.zip", local_zip)
# returns vector of paths to unzipped files
xml_files <- unzip(local_zip, exdir = local_dir)
bills <- xml_files %>%
map(read_xml) %>%
map_dfr(~list(
# note xml2 functions only take XPath selectors, not CSS ones
title = xml_find_first(.x, '//title') %>% xml_text(),
summary = xml_find_first(.x, '//summary-text') %>% xml_text()
))
bills
#> # A tibble: 1,367 x 2
#> title summary
#> <chr> <chr>
#> 1 For the relief of certain aliens w… Provides for the relief of certain …
#> 2 "To designate the facility of the … "Designates the facility of the Uni…
#> 3 Consolidated Appropriations Act, 2… <p><b>Consolidated Appropriations A…
#> 4 Financial Institution Customer Pro… <p><strong>Financial Institution Cu…
#> 5 Zero-Baseline Budget Act of 2019 <p><b>Zero-Baseline Budget Act of 2…
#> 6 Agriculture, Rural Development, Fo… "<p><b>Highlights: </b></p> <p>This…
#> 7 SAFETI Act <p><strong>Security for the Adminis…
#> 8 Buy a Brick, Build the Wall Act of… <p><b>Buy a Brick, Build the Wall A…
#> 9 Inspector General Access Act of 20… <p><strong>Inspector General Access…
#> 10 Federal CIO Authorization Act of 2… <p><b>Federal CIO Authorization Act…
#> # … with 1,357 more rows
The summary column is HTML-formatted, but by and large this is pretty clean already.

Using Rvest to webscrape rankings

I am wanting to scrape all rankings from the following website:
https://www.glassdoor.com/ratingsDetails/full.htm?employerId=432&employerName=McDonalds#trends-overallRating
I have tried using CSS selectors, which tell me to use ".ratingNum", but it leaves me with blank data. I have also tried using the GET function, which results in a similar problem.
# Attempt 1
url <- 'https://www.glassdoor.com/ratingsDetails/full.htm?employerId=432&employerName=McDonalds#trends-overallRating'
webpage <- read_html(url)
rank_data_html <- html_nodes(webpage,'.rankingNum')
rank_data <- html_table(rank_data_html)
head(rank_data)
# Attempt 2
res <- GET("https://www.glassdoor.com/ratingsDetails/full.htm",
query=list(employerId="432",
employerName="McDonalds"))
doc <- read_html(content(res, as="text"))
html_nodes(doc, ".ratingNum")
rank_data <- html_table(rank_data_html)
head(rank_data)
I expect the result to give me a list of all of the rankings, but instead it is giving me an empty list, or a list that doesn't include the rankings.
Your list is empty because you're GETing an unpopulated HTML document. Frequently when this happens you have to resort to RSelenium and co., but Glassdoor's public-facing API actually has everything you need – if you know where to look.
(Note: I'm not sure if this is officially part of Glassdoor's public API, but I think it's fair game if they haven't made more of an effort to conceal it. I tried to find some information, but their documentation is pretty meager. Usually companies will look the other way if you're just doing a smallish analysis and not slamming their servers or trying to profit from their data, but it's still a good idea to heed their ToS. You might want to shoot them an email describing what you're doing, or even ask about becoming an API partner. Make sure you adhere to their attribution rules. Continue at your own peril.)
Take a look at the network analysis tab in you browser's developer tools. You will see some GET requests that return JSONs, and one of those has the address you need. Send a GET and parse the JSON:
library(httr)
library(purrr)
library(dplyr)
ratings <- paste0("https://www.glassdoor.com/api/employer/432-rating.htm?",
"locationStr=&jobTitleStr=&filterCurrentEmployee=false")
req_obj <- GET(ratings)
cont <- content(req_obj)
ratings_df <- map(cont$ratings, bind_cols) %>% bind_rows()
ratings_df
You should end up with a dataframe containing ratings data. Just don't forget that the "ceoRating", "bizOutlook", and "recommend" are are proportions from 0-1 (or percentages if *100), while the rest reflect average user ratings on a 5-point scale:
# A tibble: 9 x 3
hasRating type value
<lgl> <chr> <dbl>
1 TRUE overallRating 3.3
2 TRUE ceoRating 0.72
3 TRUE bizOutlook 0.42
4 TRUE recommend 0.570
5 TRUE compAndBenefits 2.8
6 TRUE cultureAndValues 3.1
7 TRUE careerOpportunities 3.2
8 TRUE workLife 3.1
9 TRUE seniorManagement 2.9

Resources