How to webscrape a file which adress changes in R - r

I am interested in this excel file, which structure does not change : https://rigcount.bakerhughes.com/static-files/cc0aed5c-b4fc-440d-9522-18680fb2ef6a
Which i can get from this page : https://rigcount.bakerhughes.com/na-rig-count
The last url does not change over time, whereas the first one does.
But I guess the url of the file is located somewhere in the elements of the fixed webpage, even if it is changed, and the the generation of the filename follows a repetitive procedure.
Therefore, is there a way, in R, to get the file (which is updated every week or so) in an automated manner, without dowloading it manually each time ?

You skipped the part of the question where you talk about what you had done. Or searching the web for tutorials. But it was easy to do so here goes. You'll have to look up an rvest tutorial for more explanation.
library(rvest) # to allow easy scraping
library(magrittr) # to allow %>% pipe commands
page <- read_html("https://rigcount.bakerhughes.com/na-rig-count")
# Find links that match excel type files as defined by the page
links <- page %>%
html_nodes("span.file--mime-application-vnd-ms-excel-sheet-binary-macroEnabled-12") %>%
html_nodes("a")
links_df <- data.frame(
title = links %>% html_attr("title"),
link = links %>% html_attr("href")
)
links_df
title
# 1 north_america_rotary_rig_count_jan_2000_-_current.xlsb
# 2 north_american_rotary_rig_count_pivot_table_feb_2011_-_current.xlsb
# link
# 1 https://rigcount.bakerhughes.com/static-files/cc0aed5c-b4fc-440d-9522-18680fb2ef6a
# 2 https://rigcount.bakerhughes.com/static-files/c7852ea5-5bf5-4c47-b52c-f025597cdddf

Related

How do I extract certain html nodes using rvest?

I'm new to web-scraping so I may not be doing all the proper checks here. I'm attempting to scrape information from a url, however I'm not able to extract the nodes I need. See sample code below. In this example, I want to get the product name (Madone SLR 9 eTap Gen) which appears to be stored in the buying-zone__title class.
library(tidyverse)
library(rvest
url <- "https://www.trekbikes.com//us/en_US/bikes/road-bikes/performance-road-bikes/madone/madone-slr/madone-slr-9-etap-gen-7/p/37420"
read_html(url) %>%
html_nodes(".buying-zone__title") %>%
html_text()
When I run the code above, I get {xml_nodeset (0)}. How can I fix this? I would also like to scrape the year, price, available colors and specs from that page. Any help will be appreciated.
There is a lot of dynamic content on that page which can be reviewed by disabling JS running in browser or comparing rendered page against page source.
You can view page source with Ctrl+U, then Ctrl+F to search for where the product name exists within the non-rendered content.
The title info you want is present in lots of places and there are numerous way to obtain. I will offer an "as on tin" option as the code gives clear indications as to what is being selected for.
I've updated the syntax and reduced the volume of imported external dependencies.
library(magrittr)
library(rvest)
url <- "https://www.trekbikes.com//us/en_US/bikes/road-bikes/performance-road-bikes/madone/madone-slr/madone-slr-9-etap-gen-7/p/37420"
page <- read_html(url)
name <- page %>% html_element("[product-name]") %>% html_attr("product-name")
specs <- page %>% html_elements('[class="sprocket__table spec"]') %>% html_table()
price <- page %>% html_element('#gtm-product-display-price') %>% html_attr('value') %>% as.numeric()

Finding all csv links from website using R

I am trying to download the datafiles from ICE website (https://www.theice.com/clear-us/risk-management#margin-rates) containing info on margin strategy. I tried to do so by appliying the following code in R:
page <- read_html("https://www.theice.com/clear-us/risk-management#margin-rates")
raw_list <- page %>% # takes the page above for which we've read the html
html_nodes("a") %>% # find all links in the page
html_attr("href") %>% # get the url for these links
str_subset("\\.csv") # find those that end in csv only
However, it only finds two csv files. That is, it doesn't detect any files displayed when clicking at Margin Rates and going to Historic ICE Risk Model Parameter. See below
raw_list
[1] "/publicdocs/iosco_reporting/haircut_history/icus/ICUS_Asset_Haircuts_History.csv"
[2] "/publicdocs/iosco_reporting/haircut_history/icus/ICUS_Currency_Haircuts_History.csv"
I am wondering how I can do that so later on I can select the files and download them.
Thanks a lot in advance
We can look at the network traffic in browser devtools to find the url for each dropdown action.
The Historic ICE Risk Model Parameter dropdown pulls from this page:
https://www.theice.com/marginrates/ClearUSMarginParameterFiles.shtml;jsessionid=7945F3FE58331C88218978363BA8963C?getParameterFileTable&category=Historical
We remove the jsessionid (per QHarr's comment) and use that as our endpoint:
endpoint <- "https://www.theice.com/marginrates/ClearUSMarginParameterFiles.shtml?getParameterFileTable&category=Historical"
page <- read_html(endpoint)
Then we can get the full csv list:
raw_list <- page %>%
html_nodes(".table-partitioned a") %>% # add specificity as QHarr suggests
html_attr("href")
Output:
'/publicdocs/clear_us/irmParameters/ICUS_MARGIN_INTERMONTH_20210310.CSV'
'/publicdocs/clear_us/irmParameters/ICUS_MARGIN_INTERCONTRACT_20210310.CSV'
'/publicdocs/clear_us/irmParameters/ICUS_MARGIN_SCANNING_20210310.CSV'
'/publicdocs/clear_us/irmParameters/ICUS_MARGIN_STRATEGY_20210310.CSV'
'/publicdocs/clear_us/irmParameters/ICUS_MARGIN_INTERMONTH_20210226.CSV'
'/publicdocs/clear_us/irmParameters/ICUS_MARGIN_INTERCONTRACT_20210226.CSV'
'/publicdocs/clear_us/irmParameters/ICUS_MARGIN_SCANNING_20210226.CSV'
'/publicdocs/clear_us/irmParameters/ICUS_MARGIN_STRATEGY_20210226.CSV'
...
It seems like the page does not load that part of the page instantly and it is missing in your request. The network monitor indicates that a file "ClearUSRiskArrayFiles.shtml" is being loaded 400 ms later. That file seems to provide the required links once you specify year and month in the URL.
library(rvest)
library(stringr)
page <- read_html("https://www.theice.com/iceriskmodel/ClearUSRiskArrayFiles.shtml?getRiskArrayTable=&type=icus&year=2021&month=03")
raw_list <- page %>% # takes the page above for which we've read the html
html_nodes("a") %>% # find all links in the page
html_attr("href")
head(raw_list[grepl("csv", raw_list)], 3L)
#> [1] "/publicdocs/irm_files/icus/2021/03/NYB0312E.csv.zip"
#> [2] "/publicdocs/irm_files/icus/2021/03/NYB0311E.csv.zip"
#> [3] "/publicdocs/irm_files/icus/2021/03/NYB0311F.csv.zip"
Created on 2021-03-12 by the reprex package (v1.0.0)

How do I find html_node on search form?

I have a list of names (first name, last name, and date-of-birth) that I need to search the Fulton County Georgia (USA) Jail website to determine if a person is in or released from jail.
The website is http://justice.fultoncountyga.gov/PAJailManager/JailingSearch.aspx?ID=400
The site requires you enter a last name and first name, then it gives you a list of results.
I have found some stackoverflow posts that have given me some direction, but I'm still struggling to figure this out. I"m using this post as and example to follow. I am using SelectorGaget to help figure out the CSS tags.
Here is the code I have so far. Right now I can't figure out what html_node to use.
library(rvest)
# Specify URL
fc.url <- "http://justice.fultoncountyga.gov/PAJailManager/JailingSearch.aspx?ID=400"
# start session
jail <- html_session(fc.url)
# Grab initial form
form.unfilled <- jail %>% html_node("form")
form.unfilled
The result I get from form.unfilled is {xml_missing} <NA> which I know isn't right.
I think if I can figure out the html_node value, I can proceed to using set_values and submit_form.
Thanks.
It appears on the initial call the webpage opens onto "http://justice.fultoncountyga.gov/PAJailManager/default.aspx". Once the session is started you should be able to jump to the search page:
library(rvest)
# Specify URL
fc.url <- "http://justice.fultoncountyga.gov/PAJailManager/JailingSearch.aspx?ID=400"
# start session
jail <- html_session("http://justice.fultoncountyga.gov/PAJailManager/default.aspx")
#jump to search page
jail2 <- jail %>% jump_to("http://justice.fultoncountyga.gov/PAJailManager/JailingSearch.aspx?ID=400")
#list the form's fields
html_form(jail2)[[1]]
# Grab initial form
form.unfilled <- jail2 %>% html_node("form")
Note: Verify that your actions are within the terms of service for the website. Many sites do have policy against scraping.
The website relies heavily on Javascript to render itself. When opening the link you provided in a fresh browser instance, you get redirected to http://justice.fultoncountyga.gov/PAJailManager/default.aspx, where you have to click the "Jail Records" link. This executes a bit a Javascript, to send you to the page with the form.
rvest is unable to execute arbitrary Javascript. You might have to look at RSelenium. Selenium basically remote-controls a browser (for example Firefox or Chrome), which executes the Javascript as intended.
Thanks to Dave2e.
Here is the code that works. This questions is answered (but I'll post another one because I'm not getting a table of data as a result.)
Note: I cannot find any Terms of Service on this site that I'm querying
library(rvest)
# start session
jail <- html_session("http://justice.fultoncountyga.gov/PAJailManager/default.aspx")
#jump to search page
jail2 <- jail %>% jump_to("http://justice.fultoncountyga.gov/PAJailManager/JailingSearch.aspx?ID=400")
#list the form's fields
html_form(jail2)[[1]]
# Grab initial form
form.unfilled <- jail2 %>% html_node("form") %>% html_form()
form.unfilled
#name values
lname <- "DOE"
fname <- "JOHN"
# Fille the form with name values
form.filled <- form.unfilled %>%
set_values("LastName" = lname,
"FirstName" = fname)
#Submit form
r <- submit_form(jail2, form.filled,
submit = "SearchSubmit")
#grab tables from submitted form
table <- r %>% html_nodes("table")
#grab a table with some data
table[[5]] %>% html_table()
# resulting text in this table:
# " An error occurred while processing your request.Please contact your system administrator."

Parse CDATA with R

I'm scraping and analyzing data from a car auction website. My goal is to develop date-time and sentiment analysis skills, and I like old cars. The website is Bring A Trailer-- they do not offer API access (I asked), but robots.txt is OK.
SO user '42' pointed out that this is not permitted by BAT's terms, so I have removed their base url. I will likely remove the question. After thinking about it, I can do what I want by saving a couple of webpages from my browser and analyzing that data. I don't need ALL the auctions, I just followed a tutorial that did and here I am reading TOS instead of doing what I wanted in the first place...
Some of the data is easily accessed, but the best parts are hard, and I'm stuck with that. I'm really looking for advice on my approach.
My first steps work: I can find and locally cache the webpages:
library(tidyverse)
library(rvest)
data_dir <- "bat_data-html/"
# Step 1: Create list of links to listings ----------------------------
base_url <- "https://"
pages <- read_html(file.path(base_url,"/auctions/")) %>%
html_nodes(".auctions-item-title a") %>%
html_attr("href") %>%
file.path
pages <- head(pages, 3) # use a subset for testing code
# Step 2 : Save auction pages locally ---------------------------------
dir.create(data_dir, showWarnings = FALSE)
p <- progress_estimated(length(pages))
# Download each auction page
walk(pages, function(url){
download.file(url, destfile = file.path(data_dir, basename(url)), quiet = TRUE)
p$tick()$print()
})
I can also process metadata about the auction from these cached pages, identifying the css selectors with SelectorGadget and specifying them to rvest:
# Step 3: Process each auction info into df ----------------------------
files <- dir(data_dir, pattern = "*", full.names = TRUE)
# Function: get_auction_details, to be applied to each auction page
get_auction_details <- function(file) {
pagename <- basename(file) # the filename of the page (trailing index for multiples)
page <- read_html(file) # read the html into R ( consider , options = "NOCDATA")
# Grab the title of the auction stored in the ".listing-post-title" tag on the page
title <- page %>% html_nodes(".listing-post-title") %>% html_text()
# Grab the "BAT essentials" of the auction stored in the ".listing-essentials-item" tag on the page
essence <- page %>% html_nodes(".listing-essentials-item") %>% html_text()
# Assemble into a data frame
info_tbl0 <- as_tibble(essence)
info_tbl <- add_row(info_tbl0, value = title, .before = 1)
names(info_tbl) [1] <- pagename
return(info_tbl)
}
# Apply the get_auction_details function to each element of files
bat0 <- map_df(files, get_auction_details) # run function
bat <- gather(bat0) %>% subset(value != "NA") # serialize results
# Save as csv
write_csv(bat, path = "data-csv/bat04.csv") # this table contains the expected metadata:
key,value
1931-ford-model-a-12,Modified 1931 Ford Model A Pickup
1931-ford-model-a-12,Lot #8576
1931-ford-model-a-12,Seller: TargaEng
But the auction data (bids, comments) is inside of a CDATA section:
<script type='text/javascript'>
/* <![CDATA[ */
var BAT_VMS = { ...bids, comments, results
/* ]]> */
</script>
I've tried elements within this section using the path that I find using SelectorGadget, but they are not found-- this gives an empty list:
tmp <- page %>% html_nodes(".comments-list") %>% html_text()
Looking at the text within this CDATA section, I see some xml tags but it is not structured in the cached file like it is when I inspect the auction section of the live webpage.
To extract this information, should I try to parse the information "as-is" from within this CDATA section, or can I transform it so that it can be parsed like XML? Or am I barking up the wrong tree?
I appreciate any advice!
It's buried in the XML2 documentation, but you can use this option to keep the CDATA intact.
# Instead of rvest::read_html()
xml2::read_xml(options = "NOCDATA")
After reading the feed in this way, you'll be able to access the comments list the way you wanted.
tmp <- page %>% html_nodes(".comments-list") %>% html_text()

Web scrape subtitles from opensubtitles.org in R

I'm new to web scraping, and I'm currently trying to download subtitle files for over 100,000 films for a research project. Each film has a unique IMDb ID (i.e., the ID for Inception is 1375666). I have a list in R containing the 102524 IDs, and I want to download the corresponding subtitles from opensubtitles.org.
Each film has its own page on the site, for example, Inception has:
https://www.opensubtitles.org/en/search/sublanguageid-eng/imdbid-1375666
The link to download the subtitles is obtained by clicking on the first link in the table called "Movie name", which takes you to a new page, then clicking the "Download button" on that page.
I'm using rvest to scrape the pages, and I've written this code:
for(i in 1:102524) {
subtitle.url = paste0("https://www.opensubtitles.org/en/search/sublanguageid-eng/imdbid-", movie.ids[i])
read_html(subtitle.url) %>%
html_nodes(".head+ .expandable .bnone")
# Not sure where to go from here
}
Any help on how to do this will be greatly appreciated.
EDIT: I know I'm asking something pretty complicated, but any pointers on where to start would be great.
Following the link and the download button, we can see that the actual subtitle file is downloaded from https://www.opensubtitles.org/en/download/vrf-108d030f/sub/6961922 (for you example). I found out this inspecting the Network tab in Mozilla's Developer Tools while doing a download.
We can download directly from that address using:
download.file('https://www.opensubtitles.org/en/download/vrf-108d030f/sub/6961922',
destfile = 'subtitle-6961922.zip')
The base url (https://www.opensubtitles.org/en/download/vrf-108d030f/sub/) is fixed for all the downloads as far as I can see, so we only need the site's id.
The id is found within the search page doing:
id <- read_html(subtitle.url) %>%
html_node('.bnone') %>%
html_attr('href') %>%
stringr::str_extract('\\d+')
So, putting it all together:
search_url <- 'https://www.opensubtitles.org/en/search/sublanguageid-eng/imdbid-'
download_url <- 'https://www.opensubtitles.org/en/download/vrf-108d030f/sub/'
for(i in 1:102524) {
subtitle.url = paste0(search_url, movie.ids[i])
id <- read_html(subtitle.url) %>%
html_node('.bnone') %>%
html_attr('href') %>%
stringr::str_extract('\\d+')
download.file(paste0(download_url, id),
destfile = paste0('subtitle-', movie.ids[i], '.zip'))
# Wait somwhere between 1 and 4 second before next download
# as courtesy to the site
Sys.sleep(runif(1, 1, 4))
}
Keep in mind this will take a long time!

Resources