Finding all csv links from website using R - r

I am trying to download the datafiles from ICE website (https://www.theice.com/clear-us/risk-management#margin-rates) containing info on margin strategy. I tried to do so by appliying the following code in R:
page <- read_html("https://www.theice.com/clear-us/risk-management#margin-rates")
raw_list <- page %>% # takes the page above for which we've read the html
html_nodes("a") %>% # find all links in the page
html_attr("href") %>% # get the url for these links
str_subset("\\.csv") # find those that end in csv only
However, it only finds two csv files. That is, it doesn't detect any files displayed when clicking at Margin Rates and going to Historic ICE Risk Model Parameter. See below
raw_list
[1] "/publicdocs/iosco_reporting/haircut_history/icus/ICUS_Asset_Haircuts_History.csv"
[2] "/publicdocs/iosco_reporting/haircut_history/icus/ICUS_Currency_Haircuts_History.csv"
I am wondering how I can do that so later on I can select the files and download them.
Thanks a lot in advance

We can look at the network traffic in browser devtools to find the url for each dropdown action.
The Historic ICE Risk Model Parameter dropdown pulls from this page:
https://www.theice.com/marginrates/ClearUSMarginParameterFiles.shtml;jsessionid=7945F3FE58331C88218978363BA8963C?getParameterFileTable&category=Historical
We remove the jsessionid (per QHarr's comment) and use that as our endpoint:
endpoint <- "https://www.theice.com/marginrates/ClearUSMarginParameterFiles.shtml?getParameterFileTable&category=Historical"
page <- read_html(endpoint)
Then we can get the full csv list:
raw_list <- page %>%
html_nodes(".table-partitioned a") %>% # add specificity as QHarr suggests
html_attr("href")
Output:
'/publicdocs/clear_us/irmParameters/ICUS_MARGIN_INTERMONTH_20210310.CSV'
'/publicdocs/clear_us/irmParameters/ICUS_MARGIN_INTERCONTRACT_20210310.CSV'
'/publicdocs/clear_us/irmParameters/ICUS_MARGIN_SCANNING_20210310.CSV'
'/publicdocs/clear_us/irmParameters/ICUS_MARGIN_STRATEGY_20210310.CSV'
'/publicdocs/clear_us/irmParameters/ICUS_MARGIN_INTERMONTH_20210226.CSV'
'/publicdocs/clear_us/irmParameters/ICUS_MARGIN_INTERCONTRACT_20210226.CSV'
'/publicdocs/clear_us/irmParameters/ICUS_MARGIN_SCANNING_20210226.CSV'
'/publicdocs/clear_us/irmParameters/ICUS_MARGIN_STRATEGY_20210226.CSV'
...

It seems like the page does not load that part of the page instantly and it is missing in your request. The network monitor indicates that a file "ClearUSRiskArrayFiles.shtml" is being loaded 400 ms later. That file seems to provide the required links once you specify year and month in the URL.
library(rvest)
library(stringr)
page <- read_html("https://www.theice.com/iceriskmodel/ClearUSRiskArrayFiles.shtml?getRiskArrayTable=&type=icus&year=2021&month=03")
raw_list <- page %>% # takes the page above for which we've read the html
html_nodes("a") %>% # find all links in the page
html_attr("href")
head(raw_list[grepl("csv", raw_list)], 3L)
#> [1] "/publicdocs/irm_files/icus/2021/03/NYB0312E.csv.zip"
#> [2] "/publicdocs/irm_files/icus/2021/03/NYB0311E.csv.zip"
#> [3] "/publicdocs/irm_files/icus/2021/03/NYB0311F.csv.zip"
Created on 2021-03-12 by the reprex package (v1.0.0)

Related

How do I extract certain html nodes using rvest?

I'm new to web-scraping so I may not be doing all the proper checks here. I'm attempting to scrape information from a url, however I'm not able to extract the nodes I need. See sample code below. In this example, I want to get the product name (Madone SLR 9 eTap Gen) which appears to be stored in the buying-zone__title class.
library(tidyverse)
library(rvest
url <- "https://www.trekbikes.com//us/en_US/bikes/road-bikes/performance-road-bikes/madone/madone-slr/madone-slr-9-etap-gen-7/p/37420"
read_html(url) %>%
html_nodes(".buying-zone__title") %>%
html_text()
When I run the code above, I get {xml_nodeset (0)}. How can I fix this? I would also like to scrape the year, price, available colors and specs from that page. Any help will be appreciated.
There is a lot of dynamic content on that page which can be reviewed by disabling JS running in browser or comparing rendered page against page source.
You can view page source with Ctrl+U, then Ctrl+F to search for where the product name exists within the non-rendered content.
The title info you want is present in lots of places and there are numerous way to obtain. I will offer an "as on tin" option as the code gives clear indications as to what is being selected for.
I've updated the syntax and reduced the volume of imported external dependencies.
library(magrittr)
library(rvest)
url <- "https://www.trekbikes.com//us/en_US/bikes/road-bikes/performance-road-bikes/madone/madone-slr/madone-slr-9-etap-gen-7/p/37420"
page <- read_html(url)
name <- page %>% html_element("[product-name]") %>% html_attr("product-name")
specs <- page %>% html_elements('[class="sprocket__table spec"]') %>% html_table()
price <- page %>% html_element('#gtm-product-display-price') %>% html_attr('value') %>% as.numeric()

How do you download the last file that was added in a directory on internet?

I would like to download the last archive (meteorological data) that has been added to this website by using Rstudio;
https://opendata.dwd.de/climate_environment/CDC/grids_germany/hourly/radolan/recent/asc/
Do you know how to do it? I am able to download one in particular, but then I have to write the exact extension and It should be manually changed every time, and I do not want that, I want it automatically detected.
Thanks.
The function download_CDC() downloads the files for you. Input number 1 will download the lastest one with their respective name provided by the website.
library(tidyverse)
library(rvest)
base_url <- "https://opendata.dwd.de/climate_environment/CDC/grids_germany/hourly/radolan/recent/asc/"
files <- base_url %>%
read_html() %>%
html_elements("a+ a") %>%
html_attr("href")
download_CDC <- function(item_number) {
base_url <- "https://opendata.dwd.de/climate_environment/CDC/grids_germany/hourly/radolan/recent/asc/"
download.file(paste0(base_url, files[item_number]),
destfile = files[item_number],
mode = "wb")
}
download_CDC(1)
It's bit naïve (no error checking, blindly takes the last link from the file list page), but works with that particular listing.
Most of web scraping in R happens through rvest , html_element("a:last-of-type") extracts the last element of type <a> though CSS selector - your last archive. And html_attr('href') extracts the href attribute from that last <a>-element - actual link to the file.
library(rvest)
last_link <- function(url) {
last_href <- read_html(url) |>
html_element("a:last-of-type") |>
html_attr('href')
paste0(url,last_href)
}
url <- "https://opendata.dwd.de/climate_environment/CDC/grids_germany/hourly/radolan/recent/asc/"
last_link(url)
#> [1] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/hourly/radolan/recent/asc/RW-20220720.tar.gz"
Created on 2022-07-21 by the reprex package (v2.0.1)

How to webscrape a file which adress changes in R

I am interested in this excel file, which structure does not change : https://rigcount.bakerhughes.com/static-files/cc0aed5c-b4fc-440d-9522-18680fb2ef6a
Which i can get from this page : https://rigcount.bakerhughes.com/na-rig-count
The last url does not change over time, whereas the first one does.
But I guess the url of the file is located somewhere in the elements of the fixed webpage, even if it is changed, and the the generation of the filename follows a repetitive procedure.
Therefore, is there a way, in R, to get the file (which is updated every week or so) in an automated manner, without dowloading it manually each time ?
You skipped the part of the question where you talk about what you had done. Or searching the web for tutorials. But it was easy to do so here goes. You'll have to look up an rvest tutorial for more explanation.
library(rvest) # to allow easy scraping
library(magrittr) # to allow %>% pipe commands
page <- read_html("https://rigcount.bakerhughes.com/na-rig-count")
# Find links that match excel type files as defined by the page
links <- page %>%
html_nodes("span.file--mime-application-vnd-ms-excel-sheet-binary-macroEnabled-12") %>%
html_nodes("a")
links_df <- data.frame(
title = links %>% html_attr("title"),
link = links %>% html_attr("href")
)
links_df
title
# 1 north_america_rotary_rig_count_jan_2000_-_current.xlsb
# 2 north_american_rotary_rig_count_pivot_table_feb_2011_-_current.xlsb
# link
# 1 https://rigcount.bakerhughes.com/static-files/cc0aed5c-b4fc-440d-9522-18680fb2ef6a
# 2 https://rigcount.bakerhughes.com/static-files/c7852ea5-5bf5-4c47-b52c-f025597cdddf

How to do web scraping using R when redirecting another html?

I am a new user of rvest package in R to conduct web scraping on Marriott website.
I would like to make a list of the name and price of Marriott hotel in Japan from the url: https://www.marriott.com/search/submitSearch.mi?destinationAddress.destination=Japan&destinationAddress.country=JP&searchType=InCity&filterApplied=false.
What I have done is as below;
#library
library(rvest)
library(dplyr)
#get the url
url = "https://www.marriott.com/hotel-search.mi" # url
html = read_html(url) # read webpage
# pull out links to get the labels
links = html %>%
html_nodes(".js-region-pins") %>%
html_attr("href") %>%
str_subset("^.*Japan*")
Here links include the url of the page that includes 47 Japanese hotel as below;
links
[1] "/search/submitSearch.mi?destinationAddress.destination=Japan&destinationAddress.country=JP&searchType=InCity&filterApplied=false"
Then,
url_japan = paste("https://www.marriott.com",links,sep="")
url_japan
[1] "https://www.marriott.com/search/submitSearch.mi?destinationAddress.destination=Japan&destinationAddress.country=JP&searchType=InCity&filterApplied=false"
Here is the problem, which I came across with.
When we jump to the url_japan, it appears that the loaded page is redirected to another url (https://www.marriott.com/search/findHotels.mi).
In this case, how can I continue web-scraping with rvest package?

R and phantom.js: Can't scrape main content block

I'm trying to scrape the topline data points (Total GHG, GHG per capita, GHG per BTU) and download the charts from the following page, using R and phantom.js:
http://apps1.eere.energy.gov/sled/#/results/home?city=Omaha&abv=NE
This is my code:
url <- "http://apps1.eere.energy.gov/sled/#/results/home?city=Omaha&abv=NE"
writeLines(sprintf("var page = require('webpage').create();
page.open('%s', function () {
console.log(page.content); //page source
phantom.exit();
});", url), con="scrape.js")
system("phantomjs scrape.js > eere.html")
pg <- read_html("eere.html")
pg %>% html_nodes("CLASS") %>% html_text()
Where CLASS is the class that I haven't yet identified. Here's the html I'm getting for eere.html. It only contains header and footer content, and fails to grab the body of the page.
Any advice?
With PhantomJS via RSelenium, it would look like
library(RSelenium)
library(rvest)
# start remote driver and browser
remdr <- rsDriver(browser = 'phantomjs', verbose = FALSE)
remdr$client$navigate('http://apps1.eere.energy.gov/sled/#/results/home?city=Omaha&abv=NE')
Sys.sleep(2) # wait 2 secs for page to load if you're not running line by line
page <- remdr$client$getPageSource()
# parse HTML with rvest
page[[1]] %>% read_html() %>% html_nodes('h5') %>% html_text()
#> [1] "Total GHG: metric tons"
#> [2] "GHG per capita: metric tons/person"
#> [3] "GHG per BTU: metric tons/MMBTU"
# clean up
remdr$client$close()
remdr$server$stop()
#> [1] TRUE
RSelenium uses an OOP style that's uncommon in R, but workable. Consequently the docs are arranged in a similarly unusual manner, but they are actually thorough if you dig in.
Lastly, RSelenium is best avoided if you don't need it. It's an important and necessary tool in the R toolbox, but because of what it does it's inherently heavy and slow compared to the rest of R. Given the site offers good data download options and how to recreate the data in question, it's ultimately unnecessary here. For a couple pages it may be practical, but for more there's a point at which it's quicker to just rebuild the data.

Resources