R and phantom.js: Can't scrape main content block - r

I'm trying to scrape the topline data points (Total GHG, GHG per capita, GHG per BTU) and download the charts from the following page, using R and phantom.js:
http://apps1.eere.energy.gov/sled/#/results/home?city=Omaha&abv=NE
This is my code:
url <- "http://apps1.eere.energy.gov/sled/#/results/home?city=Omaha&abv=NE"
writeLines(sprintf("var page = require('webpage').create();
page.open('%s', function () {
console.log(page.content); //page source
phantom.exit();
});", url), con="scrape.js")
system("phantomjs scrape.js > eere.html")
pg <- read_html("eere.html")
pg %>% html_nodes("CLASS") %>% html_text()
Where CLASS is the class that I haven't yet identified. Here's the html I'm getting for eere.html. It only contains header and footer content, and fails to grab the body of the page.
Any advice?

With PhantomJS via RSelenium, it would look like
library(RSelenium)
library(rvest)
# start remote driver and browser
remdr <- rsDriver(browser = 'phantomjs', verbose = FALSE)
remdr$client$navigate('http://apps1.eere.energy.gov/sled/#/results/home?city=Omaha&abv=NE')
Sys.sleep(2) # wait 2 secs for page to load if you're not running line by line
page <- remdr$client$getPageSource()
# parse HTML with rvest
page[[1]] %>% read_html() %>% html_nodes('h5') %>% html_text()
#> [1] "Total GHG: metric tons"
#> [2] "GHG per capita: metric tons/person"
#> [3] "GHG per BTU: metric tons/MMBTU"
# clean up
remdr$client$close()
remdr$server$stop()
#> [1] TRUE
RSelenium uses an OOP style that's uncommon in R, but workable. Consequently the docs are arranged in a similarly unusual manner, but they are actually thorough if you dig in.
Lastly, RSelenium is best avoided if you don't need it. It's an important and necessary tool in the R toolbox, but because of what it does it's inherently heavy and slow compared to the rest of R. Given the site offers good data download options and how to recreate the data in question, it's ultimately unnecessary here. For a couple pages it may be practical, but for more there's a point at which it's quicker to just rebuild the data.

Related

Finding all csv links from website using R

I am trying to download the datafiles from ICE website (https://www.theice.com/clear-us/risk-management#margin-rates) containing info on margin strategy. I tried to do so by appliying the following code in R:
page <- read_html("https://www.theice.com/clear-us/risk-management#margin-rates")
raw_list <- page %>% # takes the page above for which we've read the html
html_nodes("a") %>% # find all links in the page
html_attr("href") %>% # get the url for these links
str_subset("\\.csv") # find those that end in csv only
However, it only finds two csv files. That is, it doesn't detect any files displayed when clicking at Margin Rates and going to Historic ICE Risk Model Parameter. See below
raw_list
[1] "/publicdocs/iosco_reporting/haircut_history/icus/ICUS_Asset_Haircuts_History.csv"
[2] "/publicdocs/iosco_reporting/haircut_history/icus/ICUS_Currency_Haircuts_History.csv"
I am wondering how I can do that so later on I can select the files and download them.
Thanks a lot in advance
We can look at the network traffic in browser devtools to find the url for each dropdown action.
The Historic ICE Risk Model Parameter dropdown pulls from this page:
https://www.theice.com/marginrates/ClearUSMarginParameterFiles.shtml;jsessionid=7945F3FE58331C88218978363BA8963C?getParameterFileTable&category=Historical
We remove the jsessionid (per QHarr's comment) and use that as our endpoint:
endpoint <- "https://www.theice.com/marginrates/ClearUSMarginParameterFiles.shtml?getParameterFileTable&category=Historical"
page <- read_html(endpoint)
Then we can get the full csv list:
raw_list <- page %>%
html_nodes(".table-partitioned a") %>% # add specificity as QHarr suggests
html_attr("href")
Output:
'/publicdocs/clear_us/irmParameters/ICUS_MARGIN_INTERMONTH_20210310.CSV'
'/publicdocs/clear_us/irmParameters/ICUS_MARGIN_INTERCONTRACT_20210310.CSV'
'/publicdocs/clear_us/irmParameters/ICUS_MARGIN_SCANNING_20210310.CSV'
'/publicdocs/clear_us/irmParameters/ICUS_MARGIN_STRATEGY_20210310.CSV'
'/publicdocs/clear_us/irmParameters/ICUS_MARGIN_INTERMONTH_20210226.CSV'
'/publicdocs/clear_us/irmParameters/ICUS_MARGIN_INTERCONTRACT_20210226.CSV'
'/publicdocs/clear_us/irmParameters/ICUS_MARGIN_SCANNING_20210226.CSV'
'/publicdocs/clear_us/irmParameters/ICUS_MARGIN_STRATEGY_20210226.CSV'
...
It seems like the page does not load that part of the page instantly and it is missing in your request. The network monitor indicates that a file "ClearUSRiskArrayFiles.shtml" is being loaded 400 ms later. That file seems to provide the required links once you specify year and month in the URL.
library(rvest)
library(stringr)
page <- read_html("https://www.theice.com/iceriskmodel/ClearUSRiskArrayFiles.shtml?getRiskArrayTable=&type=icus&year=2021&month=03")
raw_list <- page %>% # takes the page above for which we've read the html
html_nodes("a") %>% # find all links in the page
html_attr("href")
head(raw_list[grepl("csv", raw_list)], 3L)
#> [1] "/publicdocs/irm_files/icus/2021/03/NYB0312E.csv.zip"
#> [2] "/publicdocs/irm_files/icus/2021/03/NYB0311E.csv.zip"
#> [3] "/publicdocs/irm_files/icus/2021/03/NYB0311F.csv.zip"
Created on 2021-03-12 by the reprex package (v1.0.0)

How to webscrape a file which adress changes in R

I am interested in this excel file, which structure does not change : https://rigcount.bakerhughes.com/static-files/cc0aed5c-b4fc-440d-9522-18680fb2ef6a
Which i can get from this page : https://rigcount.bakerhughes.com/na-rig-count
The last url does not change over time, whereas the first one does.
But I guess the url of the file is located somewhere in the elements of the fixed webpage, even if it is changed, and the the generation of the filename follows a repetitive procedure.
Therefore, is there a way, in R, to get the file (which is updated every week or so) in an automated manner, without dowloading it manually each time ?
You skipped the part of the question where you talk about what you had done. Or searching the web for tutorials. But it was easy to do so here goes. You'll have to look up an rvest tutorial for more explanation.
library(rvest) # to allow easy scraping
library(magrittr) # to allow %>% pipe commands
page <- read_html("https://rigcount.bakerhughes.com/na-rig-count")
# Find links that match excel type files as defined by the page
links <- page %>%
html_nodes("span.file--mime-application-vnd-ms-excel-sheet-binary-macroEnabled-12") %>%
html_nodes("a")
links_df <- data.frame(
title = links %>% html_attr("title"),
link = links %>% html_attr("href")
)
links_df
title
# 1 north_america_rotary_rig_count_jan_2000_-_current.xlsb
# 2 north_american_rotary_rig_count_pivot_table_feb_2011_-_current.xlsb
# link
# 1 https://rigcount.bakerhughes.com/static-files/cc0aed5c-b4fc-440d-9522-18680fb2ef6a
# 2 https://rigcount.bakerhughes.com/static-files/c7852ea5-5bf5-4c47-b52c-f025597cdddf

rvest r data scraping returning empty table

New to programming and trying to scrap data from the below site. When I run the below code it returns an empty dataset or table. Any help or alternatives will be greatly appreciated.
url <- "https://fasttrack.grv.org.au/Dog/Form?id=2003010003"
tab <- url %>% read_html %>%
html_node("dogruns_wrapper") %>%
html_text()
View(tab)
Have tried with xpath and same result and html_table() instead of text returns an error of no applicable method for 'html_table' applied to an object of class "xml_missing".
As Mislav stated, the table is generated with JavaScript, so your best option is RSelenium.
In addition, if you want to get the table, you can get it with less code if you use html_table().
My try:
# Load packages
library(rvest) #Loading the rvest package
library(magrittr) # for the '%>%' pipe symbols
library(RSelenium) # to get the loaded html of the webpage
# starting local RSelenium (this is the only way to start RSelenium that is working for me atm)
selCommand <- wdman::selenium(jvmargs = c("-Dwebdriver.chrome.verboseLogging=true"), retcommand = TRUE)
shell(selCommand, wait = FALSE, minimized = TRUE)
remDr <- remoteDriver(port = 4567L, browserName = "chrome")
remDr$open()
# define url
url <- "https://fasttrack.grv.org.au/Dog/Form?id=2003010003"
# go to website
remDr$navigate(url)
# as it's being loaded with JavaScript and it has a slow load, add a sleep here
Sys.sleep(10) # increase as needed
# get the html object of the webpage
html_obj <- remDr$getPageSource(header = TRUE)[[1]] %>% read_html()
# read the table in the html_obj
tab <- html_obj %>% html_table() %>% .[[1]]
Hope it helps! However, always check if webpages allow scraping before doing it!
Check Terms and conditions:
Except for the direct purpose of viewing, printing, accessing or
interacting with the Web Site for your own personal use or as
otherwise indicated on the Web Site or these Terms and Conditions, you
must not copy, reproduce, modify, communicate to the public, adapt,
transfer, distribute, download or store any of the contents of the Web
Site (including Race Information as described below), or incorporate
any part of the Web Site into another web site without GRV’s written
consent.

Triggering doPostBack javascript with RSelenium to scrape multi-page table

I am struggling to 'web-scrape' data from a table which spans over several pages. The pages are linked via javascript.
The data I am interested in is based on the website's search function:
url <- "http://aims.niassembly.gov.uk/plenary/searchresults.aspx?tb=0&tbv=0&tbt=All%20Members&pt=7&ptv=7&ptt=Petition%20of%20Concern&mc=0&mcv=0&mct=All%20Categories&mt=0&mtv=0&mtt=All%20Types&sp=1&spv=0&spt=Tabled%20Between&ss=jc7icOHu4kg=&tm=2&per=1&fd=01/01/2011&td=17/04/2018&tit=0&txt=0&pm=0&it=0&pid=1&ba=0"
I am able to download the first page with the rvest package:
library(rvest)
library(tidyverse)
NI <- read_html(url)
NI.res <- NI %>%
html_nodes("table") %>%
html_table(fill=TRUE)
NI.res <- NI.res[[1]][c(1:10),c(1:5)]
So far so good.
As far as I understood, the RSelenium package is the way forward to navigate on websites with javascript/when html scraping via changing urls is not possible. I installed the package and run it in combination with the Docker Quicktool Box (all working fine),
library (RSelenium)
shell('docker run -d -p 4445:4444 selenium/standalone-chrome')
remDr <- remoteDriver(remoteServerAddr = "192.168.99.100",
port = 4445L,
browserName = "chrome")
remDr$open()
My hope was that by triggering the javascript I could naviate to the next page and repeat the rvest command and obtain the data contained on the e.g. 2nd, 3rd etc page (eventually that should be part of a loop or purrr::map function).
Navigate to the table with search results (1st page):
remDr$navigate("http://aims.niassembly.gov.uk/plenary/searchresults.aspx?tb=0&tbv=0&tbt=All%20Members&pt=7&ptv=7&ptt=Petition%20of%20Concern&mc=0&mcv=0&mct=All%20Categories&mt=0&mtv=0&mtt=All%20Types&sp=1&spv=0&spt=Tabled%20Between&ss=jc7icOHu4kg=&tm=2&per=1&fd=01/01/1989&td=01/04/2018&tit=0&txt=0&pm=0&it=0&pid=1&ba=0")
Trigger the javascript. The content of the javascript is taken from hovering with the mouse over the index of pages on the webiste (below the table). In the case below the javascript leading to page 2 is triggered:
remDr$executeScript("__doPostBack('ctl00$MainContentPlaceHolder$SearchResultsGridView','Page$2');", args=list("dummy"))
Repeat the scraping with rvest
url <- "http://aims.niassembly.gov.uk/plenary/searchresults.aspx?tb=0&tbv=0&tbt=All%20Members&pt=7&ptv=7&ptt=Petition%20of%20Concern&mc=0&mcv=0&mct=All%20Categories&mt=0&mtv=0&mtt=All%20Types&sp=1&spv=0&spt=Tabled%20Between&ss=jc7icOHu4kg=&tm=2&per=1&fd=01/01/2011&td=17/04/2018&tit=0&txt=0&pm=0&it=0&pid=1&ba=0"
NI <- read_html(url)
NI.res <- NI %>%
html_nodes("table") %>%
html_table(fill=TRUE)
NI.res2 <- NI.res[[1]][c(1:10),c(1:5)]
Unfortunately though, the triggering of the javascript appears not to work. The scraped results are again those from page 1 and not from page 2. I might be missing something rather basic here, but I can't figure out what.
My attempt is partly informed by SO posts here, here and here. I also saw this post.
Context: Eventually, in further steps, I will have to trigger a click on each single finding/row which shows up on all pages and also scrape the information behind each entry. Hence, as far as I understood, RSelenium will be the main tool here.
Grateful for any hint!
UPDATE
I made 'some' progress following the approach suggested here. It's a) still not doing everything I am intending to do and b) is very likely not the most elegant form of how to do it. But maybe it's of some help to others/opens up a way forward. Note that this approach does not require RSelenium.
I basically created a loop for each javascript (page index) leading to another page of the table which I want to scrape. The crucial detail is the EVENTARGUMENT argument to which I assign the respective page number (my knowledge of js is bascially zero).
for (i in 2:15) {
target<- paste0("Page$",i)
page<-rvest:::request_POST(pgsession,"http://aims.niassembly.gov.uk/plenary/searchresults.aspx?tb=0&tbv=0&tbt=All%20Members&pt=7&ptv=7&ptt=Petition%20of%20Concern&mc=0&mcv=0&mct=All%20Categories&mt=0&mtv=0&mtt=All%20Types&sp=1&spv=0&spt=Tabled%20Between&ss=jc7icOHu4kg=&tm=2&per=1&fd=01/01/2011&td=17/04/2018&tit=0&txt=0&pm=0&it=0&pid=1&ba=0",
body=list(
`__VIEWSTATE`=pgform$fields$`__VIEWSTATE`$value,
`__EVENTTARGET`="ctl00$MainContentPlaceHolder$SearchResultsGridView",
`__EVENTARGUMENT`= target, `__VIEWSTATEGENERATOR`=pgform$fields$`__VIEWSTATEGENERATOR`$value,
`__VIEWSTATEENCRYPTED`=pgform$fields$`__VIEWSTATEENCRYPTED`$value,
`__EVENTVALIDATION`=pgform$fields$`__EVENTVALIDATION`$value
),
encode="form")
x <- read_html(page) %>%
html_nodes(css = "#ctl00_MainContentPlaceHolder_SearchResultsGridView") %>%
html_table(fill=TRUE) %>%
as.data.frame()
d<- x[c(1:paste(nrow(x)-2)),c(1:5)]
page.list[[i]] <- d
i=i+1
}
However, this code is not able to trigger the javascript/go to the pages which are not visible in the page index below the table when opening the site (1 - 11). Only page 2 to 11 can be scraped with this loop. Since the script for page 12 and subsequent are not visible, they can't be triggered.

Scrape contents of dynamic pop-up window using R

I'm stuck on this one after much searching....
I started with scraping the contents of a table from:
http://www.skatepress.com/skates-top-10000/artworks/
Which is easy:
data <- data.frame()
for (i in 1:100){
print(paste("page", i, "of 100"))
url <- paste("http://www.skatepress.com/skates-top-10000/artworks/", i, "/", sep = "")
temp <- readHTMLTable(stringsAsFactors = FALSE, url, which = 1, encoding = "UTF-8")
data <- rbind(data, temp)
} # end of scraping loop
However, I need to additionally scrape the detail that is contained in a pop-up box when you click on each name (and on the artwork title) in the list on the site.
I can't for the life of me figure out how to pass the breadcrumb (or artist-id or painting-id) through in order to make this happen. Since straight up using rvest to access the contents of the nodes doesn't work, I've tried the following:
I tried passing the painting id through in the url like this:
url <- ("http://www.skatepress.com/skates-top-10000/artworks/?painting_id=576")
site <- html(url)
But it still gives an empty result when scraping:
node1 <- "bread-crumb > ul > li.activebc"
site %>% html_nodes(node1) %>% html_text(trim = TRUE)
character(0)
I'm (clearly) not a scraping expert so any and all assistance would be greatly appreciated! I need a way to capture this additional information for each of the 10,000 items on the list...hence why I'm not interested in doing this manually!
Hoping this is an easy one and I'm just overlooking something simple.
This will be a more efficient base scraper and you can get progress bars for free with the pbapply package:
library(xml2)
library(httr)
library(rvest)
library(dplyr)
library(pbapply)
library(jsonlite)
base_url <- "http://www.skatepress.com/skates-top-10000/artworks/%d/"
n <- 100
bind_rows(pblapply(1:n, function(i) {
mutate(html_table(html_nodes(read_html(sprintf(base_url, i)), "table"))[[1]],
`Sale Date`=as.Date(`Sale Date`, format="%m.%d.%Y"),
`Premium Price USD`=as.numeric(gsub(",", "", `Premium Price USD`)))
})) -> skatepress
I added trivial date & numeric conversions.
I belive your main issue is that the site requires a login to get the additional data. You should give that (i.e. logging in) a shot using httr and grab the wordpress_logged_inXXXXXXX… cookie from that endeavour. I just grabbed it from inspecting the session with Developer Tools in Chrome and that will also work for you (but it's worth the time to learn how to do it via httr).
You'll need to scrape two additional <a … tags from each table row. The one for "artist" looks like:
Pablo Picasso
You can scrape the contents with:
POST("http://www.skatepress.com/wp-content/themes/skatepress/scripts/query_artist.php",
set_cookies(wordpress_logged_in_XXX="userid%XXXXXreallylongvalueXXXXX…"),
encode="form",
body=list(id="pab_pica_1881"),
verbose()) -> artist_response
fromJSON(content(artist_response, as="text"))
(The return value is too large to post here)
The one for "artwork" looks like:
Les femmes d′Alger (Version ′O′)
and you can get that in similar fashion:
POST("http://www.skatepress.com/wp-content/themes/skatepress/scripts/query_artwork.php",
set_cookies(wordpress_logged_in_XXX="userid%XXXXXreallylongvalueXXXXX…"),
encode="form",
body=list(id=576),
verbose()) -> artwork_response
fromJSON(content(artwork_response, as="text"))
That's not huge but I won't clutter the response with it.
NOTE that you can also use rvest's html_session to do the login (which will get you cookies for free) and then continue to use that session in the scraping (vs read_html) which will mean you don't have to do the httr GET/PUT.
You'll have to figure out how you want to incorporate that data into the data frame or associate it with it via various id's in the data frame (or some other strategy).
You can see it call those two php scripts via Developer Tools and it also shows the data it passes in. I'm also really shocked that site doesn't have any anti-scraping clauses in their ToS but they don't.

Resources