Parse table data into R but it's blank, javascript? - r

My first post and a beginner with R so patience requested if I should have found an answer to my question elsewhere.
I'm trying to cobble together a table with data pulled from multiple sites from CME (https://www.cmegroup.com/trading/energy/crude-oil/western-canadian-select-wcs-crude-oil-futures.html is one).
I've tried using rvest but get a blank table.
I think this is because of the Javascript that is being used to populate the table in real time? I've fumbled my way around this site to look for similar problems and haven't quite figured out how best to pull this data. Any help is much appreciated.
library(rvest)
library(dplyr)
WCS_page <- "https://www.cmegroup.com/trading/energy/crude-oil/canadian-heavy-crude-oil-net-energy-index-futures_quotes_globex.html"
WCS_diff <- read_html(WCS_page)
month <- WCS_diff %>%
rvest::html_nodes('th') %>%
xml2::xml_find_all("//scope[contains(#col, 'Month')]") %>%
rvest::html_text()
price <- WCS_diff %>%
rvest::html_nodes('tr') %>%
xml2::xml_find_all("//td[contains(#class, 'quotesFuturesProductTable1_CLK0_last')]") %>%
rvest::html_text()
WTI_df <- data.frame(month, price)
knitr::kable(
WTI_df %>% head (10))

Yes, the page is using JS to load the data.
The easy way to check is to view source and then search for some of the text you saw in the table. For example the word "May" never shows up in the raw HTML, so it must have been loaded later.
The next step is to use something like the Chrome DevTools to inspect the network requests that were made. In this case there is a clear winner, and your structured data is coming down from here:
https://www.cmegroup.com/CmeWS/mvc/Quotes/Future/6038/G

Related

How do I extract certain html nodes using rvest?

I'm new to web-scraping so I may not be doing all the proper checks here. I'm attempting to scrape information from a url, however I'm not able to extract the nodes I need. See sample code below. In this example, I want to get the product name (Madone SLR 9 eTap Gen) which appears to be stored in the buying-zone__title class.
library(tidyverse)
library(rvest
url <- "https://www.trekbikes.com//us/en_US/bikes/road-bikes/performance-road-bikes/madone/madone-slr/madone-slr-9-etap-gen-7/p/37420"
read_html(url) %>%
html_nodes(".buying-zone__title") %>%
html_text()
When I run the code above, I get {xml_nodeset (0)}. How can I fix this? I would also like to scrape the year, price, available colors and specs from that page. Any help will be appreciated.
There is a lot of dynamic content on that page which can be reviewed by disabling JS running in browser or comparing rendered page against page source.
You can view page source with Ctrl+U, then Ctrl+F to search for where the product name exists within the non-rendered content.
The title info you want is present in lots of places and there are numerous way to obtain. I will offer an "as on tin" option as the code gives clear indications as to what is being selected for.
I've updated the syntax and reduced the volume of imported external dependencies.
library(magrittr)
library(rvest)
url <- "https://www.trekbikes.com//us/en_US/bikes/road-bikes/performance-road-bikes/madone/madone-slr/madone-slr-9-etap-gen-7/p/37420"
page <- read_html(url)
name <- page %>% html_element("[product-name]") %>% html_attr("product-name")
specs <- page %>% html_elements('[class="sprocket__table spec"]') %>% html_table()
price <- page %>% html_element('#gtm-product-display-price') %>% html_attr('value') %>% as.numeric()

Scaping text from webpage using 'rvest' and SelectorGadget

I am trying to get a text from a webpage. To simplify my question, let me use #RonakShah's Stackoverflow account as an example to extract the reputation value. With 'SelectorGadget' showing "div, div", I used the following code:
library(rvest)
so <- read_html('https://stackoverflow.com/users/3962914/ronak-shah') %>%
html_nodes("div") %>% html_nodes("div") %>% html_text()
This gave an object so with as many as 307 items.
Then, I turned the object into a dataframe:
so <- as.data.frame(so)
view(so)
Then, manually gone through all items in the dataframe until finding the correct value so$so[69]. My question is how to quickly find the specific target value. In my real case, it is a little more complicated for doing it manually as there are multiple items with the same values and I need to identify the correct order. Thanks.
You need to find a specific tag and it the respective class closer to your target. You can find that using selector gadget.
library(rvest)
read_html('https://stackoverflow.com/users/3962914/ronak-shah') %>%
html_nodes("div.grid--cell.fs-title") %>%
html_text()
#[1] "254,328"
As far as scraping StackOverflow is concerned it has an API to get the information about users/question/answers. In R, there is a wrapper package around it called stackr (not on CRAN) which makes it very easy.
library(stackr)
data <- stack_users(3962914)
data$reputation
[1] 254328
data has lot of other information as well about the user.
3962914 is the user id of the user you are interested in which can be found out from their profile link. (https://stackoverflow.com/users/3962914/ronak-shah).

Web scraping tables on college basketball stats

I am new to webscraping and working on a test project in which I am trying to scrape every table of data on the following website for this particular team. There should be 15 tables but when I run my code, it only seems to pull the first 6 of the 15. How do I go about getting the rest of the tables?
Here is the code:
library(tidyverse)
library(rvest)
library(stringr)
library(lubridate)
library(magrittr)
iowa_stats<- read_html("https://www.sports-reference.com/cbb/schools/iowa/2021.html")
iowa_stats %>% html_table()
Edit: So I decided to dig a little bit deeper into the problem and see if I could get any more insights. So I decided to start with the first table that doesn't appear when you call the html_table command which is the 'Totals' Table. I did the following to follow the path of the html all the way down to the table to see if I could figure out what's wrong. To do so, I used the following code.
iowa_stats %>% html_nodes("body") %>% html_nodes("div#wrap") %>% html_nodes("div#all_totals.table_wrapper")
This is as far as I can get prior to getting an error. At the next step, there should be the following: div#div_totals.table_container.is_setup in which the table is stored but if I were to add that to the above code, it doesn't exist. When I type the following, it doesn't exist as well.
iowa_stats %>% html_nodes("body") %>% html_nodes("div#wrap") %>% html_nodes("div#all_totals.table_wrapper") %>% html_nodes("div")
Does someone who is better with html/css have any idea why this is the case?
It looks like this webpage is storing some of the tables as comments. To solve this read and save the web page. Remove the comment tags and then process normally.
library(rvest)
library(dplyr)
iowa_stats<- read_html("https://www.sports-reference.com/cbb/schools/iowa/2021.html")
#Only save and work with the body
body<-html_node(iowa_stats,"body")
write_xml(body, "temp.xml")
#Find and remove comments
lines<-readLines("temp.xml")
lines<-lines[-grep("<!--", lines)]
lines<-lines[-grep("-->", lines)]
writeLines(lines, "temp2.xml")
#Read the file back in and process normally
body<-read_html("temp2.xml")
html_nodes(body, "table") %>% html_table()

webscraping a table with no html class

I exploring webscraping some weather data, specifically the table on the right panel of this page https://wrcc.dri.edu/cgi-bin/cliMAIN.pl?ak4988
I'm able to navigate to the appropriate location (see below), but have not been able to pull out the table e.g., html_nodes("table").
library(tidyverse)
library(rvest)
url<- read_html("https://wrcc.dri.edu/cgi-bin/cliMAIN.pl?ak4988")
url %>%
html_nodes("frame") %>%
magrittr::extract2(2)
# {html_node}
# <frame src="/cgi-bin/cliRECtM.pl?ak4988" name="Graph">
I've also looked at the namespace with no luck
xml_ns(url)
# <->
This works for me.
library(rvest)
library(magrittr)
library(plyr)
#Doing URLs one by one
url<-"https://wrcc.dri.edu/cgi-bin/cliRECtM.pl?ak4988"
##GET SALES DATA
pricesdata <- read_html(url) %>% html_nodes(xpath = "//table[1]") %>% html_table(fill=TRUE)
library(plyr)
df <- ldply(pricesdata, data.frame)
Originally I was hitting the wrong URL. The comment from Mogzol pointed me in the right direction. I'm not sure how, or why, different URLs feed into the same one. Maybe it has something to do with the different scrolling windows in one single window. I would be interested in hearing how this works...if someone has some insight into it... Thanks!!

Scraping a HTML table in R, after changing a Javascript dropdown option

I am looking to scrape the main table from this website. I have managed to get the table into R and working, but the one problem is the website defaults to PS4, while I wanted the data for Xbox (this is changed in the top-right dropdown menu).
Ideally there would be a way to pass a parameter in the URL that will define the platform in this way, but I haven't been able to find anything about that.
Looking around it seems that PhantomJS would be the best way to go but I have no experience using Javascript and I'm not sure how you would implement performing an action on the page, then scraping the resulting table.
This is currently all I have in terms of my main code scraping the data:
library(rvest)
url1 <- "https://www.futbin.com/19/players?page="
pge <- 1
tbl <- paste0(url1, pge) %>%
read_html() %>%
html_nodes(xpath='//*[#id="repTb"]') %>%
html_table()
Thanks in advance for any help.

Resources