Scraping data from variable table with rvest - r

I'm attempting to scrape tables from this page:
https://sportsbook.draftkings.com/leagues/football/nfl?category=passing-props&subcategory=pass-tds
I'm trying to gather the info under "player","over", and "under"
so the first row would be Joe Flacco 1.5 +140 1.5 -190 (these numbers change so when you're reading it might be different)
As an example of code I used on the same website, but different table/link, I used this:
library(rvest)
url <- 'https://sportsbook.draftkings.com/leagues/football/nfl?category=passing-props&subcategory=pass-yds'
test <- url %>%
read_html() %>%
html_nodes('.default-color , .sportsbook-outcome-cell__line , .sportsbook-row-name') %>%
html_text()
This code gives me the exact data that I want.
Note that this working code is for a separate page: https://sportsbook.draftkings.com/leagues/football/nfl?category=passing-props&subcategory=pass-yds
I used selectorGadget extension to ascertain the selector value
The 2 different pages I'm looking at are accessible from the header above "Bal Ravens"
Pass Yds is the default table selection for the page; Pass TDs is next to it, which gets you to the page which I'm also attempting to scrape.
For some reason, scraping the Pass TDs table using the same method as Pass Yds table leaves me with an empty string:
url <- 'https://sportsbook.draftkings.com/leagues/football/nfl?category=passing-props&subcategory=pass-tds'
test <- url %>%
read_html() %>%
html_nodes('.sportsbook-row-name , .default-color , .sportsbook-outcome-cell__line') %>%
html_text()
Note that when using selectorGadget for this page, it gives me a different html_nodes selector
I have also tried using xpath and finding the individual tables (with html_table) on the Inspect page. Again, this process works with the Pass Yds page, but not the Pass Tds page. I assume this problem relates to the fact that the table on the website is variable, with Pass Yds being the default.
If anyone could help me with this, or point me in the direction to information regarding scraping these menu-selectable tables, I would greatly appreciate it.
Thanks!

Related

How do I extract certain html nodes using rvest?

I'm new to web-scraping so I may not be doing all the proper checks here. I'm attempting to scrape information from a url, however I'm not able to extract the nodes I need. See sample code below. In this example, I want to get the product name (Madone SLR 9 eTap Gen) which appears to be stored in the buying-zone__title class.
library(tidyverse)
library(rvest
url <- "https://www.trekbikes.com//us/en_US/bikes/road-bikes/performance-road-bikes/madone/madone-slr/madone-slr-9-etap-gen-7/p/37420"
read_html(url) %>%
html_nodes(".buying-zone__title") %>%
html_text()
When I run the code above, I get {xml_nodeset (0)}. How can I fix this? I would also like to scrape the year, price, available colors and specs from that page. Any help will be appreciated.
There is a lot of dynamic content on that page which can be reviewed by disabling JS running in browser or comparing rendered page against page source.
You can view page source with Ctrl+U, then Ctrl+F to search for where the product name exists within the non-rendered content.
The title info you want is present in lots of places and there are numerous way to obtain. I will offer an "as on tin" option as the code gives clear indications as to what is being selected for.
I've updated the syntax and reduced the volume of imported external dependencies.
library(magrittr)
library(rvest)
url <- "https://www.trekbikes.com//us/en_US/bikes/road-bikes/performance-road-bikes/madone/madone-slr/madone-slr-9-etap-gen-7/p/37420"
page <- read_html(url)
name <- page %>% html_element("[product-name]") %>% html_attr("product-name")
specs <- page %>% html_elements('[class="sprocket__table spec"]') %>% html_table()
price <- page %>% html_element('#gtm-product-display-price') %>% html_attr('value') %>% as.numeric()

Obtaining Linked HTML Text in a Table with RVest

I'm trying to obtain the game IDs for each game listed in this page.
https://www.chess.com/member/bogginssloggins
Here's what I'm doing now:
First, I'm downloading the HTML with RSelenium and saving it as htmlfile.txt (the table doesn't render unless you use Selenium)
Then, I'm using RVest to parse the HTML.
Here is my code, skipping the RSelenium part
library(rvest)
html <- read_html("htmlfile.txt")
GameTable <- html %>% html_table() %>% .[[1]]
Unfortunately GameTable doesn't include the game IDs, just the data actually visible in the table. A sample GameID would be something like the link below.
https://www.chess.com/analysis/game/live/9296762565?username=bogginssloggins
These games are very much displayed in the html but I don't know how to systematically grab them and link them to the corresponding rows of the table. My ideal output would be the data in the table on the webpage (e.g. the players in the game, who won, etc but also including a column for the gameID. I believe one of the important things to look for is the "archived-games-link" in the html. There are twenty of those links in the html and twenty rows in the table, so it seems like it should match. However, when I do the below code
"htmlfile.txt" %>% read_html() %>%
html_nodes("[class='archived-games-link']") %>%
html_attr("href")
I get only 18 results returned, even though when I ctrl+f for "archived-games-link" in the html document 20 results are returned.

Why is html_nodes() in R not giving me the desired output for this webpage?

I am looking to extract all the links for each episode on this webpage, however I appear to be having difficulty using html_nodes() where I haven't experienced such difficulty before. I am trying to iterate the code using "." such that all the attributes for the page are obtained with that CSS. This code is meant to give an output of all the attributes, but instead I get {xml_nodeset (0)}. I know what to do once I have all the attributes in order to obtain the links specifically out of them, but this step is proving a stumbling block for this website.
Here is the code I have begun in R:
episode_list_page_1 <- "https://jrelibrary.com/episode-list/"
episode_list_page_1 %>%
read_html() %>%
html_node("body") %>%
html_nodes(".type-text svelte-fugjkr first-mobile first-desktop") %>%
html_attrs()
This rvest down does not work here because this page uses javascript to insert another webpage into an iframe on this page, to display the information.
If you search the imebedded script you will find a reference to this page: "https://datawrapper.dwcdn.net/eoqPA/66/" which will redirect you to "https://datawrapper.dwcdn.net/eoqPA/67/". This second page contains the data you are looking for in as embedded JSON and generated via javascript.
The links to the shows are extractable, and there is a link to a Google doc that is the full index.
Searching this page turns up a link to a Google doc:
library(rvest)
library(dplyr)
library(stringr)
page2 <-read_html("https://datawrapper.dwcdn.net/eoqPA/67/")
#find all of the links on the page:
str_extract_all(html_text(page2), 'https:.*?\\"')
#isolate the Google docs
print(str_extract_all(html_text(page2), 'https://docs.*?\\"') )
#[[1]]
#[1] "https://docs.google.com/spreadsheets/d/12iTobpwHViCIANFSX3Pc_dGMdfod-0w3I5P5QJL45L8/edit?usp=sharing"
#[2] "https://docs.google.com/spreadsheets/d/12iTobpwHViCIANFSX3Pc_dGMdfod-0w3I5P5QJL45L8/export?format=csv&id=12iTobpwHViCIANFSX3Pc_dGMdfod-0w3I5P5QJL45L8"

Rvest is unable to find the node specified by css selector, how do I fix it?

I am scraping data from this website and for some reason, I'm unable to get the name of the seller, even though I use the exact node returned by SelectorGadget. I have, however, managed to get all the other data with Rvest.
I managed to scrape the seller's name with RSelenium but that takes too much time. Anyway, here's the link of the page I'm scraping:
https://www.kijiji.ca/v-fitness-personal-trainer/bedford/swimming-lessons/1421292946
Here's the code I've used
SellerName <-
read_html("https://kijiji.ca/v-fitness-personal-trainer/bedford/swimming-lessons/1421292946") %>%
html_nodes(".link-4200870613") %>%
html_text()
You can regex out the seller name easily from the return as it is contained in a script tag (presumably loaded from here when browser is able to run javascript - which rvest does not.)
library(rvest)
library(magrittr)
library(stringr)
p <- read_html('https://www.kijiji.ca/v-fitness-personal-trainer/bedford/swimming-lessons/1421292946') %>% html_text()
seller_name <- str_match_all(p,'"sellerName":"(.*?)"')[[1]][,2][1]
print(seller_name)
Regex:

Scraping a HTML table in R, after changing a Javascript dropdown option

I am looking to scrape the main table from this website. I have managed to get the table into R and working, but the one problem is the website defaults to PS4, while I wanted the data for Xbox (this is changed in the top-right dropdown menu).
Ideally there would be a way to pass a parameter in the URL that will define the platform in this way, but I haven't been able to find anything about that.
Looking around it seems that PhantomJS would be the best way to go but I have no experience using Javascript and I'm not sure how you would implement performing an action on the page, then scraping the resulting table.
This is currently all I have in terms of my main code scraping the data:
library(rvest)
url1 <- "https://www.futbin.com/19/players?page="
pge <- 1
tbl <- paste0(url1, pge) %>%
read_html() %>%
html_nodes(xpath='//*[#id="repTb"]') %>%
html_table()
Thanks in advance for any help.

Resources