Obtaining Linked HTML Text in a Table with RVest - r

I'm trying to obtain the game IDs for each game listed in this page.
https://www.chess.com/member/bogginssloggins
Here's what I'm doing now:
First, I'm downloading the HTML with RSelenium and saving it as htmlfile.txt (the table doesn't render unless you use Selenium)
Then, I'm using RVest to parse the HTML.
Here is my code, skipping the RSelenium part
library(rvest)
html <- read_html("htmlfile.txt")
GameTable <- html %>% html_table() %>% .[[1]]
Unfortunately GameTable doesn't include the game IDs, just the data actually visible in the table. A sample GameID would be something like the link below.
https://www.chess.com/analysis/game/live/9296762565?username=bogginssloggins
These games are very much displayed in the html but I don't know how to systematically grab them and link them to the corresponding rows of the table. My ideal output would be the data in the table on the webpage (e.g. the players in the game, who won, etc but also including a column for the gameID. I believe one of the important things to look for is the "archived-games-link" in the html. There are twenty of those links in the html and twenty rows in the table, so it seems like it should match. However, when I do the below code
"htmlfile.txt" %>% read_html() %>%
html_nodes("[class='archived-games-link']") %>%
html_attr("href")
I get only 18 results returned, even though when I ctrl+f for "archived-games-link" in the html document 20 results are returned.

Related

How do I extract certain html nodes using rvest?

I'm new to web-scraping so I may not be doing all the proper checks here. I'm attempting to scrape information from a url, however I'm not able to extract the nodes I need. See sample code below. In this example, I want to get the product name (Madone SLR 9 eTap Gen) which appears to be stored in the buying-zone__title class.
library(tidyverse)
library(rvest
url <- "https://www.trekbikes.com//us/en_US/bikes/road-bikes/performance-road-bikes/madone/madone-slr/madone-slr-9-etap-gen-7/p/37420"
read_html(url) %>%
html_nodes(".buying-zone__title") %>%
html_text()
When I run the code above, I get {xml_nodeset (0)}. How can I fix this? I would also like to scrape the year, price, available colors and specs from that page. Any help will be appreciated.
There is a lot of dynamic content on that page which can be reviewed by disabling JS running in browser or comparing rendered page against page source.
You can view page source with Ctrl+U, then Ctrl+F to search for where the product name exists within the non-rendered content.
The title info you want is present in lots of places and there are numerous way to obtain. I will offer an "as on tin" option as the code gives clear indications as to what is being selected for.
I've updated the syntax and reduced the volume of imported external dependencies.
library(magrittr)
library(rvest)
url <- "https://www.trekbikes.com//us/en_US/bikes/road-bikes/performance-road-bikes/madone/madone-slr/madone-slr-9-etap-gen-7/p/37420"
page <- read_html(url)
name <- page %>% html_element("[product-name]") %>% html_attr("product-name")
specs <- page %>% html_elements('[class="sprocket__table spec"]') %>% html_table()
price <- page %>% html_element('#gtm-product-display-price') %>% html_attr('value') %>% as.numeric()

Scraping data from variable table with rvest

I'm attempting to scrape tables from this page:
https://sportsbook.draftkings.com/leagues/football/nfl?category=passing-props&subcategory=pass-tds
I'm trying to gather the info under "player","over", and "under"
so the first row would be Joe Flacco 1.5 +140 1.5 -190 (these numbers change so when you're reading it might be different)
As an example of code I used on the same website, but different table/link, I used this:
library(rvest)
url <- 'https://sportsbook.draftkings.com/leagues/football/nfl?category=passing-props&subcategory=pass-yds'
test <- url %>%
read_html() %>%
html_nodes('.default-color , .sportsbook-outcome-cell__line , .sportsbook-row-name') %>%
html_text()
This code gives me the exact data that I want.
Note that this working code is for a separate page: https://sportsbook.draftkings.com/leagues/football/nfl?category=passing-props&subcategory=pass-yds
I used selectorGadget extension to ascertain the selector value
The 2 different pages I'm looking at are accessible from the header above "Bal Ravens"
Pass Yds is the default table selection for the page; Pass TDs is next to it, which gets you to the page which I'm also attempting to scrape.
For some reason, scraping the Pass TDs table using the same method as Pass Yds table leaves me with an empty string:
url <- 'https://sportsbook.draftkings.com/leagues/football/nfl?category=passing-props&subcategory=pass-tds'
test <- url %>%
read_html() %>%
html_nodes('.sportsbook-row-name , .default-color , .sportsbook-outcome-cell__line') %>%
html_text()
Note that when using selectorGadget for this page, it gives me a different html_nodes selector
I have also tried using xpath and finding the individual tables (with html_table) on the Inspect page. Again, this process works with the Pass Yds page, but not the Pass Tds page. I assume this problem relates to the fact that the table on the website is variable, with Pass Yds being the default.
If anyone could help me with this, or point me in the direction to information regarding scraping these menu-selectable tables, I would greatly appreciate it.
Thanks!

Not able to select href of a tag with Xpath (rvest)

I am scraping https://ic.gc.ca/eic/site/bsf-osb.nsf/eng/h_br02281.html with the rvest package in R. I would like to get the hyperlink associated with the company name. That portion of the html code looks like this:
html
My code looks like this:
library(rvest)
library(dplyr)
url = "https://ic.gc.ca/eic/site/bsf-osb.nsf/eng/h_br02281.html"
ccaa = read_html(url)
links = ccaa %>%
html_nodes("body") %>%
xml_find_all("//td[1]//a[#href]") %>%
html_text()
But this is only returning the names of the firms/cases, not the links that they are associated with.
How can I get these links? The end goal of this is to put these links into a data frame (along with other information) which will be rendered in a Shiny data table. Then, when a user is interested in a particular insolvency case, they can click on the link to see more information.
I am somewhat new to R and asking quesitons on stack overflow so please let me know if you require more information.
Replace html_text with html_attr('href').

Why is html_nodes() in R not giving me the desired output for this webpage?

I am looking to extract all the links for each episode on this webpage, however I appear to be having difficulty using html_nodes() where I haven't experienced such difficulty before. I am trying to iterate the code using "." such that all the attributes for the page are obtained with that CSS. This code is meant to give an output of all the attributes, but instead I get {xml_nodeset (0)}. I know what to do once I have all the attributes in order to obtain the links specifically out of them, but this step is proving a stumbling block for this website.
Here is the code I have begun in R:
episode_list_page_1 <- "https://jrelibrary.com/episode-list/"
episode_list_page_1 %>%
read_html() %>%
html_node("body") %>%
html_nodes(".type-text svelte-fugjkr first-mobile first-desktop") %>%
html_attrs()
This rvest down does not work here because this page uses javascript to insert another webpage into an iframe on this page, to display the information.
If you search the imebedded script you will find a reference to this page: "https://datawrapper.dwcdn.net/eoqPA/66/" which will redirect you to "https://datawrapper.dwcdn.net/eoqPA/67/". This second page contains the data you are looking for in as embedded JSON and generated via javascript.
The links to the shows are extractable, and there is a link to a Google doc that is the full index.
Searching this page turns up a link to a Google doc:
library(rvest)
library(dplyr)
library(stringr)
page2 <-read_html("https://datawrapper.dwcdn.net/eoqPA/67/")
#find all of the links on the page:
str_extract_all(html_text(page2), 'https:.*?\\"')
#isolate the Google docs
print(str_extract_all(html_text(page2), 'https://docs.*?\\"') )
#[[1]]
#[1] "https://docs.google.com/spreadsheets/d/12iTobpwHViCIANFSX3Pc_dGMdfod-0w3I5P5QJL45L8/edit?usp=sharing"
#[2] "https://docs.google.com/spreadsheets/d/12iTobpwHViCIANFSX3Pc_dGMdfod-0w3I5P5QJL45L8/export?format=csv&id=12iTobpwHViCIANFSX3Pc_dGMdfod-0w3I5P5QJL45L8"

How can I scrape this recipe?

I am trying to webscrape some recipes for my own personal collection. It works great on some sites because the website structure sometimes easily allows for scraping, but some are harder. This one I have no idea how to deal with:
https://www.koket.se/halloumigryta-med-tomat-linser-och-chili
For the moment, let's just assume I want the ingredients on the left. If I inspect the website it looks like what I want are the two article class="ingredients" chunks. But I can't seem to get there.
I start with the following:
library(rvest)
library(tidyverse)
read_html("https://www.koket.se/halloumigryta-med-tomat-linser-och-chili") %>%
html_nodes(".recipe-column-wrapper") %>%
html_nodes(xpath = '//*[#id="react-recipe-page"]')
However, running the above code shows that all of the ingredients are stored in data-item like so:
<div id="react-recipe-page" data-item="{
"chefNames":"<a href='/kockar/siri-barje'>Siri Barje</a>",
"groupedIngredients":[{
"header":"Kokosris",
"ingredients":[{
"name":"basmatiris","unit":"dl","amount":"3","amount_info":{"from":3},"main":false,"ingredient":true
}
<<<and so on>>>
So I am a little bit puzzled, because from inspecting the website everything seems to be neatly placed in things I can extract, but now it's not. Instead, I'd need some serious regular expressions in order to get everything like I want it.
So my question is: am I missing something? Is there some way I can get the contents of the ingredients articles?
(I tried SelectorGadget, but it just gave me No valid path found).
You can extract attributes by using html_attr("data-item") from the rvest package.
Furthermore, the data-item attribute looks like it's in JSON, which you can convert to a list using the fromJSON from the jsonlite package:
html <- read_html("https://www.koket.se/halloumigryta-med-tomat-linser-och-chili") %>%
html_nodes(".recipe-column-wrapper") %>%
html_nodes(xpath = '//*[#id="react-recipe-page"]')
recipe <- html %>% html_attr("data-item") %>%
fromJSON
Lastly, the recipe list contains lots of different values, which are not relevant, but the ingredients and measurements are there as well in the element recipe$ingredients.

Resources