How do I extract certain html nodes using rvest?

How do I extract certain html nodes using rvest? - web-scraping

I'm new to web-scraping so I may not be doing all the proper checks here. I'm attempting to scrape information from a url, however I'm not able to extract the nodes I need. See sample code below. In this example, I want to get the product name (Madone SLR 9 eTap Gen) which appears to be stored in the buying-zone__title class.
library(tidyverse)
library(rvest
url <- "https://www.trekbikes.com//us/en_US/bikes/road-bikes/performance-road-bikes/madone/madone-slr/madone-slr-9-etap-gen-7/p/37420"
read_html(url) %>%
html_nodes(".buying-zone__title") %>%
html_text()
When I run the code above, I get {xml_nodeset (0)}. How can I fix this? I would also like to scrape the year, price, available colors and specs from that page. Any help will be appreciated.

There is a lot of dynamic content on that page which can be reviewed by disabling JS running in browser or comparing rendered page against page source.
You can view page source with Ctrl+U, then Ctrl+F to search for where the product name exists within the non-rendered content.
The title info you want is present in lots of places and there are numerous way to obtain. I will offer an "as on tin" option as the code gives clear indications as to what is being selected for.
I've updated the syntax and reduced the volume of imported external dependencies.
library(magrittr)
library(rvest)
url <- "https://www.trekbikes.com//us/en_US/bikes/road-bikes/performance-road-bikes/madone/madone-slr/madone-slr-9-etap-gen-7/p/37420"
page <- read_html(url)
name <- page %>% html_element("[product-name]") %>% html_attr("product-name")
specs <- page %>% html_elements('[class="sprocket__table spec"]') %>% html_table()
price <- page %>% html_element('#gtm-product-display-price') %>% html_attr('value') %>% as.numeric()

Related

How to scrape this website in R using rvest?

I’m trying to scrape this website using RVest: https://www.camara.cl/legislacion/sesiones_sala/sesiones_sala.aspx
Notice that the site loads quickly, but the data takes some time to appear. I realized that, while the content appears as html text in a web browser Inspector, the nodes appear empty when scraped using rvest.
library(dplyr)
library(rvest)
camara <- "https://www.camara.cl/legislacion/sesiones_sala/sesiones_sala.aspx" %>%
session()
camara %>%
html_elements("h2")
camara %>%
html_elements(".box-proyecto")
camara %>%
html_elements("#trabajo-en-sala") %>%
html_elements("#info-tabs") %>%
html_elements("#ajax-container") %>%
html_elements("pnlTablaOrdinaria")
All of these should return at least some text content, but they appear empty.
I tried using V8 to interpret javascript according to these instructions, but the site appears to use JS only for interface elements, not for data retrieval.
I also tried to run it through PhantomJS following these instructions, but couldn’t run the script due to permission issues.
It seems that I need to perform a GET request for the data, but the URL I found on the site’s code returns nothing: https://www.camara.cl/legislacion/sesiones_sala/tabla.aspx?_=1628291424652
I can’t use RSelenium as I’m working remotely through a headless server.

You need to pick up a session cookie (ASP.NET_SessionId) from the initial url. You could use session for this, for example:
library(rvest)
library(magrittr)
r <- session('https://www.camara.cl/legislacion/sesiones_sala/sesiones_sala.aspx') %>%
session_jump_to('https://www.camara.cl/legislacion/sesiones_sala/tabla.aspx')
tables <- r %>% read_html() %>% html_table()

Not able to select href of a tag with Xpath (rvest)

I am scraping https://ic.gc.ca/eic/site/bsf-osb.nsf/eng/h_br02281.html with the rvest package in R. I would like to get the hyperlink associated with the company name. That portion of the html code looks like this:
html
My code looks like this:
library(rvest)
library(dplyr)
url = "https://ic.gc.ca/eic/site/bsf-osb.nsf/eng/h_br02281.html"
ccaa = read_html(url)
links = ccaa %>%
html_nodes("body") %>%
xml_find_all("//td[1]//a[#href]") %>%
html_text()
But this is only returning the names of the firms/cases, not the links that they are associated with.
How can I get these links? The end goal of this is to put these links into a data frame (along with other information) which will be rendered in a Shiny data table. Then, when a user is interested in a particular insolvency case, they can click on the link to see more information.
I am somewhat new to R and asking quesitons on stack overflow so please let me know if you require more information.

Replace html_text with html_attr('href').

Why is html_nodes() in R not giving me the desired output for this webpage?

I am looking to extract all the links for each episode on this webpage, however I appear to be having difficulty using html_nodes() where I haven't experienced such difficulty before. I am trying to iterate the code using "." such that all the attributes for the page are obtained with that CSS. This code is meant to give an output of all the attributes, but instead I get {xml_nodeset (0)}. I know what to do once I have all the attributes in order to obtain the links specifically out of them, but this step is proving a stumbling block for this website.
Here is the code I have begun in R:
episode_list_page_1 <- "https://jrelibrary.com/episode-list/"
episode_list_page_1 %>%
read_html() %>%
html_node("body") %>%
html_nodes(".type-text svelte-fugjkr first-mobile first-desktop") %>%
html_attrs()

This rvest down does not work here because this page uses javascript to insert another webpage into an iframe on this page, to display the information.
If you search the imebedded script you will find a reference to this page: "https://datawrapper.dwcdn.net/eoqPA/66/" which will redirect you to "https://datawrapper.dwcdn.net/eoqPA/67/". This second page contains the data you are looking for in as embedded JSON and generated via javascript.
The links to the shows are extractable, and there is a link to a Google doc that is the full index.
Searching this page turns up a link to a Google doc:
library(rvest)
library(dplyr)
library(stringr)
page2 <-read_html("https://datawrapper.dwcdn.net/eoqPA/67/")
#find all of the links on the page:
str_extract_all(html_text(page2), 'https:.*?\\"')
#isolate the Google docs
print(str_extract_all(html_text(page2), 'https://docs.*?\\"') )
#[[1]]
#[1] "https://docs.google.com/spreadsheets/d/12iTobpwHViCIANFSX3Pc_dGMdfod-0w3I5P5QJL45L8/edit?usp=sharing"
#[2] "https://docs.google.com/spreadsheets/d/12iTobpwHViCIANFSX3Pc_dGMdfod-0w3I5P5QJL45L8/export?format=csv&id=12iTobpwHViCIANFSX3Pc_dGMdfod-0w3I5P5QJL45L8"

Rvest is unable to find the node specified by css selector, how do I fix it?

I am scraping data from this website and for some reason, I'm unable to get the name of the seller, even though I use the exact node returned by SelectorGadget. I have, however, managed to get all the other data with Rvest.
I managed to scrape the seller's name with RSelenium but that takes too much time. Anyway, here's the link of the page I'm scraping:
https://www.kijiji.ca/v-fitness-personal-trainer/bedford/swimming-lessons/1421292946
Here's the code I've used
SellerName <-
read_html("https://kijiji.ca/v-fitness-personal-trainer/bedford/swimming-lessons/1421292946") %>%
html_nodes(".link-4200870613") %>%
html_text()

You can regex out the seller name easily from the return as it is contained in a script tag (presumably loaded from here when browser is able to run javascript - which rvest does not.)
library(rvest)
library(magrittr)
library(stringr)
p <- read_html('https://www.kijiji.ca/v-fitness-personal-trainer/bedford/swimming-lessons/1421292946') %>% html_text()
seller_name <- str_match_all(p,'"sellerName":"(.*?)"')[[1]][,2][1]
print(seller_name)
Regex:

Scrape title attribute from CSS with rvest

I use rvest to scrape web data.
I have the following CSS code from a website:
<abbr class="intabbr" title="2.856.890">2,9M</abbr>
I scrape this data with
library(rvest)
library(dplyr)
n <- read_html("https://www.last.fm/de/music/Fang+Island")
n %>%
html_node("abbr") %>%
html_text()
This gives me "2M", but what I would like to get is the "2.856.890".
I am not very knowledgeable in CSS: Is it possible to get the information which I want by the changing the expression in html_node()?
This post suggests that it is not possible, however this one suggests that it might be possible since it pops up as a tooltip on the page?

Use html_attr to get a tag's attribute:
n %>%
html_node("abbr") %>%
html_attr("title")

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How do I extract certain html nodes using rvest? - web-scraping

Related

How to scrape this website in R using rvest?

Not able to select href of a tag with Xpath (rvest)

Why is html_nodes() in R not giving me the desired output for this webpage?

Rvest is unable to find the node specified by css selector, how do I fix it?

Scrape title attribute from CSS with rvest

Categories

Resources