How can i extract folding information of html using rvest package? - r

I try to do web scraping from this website [link].
In this part, I find a piece of folding hidden information.
Hide
I try this:
library(rvest)
library(dplyr)
url <- "https://carro.mercadolibre.com.co/MCO-611624087-chevrolet-camaro-2017-62-ss-_JM#position=16&type=item&tracking_id=f0c0ddc3-84a0-46ce-8545-5df59fe50a63"
session(url) %>%
html_node(xpath='//*[#id="root-app"]/div/div[3]/div/div[1]/div[2]/div[1]/div/div[2]/div[1]') %>%
html_text2()
But, the code doesn't catch all information:
[1] "Frenos ABS: Sí\n\nAirbag para conductor y pasajero: Sí\n\nPotencia: 455 hp"
If I click on folding information, it's shown:
Show
Another way to extract the information, is using div class "ui-pdp-specs-groups":
session(url) %>%
html_node(".ui-pdp-specs-groups-collapsable.ui-pdp-specs") %>%
html_text2()
[1] "Items del vehículo\n\nFrenos ABS: Sí\n\nAirbag para conductor y pasajero: Sí\n\nPotencia: 455 hp\n\nVer más características"
How can I extract the missing information from the website?

It is pulled from a script tag dynamically. You can use regex on the page source as string (not parsing as html) to pull out the the relevant info.
In this case the pattern used returns all the technical specifications plus some other page info. I parse into json object with jsonlite then extract the technical specifications and finally print the section containing the data you want.
There is a little work left to do to just parse out the values, as shown on screen, from the page placement instructions that are carried alongside the values, for when the page on the website renders:
R:
library(rvest)
library(stringr)
library(dplyr)
library(jsonlite)
page <- read_html('https://carro.mercadolibre.com.co/MCO-611624087-chevrolet-camaro-2017-62-ss-_JM#position=16&type=item&tracking_id=f0c0ddc3-84a0-46ce-8545-5df59fe50a63') %>% toString()
res <- page %>% stringr::str_match("window\\.__PRELOADED_STATE__ = (.*?);\n") %>% .[2]
data <- jsonlite::parse_json(res)
technical_spec <- data$initialState$components$technical_specifications
all_specs <- technical_spec$specs
print(all_specs[3])
Regex:

Related

How do I extract certain html nodes using rvest?

I'm new to web-scraping so I may not be doing all the proper checks here. I'm attempting to scrape information from a url, however I'm not able to extract the nodes I need. See sample code below. In this example, I want to get the product name (Madone SLR 9 eTap Gen) which appears to be stored in the buying-zone__title class.
library(tidyverse)
library(rvest
url <- "https://www.trekbikes.com//us/en_US/bikes/road-bikes/performance-road-bikes/madone/madone-slr/madone-slr-9-etap-gen-7/p/37420"
read_html(url) %>%
html_nodes(".buying-zone__title") %>%
html_text()
When I run the code above, I get {xml_nodeset (0)}. How can I fix this? I would also like to scrape the year, price, available colors and specs from that page. Any help will be appreciated.
There is a lot of dynamic content on that page which can be reviewed by disabling JS running in browser or comparing rendered page against page source.
You can view page source with Ctrl+U, then Ctrl+F to search for where the product name exists within the non-rendered content.
The title info you want is present in lots of places and there are numerous way to obtain. I will offer an "as on tin" option as the code gives clear indications as to what is being selected for.
I've updated the syntax and reduced the volume of imported external dependencies.
library(magrittr)
library(rvest)
url <- "https://www.trekbikes.com//us/en_US/bikes/road-bikes/performance-road-bikes/madone/madone-slr/madone-slr-9-etap-gen-7/p/37420"
page <- read_html(url)
name <- page %>% html_element("[product-name]") %>% html_attr("product-name")
specs <- page %>% html_elements('[class="sprocket__table spec"]') %>% html_table()
price <- page %>% html_element('#gtm-product-display-price') %>% html_attr('value') %>% as.numeric()

Confusion Regarding HTML Code For Web Scraping With R

I am struggling using the rvest package in R, most likely due to my lack of knowledge about CSS or HTML. Here is an example (my guess is the ".quote-header-info" is what is wrong, also tried the ".Trsdu ..." but no luck either):
library(rvest)
url="https://finance.yahoo.com/quote/SPY"
website=read_html(url) %>%
html_nodes(".quote-header-info") %>%
html_text() %>% toString()
website
The below is the webpage I am trying to scrape. Specifically looking to grab the value "416.74". I took a peek at the documentation here (https://cran.r-project.org/web/packages/rvest/rvest.pdf) but think the issue is I don't understand the breakdown of the webpage I am looking at.
The tricky part is determining the correct set of attributes to only select this one html node.
In this case the span tag with a class of Trsdu(0.3s) and Fz(36px)
library(rvest)
url="https://finance.yahoo.com/quote/SPY"
#read page once
page <- read_html(url)
#now extract information from the page
price <- page %>% html_nodes("span.Trsdu\\(0\\.3s\\).Fz\\(36px\\)") %>%
html_text()
price
Note: "(", ")", and "." are all special characters thus the need to double escape "\\" them.
Those classes are dynamic and change much more frequently than other parts of the html. They should be avoided. You have at least two more robust options.
Extract the javascript option housing that data (plus a lot more) in a script tag then parse with jsonlite
Use positional matching against other, more stable, html elements
I show both below. The advantage of the first is that you can extract lots of other page data from the json object generated.
library(magrittr)
library(rvest)
library(stringr)
library(jsonlite)
page <- read_html('https://finance.yahoo.com/quote/SPY')
data <- page %>%
toString() %>%
stringr::str_match('root\\.App\\.main = (.*?[\\s\\S]+)(?=;[\\s\\S]+\\(th)') %>% .[2]
json <- jsonlite::parse_json(data)
print(json$context$dispatcher$stores$StreamDataStore$quoteData$SPY$regularMarketPrice$raw)
print(page %>% html_node('#quote-header-info div:nth-of-type(2) ~ div div:nth-child(1) span') %>% html_text() %>% as.numeric())

How to scrape NBA data?

I want to compare rookies across leagues with stats like Points per game (PPG) and such. ESPN and NBA have great tables to scrape from (as does Basketball-reference), but I just found out that they're not stored in html, so I can't use rvest. For context, I'm trying to scrape tables like this one (from NBA):
https://i.stack.imgur.com/SdKjE.png
I'm trying to learn how to use HTTR and JSON for this, but I'm running into some issues. I followed the answer in this post, but it's not working out for me.
This is what I've tried:
library(httr)
library(jsonlite)
coby.white <- GET('https://www.nba.com/players/coby/white/1629632')
out <- content(coby.white, as = "text") %>%
fromJSON(flatten = FALSE)
However, I get an error:
Error: lexical error: invalid char in json text.
<!DOCTYPE html><html class="" l
(right here) ------^
Is there an easier way to scrape a table from ESPN or NBA, or is there a solution to this issue?
ppg and others stats come from]
https://data.nba.net/prod/v1/2019/players/1629632_profile.json
and player info e.g. weight, height
https://www.nba.com/players/active_players.json
So, you could use jsonlite to parse e.g.
library(jsonlite)
data <- jsonlite::read_json('https://data.nba.net/prod/v1/2019/players/1629632_profile.json')
You can find these in the network tab when refreshing the page. Looks like you can use the player id in the url to get different players info for the season.
You actually can web scrape with rvest, here's an example of scraping White's totals table from Basketball Reference. Anything on Sports Reference's sites that is not the first table of the page is listed as a comment, meaning we must extract the comment nodes first then extract the desired data table.
library(rvest)
library(dplyr)
cobywhite = 'https://www.basketball-reference.com/players/w/whiteco01.html'
totalsdf = cobywhite %>%
read_html %>%
html_nodes(xpath = '//comment()') %>%
html_text() %>%
paste(collapse='') %>%
read_html() %>%
html_node("#totals") %>%
html_table()

read_html not retrieving all data from simple html page, instead returning incomplete html?

read_html() usually returns all the page html for a given url.
But when I try on this url, I can see that not all of the page is returned.
Why is this (and more importantly, how do I fix it)?
Reproducible example
page_html <- "https://raw.githubusercontent.com/mjaniec2013/ExecutionTime/master/ExecutionTime.R" %>%
read_html
page_html %>% html_text %>% cat
# We can see not all the page html has been retrieved
# And just to be sure
page_html %>% as.character
Notes
It looks like github is okay with bots visiting, so I don't think it's an issue to do with github
I tried the same scrape with ruby's Nokogiri library. It gives exactly the same result as read_html. So it looks like it's not something that's specific to R or read_html()
This looks like it's treating the assignment operator in the page as an unclosed tag.
fakepage <- "<html>the text after <- will be lost</html>"
read_html(fakepage) %>%
html_text()
[1] "the text after "
As the page you're after is a plain text file, you can use readr::read_file() in this instance.
readr::read_file("https://raw.githubusercontent.com/mjaniec2013/ExecutionTime/master/ExecutionTime.R")

R Web scrape - Error

Okay, So I am stuck on what seems would be a simple web scrape. My goal is to scrape Morningstar.com to retrieve a fund name based on the entered url. Here is the example of my code:
library(rvest)
url <- html("http://www.morningstar.com/funds/xnas/fbalx/quote.html")
url %>%
read_html() %>%
html_node('r_title')
I would expect it to return the name Fidelity Balanced Fund, but instead I get the following error: {xml_missing}
Suggestions?
Aaron
edit:
I also tried scraping via XHR request, but I think my issue is not knowing what css selector or xpath to select to find the appropriate data.
XHR code:
get.morningstar.Table1 <- function(Symbol.i,htmlnode){
try(res <- GET(url = "http://quotes.morningstar.com/fundq/c-header",
query = list(
t=Symbol.i,
region="usa",
culture="en-US",
version="RET",
test="QuoteiFrame"
)
))
tryCatch(x <- content(res) %>%
html_nodes(htmlnode) %>%
html_text() %>%
trimws()
, error = function(e) x <-NA)
return(x)
} #HTML Node in this case is a vkey
still the same question is, am I using the correct css/xpath to look up? The XHR code works great for requests that have a clear css selector.
OK, so it looks like the page dynamically loads the section you are targeting, so it doesn't actually get pulled in by read_html(). Interestingly, this part of the page also doesn't load using an RSelenium headless browser.
I was able to get this to work by scraping the page title (which is actually hidden on the page) and doing some regex to get rid of the junk:
library(rvest)
url <- 'http://www.morningstar.com/funds/xnas/fbalx/quote.html'
page <- read_html(url)
title <- page %>%
html_node('title') %>%
html_text()
symbol <- 'FBALX'
regex <- paste0(symbol, " (.*) ", symbol, ".*")
cleanTitle <- gsub(regex, '\\1', title)
As a side note, and for your future use, your first call to html_node() should include a "." before the class name you are targeting:
mypage %>%
html_node('.myClass')
Again, this doesn't help in this specific case, since the page is failing to load the section we are trying to scrape.
A final note: other sites contain the same info and are easier to scrape (like yahoo finance).

Resources