Confusion Regarding HTML Code For Web Scraping With R - r

I am struggling using the rvest package in R, most likely due to my lack of knowledge about CSS or HTML. Here is an example (my guess is the ".quote-header-info" is what is wrong, also tried the ".Trsdu ..." but no luck either):
library(rvest)
url="https://finance.yahoo.com/quote/SPY"
website=read_html(url) %>%
html_nodes(".quote-header-info") %>%
html_text() %>% toString()
website
The below is the webpage I am trying to scrape. Specifically looking to grab the value "416.74". I took a peek at the documentation here (https://cran.r-project.org/web/packages/rvest/rvest.pdf) but think the issue is I don't understand the breakdown of the webpage I am looking at.

The tricky part is determining the correct set of attributes to only select this one html node.
In this case the span tag with a class of Trsdu(0.3s) and Fz(36px)
library(rvest)
url="https://finance.yahoo.com/quote/SPY"
#read page once
page <- read_html(url)
#now extract information from the page
price <- page %>% html_nodes("span.Trsdu\\(0\\.3s\\).Fz\\(36px\\)") %>%
html_text()
price
Note: "(", ")", and "." are all special characters thus the need to double escape "\\" them.

Those classes are dynamic and change much more frequently than other parts of the html. They should be avoided. You have at least two more robust options.
Extract the javascript option housing that data (plus a lot more) in a script tag then parse with jsonlite
Use positional matching against other, more stable, html elements
I show both below. The advantage of the first is that you can extract lots of other page data from the json object generated.
library(magrittr)
library(rvest)
library(stringr)
library(jsonlite)
page <- read_html('https://finance.yahoo.com/quote/SPY')
data <- page %>%
toString() %>%
stringr::str_match('root\\.App\\.main = (.*?[\\s\\S]+)(?=;[\\s\\S]+\\(th)') %>% .[2]
json <- jsonlite::parse_json(data)
print(json$context$dispatcher$stores$StreamDataStore$quoteData$SPY$regularMarketPrice$raw)
print(page %>% html_node('#quote-header-info div:nth-of-type(2) ~ div div:nth-child(1) span') %>% html_text() %>% as.numeric())

Related

How can i extract folding information of html using rvest package?

I try to do web scraping from this website [link].
In this part, I find a piece of folding hidden information.
Hide
I try this:
library(rvest)
library(dplyr)
url <- "https://carro.mercadolibre.com.co/MCO-611624087-chevrolet-camaro-2017-62-ss-_JM#position=16&type=item&tracking_id=f0c0ddc3-84a0-46ce-8545-5df59fe50a63"
session(url) %>%
html_node(xpath='//*[#id="root-app"]/div/div[3]/div/div[1]/div[2]/div[1]/div/div[2]/div[1]') %>%
html_text2()
But, the code doesn't catch all information:
[1] "Frenos ABS: Sí\n\nAirbag para conductor y pasajero: Sí\n\nPotencia: 455 hp"
If I click on folding information, it's shown:
Show
Another way to extract the information, is using div class "ui-pdp-specs-groups":
session(url) %>%
html_node(".ui-pdp-specs-groups-collapsable.ui-pdp-specs") %>%
html_text2()
[1] "Items del vehículo\n\nFrenos ABS: Sí\n\nAirbag para conductor y pasajero: Sí\n\nPotencia: 455 hp\n\nVer más características"
How can I extract the missing information from the website?
It is pulled from a script tag dynamically. You can use regex on the page source as string (not parsing as html) to pull out the the relevant info.
In this case the pattern used returns all the technical specifications plus some other page info. I parse into json object with jsonlite then extract the technical specifications and finally print the section containing the data you want.
There is a little work left to do to just parse out the values, as shown on screen, from the page placement instructions that are carried alongside the values, for when the page on the website renders:
R:
library(rvest)
library(stringr)
library(dplyr)
library(jsonlite)
page <- read_html('https://carro.mercadolibre.com.co/MCO-611624087-chevrolet-camaro-2017-62-ss-_JM#position=16&type=item&tracking_id=f0c0ddc3-84a0-46ce-8545-5df59fe50a63') %>% toString()
res <- page %>% stringr::str_match("window\\.__PRELOADED_STATE__ = (.*?);\n") %>% .[2]
data <- jsonlite::parse_json(res)
technical_spec <- data$initialState$components$technical_specifications
all_specs <- technical_spec$specs
print(all_specs[3])
Regex:

What makes table web scraping with rvest package sometimes fail?

I'm playing with rvest package and trying to figure out why sometimes it fails to scrape objects that definitely seem tables.
Consider for instance a script like this:
require(rvest)
url <- "http://bigcharts.marketwatch.com/quickchart/options.asp?symb=SPY"
population <- url %>%
xml2::read_html() %>%
html_nodes(xpath='//*[#id="options"]/table/tbody/tr/td/table[2]/tbody') %>%
html_table()
population
If I inspect population, it's an empty list:
> population
list()
Another example:
require(rvest)
url <- "https://finance.yahoo.com/quote/SPY/options?straddle=false"
population <- url %>%
xml2::read_html() %>%
html_nodes(xpath='//*[#id="Col1-1-OptionContracts-Proxy"]/section/section[1]/div[2]') %>%
html_table()
population
I was wondering if the use of PhantomJS is mandatory - as explained here - or if the problem is elsewhere.
Neither of your current xpaths actually select just the table. In both cases I think you need to pass an html table to html_table as under the hood there is:
html_table.xml_node(.) : html_name(x) == "table"
Also, long xpaths are too fragile especially when applying a path valid for browser rendered content versus rvest return html - as javascript doesn't run with rvest. Personally, I prefer nice short CSS selectors. You can use the second fastest selector type of class and only need specify a single class
require(rvest)
url <- "http://bigcharts.marketwatch.com/quickchart/options.asp?symb=SPY"
population <- url %>%
xml2::read_html() %>%
html_node('.optionchain') %>%
html_table()
The table needs cleaning of course, due to "merged" cells in source, but you get the idea.
With xpath you could do:
require(rvest)
url <- "http://bigcharts.marketwatch.com/quickchart/options.asp?symb=SPY"
population <- url %>%
xml2::read_html() %>%
html_node(xpath='//table[2]') %>%
html_table()
Note: I reduce the xpath and work with a single node which represents a table.
For your second:
Again, your xpath is not selecting for a table element. The table class is multi-valued but a single correctly chosen class will suffice in xpath i.e. //*[contains(#class,"calls")] . Select for a single table node.
require(rvest)
url <- "https://finance.yahoo.com/quote/SPY/options?straddle=false"
population <- url %>%
xml2::read_html() %>%
html_node(xpath='//*[contains(#class,"calls")]') %>%
html_table()
Once again, my preference is for a css selector (less typing!)
require(rvest)
url <- "https://finance.yahoo.com/quote/SPY/options?straddle=false"
population <- url %>%
xml2::read_html() %>%
html_node('.calls') %>%
html_table()

Scrape title attribute from CSS with rvest

I use rvest to scrape web data.
I have the following CSS code from a website:
<abbr class="intabbr" title="2.856.890">2,9M</abbr>
I scrape this data with
library(rvest)
library(dplyr)
n <- read_html("https://www.last.fm/de/music/Fang+Island")
n %>%
html_node("abbr") %>%
html_text()
This gives me "2M", but what I would like to get is the "2.856.890".
I am not very knowledgeable in CSS: Is it possible to get the information which I want by the changing the expression in html_node()?
This post suggests that it is not possible, however this one suggests that it might be possible since it pops up as a tooltip on the page?
Use html_attr to get a tag's attribute:
n %>%
html_node("abbr") %>%
html_attr("title")

Excluding Nodes RVest

I am scraping blog text using RVest and am struggling to figure out a simple way to exclude specific nodes. The following pulls the text:
AllandSundry_test <- read_html
("http://www.sundrymourning.com/2017/03/03/lets-go-back-to-commenting-on-the-weather/")
testpost <- AllandSundry_test %>%
html_node("#contentmiddle") %>%
html_text() %>%
as.character()
I want to exclude the two nodes with ID's "contenttitle" and "commentblock". Below, I try excluding just the comments using the tag "commentblock".
testpost <- AllandSundry_test %>%
html_node("#contentmiddle") %>%
html_node(":not(#commentblock)")
html_text() %>%
as.character()
When I run this, the result is simply the date -- all the rest of the text is gone. Any suggestions?
I have spent a lot of time searching for an answer, but I am new to R (and html), so I appreciate your patience if this is something obvious.
You were almost there. You should use html_nodes instead of html_node.
html_node retrieves the first element it encounter, while html_nodes returns each matching element in the page as a list.
The toString() function collapse the list of strings into one.
library(rvest)
AllandSundry_test <- read_html("http://www.sundrymourning.com/2017/03/03/lets-go-back-to-commenting-on-the-weather/")
testpost <- AllandSundry_test %>%
html_nodes("#contentmiddle>:not(#commentblock)") %>%
html_text %>%
as.character %>%
toString
testpost
#> [1] "\n\t\tMar\n\t\t3\n\t, Mar, 3, \n\t\tLet's go back to
#> commenting on the weather\n\t\t\n\t\t, Let's go back to commenting on
#> the weather, Let's go back to commenting on the weather, I have just
#> returned from the grocery store, and I need to get something off my chest.
#> When did "Got any big plans for the rest of the day?" become
#> the default small ...<truncated>
You still need to clean up the string a bit.
It certainly looks like GGamba solved it for you- however, in my machine, I had to remove the > after #contentmiddle. Therefore, this section was instead:
html_nodes("#contentmiddle:not(#commentblock)")
Best of luck!
Jesse

rvest + selector gadget return empty list

I'm attempting to scrape political endorsement data from wikipedia tables (a pretty generic scraping task) and the regular process of using rvest on the css path identified by selector gadget is failing.
The wiki page is here, and the css path .jquery-tablesorter:nth-child(11) td seems to select the right part of the page
Armed with the css, I would normally just use rvest to directly access these data, as follows:
"https://en.wikipedia.org/wiki/Endorsements_for_the_Republican_Party_presidential_primaries,_2012" %>%
html %>%
html_nodes(".jquery-tablesorter:nth-child(11) td")
but this returns:
list()
attr(,"class")
[1] "XMLNodeSet"
Do you have any ideas?
This might help:
library(rvest)
URL <- "https://en.wikipedia.org/wiki/Endorsements_for_the_Republican_Party_presidential_primaries,_2012"
tab <- URL %>% read_html %>%
html_node("table.wikitable:nth-child(11)") %>% html_table()
This code stores the table that you requested as a dataframe in the variable tab.
> View(tab)
I find that if I use the xpath suggestion from Chrome it works.
Chrome suggests an xpath of //*[#id="mw-content-text"]/table[4]
I can then run as follows
library(rvest)
URL <-"https://en.wikipedia.org/wiki/Endorsements_for_the_Republican_Party_presidential_primaries,_2012"
tab <- URL %>%
read_html %>%
html_node(xpath='//*[#id="mw-content-text"]/table[4]') %>%
html_table()

Resources