How to scrape a table with rvest and xpath?

How to scrape a table with rvest and xpath? - r

using the following documentation i have been trying to scrape a series of tables from marketwatch.com
here is the one represented by the code bellow:
The link and xpath are already included in the code:
url <- "http://www.marketwatch.com/investing/stock/IRS/profile"
valuation <- url %>%
html() %>%
html_nodes(xpath='//*[#id="maincontent"]/div[2]/div[1]') %>%
html_table()
valuation <- valuation[[1]]
I get the following error:
Warning message:
'html' is deprecated.
Use 'read_html' instead.
See help("Deprecated")
Thanks in advance.

That website doesn't use an html table, so html_table() can't find anything. It actaully uses div classes column and data lastcolumn.
So you can do something like
url <- "http://www.marketwatch.com/investing/stock/IRS/profile"
valuation_col <- url %>%
read_html() %>%
html_nodes(xpath='//*[#class="column"]')
valuation_data <- url %>%
read_html() %>%
html_nodes(xpath='//*[#class="data lastcolumn"]')
Or even
url %>%
read_html() %>%
html_nodes(xpath='//*[#class="section"]')
To get you most of the way there.
Please also read their terms of use - particularly 3.4.

Related

Issue webscraping a website: not extracting anything

I am trying to extract data from the following website: 'https://2010-2014.kormany.hu/hu/hirek'. When I try to extract, for example, the articles' links from that website using the following, I got nothing.
library(rvest)
library(dplyr)
library(XML)
url <- 'www.2015-2019.kormany.hu/hu/hirek'
links <- read_html(url) %>% html_nodes("div") %>% html_nodes(xpath = '//*[#class="article"]') %>% html_nodes("h2") %>% html_nodes("a") %>% html_attr("href")
links
> character(0)
I don't even get anything if I run the following code:
links <- read_html(url) %>% html_nodes("div")
links
> character(0)
This is very strange since, when I inspect the website, it seems that I should be getting the list of URLs from the code I provided. According to the website's source, there are "div" nodes ('view-source:https://2015-2019.kormany.hu/hu/hirek'). Does anyone know what I could be doing wrong?

Today I re-tried my code and it works perfectly. I am not sure what was happening yesterday.

Can't scrape a table with rvest

I've been trying to scrape a table from this page: https://ficeda.com.mx/index.php?id=precios
My code looks like this
url_data <- "https://ficeda.com.mx/index.php?id=precios"
url_data %>%
read_html() %>%
html_node(xpath = "/html/body/div[1]/div/div/div[3]/table[1]") %>%
html_table()
But it gives me this error: Error in UseMethod("html_table") :
no applicable method for 'html_table' applied to an object of class "xml_missing"
Does anyone know what might be going on?
Thanks!

That data is actually inside an iframe. Either, make your initial request and extract the src link from the iframe, and then make a request there, or, simply make a request direct to the iframe document:
library(rvest)
url_data <- "https://www.ficeda.com.mx/app_precios/pages/index/v2"
url_data %>%
read_html() %>%
html_node('.category_title + table') %>%
html_table()
The iframe src from original endpoint:
read_html('https://ficeda.com.mx/index.php?id=precios') %>% html_node('#linea2') %>% html_attr('src')

What makes table web scraping with rvest package sometimes fail?

I'm playing with rvest package and trying to figure out why sometimes it fails to scrape objects that definitely seem tables.
Consider for instance a script like this:
require(rvest)
url <- "http://bigcharts.marketwatch.com/quickchart/options.asp?symb=SPY"
population <- url %>%
xml2::read_html() %>%
html_nodes(xpath='//*[#id="options"]/table/tbody/tr/td/table[2]/tbody') %>%
html_table()
population
If I inspect population, it's an empty list:
> population
list()
Another example:
require(rvest)
url <- "https://finance.yahoo.com/quote/SPY/options?straddle=false"
population <- url %>%
xml2::read_html() %>%
html_nodes(xpath='//*[#id="Col1-1-OptionContracts-Proxy"]/section/section[1]/div[2]') %>%
html_table()
population
I was wondering if the use of PhantomJS is mandatory - as explained here - or if the problem is elsewhere.

Neither of your current xpaths actually select just the table. In both cases I think you need to pass an html table to html_table as under the hood there is:
html_table.xml_node(.) : html_name(x) == "table"
Also, long xpaths are too fragile especially when applying a path valid for browser rendered content versus rvest return html - as javascript doesn't run with rvest. Personally, I prefer nice short CSS selectors. You can use the second fastest selector type of class and only need specify a single class
require(rvest)
url <- "http://bigcharts.marketwatch.com/quickchart/options.asp?symb=SPY"
population <- url %>%
xml2::read_html() %>%
html_node('.optionchain') %>%
html_table()
The table needs cleaning of course, due to "merged" cells in source, but you get the idea.
With xpath you could do:
require(rvest)
url <- "http://bigcharts.marketwatch.com/quickchart/options.asp?symb=SPY"
population <- url %>%
xml2::read_html() %>%
html_node(xpath='//table[2]') %>%
html_table()
Note: I reduce the xpath and work with a single node which represents a table.
For your second:
Again, your xpath is not selecting for a table element. The table class is multi-valued but a single correctly chosen class will suffice in xpath i.e. //*[contains(#class,"calls")] . Select for a single table node.
require(rvest)
url <- "https://finance.yahoo.com/quote/SPY/options?straddle=false"
population <- url %>%
xml2::read_html() %>%
html_node(xpath='//*[contains(#class,"calls")]') %>%
html_table()
Once again, my preference is for a css selector (less typing!)
require(rvest)
url <- "https://finance.yahoo.com/quote/SPY/options?straddle=false"
population <- url %>%
xml2::read_html() %>%
html_node('.calls') %>%
html_table()

Using Rvest command for an html table

I am trying to get the table located in the following URL:
https://wallethub.com/edu/most-innovative-states/31890/
I used following code (I used SelectorGadget as well to get CSS), but it is not working. Is it an special case or is there another code that I can use?:
page <- read_html("https://wallethub.com/edu/most-innovative-states/31890/")
table <- page %>%
html_node(xpath = "/html/body/div[2]/main/div/div[1]/article/div[3]/div[1]/div[6]/table/thead/tr/th[3]") %>%
html_table()
table
head(table)

Just selecting on table worked for me:
page %>%
html_node("table") %>%
html_table()

rvest + selector gadget return empty list

I'm attempting to scrape political endorsement data from wikipedia tables (a pretty generic scraping task) and the regular process of using rvest on the css path identified by selector gadget is failing.
The wiki page is here, and the css path .jquery-tablesorter:nth-child(11) td seems to select the right part of the page
Armed with the css, I would normally just use rvest to directly access these data, as follows:
"https://en.wikipedia.org/wiki/Endorsements_for_the_Republican_Party_presidential_primaries,_2012" %>%
html %>%
html_nodes(".jquery-tablesorter:nth-child(11) td")
but this returns:
list()
attr(,"class")
[1] "XMLNodeSet"
Do you have any ideas?

This might help:
library(rvest)
URL <- "https://en.wikipedia.org/wiki/Endorsements_for_the_Republican_Party_presidential_primaries,_2012"
tab <- URL %>% read_html %>%
html_node("table.wikitable:nth-child(11)") %>% html_table()
This code stores the table that you requested as a dataframe in the variable tab.
> View(tab)

I find that if I use the xpath suggestion from Chrome it works.
Chrome suggests an xpath of //*[#id="mw-content-text"]/table[4]
I can then run as follows
library(rvest)
URL <-"https://en.wikipedia.org/wiki/Endorsements_for_the_Republican_Party_presidential_primaries,_2012"
tab <- URL %>%
read_html %>%
html_node(xpath='//*[#id="mw-content-text"]/table[4]') %>%
html_table()

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to scrape a table with rvest and xpath? - r

Related

Issue webscraping a website: not extracting anything

Can't scrape a table with rvest

What makes table web scraping with rvest package sometimes fail?

Using Rvest command for an html table

rvest + selector gadget return empty list

Categories

Resources