Can't scrape a table with rvest - r

I've been trying to scrape a table from this page: https://ficeda.com.mx/index.php?id=precios
My code looks like this
url_data <- "https://ficeda.com.mx/index.php?id=precios"
url_data %>%
read_html() %>%
html_node(xpath = "/html/body/div[1]/div/div/div[3]/table[1]") %>%
html_table()
But it gives me this error: Error in UseMethod("html_table") :
no applicable method for 'html_table' applied to an object of class "xml_missing"
Does anyone know what might be going on?
Thanks!

That data is actually inside an iframe. Either, make your initial request and extract the src link from the iframe, and then make a request there, or, simply make a request direct to the iframe document:
library(rvest)
url_data <- "https://www.ficeda.com.mx/app_precios/pages/index/v2"
url_data %>%
read_html() %>%
html_node('.category_title + table') %>%
html_table()
The iframe src from original endpoint:
read_html('https://ficeda.com.mx/index.php?id=precios') %>% html_node('#linea2') %>% html_attr('src')

Related

Issue webscraping a website: not extracting anything

I am trying to extract data from the following website: 'https://2010-2014.kormany.hu/hu/hirek'. When I try to extract, for example, the articles' links from that website using the following, I got nothing.
library(rvest)
library(dplyr)
library(XML)
url <- 'www.2015-2019.kormany.hu/hu/hirek'
links <- read_html(url) %>% html_nodes("div") %>% html_nodes(xpath = '//*[#class="article"]') %>% html_nodes("h2") %>% html_nodes("a") %>% html_attr("href")
links
> character(0)
I don't even get anything if I run the following code:
links <- read_html(url) %>% html_nodes("div")
links
> character(0)
This is very strange since, when I inspect the website, it seems that I should be getting the list of URLs from the code I provided. According to the website's source, there are "div" nodes ('view-source:https://2015-2019.kormany.hu/hu/hirek'). Does anyone know what I could be doing wrong?
Today I re-tried my code and it works perfectly. I am not sure what was happening yesterday.

How to fix following error in R 'Error in UseMethod("xml_find_all")' while web scraping with rvest?

I am new to R and am currently working on an assignment dealing with web scraping.
I am supposed to read in all the sentences from this web page: https://www.cs.columbia.edu/~hgs/audio/harvard.html
This is my current code:
library(xml2)
library(rvest)
url <- 'https://www.cs.columbia.edu/~hgs/audio/harvard.html'
read_html(url)
sentences <- url %>%
html_nodes("li") %>%
html_text()
And everytime I run it, I get this error:
Error in UseMethod("xml_find_all") : no applicable method for
'xml_find_all' applied to an object of class "character"
Can you please help me? I don't understand what I'm doing wrong.
You forgot to assign a variable (I imagine it was intended to be the same url) to read_html(url). So url %>% html_nodes("li") is reading a "string" instead of a "xml_document", which is what the error is telling you (internally, rvest::html_nodes calls the function xml2::xml_find_all).
You could do this:
html <- read_html(url)
sentences <- html%>%
html_nodes("li") %>%
html_text()
Or this, if you are reading url only once
sentences <- read_html(url) %>%
html_nodes("li") %>%
html_text()

What makes table web scraping with rvest package sometimes fail?

I'm playing with rvest package and trying to figure out why sometimes it fails to scrape objects that definitely seem tables.
Consider for instance a script like this:
require(rvest)
url <- "http://bigcharts.marketwatch.com/quickchart/options.asp?symb=SPY"
population <- url %>%
xml2::read_html() %>%
html_nodes(xpath='//*[#id="options"]/table/tbody/tr/td/table[2]/tbody') %>%
html_table()
population
If I inspect population, it's an empty list:
> population
list()
Another example:
require(rvest)
url <- "https://finance.yahoo.com/quote/SPY/options?straddle=false"
population <- url %>%
xml2::read_html() %>%
html_nodes(xpath='//*[#id="Col1-1-OptionContracts-Proxy"]/section/section[1]/div[2]') %>%
html_table()
population
I was wondering if the use of PhantomJS is mandatory - as explained here - or if the problem is elsewhere.
Neither of your current xpaths actually select just the table. In both cases I think you need to pass an html table to html_table as under the hood there is:
html_table.xml_node(.) : html_name(x) == "table"
Also, long xpaths are too fragile especially when applying a path valid for browser rendered content versus rvest return html - as javascript doesn't run with rvest. Personally, I prefer nice short CSS selectors. You can use the second fastest selector type of class and only need specify a single class
require(rvest)
url <- "http://bigcharts.marketwatch.com/quickchart/options.asp?symb=SPY"
population <- url %>%
xml2::read_html() %>%
html_node('.optionchain') %>%
html_table()
The table needs cleaning of course, due to "merged" cells in source, but you get the idea.
With xpath you could do:
require(rvest)
url <- "http://bigcharts.marketwatch.com/quickchart/options.asp?symb=SPY"
population <- url %>%
xml2::read_html() %>%
html_node(xpath='//table[2]') %>%
html_table()
Note: I reduce the xpath and work with a single node which represents a table.
For your second:
Again, your xpath is not selecting for a table element. The table class is multi-valued but a single correctly chosen class will suffice in xpath i.e. //*[contains(#class,"calls")] . Select for a single table node.
require(rvest)
url <- "https://finance.yahoo.com/quote/SPY/options?straddle=false"
population <- url %>%
xml2::read_html() %>%
html_node(xpath='//*[contains(#class,"calls")]') %>%
html_table()
Once again, my preference is for a css selector (less typing!)
require(rvest)
url <- "https://finance.yahoo.com/quote/SPY/options?straddle=false"
population <- url %>%
xml2::read_html() %>%
html_node('.calls') %>%
html_table()

How to use Rvest to scrape data

I am trying to get the address from this site https://www.uchealth.com/our-locations/#hospitals
I tried:
html_nodes(xpath = "//*[#id='uch_location_results']/div[1]/div/div[2]/address") %>%
html_text()
Any suggestions on what I am doing wrong?
If you use the network tab you will find a source url for the addresses
library(rvest)
r <- read_html('https://www.uchealth.com/wp-content/themes/uchealth-2016-interim/ajax/location_search.php?region=hospitals') %>%
html_nodes('address') %>%
html_text()
The names of the hospitals are available with the following css selector:
h3

How to scrape a table with rvest and xpath?

using the following documentation i have been trying to scrape a series of tables from marketwatch.com
here is the one represented by the code bellow:
The link and xpath are already included in the code:
url <- "http://www.marketwatch.com/investing/stock/IRS/profile"
valuation <- url %>%
html() %>%
html_nodes(xpath='//*[#id="maincontent"]/div[2]/div[1]') %>%
html_table()
valuation <- valuation[[1]]
I get the following error:
Warning message:
'html' is deprecated.
Use 'read_html' instead.
See help("Deprecated")
Thanks in advance.
That website doesn't use an html table, so html_table() can't find anything. It actaully uses div classes column and data lastcolumn.
So you can do something like
url <- "http://www.marketwatch.com/investing/stock/IRS/profile"
valuation_col <- url %>%
read_html() %>%
html_nodes(xpath='//*[#class="column"]')
valuation_data <- url %>%
read_html() %>%
html_nodes(xpath='//*[#class="data lastcolumn"]')
Or even
url %>%
read_html() %>%
html_nodes(xpath='//*[#class="section"]')
To get you most of the way there.
Please also read their terms of use - particularly 3.4.

Resources