How to use Rvest to scrape data - r

I am trying to get the address from this site https://www.uchealth.com/our-locations/#hospitals
I tried:
html_nodes(xpath = "//*[#id='uch_location_results']/div[1]/div/div[2]/address") %>%
html_text()
Any suggestions on what I am doing wrong?

If you use the network tab you will find a source url for the addresses
library(rvest)
r <- read_html('https://www.uchealth.com/wp-content/themes/uchealth-2016-interim/ajax/location_search.php?region=hospitals') %>%
html_nodes('address') %>%
html_text()
The names of the hospitals are available with the following css selector:
h3

Related

Obtaining "Character(0)" error when using rvest to get Google results headlines

Sorry if my question is simple or badly asked, I am very new at web scraping with R.
I am trying to scrape the headlines from a Google search. Sorry if it is exactly the same request previously asked in the link below, however it does not work for me (it still returns
"character(0)" ).
Character(0) error when using rvest to webscrape Google search results
Here is the two scripts I tried, based on the answers provided in the link above:
#Script 1
library(rvest)
library(dplyr)
web1 <- read_html("https://www.google.at/search?q=munich+prices")
web1 %>%
html_nodes(xpath = '//div/div/div/a/div[not(div)]') %>%
html_text
#Script 2
library(rvest)
library(dplyr)
web1 <- read_html("https://www.google.at/search?q=munich+prices")
web1 %>%
html_nodes(xpath = '//div/div/div/a/h3/div[not(div)]') %>%
html_text
The two scripts still return "character(0)" for me.
Does anyone have an idea?
Thanks you for your help.
Victor
As requested here is the screenshot,
library(rvest)
library(dplyr)
web1 <- read_html("https://www.google.at/search?q=munich+prices")
web1 %>%
html_nodes(xpath = '//div/div/div/a/h3/div[not(div)]') %>%
html_text

Issue webscraping a website: not extracting anything

I am trying to extract data from the following website: 'https://2010-2014.kormany.hu/hu/hirek'. When I try to extract, for example, the articles' links from that website using the following, I got nothing.
library(rvest)
library(dplyr)
library(XML)
url <- 'www.2015-2019.kormany.hu/hu/hirek'
links <- read_html(url) %>% html_nodes("div") %>% html_nodes(xpath = '//*[#class="article"]') %>% html_nodes("h2") %>% html_nodes("a") %>% html_attr("href")
links
> character(0)
I don't even get anything if I run the following code:
links <- read_html(url) %>% html_nodes("div")
links
> character(0)
This is very strange since, when I inspect the website, it seems that I should be getting the list of URLs from the code I provided. According to the website's source, there are "div" nodes ('view-source:https://2015-2019.kormany.hu/hu/hirek'). Does anyone know what I could be doing wrong?
Today I re-tried my code and it works perfectly. I am not sure what was happening yesterday.

What makes table web scraping with rvest package sometimes fail?

I'm playing with rvest package and trying to figure out why sometimes it fails to scrape objects that definitely seem tables.
Consider for instance a script like this:
require(rvest)
url <- "http://bigcharts.marketwatch.com/quickchart/options.asp?symb=SPY"
population <- url %>%
xml2::read_html() %>%
html_nodes(xpath='//*[#id="options"]/table/tbody/tr/td/table[2]/tbody') %>%
html_table()
population
If I inspect population, it's an empty list:
> population
list()
Another example:
require(rvest)
url <- "https://finance.yahoo.com/quote/SPY/options?straddle=false"
population <- url %>%
xml2::read_html() %>%
html_nodes(xpath='//*[#id="Col1-1-OptionContracts-Proxy"]/section/section[1]/div[2]') %>%
html_table()
population
I was wondering if the use of PhantomJS is mandatory - as explained here - or if the problem is elsewhere.
Neither of your current xpaths actually select just the table. In both cases I think you need to pass an html table to html_table as under the hood there is:
html_table.xml_node(.) : html_name(x) == "table"
Also, long xpaths are too fragile especially when applying a path valid for browser rendered content versus rvest return html - as javascript doesn't run with rvest. Personally, I prefer nice short CSS selectors. You can use the second fastest selector type of class and only need specify a single class
require(rvest)
url <- "http://bigcharts.marketwatch.com/quickchart/options.asp?symb=SPY"
population <- url %>%
xml2::read_html() %>%
html_node('.optionchain') %>%
html_table()
The table needs cleaning of course, due to "merged" cells in source, but you get the idea.
With xpath you could do:
require(rvest)
url <- "http://bigcharts.marketwatch.com/quickchart/options.asp?symb=SPY"
population <- url %>%
xml2::read_html() %>%
html_node(xpath='//table[2]') %>%
html_table()
Note: I reduce the xpath and work with a single node which represents a table.
For your second:
Again, your xpath is not selecting for a table element. The table class is multi-valued but a single correctly chosen class will suffice in xpath i.e. //*[contains(#class,"calls")] . Select for a single table node.
require(rvest)
url <- "https://finance.yahoo.com/quote/SPY/options?straddle=false"
population <- url %>%
xml2::read_html() %>%
html_node(xpath='//*[contains(#class,"calls")]') %>%
html_table()
Once again, my preference is for a css selector (less typing!)
require(rvest)
url <- "https://finance.yahoo.com/quote/SPY/options?straddle=false"
population <- url %>%
xml2::read_html() %>%
html_node('.calls') %>%
html_table()

Web Scraping in R (Getting piece of Information from a table)

Trying to study web scraping in R alone...
This feels really difficult without HTML knowledge.
crime_wiki <- read_html("https://en.wikipedia.org/wiki/List_of_United_States_cities_by_crime_rate")
crime_wiki %>%
html_nodes(".firstHeading") %>% html_text()
crime_wiki %>%
html_nodes("dl+ h2 .mw-headline") %>% html_text()
Above codes worked fine. I got what I wanted to get.
When I tried to get city names (from Albuquerque to Wichita), it didn't work.
I wrote
crime_wiki %>%
html_nodes(".jquery-tablesorter a") %>% html_text()
What did I do wrong?
Ultimately I want to do... When I click each city name, their linked pages seem to have the same format. So get the same piece of information from each page such as name of Mayor of all the cities in the table...
The following code allowed me to get the city names:
library(rvest)
crime_wiki <- read_html("https://en.wikipedia.org/wiki/List_of_United_States_cities_by_crime_rate")
crime_wiki %>%
html_nodes("td a") %>%
html_text()
I'm not familiar with your use of ".jquery-tablesorter a". I used SelectorGadget to get the name of the nodes, i.e., "td a". Note that with the code that I've shared, I would need to remove the last 4 elements if I wanted only the city names.

How to scrape a table with rvest and xpath?

using the following documentation i have been trying to scrape a series of tables from marketwatch.com
here is the one represented by the code bellow:
The link and xpath are already included in the code:
url <- "http://www.marketwatch.com/investing/stock/IRS/profile"
valuation <- url %>%
html() %>%
html_nodes(xpath='//*[#id="maincontent"]/div[2]/div[1]') %>%
html_table()
valuation <- valuation[[1]]
I get the following error:
Warning message:
'html' is deprecated.
Use 'read_html' instead.
See help("Deprecated")
Thanks in advance.
That website doesn't use an html table, so html_table() can't find anything. It actaully uses div classes column and data lastcolumn.
So you can do something like
url <- "http://www.marketwatch.com/investing/stock/IRS/profile"
valuation_col <- url %>%
read_html() %>%
html_nodes(xpath='//*[#class="column"]')
valuation_data <- url %>%
read_html() %>%
html_nodes(xpath='//*[#class="data lastcolumn"]')
Or even
url %>%
read_html() %>%
html_nodes(xpath='//*[#class="section"]')
To get you most of the way there.
Please also read their terms of use - particularly 3.4.

Resources