Web Scraping in R (Getting piece of Information from a table) - r

Trying to study web scraping in R alone...
This feels really difficult without HTML knowledge.
crime_wiki <- read_html("https://en.wikipedia.org/wiki/List_of_United_States_cities_by_crime_rate")
crime_wiki %>%
html_nodes(".firstHeading") %>% html_text()
crime_wiki %>%
html_nodes("dl+ h2 .mw-headline") %>% html_text()
Above codes worked fine. I got what I wanted to get.
When I tried to get city names (from Albuquerque to Wichita), it didn't work.
I wrote
crime_wiki %>%
html_nodes(".jquery-tablesorter a") %>% html_text()
What did I do wrong?
Ultimately I want to do... When I click each city name, their linked pages seem to have the same format. So get the same piece of information from each page such as name of Mayor of all the cities in the table...

The following code allowed me to get the city names:
library(rvest)
crime_wiki <- read_html("https://en.wikipedia.org/wiki/List_of_United_States_cities_by_crime_rate")
crime_wiki %>%
html_nodes("td a") %>%
html_text()
I'm not familiar with your use of ".jquery-tablesorter a". I used SelectorGadget to get the name of the nodes, i.e., "td a". Note that with the code that I've shared, I would need to remove the last 4 elements if I wanted only the city names.

Related

How to log scrape paths rvest used?

Background: Using rvest I'd like to scrape all details of all art pieces for the painter Paulo Uccello on wikiart.org. The endgame will look something like this:
> names(uccello_dt)
[1] title year style genre media imgSRC infoSRC
Problem: When a scraping attempt doesn't go as planned, I get back character(0). This isn't helpful for me in understanding exactly what path the scrape took to get character(0). I'd like to have my scrape attempts output what path it specifically took so that I can better troubleshoot my failures.
What I've tried:
I use Firefox, so after each failed attempt I go back to the web inspector tool to make sure that I am using the correct css selector / element tag. I've been keeping the rvest documentation by my side to better understand its functions. It's been a trial and error that's taking much longer than I think should. Here's a cleaned up source of 1 of many failures:
library(tidyverse)
library(data.table)
library(rvest)
sample_url <-
read_html(
"https://www.wikiart.org/en/paolo-uccello/all-works#!#filterName:all-paintings-chronologically,resultType:detailed"
)
imgSrc <-
sample_url %>%
html_nodes(".wiki-detailed-item-container") %>% html_nodes(".masonry-detailed-artwork-item") %>% html_nodes("aside") %>% html_nodes(".wiki-layout-artist-image-wrapper") %>% html_nodes("img") %>%
html_attr("src") %>%
as.character()
title <-
sample_url %>%
html_nodes(".masonry-detailed-artwork-title") %>%
html_text() %>%
as.character()
Thank you in advance.

RVest sometimes works, sometimes returns 0 nodes

I am relatively new to web scraping, and I have recently been using rvest.
I am scraping news headlines, paragraphs, and links from a Yahoo News page (about 10 of each at a time). The code I am using to do it is below:
headlines <- read_html(url) %>%
html_nodes("#web a") %>%
html_text()
paragraphs <- read_html(url) %>%
html_nodes("#web p") %>%
html_text()
links <- read_html(url) %>%
html_nodes("#web a") %>%
html_attr("href")
My issue is that sometimes my code works perfectly and I get what I need (three vectors of info, each of length 10), and then a second later on another test it returns nothing:
> headlines <- read_html(url) %>%
+ html_nodes("#web a") %>%
+ html_text()
> headlines
character(0)
Does anyone know why this is or how to make this more reliable? I am putting the code into a dashboard and want to be able to reliably check the top news articles every day. Does rvest/Yahoo News maybe have rate limits that are blocking me? I am currently unaware of any. For context too - I am testing the dashboard constantly (100 times a day minimum), is it possible that this could be overworking it?
Thank you in advance for any guidance.

How to scrape NBA data?

I want to compare rookies across leagues with stats like Points per game (PPG) and such. ESPN and NBA have great tables to scrape from (as does Basketball-reference), but I just found out that they're not stored in html, so I can't use rvest. For context, I'm trying to scrape tables like this one (from NBA):
https://i.stack.imgur.com/SdKjE.png
I'm trying to learn how to use HTTR and JSON for this, but I'm running into some issues. I followed the answer in this post, but it's not working out for me.
This is what I've tried:
library(httr)
library(jsonlite)
coby.white <- GET('https://www.nba.com/players/coby/white/1629632')
out <- content(coby.white, as = "text") %>%
fromJSON(flatten = FALSE)
However, I get an error:
Error: lexical error: invalid char in json text.
<!DOCTYPE html><html class="" l
(right here) ------^
Is there an easier way to scrape a table from ESPN or NBA, or is there a solution to this issue?
ppg and others stats come from]
https://data.nba.net/prod/v1/2019/players/1629632_profile.json
and player info e.g. weight, height
https://www.nba.com/players/active_players.json
So, you could use jsonlite to parse e.g.
library(jsonlite)
data <- jsonlite::read_json('https://data.nba.net/prod/v1/2019/players/1629632_profile.json')
You can find these in the network tab when refreshing the page. Looks like you can use the player id in the url to get different players info for the season.
You actually can web scrape with rvest, here's an example of scraping White's totals table from Basketball Reference. Anything on Sports Reference's sites that is not the first table of the page is listed as a comment, meaning we must extract the comment nodes first then extract the desired data table.
library(rvest)
library(dplyr)
cobywhite = 'https://www.basketball-reference.com/players/w/whiteco01.html'
totalsdf = cobywhite %>%
read_html %>%
html_nodes(xpath = '//comment()') %>%
html_text() %>%
paste(collapse='') %>%
read_html() %>%
html_node("#totals") %>%
html_table()

What makes table web scraping with rvest package sometimes fail?

I'm playing with rvest package and trying to figure out why sometimes it fails to scrape objects that definitely seem tables.
Consider for instance a script like this:
require(rvest)
url <- "http://bigcharts.marketwatch.com/quickchart/options.asp?symb=SPY"
population <- url %>%
xml2::read_html() %>%
html_nodes(xpath='//*[#id="options"]/table/tbody/tr/td/table[2]/tbody') %>%
html_table()
population
If I inspect population, it's an empty list:
> population
list()
Another example:
require(rvest)
url <- "https://finance.yahoo.com/quote/SPY/options?straddle=false"
population <- url %>%
xml2::read_html() %>%
html_nodes(xpath='//*[#id="Col1-1-OptionContracts-Proxy"]/section/section[1]/div[2]') %>%
html_table()
population
I was wondering if the use of PhantomJS is mandatory - as explained here - or if the problem is elsewhere.
Neither of your current xpaths actually select just the table. In both cases I think you need to pass an html table to html_table as under the hood there is:
html_table.xml_node(.) : html_name(x) == "table"
Also, long xpaths are too fragile especially when applying a path valid for browser rendered content versus rvest return html - as javascript doesn't run with rvest. Personally, I prefer nice short CSS selectors. You can use the second fastest selector type of class and only need specify a single class
require(rvest)
url <- "http://bigcharts.marketwatch.com/quickchart/options.asp?symb=SPY"
population <- url %>%
xml2::read_html() %>%
html_node('.optionchain') %>%
html_table()
The table needs cleaning of course, due to "merged" cells in source, but you get the idea.
With xpath you could do:
require(rvest)
url <- "http://bigcharts.marketwatch.com/quickchart/options.asp?symb=SPY"
population <- url %>%
xml2::read_html() %>%
html_node(xpath='//table[2]') %>%
html_table()
Note: I reduce the xpath and work with a single node which represents a table.
For your second:
Again, your xpath is not selecting for a table element. The table class is multi-valued but a single correctly chosen class will suffice in xpath i.e. //*[contains(#class,"calls")] . Select for a single table node.
require(rvest)
url <- "https://finance.yahoo.com/quote/SPY/options?straddle=false"
population <- url %>%
xml2::read_html() %>%
html_node(xpath='//*[contains(#class,"calls")]') %>%
html_table()
Once again, my preference is for a css selector (less typing!)
require(rvest)
url <- "https://finance.yahoo.com/quote/SPY/options?straddle=false"
population <- url %>%
xml2::read_html() %>%
html_node('.calls') %>%
html_table()

How to use Rvest to scrape data

I am trying to get the address from this site https://www.uchealth.com/our-locations/#hospitals
I tried:
html_nodes(xpath = "//*[#id='uch_location_results']/div[1]/div/div[2]/address") %>%
html_text()
Any suggestions on what I am doing wrong?
If you use the network tab you will find a source url for the addresses
library(rvest)
r <- read_html('https://www.uchealth.com/wp-content/themes/uchealth-2016-interim/ajax/location_search.php?region=hospitals') %>%
html_nodes('address') %>%
html_text()
The names of the hospitals are available with the following css selector:
h3

Resources