Extracting web table using Rvest (in R) - r

I am looking to pull a table in at http://www.nfl.com/inactives?week=5 in order to process active and inactive players. I am very familiar with rvest and have tried using the code:
library(rvest)
url <- paste0("http://www.nfl.com/inactives?week=5")
Table <- url %>%
read_html() %>%
html_nodes(xpath= '//*[contains(concat( " ", #class, " " ), concat( " ", "yui3-datatable-cell", " " ))]') %>%
html_table()
TableNew <- Table[[1]]
TableNew
Nothing is coming up correctly though. Ideally, I would like to be able to put all the players and their team name into one single table. I appreciate your insights.

Related

How to skip webpages during a web scraping in rvest

I'm trying to collect information using rvest package in R.
While collecting the data with for loop, I found some of the pages do not contain information so that it comes out an error: Error in open.connection(x, "rb") : HTTP error 404.
Here is my R code. The page number 15138 and 15140 do have information, whereas 15139 does not. How can I skip 15139 with for loop function?
library(rvest)
library(dplyr)
library(tidyr)
library(stringr)
library(stringi)
source_url <- "https://go2senkyo.com/local/senkyo/"
senkyo <- data.frame()
for (i in 15138:15140) {
Sys.sleep(0.5)
target_page <- paste0(source_url, i)
recall_html <- read_html(target_page, encoding = "UTF-8")
prefecture <- recall_html %>%
html_nodes(xpath='//*[contains(concat( " ", #class, " " ), concat( " ", "column_ttl_small", " " ))]') %>%
html_text()
city <- recall_html %>%
html_nodes(xpath='//*[contains(concat( " ", #class, " " ), concat( " ", "column_ttl", " " ))]') %>%
html_text()
city <- trimws(gsub("[\r\n]", "", city ))
senkyo2 <- cbind(prefecture, city)
senkyo <- rbind(senkyo , senkyo2)
}
I'm looking forward to your answer!
You can handle exceptions a few different ways. I'm a noob when it comes to scraping, but here are a few options for your situation.
Tailor Your Loop Range
If you know that you don't want the value 15139, you can remove from the vector of options, like:
for (i in c(15138,15140)) {
Which will completely ignore 1539 when running your loop.
Add Control Flow
This is basically the same thing as tailoring your loop range, but handles the exception within the loop itself, like:
for (i in 15138:15140) {
Sys.sleep(0.5)
# control statement
if (i == 15139 {
next # moves to next iteration of loop, in this case 15140
}
target_page <- paste0(source_url, i) # not run if i == 15139, since loop skipped to next iteration
Condition Handling Tools
This is where I get out of my depth, and constantly reference Advanced-R. Essentially, you can wrap functions like try() around your potentially buggy code, which can insulate your loop from errors and keep it from breaking, and gives you flexibility about what to do if your code breaks in specific ways.
My usual approach would be to add something to your code like:
# wrap the part of your code that can break in try()
recall_html <- try(read_html(target_page, encoding = "UTF-8"))
# you'll still see your error, but it won't stop your code, unless you set silent = TRUE
# you'll need to add control flow to keep your loop from breaking at the next function, however
if (class(recall_html) == 'try-error') {
next
} else {
prefecture <- recall_html %>%
html_nodes(xpath='//*[contains(concat( " ", #class, " " ), concat( " ", "column_ttl_small", " " ))]') %>%
html_text()

Error with rvest - NAs introduced by coercion (xpath & css)

I'm attempting to scrape a website and collect the daily prices for various articles of clothing over an extended period. I've followed the tutorial on RStudio's blog but I am unable to replicate the idea on the test set despite using SelectorGadget. I've tried the follow code still receive NAs:
url<- "https://www.zara.com/us/en/authentic-jeans-p00840407.html?v1=9035594&v2=1204074"
jeans <- url %>%
read_html() %>%
html_nodes(".description , .product-price span") %>%
html_text() %>%
as.numeric()
I've also attempting to use the xpath format and still no luck:
jeans <- url %>%
read_html() %>%
html_nodes(xpath = '//*[contains(concat( " ", #class, " " ), concat( " ", "product-price", " " ))]') %>%
html_text() %>%
as.numeric()
I'd greatly appreciate any insight you might share - and would really appreciate it if you passed along any resources that details how to build a database over time from pulled data / or how to batch rvest webscrape requests!
Thank you!

rvest: css / xpath from SelectorGadget not working

I want to extract all download links for txt-files from this page: https://www.bundestag.de/dokumente/protokolle/plenarprotokolle/plenarprotokolle. To do so, I tried SelectorGadget and selected the following:
With this information I wrote this code:
library(rvest)
library(tidyverse)
protokolle <-
read_html("https://www.bundestag.de/dokumente/protokolle/plenarprotokolle/plenarprotokolle")
txts <-
protokolle %>%
html_nodes(".bt-link-dokument")
The result is the same if I try
txts <-
protokolle %>%
html_nodes(xpath = '//*[contains(concat( " ", #class, " " ), concat( " ", "bt-link-dokument", " " ))]')
For a reason I do not understand, txts contains only {xml_nodeset (0)}. Any ideas on what went wrong?

Web scraping tables from over 5K websites listed by url in a .csv file, all in R

So, I am working to extract data from the following website: http://livingwage.mit.edu
...at the county level, and have tried many different iterations of using the rvest package to extract the data. Unfortunately, there are about 5K counties.
I have extracted all the urls into a single column .csv file. The urls have the form "http://livingwage.mit.edu/counties/..." where "..." is the state code followed by the county code.
The data I want has the css identifier as (from SelectorGadget)
css = '.wages_table .even .col-NaN , .wages_table .results .col-NaN'
or the xpath of
xpath = //*[contains(concat( " ", #class, " " ), concat( " ", "wages_table", " " ))]//*[contains(concat( " ", #class, " " ), concat( " ", "even", " " ))]//*[contains(concat( " ", #class, " " ), concat( " ", "col-NaN", " " ))] | //*[contains(concat( " ", #class, " " ), concat( " ", "wages_table", " " ))]//*[contains(concat( " ", #class, " " ), concat( " ", "results", " " ))]//*[contains(concat( " ", #class, " " ), concat( " ", "col-NaN", " " ))]
This is where I started:
library(rvest)
url <- read_html("http://livingwage.mit.edu/counties/01001")
url %>%
html_nodes("table") %>%
.[[1]] %>%
html_table()
...but was only able to extract one table at a time, and got the headers and the final row, which I did not want.
So, I tried something like this:
counties <- 01001:54500
urls <- paste0("http://livingwage.mit.edu/counties/", counties)
get_table <- function(url) {
url %>%
read_html() %>%
html_nodes("table") %>%
.[[1]] %>%
html_table()
}
results <- sapply(urls, get_table)
...but quickly realized that not all the numbers are sequential (they are mostly odd), but are not continuous either, i.e., one state may only have 4 counties, and only have urls that go up to ~/10009 for example.
Finally, I got as far as this when trying to access the .csv list of urls on my desktop:
URL <- read.csv("~/Desktop/LW_url.csv", header=T)
URL %>%
html_nodes("table", ".wages_table .even .col-NaN , .wages_table .results .col-NaN") %>%
.[[1]] %>%
html_table()
...and know that the css and the read all do not like talking to each other nicely.
Any help in making this happen would be thoroughly appreciated.
I think this is what you are looking for.
install.packages("pbapply") # has a nice addition to lapply, estimates run time
library(rvest)
library(dplyr)
library(magrittr)
library(pbapply)
## Get State urls
lwc.url <- "http://livingwage.mit.edu"
state.urls <- read_html(lwc.url)
state.urls %<>% html_nodes(".col-md-6 a") %>% xml_attr("href") %>%
paste0(lwc.url, .)
## get county urls and county names
county.urls <- lapply(state.urls, function(x) read_html(x) %>%
html_nodes(".col-md-3 a") %>% xml_attr("href") %>%
paste0(lwc.url, .)) %>% unlist
## Get the tables Hourly wage & typical Expenses
dfs <- pblapply(county.urls, function(x){
LWC <- read_html(x)
df <- rbind(
LWC %>% html_nodes("table") %>% .[[1]] %>%
html_table() %>% setNames(c("Info", names(.)[-1])),
LWC %>% html_nodes("table") %>% .[[2]] %>%
html_table() %>% setNames(c("Info", names(.)[-1])))
title <- LWC %>% html_nodes("h1") %>% html_text
df$State <- trimws(gsub(".*,", "", title))
df$County <- trimws(gsub(".*for (.*) County.*", "\\1", title))
df$url <- x
df
})
df <- data.table::rbindlist(dfs)
View(df)

How to get table using rvest()

I want to grab some data from Pro Football Reference website using the rvest package. First, let's grab results for all games played in 2015 from this url http://www.pro-football-reference.com/years/2015/games.htm
library("rvest")
library("dplyr")
#grab table info
url <- "http://www.pro-football-reference.com/years/2015/games.htm"
urlHtml <- url %>% read_html()
dat <- urlHtml %>% html_table(header=TRUE) %>% .[[1]] %>% as_data_frame()
Is that how you would have done it? :)
dat could be cleaned up a bit. Two of the variables seem to have blanks for names. Plus the header row is repeated between each week.
colnames(dat) <- c("week", "day", "date", "winner", "at", "loser",
"box", "ptsW", "ptsL", "ydsW", "toW", "ydsL", "toL")
dat2 <- dat %>% filter(!(box == ""))
head(dat2)
Looks good!
Now let's look at an individual game. At the webpage above, click on "Boxscore" in the very first row of the table: The Sept 10th game played between New England and Pittsburgh. That takes us here: http://www.pro-football-reference.com/boxscores/201509100nwe.htm.
I want to grab the individual snap counts for each player (about half way down the page). Pretty sure these will be our first two lines of code:
gameUrl <- "http://www.pro-football-reference.com/boxscores/201509100nwe.htm"
gameHtml <- gameUrl %>% read_html()
But now I can't figure out how to grab the specific table I want. I use the Selector Gadget to highlight the table of Patriots snap counts. I do this by clicking on the table in several places, then 'unclicking' the other tables that were highlighted. I end up with a path of:
#home_snap_counts .right , #home_snap_counts .left, #home_snap_counts .left, #home_snap_counts .tooltip, #home_snap_counts .left
Each of these attempts returns {xml_nodeset (0)}
gameHtml %>% html_nodes("#home_snap_counts .right , #home_snap_counts .left, #home_snap_counts .left, #home_snap_counts .tooltip, #home_snap_counts .left")
gameHtml %>% html_nodes("#home_snap_counts .right , #home_snap_counts .left")
gameHtml %>% html_nodes("#home_snap_counts .right")
gameHtml %>% html_nodes("#home_snap_counts")
Maybe let's try using xpath. All of these attempts also return {xml_nodeset (0)}
gameHtml %>% html_nodes(xpath = '//*[(#id = "home_snap_counts")]//*[contains(concat( " ", #class, " " ), concat( " ", "right", " " ))] | //*[(#id = "home_snap_counts")]//*[contains(concat( " ", #class, " " ), concat( " ", "left", " " ))]//*[(#id = "home_snap_counts")]//*[contains(concat( " ", #class, " " ), concat( " ", "left", " " ))]//*[(#id = "home_snap_counts")]//*[contains(concat( " ", #class, " " ), concat( " ", "tooltip", " " ))]//*[(#id = "home_snap_counts")]//*[contains(concat( " ", #class, " " ), concat( " ", "left", " " ))]')
gameHtml %>% html_nodes(xpath = '//*[(#id = "home_snap_counts")]//*[contains(concat( " ", #class, " " ))]')
gameHtml %>% html_nodes(xpath = '//*[(#id = "home_snap_counts")]')
How can I grab that table? I'll also point out, when I do "View Page Source" in Google Chrome, the tables I want almost seem to be commented out? That is, they're typed in green, instead of the usual red/black/blue color scheme. That is not the case for the table of game results we pulled first. "View Page Source" for that table is the usual red/black/blue color scheme. Is the greenness indicative of what's preventing me from being able to grab this snap count table?
Thanks!
The information you are looking for is programmatically display at run time. One solution is to use RSelenium.
While looking at the web page's source, the information from the tables are stored in the code but are hidden because the tables are stored as comments. Here is my solution where I remove the comments markers and reprocess the page normally.
I saved the file to the working directory and then read the file in using the readLines function.
Now I search for the html’s begin and end comment flags and then remove them. I save the file a second time (less the comment flags) in order to reread and process the file for the selected tables.
gameUrl <- "http://www.pro-football-reference.com/boxscores/201509100nwe.htm"
gameHtml <- gameUrl %>% read_html()
gameHtml %>% html_nodes("tbody")
#Only save and work with the body
body<-html_node(gameHtml,"body")
write_xml(body, "nfl.xml")
#Find and remove comments
lines<-readLines("nfl.xml")
lines<-lines[-grep("<!--", lines)]
lines<-lines[-grep("-->", lines)]
writeLines(lines, "nfl2.xml")
#Read the file back in and process normally
body<-read_html("nfl2.xml")
html_table(html_nodes(body, "table")[29])
#extract the attributes and find the attribute of interest
a<-html_attrs(html_nodes(body, "table"))
#find the tables of interest.
homesnap<-which(sapply(a, function(x){x[2]})=="home_snap_counts")
html_table(html_nodes(body, "table")[homesnap])
visitsnap<-which(sapply(a, function(x){x[2]})=="vis_snap_counts")
html_table(html_nodes(body, "table")[visitsnap])

Resources