How to skip webpages during a web scraping in rvest - r

I'm trying to collect information using rvest package in R.
While collecting the data with for loop, I found some of the pages do not contain information so that it comes out an error: Error in open.connection(x, "rb") : HTTP error 404.
Here is my R code. The page number 15138 and 15140 do have information, whereas 15139 does not. How can I skip 15139 with for loop function?
library(rvest)
library(dplyr)
library(tidyr)
library(stringr)
library(stringi)
source_url <- "https://go2senkyo.com/local/senkyo/"
senkyo <- data.frame()
for (i in 15138:15140) {
Sys.sleep(0.5)
target_page <- paste0(source_url, i)
recall_html <- read_html(target_page, encoding = "UTF-8")
prefecture <- recall_html %>%
html_nodes(xpath='//*[contains(concat( " ", #class, " " ), concat( " ", "column_ttl_small", " " ))]') %>%
html_text()
city <- recall_html %>%
html_nodes(xpath='//*[contains(concat( " ", #class, " " ), concat( " ", "column_ttl", " " ))]') %>%
html_text()
city <- trimws(gsub("[\r\n]", "", city ))
senkyo2 <- cbind(prefecture, city)
senkyo <- rbind(senkyo , senkyo2)
}
I'm looking forward to your answer!

You can handle exceptions a few different ways. I'm a noob when it comes to scraping, but here are a few options for your situation.
Tailor Your Loop Range
If you know that you don't want the value 15139, you can remove from the vector of options, like:
for (i in c(15138,15140)) {
Which will completely ignore 1539 when running your loop.
Add Control Flow
This is basically the same thing as tailoring your loop range, but handles the exception within the loop itself, like:
for (i in 15138:15140) {
Sys.sleep(0.5)
# control statement
if (i == 15139 {
next # moves to next iteration of loop, in this case 15140
}
target_page <- paste0(source_url, i) # not run if i == 15139, since loop skipped to next iteration
Condition Handling Tools
This is where I get out of my depth, and constantly reference Advanced-R. Essentially, you can wrap functions like try() around your potentially buggy code, which can insulate your loop from errors and keep it from breaking, and gives you flexibility about what to do if your code breaks in specific ways.
My usual approach would be to add something to your code like:
# wrap the part of your code that can break in try()
recall_html <- try(read_html(target_page, encoding = "UTF-8"))
# you'll still see your error, but it won't stop your code, unless you set silent = TRUE
# you'll need to add control flow to keep your loop from breaking at the next function, however
if (class(recall_html) == 'try-error') {
next
} else {
prefecture <- recall_html %>%
html_nodes(xpath='//*[contains(concat( " ", #class, " " ), concat( " ", "column_ttl_small", " " ))]') %>%
html_text()

Related

How scrape text from webpage that requires interaction in r

I am trying to scrape reviews from a webpage to determine word frequency. However, only partial reviews are given when the review is longer. You have to click on "More" to get the webpage to show the full review. Here is the code I am using to extract the text of the review. How can I "click" on more to get the full review?
library(rvest)
tripAdvisorURL <- "https://www.tripadvisor.com/Hotel_Review-g33657-d85704-
Reviews-Hotel_Bristol-Steamboat_Springs_Colorado.html#REVIEWS"
webpage <-read_html(tripAdvisorURL)
reviewData <- xml_nodes(webpage,xpath = '//*[contains(concat( " ", #class, "
" ), concat( " ", "partial_entry", " " ))]')
head(reviewData)
xml_text(reviewData[[1]])
[1] "The rooms were clean and we slept so good we had room 10 and 12 we
didn’t use 12 but it joins 10 .kind of strange but loved the hotel ..me
personally I would take the hot tub out it was kinda old..the lady
that...More"
As mentioned in the comment, you can use Rselenium together with rvest for more interactivity:
library(RSelenium)
rmDr <- rsDriver(browser = "chrome")
myclient <- rmDr$client
tripAdvisorURL <- "https://www.tripadvisor.com/Hotel_Review-g33657-d85704-Reviews-Hotel_Bristol-Steamboat_Springs_Colorado.html#REVIEWS"
myclient$navigate(tripAdvisorURL)
#select all "more" button, and loop to click them
webEles <- myclient$findElements(using = "css",value = ".ulBlueLinks")
for (webEle in webEles) {
webEle$clickElement()
}
mypagesource <- myclient$getPageSource()
read_html(mypagesource[[1]]) %>%
html_nodes(".partial_entry") %>%
html_text()

rvest: css / xpath from SelectorGadget not working

I want to extract all download links for txt-files from this page: https://www.bundestag.de/dokumente/protokolle/plenarprotokolle/plenarprotokolle. To do so, I tried SelectorGadget and selected the following:
With this information I wrote this code:
library(rvest)
library(tidyverse)
protokolle <-
read_html("https://www.bundestag.de/dokumente/protokolle/plenarprotokolle/plenarprotokolle")
txts <-
protokolle %>%
html_nodes(".bt-link-dokument")
The result is the same if I try
txts <-
protokolle %>%
html_nodes(xpath = '//*[contains(concat( " ", #class, " " ), concat( " ", "bt-link-dokument", " " ))]')
For a reason I do not understand, txts contains only {xml_nodeset (0)}. Any ideas on what went wrong?

Extracting web table using Rvest (in R)

I am looking to pull a table in at http://www.nfl.com/inactives?week=5 in order to process active and inactive players. I am very familiar with rvest and have tried using the code:
library(rvest)
url <- paste0("http://www.nfl.com/inactives?week=5")
Table <- url %>%
read_html() %>%
html_nodes(xpath= '//*[contains(concat( " ", #class, " " ), concat( " ", "yui3-datatable-cell", " " ))]') %>%
html_table()
TableNew <- Table[[1]]
TableNew
Nothing is coming up correctly though. Ideally, I would like to be able to put all the players and their team name into one single table. I appreciate your insights.

How to get table using rvest()

I want to grab some data from Pro Football Reference website using the rvest package. First, let's grab results for all games played in 2015 from this url http://www.pro-football-reference.com/years/2015/games.htm
library("rvest")
library("dplyr")
#grab table info
url <- "http://www.pro-football-reference.com/years/2015/games.htm"
urlHtml <- url %>% read_html()
dat <- urlHtml %>% html_table(header=TRUE) %>% .[[1]] %>% as_data_frame()
Is that how you would have done it? :)
dat could be cleaned up a bit. Two of the variables seem to have blanks for names. Plus the header row is repeated between each week.
colnames(dat) <- c("week", "day", "date", "winner", "at", "loser",
"box", "ptsW", "ptsL", "ydsW", "toW", "ydsL", "toL")
dat2 <- dat %>% filter(!(box == ""))
head(dat2)
Looks good!
Now let's look at an individual game. At the webpage above, click on "Boxscore" in the very first row of the table: The Sept 10th game played between New England and Pittsburgh. That takes us here: http://www.pro-football-reference.com/boxscores/201509100nwe.htm.
I want to grab the individual snap counts for each player (about half way down the page). Pretty sure these will be our first two lines of code:
gameUrl <- "http://www.pro-football-reference.com/boxscores/201509100nwe.htm"
gameHtml <- gameUrl %>% read_html()
But now I can't figure out how to grab the specific table I want. I use the Selector Gadget to highlight the table of Patriots snap counts. I do this by clicking on the table in several places, then 'unclicking' the other tables that were highlighted. I end up with a path of:
#home_snap_counts .right , #home_snap_counts .left, #home_snap_counts .left, #home_snap_counts .tooltip, #home_snap_counts .left
Each of these attempts returns {xml_nodeset (0)}
gameHtml %>% html_nodes("#home_snap_counts .right , #home_snap_counts .left, #home_snap_counts .left, #home_snap_counts .tooltip, #home_snap_counts .left")
gameHtml %>% html_nodes("#home_snap_counts .right , #home_snap_counts .left")
gameHtml %>% html_nodes("#home_snap_counts .right")
gameHtml %>% html_nodes("#home_snap_counts")
Maybe let's try using xpath. All of these attempts also return {xml_nodeset (0)}
gameHtml %>% html_nodes(xpath = '//*[(#id = "home_snap_counts")]//*[contains(concat( " ", #class, " " ), concat( " ", "right", " " ))] | //*[(#id = "home_snap_counts")]//*[contains(concat( " ", #class, " " ), concat( " ", "left", " " ))]//*[(#id = "home_snap_counts")]//*[contains(concat( " ", #class, " " ), concat( " ", "left", " " ))]//*[(#id = "home_snap_counts")]//*[contains(concat( " ", #class, " " ), concat( " ", "tooltip", " " ))]//*[(#id = "home_snap_counts")]//*[contains(concat( " ", #class, " " ), concat( " ", "left", " " ))]')
gameHtml %>% html_nodes(xpath = '//*[(#id = "home_snap_counts")]//*[contains(concat( " ", #class, " " ))]')
gameHtml %>% html_nodes(xpath = '//*[(#id = "home_snap_counts")]')
How can I grab that table? I'll also point out, when I do "View Page Source" in Google Chrome, the tables I want almost seem to be commented out? That is, they're typed in green, instead of the usual red/black/blue color scheme. That is not the case for the table of game results we pulled first. "View Page Source" for that table is the usual red/black/blue color scheme. Is the greenness indicative of what's preventing me from being able to grab this snap count table?
Thanks!
The information you are looking for is programmatically display at run time. One solution is to use RSelenium.
While looking at the web page's source, the information from the tables are stored in the code but are hidden because the tables are stored as comments. Here is my solution where I remove the comments markers and reprocess the page normally.
I saved the file to the working directory and then read the file in using the readLines function.
Now I search for the html’s begin and end comment flags and then remove them. I save the file a second time (less the comment flags) in order to reread and process the file for the selected tables.
gameUrl <- "http://www.pro-football-reference.com/boxscores/201509100nwe.htm"
gameHtml <- gameUrl %>% read_html()
gameHtml %>% html_nodes("tbody")
#Only save and work with the body
body<-html_node(gameHtml,"body")
write_xml(body, "nfl.xml")
#Find and remove comments
lines<-readLines("nfl.xml")
lines<-lines[-grep("<!--", lines)]
lines<-lines[-grep("-->", lines)]
writeLines(lines, "nfl2.xml")
#Read the file back in and process normally
body<-read_html("nfl2.xml")
html_table(html_nodes(body, "table")[29])
#extract the attributes and find the attribute of interest
a<-html_attrs(html_nodes(body, "table"))
#find the tables of interest.
homesnap<-which(sapply(a, function(x){x[2]})=="home_snap_counts")
html_table(html_nodes(body, "table")[homesnap])
visitsnap<-which(sapply(a, function(x){x[2]})=="vis_snap_counts")
html_table(html_nodes(body, "table")[visitsnap])

R how to collapse the vector of character strings without losing the " " at the beginning and end of each word

I want to get the line up data from this site; http://festileaks.com/pinkpop-festival/pinkpop-programma-2015/ , as input for a function.
To use the data for my function I need it to be in a list. For this to work all the artists need to be between " " , which is the case in the vector of character strings. As soon as I collapse the vector into one string the "artist" changes into artist (without " "). Is there a way to get the artists into one list and not lose the " " ?
This is the code I use (Rpackage rvest);
Pinkpop_2015 <- read_html("http://festileaks.com/pinkpop-festival/pinkpop- programma-2015/")
Pinkpophtml <- html_nodes(Pinkpop_2015, ".g7-one_third > a")
Pinkpoplineup <- html_text(Pinkpophtml)
Pinkpoplineup <- gsub("Â", "", Pinkpoplineup)
Pinkpoplineup <- gsub("-", "", Pinkpoplineup)
Pinkpoplineup <- gsub("^ ", '', Pinkpoplineup)
Plineup_string = paste(Pinkpoplineup, collapse=" ,")
Ok.. so based on your new character vector (I took just the first three occurrences):
Pinkpoplineup <- "Vrijdag 12 juni\n Muse\n Elbow"
You can try
Pinkpoplineup <- unlist(strsplit(Pinkpoplineup, split="\n "))
Pinkpoplineup <- paste0('\"', Pinkpoplineup, '\"', collapse=",")
Quotes are now escaped and you can see the result using
cat(Pinkpoplineup)
or writing the object to a file
write(Pinkpoplineup, file = "Pinkpoplineup.txt")

Resources