I'm using R to scrape a list of ~1,000 URLs. The script often fails in a way which is not reproducible; when I re-run it, it may succeed or it may fail at a different URL. This leads me to believe that the problem may be caused by my internet connection momentarily dropping or by a momentary error on the server whose URL I'm scraping.
How can I design my R code to continue to the next URL if it encounters an error? I've tried using the try function but that doesn't seem to work for this scenario.
library(XML)
df <- data.frame(URL=c("http://www.google.com/", "http://www.ask.com/", "http://www.bing.com/"))
for (i in 1:nrow(df)) {
URL <- df$URL[i]
# Exception handling
Test <- try(htmlTreeParse(URL, useInternalNodes = TRUE), silent = TRUE)
if(inherits(Test, "try-error")) next
HTML <- htmlTreeParse(URL, useInternalNodes = TRUE)
Result <- xpathSApply(HTML, "//li", xmlValue)
print(URL)
print(Result[1])
}
Let's assume that the URL to be scraped is accessible at this step:
Test <- try(htmlTreeParse(URL, useInternalNodes = TRUE), silent = TRUE)
if(inherits(Test, "try-error")) next
But then the URL stops working just before this step:
HTML <- htmlTreeParse(URL, useInternalNodes = TRUE)
Then htmlTreeParse won't work, R will throw up a warning/error, and my for loop will break. I want the for loop to continue to the next URL to be scraped - how can I accomplish this?
Thanks
Try this:
library(XML)
library(httr)
df <- c("http://www.google.com/", "http://www.ask.com/", "http://www.bing.com/")
for (i in 1:length(df)) {
URL <- df[i]
response <- GET(URL)
if (response$status_code != 200) next
HTML <- htmlTreeParse(content(response,type="text"),useInternalNodes=T)
Result <- xpathSApply(HTML, "//li", xmlValue)
if (length(Result) == 0) next
print(URL)
print(Result[1])
}
# [1] "http://www.ask.com/"
# [1] "\n \n Answers \n "
# [1] "http://www.bing.com/"
# [1] "Images"
So there are potentially (at least) two things going on here: the http request fails, or there are no <li> tags in the response. This uses GET(...) in the httr package to return the whole response and check the status code. It also checks for absence of <li> tags.
Related
I'm working on a project of collecting some datas from https://www.hockey-reference.com/boxscores/. Actually I'me trying to get every table of a season. I've generated a list of urls composed by combining https://www.hockey-reference.com/boxscores/ with each date of the calendar and each team name like "https://www.hockey-reference.com/boxscores/20171005WSH.html
I've stocked every url into a list but some are leading to a 404 error. I'm trying to use the "Curl package" with the function "url.exists" to know if there will be a 404 error and delete the url of the list. The problem is that each url from the list (including really existing url) return FALSE with url.exists in a for loop... I've tried to use this function in the console with url.exists(my list[i]) but it returns FALSE.
here's my code:
library(rvest)
library(RCurl)
##### Variables ####
team_names = c("ANA","ARI","BOS","BUF","CAR","CGY","CHI","CBJ","COL","DAL","DET","EDM","FLA","LAK","MIN","MTL","NSH","NJD","NYI","NYR","OTT","PHI","PHX","PIT","SJS","STL","TBL","TOR","VAN","VGK","WPG","WSH")
S2017 = read.table(file = "2018_season", header = TRUE, sep = ",")
dates = as.character(S2017[,1])
#### formating des dates ####
for (i in 1:length(dates)) {
dates[i] = gsub("-", "", dates[i])
}
dates = unique(dates)
##### generation des url ####
url_list = c()
for (j in 1:2) { #dates
for (k in 1:length(team_names)) {
print(k)
url_site = paste("https://www.hockey-reference.com/boxscores/",dates[j],team_names[k],".html",sep="")
url_list = rbind(url_site,url_list)
}
}
url_list_raffined = c()
for (l in 1:40) {
print(l)
if (url.exists(url_list[l], .header = TRUE) == TRUE) {
url_list_raffined = c(url_list_raffined,url_list[l])
}
}
Any idea for my problems ?
thanks
Instead of RCurl, you could use the httr package:
library(httr)
library(rvest)
library(xml2)
resp <- httr::GET(url_address, httr::timeout(60))
if(resp$status_code==200) {
html <- xml2::read_html(resp)
txt <- rvest::html_text(rvest::html_nodes(html)) # or similar
# save the results somewhere or do your operations..
}
here url_address is the address you are trying to download. Maybe you need to put this in a function or loop to iterate over all your addresses.
I feel like I'm very close to a solution here but can't seem to figure out why I'm not getting any result. I have an html page and I'm trying to parse out some IDs from it. I'm 99% certain my regex code is right, but for some reason I'm not getting any output.
In the html source, there are many ids that are wrapped with text like: /boardgame/9999/asdf. My regex code should pull out the /9999/ bit, but I can't figure out why it's just returning the same input html character string that I put in.
library(RCurl)
library(XML)
url <- sprintf("https://boardgamegeek.com/browse/boardgame/page/1")
html <- getURL(url, followlocation = TRUE)
id.parse <- sub("boardgame(.*?)[a-z]", "\\1", html)
Any thoughts?
I think your pattern was not accurate. In this case, you were picking up also other words, starting with "boardgames,".
This should work for one single ID.
id.pos <- regexpr("boardgame/[[:digit:]]{3,10}/[a-z]", html)
my.id <- substr(html, id.pos, id.pos + attributes(id.pos)$match.length)
gsub("(^[[:alpha:]]*/)|(/[[:alpha:]]*$)", "", my.id)
In my hands, it returns:
[1] "226501"
Also, I found many IDs in this html page. To catch them all in one list, you could do as follows.
url <- sprintf("https://boardgamegeek.com/browse/boardgame/page/1")
html <- getURL(url, followlocation = TRUE)
id.list <- list()
while (regexpr("boardgame/[[:digit:]]{3,10}/[a-z]", html) > 0) {
id.pos <- regexpr("boardgame/[[:digit:]]{3,10}/[a-z]", html)
my.id <- substr(html, id.pos, id.pos + attributes(id.pos)$match.length)
id.list[[(length(id.list) + 1)]] <- gsub("(^[[:alpha:]]*/)|(/[[:alpha:]]*$)", "", my.id)
html <- substr(html, (id.pos + attributes(id.pos)$match.length), nchar(html))
}
id.list
I am new to R. I want to scrape multiple pages of html pages and create a dataset with columns consisting specific data via XPath. I found a useful scraping tutorial.
My plan was to follow the script in the link and make it work/understand first and then customize to my website/html/xpath.
However, when I run second block in the code (Scraping the Blog Posts), I get this error:
Error in UseMethod("xmlValue") :
no applicable method for 'xmlValue' applied to an object of class "xml_node".
This is the line that breaks the code:
pages<-sapply(pages,xmlValue)
pages variable contains a nodeset:
{xml_nodeset (1)}
[1] <span class="pages">Page 1 of 25</span>
I assume that xmlValue cannot be applied to this datatype or something of that nature.
Since the code in the tutorial works for the author, I may have missed something obvious or there is a problem of the library loading sequence and related masking of functions. (although I played with that).
Any suggestion or assistance is much appreciated.
Consider XML as your only needed package with xpathSApply calls:
library(XML)
theURL <- "http://www.r-bloggers.com/search/web%20scraping"
page_data <- htmlParse(readLines(theURL, warn = FALSE))
pages <- xpathSApply(doc, '//*[#id="leftcontent"]/div[11]/span[1]', xmlValue)
pages <- as.numeric(regmatches(pages, regexpr("[0-9]+$", pages)))
scrape_r_bloggers_page <- function(doc, page){
titles <- xpathSApply(doc, '//div[contains(#id,"post")]/h2/a', xmlValue)
descriptions <- xpathSApply(doc, '//div[contains(#id,"post")]/div[2]/p[1]', xmlValue)
dates <- xpathSApply(doc, '//div[contains(#id,"post")]/div[1]/div', xmlValue)
authors <- xpathSApply(doc, '//div[contains(#id,"post")]/div[1]/a', xmlValue)
urls <- xpathSApply(doc, '//div[contains(#id,"post")]/h2/a', xmlValue)
blog_posts_df <- data.frame(title=titles,
description=descriptions,
author=authors,
date=dates,
url=urls,
page=page)
}
blogsdf <- scrape_r_bloggers_page(page_data, 1)
blogsList <- lapply(c(2:(pages-1)), function (page) {
Sys.sleep(1)
theURL <- paste("http://www.r-bloggers.com/search/web%20scraping/page/",page,"/",sep="")
page_data <- htmlParse(readLines(theURL, warn = FALSE))
scrape_r_bloggers_page(page_data, page)
})
finaldf <- rbind(blogsdf, do.call(rbind, blogsList))
That "tutorial" is a weird mix of rvest and XML. If you use rvest, then use the functions in that package like html_text instead. The xml2 package also works well with rvest, but not XML. The warning message from html should also tell you its out of date.
page_data <- html(theURL)
##Warning message: 'html' is deprecated.
page_data %>%
html_nodes(xpath='//*[#id="leftcontent"]/div[11]/span[1]') %>%
html_text
[1] "Page 1 of 25"
I have this code:
library(rvest)
url_list <- c("https://github.com/rails/rails/pull/100",
"https://github.com/rails/rails/pull/200",
"https://github.com/rails/rails/pull/300")
mine <- function(url){
url_content <- html(url)
url_mainnode <- html_node(url_content, "*")
url_mainnode_text <- html_text(url_mainnode)
url_mainnode_text <- gsub("\n", "", url_mainnode_text) # clean up the text
url_mainnode_text
}
messages <- lapply(url_list, mine)
However, as i make the list longer I tend to run into a
Error in html.response(r, encoding = encoding) :
server error: (500) Internal Server Error
I know in Ruby I can use rescue to keep iterating through a list, even though some attempts at applying a function fails. Is there something similar in R?
One option is to use try(). For more info, see here. Here's an implementation:
library(rvest)
url_list <- c("https://github.com/rails/rails/pull/100",
"https://github.com/rails/rails/pull/200",
"https://github.com/rails/rails/pull/300")
mine <- function(url){
try(url_content <- html(url))
url_mainnode <- html_node(url_content, "*")
url_mainnode_text <- html_text(url_mainnode)
url_mainnode_text <- gsub("\n", "", url_mainnode_text) # clean up the text
url_mainnode_text
}
messages <- lapply(url_list, mine)
I'm trying to get a data table off of a website using the RCurl package. My code works successfully for the URL that you get to by clicking through the website:
http://statsheet.com/mcb/teams/air-force/game_stats/
Once you try to select previous years (which I want); my code no longer works.
Example link:
http://statsheet.com/mcb/teams/air-force/game_stats?season=2012-2013
I'm guessing this has something to do with the reserved symbol(s) in the year specific address. I've tried URLencode as well as manually encoding the address but that hasn't worked either.
My code:
library(RCurl)
library(XML)
#Define URL
theurl <-URLencode("http://statsheet.com/mcb/teams/air-force/game_stats?season=2012-
2013", reserved=TRUE)
webpage <- getURL(theurl)
webpage <- readLines(tc <- textConnection(webpage)); close(tc)
pagetree <- htmlTreeParse(webpage, error=function(...){}, useInternalNodes = TRUE)
# Extract table header and contents
tablehead <- xpathSApply(pagetree, "//*/table[1]/thead[1]/tr[2]/th", xmlValue)
results <- xpathSApply(pagetree,"//*/table[1]/tbody/tr/td", xmlValue)
content <- as.data.frame(matrix(results, ncol = 19, byrow = TRUE))
testtablehead <- c("W/L","Opponent",tablehead[c(2:18)])
names(content) <- testtablehead
The relevant error that R returns:
Error in function (type, msg, asError = TRUE) :
Could not resolve host: http%3a%2f%2fstatsheet.com%2fmcb%2fteams%2fair-
force%2fgame_stats%3fseason%3d2012-2013; No data record of requested type
Does anyone have an idea what the problem is and how to fix it?
Skip the unneeded encoding and download of the url:
library(XML)
url <- "http://statsheet.com/mcb/teams/air-force/game_stats?season=2012-2013"
pagetree <- htmlTreeParse(url, useInternalNodes = TRUE)