"Rescue" command in R? - r

I have this code:
library(rvest)
url_list <- c("https://github.com/rails/rails/pull/100",
"https://github.com/rails/rails/pull/200",
"https://github.com/rails/rails/pull/300")
mine <- function(url){
url_content <- html(url)
url_mainnode <- html_node(url_content, "*")
url_mainnode_text <- html_text(url_mainnode)
url_mainnode_text <- gsub("\n", "", url_mainnode_text) # clean up the text
url_mainnode_text
}
messages <- lapply(url_list, mine)
However, as i make the list longer I tend to run into a
Error in html.response(r, encoding = encoding) :
server error: (500) Internal Server Error
I know in Ruby I can use rescue to keep iterating through a list, even though some attempts at applying a function fails. Is there something similar in R?

One option is to use try(). For more info, see here. Here's an implementation:
library(rvest)
url_list <- c("https://github.com/rails/rails/pull/100",
"https://github.com/rails/rails/pull/200",
"https://github.com/rails/rails/pull/300")
mine <- function(url){
try(url_content <- html(url))
url_mainnode <- html_node(url_content, "*")
url_mainnode_text <- html_text(url_mainnode)
url_mainnode_text <- gsub("\n", "", url_mainnode_text) # clean up the text
url_mainnode_text
}
messages <- lapply(url_list, mine)

Related

error htmlTreeParse in R

I'm trying to get the body text from this webpage www.kinyo.es
but it returns this problem:
Error in which(value == defs) :
argument "code" is missing, with no default
In addition: Warning messages:
1: XML content does not seem to be XML: 'Error displaying the error page: Application Instantiation Error: Could not connect to MySQL.'
2: XML content does not seem to be XML: ''
My code is the following loop:
for(i in 1:n)
{
#get the URL
u <- webpage[i]
doc <- getURL(u)
#get the text from the body
html <- htmlTreeParse(doc, useInternal = TRUE)
txt <- xpathApply(html, "//body//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)]", xmlValue)
txt<-toString(txt)
txt
#clean
txt<-(str_replace_all(txt, "[\r\n\t,]" , ""))
txt<-tolower(txt)
txt
search <- c("wi-fi","router","switch","adsl","wireless")
search
stri_count_fixed(txt, search)
conta[i]<-sum(stri_count_fixed(txt, search))
#txt
}
This is a bit of a stretch, as I read your other questions and I can only suppose this is what you are after:
library(rvest)
library(stringr)
count_keywords <- function(url, keywords){
read_html(url) %>%
html_nodes(xpath = '//body//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)]') %>%
html_text() %>%
toString() %>%
str_count(keywords) %>%
sum
}
urls <- c('http://www.dlink.com/it/it', 'http://www.kinyo.es')
search <- c("Wi-Fi","Router","Switch","ADSL")
res <- sapply(urls, count_keywords, search)
res
#> http://www.dlink.com/it/it http://www.kinyo.es
#> 11 0

What is the argument error in xmlValue?

I am new to R. I want to scrape multiple pages of html pages and create a dataset with columns consisting specific data via XPath. I found a useful scraping tutorial.
My plan was to follow the script in the link and make it work/understand first and then customize to my website/html/xpath.
However, when I run second block in the code (Scraping the Blog Posts), I get this error:
Error in UseMethod("xmlValue") :
no applicable method for 'xmlValue' applied to an object of class "xml_node".
This is the line that breaks the code:
pages<-sapply(pages,xmlValue)
pages variable contains a nodeset:
{xml_nodeset (1)}
[1] <span class="pages">Page 1 of 25</span>
I assume that xmlValue cannot be applied to this datatype or something of that nature.
Since the code in the tutorial works for the author, I may have missed something obvious or there is a problem of the library loading sequence and related masking of functions. (although I played with that).
Any suggestion or assistance is much appreciated.
Consider XML as your only needed package with xpathSApply calls:
library(XML)
theURL <- "http://www.r-bloggers.com/search/web%20scraping"
page_data <- htmlParse(readLines(theURL, warn = FALSE))
pages <- xpathSApply(doc, '//*[#id="leftcontent"]/div[11]/span[1]', xmlValue)
pages <- as.numeric(regmatches(pages, regexpr("[0-9]+$", pages)))
scrape_r_bloggers_page <- function(doc, page){
titles <- xpathSApply(doc, '//div[contains(#id,"post")]/h2/a', xmlValue)
descriptions <- xpathSApply(doc, '//div[contains(#id,"post")]/div[2]/p[1]', xmlValue)
dates <- xpathSApply(doc, '//div[contains(#id,"post")]/div[1]/div', xmlValue)
authors <- xpathSApply(doc, '//div[contains(#id,"post")]/div[1]/a', xmlValue)
urls <- xpathSApply(doc, '//div[contains(#id,"post")]/h2/a', xmlValue)
blog_posts_df <- data.frame(title=titles,
description=descriptions,
author=authors,
date=dates,
url=urls,
page=page)
}
blogsdf <- scrape_r_bloggers_page(page_data, 1)
blogsList <- lapply(c(2:(pages-1)), function (page) {
Sys.sleep(1)
theURL <- paste("http://www.r-bloggers.com/search/web%20scraping/page/",page,"/",sep="")
page_data <- htmlParse(readLines(theURL, warn = FALSE))
scrape_r_bloggers_page(page_data, page)
})
finaldf <- rbind(blogsdf, do.call(rbind, blogsList))
That "tutorial" is a weird mix of rvest and XML. If you use rvest, then use the functions in that package like html_text instead. The xml2 package also works well with rvest, but not XML. The warning message from html should also tell you its out of date.
page_data <- html(theURL)
##Warning message: 'html' is deprecated.
page_data %>%
html_nodes(xpath='//*[#id="leftcontent"]/div[11]/span[1]') %>%
html_text
[1] "Page 1 of 25"

how to scrape all pages (1,2,3,.....n) from a website using r vest

# I would like to read the list of .html files to extract data. Appreciate your help.
library(rvest)
library(XML)
library(stringr)
library(data.table)
library(RCurl)
u0 <- "https://www.r-users.com/jobs/"
u1 <- read_html("https://www.r-users.com/jobs/")
download_folder <- ("C:/R/BNB/")
pages <- html_text(html_node(u1, ".results_count"))
Total_Pages <- substr(pages, 4, 7)
TP <- as.numeric(Total_Pages)
# reading first two pages, writing them as separate .html files
for (i in 1:TP) {
url <- paste(u0, "page=/", i, sep = "")
download.file(url, paste(download_folder, i, ".html", sep = ""))
#create html object
html <- html(paste(download_folder, i, ".html", sep = ""))
}
Here is a potential solution:
library(rvest)
library(stringr)
u0 <- "https://www.r-users.com/jobs/"
u1 <- read_html("https://www.r-users.com/jobs/")
download_folder <- getwd() #note change in output directory
TP<-max(as.integer(html_text(html_nodes(u1,"a.page-numbers"))), na.rm=TRUE)
# reading first two pages, writing them as separate .html files
for (i in 1:TP ) {
url <- paste(u0,"page/",i, "/", sep="")
print(url)
download.file(url,paste(download_folder,i,".html",sep=""))
#create html object
html <- read_html(paste(download_folder,i,".html",sep=""))
}
I could not find the class .result-count in the html, so instead I looked for the page-numbers class and pick the highest returned value.
Also, the function html is deprecated thus I replaced it with read_html.
Good luck

Handling temperamental errors in R

I'm using R to scrape a list of ~1,000 URLs. The script often fails in a way which is not reproducible; when I re-run it, it may succeed or it may fail at a different URL. This leads me to believe that the problem may be caused by my internet connection momentarily dropping or by a momentary error on the server whose URL I'm scraping.
How can I design my R code to continue to the next URL if it encounters an error? I've tried using the try function but that doesn't seem to work for this scenario.
library(XML)
df <- data.frame(URL=c("http://www.google.com/", "http://www.ask.com/", "http://www.bing.com/"))
for (i in 1:nrow(df)) {
URL <- df$URL[i]
# Exception handling
Test <- try(htmlTreeParse(URL, useInternalNodes = TRUE), silent = TRUE)
if(inherits(Test, "try-error")) next
HTML <- htmlTreeParse(URL, useInternalNodes = TRUE)
Result <- xpathSApply(HTML, "//li", xmlValue)
print(URL)
print(Result[1])
}
Let's assume that the URL to be scraped is accessible at this step:
Test <- try(htmlTreeParse(URL, useInternalNodes = TRUE), silent = TRUE)
if(inherits(Test, "try-error")) next
But then the URL stops working just before this step:
HTML <- htmlTreeParse(URL, useInternalNodes = TRUE)
Then htmlTreeParse won't work, R will throw up a warning/error, and my for loop will break. I want the for loop to continue to the next URL to be scraped - how can I accomplish this?
Thanks
Try this:
library(XML)
library(httr)
df <- c("http://www.google.com/", "http://www.ask.com/", "http://www.bing.com/")
for (i in 1:length(df)) {
URL <- df[i]
response <- GET(URL)
if (response$status_code != 200) next
HTML <- htmlTreeParse(content(response,type="text"),useInternalNodes=T)
Result <- xpathSApply(HTML, "//li", xmlValue)
if (length(Result) == 0) next
print(URL)
print(Result[1])
}
# [1] "http://www.ask.com/"
# [1] "\n \n Answers \n "
# [1] "http://www.bing.com/"
# [1] "Images"
So there are potentially (at least) two things going on here: the http request fails, or there are no <li> tags in the response. This uses GET(...) in the httr package to return the whole response and check the status code. It also checks for absence of <li> tags.

How to fix the error in R of "no lines available in input"?

What I need to do is to read data from hundreds of links, and among them some of the links contains no data, therefore, as the codes here:
urls <-paste0("http://somelink.php?station=",station, "&start=", Year, "01-01&etc")
myData <- lapply(urls, read.table, header = TRUE, sep = '|')
an error pops up saying "no lines available in input", I've tried using "try", but with same error, please help, thanks.
Here are 2 possible solutions (untested because your example is not reproducible):
Using try:
myData <- lapply(urls, function(x) {
tmp <- try(read.table(x, header = TRUE, sep = '|'))
if (!inherits(tmp, 'try-error')) tmp
})
Using tryCatch:
myData <- lapply(urls, function(x) {
tryCatch(read.table(x, header = TRUE, sep = '|'), error=function(e) NULL)
})
Does this help?
dims <- sapply(myData, dim)[2,]
bad_Ones <- myData[dims==1]
good_Ones <- myData[dims>1]
If myData still grabs something off the station page, the above code should separate the myData list into two separate groups. good_Ones would be the list you would want to work with. (assuming the above is accurate, of course)

Resources