xpathSApply not finding required node - r

I'm trying to write some code to return the values of a given element in an xml feed. The following code works for all of the feeds except uk_legislation_feed. Can someone give me a hint as to why this might be and how to fix the problem? Thanks.
library(XML)
uk_legislation_feed <- c("http://www.legislation.gov.uk/new/data.feed", "xml", "//title")
test_feed <- c("https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml", "xml", "//zipcode")
ons_feed <- c("https://www.ons.gov.uk/releasecalendar?rss", "xml", "//title")
read_data <- function(feed) {
if (feed[2] == "xml") {
if (!file.exists(feed[1])) download.file(feed[1], "tmp.xml", "curl")
dat <- xmlRoot(xmlTreeParse("tmp.xml", useInternalNodes = TRUE))
}
titles <- xpathSApply(dat, feed[3], xmlValue)
return(titles)
}

Due to the undeclared namespace in uk_legislation_feed (specifically, no xmlns prefix) http://www.w3.org/2005/Atom, nodes are not properly mapped. Hence, you will need to declare a namespace at the URI and use it in XPath expression:
url <- "http://www.legislation.gov.uk/new/data.feed"
webpage <- readLines(url)
file <- xmlParse(webpage)
nmsp <- c(ns="http://www.w3.org/2005/Atom")
titles <- xpathSApply(file, "//ns:title", xmlValue,
namespaces = nmsp)
titles
# [1] "Search Results"
# [2] "The Air Navigation (Restriction of Flying) (RNAS Culdrose) (Amendment) \
# Regulations 2016"
...

Related

Extracting data from html pages using regex

I feel like I'm very close to a solution here but can't seem to figure out why I'm not getting any result. I have an html page and I'm trying to parse out some IDs from it. I'm 99% certain my regex code is right, but for some reason I'm not getting any output.
In the html source, there are many ids that are wrapped with text like: /boardgame/9999/asdf. My regex code should pull out the /9999/ bit, but I can't figure out why it's just returning the same input html character string that I put in.
library(RCurl)
library(XML)
url <- sprintf("https://boardgamegeek.com/browse/boardgame/page/1")
html <- getURL(url, followlocation = TRUE)
id.parse <- sub("boardgame(.*?)[a-z]", "\\1", html)
Any thoughts?
I think your pattern was not accurate. In this case, you were picking up also other words, starting with "boardgames,".
This should work for one single ID.
id.pos <- regexpr("boardgame/[[:digit:]]{3,10}/[a-z]", html)
my.id <- substr(html, id.pos, id.pos + attributes(id.pos)$match.length)
gsub("(^[[:alpha:]]*/)|(/[[:alpha:]]*$)", "", my.id)
In my hands, it returns:
[1] "226501"
Also, I found many IDs in this html page. To catch them all in one list, you could do as follows.
url <- sprintf("https://boardgamegeek.com/browse/boardgame/page/1")
html <- getURL(url, followlocation = TRUE)
id.list <- list()
while (regexpr("boardgame/[[:digit:]]{3,10}/[a-z]", html) > 0) {
id.pos <- regexpr("boardgame/[[:digit:]]{3,10}/[a-z]", html)
my.id <- substr(html, id.pos, id.pos + attributes(id.pos)$match.length)
id.list[[(length(id.list) + 1)]] <- gsub("(^[[:alpha:]]*/)|(/[[:alpha:]]*$)", "", my.id)
html <- substr(html, (id.pos + attributes(id.pos)$match.length), nchar(html))
}
id.list

How can I remove "/url?q=" from text data in R

I want to remove "/url?q=" from text data in R Studio.
This is my code for google search:
## Code for Google Search
# Enter Search Term Here
search.term <- "r-project"
# Creating Function
getGoogleURL <- function(search.term, domain = '.co.in', quotes=TRUE)
{
# Getting Search Term
search.term <- gsub(' ', '%20', search.term)
if(quotes) search.term <- paste('%22', search.term, '%22', sep='')
# Putting Search Term in Google Search
getGoogleURL <- paste('http://www.google', domain, '/search?q=', search.term, sep='') }
## Get Links from Google Search
# Creating Function to Get URLs From Search Results
getGoogleLinks <- function(google.url) {
# Creating a File to Save URLs
doc <- getURL(google.url, httpheader = c("User-Agent" = "R(3.4.0)"))
# Removing HTML code and Setting Nodes
html <- htmlTreeParse(doc, useInternalNodes = TRUE, error=function(...){})
nodes <- getNodeSet(html, "//h3[#class='r']//a")
return(sapply(nodes, function(x) x <- xmlAttrs(x)[["href"]])) }
## Remove quoted text, Create URL List
quotes <- "FALSE"
search.url <- getGoogleURL(search.term=search.term, quotes=quotes)
links <- getGoogleLinks(search.url)
## Print URL List
links
And my result is:
[1] "/url?q=https://www.r-project.org/&sa=U&ved=0ahUKEwj78ZWXoabUAhUcTI8KHaTEDTIQFggUMAA&usg=AFQjCNEqtiOAIA7OOTa3meWC8zaTjjTy8A"
[2] "/url?q=http://www.cran.r-project.org/&sa=U&ved=0ahUKEwj78ZWXoabUAhUcTI8KHaTEDTIQjBAIGzAB&usg=AFQjCNF8QmYbLzG0c66QZM2wsXF1n1-9tQ"
What can I do to remove "/url?q=" from the links above?
You can use gsub.
## Code for Google Search
# Enter Search Term Here
search.term <- "r-project"
# Creating Function
getGoogleURL <- function(search.term, domain = '.co.in', quotes=TRUE)
{
# Getting Search Term
search.term <- gsub(' ', '%20', search.term)
if(quotes) search.term <- paste('%22', search.term, '%22', sep='')
# Putting Search Term in Google Search
getGoogleURL <- paste('http://www.google', domain, '/search?q=', search.term, sep='') }
## Get Links from Google Search
# Creating Function to Get URLs From Search Results
getGoogleLinks <- function(google.url) {
# Creating a File to Save URLs
doc <- getURL(google.url, httpheader = c("User-Agent" = "R(3.4.0)"))
# Removing HTML code and Setting Nodes
html <- htmlTreeParse(doc, useInternalNodes = TRUE, error=function(...){})
nodes <- getNodeSet(html, "//h3[#class='r']//a")
return(sapply(nodes, function(x) x <- xmlAttrs(x)[["href"]])) }
## Remove quoted text, Create URL List
quotes <- "FALSE"
search.url <- getGoogleURL(search.term=search.term, quotes=quotes)
links <- getGoogleLinks(search.url)
## Print URL List
gsub("/url?q=", "", links)
I solved it this way as they were limited number of characters
links <- substring(links,8)
Alternatively to #JTeam's answer you can try this (given the links always start with /url?q=):
lapply(links,function(x) paste0(strsplit(x,'=')[[1]][-1],collapse = ''))
This gives you a nice list of clean links (if you prefer a vector, try sapply)

What is the argument error in xmlValue?

I am new to R. I want to scrape multiple pages of html pages and create a dataset with columns consisting specific data via XPath. I found a useful scraping tutorial.
My plan was to follow the script in the link and make it work/understand first and then customize to my website/html/xpath.
However, when I run second block in the code (Scraping the Blog Posts), I get this error:
Error in UseMethod("xmlValue") :
no applicable method for 'xmlValue' applied to an object of class "xml_node".
This is the line that breaks the code:
pages<-sapply(pages,xmlValue)
pages variable contains a nodeset:
{xml_nodeset (1)}
[1] <span class="pages">Page 1 of 25</span>
I assume that xmlValue cannot be applied to this datatype or something of that nature.
Since the code in the tutorial works for the author, I may have missed something obvious or there is a problem of the library loading sequence and related masking of functions. (although I played with that).
Any suggestion or assistance is much appreciated.
Consider XML as your only needed package with xpathSApply calls:
library(XML)
theURL <- "http://www.r-bloggers.com/search/web%20scraping"
page_data <- htmlParse(readLines(theURL, warn = FALSE))
pages <- xpathSApply(doc, '//*[#id="leftcontent"]/div[11]/span[1]', xmlValue)
pages <- as.numeric(regmatches(pages, regexpr("[0-9]+$", pages)))
scrape_r_bloggers_page <- function(doc, page){
titles <- xpathSApply(doc, '//div[contains(#id,"post")]/h2/a', xmlValue)
descriptions <- xpathSApply(doc, '//div[contains(#id,"post")]/div[2]/p[1]', xmlValue)
dates <- xpathSApply(doc, '//div[contains(#id,"post")]/div[1]/div', xmlValue)
authors <- xpathSApply(doc, '//div[contains(#id,"post")]/div[1]/a', xmlValue)
urls <- xpathSApply(doc, '//div[contains(#id,"post")]/h2/a', xmlValue)
blog_posts_df <- data.frame(title=titles,
description=descriptions,
author=authors,
date=dates,
url=urls,
page=page)
}
blogsdf <- scrape_r_bloggers_page(page_data, 1)
blogsList <- lapply(c(2:(pages-1)), function (page) {
Sys.sleep(1)
theURL <- paste("http://www.r-bloggers.com/search/web%20scraping/page/",page,"/",sep="")
page_data <- htmlParse(readLines(theURL, warn = FALSE))
scrape_r_bloggers_page(page_data, page)
})
finaldf <- rbind(blogsdf, do.call(rbind, blogsList))
That "tutorial" is a weird mix of rvest and XML. If you use rvest, then use the functions in that package like html_text instead. The xml2 package also works well with rvest, but not XML. The warning message from html should also tell you its out of date.
page_data <- html(theURL)
##Warning message: 'html' is deprecated.
page_data %>%
html_nodes(xpath='//*[#id="leftcontent"]/div[11]/span[1]') %>%
html_text
[1] "Page 1 of 25"

R XML xpath queries return NULL or list()

I would like to extract data as dataframes from an XML file available under: http://www.uniprot.org/uniprot/P43405.xml
I only get back empty string although I think that the xpath queries are okay.
library(RCurl)
library(XML)
url <- "http://www.uniprot.org/uniprot/P43405.xml"
urldata <- getURL(url)
xmlfile <- xmlParse(urldata)
# some xpath queries
xmlfile["//entry/comment[#type='function']/text"]
xmlfile["//entry/comment[#type='PTM']/text"]
xpathSApply(xmlfile,"//uniprot/entry",xmlGetAttr, 'dataset')
xpathSApply(xmlfile,"//uniprot/entry",xmlValue)
Can anyone help me with this problem?
Thanks, Frank
Namespaces are missing:
library(RCurl)
library(XML)
url <- "http://www.uniprot.org/uniprot/P43405.xml"
urldata <- getURL(url)
xmlfile <- xmlParse(urldata)
getNodeSet(xmlfile, "//entry//comment")
namespaces <- c(ns="http://uniprot.org/uniprot")
getNodeSet(xmlfile, "//ns:entry//ns:comment", namespaces)
getNodeSet(xmlfile, "//ns:entry//ns:comment[#type='PTM']/ns:text", namespaces)
xpathSApply(xmlfile,"//ns:uniprot/ns:entry",xmlGetAttr, 'dataset', namespaces=namespaces)
xpathSApply(xmlfile,"//ns:uniprot/ns:entry",xmlValue, namespaces=namespaces)
References:
?xpathApply
How can I use xpath querying using R's XML library?
Thanks for the help! YEs, the namespaces were missing. I added some additional code. Maybe that will help others to get familiar with XML.
library(RCurl)
library(XML)
url <- "http://www.uniprot.org/uniprot/P43405.xml"
urldata <- getURL(url)
xmlfile <- xmlParse(urldata)
getNodeSet(xmlfile, "//entry//comment")
# one needs the name space here
namespaces <- c(ns="http://uniprot.org/uniprot")
# extract all comments, make a data frame
comments.uniprot <- getNodeSet(xmlfile, "//ns:entry//ns:comment", namespaces)
comments.dataframe <- as.data.frame(sapply(comments.uniprot, xmlValue))
comments.attributes <- as.data.frame(sapply(comments.uniprot, xmlGetAttr,'type'))
comments.all <- cbind(comments.attributes,comments.dataframe)
# only extract PTM comments
PTMs <- getNodeSet(xmlfile, "//ns:entry//ns:comment[#type='PTM']/ns:text", namespaces)
PTMs2 <- sapply(PTMs, xmlValue)
PTMs2.dataframe <- as.data.frame(PTMs2)
xpathSApply(xmlfile,"//ns:uniprot/ns:entry",xmlGetAttr, 'dataset', namespaces=namespaces)
xpathSApply(xmlfile,"//ns:uniprot/ns:entry/ns:accession",xmlValue, namespaces=namespaces)

Handling temperamental errors in R

I'm using R to scrape a list of ~1,000 URLs. The script often fails in a way which is not reproducible; when I re-run it, it may succeed or it may fail at a different URL. This leads me to believe that the problem may be caused by my internet connection momentarily dropping or by a momentary error on the server whose URL I'm scraping.
How can I design my R code to continue to the next URL if it encounters an error? I've tried using the try function but that doesn't seem to work for this scenario.
library(XML)
df <- data.frame(URL=c("http://www.google.com/", "http://www.ask.com/", "http://www.bing.com/"))
for (i in 1:nrow(df)) {
URL <- df$URL[i]
# Exception handling
Test <- try(htmlTreeParse(URL, useInternalNodes = TRUE), silent = TRUE)
if(inherits(Test, "try-error")) next
HTML <- htmlTreeParse(URL, useInternalNodes = TRUE)
Result <- xpathSApply(HTML, "//li", xmlValue)
print(URL)
print(Result[1])
}
Let's assume that the URL to be scraped is accessible at this step:
Test <- try(htmlTreeParse(URL, useInternalNodes = TRUE), silent = TRUE)
if(inherits(Test, "try-error")) next
But then the URL stops working just before this step:
HTML <- htmlTreeParse(URL, useInternalNodes = TRUE)
Then htmlTreeParse won't work, R will throw up a warning/error, and my for loop will break. I want the for loop to continue to the next URL to be scraped - how can I accomplish this?
Thanks
Try this:
library(XML)
library(httr)
df <- c("http://www.google.com/", "http://www.ask.com/", "http://www.bing.com/")
for (i in 1:length(df)) {
URL <- df[i]
response <- GET(URL)
if (response$status_code != 200) next
HTML <- htmlTreeParse(content(response,type="text"),useInternalNodes=T)
Result <- xpathSApply(HTML, "//li", xmlValue)
if (length(Result) == 0) next
print(URL)
print(Result[1])
}
# [1] "http://www.ask.com/"
# [1] "\n \n Answers \n "
# [1] "http://www.bing.com/"
# [1] "Images"
So there are potentially (at least) two things going on here: the http request fails, or there are no <li> tags in the response. This uses GET(...) in the httr package to return the whole response and check the status code. It also checks for absence of <li> tags.

Resources