What is the argument error in xmlValue? - r

I am new to R. I want to scrape multiple pages of html pages and create a dataset with columns consisting specific data via XPath. I found a useful scraping tutorial.
My plan was to follow the script in the link and make it work/understand first and then customize to my website/html/xpath.
However, when I run second block in the code (Scraping the Blog Posts), I get this error:
Error in UseMethod("xmlValue") :
no applicable method for 'xmlValue' applied to an object of class "xml_node".
This is the line that breaks the code:
pages<-sapply(pages,xmlValue)
pages variable contains a nodeset:
{xml_nodeset (1)}
[1] <span class="pages">Page 1 of 25</span>
I assume that xmlValue cannot be applied to this datatype or something of that nature.
Since the code in the tutorial works for the author, I may have missed something obvious or there is a problem of the library loading sequence and related masking of functions. (although I played with that).
Any suggestion or assistance is much appreciated.

Consider XML as your only needed package with xpathSApply calls:
library(XML)
theURL <- "http://www.r-bloggers.com/search/web%20scraping"
page_data <- htmlParse(readLines(theURL, warn = FALSE))
pages <- xpathSApply(doc, '//*[#id="leftcontent"]/div[11]/span[1]', xmlValue)
pages <- as.numeric(regmatches(pages, regexpr("[0-9]+$", pages)))
scrape_r_bloggers_page <- function(doc, page){
titles <- xpathSApply(doc, '//div[contains(#id,"post")]/h2/a', xmlValue)
descriptions <- xpathSApply(doc, '//div[contains(#id,"post")]/div[2]/p[1]', xmlValue)
dates <- xpathSApply(doc, '//div[contains(#id,"post")]/div[1]/div', xmlValue)
authors <- xpathSApply(doc, '//div[contains(#id,"post")]/div[1]/a', xmlValue)
urls <- xpathSApply(doc, '//div[contains(#id,"post")]/h2/a', xmlValue)
blog_posts_df <- data.frame(title=titles,
description=descriptions,
author=authors,
date=dates,
url=urls,
page=page)
}
blogsdf <- scrape_r_bloggers_page(page_data, 1)
blogsList <- lapply(c(2:(pages-1)), function (page) {
Sys.sleep(1)
theURL <- paste("http://www.r-bloggers.com/search/web%20scraping/page/",page,"/",sep="")
page_data <- htmlParse(readLines(theURL, warn = FALSE))
scrape_r_bloggers_page(page_data, page)
})
finaldf <- rbind(blogsdf, do.call(rbind, blogsList))

That "tutorial" is a weird mix of rvest and XML. If you use rvest, then use the functions in that package like html_text instead. The xml2 package also works well with rvest, but not XML. The warning message from html should also tell you its out of date.
page_data <- html(theURL)
##Warning message: 'html' is deprecated.
page_data %>%
html_nodes(xpath='//*[#id="leftcontent"]/div[11]/span[1]') %>%
html_text
[1] "Page 1 of 25"

Related

Read files recursively from github repos

I have some files in github that I would like to read recursively in R. So If I do this, I get list of all files.
library(httr)
req <- req <- GET("https://api.github.com/repos/jakevdp/data-USstates/git/trees/master?recursive=1")
stop_for_status(req)
all.files <- unlist(lapply(content(req)$tree, "["), use.names = F)
file.names.only <- unlist(lapply(content(req)$tree, "[", "path"), use.names = F)
Which is not what I actually wanted. I would like to be able to read these from the repository itself just like using list.files locally. How can we make this work? Or, at least, get list of full url to each file in the repository that can be read locally.
Say, from this repository: https://github.com/jakevdp/data-USstates
We can do this fairly simply with the rvest library. We select the links by using the .js-navigation-open html node, and then pull the href values from the links. We get a couple of empty strings with that, and .[. != ""] removes those.
library(rvest)
fileList <- read_html("https://github.com/jakevdp/data-USstates") %>%
html_nodes(".js-navigation-open") %>%
html_attr("href") %>%
.[. != ""] # remove empty elements
[1] "/jakevdp/data-USstates/blob/master/README.md" "/jakevdp/data-USstates/blob/master/state-abbrevs.csv"
[3] "/jakevdp/data-USstates/blob/master/state-areas.csv" "/jakevdp/data-USstates/blob/master/state-population.csv"

Extracting data from html pages using regex

I feel like I'm very close to a solution here but can't seem to figure out why I'm not getting any result. I have an html page and I'm trying to parse out some IDs from it. I'm 99% certain my regex code is right, but for some reason I'm not getting any output.
In the html source, there are many ids that are wrapped with text like: /boardgame/9999/asdf. My regex code should pull out the /9999/ bit, but I can't figure out why it's just returning the same input html character string that I put in.
library(RCurl)
library(XML)
url <- sprintf("https://boardgamegeek.com/browse/boardgame/page/1")
html <- getURL(url, followlocation = TRUE)
id.parse <- sub("boardgame(.*?)[a-z]", "\\1", html)
Any thoughts?
I think your pattern was not accurate. In this case, you were picking up also other words, starting with "boardgames,".
This should work for one single ID.
id.pos <- regexpr("boardgame/[[:digit:]]{3,10}/[a-z]", html)
my.id <- substr(html, id.pos, id.pos + attributes(id.pos)$match.length)
gsub("(^[[:alpha:]]*/)|(/[[:alpha:]]*$)", "", my.id)
In my hands, it returns:
[1] "226501"
Also, I found many IDs in this html page. To catch them all in one list, you could do as follows.
url <- sprintf("https://boardgamegeek.com/browse/boardgame/page/1")
html <- getURL(url, followlocation = TRUE)
id.list <- list()
while (regexpr("boardgame/[[:digit:]]{3,10}/[a-z]", html) > 0) {
id.pos <- regexpr("boardgame/[[:digit:]]{3,10}/[a-z]", html)
my.id <- substr(html, id.pos, id.pos + attributes(id.pos)$match.length)
id.list[[(length(id.list) + 1)]] <- gsub("(^[[:alpha:]]*/)|(/[[:alpha:]]*$)", "", my.id)
html <- substr(html, (id.pos + attributes(id.pos)$match.length), nchar(html))
}
id.list

R XML xpath queries return NULL or list()

I would like to extract data as dataframes from an XML file available under: http://www.uniprot.org/uniprot/P43405.xml
I only get back empty string although I think that the xpath queries are okay.
library(RCurl)
library(XML)
url <- "http://www.uniprot.org/uniprot/P43405.xml"
urldata <- getURL(url)
xmlfile <- xmlParse(urldata)
# some xpath queries
xmlfile["//entry/comment[#type='function']/text"]
xmlfile["//entry/comment[#type='PTM']/text"]
xpathSApply(xmlfile,"//uniprot/entry",xmlGetAttr, 'dataset')
xpathSApply(xmlfile,"//uniprot/entry",xmlValue)
Can anyone help me with this problem?
Thanks, Frank
Namespaces are missing:
library(RCurl)
library(XML)
url <- "http://www.uniprot.org/uniprot/P43405.xml"
urldata <- getURL(url)
xmlfile <- xmlParse(urldata)
getNodeSet(xmlfile, "//entry//comment")
namespaces <- c(ns="http://uniprot.org/uniprot")
getNodeSet(xmlfile, "//ns:entry//ns:comment", namespaces)
getNodeSet(xmlfile, "//ns:entry//ns:comment[#type='PTM']/ns:text", namespaces)
xpathSApply(xmlfile,"//ns:uniprot/ns:entry",xmlGetAttr, 'dataset', namespaces=namespaces)
xpathSApply(xmlfile,"//ns:uniprot/ns:entry",xmlValue, namespaces=namespaces)
References:
?xpathApply
How can I use xpath querying using R's XML library?
Thanks for the help! YEs, the namespaces were missing. I added some additional code. Maybe that will help others to get familiar with XML.
library(RCurl)
library(XML)
url <- "http://www.uniprot.org/uniprot/P43405.xml"
urldata <- getURL(url)
xmlfile <- xmlParse(urldata)
getNodeSet(xmlfile, "//entry//comment")
# one needs the name space here
namespaces <- c(ns="http://uniprot.org/uniprot")
# extract all comments, make a data frame
comments.uniprot <- getNodeSet(xmlfile, "//ns:entry//ns:comment", namespaces)
comments.dataframe <- as.data.frame(sapply(comments.uniprot, xmlValue))
comments.attributes <- as.data.frame(sapply(comments.uniprot, xmlGetAttr,'type'))
comments.all <- cbind(comments.attributes,comments.dataframe)
# only extract PTM comments
PTMs <- getNodeSet(xmlfile, "//ns:entry//ns:comment[#type='PTM']/ns:text", namespaces)
PTMs2 <- sapply(PTMs, xmlValue)
PTMs2.dataframe <- as.data.frame(PTMs2)
xpathSApply(xmlfile,"//ns:uniprot/ns:entry",xmlGetAttr, 'dataset', namespaces=namespaces)
xpathSApply(xmlfile,"//ns:uniprot/ns:entry/ns:accession",xmlValue, namespaces=namespaces)

xpathSApply not finding required node

I'm trying to write some code to return the values of a given element in an xml feed. The following code works for all of the feeds except uk_legislation_feed. Can someone give me a hint as to why this might be and how to fix the problem? Thanks.
library(XML)
uk_legislation_feed <- c("http://www.legislation.gov.uk/new/data.feed", "xml", "//title")
test_feed <- c("https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml", "xml", "//zipcode")
ons_feed <- c("https://www.ons.gov.uk/releasecalendar?rss", "xml", "//title")
read_data <- function(feed) {
if (feed[2] == "xml") {
if (!file.exists(feed[1])) download.file(feed[1], "tmp.xml", "curl")
dat <- xmlRoot(xmlTreeParse("tmp.xml", useInternalNodes = TRUE))
}
titles <- xpathSApply(dat, feed[3], xmlValue)
return(titles)
}
Due to the undeclared namespace in uk_legislation_feed (specifically, no xmlns prefix) http://www.w3.org/2005/Atom, nodes are not properly mapped. Hence, you will need to declare a namespace at the URI and use it in XPath expression:
url <- "http://www.legislation.gov.uk/new/data.feed"
webpage <- readLines(url)
file <- xmlParse(webpage)
nmsp <- c(ns="http://www.w3.org/2005/Atom")
titles <- xpathSApply(file, "//ns:title", xmlValue,
namespaces = nmsp)
titles
# [1] "Search Results"
# [2] "The Air Navigation (Restriction of Flying) (RNAS Culdrose) (Amendment) \
# Regulations 2016"
...

Handling temperamental errors in R

I'm using R to scrape a list of ~1,000 URLs. The script often fails in a way which is not reproducible; when I re-run it, it may succeed or it may fail at a different URL. This leads me to believe that the problem may be caused by my internet connection momentarily dropping or by a momentary error on the server whose URL I'm scraping.
How can I design my R code to continue to the next URL if it encounters an error? I've tried using the try function but that doesn't seem to work for this scenario.
library(XML)
df <- data.frame(URL=c("http://www.google.com/", "http://www.ask.com/", "http://www.bing.com/"))
for (i in 1:nrow(df)) {
URL <- df$URL[i]
# Exception handling
Test <- try(htmlTreeParse(URL, useInternalNodes = TRUE), silent = TRUE)
if(inherits(Test, "try-error")) next
HTML <- htmlTreeParse(URL, useInternalNodes = TRUE)
Result <- xpathSApply(HTML, "//li", xmlValue)
print(URL)
print(Result[1])
}
Let's assume that the URL to be scraped is accessible at this step:
Test <- try(htmlTreeParse(URL, useInternalNodes = TRUE), silent = TRUE)
if(inherits(Test, "try-error")) next
But then the URL stops working just before this step:
HTML <- htmlTreeParse(URL, useInternalNodes = TRUE)
Then htmlTreeParse won't work, R will throw up a warning/error, and my for loop will break. I want the for loop to continue to the next URL to be scraped - how can I accomplish this?
Thanks
Try this:
library(XML)
library(httr)
df <- c("http://www.google.com/", "http://www.ask.com/", "http://www.bing.com/")
for (i in 1:length(df)) {
URL <- df[i]
response <- GET(URL)
if (response$status_code != 200) next
HTML <- htmlTreeParse(content(response,type="text"),useInternalNodes=T)
Result <- xpathSApply(HTML, "//li", xmlValue)
if (length(Result) == 0) next
print(URL)
print(Result[1])
}
# [1] "http://www.ask.com/"
# [1] "\n \n Answers \n "
# [1] "http://www.bing.com/"
# [1] "Images"
So there are potentially (at least) two things going on here: the http request fails, or there are no <li> tags in the response. This uses GET(...) in the httr package to return the whole response and check the status code. It also checks for absence of <li> tags.

Resources