Extracting data from html pages using regex - r

I feel like I'm very close to a solution here but can't seem to figure out why I'm not getting any result. I have an html page and I'm trying to parse out some IDs from it. I'm 99% certain my regex code is right, but for some reason I'm not getting any output.
In the html source, there are many ids that are wrapped with text like: /boardgame/9999/asdf. My regex code should pull out the /9999/ bit, but I can't figure out why it's just returning the same input html character string that I put in.
library(RCurl)
library(XML)
url <- sprintf("https://boardgamegeek.com/browse/boardgame/page/1")
html <- getURL(url, followlocation = TRUE)
id.parse <- sub("boardgame(.*?)[a-z]", "\\1", html)
Any thoughts?

I think your pattern was not accurate. In this case, you were picking up also other words, starting with "boardgames,".
This should work for one single ID.
id.pos <- regexpr("boardgame/[[:digit:]]{3,10}/[a-z]", html)
my.id <- substr(html, id.pos, id.pos + attributes(id.pos)$match.length)
gsub("(^[[:alpha:]]*/)|(/[[:alpha:]]*$)", "", my.id)
In my hands, it returns:
[1] "226501"
Also, I found many IDs in this html page. To catch them all in one list, you could do as follows.
url <- sprintf("https://boardgamegeek.com/browse/boardgame/page/1")
html <- getURL(url, followlocation = TRUE)
id.list <- list()
while (regexpr("boardgame/[[:digit:]]{3,10}/[a-z]", html) > 0) {
id.pos <- regexpr("boardgame/[[:digit:]]{3,10}/[a-z]", html)
my.id <- substr(html, id.pos, id.pos + attributes(id.pos)$match.length)
id.list[[(length(id.list) + 1)]] <- gsub("(^[[:alpha:]]*/)|(/[[:alpha:]]*$)", "", my.id)
html <- substr(html, (id.pos + attributes(id.pos)$match.length), nchar(html))
}
id.list

Related

R - Getting "Error "..." does not exist in current working directory, but it's clearly there

I'm trying to write a loop in R. The first part of my code works fine, which is to concatenate a URL and a value containing three years (1999-2001).
url <- 'https://www.baseball-almanac.com/players/baseball_births.php?order=LastName,%20FirstName&y='
birth_yrs <- as.character(1999:2001)
for(i in birth_yrs) {
nam <- paste("year", i, sep = ".")
assign(nam, i)
nam2<-paste(url, i ,sep = "")
assign(nam,paste(url, i ,sep = ""))
}
This gives me the following values in my Global Environment:
View of my Global Environment
What I would like to do now is to use the read_html() function in the from the xml2 package in a loop to save the html page. My code is the following:
for(i in birth_yrs) {
nam3 <- paste("baseball", i, sep = ".")
assign(nam3,read_html(paste("year",i,sep = "")))
}
Running this code gives me the following error message:
Error: 'year1999' does not exist in current working directory ('C:/Users/.....').
When I run the code:
test<-read_html(year.1999)
It works perfect with no issues:
The file code worked fine
Any suggestions would be greatly appreciated.
Thank you.
As stated by #Waldi you are providing a test string. If you would like to use the content of a variable when only supplying a test string you can use get(). When you provide get a string, it will search for a variable name that matches the provided string, and return the content stroed in the matching variable. Try:
for(i in birth_yrs){
nam3 <- paste("baseball", i, sep = ".")
assign(nam3, read_html( get(paste("year",i,sep = ".") ) ))
}
Store the data in list. You can use lapply to extract whatever value you want from each.
library(rvest)
url <- paste0('https://www.baseball-almanac.com/players/baseball_births.php?order=LastName,%20FirstName&y=', 1999:2001)
url_data <- lapply(url, read_html)

What is the argument error in xmlValue?

I am new to R. I want to scrape multiple pages of html pages and create a dataset with columns consisting specific data via XPath. I found a useful scraping tutorial.
My plan was to follow the script in the link and make it work/understand first and then customize to my website/html/xpath.
However, when I run second block in the code (Scraping the Blog Posts), I get this error:
Error in UseMethod("xmlValue") :
no applicable method for 'xmlValue' applied to an object of class "xml_node".
This is the line that breaks the code:
pages<-sapply(pages,xmlValue)
pages variable contains a nodeset:
{xml_nodeset (1)}
[1] <span class="pages">Page 1 of 25</span>
I assume that xmlValue cannot be applied to this datatype or something of that nature.
Since the code in the tutorial works for the author, I may have missed something obvious or there is a problem of the library loading sequence and related masking of functions. (although I played with that).
Any suggestion or assistance is much appreciated.
Consider XML as your only needed package with xpathSApply calls:
library(XML)
theURL <- "http://www.r-bloggers.com/search/web%20scraping"
page_data <- htmlParse(readLines(theURL, warn = FALSE))
pages <- xpathSApply(doc, '//*[#id="leftcontent"]/div[11]/span[1]', xmlValue)
pages <- as.numeric(regmatches(pages, regexpr("[0-9]+$", pages)))
scrape_r_bloggers_page <- function(doc, page){
titles <- xpathSApply(doc, '//div[contains(#id,"post")]/h2/a', xmlValue)
descriptions <- xpathSApply(doc, '//div[contains(#id,"post")]/div[2]/p[1]', xmlValue)
dates <- xpathSApply(doc, '//div[contains(#id,"post")]/div[1]/div', xmlValue)
authors <- xpathSApply(doc, '//div[contains(#id,"post")]/div[1]/a', xmlValue)
urls <- xpathSApply(doc, '//div[contains(#id,"post")]/h2/a', xmlValue)
blog_posts_df <- data.frame(title=titles,
description=descriptions,
author=authors,
date=dates,
url=urls,
page=page)
}
blogsdf <- scrape_r_bloggers_page(page_data, 1)
blogsList <- lapply(c(2:(pages-1)), function (page) {
Sys.sleep(1)
theURL <- paste("http://www.r-bloggers.com/search/web%20scraping/page/",page,"/",sep="")
page_data <- htmlParse(readLines(theURL, warn = FALSE))
scrape_r_bloggers_page(page_data, page)
})
finaldf <- rbind(blogsdf, do.call(rbind, blogsList))
That "tutorial" is a weird mix of rvest and XML. If you use rvest, then use the functions in that package like html_text instead. The xml2 package also works well with rvest, but not XML. The warning message from html should also tell you its out of date.
page_data <- html(theURL)
##Warning message: 'html' is deprecated.
page_data %>%
html_nodes(xpath='//*[#id="leftcontent"]/div[11]/span[1]') %>%
html_text
[1] "Page 1 of 25"

how to scrape all pages (1,2,3,.....n) from a website using r vest

# I would like to read the list of .html files to extract data. Appreciate your help.
library(rvest)
library(XML)
library(stringr)
library(data.table)
library(RCurl)
u0 <- "https://www.r-users.com/jobs/"
u1 <- read_html("https://www.r-users.com/jobs/")
download_folder <- ("C:/R/BNB/")
pages <- html_text(html_node(u1, ".results_count"))
Total_Pages <- substr(pages, 4, 7)
TP <- as.numeric(Total_Pages)
# reading first two pages, writing them as separate .html files
for (i in 1:TP) {
url <- paste(u0, "page=/", i, sep = "")
download.file(url, paste(download_folder, i, ".html", sep = ""))
#create html object
html <- html(paste(download_folder, i, ".html", sep = ""))
}
Here is a potential solution:
library(rvest)
library(stringr)
u0 <- "https://www.r-users.com/jobs/"
u1 <- read_html("https://www.r-users.com/jobs/")
download_folder <- getwd() #note change in output directory
TP<-max(as.integer(html_text(html_nodes(u1,"a.page-numbers"))), na.rm=TRUE)
# reading first two pages, writing them as separate .html files
for (i in 1:TP ) {
url <- paste(u0,"page/",i, "/", sep="")
print(url)
download.file(url,paste(download_folder,i,".html",sep=""))
#create html object
html <- read_html(paste(download_folder,i,".html",sep=""))
}
I could not find the class .result-count in the html, so instead I looked for the page-numbers class and pick the highest returned value.
Also, the function html is deprecated thus I replaced it with read_html.
Good luck

xpathSApply not finding required node

I'm trying to write some code to return the values of a given element in an xml feed. The following code works for all of the feeds except uk_legislation_feed. Can someone give me a hint as to why this might be and how to fix the problem? Thanks.
library(XML)
uk_legislation_feed <- c("http://www.legislation.gov.uk/new/data.feed", "xml", "//title")
test_feed <- c("https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml", "xml", "//zipcode")
ons_feed <- c("https://www.ons.gov.uk/releasecalendar?rss", "xml", "//title")
read_data <- function(feed) {
if (feed[2] == "xml") {
if (!file.exists(feed[1])) download.file(feed[1], "tmp.xml", "curl")
dat <- xmlRoot(xmlTreeParse("tmp.xml", useInternalNodes = TRUE))
}
titles <- xpathSApply(dat, feed[3], xmlValue)
return(titles)
}
Due to the undeclared namespace in uk_legislation_feed (specifically, no xmlns prefix) http://www.w3.org/2005/Atom, nodes are not properly mapped. Hence, you will need to declare a namespace at the URI and use it in XPath expression:
url <- "http://www.legislation.gov.uk/new/data.feed"
webpage <- readLines(url)
file <- xmlParse(webpage)
nmsp <- c(ns="http://www.w3.org/2005/Atom")
titles <- xpathSApply(file, "//ns:title", xmlValue,
namespaces = nmsp)
titles
# [1] "Search Results"
# [2] "The Air Navigation (Restriction of Flying) (RNAS Culdrose) (Amendment) \
# Regulations 2016"
...

Handling temperamental errors in R

I'm using R to scrape a list of ~1,000 URLs. The script often fails in a way which is not reproducible; when I re-run it, it may succeed or it may fail at a different URL. This leads me to believe that the problem may be caused by my internet connection momentarily dropping or by a momentary error on the server whose URL I'm scraping.
How can I design my R code to continue to the next URL if it encounters an error? I've tried using the try function but that doesn't seem to work for this scenario.
library(XML)
df <- data.frame(URL=c("http://www.google.com/", "http://www.ask.com/", "http://www.bing.com/"))
for (i in 1:nrow(df)) {
URL <- df$URL[i]
# Exception handling
Test <- try(htmlTreeParse(URL, useInternalNodes = TRUE), silent = TRUE)
if(inherits(Test, "try-error")) next
HTML <- htmlTreeParse(URL, useInternalNodes = TRUE)
Result <- xpathSApply(HTML, "//li", xmlValue)
print(URL)
print(Result[1])
}
Let's assume that the URL to be scraped is accessible at this step:
Test <- try(htmlTreeParse(URL, useInternalNodes = TRUE), silent = TRUE)
if(inherits(Test, "try-error")) next
But then the URL stops working just before this step:
HTML <- htmlTreeParse(URL, useInternalNodes = TRUE)
Then htmlTreeParse won't work, R will throw up a warning/error, and my for loop will break. I want the for loop to continue to the next URL to be scraped - how can I accomplish this?
Thanks
Try this:
library(XML)
library(httr)
df <- c("http://www.google.com/", "http://www.ask.com/", "http://www.bing.com/")
for (i in 1:length(df)) {
URL <- df[i]
response <- GET(URL)
if (response$status_code != 200) next
HTML <- htmlTreeParse(content(response,type="text"),useInternalNodes=T)
Result <- xpathSApply(HTML, "//li", xmlValue)
if (length(Result) == 0) next
print(URL)
print(Result[1])
}
# [1] "http://www.ask.com/"
# [1] "\n \n Answers \n "
# [1] "http://www.bing.com/"
# [1] "Images"
So there are potentially (at least) two things going on here: the http request fails, or there are no <li> tags in the response. This uses GET(...) in the httr package to return the whole response and check the status code. It also checks for absence of <li> tags.

Resources