Web-scraping with xpathSApply

Web-scraping with xpathSApply - r

I am using doing some web scraping with packages XML and html, and I need to isolate the country name, and the two numeric values that you see below:
<tr><td>Tonga</td>
<td class="RightAlign">3,000</td>
<td class="RightAlign">6,000</td>
</tr>
here is the code I've written so far - I think that I just need the right regexes?
# a vector to store the results
pages<-character(0)
country_names<-character(0)
# go through all 6 pages containing the info we want, and store
# the html in a list
for (page in 1:6) {
who_search <- paste(who_url, page, '.html', sep='')
page = htmlTreeParse(who_search, useInternalNodes = T)
pages=c(page, pages)
# extract the country names of each tweet
country <- xpathSApply(page, "????", xmlValue)
country_names<-c(country, country_names)
}

Here no need to use xmlSpathApply , use readHTMLTable instead
library(XML)
library(RCurl)
page = htmlParse('http://www.who.int/diabetes/facts/world_figures/en/index4.html')
readHTMLTable(page)
Country 2000 2030
1 Albania 86,000 188,000
2 Andora 6,000 18,000
3 Armenia 120,000 206,000
4 Austria 239,000 366,000
5 Azerbaijan 337,000 733,000
6 Belarus 735,000 922,000
using xpathSApply (Note the use of gsub to clean the result)
country <- xpathSApply(page, '//*[#id="primary"]/table/tbody/tr',
function(x) gsub('\n','' ,xmlValue(x))
+ )
> country
[1] "Albania 86,000 188,000 "
[2] "Andora 6,000 18,000 "
[3] "Armenia 120,000 206,000 "
[4] "Austria 239,000 366,000 "
[5] "Azerbaijan 337,000 733,000 "
EDIT As mentioned in the comment we can use xpathSApply without gsub
val = xpathSApply(page, '//tbody/tr/td', xmlValue) ##gets a vector of table
as.data.frame(matrix(val, ncol=3, byrow=TRUE)) ##transform to matrix

Related

Extract all text & tags between two heading tags (<h3>) with rvest

This page shows six sections listing people between <h3> tags.
How can I use XPath to select these six sections separately (using rvest), perhaps into a nested list? My goal is to later lapply through these six sections to fetch the people's names and affiliations (separated by section).
The HTML isn't so well-structured, i.e. not every text is located within specific tags. An example:
<h3>Editor-in-Chief</h3>
Claudio Ronco – <i>St. Bartolo Hospital</i>, Vicenza, Italy<br />
<br />
<h3>Clinical Engineering</h3>
William R. Clark – <i>Purdue University</i>, West Lafayette, IN, USA<br />
Hideyuki Kawanashi – <i>Tsuchiya General Hospital</i>, Hiroshima, Japan<br />
I access the site with the following code:
journal_url <- "https://www.karger.com/Journal/EditorialBoard/223997"
webpage <- rvest::html_session(journal_url,
httr::user_agent("Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.672.2 Safari/534.20"))
webpage <- rvest::html_nodes(webpage, css = '#editorialboard')
I tried various XPaths to extract the six sections with html_nodes into a nested list of six lists, but none of them work properly:
# this gives me a list of 190 (instead of 6) elements, leaving out the text between <i> and </i>
webpage <- rvest::html_nodes(webpage, xpath = '//text()[preceding-sibling::h3 and following-sibling::h3]')
# this gives me a list of 190 (instead of 6) elements, leaving out text that are not between tags
webpage <- rvest::html_nodes(webpage, xpath = '//*[preceding-sibling::h3 and following-sibling::h3]')
# error "VECTOR_ELT() can only be applied to a 'list', not a 'logical'"
webpage <- rvest::html_nodes(webpage, xpath = '//* and text()[preceding-sibling::h3 and following-sibling::h3]')
# this gives me a list of 274 (instead of 6) elements
webpage <- rvest::html_nodes(webpage, xpath = '//text()[preceding-sibling::h3]')

Are you ok with an ugly solution that does not use XPath? I don't think you can get a nested list from the structure of this website... But I am not very experienced in xpath.
I first got the headings, divided the raw text using the heading names and then, within each group, divided the members using '\n' as a separator.
journal_url <- "https://www.karger.com/Journal/EditorialBoard/223997"
webpage <- read_html(journal_url) %>% html_node(css = '#editorialboard')
# get h3 headings
headings <- webpage %>% html_nodes('h3') %>% html_text()
# get raw text
raw.text <- webpage %>% html_text()
# split raw text on h3 headings and put in a list
list.members <- list()
raw.text.2 <- raw.text
for (h in headings) {
# split on headings
b <- strsplit(raw.text.2, h, fixed=TRUE)
# split members using \n as separator
c <- strsplit(b[[1]][1], '\n', fixed=TRUE)
# clean empty elements from vector
c <- list(c[[1]][c[[1]] != ""])
# add vector of member to list
list.members <- c(list.members, c)
# update text
raw.text.2 <- b[[1]][2]
}
# remove first element of main list
list.members <- list.members[2:length(list.members)]
# add final segment of raw.text to list
c <- strsplit(raw.text.2, '\n', fixed=TRUE)
c <- list(c[[1]][c[[1]] != ""])
list.members <- c(list.members, c)
# add names to list
names(list.members) <- headings
Then you get a list of the groups and each element of the list is a vector with strings for each member (using all info)
> list.members$`Editor-in-Chief`
[1] "Claudio Ronco – St. Bartolo Hospital, Vicenza, Italy"
> list.members$`Clinical Engineering`
[1] "William R. Clark – Purdue University, West Lafayette, IN, USA"
[2] "Hideyuki Kawanashi – Tsuchiya General Hospital, Hiroshima, Japan"
[3] "Tadayuki Kawasaki – Mobara Clinic, Mobara City, Japan"
[4] "Jeongchul Kim – Wake Forest School of Medicine, Winston-Salem, NC, USA"
[5] "Anna Lorenzin – International Renal Research Institute of Vicenza, Vicenza, Italy"
[6] "Ikuto Masakane – Honcho Yabuki Clinic, Yamagata City, Japan"
[7] "Michio Mineshima – Tokyo Women's Medical University, Tokyo, Japan"
[8] "Tomotaka Naramura – Kurashiki University of Science and the Arts, Kurashiki, Japan"
[9] "Mauro Neri – International Renal Research Institute of Vicenza, Vicenza, Italy"
[10] "Masanori Shibata – Koujukai Rehabilitation Hospital, Nagoya City, Japan"
[11] "Yoshihiro Tange – Kyushu University of Health and Welfare, Nobeoka-City, Japan"
[12] "Yoshiaki Takemoto – Osaka City University, Osaka City, Japan"

openxlsx: read.xlsx throws an error if the sheet name contains the "&" character

Create an .xlsx file with three sheets named: "Test 1", "S&P500 TR" and "SP500 TR". Put some random content in each sheet and save it as "Book1.xlsx".
Run:
> a <- getSheetNames("Book1.xlsx")
> a
[1] "Test 1" "S&P500 TR" "SP500 TR"
Now try:
> read.xlsx("Book1.xlsx", a[2])
Error in read.xlsx.default("Book1.xlsx", a[2]) :
Cannot find sheet named "S&P500 TR"

First check if you actually type the name S&P500 TR instead of using a[2] that would change anything.
Alternatively, you can use readxl package for importing;
library(readxl)
X1 <- read_excel("C:/1.xls", sheet = "S&P500 TR")
This is a spreadsheet that I had and it is the result after it is imported;
head(X1)
# A tibble: 6 × 4
# Year Month Community ` Average Daily`
# <dbl> <chr> <chr> <dbl>
# 1 2016 Jan Arlington 5.35
# 2 2016 Jan Ashland 1.26
# 3 2016 Jan Bedford 2.62
# 4 2016 Jan Belmont 3.03
# 5 2016 Jan Boston 84.89
# 6 2016 Jan Braintree 8.16

I ran into the same problem, but found a workaround. First load in the workbook using read.xlsx(). Then rename the problematic sheet to avoid the ampersand. To fix the code in your example:
wb = read.xlsx("Book1.xlsx")
renameWorksheet(wb, "S&P500 TR", "NEW NAME")
output = read.xlsx(wb, "NEW NAME")
Hope this helps!

First load the workbook, then use the which and grepl function to return the sheet number containing the sheet name (which can include the '&' character when done in this way). This seems to work quite well in an application I am currently working on.
An (incomplete) example is given below that should be easily modified to your context. In my case 'i' is a file name (looping over many files). The "toy" code is here:
wb <- loadWorkbook(file = i)
which( grepl("CAPEX & Depreciation", names(wb)) )

Quantmod FRED Metadata in R

library(quantmod)
getSymbols("GDPC1",src = "FRED")
I am trying to extract the numerical economic/financial data in FRED but also the metadata. I am trying to chart CPI and have the meta data as a labels/footnotes. Is there a way to extract this data using the quantmod package?
Title: Real Gross Domestic Product
Series ID: GDPC1
Source: U.S. Department of Commerce: Bureau of Economic Analysis
Release: Gross Domestic Product
Seasonal Adjustment: Seasonally Adjusted Annual Rate
Frequency: Quarterly
Units: Billions of Chained 2009 Dollars
Date Range: 1947-01-01 to 2014-01-01
Last Updated: 2014-06-25 7:51 AM CDT
Notes: BEA Account Code: A191RX1
Real gross domestic product is the inflation adjusted value of the
goods and services produced by labor and property located in the
United States.
For more information see the Guide to the National Income and Product
Accounts of the United States (NIPA) -
(http://www.bea.gov/national/pdf/nipaguid.pdf)

You can use the same code that's in the body of getSymbools.FRED, but change ".csv" to ".xls", then read the metadata you're interested in from the .xls file.
library(gdata)
Symbol <- "GDPC1"
FRED.URL <- "http://research.stlouisfed.org/fred2/series"
tmp <- tempfile()
download.file(paste0(FRED.URL, "/", Symbol, "/downloaddata/", Symbol, ".xls"),
destfile=tmp)
read.xls(tmp, nrows=17, header=FALSE)
# V1 V2
# 1 Title: Real Gross Domestic Product
# 2 Series ID: GDPC1
# 3 Source: U.S. Department of Commerce: Bureau of Economic Analysis
# 4 Release: Gross Domestic Product
# 5 Seasonal Adjustment: Seasonally Adjusted Annual Rate
# 6 Frequency: Quarterly
# 7 Units: Billions of Chained 2009 Dollars
# 8 Date Range: 1947-01-01 to 2014-01-01
# 9 Last Updated: 2014-06-25 7:51 AM CDT
# 10 Notes: BEA Account Code: A191RX1
# 11 Real gross domestic product is the inflation adjusted value of the
# 12 goods and services produced by labor and property located in the
# 13 United States.
# 14
# 15 For more information see the Guide to the National Income and Product
# 16 Accounts of the United States (NIPA) -
# 17 (http://www.bea.gov/national/pdf/nipaguid.pdf)
Instead of hardcoding nrows=17, you can use grep to search for the row that has the headers of the data, and subset to only include rows before that.
dat <- read.xls(tmp, header=FALSE, stringsAsFactors=FALSE)
dat[seq_len(grep("DATE", dat[, 1])-1),]
unlink(tmp) # remove the temp file when you're done with it.

FRED has a straightforward, well-document json interface http://api.stlouisfed.org/docs/fred/ which provides both metadata and time series data for all of its economic series. Access requires a FRED account and api key but these are available on request from http://api.stlouisfed.org/api_key.html .
The excel descriptive data you asked for can be retrieved using
get.FRSeriesTags <- function(seriesNam)
{
# seriesNam = character string containing the ID identifying the FRED series to be retrieved
#
library("httr")
library("jsonlite")
# dummy FRED api key; request valid key from http://api.stlouisfed.org/api_key.html
apiKey <- "&api_key=abcdefghijklmnopqrstuvwxyz123456"
base <- "http://api.stlouisfed.org/fred/"
seriesID <- paste("series_id=", seriesNam,sep="")
fileType <- "&file_type=json"
#
# get series descriptive data
#
datType <- "series?"
url <- paste(base, datType, seriesID, apiKey, fileType, sep="")
series <- fromJSON(url)$seriess
#
# get series tag data
#
datType <- "series/tags?"
url <- paste(base, datType, seriesID, apiKey, fileType, sep="")
tags <- fromJSON(url)$tags
#
# format as excel descriptive rows
#
description <- data.frame(Title=series$title[1],
Series_ID = series$id[1],
Source = tags$notes[tags$group_id=="src"][1],
Release = tags$notes[tags$group_id=="gen"][1],
Frequency = series$frequency[1],
Units = series$units[1],
Date_Range = paste(series[1, c("observation_start","observation_end")], collapse=" to "),
Last_Updated = series$last_updated[1],
Notes = series$notes[1],
row.names=series$id[1])
return(t(description))
}
Retrieving the actual time series data would be done in a similar way. There are several json packages available for R but jsonlite works particularly well for this application.
There's a bit more to setting this up than the previous answer but perhaps worth it if you do much with FRED data.

How to remove all NAs in character strings in a dataframe column in R?

I have a CSV file like
LocationList,Identity,Category
"New York,New York,United States","42","S"
"NA,California,United States","89","lyt"
"Hartford,Connecticut,United States","879","polo"
"San Diego,California,United States","45454","utyr"
"Seattle,Washington,United States","uytr","69"
"NA,NA,United States","87","tree"
I want to remove all 'NA' from the 'LocationList' Column
The Desired Result -
LocationList,Identity,Category
"New York,New York,United States","42","S"
"California,United States","89","lyt"
"Hartford,Connecticut,United States","879","polo"
"San Diego,California,United States","45454","utyr"
"Seattle,Washington,United States","uytr","69"
"United States","87","tree"
The number of columns are not fixed and they may increase or decrease. Also I want to write to the CSV file without quotes and without escaping for the 'LocationList' column.
How to achieve the following in R?
New to R any help is appreciated.

In this case, you just want to replace the NA, with nothing. However, this is not the standard way to remove NA values.
Assuming dat is your data, use
dat$LocationList <- gsub("^(NA,)+", "", dat$LocationList)

Try:
my.data <- read.table(text='LocationList,Identity,Category
"New York,New York,United States","42","S"
"NA,California,United States","89","lyt"
"Hartford,Connecticut,United States","879","polo"
"San Diego,California,United States","45454","utyr"
"Seattle,Washington,United States","uytr","69"
"NA,NA,United States","87","tree"', header=T, sep=",")
my.data$LocationList <- gsub("NA,", "", my.data$LocationList)
my.data
# LocationList Identity Category
# 1 New York,New York,United States 42 S
# 2 California,United States 89 lyt
# 3 Hartford,Connecticut,United States 879 polo
# 4 San Diego,California,United States 45454 utyr
# 5 Seattle,Washington,United States uytr 69
# 6 United States 87 tree
If you get rid of the quotes when you write to a conventional csv file, you will have trouble reading the data in later. This is because you have commas already inside each value in the LocationList variable, so you would have commas both in the middle of fields and marking the break between fields. You might try using write.csv2() instead, which will indicate new fields with a semicolon ;. You could use:
write.csv2(my.data, file="myFile.csv", quote=FALSE, row.names=FALSE)
Which yields the following file:
LocationList;Identity;Category
New York,New York,United States;42;S
California,United States;89;lyt
Hartford,Connecticut,United States;879;polo
San Diego,California,United States;45454;utyr
Seattle,Washington,United States;uytr;69
United States;87;tree
(I now notice that the values for Identity and Category for row 5 are presumably messed up. You may want to switch those before writing to file.)
x <- my.data[5, 2]
my.data[5, 2] <- my.data[5, 3]
my.data[5, 2] <- x
rm(x)

Importing wikipedia tables in R

I regularly extract tables from Wikipedia. Excel's web import does not work properly for wikipedia, as it treats the whole page as a table. In google spreadsheet, I can enter this:
=ImportHtml("http://en.wikipedia.org/wiki/Upper_Peninsula_of_Michigan","table",3)
and this function will download the 3rd table, which lists all the counties of the UP of Michigan, from that page.
Is there something similar in R? or can be created via a user defined function?

Building on Andrie's answer, and addressing SSL. If you can take one additional library dependency:
library(httr)
library(XML)
url <- "https://en.wikipedia.org/wiki/Upper_Peninsula_of_Michigan"
r <- GET(url)
doc <- readHTMLTable(
doc=content(r, "text"))
doc[6]

The function readHTMLTable in package XML is ideal for this.
Try the following:
library(XML)
doc <- readHTMLTable(
doc="http://en.wikipedia.org/wiki/Upper_Peninsula_of_Michigan")
doc[[6]]
V1 V2 V3 V4
1 County Population Land Area (sqÂ mi) Population Density (per sqÂ mi)
2 Alger 9,862 918 10.7
3 Baraga 8,735 904 9.7
4 Chippewa 38,413 1561 24.7
5 Delta 38,520 1170 32.9
6 Dickinson 27,427 766 35.8
7 Gogebic 17,370 1102 15.8
8 Houghton 36,016 1012 35.6
9 Iron 13,138 1166 11.3
10 Keweenaw 2,301 541 4.3
11 Luce 7,024 903 7.8
12 Mackinac 11,943 1022 11.7
13 Marquette 64,634 1821 35.5
14 Menominee 25,109 1043 24.3
15 Ontonagon 7,818 1312 6.0
16 Schoolcraft 8,903 1178 7.6
17 TOTAL 317,258 16,420 19.3
readHTMLTable returns a list of data.frames for each element of the HTML page. You can use names to get information about each element:
> names(doc)
[1] "NULL"
[2] "toc"
[3] "Election results of the 2008 Presidential Election by County in the Upper Peninsula"
[4] "NULL"
[5] "Cities and Villages of the Upper Peninsula"
[6] "Upper Peninsula Land Area and Population Density by County"
[7] "19th Century Population by Census Year of the Upper Peninsula by County"
[8] "20th & 21st Centuries Population by Census Year of the Upper Peninsula by County"
[9] "NULL"
[10] "NULL"
[11] "NULL"
[12] "NULL"
[13] "NULL"
[14] "NULL"
[15] "NULL"
[16] "NULL"

Here is a solution that works with the secure (https) link:
install.packages("htmltab")
library(htmltab)
htmltab("http://en.wikipedia.org/wiki/Upper_Peninsula_of_Michigan",3)

One simple way to do it is to use the RGoogleDocs interface to have Google Docs to do the conversion for you:
http://www.omegahat.org/RGoogleDocs/run.html
You can then use the =ImportHtml Google Docs function with all its pre-built magic.

A tidyverse solution using rvest. It's very useful if you need to find the table based on some keywords, for example in the table headers. Here is an example where we want to get the table on Vital statistics of Egypt. Note: html_nodes(x = page, css = "table") is a useful way to browse available tables on the page.
library(magrittr)
library(rvest)
# define the page to load
read_html("https://en.wikipedia.org/wiki/Demographics_of_Egypt") %>%
# list all tables on the page
html_nodes(css = "table") %>%
# select the one containing needed key words
extract2(., str_which(string = . , pattern = "Live births")) %>%
# convert to a table
html_table(fill = T) %>%
view

That table is the only table which is a child of the second td child of so you can specify that pattern with css. Rather than use a type selector of table to grab the child table you can use the class which is faster:
library(rvest)
t <- read_html('https://en.wikipedia.org/wiki/Upper_Peninsula_of_Michigan') %>%
html_node('td:nth-child(2) .wikitable') %>%
html_table()
print(t)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Web-scraping with xpathSApply - r

Related

Extract all text & tags between two heading tags (<h3>) with rvest

openxlsx: read.xlsx throws an error if the sheet name contains the "&" character

Quantmod FRED Metadata in R

How to remove all NAs in character strings in a dataframe column in R?

Importing wikipedia tables in R

Categories

Resources