Related
I am new to R and webscraping. For practice I am trying to scrape book titles from a fake website that has multiple pages ('http://books.toscrape.com/catalogue/page-1.html'), and then calculate certain metrics based on the book titles. There are 20 books on each page and 50 pages, I have managed to scrape and calculate metrics for the first 20 books, however I want to calculate the metrics for the full 1000 books on the website.
The current output looks like this:
[1] "A Light in the Attic"
[2] "Tipping the Velvet"
[3] "Soumission"
[4] "Sharp Objects"
[5] "Sapiens: A Brief History of Humankind"
[6] "The Requiem Red"
[7] "The Dirty Little Secrets of Getting Your Dream Job"
[8] "The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull"
[9] "The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics"
[10] "The Black Maria"
[11] "Starving Hearts (Triangular Trade Trilogy, #1)"
[12] "Shakespeare's Sonnets"
[13] "Set Me Free"
[14] "Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)"
[15] "Rip it Up and Start Again"
[16] "Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991"
[17] "Olio"
[18] "Mesaerion: The Best Science Fiction Stories 1800-1849"
[19] "Libertarianism for Beginners"
[20] "It's Only the Himalayas"
I want this to be 1000 books long instead of 20, this will allow me to use the same code to calculate the metrics but for 1000 books instead of 20.
Code:
url<-'http://books.toscrape.com/catalogue/page-1.html'
url %>%
read_html() %>%
html_nodes('h3 a') %>%
html_attr('title')->titles
titles
What would be the best way to scrape every book from the website and make the list 1000 book titles long instead of 20? Thanks in advance.
Generate the 50 URLs, then iterate on them, e.g. with purrr::map
library(rvest)
urls <- paste0('http://books.toscrape.com/catalogue/page-', 1:50, '.html')
titles <- purrr::map(
urls,
. %>%
read_html() %>%
html_nodes('h3 a') %>%
html_attr('title')
)
something like this perhaps?
library(tidyverse)
library(rvest)
library(data.table)
# Vector with URL's to scrape
url <- paste0("http://books.toscrape.com/catalogue/page-", 1:20, ".html")
# Scrape to list
L <- lapply( url, function(x) {
print( paste0( "scraping: ", x, " ... " ) )
data.table(titles = read_html(x) %>%
html_nodes('h3 a') %>%
html_attr('title') )
})
# Bind list to single data.table
data.table::rbindlist(L, use.names = TRUE, fill = TRUE)
This page shows six sections listing people between <h3> tags.
How can I use XPath to select these six sections separately (using rvest), perhaps into a nested list? My goal is to later lapply through these six sections to fetch the people's names and affiliations (separated by section).
The HTML isn't so well-structured, i.e. not every text is located within specific tags. An example:
<h3>Editor-in-Chief</h3>
Claudio Ronco – <i>St. Bartolo Hospital</i>, Vicenza, Italy<br />
<br />
<h3>Clinical Engineering</h3>
William R. Clark – <i>Purdue University</i>, West Lafayette, IN, USA<br />
Hideyuki Kawanashi – <i>Tsuchiya General Hospital</i>, Hiroshima, Japan<br />
I access the site with the following code:
journal_url <- "https://www.karger.com/Journal/EditorialBoard/223997"
webpage <- rvest::html_session(journal_url,
httr::user_agent("Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.672.2 Safari/534.20"))
webpage <- rvest::html_nodes(webpage, css = '#editorialboard')
I tried various XPaths to extract the six sections with html_nodes into a nested list of six lists, but none of them work properly:
# this gives me a list of 190 (instead of 6) elements, leaving out the text between <i> and </i>
webpage <- rvest::html_nodes(webpage, xpath = '//text()[preceding-sibling::h3 and following-sibling::h3]')
# this gives me a list of 190 (instead of 6) elements, leaving out text that are not between tags
webpage <- rvest::html_nodes(webpage, xpath = '//*[preceding-sibling::h3 and following-sibling::h3]')
# error "VECTOR_ELT() can only be applied to a 'list', not a 'logical'"
webpage <- rvest::html_nodes(webpage, xpath = '//* and text()[preceding-sibling::h3 and following-sibling::h3]')
# this gives me a list of 274 (instead of 6) elements
webpage <- rvest::html_nodes(webpage, xpath = '//text()[preceding-sibling::h3]')
Are you ok with an ugly solution that does not use XPath? I don't think you can get a nested list from the structure of this website... But I am not very experienced in xpath.
I first got the headings, divided the raw text using the heading names and then, within each group, divided the members using '\n' as a separator.
journal_url <- "https://www.karger.com/Journal/EditorialBoard/223997"
webpage <- read_html(journal_url) %>% html_node(css = '#editorialboard')
# get h3 headings
headings <- webpage %>% html_nodes('h3') %>% html_text()
# get raw text
raw.text <- webpage %>% html_text()
# split raw text on h3 headings and put in a list
list.members <- list()
raw.text.2 <- raw.text
for (h in headings) {
# split on headings
b <- strsplit(raw.text.2, h, fixed=TRUE)
# split members using \n as separator
c <- strsplit(b[[1]][1], '\n', fixed=TRUE)
# clean empty elements from vector
c <- list(c[[1]][c[[1]] != ""])
# add vector of member to list
list.members <- c(list.members, c)
# update text
raw.text.2 <- b[[1]][2]
}
# remove first element of main list
list.members <- list.members[2:length(list.members)]
# add final segment of raw.text to list
c <- strsplit(raw.text.2, '\n', fixed=TRUE)
c <- list(c[[1]][c[[1]] != ""])
list.members <- c(list.members, c)
# add names to list
names(list.members) <- headings
Then you get a list of the groups and each element of the list is a vector with strings for each member (using all info)
> list.members$`Editor-in-Chief`
[1] "Claudio Ronco – St. Bartolo Hospital, Vicenza, Italy"
> list.members$`Clinical Engineering`
[1] "William R. Clark – Purdue University, West Lafayette, IN, USA"
[2] "Hideyuki Kawanashi – Tsuchiya General Hospital, Hiroshima, Japan"
[3] "Tadayuki Kawasaki – Mobara Clinic, Mobara City, Japan"
[4] "Jeongchul Kim – Wake Forest School of Medicine, Winston-Salem, NC, USA"
[5] "Anna Lorenzin – International Renal Research Institute of Vicenza, Vicenza, Italy"
[6] "Ikuto Masakane – Honcho Yabuki Clinic, Yamagata City, Japan"
[7] "Michio Mineshima – Tokyo Women's Medical University, Tokyo, Japan"
[8] "Tomotaka Naramura – Kurashiki University of Science and the Arts, Kurashiki, Japan"
[9] "Mauro Neri – International Renal Research Institute of Vicenza, Vicenza, Italy"
[10] "Masanori Shibata – Koujukai Rehabilitation Hospital, Nagoya City, Japan"
[11] "Yoshihiro Tange – Kyushu University of Health and Welfare, Nobeoka-City, Japan"
[12] "Yoshiaki Takemoto – Osaka City University, Osaka City, Japan"
I'm looking to extract names and professions of those who testified in front of Congress from the following text:
text <- c(("FULL COMMITTEE HEARINGS\\", \\" 2017\\",\n\\" April 6, 2017—‘‘The 2017 Tax Filing Season: Internal Revenue\\", \", \"\\"\nService Operations and the Taxpayer Experience.’’ This hearing\\", \\" examined\nissues related to the 2017 tax filing season, including\\", \\" IRS performance,\ncustomer service challenges, and information\\", \\" technology. Testimony was\nheard from the Honorable John\\", \\" Koskinen, Commissioner, Internal Revenue\nService, Washington,\\", \", \"\\" DC.\\", \\" May 25, 2017—‘‘Fiscal Year 2018 Budget\nProposals for the Depart-\\", \\" ment of Treasury and Tax Reform.’’ The hearing\ncovered the\\", \\" President’s 2018 Budget and touched on operations of the De-\n\\", \\" partment of Treasury and Tax Reform. Testimony was heard\\", \\" from the\nHonorable Steven Mnuchin, Secretary of the Treasury,\\", \", \"\\" United States\nDepartment of the Treasury, Washington, DC.\\", \\" July 18, 2017—‘‘Comprehensive\nTax Reform: Prospects and Chal-\\", \\" lenges.’’ The hearing covered issues\nsurrounding potential tax re-\\", \\" form plans including individual, business,\nand international pro-\\", \\" posals. Testimony was heard from the Honorable\nJonathan Talis-\\", \", \"\\" man, former Assistant Secretary for Tax Policy 2000–\n2001,\\", \\" United States Department of the Treasury, Washington, DC; the\\",\n\\" Honorable Pamela F. Olson, former Assistant Secretary for Tax\\", \\" Policy\n2002–2004, United States Department of the Treasury,\\", \\" Washington, DC; the\nHonorable Eric Solomon, former Assistant\\", \", \"\\" Secretary for Tax Policy\n2006–2009, United States Department of\\", \\" the Treasury, Washington, DC; and\nthe Honorable Mark J.\\", \\" Mazur, former Assistant Secretary for Tax Policy\n2012–2017,\\", \\" United States Department of the Treasury, Washington, DC.\\",\n\\" (5)\\", \\"VerDate Sep 11 2014 14:16 Mar 28, 2019 Jkt 000000 PO 00000 Frm 00013\nFmt 6601 Sfmt 6601 R:\\\\DOCS\\\\115ACT.000 TIM\\"\", \")\")"
)
The full text is available here: https://www.congress.gov/116/crpt/srpt19/CRPT-116srpt19.pdf
It seems that the names are in between "Testimony was heard from" until the next ".". So, how can I extract the names between these two patterns? The text is much longer (50 page document), but I figured that if I can do it one, I'll do it for the rest of the text.
I know I can't use NLP for name extraction because they are names of persons that didn't testify, for example.
NLP is likely unavoidable because of the many abbreviations in the text. Try this workflow:
Tokenize by sentence
Remove sentences without "Testimony"
Extract persons + professions from remaining sentences
There are a couple of packages with sentence tokenizers, but openNLP has generally worked best for me when dealing with abbreviation laden sentences. The following code should get you close to your goal:
library(tidyverse)
library(pdftools)
library(openNLP)
# Get the data
testimony_url <- "https://www.congress.gov/116/crpt/srpt19/CRPT-116srpt19.pdf"
download.file(testimony_url, "testimony.pdf")
text_raw <- pdf_text("testimony.pdf")
# Clean the character vector and smoosh into one long string.
text_string <- str_squish(text_raw) %>%
str_replace_all("- ", "") %>%
paste(collapse = " ") %>%
NLP::as.String()
# Annotate and extract the sentences.
annotations <- NLP::annotate(text_string, Maxent_Sent_Token_Annotator())
sentences <- text_string[annotations]
# Some sentences starting with "Testimony" list multiple persons. We need to
# split these and clean up a little.
name_title_vec <- str_subset(sentences, "Testimony was") %>%
str_split(";") %>%
unlist %>%
str_trim %>%
str_remove("^(Testimony .*? from|and) ") %>%
str_subset("^\\(\\d\\)", negate = T)
# Put in data frame and separate name from profession/title.
testimony_tibb <- tibble(name_title_vec) %>%
separate(name_title_vec, c("name", "title"), sep = ", ", extra = "merge")
You should end up with the below data frame. Some additional cleaning may be necessary:
# A tibble: 95 x 2
name title
<chr> <chr>
1 the Honorable John Koskin… Commissioner, Internal Revenue Service, Washington, DC.
2 the Honorable Steven Mnuc… Secretary of the Treasury, United States Department of the Treasury…
3 the Honorable Jonathan Ta… former Assistant Secretary for Tax Policy 2000–2001, United States …
4 the Honorable Pamela F. O… former Assistant Secretary for Tax Policy 2002–2004, United States …
5 the Honorable Eric Solomon former Assistant Secretary for Tax Policy 2006–2009, United States …
6 the Honorable Mark J. Maz… "former Assistant Secretary for Tax Policy 2012–2017, United States…
7 Mr. Daniel Garcia-Diaz Director, Financial Markets and Community Investment, United States…
8 Mr. Grant S. Whitaker president, National Council of State Housing Agencies, Washington, …
9 the Honorable Katherine M… Ph.D., professor of public policy and planning, and faculty directo…
10 Mr. Kirk McClure Ph.D., professor, Urban Planning Program, School of Public Policy a…
# … with 85 more rows
This is a link to a journal page:
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1535-9
I'm trying to get the following: Author Affiliations (all authors), Corresponding Author, and Corresponding Author's Email. Note: it is assumed the corresponding author is the last author listed in the authors sections at the top of the article. I've used SelectorGadget to identify some tags for other elements like Abstract and Publication Date, but I just can't seem to figure out how to get these three. The following is my code to get the authors as a character vector:
#url is the url for the list of articles on a particular page
s <- html_session(url)<br >
page <- s %>% follow_link(art) %>% read_html() <br >
str_replace_all(str_squish(page %>% html_nodes(".AuthorName") %>% html_text()), "[0-9]|Email author", "")<br >
And this returns a vector of all authors involved, in this case of length 8 for each of the authors. But now I need to follow the links on their names to get the affiliations, and their emails. I'm sure all the code I need is in front of me but I'm a little lost as I'm new to R and web scraping (had to learn this quickly for my current project).
Update
The answer below is perfect.
I am not sure the email address always matches the author at the last position.
Because when I open the Chrome view-source, I find the email address somehow is below an independent list.
library(rvest)
#> 载入需要的程辑包:xml2
library(data.table)
library(tidyverse)
xml <- read_html('https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1535-9')
xml %>%
html_nodes('.EmailAuthor') %>%
html_attr('href')
#> [1] "mailto:liuj#cs.uky.edu"
# get email address
xml %>%
html_nodes('.AuthorName') %>%
html_text
#> [1] "Ye<U+00A0>Yu" "Jinpeng<U+00A0>Liu" "Xinan<U+00A0>Liu" "Yi<U+00A0>Zhang"
#> [5] "Eamonn<U+00A0>Magner" "Erik<U+00A0>Lehnert" "Chen<U+00A0>Qian" "Jinze<U+00A0>Liu"
# get name
data.table(
name = xml %>%
html_nodes('meta') %>%
html_attr('name')
,content = xml %>%
html_nodes('meta') %>%
html_attr('content')
) %>%
# extract both name and affiliatation, because make show they are matched.
filter(name %in% c('citation_author_institution')) %>%
select(content)
#> content
#> 1 Department of Computer Science, University of Kentucky, Lexington, USA
#> 2 Department of Computer Science, University of Kentucky, Lexington, USA
#> 3 Department of Computer Science, University of Kentucky, Lexington, USA
#> 4 Department of Computer Science, University of Kentucky, Lexington, USA
#> 5 Department of Computer Science, University of Kentucky, Lexington, USA
#> 6 Seven Bridges Genomics Inc, Cambridge, USA
#> 7 Department of Computer Engineering, University of California Santa Cruz, Santa Cruz, USA
#> 8 Department of Computer Science, University of Kentucky, Lexington, USA
Created on 2018-11-02 by the reprex package (v0.2.1)
I have raw bibliographic data as follows:
bib =
c("Bernal, Martin, \\\"Liu Shi-p\\'ei and National Essence,\\\" in Charlotte",
"Furth, ed., *The Limit of Change, Essays on Conservative Alternatives in",
"Republican China*, Cambridge: Harvard University Press, 1976.",
"", "Chen,Hsi-yuan, \"*Last Chapter Unfinished*: The Making of the *Draft Qing",
"History* and the Crisis of Traditional Chinese Historiography,\"",
"*Historiography East & West*2.2 (Sept. 2004): 173-204", "",
"Dennerline, Jerry, *Qian Mu and the World of Seven Mansions*, New Haven:",
"Yale University Press, 1988.", "")
[1] "Bernal, Martin, \\\"Liu Shi-p\\'ei and National Essence,\\\" in Charlotte"
[2] "Furth, ed., *The Limit of Change, Essays on Conservative Alternatives in"
[3] "Republican China*, Cambridge: Harvard University Press, 1976."
[4] ""
[5] "Chen,Hsi-yuan, \"*Last Chapter Unfinished*: The Making of the *Draft Qing"
[6] "History* and the Crisis of Traditional Chinese Historiography,\""
[7] "*Historiography East & West*2.2 (Sept. 2004): 173-204"
[8] ""
[9] "Dennerline, Jerry, *Qian Mu and the World of Seven Mansions*, New Haven:"
[10] "Yale University Press, 1988."
[11] ""
I would like to collapse elements between the ""s in one line so that:
clean_bib[1]=paste(bib[1], bib[2], bib[3])
clean_bib[2]=paste(bib[5], bib[6], bib[7])
clean_bib[3]=paste(bib[9], bib[10])
[1] "Bernal, Martin, \\\"Liu Shi-p\\'ei and National Essence,\\\" in Charlotte Furth, ed., *The Limit of Change, Essays on Conservative Alternatives in Republican China*, Cambridge: Harvard University Press, 1976."
[2] "Chen,Hsi-yuan, \"*Last Chapter Unfinished*: The Making of the *Draft Qing History* and the Crisis of Traditional Chinese Historiography,\" *Historiography East & West*2.2 (Sept. 2004): 173-204"
[3] "Dennerline, Jerry, *Qian Mu and the World of Seven Mansions*, New Haven: Yale University Press, 1988."
Is there a one-liner that does this automatically?
You can use tapply while grouping with all "" then paste together the groups
unname(tapply(bib,cumsum(bib==""),paste,collapse=" "))
[1] "Bernal, Martin, \\\"Liu Shi-p\\'ei and National Essence,\\\" in Charlotte Furth, ed., *The Limit of Change, Essays on Conservative Alternatives in Republican China*, Cambridge: Harvard University Press, 1976."
[2] " Chen,Hsi-yuan, \"*Last Chapter Unfinished*: The Making of the *Draft Qing History* and the Crisis of Traditional Chinese Historiography,\" *Historiography East & West*2.2 (Sept. 2004): 173-204"
[3] " Dennerline, Jerry, *Qian Mu and the World of Seven Mansions*, New Haven: Yale University Press, 1988."
[4] ""
you can also do:
unname(c(by(bib,cumsum(bib==""),paste,collapse=" ")))
or
unname(tapply(bib,cumsum(grepl("^$",bib)),paste,collapse=" "))
etc
Similar to the other answer. This uses split and sapply. The second line is just to remove any elements with only has "".
vec <- unname(sapply(split(bib, f = cumsum(bib %in% "")), paste0, collapse = " "))
vec[!vec %in% ""]