I am trying to scrape details from a website in order to gather details for pictures with a script in R.
What I need is:
Image name (1.jpg)
Image caption ("A recruit demonstrates the proper use of a CO2 portable extinguisher to put out a small outside fire.")
Photo credit ("Photo courtesy of: James Fortner")
There are over 16,000 files, and thankfully the web url goes "...asp?photo=1, 2, 3, 4" so there is base url which doesn't change, just the last section with the image number. I would like the script to loop for either a set number (I tell it where to start) or it just breaks when it gets to a page which doesn't exisit.
Using the code below, I can get the caption of the photo, but only one line. I would like to get the photo credit, which is on a separate line; there are three between the main caption and photo credit. I'd be fine if the table which is generated had two or three blank columns to account for the lines, as I can delete them later.
library(rvest)
library(dplyr)
link = "http://fallschurchvfd.org/photovideo.asp?photo=1"
page = read_html(link)
caption = page %>% html_nodes(".text7 i") %>% html_text()
info = data.frame(caption, stringsAsFactors = FALSE)
write.csv(info, "photos.csv")
Scraping with rvest and tidyverse
library(tidyverse)
library(rvest)
get_picture <- function(page) {
cat("Scraping page", page, "\n")
page <- str_c("http://fallschurchvfd.org/photovideo.asp?photo=", page) %>%
read_html()
tibble(
image_name = page %>%
html_element(".text7 img") %>%
html_attr("src"),
caption = page %>%
html_element(".text7") %>%
html_text() %>%
str_split(pattern = "\r\n\t\t\t\t") %>%
unlist %>%
nth(1),
credit = page %>%
html_element(".text7") %>%
html_text() %>%
str_split(pattern = "\r\n\t\t\t\t") %>%
unlist %>%
nth(3)
)
}
# Get the first 1:50
df <- map_dfr(1:50, possibly(get_picture, otherwise = tibble()))
# A tibble: 42 × 3
image_name caption credit
<chr> <chr> <chr>
1 /photos/1.jpg Recruit Clay Hamric demonstrates the use… James…
2 /photos/2.jpg A recruit demonstrates the proper use of… James…
3 /photos/3.jpg Recruit Paul Melnick demonstrates the pr… James…
4 /photos/4.jpg Rescue 104 James…
5 /photos/5.jpg Rescue 104 James…
6 /photos/6.jpg Rescue 104 James…
7 /photos/15.jpg Truck 106 operates a ladder pipe from Wi… Jim O…
8 /photos/16.jpg Truck 106 operates a ladder pipe as heav… Jim O…
9 /photos/17.jpg Heavy fire vents from the roof area of t… Jim O…
10 /photos/18.jpg Arlington County Fire and Rescue Associa… James…
# … with 32 more rows
# ℹ Use `print(n = ...)` to see more rows
For the images, you can use the command line tool curl. For example, to download images 1.jpg through 100.jpg
curl -O "http://fallschurchvfd.org/photos/[0-100].jpg"
For the R code, if you grab the whole .text7 section, then you can split into caption and photo credit subsequently:
extractedtext <- page %>% html_nodes(".text7") %>% html_text()
caption <- str_split(extractedtext, "\r\n\t\t\t\t")[[1]][1]
credit <- str_split(extractedtext, "\r\n\t\t\t\t")[[1]][3]
As a loop
library(rvest)
library(tidyverse)
df<-data.frame(id=1:20,
image=NA,
caption=NA,
credit=NA)
for (i in 1:20){
cat(i, " ") # to monitor progress and debug
link <- paste0("http://fallschurchvfd.org/photovideo.asp?photo=", i)
tryCatch({ # This is to avoid stopping on an error message for missing pages
page <- read_html(link)
close(link)
df$image[i] <- page %>% html_nodes(".text7 img") %>% html_attr("src")
extractedtext <- page %>% html_nodes(".text7") %>% html_text()
df$caption[i] <- str_split(extractedtext, "\r\n\t\t\t\t")[[1]][1] # This is an awkward way of saying "list 1, element 1"
df$credit[i] <- str_split(extractedtext, "\r\n\t\t\t\t")[[1]][3]
},
error=function(e){cat("ERROR :",conditionMessage(e), "\n")})
}
I get inconsistent results with this current code, for example, page 15 has more line breaks than page 1.
TODO: enhance string extraction; switch to an 'append' method of adding data to a data.frame (vs pre-allocate and insert).
Related
I'm attempting to scrape a table of read books from the Goodreads website using rvest. The data is formatted as a table, however I am getting errors when attempting to extract this info.
First we load some packages and set the url to scrape
library(dplyr)
library(rvest)
url <- "https://www.goodreads.com/review/list/4622890?shelf=read"
Running this code:
dat <- read_html(url) %>%
html_nodes('//*[#id="booksBody"]') %>%
html_table()
Produces: Error in tokenize(css) : Unexpected character '/' found at position 1
Trying it again, but without the first /:
dat <- read_html(url) %>%
html_nodes('/*[#id="booksBody"]') %>%
html_table()
Produces: Error in parse_simple_selector(stream) : Expected selector, got <EOF at 20>
And finally, just trying to get the table directly, without the intermediate call to html_nodes:
dat <- read_html(url) %>%
html_table('/*[#id="booksBody"]')
Produces: Error in if (header) { : argument is not interpretable as logical
Would appreciate any guidance on how to scrape this table
Scraping the first 5 pages
library(tidyverse)
library(rvest)
library(httr2)
get_books <- function(page) {
cat("Scraping page:", page, "\n")
books <-
str_c("https://www.goodreads.com/review/list/4622890-emily-may?page=", page,
"&shelf=%23ALL%23") %>%
read_html() %>%
html_elements(".bookalike.review")
tibble(
title = books %>%
html_elements(".title a") %>%
html_text2(),
author = books %>%
html_elements(".author a") %>%
html_text2(),
rating = books %>%
html_elements(".avg_rating .value") %>%
html_text2() %>%
as.numeric(),
date = books %>%
html_elements(".date_added .value") %>%
html_text2() %>%
lubridate::mdy()
)
}
df <- map_dfr(0:5, get_books)
# A tibble: 180 x 4
title author rating date
<chr> <chr> <dbl> <date>
1 Sunset "Cave~ 4.19 2023-01-14
2 Green for Danger (Inspector Cockrill~ "Bran~ 3.84 2023-01-12
3 Stone Cold Fox "Crof~ 4.22 2023-01-12
4 What If I'm Not a Cat? "Wint~ 4.52 2023-01-10
5 The Prisoner's Throne (The Stolen He~ "Blac~ 4.85 2023-01-07
6 The Kind Worth Saving (Henry Kimball~ "Swan~ 4.13 2023-01-06
7 Girl at War "Novi~ 4 2022-12-29
8 If We Were Villains "Rio,~ 4.23 2022-12-29
9 The Gone World "Swet~ 3.94 2022-12-28
10 Batman: The Dark Knight Returns "Mill~ 4.26 2022-12-28
# ... with 170 more rows
# i Use `print(n = ...)` to see more rows
I can get the first 30 books using this -
library(dplyr)
library(rvest)
url <- "https://www.goodreads.com/review/list/4622890?shelf=read"
book_table <- read_html(url) %>%
html_elements('table#books') %>%
html_table() %>%
.[[1]]
book_table
There is some cleaning that you might need to do in the data captured. Moreover, to get the complete list I am afraid rvest would not be enough. You might need to use something like RSelenium to scroll through the list.
I am currently experimenting with web scraping my own Stack Overflow profile (logout) using rvest. To find the CSS tags I use the SelectorGadget extension for google chrome. To start I would like to extract the numbers with headers under the Stats header of my profile which are marked as green and yellow (colors because of using the extension to find tag) in the picture below:
This gives me the following CSS tags: .md\:fl-auto , .fc-dark. The .fc-dark tag is for the numbers and .md\:fl-auto for the headers (reputation, reached, etc.). Extracting the numbers works, but extracting the headers, I get the following error: Error: '\:' is an unrecognized escape in character string starting "".md\:". Is it possible to extract this CSS tag and save both outputs in a dataframe? Here is a reproducible example:
library(rvest)
library(dplyr)
link <- "https://stackoverflow.com/users/14282714/quinten"
profile <- read_html(link)
numbers <- profile %>% html_nodes(".fc-dark") %>% html_text()
numbers
[1] "12,688" "49k" "847" "9"
headers <- profile %>% html_nodes(".md\:fl-auto") %>% html_text()
Error: '\:' is an unrecognized escape in character string starting "".md\:"
I am open to better options for web scraping my StackOverflow profile!
library(rvest)
library(dplyr)
library(stringr)
profile %>% html_nodes(".md\\:fl-auto") %>% html_text() %>%
stringr::str_squish() %>%
as_tibble() %>%
tidyr::separate(value, into = c("number", "header"), sep = "\\s") %>%
mutate(number = stringr::str_remove(number, "\\,") %>%
sub("k", "000", ., fixed = TRUE))
Output:
# A tibble: 4 x 2
number header
<dbl> <chr>
1 12688 reputation
2 49000 reached
3 847 answers
4 10 questions
I'm trying to scrape the information about the nurse jobs on that link: https://www.jobs.nhs.uk/xi/search_vacancy/?action=search&staff_group=SG40&keyword=Nurse%20Sister%20Matron&logic=OR
I managed to do it on the first page of results. But when I try to do it on the other few hundreds pages, read_html() doesn't work anymore.
The first page works perfectly fine:
install.packages("rvest")
install.packages("dplyr")
library(rvest)
library(dplyr)
link = "https://www.jobs.nhs.uk/xi/search_vacancy/?action=search&staff_group=SG40&keyword=Nurse%20Sister%20Matron&logic=OR"
page = read_html(link)
But then for the following code I get the error message: Error in read_xml.raw(raw, encoding = encoding, base_url = base_url, as_html = as_html, : Failed to parse text
link = "https://www.jobs.nhs.uk/xi/search_vacancy?action=page&page=2"
page = read_html(link)
Could you please tell me where I'm wrong when I scrape the second page of results? Thanks
[EDIT] Thanks for the answers. For anybody interested, this is what I ended up doing using #Dave2e answer (I am too much of a beginner to use RSelenium), and it works fine (with scraping_onepage the function I created to scrape one page):
#extract the number of pages of results
link = "https://www.jobs.nhs.uk/xi/search_vacancy/?action=search&staff_group=SG40&keyword=Nurse%20Sister%20Matron&logic=OR"
page = read_html(link)
extract = page %>% html_nodes(".total") %>% html_text()
number_pages = substring(extract, 24, 26)
#initialization of nurse_jobs for the loop
nurse_jobs <- scraping_onepage(page)
#loop
s<- session("https://www.jobs.nhs.uk/xi/search_vacancy/?action=search&staff_group=SG40&keyword=Nurse%20Sister%20Matron&logic=OR")
for (page_result in seq(from = 2, to = number_pages, by = 1)) {
link = paste0("https://www.jobs.nhs.uk/xi/search_vacancy?action=page&page=", page_result)
s1 <- session_jump_to(s, link) #method: https://stackoverflow.com/questions/73044507/read-html-returning-error-in-read-xml-raw-failed-to-parse-text-while
page = read_html(s1)
nurse_jobs1 <- scraping_onepage(page)
nurse_jobs = rbind(nurse_jobs, nurse_jobs1)
}
Here I scraped from page 2 to 100 without any error. It should work for the 362 pages available. The code is inspired from the answer of #Dave2e.
library(tidyverse)
library(rvest)
library(httr2)
ses <-
"https://www.jobs.nhs.uk/xi/search_vacancy/?action=search&staff_group=SG40&keyword=Nurse%20Sister%20Matron&logic=OR" %>%
session()
n_pages <- page %>%
html_element("li:nth-child(10) a") %>%
html_text2() %>%
as.numeric()
get_info <- function(index_page) {
cat("Scraping page", index_page, "...", "\n")
page <- session_jump_to(ses,
paste0("https://www.jobs.nhs.uk/xi/search_vacancy?action=page&page=",
index_page)) %>%
read_html()
tibble(
from_page = index_page,
position = page %>%
html_elements("h2 a") %>%
html_text2(),
practice = page %>%
html_elements(".vacancy h3") %>%
html_text2(),
salary = page %>%
html_elements(".salary") %>%
html_text2(),
type = page %>%
html_elements(".left dl~ dl+ dl dd") %>%
html_text2()
)
}
df <- map_dfr(2:100, get_info)
# A tibble: 1,980 × 5
from_page position practice salary type
<int> <chr> <chr> <chr> <chr>
1 2 Practice Nurse or Nurse Practitioner General … Depen… Perm…
2 2 Practice Nurse General … Depen… Perm…
3 2 Practice Nurse General … Depen… Perm…
4 2 Practice Nurse General … Depen… Perm…
5 2 Practice Nurse General … Depen… Perm…
6 2 Practice Nurse General … Depen… Perm…
7 2 Practice Nurse General … Depen… Perm…
8 2 Practice Nurse General … Depen… Perm…
9 2 Practice Nurse General … Depen… Perm…
10 2 Staff Nurse Neurology £2565… Perm…
# … with 1,970 more rows
You maybe able to create a session and then jump from page to page:
library(rvest)
s<- session("https://www.jobs.nhs.uk/xi/search_vacancy/?action=search&staff_group=SG40&keyword=Nurse%20Sister%20Matron&logic=OR")
link = "https://www.jobs.nhs.uk/xi/search_vacancy?action=page&page=2"
#jump to next page
s <- session_jump_to(s, link)
page = read_html(s2)
page %>% html_elements("div.vacancy")
session_history(s1). #display history
This should work, but I have not fully tested it to verify.
If you want to scrape a few hundred pages with an easy pagination structure (next page button), you might be better off using something like RSelenium to automate the clicking and scraping process. A clever trick for XPaths is Google Chrome -> Inspect -> Right Click on Code -> Copy XPath, you can do that for the next page button. Previous iterations of this issue have encoding errors, but the encoding for this site is UTF-8, and it doesn't work even if that is specified. This means that the site is probably in JavaScript, which further signifies that the best approach is Selenium. Alternatively, if the coding is too difficult you can use Octoparse, a free tool for Webscraping that makes pagination loops easy.
I am using rvest to (try to) scrape all the author affiliation data from a database of academic publications called RePEc. I have the authors' short IDs (author_reg), which I'm using to scrape affiliation data. However, I have several columns indicating multiple authors (each of which I need the affiliation data for). When there aren't multiple authors, the cell has an NA value. Some of the columns are mostly NA values so how do I alter my code so it skips the NA values but doesn't delete them?
Here is the code I'm using:
library(rvest)
library(purrr)
df$author_reg <- c("paa6","paa2","paa1", "paa8", "pve266", "pya500", "NA", "NA")
http1 <- "https://ideas.repec.org/e/"
http2 <- "https://ideas.repec.org/f/"
df$affiliation_author_1 <- sapply(df$author_reg_1, function(x) {
links = c(paste0(http1, x, ".html"),paste0(http2, x, ".html"))
# here we try both links and store under attempts
attempts = links %>% map(function(i){
try(read_html(i) %>% html_nodes("#affiliation h3") %>% html_text())
})
# the good ones will have "character" class, the failed ones, try-error
gdlink = which(sapply(attempts,class) != "try-error")
if(length(gdlink)>0){
return(attempts[[gdlink[1]]])
}
else{
return("True 404 error")
}
})
Thanks in advance for your help!
As far as I see the target links, you can try the following way. First, you want to scrape all links from https://ideas.repec.org/e/ and create all links. Then, check if each link exists or not. (There are about 26000 links with this URL, and I do not have time to check all. So I just used 100 URLs in the following demonstration.) Extract all existing links.
library(rvest)
library(httr)
library(tidyverse)
# Get all possible links from this webpage. There are 26665 links.
read_html("https://ideas.repec.org/e/") %>%
html_nodes("td") %>%
html_nodes("a") %>%
html_attr("href") %>%
.[grepl(x = ., pattern = "html")] -> x
# Create complete URLs.
mylinks1 <- paste("https://ideas.repec.org/e/", x, sep = "")
# For this demonstration I created a subset.
mylinks_samples <- mylinks1[1:100]
# Check if each URL exists or not. If FALSE, a link exists.
foo <- sapply(mylinks_sample, http_error)
# Using the logical vector, foo, extract existing links.
urls <- mylinks_samples[!foo]
Then, for each link, I tried to extract affiliation information. There are several spots with h3. So I tried to specifically target h3 that stays in xpath containing id = "affiliation". If there is no affiliation information, R returns character(0). When enframe() is applied, these elements are removed. For instance, pab127 does not have any affiliation information, so there is no entry for this link.
lapply(urls, function(x){
read_html(x, encoding = "UTF-8") %>%
html_nodes(xpath = '//*[#id="affiliation"]') %>%
html_nodes("h3") %>%
html_text() %>%
trimws() -> foo
return(foo)}) -> mylist
Then, I assigned names to mylist with the links and created a data frame.
names(mylist) <- sub(x = basename(urls), pattern = ".html", replacement = "")
enframe(mylist) %>%
unnest(value)
name value
<chr> <chr>
1 paa1 "(80%) Institutt for ØkonomiUniversitetet i Bergen"
2 paa1 "(20%) Gruppe for trygdeøkonomiInstitutt for ØkonomiUniversitetet i Bergen"
3 paa2 "Department of EconomicsCollege of BusinessUniversity of Wyoming"
4 paa6 "Statistisk SentralbyråGovernment of Norway"
5 paa8 "Centraal Planbureau (CPB)Government of the Netherlands"
6 paa9 "(79%) Economic StudiesBrookings Institution"
7 paa9 "(21%) Brookings Institution"
8 paa10 "Helseøkonomisk Forskningsprogram (HERO) (Health Economics Research Programme)\nUniversitetet i Oslo (Unive~
9 paa10 "Institutt for Helseledelse og Helseökonomi (Institute of Health Management and Health Economics)\nUniversi~
10 paa11 "\"Carlo F. Dondena\" Centre for Research on Social Dynamics (DONDENA)\nUniversità Commerciale Luigi Boccon~
My question is in regards to R being able to read a URL link. The example that I use is solely for illustration purposes. Say that I have the following webpage that I want to read (chosen at random);
https://www.mcdb.ucla.edu/faculty
It has a list of professor names with a URL link, I am trying to build a script which can read a webpage similar to this for instance and access each URL link and make a search for certain keywords regarding their publications.
I currently have my script to scan an individual website for certain keywords which I post below.
library(rvest)
library(dplyr)
library(tidyverse)
library(stringr)
prof <- readLines("https://www.mcdb.ucla.edu/faculty/jsadams")
library(dplyr)
text_df <- data_frame(text = prof)
text_df <- as.data.frame.table(text_df)
keywords <- c("nonskeletal", "antimicrobial response")
text_df %>%
filter(str_detect(text, keywords[1]) | str_detect(text, keywords[2]))
This should return publications 1, 2 and 4 under the section "Selected Publications" on the professors webpage.
Now I am trying to get R to read each professors page from the faculty link (https://www.mcdb.ucla.edu/faculty) and see if each professor has publications with the keywords listed above.
Read: https://www.mcdb.ucla.edu/faculty
Access each link and read each faculty member page:
Return if value "keywords" = TRUE:
List professors publications or text that has the "keywords" in:
I have already been able to do this for each individual page but I would perhaps prefer a loop or function so I do not have to copy and paste each professors page URL each time.
Just a slight disclaimer - I have no connection with the UCLA or the professor on that website, the professor URL I chose just so happened to be the first professor listed on the faculty of professors webpage.
I'd approach this as follows. This is "quick and dirty" code, but hopefully provides a basis for something better.
First, you need the correct selectors to get the faculty names and the links to their pages. Create a data frame with that information:
library(dplyr)
library(rvest)
library(tidytext)
page <- read_html("https://www.mcdb.ucla.edu/faculty")
table1 <- page %>%
html_nodes(xpath = "///table[1]/tr/td/a")
names <- table1 %>%
html_text() %>%
unlist(use.names = FALSE)
links <- table1 %>%
html_attrs() %>%
unlist(use.names = FALSE)
data1 <- data.frame(name = names, href = links)
head(data1)
name href
1 John Adams /faculty/jsadams
2 Utpal Banerjee /faculty/banerjee
3 Siobhan Braybrook /faculty/siobhanb
4 Jau-Nian Chen /faculty/chenjn
5 Amander Clark /faculty/clarka
6 Daniel Cohn /faculty/dcohn
Next, you need a function that takes the values in the href column, fetches the staff page and looks for keywords. I took a different approach to you, using tidytext to break all of the publications down into individual words, then counting rows where any of the keywords occur. This means that "antimicrobial response" has to be two separate words, so you may want to do that differently.
The function returns a count which is > 0 if any of the keywords were present.
get_pubs <- function(href) {
page <- read_html(paste0("https://www.mcdb.ucla.edu", href))
pubs <- data.frame(text = page %>%
html_nodes("div.mcdb-faculty-pubs p") %>%
html_text(),
stringsAsFactors = FALSE)
pubs <- pubs %>%
unnest_tokens(word, text)
pubs %>%
filter(word %in% c("nonskeletal", "antimicrobial", "response")) %>%
nrow()
}
Now you can apply the function to each href:
data1 <- data1 %>%
mutate(count = sapply(href, function(x) get_pubs(x)))
Which faculty had at least one keyword in their publications?
data1 %>%
filter(count > 0)
name href count
1 John Adams /faculty/jsadams 9
2 Arjun Deb /faculty/adeb 1
3 Tracy Johnson /faculty/tljohnson 1
4 Chentao Lin /faculty/clin 1
5 Jeffrey Long /faculty/jeffalong 1
6 Matteo Pellegrini /faculty/matteop 1