Scraping from one URL to another URL in R

Scraping from one URL to another URL in R - r

My question is in regards to R being able to read a URL link. The example that I use is solely for illustration purposes. Say that I have the following webpage that I want to read (chosen at random);
https://www.mcdb.ucla.edu/faculty
It has a list of professor names with a URL link, I am trying to build a script which can read a webpage similar to this for instance and access each URL link and make a search for certain keywords regarding their publications.
I currently have my script to scan an individual website for certain keywords which I post below.
library(rvest)
library(dplyr)
library(tidyverse)
library(stringr)
prof <- readLines("https://www.mcdb.ucla.edu/faculty/jsadams")
library(dplyr)
text_df <- data_frame(text = prof)
text_df <- as.data.frame.table(text_df)
keywords <- c("nonskeletal", "antimicrobial response")
text_df %>%
filter(str_detect(text, keywords[1]) | str_detect(text, keywords[2]))
This should return publications 1, 2 and 4 under the section "Selected Publications" on the professors webpage.
Now I am trying to get R to read each professors page from the faculty link (https://www.mcdb.ucla.edu/faculty) and see if each professor has publications with the keywords listed above.
Read: https://www.mcdb.ucla.edu/faculty
Access each link and read each faculty member page:
Return if value "keywords" = TRUE:
List professors publications or text that has the "keywords" in:
I have already been able to do this for each individual page but I would perhaps prefer a loop or function so I do not have to copy and paste each professors page URL each time.
Just a slight disclaimer - I have no connection with the UCLA or the professor on that website, the professor URL I chose just so happened to be the first professor listed on the faculty of professors webpage.

I'd approach this as follows. This is "quick and dirty" code, but hopefully provides a basis for something better.
First, you need the correct selectors to get the faculty names and the links to their pages. Create a data frame with that information:
library(dplyr)
library(rvest)
library(tidytext)
page <- read_html("https://www.mcdb.ucla.edu/faculty")
table1 <- page %>%
html_nodes(xpath = "///table[1]/tr/td/a")
names <- table1 %>%
html_text() %>%
unlist(use.names = FALSE)
links <- table1 %>%
html_attrs() %>%
unlist(use.names = FALSE)
data1 <- data.frame(name = names, href = links)
head(data1)
name href
1 John Adams /faculty/jsadams
2 Utpal Banerjee /faculty/banerjee
3 Siobhan Braybrook /faculty/siobhanb
4 Jau-Nian Chen /faculty/chenjn
5 Amander Clark /faculty/clarka
6 Daniel Cohn /faculty/dcohn
Next, you need a function that takes the values in the href column, fetches the staff page and looks for keywords. I took a different approach to you, using tidytext to break all of the publications down into individual words, then counting rows where any of the keywords occur. This means that "antimicrobial response" has to be two separate words, so you may want to do that differently.
The function returns a count which is > 0 if any of the keywords were present.
get_pubs <- function(href) {
page <- read_html(paste0("https://www.mcdb.ucla.edu", href))
pubs <- data.frame(text = page %>%
html_nodes("div.mcdb-faculty-pubs p") %>%
html_text(),
stringsAsFactors = FALSE)
pubs <- pubs %>%
unnest_tokens(word, text)
pubs %>%
filter(word %in% c("nonskeletal", "antimicrobial", "response")) %>%
nrow()
}
Now you can apply the function to each href:
data1 <- data1 %>%
mutate(count = sapply(href, function(x) get_pubs(x)))
Which faculty had at least one keyword in their publications?
data1 %>%
filter(count > 0)
name href count
1 John Adams /faculty/jsadams 9
2 Arjun Deb /faculty/adeb 1
3 Tracy Johnson /faculty/tljohnson 1
4 Chentao Lin /faculty/clin 1
5 Jeffrey Long /faculty/jeffalong 1
6 Matteo Pellegrini /faculty/matteop 1

Related

R Scraping for image details from several thousand pages

I am trying to scrape details from a website in order to gather details for pictures with a script in R.
What I need is:
Image name (1.jpg)
Image caption ("A recruit demonstrates the proper use of a CO2 portable extinguisher to put out a small outside fire.")
Photo credit ("Photo courtesy of: James Fortner")
There are over 16,000 files, and thankfully the web url goes "...asp?photo=1, 2, 3, 4" so there is base url which doesn't change, just the last section with the image number. I would like the script to loop for either a set number (I tell it where to start) or it just breaks when it gets to a page which doesn't exisit.
Using the code below, I can get the caption of the photo, but only one line. I would like to get the photo credit, which is on a separate line; there are three between the main caption and photo credit. I'd be fine if the table which is generated had two or three blank columns to account for the lines, as I can delete them later.
library(rvest)
library(dplyr)
link = "http://fallschurchvfd.org/photovideo.asp?photo=1"
page = read_html(link)
caption = page %>% html_nodes(".text7 i") %>% html_text()
info = data.frame(caption, stringsAsFactors = FALSE)
write.csv(info, "photos.csv")

Scraping with rvest and tidyverse
library(tidyverse)
library(rvest)
get_picture <- function(page) {
cat("Scraping page", page, "\n")
page <- str_c("http://fallschurchvfd.org/photovideo.asp?photo=", page) %>%
read_html()
tibble(
image_name = page %>%
html_element(".text7 img") %>%
html_attr("src"),
caption = page %>%
html_element(".text7") %>%
html_text() %>%
str_split(pattern = "\r\n\t\t\t\t") %>%
unlist %>%
nth(1),
credit = page %>%
html_element(".text7") %>%
html_text() %>%
str_split(pattern = "\r\n\t\t\t\t") %>%
unlist %>%
nth(3)
)
}
# Get the first 1:50
df <- map_dfr(1:50, possibly(get_picture, otherwise = tibble()))
# A tibble: 42 × 3
image_name caption credit
<chr> <chr> <chr>
1 /photos/1.jpg Recruit Clay Hamric demonstrates the use… James…
2 /photos/2.jpg A recruit demonstrates the proper use of… James…
3 /photos/3.jpg Recruit Paul Melnick demonstrates the pr… James…
4 /photos/4.jpg Rescue 104 James…
5 /photos/5.jpg Rescue 104 James…
6 /photos/6.jpg Rescue 104 James…
7 /photos/15.jpg Truck 106 operates a ladder pipe from Wi… Jim O…
8 /photos/16.jpg Truck 106 operates a ladder pipe as heav… Jim O…
9 /photos/17.jpg Heavy fire vents from the roof area of t… Jim O…
10 /photos/18.jpg Arlington County Fire and Rescue Associa… James…
# … with 32 more rows
# ℹ Use `print(n = ...)` to see more rows

For the images, you can use the command line tool curl. For example, to download images 1.jpg through 100.jpg
curl -O "http://fallschurchvfd.org/photos/[0-100].jpg"
For the R code, if you grab the whole .text7 section, then you can split into caption and photo credit subsequently:
extractedtext <- page %>% html_nodes(".text7") %>% html_text()
caption <- str_split(extractedtext, "\r\n\t\t\t\t")[[1]][1]
credit <- str_split(extractedtext, "\r\n\t\t\t\t")[[1]][3]
As a loop
library(rvest)
library(tidyverse)
df<-data.frame(id=1:20,
image=NA,
caption=NA,
credit=NA)
for (i in 1:20){
cat(i, " ") # to monitor progress and debug
link <- paste0("http://fallschurchvfd.org/photovideo.asp?photo=", i)
tryCatch({ # This is to avoid stopping on an error message for missing pages
page <- read_html(link)
close(link)
df$image[i] <- page %>% html_nodes(".text7 img") %>% html_attr("src")
extractedtext <- page %>% html_nodes(".text7") %>% html_text()
df$caption[i] <- str_split(extractedtext, "\r\n\t\t\t\t")[[1]][1] # This is an awkward way of saying "list 1, element 1"
df$credit[i] <- str_split(extractedtext, "\r\n\t\t\t\t")[[1]][3]
},
error=function(e){cat("ERROR :",conditionMessage(e), "\n")})
}
I get inconsistent results with this current code, for example, page 15 has more line breaks than page 1.
TODO: enhance string extraction; switch to an 'append' method of adding data to a data.frame (vs pre-allocate and insert).

How to extract webpage data with a node and class in rvest

I am performing webscraping on a site and have been able to get basic data, but I now need to collect data from a more complicated part of the page.
I am using rvest to pull data from the AAA gas prices website:
https://gasprices.aaa.com/
I am now trying to pull county-level data, which is only displayed on the map (if you hover your cursor over an individual county. I need to get the county gas prices for individual counties in different states. For example, if you click on Maine, to go to the Maine page (https://gasprices.aaa.com/?state=ME), I need to webscrape the price for Aroostook (the northernmost county on the map).
I have been able to use rvest to extract the data for the metro areas (lower on the page), using html_nodes and the node "td". However, the code for the map is more complex. Instead of the simple "td" node, the developer tools (in Chrome) gives <td class="fm-tooltip-comment">$4.928</td on the line with the price ($4.928 is the current price in Aroostook, as of the date of this post). I cannot seem to identify that with the rvest package to extract it.
I have read that the class can be used, or others have proposed using the css code to designate it within rvest, but I am unfamiliar with how to do so. Pulling the metro-area numbers was straightforward, however the county-level prices embedded within the map do not seem as accessible.
Is there a way to extract this county-level data so that I can webscrape in R? And, can this then be repeatable for all the counties/states from which I must select? Do I need the css code, and if so how do I access it/write it properly for rvest to use?

It looks like the information you are looking for is store in the "index.php" file that gets downloaded when the web page loads.
The current link for Maine is "https://gasprices.aaa.com/index.php?premiumhtml5map_js_data=true&map_id=21&r=89346&ver=5.9.3".
I am not sure what the r=89346 value is for, maybe a timestamp, tracking id, temporary token (to prevent web scraping) etc. I suspect this URL will change thus you may need to use the developer tools on the browser to obtain the current url.
Also, map_id refers to state but I don't know the rational, Florida is 1, NC is 35 and Maine is 21.
Download this file, then extract the JSON data and convert. The data starts with a {"st1": and ends with }}.
library(dplyr)
#read the index_php file and turn it into character string
index_php <-readLines("https://gasprices.aaa.com/index.php?premiumhtml5map_js_data=true&map_id=21&r=19770&ver=5.9.3")
index_php <- paste(index_php, collapse = " ")
#extract out the correct JSON data part and convert
jsondata <- stringr::str_extract(index_php, "\\{\"st1\":.+?\\}\\}")
data<-jsonlite::fromJSON(jsondata)
#create a data frame with the results
answer <- bind_rows(data)
id name shortname link comment image color_map color_map_over
<int> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 Androscoggin "" "" $4.964 "" #ca3338 #ca3338
2 2 Aroostook "" "" $4.928 "" #dd7a7a #dd7a7a
3 3 Cumberland "" "" $4.944 "" #ca3338 #ca3338
4 4 Franklin "" "" $4.936 "" #dd7a7a #dd7a7a
5 5 Hancock "" "" $4.900 "" #01b5da #01b5da
6 6 Kennebec "" "" $4.955 "" #ca3338 #ca3338
There are some extra columns which need removal, I leave it as an exercise for the reader.

So, you can gather the state info, including state level prices from the initial US page. You can also, from there, gather the urls for each state page. Make a request to each of those pages, and store the returned html. You can then, depending on whether the county data is in a php file, either extract the php file links, request that file and process out the info you want, or, in the case of no php file, extract the necessary data from the html already stored from the state requests.
Below extracts all the prices for all states and counties. There is a state DataFrame and a state with counties DataFrame.
library(tidyverse)
library(rvest)
get_data <- function(state, url) {
# extract county and price data from php files. Pass in state abbreviation and php file URI.
s <- read_html(url) %>%
html_text() %>%
str_match("map_data\\s+:\\s+(.*\\}),") %>%
.[, 2]
return(
tibble(
state = state,
county = s %>% str_match_all(',"name":"(.*?)"') %>% .[[1]] %>% .[, 2],
price = s %>% str_match_all(',"comment":"(.*?)"') %>% .[[1]] %>% .[, 2]
)
)
}
start_url <- "https://gasprices.aaa.com/?state=US"
page <- read_html(start_url)
# get state price info and urls for state pages
data_strings <- page %>%
html_text() %>%
stringr::str_match('placestxt = (".*")') %>%
.[, 2] %>%
str_replace_all('\\"', "") %>%
str_split(";")
df_state <- data.frame(subset(data_strings[[1]], lapply(data_strings, function(x) {
x != ""
})[[1]]) %>% map(., ~ str_split(.x, ",")) %>% unlist(recursive = F)) %>%
transpose() %>%
.[c(1:4)] %>%
set_names("abbr", "state", "price", "url")
state_data <- lapply(df_state$url, read_html)
# find the php file links
df_state$data_url <- lapply(state_data, function(item) {
item %>%
html_element("[src*=js_data]") %>%
html_attr("src")
})
# separate out dataframe according to whether county data is in php file or in previously stored html
no_valid_data_url <- df_state %>% filter(is.na(data_url))
has_valid_data_url <- df_state %>% filter(!is.na(data_url))
# grab the data for states where there are php files with county info
df_state_county <- map2_dfr(has_valid_data_url$state, has_valid_data_url$data_url, get_data)
# add in missing info i.e. # handle cases where data_url is NA e.g. https://gasprices.aaa.com/?state=DC
if (nrow(no_valid_data_url) > 0) {
html_to_use <- state_data[match(no_valid_data_url$abbr, df_state$abbr)]
df_state_county_no_data_url <- map_dfr(html_to_use, function(html) {
state_node <- html %>% html_element(".selected")
state_text <- state_node %>% html_text(trim = T)
return(
data.frame(
state = state_text,
county = state_text,
price = html %>% html_element('td:contains("Current Avg.") + td') %>% html_text()
)
)
})
df_state_county <- rbind(df_state_county, df_state_county_no_data_url)
}
head(df_state, 2)
head(df_state_county, 2)

How to get rvest or sapply to skip NA values?

I am using rvest to (try to) scrape all the author affiliation data from a database of academic publications called RePEc. I have the authors' short IDs (author_reg), which I'm using to scrape affiliation data. However, I have several columns indicating multiple authors (each of which I need the affiliation data for). When there aren't multiple authors, the cell has an NA value. Some of the columns are mostly NA values so how do I alter my code so it skips the NA values but doesn't delete them?
Here is the code I'm using:
library(rvest)
library(purrr)
df$author_reg <- c("paa6","paa2","paa1", "paa8", "pve266", "pya500", "NA", "NA")
http1 <- "https://ideas.repec.org/e/"
http2 <- "https://ideas.repec.org/f/"
df$affiliation_author_1 <- sapply(df$author_reg_1, function(x) {
links = c(paste0(http1, x, ".html"),paste0(http2, x, ".html"))
# here we try both links and store under attempts
attempts = links %>% map(function(i){
try(read_html(i) %>% html_nodes("#affiliation h3") %>% html_text())
})
# the good ones will have "character" class, the failed ones, try-error
gdlink = which(sapply(attempts,class) != "try-error")
if(length(gdlink)>0){
return(attempts[[gdlink[1]]])
}
else{
return("True 404 error")
}
})
Thanks in advance for your help!

As far as I see the target links, you can try the following way. First, you want to scrape all links from https://ideas.repec.org/e/ and create all links. Then, check if each link exists or not. (There are about 26000 links with this URL, and I do not have time to check all. So I just used 100 URLs in the following demonstration.) Extract all existing links.
library(rvest)
library(httr)
library(tidyverse)
# Get all possible links from this webpage. There are 26665 links.
read_html("https://ideas.repec.org/e/") %>%
html_nodes("td") %>%
html_nodes("a") %>%
html_attr("href") %>%
.[grepl(x = ., pattern = "html")] -> x
# Create complete URLs.
mylinks1 <- paste("https://ideas.repec.org/e/", x, sep = "")
# For this demonstration I created a subset.
mylinks_samples <- mylinks1[1:100]
# Check if each URL exists or not. If FALSE, a link exists.
foo <- sapply(mylinks_sample, http_error)
# Using the logical vector, foo, extract existing links.
urls <- mylinks_samples[!foo]
Then, for each link, I tried to extract affiliation information. There are several spots with h3. So I tried to specifically target h3 that stays in xpath containing id = "affiliation". If there is no affiliation information, R returns character(0). When enframe() is applied, these elements are removed. For instance, pab127 does not have any affiliation information, so there is no entry for this link.
lapply(urls, function(x){
read_html(x, encoding = "UTF-8") %>%
html_nodes(xpath = '//*[#id="affiliation"]') %>%
html_nodes("h3") %>%
html_text() %>%
trimws() -> foo
return(foo)}) -> mylist
Then, I assigned names to mylist with the links and created a data frame.
names(mylist) <- sub(x = basename(urls), pattern = ".html", replacement = "")
enframe(mylist) %>%
unnest(value)
name value
<chr> <chr>
1 paa1 "(80%) Institutt for ØkonomiUniversitetet i Bergen"
2 paa1 "(20%) Gruppe for trygdeøkonomiInstitutt for ØkonomiUniversitetet i Bergen"
3 paa2 "Department of EconomicsCollege of BusinessUniversity of Wyoming"
4 paa6 "Statistisk SentralbyråGovernment of Norway"
5 paa8 "Centraal Planbureau (CPB)Government of the Netherlands"
6 paa9 "(79%) Economic StudiesBrookings Institution"
7 paa9 "(21%) Brookings Institution"
8 paa10 "Helseøkonomisk Forskningsprogram (HERO) (Health Economics Research Programme)\nUniversitetet i Oslo (Unive~
9 paa10 "Institutt for Helseledelse og Helseökonomi (Institute of Health Management and Health Economics)\nUniversi~
10 paa11 "\"Carlo F. Dondena\" Centre for Research on Social Dynamics (DONDENA)\nUniversità Commerciale Luigi Boccon~

rvest: for loop/map to pull multiple tables using html_node & html_table

I'm trying to programmatically pull all of the box scores for a given day from NBA Reference (I used January 4th, 2020 which has multiple games). I started by creating a list of integers to denote the amount of box scores to pull:
games<- c(1:3)
Then I used developer tools from my browser to determine what each table contains (you can use selector gadget):
#content > div.game_summaries > div:nth-child(1) > table.team
Then I used purrr::map to create a list of the the tables to pull, using games:
map_list<- map(.x= '', paste, '#content > div.game_summaries > div:nth-child(', games, ') > table.teams',
sep = "")
# check map_list
map_list
Then I tried to run this list through a for loop to generate three tables, using tidyverse and rvest, which delivered an error:
for (i in map_list){
read_html('https://www.basketball-reference.com/boxscores/') %>%
html_node(map_list[[1]][i]) %>%
html_table() %>%
glimpse()
}
Error in selectr::css_to_xpath(css, prefix = ".//") :
Zero length character vector found for the following argument: selector
In addition: Warning message:
In selectr::css_to_xpath(css, prefix = ".//") :
NA values were found in the 'selector' argument, they have been removed
For reference, if I explicitly denote the html or call the exact item from map_list, the code works as intended (run below items for reference):
read_html('https://www.basketball-reference.com/boxscores/') %>%
html_node('#content > div.game_summaries > div:nth-child(1) > table.teams') %>%
html_table() %>%
glimpse()
read_html('https://www.basketball-reference.com/boxscores/') %>%
html_node(map_list[[1]][1]) %>%
html_table() %>%
glimpse()
How do I make this work with a list? I have looked at other threads but even though they use the same site, they're not the same issue.

Using your current map_list, if you want to use for loop this is what you should use
library(rvest)
for (i in seq_along(map_list[[1]])){
read_html('https://www.basketball-reference.com/boxscores/') %>%
html_node(map_list[[1]][i]) %>%
html_table() %>%
glimpse()
}
but I think this is simpler as you don't need to use map to create map_list since paste is vectorized :
map_list<- paste0('#content > div.game_summaries > div:nth-child(', games, ') > table.teams')
url <- 'https://www.basketball-reference.com/boxscores/'
webpage <- url %>% read_html()
purrr::map(map_list, ~webpage %>% html_node(.x) %>% html_table)
#[[1]]
# X1 X2 X3
#1 Indiana 111 Final
#2 Atlanta 116
#[[2]]
# X1 X2 X3
#1 Toronto 121 Final
#2 Brooklyn 102
#[[3]]
# X1 X2 X3
#1 Boston 111 Final
#2 Chicago 104

This page is reasonably straight forward to scrape. Here is a possible solution, first scrape the game summary nodes "div with class=game_summary". This provides a list of all of the games played. Also this allows the use of html_node function which guarantees a return, thus keeping the list sizes equal.
Each game summary is made up of three subtables, the first and third tables can be scraped directly. The second table does not have a class assigned thus making it more tricky to retrieve.
library(rvest)
page <- read_html('https://www.basketball-reference.com/boxscores/')
#find all of the game summaries on the page
games<-page %>% html_nodes("div.game_summary")
#Each game summary has 3 sub tables
#game score is table 1 of class=teams
#the stats is table 3 of class=stats
# the quarterly score is the second table and does not have a class defined
table1<-games %>% html_node("table.teams") %>% html_table()
stats <-games %>% html_node("table.stats") %>% html_table()
quarter<-sapply(games, function(g){
g %>% html_nodes("table") %>% .[2] %>% html_table()
})

Web Scraping Notable Names

I'm trying to get the
Gender
Race or Ethnicity
Sexual orientation
Occupation
Nationality
from each site listed here: https://www.nndb.com/lists/494/000063305/
Here's an individual site so viewers can see the single page.
I'm trying to model my R code after this site but it's difficult because on the individual sites there aren't headings for Gender, for example. Can someone assist?
library(purrr)
library(rvest)
url_base <- "https://www.nndb.com/lists/494/000063305/"
b_dataset <- map_df(1:91, function(i) {
page <- read_html(sprintf(url_base, i))
data.frame(ICOname = html_text(html_nodes(page, ".name")))
})

I'll take you halfway there: it's not too difficult to figure out from here.
library(purrr)
library(rvest)
url_base <- "https://www.nndb.com/lists/494/000063305/"
First, the following will generate a list of A-Z surname list URLs, and then consequently each person's profile URLs.
## Gets A-Z links
all_surname_urls <- read_html(url_base) %>%
html_nodes(".newslink") %>%
html_attrs() %>%
map(pluck(1, 1))
all_ppl_urls <- map(
all_surname_urls,
function(x) read_html(x) %>%
html_nodes("a") %>%
html_attrs() %>%
map(pluck(1, 1))
) %>%
unlist()
all_ppl_urls <- setdiff(
all_ppl_urls[!duplicated(all_ppl_urls)],
c(all_surname_urls, "http://www.nndb.com/")
)
You are correct---there are no separate headings for gender or any other, really. You'll just have to use tools such as SelectorGadget to see what elements contain what you need. In this case it's simply p.
all_ppl_urls[1] %>%
read_html() %>%
html_nodes("p") %>%
html_text()
The output will be
[1] "AKA Lee William Aaker"
[2] "Born: 25-Sep-1943Birthplace: Los Angeles, CA"
[3] "Gender: MaleRace or Ethnicity: WhiteOccupation: Actor"
[4] "Nationality: United StatesExecutive summary: The Adventures of Rin Tin Tin"
...
Although the output is not clean, things rarely are when webscraping---this is actually relatively easier one. You can use series of grepl and map to subset the contents that you need, and make a dataframe out of them.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Scraping from one URL to another URL in R - r

Related

R Scraping for image details from several thousand pages

How to extract webpage data with a node and class in rvest

How to get rvest or sapply to skip NA values?

rvest: for loop/map to pull multiple tables using html_node & html_table

Web Scraping Notable Names

Categories

Resources