Web Scraping Notable Names - r

I'm trying to get the
Gender
Race or Ethnicity
Sexual orientation
Occupation
Nationality
from each site listed here: https://www.nndb.com/lists/494/000063305/
Here's an individual site so viewers can see the single page.
I'm trying to model my R code after this site but it's difficult because on the individual sites there aren't headings for Gender, for example. Can someone assist?
library(purrr)
library(rvest)
url_base <- "https://www.nndb.com/lists/494/000063305/"
b_dataset <- map_df(1:91, function(i) {
page <- read_html(sprintf(url_base, i))
data.frame(ICOname = html_text(html_nodes(page, ".name")))
})

I'll take you halfway there: it's not too difficult to figure out from here.
library(purrr)
library(rvest)
url_base <- "https://www.nndb.com/lists/494/000063305/"
First, the following will generate a list of A-Z surname list URLs, and then consequently each person's profile URLs.
## Gets A-Z links
all_surname_urls <- read_html(url_base) %>%
html_nodes(".newslink") %>%
html_attrs() %>%
map(pluck(1, 1))
all_ppl_urls <- map(
all_surname_urls,
function(x) read_html(x) %>%
html_nodes("a") %>%
html_attrs() %>%
map(pluck(1, 1))
) %>%
unlist()
all_ppl_urls <- setdiff(
all_ppl_urls[!duplicated(all_ppl_urls)],
c(all_surname_urls, "http://www.nndb.com/")
)
You are correct---there are no separate headings for gender or any other, really. You'll just have to use tools such as SelectorGadget to see what elements contain what you need. In this case it's simply p.
all_ppl_urls[1] %>%
read_html() %>%
html_nodes("p") %>%
html_text()
The output will be
[1] "AKA Lee William Aaker"
[2] "Born: 25-Sep-1943Birthplace: Los Angeles, CA"
[3] "Gender: MaleRace or Ethnicity: WhiteOccupation: Actor"
[4] "Nationality: United StatesExecutive summary: The Adventures of Rin Tin Tin"
...
Although the output is not clean, things rarely are when webscraping---this is actually relatively easier one. You can use series of grepl and map to subset the contents that you need, and make a dataframe out of them.

Related

How to get rvest or sapply to skip NA values?

I am using rvest to (try to) scrape all the author affiliation data from a database of academic publications called RePEc. I have the authors' short IDs (author_reg), which I'm using to scrape affiliation data. However, I have several columns indicating multiple authors (each of which I need the affiliation data for). When there aren't multiple authors, the cell has an NA value. Some of the columns are mostly NA values so how do I alter my code so it skips the NA values but doesn't delete them?
Here is the code I'm using:
library(rvest)
library(purrr)
df$author_reg <- c("paa6","paa2","paa1", "paa8", "pve266", "pya500", "NA", "NA")
http1 <- "https://ideas.repec.org/e/"
http2 <- "https://ideas.repec.org/f/"
df$affiliation_author_1 <- sapply(df$author_reg_1, function(x) {
links = c(paste0(http1, x, ".html"),paste0(http2, x, ".html"))
# here we try both links and store under attempts
attempts = links %>% map(function(i){
try(read_html(i) %>% html_nodes("#affiliation h3") %>% html_text())
})
# the good ones will have "character" class, the failed ones, try-error
gdlink = which(sapply(attempts,class) != "try-error")
if(length(gdlink)>0){
return(attempts[[gdlink[1]]])
}
else{
return("True 404 error")
}
})
Thanks in advance for your help!
As far as I see the target links, you can try the following way. First, you want to scrape all links from https://ideas.repec.org/e/ and create all links. Then, check if each link exists or not. (There are about 26000 links with this URL, and I do not have time to check all. So I just used 100 URLs in the following demonstration.) Extract all existing links.
library(rvest)
library(httr)
library(tidyverse)
# Get all possible links from this webpage. There are 26665 links.
read_html("https://ideas.repec.org/e/") %>%
html_nodes("td") %>%
html_nodes("a") %>%
html_attr("href") %>%
.[grepl(x = ., pattern = "html")] -> x
# Create complete URLs.
mylinks1 <- paste("https://ideas.repec.org/e/", x, sep = "")
# For this demonstration I created a subset.
mylinks_samples <- mylinks1[1:100]
# Check if each URL exists or not. If FALSE, a link exists.
foo <- sapply(mylinks_sample, http_error)
# Using the logical vector, foo, extract existing links.
urls <- mylinks_samples[!foo]
Then, for each link, I tried to extract affiliation information. There are several spots with h3. So I tried to specifically target h3 that stays in xpath containing id = "affiliation". If there is no affiliation information, R returns character(0). When enframe() is applied, these elements are removed. For instance, pab127 does not have any affiliation information, so there is no entry for this link.
lapply(urls, function(x){
read_html(x, encoding = "UTF-8") %>%
html_nodes(xpath = '//*[#id="affiliation"]') %>%
html_nodes("h3") %>%
html_text() %>%
trimws() -> foo
return(foo)}) -> mylist
Then, I assigned names to mylist with the links and created a data frame.
names(mylist) <- sub(x = basename(urls), pattern = ".html", replacement = "")
enframe(mylist) %>%
unnest(value)
name value
<chr> <chr>
1 paa1 "(80%) Institutt for ØkonomiUniversitetet i Bergen"
2 paa1 "(20%) Gruppe for trygdeøkonomiInstitutt for ØkonomiUniversitetet i Bergen"
3 paa2 "Department of EconomicsCollege of BusinessUniversity of Wyoming"
4 paa6 "Statistisk SentralbyråGovernment of Norway"
5 paa8 "Centraal Planbureau (CPB)Government of the Netherlands"
6 paa9 "(79%) Economic StudiesBrookings Institution"
7 paa9 "(21%) Brookings Institution"
8 paa10 "Helseøkonomisk Forskningsprogram (HERO) (Health Economics Research Programme)\nUniversitetet i Oslo (Unive~
9 paa10 "Institutt for Helseledelse og Helseökonomi (Institute of Health Management and Health Economics)\nUniversi~
10 paa11 "\"Carlo F. Dondena\" Centre for Research on Social Dynamics (DONDENA)\nUniversità Commerciale Luigi Boccon~

R Webscraping: How to feed URLS into a function

My end goal is to be able to take all 310 articles from this page and its following pages and run it through this function:
library(tidyverse)
library(rvest)
library(stringr)
library(purrr)
library(lubridate)
library(dplyr)
scrape_docs <- function(URL){
doc <- read_html(URL)
speaker <- html_nodes(doc, ".diet-title a") %>%
html_text()
date <- html_nodes(doc, ".date-display-single") %>%
html_text() %>%
mdy()
title <- html_nodes(doc, "h1") %>%
html_text()
text <- html_nodes(doc, "div.field-docs-content") %>%
html_text()
all_info <- list(speaker = speaker, date = date, title = title, text = text)
return(all_info)
}
I assume the way to go forward would be to somehow create a list of the URLs I want, then iterate that list through the scrape_docs function. As it stands, however, I'm having a hard time understanding how to go about that. I thought something like this would work, but I seem to be missing something key given the following error:
xml_attr cannot be applied to object of class "character'.
source_col <- "https://www.presidency.ucsb.edu/advanced-search?field-keywords=%22space%20exploration%22&field-keywords2=&field-keywords3=&from%5Bdate%5D=&to%5Bdate%5D=&person2=&items_per_page=100&page=0"
pages <- 4
all_links <- tibble()
for(i in seq_len(pages)){
page <- paste0(source_col,i) %>%
read_html() %>%
html_attr("href") %>%
html_attr()
tmp <- page[[1]]
all_links <- bind_rows(all_links, tmp)
}
all_links
You can get all the url's by doing
library(rvest)
source_col <- "https://www.presidency.ucsb.edu/advanced-search?field-keywords=%22space%20exploration%22&field-keywords2=&field-keywords3=&from%5Bdate%5D=&to%5Bdate%5D=&person2=&items_per_page=100&page=0"
all_urls <- source_col %>%
read_html() %>%
html_nodes("td a") %>%
html_attr("href") %>%
.[c(FALSE, TRUE)] %>%
paste0("https://www.presidency.ucsb.edu", .)
Now do the same by changing the page number in source_col to get remaining data.
You can then use a for loop or map to extract all the data.
purrr::map(all_urls, scrape_docs)
Testing the function scrape_docs on 1 URL
scrape_docs(all_urls[1])
#$speaker
#[1] "Dwight D. Eisenhower"
#$date
#[1] "1958-04-02"
#$title
#[1] "Special Message to the Congress Relative to Space Science and Exploration."
#$text
#[1] "\n To the Congress of the United States:\nRecent developments in long-range
# rockets for military purposes have for the first time provided man with new mac......

How do I add a loop when using R to scrape data?

I'm trying to create a database of crime data by zip code based on Trulia.com's data. I have the code below but so far it only produces 1 line of data. In the code below, Zipcodes is just a list of US zip codes. Can anyone tell me what I need to add to make this run through my entire list "i" ?
Here is a link to one of the Trulia pages for reference: https://www.trulia.com/real_estate/20004-Washington/crime/
UPDATE:
Here are zip codes for download: https://www.dropbox.com/s/uxukqpu0v88d7tf/Zip%20Code%20Database%20wo%20Boston.xlsx?dl=0
I also changed the code a bit this time after realizing the crime stats appear in different orders depending on the zip code. Is it possible to have the loop produce 4 lines per zipcode? This currently works but only produces the last zip code in the dataset. I can't figure out how to make sure each zip code's data is recorded on separate lines, so it doesn't overwrite and only leave one line of the last zip code.
Please help!!
library(rvest)
data=data.frame(Zipcodes)
for(i in data$Zip.Code)
{
site <- paste("https://www.trulia.com/real_estate/",i,"-Boston/crime/", sep="")
site <- html(site)
crime<- data.frame(zip =i,
type =site %>% html_nodes(".brs") %>% html_text() ,
stringsAsFactors=FALSE)
}
View(crime)
If that code doesn't work, try this:
data=data.frame(Zillow_Data_for_R_Test)
for(i in data$Zip.Code)
site <- paste("https://www.trulia.com/real_estate/",i,"-Boston/crime/", sep="")
site <- read_html(site)
crime<- data.frame(zip =i,
theft =site %>% html_nodes(".crime-text-0") %>% html_text() ,
assault =site %>% html_nodes(".crime-text-1") %>% html_text() ,
arrest =site %>% html_nodes(".crime-text-2") %>% html_text() ,
vandalism =site %>% html_nodes(".crime-text-3") %>% html_text() ,
robbery =site %>% html_nodes(".crime-text-4") %>% html_text() ,
type =site %>% html_nodes(".clearfix") %>% html_text() ,
stringsAsFactors=FALSE)
View(crime)
The comment of #r2evans already provides an answer. Since the #ShanCham asked how to actually implement this I wanted to guide with the following code, which is just more verbose than the comment and could therefore not be posted as additional comment.
library(rvest)
#only two exemplary zipcodes, could be more, of course
zipcodes <- c("02110", "02125")
crime <- lapply(zipcodes, function(z) {
site <- read_html(paste0("https://www.trulia.com/real_estate/",z,"-Boston/crime/"))
#for illustrative purposes:
#introduced as.numeric to numeric columns
#exluded some of your other columns and shortenend the current text in type
data.frame(zip = z,
theft = site %>% html_nodes(".crime-text-0") %>% html_text() %>% as.numeric(),
assault = site %>% html_nodes(".crime-text-1") %>% html_text() %>% as.numeric() ,
type = site %>% html_nodes(".clearfix") %>% html_text() %>% paste(collapse = " ") %>% substr(1, 50) ,
stringsAsFactors=FALSE)
})
class(crime)
#list
#Output are lists that can be bound together to one data.frame
crime <- do.call(rbind, crime)
#crime is a data.frame, hence, classes/types are kept
class(crime$type)
# [1] "character"
class(crime$assault)
# [1] "numeric"

Scraping from one URL to another URL in R

My question is in regards to R being able to read a URL link. The example that I use is solely for illustration purposes. Say that I have the following webpage that I want to read (chosen at random);
https://www.mcdb.ucla.edu/faculty
It has a list of professor names with a URL link, I am trying to build a script which can read a webpage similar to this for instance and access each URL link and make a search for certain keywords regarding their publications.
I currently have my script to scan an individual website for certain keywords which I post below.
library(rvest)
library(dplyr)
library(tidyverse)
library(stringr)
prof <- readLines("https://www.mcdb.ucla.edu/faculty/jsadams")
library(dplyr)
text_df <- data_frame(text = prof)
text_df <- as.data.frame.table(text_df)
keywords <- c("nonskeletal", "antimicrobial response")
text_df %>%
filter(str_detect(text, keywords[1]) | str_detect(text, keywords[2]))
This should return publications 1, 2 and 4 under the section "Selected Publications" on the professors webpage.
Now I am trying to get R to read each professors page from the faculty link (https://www.mcdb.ucla.edu/faculty) and see if each professor has publications with the keywords listed above.
Read: https://www.mcdb.ucla.edu/faculty
Access each link and read each faculty member page:
Return if value "keywords" = TRUE:
List professors publications or text that has the "keywords" in:
I have already been able to do this for each individual page but I would perhaps prefer a loop or function so I do not have to copy and paste each professors page URL each time.
Just a slight disclaimer - I have no connection with the UCLA or the professor on that website, the professor URL I chose just so happened to be the first professor listed on the faculty of professors webpage.
I'd approach this as follows. This is "quick and dirty" code, but hopefully provides a basis for something better.
First, you need the correct selectors to get the faculty names and the links to their pages. Create a data frame with that information:
library(dplyr)
library(rvest)
library(tidytext)
page <- read_html("https://www.mcdb.ucla.edu/faculty")
table1 <- page %>%
html_nodes(xpath = "///table[1]/tr/td/a")
names <- table1 %>%
html_text() %>%
unlist(use.names = FALSE)
links <- table1 %>%
html_attrs() %>%
unlist(use.names = FALSE)
data1 <- data.frame(name = names, href = links)
head(data1)
name href
1 John Adams /faculty/jsadams
2 Utpal Banerjee /faculty/banerjee
3 Siobhan Braybrook /faculty/siobhanb
4 Jau-Nian Chen /faculty/chenjn
5 Amander Clark /faculty/clarka
6 Daniel Cohn /faculty/dcohn
Next, you need a function that takes the values in the href column, fetches the staff page and looks for keywords. I took a different approach to you, using tidytext to break all of the publications down into individual words, then counting rows where any of the keywords occur. This means that "antimicrobial response" has to be two separate words, so you may want to do that differently.
The function returns a count which is > 0 if any of the keywords were present.
get_pubs <- function(href) {
page <- read_html(paste0("https://www.mcdb.ucla.edu", href))
pubs <- data.frame(text = page %>%
html_nodes("div.mcdb-faculty-pubs p") %>%
html_text(),
stringsAsFactors = FALSE)
pubs <- pubs %>%
unnest_tokens(word, text)
pubs %>%
filter(word %in% c("nonskeletal", "antimicrobial", "response")) %>%
nrow()
}
Now you can apply the function to each href:
data1 <- data1 %>%
mutate(count = sapply(href, function(x) get_pubs(x)))
Which faculty had at least one keyword in their publications?
data1 %>%
filter(count > 0)
name href count
1 John Adams /faculty/jsadams 9
2 Arjun Deb /faculty/adeb 1
3 Tracy Johnson /faculty/tljohnson 1
4 Chentao Lin /faculty/clin 1
5 Jeffrey Long /faculty/jeffalong 1
6 Matteo Pellegrini /faculty/matteop 1

How to use rvest's read_html to read a list of HTML files?

I have a list of web urls that are all the same page, just with different information.
Like this:
http://www.halfordsautocentres.com/autocentres/chesterfield
http://www.halfordsautocentres.com/autocentres/derby-london-road
http://www.halfordsautocentres.com/autocentres/derby-wyvern-way
Each one has a different address under the CSS selector .store-details__address.
I have written the following code that outputs a the correct address for a single page:
derby <- read_html("http://www.halfordsautocentres.com/autocentres/derby-wyvern-way")
derby %>%
+ html_node(".store-details__address") %>%
+ html_text()
[1] "Unit 7, Wyvern Way, Wyvern Retail Park, Derby, DE21 6NZ"
How can I make read_html read a list of urls rather than just a single one?
Thanks.
You can use any looping strategy that you want: for, lapply, purrr::map.
require(rvest)
urls <- c("http://www.halfordsautocentres.com/autocentres/chesterfield",
"http://www.halfordsautocentres.com/autocentres/derby-london-road",
"http://www.halfordsautocentres.com/autocentres/derby-wyvern-way")
Base R using a for loop
out <- vector("character", length = length(urls))
for(i in seq_along(urls)){
derby <- read_html(urls[i])
out[i] <- derby %>%
html_node(".store-details__address") %>%
html_text()
}
Base R with *apply
urls %>%
lapply(read_html) %>%
lapply(html_node, ".store-details__address") %>%
vapply(html_text, character(1))
Here is a tidyverse/purrr
require(tidyverse)
urls %>%
map(read_html) %>%
map(html_node, ".store-details__address") %>%
map_chr(html_text)

Resources