Rvest scraping child nodes but filling missing values with NA

Rvest scraping child nodes but filling missing values with NA - r

I am trying to scrape some data from the sec website. Each parent node has child nodes that contains text of interest. However, in some cases a particular child node does not exist. So for example in this link:
urll <- "https://www.sec.gov/Archives/edgar/data/1002784/000139834421003391/fp0061633_13fhr-table.xml"
There are 728 parent nodes. Each parent node has a number of entries that are child nodes that have a specific tag. Here is an example of one full entry (of the 728):
<infoTable>
<nameOfIssuer>APPLE INC</nameOfIssuer>
<titleOfClass>COM</titleOfClass>
<cusip>037833100</cusip>
<value>1486</value>
<shrsOrPrnAmt>
<sshPrnamt>11200</sshPrnamt>
<sshPrnamtType>SH</sshPrnamtType>
</shrsOrPrnAmt>
<putCall>Put</putCall>
<investmentDiscretion>SOLE</investmentDiscretion>
<votingAuthority>
<Sole>11200</Sole>
<Shared>0</Shared>
<None>0</None>
</votingAuthority>
</infoTable>
In this example the "putCall" tag may or may not exist. When it exists I want to be able to get the relevant text, so "Put" in this instance. However for this link, only 8 of the 728 parent nodes have the "putCall" node. I want to fill the nodes where there is no "putCall" node with NA so that I always have the 728 entries for each tag that I can coerce into a data frame. So for example this is what I have tried so far inspired by Inputting NA where there are missing values when scraping with rvest.
library(polite)
library(rvest)
library(purrr)
library(tidyverse)
library(httr)
session <- bow("https://www.sec.gov/")
urll <- "https://www.sec.gov/Archives/edgar/data/1002784/000139834421003391/fp0061633_13fhr-table.xml"
test <- session %>%
nod(urll) %>%
scrape(verbose = FALSE) %>%
html_elements(xpath = "//*[local-name()='infoTable']") %>% # select enclosing nodes
# iterate over each parent node, pulling out desired parts and coerce to data.frame
# not the complete list
map_df(
~ list(
name_of_issuer = html_elements(.x, xpath = "//*[local-name()='nameOfIssuer']") %>%
html_text() %>%
{
if (length(.) == 0)
NA
else
.
},
title_of_class = html_elements(.x, xpath = "//*[local-name()='titleOfClass']") %>%
html_text() %>%
{
if (length(.) == 0)
NA
else
.
},
put_or_call = html_elements(.x, xpath = "//*[local-name()='putCall']") %>%
html_text() %>%
{
if (length(.) == 0)
NA
else
.
}))
This fails with the error message:
Error: Can't recycle `name_of_issuer` (size 728) to match `put_or_call` (size 8).
It seems that the NA fill in not working for the "putCall" node and it only returns a list of 8 entries.
Any suggestions on what I am doing wrong and how to fix it?
Thanks much!

If I simply use httr then I can pass in a valid UA header and re-write your code to instead use a data.frame call, instead of list, that way I can return N/A where value not present.
Swap out html_elements for html_element.
You also need to amend your xpaths to avoid getting the first node value repeated for each row.
library(tidyverse)
library(httr)
headers <- c("User-Agent" = "Safari/537.36")
r <- httr::GET(url = "https://www.sec.gov/Archives/edgar/data/1002784/000139834421003391/fp0061633_13fhr-table.xml", httr::add_headers(.headers = headers))
r %>%
content() %>%
html_elements(xpath = "//*[local-name()='infoTable']") %>% # select enclosing nodes
# iterate over each parent node, pulling out desired parts and coerce to data.frame
# not the complete list
map_df(
~ data.frame(
name_of_issuer = html_element(.x, xpath = ".//*[local-name()='nameOfIssuer']") %>%
html_text(),
title_of_class = html_element(.x, xpath = ".//*[local-name()='titleOfClass']") %>%
html_text(),
put_or_call = html_element(.x, xpath = ".//*[local-name()='putCall']") %>%
html_text()
)
)

Related

R - using rvest to scrape <p> tag only if a sister <img> tag is also present in nodes

I am scraping college basketball player images and https://unfospreys.com/sports/womens-basketball/roster/2020-21 is one of the many pages with these images. Unfortunately, the 15th player on this page, Britney Gore, does not have a player image. As a result, the above data.frame() is not created because the column imgSrc is length 14 and the column playerName is length 15. (you can run the code separately for each column in the data.frame() and each line works individually).
library(rvest)
library(xml2)
rosters_url = 'https://unfospreys.com/sports/womens-basketball/roster/2020-21'
rosters_page = rosters_url %>% read_html()
this_rosters_df <- data.frame(
baseUrl = 'https://unfospreys.com/sports/womens-basketball/roster/2020-21',
imgSrc = rosters_page %>% html_nodes('div.sidearm-roster-player-image a img') %>% html_attr("data-src"),
playerName = rosters_page %>% html_nodes('div.sidearm-roster-player-name p') %>% html_text() %>% trimws(),
stringsAsFactors = FALSE
)
Is there anyway for the code to identify on this page - okay, this player doesn't have an image tag, so don't pull their name, so we don't have this mismatch in the data frame? I cannot change that there are only 14 tags to the 15 tags, but perhaps I can change the code for playerName to exclude all nodes that don't have a child/sister tag?

The key to solve this problem is retrieve the parent nodes for all of the players. Then parse this vector of parent nodes for the requested information for each player, using the html_node() function (notice no ending s).
This technique works for this problem since there is a one-to-one relationship between the player's parent node to the requested information. For example one name, one position. The advantage of using html_node() instead of html_nodes() is html_node()will always return a value even if it is NA. So when there is no image node a NA is returned and your vectors stay aligned.
library(rvest)
rosters_url <- "https://unfospreys.com/sports/womens-basketball/roster/2020-21"
rosters_page <- rosters_url %>% read_html()
#find the parent node which has all of the desired information for each player
players <- rosters_page %>% html_nodes(".sidearm-roster-player")
#Extract the requested information for each player
baseUrl = 'https://unfospreys.com/sports/womens-basketball/roster/2020-21'
imgSrc = players %>% html_node('img') %>% html_attr("data-src")
playername <- players %>% html_node('.sidearm-roster-player-name p') %>% html_text() %>% trimws()
#build the final answer
data.frame(baseUrl, imgSrc, playername)

You could grab a shared parent and thereby restrict to only those where both targets are present. I choose a selector for parent node that allows me to pull the name from an aria-label attribute (to match with displayed name after a substring replace)
library(rvest)
library(purrr)
library(stringr)
rosters_url <- "https://unfospreys.com/sports/womens-basketball/roster/2020-21"
rosters_page <- rosters_url %>% read_html()
parent_nodes <- rosters_page %>% html_nodes(".sidearm-roster-player-image.column")
this_rosters_df <- map_df(parent_nodes, ~ {
data.frame(
imgSrc = .x %>% html_node(".lazyload") %>% html_attr("data-src") %>% url_absolute(., rosters_url),
playerName = .x %>% html_node("a") %>% html_attr('aria-label') %>% str_replace(' - View Full Bio',''),
stringsAsFactors = FALSE
)
})
head(this_rosters_df)

Same webscrape code works on one page, not another using rvest

I built a simple scrape to get a data frame with NFL draft results for 2020. I intent to use this code to map several years of results but for some reason, when I change the code for a single page scrape for any other year than 2020, I get the error at the bottom.
library(tidyverse)
library(rvest)
library(httr)
library(curl)
This scrape for 2020 works flawlesslessy, although the col names are in row 1 which isn't a big deal to me as I can deal with this later (mentioning though in case this might have to do with the problem):
x <- "https://www.pro-football-reference.com/years/2020/draft.htm"
df <- read_html(curl(x, handle = curl::new_handle("useragent" = "Mozilla/5.0"))) %>%
html_nodes("table") %>%
html_table() %>%
as.data.frame()
below the url is changed from 2020 to 2019 which is an active page with a table of the same format. For some reason, the same call as above does not work:
x <- "https://www.pro-football-reference.com/years/2019/draft.htm"
df <- read_html(curl(x, handle = curl::new_handle("useragent" = "Mozilla/5.0"))) %>%
html_nodes("table") %>%
html_table() %>%
as.data.frame()
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, :
arguments imply differing number of rows: 261, 2

There are two tables at the url provided. There is the core draft (table 1, id = "drafts") and the supplemental draft (table 2, id = "drafts_supp").
The as.data.frame() call fails because it is trying to combine the two tables but they have differing columns in both name and number. You can direct rvest to read just the specific table you are interested in by providing the html_node() with either the xpath or the selector. You can find the xpath or selector by inspecting the specific table you are interested in, right-click > inspect on Chrome/Mozilla. Note that for selector to use id you'll need to use #drafts not just drafts and for xpath you typically have to wrap the text in single quotes.
This works: html_node(xpath = '//*[#id="drafts"]')
This doesn't because of the double quotes: html_node(xpath = "//*[#id="drafts"]")
Note that I believe the html_nodes("table") used in your example is unnecessary, as html_table() already selects only tables.
x <- "https://www.pro-football-reference.com/years/2019/draft.htm"
raw_html <- read_html(x)
# use xpath
raw_html %>%
html_node(xpath = '//*[#id="drafts"]') %>%
html_table()
# use selector
raw_html %>%
html_node("#drafts") %>%
html_table()

How to get rvest or sapply to skip NA values?

I am using rvest to (try to) scrape all the author affiliation data from a database of academic publications called RePEc. I have the authors' short IDs (author_reg), which I'm using to scrape affiliation data. However, I have several columns indicating multiple authors (each of which I need the affiliation data for). When there aren't multiple authors, the cell has an NA value. Some of the columns are mostly NA values so how do I alter my code so it skips the NA values but doesn't delete them?
Here is the code I'm using:
library(rvest)
library(purrr)
df$author_reg <- c("paa6","paa2","paa1", "paa8", "pve266", "pya500", "NA", "NA")
http1 <- "https://ideas.repec.org/e/"
http2 <- "https://ideas.repec.org/f/"
df$affiliation_author_1 <- sapply(df$author_reg_1, function(x) {
links = c(paste0(http1, x, ".html"),paste0(http2, x, ".html"))
# here we try both links and store under attempts
attempts = links %>% map(function(i){
try(read_html(i) %>% html_nodes("#affiliation h3") %>% html_text())
})
# the good ones will have "character" class, the failed ones, try-error
gdlink = which(sapply(attempts,class) != "try-error")
if(length(gdlink)>0){
return(attempts[[gdlink[1]]])
}
else{
return("True 404 error")
}
})
Thanks in advance for your help!

As far as I see the target links, you can try the following way. First, you want to scrape all links from https://ideas.repec.org/e/ and create all links. Then, check if each link exists or not. (There are about 26000 links with this URL, and I do not have time to check all. So I just used 100 URLs in the following demonstration.) Extract all existing links.
library(rvest)
library(httr)
library(tidyverse)
# Get all possible links from this webpage. There are 26665 links.
read_html("https://ideas.repec.org/e/") %>%
html_nodes("td") %>%
html_nodes("a") %>%
html_attr("href") %>%
.[grepl(x = ., pattern = "html")] -> x
# Create complete URLs.
mylinks1 <- paste("https://ideas.repec.org/e/", x, sep = "")
# For this demonstration I created a subset.
mylinks_samples <- mylinks1[1:100]
# Check if each URL exists or not. If FALSE, a link exists.
foo <- sapply(mylinks_sample, http_error)
# Using the logical vector, foo, extract existing links.
urls <- mylinks_samples[!foo]
Then, for each link, I tried to extract affiliation information. There are several spots with h3. So I tried to specifically target h3 that stays in xpath containing id = "affiliation". If there is no affiliation information, R returns character(0). When enframe() is applied, these elements are removed. For instance, pab127 does not have any affiliation information, so there is no entry for this link.
lapply(urls, function(x){
read_html(x, encoding = "UTF-8") %>%
html_nodes(xpath = '//*[#id="affiliation"]') %>%
html_nodes("h3") %>%
html_text() %>%
trimws() -> foo
return(foo)}) -> mylist
Then, I assigned names to mylist with the links and created a data frame.
names(mylist) <- sub(x = basename(urls), pattern = ".html", replacement = "")
enframe(mylist) %>%
unnest(value)
name value
<chr> <chr>
1 paa1 "(80%) Institutt for ØkonomiUniversitetet i Bergen"
2 paa1 "(20%) Gruppe for trygdeøkonomiInstitutt for ØkonomiUniversitetet i Bergen"
3 paa2 "Department of EconomicsCollege of BusinessUniversity of Wyoming"
4 paa6 "Statistisk SentralbyråGovernment of Norway"
5 paa8 "Centraal Planbureau (CPB)Government of the Netherlands"
6 paa9 "(79%) Economic StudiesBrookings Institution"
7 paa9 "(21%) Brookings Institution"
8 paa10 "Helseøkonomisk Forskningsprogram (HERO) (Health Economics Research Programme)\nUniversitetet i Oslo (Unive~
9 paa10 "Institutt for Helseledelse og Helseökonomi (Institute of Health Management and Health Economics)\nUniversi~
10 paa11 "\"Carlo F. Dondena\" Centre for Research on Social Dynamics (DONDENA)\nUniversità Commerciale Luigi Boccon~

How to resolve 'Don't know how to pluck from a closure' error in R

The code below works if I remove the Sys.sleep() from within the map() function. I tried to research the error ('Don't know how to pluck from a closure') but i haven't found much on that topic.
Does anyone know where I can find documentation on this error, and any help on why it is happening and how to prevent it?
library(rvest)
library(tidyverse)
library(stringr)
# lets assume 3 pages only to do it quickly
page <- (0:18)
# no need to create a list. Just a vector
urls = paste0("https://www.mlssoccer.com/players?page=", page)
# define this function that collects the player's name from a url
get_the_names = function( url){
url %>%
read_html() %>%
html_nodes("a.name_link") %>%
html_text()
}
# map the urls to the function that gets the names
players = map(urls, get_the_names) %>%
# turn into a single character vector
unlist() %>%
# make lower case
tolower() %>%
# replace the `space` to underscore
str_replace_all(" ", "-")
# Now create a vector of player urls
player_urls = paste0("https://www.mlssoccer.com/players/", players )
# define a function that reads the 3rd table of the url
get_the_summary_stats <- function(url){
url %>%
read_html() %>%
html_nodes("table") %>%
html_table() %>% .[[3]]
}
# lets read 3 players only to speed things up [otherwise it takes a significant amount of time to run...]
a_few_players <- player_urls[1:5]
# get the stats
tables = a_few_players %>%
# important step so I can name the rows I get in the table
set_names() %>%
#map the player urls to the function that reads the 3rd table
# note the `safely` wrap around the get_the_summary_stats' function
# since there are players with no stats and causes an error (eg.brenden-aaronson )
# the output will be a list of lists [result and error]
map(., ~{ Sys.sleep(5)
safely(get_the_summary_stats) }) %>%
# collect only the `result` output (the table) INTO A DATA FRAME
# There is also an `error` output
# also, name each row with the players name
map_df("result", .id = "player") %>%
#keep only the player name (remove the www.mls.... part)
mutate(player = str_replace(player, "https://www.mlssoccer.com/players/", "")) %>%
as_tibble()
tables <- tables %>% separate(Match,c("awayTeam","homeTeam"), extra= "drop", fill = "right")

purrr::safely(...) returns a function, so your map(., { Sys.sleep(5); safely(get_the_summary_stats) }) is returning functions, not any data. In R, a "closure" is a function and its enclosing environment.
Tilde notation is a tidyverse-specific method of more-terse anonymous functions. Typically (e.g., with lapply) one would use lapply(mydata, function(x) get_the_summary_stats(x)). In tilde notation, the same thing is written as map(mydata, ~ get_the_summary_stats(.))
So, re-write to:
... %>% map(~ { Sys.sleep(5); safely(get_the_summary_stats)(.); })
From comments by #r2evans

How to scrape the data when there's missing values in selector nodes

Hi I am trying scrape the data from ebay in R, I used the code mentioned below but I encountered with a problem wherein there were missing values for a particular selector elements, to get round it I used a for loop as shown(inspecting each listing and giving the number for which there was data missing) since the data scraped was less it was possible to inspect but how to do it when there's large amounts of data to be scraped.
Thanks in advance
library(rvest)
url<-"https://www.ebay.in/sch/i.html_from=R40&_sacat=0&LH_ItemCondition=4&_ipg=100&_nkw=samsung+j7"
web<- read_html(url)
subdescp<- html_nodes(web, ".lvsubtitle+ .lvsubtitle")
subdescp1<-html_text(subdescp)
head(subdescp1)
library(stringr)
subdescp1<- str_replace_all(subdescp1, "[\t\n\r]" , "")
head(subdescp1)
for (i in c(5,6,10,19,33,34,35)){
a<-subdescp1[1:(i-1)]
b<-subdescp1[i:length(subdescp1)]
subdescp1<-append(a,list("NA"))
subdescp1<-append(subdescp1,b)
}
Z<-as.character(subdescp1)
Z
webpage <- read_html(url)
Descp_data_html <- html_nodes(webpage,'.vip')
Descp_data <- html_text(Descp_data_html)
head(Descp_data)
price_data_html <- html_nodes(web,'.prc .bold')
price_data <- html_text(price_data_html)
head(price_data)
library(stringr)
price_data<-str_replace_all(price_data, "[\t\n]" , "")
price_data<-gsub("Rs. ","",price_data)
price_data<-gsub(",","",price_data)
price_data<- as.numeric(price_data)
price_data
Desc_data_html <- html_nodes(webpage,'.lvtitle+ .lvsubtitle')
Desc_data <- html_text(Desc_data_html, trim = TRUE)
head(Desc_data)
j7_f2<-data.frame(Title = Descp_data, Description= Desc_data, Sub_Description= Z, Pirce = price_data)

For instance you can use something like this.
data <- read_html("url.xml")
var <- data %>% html_nodes("//node") %>% xml_text()
# observations that don´t have certain nodes - fill them with NA
var_pair <- data %>% html_nodes("node_var_pair")
var_missing_clean = sapply(var_pair, function(x) {
tryCatch(xml_text(html_nodes(x, "./var_missing")),
error=function(err) NA)
})
df = data.frame(var, var_pair, var_missing)
Here there are three types of nodes that you may consider. var gathers the nodes that do not have missing data. var_pair includes the nodes that you want to pair with the nodes that contain missing observation and var_missing refers to the nodes with missing information. You can create variables and aggregate them in a data data frame (df)

The process here is simple and in two steps -- First extract all nodes at the block level (not each element and don't convert to text). This is a list of length equal to the number of blocks. Second from this extracted list extract each element as text and clean it. Since this is being done from a list, NA's where applicable are automatically coerced in the right places. See an example from the same ebay India site:
library(rvest)
library(stringr)
# specify the url
url <-"https://www.ebay.in/sch/Mobile-Phones"
# read the page
web <- read_html(url)
# define the supernode that has the entire block of information
super_node <- '.li'
# read as vector of all blocks of supernode (imp: use html_nodes function)
super_node_read <- html_nodes(web, super_node)
# define each node element that you want
node_model_details <- '.lvtitle'
node_description_1 <- '.lvtitle+ .lvsubtitle'
node_description_2 <- '.lvsubtitle+ .lvsubtitle'
node_model_price <- '.prc .bold'
node_shipping_info <- '.bfsp'
# extract the output for each as cleaned text (imp: use html_node function)
model_details <- html_node(super_node_read, node_model_details) %>%
html_text() %>%
str_replace_all("[\t\n\r]" , "")
description_1 <- html_node(super_node_read, node_description_1) %>%
html_text() %>%
str_replace_all("[\t\n\r]" , "")
description_2 <- html_node(super_node_read, node_description_2) %>%
html_text() %>%
str_replace_all("[\t\n\r]" , "")
model_price <- html_node(super_node_read, node_model_price) %>%
html_text() %>%
str_replace_all("[\t\n\r]" , "")
shipping_info <- html_node(super_node_read, node_shipping_info) %>%
html_text() %>%
str_replace_all("[\t\n\r]" , "")
# create the data.frame
mobile_phone_data <- data.frame(
model_details,
description_1,
description_2,
model_price,
shipping_info
)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Rvest scraping child nodes but filling missing values with NA - r

Related

R - using rvest to scrape <p> tag only if a sister <img> tag is also present in nodes

Same webscrape code works on one page, not another using rvest

How to get rvest or sapply to skip NA values?

How to resolve 'Don't know how to pluck from a closure' error in R

How to scrape the data when there's missing values in selector nodes

Categories

Resources