Cannot identify html node for scraping in rvest - css

trying to grab links from a page for subsequent analysis and can only grab about 1/2 of them which may be due to filtering. I'm trying to extract the links highlighted here:
My approach is as follows, which is not ideal because I believe I may be losing some links in the filter() call.
library(rvest)
library(tidyverse)
#initiate session
session <- html_session("https://www.backlisted.fm/episodes")
#collect links for all episodes from the index page:
session %>%
read_html() %>%
html_nodes(".underline-body-links a") %>%
html_attr("href") %>%
tibble(link_temp = .) %>%
filter(str_detect(link_temp, pattern = "episodes/")) %>%
distinct()
#css:
#.underline-body-links #page .html-block a, .underline-body-links #page .product-excerpt ahere
#result:
link_temp
<chr>
1 /episodes/116-mfk-fisher-how-to-cook-a-wolf
2 https://www.backlisted.fm/episodes/109-barbara-pym-excellent-women
3 /episodes/115-george-amp-weedon-grossmith-the-diary-of-a-nobody
4 https://www.backlisted.fm/episodes/27-jane-gardam-a-long-way-from-verona
5 https://www.backlisted.fm/episodes/5-b-s-johnson-christie-malrys-own-double-entry
6 https://www.backlisted.fm/episodes/97-ray-bradbury-the-illustrated-man
7 /episodes/114-william-golding-the-inheritors
8 https://www.backlisted.fm/episodes/30-georgette-heyer-venetia
9 https://www.backlisted.fm/episodes/49-anita-brookner-look-at-me
10 https://www.backlisted.fm/episodes/71-jrr-tolkien-the-return-of-the-king
# … with 43 more rows
I've been reading multiple documents but I can't target that one type of href. Any help will be much appreciated. Thank you.

Try this
library(rvest)
library(tidyverse)
session <- html_session("https://www.backlisted.fm/index")
raw_html <- read_html(session)
node <- raw_html %>% html_nodes(css = "li p a")
link <- node %>% html_attr("href")
title <- node %>% html_text()
tibble(title, link)
# A tibble: 117 x 2
# title link
# <chr> <chr>
# 1 "A Month in the Country" https://www.backlisted.fm/episodes/1-j-l-carr-a-month-in-the-country
# 2 " - J.L. Carr (with Lissa Evans)" #
# 3 "Good Morning, Midnight - Jean Rhys" https://www.backlisted.fm/episodes/2-jean-rhys-good-morning-midnight
# 4 "It Had to Be You - David Nobbs" https://www.backlisted.fm/episodes/3-david-nobbs-1
# 5 "The Blessing - Nancy Mitford" https://www.backlisted.fm/episodes/4-nancy-mitford-the-blessing
# 6 "Christie Malry's Own Double Entry - B.S. Joh… https://www.backlisted.fm/episodes/5-b-s-johnson-christie-malrys-own-dou…
# 7 "Passing - Nella Larsen" https://www.backlisted.fm/episodes/6-nella-larsen-passing
# 8 "The Great Fire - Shirley Hazzard" https://www.backlisted.fm/episodes/7-shirley-hazzard-the-great-fire
# 9 "Lolly Willowes - Sylvia Townsend Warner" https://www.backlisted.fm/episodes/8-sylvia-townsend-warner-lolly-willow…
# 10 "The Information - Martin Amis" https://www.backlisted.fm/episodes/9-martin-amis-the-information
# … with 107 more rows

Related

Using R to scrape a table and links from a web page

I am trying scraping a website with R. I need the table and the links from that table associated with the correct row in the table. I can get the table and the links but because in the web table there are two columns with links and some rows in the table don't have links and the links can't be sorted and joined by the file names. I can't figure out how to create a dateframe with the columns and the links associated with the correct row.
library(rvest)
#Read HTML from EPA website
content <- read_html("https://www.epa.gov/national-aquatic-resource-surveys/data-national-aquatic-resource-surveys")
tables <- content %>%
html_table(fill = TRUE)
EPA_table <- tables[[1]]
#get links from table
web <- content %>%
html_nodes("table") %>% html_nodes("tr") %>% html_nodes("a") %>%
html_attr("href") #as above
Use xpath= argument of to select columns.
## Data links
web <- content %>%
html_nodes("table tr")%>%
html_nodes(xpath="//td[3]") %>% ## xpath
html_nodes("a") %>%
html_attr("href")
EPA_table$web1 <- web ## add Data links column
## metadata links accordingly
web2 <- content %>%
html_nodes("table tr") %>%
html_nodes(xpath="//td[4]") %>% ## xpath
html_nodes("a") %>%
html_attr("href")
The empty metadata cells can be set to NA, the description links fit where it's not NA then.
EPA_table[EPA_table$Metadata %in% "", "Metadata"] <- NA
EPA_table[!is.na(EPA_table$Metadata), "web2"] <- web2 ## add metadata column
Result
head(EPA_table)
# Survey Indicator
# 1 Lakes 2007 All
# 2 Lakes 2007 Landscape Data
# 3 Lakes 2007 Water Chemistry
# 4 Lakes 2007 Visual Assessment
# 5 Lakes 2007 Site Information
# 6 Lakes 2007 Notes
# Data
# 1 NLA 2007 All Data (ZIP)(1 pg, 5 MB)
# 2 NLA 2007 Basin Landuse Metrics - Data 20061022 (CSV)(1 pg, 307 K)
# 3 NLA 2007 Profile - Data 20091008 (CSV)(1 pg, 888 K)
# 4 NLA 2007 Visual Assessment - Data 20091015 (CSV)(1 pg, 813 K)
# 5 NLA 2007 Site Information - Data 20091113 (CSV)(1 pg, 980 K)
# 6 National Lakes Assessment 2007 Final Data Notes
# Metadata
# 1 <NA>
# 2 NLA 2007 Basin Landuse Metrics - Metadata 20091022 (TXT)(1 pg, 4 K)
# 3 NLA 2007 Profile - Metadata 20091008 (TXT)(1 pg, 650 B)
# 4 NLA 2007 Visual Assessment - Metadata 10091015 (TXT)(1 pg, 7 K)
# 5 NLA 2007 Site Information - Metadata 20091113 (TXT)(1 pg, 8 K)
# 6 <NA>
# web1
# 1 /sites/production/files/2017-02/nla2007_alldata.zip
# 2 /sites/production/files/2013-09/nla2007_basin_landuse_metrics_20061022.csv
# 3 /sites/production/files/2013-09/nla2007_profile_20091008.csv
# 4 /sites/production/files/2014-01/nla2007_visualassessment_20091015.csv
# 5 /sites/production/files/2014-01/nla2007_sampledlakeinformation_20091113.csv
# 6 /national-aquatic-resource-surveys/national-lakes-assessment-2007-final-data-notes
# web2
# 1 <NA>
# 2 /sites/production/files/2013-09/nla2007_basin_landuse_metrics_info_20091022.txt
# 3 /sites/production/files/2013-09/nla2007_profile_info_20091008_0.txt
# 4 /sites/production/files/2014-01/nla2007_visualassessment_info_20091015.txt
# 5 /sites/production/files/2014-01/nla2007_sampledlakeinformation_info_20091113.txt
# 6 <NA>
I would have gone with css selectors and :nth-child to separate out individual columns from a loop over the table rows. By using tbody in the selector I would exclude the header row and only process the table body rows and pass that list to map_df
library(rvest)
library(purrr)
url <- "https://www.epa.gov/national-aquatic-resource-surveys/data-national-aquatic-resource-surveys"
rows <- read_html(url) %>% html_nodes("#narsdata tbody tr")
df <- map_df(rows, function(x) {
data.frame(
Survey = x %>% html_node("td:nth-child(1)") %>% html_text(),
Indicator = x %>% html_node("td:nth-child(2)") %>% html_text(),
Data = x %>% html_node("td:nth-child(3) a") %>% html_attr("href") %>% if_else(is.na(.), ., url_absolute(., url)),
Metadata = x %>% html_node("td:nth-child(4) a") %>% html_attr("href") %>% if_else(is.na(.), ., url_absolute(., url)),
stringsAsFactors = FALSE
)
})
Don't think you really need the file names in addition to the urls, but if so, you can expand the data.frame with two additional columns and extract the html_text rather than html_attr e.g.
Data_Name = x %>% html_node("td:nth-child(3) a") %>% html_text(),
Metadata_Name = x %>% html_node("td:nth-child(4) a") %>% html_text()

How do I clean and organize this list of scraped data?

My issue is that I've managed to, with great assistance from this community, scrape much of the data I desire; however, I have not managed to organize it in any meaningful way. The links I used in source are a sample of the many, many links I have for this project that are representative of all of them
library(rvest)
library(tidyverse)
#source links
source<-c("http://www.ufcstats.com/fighter-details/f2688492b9a525a3","http://www.ufcstats.com/fighter-details/f1fac969a1d70b08")
fp_e<-map(source, function(career_data){
read_html(career_data)%>%
html_nodes("div ul li")%>%
html_text()%>%
#cleans up the data a bit
str_replace_all(.,"\n\\s+\n\\s+","")%>%
as.data.frame(.)
})
What I want to do with this list is to turn it into a usable dataframe. My original idea was to transpose() it after as.data.frame(); however, all it did was put everything in a single row. Additionally, I was unable to index the data frame. This lead me to believe the data frame was not set up how I thought it was. I want to be more specific here but I'm honestly quite confused at this point.
Searching around, I found this question and the answer by neilfws gave me an idea of building the dataframe and inserting the data into it; however, I don't even know where to start. I'm also unsure if it's necessary to do that when it's already set up in a format I like.
This is the first real-world R application I've tried and I'm really stumbling on how to organize this data. Thank you for any help and suggestions!
You can do some data-cleaning with tidyverse library :
library(tidyverse)
library(rvest)
map(source, function(career_data){
read_html(career_data) %>%
html_nodes("div ul li")%>%
html_text() %>%
trimws(whitespace = '[\\s\n]') %>%
tibble(data = .) %>%
separate(data, c('Property', 'Value'), sep = ':') %>%
na.omit() %>%
mutate(Value = trimws(Value, whitespace = '[\n\\s]'))
})
This returns :
#[[1]]
# A tibble: 13 x 2
# Property Value
# <chr> <chr>
# 1 Height "5' 6\""
# 2 Weight "135 lbs."
# 3 Reach "68\""
# 4 STANCE "Orthodox"
# 5 DOB "Oct 16, 1981"
# 6 SLpM "3.70"
# 7 Str. Acc. "39%"
# 8 SApM "2.70"
# 9 Str. Def "66%"
#10 TD Avg. "2.28"
#11 TD Acc. "31%"
#12 TD Def. "65%"
#13 Sub. Avg. "0.3"
#[[2]]
# A tibble: 13 x 2
# Property Value
# <chr> <chr>
# 1 Height "6' 2\""
# 2 Weight "170 lbs."
# 3 Reach "74\""
# 4 STANCE "Southpaw"
# 5 DOB "Aug 25, 1991"
# 6 SLpM "2.53"
# 7 Str. Acc. "47%"
# 8 SApM "2.05"
# 9 Str. Def "55%"
#10 TD Avg. "1.39"
#11 TD Acc. "31%"
#12 TD Def. "70%"
#13 Sub. Avg. "0.4"

scraping wikimedia category trees

I want to use R to scrape the links contained within a wikimedia category tree and the structure of the tree from here. The code below can open up all the collapsible bullet points
library(RSelenium)
rD <- rsDriver(check = FALSE)
remDr <- rD[["client"]]
remDr$navigate("https://commons.wikimedia.org/wiki/Category:Sports")
n <- 1
# n <- 10 # takes a long time to expand all bullet points
for(i in 1:n){
b <- remDr$findElements(using = "css selector", "[title='expand']")
for(i in 1:length(b)){
b[[i]]$clickElement()
}
}
... but i am struggling to build a data base that would look like...
I can get the bullet href and names using the code below, but i am struggling to find a way to indicate which level each bullet points refers too (i.e. how deep in the category tree each bullet point is)? I am thinking there might be a clever xpath method to count how many CategoryTreeChildren deep each bullet is but that is reaching well beyond my capabilities.
# for testing I manually expand the bullets for the first couple of branches
# (fully for Bulgaria women badminton, basketball) and the last possible
# branch rather than let the for loop run and run through multiple cycles.
library(tidyverse)
library(rvest)
s <- remDr$getPageSource()
d <- read_html(s[[1]]) %>%
html_nodes("div#mw-subcategories") %>%
html_nodes("div.CategoryTreeItem") %>%
html_nodes("a") %>%
map(xml_attrs) %>%
map_df(~as.list(.)) %>%
as_tibble()
# > d
# # A tibble: 135 x 2
# href title
# <chr> <chr>
# 1 /wiki/Category:Categories_by_sport Category:Categories by sport
# 2 /wiki/Category:Categories_by_sport_by_c~ Category:Categories by sport by co~
# 3 /wiki/Category:Categories_of_Bulgaria_b~ Category:Categories of Bulgaria by~
# 4 /wiki/Category:Female_sportspeople_from~ Category:Female sportspeople from ~
# 5 /wiki/Category:Female_badminton_players~ Category:Female badminton players ~
# 6 /wiki/Category:Maria_Delcheva Category:Maria Delcheva
# 7 /wiki/Category:Petya_Nedelcheva Category:Petya Nedelcheva
# 8 /wiki/Category:Gabriela_Stoeva Category:Gabriela Stoeva
# 9 /wiki/Category:Stefani_Stoeva Category:Stefani Stoeva
# 10 /wiki/Category:Women%27s_basketball_pla~ Category:Women's basketball player~
I have also played around with the WikipediR package - it says in the package description that it can be used to retrieve elements of category trees but i cannot find an example of how to implement it.

Create aggregate df out of webscraping multiple pages using Rvest and Glue

I'm working on scraping data from a table on the following website
https://fantasy.nfl.com/research/scoringleaders?position=1&statCategory=stats&statSeason=2019&statType=weekStats&statWeek=1
I want to create a scrape that takes all 17 weeks, all four positions (qb,rb,wr,te) and takes the first 4 pages to get the first 100 rows (only 25 shown on a page at a time).
library(tidyverse)
library(rvest)
library(glue)
scrape_19 <- function(week, position, page) {
Sys.sleep(3)
cat(".")
url <- glue("https://fantasy.nfl.com/research/scoringleaders?{page}position={position}&sort=pts&statCategory=stats&statSeason=2019&statType=weekStats&statWeek={week}")
read_html(url) %>%
html_nodes("table") %>%
html_table(header = T) %>%
simplify() %>%
first() %>%
setNames(paste0(colnames(.), as.character(.[1,]))) %>%
slice(-1) %>%
list()
}
Here are all the iterations of each call in glue:
week = 1:17;
position = 1:4;
page = c("", "offset=26&", "offset=51&", "offset=76&")
The problem I run into is when I try to make one df with all the data for each week, position and page. Here is code that works for week and position but will not work for an additional nested df.
scaffold <- tibble(week = weeks,
position = list(positions)) %>% tidyr::unnest()
scaffold
tbl_data <- scaffold %>%
mutate(data = purrr::map2(week, position, ~scrape_19(.x, .y)[[1]]))
Basically, I need help in crafting the scaffold and turning that scaffold into the final total data set with all weeks, positions and pages.
Here is my attempt. I am not sure if glue() is the way to go. See below.
first_name <- c("Fred", "Ana", "Bob")
last_name <- c("JOhnson", "Trump")
glue('My name is {first_name} {last_name}.')
Error: Variables must be length 1 or 3
Your case is similar to this example. So I tried to create all possible links using loops with map(). Then, I checked if all URLs exist or not. I used map_dfr() in order to loop through all URLs and bind all data frames. In this process, I added week and position information as well. If position is 1, it is QB. If necessary, replace these numbers by yourself. Note that I scraped four URLs in this demonstration.
library(httr)
library(rvest)
library(tidyverse)
# Create all URLs.
# Create 4 base URLs
paste("https://fantasy.nfl.com/research/scoringleaders?",
c("", "offset=26&", "offset=51&", "offset=76&"),
"position={position}&sort=pts&statCategory=stats&statSeason=2019&statType=weekStats&statWeek={week}",
sep = "") -> mytemp
# For each base URL, create 4 URLs. (4 x 4 = 16 URLs)
map(.x = 1:4,
.f = function(x){gsub(x = mytemp, pattern = "\\{position\\}", replacement = x)}) %>%
unlist -> mytemp
# For each of the 16 URLs, create 17 URLs
map(.x = 1:17,
.f = function(x){gsub(x = mytemp, pattern = "\\{week\\}", replacement = x)}) %>%
unlist -> myurls
# Check if any URLs are invalid
sapply(myurls, url_success) %>% table
# TRUE
# 272
# Scrape the tables
map_dfr(.x = myurls[1:4],
.f = function(x){read_html(x) %>%
html_nodes("table") %>%
html_table() %>%
simplify() %>%
first() %>%
setNames(paste0(colnames(.), as.character(.[1,]))) %>%
slice(-1) %>%
mutate(position = str_extract(string = x, pattern = "(?<=position=)\\d+(?=&)"),
week = str_extract(string = x, pattern = "(?<=statWeek=)\\d+"))},
.id = "url") -> foo
url Rank Player Opp PassingYds PassingTD PassingInt RushingYds RushingTD ReceivingRec ReceivingYds
1 1 1 Lamar Jackson QB - BAL #MIA 324 5 - 6 - - -
2 1 2 Dak Prescott QB - DAL NYG 405 4 - 12 - - -
3 1 3 Deshaun Watson QB - HOU #NO 268 3 1 40 1 - -
4 1 4 Matthew Stafford QB - DET #ARI 385 3 - 22 - - -
5 1 5 Patrick Mahomes QB - KC #JAX 378 3 - 2 - - -
ReceivingTD RetTD MiscFumTD Misc2PT FumLost FantasyPoints position week
1 - - - - - 33.56 1 1
2 - - - - - 33.40 1 1
3 - - - - - 30.72 1 1
4 - - - - 1 27.60 1 1
5 - - - - - 27.32 1 1

How to scrape multiple tables that are without IDs or Class using R

I'm trying to scrape this webpage using R : http://zipnet.in/index.php?page=missing_mobile_phones_search&criteria=browse_all (All the pages)
I'm new to programming. And everywhere I've looked, tables are mostly identified with IDs or Divs or Class. On this page there's none. Data is stored in Table format. How should I scrape it?
This is what I did :
library(rvest)
webpage <- read_html("http://zipnet.in/index.php
page=missing_mobile_phones_search&criteria=browse_all")
tbls <- html_nodes(webpage, "table")
head(tbls)
tbls_ls <- webpage %>%
html_nodes("table") %>%
.[9:10] %>%
html_table(fill = TRUE)
colnames(tbls_ls[[1]]) <- c("Mobile Make", "State", "District",
"Police Station", "Status", "Mobile Type(GSM/CDMA)",
"FIR/DD/GD Dat")
You can scrape the table data by targeting the css id of each table. It looks like each page is composed of 3 different tables pasted one after another. Two of the tables have #AutoNumber15 css id while the third (in the middle) has the #AutoNumber16 css id.
I put a simple code example that should get you started in the right direction.
suppressMessages(library(tidyverse))
suppressMessages(library(rvest))
# define function to scrape the table data from a page
get_page <- function(page_id = 1) {
# default link
link <- "http://zipnet.in/index.php?page=missing_mobile_phones_search&criteria=browse_all&Page_No="
# build link
link <- paste0(link, page_id)
# get tables data
wp <- read_html(link)
wp %>%
html_nodes("#AutoNumber16, #AutoNumber15") %>%
html_table(fill = TRUE) %>%
bind_rows()
}
# get the data from the first three pages
iter_page <- 1:3
# this is just a progress bar
pb <- progress_estimated(length(iter_page))
# this code will iterate over pages 1 through 3 and apply the get_page()
# function defined earlier. The Sys.sleep() part is used to pause the code
# after each iteration so that the sever is not overloaded with requests.
map_df(iter_page, ~ {
pb$tick()$print()
df <- get_page(.x)
Sys.sleep(sample(10, 1) * 0.1)
as_tibble(df)
})
#> # A tibble: 72 x 4
#> X1 X2 X3
#> <chr> <chr> <chr>
#> 1 FIR/DD/GD Number 000165 State
#> 2 FIR/DD/GD Date 17/08/2017 District
#> 3 Mobile Type(GSM/CDMA) GSM Police Station
#> 4 Mobile Make SAMSUNG J2 Mobile Number
#> 5 Missing/Stolen Date 23/04/2017 IMEI Number
#> 6 Complainant AKEEL KHAN Complainant Contact Number
#> 7 Status Stolen/Theft Report Date/Time on ZIPNET
#> 8 <NA> <NA> <NA>
#> 9 FIR/DD/GD Number FIR No 37/ State
#> 10 FIR/DD/GD Date 17/08/2017 District
#> # ... with 62 more rows, and 1 more variables: X4 <chr>

Resources