Webscrape heading and lists to a dataframe with rvest - r

I would like to scrape the hyperlinks on this webpage into a dataframe with the columns shown below. The source page contains headings and lists of links.
subject.heading (problem)
hyperlink.title (OK)
hyperlink (OK)
Getting the links and titles is straightforward (html_node "li" and "a"). I'm not clear how to incorporate the subject headings to the final dataframe.
library(tidyverse)
library(rvest)
my.url <- read_html("http://www.secnav.navy.mil/fmc/fmb/Pages/Fiscal-Year-2019.aspx") %>%
html_nodes("#sharePointMainContent")
hyperlink.title <- my.url %>%
html_nodes("li") %>%
html_text()
hyperlink <- my.url %>%
html_nodes("li") %>%
html_nodes("a") %>%
html_attr("href")
df <- tibble(title, hyperlink.title)
I can successfully scrape the headings, but cannot figure out how to incorporate them into the final dataframe properly.
subject.heading <- my.url %>%
html_nodes("h3") %>%
html_text() %>% str_trim()
Created on 2018-09-03 by the reprex package (v0.2.0).

That page has a weird structure, with tables inside the main table.
What I found to work is to iterate (map_df()) the cells of the parent table (identified by the s4-wpcell-plain class). Each cell contains another table, but we can simply extract what we are after, instead of relying on html_table().
library(tidyverse)
library(rvest)
#> Loading required package: xml2
r <- read_html("http://www.secnav.navy.mil/fmc/fmb/Pages/Fiscal-Year-2019.aspx") %>%
html_node("#sharePointMainContent>div>table") %>%
html_nodes(".s4-wpcell-plain") %>%
map_df(~{
heading <- .x %>% html_nodes('h3') %>% html_text() %>% str_trim()
titles <- .x %>% html_nodes('li') %>% html_text()
links <- .x %>% html_nodes('a') %>% html_attr("href")
data_frame(heading, titles, links)
})
r
#> # A tibble: 21 x 3
#> heading titles links
#> <chr> <chr> <chr>
#> 1 DEPARTMENT OF THE NAVY SUMMARY FY 19 DON Press Brief http://www.secna…
#> 2 DEPARTMENT OF THE NAVY SUMMARY Supporting Exhibits http://www.secna…
#> 3 DEPARTMENT OF THE NAVY SUMMARY Budget Highlights Book http://www.secna…
#> 4 DEPARTMENT OF THE NAVY SUMMARY The Bottom Line http://www.secna…
#> 5 DEPARTMENT OF THE NAVY SUMMARY Report to Congress on… http://www.secna…
#> 6 DEPARTMENT OF THE NAVY SUMMARY Ship Building Plan SE… http://www.secna…
#> 7 MILITARY PERSONNEL PROGRAMS Military Personnel, N… http://www.secna…
#> 8 MILITARY PERSONNEL PROGRAMS Military Personnel, M… http://www.secna…
#> 9 MILITARY PERSONNEL PROGRAMS Reserve Personnel, Na… http://www.secna…
#> 10 MILITARY PERSONNEL PROGRAMS Reserve Personnel, Ma… http://www.secna…
#> # ... with 11 more rows
Created on 2018-09-04 by the reprex package (v0.2.0).

Related

How to scrape same type of data from multiple link in R

I have link in a column in dataframe and wanted to extract same type of data from different link all in once like this
page <- read_html("https://www.airbnb.co.in/users/show/129534814")
page %>% html_nodes("._a0kct9 ._14i3z6h") %>% html_text()
If your links are in a data frame like this:
df <- data.frame(links = c( "https://www.airbnb.co.in/users/show/446820235",
"https://www.airbnb.co.in/users/show/221530395",
"https://www.airbnb.co.in/users/show/74933177",
"https://www.airbnb.co.in/users/show/213865220",
"https://www.airbnb.co.in/users/show/362873365",
"https://www.airbnb.co.in/users/show/167648591",
"https://www.airbnb.co.in/users/show/143273640"))
Then you can scrape the text and store it in your data frame like this:
library(rvest)
df$greeting <- sapply(df$links, function(url) {
read_html(url) %>% html_nodes("._a0kct9 ._14i3z6h") %>% html_text()
}, USE.NAMES = FALSE)
df
#> links greeting
#> 1 https://www.airbnb.co.in/users/show/446820235 Hi, I’m LuxurybookingsFZE
#> 2 https://www.airbnb.co.in/users/show/221530395 Hi, I’m Blueground
#> 3 https://www.airbnb.co.in/users/show/74933177 Hi, I’m Deluxe Holiday Homes
#> 4 https://www.airbnb.co.in/users/show/213865220 Hi, I’m Andy
#> 5 https://www.airbnb.co.in/users/show/362873365 Hi, I’m Key View
#> 6 https://www.airbnb.co.in/users/show/167648591 Hi, I’m Gregory
#> 7 https://www.airbnb.co.in/users/show/143273640 Hi, I’m AlNisreen

Is there a function similar to read_html() that can be used on data table or data frame types in R?

I'm attempting to webscrape from footballdb.com to get data related to NFL player injuries for a model I am creating from links such as this: https://www.footballdb.com/transactions/injuries.html?yr=2016&wk=1&type=reg which will then be output in a data table. Along with data related to individual player injury information (i.e. their name, injury, and status throughout the week leading up to the game), I also want to include the season and week of the injury in question for each player. I started by using nested for loops to generate the url for each webpage in question, along with the season and week corresponding to each webpage, which were stored in a data table with columns: link, season, and week.
I then tried to to use the functions map_df(), read_html(), and html_nodes() to extract the information I wanted from each webpage, but I run into errors as read_html() does not work for for objects of the data table or data frame class. I then tried to use different types of indexing and the $ operator with no luck either. Is there anyway I can modify the code I have produced thus far to extract the information I want from a data table? Below is what I have written thus far:
library(purrr)
library(rvest)
library(data.table)
#Remove file if file already exists
if (file.exists("./project/volume/data/interim/injuryreports.csv")) {
file.remove("./project/volume/data/interim/injuryreports.csv")}
#Declare variables and empty data tables
path1<-("https://www.footballdb.com/transactions/injuries.html?yr=")
seasons<-c("2016", "2017", "2020")
weeks<-1:17
result<-data.table()
temp<-NULL
#Use nested for loops to get the url, season, and week for each webpage of interest, store in result data table
for(s in 1:length(seasons)){
for(w in 1:length(weeks)){
temp$link<- paste0(path1, seasons[s],"&wk=", as.character(w), "&type=reg")
temp$season<-as.numeric(seasons[s])
temp$week<-weeks[w]
result<-rbind(result,temp)
}
}
#Get rid of any potential empty values from result
result<-compact(result)
###Errors Below####
DT <- map_df(result, function(x){
page <- read_html(x[[1]])
data.table(
Season = x[[2]],
Week = x[[3]],
Player = page %>% html_nodes('.divtable .td:nth-child(1) b') %>% html_text(),
Injury = page %>% html_nodes('.divtable .td:nth-child(2)') %>% html_text(),
Wed = page %>% html_nodes('.divtable .td:nth-child(3)') %>% html_text(),
Thu = page %>% html_nodes('.divtable .td:nth-child(4)') %>% html_text(),
Fri = page %>% html_nodes('.divtable .td:nth-child(5)') %>% html_text(),
GameStatus = page %>% html_nodes('.divtable .td:nth-child(6)') %>% html_text()
)
}
)
#####End of Errors###
#Write out injury data table
fwrite(DT,"./project/volume/data/interim/injuryreports.csv")
The issue is that your input data frame result is a datatable. When passing this to map_df it will loop over the columns(!!) of the datable not the rows.
One approach to make your code work is to split result by link and loop over the resulting list.
Note: For the reprex I only loop over the first two elements of the list. Additionally I have put your function outside of the map statement which made debugging easier.
library(purrr)
library(rvest)
library(data.table)
#Declare variables and empty data tables
path1<-("https://www.footballdb.com/transactions/injuries.html?yr=")
seasons<-c("2016", "2017", "2020")
weeks<-1:17
result<-data.table()
temp<-NULL
#Use nested for loops to get the url, season, and week for each webpage of interest, store in result data table
for(s in 1:length(seasons)){
for(w in 1:length(weeks)){
temp$link<- paste0(path1, seasons[s],"&wk=", as.character(w), "&type=reg")
temp$season<-as.numeric(seasons[s])
temp$week<-weeks[w]
result<-rbind(result,temp)
}
}
#Get rid of any potential empty values from result
result<-compact(result)
result <- split(result, result$link)
get_table <- function(x) {
page <- read_html(x[[1]])
data.table(
Season = x[[2]],
Week = x[[3]],
Player = page %>% html_nodes('.divtable .td:nth-child(1) b') %>% html_text(),
Injury = page %>% html_nodes('.divtable .td:nth-child(2)') %>% html_text(),
Wed = page %>% html_nodes('.divtable .td:nth-child(3)') %>% html_text(),
Thu = page %>% html_nodes('.divtable .td:nth-child(4)') %>% html_text(),
Fri = page %>% html_nodes('.divtable .td:nth-child(5)') %>% html_text(),
GameStatus = page %>% html_nodes('.divtable .td:nth-child(6)') %>% html_text()
)
}
DT <- map_df(result[1:2], get_table)
DT
#> Season Week Player Injury Wed Thu Fri
#> 1: 2016 1 Justin Bethel Foot Limited Limited Limited
#> 2: 2016 1 Lamar Louis Knee DNP Limited Limited
#> 3: 2016 1 Kareem Martin Knee DNP DNP DNP
#> 4: 2016 1 Alex Okafor Biceps Full Full Full
#> 5: 2016 1 Frostee Rucker Neck Limited Limited Full
#> ---
#> 437: 2016 10 Will Blackmon Thumb Limited Limited Limited
#> 438: 2016 10 Duke Ihenacho Concussion Full Full Full
#> 439: 2016 10 DeSean Jackson Shoulder DNP DNP DNP
#> 440: 2016 10 Morgan Moses Ankle Limited Limited Limited
#> 441: 2016 10 Brandon Scherff Shoulder Full Full Full
#> GameStatus
#> 1: (09/09) Questionable vs NE
#> 2: (09/09) Questionable vs NE
#> 3: (09/09) Out vs NE
#> 4: --
#> 5: --
#> ---
#> 437: (11/11) Questionable vs Min
#> 438: (11/11) Questionable vs Min
#> 439: (11/11) Doubtful vs Min
#> 440: (11/11) Questionable vs Min
#> 441: --

Cannot identify html node for scraping in rvest

trying to grab links from a page for subsequent analysis and can only grab about 1/2 of them which may be due to filtering. I'm trying to extract the links highlighted here:
My approach is as follows, which is not ideal because I believe I may be losing some links in the filter() call.
library(rvest)
library(tidyverse)
#initiate session
session <- html_session("https://www.backlisted.fm/episodes")
#collect links for all episodes from the index page:
session %>%
read_html() %>%
html_nodes(".underline-body-links a") %>%
html_attr("href") %>%
tibble(link_temp = .) %>%
filter(str_detect(link_temp, pattern = "episodes/")) %>%
distinct()
#css:
#.underline-body-links #page .html-block a, .underline-body-links #page .product-excerpt ahere
#result:
link_temp
<chr>
1 /episodes/116-mfk-fisher-how-to-cook-a-wolf
2 https://www.backlisted.fm/episodes/109-barbara-pym-excellent-women
3 /episodes/115-george-amp-weedon-grossmith-the-diary-of-a-nobody
4 https://www.backlisted.fm/episodes/27-jane-gardam-a-long-way-from-verona
5 https://www.backlisted.fm/episodes/5-b-s-johnson-christie-malrys-own-double-entry
6 https://www.backlisted.fm/episodes/97-ray-bradbury-the-illustrated-man
7 /episodes/114-william-golding-the-inheritors
8 https://www.backlisted.fm/episodes/30-georgette-heyer-venetia
9 https://www.backlisted.fm/episodes/49-anita-brookner-look-at-me
10 https://www.backlisted.fm/episodes/71-jrr-tolkien-the-return-of-the-king
# … with 43 more rows
I've been reading multiple documents but I can't target that one type of href. Any help will be much appreciated. Thank you.
Try this
library(rvest)
library(tidyverse)
session <- html_session("https://www.backlisted.fm/index")
raw_html <- read_html(session)
node <- raw_html %>% html_nodes(css = "li p a")
link <- node %>% html_attr("href")
title <- node %>% html_text()
tibble(title, link)
# A tibble: 117 x 2
# title link
# <chr> <chr>
# 1 "A Month in the Country" https://www.backlisted.fm/episodes/1-j-l-carr-a-month-in-the-country
# 2 " - J.L. Carr (with Lissa Evans)" #
# 3 "Good Morning, Midnight - Jean Rhys" https://www.backlisted.fm/episodes/2-jean-rhys-good-morning-midnight
# 4 "It Had to Be You - David Nobbs" https://www.backlisted.fm/episodes/3-david-nobbs-1
# 5 "The Blessing - Nancy Mitford" https://www.backlisted.fm/episodes/4-nancy-mitford-the-blessing
# 6 "Christie Malry's Own Double Entry - B.S. Joh… https://www.backlisted.fm/episodes/5-b-s-johnson-christie-malrys-own-dou…
# 7 "Passing - Nella Larsen" https://www.backlisted.fm/episodes/6-nella-larsen-passing
# 8 "The Great Fire - Shirley Hazzard" https://www.backlisted.fm/episodes/7-shirley-hazzard-the-great-fire
# 9 "Lolly Willowes - Sylvia Townsend Warner" https://www.backlisted.fm/episodes/8-sylvia-townsend-warner-lolly-willow…
# 10 "The Information - Martin Amis" https://www.backlisted.fm/episodes/9-martin-amis-the-information
# … with 107 more rows

Rvest reading separated article data

I am looking to scrape article data from inquirer.net.
This is a follow-up question to Scrape Data through RVest
Here is the code that works based on the answer:
library(rvest)
#> Loading required package: xml2
library(tibble)
year <- 2020
month <- 06
day <- 13
url <- paste0('http://www.inquirer.net/article-index?d=', year, '-', month, '-', day)
div <- read_html(url) %>% html_node(xpath = '//*[#id ="index-wrap"]')
links <- html_nodes(div, xpath = '//a[#rel = "bookmark"]')
post_date <- html_nodes(div, xpath = '//span[#class = "index-postdate"]') %>%
html_text()
test <- tibble(date = post_date,
text = html_text(links),
link = html_attr(links, "href"))
test
#> # A tibble: 261 x 3
#> date text link
#> <chr> <chr> <chr>
#> 1 1 day a~ ‘We can never let our guard down~ https://newsinfo.inquirer.net/129~
#> 2 1 day a~ PNP spox says mañanita remark di~ https://newsinfo.inquirer.net/129~
#> 3 1 day a~ After stranded mom’s death, Pasa~ https://newsinfo.inquirer.net/129~
#> 4 1 day a~ Putting up lining for bike lanes~ https://newsinfo.inquirer.net/129~
#> 5 1 day a~ PH Army provides accommodation f~ https://newsinfo.inquirer.net/129~
#> 6 1 day a~ DA: Local poultry production suf~ https://newsinfo.inquirer.net/129~
#> 7 1 day a~ IATF assessing proposed design t~ https://newsinfo.inquirer.net/129~
#> 8 1 day a~ PCSO lost ‘most likely’ P13B dur~ https://newsinfo.inquirer.net/129~
#> 9 2 days ~ DOH: No IATF recommendations yet~ https://newsinfo.inquirer.net/129~
#> 10 2 days ~ PH coronavirus cases exceed 25,0~ https://newsinfo.inquirer.net/129~
#> # ... with 251 more rows
I now want to add a new column to this output which has the full article for each row. Before doing the for-loop, I was investigating the html code for the first article: https://newsinfo.inquirer.net/1291178/pnp-spox-says-he-did-not-intend-to-put-sinas-in-bad-light
Digging into the html code, I'm noticing it is not that clean. From my findings so far, the main article data falls under #article_content , p. So my output right now is multiple rows separated and there is a lot of non-article data appearing. here is what I have currently:
article_data<-data.frame(test)
article_url<- read_html(article_data[2, 3])
article<-article_url %>%
html_nodes("#article_content , p") %>%
html_text()
View(article)
I'm ok with this being multiple rows because I can just union the final result. But since there are other non-article items then it will mess up what I am trying to do (sentiment analysis).
Can someone please assist on how to clean this data so that the full article is next to each article link?
I could simply just union the results excluding the first row and last 2 rows but looking for a cleaner way because I want to do this for all article data and not just this one.
After a short look in the structure of the article page, I suggest using the css selector: ".article_align div p".
library(rvest)
library(dplyr)
url <- "https://newsinfo.inquirer.net/1291178/pnp-spox-says-he-did-not-intend-to-put-sinas-in-bad-light"
read_html(url) %>%
html_nodes(".article_align div p") %>%
html_text()

Scrape Data through RVest

I am looking to get the article names by category from https://www.inquirer.net/article-index?d=2020-6-13
I've attempted to read the article names by doing:
library('rvest')
year <- 2020
month <- 06
day <- 13
url <- paste('http://www.inquirer.net/article-index?d=', year, '-', month, '-',day, sep = "")
pg <- read_html(url)
test<-pg %>%
html_nodes("#index-wrap") %>%
html_text()
This returns only 1 string of all articles names and it's very messy.
I ultimately would like to have a dataframe that looks like below:
Date Category Article Name
2020-06-13 News ‘We can never let our guard down’ vs terrorism – Cayetano
2020-06-13 News PNP spox says mañanita remark did not intend to put Sinas in bad light
2020-06-13 News After stranded mom’s death, Pasay LGU helps over 400 stranded individuals
2020-06-13 World 4 dead after tanker truck explodes on highway in China
etc.
etc.
etc.
etc.
2020-06-13 Lifestyle Book: Melania Trump delayed 2017 move to DC to get new prenup
Does anyone know what I may be missing? Very new to this, thanks!
This is maybe the closest you can get:
library(rvest)
#> Loading required package: xml2
library(tibble)
year <- 2020
month <- 06
day <- 13
url <- paste0('http://www.inquirer.net/article-index?d=', year, '-', month, '-', day)
div <- read_html(url) %>% html_node(xpath = '//*[#id ="index-wrap"]')
links <- html_nodes(div, xpath = '//a[#rel = "bookmark"]')
post_date <- html_nodes(div, xpath = '//span[#class = "index-postdate"]') %>%
html_text()
test <- tibble(date = post_date,
text = html_text(links),
link = html_attr(links, "href"))
test
#> # A tibble: 261 x 3
#> date text link
#> <chr> <chr> <chr>
#> 1 1 day a~ ‘We can never let our guard down~ https://newsinfo.inquirer.net/129~
#> 2 1 day a~ PNP spox says mañanita remark di~ https://newsinfo.inquirer.net/129~
#> 3 1 day a~ After stranded mom’s death, Pasa~ https://newsinfo.inquirer.net/129~
#> 4 1 day a~ Putting up lining for bike lanes~ https://newsinfo.inquirer.net/129~
#> 5 1 day a~ PH Army provides accommodation f~ https://newsinfo.inquirer.net/129~
#> 6 1 day a~ DA: Local poultry production suf~ https://newsinfo.inquirer.net/129~
#> 7 1 day a~ IATF assessing proposed design t~ https://newsinfo.inquirer.net/129~
#> 8 1 day a~ PCSO lost ‘most likely’ P13B dur~ https://newsinfo.inquirer.net/129~
#> 9 2 days ~ DOH: No IATF recommendations yet~ https://newsinfo.inquirer.net/129~
#> 10 2 days ~ PH coronavirus cases exceed 25,0~ https://newsinfo.inquirer.net/129~
#> # ... with 251 more rows
Created on 2020-06-14 by the reprex package (v0.3.0)
you fogot read_html() then use that in the dplyr statment
library('rvest')
year <- 2020
month <- 06
day <- 13
url <- paste('http://www.inquirer.net/article-index?d=', year, '-', month, '-',day, sep = "")
#added page
page <- read_html(url)
test <- page %>%
#changed xpath
html_node(xpath = '//*[#id ="index-wrap"]') %>%
html_text()
test
update, i suck at dplyr but this is what i have before i go to bed
library('rvest')
year <- 2020
month <- 06
day <- 13
url <- paste('http://www.inquirer.net/article-index?d=', year, '-', month, '-',day, sep = "")
#addad page
page <- read_html(url)
titles <- page %>%
html_nodes(xpath = '//*[#id ="index-wrap"]/h4') %>%
html_text()
sections <- page %>%
html_nodes(xpath = '//*[#id ="index-wrap"]/ul')
stories <- sections %>%
html_nodes(xpath = '//li/a') %>%
html_text()
stories

Resources