Using R to scrape a table and links from a web page

Using R to scrape a table and links from a web page - r

I am trying scraping a website with R. I need the table and the links from that table associated with the correct row in the table. I can get the table and the links but because in the web table there are two columns with links and some rows in the table don't have links and the links can't be sorted and joined by the file names. I can't figure out how to create a dateframe with the columns and the links associated with the correct row.
library(rvest)
#Read HTML from EPA website
content <- read_html("https://www.epa.gov/national-aquatic-resource-surveys/data-national-aquatic-resource-surveys")
tables <- content %>%
html_table(fill = TRUE)
EPA_table <- tables[[1]]
#get links from table
web <- content %>%
html_nodes("table") %>% html_nodes("tr") %>% html_nodes("a") %>%
html_attr("href") #as above

Use xpath= argument of to select columns.
## Data links
web <- content %>%
html_nodes("table tr")%>%
html_nodes(xpath="//td[3]") %>% ## xpath
html_nodes("a") %>%
html_attr("href")
EPA_table$web1 <- web ## add Data links column
## metadata links accordingly
web2 <- content %>%
html_nodes("table tr") %>%
html_nodes(xpath="//td[4]") %>% ## xpath
html_nodes("a") %>%
html_attr("href")
The empty metadata cells can be set to NA, the description links fit where it's not NA then.
EPA_table[EPA_table$Metadata %in% "", "Metadata"] <- NA
EPA_table[!is.na(EPA_table$Metadata), "web2"] <- web2 ## add metadata column
Result
head(EPA_table)
# Survey Indicator
# 1 Lakes 2007 All
# 2 Lakes 2007 Landscape Data
# 3 Lakes 2007 Water Chemistry
# 4 Lakes 2007 Visual Assessment
# 5 Lakes 2007 Site Information
# 6 Lakes 2007 Notes
# Data
# 1 NLA 2007 All Data (ZIP)(1 pg, 5 MB)
# 2 NLA 2007 Basin Landuse Metrics - Data 20061022 (CSV)(1 pg, 307 K)
# 3 NLA 2007 Profile - Data 20091008 (CSV)(1 pg, 888 K)
# 4 NLA 2007 Visual Assessment - Data 20091015 (CSV)(1 pg, 813 K)
# 5 NLA 2007 Site Information - Data 20091113 (CSV)(1 pg, 980 K)
# 6 National Lakes Assessment 2007 Final Data Notes
# Metadata
# 1 <NA>
# 2 NLA 2007 Basin Landuse Metrics - Metadata 20091022 (TXT)(1 pg, 4 K)
# 3 NLA 2007 Profile - Metadata 20091008 (TXT)(1 pg, 650 B)
# 4 NLA 2007 Visual Assessment - Metadata 10091015 (TXT)(1 pg, 7 K)
# 5 NLA 2007 Site Information - Metadata 20091113 (TXT)(1 pg, 8 K)
# 6 <NA>
# web1
# 1 /sites/production/files/2017-02/nla2007_alldata.zip
# 2 /sites/production/files/2013-09/nla2007_basin_landuse_metrics_20061022.csv
# 3 /sites/production/files/2013-09/nla2007_profile_20091008.csv
# 4 /sites/production/files/2014-01/nla2007_visualassessment_20091015.csv
# 5 /sites/production/files/2014-01/nla2007_sampledlakeinformation_20091113.csv
# 6 /national-aquatic-resource-surveys/national-lakes-assessment-2007-final-data-notes
# web2
# 1 <NA>
# 2 /sites/production/files/2013-09/nla2007_basin_landuse_metrics_info_20091022.txt
# 3 /sites/production/files/2013-09/nla2007_profile_info_20091008_0.txt
# 4 /sites/production/files/2014-01/nla2007_visualassessment_info_20091015.txt
# 5 /sites/production/files/2014-01/nla2007_sampledlakeinformation_info_20091113.txt
# 6 <NA>

I would have gone with css selectors and :nth-child to separate out individual columns from a loop over the table rows. By using tbody in the selector I would exclude the header row and only process the table body rows and pass that list to map_df
library(rvest)
library(purrr)
url <- "https://www.epa.gov/national-aquatic-resource-surveys/data-national-aquatic-resource-surveys"
rows <- read_html(url) %>% html_nodes("#narsdata tbody tr")
df <- map_df(rows, function(x) {
data.frame(
Survey = x %>% html_node("td:nth-child(1)") %>% html_text(),
Indicator = x %>% html_node("td:nth-child(2)") %>% html_text(),
Data = x %>% html_node("td:nth-child(3) a") %>% html_attr("href") %>% if_else(is.na(.), ., url_absolute(., url)),
Metadata = x %>% html_node("td:nth-child(4) a") %>% html_attr("href") %>% if_else(is.na(.), ., url_absolute(., url)),
stringsAsFactors = FALSE
)
})
Don't think you really need the file names in addition to the urls, but if so, you can expand the data.frame with two additional columns and extract the html_text rather than html_attr e.g.
Data_Name = x %>% html_node("td:nth-child(3) a") %>% html_text(),
Metadata_Name = x %>% html_node("td:nth-child(4) a") %>% html_text()

Related

Is there a function similar to read_html() that can be used on data table or data frame types in R?

I'm attempting to webscrape from footballdb.com to get data related to NFL player injuries for a model I am creating from links such as this: https://www.footballdb.com/transactions/injuries.html?yr=2016&wk=1&type=reg which will then be output in a data table. Along with data related to individual player injury information (i.e. their name, injury, and status throughout the week leading up to the game), I also want to include the season and week of the injury in question for each player. I started by using nested for loops to generate the url for each webpage in question, along with the season and week corresponding to each webpage, which were stored in a data table with columns: link, season, and week.
I then tried to to use the functions map_df(), read_html(), and html_nodes() to extract the information I wanted from each webpage, but I run into errors as read_html() does not work for for objects of the data table or data frame class. I then tried to use different types of indexing and the $ operator with no luck either. Is there anyway I can modify the code I have produced thus far to extract the information I want from a data table? Below is what I have written thus far:
library(purrr)
library(rvest)
library(data.table)
#Remove file if file already exists
if (file.exists("./project/volume/data/interim/injuryreports.csv")) {
file.remove("./project/volume/data/interim/injuryreports.csv")}
#Declare variables and empty data tables
path1<-("https://www.footballdb.com/transactions/injuries.html?yr=")
seasons<-c("2016", "2017", "2020")
weeks<-1:17
result<-data.table()
temp<-NULL
#Use nested for loops to get the url, season, and week for each webpage of interest, store in result data table
for(s in 1:length(seasons)){
for(w in 1:length(weeks)){
temp$link<- paste0(path1, seasons[s],"&wk=", as.character(w), "&type=reg")
temp$season<-as.numeric(seasons[s])
temp$week<-weeks[w]
result<-rbind(result,temp)
}
}
#Get rid of any potential empty values from result
result<-compact(result)
###Errors Below####
DT <- map_df(result, function(x){
page <- read_html(x[[1]])
data.table(
Season = x[[2]],
Week = x[[3]],
Player = page %>% html_nodes('.divtable .td:nth-child(1) b') %>% html_text(),
Injury = page %>% html_nodes('.divtable .td:nth-child(2)') %>% html_text(),
Wed = page %>% html_nodes('.divtable .td:nth-child(3)') %>% html_text(),
Thu = page %>% html_nodes('.divtable .td:nth-child(4)') %>% html_text(),
Fri = page %>% html_nodes('.divtable .td:nth-child(5)') %>% html_text(),
GameStatus = page %>% html_nodes('.divtable .td:nth-child(6)') %>% html_text()
)
}
)
#####End of Errors###
#Write out injury data table
fwrite(DT,"./project/volume/data/interim/injuryreports.csv")

The issue is that your input data frame result is a datatable. When passing this to map_df it will loop over the columns(!!) of the datable not the rows.
One approach to make your code work is to split result by link and loop over the resulting list.
Note: For the reprex I only loop over the first two elements of the list. Additionally I have put your function outside of the map statement which made debugging easier.
library(purrr)
library(rvest)
library(data.table)
#Declare variables and empty data tables
path1<-("https://www.footballdb.com/transactions/injuries.html?yr=")
seasons<-c("2016", "2017", "2020")
weeks<-1:17
result<-data.table()
temp<-NULL
#Use nested for loops to get the url, season, and week for each webpage of interest, store in result data table
for(s in 1:length(seasons)){
for(w in 1:length(weeks)){
temp$link<- paste0(path1, seasons[s],"&wk=", as.character(w), "&type=reg")
temp$season<-as.numeric(seasons[s])
temp$week<-weeks[w]
result<-rbind(result,temp)
}
}
#Get rid of any potential empty values from result
result<-compact(result)
result <- split(result, result$link)
get_table <- function(x) {
page <- read_html(x[[1]])
data.table(
Season = x[[2]],
Week = x[[3]],
Player = page %>% html_nodes('.divtable .td:nth-child(1) b') %>% html_text(),
Injury = page %>% html_nodes('.divtable .td:nth-child(2)') %>% html_text(),
Wed = page %>% html_nodes('.divtable .td:nth-child(3)') %>% html_text(),
Thu = page %>% html_nodes('.divtable .td:nth-child(4)') %>% html_text(),
Fri = page %>% html_nodes('.divtable .td:nth-child(5)') %>% html_text(),
GameStatus = page %>% html_nodes('.divtable .td:nth-child(6)') %>% html_text()
)
}
DT <- map_df(result[1:2], get_table)
DT
#> Season Week Player Injury Wed Thu Fri
#> 1: 2016 1 Justin Bethel Foot Limited Limited Limited
#> 2: 2016 1 Lamar Louis Knee DNP Limited Limited
#> 3: 2016 1 Kareem Martin Knee DNP DNP DNP
#> 4: 2016 1 Alex Okafor Biceps Full Full Full
#> 5: 2016 1 Frostee Rucker Neck Limited Limited Full
#> ---
#> 437: 2016 10 Will Blackmon Thumb Limited Limited Limited
#> 438: 2016 10 Duke Ihenacho Concussion Full Full Full
#> 439: 2016 10 DeSean Jackson Shoulder DNP DNP DNP
#> 440: 2016 10 Morgan Moses Ankle Limited Limited Limited
#> 441: 2016 10 Brandon Scherff Shoulder Full Full Full
#> GameStatus
#> 1: (09/09) Questionable vs NE
#> 2: (09/09) Questionable vs NE
#> 3: (09/09) Out vs NE
#> 4: --
#> 5: --
#> ---
#> 437: (11/11) Questionable vs Min
#> 438: (11/11) Questionable vs Min
#> 439: (11/11) Doubtful vs Min
#> 440: (11/11) Questionable vs Min
#> 441: --

Cannot identify html node for scraping in rvest

trying to grab links from a page for subsequent analysis and can only grab about 1/2 of them which may be due to filtering. I'm trying to extract the links highlighted here:
My approach is as follows, which is not ideal because I believe I may be losing some links in the filter() call.
library(rvest)
library(tidyverse)
#initiate session
session <- html_session("https://www.backlisted.fm/episodes")
#collect links for all episodes from the index page:
session %>%
read_html() %>%
html_nodes(".underline-body-links a") %>%
html_attr("href") %>%
tibble(link_temp = .) %>%
filter(str_detect(link_temp, pattern = "episodes/")) %>%
distinct()
#css:
#.underline-body-links #page .html-block a, .underline-body-links #page .product-excerpt ahere
#result:
link_temp
<chr>
1 /episodes/116-mfk-fisher-how-to-cook-a-wolf
2 https://www.backlisted.fm/episodes/109-barbara-pym-excellent-women
3 /episodes/115-george-amp-weedon-grossmith-the-diary-of-a-nobody
4 https://www.backlisted.fm/episodes/27-jane-gardam-a-long-way-from-verona
5 https://www.backlisted.fm/episodes/5-b-s-johnson-christie-malrys-own-double-entry
6 https://www.backlisted.fm/episodes/97-ray-bradbury-the-illustrated-man
7 /episodes/114-william-golding-the-inheritors
8 https://www.backlisted.fm/episodes/30-georgette-heyer-venetia
9 https://www.backlisted.fm/episodes/49-anita-brookner-look-at-me
10 https://www.backlisted.fm/episodes/71-jrr-tolkien-the-return-of-the-king
# … with 43 more rows
I've been reading multiple documents but I can't target that one type of href. Any help will be much appreciated. Thank you.

Try this
library(rvest)
library(tidyverse)
session <- html_session("https://www.backlisted.fm/index")
raw_html <- read_html(session)
node <- raw_html %>% html_nodes(css = "li p a")
link <- node %>% html_attr("href")
title <- node %>% html_text()
tibble(title, link)
# A tibble: 117 x 2
# title link
# <chr> <chr>
# 1 "A Month in the Country" https://www.backlisted.fm/episodes/1-j-l-carr-a-month-in-the-country
# 2 " - J.L. Carr (with Lissa Evans)" #
# 3 "Good Morning, Midnight - Jean Rhys" https://www.backlisted.fm/episodes/2-jean-rhys-good-morning-midnight
# 4 "It Had to Be You - David Nobbs" https://www.backlisted.fm/episodes/3-david-nobbs-1
# 5 "The Blessing - Nancy Mitford" https://www.backlisted.fm/episodes/4-nancy-mitford-the-blessing
# 6 "Christie Malry's Own Double Entry - B.S. Joh… https://www.backlisted.fm/episodes/5-b-s-johnson-christie-malrys-own-dou…
# 7 "Passing - Nella Larsen" https://www.backlisted.fm/episodes/6-nella-larsen-passing
# 8 "The Great Fire - Shirley Hazzard" https://www.backlisted.fm/episodes/7-shirley-hazzard-the-great-fire
# 9 "Lolly Willowes - Sylvia Townsend Warner" https://www.backlisted.fm/episodes/8-sylvia-townsend-warner-lolly-willow…
# 10 "The Information - Martin Amis" https://www.backlisted.fm/episodes/9-martin-amis-the-information
# … with 107 more rows

Create aggregate df out of webscraping multiple pages using Rvest and Glue

I'm working on scraping data from a table on the following website
https://fantasy.nfl.com/research/scoringleaders?position=1&statCategory=stats&statSeason=2019&statType=weekStats&statWeek=1
I want to create a scrape that takes all 17 weeks, all four positions (qb,rb,wr,te) and takes the first 4 pages to get the first 100 rows (only 25 shown on a page at a time).
library(tidyverse)
library(rvest)
library(glue)
scrape_19 <- function(week, position, page) {
Sys.sleep(3)
cat(".")
url <- glue("https://fantasy.nfl.com/research/scoringleaders?{page}position={position}&sort=pts&statCategory=stats&statSeason=2019&statType=weekStats&statWeek={week}")
read_html(url) %>%
html_nodes("table") %>%
html_table(header = T) %>%
simplify() %>%
first() %>%
setNames(paste0(colnames(.), as.character(.[1,]))) %>%
slice(-1) %>%
list()
}
Here are all the iterations of each call in glue:
week = 1:17;
position = 1:4;
page = c("", "offset=26&", "offset=51&", "offset=76&")
The problem I run into is when I try to make one df with all the data for each week, position and page. Here is code that works for week and position but will not work for an additional nested df.
scaffold <- tibble(week = weeks,
position = list(positions)) %>% tidyr::unnest()
scaffold
tbl_data <- scaffold %>%
mutate(data = purrr::map2(week, position, ~scrape_19(.x, .y)[[1]]))
Basically, I need help in crafting the scaffold and turning that scaffold into the final total data set with all weeks, positions and pages.

Here is my attempt. I am not sure if glue() is the way to go. See below.
first_name <- c("Fred", "Ana", "Bob")
last_name <- c("JOhnson", "Trump")
glue('My name is {first_name} {last_name}.')
Error: Variables must be length 1 or 3
Your case is similar to this example. So I tried to create all possible links using loops with map(). Then, I checked if all URLs exist or not. I used map_dfr() in order to loop through all URLs and bind all data frames. In this process, I added week and position information as well. If position is 1, it is QB. If necessary, replace these numbers by yourself. Note that I scraped four URLs in this demonstration.
library(httr)
library(rvest)
library(tidyverse)
# Create all URLs.
# Create 4 base URLs
paste("https://fantasy.nfl.com/research/scoringleaders?",
c("", "offset=26&", "offset=51&", "offset=76&"),
"position={position}&sort=pts&statCategory=stats&statSeason=2019&statType=weekStats&statWeek={week}",
sep = "") -> mytemp
# For each base URL, create 4 URLs. (4 x 4 = 16 URLs)
map(.x = 1:4,
.f = function(x){gsub(x = mytemp, pattern = "\\{position\\}", replacement = x)}) %>%
unlist -> mytemp
# For each of the 16 URLs, create 17 URLs
map(.x = 1:17,
.f = function(x){gsub(x = mytemp, pattern = "\\{week\\}", replacement = x)}) %>%
unlist -> myurls
# Check if any URLs are invalid
sapply(myurls, url_success) %>% table
# TRUE
# 272
# Scrape the tables
map_dfr(.x = myurls[1:4],
.f = function(x){read_html(x) %>%
html_nodes("table") %>%
html_table() %>%
simplify() %>%
first() %>%
setNames(paste0(colnames(.), as.character(.[1,]))) %>%
slice(-1) %>%
mutate(position = str_extract(string = x, pattern = "(?<=position=)\\d+(?=&)"),
week = str_extract(string = x, pattern = "(?<=statWeek=)\\d+"))},
.id = "url") -> foo
url Rank Player Opp PassingYds PassingTD PassingInt RushingYds RushingTD ReceivingRec ReceivingYds
1 1 1 Lamar Jackson QB - BAL #MIA 324 5 - 6 - - -
2 1 2 Dak Prescott QB - DAL NYG 405 4 - 12 - - -
3 1 3 Deshaun Watson QB - HOU #NO 268 3 1 40 1 - -
4 1 4 Matthew Stafford QB - DET #ARI 385 3 - 22 - - -
5 1 5 Patrick Mahomes QB - KC #JAX 378 3 - 2 - - -
ReceivingTD RetTD MiscFumTD Misc2PT FumLost FantasyPoints position week
1 - - - - - 33.56 1 1
2 - - - - - 33.40 1 1
3 - - - - - 30.72 1 1
4 - - - - 1 27.60 1 1
5 - - - - - 27.32 1 1

R scrape html table and extract background color

I am trying to scrape some data off a wikipedia table from this page:
https://en.wikipedia.org/wiki/Results_of_the_Indian_general_election,_2014 and I am interested in the table:
Summary of the 2014 Indian general election
I would also like to extract the party colors from the table.
Here's what I've tried so far:
library("rvest")
url <-
"https://en.wikipedia.org/wiki/Results_of_the_Indian_general_election,_2014"
electionstats <- read_html(url)
results <- html_nodes(electionstats, xpath='//*[#id="mw-content-text"]/div/table[79]') %>% html_table(fill = T)
party_colors <- electionstats %>%
html_nodes(xpath='//*[#id="mw-content-text"]/div/table[3]') %>%
html_table(fill = T)
Printing out party_colors does not show any info about the colors
So, I tried:
party_colors <- electionstats %>% html_nodes(xpath='//*[#id="mw-content-text"]/div/table[3]') %>%
html_nodes('tr')
Now if I print out party_colors, I get:
[1] <tr style="background-color:#E9E9E9">\n<th style="text-align:left;vertical-align:bottom;" rowspan="2"></th>\n<th style="text-align:left; ...
[2] <tr style="background-color:#E9E9E9">\n<th style="text-align:center;">No.</th>\n<th style="text-align:center;">+/-</th>\n<th style="text ...
[3] <tr>\n<td style="background-color:#FF9933"></td>\n<td style="text-align:left;"><a href="/wiki/Bharatiya_Janata_Party" title="Bharatiya J ...
[4] <tr>\n<td style="background-color:#00BFFF"></td>\n<td style="text-align:left;"><a href="/wiki/Indian_National_Congress" title="Indian Na ...
[5] <tr>\n<td style="background-color:#009900"></td>\n<td style="text-align:left;"><a href="/wiki/All_India_Anna_Dravida_Munnetra_Kazhagam" ...
and so on...
But, now, I have no idea how to pull out the colors from this data. I cannot convert the output of the above to a html_table with:
html_table(fill = T)
I get the error:
Error: html_name(x) == "table" is not TRUE
I also tried various options with html_attrs, but I have no idea what the correct attribute I should be passing is.
I even tried SelectorGadget to try and figure out the attribute, but if I select the first column of the table in question, SelectorGadget shows just "td".

I would get the table like you did and then add the color attribute as a column. The wikitable sortable class works on many pages, so get the first one and remove the second header in row 1.
electionstats <- read_html(url)
x <- html_nodes(electionstats, xpath='//table[#class="wikitable sortable"]')[[1]] %>%
html_table(fill=TRUE)
# paste names from 2nd row header and then remove
names(x)[6:14] <- paste(names(x)[6:14], x[1,6:14])
x <- x[-1,]
The colors are in the first tr/td tags and you can add it to empty column 1 or 3 (see str(x))
names(x)[3] <- "Color"
x$Color <- html_nodes(electionstats, xpath='//table[#class="wikitable sortable"][1]/tr/td[1]') %>%
html_attr("style") %>% gsub("background-color:", "", .)
## drop table footer, extra columns
x <- x[1:83, 2:14]
head(x)
Party Color Alliance Abbr. Candidates No. Candidates +/- Candidates %
2 Bharatiya Janata Party #FF9933 NDA BJP 428 -5 78.82%
3 Indian National Congress #00BFFF UPA INC 464 24 85.45%
4 All India Anna Dravida Munnetra Kazhagam #009900 ADMK 40 17 7.37%
5 All India Trinamool Congress #00FF00 AITC 131 96 24.13%
6 Biju Janata Dal #006400 BJD 21 3 3.87%
7 Shiv Sena #E3882D NDA SHS 24 11 10.68%

Looks like your xml_nodeset contains both tr and td nodes.
Deal with both trs and tds, converting to data frames:
party_colors_tr <- electionstats %>% html_nodes(xpath='//*[#id="mw-content-text"]/div/table[3]') %>% html_nodes('tr')
trs <- bind_rows(lapply(xml_attrs(party_colors_tr), function(x) data.frame(as.list(x), stringsAsFactors=FALSE)))
party_colors_td <- electionstats %>% html_nodes(xpath='//*[#id="mw-content-text"]/div/table[3]') %>% html_nodes('tr') %>% html_nodes('td')
tds <- bind_rows(lapply(xml_attrs(party_colors_td), function(x) data.frame(as.list(x), stringsAsFactors=FALSE)))
Write function for extracting styles from data frames:
library(stringi)
list_styles <- function(nodes_frame) {
get_cols <- function(x) { stri_detect_fixed(x, 'background-color') }
has_style <- which(lapply(nodes_frame$style, get_cols) == TRUE)
res <- strsplit(nodes_frame[has_style,]$style, ':')
return(res)
}
Create data frame of extracted styles:
l_trs <- list_styles(trs)
df_trs <- data.frame(do.call('rbind', l_trs)[,1], do.call('rbind', l_trs)[,2])
names(df_trs) <- c('style', 'color')
l_tds <- list_styles(tds)
df_tds <- data.frame(do.call('rbind', l_tds)[,1], do.call('rbind', l_tds)[,2])
names(df_tds) <- c('style', 'color')
Combine trs and tds frames:
final_style_frame <- do.call('rbind', list(df_trs, df_tds))
Here are the first 20 rows:
final_style_frame[1:20,]

How to scrape multiple tables that are without IDs or Class using R

I'm trying to scrape this webpage using R : http://zipnet.in/index.php?page=missing_mobile_phones_search&criteria=browse_all (All the pages)
I'm new to programming. And everywhere I've looked, tables are mostly identified with IDs or Divs or Class. On this page there's none. Data is stored in Table format. How should I scrape it?
This is what I did :
library(rvest)
webpage <- read_html("http://zipnet.in/index.php
page=missing_mobile_phones_search&criteria=browse_all")
tbls <- html_nodes(webpage, "table")
head(tbls)
tbls_ls <- webpage %>%
html_nodes("table") %>%
.[9:10] %>%
html_table(fill = TRUE)
colnames(tbls_ls[[1]]) <- c("Mobile Make", "State", "District",
"Police Station", "Status", "Mobile Type(GSM/CDMA)",
"FIR/DD/GD Dat")

You can scrape the table data by targeting the css id of each table. It looks like each page is composed of 3 different tables pasted one after another. Two of the tables have #AutoNumber15 css id while the third (in the middle) has the #AutoNumber16 css id.
I put a simple code example that should get you started in the right direction.
suppressMessages(library(tidyverse))
suppressMessages(library(rvest))
# define function to scrape the table data from a page
get_page <- function(page_id = 1) {
# default link
link <- "http://zipnet.in/index.php?page=missing_mobile_phones_search&criteria=browse_all&Page_No="
# build link
link <- paste0(link, page_id)
# get tables data
wp <- read_html(link)
wp %>%
html_nodes("#AutoNumber16, #AutoNumber15") %>%
html_table(fill = TRUE) %>%
bind_rows()
}
# get the data from the first three pages
iter_page <- 1:3
# this is just a progress bar
pb <- progress_estimated(length(iter_page))
# this code will iterate over pages 1 through 3 and apply the get_page()
# function defined earlier. The Sys.sleep() part is used to pause the code
# after each iteration so that the sever is not overloaded with requests.
map_df(iter_page, ~ {
pb$tick()$print()
df <- get_page(.x)
Sys.sleep(sample(10, 1) * 0.1)
as_tibble(df)
})
#> # A tibble: 72 x 4
#> X1 X2 X3
#> <chr> <chr> <chr>
#> 1 FIR/DD/GD Number 000165 State
#> 2 FIR/DD/GD Date 17/08/2017 District
#> 3 Mobile Type(GSM/CDMA) GSM Police Station
#> 4 Mobile Make SAMSUNG J2 Mobile Number
#> 5 Missing/Stolen Date 23/04/2017 IMEI Number
#> 6 Complainant AKEEL KHAN Complainant Contact Number
#> 7 Status Stolen/Theft Report Date/Time on ZIPNET
#> 8 <NA> <NA> <NA>
#> 9 FIR/DD/GD Number FIR No 37/ State
#> 10 FIR/DD/GD Date 17/08/2017 District
#> # ... with 62 more rows, and 1 more variables: X4 <chr>

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Using R to scrape a table and links from a web page - r

Related

Is there a function similar to read_html() that can be used on data table or data frame types in R?

Cannot identify html node for scraping in rvest

Create aggregate df out of webscraping multiple pages using Rvest and Glue

R scrape html table and extract background color

How to scrape multiple tables that are without IDs or Class using R

Categories

Resources