R: loop not inputting correct indexes - r

I scraped a bunch of url's and was using RSelenium to add the links to the archive.org website. My loop was about half way finished when it broke. So I used the which() function to determine where it left off, then I restarted the loop from that point.
The problem is that the driver kept typing the wrong index instead of the one I specified in the loop. To fix the problem, I subset 'mget_links' to 'mget_links2' so that it only contained the remaining links I needed. I ran the loop again and it somehow still typed an index from 'mget_links'. The driver keeps typing in this link: https://www.bjjcompsystem.com/tournaments/1869/categories/2053146
What is going on here? How do I fix it?
library(rvest)
library(tidyverse)
library(RSelenium); library(netstat)
pageMen = read_html('https://www.bjjcompsystem.com/tournaments/1869/categories')
mget_links <- pageMen %>%
html_nodes('.categories-grid__category a') %>%
html_attr('href') %>%
paste0('https://www.bjjcompsystem.com', .)
# know the length for the loop
length(mget_links)
remote_driver = rsDriver(browser = 'firefox',
verbose = F,
port = free_port())
rd = remote_driver$client
rd$open()
rd$navigate('https://web.archive.org/save')
rd$maxWindowSize()
for (i in 1:length(mget_links[255:259])){
save_page_box = rd$findElement(using = 'xpath', '//*[(#id = "web-save-url-input")]')
save_page_box$clickElement()
save_page_box$sendKeysToElement(list(mget_links[i], key='enter'))
Sys.sleep(26)
return.to.save.page = rd$findElement(using = 'link text', 'Return to Save Page Now')
return.to.save.page$clickElement()
}
# Reran above code with mget_links2 instead of mget_links[255:259]. I also tried #mget_links2[1:156] with no luck.
mget_links2= mget_links[104:259]

Related

R webscraping doesn't work when url is in variable

I'm having trouble scraping in R. I want to scrape genre information for several titles on goodreads.
If I do this, it works completely fine and gives me what I need:
library(polite)
library(rvest)
library(dplyr)
session <- bow("https://www.goodreads.com/book/show/29991718-royally-matched",
delay = 5)
genres <- scrape(session) %>%
html_elements(".bookPageGenreLink") %>%
html_text()
genres
However, since I'd like loop over several pages, I need this to work, but it always returns character(0).
host <- "https://www.goodreads.com/book/show/29991718-royally-matched"
session <- bow(host,
delay = 5)
genres <- scrape(session) %>%
html_elements(".bookPageGenreLink") %>%
html_text()
genres
Something like this would also be fine for me, but it doesn't work either:
link = "29991718-royally-matched"
session <- bow(paste0("https://www.goodreads.com/book/show/29991718-royally-matched", link),
delay = 5)
genres <- scrape(session) %>%
html_elements(".bookPageGenreLink") %>%
html_text()
genres
If I open the website and disable javascript, it still works completely fine, so I don't think Selenium is necessary and I really can't figure out why this doesn't work, which drives me crazy.
Thank you so much for your support!
Solution (kind of)
So I noticed that the success of my scrapings was kind of dependent on the random moods of the scraping gods.
So I did the following:
links <- c("31752345-black-mad-wheel", "00045101-The-Mad-Ship", "2767052-the-hunger-games", "18619684-the-time-traveler-s-wife", "29991718-royally-matched")
data <- data.frame(links)
for (link in links) {
print(link)
genres <- character(0)
url <- paste0('https://www.goodreads.com/book/show/',link)
#I don't know why, but resaving it kinda helped
host <- url
#I had the theory that repeating the scraping would eventually lead to a result. For me that didn't work though
try <- 0
while (identical(genres, character(0)) & (try < 10)) {
try <- try+1
print(paste0(try, ": ", link))
session <- bow(host,
delay = 5)
scraping <- scrape(session)
genres <- scraping %>% html_elements(".bookPageGenreLink") %>%
html_text()
}
if(identical(genres, character(0))){
print("Scraping unsuccessfull.. :( ")
}
else{
print("scraping success!!")
genres.df <- data.frame(genres)
data <- left_join(data,
genres.df, by = c("link"))
}
}
## then I created a list of the missing titles
missing_titles <- data %>%
filter(is.na(genre_1))
missing_links <- unique(missing_titles$link)
So the next step(s) were closing R (while saving the workspace of course), restarting it and refeeding the loop with missing_titles instead of links. It took me like 7 iterations of that to get everything I needed, while on the last run I had to insert the last remaining link directly into example 1, since it did not work inside the loop. whyever.
I hope the code kind of works, since I wanted to spare you pages of wild data formatting.
If someone has an explanation, why I needed to go through this hustle, I would still very much appreciate it.
You can consider to use the R package RSelenium as follows :
library(RSelenium)
library(rvest)
url <- "https://www.goodreads.com/book/show/29991718-royally-matched"
shell('docker run -d -p 4445:4444 selenium/standalone-firefox')
remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4445L, browserName = "firefox")
remDr$open()
remDr$navigate(url)
page_Content <- remDr$getPageSource()[[1]]
read_html(page_Content) %>% html_elements(".bookPageGenreLink") %>% html_text()
Afterwards, you can loop over the url you want.

How to scrape hrefs embedded in a dropdown list of a web table using rselenium R

I'm trying to scrape links to all minutes and agenda provided in this website: https://www.charleston-sc.gov/AgendaCenter/
I've managed to scrape section IDs associated with each category (and years for each category) to loop through the contents within each category-year (please see below). But I don't know how to scrape the hrefs that lives inside the contents. Especially because the links to Agenda lives inside the drop down menu under 'download', it seems like I need to go through extra clicks to scrape the hrefs.
How do I scrape the minutes and agenda (inside the download dropdown) for each table I select? Ideally, I would like a table with the date, title of the agenda, links to minutes, and links to agenda.
I'm using RSelenium for this. Please see the code I have so far below, which allows me to click through each category and year, but not else much. Please help!
rm(list = ls())
library(RSelenium)
library(tidyverse)
library(httr)
library(XML)
library(stringr)
library(RCurl)
t <- readLines('https://www.charleston-sc.gov/AgendaCenter/', encoding = 'UTF-8')
co <- str_match(t, 'aria-label="(.*?)"[ ]href="java')[,2]
yr <- str_match(t, 'id="(.*?)" aria-label')[,2]
df <- data.frame(cbind(co, yr)) %>%
mutate_all(as.character) %>%
filter_all(any_vars(!is.na(.))) %>%
mutate(id = ifelse(grepl('^a0', yr), gsub('a0', '', yr), NA)) %>%
tidyr::fill(c(co,id), .direction='down')%>% drop_na(co)
remDr <- remoteDriver(port=4445L, browserName = "chrome")
remDr$open()
remDr$navigate('https://www.charleston-sc.gov/AgendaCenter/')
remDr$screenshot(display = T)
for (j in unique(df$id)){
remDr$findElement(using = 'xpath',
value = paste0('//*[#id="cat',j,'"]/h2'))$clickElement()
for (k in unique(df[which(df$id==j),'yr'])){
remDr$findElement(using = 'xpath',
value = paste0('//*[#id="',k,'"]'))$clickElement()
# NEED TO SCRAPE THE HREF ASSOCIATED WITH MINUTES AND AGENDA DOWNLOAD HERE #
}
}
Maybe you don't really need to click through all the elements? You can use the fact that all downloadable links have ViewFile in their href:
t <- readLines('https://www.charleston-sc.gov/AgendaCenter/', encoding = 'UTF-8')
viewfile <- str_extract_all(t, '.*ViewFile.*', simplify = T)
viewfile <- viewfile[viewfile!='']
library(data.table) # I use data.table because it's more convenient - but can be done without too
dt.viewfile <- data.table(origStr=viewfile)
# list the elements and patterns we will be looking for:
searchfor <- list(
Title='name=[^ ]+ title=\"(.+)\" href',
Date='<strong>(.+)</strong>',
href='href=\"([^\"]+)\"',
label= 'aria-label=\"([^\"]+)\"'
)
for (this.i in names(searchfor)){
this.full <- paste0('.*',searchfor[[this.i]],'.*');
dt.viewfile[grepl(this.full, origStr), (this.i):=gsub(this.full,'\\1',origStr)]
}
# Clean records:
dt.viewfile[, `:=`(Title=na.omit(Title),Date=na.omit(Date),label=na.omit(label)),
by=href]
dt.viewfile[,Date:=gsub('<abbr title=".*">(.*)</abbr>','\\1',Date)]
dt.viewfile <- unique(dt.viewfile[,.(Title,Date,href,label)]); # 690 records
What you have as the result is a table with the links to all downloadable files. You can now download them using any tool you like, for example using download.file() or GET():
dt.viewfile[, full.url:=paste0('https://www.charleston-sc.gov', href)]
dt.viewfile[, filename:=fs::path_sanitize(paste0(Title, ' - ', Date), replacement = '_')]
for (i in seq_len(nrow(dt.viewfile[1:10,]))){ # remove `1:10` limitation to process all records
url <- dt.viewfile[i,full.url]
destfile <- dt.viewfile[i,filename]
cat('\nDownloading',url, ' to ', destfile)
fil <- GET(url, write_disk(destfile))
# our destination file doesn't have extension, we need to get it from the server:
serverFilename <- gsub("inline;filename=(.*)",'\\1',headers(fil)$`content-disposition`)
serverExtension <- tools::file_ext(serverFilename)
# Adding the extension to the file we just saved
file.rename(destfile,paste0(destfile,'.',serverExtension))
}
Now the only problem we have is that the original webpage was only showing records for the last 3 years. But instead of clicking View More through RSelenium, we can simply load the page with earlier dates, something like this:
t <- readLines('https://www.charleston-sc.gov/AgendaCenter/Search/?term=&CIDs=all&startDate=10/14/2014&endDate=10/14/2017', encoding = 'UTF-8')
then repeat the rest of the code as necessary.

Issue with scrolling down to scrape Google Reviews

I am trying scrape data from Google Reviews (stars, commentary, date, etc.).
I tried to adapt a code that I found available online but am having problems to make it work. Apparently, R is not managing to scroll down google reviews and only returns the first ten reviews (that are the ones that Google displays without scrolling)
Has someone came across the same issue? Thanks!
#install.packages("rvest")
#install.packages("xml2")
#install.packages("RSelenium")
library(rvest)
library(xml2)
library(RSelenium)
rmDr=rsDriver(port = 4444L, browser=c("firefox"))
myclient= rmDr$client
myclient$navigate("https://www.google.com/search?client=firefox-b-d&q=emporio+santa+maria#lrd=0x94ce576a4e45ed99:0xa36a342d3ceb06c3,1,,,")
#click on the snippet to switch focus----------
webEle <- myclient$findElement(using = "css",value = ".review-snippet")
webEle$clickElement()
#simulate scroll down for several times-------------
scroll_down_times=1000
for(i in 1 :scroll_down_times){
webEle$sendKeysToActiveElement(sendKeys = list(key="page_down"))
#the content needs time to load,wait 1 second every 5 scroll downs
if(i%%5==0){
Sys.sleep(3)
}
}
#loop and simulate clicking on all "click on more" elements-------------
webEles <- myclient$findElements(using = "css",value = ".review-more-link")
for(webEle in webEles){
tryCatch(webEle$clickElement(),error=function(e){print(e)}) # trycatch to prevent any error from stopping the loop
}
pagesource= myclient$getPageSource()[[1]]
#this should get you the full review, including translation and original text-------------
reviews=read_html(pagesource) %>%
html_nodes(".review-full-text") %>%
html_text()
#number of stars
stars <- read_html(pagesource) %>%
html_node(".review-dialog-list") %>%
html_nodes("g-review-stars > span") %>%
html_attr("aria-label")
#time posted
post_time <- read_html(pagesource) %>%
html_node(".review-dialog-list") %>%
html_nodes(".dehysf") %>%
html_text()`enter code here`
The codes are all correct, but you didn't target the correct element, use .review-dialog-list in css instead. That element is where the scroll bar resides.
library(RSelenium)
rmDr <- rsDriver(browser = "firefox")
driver <- rmDr$client
driver$navigate("https://www.google.com/search?client=firefox-b-d&q=emporio+santa+maria#lrd=0x94ce576a4e45ed99:0xa36a342d3ceb06c3,1,,,")
Sys.sleep(3) # wait a couple of seconds to let browser render the review window.
webEle <- driver$findElement(using = "css",value = ".review-dialog-list")
for(i in 1 : 5){
webEle$sendKeysToElement(sendKeys = list(key = "page_down"))
Sys.sleep(1)
}

Sreality.cz web scraping

I have tried scraping data from a real estate site, and arranging the data in a way that can then easily be filtered and checked using a spreadsheet. I’m actually a little embarrassed that i don’t move of this R code forward.
Now that i have all the links to the posts, i can not now loop through the previously compiled dataframe and get the details from all the URLs.
Could you just please help me with it? Thanks a lot.
#Loading the rvest package
library(rvest)
library(magrittr) # for the '%>%' pipe symbols
library(RSelenium) # to get the loaded html of
library(xml2)
complete <- data.frame()
# starting local RSelenium (this is the only way to start RSelenium that is working for me atm)
selCommand <- wdman::selenium(jvmargs = c("-Dwebdriver.chrome.verboseLogging=true"), retcommand = TRUE)
shell(selCommand, wait = FALSE, minimized = TRUE)
remDr <- remoteDriver(port = 4567L, browserName = "chrome")
remDr$open()
URL.base <- "https://www.sreality.cz/hledani/prodej/byty?strana="
#"https://www.sreality.cz/hledani/prodej/byty/praha?strana="
#"https://www.sreality.cz/hledani/prodej/byty/praha?stari=dnes&strana="
#"https://www.sreality.cz/hledani/prodej/byty/praha?stari=tyden&strana="
for (i in 1:10000) {
#Specifying the url for desired website to be scrapped
main_link<- paste0(URL.base, i)
# go to website
remDr$navigate(main_link)
# get page source and save it as an html object with rvest
main_page <- remDr$getPageSource(header = TRUE)[[1]] %>% read_html()
# get the data
name <- html_nodes(main_page, css=".name.ng-binding") %>% html_text()
locality <- html_nodes(main_page, css=".locality.ng-binding") %>% html_text()
norm_price <- html_nodes(main_page, css=".norm-price.ng-binding") %>% html_text()
sreality_url <- main_page %>% html_nodes(".title") %>% html_attr("href")
sreality_url2 <- sreality_url[c(4:24)]
name2 <- name[c(4:24)]
record <- data.frame(cbind(name2, locality, norm_price, sreality_url2))
complete <- rbind(complete, record)
}
# Write CSV in R
write.csv(complete, file = "MyData.csv")
I would do this differently:
I would create a function, say 'scraper', that groups up together all the scraping functions you have already defined, doing so I'll create a list with the str_c of all the possibile links (say 30), after that a simple lapply function. As it all said, I will not use Rselenium. (libraries: rvest , stringr , tibble, dplyr )
url = 'https://www.sreality.cz/hledani/prodej/byty?strana='
here it is the URL base, starting from here you should be able to replicate the URL strings for all the pages (1 to whichever) you are interested in (and for all the possible url, for praha, olomuc, ostrava etc ).
main_page = read_html('https://www.sreality.cz/hledani/prodej/byty?strana=')
here you create all the linnks according to the number of pages you want:
list.of.pages = str_c(url, 1:30)
then define a single function for all the single data you are interested, in this way you are more precise and your error debug is easier, as well as the data quality. (I assume your CSS selections are right, otherwise you will obtain empty obj)
for names
name = function(url) {
data = html_nodes(url, css=".name.ng-binding") %>%
html_text()
return(data)
}
for locality
locality = function(url) {
data = html_nodes(url, css=".locality.ng-binding") %>%
html_text()
return(data)
}
for normprice
normprice = function(url) {
data = html_nodes(url, css=".norm-price.ng-binding") %>%
html_text()
return(data)
}
for hrefs
sreality_url = function(url) {
data = html_nodes(url, css=".title") %>%
html_attr("href")
return(data)
}
those are the single fuctions (the CSS selection, even if i didnt test them, seem to be not correct to me, but this will give you the right framework to work on). After that combine them into a tibble obj
get.data.table = function(html){
name = name(html)
locality = locality(html)
normprice = normprice(html)
hrefs = sreality_url(html)
combine = tibble(adtext = name,
loc = locality,
price = normprice,
URL = sreality_url)
combine %>%
select(adtext, loc, price, URL) return(combine)
}
then the final scraper:
scrape.all = function(urls){
list.of.pages %>%
lapply(get.data.table) %>%
bind_rows() %>%
write.csv(file = 'MyData.csv')
}

R Selenium (or rvest): How to scrape tables in sub(sub)pages listed in a main page

RSelenium
I need quite often to scrape and analyze public data of health-care contracts and partially automated it in VBA.
I deserve a couple of minuses although I spent the last night trying to set up RSelenium, succeeded in firing up server and running some examples copying single tables to dataframes. I am a beginner in web-scraping.
I am working with a dynamically generated site.
https://aplikacje.nfz.gov.pl/umowy/Provider/Index?ROK=2017&OW=15&ServiceType=03&Code=&Name=&City=&Nip=&Regon=&Product=&OrthopedicSupply=false
I deal withthree levels of pages:
Level 1
My top pages have the following structure (column A contains links, at the bottom there are pages):
========
A, B, C
link_A,15,10
link_B,23,12
link_c,21,12
link_D,32,12
========
1,2,3,4,5,6,7,8,9,...
======================
I have just learned the Selector Gadget that indicates:
Table
.table-striped
1.2.3.4.5.6.7
.pagination-container
Level 2 Under each link (link_A, link_B) in the table there is a subpage which contains a table. Example: https://aplikacje.nfz.gov.pl/umowy/Agreements/GetAgreements?ROK=2017&ServiceType=03&ProviderId=20799&OW=15&OrthopedicSupply=False&Code=150000009
============
F, G, H
link_agreements,34,23
link_agreements,23,23
link_agreements,24,24
============
Selector gadget indicates
.table-striped
Level 3 Again, under each link (link_agreements) there is another, subsubpage with the data that I want to collect
https://aplikacje.nfz.gov.pl/umowy/AgreementsPlan/GetPlans?ROK=2017&ServiceType=03&ProviderId=20799&OW=15&OrthopedicSupply=False&Code=150000009&AgreementTechnicalCode=761176
============
X,Y,Z
orthopedics, 231,323
traumatology, 323,248
hematology, 323,122
Again, Selector Gadget indicates
.table-striped
I would like to iteratively collect all the subpages to the data frame that would look like:
Info from top page; info from sub-subpages
link_A (from top page);15 (Value from A column), ortopedics, 231,323
link_A (from top page);15 (Value from A column), traumatology,323,248
link_A (from top page);15 (Value from A column), traumatology,323,122
Is there a cookbook, some good examples for R selenium or rvest to show, how to iterate through links in the tables and get data in the sub(sub)-pages into a dataframe?
I would appreciate any info, an example, any hints a book indicating how to do it with RSelenium or any other scraping package.
P.S. Warning: I am also encountering SSL invalid cretificate issues with this page, I am working with Firefox selenium driver. So each time I manually need to skip the warning - for another topic.
P.S. The code I tried so far and found to be a dead end.
install.packages("RSelenium")
install.packages("wdman")
library(RSelenium)
library(wdman)
library(XML)
Next I started selenium, I immediately had issues with "java 8 present, java 7 needed issues solved by removing all java?.exe files wrom Windows/System32 or SysWOW64
library(wdman)
library(XML)
selServ <- selenium(verbose = TRUE) #installs selenium
selServ$process
remDr <- remoteDriver(remoteServerAddr = "localhost"
, port = 4567
, browserName = "firefox")
remDr$open(silent = F)
remDr$navigate("https://aplikacje.nfz.gov.pl/umowy/AgreementsPlan/GetPlans?ROK=2017&ServiceType=03&ProviderId=17480&OW=13&OrthopedicSupply=False&Code=130000111&AgreementTechnicalCode=773979")
webElem <- remDr$findElement(using = "class name", value = "table-striped")
webElemtxt <- webElem$getElementAttribute("outerHTML")[[1]]
table <- readHTMLTable(webElemtxt, header=FALSE, as.data.frame=TRUE,)[[1]]
webElem$clickElement()
webElem$sendKeysToElement(list(key="tab",key="enter"))
Here my struggle with RSelenium ended. I could not send keys to Chrome, I could not work with Firefox because it demanded correct SSL certificates and I could not effectively bypass it.
table<-0
library(rvest)
# PRIMARY TABLE EXTRACTION
for (i in 1:10){
url<-paste0("https://aplikacje.nfz.gov.pl/umowy/Provider/Index?ROK=2017&OW=15&ServiceType=03&OrthopedicSupply=False&page=",i)
page<-html_session(url)
table[i]<-html_table(page)
}
library(data.table)
primary_table<-rbindlist(table,fill=TRUE)
# DATA CLEANING REQUIRED IN PRIMARY TABLE to clean the the variable
# `Kod Sortuj według kodu świadczeniodawcy`
# Clean and store it in the primary_Table_column only then secondary table extraction will work
#SECONDARY TABLE EXTRACTION
for (i in 1:10){
url<-paste0("https://aplikacje.nfz.gov.pl/umowy/Agreements/GetAgreements?ROK=2017&ServiceType=03&ProviderId=20795&OW=15&OrthopedicSupply=False&Code=",primary_table[i,2])
page<-html_session(url)
table[i]<-html_table(page)
# This is the key where you can identify the whose secondary table is this.
table[i][[1]][1,1]<-primary_table[i,2]
}
secondary_table<-rbindlist(table,fill=TRUE)
Here is the answer I developed based on hbmstr aid: rvest: extract tables with url's instead of text
Practically tribute goes to him. I modified his code to deal with subpages. I am also grateful to Bharath. My code works but it may be very untidy. Hope it will be adaptable for others. Feel free to simplify code, propose changes.
library(rvest)
library(tidyverse)
library(stringr)
# error: Peer certificate cannot be authenticated with given CA certificates
# https://stackoverflow.com/questions/40397932/r-peer-certificate-cannot-be-authenticated-with-given-ca-certificates-windows
library(httr)
set_config(config(ssl_verifypeer = 0L))
# Helpers
# First based on https://stackoverflow.com/questions/35947123/r-stringr-extract-number-after-specific-string
# str_extract(myStr, "(?i)(?<=ProviderID\\D)\\d+")
get_id <-
function (x, myString) {
require(stringr)
str_extract(x, paste0("(?i)(?<=", myString, "\\D)\\d+"))
}
rm_extra <- function(x) { gsub("\r.*$", "", x) }
mk_gd_col_names <- function(x) {
tolower(x) %>%
gsub("\ +", "_", .)
}
URL <- "https://aplikacje.nfz.gov.pl/umowy/Provider/Index?ROK=2017&OW=15&ServiceType=03&OrthopedicSupply=False&page=%d"
get_table <- function(page_num = 1) {
pg <- read_html(httr::GET(sprintf(URL, page_num)))
tab <- html_nodes(pg, "table")
html_table(tab)[[1]][,-c(1,11)] %>%
set_names(rm_extra(colnames(.) %>% mk_gd_col_names)) %>%
mutate_all(funs(rm_extra)) %>%
mutate(link = html_nodes(tab, xpath=".//td[2]/a") %>% html_attr("href")) %>%
mutate(provider_id=get_id(link,"ProviderID")) %>%
as_tibble()
}
pb <- progress_estimated(10)
map_df(1:10, function(i) {
pb$tick()$print()
get_table(page_num = i)
}) -> full_df
#===========level 2===============
# %26 escapes "&"
URL2a <- "https://aplikacje.nfz.gov.pl/umowy/Agreements/GetAgreements?ROK=2017&ServiceType=03&ProviderId="
URL2b <- "&OW=15&OrthopedicSupply=False&Code="
paste0(URL2a,full_df[1,11],URL2b,full_df[1,1])
get_table2 <- function(page_num = 1) {
pg <- read_html(httr::GET(paste0(URL2a,full_df[page_num,11],URL2b,full_df[page_num,1])))
tab <- html_nodes(pg, "table")
html_table(tab)[[1]][,-c(1,8)] %>%
set_names(rm_extra(colnames(.) %>% mk_gd_col_names)) %>%
mutate_all(funs(rm_extra)) %>%
mutate(link = html_nodes(tab, xpath=".//td[2]/a") %>% html_attr("href")) %>%
mutate(provider_id=get_id(link,"ProviderID")) %>%
mutate(technical_code=get_id(link,"AgreementTechnicalCode")) %>%
as_tibble()
}
pb <- progress_estimated(nrow(full_df))
map_df(1:nrow(full_df), function(i) {
pb$tick()$print()
get_table2(page_num = i)
}) -> full_df2
#===========level 3===============
URL3a <- "https://aplikacje.nfz.gov.pl/umowy/AgreementsPlan/GetPlans?ROK=2017&ServiceType=03&ProviderId="
URL3b <- "&OW=15&OrthopedicSupply=False&Code=150000001&AgreementTechnicalCode="
paste0(URL3a,full_df2[1,8],URL3b,full_df2[1,9])
get_table3 <- function(page_num = 1) {
pg <- read_html(httr::GET(paste0(paste0(URL3a,full_df2[page_num,8],URL3b,full_df2[page_num,9]))))
tab <- html_nodes(pg, "table")
provider <- as.numeric(full_df2[page_num,8])
html_table(tab)[[1]][,-c(1,8)] %>%
set_names(rm_extra(colnames(.) %>% mk_gd_col_names)) %>%
mutate_all(funs(rm_extra)) %>%
mutate(provider_id=provider) %>%
as_tibble()
}
pb <- progress_estimated(nrow(full_df2)+1)
map_df(1:nrow(full_df2), function(i) {
pb$tick()$print()
get_table3(page_num = i)
} ) -> full_df3

Resources