Issue with scrolling down to scrape Google Reviews

Issue with scrolling down to scrape Google Reviews - r

I am trying scrape data from Google Reviews (stars, commentary, date, etc.).
I tried to adapt a code that I found available online but am having problems to make it work. Apparently, R is not managing to scroll down google reviews and only returns the first ten reviews (that are the ones that Google displays without scrolling)
Has someone came across the same issue? Thanks!
#install.packages("rvest")
#install.packages("xml2")
#install.packages("RSelenium")
library(rvest)
library(xml2)
library(RSelenium)
rmDr=rsDriver(port = 4444L, browser=c("firefox"))
myclient= rmDr$client
myclient$navigate("https://www.google.com/search?client=firefox-b-d&q=emporio+santa+maria#lrd=0x94ce576a4e45ed99:0xa36a342d3ceb06c3,1,,,")
#click on the snippet to switch focus----------
webEle <- myclient$findElement(using = "css",value = ".review-snippet")
webEle$clickElement()
#simulate scroll down for several times-------------
scroll_down_times=1000
for(i in 1 :scroll_down_times){
webEle$sendKeysToActiveElement(sendKeys = list(key="page_down"))
#the content needs time to load,wait 1 second every 5 scroll downs
if(i%%5==0){
Sys.sleep(3)
}
}
#loop and simulate clicking on all "click on more" elements-------------
webEles <- myclient$findElements(using = "css",value = ".review-more-link")
for(webEle in webEles){
tryCatch(webEle$clickElement(),error=function(e){print(e)}) # trycatch to prevent any error from stopping the loop
}
pagesource= myclient$getPageSource()[[1]]
#this should get you the full review, including translation and original text-------------
reviews=read_html(pagesource) %>%
html_nodes(".review-full-text") %>%
html_text()
#number of stars
stars <- read_html(pagesource) %>%
html_node(".review-dialog-list") %>%
html_nodes("g-review-stars > span") %>%
html_attr("aria-label")
#time posted
post_time <- read_html(pagesource) %>%
html_node(".review-dialog-list") %>%
html_nodes(".dehysf") %>%
html_text()`enter code here`

The codes are all correct, but you didn't target the correct element, use .review-dialog-list in css instead. That element is where the scroll bar resides.
library(RSelenium)
rmDr <- rsDriver(browser = "firefox")
driver <- rmDr$client
driver$navigate("https://www.google.com/search?client=firefox-b-d&q=emporio+santa+maria#lrd=0x94ce576a4e45ed99:0xa36a342d3ceb06c3,1,,,")
Sys.sleep(3) # wait a couple of seconds to let browser render the review window.
webEle <- driver$findElement(using = "css",value = ".review-dialog-list")
for(i in 1 : 5){
webEle$sendKeysToElement(sendKeys = list(key = "page_down"))
Sys.sleep(1)
}

Related

R webscraping doesn't work when url is in variable

I'm having trouble scraping in R. I want to scrape genre information for several titles on goodreads.
If I do this, it works completely fine and gives me what I need:
library(polite)
library(rvest)
library(dplyr)
session <- bow("https://www.goodreads.com/book/show/29991718-royally-matched",
delay = 5)
genres <- scrape(session) %>%
html_elements(".bookPageGenreLink") %>%
html_text()
genres
However, since I'd like loop over several pages, I need this to work, but it always returns character(0).
host <- "https://www.goodreads.com/book/show/29991718-royally-matched"
session <- bow(host,
delay = 5)
genres <- scrape(session) %>%
html_elements(".bookPageGenreLink") %>%
html_text()
genres
Something like this would also be fine for me, but it doesn't work either:
link = "29991718-royally-matched"
session <- bow(paste0("https://www.goodreads.com/book/show/29991718-royally-matched", link),
delay = 5)
genres <- scrape(session) %>%
html_elements(".bookPageGenreLink") %>%
html_text()
genres
If I open the website and disable javascript, it still works completely fine, so I don't think Selenium is necessary and I really can't figure out why this doesn't work, which drives me crazy.
Thank you so much for your support!
Solution (kind of)
So I noticed that the success of my scrapings was kind of dependent on the random moods of the scraping gods.
So I did the following:
links <- c("31752345-black-mad-wheel", "00045101-The-Mad-Ship", "2767052-the-hunger-games", "18619684-the-time-traveler-s-wife", "29991718-royally-matched")
data <- data.frame(links)
for (link in links) {
print(link)
genres <- character(0)
url <- paste0('https://www.goodreads.com/book/show/',link)
#I don't know why, but resaving it kinda helped
host <- url
#I had the theory that repeating the scraping would eventually lead to a result. For me that didn't work though
try <- 0
while (identical(genres, character(0)) & (try < 10)) {
try <- try+1
print(paste0(try, ": ", link))
session <- bow(host,
delay = 5)
scraping <- scrape(session)
genres <- scraping %>% html_elements(".bookPageGenreLink") %>%
html_text()
}
if(identical(genres, character(0))){
print("Scraping unsuccessfull.. :( ")
}
else{
print("scraping success!!")
genres.df <- data.frame(genres)
data <- left_join(data,
genres.df, by = c("link"))
}
}
## then I created a list of the missing titles
missing_titles <- data %>%
filter(is.na(genre_1))
missing_links <- unique(missing_titles$link)
So the next step(s) were closing R (while saving the workspace of course), restarting it and refeeding the loop with missing_titles instead of links. It took me like 7 iterations of that to get everything I needed, while on the last run I had to insert the last remaining link directly into example 1, since it did not work inside the loop. whyever.
I hope the code kind of works, since I wanted to spare you pages of wild data formatting.
If someone has an explanation, why I needed to go through this hustle, I would still very much appreciate it.

You can consider to use the R package RSelenium as follows :
library(RSelenium)
library(rvest)
url <- "https://www.goodreads.com/book/show/29991718-royally-matched"
shell('docker run -d -p 4445:4444 selenium/standalone-firefox')
remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4445L, browserName = "firefox")
remDr$open()
remDr$navigate(url)
page_Content <- remDr$getPageSource()[[1]]
read_html(page_Content) %>% html_elements(".bookPageGenreLink") %>% html_text()
Afterwards, you can loop over the url you want.

How do I get last page using Rvest if not identified on page?

I'm very new to R, but trying to create a webscraper to help inspire my team.
The site I want to scrape is https://www.parl.ca/legisinfo/Home.aspx?Page=1 . If you visit that site it will push you to a site where the 43 session is selected by default but I want to include all sessions. If you deselect the session the site sends you to the URL above.
My problem is finding the last page. By clicking >> at the bottom of the page, I know there are 328 pages in total. I want the code to remain relevant so I want the code to find the last page number so I can build a list. I don't know how to do that when the number for "last page" isn't actually in the page text. The code below, returns ">>" instead of "328."
url <- 'https://www.parl.ca/legisinfo/Home.aspx?Page=1'
url %>%
read_html()%>%
html_nodes('.resultPageCurrent') %>%
html_text() %>%
tail(1L)
When I look at the source code I do find the magic 328 number, I just don't know how to point to it. Thank you for your help!

An interesting problem. I decided to see if there was an API. There are a number of APIs. See: https://openparliament.ca/api/
Not spending much time reading, I opted for the bills API endpoint. I couldn't easily see a request that returned the total bills count, or all bills in one go. I did find I could use one of the endpoints in a loop, using a querystring construct to return 500 (max I could get to return in one) bills at a time, using offset to get the next batch. I used a for loop (gasp), as I was unsure how else to terminate making requests when next_url, in the response json, was NULL.
library(jsonlite)
library(data.table)
library(dplyr)
n <- 0
result <- list()
while (TRUE) {
url <- sprintf("https://api.openparliament.ca/bills/?limit=500&offset=%i&format=json", n)
data <- jsonlite::read_json(url, simplifyVector = T)
next_url <- data$pagination$next_url
df <- data$objects %>% mutate(name_en = name$en, name_fr = name$fr) %>% select(-c('name'))
result[[(n/500)+1]] <- df
if (is.null(next_url)) {
break
} else {
n <- n + 500
}
}
df <- rbindlist(result)
df %>% arrange(desc(introduced)) %>% head()
df contains the data you see on the page for all bills listed via API. Currently 6,620 rows.
Py:
import requests
import pandas as pd
n = 0
results = []
def get_dict_list(value: list)->list:
output = []
for i in value:
d = {k:v for k,v in i.items() if k != 'name'}
d['name_en'] = i['name']['en']
d['name_fr'] = i['name']['fr']
output.append(d)
return output
with requests.Session() as s:
while True:
url = f'https://api.openparliament.ca/bills/?limit=500&offset={n}&%5D=&format=json'
r = s.get(url)
data = r.json()
next_url = data['pagination']['next_url']
results.extend(get_dict_list(data['objects']))
if next_url is None:
break
n+=500
df = pd.DataFrame(results)

I decided to try using RSelenium to navigate to the last page, and grab the last page number. This did the trick, but was not very elegant or efficient. I allow myself this as someone who is new to R and webscraping.
#Step 1 - call packages
library(tidyverse)
library(rvest)
library(RSelenium)
#Step 2 - Identify the main webpage
url <- 'https://www.parl.ca/legisinfo/Home.aspx'
#Step 3 - Open browser
rD <- rsDriver(port = 4546L, version = "latest", browser=c("firefox"))
remDr <- rD[["client"]]
#Step 4 - Navigate to page
remDr$navigate(url)
#Step 5 - Select filters
webElem <- remDr$findElement(using = 'css selector',".filterTopRemoveImage")
webElem$clickElement() #Click to remove default filter so that all Parliamentary sessions are included
webElem <- remDr$findElement(using = 'css selector',".filterRefiner~ div:nth-child(5) .filterRefinement")
webElem$clickElement() #Click to select only House of Commons Bills
#Step 6 - Get last page by navigating >>
webElem <- remDr$findElement(using = 'link text',">>")
webElem$clickElement()
html <- remDr$getPageSource()[[1]] #Get everything on the page
html <-read_html(html) #Parse HTML
lastpage <- html %>% html_nodes(".resultPageCurrent:nth-child(12)") %>% html_text()
lastpage

How to scrape hrefs embedded in a dropdown list of a web table using rselenium R

I'm trying to scrape links to all minutes and agenda provided in this website: https://www.charleston-sc.gov/AgendaCenter/
I've managed to scrape section IDs associated with each category (and years for each category) to loop through the contents within each category-year (please see below). But I don't know how to scrape the hrefs that lives inside the contents. Especially because the links to Agenda lives inside the drop down menu under 'download', it seems like I need to go through extra clicks to scrape the hrefs.
How do I scrape the minutes and agenda (inside the download dropdown) for each table I select? Ideally, I would like a table with the date, title of the agenda, links to minutes, and links to agenda.
I'm using RSelenium for this. Please see the code I have so far below, which allows me to click through each category and year, but not else much. Please help!
rm(list = ls())
library(RSelenium)
library(tidyverse)
library(httr)
library(XML)
library(stringr)
library(RCurl)
t <- readLines('https://www.charleston-sc.gov/AgendaCenter/', encoding = 'UTF-8')
co <- str_match(t, 'aria-label="(.*?)"[ ]href="java')[,2]
yr <- str_match(t, 'id="(.*?)" aria-label')[,2]
df <- data.frame(cbind(co, yr)) %>%
mutate_all(as.character) %>%
filter_all(any_vars(!is.na(.))) %>%
mutate(id = ifelse(grepl('^a0', yr), gsub('a0', '', yr), NA)) %>%
tidyr::fill(c(co,id), .direction='down')%>% drop_na(co)
remDr <- remoteDriver(port=4445L, browserName = "chrome")
remDr$open()
remDr$navigate('https://www.charleston-sc.gov/AgendaCenter/')
remDr$screenshot(display = T)
for (j in unique(df$id)){
remDr$findElement(using = 'xpath',
value = paste0('//*[#id="cat',j,'"]/h2'))$clickElement()
for (k in unique(df[which(df$id==j),'yr'])){
remDr$findElement(using = 'xpath',
value = paste0('//*[#id="',k,'"]'))$clickElement()
# NEED TO SCRAPE THE HREF ASSOCIATED WITH MINUTES AND AGENDA DOWNLOAD HERE #
}
}

Maybe you don't really need to click through all the elements? You can use the fact that all downloadable links have ViewFile in their href:
t <- readLines('https://www.charleston-sc.gov/AgendaCenter/', encoding = 'UTF-8')
viewfile <- str_extract_all(t, '.*ViewFile.*', simplify = T)
viewfile <- viewfile[viewfile!='']
library(data.table) # I use data.table because it's more convenient - but can be done without too
dt.viewfile <- data.table(origStr=viewfile)
# list the elements and patterns we will be looking for:
searchfor <- list(
Title='name=[^ ]+ title=\"(.+)\" href',
Date='<strong>(.+)</strong>',
href='href=\"([^\"]+)\"',
label= 'aria-label=\"([^\"]+)\"'
)
for (this.i in names(searchfor)){
this.full <- paste0('.*',searchfor[[this.i]],'.*');
dt.viewfile[grepl(this.full, origStr), (this.i):=gsub(this.full,'\\1',origStr)]
}
# Clean records:
dt.viewfile[, `:=`(Title=na.omit(Title),Date=na.omit(Date),label=na.omit(label)),
by=href]
dt.viewfile[,Date:=gsub('<abbr title=".*">(.*)</abbr>','\\1',Date)]
dt.viewfile <- unique(dt.viewfile[,.(Title,Date,href,label)]); # 690 records
What you have as the result is a table with the links to all downloadable files. You can now download them using any tool you like, for example using download.file() or GET():
dt.viewfile[, full.url:=paste0('https://www.charleston-sc.gov', href)]
dt.viewfile[, filename:=fs::path_sanitize(paste0(Title, ' - ', Date), replacement = '_')]
for (i in seq_len(nrow(dt.viewfile[1:10,]))){ # remove `1:10` limitation to process all records
url <- dt.viewfile[i,full.url]
destfile <- dt.viewfile[i,filename]
cat('\nDownloading',url, ' to ', destfile)
fil <- GET(url, write_disk(destfile))
# our destination file doesn't have extension, we need to get it from the server:
serverFilename <- gsub("inline;filename=(.*)",'\\1',headers(fil)$`content-disposition`)
serverExtension <- tools::file_ext(serverFilename)
# Adding the extension to the file we just saved
file.rename(destfile,paste0(destfile,'.',serverExtension))
}
Now the only problem we have is that the original webpage was only showing records for the last 3 years. But instead of clicking View More through RSelenium, we can simply load the page with earlier dates, something like this:
t <- readLines('https://www.charleston-sc.gov/AgendaCenter/Search/?term=&CIDs=all&startDate=10/14/2014&endDate=10/14/2017', encoding = 'UTF-8')
then repeat the rest of the code as necessary.

Google Play web scraping: how can you get the number of votes for each review in R?

I'm doing web scraping in R of the reviews of a Google Play app, but I can't get the number of votes. I indicate the code: likes <- html_obj %>% html_nodes(".xjKiLb") %>% html_attr("aria-label") and I get no value. How can it be done?
Get scrape votes
FULL CODE
#Loading the rvest package
library(rvest)
library(magrittr) # for the '%>%' pipe symbols
library(RSelenium) # to get the loaded html of
url <- 'https://play.google.com/store/apps/details?id=com.gospace.parenteral&showAllReviews=true'
# starting local RSelenium (this is the only way to start RSelenium that is working for me atm)
selCommand <- wdman::selenium(jvmargs = c("-Dwebdriver.chrome.verboseLogging=true"), retcommand = TRUE)
shell(selCommand, wait = FALSE, minimized = TRUE)
remDr <- remoteDriver(port = 4567L, browserName = "firefox")
remDr$open()
# go to website
remDr$navigate(url)
# get page source and save it as an html object with rvest
html_obj <- remDr$getPageSource(header = TRUE)[[1]] %>% read_html()
likes <- html_obj %>% html_nodes(".xjKiLb") %>% html_attr("aria-label")
What returns me
NA NA NA
What I want to be returned
3 3 2

Maybe you are using the selector gadget to get the css selector. As you, I tried to do that, but the css the selector gadget return is not the correct one.
Inspecting the html code of the page, I realized that the correct element is contain in the tag with class = "jUL89d y92BAb" as you can see in this image.
This way, the code you should use is this one.
html_obj %>% html_nodes('.jUL89d') %>% html_text()
My personal recommendation for you is to always check the source code to confirm the output of the selector gadget.

R Selenium (or rvest): How to scrape tables in sub(sub)pages listed in a main page

RSelenium
I need quite often to scrape and analyze public data of health-care contracts and partially automated it in VBA.
I deserve a couple of minuses although I spent the last night trying to set up RSelenium, succeeded in firing up server and running some examples copying single tables to dataframes. I am a beginner in web-scraping.
I am working with a dynamically generated site.
https://aplikacje.nfz.gov.pl/umowy/Provider/Index?ROK=2017&OW=15&ServiceType=03&Code=&Name=&City=&Nip=&Regon=&Product=&OrthopedicSupply=false
I deal withthree levels of pages:
Level 1
My top pages have the following structure (column A contains links, at the bottom there are pages):
========
A, B, C
link_A,15,10
link_B,23,12
link_c,21,12
link_D,32,12
========
1,2,3,4,5,6,7,8,9,...
======================
I have just learned the Selector Gadget that indicates:
Table
.table-striped
1.2.3.4.5.6.7
.pagination-container
Level 2 Under each link (link_A, link_B) in the table there is a subpage which contains a table. Example: https://aplikacje.nfz.gov.pl/umowy/Agreements/GetAgreements?ROK=2017&ServiceType=03&ProviderId=20799&OW=15&OrthopedicSupply=False&Code=150000009
============
F, G, H
link_agreements,34,23
link_agreements,23,23
link_agreements,24,24
============
Selector gadget indicates
.table-striped
Level 3 Again, under each link (link_agreements) there is another, subsubpage with the data that I want to collect
https://aplikacje.nfz.gov.pl/umowy/AgreementsPlan/GetPlans?ROK=2017&ServiceType=03&ProviderId=20799&OW=15&OrthopedicSupply=False&Code=150000009&AgreementTechnicalCode=761176
============
X,Y,Z
orthopedics, 231,323
traumatology, 323,248
hematology, 323,122
Again, Selector Gadget indicates
.table-striped
I would like to iteratively collect all the subpages to the data frame that would look like:
Info from top page; info from sub-subpages
link_A (from top page);15 (Value from A column), ortopedics, 231,323
link_A (from top page);15 (Value from A column), traumatology,323,248
link_A (from top page);15 (Value from A column), traumatology,323,122
Is there a cookbook, some good examples for R selenium or rvest to show, how to iterate through links in the tables and get data in the sub(sub)-pages into a dataframe?
I would appreciate any info, an example, any hints a book indicating how to do it with RSelenium or any other scraping package.
P.S. Warning: I am also encountering SSL invalid cretificate issues with this page, I am working with Firefox selenium driver. So each time I manually need to skip the warning - for another topic.
P.S. The code I tried so far and found to be a dead end.
install.packages("RSelenium")
install.packages("wdman")
library(RSelenium)
library(wdman)
library(XML)
Next I started selenium, I immediately had issues with "java 8 present, java 7 needed issues solved by removing all java?.exe files wrom Windows/System32 or SysWOW64
library(wdman)
library(XML)
selServ <- selenium(verbose = TRUE) #installs selenium
selServ$process
remDr <- remoteDriver(remoteServerAddr = "localhost"
, port = 4567
, browserName = "firefox")
remDr$open(silent = F)
remDr$navigate("https://aplikacje.nfz.gov.pl/umowy/AgreementsPlan/GetPlans?ROK=2017&ServiceType=03&ProviderId=17480&OW=13&OrthopedicSupply=False&Code=130000111&AgreementTechnicalCode=773979")
webElem <- remDr$findElement(using = "class name", value = "table-striped")
webElemtxt <- webElem$getElementAttribute("outerHTML")[[1]]
table <- readHTMLTable(webElemtxt, header=FALSE, as.data.frame=TRUE,)[[1]]
webElem$clickElement()
webElem$sendKeysToElement(list(key="tab",key="enter"))
Here my struggle with RSelenium ended. I could not send keys to Chrome, I could not work with Firefox because it demanded correct SSL certificates and I could not effectively bypass it.

table<-0
library(rvest)
# PRIMARY TABLE EXTRACTION
for (i in 1:10){
url<-paste0("https://aplikacje.nfz.gov.pl/umowy/Provider/Index?ROK=2017&OW=15&ServiceType=03&OrthopedicSupply=False&page=",i)
page<-html_session(url)
table[i]<-html_table(page)
}
library(data.table)
primary_table<-rbindlist(table,fill=TRUE)
# DATA CLEANING REQUIRED IN PRIMARY TABLE to clean the the variable
# `Kod Sortuj według kodu świadczeniodawcy`
# Clean and store it in the primary_Table_column only then secondary table extraction will work
#SECONDARY TABLE EXTRACTION
for (i in 1:10){
url<-paste0("https://aplikacje.nfz.gov.pl/umowy/Agreements/GetAgreements?ROK=2017&ServiceType=03&ProviderId=20795&OW=15&OrthopedicSupply=False&Code=",primary_table[i,2])
page<-html_session(url)
table[i]<-html_table(page)
# This is the key where you can identify the whose secondary table is this.
table[i][[1]][1,1]<-primary_table[i,2]
}
secondary_table<-rbindlist(table,fill=TRUE)

Here is the answer I developed based on hbmstr aid: rvest: extract tables with url's instead of text
Practically tribute goes to him. I modified his code to deal with subpages. I am also grateful to Bharath. My code works but it may be very untidy. Hope it will be adaptable for others. Feel free to simplify code, propose changes.
library(rvest)
library(tidyverse)
library(stringr)
# error: Peer certificate cannot be authenticated with given CA certificates
# https://stackoverflow.com/questions/40397932/r-peer-certificate-cannot-be-authenticated-with-given-ca-certificates-windows
library(httr)
set_config(config(ssl_verifypeer = 0L))
# Helpers
# First based on https://stackoverflow.com/questions/35947123/r-stringr-extract-number-after-specific-string
# str_extract(myStr, "(?i)(?<=ProviderID\\D)\\d+")
get_id <-
function (x, myString) {
require(stringr)
str_extract(x, paste0("(?i)(?<=", myString, "\\D)\\d+"))
}
rm_extra <- function(x) { gsub("\r.*$", "", x) }
mk_gd_col_names <- function(x) {
tolower(x) %>%
gsub("\ +", "_", .)
}
URL <- "https://aplikacje.nfz.gov.pl/umowy/Provider/Index?ROK=2017&OW=15&ServiceType=03&OrthopedicSupply=False&page=%d"
get_table <- function(page_num = 1) {
pg <- read_html(httr::GET(sprintf(URL, page_num)))
tab <- html_nodes(pg, "table")
html_table(tab)[[1]][,-c(1,11)] %>%
set_names(rm_extra(colnames(.) %>% mk_gd_col_names)) %>%
mutate_all(funs(rm_extra)) %>%
mutate(link = html_nodes(tab, xpath=".//td[2]/a") %>% html_attr("href")) %>%
mutate(provider_id=get_id(link,"ProviderID")) %>%
as_tibble()
}
pb <- progress_estimated(10)
map_df(1:10, function(i) {
pb$tick()$print()
get_table(page_num = i)
}) -> full_df
#===========level 2===============
# %26 escapes "&"
URL2a <- "https://aplikacje.nfz.gov.pl/umowy/Agreements/GetAgreements?ROK=2017&ServiceType=03&ProviderId="
URL2b <- "&OW=15&OrthopedicSupply=False&Code="
paste0(URL2a,full_df[1,11],URL2b,full_df[1,1])
get_table2 <- function(page_num = 1) {
pg <- read_html(httr::GET(paste0(URL2a,full_df[page_num,11],URL2b,full_df[page_num,1])))
tab <- html_nodes(pg, "table")
html_table(tab)[[1]][,-c(1,8)] %>%
set_names(rm_extra(colnames(.) %>% mk_gd_col_names)) %>%
mutate_all(funs(rm_extra)) %>%
mutate(link = html_nodes(tab, xpath=".//td[2]/a") %>% html_attr("href")) %>%
mutate(provider_id=get_id(link,"ProviderID")) %>%
mutate(technical_code=get_id(link,"AgreementTechnicalCode")) %>%
as_tibble()
}
pb <- progress_estimated(nrow(full_df))
map_df(1:nrow(full_df), function(i) {
pb$tick()$print()
get_table2(page_num = i)
}) -> full_df2
#===========level 3===============
URL3a <- "https://aplikacje.nfz.gov.pl/umowy/AgreementsPlan/GetPlans?ROK=2017&ServiceType=03&ProviderId="
URL3b <- "&OW=15&OrthopedicSupply=False&Code=150000001&AgreementTechnicalCode="
paste0(URL3a,full_df2[1,8],URL3b,full_df2[1,9])
get_table3 <- function(page_num = 1) {
pg <- read_html(httr::GET(paste0(paste0(URL3a,full_df2[page_num,8],URL3b,full_df2[page_num,9]))))
tab <- html_nodes(pg, "table")
provider <- as.numeric(full_df2[page_num,8])
html_table(tab)[[1]][,-c(1,8)] %>%
set_names(rm_extra(colnames(.) %>% mk_gd_col_names)) %>%
mutate_all(funs(rm_extra)) %>%
mutate(provider_id=provider) %>%
as_tibble()
}
pb <- progress_estimated(nrow(full_df2)+1)
map_df(1:nrow(full_df2), function(i) {
pb$tick()$print()
get_table3(page_num = i)
} ) -> full_df3

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Issue with scrolling down to scrape Google Reviews - r

Related

R webscraping doesn't work when url is in variable

How do I get last page using Rvest if not identified on page?

How to scrape hrefs embedded in a dropdown list of a web table using rselenium R

Google Play web scraping: how can you get the number of votes for each review in R?

R Selenium (or rvest): How to scrape tables in sub(sub)pages listed in a main page

Categories

Resources