RSelenium: Scraping a dynamically loaded page that loads slowly - r

I'm not sure if it is because my internet is slow, but I'm trying to scrape a website that loads information as you scroll down the page. I'm executing a script that goes to the end of the page, and waits for the Selenium/Chrome server to load the additional content. The server does update and load the new content, because I am able to scrape information that wasn't on the page originally and the new content shows up on the chrome viewer, but it only updates once. I set a Sys.sleep() function to wait for a minute each time so that the content will have plenty of time to load, but it still doesn't update more than once. Am I using RSelenium incorrectly? Are there other ways of scraping a site that dynamically loads?
Anyway, any kind of advice or help you can provide would be awesome.
Below is what I think is the relevant portion of my code with regards to loading the new content at the end of the page:
for(i in 1:3){
webElem <- remDr$findElement('css', 'body')
remDr$executeScript('window.scrollTo(0, document.body.scrollHeight);')
Sys.sleep(60)
}
Below is the full code:
library(RSelenium)
library(rvest)
library(stringr)
rsDriver(port = 4444L, browser = 'chrome')
remDr <- remoteDriver(browser = 'chrome')
remDr$open()
remDr$navigate('http://www.codewars.com/kata')
#find the total number of recorded katas
tot_kata <- remDr$findElement(using = 'css', '.is-gray-text')$getElementText() %>%
unlist() %>%
str_extract('\\d+') %>%
as.numeric()
#there are about 30 katas per page reload
tot_pages <- (tot_kata/30) %>%
ceiling()
#will be 1:tot_pages once I know the below code works
for(i in 1:3){
webElem <- remDr$findElement('css', 'body')
remDr$executeScript('window.scrollTo(0, document.body.scrollHeight);')
Sys.sleep(60)
}
page_source <- remDr$getPageSource()
kata_vector <- read_html(page_source[[1]]) %>%
html_nodes('.item-title a') %>%
html_attr('href') %>%
str_replace('/kata/', '')
remDr$close

The website provides an api which should be the first port of call. Failing this you can access individual pages using for example:
http://www.codewars.com/kata?page=21
If you want to scroll to the bottom of the page until there is no more content with RSelenium you can use the "Loading..." element it has a class=js-infinite-marker. While we still have this element on the page we attempt to scroll down to it every second (with some error catching for any issues). If the element is not present we assume all content is loaded:
library(RSelenium)
rD <- rsDriver(port = 4444L, browser = 'chrome')
remDr <- rD$client # You dont need to use the open method
remDr$navigate('http://www.codewars.com/kata')
chk <- FALSE
while(!chk){
webElem <- remDr$findElements("css", ".js-infinite-marker")
if(length(webElem) > 0L){
tryCatch(
remDr$executeScript("elem = arguments[0];
elem.scrollIntoView();
return true;", list(webElem[[1]])),
error = function(e){}
)
Sys.sleep(1L)
}else{
chk <- TRUE
}
}

Related

Using purrr:map to loop through web pages for scraping with Rselenium

I have a basic R script which I have cobbled together using Rselenium which allows me to log into a website, once authenticated my script then goes to the first page of interest and pulls 3 pieces of text from the page.
Luckily for me the URL has been created in such a way that I can pass in a vector of numbers to the URL to take me to the next page of interest hence the use of map().
While on each page I want to scrape the same 3 elements off the page and store them in a master data frame for later analysis.
I wish to use the map family of functions so that I can become more familiar with them but I am really struggling to get these to work, could anyone kindly tell me where I am going wrong?
Here is the main part of my code (go to website and log in)
library(RSelenium)
# https://stackoverflow.com/questions/55201226/session-not-created-this-version-of-chromedriver-only-supports-chrome-version-7/56173984
rd <- rsDriver(browser = "chrome",
chromever = "88.0.4324.27",
port = netstat::free_port())
remdr <- rd[["client"]]
# url of the site's login page
url <- "https://www.myWebsite.com/"
# Navigating to the page
remdr$navigate(url)
# Wait 5 secs for the page to load
Sys.sleep(5)
# Find the initial login button to bring up the username and password fields
loginbutton <- remdr$findElement(using = 'css selector','.plain')
# Click the initial login button to bring up the username and password fields
loginbutton$clickElement()
# Find the username box
username <- remdr$findElement(using = 'css selector','#username')
# Find the password box
password <- remdr$findElement(using = 'css selector','#password')
# Find the final login button
login <- remdr$findElement(using = 'css selector','#btnLoginSubmit1')
# Input the username
username$sendKeysToElement(list("myUsername"))
# Input the password
password$sendKeysToElement(list("myPassword"))
# Click login
login$clickElement()
And hey presto we're in!
Now my code takes me to the initial page of interest (index = 1)
Above I mentioned that I am looking to increment through each page and I can do this by substituting an integer into the URL at the rcId element, see below
#remdr$navigate("https://myWebsite.com/rc_redesign/#/layout/jcard/drugCard?accountId=XXXXXX&rcId=1&searchType=R&reimbCode=&searchTerm=&searchTexts=*") # Navigating to the page
For each rcId in 1:9999 I wish to grab the following 3 elements and store them in a data frame
hcpcs_info <- remdr$findElement(using = 'class','is-jcard-heading')
hcpcs <- hcpcs_info$getElementText()[[1]]
hcpcs_description <- remdr$findElement(using = 'class','is-jcard-desc')
hcpcs_desc <- hcpcs_description$getElementText()[[1]]
tc_info <- remdr$findElement(using = 'css selector','.col-12.ng-star-inserted')
therapeutic_class <- tc_info$getElementText()[[1]]
I have tried creating a separate function and passing to map but I am not advance enough to piece this together, below is what I have tried.
my_function <- function(index) {
remdr$navigate(sprintf("https://rc2.reimbursementcodes.com/rc_redesign/#/layout/jcard/drugCard?accountId=113479&rcId=%d&searchType=R&reimbCode=*&searchTerm=*&searchTexts=*",index)
Sys.sleep(5)
hcpcs_info[index] <- remdr$findElement(using = 'class','is-jcard-heading')
hcpcs[index] <- hcpcs_info$getElementText()[index][[1]])
}
x <- 1:10 %>%
map(~ my_function(.x))
Any help would be greatly appreciated
Try the following :
library(RSelenium)
purrr::map_df(1:10, ~{
remdr$navigate(sprintf("https://rc2.reimbursementcodes.com/rc_redesign/#/layout/jcard/drugCard?accountId=113479&rcId=%d&searchType=R&reimbCode=*&searchTerm=*&searchTexts=*",.x))
Sys.sleep(5)
hcpcs_info <- remdr$findElement(using = 'class','is-jcard-heading')
hcpcs <- hcpcs_info$getElementText()[[1]]
hcpcs_description <- remdr$findElement(using = 'class','is-jcard-desc')
hcpcs_desc <- hcpcs_description$getElementText()[[1]]
tc_info <- remdr$findElement(using = 'css selector','.col-12.ng-star-inserted')
therapeutic_class <- tc_info$getElementText()[[1]]
tibble(hcpcs, hcpcs_desc, therapeutic_class)
}) -> result
result

RSelenium scraping returning odd results

I am trying to scrape some news sources search pages using RSelenium. Here's my code:
library(rvest)
library(RSelenium)
#open the browser
rD <- rsDriver(browser=c("chrome"), chromever="73.0.3683.68")
remDr <- rD[["client"]]
#create a blank space to put the links
urlslist_final = list()
##loop through the page number at the end until done with ~1000 / 20 = 50
for (i in 1:2) { ##change this to 50
url = paste0('https://www.npr.org/search?query=kavanaugh&page=', i)
#navigate to it
remDr$navigate(url)
#get the links
webElems <- remDr$findElements(using = "css", "[href]")
urlslist_final[[i]] = unlist(sapply(webElems, function(x) {x$getElementAttribute("href")}))
#don't go too fast
Sys.sleep(runif(1, 1, 5))
} #close the loop
remDr$close()
# stop the selenium server
rD[["server"]]$stop()
If I set i = 1 and CLICK over to the browser after the page is navigated to, then I get the desired results of 166 links with the specific result links I'm trying to scrape:
> str(urlslist_final)
List of 1
$ : chr [1:166] "https://media.npr.org/templates/favicon/favicon-180x180.png" "https://media.npr.org/templates/favicon/favicon-96x96.png" "https://media.npr.org/templates/favicon/favicon-32x32.png" "https://media.npr.org/templates/favicon/favicon-16x16.png" ...
However, if just run my loop I get just 91 results and none of them are the actual results from the search:
> str(urlslist_final)
List of 2
$ : chr [1:91] "https://media.npr.org/templates/favicon/favicon-180x180.png" "https://media.npr.org/templates/favicon/favicon-96x96.png" "https://media.npr.org/templates/favicon/favicon-32x32.png" "https://media.npr.org/templates/favicon/favicon-16x16.png" ...
Any help understanding why the difference here? What can I do differently? I tried just using rvest but I couldn't get it to find the links embedded in their script for the results.
Thanks to my friend Thom, here's a good solution:
#scroll on the page
webscroll <- remDr$findElement("css", "body")
webscroll$sendKeysToElement(list(key = "end"))
I put that code between navigating to the page and capturing the links, which triggered the website to think I was using it properly so I could scrape the links.

Scrape a paginated table in R

Apologizes in advanced if this is very basic, but I'm lost on this!
I want to scrape the following table in R,
http://dgsp.cns.gob.mx/Transparencia/wConsultasGeneral.aspx
However, this page is written in, I believe, Java. I tried with RSelenium, but I am no having success in scraping the 17 pages of this table.
Could you give me a hint about how to scrape the entire content of this table?
Given it's just 17 pages, I would manually click through the pages and save the HTML source. It would take no more than 3-5 minutes this way.
However, if you want to do it programmatically, we can start by writing a function that takes a page number, finds the link for that page, clicks on the link, and returns the HTML source for that page:
get_html <- function(i) {
webElem <- remDr$findElement(using = "link text", as.character(i))
webElem$clickElement()
Sys.sleep(s)
remDr$getPageSource()[[1]]
}
Initialize some values:
s <- 2 # seconds to wait between each page
total_pages <- 17
html_pages <- vector("list", total_pages)
Start the browser, navigate to page 1, and save the source:
library(RSelenium)
rD <- rsDriver()
remDr <- rD[["client"]]
base_url <- "http://dgsp.cns.gob.mx/Transparencia/wConsultasGeneral.aspx"
remDr$navigate(base_url)
src <- remDr$getPageSource()[[1]]
html_pages[1] <- src
For pages 2 to 17, we use a for-loop and call the function we wrote above, taking care to account specially for page 11:
for (i in 2:total_pages) {
if (i == 11) {
webElem <- remDr$findElement(using = "link text", "...")
webElem$clickElement()
Sys.sleep(s)
html_pages[i] <- remDr$getPageSource()[[1]]
} else {
html_pages[i] <- get_html(i)
}
}
remDr$close()
The result is html_pages, a list of length 17, with each element the HTML source for each page. How to parse the data from HTML into some other form (e.g. a dataframe) is probably a separate question by itself.

Click on cross domain iframe element using Rselenium

I am using R, version 3.3.2. Using Rselenium package, I am trying to scrap some data from this website: http://www.dziv.hr/en/e-services/on-line-database-search/patents/
I am using Rselenium and my code looks like this:
selServ <- RSelenium::startServer(javaargs = c("-Dwebdriver.gecko.driver=\"C:/Users/Mislav/Documents/geckodriver.exe\""))
remDr <- remoteDriver(extraCapabilities = list(marionette = TRUE))
remDr$open()
Sys.sleep(2)
# Simulate browser session and fill out form
remDr$navigate("http://www.dziv.hr/hr/e-usluge/pretrazivanje-baza-podataka/patent/")
This doesn't work:
webel <- remDr$findElement(using = "xpath", "/input[#id = 'TB1']")
Then I wanted to swith to iframe using switchToFrame() function, but the iframe does not contain id.
Then I have tr to use index: webel <- remDr$switchToFrame(1) but this just return NULL
Also, I recognized, iframe has different domain.
Is it possible to svrap data from this web site?
You can just select the first iframe and pass it to the switchToFrame method:
webElem <- remDr$findElements("css", "iframe")
remDr$switchToFrame(webElem[[1]])
webel <- remDr$findElement(using = "xpath", "//input[#id = 'TB1']")

Web-scraping in r. How to scrape the data from ("+More" etc).?

Suppose I want to get information about Amenities from this webpage (https://www.airbnb.com/rooms/6676364). It works ok only for visible part.
But how to extract the rest from "+More" button?
I tried the node from "source code" with the help of xpathSApply, but it returns me "+more".
Do you know the solution of this problem?
My RSelenium approach:
url <- "https://www.airbnb.com/rooms/12344760"
library('RSelenium')
pJS <- phantom()
library('XML')
shell.exec(paste0("C:\\Users\\Daniil\\Desktop\\R-language,Python\\file.bat"))
Sys.sleep(10)
checkForServer()
startServer()
remDr <- remoteDriver(browserName="chrome", port=4444)
remDr$open(silent=T)
remDr$navigate(url)
var <- remDr$findElement('id','details') ### extracting all table###
vartxt <- var$getElementAttribute("outerHTML")[[1]]
varxml <- htmlParse(vartxt, useInternalNodes=T)
Amenities <- xpathSApply(varxml,"//div[#class = expandable-content expandable-content-full']",xmlValue)
Also doesn't work
After you navigate the RSelenium driver to the target URL, use the following XPath to find <a> element where inner text equals '+ More' within amenities <div> :
remDr$navigate(url)
link <- remDr$findElement(using = 'xpath', "//div[#class='row amenities']//a[.='+ More']")
Then perform click on the link to get complete list of amenities :
link$clickElement()
Lastly, pass current page HTML source to whatever R function you want to use for further processing :
doc <- htmlParse(remDr$getPageSource()[[1]])
....

Resources