RSelenium scraping returning odd results - r

I am trying to scrape some news sources search pages using RSelenium. Here's my code:
library(rvest)
library(RSelenium)
#open the browser
rD <- rsDriver(browser=c("chrome"), chromever="73.0.3683.68")
remDr <- rD[["client"]]
#create a blank space to put the links
urlslist_final = list()
##loop through the page number at the end until done with ~1000 / 20 = 50
for (i in 1:2) { ##change this to 50
url = paste0('https://www.npr.org/search?query=kavanaugh&page=', i)
#navigate to it
remDr$navigate(url)
#get the links
webElems <- remDr$findElements(using = "css", "[href]")
urlslist_final[[i]] = unlist(sapply(webElems, function(x) {x$getElementAttribute("href")}))
#don't go too fast
Sys.sleep(runif(1, 1, 5))
} #close the loop
remDr$close()
# stop the selenium server
rD[["server"]]$stop()
If I set i = 1 and CLICK over to the browser after the page is navigated to, then I get the desired results of 166 links with the specific result links I'm trying to scrape:
> str(urlslist_final)
List of 1
$ : chr [1:166] "https://media.npr.org/templates/favicon/favicon-180x180.png" "https://media.npr.org/templates/favicon/favicon-96x96.png" "https://media.npr.org/templates/favicon/favicon-32x32.png" "https://media.npr.org/templates/favicon/favicon-16x16.png" ...
However, if just run my loop I get just 91 results and none of them are the actual results from the search:
> str(urlslist_final)
List of 2
$ : chr [1:91] "https://media.npr.org/templates/favicon/favicon-180x180.png" "https://media.npr.org/templates/favicon/favicon-96x96.png" "https://media.npr.org/templates/favicon/favicon-32x32.png" "https://media.npr.org/templates/favicon/favicon-16x16.png" ...
Any help understanding why the difference here? What can I do differently? I tried just using rvest but I couldn't get it to find the links embedded in their script for the results.

Thanks to my friend Thom, here's a good solution:
#scroll on the page
webscroll <- remDr$findElement("css", "body")
webscroll$sendKeysToElement(list(key = "end"))
I put that code between navigating to the page and capturing the links, which triggered the website to think I was using it properly so I could scrape the links.

Related

How do I create a clickable webElement from an elementID in a shadow DOM using RSelenium?

I'm trying download data from this GIS site, but some of the button elements I need are in a shadow dom.
My strategy was to run a JS script that uses shadowRoot.querySelector to get the "elementID" and use it to point to the button I need to click. I thought I could use RSelenium::webElement to create the button elements, but I'm getting an error when I try to click on them.
pacman::p_load(RSelenium, glue, dplyr, rvest)
driver <- rsDriver(browser = c("chrome"), chromever = "90.0.4430.24")
chrome <- driver$client
wisc_url <- "https://data.dhsgis.wi.gov/datasets/wi-dhs::covid-19-historical-data-by-county/about"
chrome$navigate(url = wisc_url)
Sys.sleep(5)
# open side panel
wisc_dl_panel_button <- chrome$findElement("css selector", "#main-region > div.content-hero > div.content-hero-footer > div.content-footer.flex-row > div.content-footer-right > div.yielded > button")
wisc_dl_panel_button$clickElement()
# sometimes it needs time to create the file
Sys.sleep(120)
# get elementId from shadow DOM
wisc_dlopts_elt_id <- chrome$executeScript("return document.querySelector('hub-download-card').shadowRoot.querySelector('calcite-card').querySelector('calcite-dropdown')")[[1]]
wisc_dlopts_elt <- webElement(elementId = wisc_dlopts_elt_id,
browserName = "chrome",
port = 4567)
wisc_dlopts_elt$clickElement()
## Error in checkError(res) :
## Undefined error in httr call. httr output: length(url) == 1 is not TRUE
I don't think it's the particular elementID that I'm using. I tried this create-webElement method with other elements that were able to be clicked using the typical findElement >> clickElement method, and I still get the same error.
BTW I can solve this particular problem through a JS script. But I don't know JS that well, so I'd rather have a RSelenium solution that I can more generally apply if different use cases pop-up in the future.
Use find_elements from the shadowr package:
library(shadowr)
remDr <- RSelenium::remoteDriver(
remoteServerAddr = "host.docker.internal",
port = 4445 , browser = "chrome")
remDr$open(silent = TRUE)
wisc_url <- "https://data.dhsgis.wi.gov/datasets/wi-dhs::covid-19-vaccination-data-by-county/explore?location=44.718554%2C-89.846850%2C7.23"
remDr$navigate(url = wisc_url)
Sys.sleep(5)
shadow_rd <- shadow(remDr)
wisc_dl_panel_button <- find_elements(shadow_rd, 'button[aria-describedby*="tooltip"]')
wisc_dl_panel_button[[3]]$clickElement()
wisc_dl_panel_button <- find_elements(shadow_rd, 'calcite-button')
wisc_dl_panel_button[[1]]$clickElement()
For details and examples please check out my github page https://github.com/ricilandolt/shadowr and https://www.shadowr.org/ (Disclosure: I'm a contributor to this package).
Did you try using findElements?
If you get the CSS or Xpath you can try something like that:
frames <- remDr$findElements("css selector", 'your_css')
then
frames[[1]]$clickElement()
That's what I usually do..

Using purrr:map to loop through web pages for scraping with Rselenium

I have a basic R script which I have cobbled together using Rselenium which allows me to log into a website, once authenticated my script then goes to the first page of interest and pulls 3 pieces of text from the page.
Luckily for me the URL has been created in such a way that I can pass in a vector of numbers to the URL to take me to the next page of interest hence the use of map().
While on each page I want to scrape the same 3 elements off the page and store them in a master data frame for later analysis.
I wish to use the map family of functions so that I can become more familiar with them but I am really struggling to get these to work, could anyone kindly tell me where I am going wrong?
Here is the main part of my code (go to website and log in)
library(RSelenium)
# https://stackoverflow.com/questions/55201226/session-not-created-this-version-of-chromedriver-only-supports-chrome-version-7/56173984
rd <- rsDriver(browser = "chrome",
chromever = "88.0.4324.27",
port = netstat::free_port())
remdr <- rd[["client"]]
# url of the site's login page
url <- "https://www.myWebsite.com/"
# Navigating to the page
remdr$navigate(url)
# Wait 5 secs for the page to load
Sys.sleep(5)
# Find the initial login button to bring up the username and password fields
loginbutton <- remdr$findElement(using = 'css selector','.plain')
# Click the initial login button to bring up the username and password fields
loginbutton$clickElement()
# Find the username box
username <- remdr$findElement(using = 'css selector','#username')
# Find the password box
password <- remdr$findElement(using = 'css selector','#password')
# Find the final login button
login <- remdr$findElement(using = 'css selector','#btnLoginSubmit1')
# Input the username
username$sendKeysToElement(list("myUsername"))
# Input the password
password$sendKeysToElement(list("myPassword"))
# Click login
login$clickElement()
And hey presto we're in!
Now my code takes me to the initial page of interest (index = 1)
Above I mentioned that I am looking to increment through each page and I can do this by substituting an integer into the URL at the rcId element, see below
#remdr$navigate("https://myWebsite.com/rc_redesign/#/layout/jcard/drugCard?accountId=XXXXXX&rcId=1&searchType=R&reimbCode=&searchTerm=&searchTexts=*") # Navigating to the page
For each rcId in 1:9999 I wish to grab the following 3 elements and store them in a data frame
hcpcs_info <- remdr$findElement(using = 'class','is-jcard-heading')
hcpcs <- hcpcs_info$getElementText()[[1]]
hcpcs_description <- remdr$findElement(using = 'class','is-jcard-desc')
hcpcs_desc <- hcpcs_description$getElementText()[[1]]
tc_info <- remdr$findElement(using = 'css selector','.col-12.ng-star-inserted')
therapeutic_class <- tc_info$getElementText()[[1]]
I have tried creating a separate function and passing to map but I am not advance enough to piece this together, below is what I have tried.
my_function <- function(index) {
remdr$navigate(sprintf("https://rc2.reimbursementcodes.com/rc_redesign/#/layout/jcard/drugCard?accountId=113479&rcId=%d&searchType=R&reimbCode=*&searchTerm=*&searchTexts=*",index)
Sys.sleep(5)
hcpcs_info[index] <- remdr$findElement(using = 'class','is-jcard-heading')
hcpcs[index] <- hcpcs_info$getElementText()[index][[1]])
}
x <- 1:10 %>%
map(~ my_function(.x))
Any help would be greatly appreciated
Try the following :
library(RSelenium)
purrr::map_df(1:10, ~{
remdr$navigate(sprintf("https://rc2.reimbursementcodes.com/rc_redesign/#/layout/jcard/drugCard?accountId=113479&rcId=%d&searchType=R&reimbCode=*&searchTerm=*&searchTexts=*",.x))
Sys.sleep(5)
hcpcs_info <- remdr$findElement(using = 'class','is-jcard-heading')
hcpcs <- hcpcs_info$getElementText()[[1]]
hcpcs_description <- remdr$findElement(using = 'class','is-jcard-desc')
hcpcs_desc <- hcpcs_description$getElementText()[[1]]
tc_info <- remdr$findElement(using = 'css selector','.col-12.ng-star-inserted')
therapeutic_class <- tc_info$getElementText()[[1]]
tibble(hcpcs, hcpcs_desc, therapeutic_class)
}) -> result
result

Scrape a paginated table in R

Apologizes in advanced if this is very basic, but I'm lost on this!
I want to scrape the following table in R,
http://dgsp.cns.gob.mx/Transparencia/wConsultasGeneral.aspx
However, this page is written in, I believe, Java. I tried with RSelenium, but I am no having success in scraping the 17 pages of this table.
Could you give me a hint about how to scrape the entire content of this table?
Given it's just 17 pages, I would manually click through the pages and save the HTML source. It would take no more than 3-5 minutes this way.
However, if you want to do it programmatically, we can start by writing a function that takes a page number, finds the link for that page, clicks on the link, and returns the HTML source for that page:
get_html <- function(i) {
webElem <- remDr$findElement(using = "link text", as.character(i))
webElem$clickElement()
Sys.sleep(s)
remDr$getPageSource()[[1]]
}
Initialize some values:
s <- 2 # seconds to wait between each page
total_pages <- 17
html_pages <- vector("list", total_pages)
Start the browser, navigate to page 1, and save the source:
library(RSelenium)
rD <- rsDriver()
remDr <- rD[["client"]]
base_url <- "http://dgsp.cns.gob.mx/Transparencia/wConsultasGeneral.aspx"
remDr$navigate(base_url)
src <- remDr$getPageSource()[[1]]
html_pages[1] <- src
For pages 2 to 17, we use a for-loop and call the function we wrote above, taking care to account specially for page 11:
for (i in 2:total_pages) {
if (i == 11) {
webElem <- remDr$findElement(using = "link text", "...")
webElem$clickElement()
Sys.sleep(s)
html_pages[i] <- remDr$getPageSource()[[1]]
} else {
html_pages[i] <- get_html(i)
}
}
remDr$close()
The result is html_pages, a list of length 17, with each element the HTML source for each page. How to parse the data from HTML into some other form (e.g. a dataframe) is probably a separate question by itself.

RSelenium: Scraping a dynamically loaded page that loads slowly

I'm not sure if it is because my internet is slow, but I'm trying to scrape a website that loads information as you scroll down the page. I'm executing a script that goes to the end of the page, and waits for the Selenium/Chrome server to load the additional content. The server does update and load the new content, because I am able to scrape information that wasn't on the page originally and the new content shows up on the chrome viewer, but it only updates once. I set a Sys.sleep() function to wait for a minute each time so that the content will have plenty of time to load, but it still doesn't update more than once. Am I using RSelenium incorrectly? Are there other ways of scraping a site that dynamically loads?
Anyway, any kind of advice or help you can provide would be awesome.
Below is what I think is the relevant portion of my code with regards to loading the new content at the end of the page:
for(i in 1:3){
webElem <- remDr$findElement('css', 'body')
remDr$executeScript('window.scrollTo(0, document.body.scrollHeight);')
Sys.sleep(60)
}
Below is the full code:
library(RSelenium)
library(rvest)
library(stringr)
rsDriver(port = 4444L, browser = 'chrome')
remDr <- remoteDriver(browser = 'chrome')
remDr$open()
remDr$navigate('http://www.codewars.com/kata')
#find the total number of recorded katas
tot_kata <- remDr$findElement(using = 'css', '.is-gray-text')$getElementText() %>%
unlist() %>%
str_extract('\\d+') %>%
as.numeric()
#there are about 30 katas per page reload
tot_pages <- (tot_kata/30) %>%
ceiling()
#will be 1:tot_pages once I know the below code works
for(i in 1:3){
webElem <- remDr$findElement('css', 'body')
remDr$executeScript('window.scrollTo(0, document.body.scrollHeight);')
Sys.sleep(60)
}
page_source <- remDr$getPageSource()
kata_vector <- read_html(page_source[[1]]) %>%
html_nodes('.item-title a') %>%
html_attr('href') %>%
str_replace('/kata/', '')
remDr$close
The website provides an api which should be the first port of call. Failing this you can access individual pages using for example:
http://www.codewars.com/kata?page=21
If you want to scroll to the bottom of the page until there is no more content with RSelenium you can use the "Loading..." element it has a class=js-infinite-marker. While we still have this element on the page we attempt to scroll down to it every second (with some error catching for any issues). If the element is not present we assume all content is loaded:
library(RSelenium)
rD <- rsDriver(port = 4444L, browser = 'chrome')
remDr <- rD$client # You dont need to use the open method
remDr$navigate('http://www.codewars.com/kata')
chk <- FALSE
while(!chk){
webElem <- remDr$findElements("css", ".js-infinite-marker")
if(length(webElem) > 0L){
tryCatch(
remDr$executeScript("elem = arguments[0];
elem.scrollIntoView();
return true;", list(webElem[[1]])),
error = function(e){}
)
Sys.sleep(1L)
}else{
chk <- TRUE
}
}

Web-scraping in r. How to scrape the data from ("+More" etc).?

Suppose I want to get information about Amenities from this webpage (https://www.airbnb.com/rooms/6676364). It works ok only for visible part.
But how to extract the rest from "+More" button?
I tried the node from "source code" with the help of xpathSApply, but it returns me "+more".
Do you know the solution of this problem?
My RSelenium approach:
url <- "https://www.airbnb.com/rooms/12344760"
library('RSelenium')
pJS <- phantom()
library('XML')
shell.exec(paste0("C:\\Users\\Daniil\\Desktop\\R-language,Python\\file.bat"))
Sys.sleep(10)
checkForServer()
startServer()
remDr <- remoteDriver(browserName="chrome", port=4444)
remDr$open(silent=T)
remDr$navigate(url)
var <- remDr$findElement('id','details') ### extracting all table###
vartxt <- var$getElementAttribute("outerHTML")[[1]]
varxml <- htmlParse(vartxt, useInternalNodes=T)
Amenities <- xpathSApply(varxml,"//div[#class = expandable-content expandable-content-full']",xmlValue)
Also doesn't work
After you navigate the RSelenium driver to the target URL, use the following XPath to find <a> element where inner text equals '+ More' within amenities <div> :
remDr$navigate(url)
link <- remDr$findElement(using = 'xpath', "//div[#class='row amenities']//a[.='+ More']")
Then perform click on the link to get complete list of amenities :
link$clickElement()
Lastly, pass current page HTML source to whatever R function you want to use for further processing :
doc <- htmlParse(remDr$getPageSource()[[1]])
....

Resources