Using RSelenium findElement funciton in an if else statement - web-scraping

I have been trying to use RSelenium and cannot figure out how to make an 'if' 'else' statement when using the findElement function. Can anyone help out this basic problem below? This is just a test to understand a piece of my overall code. Basically, I want to see if a page has a term and if it does make a calculation. Many thanks!
Start Selenium Server --------------------------------------------------------
checkForServer()
startServer()
remDrv <- remoteDriver()
remDrv$open()
# Simulate browser session
remDrv$navigate('http://agcensus.dacnet.nic.in/districtsummarytype.aspx')
# Test if the page has the words 'District Tables' then make a calculation
if (remDrv$findElement(using = "xpath", "//*[contains(text(), 'District Tables')]"))
{2+2}
else
{4+4}

Related

What is the difference between rvest::html_text and RSelenium::getPageSource?

I'm scraping a number of webpages, where I noticed the different results that rvest (read_html, then html_text) provides, and the one that RSelenium (getPageSource()) provides.
More specifically, when dropdown menus are involved, using html_text only gives you the names of the choices, while using RSelenium you can get the url of the page that you will be directed to once you choose one.
My question here would be : (1) why the difference, and what exactly is the nature of the difference? and (2) is there a way to get the same source text extraction as RSelenium one, but using a faster way such as rvest package?
I have tried using webdriver, a PhantomJS implementation, per suggestion from rvest vs RSelenium results for text extracting , and their getSource function does provide the same results as RSelenium. However, while this is faster than RSelenium, it is still much slower than rvest.
library(rvest)
library(RSelenium)
library(webdriver)
library(tictoc)
library(robotstxt)
test_url <- "https://www.bea.gov"
robotstxt::paths_allowed(test_url)
# rvest
tictoc::tic()
resultA <- html_text(read_html(test_url))
tictoc::toc()
# RSelenium
tictoc::tic()
remDr <- remoteDriver(port = 4445L, browserName = "firefox")
remDr$open()
remDr$navigate(test_url)
resultB <- remDr$getPageSource(test_url)
tictoc::toc()
# webdriver
tictoc::tic()
pjs <- run_phantomjs()
ses <- Session$new(port = pjs$port)
ses$go(test_url)
resultC <- ses$getSource()
tictoc::toc()
You can see that resultA is different from resultB and resultC. More specifically, my focus would be something from the word "Tools" onwards, which is the part where the dropdown menu for choosing different tabs regarding "Tools" that this website provides.
Showing just a small chunk, choosing "BEARFACTS" in rvest is:
BEARFACTS\n \n \n
while in RSelenium it is something like the following :
<li class=\"expanded dropdown\">\n BEARFACTS\n
The difference between RSelenium and rvest is:
RSelenium runs a real web browser, so it will load any javascript contained in the webpage (javascript is often used to load additional html elements or data after the initial html has loaded).
rvest does not run javascript, and therefore retrieves the page html faster, but will miss any elements loaded with javascript after the initial page load.
Some useful tips:
When scraping a page that doesn't load javascript, use rvest.
When you must use RSelenium, try using a headless option to improve speed (it will load the page in a browser just like normal, but it won't display any of the graphical elements, so it will be faster).
Example of using RSelenium headless
eCaps <- list(chromeOptions = list(
args = c('--headless', '--disable-gpu', '--window-size=1280,800')
))
rD <- rsDriver(browser=c("chrome"), verbose = TRUE, chromever="78.0.3904.105", port=4447L, extraCapabilities = eCaps)

How can I find attributes in RSelenium

I would like to find all "a href" of a particular webpage, there are examples how to do it in XML, rvest or splashr, I would like to do it with Rselenium and Not by finding the elements first and then getElementAttribute(..., "href").
I'm looking for something similiar to read_html, html_nodes, html_attr from rvest, or render_html from splashr which later works with rvest.
EDIT: Ideally something similiar to render_html which lets all scrips finish.
Start of code:
library(rselenium)
rd <- rsDriver(verbose = F)
remotedriver <- rd$client
remotedriver$navigate('https://stackoverflow.com/')
Now I get stuck on how to find all URLs that points somewhere. I have tried
library(xml)
html_parse <- htmlParse(remote$getPageSource()[1])
But dont know which functions can be used to work with the html_parse object.

R - Waiting for page to load in RSelenium with PhantomJS

I put together a crude scraper that scrapes prices/airlines from Expedia:
# Start the Server
rD <- rsDriver(browser = "phantomjs", verbose = FALSE)
# Assign the client
remDr <- rD$client
# Establish a wait for an element
remDr$setImplicitWaitTimeout(1000)
# Navigate to Expedia.com
appurl <- "https://www.expedia.com/Flights-Search?flight-type=on&starDate=04/30/2017&mode=search&trip=oneway&leg1=from:Denver,+Colorado,to:Oslo,+Norway,departure:04/30/2017TANYT&passengers=children:0,adults:1"
remDr$navigate(appURL)
# Give a crawl delay to see if it gives time to load web page
Sys.sleep(10) # Been testing with 10
###ADD JAVASCRIPT INJECTION HERE###
remDr$executeScript(?)
# Extract Prices
webElem <- remDr$findElements(using = "css", "[class='dollars price-emphasis']")
prices <- unlist(lapply(webElem, function(x){x$getElementText()}))
print(prices)
# Extract Airlines
webElem <- remDr$findElements(using = "css", "[data-test-id='airline-name']")
airlines <- unlist(lapply(webElem, function(x){x$getElementText()}))
print(airlines)
# close client/server
remDr$close()
rD$server$stop()
As you can see, I built in an ImplicitWaitTimeout and a Sys.Sleep call so that the page has time to load in phantomJS and to not overload the website with requests.
Generally speaking, when looping over a date range, the scraper works well. However, when looping through 10+ dates consecutively, Selenium sometimes throws a StaleElementReference error and stops the execution. I know the reason for this is because the page has yet to load completely and the class='dollars price-emphasis' doesn't exist yet. The URL construction is fine.
Whenever the page successfully loads all the way, the scraper gets near 60 prices and flights. I'm mentioning this because there are times when the script returns only 15-20 entries (when checking this date normally on a browser, there are 60). Here, I conclude that I'm only finding 20 of 60 elements, meaning the page has only partially loaded.
I want to make this script more robust by injecting JavaScript that waits for the page to fully load prior to looking for elements. I know the way to do this is remDr$executeScript(), and I have found many useful JS snippets for accomplishing this, but due to limited knowledge in JS, I'm having problems adapting these solutions to work syntactically with my script.
Here are several solutions that have been proposed from Wait for page load in Selenium & Selenium - How to wait until page is completely loaded:
Base Code:
remDr$executeScript(
WebDriverWait wait = new WebDriverWait(driver, 20);
By addItem = By.cssSelector("class=dollars price-emphasis");, args = list()
)
Additions to base script:
1) Check for Staleness of an Element
# get the "Add Item" element
WebElement element = wait.until(ExpectedConditions.presenceOfElementLocated(addItem));
# wait the element "Add Item" to become stale
wait.until(ExpectedConditions.stalenessOf(element));
2) Wait for Visibility of element
wait.until(ExpectedConditions.visibilityOfElementLocated(addItem));
I have tried to use
remDr$executeScript("return document.readyState").equals("complete") as a check before proceeding with the scrape, but the page always shows as complete, even if it's not.
Does anyone have any suggestions about how I could adapt one of these solutions to work with my R script? Any ideas on how I could wait entirely for the page to load with nearly 60 found elements? I'm still leaning, so any help would be greatly appreciated.
Solution using while/tryCatch:
remDr$navigate("<webpage url>")
webElem <-NULL
while(is.null(webElem)){
webElem <- tryCatch({remDr$findElement(using = 'name', value = "<value>")},
error = function(e){NULL})
#loop until element with name <value> is found in <webpage url>
}
To tack on a bit more convenience to Victor's great answer, a common element on tons of pages is body which can be accessed via css. I also made it a function and added a quick random sleep (always good practice). This should work without you needing to assign the element on most web pages with text:
##use double arrow to assign to global environment permanently
#remDr <<- remDr
wetest <- function(sleepmin,sleepmax){
remDr <- get("remDr",envir=globalenv())
webElemtest <-NULL
while(is.null(webElemtest)){
webElemtest <- tryCatch({remDr$findElement(using = 'css', "body")},
error = function(e){NULL})
#loop until element with name <value> is found in <webpage url>
}
randsleep <- sample(seq(sleepmin, sleepmax, by = 0.001), 1)
Sys.sleep(randsleep)
}
Usage:
remDr$navigate("https://bbc.com/news")
clickable <- remDr$findElements(using='xpath','//button[contains(#href,"")]')
clickable[[1]]$clickElement()
wetest(sleepmin=.5,sleepmax=1)

Error in findElement function in RSelenium

I am trying to run this code:
library(RSelenium)
pJS<- phantom()
remDr <- remoteDriver(browserName = "phantomjs")
url<- "http://www.magicbricks.com/property-for-rent/residential-real-estate?proptype=Multistorey-Apartment,Builder-Floor-Apartment,Penthouse,Studio-Apartment,Service-Apartment,Residential-House,Villa&cityName=Mumbai"
remDr$open()
remDr$navigate(url)
webElem1 <- remDr$findElement("name", ">5 BHK")
webElem2 <- remDr$findElement("css", "#refinebedrooms li:nth-child(6)")
webElem3 <- remDr$findElement("css", "#viewMoreButton a")
But I keep getting the following error:
Error: Summary: NoSuchElement
Detail: An element could not be located on the page using the given search parameters.
class: org.openqa.selenium.NoSuchElementException
Further Details: run errorDetails method
What does this mean? And how can I overcome it? I am new to R and a first time user of RSelenium so any kind of help would be much appreciated? TIA
Firstly, if you are new I would strongly recommend to go over the help file R-SELENIUM and then try using the package.
The element with name >5 BHK does not exist. And that is the reason you are getting an error. but the webElem2 is the same as webElem1(if this worked).
So to answer your question, you have to identify where the error occurs. and the error is pretty self-explanatory. NoSuchElement.
So one of your three webelements1,2,3 is not seen in the page by the webdriver. If you want to identify the elements using css assuming you are new to HTML too, I would suggest you to use Selector gadget to identify the element using css or xpath

How do I click a javascript "link"? Is it my xpath or my relenium/selenium usage?

Like the beginning to any problem before I post it on stack overflow I think I have tried everything. This is a learning experience for me on how to work with javascript and xml so I'm guessing my problem is there.
My question is how to get the results of clicking on the parcel number links that are javascript links? I've tried getting the xpath of the link and using the $click method which following my intuition but this wasn't right or is at least not working for me.
Firefox 26.0
R 3.0.2
require(relenium)
library(XML)
library(stringr)
initializing_parcel_number <- "00000000000"
firefox <- firefoxClass$new()
firefox$get("http://www.muni.org/pw/public.html")
inputElement <- firefox$findElementByXPath("/html/body/form[2]/table/tbody/tr[2]/td/table[1]/tbody/tr[3]/td[4]/input[1]")
inputElement$sendKeys(initializing_parcel_number)
inputElement$sendKeys(key = "ENTER")
##xpath to the first link. Or is it?
first_link <- "/html/body/table/tbody/tr[2]/td/table[5]/tbody/tr[2]/td[1]/a"
##How I'm trying to click the thing.
linkElement <- firefox$findElementByXPath("/html/body/table/tbody/tr[2]/td/table[5]/tbody/tr[2]/td[1]/a")
linkElement$click()
You can do this using RSelenium. See http://johndharrison.github.io/RSelenium/ . DISCLAIMER I am the author of the RSelenium package. A basic vignette on operation can be viewed at RSelenium basics and
RSelenium: Testing Shiny apps
If you are unsure of what element is selected you can use the highlightElement utility method in the webElement class see the commented out code.
The element click event wont work in this case. You need to simulate a click using javascript:
require(RSelenium)
# RSelenium::startServer # if needed
initializing_parcel_number <- "00000000000"
remDr <- remoteDriver()
remDr$open()
remDr$navigate("http://www.muni.org/pw/public.html")
webElem <- remDr$findElement(using = "name", "PAR1")
# webElem$highlightElement() # to visually check what elemnet is selected
webElem$sendKeysToElement(list(initializing_parcel_number, key = "enter"))
# get first link containing javascript:getParcel
webElem <- remDr$findElement(using = "css selector", '[href*="javascript:getParcel"]')
# webElem$highlightElement() # to visually check what elemnet is selected
# send a webElement as an argument.
remDr$executeScript("arguments[0].click();", list(webElem))
#

Resources