I would like to use RSelenium to download files (by clicking on the excel image) from this website http://highereducationstatistics.education.gov.au/. However, before downloading the file, a series of drag-and-drop actions (see this image http://highereducationstatistics.education.gov.au/images/preDragDimension.png) have to be performed so the right dataset could be chosen (See this http://highereducationstatistics.education.gov.au/GettingStarted.aspx for instruction).
I am wondering whether RSelenium has this type of drag and drop functions. I have searched this whole day and guess that mouseMoveToLocation combined with other functions like buttondown function might be the answers, but have no idea how to use them. Can anyone help with this?
Thanks very much.
First navigate with RSelenium to the page using:
library(RSelenium)
rD <- rsDriver() # runs a chrome browser, wait for necessary files to download
remDr <- rD$client
remDr$navigate("http://highereducationstatistics.education.gov.au/")
Then locate the element you want to drag to the chart/grid. In this example I will be selecting the Course Level from the left menu.
webElem1 <- remDr$findElement('xpath', "//span[#title = 'Course Level']")
Select the element where you want to drop this menu item. In this case the element has an id = "olapClientGridXrowHeader":
webElem2 <- remDr$findElement(using = 'id', "olapClientGridXrowHeader")
Once both items are selected, drag the first one into the second one like this:
remDr$mouseMoveToLocation(webElement = webElem1)
remDr$buttondown()
remDr$mouseMoveToLocation(webElement = webElem2)
remDr$buttonup()
Notice that these methods work on the remote driver, not the elements.
Related
I am trying to webscrape information from this website: https://www.nea.gov.sg/weather/rain-areas and download the 240km radar scans between 2022-07-31 01:00:00 (am) and 2022-07-31 03:00:00 (am) at five-minute intervals, inclusive of end points. Save the images to a zip file.
Edit: Is there a way to do it with just rvest and avoiding the usage of for loops?
I've fount out that the image address can be acquired by clicking on the image and selecting copy image address. An example :https://www.nea.gov.sg/docs/default-source/rain-area-240km/dpsri_240km_2022091920000000dBR.dpsri.png
I've noted that the string of numbers would represent the date and time. So the one I'd need would be 20220731xxxxxxx where x would be the time. However, how would I then use this to webscrape?
Could someone provide some guidance? I can't even seem to find the radar scans for that day. Thank you.
You can consider the following code to save the screenshots of the webpage :
library(RSelenium)
url <- "https://www.nea.gov.sg/weather/rain-areas"
shell('docker run -d -p 4445:4444 selenium/standalone-firefox')
remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4445L, browserName = "firefox")
remDr$open()
remDr$navigate(url)
web_Elem <- remDr$findElement("xpath", '//*[#id="rain-area-slider"]/div/button')
web_Elem$clickElement()
for(i in 1 : 10)
{
print(i)
Sys.sleep(1)
path_To_File <- paste0("C:/file", i, ".png")
remDr$screenshot(display = FALSE, useViewer = TRUE, file = path_To_File)
}
Scraping the images from the website requires you to interact with the website (e.g. clicks), so we will use the RSelenium package for the task. You will also need to have Firefox installed on your system to be able to follow this solution.
1. Load packages
We will begin by loading our packages of interest:
# Load packages ----
pacman::p_load(
httr,
png,
purrr,
RSelenium,
rvest,
servr
)
2. Setup
Now, we need to start the Selenium server with firefox. The following code will start a firefox instance. Run it and wait for firefox to launch:
# Start Selenium driver with firefox ----
rsd <- rsDriver(browser = "firefox", port = random_port())
Now that the firefox browser (aka the client) is up, we want to be able to manipulate it with our code. So, let's create a variable (cl for client) that will represent it. We will use the variable to perform all the actions we need:
cl <- rsd$client
The first action we want to perform is to navigate to the website. Once you run the code, notice how Firefox goes to the website as a response to you running your R code:
# Navigate to the webpage ----
cl$navigate(url = "https://www.nea.gov.sg/weather/rain-areas")
Let's get scraping
Now we're going to begin the actual scraping! #EmmanuelHamel took the clever approach of simply clicking on the "play" button in order to launch the automatic "slideshow". He then took a screenshot of the webpage every second in order to capture the changes in the image. The approach I use is somewhat different.
In the code below, I identify the 13 steps of the slideshow (along the horizontal green bar) and I click on each "step" one after the other. After clicking on a step, I get the URL of the image, then I click on the other step... all the way to the 13th step.
Here I get the HTML element for each step:
# Get the selector for each of the 13 steps
rail_steps <- cl$findElements(using = "css", value = "div.vue-slider-mark")[1:13]
Then, I click on each element and get the image URL at each step. After you run this code, check how your code manipulates the webpage on the firefox instance, isn't that cool?
img_urls <- map_chr(rail_steps, function(step){
cl$mouseMoveToLocation(webElement = step)
cl$click()
img_el <- cl$findElement(using = "css", value = "#rain_overlay")
Sys.sleep(1)
imcg_url <-
img_el$getElementAttribute(attrName = "src")[[1]]
})
Finally, I create an image folder img where I download and save the images:
# Create an image folder then download all images in it ----
dir.create("img")
walk(img_urls, function(img_url){
GET(url = img_url) |>
content() |>
writePNG(target = paste0("img/", basename(img_url)))
})
Important
The downloaded images do not contain the background map on the webpage... only the points! You can download the background map then lay the points on top of it (using an image processing software for example). Here is how to download the background map:
# Download the background map----
GET(url = "https://www.nea.gov.sg/assets/images/map/base-853.png") |>
content() |>
writePNG(target = "base_image.png")
If you want to combine the images programmatically, you may want to look into the magick package in R.
I'm trying to write a code for web scraping in R when you have to introduce inputs.
Exactly, I have a platform where I need to complete 2 fields and after that click submit and get the results.
But I don't know how to use my columns in R like inputs in platform.
I searched for an example but I did't find any.
Pls, if anyone can give me a simple e.g.
Thank you
EDIT:
I don't have a code yet. I was looking for an example where you can use input for complete a field on a site and after that to scrape the result.
In the photo are the fields on my URL. So, in R I have a dataframe with 2 columns. One for CNP/CUI and one for VIN/SASIU with 100 rows or more. And I want to use this columns like input and take the output for every row.
EDIT2:
The example provided by #Dominik S.Meier it worked for me when I had a list for inputs. For column inputs I will post another question.
But, till then I want to mention few thing that helped me, maybe it will hep somebody else.
You need to be sure that all the versions matches: R version, browser version, browser driver version, Java version. For me it didn't match chromedriver version, even if I downloaded the right version. The problem was that I had 3 chromeversion and I think it didn't choose the right. I fixed with: rD <- rsDriver(browser = c("chrome"),port = 4444L,chromever = "83.0.4103.39"). More info here:enter link description here
Because one element didn't have id like in e.g. webElem <- remDr$findElement(using = "id", "trimite"), I used css selector. You can find the css selector with right click -> copy -> copy selector (in the html code on the page).
If you don't get the results, maybe you don't use the right selector. I did that and the result was list(). Then I tried more css selector from the "above" in the html code. I don'y know if it is the right solution, but for me it worked.
Hope it will help. Thank you.
Using RSelenium (see here for more infos):
library(RSelenium)
rD <- rsDriver(browser = c("firefox")) #specify browser type you want Selenium to open
remDr <- rD$client
remDr$navigate("https://pro.rarom.ro/istoric_vehicul/dosar_vehicul.aspx") # navigates to webpage
# select first input field
option <- remDr$findElement(using='id', value="inputEmail")
option$highlightElement()
option$clickElement()
option$sendKeysToElement(list("email#email.com"))
# select second input field
option <- remDr$findElement(using='id', value="inputEmail2")
option$highlightElement()
option$clickElement()
option$sendKeysToElement(list("email#email.com"))
# select second input field
option <- remDr$findElement(using='id', value="inputVIN")
option$highlightElement()
option$clickElement()
option$sendKeysToElement(list("123"))
#press key
webElem <- remDr$findElement(using = "id", "trimite")
webElem$highlightElement()
webElem$clickElement()
The BCOGC keeps a database of applications for drilling wells in northeast British Columbia. By default, some filters are active to only highlight approved applications within the last month, even though the application database holds 30K+ records:
When the filter is deactivated:
To download the entire data set, remove or deactivate any filters, click on Actions > Download > CSV. I want to download the entire data set (containing 30K+ records) automatically using R.
When I use
library(tidyverse)
df <- read_csv(
file = 'https://reports.bcogc.ca/ogc/f?p=200:21::CSV::::'
)
it only downloads whatever the default query specifies, so around 150 records, not 30K+.
How can I use R to download the entire data set automatically? Is this a task for httr or RSelenium?
OK, I'm going to go with Selenium then since it doesn't necessarily require Docker (though the example I'm using is with Docker :-) Pretty sure I could get Splash/splashr to do this as well, but it involves a file download and I think there's issues with that and the Splash back-end. As the splashr author, I avoid having to deal with GitHub issues if I use Selenium for this example as well ;-)
Anyway, you should install RSelenium. I can't really provide support for that but it's well documented and the rOpenSci folks are super helpful. I'd highly suggest getting Docker to run on your system or getting your department to setup a Selenium server you all can use.
There are a couple gotchas for this use-case:
Some element names we need to instrument are dynamically generated so we have to work around that
This involves downloading a CSV file so we need to map a filesystem path in Docker so it downloads properly
This is a super slow site so you need to figure out wait times after each interaction (I'm not going to do that since you may be on a slower or faster network and network speed does play a part here, tho not that much)
I'd suggest working through the vignettes for RSelenium before trying the below to get a feel for how it works. You're essentially coding up human page interactions.
You will need to start Docker with a mapped directory. See download file with Rselenium & docker toolbox for all the info but here's how I did it on my macOS box:
docker run -d -v /Users/hrbrmstr/Downloads://home/seluser/Downloads -p 4445:4444 selenium/standalone-firefox:2.53.1
That makes Selenium accessible on port 4445, uses Firefox (b/c Chrome is evil) and maps my local downloads directory to the Firefox default dir for the selenium user in the Docker container. That means well_authorizations_issued.csv is going to go there (eventually).
Now, we need to crank up R and connect it to this Selenium instance. We need to create a custom Firefox profile since we're saving stuff to disk and we don't want the browser to prompt us for anything:
library(RSelenium)
makeFirefoxProfile(
list(
browser.download.dir = "home/seluser/Downloads",
browser.download.folderList = 2L,
browser.download.manager.showWhenStarting = FALSE,
browser.helperApps.neverAsk.saveToDisk = "text/csv"
)
) -> ffox_prof
remoteDriver(
browserName = "firefox", port = 4445L,
extraCapabilities = ffox_prof
) -> remDr
invisible(remDr$open())
remDr$navigate("https://reports.bcogc.ca/ogc/f?p=AMS_REPORTS:WA_ISSUED")
# Sys.sleep(###)
magick::image_read(openssl::base64_decode(remDr$screenshot()[[1]]))
You will need to uncomment out the Sys.sleep()s and experiment with various "wait time" values between calls. Some will be short (1-2s) others will be larger (20s, 30s, or higher).
I'm not displaying the output of the screenshots here but those are one way to figure out timings (i.e. keep generating screen shots after an element interaction until gray spinner boxes are gone — etc — and keep a mental note of how many seconds that was).
Now, the one tricky bit noted above is figuring out the where the checkbox is to turn off the filter since it has a dynamic id. However, we aren't actually going to click on the checkbox b/c the daft fools who created that app have no idea what they are doing and actually have the click-event trapped with the span element that surrounds it, so we have to find the li element that contains the checkbox label text then go to the span element and click on it.
box <- remDr$findElement(using = "xpath", value = "//li[contains(., 'Approval Date is in the last')]/span")
box$clickElement()
# Sys.sleep(###)
magick::image_read(openssl::base64_decode(remDr$screenshot()[[1]]))
^^ definitely needs a delay (you likely saw it spin a while in-person when clicking yourself so you can count that and add in some buffer seconds).
Then, we click on the drop-down "menu" (it's really a button):
btn1 <- remDr$findElement(using = "css", "button#WA_ISSUED_actions_button")
btn1$clickElement()
# Sys.sleep(###)
magick::image_read(openssl::base64_decode(remDr$screenshot()[[1]]))
Then the download "menu" item (it's really a button:
btn2 <- remDr$findElement(using = "css", "button#WA_ISSUED_actions_menu_14i")
btn2$clickElement()
# Sys.sleep(###)
magick::image_read(openssl::base64_decode(remDr$screenshot()[[1]]))
^^ also needs rly needs a delay as the Download "dialog" takes a few seconds to come up (it did for me at least).
Now, find the CSV box which is really an a tag:
lnk <- remDr$findElement(using = "css", "a#WA_ISSUED_download_CSV")
lnk$clickElement()
### WAIT A WHILE
magick::image_read(openssl::base64_decode(remDr$screenshot()[[1]]))
That last bit is something you'll have to experiment with. It takes a while to process the request and then transfer the ~9MB file. The call to rmDr$screenshot() actually waits for the download to complete so you can remove the display and decoding code and assign the output to a variable and use that as an automatic "wait"er.
I tried this 3x on 2 different macOS systems and it worked fine. YMMV.
I'm guessing you'll want to automate this eventually so you could have a system() call towards the top of the script that starts the Selenium Docker container, then does the rest of the bits and then issues another system() call to shut down the Docker container.
Alternately, https://github.com/richfitz/stevedore is now on CRAN so it is a pure R interface to starting/stopping Docker containers (amongst many other things) so you could use that instead of system() calls.
If you can't use Docker, you need to install a "webdriver" executable for Firefox on your Windows box and also get the Selenium Java archive, ensure you have Java installed and then do the various manual incantations to get that going (which is beyond the scope of this answer).
Here's a shortened, contiguous version of the above:
library(RSelenium)
# start Selenium before doing this
makeFirefoxProfile(
list(
browser.download.dir = "home/seluser/Downloads",
browser.download.folderList = 2L,
browser.download.manager.showWhenStarting = FALSE,
browser.helperApps.neverAsk.saveToDisk = "text/csv"
)
) -> ffox_prof
remoteDriver(
browserName = "firefox", port = 4445L,
extraCapabilities = ffox_prof
) -> remDr
invisible(remDr$open())
remDr$navigate("https://reports.bcogc.ca/ogc/f?p=AMS_REPORTS:WA_ISSUED")
# Sys.sleep(###)
box <- remDr$findElement(using = "xpath", value = "//li[contains(., 'Approval Date is in the last')]/span")
box$clickElement()
# Sys.sleep(###)
btn1 <- remDr$findElement(using = "css", "button#WA_ISSUED_actions_button")
btn1$clickElement()
# Sys.sleep(###)
btn2 <- remDr$findElement(using = "css", "button#WA_ISSUED_actions_menu_14i")
btn2$clickElement()
# Sys.sleep(###)
lnk <- remDr$findElement(using = "css", "a#WA_ISSUED_download_CSV")
lnk$clickElement()
### WAIT A WHILE
done <- remDr$screenshot()
# stop Selenium
I put together a crude scraper that scrapes prices/airlines from Expedia:
# Start the Server
rD <- rsDriver(browser = "phantomjs", verbose = FALSE)
# Assign the client
remDr <- rD$client
# Establish a wait for an element
remDr$setImplicitWaitTimeout(1000)
# Navigate to Expedia.com
appurl <- "https://www.expedia.com/Flights-Search?flight-type=on&starDate=04/30/2017&mode=search&trip=oneway&leg1=from:Denver,+Colorado,to:Oslo,+Norway,departure:04/30/2017TANYT&passengers=children:0,adults:1"
remDr$navigate(appURL)
# Give a crawl delay to see if it gives time to load web page
Sys.sleep(10) # Been testing with 10
###ADD JAVASCRIPT INJECTION HERE###
remDr$executeScript(?)
# Extract Prices
webElem <- remDr$findElements(using = "css", "[class='dollars price-emphasis']")
prices <- unlist(lapply(webElem, function(x){x$getElementText()}))
print(prices)
# Extract Airlines
webElem <- remDr$findElements(using = "css", "[data-test-id='airline-name']")
airlines <- unlist(lapply(webElem, function(x){x$getElementText()}))
print(airlines)
# close client/server
remDr$close()
rD$server$stop()
As you can see, I built in an ImplicitWaitTimeout and a Sys.Sleep call so that the page has time to load in phantomJS and to not overload the website with requests.
Generally speaking, when looping over a date range, the scraper works well. However, when looping through 10+ dates consecutively, Selenium sometimes throws a StaleElementReference error and stops the execution. I know the reason for this is because the page has yet to load completely and the class='dollars price-emphasis' doesn't exist yet. The URL construction is fine.
Whenever the page successfully loads all the way, the scraper gets near 60 prices and flights. I'm mentioning this because there are times when the script returns only 15-20 entries (when checking this date normally on a browser, there are 60). Here, I conclude that I'm only finding 20 of 60 elements, meaning the page has only partially loaded.
I want to make this script more robust by injecting JavaScript that waits for the page to fully load prior to looking for elements. I know the way to do this is remDr$executeScript(), and I have found many useful JS snippets for accomplishing this, but due to limited knowledge in JS, I'm having problems adapting these solutions to work syntactically with my script.
Here are several solutions that have been proposed from Wait for page load in Selenium & Selenium - How to wait until page is completely loaded:
Base Code:
remDr$executeScript(
WebDriverWait wait = new WebDriverWait(driver, 20);
By addItem = By.cssSelector("class=dollars price-emphasis");, args = list()
)
Additions to base script:
1) Check for Staleness of an Element
# get the "Add Item" element
WebElement element = wait.until(ExpectedConditions.presenceOfElementLocated(addItem));
# wait the element "Add Item" to become stale
wait.until(ExpectedConditions.stalenessOf(element));
2) Wait for Visibility of element
wait.until(ExpectedConditions.visibilityOfElementLocated(addItem));
I have tried to use
remDr$executeScript("return document.readyState").equals("complete") as a check before proceeding with the scrape, but the page always shows as complete, even if it's not.
Does anyone have any suggestions about how I could adapt one of these solutions to work with my R script? Any ideas on how I could wait entirely for the page to load with nearly 60 found elements? I'm still leaning, so any help would be greatly appreciated.
Solution using while/tryCatch:
remDr$navigate("<webpage url>")
webElem <-NULL
while(is.null(webElem)){
webElem <- tryCatch({remDr$findElement(using = 'name', value = "<value>")},
error = function(e){NULL})
#loop until element with name <value> is found in <webpage url>
}
To tack on a bit more convenience to Victor's great answer, a common element on tons of pages is body which can be accessed via css. I also made it a function and added a quick random sleep (always good practice). This should work without you needing to assign the element on most web pages with text:
##use double arrow to assign to global environment permanently
#remDr <<- remDr
wetest <- function(sleepmin,sleepmax){
remDr <- get("remDr",envir=globalenv())
webElemtest <-NULL
while(is.null(webElemtest)){
webElemtest <- tryCatch({remDr$findElement(using = 'css', "body")},
error = function(e){NULL})
#loop until element with name <value> is found in <webpage url>
}
randsleep <- sample(seq(sleepmin, sleepmax, by = 0.001), 1)
Sys.sleep(randsleep)
}
Usage:
remDr$navigate("https://bbc.com/news")
clickable <- remDr$findElements(using='xpath','//button[contains(#href,"")]')
clickable[[1]]$clickElement()
wetest(sleepmin=.5,sleepmax=1)
I am trying to get some information about enterprises from the Internet. Most of the information is located in this page: http://appscvs.supercias.gob.ec/portalInformacion/sector_societario.zul, the page looks like this:
In this page I have to click on the tab Busqueda de Companias and then the interesting side starts. When I click I get the next screen:
In this page I have to set the option Nombre and then I have to insert a string with a name. For example I will add the string PROAÑO & ASOCIADOS CIA. LTDA. and I will get the next screen:
Then, I have to click on Buscar and I will get the next screen:
In this screen I have the information for this enterprise. Then, I have to click on the tab Informacion Estados Financieros and I will get the next screen:
In this finally screen I have to click on the tab Estado Situacion and I will get the information from the enterprise in the columns Codigo de la cuenta contable, Nombre de la cuenta contable and Valor. I would like to get that information saved in a dataframe. Most of the complex side I found began when I have to set the element Nombre, insert a string, then Buscar and click until find the tab Informacion Estados Financieros. I have tried using html_session and html_form from rvest package but the elements are empty.
Could you help me with some steps to solve this problem?
RSelenium Coded Example
Here is a self-contained code example, using the web-site referenced in the question.
Observation: Please do not run this code.
Why? Having 1k Stack users hit the web-site is a DDOS attack.
##Introduction Prerequisites
The code below will install RSelenium, before running the code you need to:
Install Firefox
Add the Selenium IDE Plugin
https://addons.mozilla.org/en-US/firefox/addon/selenium-ide/
Install RStudio [Recommendation]
Create a project and open the code file below
The code below will take you from the second page [http://appscvs.supercias.gob.ec/portaldeinformacion/consulta_cia_param.zul] through to the final page where the information you are interested in is...
Useful References:
If you are interested in using RSelenium I strongly recommend you read the following references, thanks go to John Harrison for developing the RSelenium package.
RSelenium Basics
http://rpubs.com/johndharrison/12843
RSelenium Headless Browsing
http://rpubs.com/johndharrison/RSelenium-headless
RSelenium Vignette
https://cran.r-project.org/web/packages/RSelenium/vignettes/basics.html
Code Example
# We want to make this as easy as possible to use
# So we need to install required packages for the user...
#
if (!require(RSelenium)) install.packages("RSelenium")
if (!require(XML)) install.packages("XML")
if (!require(RJSONIO)) install.packages("RSJONIO")
if (!require(stringr)) install.packages("stringr")
# Data
#
mainPage <- "http://appscvs.supercias.gob.ec/portalInformacion/sector_societario.zul"
businessPage <- "http://appscvs.supercias.gob.ec/portaldeinformacion/consulta_cia_param.zul"
# StartServer
# We assume RSelenium is not setup, so we check if the RSelenium
# server is available, if not we install RSelenium server.
checkForServer()
# OK. now we start the server
RSelenium::startServer()
remDr <- RSelenium::remoteDriver$new()
# We assume the user has installed Firefox and the Selenium IDE
# https://addons.mozilla.org/en-US/firefox/addon/selenium-ide/
#
# Ok we open firefix
remDr$open(silent = T) # Open up a firefox window...
# Now we open the browser and required URL...
# This is the page that matters...
remDr$navigate(businessPage)
# First things first on the first page, lets get the id's for the radio_button,
# name Element, and button. We need all three.
#
radioButton <- remDr$findElements(using = 'css selector', ".z-radio-cnt")
nameElement <- remDr$findElements(using = 'css selector', ".z-combobox-inp")
searchButton <- remDr$findElements(using = 'css selector', ".z-button-cm")
# Optional: we can highlight the radio elements returned
# lapply(radioButton, function(x){x$highlightElement()})
# Optional: we can highlight the nameElement returned
# lapply(nameElement, function(x){x$highlightElement()})
# Optional: we can highlight the searchButton returned
# lapply(searchButton, function(x){x$highlightElement()})
# Now we can select and press the third radio button
radioButton[[3]]$clickElement()
# We fill in the required name...
nameElement[[1]]$sendKeysToElement(list("PROAÑO & ASOCIADOS CIA. LTDA."))
# This is subtle but required the page triggers a drop down list, so rather than
# hitting the searchButton, we first select, and hit enter in the drop down menu...
selectElement <- remDr$findElements(using = 'css selector', ".z-comboitem-text")
selectElement[[1]]$clickElement()
# OK, now we can click the search button, which will cause the next page to open
searchButton[[1]]$clickElement()
# New Page opens...
#
# Ok, so now we first pull the list of buttons...
finPageButton <- remDr$findElements(using = 'class name', "m_iconos")
# Now we can press the required button to open the page we want to get too...
finPageButton[[9]]$clickElement()
# We are now on the required page.
we are now on the target page [See image]
Extracting the table values...
The next step is to extract the table values. To do this, we pull the .z-listitem css-selector data. Now we can check to confirm if we see the lines of data. We do, so we can now extract the values returned and populate either a list or Dataframe.
# Ok, now we need to extract the table, we identify and pull out the
# '.z-listitem' and assign to modalWindow
modalWindow <- remDr$findElements(using = 'css selector', ".z-listitem")
# Now we can extract the lines from modalWindow... Now that each line is
# returned as a single line of text, so we split into three based on the
# line marker "/n'
lineText <- str_split(modalWindow[[1]]$getElementText()[1], '\n')
lineText
here, is the result:
> lineText <- stringr::str_split(modalWindow[[1]]$getElementText()[1], '\n')
> lineText
[[1]]
[1] "10"
[2] "OPERACIONES DE INGRESO CON PARTES RELACIONADAS EN PARAÍSOS FISCALES, JURISDICCIONES DE MENOR IMPOSICIÓN Y REGÍMENES FISCALES PREFERENTES"
[3] "0.00"
Dealing with Hidden Data.
The Selenium WebDriver and thus RSelenium only interact with visible elements of a web page. If we try to read the entire table, we will only return table items that are visible (unhidden).
We can navigate this issue by scrolling to the bottom of the table. We force the table to populate due to the scroll action. We can then extract the complete table.
# Select the .z-listbox-body
modalWindow <- remDr$findElements(using = 'css selector', ".z-listbox-body")
# Now we tell the window we want to scroll to the bottom of the table
# This triggers the table to populate all the rows
modalWindow[[1]]$executeScript("window.scrollTo(0, document.body.scrollHeight)")
# Now we can extract the complete table
modalWindow <- remDr$findElements(using = 'css selector', ".z-listitem")
lineText <- stringr::str_split(modalWindow[[9]]$getElementText(), '\n')
lineText
###What the code does.
The code example above is meant to be self-contained. By that I mean it should install everything you need including required packages. Once the dependent R packages install, the R code will call checkForServer(), if Selenium is not installed, the call will install it. This may take some time
My recommendation is you step through the code as I have not incorporated any delays (in production you would want to), note also I have not optimised for speed but rather for a modicum of clarity [from my perspective]...
The code was shown to work on:
Mac OS X 10.11.5
RStudio 0.99.893
R version 3.2.4 (2016-03-10) -- "Very Secure Dishes"
Check out RSelenium
First, install RSelenium and use the above linked vignette to get familiar with the basics
Then see this webinar on using RSelenium, which goes through some detailed scraping step-by-step and is quite easy to follow:
http://johndharrison.blogspot.hk/2014/05/orange-county-r-users-group-oc-rug.html