I put together a crude scraper that scrapes prices/airlines from Expedia:
# Start the Server
rD <- rsDriver(browser = "phantomjs", verbose = FALSE)
# Assign the client
remDr <- rD$client
# Establish a wait for an element
remDr$setImplicitWaitTimeout(1000)
# Navigate to Expedia.com
appurl <- "https://www.expedia.com/Flights-Search?flight-type=on&starDate=04/30/2017&mode=search&trip=oneway&leg1=from:Denver,+Colorado,to:Oslo,+Norway,departure:04/30/2017TANYT&passengers=children:0,adults:1"
remDr$navigate(appURL)
# Give a crawl delay to see if it gives time to load web page
Sys.sleep(10) # Been testing with 10
###ADD JAVASCRIPT INJECTION HERE###
remDr$executeScript(?)
# Extract Prices
webElem <- remDr$findElements(using = "css", "[class='dollars price-emphasis']")
prices <- unlist(lapply(webElem, function(x){x$getElementText()}))
print(prices)
# Extract Airlines
webElem <- remDr$findElements(using = "css", "[data-test-id='airline-name']")
airlines <- unlist(lapply(webElem, function(x){x$getElementText()}))
print(airlines)
# close client/server
remDr$close()
rD$server$stop()
As you can see, I built in an ImplicitWaitTimeout and a Sys.Sleep call so that the page has time to load in phantomJS and to not overload the website with requests.
Generally speaking, when looping over a date range, the scraper works well. However, when looping through 10+ dates consecutively, Selenium sometimes throws a StaleElementReference error and stops the execution. I know the reason for this is because the page has yet to load completely and the class='dollars price-emphasis' doesn't exist yet. The URL construction is fine.
Whenever the page successfully loads all the way, the scraper gets near 60 prices and flights. I'm mentioning this because there are times when the script returns only 15-20 entries (when checking this date normally on a browser, there are 60). Here, I conclude that I'm only finding 20 of 60 elements, meaning the page has only partially loaded.
I want to make this script more robust by injecting JavaScript that waits for the page to fully load prior to looking for elements. I know the way to do this is remDr$executeScript(), and I have found many useful JS snippets for accomplishing this, but due to limited knowledge in JS, I'm having problems adapting these solutions to work syntactically with my script.
Here are several solutions that have been proposed from Wait for page load in Selenium & Selenium - How to wait until page is completely loaded:
Base Code:
remDr$executeScript(
WebDriverWait wait = new WebDriverWait(driver, 20);
By addItem = By.cssSelector("class=dollars price-emphasis");, args = list()
)
Additions to base script:
1) Check for Staleness of an Element
# get the "Add Item" element
WebElement element = wait.until(ExpectedConditions.presenceOfElementLocated(addItem));
# wait the element "Add Item" to become stale
wait.until(ExpectedConditions.stalenessOf(element));
2) Wait for Visibility of element
wait.until(ExpectedConditions.visibilityOfElementLocated(addItem));
I have tried to use
remDr$executeScript("return document.readyState").equals("complete") as a check before proceeding with the scrape, but the page always shows as complete, even if it's not.
Does anyone have any suggestions about how I could adapt one of these solutions to work with my R script? Any ideas on how I could wait entirely for the page to load with nearly 60 found elements? I'm still leaning, so any help would be greatly appreciated.
Solution using while/tryCatch:
remDr$navigate("<webpage url>")
webElem <-NULL
while(is.null(webElem)){
webElem <- tryCatch({remDr$findElement(using = 'name', value = "<value>")},
error = function(e){NULL})
#loop until element with name <value> is found in <webpage url>
}
To tack on a bit more convenience to Victor's great answer, a common element on tons of pages is body which can be accessed via css. I also made it a function and added a quick random sleep (always good practice). This should work without you needing to assign the element on most web pages with text:
##use double arrow to assign to global environment permanently
#remDr <<- remDr
wetest <- function(sleepmin,sleepmax){
remDr <- get("remDr",envir=globalenv())
webElemtest <-NULL
while(is.null(webElemtest)){
webElemtest <- tryCatch({remDr$findElement(using = 'css', "body")},
error = function(e){NULL})
#loop until element with name <value> is found in <webpage url>
}
randsleep <- sample(seq(sleepmin, sleepmax, by = 0.001), 1)
Sys.sleep(randsleep)
}
Usage:
remDr$navigate("https://bbc.com/news")
clickable <- remDr$findElements(using='xpath','//button[contains(#href,"")]')
clickable[[1]]$clickElement()
wetest(sleepmin=.5,sleepmax=1)
Related
I am trying to webscrape information from this website: https://www.nea.gov.sg/weather/rain-areas and download the 240km radar scans between 2022-07-31 01:00:00 (am) and 2022-07-31 03:00:00 (am) at five-minute intervals, inclusive of end points. Save the images to a zip file.
Edit: Is there a way to do it with just rvest and avoiding the usage of for loops?
I've fount out that the image address can be acquired by clicking on the image and selecting copy image address. An example :https://www.nea.gov.sg/docs/default-source/rain-area-240km/dpsri_240km_2022091920000000dBR.dpsri.png
I've noted that the string of numbers would represent the date and time. So the one I'd need would be 20220731xxxxxxx where x would be the time. However, how would I then use this to webscrape?
Could someone provide some guidance? I can't even seem to find the radar scans for that day. Thank you.
You can consider the following code to save the screenshots of the webpage :
library(RSelenium)
url <- "https://www.nea.gov.sg/weather/rain-areas"
shell('docker run -d -p 4445:4444 selenium/standalone-firefox')
remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4445L, browserName = "firefox")
remDr$open()
remDr$navigate(url)
web_Elem <- remDr$findElement("xpath", '//*[#id="rain-area-slider"]/div/button')
web_Elem$clickElement()
for(i in 1 : 10)
{
print(i)
Sys.sleep(1)
path_To_File <- paste0("C:/file", i, ".png")
remDr$screenshot(display = FALSE, useViewer = TRUE, file = path_To_File)
}
Scraping the images from the website requires you to interact with the website (e.g. clicks), so we will use the RSelenium package for the task. You will also need to have Firefox installed on your system to be able to follow this solution.
1. Load packages
We will begin by loading our packages of interest:
# Load packages ----
pacman::p_load(
httr,
png,
purrr,
RSelenium,
rvest,
servr
)
2. Setup
Now, we need to start the Selenium server with firefox. The following code will start a firefox instance. Run it and wait for firefox to launch:
# Start Selenium driver with firefox ----
rsd <- rsDriver(browser = "firefox", port = random_port())
Now that the firefox browser (aka the client) is up, we want to be able to manipulate it with our code. So, let's create a variable (cl for client) that will represent it. We will use the variable to perform all the actions we need:
cl <- rsd$client
The first action we want to perform is to navigate to the website. Once you run the code, notice how Firefox goes to the website as a response to you running your R code:
# Navigate to the webpage ----
cl$navigate(url = "https://www.nea.gov.sg/weather/rain-areas")
Let's get scraping
Now we're going to begin the actual scraping! #EmmanuelHamel took the clever approach of simply clicking on the "play" button in order to launch the automatic "slideshow". He then took a screenshot of the webpage every second in order to capture the changes in the image. The approach I use is somewhat different.
In the code below, I identify the 13 steps of the slideshow (along the horizontal green bar) and I click on each "step" one after the other. After clicking on a step, I get the URL of the image, then I click on the other step... all the way to the 13th step.
Here I get the HTML element for each step:
# Get the selector for each of the 13 steps
rail_steps <- cl$findElements(using = "css", value = "div.vue-slider-mark")[1:13]
Then, I click on each element and get the image URL at each step. After you run this code, check how your code manipulates the webpage on the firefox instance, isn't that cool?
img_urls <- map_chr(rail_steps, function(step){
cl$mouseMoveToLocation(webElement = step)
cl$click()
img_el <- cl$findElement(using = "css", value = "#rain_overlay")
Sys.sleep(1)
imcg_url <-
img_el$getElementAttribute(attrName = "src")[[1]]
})
Finally, I create an image folder img where I download and save the images:
# Create an image folder then download all images in it ----
dir.create("img")
walk(img_urls, function(img_url){
GET(url = img_url) |>
content() |>
writePNG(target = paste0("img/", basename(img_url)))
})
Important
The downloaded images do not contain the background map on the webpage... only the points! You can download the background map then lay the points on top of it (using an image processing software for example). Here is how to download the background map:
# Download the background map----
GET(url = "https://www.nea.gov.sg/assets/images/map/base-853.png") |>
content() |>
writePNG(target = "base_image.png")
If you want to combine the images programmatically, you may want to look into the magick package in R.
I am webscraping a website to collect data for research purposes using RSelenium, docker and rvest.
I've built a script that automatically 'clicks' through the pages of which I want to download content. My problem is that when I run this script, the results change. The amount of observations of the variable I'm interested in change. It concerns about 50.000 observations. When running the script several times, the total amount of observations differs by a few hundred.
I'm thinking it has something to do with the internet connection being too slow or with the website not being able to load quick enough... Or something... When I change Sys.sleep(2) the results change too, but without clear effect of wether changing it to higher numbers makes it worse or better.
In the R terminal I run:
docker run -d -p 4445:4444 selenium/standalone-chrome
Then my code looks something like this:
remDr <- RSelenium::remoteDriver(remoteServerAddr = "localhost",
port = 4445L,
browserName = "chrome")
remDr$open()
remDr$navigate("url of website")
pages <- 100 # for example, I want information from the first hundred pages
variable <- vector("list", pages)
i <- 1
while (i <= pages) {
variable[[i]] <- remDr$getPageSource()[[1]] %>%
read_html(encoding = "UTF-8") %>%
html_nodes("node that indicates the information I want") %>% # select the information I want
html_text()
element_next_page <- remDr$findElement(using = 'css selector', "node that indicates the 'next page button") # select button with which I can go to the next page
element_next_page$sendKeysToElement(list(key="enter")) # go to the next page
Sys.sleep(2) # I believe this is done to not overload the website I'm scraping
i <- i + 1
}
variable <- unlist(variable)
Somehow running this multiple times this keeps returning different results in terms of the number of observations that remain when I unlist variable.
Does someone have the same experiences and tips on what to do?
Thanks.
You could consider including the following code before extracting the text :
for(i in 1 : 100)
{
print(i)
remDr$executeScript(paste0("scroll(0, ", i * 2000, ")"))
}
This code forces the application to go "almost everywhere in the web page" which can help the page to load some sections that are not loaded. This approach is used in the following post : How to webscrape texts that are contained into sublinks of a link in R?.
The BCOGC keeps a database of applications for drilling wells in northeast British Columbia. By default, some filters are active to only highlight approved applications within the last month, even though the application database holds 30K+ records:
When the filter is deactivated:
To download the entire data set, remove or deactivate any filters, click on Actions > Download > CSV. I want to download the entire data set (containing 30K+ records) automatically using R.
When I use
library(tidyverse)
df <- read_csv(
file = 'https://reports.bcogc.ca/ogc/f?p=200:21::CSV::::'
)
it only downloads whatever the default query specifies, so around 150 records, not 30K+.
How can I use R to download the entire data set automatically? Is this a task for httr or RSelenium?
OK, I'm going to go with Selenium then since it doesn't necessarily require Docker (though the example I'm using is with Docker :-) Pretty sure I could get Splash/splashr to do this as well, but it involves a file download and I think there's issues with that and the Splash back-end. As the splashr author, I avoid having to deal with GitHub issues if I use Selenium for this example as well ;-)
Anyway, you should install RSelenium. I can't really provide support for that but it's well documented and the rOpenSci folks are super helpful. I'd highly suggest getting Docker to run on your system or getting your department to setup a Selenium server you all can use.
There are a couple gotchas for this use-case:
Some element names we need to instrument are dynamically generated so we have to work around that
This involves downloading a CSV file so we need to map a filesystem path in Docker so it downloads properly
This is a super slow site so you need to figure out wait times after each interaction (I'm not going to do that since you may be on a slower or faster network and network speed does play a part here, tho not that much)
I'd suggest working through the vignettes for RSelenium before trying the below to get a feel for how it works. You're essentially coding up human page interactions.
You will need to start Docker with a mapped directory. See download file with Rselenium & docker toolbox for all the info but here's how I did it on my macOS box:
docker run -d -v /Users/hrbrmstr/Downloads://home/seluser/Downloads -p 4445:4444 selenium/standalone-firefox:2.53.1
That makes Selenium accessible on port 4445, uses Firefox (b/c Chrome is evil) and maps my local downloads directory to the Firefox default dir for the selenium user in the Docker container. That means well_authorizations_issued.csv is going to go there (eventually).
Now, we need to crank up R and connect it to this Selenium instance. We need to create a custom Firefox profile since we're saving stuff to disk and we don't want the browser to prompt us for anything:
library(RSelenium)
makeFirefoxProfile(
list(
browser.download.dir = "home/seluser/Downloads",
browser.download.folderList = 2L,
browser.download.manager.showWhenStarting = FALSE,
browser.helperApps.neverAsk.saveToDisk = "text/csv"
)
) -> ffox_prof
remoteDriver(
browserName = "firefox", port = 4445L,
extraCapabilities = ffox_prof
) -> remDr
invisible(remDr$open())
remDr$navigate("https://reports.bcogc.ca/ogc/f?p=AMS_REPORTS:WA_ISSUED")
# Sys.sleep(###)
magick::image_read(openssl::base64_decode(remDr$screenshot()[[1]]))
You will need to uncomment out the Sys.sleep()s and experiment with various "wait time" values between calls. Some will be short (1-2s) others will be larger (20s, 30s, or higher).
I'm not displaying the output of the screenshots here but those are one way to figure out timings (i.e. keep generating screen shots after an element interaction until gray spinner boxes are gone — etc — and keep a mental note of how many seconds that was).
Now, the one tricky bit noted above is figuring out the where the checkbox is to turn off the filter since it has a dynamic id. However, we aren't actually going to click on the checkbox b/c the daft fools who created that app have no idea what they are doing and actually have the click-event trapped with the span element that surrounds it, so we have to find the li element that contains the checkbox label text then go to the span element and click on it.
box <- remDr$findElement(using = "xpath", value = "//li[contains(., 'Approval Date is in the last')]/span")
box$clickElement()
# Sys.sleep(###)
magick::image_read(openssl::base64_decode(remDr$screenshot()[[1]]))
^^ definitely needs a delay (you likely saw it spin a while in-person when clicking yourself so you can count that and add in some buffer seconds).
Then, we click on the drop-down "menu" (it's really a button):
btn1 <- remDr$findElement(using = "css", "button#WA_ISSUED_actions_button")
btn1$clickElement()
# Sys.sleep(###)
magick::image_read(openssl::base64_decode(remDr$screenshot()[[1]]))
Then the download "menu" item (it's really a button:
btn2 <- remDr$findElement(using = "css", "button#WA_ISSUED_actions_menu_14i")
btn2$clickElement()
# Sys.sleep(###)
magick::image_read(openssl::base64_decode(remDr$screenshot()[[1]]))
^^ also needs rly needs a delay as the Download "dialog" takes a few seconds to come up (it did for me at least).
Now, find the CSV box which is really an a tag:
lnk <- remDr$findElement(using = "css", "a#WA_ISSUED_download_CSV")
lnk$clickElement()
### WAIT A WHILE
magick::image_read(openssl::base64_decode(remDr$screenshot()[[1]]))
That last bit is something you'll have to experiment with. It takes a while to process the request and then transfer the ~9MB file. The call to rmDr$screenshot() actually waits for the download to complete so you can remove the display and decoding code and assign the output to a variable and use that as an automatic "wait"er.
I tried this 3x on 2 different macOS systems and it worked fine. YMMV.
I'm guessing you'll want to automate this eventually so you could have a system() call towards the top of the script that starts the Selenium Docker container, then does the rest of the bits and then issues another system() call to shut down the Docker container.
Alternately, https://github.com/richfitz/stevedore is now on CRAN so it is a pure R interface to starting/stopping Docker containers (amongst many other things) so you could use that instead of system() calls.
If you can't use Docker, you need to install a "webdriver" executable for Firefox on your Windows box and also get the Selenium Java archive, ensure you have Java installed and then do the various manual incantations to get that going (which is beyond the scope of this answer).
Here's a shortened, contiguous version of the above:
library(RSelenium)
# start Selenium before doing this
makeFirefoxProfile(
list(
browser.download.dir = "home/seluser/Downloads",
browser.download.folderList = 2L,
browser.download.manager.showWhenStarting = FALSE,
browser.helperApps.neverAsk.saveToDisk = "text/csv"
)
) -> ffox_prof
remoteDriver(
browserName = "firefox", port = 4445L,
extraCapabilities = ffox_prof
) -> remDr
invisible(remDr$open())
remDr$navigate("https://reports.bcogc.ca/ogc/f?p=AMS_REPORTS:WA_ISSUED")
# Sys.sleep(###)
box <- remDr$findElement(using = "xpath", value = "//li[contains(., 'Approval Date is in the last')]/span")
box$clickElement()
# Sys.sleep(###)
btn1 <- remDr$findElement(using = "css", "button#WA_ISSUED_actions_button")
btn1$clickElement()
# Sys.sleep(###)
btn2 <- remDr$findElement(using = "css", "button#WA_ISSUED_actions_menu_14i")
btn2$clickElement()
# Sys.sleep(###)
lnk <- remDr$findElement(using = "css", "a#WA_ISSUED_download_CSV")
lnk$clickElement()
### WAIT A WHILE
done <- remDr$screenshot()
# stop Selenium
I am struggling to 'web-scrape' data from a table which spans over several pages. The pages are linked via javascript.
The data I am interested in is based on the website's search function:
url <- "http://aims.niassembly.gov.uk/plenary/searchresults.aspx?tb=0&tbv=0&tbt=All%20Members&pt=7&ptv=7&ptt=Petition%20of%20Concern&mc=0&mcv=0&mct=All%20Categories&mt=0&mtv=0&mtt=All%20Types&sp=1&spv=0&spt=Tabled%20Between&ss=jc7icOHu4kg=&tm=2&per=1&fd=01/01/2011&td=17/04/2018&tit=0&txt=0&pm=0&it=0&pid=1&ba=0"
I am able to download the first page with the rvest package:
library(rvest)
library(tidyverse)
NI <- read_html(url)
NI.res <- NI %>%
html_nodes("table") %>%
html_table(fill=TRUE)
NI.res <- NI.res[[1]][c(1:10),c(1:5)]
So far so good.
As far as I understood, the RSelenium package is the way forward to navigate on websites with javascript/when html scraping via changing urls is not possible. I installed the package and run it in combination with the Docker Quicktool Box (all working fine),
library (RSelenium)
shell('docker run -d -p 4445:4444 selenium/standalone-chrome')
remDr <- remoteDriver(remoteServerAddr = "192.168.99.100",
port = 4445L,
browserName = "chrome")
remDr$open()
My hope was that by triggering the javascript I could naviate to the next page and repeat the rvest command and obtain the data contained on the e.g. 2nd, 3rd etc page (eventually that should be part of a loop or purrr::map function).
Navigate to the table with search results (1st page):
remDr$navigate("http://aims.niassembly.gov.uk/plenary/searchresults.aspx?tb=0&tbv=0&tbt=All%20Members&pt=7&ptv=7&ptt=Petition%20of%20Concern&mc=0&mcv=0&mct=All%20Categories&mt=0&mtv=0&mtt=All%20Types&sp=1&spv=0&spt=Tabled%20Between&ss=jc7icOHu4kg=&tm=2&per=1&fd=01/01/1989&td=01/04/2018&tit=0&txt=0&pm=0&it=0&pid=1&ba=0")
Trigger the javascript. The content of the javascript is taken from hovering with the mouse over the index of pages on the webiste (below the table). In the case below the javascript leading to page 2 is triggered:
remDr$executeScript("__doPostBack('ctl00$MainContentPlaceHolder$SearchResultsGridView','Page$2');", args=list("dummy"))
Repeat the scraping with rvest
url <- "http://aims.niassembly.gov.uk/plenary/searchresults.aspx?tb=0&tbv=0&tbt=All%20Members&pt=7&ptv=7&ptt=Petition%20of%20Concern&mc=0&mcv=0&mct=All%20Categories&mt=0&mtv=0&mtt=All%20Types&sp=1&spv=0&spt=Tabled%20Between&ss=jc7icOHu4kg=&tm=2&per=1&fd=01/01/2011&td=17/04/2018&tit=0&txt=0&pm=0&it=0&pid=1&ba=0"
NI <- read_html(url)
NI.res <- NI %>%
html_nodes("table") %>%
html_table(fill=TRUE)
NI.res2 <- NI.res[[1]][c(1:10),c(1:5)]
Unfortunately though, the triggering of the javascript appears not to work. The scraped results are again those from page 1 and not from page 2. I might be missing something rather basic here, but I can't figure out what.
My attempt is partly informed by SO posts here, here and here. I also saw this post.
Context: Eventually, in further steps, I will have to trigger a click on each single finding/row which shows up on all pages and also scrape the information behind each entry. Hence, as far as I understood, RSelenium will be the main tool here.
Grateful for any hint!
UPDATE
I made 'some' progress following the approach suggested here. It's a) still not doing everything I am intending to do and b) is very likely not the most elegant form of how to do it. But maybe it's of some help to others/opens up a way forward. Note that this approach does not require RSelenium.
I basically created a loop for each javascript (page index) leading to another page of the table which I want to scrape. The crucial detail is the EVENTARGUMENT argument to which I assign the respective page number (my knowledge of js is bascially zero).
for (i in 2:15) {
target<- paste0("Page$",i)
page<-rvest:::request_POST(pgsession,"http://aims.niassembly.gov.uk/plenary/searchresults.aspx?tb=0&tbv=0&tbt=All%20Members&pt=7&ptv=7&ptt=Petition%20of%20Concern&mc=0&mcv=0&mct=All%20Categories&mt=0&mtv=0&mtt=All%20Types&sp=1&spv=0&spt=Tabled%20Between&ss=jc7icOHu4kg=&tm=2&per=1&fd=01/01/2011&td=17/04/2018&tit=0&txt=0&pm=0&it=0&pid=1&ba=0",
body=list(
`__VIEWSTATE`=pgform$fields$`__VIEWSTATE`$value,
`__EVENTTARGET`="ctl00$MainContentPlaceHolder$SearchResultsGridView",
`__EVENTARGUMENT`= target, `__VIEWSTATEGENERATOR`=pgform$fields$`__VIEWSTATEGENERATOR`$value,
`__VIEWSTATEENCRYPTED`=pgform$fields$`__VIEWSTATEENCRYPTED`$value,
`__EVENTVALIDATION`=pgform$fields$`__EVENTVALIDATION`$value
),
encode="form")
x <- read_html(page) %>%
html_nodes(css = "#ctl00_MainContentPlaceHolder_SearchResultsGridView") %>%
html_table(fill=TRUE) %>%
as.data.frame()
d<- x[c(1:paste(nrow(x)-2)),c(1:5)]
page.list[[i]] <- d
i=i+1
}
However, this code is not able to trigger the javascript/go to the pages which are not visible in the page index below the table when opening the site (1 - 11). Only page 2 to 11 can be scraped with this loop. Since the script for page 12 and subsequent are not visible, they can't be triggered.
I would like to use RSelenium to download files (by clicking on the excel image) from this website http://highereducationstatistics.education.gov.au/. However, before downloading the file, a series of drag-and-drop actions (see this image http://highereducationstatistics.education.gov.au/images/preDragDimension.png) have to be performed so the right dataset could be chosen (See this http://highereducationstatistics.education.gov.au/GettingStarted.aspx for instruction).
I am wondering whether RSelenium has this type of drag and drop functions. I have searched this whole day and guess that mouseMoveToLocation combined with other functions like buttondown function might be the answers, but have no idea how to use them. Can anyone help with this?
Thanks very much.
First navigate with RSelenium to the page using:
library(RSelenium)
rD <- rsDriver() # runs a chrome browser, wait for necessary files to download
remDr <- rD$client
remDr$navigate("http://highereducationstatistics.education.gov.au/")
Then locate the element you want to drag to the chart/grid. In this example I will be selecting the Course Level from the left menu.
webElem1 <- remDr$findElement('xpath', "//span[#title = 'Course Level']")
Select the element where you want to drop this menu item. In this case the element has an id = "olapClientGridXrowHeader":
webElem2 <- remDr$findElement(using = 'id', "olapClientGridXrowHeader")
Once both items are selected, drag the first one into the second one like this:
remDr$mouseMoveToLocation(webElement = webElem1)
remDr$buttondown()
remDr$mouseMoveToLocation(webElement = webElem2)
remDr$buttonup()
Notice that these methods work on the remote driver, not the elements.