I am using RSelenium to download a number of .xls files. I was able to get a somewhat passable solution using the following code to set up the server, which specifies not to create a pop-up when I click on the download link and where to download the file to. However, without fail, once I download the 101st file (saved as "report (100).xls) the download pop-up begins appearing in the browser Selenium is driving.
eCaps <- list(
chromeOptions =
list(prefs = list(
"profile.default_content_settings.popups" = 0L,
"download.prompt_for_download" = FALSE,
"download.default_directory" = "mydownloadpath"
)
)
)
rd <- rsDriver(browser = "chrome", port=4566L, extraCapabilities = eCaps)
The function to download then looks like:
vote.downloading <- function(url){
#NB: this function assumes browser already up and running, options set correctly
Sys.sleep(1.5)
browser$navigate(url)
down_button <- browser$findElement(using="css",
"table:nth-child(4) tr:nth-child(3) a")
down_button$clickElement()
}
For reference, the sites I'm getting the download from look like this: http://www.moscow_city.vybory.izbirkom.ru/region/moscow_city?action=show&root=774001001&tvd=4774001137463&vrn=4774001137457&prver=0&pronetvd=null®ion=77&sub_region=77&type=427&vibid=4774001137463
The link being used for the download reads "Версия для печати" for those who don't know Russian.
I can't simply stop the function when the dialog begins popping up and pick up where I left off, because it's part of a larger function that scrapes links from drop-down menus that lead to the sites from the download link. This would also be extremely annoying, as there are 400+ files to download.
Is there some way I can alter the Chrome profile or my scraping function to prevent the system dialog from popping up every 101 files? Or is there a better way altogether to get these files downloaded?
No need for Selenium:
library(httr)
httr::GET(
url = "http://www.moscow_city.vybory.izbirkom.ru/servlet/ExcelReportVersion",
query = list(
region="77",
sub_region="77",
root="774001001",
global="null",
vrn="4774001137457",
tvd="4774001137463",
type="427",
vibid="4774001137463",
condition="",
action="show",
version="null",
prver="0",
sortorder="0"
),
write_disk("/tmp/report.xls"), ## CHANGE ME
verbose()
) -> res
I save it off to an object so you can run warn_for_status() or other such checks.
It shld be straightforward to wrap that in a function with parameters to make it more generic.
Related
I am trying to webscrape information from this website: https://www.nea.gov.sg/weather/rain-areas and download the 240km radar scans between 2022-07-31 01:00:00 (am) and 2022-07-31 03:00:00 (am) at five-minute intervals, inclusive of end points. Save the images to a zip file.
Edit: Is there a way to do it with just rvest and avoiding the usage of for loops?
I've fount out that the image address can be acquired by clicking on the image and selecting copy image address. An example :https://www.nea.gov.sg/docs/default-source/rain-area-240km/dpsri_240km_2022091920000000dBR.dpsri.png
I've noted that the string of numbers would represent the date and time. So the one I'd need would be 20220731xxxxxxx where x would be the time. However, how would I then use this to webscrape?
Could someone provide some guidance? I can't even seem to find the radar scans for that day. Thank you.
You can consider the following code to save the screenshots of the webpage :
library(RSelenium)
url <- "https://www.nea.gov.sg/weather/rain-areas"
shell('docker run -d -p 4445:4444 selenium/standalone-firefox')
remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4445L, browserName = "firefox")
remDr$open()
remDr$navigate(url)
web_Elem <- remDr$findElement("xpath", '//*[#id="rain-area-slider"]/div/button')
web_Elem$clickElement()
for(i in 1 : 10)
{
print(i)
Sys.sleep(1)
path_To_File <- paste0("C:/file", i, ".png")
remDr$screenshot(display = FALSE, useViewer = TRUE, file = path_To_File)
}
Scraping the images from the website requires you to interact with the website (e.g. clicks), so we will use the RSelenium package for the task. You will also need to have Firefox installed on your system to be able to follow this solution.
1. Load packages
We will begin by loading our packages of interest:
# Load packages ----
pacman::p_load(
httr,
png,
purrr,
RSelenium,
rvest,
servr
)
2. Setup
Now, we need to start the Selenium server with firefox. The following code will start a firefox instance. Run it and wait for firefox to launch:
# Start Selenium driver with firefox ----
rsd <- rsDriver(browser = "firefox", port = random_port())
Now that the firefox browser (aka the client) is up, we want to be able to manipulate it with our code. So, let's create a variable (cl for client) that will represent it. We will use the variable to perform all the actions we need:
cl <- rsd$client
The first action we want to perform is to navigate to the website. Once you run the code, notice how Firefox goes to the website as a response to you running your R code:
# Navigate to the webpage ----
cl$navigate(url = "https://www.nea.gov.sg/weather/rain-areas")
Let's get scraping
Now we're going to begin the actual scraping! #EmmanuelHamel took the clever approach of simply clicking on the "play" button in order to launch the automatic "slideshow". He then took a screenshot of the webpage every second in order to capture the changes in the image. The approach I use is somewhat different.
In the code below, I identify the 13 steps of the slideshow (along the horizontal green bar) and I click on each "step" one after the other. After clicking on a step, I get the URL of the image, then I click on the other step... all the way to the 13th step.
Here I get the HTML element for each step:
# Get the selector for each of the 13 steps
rail_steps <- cl$findElements(using = "css", value = "div.vue-slider-mark")[1:13]
Then, I click on each element and get the image URL at each step. After you run this code, check how your code manipulates the webpage on the firefox instance, isn't that cool?
img_urls <- map_chr(rail_steps, function(step){
cl$mouseMoveToLocation(webElement = step)
cl$click()
img_el <- cl$findElement(using = "css", value = "#rain_overlay")
Sys.sleep(1)
imcg_url <-
img_el$getElementAttribute(attrName = "src")[[1]]
})
Finally, I create an image folder img where I download and save the images:
# Create an image folder then download all images in it ----
dir.create("img")
walk(img_urls, function(img_url){
GET(url = img_url) |>
content() |>
writePNG(target = paste0("img/", basename(img_url)))
})
Important
The downloaded images do not contain the background map on the webpage... only the points! You can download the background map then lay the points on top of it (using an image processing software for example). Here is how to download the background map:
# Download the background map----
GET(url = "https://www.nea.gov.sg/assets/images/map/base-853.png") |>
content() |>
writePNG(target = "base_image.png")
If you want to combine the images programmatically, you may want to look into the magick package in R.
I’m new to web scraping. I can do the very basic stuff of scraping pages using URLs and css selector tools with R. Now I have run into problems.
For hobby purposes I would like to be able to scrape the following URL:
https://matchpadel.halbooking.dk/newlook/proc_baner.asp (a time slot booking system for sports)
However, the URL does not change when I navigate to different dates or adresses (‘Område’).
I have read a couple of similar problems suggesting to inspect the webpage, look under ’Network’ and then ‘XHR’ or ‘JS’ to find the data source of the table and get information from there. I am able to do this, but to be honest, I have no idea what to do from there.
I would like to retrieve data on what time slots are available across dates and adresses (the ‘Område’ drop down on the webpage).
If anyone is willing to help me and my understanding, it would be greatly appreciated.
Have a nice day!
The website you have linked looks to be run on Javascript which changes dynamically. You need to extract your desired information using RSelenium library which opens a browser and then you need to choose your dropdown and get data.
Find the sample code here to fire up firefox to your website. From here you can write codes to select different types of ‘Område’ dropdown and get the following table info using remdr$getPageSource() and then using Rvest functions to extract the data
# load libraries
library(RSelenium)
# open browser
selCommand <- wdman::selenium(jvmargs = c("-Dwebdriver.chrome.verboseLogging=true"), retcommand = TRUE)
Sys.sleep(2)
shell(selCommand, wait = FALSE, minimized = TRUE)
Sys.sleep(2)
remdr <- remoteDriver(port = 4567L, browserName = "firefox")
Sys.sleep(10)
remdr$open()
remdr$navigate(url = 'https://matchpadel.halbooking.dk/newlook/proc_baner.asp')
EDIT: From the comments I received so far, I managed to use RSelenium to access the PDF files I am looking for, using the following code:
library(RSelenium)
driver <- rsDriver(browser = "firefox")
remote_driver <- driver[["client"]]
remote_driver$navigate("https://www.rad.cvm.gov.br/enetconsulta/frmGerenciaPaginaFRE.aspx?CodigoTipoInstituicao=1&NumeroSequencialDocumento=62398")
# It needs some time to load the page
option <- remote_driver$findElement(using = 'xpath', "//select[#id='cmbGrupo']/option[#value='PDF|412']")
option$clickElement()
Now, I need R to click the download button, but I could not manage to do so. I tried:
button <- remote_driver$findElement(using = "xpath", "//*[#id='download']")
button$clickElement()
But I get the following error:
Selenium message:Unable to locate element: //*[#id="download"]
For documentation on this error, please visit: https://www.seleniumhq.org/exceptions/no_such_element.html
Build info: version: '4.0.0-alpha-2', revision: 'f148142cf8', time: '2019-07-01T21:30:10'
Erro: Summary: NoSuchElement
Detail: An element could not be located on the page using the given search parameters.
class: org.openqa.selenium.NoSuchElementException
Further Details: run errorDetails method
Can someone tell what is wrong here?
Thanks!
Original question:
I have several webpages from which I need to download embedded PDF files and I am looking for a way to automate it with R. This is one of the webpages: https://www.rad.cvm.gov.br/enetconsulta/frmGerenciaPaginaFRE.aspx?CodigoTipoInstituicao=1&NumeroSequencialDocumento=62398
This is a webpage from CVM (Comissão de Valores Mobiliários, the Brazilian equivalent to the US Securities and Exchange Commission - SEC) to download Notes to Financial Statements (Notas Explicativas) from Brazilian companies.
I tried several options but the website seems to be built in a way that makes it difficult to extract the direct links.
I tried what is suggested in here Downloading all PDFs from URL, but the html_nodes(".ms-vb2 a") %>% html_attr("href") yields an empty character vector.
Similarly, when I tried the approach in here https://www.samuelworkman.org/blog/scraping-up-bits-of-helpfulness/, the html_attr("href") generates an empty vector.
I am not used to web scraping codes in R, so I cannot figure out what is happening.
I appreciate any help!
If someone is facing the same problem I did, I am posting the solution I used:
# set Firefox profile to download PDFs automatically
pdfprof <- makeFirefoxProfile(list(
"pdfjs.disabled" = TRUE,
"plugin.scan.plid.all" = FALSE,
"plugin.scan.Acrobat" = "99.0",
"browser.helperApps.neverAsk.saveToDisk" = 'application/pdf'))
driver <- rsDriver(browser = "firefox", extraCapabilities = pdfprof)
remote_driver <- driver[["client"]]
remote_driver$navigate("https://www.rad.cvm.gov.br/enetconsulta/frmGerenciaPaginaFRE.aspx?CodigoTipoInstituicao=1&NumeroSequencialDocumento=62398")
Sys.sleep(3) # It needs some time to load the page (set to 3 seconds)
option <- remote_driver$findElement(using = 'xpath', "//select[#id='cmbGrupo']/option[#value='PDF|412']") # select the option to open PDF file
option$clickElement()
# Find iframes in the webpage
web.elem <- remote_driver$findElements(using = "css", "iframe") # get all iframes in the webpage
sapply(web.elem, function(x){x$getElementAttribute("id")}) # see their names
remote_driver$switchToFrame(web.elem[[1]]) # Move to the first iframe (Formularios Filho)
web.elem.2 <- remote_driver$findElements(using = "css", "iframe") # get all iframes in the webpage
sapply(web.elem.2, function(x){x$getElementAttribute("id")}) # see their names
# The pdf Viewer iframe is the only one inside Formularios Filho
remote_driver$switchToFrame(web.elem.2[[1]]) # Move to the first iframe (pdf Viewer)
Sys.sleep(3) # It needs some time to load the page (set to 3 seconds)
# Download the PDF file
button <- remote_driver$findElement(using = "xpath", "//*[#id='download']")
button$clickElement() # download
Sys.sleep(3) # Need sometime to finish download and then close the window
remote_driver$close() # Close the window
The BCOGC keeps a database of applications for drilling wells in northeast British Columbia. By default, some filters are active to only highlight approved applications within the last month, even though the application database holds 30K+ records:
When the filter is deactivated:
To download the entire data set, remove or deactivate any filters, click on Actions > Download > CSV. I want to download the entire data set (containing 30K+ records) automatically using R.
When I use
library(tidyverse)
df <- read_csv(
file = 'https://reports.bcogc.ca/ogc/f?p=200:21::CSV::::'
)
it only downloads whatever the default query specifies, so around 150 records, not 30K+.
How can I use R to download the entire data set automatically? Is this a task for httr or RSelenium?
OK, I'm going to go with Selenium then since it doesn't necessarily require Docker (though the example I'm using is with Docker :-) Pretty sure I could get Splash/splashr to do this as well, but it involves a file download and I think there's issues with that and the Splash back-end. As the splashr author, I avoid having to deal with GitHub issues if I use Selenium for this example as well ;-)
Anyway, you should install RSelenium. I can't really provide support for that but it's well documented and the rOpenSci folks are super helpful. I'd highly suggest getting Docker to run on your system or getting your department to setup a Selenium server you all can use.
There are a couple gotchas for this use-case:
Some element names we need to instrument are dynamically generated so we have to work around that
This involves downloading a CSV file so we need to map a filesystem path in Docker so it downloads properly
This is a super slow site so you need to figure out wait times after each interaction (I'm not going to do that since you may be on a slower or faster network and network speed does play a part here, tho not that much)
I'd suggest working through the vignettes for RSelenium before trying the below to get a feel for how it works. You're essentially coding up human page interactions.
You will need to start Docker with a mapped directory. See download file with Rselenium & docker toolbox for all the info but here's how I did it on my macOS box:
docker run -d -v /Users/hrbrmstr/Downloads://home/seluser/Downloads -p 4445:4444 selenium/standalone-firefox:2.53.1
That makes Selenium accessible on port 4445, uses Firefox (b/c Chrome is evil) and maps my local downloads directory to the Firefox default dir for the selenium user in the Docker container. That means well_authorizations_issued.csv is going to go there (eventually).
Now, we need to crank up R and connect it to this Selenium instance. We need to create a custom Firefox profile since we're saving stuff to disk and we don't want the browser to prompt us for anything:
library(RSelenium)
makeFirefoxProfile(
list(
browser.download.dir = "home/seluser/Downloads",
browser.download.folderList = 2L,
browser.download.manager.showWhenStarting = FALSE,
browser.helperApps.neverAsk.saveToDisk = "text/csv"
)
) -> ffox_prof
remoteDriver(
browserName = "firefox", port = 4445L,
extraCapabilities = ffox_prof
) -> remDr
invisible(remDr$open())
remDr$navigate("https://reports.bcogc.ca/ogc/f?p=AMS_REPORTS:WA_ISSUED")
# Sys.sleep(###)
magick::image_read(openssl::base64_decode(remDr$screenshot()[[1]]))
You will need to uncomment out the Sys.sleep()s and experiment with various "wait time" values between calls. Some will be short (1-2s) others will be larger (20s, 30s, or higher).
I'm not displaying the output of the screenshots here but those are one way to figure out timings (i.e. keep generating screen shots after an element interaction until gray spinner boxes are gone — etc — and keep a mental note of how many seconds that was).
Now, the one tricky bit noted above is figuring out the where the checkbox is to turn off the filter since it has a dynamic id. However, we aren't actually going to click on the checkbox b/c the daft fools who created that app have no idea what they are doing and actually have the click-event trapped with the span element that surrounds it, so we have to find the li element that contains the checkbox label text then go to the span element and click on it.
box <- remDr$findElement(using = "xpath", value = "//li[contains(., 'Approval Date is in the last')]/span")
box$clickElement()
# Sys.sleep(###)
magick::image_read(openssl::base64_decode(remDr$screenshot()[[1]]))
^^ definitely needs a delay (you likely saw it spin a while in-person when clicking yourself so you can count that and add in some buffer seconds).
Then, we click on the drop-down "menu" (it's really a button):
btn1 <- remDr$findElement(using = "css", "button#WA_ISSUED_actions_button")
btn1$clickElement()
# Sys.sleep(###)
magick::image_read(openssl::base64_decode(remDr$screenshot()[[1]]))
Then the download "menu" item (it's really a button:
btn2 <- remDr$findElement(using = "css", "button#WA_ISSUED_actions_menu_14i")
btn2$clickElement()
# Sys.sleep(###)
magick::image_read(openssl::base64_decode(remDr$screenshot()[[1]]))
^^ also needs rly needs a delay as the Download "dialog" takes a few seconds to come up (it did for me at least).
Now, find the CSV box which is really an a tag:
lnk <- remDr$findElement(using = "css", "a#WA_ISSUED_download_CSV")
lnk$clickElement()
### WAIT A WHILE
magick::image_read(openssl::base64_decode(remDr$screenshot()[[1]]))
That last bit is something you'll have to experiment with. It takes a while to process the request and then transfer the ~9MB file. The call to rmDr$screenshot() actually waits for the download to complete so you can remove the display and decoding code and assign the output to a variable and use that as an automatic "wait"er.
I tried this 3x on 2 different macOS systems and it worked fine. YMMV.
I'm guessing you'll want to automate this eventually so you could have a system() call towards the top of the script that starts the Selenium Docker container, then does the rest of the bits and then issues another system() call to shut down the Docker container.
Alternately, https://github.com/richfitz/stevedore is now on CRAN so it is a pure R interface to starting/stopping Docker containers (amongst many other things) so you could use that instead of system() calls.
If you can't use Docker, you need to install a "webdriver" executable for Firefox on your Windows box and also get the Selenium Java archive, ensure you have Java installed and then do the various manual incantations to get that going (which is beyond the scope of this answer).
Here's a shortened, contiguous version of the above:
library(RSelenium)
# start Selenium before doing this
makeFirefoxProfile(
list(
browser.download.dir = "home/seluser/Downloads",
browser.download.folderList = 2L,
browser.download.manager.showWhenStarting = FALSE,
browser.helperApps.neverAsk.saveToDisk = "text/csv"
)
) -> ffox_prof
remoteDriver(
browserName = "firefox", port = 4445L,
extraCapabilities = ffox_prof
) -> remDr
invisible(remDr$open())
remDr$navigate("https://reports.bcogc.ca/ogc/f?p=AMS_REPORTS:WA_ISSUED")
# Sys.sleep(###)
box <- remDr$findElement(using = "xpath", value = "//li[contains(., 'Approval Date is in the last')]/span")
box$clickElement()
# Sys.sleep(###)
btn1 <- remDr$findElement(using = "css", "button#WA_ISSUED_actions_button")
btn1$clickElement()
# Sys.sleep(###)
btn2 <- remDr$findElement(using = "css", "button#WA_ISSUED_actions_menu_14i")
btn2$clickElement()
# Sys.sleep(###)
lnk <- remDr$findElement(using = "css", "a#WA_ISSUED_download_CSV")
lnk$clickElement()
### WAIT A WHILE
done <- remDr$screenshot()
# stop Selenium
My goal is to download an image from a URL. In my case I can't use download.file because my picture is in a web page requiring login and it has some java scripts running in the background before the real image gets visible. This is why I need to do it using RSelenium package.
As suggested here, I've built a docker container with a standalone-chrome tag. Output from Docker terminal:
$ docker-machine ip
192.168.99.100
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
c651dab3a948 selenium/standalone-chrome:3.4.0 "/opt/bin/entry_po..." 24 hours ago Up 24 hours 0.0.0.0:4445->4444/tcp cranky_kalam
Here's what I've tried:
require(RSelenium)
# Avoid download prompt to pop up and parsing default download folder
eCaps <- list(
chromeOptions =
list(prefs = list(
"profile.default_content_settings.popups" = 0L,
"download.prompt_for_download" = FALSE,
"download.default_directory" = "C:/temp/Pictures"
)
)
)
# Open connection
remDr <- remoteDriver(remoteServerAddr = "192.168.99.100",port = 4445L,browserName="chrome",extraCapabilities = eCaps)
remDr$open()
# Navigate to desired URL with picture
url <- "https://www.google.be/images/branding/googlelogo/2x/googlelogo_color_272x92dp.png"
remDr$navigate(url)
remDr$screenshot(display = TRUE) # Everything looks fine here
# Move mouse to the page's center
webElem <- remDr$findElement(using = 'xpath',value = '/html/body')
remDr$mouseMoveToLocation(webElement = webElem)
# Right click and
remDr$click(2)
remDr$screenshot(display = TRUE) # I don't see the right-click dialog!
# Try to move right-click dialog to 'Save as' or 'Save image as'
remDr$sendKeysToActiveElement(list(key = 'down_arrow',
key = 'down_arrow',
key = 'enter'))
### NOTHING HAPPENS
I've tried to play around with the amount of key = 'down_arrow' and every time I look into C:/temp/Pictures nothing has been saved.
Please note that this is just an example and I know I could have downloaded this picture with download.file. I need a solution with RSelenium for my real case.
I tried using remDr$click(buttonId = 2) to perform Right click but to no avail. Thus, one workaround to save the image would be extracting links from the webpage and using download.file to download it.
#navigate
url <- "https://www.google.be/images/branding/googlelogo/2x/googlelogo_color_272x92dp.png"
remDr$navigate(url)
#get the link of image
link = remDr$getPageSource()[[1]] %>%
read_html() %>% html_nodes('img') %>%
html_attr('src')
[1] "https://www.google.be/images/branding/googlelogo/2x/googlelogo_color_272x92dp.png"
#download using download.file in your current working directory.
download.file(link, basename(url), method = 'curl')