Rselenium - Save page as - r

My goal is to download an image from a URL. In my case I can't use download.file because my picture is in a web page requiring login and it has some java scripts running in the background before the real image gets visible. This is why I need to do it using RSelenium package.
As suggested here, I've built a docker container with a standalone-chrome tag. Output from Docker terminal:
$ docker-machine ip
192.168.99.100
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
c651dab3a948 selenium/standalone-chrome:3.4.0 "/opt/bin/entry_po..." 24 hours ago Up 24 hours 0.0.0.0:4445->4444/tcp cranky_kalam
Here's what I've tried:
require(RSelenium)
# Avoid download prompt to pop up and parsing default download folder
eCaps <- list(
chromeOptions =
list(prefs = list(
"profile.default_content_settings.popups" = 0L,
"download.prompt_for_download" = FALSE,
"download.default_directory" = "C:/temp/Pictures"
)
)
)
# Open connection
remDr <- remoteDriver(remoteServerAddr = "192.168.99.100",port = 4445L,browserName="chrome",extraCapabilities = eCaps)
remDr$open()
# Navigate to desired URL with picture
url <- "https://www.google.be/images/branding/googlelogo/2x/googlelogo_color_272x92dp.png"
remDr$navigate(url)
remDr$screenshot(display = TRUE) # Everything looks fine here
# Move mouse to the page's center
webElem <- remDr$findElement(using = 'xpath',value = '/html/body')
remDr$mouseMoveToLocation(webElement = webElem)
# Right click and
remDr$click(2)
remDr$screenshot(display = TRUE) # I don't see the right-click dialog!
# Try to move right-click dialog to 'Save as' or 'Save image as'
remDr$sendKeysToActiveElement(list(key = 'down_arrow',
key = 'down_arrow',
key = 'enter'))
### NOTHING HAPPENS
I've tried to play around with the amount of key = 'down_arrow' and every time I look into C:/temp/Pictures nothing has been saved.
Please note that this is just an example and I know I could have downloaded this picture with download.file. I need a solution with RSelenium for my real case.

I tried using remDr$click(buttonId = 2) to perform Right click but to no avail. Thus, one workaround to save the image would be extracting links from the webpage and using download.file to download it.
#navigate
url <- "https://www.google.be/images/branding/googlelogo/2x/googlelogo_color_272x92dp.png"
remDr$navigate(url)
#get the link of image
link = remDr$getPageSource()[[1]] %>%
read_html() %>% html_nodes('img') %>%
html_attr('src')
[1] "https://www.google.be/images/branding/googlelogo/2x/googlelogo_color_272x92dp.png"
#download using download.file in your current working directory.
download.file(link, basename(url), method = 'curl')

Related

Webscraping images in r and saving them into a zip file

I am trying to webscrape information from this website: https://www.nea.gov.sg/weather/rain-areas and download the 240km radar scans between 2022-07-31 01:00:00 (am) and 2022-07-31 03:00:00 (am) at five-minute intervals, inclusive of end points. Save the images to a zip file.
Edit: Is there a way to do it with just rvest and avoiding the usage of for loops?
I've fount out that the image address can be acquired by clicking on the image and selecting copy image address. An example :https://www.nea.gov.sg/docs/default-source/rain-area-240km/dpsri_240km_2022091920000000dBR.dpsri.png
I've noted that the string of numbers would represent the date and time. So the one I'd need would be 20220731xxxxxxx where x would be the time. However, how would I then use this to webscrape?
Could someone provide some guidance? I can't even seem to find the radar scans for that day. Thank you.
You can consider the following code to save the screenshots of the webpage :
library(RSelenium)
url <- "https://www.nea.gov.sg/weather/rain-areas"
shell('docker run -d -p 4445:4444 selenium/standalone-firefox')
remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4445L, browserName = "firefox")
remDr$open()
remDr$navigate(url)
web_Elem <- remDr$findElement("xpath", '//*[#id="rain-area-slider"]/div/button')
web_Elem$clickElement()
for(i in 1 : 10)
{
print(i)
Sys.sleep(1)
path_To_File <- paste0("C:/file", i, ".png")
remDr$screenshot(display = FALSE, useViewer = TRUE, file = path_To_File)
}
Scraping the images from the website requires you to interact with the website (e.g. clicks), so we will use the RSelenium package for the task. You will also need to have Firefox installed on your system to be able to follow this solution.
1. Load packages
We will begin by loading our packages of interest:
# Load packages ----
pacman::p_load(
httr,
png,
purrr,
RSelenium,
rvest,
servr
)
2. Setup
Now, we need to start the Selenium server with firefox. The following code will start a firefox instance. Run it and wait for firefox to launch:
# Start Selenium driver with firefox ----
rsd <- rsDriver(browser = "firefox", port = random_port())
Now that the firefox browser (aka the client) is up, we want to be able to manipulate it with our code. So, let's create a variable (cl for client) that will represent it. We will use the variable to perform all the actions we need:
cl <- rsd$client
The first action we want to perform is to navigate to the website. Once you run the code, notice how Firefox goes to the website as a response to you running your R code:
# Navigate to the webpage ----
cl$navigate(url = "https://www.nea.gov.sg/weather/rain-areas")
Let's get scraping
Now we're going to begin the actual scraping! #EmmanuelHamel took the clever approach of simply clicking on the "play" button in order to launch the automatic "slideshow". He then took a screenshot of the webpage every second in order to capture the changes in the image. The approach I use is somewhat different.
In the code below, I identify the 13 steps of the slideshow (along the horizontal green bar) and I click on each "step" one after the other. After clicking on a step, I get the URL of the image, then I click on the other step... all the way to the 13th step.
Here I get the HTML element for each step:
# Get the selector for each of the 13 steps
rail_steps <- cl$findElements(using = "css", value = "div.vue-slider-mark")[1:13]
Then, I click on each element and get the image URL at each step. After you run this code, check how your code manipulates the webpage on the firefox instance, isn't that cool?
img_urls <- map_chr(rail_steps, function(step){
cl$mouseMoveToLocation(webElement = step)
cl$click()
img_el <- cl$findElement(using = "css", value = "#rain_overlay")
Sys.sleep(1)
imcg_url <-
img_el$getElementAttribute(attrName = "src")[[1]]
})
Finally, I create an image folder img where I download and save the images:
# Create an image folder then download all images in it ----
dir.create("img")
walk(img_urls, function(img_url){
GET(url = img_url) |>
content() |>
writePNG(target = paste0("img/", basename(img_url)))
})
Important
The downloaded images do not contain the background map on the webpage... only the points! You can download the background map then lay the points on top of it (using an image processing software for example). Here is how to download the background map:
# Download the background map----
GET(url = "https://www.nea.gov.sg/assets/images/map/base-853.png") |>
content() |>
writePNG(target = "base_image.png")
If you want to combine the images programmatically, you may want to look into the magick package in R.

problem finding element on web page using RSelenium

I have been trying to get the element "Excel CSV" on a web page using the remDrv $ findElements in R software, but have not been able to achieve it. how could you call the element using the xpath, css, etc arguments?
i try:
library(RSelenium)
test_link="https://sinca.mma.gob.cl/cgi-bin/APUB-MMA/apub.htmlindico2.cgi?page=pageFrame&header=Talagante&macropath=./RM/D28/Cal/PM25&macro=PM25.horario.horario&from=080522&to=210909&"
rD <- rsDriver(port=4446L, browser = "firefox", chromever = "92.0.4515.107") # runs a chrome browser, wait for necessary files to download
remDrv <- rD$client
#remDrv$open(silent = TRUE)
url<-test_link
remDrv$navigate(url)
remDrv$findElements(using = "xpath", "/html/body/table/tbody/tr/td/table[2]/tbody/tr[1]/td/label/span[3]/a")
link: https://sinca.mma.gob.cl/cgi-bin/APUB-MMA/apub.htmlindico2.cgi?page=pageFrame&header=Talagante&macropath=./RM/D28/Cal/PM25&macro=PM25.horario.horario&from=080522&to=210909&

How to download embedded PDF files from webpage using RSelenium?

EDIT: From the comments I received so far, I managed to use RSelenium to access the PDF files I am looking for, using the following code:
library(RSelenium)
driver <- rsDriver(browser = "firefox")
remote_driver <- driver[["client"]]
remote_driver$navigate("https://www.rad.cvm.gov.br/enetconsulta/frmGerenciaPaginaFRE.aspx?CodigoTipoInstituicao=1&NumeroSequencialDocumento=62398")
# It needs some time to load the page
option <- remote_driver$findElement(using = 'xpath', "//select[#id='cmbGrupo']/option[#value='PDF|412']")
option$clickElement()
Now, I need R to click the download button, but I could not manage to do so. I tried:
button <- remote_driver$findElement(using = "xpath", "//*[#id='download']")
button$clickElement()
But I get the following error:
Selenium message:Unable to locate element: //*[#id="download"]
For documentation on this error, please visit: https://www.seleniumhq.org/exceptions/no_such_element.html
Build info: version: '4.0.0-alpha-2', revision: 'f148142cf8', time: '2019-07-01T21:30:10'
Erro: Summary: NoSuchElement
Detail: An element could not be located on the page using the given search parameters.
class: org.openqa.selenium.NoSuchElementException
Further Details: run errorDetails method
Can someone tell what is wrong here?
Thanks!
Original question:
I have several webpages from which I need to download embedded PDF files and I am looking for a way to automate it with R. This is one of the webpages: https://www.rad.cvm.gov.br/enetconsulta/frmGerenciaPaginaFRE.aspx?CodigoTipoInstituicao=1&NumeroSequencialDocumento=62398
This is a webpage from CVM (Comissão de Valores Mobiliários, the Brazilian equivalent to the US Securities and Exchange Commission - SEC) to download Notes to Financial Statements (Notas Explicativas) from Brazilian companies.
I tried several options but the website seems to be built in a way that makes it difficult to extract the direct links.
I tried what is suggested in here Downloading all PDFs from URL, but the html_nodes(".ms-vb2 a") %>% html_attr("href") yields an empty character vector.
Similarly, when I tried the approach in here https://www.samuelworkman.org/blog/scraping-up-bits-of-helpfulness/, the html_attr("href") generates an empty vector.
I am not used to web scraping codes in R, so I cannot figure out what is happening.
I appreciate any help!
If someone is facing the same problem I did, I am posting the solution I used:
# set Firefox profile to download PDFs automatically
pdfprof <- makeFirefoxProfile(list(
"pdfjs.disabled" = TRUE,
"plugin.scan.plid.all" = FALSE,
"plugin.scan.Acrobat" = "99.0",
"browser.helperApps.neverAsk.saveToDisk" = 'application/pdf'))
driver <- rsDriver(browser = "firefox", extraCapabilities = pdfprof)
remote_driver <- driver[["client"]]
remote_driver$navigate("https://www.rad.cvm.gov.br/enetconsulta/frmGerenciaPaginaFRE.aspx?CodigoTipoInstituicao=1&NumeroSequencialDocumento=62398")
Sys.sleep(3) # It needs some time to load the page (set to 3 seconds)
option <- remote_driver$findElement(using = 'xpath', "//select[#id='cmbGrupo']/option[#value='PDF|412']") # select the option to open PDF file
option$clickElement()
# Find iframes in the webpage
web.elem <- remote_driver$findElements(using = "css", "iframe") # get all iframes in the webpage
sapply(web.elem, function(x){x$getElementAttribute("id")}) # see their names
remote_driver$switchToFrame(web.elem[[1]]) # Move to the first iframe (Formularios Filho)
web.elem.2 <- remote_driver$findElements(using = "css", "iframe") # get all iframes in the webpage
sapply(web.elem.2, function(x){x$getElementAttribute("id")}) # see their names
# The pdf Viewer iframe is the only one inside Formularios Filho
remote_driver$switchToFrame(web.elem.2[[1]]) # Move to the first iframe (pdf Viewer)
Sys.sleep(3) # It needs some time to load the page (set to 3 seconds)
# Download the PDF file
button <- remote_driver$findElement(using = "xpath", "//*[#id='download']")
button$clickElement() # download
Sys.sleep(3) # Need sometime to finish download and then close the window
remote_driver$close() # Close the window

How to properly use Rselenium to wait for a page to load?

So I'm trying to scrape the name and location of various breweries in the US via this link:
https://www.brewersassociation.org/directories/breweries/
As you can see the HTML takes a second to load. This means that when I scrape the HTML code with Rselenium that it only loads half the page, here's the code that I'm running that should replicate for anyone with Rselenium,
remDr <- RSelenium::remoteDriver(remoteServerAddr = "localhost",
port = 4445L,
browserName = "chrome")
remDr$open()
remDr$setTimeout(type="page load")
remDr$navigate("https://www.brewersassociation.org/directories/breweries/?location=MI")
remDr$screenshot(display=TRUE)
However if you look at the screenshot only half of the page loads. I've tried set Timeout and a few other commands but they don't seem to allow the page to load correctly. Any advice or ideas on how to fix this?
You could try this:
library(RSelenium)
driver <- rsDriver(browser=c("firefox"), port = 4567L)
remote_driver <- driver[["client"]]
remote_driver$navigate("https://www.brewersassociation.org/directories/breweries/?location=MI")
#You can wait 3 seconds
Sys.sleep(3)
#Now you can scroll down all page and wait for the full page
scroll_d <- remote_driver$findElement(using = "css", value = "body")
#This will scroll the page but is not enough, but is a way to create an automatization.
#If you scroll the page many times you are able to see all page.
scroll_d$sendKeysToElement(list(key = "end"))
#How? For example you can use the alphabet to monitor the list.
This answer is just a way/idea to solve the problem.

RSelenium: Error while downloading files with Chrome

I am using RSelenium to download a number of .xls files. I was able to get a somewhat passable solution using the following code to set up the server, which specifies not to create a pop-up when I click on the download link and where to download the file to. However, without fail, once I download the 101st file (saved as "report (100).xls) the download pop-up begins appearing in the browser Selenium is driving.
eCaps <- list(
chromeOptions =
list(prefs = list(
"profile.default_content_settings.popups" = 0L,
"download.prompt_for_download" = FALSE,
"download.default_directory" = "mydownloadpath"
)
)
)
rd <- rsDriver(browser = "chrome", port=4566L, extraCapabilities = eCaps)
The function to download then looks like:
vote.downloading <- function(url){
#NB: this function assumes browser already up and running, options set correctly
Sys.sleep(1.5)
browser$navigate(url)
down_button <- browser$findElement(using="css",
"table:nth-child(4) tr:nth-child(3) a")
down_button$clickElement()
}
For reference, the sites I'm getting the download from look like this: http://www.moscow_city.vybory.izbirkom.ru/region/moscow_city?action=show&root=774001001&tvd=4774001137463&vrn=4774001137457&prver=0&pronetvd=null&region=77&sub_region=77&type=427&vibid=4774001137463
The link being used for the download reads "Версия для печати" for those who don't know Russian.
I can't simply stop the function when the dialog begins popping up and pick up where I left off, because it's part of a larger function that scrapes links from drop-down menus that lead to the sites from the download link. This would also be extremely annoying, as there are 400+ files to download.
Is there some way I can alter the Chrome profile or my scraping function to prevent the system dialog from popping up every 101 files? Or is there a better way altogether to get these files downloaded?
No need for Selenium:
library(httr)
httr::GET(
url = "http://www.moscow_city.vybory.izbirkom.ru/servlet/ExcelReportVersion",
query = list(
region="77",
sub_region="77",
root="774001001",
global="null",
vrn="4774001137457",
tvd="4774001137463",
type="427",
vibid="4774001137463",
condition="",
action="show",
version="null",
prver="0",
sortorder="0"
),
write_disk("/tmp/report.xls"), ## CHANGE ME
verbose()
) -> res
I save it off to an object so you can run warn_for_status() or other such checks.
It shld be straightforward to wrap that in a function with parameters to make it more generic.

Resources