I want to take a screenshot of an entire webpage using RSelenium. I have this code working:
library(RSelenium)
driver <- rsDriver(browser = "firefox")
remdriv <- driver$client
remdriv$navigate("https://stackoverflow.com/questions/73115385/counting-all-elements-row-wise-in-list-column")
remdriv$screenshot(file = "post.png")
But when I run this I get a screenshot of exactly what the driver's browser is showing, like this:
What I want is the full-length screenshot of the entire webpage. What can I do to capture that within RSelenium or another R tool?
In the end I want it to look like this:
I think you have to scroll down and take multiple screenshots, then combine these multiple screenshots into a single one. I haven't managed yet to zoom out yet, which would be another option.
The scrolling doesn't work perfectly and you need to loop and finish up the script, but I hope this is useful to start with:
# Load packages
if (!require(pacman)) {install.packages("pacman")}
pacman::p_load(here, RSelenium, stringr)
# Stop currently running server
if(exists("rD")) suppressWarnings(rD$server$stop())
# Load RSelenium
rD <- rsDriver(browser = "chrome", chromever = "106.0.5249.61", port = 4567L)
remDr <- rD[["client"]]
link <- "https://stackoverflow.com/questions/73115385/counting-all-elements-row-wise-in-list-column"
remDr$open()
remDr$navigate(link)
# Get browser height and width
browser_height <- remDr$executeScript("return document.body.offsetHeight;")[[1]]
browser_width <- remDr$executeScript("return document.body.offsetWidth;")[[1]]
remDr$getWindowSize()
remDr$setWindowSize(browser_width, browser_height)
remDr$getWindowSize()
# This is what actually can be displayed
browser_final_height <- remDr$getWindowSize()$height # 1175
# Get inner window height and width
inner_window_height <- remDr$executeScript("return window.innerHeight")[[1]]
inner_window_width <- remDr$executeScript("return window.innerWidth")[[1]]
# Check how many xtimes the inner window fits in what should be the document size
num_screen <- (browser_height / inner_window_height)
# Move to top of window
remDr$executeScript("window.scrollBy(0, -5000);")
# Scroll down (loop from here to end)
remDr$executeScript(str_c("window.scrollBy(0, ", inner_window_height, ");"))
# Take screenshot
remDr$screenshot(file = here("results", "screenshots", "ex2.png"))
# Close server
remDr$close()
rD$server$stop()
Related
I am trying to webscrape information from this website: https://www.nea.gov.sg/weather/rain-areas and download the 240km radar scans between 2022-07-31 01:00:00 (am) and 2022-07-31 03:00:00 (am) at five-minute intervals, inclusive of end points. Save the images to a zip file.
Edit: Is there a way to do it with just rvest and avoiding the usage of for loops?
I've fount out that the image address can be acquired by clicking on the image and selecting copy image address. An example :https://www.nea.gov.sg/docs/default-source/rain-area-240km/dpsri_240km_2022091920000000dBR.dpsri.png
I've noted that the string of numbers would represent the date and time. So the one I'd need would be 20220731xxxxxxx where x would be the time. However, how would I then use this to webscrape?
Could someone provide some guidance? I can't even seem to find the radar scans for that day. Thank you.
You can consider the following code to save the screenshots of the webpage :
library(RSelenium)
url <- "https://www.nea.gov.sg/weather/rain-areas"
shell('docker run -d -p 4445:4444 selenium/standalone-firefox')
remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4445L, browserName = "firefox")
remDr$open()
remDr$navigate(url)
web_Elem <- remDr$findElement("xpath", '//*[#id="rain-area-slider"]/div/button')
web_Elem$clickElement()
for(i in 1 : 10)
{
print(i)
Sys.sleep(1)
path_To_File <- paste0("C:/file", i, ".png")
remDr$screenshot(display = FALSE, useViewer = TRUE, file = path_To_File)
}
Scraping the images from the website requires you to interact with the website (e.g. clicks), so we will use the RSelenium package for the task. You will also need to have Firefox installed on your system to be able to follow this solution.
1. Load packages
We will begin by loading our packages of interest:
# Load packages ----
pacman::p_load(
httr,
png,
purrr,
RSelenium,
rvest,
servr
)
2. Setup
Now, we need to start the Selenium server with firefox. The following code will start a firefox instance. Run it and wait for firefox to launch:
# Start Selenium driver with firefox ----
rsd <- rsDriver(browser = "firefox", port = random_port())
Now that the firefox browser (aka the client) is up, we want to be able to manipulate it with our code. So, let's create a variable (cl for client) that will represent it. We will use the variable to perform all the actions we need:
cl <- rsd$client
The first action we want to perform is to navigate to the website. Once you run the code, notice how Firefox goes to the website as a response to you running your R code:
# Navigate to the webpage ----
cl$navigate(url = "https://www.nea.gov.sg/weather/rain-areas")
Let's get scraping
Now we're going to begin the actual scraping! #EmmanuelHamel took the clever approach of simply clicking on the "play" button in order to launch the automatic "slideshow". He then took a screenshot of the webpage every second in order to capture the changes in the image. The approach I use is somewhat different.
In the code below, I identify the 13 steps of the slideshow (along the horizontal green bar) and I click on each "step" one after the other. After clicking on a step, I get the URL of the image, then I click on the other step... all the way to the 13th step.
Here I get the HTML element for each step:
# Get the selector for each of the 13 steps
rail_steps <- cl$findElements(using = "css", value = "div.vue-slider-mark")[1:13]
Then, I click on each element and get the image URL at each step. After you run this code, check how your code manipulates the webpage on the firefox instance, isn't that cool?
img_urls <- map_chr(rail_steps, function(step){
cl$mouseMoveToLocation(webElement = step)
cl$click()
img_el <- cl$findElement(using = "css", value = "#rain_overlay")
Sys.sleep(1)
imcg_url <-
img_el$getElementAttribute(attrName = "src")[[1]]
})
Finally, I create an image folder img where I download and save the images:
# Create an image folder then download all images in it ----
dir.create("img")
walk(img_urls, function(img_url){
GET(url = img_url) |>
content() |>
writePNG(target = paste0("img/", basename(img_url)))
})
Important
The downloaded images do not contain the background map on the webpage... only the points! You can download the background map then lay the points on top of it (using an image processing software for example). Here is how to download the background map:
# Download the background map----
GET(url = "https://www.nea.gov.sg/assets/images/map/base-853.png") |>
content() |>
writePNG(target = "base_image.png")
If you want to combine the images programmatically, you may want to look into the magick package in R.
EDIT: From the comments I received so far, I managed to use RSelenium to access the PDF files I am looking for, using the following code:
library(RSelenium)
driver <- rsDriver(browser = "firefox")
remote_driver <- driver[["client"]]
remote_driver$navigate("https://www.rad.cvm.gov.br/enetconsulta/frmGerenciaPaginaFRE.aspx?CodigoTipoInstituicao=1&NumeroSequencialDocumento=62398")
# It needs some time to load the page
option <- remote_driver$findElement(using = 'xpath', "//select[#id='cmbGrupo']/option[#value='PDF|412']")
option$clickElement()
Now, I need R to click the download button, but I could not manage to do so. I tried:
button <- remote_driver$findElement(using = "xpath", "//*[#id='download']")
button$clickElement()
But I get the following error:
Selenium message:Unable to locate element: //*[#id="download"]
For documentation on this error, please visit: https://www.seleniumhq.org/exceptions/no_such_element.html
Build info: version: '4.0.0-alpha-2', revision: 'f148142cf8', time: '2019-07-01T21:30:10'
Erro: Summary: NoSuchElement
Detail: An element could not be located on the page using the given search parameters.
class: org.openqa.selenium.NoSuchElementException
Further Details: run errorDetails method
Can someone tell what is wrong here?
Thanks!
Original question:
I have several webpages from which I need to download embedded PDF files and I am looking for a way to automate it with R. This is one of the webpages: https://www.rad.cvm.gov.br/enetconsulta/frmGerenciaPaginaFRE.aspx?CodigoTipoInstituicao=1&NumeroSequencialDocumento=62398
This is a webpage from CVM (Comissão de Valores Mobiliários, the Brazilian equivalent to the US Securities and Exchange Commission - SEC) to download Notes to Financial Statements (Notas Explicativas) from Brazilian companies.
I tried several options but the website seems to be built in a way that makes it difficult to extract the direct links.
I tried what is suggested in here Downloading all PDFs from URL, but the html_nodes(".ms-vb2 a") %>% html_attr("href") yields an empty character vector.
Similarly, when I tried the approach in here https://www.samuelworkman.org/blog/scraping-up-bits-of-helpfulness/, the html_attr("href") generates an empty vector.
I am not used to web scraping codes in R, so I cannot figure out what is happening.
I appreciate any help!
If someone is facing the same problem I did, I am posting the solution I used:
# set Firefox profile to download PDFs automatically
pdfprof <- makeFirefoxProfile(list(
"pdfjs.disabled" = TRUE,
"plugin.scan.plid.all" = FALSE,
"plugin.scan.Acrobat" = "99.0",
"browser.helperApps.neverAsk.saveToDisk" = 'application/pdf'))
driver <- rsDriver(browser = "firefox", extraCapabilities = pdfprof)
remote_driver <- driver[["client"]]
remote_driver$navigate("https://www.rad.cvm.gov.br/enetconsulta/frmGerenciaPaginaFRE.aspx?CodigoTipoInstituicao=1&NumeroSequencialDocumento=62398")
Sys.sleep(3) # It needs some time to load the page (set to 3 seconds)
option <- remote_driver$findElement(using = 'xpath', "//select[#id='cmbGrupo']/option[#value='PDF|412']") # select the option to open PDF file
option$clickElement()
# Find iframes in the webpage
web.elem <- remote_driver$findElements(using = "css", "iframe") # get all iframes in the webpage
sapply(web.elem, function(x){x$getElementAttribute("id")}) # see their names
remote_driver$switchToFrame(web.elem[[1]]) # Move to the first iframe (Formularios Filho)
web.elem.2 <- remote_driver$findElements(using = "css", "iframe") # get all iframes in the webpage
sapply(web.elem.2, function(x){x$getElementAttribute("id")}) # see their names
# The pdf Viewer iframe is the only one inside Formularios Filho
remote_driver$switchToFrame(web.elem.2[[1]]) # Move to the first iframe (pdf Viewer)
Sys.sleep(3) # It needs some time to load the page (set to 3 seconds)
# Download the PDF file
button <- remote_driver$findElement(using = "xpath", "//*[#id='download']")
button$clickElement() # download
Sys.sleep(3) # Need sometime to finish download and then close the window
remote_driver$close() # Close the window
I have this website (https://www.sofascore.com/pt/torneio/futebol/brazil/brasileiro-serie-a/325) that I want to get some stats from games by round. There is 38 rounds and the base just shows the first 11. For me to get the rest of the rounds I have to scroll this inner scroll bar but I don't know how to do it.
I use the package RSelenium in R.
Here's the code (so far)...
After this, I don't know what to do...
require(RSelenium)
click <- function(xpath){
webElem <- remDr$findElement(using = "xpath", value = xpath)
webElem$clickElement()
}
driver <- rsDriver(port = 5799L, browser = c('chrome'), chromever = "88.0.4324.96")
url = 'https://www.sofascore.com/pt/torneio/futebol/brazil/brasileiro-serie-a/325'
remDr <- driver[['client']]
remDr$navigate(url) #link
Sys.sleep(1)
# games by round
click('//*[#id="__next"]/main/div/div[2]/div[1]/div[1]/div[6]/div/div[1]/a[2]')
# round options
click('//*[#id="__next"]/main/div/div[2]/div[1]/div[1]/div[6]/div/div[2]/div/div/div/div[1]/div/div[1]/div[2]/div')
So I'm trying to scrape the name and location of various breweries in the US via this link:
https://www.brewersassociation.org/directories/breweries/
As you can see the HTML takes a second to load. This means that when I scrape the HTML code with Rselenium that it only loads half the page, here's the code that I'm running that should replicate for anyone with Rselenium,
remDr <- RSelenium::remoteDriver(remoteServerAddr = "localhost",
port = 4445L,
browserName = "chrome")
remDr$open()
remDr$setTimeout(type="page load")
remDr$navigate("https://www.brewersassociation.org/directories/breweries/?location=MI")
remDr$screenshot(display=TRUE)
However if you look at the screenshot only half of the page loads. I've tried set Timeout and a few other commands but they don't seem to allow the page to load correctly. Any advice or ideas on how to fix this?
You could try this:
library(RSelenium)
driver <- rsDriver(browser=c("firefox"), port = 4567L)
remote_driver <- driver[["client"]]
remote_driver$navigate("https://www.brewersassociation.org/directories/breweries/?location=MI")
#You can wait 3 seconds
Sys.sleep(3)
#Now you can scroll down all page and wait for the full page
scroll_d <- remote_driver$findElement(using = "css", value = "body")
#This will scroll the page but is not enough, but is a way to create an automatization.
#If you scroll the page many times you are able to see all page.
scroll_d$sendKeysToElement(list(key = "end"))
#How? For example you can use the alphabet to monitor the list.
This answer is just a way/idea to solve the problem.
How can image downloading be disabled when using Firefox in Rselenium? I want to see if doing so makes a scraping script faster.
I've read the Reselnium package manual including the sections on getFirefoxProfile & makeFirefoxProfile.
I've found this link that shows how to handle chromedriver.
I can disable images for a Firefox instance that I manually open in Windows 10 but Rselenium does not appear to use that same profile.
Previously you would need to set the appropriate preference (in this case
permissions.default.image) however there is now an issue with firefox resetting this value see:
https://github.com/seleniumhq/selenium/issues/2171
a work around is given:
https://github.com/gempesaw/Selenium-Remote-Driver/issues/248
implementing this in RSelenium:
library(RSelenium)
fprof <- makeFirefoxProfile(list(permissions.default.image = 2L,
browser.migration.version = 9999L))
rD <- rsDriver(browser = "firefox", extraCapabilities = fprof)
remDr <- rD$client
remDr$navigate("http://www.google.com/ncr")
remDr$screenshot(display = TRUE)
# clean up
rm(rD)
gc()