I have been trying to get the element "Excel CSV" on a web page using the remDrv $ findElements in R software, but have not been able to achieve it. how could you call the element using the xpath, css, etc arguments?
i try:
library(RSelenium)
test_link="https://sinca.mma.gob.cl/cgi-bin/APUB-MMA/apub.htmlindico2.cgi?page=pageFrame&header=Talagante¯opath=./RM/D28/Cal/PM25¯o=PM25.horario.horario&from=080522&to=210909&"
rD <- rsDriver(port=4446L, browser = "firefox", chromever = "92.0.4515.107") # runs a chrome browser, wait for necessary files to download
remDrv <- rD$client
#remDrv$open(silent = TRUE)
url<-test_link
remDrv$navigate(url)
remDrv$findElements(using = "xpath", "/html/body/table/tbody/tr/td/table[2]/tbody/tr[1]/td/label/span[3]/a")
link: https://sinca.mma.gob.cl/cgi-bin/APUB-MMA/apub.htmlindico2.cgi?page=pageFrame&header=Talagante¯opath=./RM/D28/Cal/PM25¯o=PM25.horario.horario&from=080522&to=210909&
Related
EDIT: From the comments I received so far, I managed to use RSelenium to access the PDF files I am looking for, using the following code:
library(RSelenium)
driver <- rsDriver(browser = "firefox")
remote_driver <- driver[["client"]]
remote_driver$navigate("https://www.rad.cvm.gov.br/enetconsulta/frmGerenciaPaginaFRE.aspx?CodigoTipoInstituicao=1&NumeroSequencialDocumento=62398")
# It needs some time to load the page
option <- remote_driver$findElement(using = 'xpath', "//select[#id='cmbGrupo']/option[#value='PDF|412']")
option$clickElement()
Now, I need R to click the download button, but I could not manage to do so. I tried:
button <- remote_driver$findElement(using = "xpath", "//*[#id='download']")
button$clickElement()
But I get the following error:
Selenium message:Unable to locate element: //*[#id="download"]
For documentation on this error, please visit: https://www.seleniumhq.org/exceptions/no_such_element.html
Build info: version: '4.0.0-alpha-2', revision: 'f148142cf8', time: '2019-07-01T21:30:10'
Erro: Summary: NoSuchElement
Detail: An element could not be located on the page using the given search parameters.
class: org.openqa.selenium.NoSuchElementException
Further Details: run errorDetails method
Can someone tell what is wrong here?
Thanks!
Original question:
I have several webpages from which I need to download embedded PDF files and I am looking for a way to automate it with R. This is one of the webpages: https://www.rad.cvm.gov.br/enetconsulta/frmGerenciaPaginaFRE.aspx?CodigoTipoInstituicao=1&NumeroSequencialDocumento=62398
This is a webpage from CVM (Comissão de Valores Mobiliários, the Brazilian equivalent to the US Securities and Exchange Commission - SEC) to download Notes to Financial Statements (Notas Explicativas) from Brazilian companies.
I tried several options but the website seems to be built in a way that makes it difficult to extract the direct links.
I tried what is suggested in here Downloading all PDFs from URL, but the html_nodes(".ms-vb2 a") %>% html_attr("href") yields an empty character vector.
Similarly, when I tried the approach in here https://www.samuelworkman.org/blog/scraping-up-bits-of-helpfulness/, the html_attr("href") generates an empty vector.
I am not used to web scraping codes in R, so I cannot figure out what is happening.
I appreciate any help!
If someone is facing the same problem I did, I am posting the solution I used:
# set Firefox profile to download PDFs automatically
pdfprof <- makeFirefoxProfile(list(
"pdfjs.disabled" = TRUE,
"plugin.scan.plid.all" = FALSE,
"plugin.scan.Acrobat" = "99.0",
"browser.helperApps.neverAsk.saveToDisk" = 'application/pdf'))
driver <- rsDriver(browser = "firefox", extraCapabilities = pdfprof)
remote_driver <- driver[["client"]]
remote_driver$navigate("https://www.rad.cvm.gov.br/enetconsulta/frmGerenciaPaginaFRE.aspx?CodigoTipoInstituicao=1&NumeroSequencialDocumento=62398")
Sys.sleep(3) # It needs some time to load the page (set to 3 seconds)
option <- remote_driver$findElement(using = 'xpath', "//select[#id='cmbGrupo']/option[#value='PDF|412']") # select the option to open PDF file
option$clickElement()
# Find iframes in the webpage
web.elem <- remote_driver$findElements(using = "css", "iframe") # get all iframes in the webpage
sapply(web.elem, function(x){x$getElementAttribute("id")}) # see their names
remote_driver$switchToFrame(web.elem[[1]]) # Move to the first iframe (Formularios Filho)
web.elem.2 <- remote_driver$findElements(using = "css", "iframe") # get all iframes in the webpage
sapply(web.elem.2, function(x){x$getElementAttribute("id")}) # see their names
# The pdf Viewer iframe is the only one inside Formularios Filho
remote_driver$switchToFrame(web.elem.2[[1]]) # Move to the first iframe (pdf Viewer)
Sys.sleep(3) # It needs some time to load the page (set to 3 seconds)
# Download the PDF file
button <- remote_driver$findElement(using = "xpath", "//*[#id='download']")
button$clickElement() # download
Sys.sleep(3) # Need sometime to finish download and then close the window
remote_driver$close() # Close the window
I am trying to web scrape the data from the Flipkart site. The link for the webpage is as follows:
https://www.flipkart.com/mi-a1-black-64-gb/product-reviews/itmexnsrtzhbbneg?aid=overall&pid=MOBEX9WXUSZVYHET
I need to automate navigation to the NEXT page by clicking on NEXT button the webpage. Below is the code I'm using
nextButton <-remDr$findElement(value ='//div[#class="_2kUstJ"]')$clickElement()
Error
Selenium message:Element is not clickable at point
I even tried scrolling the webpage as suggested by many stackoverflow questions using the below code
remDr$executeScript("arguments[0].scrollIntoView(true);", nextButton)
But this code is also giving error as
Error in checkError(res) : Undefined error in httr call. httr output: No method for S4 class:webElement
Kindly suggest the solution. I'm using firefox browser and selenium to automate using R programming.
If you do not mind using Chrome driver, the following code worked:
eCaps <- list(chromeOptions = list(
args = c('--headless', '--disable-gpu', '--window-size=1880,1000', "--no-sandbox", "--disable-dev-shm-usage")
))
remDr <- rsDriver(port = 4565L,browser = "chrome",extraCapabilities = eCaps)
remCl <- remDr[["client"]]
remCl$navigate("https://www.flipkart.com/mi-a1-black-64-gb/product-reviews/itmexnsrtzhbbneg?aid=overall&pid=MOBEX9WXUSZVYHET")
remCl$findElement(using = "css selector", "._3fVaIS > span:nth-child(1)")$clickElement()
We shall first scroll to the end of the page and then click Next.
#Navigate to webpage
remDr$navigate("https://www.flipkart.com/mi-a1-black-64-gb/product-reviews/itmexnsrtzhbbneg?aid=overall&pid=MOBEX9WXUSZVYHET")
#Scroll to the end
webElem <- remDr$findElement("css", "html")
webElem$sendKeysToElement(list(key="end"))
#click on Next
remDr$findElement(using = "xpath", '//*[#id="container"]/div/div[3]/div/div/div[2]/div[13]/div/div/nav/a[11]/span')$clickElement()
I am using RSelenium to download a number of .xls files. I was able to get a somewhat passable solution using the following code to set up the server, which specifies not to create a pop-up when I click on the download link and where to download the file to. However, without fail, once I download the 101st file (saved as "report (100).xls) the download pop-up begins appearing in the browser Selenium is driving.
eCaps <- list(
chromeOptions =
list(prefs = list(
"profile.default_content_settings.popups" = 0L,
"download.prompt_for_download" = FALSE,
"download.default_directory" = "mydownloadpath"
)
)
)
rd <- rsDriver(browser = "chrome", port=4566L, extraCapabilities = eCaps)
The function to download then looks like:
vote.downloading <- function(url){
#NB: this function assumes browser already up and running, options set correctly
Sys.sleep(1.5)
browser$navigate(url)
down_button <- browser$findElement(using="css",
"table:nth-child(4) tr:nth-child(3) a")
down_button$clickElement()
}
For reference, the sites I'm getting the download from look like this: http://www.moscow_city.vybory.izbirkom.ru/region/moscow_city?action=show&root=774001001&tvd=4774001137463&vrn=4774001137457&prver=0&pronetvd=null®ion=77&sub_region=77&type=427&vibid=4774001137463
The link being used for the download reads "Версия для печати" for those who don't know Russian.
I can't simply stop the function when the dialog begins popping up and pick up where I left off, because it's part of a larger function that scrapes links from drop-down menus that lead to the sites from the download link. This would also be extremely annoying, as there are 400+ files to download.
Is there some way I can alter the Chrome profile or my scraping function to prevent the system dialog from popping up every 101 files? Or is there a better way altogether to get these files downloaded?
No need for Selenium:
library(httr)
httr::GET(
url = "http://www.moscow_city.vybory.izbirkom.ru/servlet/ExcelReportVersion",
query = list(
region="77",
sub_region="77",
root="774001001",
global="null",
vrn="4774001137457",
tvd="4774001137463",
type="427",
vibid="4774001137463",
condition="",
action="show",
version="null",
prver="0",
sortorder="0"
),
write_disk("/tmp/report.xls"), ## CHANGE ME
verbose()
) -> res
I save it off to an object so you can run warn_for_status() or other such checks.
It shld be straightforward to wrap that in a function with parameters to make it more generic.
My goal is to download an image from a URL. In my case I can't use download.file because my picture is in a web page requiring login and it has some java scripts running in the background before the real image gets visible. This is why I need to do it using RSelenium package.
As suggested here, I've built a docker container with a standalone-chrome tag. Output from Docker terminal:
$ docker-machine ip
192.168.99.100
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
c651dab3a948 selenium/standalone-chrome:3.4.0 "/opt/bin/entry_po..." 24 hours ago Up 24 hours 0.0.0.0:4445->4444/tcp cranky_kalam
Here's what I've tried:
require(RSelenium)
# Avoid download prompt to pop up and parsing default download folder
eCaps <- list(
chromeOptions =
list(prefs = list(
"profile.default_content_settings.popups" = 0L,
"download.prompt_for_download" = FALSE,
"download.default_directory" = "C:/temp/Pictures"
)
)
)
# Open connection
remDr <- remoteDriver(remoteServerAddr = "192.168.99.100",port = 4445L,browserName="chrome",extraCapabilities = eCaps)
remDr$open()
# Navigate to desired URL with picture
url <- "https://www.google.be/images/branding/googlelogo/2x/googlelogo_color_272x92dp.png"
remDr$navigate(url)
remDr$screenshot(display = TRUE) # Everything looks fine here
# Move mouse to the page's center
webElem <- remDr$findElement(using = 'xpath',value = '/html/body')
remDr$mouseMoveToLocation(webElement = webElem)
# Right click and
remDr$click(2)
remDr$screenshot(display = TRUE) # I don't see the right-click dialog!
# Try to move right-click dialog to 'Save as' or 'Save image as'
remDr$sendKeysToActiveElement(list(key = 'down_arrow',
key = 'down_arrow',
key = 'enter'))
### NOTHING HAPPENS
I've tried to play around with the amount of key = 'down_arrow' and every time I look into C:/temp/Pictures nothing has been saved.
Please note that this is just an example and I know I could have downloaded this picture with download.file. I need a solution with RSelenium for my real case.
I tried using remDr$click(buttonId = 2) to perform Right click but to no avail. Thus, one workaround to save the image would be extracting links from the webpage and using download.file to download it.
#navigate
url <- "https://www.google.be/images/branding/googlelogo/2x/googlelogo_color_272x92dp.png"
remDr$navigate(url)
#get the link of image
link = remDr$getPageSource()[[1]] %>%
read_html() %>% html_nodes('img') %>%
html_attr('src')
[1] "https://www.google.be/images/branding/googlelogo/2x/googlelogo_color_272x92dp.png"
#download using download.file in your current working directory.
download.file(link, basename(url), method = 'curl')
How can image downloading be disabled when using Firefox in Rselenium? I want to see if doing so makes a scraping script faster.
I've read the Reselnium package manual including the sections on getFirefoxProfile & makeFirefoxProfile.
I've found this link that shows how to handle chromedriver.
I can disable images for a Firefox instance that I manually open in Windows 10 but Rselenium does not appear to use that same profile.
Previously you would need to set the appropriate preference (in this case
permissions.default.image) however there is now an issue with firefox resetting this value see:
https://github.com/seleniumhq/selenium/issues/2171
a work around is given:
https://github.com/gempesaw/Selenium-Remote-Driver/issues/248
implementing this in RSelenium:
library(RSelenium)
fprof <- makeFirefoxProfile(list(permissions.default.image = 2L,
browser.migration.version = 9999L))
rD <- rsDriver(browser = "firefox", extraCapabilities = fprof)
remDr <- rD$client
remDr$navigate("http://www.google.com/ncr")
remDr$screenshot(display = TRUE)
# clean up
rm(rD)
gc()