Web scraping with R: findElement doesn't recognise drop down menu

Web scraping with R: findElement doesn't recognise drop down menu - r

I'm trying to scrape a government database with multiple dropdown menus. Using RSelenium, I've managed to click on the button taking me to the interactive database ("Sistema de Catastros de superficie frutícola regional"), and I'm now trying to click on the drop-down menus (ex: region, year), but keep getting errors that there's NoSuchElement. I've tried multiple xpaths based on inspect element and the Selector Gadget chrome extension to no avail. It looks like each of the dropdown menus is a combobox.
If helpful, my end goal is to go through each of the regions, years, and crops; scraping the table generated by each one.
library(RSelenium)
library(tidyverse)
rdriver = rsDriver(browser = "chrome", port = 9515L, chromever = "106.0.5249.61")
obj = rdriver$client
obj$navigate("https://www.odepa.gob.cl/estadisticas-del-sector/catastros-fruticolas")
link = obj$findElement(using = 'xpath', value = '//*[#id="content"]/div/div/div/div/div[1]/div[2]/div/div[2]/div[1]/div/div/div[3]/div/p[2]/a')$clickElement()

When you click on the button, a new tab appears.
You have to switch tabs with the help of
remDr$switchToWindow(remDr$getWindowHandles()[[2]])
Here is an example.
library(rvest)
library(RSelenium)
shell('docker run -d -p 4445:4444 selenium/standalone-firefox')
remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4445L, browserName = "firefox")
remDr$open()
Sys.sleep(15)
url <- "https://www.odepa.gob.cl/estadisticas-del-sector/catastros-fruticolas"
remDr$navigate(url)
Sys.sleep(15)
web_Obj_Button <- remDr$findElement("xpath", '//*[#id="content"]/div/div/div/div/div[1]/div[2]/div/div[2]/div[1]/div/div/div[3]/div/p[2]/a')
web_Obj_Button$clickElement()
remDr$switchToWindow(remDr$getWindowHandles()[[2]])
web_Obj_Date <- remDr$findElement("css selector", "#mat-select-value-3 > span > span")
web_Obj_Date$clickElement()
remDr$screenshot(TRUE)

Related

RSelenium message:no such element: Unable to locate element

I intend to download and clean databases using RSelenium. I am able to open the link however I am having trouble downloading and opening the database. I believe the xpath is right but when I try to open I receive the following error
Selenium message:no such element: Unable to locate element: {"method":"xpath","selector":"//*[#id="ESTBAN_AGENCIA"]"}
My code is the following:
dir <- getwd()
file_path <- paste0(dir,"\\DataBase") %>% str_replace_all("/", "\\\\\\")
eCaps <- list(
chromeOptions =
list(prefs = list('download.default_directory' = file_path))
)
system("taskkill /im java.exe /f", intern=FALSE, ignore.stdout=FALSE)
#Creating server
rD <- rsDriver(browser = "chrome",
chromever = "101.0.4951.15",
port = 4812L,
extraCapabilities = eCaps)
#Creating the driver to use R
remDr <- remoteDriver(
remoteServerAddr = "localhost",
browserName = "chrome",
port = 4812L)
#Open server
remDr$open()
#Navegating in the webpage of ESTABAN
remDr$navigate("https://www.bcb.gov.br/acessoinformacao/legado?url=https:%2F%2Fwww4.bcb.gov.br%2Ffis%2Fcosif%2Festban.asp")
##Download
remDr$findElement(using ="xpath", '//*[#id="ESTBAN_AGENCIA"]/option[1]')

The element you are trying to access is inside an iframe and you need switch that iframe first in order to access the element.
remDr$navigate("https://www.bcb.gov.br/acessoinformacao/legado?url=https:%2F%2Fwww4.bcb.gov.br%2Ffis%2Fcosif%2Festban.asp")
#Switch to Iframe
webElem <- remDr$findElement("css", "iframe#framelegado")
remDr$switchToFrame(webElem)
##Download
remDr$findElement(using ="xpath", '//*[#id="ESTBAN_AGENCIA"]/option[1]')

Access restfulAPI token using R

my project includes fetching the token from restful API using R. THe API uses OAuth 2.0. The project is confidential, so cannot provide the information.
can anyone please help me to access the token using coding in R. For instance
URL: facebook.com
username: ABC
Password: qwerty

Here is one approach that can be considered for Facebook :
library(RSelenium)
url <- "https://www.facebook.com/"
shell('docker run -d -p 4445:4444 selenium/standalone-firefox')
remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4445L, browserName = "firefox")
remDr$open()
remDr$navigate(url)
web_Obj_Username <- remDr$findElement("css selector", '#email')
web_Obj_Username$sendKeysToElement(list("ABC"))
web_Obj_Password <- remDr$findElement("css selector", '#pass')
web_Obj_Password$sendKeysToElement(list("qwerty"))
html_Content <- remDr$getPageSource()[[1]]
id_Submit <- unlist(stringr::str_extract_all(html_Content, 'royal_login_button" type="submit" id=".........'))
id_Submit <- stringr::str_extract_all(id_Submit, 'id=".........')
id_Submit <- stringr::str_remove_all(id_Submit, 'id|=|\"')
web_Obj_Submit <- remDr$findElement("id", id_Submit)
web_Obj_Submit$click()

Phantomjs returns 404 in R when attempting webscraping

I am trying to collect some data from the OTC Markets (within the confines of their robots.txt) and I cannot connect to the webpage.
The first step I tried was just to scrape the HTML right off the page, but the page requires javascript to load.
So I downloaded phantomjs and connected that way. However, this leads to a 404 error page
I then changed the user-agent to something resembling a user to see if it would let me connect and still, no luck! What is going on here
Here is a reproducible version of my code, any help would be appreciated. Phantomjs can be downloaded here: http://phantomjs.org/
library(rvest)
library(xml2)
library(V8)
# example website, I have no correlation to this stock
url <- 'https://www.otcmarkets.com/stock/YTROF/profile'
# create javascript file that phantomjs can process
writeLines(sprintf("var page = require('webpage').create();
page.settings.userAgent = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36';
page.open('%s', function () {
console.log(page.content); //page source
phantom.exit();
});", url), con="scrape.js")
html <- system("phantomjs.exe_PATH scrape.js", intern = TRUE)
page_html <- read_html(html)

I have been able to get the html content with the following code which is not based on PhantomJS but on Selenium :
library(RSelenium)
shell('docker run -d -p 4445:4444 selenium/standalone-firefox')
remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4445L, browserName = "firefox")
remDr$open()
remDr$navigate('https://www.otcmarkets.com/stock/YTROF/profile')
remDr$executeScript("scroll(0, 5000)")
remDr$executeScript("scroll(0, 10000)")
remDr$executeScript("scroll(0, 15000)")
Sys.sleep(4)
remDr$screenshot(display = TRUE, useViewer = TRUE)
html_Content <- remDr$getPageSource()[[1]]
It is important to give time to the page to load before we extract the html content.
Here is another approach based on RDCOMClient :
library(RDCOMClient)
url <- 'https://www.otcmarkets.com/stock/YTROF/profile'
IEApp <- COMCreate("InternetExplorer.Application")
IEApp[['Visible']] <- TRUE
IEApp$Navigate(url)
Sys.sleep(5)
doc <- IEApp$Document()
Sys.sleep(5)
html_Content <- doc$documentElement()$innerText()

Rselenium makeFirefoxProfile not implemented

I am trying to run a RSelenium instance to download some pdf files for me without having to click on the dialog boxes (or it opening using pdfjs).
But even if I set my configurations, the Firefox instance still loads the default profile.
RSelenium version: 1.73
Firefox version: 56.0 (32-bit)
Windows: 7 Ultimate
Create profile and start server:
library(RSelenium)
library(rvest)
library(XML)
library(stringi)
cprof <- makeFirefoxProfile(list(
pdfjs.disabled = TRUE,
plugin.scan.plid.all = FALSE,
plugin.scan.Acrobat = "99.0",
browser.helperApps.neverAsk.saveToDisk = 'application/pdf',
browser.download.dir = "C:\\temp")
)
remDr <- rsDriver(port = 4477L, browser = "firefox", check = FALSE, extraCapabilities = cprof)
remDr <- remDr[["client"]]
After Firefox launches I check the configs, the settings have remained in their default state:

Open new tab in RSelenium

I have the following code with which I try to open the url into a new tab every take a new url loaded from the for loop open to a new tab. What I made until know is this:
library("RSelenium")
startServer()
checkForServer()
remDr <- remoteDriver()
remDr$open()
remDr$navigate("http://www.google.com/")
Sys.sleep(5)
myurllist <- c("https://cran.r-project.org/", "http://edition.cnn.com/", "https://cran.r-project.org/web/packages/")
for (i in 1:length(myurllist)) {
url <- url_list[i]
webElem <- remDr$findElement("css", "urlLink")
webElem$sendKeysToElement(list(key = "t"))
remDr$navigate(url)
Sys.sleep(5)
}
From selenium I found this answer

A new tab is opened by pressing CTRL+T, not T:
library("RSelenium")
startServer()
checkForServer()
remDr <- remoteDriver()
remDr$open()
remDr$navigate("http://www.google.com/")
url_list <- c("http://edition.cnn.com/", "https://cran.r-project.org/web/packages/")
for (url in url_list) {
webElem <- remDr$findElement("css", "html")
webElem$sendKeysToElement(list(key="control", "t"))
remDr$navigate(url)
}

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Web scraping with R: findElement doesn't recognise drop down menu - r

Related

RSelenium message:no such element: Unable to locate element

Access restfulAPI token using R

Phantomjs returns 404 in R when attempting webscraping

Rselenium makeFirefoxProfile not implemented

Open new tab in RSelenium

Categories

Resources