I have had my first go at using RSelenium today to scrape data from websites. I can navigate to the data I require via the tabs and drop-down menus (the hard bit?) but am now stuck at the point of extracting the actual data I need (the easy bit!)
My code so far is:
library(RSelenium)
checkForServer()
startServer()
remDr <- remoteDriver$new()
remDr$open()
remDr$navigate("https://www.whoscored.com/Teams/31")
webElem1 <- remDr$findElement(value = '//a[#href = "#team-squad-stats-detailed"]')
webElem1$clickElement()
webElem2 <- remDr$findElement("id", "category")
webElem2$clickElement()
webElem2$sendKeysToElement(list(key="down_arrow", key="down_arrow", key="down_arrow",
key="down_arrow", key="down_arrow", key="enter"))
webElem3 <- remDr$findElement("id", "subcategory")
webElem3$clickElement()
webElem3$sendKeysToElement(list(key="down_arrow", key="enter"))
webElem4 <- remDr$findElement("id", "statsAccumulationType")
webElem4$clickElement()
webElem4$sendKeysToElement(list(key="down_arrow", key="down_arrow", key="down_arrow",
key="enter"))
webElem5 <- remDr$findElement("id", "player-table-statistics-body")
Can someone advise the simplest way to now extract the data in this player table into csv form please? I am used to using the XML package and readHTMLTable to scrape other (static) websites but I am stuck on how to combine this with my RSelenium steps above.
Thank you
EDIT - having come back to this with fresh eyes the answer I have found is below:
webElem5 <- remDr$findElement(using = "id", value = "statistics-table-detailed")
webElem5txt <- webElem5$getElementAttribute("outerHTML")[[1]]
table <- readHTMLTable(webElem5txt, header=TRUE, as.data.frame=TRUE)[[1]]
This allows me to proceed with what I need on this part of the website.
If I may, I would like to ask for help with another part of the same site. I navigate to the data I need as follows:
remDr$navigate("https://www.whoscored.com/Matches/959894")
webElem1 <- remDr$findElement(using = "link text", value = "Match Centre")
webElem1$clickElement()
webElem2 <- remDr$findElement(value = '//a[#href = "#chalkboard"]')
webElem2$clickElement()
The data I would like to extract is in these boxes, but as the HTML doesn't say they are built as tables I don't really know how to proceed.
Related
EDIT: From the comments I received so far, I managed to use RSelenium to access the PDF files I am looking for, using the following code:
library(RSelenium)
driver <- rsDriver(browser = "firefox")
remote_driver <- driver[["client"]]
remote_driver$navigate("https://www.rad.cvm.gov.br/enetconsulta/frmGerenciaPaginaFRE.aspx?CodigoTipoInstituicao=1&NumeroSequencialDocumento=62398")
# It needs some time to load the page
option <- remote_driver$findElement(using = 'xpath', "//select[#id='cmbGrupo']/option[#value='PDF|412']")
option$clickElement()
Now, I need R to click the download button, but I could not manage to do so. I tried:
button <- remote_driver$findElement(using = "xpath", "//*[#id='download']")
button$clickElement()
But I get the following error:
Selenium message:Unable to locate element: //*[#id="download"]
For documentation on this error, please visit: https://www.seleniumhq.org/exceptions/no_such_element.html
Build info: version: '4.0.0-alpha-2', revision: 'f148142cf8', time: '2019-07-01T21:30:10'
Erro: Summary: NoSuchElement
Detail: An element could not be located on the page using the given search parameters.
class: org.openqa.selenium.NoSuchElementException
Further Details: run errorDetails method
Can someone tell what is wrong here?
Thanks!
Original question:
I have several webpages from which I need to download embedded PDF files and I am looking for a way to automate it with R. This is one of the webpages: https://www.rad.cvm.gov.br/enetconsulta/frmGerenciaPaginaFRE.aspx?CodigoTipoInstituicao=1&NumeroSequencialDocumento=62398
This is a webpage from CVM (Comissão de Valores Mobiliários, the Brazilian equivalent to the US Securities and Exchange Commission - SEC) to download Notes to Financial Statements (Notas Explicativas) from Brazilian companies.
I tried several options but the website seems to be built in a way that makes it difficult to extract the direct links.
I tried what is suggested in here Downloading all PDFs from URL, but the html_nodes(".ms-vb2 a") %>% html_attr("href") yields an empty character vector.
Similarly, when I tried the approach in here https://www.samuelworkman.org/blog/scraping-up-bits-of-helpfulness/, the html_attr("href") generates an empty vector.
I am not used to web scraping codes in R, so I cannot figure out what is happening.
I appreciate any help!
If someone is facing the same problem I did, I am posting the solution I used:
# set Firefox profile to download PDFs automatically
pdfprof <- makeFirefoxProfile(list(
"pdfjs.disabled" = TRUE,
"plugin.scan.plid.all" = FALSE,
"plugin.scan.Acrobat" = "99.0",
"browser.helperApps.neverAsk.saveToDisk" = 'application/pdf'))
driver <- rsDriver(browser = "firefox", extraCapabilities = pdfprof)
remote_driver <- driver[["client"]]
remote_driver$navigate("https://www.rad.cvm.gov.br/enetconsulta/frmGerenciaPaginaFRE.aspx?CodigoTipoInstituicao=1&NumeroSequencialDocumento=62398")
Sys.sleep(3) # It needs some time to load the page (set to 3 seconds)
option <- remote_driver$findElement(using = 'xpath', "//select[#id='cmbGrupo']/option[#value='PDF|412']") # select the option to open PDF file
option$clickElement()
# Find iframes in the webpage
web.elem <- remote_driver$findElements(using = "css", "iframe") # get all iframes in the webpage
sapply(web.elem, function(x){x$getElementAttribute("id")}) # see their names
remote_driver$switchToFrame(web.elem[[1]]) # Move to the first iframe (Formularios Filho)
web.elem.2 <- remote_driver$findElements(using = "css", "iframe") # get all iframes in the webpage
sapply(web.elem.2, function(x){x$getElementAttribute("id")}) # see their names
# The pdf Viewer iframe is the only one inside Formularios Filho
remote_driver$switchToFrame(web.elem.2[[1]]) # Move to the first iframe (pdf Viewer)
Sys.sleep(3) # It needs some time to load the page (set to 3 seconds)
# Download the PDF file
button <- remote_driver$findElement(using = "xpath", "//*[#id='download']")
button$clickElement() # download
Sys.sleep(3) # Need sometime to finish download and then close the window
remote_driver$close() # Close the window
so I have tried traditional
data <- curl("https://www.covid19india.org")
or
readLines("https://www.covid19india.org")
but was not able to extract data.
the data I want is at the district level which we can see after click example if we go on this URL and click Maharashtra then we can see all districts related to Maharashtra.
similarly for all districts.
any guidance will be of great help
my humble solution to your problem would be to use RSelenium to remote control a webbrowser, access the page you are showing and click on the element you whish to open. After his a page-read should show the required information.
Raw Example to show it works (just tried it):
library(RSelenium)
library(rvest)
I use firefox but there is chrome and other options
driver <- rsDriver(browser = "firefox")
remDr <- driver[['client']]
Navivate to the page
remDr$navigate("https://www.covid19india.org/")
Get the content of the Page to R & read as html:
src <- remDr$getPageSource()[[1]]
pg <- read_html(src)
There is no mumbai information as you informed:
no_mumbai <- pg %>% html_node(xpath='/html/body/div/div/div/div[2]/div[1]/table/tbody[1]') %>% html_text()
Get the Button by css selector and then "click" on it:
maharastra <- remDr$findElement(using = "css selector", ".table > tbody:nth-child(2) > tr:nth-child(1) > td:nth-child(1)")
maharastra$clickElement()
Get the content of the Page to R & read as html:
src <- remDr$getPageSource()[[1]]
pg <- read_html(src)
Now I just read the same info and mumbai shows up!
with_mumbai <- pg %>% html_node(xpath='/html/body/div/div/div/div[2]/div[1]/table/tbody[1]') %>% html_text()
This code is not perfect (probably far away from it) but it does the job. You would have to combine it with loops and parsing to get the clean info. Most probably there are better data sources like an governmental API.
I have this RSelenium setup (using selenium really shouldn't impact the answer to this question):
library(tidyverse)
library(rvest)
library(httr)
library(RSelenium) # running through docker
## RSelenium setup
remDr <- remoteDriver(port = 4445L, browserName = "chrome")
remDr$open()
## Navigate to Google Books
remDr$navigate("https://books.google.com/")
books <- remDr$findElement(using = "css", "[name = 'q']")
## Search for whatever, the Civil War, for example
books$sendKeysToElement(list("the civil war", key = "enter"))
## Getting Google web elements (10 per page)
bookElem <- remDr$findElements(using = "xpath", "//h3[#class = 'LC20lb']//parent::a")
## Click on each book link
links <- sapply(bookElem, function(bookElem){
bookElem$getElementAttribute("href")
})
This works great - and compiles all of the links from the first page of results (Google automatically limits it to 10 results, so ten links). What I would like is to have that same links vector compile every result link from the first, say, 12 pages (to keep it manageable). So:
goog_pgs <- seq(1:12) # to set the limit
Where I'm lost: how do I feed that into my links vector? The links from each page are too different and aren't simple enough to just feed the number to its end. I've tried inserting the following:
nextButton <- remDr$findElements("xpath", "//*[#id = 'pnnext']")
next_page <- sapply(nextButton, function(nextButton) {
next_elements$clickElement()
})
And that does not work. What's the solution here?
You can use the sequence 1:12 as something to iterate over, using a for loop, lapply, or other looping mechanism. I have a terrible time with the apply functions, so I swapped in with map. The steps that need to be done repeatedly are finding books, getting the href of each book, and clicking the "next" button. With some modification, you can use:
books_12 <- map(1:12, function(pg) {
bookElem <- remDr$findElements(using = "xpath", "//h3[#class = 'LC20lb']//parent::a")
links <- map_chr(bookElem, ~.$getElementAttribute("href")[[1]])
nextButton <- remDr$findElement("xpath", "//*[#id='pnnext']")
nextButton$clickElement()
links
})
Note that getElementAttribute returns a list; since each element only has one href, I kept the first (only) one with [[1]]. This yields a list of 12 vectors of 10 URLs each.
It does not really align with stackoverflow policy since I am not showing what I have done but I really have no clue how to even start on this question given my lack of technical expertise. Hope someone can post a solution or at least point me to the right direction.
I want to download all the data from this website:
http://aps.dac.gov.in/APY/Public_Report1.aspx
I need to download all the data i.e. all season * all year * all states * all crops. The longer (frustrating!) way to approach is to just click all the boxes and press download.
However, I was wondering if anyone has any programming solution to download this data. I would preferably want to do this in R because that's the language I understand but feel free to tag other programming languages.
Here's a solution using RSelenium to instance a browser and direct it to do your bidding.
library(RSelenium)
driver <- rsDriver()
remDr <- driver[["client"]]
remDr$navigate("http://aps.dac.gov.in/APY/Public_Report1.aspx") #navigate to your page
You basically need to tell the browser to select each button you want to mark, using SelectorGadget to find the unique ID for each, then pass them one-by-one to webElem. Then use the webElem methods to make the page do things.
webElem <- remDr$findElement(using = 'id', value = "TreeViewSeasonn0CheckBox")
webElem$highlightElement() #quick flash as a check we're in the right box
webElem$clickElement() #performs the click
#now do the same for each other box
webElem <- remDr$findElement(using = 'id', value = "TreeView1n0CheckBox")
webElem$highlightElement()
webElem$clickElement()
webElem <- remDr$findElement(using = 'id', value = "TreeView2n0CheckBox")
webElem$highlightElement()
webElem$clickElement()
webElem <- remDr$findElement(using = 'id', value = "TreeViewYearn0CheckBox")
webElem$highlightElement()
webElem$clickElement()
Now choose the report form you want and click the download button. Assuming it's Excel format here.
webElem <- remDr$findElement(using = 'id', value = "DdlFormat")
webElem$sendKeysToElement(list("Excel", key = "enter"))
webElem <- remDr$findElement(using = 'id', value = "Button1")
webElem$clickElement() #does the click
For what it's worth, the site timed out on trying to download all the data for me. Your results may vary.
I’m trying to automate browsing on a site with RSelenium in order to retrieve the latest planned release dates. My problem lies in that there is an age-check that pops up when I visit the URL. The page(age-check-page) concists of two buttons, which I haven’t succeeded to click on through RSelenium. The code that I use thus far is appended below, what is the solution for this problem?
#Varialble and URL
s4 <- "https://www.systembolaget.se"
#Start Server
rd <- rsDriver()
remDr <- rd[["client"]]
#Load Page
remDr$navigate(s4)
webE <- remDr$findElements("class name", "action")
webE$isElementEnabled()
webE$clickElement()
You need to more accurately target the selector:
#Varialble and URL
s4 <- "https://www.systembolaget.se"
#Start Server
rd <- rsDriver()
remDr <- rd[["client"]]
#Load Page
remDr$navigate(s4)
webE <- remDr$findElement("css", "#modal-agecheck .action.primary")
webE$clickElement()