Getting information with web scraping from multiple screen web page - r

I am trying to get some information about enterprises from the Internet. Most of the information is located in this page: http://appscvs.supercias.gob.ec/portalInformacion/sector_societario.zul, the page looks like this:
In this page I have to click on the tab Busqueda de Companias and then the interesting side starts. When I click I get the next screen:
In this page I have to set the option Nombre and then I have to insert a string with a name. For example I will add the string PROAÑO & ASOCIADOS CIA. LTDA. and I will get the next screen:
Then, I have to click on Buscar and I will get the next screen:
In this screen I have the information for this enterprise. Then, I have to click on the tab Informacion Estados Financieros and I will get the next screen:
In this finally screen I have to click on the tab Estado Situacion and I will get the information from the enterprise in the columns Codigo de la cuenta contable, Nombre de la cuenta contable and Valor. I would like to get that information saved in a dataframe. Most of the complex side I found began when I have to set the element Nombre, insert a string, then Buscar and click until find the tab Informacion Estados Financieros. I have tried using html_session and html_form from rvest package but the elements are empty.
Could you help me with some steps to solve this problem?

RSelenium Coded Example
Here is a self-contained code example, using the web-site referenced in the question.
Observation: Please do not run this code.
Why? Having 1k Stack users hit the web-site is a DDOS attack.
##Introduction Prerequisites
The code below will install RSelenium, before running the code you need to:
Install Firefox
Add the Selenium IDE Plugin
https://addons.mozilla.org/en-US/firefox/addon/selenium-ide/
Install RStudio [Recommendation]
Create a project and open the code file below
The code below will take you from the second page [http://appscvs.supercias.gob.ec/portaldeinformacion/consulta_cia_param.zul] through to the final page where the information you are interested in is...
Useful References:
If you are interested in using RSelenium I strongly recommend you read the following references, thanks go to John Harrison for developing the RSelenium package.
RSelenium Basics
http://rpubs.com/johndharrison/12843
RSelenium Headless Browsing
http://rpubs.com/johndharrison/RSelenium-headless
RSelenium Vignette
https://cran.r-project.org/web/packages/RSelenium/vignettes/basics.html
Code Example
# We want to make this as easy as possible to use
# So we need to install required packages for the user...
#
if (!require(RSelenium)) install.packages("RSelenium")
if (!require(XML)) install.packages("XML")
if (!require(RJSONIO)) install.packages("RSJONIO")
if (!require(stringr)) install.packages("stringr")
# Data
#
mainPage <- "http://appscvs.supercias.gob.ec/portalInformacion/sector_societario.zul"
businessPage <- "http://appscvs.supercias.gob.ec/portaldeinformacion/consulta_cia_param.zul"
# StartServer
# We assume RSelenium is not setup, so we check if the RSelenium
# server is available, if not we install RSelenium server.
checkForServer()
# OK. now we start the server
RSelenium::startServer()
remDr <- RSelenium::remoteDriver$new()
# We assume the user has installed Firefox and the Selenium IDE
# https://addons.mozilla.org/en-US/firefox/addon/selenium-ide/
#
# Ok we open firefix
remDr$open(silent = T) # Open up a firefox window...
# Now we open the browser and required URL...
# This is the page that matters...
remDr$navigate(businessPage)
# First things first on the first page, lets get the id's for the radio_button,
# name Element, and button. We need all three.
#
radioButton <- remDr$findElements(using = 'css selector', ".z-radio-cnt")
nameElement <- remDr$findElements(using = 'css selector', ".z-combobox-inp")
searchButton <- remDr$findElements(using = 'css selector', ".z-button-cm")
# Optional: we can highlight the radio elements returned
# lapply(radioButton, function(x){x$highlightElement()})
# Optional: we can highlight the nameElement returned
# lapply(nameElement, function(x){x$highlightElement()})
# Optional: we can highlight the searchButton returned
# lapply(searchButton, function(x){x$highlightElement()})
# Now we can select and press the third radio button
radioButton[[3]]$clickElement()
# We fill in the required name...
nameElement[[1]]$sendKeysToElement(list("PROAÑO & ASOCIADOS CIA. LTDA."))
# This is subtle but required the page triggers a drop down list, so rather than
# hitting the searchButton, we first select, and hit enter in the drop down menu...
selectElement <- remDr$findElements(using = 'css selector', ".z-comboitem-text")
selectElement[[1]]$clickElement()
# OK, now we can click the search button, which will cause the next page to open
searchButton[[1]]$clickElement()
# New Page opens...
#
# Ok, so now we first pull the list of buttons...
finPageButton <- remDr$findElements(using = 'class name', "m_iconos")
# Now we can press the required button to open the page we want to get too...
finPageButton[[9]]$clickElement()
# We are now on the required page.
we are now on the target page [See image]
Extracting the table values...
The next step is to extract the table values. To do this, we pull the .z-listitem css-selector data. Now we can check to confirm if we see the lines of data. We do, so we can now extract the values returned and populate either a list or Dataframe.
# Ok, now we need to extract the table, we identify and pull out the
# '.z-listitem' and assign to modalWindow
modalWindow <- remDr$findElements(using = 'css selector', ".z-listitem")
# Now we can extract the lines from modalWindow... Now that each line is
# returned as a single line of text, so we split into three based on the
# line marker "/n'
lineText <- str_split(modalWindow[[1]]$getElementText()[1], '\n')
lineText
here, is the result:
> lineText <- stringr::str_split(modalWindow[[1]]$getElementText()[1], '\n')
> lineText
[[1]]
[1] "10"
[2] "OPERACIONES DE INGRESO CON PARTES RELACIONADAS EN PARAÍSOS FISCALES, JURISDICCIONES DE MENOR IMPOSICIÓN Y REGÍMENES FISCALES PREFERENTES"
[3] "0.00"
Dealing with Hidden Data.
The Selenium WebDriver and thus RSelenium only interact with visible elements of a web page. If we try to read the entire table, we will only return table items that are visible (unhidden).
We can navigate this issue by scrolling to the bottom of the table. We force the table to populate due to the scroll action. We can then extract the complete table.
# Select the .z-listbox-body
modalWindow <- remDr$findElements(using = 'css selector', ".z-listbox-body")
# Now we tell the window we want to scroll to the bottom of the table
# This triggers the table to populate all the rows
modalWindow[[1]]$executeScript("window.scrollTo(0, document.body.scrollHeight)")
# Now we can extract the complete table
modalWindow <- remDr$findElements(using = 'css selector', ".z-listitem")
lineText <- stringr::str_split(modalWindow[[9]]$getElementText(), '\n')
lineText
###What the code does.
The code example above is meant to be self-contained. By that I mean it should install everything you need including required packages. Once the dependent R packages install, the R code will call checkForServer(), if Selenium is not installed, the call will install it. This may take some time
My recommendation is you step through the code as I have not incorporated any delays (in production you would want to), note also I have not optimised for speed but rather for a modicum of clarity [from my perspective]...
The code was shown to work on:
Mac OS X 10.11.5
RStudio 0.99.893
R version 3.2.4 (2016-03-10) -- "Very Secure Dishes"

Check out RSelenium
First, install RSelenium and use the above linked vignette to get familiar with the basics
Then see this webinar on using RSelenium, which goes through some detailed scraping step-by-step and is quite easy to follow:
http://johndharrison.blogspot.hk/2014/05/orange-county-r-users-group-oc-rug.html

Related

Webscraping images in r and saving them into a zip file

I am trying to webscrape information from this website: https://www.nea.gov.sg/weather/rain-areas and download the 240km radar scans between 2022-07-31 01:00:00 (am) and 2022-07-31 03:00:00 (am) at five-minute intervals, inclusive of end points. Save the images to a zip file.
Edit: Is there a way to do it with just rvest and avoiding the usage of for loops?
I've fount out that the image address can be acquired by clicking on the image and selecting copy image address. An example :https://www.nea.gov.sg/docs/default-source/rain-area-240km/dpsri_240km_2022091920000000dBR.dpsri.png
I've noted that the string of numbers would represent the date and time. So the one I'd need would be 20220731xxxxxxx where x would be the time. However, how would I then use this to webscrape?
Could someone provide some guidance? I can't even seem to find the radar scans for that day. Thank you.
You can consider the following code to save the screenshots of the webpage :
library(RSelenium)
url <- "https://www.nea.gov.sg/weather/rain-areas"
shell('docker run -d -p 4445:4444 selenium/standalone-firefox')
remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4445L, browserName = "firefox")
remDr$open()
remDr$navigate(url)
web_Elem <- remDr$findElement("xpath", '//*[#id="rain-area-slider"]/div/button')
web_Elem$clickElement()
for(i in 1 : 10)
{
print(i)
Sys.sleep(1)
path_To_File <- paste0("C:/file", i, ".png")
remDr$screenshot(display = FALSE, useViewer = TRUE, file = path_To_File)
}
Scraping the images from the website requires you to interact with the website (e.g. clicks), so we will use the RSelenium package for the task. You will also need to have Firefox installed on your system to be able to follow this solution.
1. Load packages
We will begin by loading our packages of interest:
# Load packages ----
pacman::p_load(
httr,
png,
purrr,
RSelenium,
rvest,
servr
)
2. Setup
Now, we need to start the Selenium server with firefox. The following code will start a firefox instance. Run it and wait for firefox to launch:
# Start Selenium driver with firefox ----
rsd <- rsDriver(browser = "firefox", port = random_port())
Now that the firefox browser (aka the client) is up, we want to be able to manipulate it with our code. So, let's create a variable (cl for client) that will represent it. We will use the variable to perform all the actions we need:
cl <- rsd$client
The first action we want to perform is to navigate to the website. Once you run the code, notice how Firefox goes to the website as a response to you running your R code:
# Navigate to the webpage ----
cl$navigate(url = "https://www.nea.gov.sg/weather/rain-areas")
Let's get scraping
Now we're going to begin the actual scraping! #EmmanuelHamel took the clever approach of simply clicking on the "play" button in order to launch the automatic "slideshow". He then took a screenshot of the webpage every second in order to capture the changes in the image. The approach I use is somewhat different.
In the code below, I identify the 13 steps of the slideshow (along the horizontal green bar) and I click on each "step" one after the other. After clicking on a step, I get the URL of the image, then I click on the other step... all the way to the 13th step.
Here I get the HTML element for each step:
# Get the selector for each of the 13 steps
rail_steps <- cl$findElements(using = "css", value = "div.vue-slider-mark")[1:13]
Then, I click on each element and get the image URL at each step. After you run this code, check how your code manipulates the webpage on the firefox instance, isn't that cool?
img_urls <- map_chr(rail_steps, function(step){
cl$mouseMoveToLocation(webElement = step)
cl$click()
img_el <- cl$findElement(using = "css", value = "#rain_overlay")
Sys.sleep(1)
imcg_url <-
img_el$getElementAttribute(attrName = "src")[[1]]
})
Finally, I create an image folder img where I download and save the images:
# Create an image folder then download all images in it ----
dir.create("img")
walk(img_urls, function(img_url){
GET(url = img_url) |>
content() |>
writePNG(target = paste0("img/", basename(img_url)))
})
Important
The downloaded images do not contain the background map on the webpage... only the points! You can download the background map then lay the points on top of it (using an image processing software for example). Here is how to download the background map:
# Download the background map----
GET(url = "https://www.nea.gov.sg/assets/images/map/base-853.png") |>
content() |>
writePNG(target = "base_image.png")
If you want to combine the images programmatically, you may want to look into the magick package in R.

Scraping a Webpage with RSelenium

I am trying to scrape this link here: https://sportsbook.draftkings.com/leagues/hockey/88670853?category=player-props&subcategory=goalscorer -- and return the player props on the page in some sort of workable table within R where I can clean it to a final result.
I am working with the RSelenium package in combination with the tidyverse and rvest in order to scrape this info into R. I have had success on other pages on this website in the past, but can't seem to crack this one.
I've gotten as far as Inspecting the webpage down to the most granular <div> that contains the entire list of players on the page, and copied the corresponding xpath from that line of the inspection.
My code looks as such:
# Run this code to scrape the player props for goals from draftkings
library(tidyverse)
library(RSelenium)
library(rvest)
# start up local selenium server
rD <- rsDriver(browser = "chrome", port=6511L, chromever = "96.0.4664.45")
remote_driver <- rD$client
# Open chrome
remote_driver$open()
# Navigate to URL
url <- "https://sportsbook.draftkings.com/leagues/hockey/88670853?category=player-props&subcategory=goalscorer"
remote_driver$navigate(url)
# Find the table via the XML path
table_xml <- remote_driver$findElement(using = "xpath", value = "//*[#id='root']/section/section[2]/section/div[3]/div/div[3]/div/div/div[2]/div")
# Locates the table, turns it into a list, and binds into a single dataframe
player_prop_table <- table_xml$getElementAttribute("innerHTML")
That last line, instead of returning a workable list, tibble, or dataframe like I'm used to returns a large list that contains the same values I see on the Chrome inspect tool.
What am I missing here in terms of successfully scraping this page?

How to download embedded PDF files from webpage using RSelenium?

EDIT: From the comments I received so far, I managed to use RSelenium to access the PDF files I am looking for, using the following code:
library(RSelenium)
driver <- rsDriver(browser = "firefox")
remote_driver <- driver[["client"]]
remote_driver$navigate("https://www.rad.cvm.gov.br/enetconsulta/frmGerenciaPaginaFRE.aspx?CodigoTipoInstituicao=1&NumeroSequencialDocumento=62398")
# It needs some time to load the page
option <- remote_driver$findElement(using = 'xpath', "//select[#id='cmbGrupo']/option[#value='PDF|412']")
option$clickElement()
Now, I need R to click the download button, but I could not manage to do so. I tried:
button <- remote_driver$findElement(using = "xpath", "//*[#id='download']")
button$clickElement()
But I get the following error:
Selenium message:Unable to locate element: //*[#id="download"]
For documentation on this error, please visit: https://www.seleniumhq.org/exceptions/no_such_element.html
Build info: version: '4.0.0-alpha-2', revision: 'f148142cf8', time: '2019-07-01T21:30:10'
Erro: Summary: NoSuchElement
Detail: An element could not be located on the page using the given search parameters.
class: org.openqa.selenium.NoSuchElementException
Further Details: run errorDetails method
Can someone tell what is wrong here?
Thanks!
Original question:
I have several webpages from which I need to download embedded PDF files and I am looking for a way to automate it with R. This is one of the webpages: https://www.rad.cvm.gov.br/enetconsulta/frmGerenciaPaginaFRE.aspx?CodigoTipoInstituicao=1&NumeroSequencialDocumento=62398
This is a webpage from CVM (Comissão de Valores Mobiliários, the Brazilian equivalent to the US Securities and Exchange Commission - SEC) to download Notes to Financial Statements (Notas Explicativas) from Brazilian companies.
I tried several options but the website seems to be built in a way that makes it difficult to extract the direct links.
I tried what is suggested in here Downloading all PDFs from URL, but the html_nodes(".ms-vb2 a") %>% html_attr("href") yields an empty character vector.
Similarly, when I tried the approach in here https://www.samuelworkman.org/blog/scraping-up-bits-of-helpfulness/, the html_attr("href") generates an empty vector.
I am not used to web scraping codes in R, so I cannot figure out what is happening.
I appreciate any help!
If someone is facing the same problem I did, I am posting the solution I used:
# set Firefox profile to download PDFs automatically
pdfprof <- makeFirefoxProfile(list(
"pdfjs.disabled" = TRUE,
"plugin.scan.plid.all" = FALSE,
"plugin.scan.Acrobat" = "99.0",
"browser.helperApps.neverAsk.saveToDisk" = 'application/pdf'))
driver <- rsDriver(browser = "firefox", extraCapabilities = pdfprof)
remote_driver <- driver[["client"]]
remote_driver$navigate("https://www.rad.cvm.gov.br/enetconsulta/frmGerenciaPaginaFRE.aspx?CodigoTipoInstituicao=1&NumeroSequencialDocumento=62398")
Sys.sleep(3) # It needs some time to load the page (set to 3 seconds)
option <- remote_driver$findElement(using = 'xpath', "//select[#id='cmbGrupo']/option[#value='PDF|412']") # select the option to open PDF file
option$clickElement()
# Find iframes in the webpage
web.elem <- remote_driver$findElements(using = "css", "iframe") # get all iframes in the webpage
sapply(web.elem, function(x){x$getElementAttribute("id")}) # see their names
remote_driver$switchToFrame(web.elem[[1]]) # Move to the first iframe (Formularios Filho)
web.elem.2 <- remote_driver$findElements(using = "css", "iframe") # get all iframes in the webpage
sapply(web.elem.2, function(x){x$getElementAttribute("id")}) # see their names
# The pdf Viewer iframe is the only one inside Formularios Filho
remote_driver$switchToFrame(web.elem.2[[1]]) # Move to the first iframe (pdf Viewer)
Sys.sleep(3) # It needs some time to load the page (set to 3 seconds)
# Download the PDF file
button <- remote_driver$findElement(using = "xpath", "//*[#id='download']")
button$clickElement() # download
Sys.sleep(3) # Need sometime to finish download and then close the window
remote_driver$close() # Close the window

Web Scraping in R when you have inputs

I'm trying to write a code for web scraping in R when you have to introduce inputs.
Exactly, I have a platform where I need to complete 2 fields and after that click submit and get the results.
But I don't know how to use my columns in R like inputs in platform.
I searched for an example but I did't find any.
Pls, if anyone can give me a simple e.g.
Thank you
EDIT:
I don't have a code yet. I was looking for an example where you can use input for complete a field on a site and after that to scrape the result.
In the photo are the fields on my URL. So, in R I have a dataframe with 2 columns. One for CNP/CUI and one for VIN/SASIU with 100 rows or more. And I want to use this columns like input and take the output for every row.
EDIT2:
The example provided by #Dominik S.Meier it worked for me when I had a list for inputs. For column inputs I will post another question.
But, till then I want to mention few thing that helped me, maybe it will hep somebody else.
You need to be sure that all the versions matches: R version, browser version, browser driver version, Java version. For me it didn't match chromedriver version, even if I downloaded the right version. The problem was that I had 3 chromeversion and I think it didn't choose the right. I fixed with: rD <- rsDriver(browser = c("chrome"),port = 4444L,chromever = "83.0.4103.39"). More info here:enter link description here
Because one element didn't have id like in e.g. webElem <- remDr$findElement(using = "id", "trimite"), I used css selector. You can find the css selector with right click -> copy -> copy selector (in the html code on the page).
If you don't get the results, maybe you don't use the right selector. I did that and the result was list(). Then I tried more css selector from the "above" in the html code. I don'y know if it is the right solution, but for me it worked.
Hope it will help. Thank you.
Using RSelenium (see here for more infos):
library(RSelenium)
rD <- rsDriver(browser = c("firefox")) #specify browser type you want Selenium to open
remDr <- rD$client
remDr$navigate("https://pro.rarom.ro/istoric_vehicul/dosar_vehicul.aspx") # navigates to webpage
# select first input field
option <- remDr$findElement(using='id', value="inputEmail")
option$highlightElement()
option$clickElement()
option$sendKeysToElement(list("email#email.com"))
# select second input field
option <- remDr$findElement(using='id', value="inputEmail2")
option$highlightElement()
option$clickElement()
option$sendKeysToElement(list("email#email.com"))
# select second input field
option <- remDr$findElement(using='id', value="inputVIN")
option$highlightElement()
option$clickElement()
option$sendKeysToElement(list("123"))
#press key
webElem <- remDr$findElement(using = "id", "trimite")
webElem$highlightElement()
webElem$clickElement()

Use RSelenium to perform drag-and-drop action

I would like to use RSelenium to download files (by clicking on the excel image) from this website http://highereducationstatistics.education.gov.au/. However, before downloading the file, a series of drag-and-drop actions (see this image http://highereducationstatistics.education.gov.au/images/preDragDimension.png) have to be performed so the right dataset could be chosen (See this http://highereducationstatistics.education.gov.au/GettingStarted.aspx for instruction).
I am wondering whether RSelenium has this type of drag and drop functions. I have searched this whole day and guess that mouseMoveToLocation combined with other functions like buttondown function might be the answers, but have no idea how to use them. Can anyone help with this?
Thanks very much.
First navigate with RSelenium to the page using:
library(RSelenium)
rD <- rsDriver() # runs a chrome browser, wait for necessary files to download
remDr <- rD$client
remDr$navigate("http://highereducationstatistics.education.gov.au/")
Then locate the element you want to drag to the chart/grid. In this example I will be selecting the Course Level from the left menu.
webElem1 <- remDr$findElement('xpath', "//span[#title = 'Course Level']")
Select the element where you want to drop this menu item. In this case the element has an id = "olapClientGridXrowHeader":
webElem2 <- remDr$findElement(using = 'id', "olapClientGridXrowHeader")
Once both items are selected, drag the first one into the second one like this:
remDr$mouseMoveToLocation(webElement = webElem1)
remDr$buttondown()
remDr$mouseMoveToLocation(webElement = webElem2)
remDr$buttonup()
Notice that these methods work on the remote driver, not the elements.

Resources