Webscraping in R where input is required - r

I have used the rvest package in R to scrape unique URLs before.
However, I am now stuck with a particular website. The URL stays static and I need to select the following dropdowns now and scrape the resulting table that appears.
Will be helpful if someone can guide me on what direction to take with websites like these? Is R even capable of doing this?
Edit: I have researched and it seems RSelenium can handle such tasks. Unfortunately, I have no exposure to it. Can someone recommend an example/blog/material online on using Selenium specifically for clicking and scraping for someone as noob as I am?

I have made a blog post about an RSelenium example:
https://guillaumepressiat.github.io/blog/2021/04/RSelenium-paginated-tables
this website contains a lot of things about selenium, you will have to plug it to RSelenium api package.(verbs are almost the same in all languages, findElement, etc) https://www.guru99.com/selenium-tutorial.html
But as an example based on your question maybe something like this to begin:
# https://stackoverflow.com/q/67021563/10527496
# java -jar selenium-server-standalone-3.9.1.jar
library(RSelenium)
library(tidyverse)
library(rvest)
library(httr)
remDr <- remoteDriver(
remoteServerAddr = "localhost",
port = 4444L, # change port according to terminal
browserName = "firefox"
)
remDr$open()
# remDr$getStatus()
url <- "https://fcainfoweb.nic.in/reports/Report_Menu_Web.aspx"
remDr$navigate(url)
Sys.sleep(5)
# first : radio buttons
u1 <- remDr$findElement(using = "id", value = 'ctl00_MainContent_Rbl_Rpt_type_0')
u2 <- remDr$findElement(using = "id", value = 'ctl00_MainContent_Rbl_Rpt_type_1')
u3 <- remDr$findElement(using = "id", value = 'ctl00_MainContent_Rbl_Rpt_type_2')
u4 <- remDr$findElement(using = "id", value = 'ctl00_MainContent_Rbl_Rpt_type_3')
dynam <- remDr$mouseMoveToLocation(webElement = u1)
u1$click()
Sys.sleep(5)
# second : Select input
s1 <- remDr$findElement(using = "id", value = 'ctl00_MainContent_Ddl_Rpt_Option0')
# get available choices
s_choices <- read_html(s1$getElementAttribute('innerHTML')[[1]]) %>%
html_nodes('option') %>%
html_attrs() %>%
unlist() %>%
.[3:length(.)] %>%
as.vector()
dynam <- remDr$mouseMoveToLocation(webElement = s1)
s1$click()
s1$sendKeysToElement(sendKeys = list(s_choices[1], key = "enter"))
# s_choices[1] is "Daily Prices"
Sys.sleep(5)
# get date choices
s_date_choices <- remDr$findElement(using = "id", value = "ctl00_MainContent_Txt_FrmDate")
dynam <- remDr$mouseMoveToLocation(webElement = s_date_choices)
s_date_choices$click()
s_date_choices$sendKeysToElement(sendKeys = list('01/01/2021', key = "enter"))
Sys.sleep(5)
s_table <- remDr$findElement(using = "id", value = "Panel1")
# get first tables as an example
results_1 <- read_html(s_table$getElementAttribute('innerHTML')[[1]]) %>%
html_table(fill = TRUE) %>%
.[2:length(.)]
we get a list of three tables as a result:
Making a function from this code to loop on a date vector is possible after that I think (you will have to reload a fresh start page on base URL for each date I suppose).

Related

Extract href tag using Rselenium

I am trying to get the store address of apple stores for multiple countries using Rselenium.
library(RSelenium)
library(tidyverse)
library(netstat)
# start the server
rs_driver_object <- rsDriver(browser = "chrome",
chromever = "100.0.4896.60",
verbose = F,
port = free_port())
# create a client object
remDr <- rs_driver_object$client
# maximise window size
remDr$maxWindowSize()
# navigate to the website
remDr$navigate("https://www.apple.com/uk/retail/storelist/")
# click on search bar
search_box <- remDr$findElement(using = "id", "dropdown")
country_name <- "United States" # for a single country. I can loop over multiple countries
# in the search box, pass on the country name and hit enter
search_box$sendKeysToElement(list(country_name, key = "enter"))
search_box$clickElement() # I am not sure if I need to click but I am doing anyway
The page now shows me the location of each store. Each store has a hyperlink that will take me to the store website where the full address is which I want to extract
However, I am stuck on how do I click on individual store address in the last step.
I thought I will get the href for all the stores in the particular page
store_address <- remDr$findElement(using = 'class', 'store-address')
store_address$getElementAttribute('href')
But it returns me an empty list. How do I go from here?
After obtaining page with list of stores we can do,
link = remDr$getPageSource()[[1]] %>%
read_html() %>% html_nodes('.state') %>% html_nodes('a') %>% html_attr('href') %>% paste0('https://www.apple.com', .)
[1] "https://www.apple.com/retail/thesummit/" "https://www.apple.com/retail/bridgestreet/"
[3] "https://www.apple.com/retail/anchorage5thavenuemall/" "https://www.apple.com/retail/chandlerfashioncenter/"
[5] "https://www.apple.com/retail/santanvillage/" "https://www.apple.com/retail/arrowhead/"

How to scrape hrefs embedded in a dropdown list of a web table using rselenium R

I'm trying to scrape links to all minutes and agenda provided in this website: https://www.charleston-sc.gov/AgendaCenter/
I've managed to scrape section IDs associated with each category (and years for each category) to loop through the contents within each category-year (please see below). But I don't know how to scrape the hrefs that lives inside the contents. Especially because the links to Agenda lives inside the drop down menu under 'download', it seems like I need to go through extra clicks to scrape the hrefs.
How do I scrape the minutes and agenda (inside the download dropdown) for each table I select? Ideally, I would like a table with the date, title of the agenda, links to minutes, and links to agenda.
I'm using RSelenium for this. Please see the code I have so far below, which allows me to click through each category and year, but not else much. Please help!
rm(list = ls())
library(RSelenium)
library(tidyverse)
library(httr)
library(XML)
library(stringr)
library(RCurl)
t <- readLines('https://www.charleston-sc.gov/AgendaCenter/', encoding = 'UTF-8')
co <- str_match(t, 'aria-label="(.*?)"[ ]href="java')[,2]
yr <- str_match(t, 'id="(.*?)" aria-label')[,2]
df <- data.frame(cbind(co, yr)) %>%
mutate_all(as.character) %>%
filter_all(any_vars(!is.na(.))) %>%
mutate(id = ifelse(grepl('^a0', yr), gsub('a0', '', yr), NA)) %>%
tidyr::fill(c(co,id), .direction='down')%>% drop_na(co)
remDr <- remoteDriver(port=4445L, browserName = "chrome")
remDr$open()
remDr$navigate('https://www.charleston-sc.gov/AgendaCenter/')
remDr$screenshot(display = T)
for (j in unique(df$id)){
remDr$findElement(using = 'xpath',
value = paste0('//*[#id="cat',j,'"]/h2'))$clickElement()
for (k in unique(df[which(df$id==j),'yr'])){
remDr$findElement(using = 'xpath',
value = paste0('//*[#id="',k,'"]'))$clickElement()
# NEED TO SCRAPE THE HREF ASSOCIATED WITH MINUTES AND AGENDA DOWNLOAD HERE #
}
}
Maybe you don't really need to click through all the elements? You can use the fact that all downloadable links have ViewFile in their href:
t <- readLines('https://www.charleston-sc.gov/AgendaCenter/', encoding = 'UTF-8')
viewfile <- str_extract_all(t, '.*ViewFile.*', simplify = T)
viewfile <- viewfile[viewfile!='']
library(data.table) # I use data.table because it's more convenient - but can be done without too
dt.viewfile <- data.table(origStr=viewfile)
# list the elements and patterns we will be looking for:
searchfor <- list(
Title='name=[^ ]+ title=\"(.+)\" href',
Date='<strong>(.+)</strong>',
href='href=\"([^\"]+)\"',
label= 'aria-label=\"([^\"]+)\"'
)
for (this.i in names(searchfor)){
this.full <- paste0('.*',searchfor[[this.i]],'.*');
dt.viewfile[grepl(this.full, origStr), (this.i):=gsub(this.full,'\\1',origStr)]
}
# Clean records:
dt.viewfile[, `:=`(Title=na.omit(Title),Date=na.omit(Date),label=na.omit(label)),
by=href]
dt.viewfile[,Date:=gsub('<abbr title=".*">(.*)</abbr>','\\1',Date)]
dt.viewfile <- unique(dt.viewfile[,.(Title,Date,href,label)]); # 690 records
What you have as the result is a table with the links to all downloadable files. You can now download them using any tool you like, for example using download.file() or GET():
dt.viewfile[, full.url:=paste0('https://www.charleston-sc.gov', href)]
dt.viewfile[, filename:=fs::path_sanitize(paste0(Title, ' - ', Date), replacement = '_')]
for (i in seq_len(nrow(dt.viewfile[1:10,]))){ # remove `1:10` limitation to process all records
url <- dt.viewfile[i,full.url]
destfile <- dt.viewfile[i,filename]
cat('\nDownloading',url, ' to ', destfile)
fil <- GET(url, write_disk(destfile))
# our destination file doesn't have extension, we need to get it from the server:
serverFilename <- gsub("inline;filename=(.*)",'\\1',headers(fil)$`content-disposition`)
serverExtension <- tools::file_ext(serverFilename)
# Adding the extension to the file we just saved
file.rename(destfile,paste0(destfile,'.',serverExtension))
}
Now the only problem we have is that the original webpage was only showing records for the last 3 years. But instead of clicking View More through RSelenium, we can simply load the page with earlier dates, something like this:
t <- readLines('https://www.charleston-sc.gov/AgendaCenter/Search/?term=&CIDs=all&startDate=10/14/2014&endDate=10/14/2017', encoding = 'UTF-8')
then repeat the rest of the code as necessary.

Can't find and click on a dynamic element with RSelenium

I'm using RSelenium to click on a dynamic element after a search on this webpage: http://www.in.gov.br/web/guest/inicio.
Every time I search for a word, I would like to find the words/link 'Ministério Da Educação' (it is the portuguese equivalent for Ministry of Education) on the right side of the results webpage and click on it.
I have used the inspect element feature from Google Chrome, but I am not having any success on finding and clicking that element. I have already tried using xpath, css selector, id ...
I am using the following code:
## search parameters
string_search <- "contrato"
date_search <- format(
as.Date("17/04/2019", "%d/%m/%Y"),
"%d/%m/%Y") #brazilian format
## start Selenium driver
library(RSelenium)
selCommand <- wdman::selenium(
jvmargs = c("-Dwebdriver.firefox.verboseLogging=true"),
retcommand = TRUE)
shell(selCommand, wait = FALSE, minimized = TRUE) # for windows
# system(selCommand) # for Linux
remDr <- remoteDriver(port = 4567L, browserName = "firefox")
remDr$open()
## navigation & search
remDr$navigate("http://www.in.gov.br/web/guest/inicio")
Sys.sleep(5)
# from date
datefromkey<-remDr$findElement(using = 'css', "#calendario_advanced_from")
datefromkey$clickElement()
datefromkey$sendKeysToElement(list(key = "enter"))
datefromkey$clearElement()
datefromkey$sendKeysToElement(list(date_search))
datefromkey$sendKeysToElement(list(key = "enter"))
# to date
datetokey<-remDr$findElement(using = 'css', "#calendario_advanced_to")
datetokey$clickElement()
datetokey$sendKeysToElement(list(key = "enter"))
datetokey$clearElement()
datetokey$sendKeysToElement(list(date_search))
datetokey$sendKeysToElement(list(key = "enter"))
# string to search
wordkey<-remDr$findElement(using = 'css', "#input-advanced_search")
wordkey$sendKeysToElement(list('"', string_search, '"'))
# click search button
press_button <- remDr$findElement(using = 'class', "btn")
press_button$clickElement()
Here is where I struggle:
1) first attempt: using a broader tag
# using a broader tag
categorykey <- remDr$findElement(using = 'id', '_3_facetNavigation')
categorykey$getElementText()
With getElementText() I see that "Ministério da Educação" is there, but I do not know how to click on the link.
2) second attempt: using the xpath
categorykey <- remDr$findElement('xpath', '//li
[#id="yui_patched_v3_11_0_1_1555545676970_404"]/text()')
It returns an error. Selenium can't locate the element.
Found the solution myself after watching this video on YouTube:
How to locate Dynamic Elements in Selenium Webdriver - XPATH Tutorial
The code would be like this:
categorykey <-remDr$findElement('xpath', '//*[contains(#data-value,"ministério da
educação")]')
categorykey$getElementText()
# just to see if it's right
categorykey$clickElement()

Scraping the tables from the .aspx web page with the multiple drop down optins

I would like to scrape the table’s data from this page http://agmarknet.gov.in/PriceTrends/SA_Month_PriMar.aspx.
Which ask for selecting the multiple options like "Commodity","state", "year" and "month". Then need to press submit button to get the table.
My attempt is to scrape the table associated with "Commodity"="Tomato","state"="Karnataka", "year"="2016" and "month"=ALL MONTH DATA. I am working with the following code in R
url<-"http://agmarknet.gov.in/PriceTrends/SA_Month_PriMar.aspx"
pgsession <- html_session(url)
pgform <-html_form(pgsession)[[1]]
filled_form <-set_values(pgform,
"ctl00$cphBody$Commodit_list"= "Tomato",
"ctl00$cphBody$State_list" = "Karnataka",
"ctl00$cphBody$Yea_list" = "2016",
"ctl00$cphBody$Mont_list" = "January"
)
d <- submit_form(session=pgsession, form=filled_form)
y <- d %>%
html_nodes("table") %>%.[[2]] %>%
html_table(header=TRUE)
dim(y)
but I am getting a error message as:
Submitting with 'ctl00$ddlDistrict'
Warning message:
In request_POST(session, url = url, body = request$values, encode =
request$encode, :
Internal Server Error (HTTP 500).
I am not able to scrap the required table from the web page please help me to extract the table with the desired options from the page.
Here is a method that uses RSelenium package to scrape data for all months of 2016.
library(RSelenium)
library(rvest)
library(tidyverse)
url <- "http://agmarknet.gov.in/PriceTrends/SA_Month_PriMar.aspx"
rD <- rsDriver()
remDr <- rD$client
lst <- lapply(seq(2,13), function(x) {
remDr$navigate(url)
webElem_commodity <- remDr$findElement(using = "css", "#cphBody_Commodit_list")
opts_commodity <- webElem_commodity$selectTag() # get all the associated tags
commodity_num <- which(opts_commodity$text=="Tomato") # find the required option
opts_commodity$elements[[commodity_num]]$clickElement() # select the required option
Sys.sleep(10) # for state names to load
webElem_state <- remDr$findElement(using = "css", "#cphBody_State_list")
opts_state <- webElem_state$selectTag()
state_num <- which(opts_state$text=="Karnataka")
opts_state$elements[[state_num]]$clickElement()
Sys.sleep(10) # for years to load
webElem_yr <- remDr$findElement(using = "css", "#cphBody_Yea_list")
opts_yr <- webElem_yr$selectTag()
yr_num <- which(opts_yr$text=="2016")
opts_yr$elements[[yr_num]]$clickElement()
Sys.sleep(10) # for months to load
webElem_month <- remDr$findElement(using = "css", "#cphBody_Mont_list")
opts_month <- webElem_month$selectTag()
opts_month$elements[[x]]$clickElement() # select a different month in each lapply iteration
Sys.sleep(10) # for submit button to become active
webElem_submit <- remDr$findElement(using = "css", "#cphBody_But_Submit")
webElem_submit$clickElement()
page_source <- remDr$getPageSource()
tdf <- read_html(page_source[[1]]) %>% # read table
html_nodes("table") %>% .[[5]] %>%
html_table(header=T,fill=T, trim=T) %>%
head(-1) # remove the last row which contains average at the bottom of the scraped table
})
remDr$close()
rD$server$stop()
# lst is a list, with 12 elements. Each element corresponds to data for one month of 2016

RSelenium: Select option from dropdown

I use RSelenium to fill in a webform. To select an option from a dropdown I use the following:
xpathoption <- paste0("//select[#id = '",samplepatient[p,'name'],"']/option[",samplepatient[p,'value'],"]")
optionelem <- remDrv$findElement(using = "xpath", xpathoption)
selectelem <- remDrv$findElement(using = "xpath"
, paste0("//select[#id = '",samplepatient[p,'name'],"']"))
optionelem$clickElement()
selectelem$screenshot(display = T)
I use the following to check if the correct option was selected:
remDrv$findElement(using = "xpath", paste0("//select[#id = '",samplepatient[p,'name'],"']"))$getElementAttribute("value")[[1]]
The problem I have is when the clickElement() command is run twice, the result of the last command changes. I also checked the outcome with screenshot(). It also shows that a different option is switched to when using the clickElement() command twice.
Is there a different way to select the option from a dropdown list, that is not creating this behavior?
I use a docker on ubuntu with firefox 3.0.1.
The form is from a calculator I want to use. To open the form itself you need to first check the disclaimer, like so:
remDrv$navigate('http://riskcalculator.facs.org/RiskCalculator/')
remDrv$findElement(using = "xpath", "//input[#id = 'chkDisclaimer']")$clickElement()
Sys.sleep(1)
remDrv$findElement(using = "xpath", "//input[#id = 'btnContinue']")$clickElement()
Sys.sleep(1)
a reproducable example after the disclaimer is:
#select age group
optionelem <- remDrv$findElement(using = "xpath", "//select[#id = 'AgeGroup']/option[3]")
selectelem <- remDrv$findElement(using = "xpath", "//select[#id = 'AgeGroup']")
#first attempt
optionelem$clickElement()
selectelem$getElementAttribute("value")
# result = 3
#second attempt
optionelem$clickElement()
selectelem$getElementAttribute("value")
# result = 1
As mentioned in one of the comments, the issue has not to do with RSelenium but with the docker used. I now use a chrome docker (standalone-chrome) that does not have the same issue with selecting an option in a dropdown.
I don't come across any issues when selecting options using clickElement
for example:
remDrv$navigate('http://riskcalculator.facs.org/RiskCalculator/')
remDrv$findElement("id", "chkDisclaimer")$clickElement()
Sys.sleep(1)
remDrv$findElement("id", "btnContinue")$clickElement()
Sys.sleep(1)
#select age group
ageElems <- remDrv$findElements("css", "#AgeGroup option")
ageElems[[3]]$clickElement()
#select Diabetes
diaElems <- remDrv$findElements("css", "#Diabetes option")
diaElems[[2]]$clickElement()
# Select Gender
genderElems <- remDrv$findElements("css", "#Gender option")
genderElems[[1]]$clickElement()
When running in Docker you can use "debug" image and a VNC viewer to see exactly whats happening in the browser.

Resources