RSelenium To Scrape Yahoo Financial News Headlines

RSelenium To Scrape Yahoo Financial News Headlines - web-scraping

I would like to get news headings for a company from Yahoo. I use RSelenium to start a remote browser and accept cookies. I found the surroung css class "StretchedBox" and I can literally see the headline by browser inspection. How can I store these headings? Next, I would like to scroll down with RSelenium and save more of these elements (say for several days).
library('RSelenium')
# Start Remote Browser
rD <- rsDriver(port = 4840L, browser = c("firefox"))
remDr <- rD[["client"]]
# Navigate to Yahoo Finance News for Specific Company
# This takes unusual long time
remDr$navigate("https://finance.yahoo.com/quote/AAPL/news?p=AAPL")
# Get "accept all cookies" botton
webElems <- remDr$findElements(using = "xpath", "//button[starts-with(#class, 'btn primary')]")
# We can check if we did get the proper button by checking the text of the element:
unlist(lapply(webElems, function(x) {x$getElementText()}))
# We found the two button, and we want to click the first one:
webElems[[1]]$clickElement()
# wait for page loading
Sys.sleep(5)
# I am looking for news headline in or after the StretchedBox
boxes <- remDr$findElements(using = "class", "StretchedBox")
boxes[1] # empty
boxes[[1]]$browserName

Finally, I found an xpath from which I could getElementText the news article headlines.
library('RSelenium')
# Start Browser
rD <- rsDriver(port = 4835L, browser = c("firefox"))
remDr <- rD[["client"]]
# Navigate to Yahoo Financial News
remDr$navigate("https://finance.yahoo.com/quote/AAPL/news?p=AAPL")
# Click Accept Cookies
webElems <- remDr$findElements(using = "xpath", "//button[starts-with(#class, 'btn primary')]")
unlist(lapply(webElems, function(x) {x$getElementText()}))
webElems[[1]]$clickElement()
# extract headlines from html/css by xpath
headlines <- remDr$findElements(using = "xpath", "//h3[#class = 'Mb(5px)']//a")
# extract headline text
headlines <- sapply(headlines, function(x){x$getElementText()})
headlines[1]
[[1]]
[1] "What Kind Of Investors Own Most Of Apple Inc. (NASDAQ:AAPL)?"

Related

Extract href tag using Rselenium

I am trying to get the store address of apple stores for multiple countries using Rselenium.
library(RSelenium)
library(tidyverse)
library(netstat)
# start the server
rs_driver_object <- rsDriver(browser = "chrome",
chromever = "100.0.4896.60",
verbose = F,
port = free_port())
# create a client object
remDr <- rs_driver_object$client
# maximise window size
remDr$maxWindowSize()
# navigate to the website
remDr$navigate("https://www.apple.com/uk/retail/storelist/")
# click on search bar
search_box <- remDr$findElement(using = "id", "dropdown")
country_name <- "United States" # for a single country. I can loop over multiple countries
# in the search box, pass on the country name and hit enter
search_box$sendKeysToElement(list(country_name, key = "enter"))
search_box$clickElement() # I am not sure if I need to click but I am doing anyway
The page now shows me the location of each store. Each store has a hyperlink that will take me to the store website where the full address is which I want to extract
However, I am stuck on how do I click on individual store address in the last step.
I thought I will get the href for all the stores in the particular page
store_address <- remDr$findElement(using = 'class', 'store-address')
store_address$getElementAttribute('href')
But it returns me an empty list. How do I go from here?

After obtaining page with list of stores we can do,
link = remDr$getPageSource()[[1]] %>%
read_html() %>% html_nodes('.state') %>% html_nodes('a') %>% html_attr('href') %>% paste0('https://www.apple.com', .)
[1] "https://www.apple.com/retail/thesummit/" "https://www.apple.com/retail/bridgestreet/"
[3] "https://www.apple.com/retail/anchorage5thavenuemall/" "https://www.apple.com/retail/chandlerfashioncenter/"
[5] "https://www.apple.com/retail/santanvillage/" "https://www.apple.com/retail/arrowhead/"

How to scrape hrefs embedded in a dropdown list of a web table using rselenium R

I'm trying to scrape links to all minutes and agenda provided in this website: https://www.charleston-sc.gov/AgendaCenter/
I've managed to scrape section IDs associated with each category (and years for each category) to loop through the contents within each category-year (please see below). But I don't know how to scrape the hrefs that lives inside the contents. Especially because the links to Agenda lives inside the drop down menu under 'download', it seems like I need to go through extra clicks to scrape the hrefs.
How do I scrape the minutes and agenda (inside the download dropdown) for each table I select? Ideally, I would like a table with the date, title of the agenda, links to minutes, and links to agenda.
I'm using RSelenium for this. Please see the code I have so far below, which allows me to click through each category and year, but not else much. Please help!
rm(list = ls())
library(RSelenium)
library(tidyverse)
library(httr)
library(XML)
library(stringr)
library(RCurl)
t <- readLines('https://www.charleston-sc.gov/AgendaCenter/', encoding = 'UTF-8')
co <- str_match(t, 'aria-label="(.*?)"[ ]href="java')[,2]
yr <- str_match(t, 'id="(.*?)" aria-label')[,2]
df <- data.frame(cbind(co, yr)) %>%
mutate_all(as.character) %>%
filter_all(any_vars(!is.na(.))) %>%
mutate(id = ifelse(grepl('^a0', yr), gsub('a0', '', yr), NA)) %>%
tidyr::fill(c(co,id), .direction='down')%>% drop_na(co)
remDr <- remoteDriver(port=4445L, browserName = "chrome")
remDr$open()
remDr$navigate('https://www.charleston-sc.gov/AgendaCenter/')
remDr$screenshot(display = T)
for (j in unique(df$id)){
remDr$findElement(using = 'xpath',
value = paste0('//*[#id="cat',j,'"]/h2'))$clickElement()
for (k in unique(df[which(df$id==j),'yr'])){
remDr$findElement(using = 'xpath',
value = paste0('//*[#id="',k,'"]'))$clickElement()
# NEED TO SCRAPE THE HREF ASSOCIATED WITH MINUTES AND AGENDA DOWNLOAD HERE #
}
}

Maybe you don't really need to click through all the elements? You can use the fact that all downloadable links have ViewFile in their href:
t <- readLines('https://www.charleston-sc.gov/AgendaCenter/', encoding = 'UTF-8')
viewfile <- str_extract_all(t, '.*ViewFile.*', simplify = T)
viewfile <- viewfile[viewfile!='']
library(data.table) # I use data.table because it's more convenient - but can be done without too
dt.viewfile <- data.table(origStr=viewfile)
# list the elements and patterns we will be looking for:
searchfor <- list(
Title='name=[^ ]+ title=\"(.+)\" href',
Date='<strong>(.+)</strong>',
href='href=\"([^\"]+)\"',
label= 'aria-label=\"([^\"]+)\"'
)
for (this.i in names(searchfor)){
this.full <- paste0('.*',searchfor[[this.i]],'.*');
dt.viewfile[grepl(this.full, origStr), (this.i):=gsub(this.full,'\\1',origStr)]
}
# Clean records:
dt.viewfile[, `:=`(Title=na.omit(Title),Date=na.omit(Date),label=na.omit(label)),
by=href]
dt.viewfile[,Date:=gsub('<abbr title=".*">(.*)</abbr>','\\1',Date)]
dt.viewfile <- unique(dt.viewfile[,.(Title,Date,href,label)]); # 690 records
What you have as the result is a table with the links to all downloadable files. You can now download them using any tool you like, for example using download.file() or GET():
dt.viewfile[, full.url:=paste0('https://www.charleston-sc.gov', href)]
dt.viewfile[, filename:=fs::path_sanitize(paste0(Title, ' - ', Date), replacement = '_')]
for (i in seq_len(nrow(dt.viewfile[1:10,]))){ # remove `1:10` limitation to process all records
url <- dt.viewfile[i,full.url]
destfile <- dt.viewfile[i,filename]
cat('\nDownloading',url, ' to ', destfile)
fil <- GET(url, write_disk(destfile))
# our destination file doesn't have extension, we need to get it from the server:
serverFilename <- gsub("inline;filename=(.*)",'\\1',headers(fil)$`content-disposition`)
serverExtension <- tools::file_ext(serverFilename)
# Adding the extension to the file we just saved
file.rename(destfile,paste0(destfile,'.',serverExtension))
}
Now the only problem we have is that the original webpage was only showing records for the last 3 years. But instead of clicking View More through RSelenium, we can simply load the page with earlier dates, something like this:
t <- readLines('https://www.charleston-sc.gov/AgendaCenter/Search/?term=&CIDs=all&startDate=10/14/2014&endDate=10/14/2017', encoding = 'UTF-8')
then repeat the rest of the code as necessary.

Can't find and click on a dynamic element with RSelenium

I'm using RSelenium to click on a dynamic element after a search on this webpage: http://www.in.gov.br/web/guest/inicio.
Every time I search for a word, I would like to find the words/link 'Ministério Da Educação' (it is the portuguese equivalent for Ministry of Education) on the right side of the results webpage and click on it.
I have used the inspect element feature from Google Chrome, but I am not having any success on finding and clicking that element. I have already tried using xpath, css selector, id ...
I am using the following code:
## search parameters
string_search <- "contrato"
date_search <- format(
as.Date("17/04/2019", "%d/%m/%Y"),
"%d/%m/%Y") #brazilian format
## start Selenium driver
library(RSelenium)
selCommand <- wdman::selenium(
jvmargs = c("-Dwebdriver.firefox.verboseLogging=true"),
retcommand = TRUE)
shell(selCommand, wait = FALSE, minimized = TRUE) # for windows
# system(selCommand) # for Linux
remDr <- remoteDriver(port = 4567L, browserName = "firefox")
remDr$open()
## navigation & search
remDr$navigate("http://www.in.gov.br/web/guest/inicio")
Sys.sleep(5)
# from date
datefromkey<-remDr$findElement(using = 'css', "#calendario_advanced_from")
datefromkey$clickElement()
datefromkey$sendKeysToElement(list(key = "enter"))
datefromkey$clearElement()
datefromkey$sendKeysToElement(list(date_search))
datefromkey$sendKeysToElement(list(key = "enter"))
# to date
datetokey<-remDr$findElement(using = 'css', "#calendario_advanced_to")
datetokey$clickElement()
datetokey$sendKeysToElement(list(key = "enter"))
datetokey$clearElement()
datetokey$sendKeysToElement(list(date_search))
datetokey$sendKeysToElement(list(key = "enter"))
# string to search
wordkey<-remDr$findElement(using = 'css', "#input-advanced_search")
wordkey$sendKeysToElement(list('"', string_search, '"'))
# click search button
press_button <- remDr$findElement(using = 'class', "btn")
press_button$clickElement()
Here is where I struggle:
1) first attempt: using a broader tag
# using a broader tag
categorykey <- remDr$findElement(using = 'id', '_3_facetNavigation')
categorykey$getElementText()
With getElementText() I see that "Ministério da Educação" is there, but I do not know how to click on the link.
2) second attempt: using the xpath
categorykey <- remDr$findElement('xpath', '//li
[#id="yui_patched_v3_11_0_1_1555545676970_404"]/text()')
It returns an error. Selenium can't locate the element.

Found the solution myself after watching this video on YouTube:
How to locate Dynamic Elements in Selenium Webdriver - XPATH Tutorial
The code would be like this:
categorykey <-remDr$findElement('xpath', '//*[contains(#data-value,"ministério da
educação")]')
categorykey$getElementText()
# just to see if it's right
categorykey$clickElement()

select textbox with RSelenium

Start RSelenium
library(RSelenium)
RSelenium::startServer()
pJS <- phantom()
Sys.sleep(5) # give the binary a moment
remDr <- remoteDriver(browserName = 'phantomjs')
remDr$open()
Go to the site and wait a bit
remDr$navigate("http://ideal-scope.com/online-holloway-cut-adviser/")
Sys.sleep(5)
now when I try to find element on the text boxes
depthElem <- remDr$findElements("name","depth_textbox")
tableElem <- remDr$findElements("name","table_textbox")
crownElem <- remDr$findElements("name","crown_textbox")
pavilionElem <- remDr$findElements("name","pavilion_textbox")
...just gives me a bunch of objects that are list()
If I do findElement instead of findElements I get
Error: Summary: NoSuchElement
Detail: An element could not be located on the page using the given search parameters.
How can I select these textboxes? Why can't I select them by searching name?

The findElements method returns an empty list when no elements are present. The page has the content you require in an iframe. You will need to switch to the iframe first before you can search for the elements:
remDr$navigate("http://ideal-scope.com/online-holloway-cut-adviser/")
# get iframes
webElems <- remDr$findElements("css", "iframe")
# there is only one
remDr$switchToFrame(webElems[[1]])
depthElem <- remDr$findElement("name","depth_textbox")
# > depthElem$getElementAttribute("name")
# [[1]]
# [1] "depth_textbox"
remDr$findElement("name","table_textbox")
crownElem <- remDr$findElement("name","crown_textbox")
pavilionElem <- remDr$findElement("name","pavilion_textbox")

Scrape Products Information from a Ecommerce Page

Need to Scrape Product Information from a Ecommerce Page. But page has infinite scrolling. Currently I am able to scrape only products shown without scrolling down. Below is the code for it.
require(RCurl)
require(XML)
require(dplyr)
require(stringr)
webpage <- getURL("http://www.jabong.com/kids/clothing/girls-clothing/kids-tops-t-shirts/?source=topnav_kids")
linklist <- str_extract_all(webpage, '(?<=href=")[^"]+')[[1]]
linklist <- as.data.frame(linklist)
linklist <- filter(linklist, grepl("\\?pos=", linklist))
linklist <- unique(linklist)
a <- as.data.frame(linklist)
a[2] <- "Jabong.com"
a <- add_rownames(a, "ID")
a$V3 <- gsub(" ", "", paste(a$V2, a$linklist))
a <- a[, -(1:3)]
colnames(a) <- "Links"

Well, if scrolling is truly infinite, then it is impossible to get ALL of the links... If you wanted to settle for a finite number, you can indeed fruitfully use RSelenium here.
library(RSelenium)
#start RSelenium
checkForServer()
startServer()
remDr <- remoteDriver()
remDr$open()
# load your page
remDr$navigate("http://www.jabong.com/kids/clothing/girls-clothing/kids-tops-t-shirts/?source=topnav_kids")
# scroll down 5 times, allowing 3 second for the page to load everytime
for(i in 1:5){
remDr$executeScript(paste("scroll(0,",i*10000,");"))
Sys.sleep(3)
}
# get the page html
page_source<-remDr$getPageSource()
# get the URL's that you are looking for
pp <- xml2::read_html(page_source[[1]]) %>%
rvest::html_nodes("a") %>%
rvest::html_attr("data-original-href") %>%
{.[!is.na(.)]}
The result is 312 links (in my browser). The more you have RSelenium scroll down, the more links you'll get.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

RSelenium To Scrape Yahoo Financial News Headlines - web-scraping

Related

Extract href tag using Rselenium

How to scrape hrefs embedded in a dropdown list of a web table using rselenium R

Can't find and click on a dynamic element with RSelenium

select textbox with RSelenium

Scrape Products Information from a Ecommerce Page

Categories

Resources