Web scraping of key stats in Yahoo! Finance with R - r

Is anyone experienced in scraping data from the Yahoo! Finance key statistics page with R? I am familiar scraping data directly from html using read_html, html_nodes(), and html_text() from rvest package. However, this web page MSFT key stats is a bit complicated, I am not sure if all the stats are kept in XHR, JS, or Doc. I am guessing the data is stored in JSON. If anyone knows a good way to extract and parse data for this web page with R, kindly answer my question, great thanks in advance!
Or if there is a more convenient way to extract these metrics via quantmod or Quandl, kindly let me know, that would be a extremely good solution!

I know this is an older thread, but I used it to scrape Yahoo Analyst tables so I figure I would share.
# Yahoo webscrape Analysts
library(XML)
symbol = "HD"
url <- paste('https://finance.yahoo.com/quote/HD/analysts?p=',symbol,sep="")
webpage <- readLines(url)
html <- htmlTreeParse(webpage, useInternalNodes = TRUE, asText = TRUE)
tableNodes <- getNodeSet(html, "//table")
earningEstimates <- readHTMLTable(tableNodes[[1]])
revenueEstimates <- readHTMLTable(tableNodes[[2]])
earningHistory <- readHTMLTable(tableNodes[[3]])
epsTrend <- readHTMLTable(tableNodes[[4]])
epsRevisions <- readHTMLTable(tableNodes[[5]])
growthEst <- readHTMLTable(tableNodes[[6]])
Cheers,
Sody

I gave up on Excel a long time ago. R is definitely the way to go for things like this.
library(XML)
stocks <- c("AXP","BA","CAT","CSCO")
for (s in stocks) {
url <- paste0("http://finviz.com/quote.ashx?t=", s)
webpage <- readLines(url)
html <- htmlTreeParse(webpage, useInternalNodes = TRUE, asText = TRUE)
tableNodes <- getNodeSet(html, "//table")
# ASSIGN TO STOCK NAMED DFS
assign(s, readHTMLTable(tableNodes[[9]],
header= c("data1", "data2", "data3", "data4", "data5", "data6",
"data7", "data8", "data9", "data10", "data11", "data12")))
# ADD COLUMN TO IDENTIFY STOCK
df <- get(s)
df['stock'] <- s
assign(s, df)
}
# COMBINE ALL STOCK DATA
stockdatalist <- cbind(mget(stocks))
stockdata <- do.call(rbind, stockdatalist)
# MOVE STOCK ID TO FIRST COLUMN
stockdata <- stockdata[, c(ncol(stockdata), 1:ncol(stockdata)-1)]
# SAVE TO CSV
write.table(stockdata, "C:/Users/your_path_here/Desktop/MyData.csv", sep=",",
row.names=FALSE, col.names=FALSE)
# REMOVE TEMP OBJECTS
rm(df, stockdatalist)

When I use the methods shown here with XML library, I get a Warning
Warning in readLines(page) : incomplete final line found on
'https://finance.yahoo.com/quote/DIS/key-statistics?p=DIS'
We can use rvest and xml2 for a cleaner approach. This example demonstrates how to pull a key statistic from the key-statistics Yahoo! Finance page. Here I want to obtain the float of an equity. I don't believe float is available from quantmod, but some of the key stats values are. You'll have to reference the list.
library(xml2)
library(rvest)
getFloat <- function(stock){
url <- paste0("https://finance.yahoo.com/quote/", stock, "/key-statistics?p=", stock)
tables <- read_html(url) %>%
html_nodes("table") %>%
html_table()
float <- as.vector(tables[[3]][4,2])
last <- substr(float, nchar(float)-1+1, nchar(float))
float <-gsub("[a-zA-Z]", "", float)
float <- as.numeric(as.character(float))
if(last == "k"){
float <- float * 1000
} else if (last == "M") {
float <- float * 1000000
} else if (last == "B") {
float <- float * 1000000000
}
return(float)
}
getFloat("DIS")
[1] 1.81e+09
That's a lot of shares of Disney available.

Related

Google Search in R [duplicate]

I used the following code:
library(XML)
library(RCurl)
getGoogleURL <- function(search.term, domain = '.co.uk', quotes=TRUE)
{
search.term <- gsub(' ', '%20', search.term)
if(quotes) search.term <- paste('%22', search.term, '%22', sep='')
getGoogleURL <- paste('http://www.google', domain, '/search?q=',
search.term, sep='')
}
getGoogleLinks <- function(google.url)
{
doc <- getURL(google.url, httpheader = c("User-Agent" = "R(2.10.0)"))
html <- htmlTreeParse(doc, useInternalNodes = TRUE, error=function(...){})
nodes <- getNodeSet(html, "//a[#href][#class='l']")
return(sapply(nodes, function(x) x <- xmlAttrs(x)[[1]]))
}
search.term <- "cran"
quotes <- "FALSE"
search.url <- getGoogleURL(search.term=search.term, quotes=quotes)
links <- getGoogleLinks(search.url)
I would like to find all the links that resulted from my search and I get the following result:
> links
list()
How can I get the links?
In addition I would like to get the headlines and summary of google results how can I get it?
And finally is there a way to get the links that resides in ChillingEffects.org results?
If you look at the htmlvariable, you can see that the search result links all are nested in <h3 class="r"> tags.
Try to change your getGoogleLinks function to:
getGoogleLinks <- function(google.url) {
doc <- getURL(google.url, httpheader = c("User-Agent" = "R
(2.10.0)"))
html <- htmlTreeParse(doc, useInternalNodes = TRUE, error=function
(...){})
nodes <- getNodeSet(html, "//h3[#class='r']//a")
return(sapply(nodes, function(x) x <- xmlAttrs(x)[["href"]]))
}
I created this function to read in a list of company names and then get the top website result for each. It will get you started then you can adjust it as needed.
#libraries.
library(URLencode)
library(rvest)
#load data
d <-read.csv("P:\\needWebsites.csv")
c <- as.character(d$Company.Name)
# Function for getting website.
getWebsite <- function(name)
{
url = URLencode(paste0("https://www.google.com/search?q=",name))
page <- read_html(url)
results <- page %>%
html_nodes("cite") %>% # Get all notes of type cite. You can change this to grab other node types.
html_text()
result <- results[1]
return(as.character(result)) # Return results if you want to see them all.
}
# Apply the function to a list of company names.
websites <- data.frame(Website = sapply(c,getWebsite))]
other solutions here don't work for me, here's my take on #Bryce-Chamberlain's issue which works for me in August 2019, it answers also another closed question : company name to URL in R
# install.packages("rvest")
get_first_google_link <- function(name, root = TRUE) {
url = URLencode(paste0("https://www.google.com/search?q=",name))
page <- xml2::read_html(url)
# extract all links
nodes <- rvest::html_nodes(page, "a")
links <- rvest::html_attr(nodes,"href")
# extract first link of the search results
link <- links[startsWith(links, "/url?q=")][1]
# clean it
link <- sub("^/url\\?q\\=(.*?)\\&sa.*$","\\1", link)
# get root if relevant
if(root) link <- sub("^(https?://.*?/).*$", "\\1", link)
link
}
companies <- data.frame(company = c("apple acres llc","abbvie inc","apple inc"))
companies <- transform(companies, url = sapply(company,get_first_google_link))
companies
#> company url
#> 1 apple acres llc https://www.appleacresllc.com/
#> 2 abbvie inc https://www.abbvie.com/
#> 3 apple inc https://www.apple.com/
Created on 2019-08-10 by the reprex package (v0.2.1)
The free solutions don't work anymore. Plus it doesn't allow you to search for regions outside your location. Here's a solution using Google Custom Search API. The API allows 100 free API calls per day. The function below returns only 10 results or page 1. 1 API call returns only 10 results.
Google.Search.API <- function(keyword, google.key, google.cx, country = "us")
{
# keyword = keywords[10]; country = "us"
url <- paste0("https://www.googleapis.com/customsearch/v1?"
, "key=", google.key
, "&q=", gsub(" ", "+", keyword)
, "&gl=", country # Country
, "&hl=en" # Language from Browser, english
, "&cx=", google.cx
, "&fields=items(link)"
)
d2 <- url %>%
httr::GET(ssl.verifypeer=TRUE) %>%
httr::content(.) %>% .[["items"]] %>%
data.table::rbindlist(.) %>%
mutate(keyword, SERP = row_number(), search.engine = "Google API") %>%
rename(source = link) %>%
select(search.engine, keyword, SERP, source)
pause <- round(runif(1, min = 1.1, max = 5), 1)
if(nrow(d2) == 0)
{cat("\nPausing", pause, "seconds. Failed for:", keyword)} else
{cat("\nPausing", pause, "seconds. Successful for:", keyword)}
Sys.sleep(pause)
rm(keyword, country, pause, url, google.key, google.cx)
return(d2)
}

R create an object in a function and rename the object by using a variable but retaining the object for use in the current session

I have a function that I use to get financial data from the Wall Street Journal website. Basically I want to make a copy of the data held in symData and give it a name the same as symbol. That means the objects are in the workspace and can be reused for looking at other information. I don't want to keep them permanently so creating temp files on the filesystem is not my favoured method.
The problem I have is that I can't figure out how to do it.
library(httr)
library(XML)
library(data.table)
getwsj.quotes <- function(symbol)
{
myUrl <- sprintf("https://quotes.wsj.com/AU/XASX/%s/FINANCIALS", symbol)
symbol.data <- GET(myUrl)
x <- content(symbol.data, as = 'text')
wsj.tables <- sub('cr_dataTable cr_sub_capital', '\\1', x)
symData <- readHTMLTable(wsj.tables)
mytemp <- summary(symData)
print(mytemp)
d2e <- gsub('^.* ', '', names(symData[[8]]))
my.out <- sprintf("%s has Debt to Equity Ratio of %s", symbol, d2e)
print(my.out)
}
TickerList <- c("AMC", "ANZ")
for (Ticker in TickerList)
{
Ticker.Data <- lapply(Ticker, FUN = getwsj.quotes)
}
The Ticker.Data output is:
> Ticker.Data
[[1]]
[1] "ANZ has Debt to Equity Ratio of 357.41"
The output from mytemp <- summary(symData) has the following:
Length Class Mode
NULL 12 data.frame list
NULL 2 data.frame list
...
I tried various ways of doing it when I call the function and all I ever get is the last symbols data. I have searched for hours trying to get an answer but so far, no luck. I need to walk away for a few hours.
Any information would be most helpful.
Regards
Stephen
Edited: I changed my answer based on the suggestion by #MrFlick. It solved another problem.
library(httr)
library(XML)
library(data.table)
getwsj.quotes <- function(Symbol)
{
MyUrl <- sprintf("https://quotes.wsj.com/AU/XASX/%s/FINANCIALS", Symbol)
Symbol.Data <- GET(MyUrl)
x <- content(Symbol.Data, as = 'text')
wsj.tables <- sub('cr_dataTable cr_sub_capital', '\\1', x)
SymData <- readHTMLTable(wsj.tables)
return(SymData)
}
TickerList <- c("AMC", "ANZ", "BHP", "BXB", "CBA", "COL", "CSL", "IAG", "MQG", "NAB", "RIO", "S32", "SCG", "SUN", "TCL", "TLS", "WBC", "WES", "WOW", "WPL")
SymbolDataList <- lapply(TickerList, FUN = getwsj.quotes)
Thanks again.

Trying to webscrape an unchanging URL with data spread over pages

I am new to Webscraping. The url I am working with is this (https://tsmc.tripura.gov.in/doc_list). At present, I am able to extract data from the first page. Since, the url is unchanging, I don't have an identifier for the other pages to create a loop for data table extraction.
Here is my code:
install.packages("XML")
install.packages("RCurl")
install.packages("rlist")
install.packages("bitops")
library(bitops)
library(XML)
library(RCurl)
url1<- getURL("https://tsmc.tripura.gov.in/doc_list",.opts =
list(ssl.verifypeer = FALSE))
table1<- readHTMLTable(url1)
table1<- list.clean(table1, fun = is.null, recursive = FALSE)
n.rows <- unlist(lapply(table1, function(t) dim(t)[1]))
table1[[which.max(n.rows)]]
View(table1)
table11= table1[["NULL"]]
Please help. Thanks!
Perhaps try this solution:
url <- "https://tsmc.tripura.gov.in/doc_list?page="
sq <- seq(1, 30) # There appears to be 30 pages so we create a sequence of 1:30 results
links <- paste0(url, sq) #Paste the sequence after the url "page="
store <- NULL
tbl <- NULL
library(rvest) #extract the tables
for(i in links){
store[[i]] = read_html(i)
tbl[[i]] = html_table(store[[i]])
}
library(plyr)
df <- ldply(tbl, data.frame) #combine the list of data frames into one large data frame
df$`.id` <- gsub("https://tsmc.tripura.gov.in/doc_list?page=", " ", df$`.id`, fixed = TRUE)
Which gives 846 observations across 8 variables.
EDIT: I found that the first url does not have a sequence. In order to add the first page and rbind it with the rest of the data use the following:
firsturl <- "https://tsmc.tripura.gov.in/doc_list"
first_store = read_html(firsturl)
first_tbl = html_table(first_store)
first_df <- as.data.frame(first_tbl)
first_df$`.id` <- 0
df2 <- rbind(first_df, df)

Web scraping of stock key stats from Finviz with R

I tried to scrape from Finviz for some stock key stats. I applied codes from the original question: Web scraping of key stats in Yahoo! Finance with R. To collect stats for as many stocks as possible I create a list of stock symbols and descriptions like this:
Symbol Description
A Agilent Technologies
AAA Alcoa Corp
AAC Aac Holdings Inc
BABA Alibaba Group Holding Ltd
CRM Salesforce.Com Inc
...
I selected out the first column and stored it as a character in R and called it stocks. Then I applied the code:
for (s in stocks) {
url <- paste0("http://finviz.com/quote.ashx?t=", s)
webpage <- readLines(url)
html <- htmlTreeParse(webpage, useInternalNodes = TRUE, asText = TRUE)
tableNodes <- getNodeSet(html, "//table")
# ASSIGN TO STOCK NAMED DFS
assign(s, readHTMLTable(tableNodes[[9]],
header= c("data1", "data2", "data3", "data4", "data5", "data6",
"data7", "data8", "data9", "data10", "data11", "data12")))
# ADD COLUMN TO IDENTIFY STOCK
df <- get(s)
df['stock'] <- s
assign(s, df)
}
# COMBINE ALL STOCK DATA
stockdatalist <- cbind(mget(stocks))
stockdata <- do.call(rbind, stockdatalist)
# MOVE STOCK ID TO FIRST COLUMN
stockdata <- stockdata[, c(ncol(stockdata), 1:ncol(stockdata)-1)]
However, for some of the stocks, Finviz doesn't have a page for them and I get error massages like this:
Error in file(con, "r") : cannot open the connection
In addition: Warning message:
In file(con, "r") :
cannot open URL 'http://finviz.com/quote.ashx?t=AGM.A': HTTP status was '404
Not Found'
There are a good number of stocks that have this situation, so I can't delete them from my list manually. Is there a way to skip getting the page for those stocks?
Maybe something in these lines? Trying to filter stocks before using your forloop.
library(tidyverse)
#AGM.A should produce error
stocks <- c("AXP","BA","CAT","AGM.A")
urls <- paste0("http://finviz.com/quote.ashx?t=", stocks)
#Test urls with possibly first and find out NAs
temp_ind <- map(urls, possibly(readLines, otherwise = NA_real_))
ind <- map_lgl(map(temp_ind, c(1)), is.na)
ind <- which(ind == TRUE)
filter.stocks <- stocks[-ind]
#AGM.A is removed and you can just insert stocks which work to for loop.
filter.stocks
[1] "AXP" "BA" "CAT"
As statxiong pointed out url.exist here is simpler version:
library(RCurl)
library(tidyverse)
stocks[map_lgl(urls, url.exists)]

Use xpathSApply in R

I would like to get the information of href from below.
http://www.mitbbs.com/bbsdoc1/USANews_101_0.html
I prefer to get someting from each topic like this
/USANews/31587637.html
/USANews/31587633.html
/USANews/31587631.html
...
The code is used below, but it doesn't work.
library("XML")
library("httr")
library("stringr")
data <- list()
for( i in 101:201){
url <- paste('bbsdoc1/USANews_', i, '_0.html', sep='')
html <- content(GET("http://www.mitbbs.com/", path = url),as = 'parsed')
url.list <- xpathSApply(html, "//td[#align='left' height=26]/[#class='news1' href]", xmlAttrs)
data <- rbind(data, url.list)
}
Your suggestions are really appreicated!
You should look into the rvest package which simplifies things a lot
library(rvest); library(dplyr)
myList <- read_html("http://www.mitbbs.com/bbsdoc1/USANews_101_0.html") %>%
html_nodes(".news1") %>% xml_attr("href")
mtList
myList %>% gsub("/article_t", "", .)
Retrieve the document
library(XML)
html = htmlParse("http://www.mitbbs.com/bbsdoc1/USANews_101_0.html")
and extract the links and text you're interested in using the appropriate xpath query
href = "//a[./#class='news1']/#href"
text = "//a[./#class='news1']/text()"
df = data.frame(
url=sub("article_t/", "", sapply(html[href], as.character)),
text=trimws(sapply(html[text], xmlValue)))
trimws() is a function in recent versions of R.

Resources