I tried to scrape from Finviz for some stock key stats. I applied codes from the original question: Web scraping of key stats in Yahoo! Finance with R. To collect stats for as many stocks as possible I create a list of stock symbols and descriptions like this:
Symbol Description
A Agilent Technologies
AAA Alcoa Corp
AAC Aac Holdings Inc
BABA Alibaba Group Holding Ltd
CRM Salesforce.Com Inc
...
I selected out the first column and stored it as a character in R and called it stocks. Then I applied the code:
for (s in stocks) {
url <- paste0("http://finviz.com/quote.ashx?t=", s)
webpage <- readLines(url)
html <- htmlTreeParse(webpage, useInternalNodes = TRUE, asText = TRUE)
tableNodes <- getNodeSet(html, "//table")
# ASSIGN TO STOCK NAMED DFS
assign(s, readHTMLTable(tableNodes[[9]],
header= c("data1", "data2", "data3", "data4", "data5", "data6",
"data7", "data8", "data9", "data10", "data11", "data12")))
# ADD COLUMN TO IDENTIFY STOCK
df <- get(s)
df['stock'] <- s
assign(s, df)
}
# COMBINE ALL STOCK DATA
stockdatalist <- cbind(mget(stocks))
stockdata <- do.call(rbind, stockdatalist)
# MOVE STOCK ID TO FIRST COLUMN
stockdata <- stockdata[, c(ncol(stockdata), 1:ncol(stockdata)-1)]
However, for some of the stocks, Finviz doesn't have a page for them and I get error massages like this:
Error in file(con, "r") : cannot open the connection
In addition: Warning message:
In file(con, "r") :
cannot open URL 'http://finviz.com/quote.ashx?t=AGM.A': HTTP status was '404
Not Found'
There are a good number of stocks that have this situation, so I can't delete them from my list manually. Is there a way to skip getting the page for those stocks?
Maybe something in these lines? Trying to filter stocks before using your forloop.
library(tidyverse)
#AGM.A should produce error
stocks <- c("AXP","BA","CAT","AGM.A")
urls <- paste0("http://finviz.com/quote.ashx?t=", stocks)
#Test urls with possibly first and find out NAs
temp_ind <- map(urls, possibly(readLines, otherwise = NA_real_))
ind <- map_lgl(map(temp_ind, c(1)), is.na)
ind <- which(ind == TRUE)
filter.stocks <- stocks[-ind]
#AGM.A is removed and you can just insert stocks which work to for loop.
filter.stocks
[1] "AXP" "BA" "CAT"
As statxiong pointed out url.exist here is simpler version:
library(RCurl)
library(tidyverse)
stocks[map_lgl(urls, url.exists)]
Related
I have a function that I use to get financial data from the Wall Street Journal website. Basically I want to make a copy of the data held in symData and give it a name the same as symbol. That means the objects are in the workspace and can be reused for looking at other information. I don't want to keep them permanently so creating temp files on the filesystem is not my favoured method.
The problem I have is that I can't figure out how to do it.
library(httr)
library(XML)
library(data.table)
getwsj.quotes <- function(symbol)
{
myUrl <- sprintf("https://quotes.wsj.com/AU/XASX/%s/FINANCIALS", symbol)
symbol.data <- GET(myUrl)
x <- content(symbol.data, as = 'text')
wsj.tables <- sub('cr_dataTable cr_sub_capital', '\\1', x)
symData <- readHTMLTable(wsj.tables)
mytemp <- summary(symData)
print(mytemp)
d2e <- gsub('^.* ', '', names(symData[[8]]))
my.out <- sprintf("%s has Debt to Equity Ratio of %s", symbol, d2e)
print(my.out)
}
TickerList <- c("AMC", "ANZ")
for (Ticker in TickerList)
{
Ticker.Data <- lapply(Ticker, FUN = getwsj.quotes)
}
The Ticker.Data output is:
> Ticker.Data
[[1]]
[1] "ANZ has Debt to Equity Ratio of 357.41"
The output from mytemp <- summary(symData) has the following:
Length Class Mode
NULL 12 data.frame list
NULL 2 data.frame list
...
I tried various ways of doing it when I call the function and all I ever get is the last symbols data. I have searched for hours trying to get an answer but so far, no luck. I need to walk away for a few hours.
Any information would be most helpful.
Regards
Stephen
Edited: I changed my answer based on the suggestion by #MrFlick. It solved another problem.
library(httr)
library(XML)
library(data.table)
getwsj.quotes <- function(Symbol)
{
MyUrl <- sprintf("https://quotes.wsj.com/AU/XASX/%s/FINANCIALS", Symbol)
Symbol.Data <- GET(MyUrl)
x <- content(Symbol.Data, as = 'text')
wsj.tables <- sub('cr_dataTable cr_sub_capital', '\\1', x)
SymData <- readHTMLTable(wsj.tables)
return(SymData)
}
TickerList <- c("AMC", "ANZ", "BHP", "BXB", "CBA", "COL", "CSL", "IAG", "MQG", "NAB", "RIO", "S32", "SCG", "SUN", "TCL", "TLS", "WBC", "WES", "WOW", "WPL")
SymbolDataList <- lapply(TickerList, FUN = getwsj.quotes)
Thanks again.
I am new to Webscraping. The url I am working with is this (https://tsmc.tripura.gov.in/doc_list). At present, I am able to extract data from the first page. Since, the url is unchanging, I don't have an identifier for the other pages to create a loop for data table extraction.
Here is my code:
install.packages("XML")
install.packages("RCurl")
install.packages("rlist")
install.packages("bitops")
library(bitops)
library(XML)
library(RCurl)
url1<- getURL("https://tsmc.tripura.gov.in/doc_list",.opts =
list(ssl.verifypeer = FALSE))
table1<- readHTMLTable(url1)
table1<- list.clean(table1, fun = is.null, recursive = FALSE)
n.rows <- unlist(lapply(table1, function(t) dim(t)[1]))
table1[[which.max(n.rows)]]
View(table1)
table11= table1[["NULL"]]
Please help. Thanks!
Perhaps try this solution:
url <- "https://tsmc.tripura.gov.in/doc_list?page="
sq <- seq(1, 30) # There appears to be 30 pages so we create a sequence of 1:30 results
links <- paste0(url, sq) #Paste the sequence after the url "page="
store <- NULL
tbl <- NULL
library(rvest) #extract the tables
for(i in links){
store[[i]] = read_html(i)
tbl[[i]] = html_table(store[[i]])
}
library(plyr)
df <- ldply(tbl, data.frame) #combine the list of data frames into one large data frame
df$`.id` <- gsub("https://tsmc.tripura.gov.in/doc_list?page=", " ", df$`.id`, fixed = TRUE)
Which gives 846 observations across 8 variables.
EDIT: I found that the first url does not have a sequence. In order to add the first page and rbind it with the rest of the data use the following:
firsturl <- "https://tsmc.tripura.gov.in/doc_list"
first_store = read_html(firsturl)
first_tbl = html_table(first_store)
first_df <- as.data.frame(first_tbl)
first_df$`.id` <- 0
df2 <- rbind(first_df, df)
How can I ignore a data set if some column names don't exist in it?
I have a list of weather data from a stream but I think certain key weather conditions don't exist and therefore I have this error below with rbind:
Error in rbind(deparse.level, ...) :
numbers of columns of arguments do not match
My code:
weatherDf <- data.frame()
for(i in weatherData) {
# Get the airport code.
airport <- i$airport
# Get the date.
date <- as.POSIXct(as.numeric(as.character(i$timestamp))/1000, origin="1970-01-01", tz="UTC-1")
# Get the data in dailysummary only.
dailySummary <- i$dailysummary
weatherDf <- rbind(weatherDf, ldply(
list(dailySummary),
function(x) c(airport, format(as.Date(date), "%Y-%m-%d"), x[["meanwindspdi"]], x[["meanwdird"]], x[["meantempm"]], x[["humidity"]])
))
}
So how can I make sure these key conditions below exist in the data:
meanwindspdi
meanwdird
meantempm
humidity
If any of them does not exit, then ignore the bunch of them. Is it possible?
EDIT:
The content of weatherData is in jsfiddle (I can't post it here as it is too long and I dunno where is the best place to show the data publicly for R...)
EDIT 2:
I get some error when I try to export the data into a txt:
> write.table(weatherData,"/home/teelou/Desktop/data/data.txt",sep="\t",row.names=FALSE)
Error in data.frame(date = list(pretty = "January 1, 1970", year = "1970", :
arguments imply differing number of rows: 1, 0
What does it mean? It seems that there are some errors in the data...
EDIT 3:
I have exported my entire data in .RData to my google drive:
https://drive.google.com/file/d/0B_w5RSQMxtRSbjdQYWJMX3pfWXM/view?usp=sharing
If you use RStudio, then you can just import the data.
EDIT 4:
target_names <- c("meanwindspdi", "meanwdird", "meantempm", "humidity")
# If it has data then loop it.
if (!is.null(weatherData)) {
# Initialize a data frame.
weatherDf <- data.frame()
for(i in weatherData) {
if (!all(target_names %in% names(i)))
next
# Get the airport code.
airport <- i$airport
# Get the date.
date <- as.POSIXct(as.numeric(as.character(i$timestamp))/1000, origin="1970-01-01", tz="UTC-1")
# Get the data in dailysummary only.
dailySummary <- i$dailysummary
weatherDf <- rbind(weatherDf, ldply(
list(dailySummary),
function(x) c(airport, format(as.Date(date), "%Y-%m-%d"), x[["meanwindspdi"]], x[["meanwdird"]], x[["meantempm"]], x[["humidity"]])
))
}
# Rename column names.
colnames(weatherDf) <- c("airport", "key_date", "ws", "wd", "tempi", 'humidity')
# Convert certain columns weatherDf type to numberic.
columns <-c("ws", "wd", "tempi", "humidity")
weatherDf[, columns] <- lapply(columns, function(x) as.numeric(weatherDf[[x]]))
}
Inspect the weatherDf:
> View(weatherDf)
Error in .subset2(x, i, exact = exact) : subscript out of bounds
You can use next to skip the current iteration of the loop and go to the next iteration:
target_names <- c("meanwindspdi", "meanwdird", "meantempm", "humidity")
for(i in weatherData) {
if (!all(target_names %in% names(i)))
next
# continue with loop...
Is anyone experienced in scraping data from the Yahoo! Finance key statistics page with R? I am familiar scraping data directly from html using read_html, html_nodes(), and html_text() from rvest package. However, this web page MSFT key stats is a bit complicated, I am not sure if all the stats are kept in XHR, JS, or Doc. I am guessing the data is stored in JSON. If anyone knows a good way to extract and parse data for this web page with R, kindly answer my question, great thanks in advance!
Or if there is a more convenient way to extract these metrics via quantmod or Quandl, kindly let me know, that would be a extremely good solution!
I know this is an older thread, but I used it to scrape Yahoo Analyst tables so I figure I would share.
# Yahoo webscrape Analysts
library(XML)
symbol = "HD"
url <- paste('https://finance.yahoo.com/quote/HD/analysts?p=',symbol,sep="")
webpage <- readLines(url)
html <- htmlTreeParse(webpage, useInternalNodes = TRUE, asText = TRUE)
tableNodes <- getNodeSet(html, "//table")
earningEstimates <- readHTMLTable(tableNodes[[1]])
revenueEstimates <- readHTMLTable(tableNodes[[2]])
earningHistory <- readHTMLTable(tableNodes[[3]])
epsTrend <- readHTMLTable(tableNodes[[4]])
epsRevisions <- readHTMLTable(tableNodes[[5]])
growthEst <- readHTMLTable(tableNodes[[6]])
Cheers,
Sody
I gave up on Excel a long time ago. R is definitely the way to go for things like this.
library(XML)
stocks <- c("AXP","BA","CAT","CSCO")
for (s in stocks) {
url <- paste0("http://finviz.com/quote.ashx?t=", s)
webpage <- readLines(url)
html <- htmlTreeParse(webpage, useInternalNodes = TRUE, asText = TRUE)
tableNodes <- getNodeSet(html, "//table")
# ASSIGN TO STOCK NAMED DFS
assign(s, readHTMLTable(tableNodes[[9]],
header= c("data1", "data2", "data3", "data4", "data5", "data6",
"data7", "data8", "data9", "data10", "data11", "data12")))
# ADD COLUMN TO IDENTIFY STOCK
df <- get(s)
df['stock'] <- s
assign(s, df)
}
# COMBINE ALL STOCK DATA
stockdatalist <- cbind(mget(stocks))
stockdata <- do.call(rbind, stockdatalist)
# MOVE STOCK ID TO FIRST COLUMN
stockdata <- stockdata[, c(ncol(stockdata), 1:ncol(stockdata)-1)]
# SAVE TO CSV
write.table(stockdata, "C:/Users/your_path_here/Desktop/MyData.csv", sep=",",
row.names=FALSE, col.names=FALSE)
# REMOVE TEMP OBJECTS
rm(df, stockdatalist)
When I use the methods shown here with XML library, I get a Warning
Warning in readLines(page) : incomplete final line found on
'https://finance.yahoo.com/quote/DIS/key-statistics?p=DIS'
We can use rvest and xml2 for a cleaner approach. This example demonstrates how to pull a key statistic from the key-statistics Yahoo! Finance page. Here I want to obtain the float of an equity. I don't believe float is available from quantmod, but some of the key stats values are. You'll have to reference the list.
library(xml2)
library(rvest)
getFloat <- function(stock){
url <- paste0("https://finance.yahoo.com/quote/", stock, "/key-statistics?p=", stock)
tables <- read_html(url) %>%
html_nodes("table") %>%
html_table()
float <- as.vector(tables[[3]][4,2])
last <- substr(float, nchar(float)-1+1, nchar(float))
float <-gsub("[a-zA-Z]", "", float)
float <- as.numeric(as.character(float))
if(last == "k"){
float <- float * 1000
} else if (last == "M") {
float <- float * 1000000
} else if (last == "B") {
float <- float * 1000000000
}
return(float)
}
getFloat("DIS")
[1] 1.81e+09
That's a lot of shares of Disney available.
If I want to load stock data, this is how I do it (for Google as an example):
## most recent close price
getSymbols("GOOG")
last(GOOG)[,4]
## total equity
getFinancials("GOOG")
viewFinancials(GOOG.f, type='BS', period='A',subset = NULL)['Total Equity',1]
## Net Income
viewFinancials(GOOG.f, type='IS', period='Q',subset = NULL)['Net Income',1]
...the list goes on.
But it would be much more practical to have to type GOOG only once and then have it replaced with a generic name in the rest of the code. How can this be done in quantmod?
The option auto.assign=FALSE should solve the problem.
Below is a modified version of your code. Extending it to a larger number of tickers and treating them, e.g., in a loop should be straightforward.
library(quantmod)
CollectionOfTickers <- c("GOOG")
IndexOfCurrentTicker <- 1
# the part that follows could be extracted as a function
CurrentTicker <- getSymbols(CollectionOfTickers[IndexOfCurrentTicker], auto.assign=FALSE)
Cl(last(CurrentTicker)) ## most recent close price
## total equity
CurrentTickerFinancials <- getFinancials(CollectionOfTickers[IndexOfCurrentTicker], auto.assign=FALSE)
viewFinancials(CurrentTickerFinancials, type='BS', period='A',subset = NULL)['Total Equity',1]
## Net Income
viewFinancials(CurrentTickerFinancials, type='IS', period='Q',subset = NULL)['Net Income',1]
Note that "GOOG" is no longer hard-coded. It is defined only once, in the vector CollectionOfTickers and the entry of this vector is retrieved by using the variable IndexOfCurrentTicker which could represent a looping variable in a larger collection of tickers.
Edit
A variant of this code to perform a loop over several tickers could be programmed like this:
library(quantmod)
CollectionOfTickers <- c("GOOG","AAPL","TSLA","MSFT")
for (TickerName in CollectionOfTickers) {
CurrentTicker <- getSymbols(TickerName, auto.assign=FALSE)
cat("========\nData for ticker ", TickerName,"\n")
## most recent close price:
print(Cl(last(CurrentTicker)))
CurrentTickerFinancials <- getFinancials(TickerName, auto.assign=FALSE)
## total equity:
print(viewFinancials(CurrentTickerFinancials, type='BS', period='A',subset = NULL)['Total Equity',1])
## Net Income:
print(viewFinancials(CurrentTickerFinancials, type='IS', period='Q',subset = NULL)['Net Income',1])
cat("========\n")
}
The code quality could be improved by some further refactoring, but in any case this should work.
Hope this helps.
I think this is what you want. If you need something else...post back...
require(XML)
require(plyr)
getKeyStats_xpath <- function(symbol) {
yahoo.URL <- "http://finance.yahoo.com/q/ks?s="
html_text <- htmlParse(paste(yahoo.URL, symbol, sep = ""), encoding="UTF-8")
#search for <td> nodes anywhere that have class 'yfnc_tablehead1'
nodes <- getNodeSet(html_text, "/*//td[#class='yfnc_tablehead1']")
if(length(nodes) > 0 ) {
measures <- sapply(nodes, xmlValue)
#Clean up the column name
measures <- gsub(" *[0-9]*:", "", gsub(" \\(.*?\\)[0-9]*:","", measures))
#Remove dups
dups <- which(duplicated(measures))
#print(dups)
for(i in 1:length(dups))
measures[dups[i]] = paste(measures[dups[i]], i, sep=" ")
#use siblings function to get value
values <- sapply(nodes, function(x) xmlValue(getSibling(x)))
df <- data.frame(t(values))
colnames(df) <- measures
return(df)
} else {
# break
cat("Could not find",symbol,"\n")
return(data.frame(NA))
}
}
tickers <- c("AXP","BA","CAT","CSCO","CVX","DD","DIS","GE","GS","HD","IBM","INTC","JNJ","JPM","KO","MCD","MMM","MRK","MSFT","NKE","PFE","PG","T","TRV","UNH","UTX","V","VZ","WMT","XOM")
stats <- ldply(tickers, getKeyStats_xpath)
stats <- stats[!rowSums(is.na(stats)) == length(stats),]
rownames(stats) <- tickers
write.csv(t(stats), "FinancialStats_updated.csv",row.names=TRUE)