Using R to Analyze Balance Sheets and Income Statements - r

I am interested in analyzing balance sheets and income statements using R. I have seen that there are R packages that pull information from Yahoo and Google Finance, but all the examples I have seen concern historical stock price information. Is there a way I can pull historical information from balance sheets and income statements using R?

I have found on the net only a partial solution to your issue, for I managed to retrieve only the Balance sheet info and financial statement for one year. I don't know how to do it for more years.
There is a package in R, called quantmod, which you can install from CRAN
install.packages('quantmod')
Then you can do the following: Suppose you want to get the financial info from a company listed at NYSE : General Electric. ticker: GE
library(quantmod)
getFinancials('GE')
viewFinancials(GE.f)
To get only the income statement, reported anually, as a data frame use this:
viewFinancials(GE.f, "IS", "A")
Please let me know if you find out how to do this for multiple years.

The question you want to ask, and get an answer to!, is where can I get free XBRL data for analysing corporate balance sheets, and is there a library for consuming such data in R?
XBRL (Extensible Business Reporting Language - http://en.wikipedia.org/wiki/XBRL) is a standard for marking up accounting statments (income statements, balance sheets, profit & loss statements) in XML format such that they can easily be parsed by computer and put into a spreadsheet.
As far as I know, a lot of corporate regulators (e.g. the SEC in the US, ASIC in Australia) are encouraging the companies under their jurisdiction to report using such a format, or running pilots, but I don't believe it has been mandated at this point. If you limited your investment universe (I am assuming you want this data in electronic format for investment purposes) to firms that have made their quarterly reports freely available in XBRL form, I expect you will have a pretty short list of firms to invest in!
Bloomberg, Reuters et al all have pricey feeds for obtaining corporate fundamental data. There may also be someone out there running a tidy business publishing balance sheets in XBRL format. Cheaper, but still paid for, are XIgnite's xFundamentals and xGlobalFundamentals web services, but you aren't getting full balance sheet data from them.

to read-in the financial information try this function ( I picked it up several months ago and made some small adjustments)
require(XML)
require(plyr)
getKeyStats_xpath <- function(symbol) {
yahoo.URL <- "http://finance.yahoo.com/q/ks?s="
html_text <- htmlParse(paste(yahoo.URL, symbol, sep = ""), encoding="UTF-8")
#search for <td> nodes anywhere that have class 'yfnc_tablehead1'
nodes <- getNodeSet(html_text, "/*//td[#class='yfnc_tablehead1']")
if(length(nodes) > 0 ) {
measures <- sapply(nodes, xmlValue)
#Clean up the column name
measures <- gsub(" *[0-9]*:", "", gsub(" \\(.*?\\)[0-9]*:","", measures))
#Remove dups
dups <- which(duplicated(measures))
#print(dups)
for(i in 1:length(dups))
measures[dups[i]] = paste(measures[dups[i]], i, sep=" ")
#use siblings function to get value
values <- sapply(nodes, function(x) xmlValue(getSibling(x)))
df <- data.frame(t(values))
colnames(df) <- measures
return(df)
} else {
break
}
}
to use it, compare for example 3 companies and write the data into a csv-file do the following:
tickers <- c("AAPL","GOOG","F")
stats <- ldply(tickers, getKeyStats_xpath)
rownames(stats) <- tickers
write.csv(t(stats), "FinancialStats_updated.csv",row.names=TRUE)
Just tried it. Still working.
UPDATE as Yahoo changed it’s web site layout:
The function above does not work anymore as Yahoo again changed its web site layout. Fortunately its still easy to get the financial infos as the tags for getting fundamental data have not been changed.
example for downloading a file with eps and P/E ratio for MSFT, AAPL and Ford insert the following into your browser:
http://finance.yahoo.com/d/quotes.csv?s=MSFT+AAPL+F&f=ser
and after entering the above URL into your browser’s address bar and hitting return/enter. The CSV will be automatically downloaded to your computer and you should get the cvs file as shown below (data as 7/22/2016):
some yahoo tags for fundamental data:

You are making the common mistake of confusing 'access to Yahoo or Google data' with 'everything I see on Yahoo or Google Finance can be downloaded'.
When R functions download historical stock price data, they almost always access an interface explicitly designed for this purpose as e.g. a cgi handler providing csv files given a stock symbol and start and end date. So this easy as all we need to do is form the appropriate query, hit the webserver, fetch the csv file an dparse it.
Now balance sheet information is (as far as I know) not available in such an interface. So you will need to 'screen scrape' and parse the html directly.
It is not clear that R is the best tool for this. I am aware of some Perl modules for the purpose of getting non-time-series data off Yahoo Finance but have not used them.

Taking the last two comments into consideration, you may be able to acquire corporate financial statements economically using EdgardOnline. It isn't free, but is less expensive than Bloomberg and Reuters. Another thing to consider is financial reporting normalization/standardized. Just because two companies are in the same industry and sell similar products does not necessarily mean that if you laid the two companies' income statements or balance sheets side by side, that reporting items would necessarily line up. Compustat has normalized/standardized financial reports.

I don't know anything about R, but assuming that it can call a REST API and consume data in XML form, you can try the Mergent Company Fundamentals API at http://www.mergent.com/servius/ - there's lots of very detailed financial statement data (balance sheets / income statements / cashflow statements / ratios), standardized across companies, going back more than 20 years

I have written a C# program that I think does what you want. It parses the html from nasdaq.com pages. It parses html and creates 1 csv file per stock that includes income statement, cash flow, and balance sheet values going back 5 - 10 years depending on the age of the stock. I am now working to add some analysis calculations (mostly historic ratios at this point). I'm interested in learning about R and it's applications to fundamental analysis. Maybe we can help each other.

I recently found this R package on CRAN. Which does exactly what you are asking I believe.
XBRL: Extraction of business financial information from XBRL documents

You can get all three types of financial statements from Intrinio in R for free. Additionally, you can get as reported statements and standardized statements. The problem with pulling XBRL filings from the SEC is that there is no standardized option, which means you have to manually map financial statement items if you want to do cross equity comparisons. Here is an example:
#Install httr, which you need to request data via API
install.packages("httr")
require("httr")
#Install jsonlite which parses JSON
install.packages("jsonlite")
require("jsonlite")
#Create variables for your usename and password, get those at intrinio.com/login
username <- "Your_API_Username"
password <- "Your_API_Password"
#Making an api call for roic. This puts together the different parts of the API call
base <- "https://api.intrinio.com/"
endpoint <- "financials/"
type <- "standardized"
stock <- "YUM"
statement <- "income_statement"
fiscal_period <- "Q2"
fiscal_year <- "2015"
#Pasting them together to make the API call
call1 <- paste(base,endpoint,type,"?","identifier","=", stock, "&","statement","=",statement,"&","fiscal_period",
"=", fiscal_period, "&", "fiscal_year", "=", fiscal_year, sep="")
# call1 Looks like this "https://api.intrinio.com/financials/standardized?identifier=YUM&statement=income_statement&fiscal_period=Q2&fiscal_year=2015"
#Now we use the API call to request the data from Intrinio's database
YUM_Income <- GET(call1, authenticate(username,password, type = "basic"))
#That gives us the ROIC value, but it isn't in a good format so we parse it
test1 <- unlist(content(YUM_Income, "text"))
#Convert from JSON to flattened list
parsed_statement <- fromJSON(test1)
#Then make your data frame:
df1 <- data.frame(parsed_statement)
I wrote this script to make it easy to change out the ticker, dates, and statement type so you can get the financial statement for any US company for any period.

I actually do this in Google Sheets. I thought it to be the easiest way to do it as well and because it can pull real live data was another bonus point. Lastly it doesn't consume any of my space to save these statements.
=importhtml("http://investing.money.msn.com/investments/stock-income-statement/?symbol=US%3A"&B1&"&stmtView=Ann", "table",0)
where B1 cell contains the ticker.
You can do the same thing for balance sheet, and cash flow as well.

1- Subscribe into yahoo finance api from Rapid Api here
2- Get your key
3- Insert your key in the code:
name="AAPL"
{raw=httr::GET(paste("https://yahoo-finance15.p.rapidapi.com//api/yahoo/qu/quote/",name,"/financial-data", sep = ""),
httr::add_headers("x-rapidapi-host"= "yahoo-finance15.p.rapidapi.com",
"x-rapidapi-key"="insert your Key here")
)
raw=jsonlite::fromJSON(rawToChar(raw$content))
values=sapply(1:length(raw$financialData),function(x){sapply(raw, "[", x)[[1]][1]})
names(values)=names(raw$financialData)
values=as.data.frame(t(values))
row.names(values)=name
}
values
Pros: Easy way to get data
Cons: free version limited into 500 request per month

Related

API Webscrape OpenFDA with R

I am scraping OpenFDA (https://open.fda.gov/apis). I know my particular inquiry has 6974 hits, which is organized into 100 hits per page (max download of the API). I am trying to use R (rvest, jsonlite, purr, tidyverse, httr) to download all of this data.
I checked the website information with curl in terminal and downloaded a couple of sites to see a pattern.
I've tried a few lines of code and I can only get 100 entries to download. This code seems to work decently, but it will only pull 100 entries, so one page To skip the fisrt 100, which I can pull down and merge later, here is the code that I have used:
url_json <- "https://api.fda.gov/drug/label.json?api_key=YOULLHAVETOGETAKEY&search=grapefruit&limit=100&skip=6973"
raw_json <- httr::GET(url_json, accept_json())
data<- httr::content(raw_json, "text")
my_content_from_json <- jsonlite::fromJSON(data)
dplyr::glimpse(my_content_from_json)
dataframe1 <- my_content_from_json$results
view(dataframe1)
SOLUTION below in the responses. Thanks!
From the comments:
It looks like the API parameters skip and limit work better than the search_after parameter. They allow pulling down 1,000 entries simultaneously according to the documentation (open.fda.gov/apis/query-parameters). To provide these parameters into the query string, an example URL would be
https://api.fda.gov/drug/label.json?api_key=YOULLHAVETOGETAKEY&search=grapefruit&limit=1000&skip=0
after which you can loop to get the remaining entries with skip=1000, skip=2000, etc. as you've done above.

How to get from Excel STOCKHISTORY to a list in R?

I need to find some historical time series for Stocks and have the result in R.
I tried already the package "quantmod", but unfortunately most of the stocks are not covered.
I found EXCELS "STOCKPRICEHISTORY" to yield good results.
Hence, to allow more solutions, I phrase my question more open:
How do I get from a table, that contains Stocks (Ticker), Startdate and Endate to a list in R which contains each respective stock and its stockprice timeseries?
My Startingpoint looks like this:
My aim at the very End is to have something like this:
(Its also ok if I have every single stock price timeseries as csv)
My ideas so far:
Excel VBA Solution - 1
Write a macro, that executes EXCELS "STOCKHISTORY" function on each of these stocks, and writes them as csv or so? Then after that, read them in and create a list in R
Excel VBA Solution - 2
Write a macro, that executes EXCELS "STOCKHISTORY" function on each of these stocks,
each one in a new worksheet? Bad Idea, since there are more than 4000 Stocks..
R Solution
(If possible) call "STOCKHISTORY" function from R directly (?)
Any suggestions on how to takle this?
kind regards
I would recommend using an API, especially over trying to connect to Excel via VBA. There are many that are free, but require and API key from their website. For example, alphavantage.
library(tidyverse)
library(jsonlite)
library(httr)
symbol = "IBM"
av_key = "demo"
url <- str_c("https://www.alphavantage.co/query?function=TIME_SERIES_DAILY&symbol=", symbol ,"&apikey=", av_key, "&datatype=csv")
d <- read_csv(url)
d %>% head
Cred and other options: https://quantnomad.com/2020/07/06/best-free-api-for-historical-stock-data-examples-in-r/

Equivalent R function to slackr_upload for microsoft teams

We have recently moved from slack to Microsoft teams. There was a useful function (slackr) that allowed for files to be uploaded to slack from R (example below) and so wondering if there is an equivalent for Microsoft teams.
library(slackr)
slackrSetup(incoming_webhook_url = "webhook-url",
api_token = "api-token")
d1 <-
data.frame(col1 = "a", col2 = "b")
write.table(
d1,
file = paste0("my-location/export.csv"))
slackr_upload(paste0("my-location/export.csv"),
channel = "my-channel")
I have found that there is a teamr function which is useful for messages, but doesn't allow uploading of files. I have attempted to at least format the contents of the dataframe as a table in markdown in the message sent from teamr, but as the tables can be quite large (500 rows, 20-30 columns) this isn't convenient for the Microsoft teams users to extract the data.
Alternatively, I can create and send an email with an attachment from R, but hoping there is an approach to keep it to teams that I have missed.
Like #Gakku said I think that could be achieved with Microsoft365R package.
I think something in line this would put it in specific team, even specific channel creating upload folder along the way
library(Microsoft365R)
team <- get_team("NAME OF YOUR TEAM")
channel <- team$get_channel("NAME OF YOUR CHANNEL")
channel$get_folder()$create_folder("UPLOAD LOCATION")
channel$get_folder()$get_item("UPLOAD LOCATION")$upload("UPLOAD_FILE.CSV")
I know this is old, but in case someone comes across this, look at microsoft365r which lets you upload files and much more in MS teams.

How to bulk download historical weather data from Canadian government website using R

I'm trying to download climate data in bulk from the Canadian Government's national weather and climate data reporting service using R. Here are instructions provided via their website which explain how to do this using Cygwin, which I was able to do, however, I'd like to include the file retrieval as part of a script in R which iterates grabbing the data for multiple stations and time frames and processes them, etc.. I'm somewhat new to R, so I'm having trouble with this process.
Readme.txt
URL based procedure to automatically download data in bulk from Climate Website
(http&colon;//www.climate.weather.gc.ca)
Version: 2016-05-10
ENVIRONMENT AND CLIMATE CHANGE CANADA
To read this file online, please visit:
ftp&colon;//client_climate#ftp.tor.ec.gc.ca/Pub/Get_More_Data_Plus_de_donnees/
Folder: Get_More_Data_Plus_de_donnees > Readme.txt
Instructions on how to download all weather data for one station from Environment and Climate Change Canada's Climate website:
A daily updated list of Climate stations in the National Archive, including their Climate ID, Station ID, WMO ID, TC ID, and co-ordinates can be found in the following folder:
Get_More_Data_Plus_de_donnees > Station Inventory EN.csv
Use the following utility to download data:
wget (GNU / Linux Operating systems)
Cygwin (Windows Operating systems) https&colon;//www.cygwin.com
Homebrew (OS X - Apple) http&colon;//brew.sh/
Example to download all available hourly data for Yellowknife A, from 1998 to 2008, in .csv format
Command line:
for year in `seq 1998 2008`;do for month in `seq 1 12`;do wget --content-disposition "http&colon;//climate.weather.gc.ca/climate_data/bulk_data_e.html?format=csv&stationID=1706&Year=${year}&Month=${month}&Day=14&timeframe=1&submit= Download+Data" ;done;done
WHERE:
year = change values in command line (`seq 1998 2008`)
month = change values in command line (`seq 1 12`)
format= [csv|xml]: the format output
timeframe = 1: for hourly data
timeframe = 2: for daily data
timeframe = 3: for monthly data
Day: the value of the "day" variable is not used and can be an arbitrary value
For another station, change the value of the variable stationID
For the data in XML format, change the value of the variable format to xml in the URL.
For information in French, change Download+Data with ++T%C3%A9l%C3%A9charger+%0D%0Ades+donn%C3%A9es, also change _e with _f in the url.
For questions or concerns please contact our National Climate Services office at:
ec.services.climatiques-climate.services.ec#canada.ca
As stated above, the Cygwin command is:
for year in `seq 2015 2018`;do for month in `seq 1 12`;do wget --content-disposition "http://climate.weather.gc.ca/climate_data/bulk_data_e.html?format=csv&stationID=51459&Year=${year}&Month=${month}&Day=14&timeframe=1&submit=Download+Data" ;done;done
I know that download.file() has an option for wget, as is used by the Cygwin command, however, when I tried the following:
download.file("http://climate.weather.gc.ca/climate_data/bulk_data_e.html?format=csv&stationID=51459&Year=$2018&Month=$12&Day=14&timeframe=2&submit=Download+Data","X:/folderX/example.csv", method = "wget")
I get a 'wget' call had nonzero exit status error.
Not sure if this has something to do with --content-disposition from the Cygwin command or if I'm even approaching this with the right function or not, so any direction is greatly appreciated.
Thanks.
You are using http you need https.
You also have $ in front of your numbers.
What you have :
http://climate.weather.gc.ca/climate_data/bulk_data_e.html?format=csv&stationID=51459&Year=$2018&Month=$12&Day=14&timeframe=2&submit=Download+Data
WORKING LINK:
https://climate.weather.gc.ca/climate_data/bulk_data_e.html?format=csv&stationID=51459&Year=2018&Month=3&Day=31&timeframe=1
There is also some needed documentation for general use of the climate data such as the station list so you know where to request:
https://drive.google.com/uc?authuser=0&id=1egfzGgzUb0RFu_EE5AYFZtsyXPfZ11y2&export=download
You can find information on the columns here:
https://climate.weather.gc.ca/glossary_e.html#weather
And some general data information here:
https://climate.weather.gc.ca/about_the_data_index_e.html https://drive.google.com/drive/folders/1WJCDEU34c60IfOnG4rv5EPZ4IhhW9vZH
I didn't see anything useful, at all, in the link you posted. It seems to be some kind of landing page, but it doesn't contain any weather-related data. Anyway, here is how to loop through an array of URLs, and download data from each. Just modify it to suit your needs.
library(RCurl);library(XML)
pageNum <- seq(1:10)
url <- paste0("http://www.totaljobs.com/JobSearch/Results.aspx?Keywords=Leadership&LTxt=&Radius=10&RateType=0&JobType1=&CompanyType=&PageNum=")
urls <- paste0(url, pageNum)
allPages <- lapply(urls, function(x) getURLContent(x)[[1]])
xmlDocs <- lapply(allPages, function(x) XML::htmlParse(x))
Here is another example.
mydownload <- function (start_date, end_date) {
start_date <- as.Date(start_date) ## convert to Date object
end_date <- as.Date(end_date) ## convert to Date object
dates <- as.Date("1970/01/01") + (start_date : end_date) ## date sequence
## a loop to download data
for (i in 1:length(dates)) {
string_date <- as.character(dates[i])
myfile <- paste0("C:/Users/Excel/Desktop/weather/", string_date, ".csv")
string_date <- gsub("-", "-", string_date) ## replace "-" with "/"
myurl <- paste("https://sci.ncas.ac.uk/leedsweather/Archive/CUSTOM-ARC-", string_date, "-METRIC.csv", sep = "")
download.file(url = myurl, destfile = myfile, quiet = TRUE)
}
}
mydownload("2013/11/25", "2013/11/30")
As #EmilyKothe was saying, you can use the R weathercan package to download ECCC station data in bulk. I find it still a bit annoying since not everyone uses R.
So I built a Shiny web application based on weathercan functionality:
https://nickrongkp.shinyapps.io/WeatherCan/
The app is hosted on shinyapps.io (free account with limited running time) so please close the browser as soon as you are done for others' sake.
Source code:
https://github.com/nickyrong/ShinyWeatherCan

requesting data from the Center for Disease Control using RSocrata or XML in R

My goal is to obtain a time series from 1996 week 1 to week 46 of 2016 of legionellosis cases from this website supported by the Center for Disease Control (CDC) of the United States. A coworker attempted to scrape only tables that contain legionellosis cases with the code below:
#install.packages('rvest')
library(rvest)
## Code to get all URLS
getUrls <- function(y1,y2,clist){
root="https://wonder.cdc.gov/mmwr/mmwr_1995_2014.asp?mmwr_year="
root1="&mmwr_week="
root2="&mmwr_table=2"
root3="&request=Submit&mmwr_location="
urls <- NULL
for (year in y1:y2){
for (week in 1:53){
for (part in clist) {
urls <- c(urls,(paste(root,year,root1,week,root2,part,root3,sep="")))
}
}
}
return(urls)
}
TabList<-c("A","B") ## can change to get not just 2 parts of the table but as many as needed.
WEB <- as.data.frame(getUrls(1996,2014,TabList)) # Only applies from 1996-2014. After 2014, the root url changes.
head(WEB)
#Example of how to extract data from a single webpage.
url <- 'https://wonder.cdc.gov/mmwr/mmwr_1995_2014.asp? mmwr_year=1996&mmwr_week=20&mmwr_table=2A&request=Submit&mmwr_location='
webpage <- read_html(url)
sb_table <- html_nodes(webpage, 'table')
sb <- html_table(sb_table, fill = TRUE)[[2]]
#test if Legionellosis is in the table. Returns a vector showing the columns index if the text is found.
#Can use this command to filter only pages that you need and select only those columns.
test <- grep("Leg", sb)
sb <- sb[,c(1,test)]
### This code only works if you have 3 columns for headings. Need to adapt to be more general for all tables.
#Get Column names
colnames(sb) <- paste(sb[2,], sb[3,], sep="_")
colnames(sb)[1] <- "Area"
sb <- sb[-c(1:3),]
#Remove commas from numbers so that you can then convert columns to numerical values. Only important if numbers above 1000
Dat <- sapply(sb, FUN= function(x)
as.character(gsub(",", "", as.character(x), fixed = TRUE)))
Dat<-as.data.frame(Dat, stringsAsFactors = FALSE)
However, the code is not finished and I thought it may be best to use the API since the structure and layout of the table in the webpages changes. This way we wouldn't have to comb through the tables to figure out when the layout changes and how to adjust the web scraping code accordingly. Thus I attempted to pull the data from the API.
Now, I found two help documents from the CDC that provides the data. One appears to provide data from 2014 onward which can be seen here using RSocrata, while the other instruction appears to be more generalized and uses XML format request over http, which can be seen here.The XML format request over http required a databased ID which I could not find. Then I stumbled onto the RSocrata and decided to try that instead. But the code snippet provided along with the token ID I set up did not work.
install.packages("RSocrata")
library("RSocrata")
df <- read.socrata("https://data.cdc.gov/resource/cmap-p7au?$$app_token=tdWMkm9ddsLc6QKHvBP6aCiOA")
How can I fix this? My end goal is a table of legionellosis cases from 1996 to 2016 on a weekly basis by state.
I'd recommend checking out this issue thread in the RSocrata GitHub repo where they're discussing a similar issue with passing tokens into the RSocrata library.
In the meantime, you can actually leave off the $$app_token parameter, and as long as you're not flooding us with requests, it'll work just fine. There's a throttling limit you can sneak under without using an app token.

Resources