How/where to match tidycensus variables with census bureau variables? - r

Problem
I am given a long list of specific variable codes for the DP05 table - in the census bureau format. For instance:
target_dp05_vars = c(perc_white = "HC03_VC53",
perc_black = "HC03_VC55",
perc_native = "HC03_VC56")
Since tidycensus uses its own variable naming convention, I can't use the above easily. How do I easily crosswalk to the tidycensus definition?
Temporary solution
In the meantime, I've downloaded the bureau file manually and eliminated rows with HC02 and HC04 prefixes to match with tidycensus to create an internal crosswalk (because it's at least positionally correct) but it's tedious.
I'd love to just feed those HCs as a named vector into get_acs() and perhaps just specify the table as DP05.

tidycensus doesn't use its own variable naming convention - it uses variable IDs as specified by the Census API. For example, see https://api.census.gov/data/2017/acs/acs5/profile/variables.html, which is accessible in R with:
library(tidycensus)
dp17 <- load_variables(2017, "acs5/profile", cache = TRUE)
The IDs you've provided appear to be FactFinder codes.
If you want the full DP05 table in one tidycensus call, you can do the following (e.g. for counties in New York) with tidycensus 0.9:
dp05 <- get_acs(geography = "county",
table = "DP05",
state = "NY")
Mapping of variable IDs to their meanings are in turn available with load_variables().
Note: I am getting intermittent server errors with these calls from the API, which may be due to the government shutdown. If it doesn't work at first, try again.

Related

How to load internally stored data before functions in an R package?

I am currently writing an R package containing project-specific data cleaning functions for my collaborators, using devtools and roxygen2 and following the RStudio suggested formats. Many of these functions essentially fix typos/common data entry errors using reference files (dataframes) that are currently stored in the package's sysdata.rda file under /R. Prior to the issue I present below, referencing the files and using them within functions was working fine. Lazy load is set to true.
I would like to make a function that allows users to add a row to these reference files if they come across a novel typo/error. From my research and from reading the very helpful information at https://r-pkgs.org/data.html, it seems the best way to do this is to list the reference files to a new environment, and then allow the user to edit those files within the session-specific environment. Ideally, these changes would persist across sessions but I cannot figure out how to make that work so am continuing down this path.
For brevity, I've only included one of these files, called column_standardized, that contains standard names for the columns as well as the potential alternatives we regularly come across. A function called "standardize.columns" coerces column names of input data frames to the standards and reorders them also to our agreed standard.
Here is a short reproducible of the column_standardized:
column_standardized <- data.frame(standard_name = c("date", "date", "time", "ID", "ID", "ID", "location", "location"), other_names = c("DATE", "day", "TIME", "id", "individual", "name", "LOCATION", "locale"))
To do this, I created a file in /R that contains the following code, based heavily on the example in https://r-pkgs.org/data.html (section 8.5, Internal State). The file is titled "aaaaa.R" so it comes before other functions in /R:
(The reason I set key, correct_col, and alt_col are so that the same function can act upon different reference files with the same code. It works fine and I am not necessarily seeking feedback on the data manipulation aspects of this function.)
the <- new.env(parent = emptyenv())
the$column_standardized <- column_standardized
#' Add alternative names to reference
#'
#' #param correct A character string of the standard/correct name
#' #param alt A character string of the alternative name
#' #param data_type A character string indicating which reference file to edit. Options are 'column' (and others in reality).
#'
#' #return NA; edits included reference files for current R session.
#' #export
add.alt <- function(correct, alt, data_type){
if(data_type == "column"){
key <- the$column_standardized
correct_col <- "standard_name"
alt_col <- "other_names"
}else{print("\nData type not found. Acceptable options are 'column', etc.")}
new_key <- data.frame(matrix(ncol = length(colnames(key)), nrow = 1))
colnames(new_key) <- colnames(key)
new_key[1,correct_col] <- correct
new_key[1, alt_col] <- alt
key <- rbind(key, new_key)
if(data_type == "column"){
the$column_standardized <- key
invisible(key)
}}
No errors/issues are flagged when I run document() or load_all(), but when I check the package it is unable to install because column_standardized does not exist. I assume this is because sysdata.rda is loaded after .R files in /R.
I have also tried to put column_standardized in the /data folder and call it using system.file but run into the same error.
The actual file is over 300 rows long, and there are multiple reference files, and so I don't think it makes sense to just recreate the data frame in the environment from scratch, although I've considered it.
Finally, to my specific questions:
Is there a way to load the system data first so that .R files in /R can reference the data included?
I am not wed to storing them internally although that would be ideal for privacy reasons, and could move the dataframes to /data or another location. This seemed initially simplest but I could be wrong.
Is there a modification that would allow each user to "permanently" modify these reference files? It wouldn't be too much of a headache for them to run add.alt() each session since the files already contain most common errors, and once user data is edited/standardized one usually does not need to restandardize in another session. If someone knows a solution, however, it would probably be the ideal.
I am potentially completely off-base here as this is my first time developing a package, so any tips are appreciated! Many thanks in advance, and happy to provide more documentation information if I've forgotten anything crucial.

Read in a list of shapefiles and row bind them in R (preferably using tidy syntax and sf)

I have a directory with a bunch of shapefiles for 50 cities (and will accumulate more). They are divided into three groups: cities' political boundaries (CityA_CD.shp, CityB_CD.shp, etc.), neighborhoods (CityA_Neighborhoods.shp, CityB_Neighborhoods.shp, etc.), and Census blocks (CityA_blocks.shp, CityB_blocks.shp, etc.). They use common file-naming syntaxes, have the same set of attribute variables, and are all in the same CRS. (I transformed all of them as such using QGIS.) I need to write a list of each group of files (political boundaries, neighborhoods, blocks) to read as sf objects and then bind the rows to create one large sf object for each group. However I am running into consistent problems developing this workflow in R.
library(tidyverse)
library(sf)
library(mapedit)
# This first line succeeds in creating a character string of the files that match the regex pattern.
filenames <- list.files("Directory", pattern=".*_CDs.*shp", full.names=TRUE)
# This second line creates a list object from the files.
shapefile_list <- lapply(filenames, st_read)
# This third line (adopted from https://github.com/r-spatial/sf/issues/798) fails as follows.
districts <- mapedit:::combine_list_of_sf(shapefile_list)
Error: Column `District_I` cant be converted from character to numeric
# This fourth line fails in an apparently different way (also adopted from https://github.com/r-spatial/sf/issues/798).
districts <- do.call(what = sf:::rbind.sf, args = shapefile_list)
Error in CPL_get_z_range(obj, 2) : z error - expecting three columns;
The first error appears to be indicating that one of my shapefiles has an incorrect variable class for the common variable District_I but R provides no information to clue me into which file is causing the error.
The second error seems to be looking for a z coordinate but is only finding x and y in the geometry attribute.
I have four questions on this front:
How can I have R identify which list item it is attempting to read and bind is causing an error that halts the process?
How can I force R to ignore the incompatibility issue and coerce the variable class to character so that I can deal with the variable inconsistency (if that's what it is) in R?
How can I drop a variable entirely from the read sf objects that is causing an error (i.e. omit District_I for all read_sf calls in the process)?
More generally, what is going on and how can I solve the second error?
Thanks all as always for your help.
P.S.: I know this post isn't "reproducible" in the desired way, but I'm not sure how to make it so besides copying the contents of all my shapefiles. If I'm mistaken on this point, I'd gladly accept any wisdom on this front.
UPDATE:
I've run
filenames <- list.files("Directory", pattern=".*_CDs.*shp", full.names=TRUE)
shapefile_list <- lapply(filenames, st_read)
districts <- mapedit:::combine_list_of_sf(shapefile_list)
successfully on a subset of three of the shapefiles. So I've confirmed that there is some class conflict between the column District_I in one of the files causing the hold-up when running the code on the full batch. But again, I need the error to identify the file name causing the issue so I can fix it in the file OR need the code to coerce District_I to character in all files (which is the class I want that variable to be in anyway).
A note, particularly regarding Pablo's recommendation:
districts <- do.call(what = dplyr::rbind_all, shapefile_list)
results in an error
Error in (function (x, id = NULL) : unused argument
followed by a long string of digits and coordinates. So,
mapedit:::combine_list_of_sf(shapefile_list)
is definitely the mechanism to read from the list and merge the files, but I still need a way to diagnose the source of the column incompatibility error across shapefiles.
So after much fretting and some great guidance from Pablo (and his link to https://community.rstudio.com/t/simplest-way-to-modify-the-same-column-in-multiple-dataframes-in-a-list/13076), the following works:
library(tidyverse)
library(sf)
# Reads in all shapefiles from Directory that include the string "_CDs".
filenames <- list.files("Directory", pattern=".*_CDs.*shp", full.names=TRUE)
# Applies the function st_read from the sf package to each file saved as a character string to transform the file list to a list object.
shapefile_list <- lapply(filenames, st_read)
# Creates a function that transforms a problem variable to class character for all shapefile reads.
my_func <- function(data, my_col){
my_col <- enexpr(my_col)
output <- data %>%
mutate(!!my_col := as.character(!!my_col))
}
# Applies the new function to our list of shapefiles and specifies "District_I" as our problem variable.
districts <- map_dfr(shapefile_list, ~my_func(.x, District_I))

Can AZ ML workbench reference multiple data sources from Data Prep Transform Dataflow expression

Using AZ ML workbench for a class project (required tool) I coded the desired logic below in an exploration notebook but cannot find a way to include this into a Data-prep Transform Data flow.
all_columns = df.columns
sum_columns = [col_name for col_name in all_columns if col_name not in ['NPI', 'Gender', 'State', 'Credentials', 'Specialty']]
sum_op_columns = list(set(sum_columns) & set(df_op['Drug Name'].values))
The logic is using the column names from one data source df_op (opioid drugs) to choose which subset of columns to include from another data source df (all drugs). When adding a py script/expression Transform Data Flow I'm only seeing the ability to reference the single df. Alternatives?
I may have a way for you to access both data frames.
In Workbench, once you have the data sources that you need loaded, right click on one and select "Generate Data Access Code File".
Once there you're automatically given code to access that specific file. However, you can use the same code to access the other files.
In the screenshot above, I have two data sources. I can use the below code to access them both as a pandas data frame and manipulate them as I need.
df_salary = datasource.load_datasource('SalaryData.dsource')
df_startup = datasource.load_datasource('50-Startups.dsource')
I believe from there you can save your updated data frame to a CSV and then use that in the train script.
Hope that helps or at least points you to another solution.

How to diagnose duplicated levels in an R import

I use R to analyze large data files from IPUMS, which published sophisticated micro-data on Census records. IPUMS offers its extracts as SPSS, SAS or STATA files. To get the data into R, I've had the most luck downloading the SPSS version and using the read.spss function from the "foreign" library:
library(foreign);
ipums <- read.spss("usa_00106.sav", to.data.frame = TRUE);
This works brilliantly, save for this perpetual warning:
Warning message:
In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels, :
duplicated levels in factors are deprecated
(If anyone is feeling heroic, I uploaded the zipped .sav file here (39Mb) as well as the .SPS file and the more human-readable codebook. This is just a sample IPUMs extract and, like all IPUMs data, contains no private information.)
My question is whether my data is compromised by duplicate factors in the SPSS file or whether this is something I can fix after the import.
To figure out which of the columns was the culprit, I wrote a little diagnosis:
ipums <- read.spss("usa_00106.sav", to.data.frame = TRUE);
for (name in names(ipums)) {
type <- class(ipums[[name]]);
if (type == "factor") {
print(name);
print(anyDuplicated(levels(ipums[[name]])));
}
}
This loop correctly identifies the column BLPD as the culprit. That's a detailed version of a person's birthplace that has 536 possible values in the .SPS file, as confirmed by this code:
fac <- levels(ipums$BPLD)
length(fac) #536
anyDuplicated(fac) #153
fac[153] #"Br. Virgin Islands, ns"
When I look at the .SPS file, I do see in fact that there are two entries for this location:
26052 "Br. Virgin Islands, ns"
26069 "Br. Virgin Islands, ns"
However, I don't see a single instance of this location in the data:
NROW(subset(ipums, ipums$BPLD=="Br. Virgin Islands, ns")) #0
This may well be because this is not a common location that's likely to show up in the data, but I cannot assume that will always be the case in future project. So part two of my question is whether an SPSS file with duplicate factors will at least important the correct values, or whether a file that produces this warning message is potentially damaged.
As for fixing the problem, I see a few related StackOverflow posts, like this one, but I'm not sure if they address the problem I have with complex public data from a third-party. What is the most efficient way for me to clean up factors with duplicate values so that I can have full confidence in the data?
SPSS does not require uniqueness of value labels. In this dataset, BLPD is a string. I believe read.spss will create a factor with duplicate levels but will assign all the duplicate values to just one of them. You can use droplevels() after reading the data to get rid of the unused level.
Could you try importing and specifying factors as false with either:
#havent tested
read.spss(x...,stringsAsFactors=FALSE)
or from help command for read.spss
read.spss(x...,use.value.labels=FALSE)
?read.spss
#use.value.labels
#logical: convert variables with value labels into R factors with those levels?
#This is only done if there are at least as many labels as values of the
#variable #(when values without a matching label are returned as NA).

Using R to Analyze Balance Sheets and Income Statements

I am interested in analyzing balance sheets and income statements using R. I have seen that there are R packages that pull information from Yahoo and Google Finance, but all the examples I have seen concern historical stock price information. Is there a way I can pull historical information from balance sheets and income statements using R?
I have found on the net only a partial solution to your issue, for I managed to retrieve only the Balance sheet info and financial statement for one year. I don't know how to do it for more years.
There is a package in R, called quantmod, which you can install from CRAN
install.packages('quantmod')
Then you can do the following: Suppose you want to get the financial info from a company listed at NYSE : General Electric. ticker: GE
library(quantmod)
getFinancials('GE')
viewFinancials(GE.f)
To get only the income statement, reported anually, as a data frame use this:
viewFinancials(GE.f, "IS", "A")
Please let me know if you find out how to do this for multiple years.
The question you want to ask, and get an answer to!, is where can I get free XBRL data for analysing corporate balance sheets, and is there a library for consuming such data in R?
XBRL (Extensible Business Reporting Language - http://en.wikipedia.org/wiki/XBRL) is a standard for marking up accounting statments (income statements, balance sheets, profit & loss statements) in XML format such that they can easily be parsed by computer and put into a spreadsheet.
As far as I know, a lot of corporate regulators (e.g. the SEC in the US, ASIC in Australia) are encouraging the companies under their jurisdiction to report using such a format, or running pilots, but I don't believe it has been mandated at this point. If you limited your investment universe (I am assuming you want this data in electronic format for investment purposes) to firms that have made their quarterly reports freely available in XBRL form, I expect you will have a pretty short list of firms to invest in!
Bloomberg, Reuters et al all have pricey feeds for obtaining corporate fundamental data. There may also be someone out there running a tidy business publishing balance sheets in XBRL format. Cheaper, but still paid for, are XIgnite's xFundamentals and xGlobalFundamentals web services, but you aren't getting full balance sheet data from them.
to read-in the financial information try this function ( I picked it up several months ago and made some small adjustments)
require(XML)
require(plyr)
getKeyStats_xpath <- function(symbol) {
yahoo.URL <- "http://finance.yahoo.com/q/ks?s="
html_text <- htmlParse(paste(yahoo.URL, symbol, sep = ""), encoding="UTF-8")
#search for <td> nodes anywhere that have class 'yfnc_tablehead1'
nodes <- getNodeSet(html_text, "/*//td[#class='yfnc_tablehead1']")
if(length(nodes) > 0 ) {
measures <- sapply(nodes, xmlValue)
#Clean up the column name
measures <- gsub(" *[0-9]*:", "", gsub(" \\(.*?\\)[0-9]*:","", measures))
#Remove dups
dups <- which(duplicated(measures))
#print(dups)
for(i in 1:length(dups))
measures[dups[i]] = paste(measures[dups[i]], i, sep=" ")
#use siblings function to get value
values <- sapply(nodes, function(x) xmlValue(getSibling(x)))
df <- data.frame(t(values))
colnames(df) <- measures
return(df)
} else {
break
}
}
to use it, compare for example 3 companies and write the data into a csv-file do the following:
tickers <- c("AAPL","GOOG","F")
stats <- ldply(tickers, getKeyStats_xpath)
rownames(stats) <- tickers
write.csv(t(stats), "FinancialStats_updated.csv",row.names=TRUE)
Just tried it. Still working.
UPDATE as Yahoo changed it’s web site layout:
The function above does not work anymore as Yahoo again changed its web site layout. Fortunately its still easy to get the financial infos as the tags for getting fundamental data have not been changed.
example for downloading a file with eps and P/E ratio for MSFT, AAPL and Ford insert the following into your browser:
http://finance.yahoo.com/d/quotes.csv?s=MSFT+AAPL+F&f=ser
and after entering the above URL into your browser’s address bar and hitting return/enter. The CSV will be automatically downloaded to your computer and you should get the cvs file as shown below (data as 7/22/2016):
some yahoo tags for fundamental data:
You are making the common mistake of confusing 'access to Yahoo or Google data' with 'everything I see on Yahoo or Google Finance can be downloaded'.
When R functions download historical stock price data, they almost always access an interface explicitly designed for this purpose as e.g. a cgi handler providing csv files given a stock symbol and start and end date. So this easy as all we need to do is form the appropriate query, hit the webserver, fetch the csv file an dparse it.
Now balance sheet information is (as far as I know) not available in such an interface. So you will need to 'screen scrape' and parse the html directly.
It is not clear that R is the best tool for this. I am aware of some Perl modules for the purpose of getting non-time-series data off Yahoo Finance but have not used them.
Taking the last two comments into consideration, you may be able to acquire corporate financial statements economically using EdgardOnline. It isn't free, but is less expensive than Bloomberg and Reuters. Another thing to consider is financial reporting normalization/standardized. Just because two companies are in the same industry and sell similar products does not necessarily mean that if you laid the two companies' income statements or balance sheets side by side, that reporting items would necessarily line up. Compustat has normalized/standardized financial reports.
I don't know anything about R, but assuming that it can call a REST API and consume data in XML form, you can try the Mergent Company Fundamentals API at http://www.mergent.com/servius/ - there's lots of very detailed financial statement data (balance sheets / income statements / cashflow statements / ratios), standardized across companies, going back more than 20 years
I have written a C# program that I think does what you want. It parses the html from nasdaq.com pages. It parses html and creates 1 csv file per stock that includes income statement, cash flow, and balance sheet values going back 5 - 10 years depending on the age of the stock. I am now working to add some analysis calculations (mostly historic ratios at this point). I'm interested in learning about R and it's applications to fundamental analysis. Maybe we can help each other.
I recently found this R package on CRAN. Which does exactly what you are asking I believe.
XBRL: Extraction of business financial information from XBRL documents
You can get all three types of financial statements from Intrinio in R for free. Additionally, you can get as reported statements and standardized statements. The problem with pulling XBRL filings from the SEC is that there is no standardized option, which means you have to manually map financial statement items if you want to do cross equity comparisons. Here is an example:
#Install httr, which you need to request data via API
install.packages("httr")
require("httr")
#Install jsonlite which parses JSON
install.packages("jsonlite")
require("jsonlite")
#Create variables for your usename and password, get those at intrinio.com/login
username <- "Your_API_Username"
password <- "Your_API_Password"
#Making an api call for roic. This puts together the different parts of the API call
base <- "https://api.intrinio.com/"
endpoint <- "financials/"
type <- "standardized"
stock <- "YUM"
statement <- "income_statement"
fiscal_period <- "Q2"
fiscal_year <- "2015"
#Pasting them together to make the API call
call1 <- paste(base,endpoint,type,"?","identifier","=", stock, "&","statement","=",statement,"&","fiscal_period",
"=", fiscal_period, "&", "fiscal_year", "=", fiscal_year, sep="")
# call1 Looks like this "https://api.intrinio.com/financials/standardized?identifier=YUM&statement=income_statement&fiscal_period=Q2&fiscal_year=2015"
#Now we use the API call to request the data from Intrinio's database
YUM_Income <- GET(call1, authenticate(username,password, type = "basic"))
#That gives us the ROIC value, but it isn't in a good format so we parse it
test1 <- unlist(content(YUM_Income, "text"))
#Convert from JSON to flattened list
parsed_statement <- fromJSON(test1)
#Then make your data frame:
df1 <- data.frame(parsed_statement)
I wrote this script to make it easy to change out the ticker, dates, and statement type so you can get the financial statement for any US company for any period.
I actually do this in Google Sheets. I thought it to be the easiest way to do it as well and because it can pull real live data was another bonus point. Lastly it doesn't consume any of my space to save these statements.
=importhtml("http://investing.money.msn.com/investments/stock-income-statement/?symbol=US%3A"&B1&"&stmtView=Ann", "table",0)
where B1 cell contains the ticker.
You can do the same thing for balance sheet, and cash flow as well.
1- Subscribe into yahoo finance api from Rapid Api here
2- Get your key
3- Insert your key in the code:
name="AAPL"
{raw=httr::GET(paste("https://yahoo-finance15.p.rapidapi.com//api/yahoo/qu/quote/",name,"/financial-data", sep = ""),
httr::add_headers("x-rapidapi-host"= "yahoo-finance15.p.rapidapi.com",
"x-rapidapi-key"="insert your Key here")
)
raw=jsonlite::fromJSON(rawToChar(raw$content))
values=sapply(1:length(raw$financialData),function(x){sapply(raw, "[", x)[[1]][1]})
names(values)=names(raw$financialData)
values=as.data.frame(t(values))
row.names(values)=name
}
values
Pros: Easy way to get data
Cons: free version limited into 500 request per month

Resources