How to connect data dictionaries to the unlabeled data - r

I'm working with some large government datasets from the Department of Transportation that are available as tab-delimited text files accompanied by data dictionaries. For example, the auto complaints file is a 670Mb file of unlabeled data (when unzipped), and comes with a dictionary. Here are some excerpts:
Last updated: April 24, 2014
FIELDS:
=======
Field# Name Type/Size Description
------ --------- --------- --------------------------------------
1 CMPLID CHAR(9) NHTSA'S INTERNAL UNIQUE SEQUENCE NUMBER.
IS AN UPDATEABLE FIELD,THUS DATA FOR A
GIVEN RECORD POTENTIALLY COULD CHANGE FROM
ONE DATA OUTPUT FILE TO THE NEXT.
2 ODINO CHAR(9) NHTSA'S INTERNAL REFERENCE NUMBER.
THIS NUMBER MAY BE REPEATED FOR
MULTIPLE COMPONENTS.
ALSO, IF LDATE IS PRIOR TO DEC 15, 2002,
THIS NUMBER MAY BE REPEATED FOR MULTIPLE
PRODUCTS OWNED BY THE SAME COMPLAINANT.
Some of the fields have foreign keys listed like so:
21 CMPL_TYPE CHAR(4) SOURCE OF COMPLAINT CODE:
CAG =CONSUMER ACTION GROUP
CON =FORWARDED FROM A CONGRESSIONAL OFFICE
DP =DEFECT PETITION,RESULT OF A DEFECT PETITION
EVOQ =HOTLINE VOQ
EWR =EARLY WARNING REPORTING
INS =INSURANCE COMPANY
IVOQ =NHTSA WEB SITE
LETR =CONSUMER LETTER
MAVQ =NHTSA MOBILE APP
MIVQ =NHTSA MOBILE APP
MVOQ =OPTICAL MARKED VOQ
RC =RECALL COMPLAINT,RESULT OF A RECALL INVESTIGATION
RP =RECALL PETITION,RESULT OF A RECALL PETITION
SVOQ =PORTABLE SAFETY COMPLAINT FORM (PDF)
VOQ =NHTSA VEHICLE OWNERS QUESTIONNAIRE
There are import instructions for Microsoft Access, which I don't have and would not use if I did. But I THINK this data dictionary was meant to be machine-readable.
My question: Is this data dictionary a standard format of some kind? I've tried to Google around, but it's hard to do so without the right terminology. I would like to import into R, though I'm flexible so long as it can be done programmatically.

Related

Extracting full article text via the newsanchor package [in R]

I am using the newsanchor package in R to try to extract entire article content via NewsAPI. For now I have done the following :
require(newsanchor)
results <- get_everything(query = "Trump +Trade", language = "en")
test <- results$results_df
This give me a dataframe full of info of (maximum) a 100 articles. These however do not containt the entire actual article text. Rather they containt something like the following:
[1] "Tensions between China and the U.S. ratcheted up several notches over the weekend as Washington sent a warship into the disputed waters of the South China Sea. Meanwhile, Google dealt Huaweis smartphone business a crippling blow and an escalating trade war co… [+5173 chars]"
Is there a way to extract the remaining 5173 chars. I have tried to read the documentation but I am not really sure.
I don't think that is possible at least with free plan. If you go through the documentation at https://newsapi.org/docs/endpoints/everything in the Response object section it says :
content - string
The unformatted content of the article, where available. This is truncated to 260 chars for Developer plan users.
So all the content is restricted to only 260 characters. However, test$url has the link of the source article which you can use to scrape the entire content but since it is being aggregated from various sources I don't think there is one automated way to do this.

Matching columns of two datasets

I've been assigned this problem where I need to match mobile apps and publishers between two data sets (One is GooglePlay, the other is iTunes).
Here is a description of the variables used in the iTunes dataset (The Google Play data set variables names are similair or the same).
anon_ios_app_id: anonymized iOS app id
anon_ios_publisher_id: anonymized iOS publisher id
points: the “worth” of the match, 10 points is highest worth and 0.5 is the lowest.
ios_name: name of the mobile app in the itunes store
ios_publisher_name: name of the publisher of the app in the itunes store
category_name: the category of the app
type: Game or Non-game
I've done some analysis to look for the names of apps in the data sets that share the same name and publishers. As an example, I searched for apps that had "Walmart" in their names.
GooglePlay <- read.csv("...\\GooglePlay.csv", header = TRUE)
iTunes <- read.csv("...\\iTunes.csv", header = TRUE)
grep("Walmart", iTunes$ios_name)
[1] 41203 51026 63522 64330 112441 113516 115510 117588 117788 119558 119605 120002 165514 248817
[15] 277425 290010 463244 546799 565806
grep("Walmart", GooglePlay$gp_name)
[1] 154 31984 162284 162342 162792 168722 168774 169339 325520 325601 357122 360050 436084 437144
[15] 441458 447177 503260
During my analysis, I did find that some apps had the same name and publisher in both data sets. For example
GooglePlay$gp_name[154]
[1] Walmart Photo
GooglePlay$gp_publisher_name[154]
[1] Kodak Alaris Inc.
iTunes$ios_name[165514]
[1] Walmart Photo
iTunes$ios_publisher_name[165514]
[1] Kodak Alaris Inc.
My objectives are: 1. Provide one unified file with all the respective id’s/names of the matched apps/publishers.
2. provide one number: the SUM(iOS points + GP points) for the matched apps.
What functions should I use to match apps and publishers from the two data sets? How do I make a unified file of those matches?

Web scraping SEC Edgar 10-K and 10-Q filings

Are there anyone experienced with scraping SEC 10-K and 10-Q filings? I got stuck while trying to scrape monthly realised share repurchases from these filings. In specific, I would like to get the following information: 1. Period; 2. Total Number of Shares Purchased; 3. Average Price Paid per Share; 4. Total Number of Shares Purchased as Part of Publicly Announced Plans or Programs; 5. Maximum Number (or Approximate Dollar Value) of Shares that May Yet Be Purchased Under the Plans or Programs for each month from 2004 to 2014. I have in total 90,000+ forms to parse, so it won't be feasible to do it manually.
This information is usually reported under "Part 2 Item 5 Market for Registrant's Common Equity, Related Stockholder Matters and Issuer Purchases of Equity Securities" in 10-Ks and "Part 2 Item 2 Unregistered Sales of Equity Securities and Use of Proceeds".
Here is one example of the 10-Q filings that I need to parse:
https://www.sec.gov/Archives/edgar/data/12978/000104746909007169/a2193892z10-q.htm
If a firm have no share repurchase, this table can be missing from the quarterly report.
I have tried to parse the html files with Python BeautifulSoup, but the results are not satisfactory, mainly because these files are not written in a consistent format.
For example, the only way I can think of to parse these forms is
from bs4 import BeautifulSoup
import requests
import unicodedata
import re
url='https://www.sec.gov/Archives/edgar/data/12978/000104746909007169/a2193892z10-q.htm'
def parse_html(url):
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html5lib')
tables = soup.find_all('table')
identifier = re.compile(r'Total.*Number.*of.*Shares.*\w*Purchased.*', re.UNICODE|re.IGNORECASE|re.DOTALL)
n = len(tables) -1
rep_tables = []
while n >= 0:
table = tables[n]
remove_invalid_tags(table)
table_text = unicodedata.normalize('NFKD', table.text).encode('ascii','ignore')
if re.search(identifier, table_text):
rep_tables += [table]
n -= 1
else:
n -= 1
return rep_tables
def remove_invalid_tags(soup, invalid_tags=['sup', 'br']):
for tag in invalid_tags:
tags = soup.find_all(tag)
if tags:
[x.replaceWith(' ') for x in tags]
The above code only returns the messy that may contain the repurchase information. However, 1) it is not reliable; 2) it is very slow; 3) the following steps to scrape date/month, share price, and number of shares etc. are much more painful to do. I am wondering if there are more feasible languages/approaches/applications/databases to get such information? Thanks a million!
I'm not sure about python, but in R there is an beautiful solution using 'finstr' package (https://github.com/bergant/finstr).
'finstr' automatically extracts the financial statements (income statement, balance sheet, cash flow and etc.) from EDGAR using XBRL format.

R webcorpus attribute extraction

I am using the tm.plugin.webmining to get latest news about a company say microsoft using the following command
corpus<-WebCorpus(GoogleBlogSearchSource(stock))
When I run meta(corpus[[1]]) i get
Metadata:
author : character(0)
datetimestamp: 2014-07-17 20:28:10
description : Microsoft Layoffs – What it Means for MSFT StockInvestorplace.comWhile the layoffs are obviously
going to be hardest on the workers, as investors we still have to take
a rational and objective look at the corporation to see what it means
for MSFT – particularly if you are personally a
Microsoft stock holder ...Why Microsoft (MSFT) Stock Is Up
TodayTheStreet.comEarnings Preview: Microsoft Corporation (MSFT),
Apple Inc (AAPL), Facebook ...International Business TimesWhat Do
Microsoft's Layoff Plans Tell Us About Satya Nadella's Vision?Motley
FoolTech Insider -Insider Monkey (blog)all 2,176 news articles »
heading : Microsoft Layoffs – What it Means for MSFT Stock - Investorplace.com
id : tag:news.google.com,2005:cluster=http://investorplace.com/2014/07/microsoft-layoffs-means-msft-stock/
language : character(0)
origin : http://news.google.com/news/url?sa=t&fd=R&ct2=us&usg=AFQjCNEadqFvThyxvJU3O5uHa6wiyoWNEw&clid=c3a7d30bb8a4878e06b80cf16b898331&cid=52778559643673&ei=Cr3LU8jGNMnNkwX_lYCICQ&url=http://investorplace.com/2014/07/microsoft-layoffs-means-msft-stock/
So here I see that the different attributes are here but when I run
Headers<-sapply(meta(corpus,FUN=function(x){attr(x,"heading")})
Headers is a list of 100 items with null values. I am pretty sure this particular code was running a few days back. What changed in between was I reinstalled the packages on the new system and also updated R to 3.1.1 instead of R 3.1.0(earlier)
What can I do to get separate lists of headers, descriptions timestamp, etc, which I later want to convert into a 100X3 data frame.
With the newest R, Please try the following code:
Code :
headers<-meta(corpus,tag="heading")

SSRS: How to create 'Color Analytics Map' in Reporting Services 2008 / Report Builder 3.0

I'm setting up a map report in Reporting Services (Report Builder 3.0) running on a SQL Server 2008 RC2.
I've got a dataset like this:
CODE VAL
AF 11
AL 7
DZ 7
VI 15
AD 7
AO 6
AI 8
AQ 10
I've downloaded the shape (.shp) file from here: http://www.naturalearthdata.com/downloads/110m-cultural-vectors/. I imported this file to my report and in the dialog 'Select fields to match spatial data and analytical data' I've registered a match between the spatial data field 'ISO_A2' and my own 'CODE' field. I've checked manually that these columns match.
I've set my VAL-field as the field to 'visualize' in the Wizard.
I've changed the distribution-range in the 'Map Color Rules Properties' dialog as starting from 1 through 20.
When I preview my report, all the countries are rendered with No Color (all backgroundss are grey as the background)! I tried a lot of difference color setups but no luck. Seems like the spatial data and my test dataset doesn't get join correctly!? :(
How can I ensure that the spatial data and my analytical data are joined correctly?
Am I doing something obviously wrong?
Any solutions/hints highly appreciated! ;)
Regards
Alex

Resources