Web scraping SEC Edgar 10-K and 10-Q filings - web-scraping

Are there anyone experienced with scraping SEC 10-K and 10-Q filings? I got stuck while trying to scrape monthly realised share repurchases from these filings. In specific, I would like to get the following information: 1. Period; 2. Total Number of Shares Purchased; 3. Average Price Paid per Share; 4. Total Number of Shares Purchased as Part of Publicly Announced Plans or Programs; 5. Maximum Number (or Approximate Dollar Value) of Shares that May Yet Be Purchased Under the Plans or Programs for each month from 2004 to 2014. I have in total 90,000+ forms to parse, so it won't be feasible to do it manually.
This information is usually reported under "Part 2 Item 5 Market for Registrant's Common Equity, Related Stockholder Matters and Issuer Purchases of Equity Securities" in 10-Ks and "Part 2 Item 2 Unregistered Sales of Equity Securities and Use of Proceeds".
Here is one example of the 10-Q filings that I need to parse:
https://www.sec.gov/Archives/edgar/data/12978/000104746909007169/a2193892z10-q.htm
If a firm have no share repurchase, this table can be missing from the quarterly report.
I have tried to parse the html files with Python BeautifulSoup, but the results are not satisfactory, mainly because these files are not written in a consistent format.
For example, the only way I can think of to parse these forms is
from bs4 import BeautifulSoup
import requests
import unicodedata
import re
url='https://www.sec.gov/Archives/edgar/data/12978/000104746909007169/a2193892z10-q.htm'
def parse_html(url):
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html5lib')
tables = soup.find_all('table')
identifier = re.compile(r'Total.*Number.*of.*Shares.*\w*Purchased.*', re.UNICODE|re.IGNORECASE|re.DOTALL)
n = len(tables) -1
rep_tables = []
while n >= 0:
table = tables[n]
remove_invalid_tags(table)
table_text = unicodedata.normalize('NFKD', table.text).encode('ascii','ignore')
if re.search(identifier, table_text):
rep_tables += [table]
n -= 1
else:
n -= 1
return rep_tables
def remove_invalid_tags(soup, invalid_tags=['sup', 'br']):
for tag in invalid_tags:
tags = soup.find_all(tag)
if tags:
[x.replaceWith(' ') for x in tags]
The above code only returns the messy that may contain the repurchase information. However, 1) it is not reliable; 2) it is very slow; 3) the following steps to scrape date/month, share price, and number of shares etc. are much more painful to do. I am wondering if there are more feasible languages/approaches/applications/databases to get such information? Thanks a million!

I'm not sure about python, but in R there is an beautiful solution using 'finstr' package (https://github.com/bergant/finstr).
'finstr' automatically extracts the financial statements (income statement, balance sheet, cash flow and etc.) from EDGAR using XBRL format.

Related

How can I use beautiful soup to get the following data from kick starter?

I am trying to get some data from kick starter. How can use beautiful soup library?
Kick Starter link
https://www.kickstarter.com/discover/advanced?woe_id=2347575&sort=magic&seed=2600008&page=7
These are the following information I need
Crowdfunding goal
Total crowdfunding
Total backers
Length of the campaign (# of days)
This is my current code
import requests
r = requests.get('https://www.kickstarter.com/discover/advanced?woe_id=2347575&sort=magic&seed=2600008&page=1')
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')
results = soup.find_all('div', attrs={'js-react-proj-card grid-col-12 grid-col-6-sm grid-col-4-lg'})
len(results)
i'll give you some of hint that i know, and hope you can do by yourself.
crawling has legal problem when you abuse Term of Service.
find_all should use with 'for' statment. it works like find all on web page(Ctrl + f).
e.g.
for a in soup.find_all('div', attrs={'js-react-proj-card grid-col-12 grid-col-6-sm grid-col-4-lg'}):
print (a)
3.links should be open 'for' statement. - https://www.kickstarte...seed=2600008&page=1
bold number repeated in for statement, so you can crawling all data In orderly
4.you sholud linked twice. - above link, there is list of pj. you should get link of these pj.
so code's algorithm likes this.
for i in range(0,10000):
url = www.kick.....page=i
for pj_link in find_all(each pj's link):
r2 = requests.get(pj_link)
soup2 = BeautifulSoup(r2.text, 'html.parser')
......

Extracting full article text via the newsanchor package [in R]

I am using the newsanchor package in R to try to extract entire article content via NewsAPI. For now I have done the following :
require(newsanchor)
results <- get_everything(query = "Trump +Trade", language = "en")
test <- results$results_df
This give me a dataframe full of info of (maximum) a 100 articles. These however do not containt the entire actual article text. Rather they containt something like the following:
[1] "Tensions between China and the U.S. ratcheted up several notches over the weekend as Washington sent a warship into the disputed waters of the South China Sea. Meanwhile, Google dealt Huaweis smartphone business a crippling blow and an escalating trade war co… [+5173 chars]"
Is there a way to extract the remaining 5173 chars. I have tried to read the documentation but I am not really sure.
I don't think that is possible at least with free plan. If you go through the documentation at https://newsapi.org/docs/endpoints/everything in the Response object section it says :
content - string
The unformatted content of the article, where available. This is truncated to 260 chars for Developer plan users.
So all the content is restricted to only 260 characters. However, test$url has the link of the source article which you can use to scrape the entire content but since it is being aggregated from various sources I don't think there is one automated way to do this.

How can set an inbuilt timer in RStudio to execute snippets of code?

I have some code in Rstudio which sends an API request to Google Big Query to run a saved query. Then my script downloads the data back to RStudio to be modelled to a machine learning model.
Its a lot of medical data and I would like some of the process to be even more automated than before.
tags<-read.csv('patient_health_codes.csv',stringsAsFactors = FALSE)
tags<-tail(tags, 6)
this section takes a CSV to iterate over patient health groups (such as Eczema is 123456) - Section 1
MD2DS="2018-07-20"
MD2DE="2018-07-20"
This section above fills in date periods for the query execution function - Section 2
sapply(health_tags$ID, function(x) query_details (MD2_date_start=MD2SE,
MD2_date_end=MD2DE,
Sterile_tag=as.character(x)))
This section executes the query on google big query and iterates over all the different patient groups in x i.e Eczema, Asthma, Allergy Group, and so on. -Section 3
project <- "private-health-clinic"
bq_table=paste0('[private-health-clinic:medical.london_',Sterile_tag,']')
sql =paste0('SELECT * FROM ', bq_table)
This section names each table after its patient group - section 4
data <- query_exec(sql, project = project, max_pages = Inf)
write.csv(data, file =paste0("medical_", Sterile_tag, ".csv"))
This code downloads and writes the big query table as a CSV on RStudio - Section 5
My question is, how do I tell RStudio when someone executes section 3 after 1 hour in real time please execute section 4 then 5 mins after execute section 5.
In advance thank you for the help I'm not an R expert!
Just add this after section 3:
Sys.sleep(3600)
And after section 4, add:
Sys.sleep(300)
Depending on how long it takes to execute that code, it might be worthwhile to use Sys.sleep for the desired amount of waiting time minus the time spent calculating, as follows:
t0 <- Sys.time()
# section 3
t1 <- Sys.time()
Sys.sleep(3600 - (t1 - t0))
# section 4
t2 <- Sys.time()
Sys.sleep(300 - (t2 - t1))
# section 5
Otherwise the waiting time will be added to the time spent running the sections.

How to connect data dictionaries to the unlabeled data

I'm working with some large government datasets from the Department of Transportation that are available as tab-delimited text files accompanied by data dictionaries. For example, the auto complaints file is a 670Mb file of unlabeled data (when unzipped), and comes with a dictionary. Here are some excerpts:
Last updated: April 24, 2014
FIELDS:
=======
Field# Name Type/Size Description
------ --------- --------- --------------------------------------
1 CMPLID CHAR(9) NHTSA'S INTERNAL UNIQUE SEQUENCE NUMBER.
IS AN UPDATEABLE FIELD,THUS DATA FOR A
GIVEN RECORD POTENTIALLY COULD CHANGE FROM
ONE DATA OUTPUT FILE TO THE NEXT.
2 ODINO CHAR(9) NHTSA'S INTERNAL REFERENCE NUMBER.
THIS NUMBER MAY BE REPEATED FOR
MULTIPLE COMPONENTS.
ALSO, IF LDATE IS PRIOR TO DEC 15, 2002,
THIS NUMBER MAY BE REPEATED FOR MULTIPLE
PRODUCTS OWNED BY THE SAME COMPLAINANT.
Some of the fields have foreign keys listed like so:
21 CMPL_TYPE CHAR(4) SOURCE OF COMPLAINT CODE:
CAG =CONSUMER ACTION GROUP
CON =FORWARDED FROM A CONGRESSIONAL OFFICE
DP =DEFECT PETITION,RESULT OF A DEFECT PETITION
EVOQ =HOTLINE VOQ
EWR =EARLY WARNING REPORTING
INS =INSURANCE COMPANY
IVOQ =NHTSA WEB SITE
LETR =CONSUMER LETTER
MAVQ =NHTSA MOBILE APP
MIVQ =NHTSA MOBILE APP
MVOQ =OPTICAL MARKED VOQ
RC =RECALL COMPLAINT,RESULT OF A RECALL INVESTIGATION
RP =RECALL PETITION,RESULT OF A RECALL PETITION
SVOQ =PORTABLE SAFETY COMPLAINT FORM (PDF)
VOQ =NHTSA VEHICLE OWNERS QUESTIONNAIRE
There are import instructions for Microsoft Access, which I don't have and would not use if I did. But I THINK this data dictionary was meant to be machine-readable.
My question: Is this data dictionary a standard format of some kind? I've tried to Google around, but it's hard to do so without the right terminology. I would like to import into R, though I'm flexible so long as it can be done programmatically.

Collect tweets with their related tweeters

I am doing text mining on tweets,I have collected random tweets form different accounts about some topic, I transformed the tweets into data frame, I was able to find the most frequent tweeters among those tweets(by using the column "screenName")... like those tweets:
[1] "ISCSP_ORG: #cybercrime NetSafe publishes guide to phishing:
Auckland, Monday 04 June 2013 – Most New Zealanders will have...
http://t.co/dFLyOO0Djf"
[1] "ISCSP_ORG: #cybercrime Business Briefs: MILL CREEK — H.M. Jackson
High School DECA chapter members earned the organizatio...
http://t.co/auqL6mP7AQ"
[1] "BNDarticles: How do you protect your #smallbiz from #cybercrime?
Here are the top 3 new ways they get in & how to stop them.
http://t.co/DME9q30mcu"
[1] "TweetMoNowNa: RT #jamescollinss: #senatormbishop It's the same
problem I've been having in my fight against #cybercrime. \"Vested
Interests\" - Tell me if …"
[1] "jamescollinss: #senatormbishop It's the same problem I've been
having in my fight against #cybercrime. \"Vested Interests\" - Tell me
if you work out a way!"
there are different tweeters have sent many tweets (in the collected dataset)
Now , I want to collect/group the related tweets for their corresponding tweeters/user..
Is there any way to do it using R ?? any suggestion? your help would be very appreciated.

Resources