extract sections of 10-K filings by R - quanteda - r

We have package edgar to extract item 1 business description and item 7 MDA from 10-K fillings of EDGAR. Could we have other tools with R to extract other items such as 1A - Risk Factors, 7A - Quantitative and Qualitative Disclosures about Market Risk?
I would like to do it with quanteda as it is very fast and effective for coding.
Thank you so much for your time and consideration!

Related

Extract specific lines of text in r

I have a .txt file with thousands of lines. In this file, I have a meta information about research articles. Every paper has information about Published year (PY), Title (TI), DOI number (DI), Publishing Type (PT) and Abstract (AB). So, the information of almost 300 papers exist in the text file. The format of information about first two article is as follows.
PT J
AU Filieri, Raffaele
Acikgoz, Fulya
Ndou, Valentina
Dwivedi, Yogesh
TI Is TripAdvisor still relevant? The influence of review credibility,
review usefulness, and ease of use on consumers' continuance intention
SO INTERNATIONAL JOURNAL OF CONTEMPORARY HOSPITALITY MANAGEMENT
DI 10.1108/IJCHM-05-2020-0402
EA NOV 2020
PY 2020
AB Purpose - Recent figures show that users are discontinuing their usage
of TripAdvisor, the leading user-generated content (UGC) platform in the
tourism sector. Hence, it is relevant to study the factors that
influence travelers' continued use of TripAdvisor.
Design/methodology/approach - The authors have integrated constructs
from the technology acceptance model, information systems (IS)
continuance model and electronic word of mouth literature. They used
PLS-SEM (smartPLS V.3.2.8) to test the hypotheses using data from 297
users of TripAdvisor recruited through Prolific.
Findings - Findings reveal that perceived ease of use, online consumer
review (OCR) credibility and OCR usefulness have a positive impact on
customer satisfaction, which ultimately leads to continuance intention
of UGC platforms. Customer satisfaction mediates the effect of the
independent variables on continuance intention.
Practical implications - Managers of UGC platforms (i.e. TripAdvisor)
can benefit from the findings of this study. Specifically, they should
improve the ease of use of their platforms by facilitating travelers'
information searches. Moreover, they should use signals to make credible
and helpful content stand out from the crowd of reviews.
Originality/value - This is the first study that adopts the IS
continuance model in the travel and tourism literature to research the
factors influencing consumers' continued use of travel-based UGC
platforms. Moreover, the authors have extended this model by including
new constructs that are particularly relevant to UGC platforms, such as
performance heuristics and OCR credibility.
ZR 0
ZA 0
Z8 0
ZS 0
TC 0
ZB 0
Z9 0
SN 0959-6119
EI 1757-1049
UT WOS:000592516500001
ER
PT J
AU Li, Yelin
Bu, Hui
Li, Jiahong
Wu, Junjie
TI The role of text-extracted investor sentiment in Chinese stock price
prediction with the enhancement of deep learning
SO INTERNATIONAL JOURNAL OF FORECASTING
VL 36
IS 4
BP 1541
EP 1562
DI 10.1016/j.ijforecast.2020.05.001
PD OCT-DEC 2020
PY 2020
AB Whether investor sentiment affects stock prices is an issue of
long-standing interest for economists. We conduct a comprehensive study
of the predictability of investor sentiment, which is measured directly
by extracting expectations from online user-generated content (UGC) on
the stock message board of Eastmoney.com in the Chinese stock market. We
consider the influential factors in prediction, including the selections
of different text classification algorithms, price forecasting models,
time horizons, and information update schemes. Using comparisons of the
long short-term memory (LSTM) model, logistic regression, support vector
machine, and Naive Bayes model, the results show that daily investor
sentiment contains predictive information only for open prices, while
the hourly sentiment has two hours of leading predictability for closing
prices. Investors do update their expectations during trading hours.
Moreover, our results reveal that advanced models, such as LSTM, can
provide more predictive power with investor sentiment only if the inputs
of a model contain predictive information. (C) 2020 International
Institute of Forecasters. Published by Elsevier B.V. All rights
reserved.
CT 14th International Conference on Services Systems and Services
Management (ICSSSM)
CY JUN 16-18, 2017
CL Dongbei Univ Finance & Econ, Sch Management Sci & Engn, Dalian, PEOPLES
R CHINA
HO Dongbei Univ Finance & Econ, Sch Management Sci & Engn
SP Tsinghua Univ; Chinese Univ Hong Kong; IEEE Syst Man & Cybernet Soc
ZA 0
TC 0
ZB 0
ZS 0
Z8 0
ZR 0
Z9 0
SN 0169-2070
EI 1872-8200
UT WOS:000570797300025
ER
Now, I want to extract the abstract of each article and store it in the data frame. To extract the abstract I have the following code, which gives me the first match of abstract.
f = readLines("sample.txt")
#extract first match....
pattern <- "AB\\s*(.*?)\\s*ZR"
result <- regmatches(as.String(f), regexec(pattern, as.String(f)))
result[[1]][2]
[1] "Purpose - Recent figures show that users are discontinuing their usage\n of TripAdvisor, the leading user-generated content (UGC) platform in the\n tourism sector. Hence, it is relevant to study the factors that\n influence travelers' continued use of TripAdvisor.\n Design/methodology/approach - The authors have integrated constructs\n from the technology acceptance model, information systems (IS)\n continuance model and electronic word of mouth literature. They used\n PLS-SEM (smartPLS V.3.2.8) to test the hypotheses using data from 297\n users of TripAdvisor recruited through Prolific.\n Findings - Findings reveal that perceived ease of use, online consumer\n review (OCR) credibility and OCR usefulness have a positive impact on\n customer satisfaction, which ultimately leads to continuance intention\n of UGC platforms. Customer satisfaction mediates the effect of the\n independent variables on continuance intention.\n Practical implications - Managers of UGC platforms (i.e. TripAdvisor)\n can benefit from the findings of this study. Specifically, they should\n improve the ease of use of their platforms by facilitating travelers'\n information searches. Moreover, they should use signals to make credible\n and helpful content stand out from the crowd of reviews.\n Originality/value - This is the first study that adopts the IS\n continuance model in the travel and tourism literature to research the\n factors influencing consumers' continued use of travel-based UGC\n platforms. Moreover, the authors have extended this model by including\n new constructs that are particularly relevant to UGC platforms, such as\n performance heuristics and OCR credibility."
The problem is, I want to extract all the abstracts but the pattern would be different for most of the abstracts. So the specific pattern for all the abstract is that I should extract text starting from AB and every next line having space in the front. Any body can help me in this regard?
You can first group the lines: whenever a line does not start with a space character the group counter is moved up by one.
Then you can aggregate f by group and select the abstracts from the aggregated vector:
group <- cumsum(!grepl("^ ", f))
f2 <- aggregate(f, list(group), function(x) paste(x, collapse = " "))[, 2]
f2[grepl("^AB ", f2)]
A completely different approach. If your text file has the layout you are showing, you could also read everything in a data.frame with readr::read_fwf. When doing this you have all the info from the articles available. You could use tidyr::fill to fill out the missing meta info.
library(dplyr)
library(readr)
articles <- read_fwf("tests/SO text.txt", fwf_empty("tests/SO text.txt", col_names = c("mi", "text")))
articles <- articles %>%
filter(!(is.na(mi) & is.na(text))) # removes empty lines between articles.
articles
# A tibble: 98 x 2
mi text
<chr> <chr>
1 PT J
2 AU Filieri, Raffaele
3 NA Acikgoz, Fulya
4 NA Ndou, Valentina
5 NA Dwivedi, Yogesh
6 TI Is TripAdvisor still relevant? The influence of review credibility,
7 NA review usefulness, and ease of use on consumers' continuance intention
8 SO INTERNATIONAL JOURNAL OF CONTEMPORARY HOSPITALITY MANAGEMENT
9 DI 10.1108/IJCHM-05-2020-0402
10 EA NOV 2020
# ... with 88 more rows
Try it with this regex:
^AB (?:(?!^[A-Z]{2} )([\s\S]))*
PCRE Demo (requires perl=TRUE in R)
If you want to drop the prefix add \K after ^AB \K
You can use
(?m)^AB\h+\K.*(?:\R\h.+)*
See the regex demo. Details:
(?m) - a multiline flag making ^ match at the start of each line
^ - start of a line
AB - an AB substring
\h+ - one or more horizontal whitespaces
\K - match reset operator discard the text matched so far
.* - the rest of the line
(?:\R\h.+)* - zero or more consecutive lines that start with a horizontal whitespace.
In R, you may use it like
x <- as.String(f)
regmatches(x, gregexpr("(?m)^AB\\h+\\K.*(?:\\R\\h.+)*", x, perl=TRUE))

Using calendar adjustment when forecasting

I am reading the online textbook "Forecasting: Principles and Practice
Textbook by George Athanasopoulos and Rob J. Hyndman" which has examples in R code.
A section on calendar adjustments explaisn that it is often useful to "look at average daily production instead of average monthly production, we effectively remove the variation due to the different month lengths. Simpler patterns are usually easier to model and lead to more accurate forecasts.
This is the example data sets with accompanying plots:
monthly -> daily
I don't understand the second line of the given example code:
monthdays <- rep(c(31,28,31,30,31,30,31,31,30,31,30,31),14)
monthdays[26 + (4*12)*(0:2)] <- 29
par(mfrow=c(2,1))
plot(milk, main="Monthly milk production per cow",
ylab="Pounds",xlab="Years")
plot(milk/monthdays, main="Average milk production per cow per day",
ylab="Pounds", xlab="Years")
I understand that the first line creates a vector of the # of days in each month and repeats it 14 times because the data set is 14 years. But I have no idea what the second line is doing and where those numbers and calculations are coming from?

HMM text recognition in R depmixs4

I'm wondering how I would utilize the depmixs4 package for R to run HMM on a dataset. What functions would I use so I get a classification of a testing data set?
I have a file of training data, a file of label data, and a test data.
Training data consists of 4620 rows. Each row has 1079 values. These values are 83 windows with 13 values per window so in otherwords the 1079 is data that is made up of 83 states and each category has 13 observations. Each of these rows with 1079 values is a spoken word so it have 4620 utterances. But in total the data only has 7 distinct words. each of these distinct words have 660 different utterances hence the 4620 rows of words.
So we have words (0-6)
The label file is a list where each row is labeled 0-6 corresponding to what word they are. For example row 300 is labeled 2, row 450 is labeled 6 and 520 is labeled 0.
The test file contains about 5000 rows structured exactly like the training data except there are no labels assocaiated with it.
I want to use HMM to using the training data to classify the test data.
How would I use depmixs4 to output a classification of my test data?
I'm looking at :
depmix(response, data=NULL, nstates, transition=~1, family=gaussian(),
prior=~1, initdata=NULL, respstart=NULL, trstart=NULL, instart=NULL,
ntimes=NULL,...)
but I don't know what response refers to or any of the other parameters.
Here's a quick, albeit incomplete, test to get you started, if only to familiarize you with the basic outline. Please note that this is a toy example and it merely scratches the surface for HMM design/analysis. The vignette for the depmixs4 package, for instance, offers quite a lot of context and examples. Meanwhile, here's a brief intro.
Let's say that you wanted to investigate if industrial production offers clues about economic recessions. First, let's load the relevant packages and then download the data from the St. Louis Fed:
library(quantmod)
library(depmixS4)
library(TTR)
fred.tickers <-c("INDPRO")
getSymbols(fred.tickers,src="FRED")
Next, transform the data into rolling 1-year percentage changes to minimize noise in the data and convert data into data.frame format for analysis in depmixs4:
indpro.1yr <-na.omit(ROC(INDPRO,12))
indpro.1yr.df <-data.frame(indpro.1yr)
Now, let's run a simple HMM model and choose just 2 states--growth and contraction. Note that we're only using industrial production to search for signals:
model <- depmix(response=INDPRO ~ 1,
family = gaussian(),
nstates = 2,
data = indpro.1yr.df ,
transition=~1)
Now let's fit the resulting model, generate posterior states
for analysis, and estimate probabilities of recession. Also, we'll bind the data with dates in an xts format for easier viewing/analysis. (Note the use of set.seed(1), which is used to create a replicable starting value to launch the modeling.)
set.seed(1)
model.fit <- fit(model, verbose = FALSE)
model.prob <- posterior(model.fit)
prob.rec <-model.prob[,2]
prob.rec.dates <-xts(prob.rec,as.Date(index(indpro.1yr)),
order.by=as.Date(index(indpro.1yr)))
Finally, let's review and ideally plot the data:
head(prob.rec.dates)
[,1]
1920-01-01 1.0000000
1920-02-01 1.0000000
1920-03-01 1.0000000
1920-04-01 0.9991880
1920-05-01 0.9999549
1920-06-01 0.9739622
High values (>0.80 ??) indicate/suggest that the economy is in recession/contraction.
Again, a very, very basic introduction, perhaps too basic. Hope it helps.

Extraction of text data from text file using R

How can I extract individual tags automatically using "R" from the following sample text (.txt) file and convert in excel file format where excel Columns will getTag heading like PMID, TI, DP, FAU etc and respective values will be sorted under this headings???
PMID- 24579777
OWN - NLM
STAT- Publisher
DA - 20141210
LR - 20141210
IS - 1476-8259 (Electronic)
IS - 1025-5842 (Linking)
VI - 18
IP - 11
DP - 2015 Aug
TI - A complete structural performance analysis and modelling of hydroxyapatite
scaffolds with variable porosity.
PG - 1225-1237
AB - The use of hydroxyapatite (HA) scaffolds for bone regeneration is an alternative
procedure to treat bone defects due to cancer, other diseases or traumas.
Although the use of HA has been widely studied in the literature, there are still
some disparities regarding its mechanical performance. This paper presents a
complete analysis of the structural performance of porous HA scaffolds based on
experimental tests, numerical simulations and theoretical studies. HA scaffolds
with variable porosity were considered and fabricated by the water-soluble
polymer method, using poly vinyl alcohol as pore former. These scaffolds were
then characterised by scanning electron microscopy, stereo microscopy, X-ray
diffraction, porosity analysis and mechanical tests. Different scaffold models
were proposed and analysed by the finite element method to obtain numerical
predictions of the mechanical properties. Also theoretical predictions based on
the (Gibson LJ, Ashby MF. 1988. Cellular solids: structure and properties.
Oxford: Pergamon Press) model were obtained. Finally the experimental, numerical
and theoretical results were compared. From this comparison, it was observed that
the proposed numerical and theoretical models can be used to predict, with
adequate accuracy, the mechanical performance of HA scaffolds for different
porosity values.
FAU - Gallegos-Nieto, Enrique
AU - Gallegos-Nieto E
AD - a Facultad de Ingenieria, Centro de Investigacion y Estudios de
Posgrado,Universidad Autonoma de San Luis Potosi , CP 78290 SLP , Mexico.
FAU - Medellin-Castillo, Hugo I
AU - Medellin-Castillo HI
FAU - de Lange, Dirk F
AU - de Lange DF
LA - ENG
PT - JOURNAL ARTICLE
DEP - 20140228
TA - Comput Methods Biomech Biomed Engin
JT - Computer methods in biomechanics and biomedical engineering
JID - 9802899
OTO - NOTNLM
OT - compressive strength
OT - finite element method
OT - hydroxyapatite
OT - modulus of elasticity
OT - porosity
OT - scaffolds
EDAT- 2014/03/04 06:00
MHDA- 2014/03/04 06:00
CRDT- 2014/03/04 06:00
PHST- 2014/02/28 [aheadofprint]
AID - 10.1080/10255842.2014.889690 [doi]
PST - ppublish
SO - Comput Methods Biomech Biomed Engin. 2015 Aug;18(11):1225-1237. Epub 2014 Feb 28.
Without writing out the entire script, here are the basics of what you could do:
Read the file in as a table with " - " as separator data=read.table('filename',sep="-", row.names=F, col.names=F)
Loop through all the data in the table (for loops would work, or you could use the apply functions) and trim leading/trailing spaces data[i,j]=gsub("^ *",'',data[i,j]), data[i,j]=gsub(" *$",'',data[i,j])
Transpose the data newData=t(data)
Write to excel spreadsheet (you can also use write.csv as that is excel compatible and built in to R)
library(xlsx)
write.xlsx(newdata, "c:/mydata.xlsx",col.names=F, row.names=F)

readPDF (tm package) in R

I tried to read some online pdf document in R. I used readRDF function. My script goes like this
safex <- readPDF(PdftotextOptions='-layout')(elem=list(uri='C:/Users/FCG/Desktop/NoteF7000.pdf'),language='en',id='id1')
R showed the message that running command has status 309. I tried different pdftotext options. however, it is the same message. and the text file created has no content.
Can anyone read this pdf
readPDF has bugs and probably isn't worth bothering with (check out this well-documented struggle with it).
Assuming that...
you've got xpdf installed (see here for details)
your PATHs are all in order (see here for details of how to do that) and you've restarted your computer.
Then you might be better off avoiding readPDF and instead using this workaround:
system(paste('"C:/Program Files/xpdf/pdftotext.exe"',
'"C:/Users/FCG/Desktop/NoteF7000.pdf"'), wait=FALSE)
And then read the text file into R like so...
require(tm)
mycorpus <- Corpus(URISource("C:/Users/FCG/Desktop/NoteF7001.txt"))
And have a look to confirm that it went well:
inspect(mycorpus)
A corpus with 1 text document
The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
create_date creator
Available variables in the data frame are:
MetaID
[[1]]
Market Notice
Number: Date F7001 08 May 2013
New IDX SSF (EWJG) The following new IDX SSF contract will be added to the list and will be available for trade today.
Summary Contract Specifications Contract Code Underlying Instrument Bloomberg Code ISIN Code EWJG EWJG IShares MSCI Japan Index Fund (US) EWJ US EQUITY US4642868487 1 (R1 per point)
Contract Size / Nominal
Expiry Dates & Times
10am New York Time; 14 Jun 2013 / 16 Sep 2013
Underlying Currency Quotations Minimum Price Movement (ZAR) Underlying Reference Price
USD/ZAR Bloomberg Code (USDZAR Currency) Price per underlying share to two decimals. R0.01 (0.01 in the share price)
4pm underlying spot level as captured by the JSE.
Currency Reference Price
The same method as the one utilized for the expiry of standard currency futures on standard quarterly SAFEX expiry dates.
JSE Limited Registration Number: 2005/022939/06 One Exchange Square, Gwen Lane, Sandown, South Africa. Private Bag X991174, Sandton, 2146, South Africa. Telephone: +27 11 520 7000, Facsimile: +27 11 520 8584, www.jse.co.za
Executive Director: NF Newton-King (CEO), A Takoordeen (CFO) Non-Executive Directors: HJ Borkum (Chairman), AD Botha, MR Johnston, DM Lawrence, A Mazwai, Dr. MA Matooane , NP Mnxasana, NS Nematswerani, N Nyembezi-Heita, N Payne Alternate Directors: JH Burke, LV Parsons
Member of the World Federation of Exchanges
Company Secretary: GC Clarke
Settlement Method
Cash Settled
-
Clearing House Fees -
On-screen IDX Futures Trading: o 1 BP for Taker (Aggressor) o Zero Booking Fees for Maker (Passive) o No Cap o Floor of 0.01 Reported IDX Futures Trades o 1.75 BP for both buyer and seller o No Cap o Floor of 0.01
Initial Margin Class Spread Margin V.S.R. Expiry Date
R 10.00 R 5.00 3.5 14/06/2013, 16/09/2013
The above instrument has been designated as "Foreign" by the South African Reserve Bank
Should you have any queries regarding IDX Single Stock Futures, please contact the IDX team on 011 520-7399 or idx#jse.co.za
Graham Smale Director: Bonds and Financial Derivatives Tel: +27 11 520 7831 Fax:+27 11 520 8831 E-mail: grahams#jse.co.za
Distributed by the Company Secretariat +27 11 520 7346
Page 2 of 2

Resources