Extraction of text data from text file using R

Extraction of text data from text file using R - r

How can I extract individual tags automatically using "R" from the following sample text (.txt) file and convert in excel file format where excel Columns will getTag heading like PMID, TI, DP, FAU etc and respective values will be sorted under this headings???
PMID- 24579777
OWN - NLM
STAT- Publisher
DA - 20141210
LR - 20141210
IS - 1476-8259 (Electronic)
IS - 1025-5842 (Linking)
VI - 18
IP - 11
DP - 2015 Aug
TI - A complete structural performance analysis and modelling of hydroxyapatite
scaffolds with variable porosity.
PG - 1225-1237
AB - The use of hydroxyapatite (HA) scaffolds for bone regeneration is an alternative
procedure to treat bone defects due to cancer, other diseases or traumas.
Although the use of HA has been widely studied in the literature, there are still
some disparities regarding its mechanical performance. This paper presents a
complete analysis of the structural performance of porous HA scaffolds based on
experimental tests, numerical simulations and theoretical studies. HA scaffolds
with variable porosity were considered and fabricated by the water-soluble
polymer method, using poly vinyl alcohol as pore former. These scaffolds were
then characterised by scanning electron microscopy, stereo microscopy, X-ray
diffraction, porosity analysis and mechanical tests. Different scaffold models
were proposed and analysed by the finite element method to obtain numerical
predictions of the mechanical properties. Also theoretical predictions based on
the (Gibson LJ, Ashby MF. 1988. Cellular solids: structure and properties.
Oxford: Pergamon Press) model were obtained. Finally the experimental, numerical
and theoretical results were compared. From this comparison, it was observed that
the proposed numerical and theoretical models can be used to predict, with
adequate accuracy, the mechanical performance of HA scaffolds for different
porosity values.
FAU - Gallegos-Nieto, Enrique
AU - Gallegos-Nieto E
AD - a Facultad de Ingenieria, Centro de Investigacion y Estudios de
Posgrado,Universidad Autonoma de San Luis Potosi , CP 78290 SLP , Mexico.
FAU - Medellin-Castillo, Hugo I
AU - Medellin-Castillo HI
FAU - de Lange, Dirk F
AU - de Lange DF
LA - ENG
PT - JOURNAL ARTICLE
DEP - 20140228
TA - Comput Methods Biomech Biomed Engin
JT - Computer methods in biomechanics and biomedical engineering
JID - 9802899
OTO - NOTNLM
OT - compressive strength
OT - finite element method
OT - hydroxyapatite
OT - modulus of elasticity
OT - porosity
OT - scaffolds
EDAT- 2014/03/04 06:00
MHDA- 2014/03/04 06:00
CRDT- 2014/03/04 06:00
PHST- 2014/02/28 [aheadofprint]
AID - 10.1080/10255842.2014.889690 [doi]
PST - ppublish
SO - Comput Methods Biomech Biomed Engin. 2015 Aug;18(11):1225-1237. Epub 2014 Feb 28.

Without writing out the entire script, here are the basics of what you could do:
Read the file in as a table with " - " as separator data=read.table('filename',sep="-", row.names=F, col.names=F)
Loop through all the data in the table (for loops would work, or you could use the apply functions) and trim leading/trailing spaces data[i,j]=gsub("^ *",'',data[i,j]), data[i,j]=gsub(" *$",'',data[i,j])
Transpose the data newData=t(data)
Write to excel spreadsheet (you can also use write.csv as that is excel compatible and built in to R)
library(xlsx)
write.xlsx(newdata, "c:/mydata.xlsx",col.names=F, row.names=F)

Related

extract sections of 10-K filings by R - quanteda

We have package edgar to extract item 1 business description and item 7 MDA from 10-K fillings of EDGAR. Could we have other tools with R to extract other items such as 1A - Risk Factors, 7A - Quantitative and Qualitative Disclosures about Market Risk?
I would like to do it with quanteda as it is very fast and effective for coding.
Thank you so much for your time and consideration!

Extract specific lines of text in r

I have a .txt file with thousands of lines. In this file, I have a meta information about research articles. Every paper has information about Published year (PY), Title (TI), DOI number (DI), Publishing Type (PT) and Abstract (AB). So, the information of almost 300 papers exist in the text file. The format of information about first two article is as follows.
PT J
AU Filieri, Raffaele
Acikgoz, Fulya
Ndou, Valentina
Dwivedi, Yogesh
TI Is TripAdvisor still relevant? The influence of review credibility,
review usefulness, and ease of use on consumers' continuance intention
SO INTERNATIONAL JOURNAL OF CONTEMPORARY HOSPITALITY MANAGEMENT
DI 10.1108/IJCHM-05-2020-0402
EA NOV 2020
PY 2020
AB Purpose - Recent figures show that users are discontinuing their usage
of TripAdvisor, the leading user-generated content (UGC) platform in the
tourism sector. Hence, it is relevant to study the factors that
influence travelers' continued use of TripAdvisor.
Design/methodology/approach - The authors have integrated constructs
from the technology acceptance model, information systems (IS)
continuance model and electronic word of mouth literature. They used
PLS-SEM (smartPLS V.3.2.8) to test the hypotheses using data from 297
users of TripAdvisor recruited through Prolific.
Findings - Findings reveal that perceived ease of use, online consumer
review (OCR) credibility and OCR usefulness have a positive impact on
customer satisfaction, which ultimately leads to continuance intention
of UGC platforms. Customer satisfaction mediates the effect of the
independent variables on continuance intention.
Practical implications - Managers of UGC platforms (i.e. TripAdvisor)
can benefit from the findings of this study. Specifically, they should
improve the ease of use of their platforms by facilitating travelers'
information searches. Moreover, they should use signals to make credible
and helpful content stand out from the crowd of reviews.
Originality/value - This is the first study that adopts the IS
continuance model in the travel and tourism literature to research the
factors influencing consumers' continued use of travel-based UGC
platforms. Moreover, the authors have extended this model by including
new constructs that are particularly relevant to UGC platforms, such as
performance heuristics and OCR credibility.
ZR 0
ZA 0
Z8 0
ZS 0
TC 0
ZB 0
Z9 0
SN 0959-6119
EI 1757-1049
UT WOS:000592516500001
ER
PT J
AU Li, Yelin
Bu, Hui
Li, Jiahong
Wu, Junjie
TI The role of text-extracted investor sentiment in Chinese stock price
prediction with the enhancement of deep learning
SO INTERNATIONAL JOURNAL OF FORECASTING
VL 36
IS 4
BP 1541
EP 1562
DI 10.1016/j.ijforecast.2020.05.001
PD OCT-DEC 2020
PY 2020
AB Whether investor sentiment affects stock prices is an issue of
long-standing interest for economists. We conduct a comprehensive study
of the predictability of investor sentiment, which is measured directly
by extracting expectations from online user-generated content (UGC) on
the stock message board of Eastmoney.com in the Chinese stock market. We
consider the influential factors in prediction, including the selections
of different text classification algorithms, price forecasting models,
time horizons, and information update schemes. Using comparisons of the
long short-term memory (LSTM) model, logistic regression, support vector
machine, and Naive Bayes model, the results show that daily investor
sentiment contains predictive information only for open prices, while
the hourly sentiment has two hours of leading predictability for closing
prices. Investors do update their expectations during trading hours.
Moreover, our results reveal that advanced models, such as LSTM, can
provide more predictive power with investor sentiment only if the inputs
of a model contain predictive information. (C) 2020 International
Institute of Forecasters. Published by Elsevier B.V. All rights
reserved.
CT 14th International Conference on Services Systems and Services
Management (ICSSSM)
CY JUN 16-18, 2017
CL Dongbei Univ Finance & Econ, Sch Management Sci & Engn, Dalian, PEOPLES
R CHINA
HO Dongbei Univ Finance & Econ, Sch Management Sci & Engn
SP Tsinghua Univ; Chinese Univ Hong Kong; IEEE Syst Man & Cybernet Soc
ZA 0
TC 0
ZB 0
ZS 0
Z8 0
ZR 0
Z9 0
SN 0169-2070
EI 1872-8200
UT WOS:000570797300025
ER
Now, I want to extract the abstract of each article and store it in the data frame. To extract the abstract I have the following code, which gives me the first match of abstract.
f = readLines("sample.txt")
#extract first match....
pattern <- "AB\\s*(.*?)\\s*ZR"
result <- regmatches(as.String(f), regexec(pattern, as.String(f)))
result[[1]][2]
[1] "Purpose - Recent figures show that users are discontinuing their usage\n of TripAdvisor, the leading user-generated content (UGC) platform in the\n tourism sector. Hence, it is relevant to study the factors that\n influence travelers' continued use of TripAdvisor.\n Design/methodology/approach - The authors have integrated constructs\n from the technology acceptance model, information systems (IS)\n continuance model and electronic word of mouth literature. They used\n PLS-SEM (smartPLS V.3.2.8) to test the hypotheses using data from 297\n users of TripAdvisor recruited through Prolific.\n Findings - Findings reveal that perceived ease of use, online consumer\n review (OCR) credibility and OCR usefulness have a positive impact on\n customer satisfaction, which ultimately leads to continuance intention\n of UGC platforms. Customer satisfaction mediates the effect of the\n independent variables on continuance intention.\n Practical implications - Managers of UGC platforms (i.e. TripAdvisor)\n can benefit from the findings of this study. Specifically, they should\n improve the ease of use of their platforms by facilitating travelers'\n information searches. Moreover, they should use signals to make credible\n and helpful content stand out from the crowd of reviews.\n Originality/value - This is the first study that adopts the IS\n continuance model in the travel and tourism literature to research the\n factors influencing consumers' continued use of travel-based UGC\n platforms. Moreover, the authors have extended this model by including\n new constructs that are particularly relevant to UGC platforms, such as\n performance heuristics and OCR credibility."
The problem is, I want to extract all the abstracts but the pattern would be different for most of the abstracts. So the specific pattern for all the abstract is that I should extract text starting from AB and every next line having space in the front. Any body can help me in this regard?

You can first group the lines: whenever a line does not start with a space character the group counter is moved up by one.
Then you can aggregate f by group and select the abstracts from the aggregated vector:
group <- cumsum(!grepl("^ ", f))
f2 <- aggregate(f, list(group), function(x) paste(x, collapse = " "))[, 2]
f2[grepl("^AB ", f2)]

A completely different approach. If your text file has the layout you are showing, you could also read everything in a data.frame with readr::read_fwf. When doing this you have all the info from the articles available. You could use tidyr::fill to fill out the missing meta info.
library(dplyr)
library(readr)
articles <- read_fwf("tests/SO text.txt", fwf_empty("tests/SO text.txt", col_names = c("mi", "text")))
articles <- articles %>%
filter(!(is.na(mi) & is.na(text))) # removes empty lines between articles.
articles
# A tibble: 98 x 2
mi text
<chr> <chr>
1 PT J
2 AU Filieri, Raffaele
3 NA Acikgoz, Fulya
4 NA Ndou, Valentina
5 NA Dwivedi, Yogesh
6 TI Is TripAdvisor still relevant? The influence of review credibility,
7 NA review usefulness, and ease of use on consumers' continuance intention
8 SO INTERNATIONAL JOURNAL OF CONTEMPORARY HOSPITALITY MANAGEMENT
9 DI 10.1108/IJCHM-05-2020-0402
10 EA NOV 2020
# ... with 88 more rows

Try it with this regex:
^AB (?:(?!^[A-Z]{2} )([\s\S]))*
PCRE Demo (requires perl=TRUE in R)
If you want to drop the prefix add \K after ^AB \K

You can use
(?m)^AB\h+\K.*(?:\R\h.+)*
See the regex demo. Details:
(?m) - a multiline flag making ^ match at the start of each line
^ - start of a line
AB - an AB substring
\h+ - one or more horizontal whitespaces
\K - match reset operator discard the text matched so far
.* - the rest of the line
(?:\R\h.+)* - zero or more consecutive lines that start with a horizontal whitespace.
In R, you may use it like
x <- as.String(f)
regmatches(x, gregexpr("(?m)^AB\\h+\\K.*(?:\\R\\h.+)*", x, perl=TRUE))

Creating multiple dataframes out of one based on string search in R

I am relatively new to R. I have a dataframe that has more than 10 million rows that contain 500,000 PMIDs (a type of ID). However, the code I use to run on it can only handle 4000-5000 PMIDs at most. Here is a sample of what the raw dataframe (it's all in one column) looks like:
PMID- 28524368
OT - cardiomyopathy
OT - encephalitis
LID - 10.1111/jmp.12273 [doi]
PL - Denmark
PMID- 28523858
OT - Pan troglodytes
PST - aheadofprint
LID - 10.1111/echo.13561 [doi]
STAT- Publisher
FAU - Ruivo, Catarina
PMID- 52528302
CI - (c) 2017, Wiley Periodicals, Inc.
DA - 20170518
OWN - NLM
PMID- 18325287
STAT- Publisher
OWN - NLM
DA - 20170519
LA - eng
PMID- 95625132
FAU - Oumerzouk, Jawad
JID - 0135232
PL - Australia
PMID- 47628853
LA - eng
STAT- Publisher
AID - 10.1111/jmp.12273 [doi]
As you can see in the example dataframe, there are only 6 PMIDs. So for the sake of the example, let's say I need to make multiple dataframes and each dataframe should only have 2 PMIDs (in my actual code I will probably do around 4000 PMIDs). Thus, I would like to split up my dataframe into 3 different dataframes that look like this (start at one PMID and end before the third PMID comes)
df1:
PMID- 28524368
OT - cardiomyopathy
OT - encephalitis
LID - 10.1111/jmp.12273 [doi]
PL - Denmark
PMID- 28523858
OT - Pan troglodytes
PST - aheadofprint
LID - 10.1111/echo.13561 [doi]
STAT- Publisher
FAU - Ruivo, Catarina
df2:
PMID- 52528302
CI - (c) 2017, Wiley Periodicals, Inc.
DA - 20170518
OWN - NLM
PMID- 18325287
STAT- Publisher
OWN - NLM
DA - 20170519
LA - eng
df3:
PMID- 95625132
FAU - Oumerzouk, Jawad
JID - 0135232
PL - Australia
PMID- 47628853
LA - eng
STAT- Publisher
AID - 10.1111/jmp.12273 [doi]
Note that the row differences between each PMID is different, so it must be done by string matching PMID. I don't know how to do this on such a large dataset (how do I not manually create the dataframes? for loop?)
Any suggestions would be appreciated.

Make a little counter whenever you hit the start of a new group, then split. Here's a simplified example:
x <- rep(1:3,5)
grpsize <- 2
split(x, (cumsum(x==1)+grpsize-1) %/% grpsize)
#$`1`
#[1] 1 2 3 1 2 3
#
#$`2`
#[1] 1 2 3 1 2 3
#
#$`3`
#[1] 1 2 3
On your full data then you could use grepl to identify the start of each group:
split(df, (cumsum(grepl("^PMID",df$var)) + grpsize - 1) %/% grpsize)
Arguably you could add the counter as a new column on your dataset and use it as an identifier to go from a long to a wide dataset.

so although the solution of #thelatemail seemed very promising, it did not work on my dataset. even after I tried the code on a smaller subset of only 1 million rows, it would constantly freeze my computer and I would have to continuously re-start my computer and re-load all the code and large file. perhaps it works better on numerical data or maybe on fewer data or maybe using data.table or dplyr or maybe I was just coding it wrong...not sure exactly why I wasn't able to implement it correctly (I would've experimented more, but I want to go home soon hah), but I was able to come up with my own solution:
# shows indices of each PMID
a <- which(grepl("^PMID", df$V1))
a <- as.data.frame(a)
# creates dataframes based on indices from `a` at every 4000 PMID
df1 <- original[c(a[1, 1]:a[4000, 1]), ]
df1 <- as.data.frame(df1)
df2 <- original[c(a[4001, 1]:a[8000, 1]), ]
df2 <- as.data.frame(df2)
etc...until df100, ha. very tedious, but I couldn't come up of a way to not do this manually...perhaps creating a function? regardless, my code ran within seconds, so I'm not complaining. plus the tedious work was just mindless work anyway that actually only took 10-15 minutes.

Name matching with different length data frames in R

I have two dataframes with numerous variables. Of primary concern are the following variables, df1.organization_name and df2.legal.name. I'm just using fully qualified SQL-esque names here.
df1 has dimensions of 15 x 2700 whereas df2 has dimensions of 10x40,000. And essentially, the 'common' or 'matching' columns are name fields.
I reviewed this post Merging through fuzzy matching of variables in R and it was very helpful but I can't really figure out how to wrangle the script to get it to work with my dfs.
I keep getting an error - Error in which(organization_name[i] == LEGAL.NAME) :
object 'LEGAL.NAME' not found.
Desired Matching and Outcome
What I am trying to do is compare each and every one of my df1.organization_name to every one of the df2.legal_name and make a comparison if they are a very close match (like >=85%). And then like in the script above, take matched customer name and the matched comparison name and put them into a data.frame for later analysis.
So, if one of my customer names is 'Johns Hopkins Auto Repair' and one of my public list names is, 'John Hopkins Microphone Repair', I would call that a good match and I want some sort of indicator appended to my customer list (in another column) that says, 'Partial Match' and the name from the public list.
Example(s) of the dfs for text wrangling:
df1.organization_name (these are fake names b/c I can't post customer names)
- My Company LLC
- John Johns DBA John's Repair
- Some Company Inc
- Ninja Turtles LLP
- Shredder Partners
df2.LEGAL.NAME (these are real names from the open source file)
- $1 & UP STORE CORP.
- $1 store 0713
- LLC 0baid/munir/gazem
- 1 2 3 MONEY EXCHANGE LLC
- 1 BOY & 3 GIRLS, LLC
- 1 STAR BEVERAGE INC
- 1 STOP LLC
- 1 STOP LLC
- 1 STOP LLC DBA TIENDA MEXICANA LA SAN JOSE
- 1 Stop Money Centers, LLC/Richard

readPDF (tm package) in R

I tried to read some online pdf document in R. I used readRDF function. My script goes like this
safex <- readPDF(PdftotextOptions='-layout')(elem=list(uri='C:/Users/FCG/Desktop/NoteF7000.pdf'),language='en',id='id1')
R showed the message that running command has status 309. I tried different pdftotext options. however, it is the same message. and the text file created has no content.
Can anyone read this pdf

readPDF has bugs and probably isn't worth bothering with (check out this well-documented struggle with it).
Assuming that...
you've got xpdf installed (see here for details)
your PATHs are all in order (see here for details of how to do that) and you've restarted your computer.
Then you might be better off avoiding readPDF and instead using this workaround:
system(paste('"C:/Program Files/xpdf/pdftotext.exe"',
'"C:/Users/FCG/Desktop/NoteF7000.pdf"'), wait=FALSE)
And then read the text file into R like so...
require(tm)
mycorpus <- Corpus(URISource("C:/Users/FCG/Desktop/NoteF7001.txt"))
And have a look to confirm that it went well:
inspect(mycorpus)
A corpus with 1 text document
The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
create_date creator
Available variables in the data frame are:
MetaID
[[1]]
Market Notice
Number: Date F7001 08 May 2013
New IDX SSF (EWJG) The following new IDX SSF contract will be added to the list and will be available for trade today.
Summary Contract Specifications Contract Code Underlying Instrument Bloomberg Code ISIN Code EWJG EWJG IShares MSCI Japan Index Fund (US) EWJ US EQUITY US4642868487 1 (R1 per point)
Contract Size / Nominal
Expiry Dates & Times
10am New York Time; 14 Jun 2013 / 16 Sep 2013
Underlying Currency Quotations Minimum Price Movement (ZAR) Underlying Reference Price
USD/ZAR Bloomberg Code (USDZAR Currency) Price per underlying share to two decimals. R0.01 (0.01 in the share price)
4pm underlying spot level as captured by the JSE.
Currency Reference Price
The same method as the one utilized for the expiry of standard currency futures on standard quarterly SAFEX expiry dates.
JSE Limited Registration Number: 2005/022939/06 One Exchange Square, Gwen Lane, Sandown, South Africa. Private Bag X991174, Sandton, 2146, South Africa. Telephone: +27 11 520 7000, Facsimile: +27 11 520 8584, www.jse.co.za
Executive Director: NF Newton-King (CEO), A Takoordeen (CFO) Non-Executive Directors: HJ Borkum (Chairman), AD Botha, MR Johnston, DM Lawrence, A Mazwai, Dr. MA Matooane , NP Mnxasana, NS Nematswerani, N Nyembezi-Heita, N Payne Alternate Directors: JH Burke, LV Parsons
Member of the World Federation of Exchanges
Company Secretary: GC Clarke
Settlement Method
Cash Settled
-
Clearing House Fees -
On-screen IDX Futures Trading: o 1 BP for Taker (Aggressor) o Zero Booking Fees for Maker (Passive) o No Cap o Floor of 0.01 Reported IDX Futures Trades o 1.75 BP for both buyer and seller o No Cap o Floor of 0.01
Initial Margin Class Spread Margin V.S.R. Expiry Date
R 10.00 R 5.00 3.5 14/06/2013, 16/09/2013
The above instrument has been designated as "Foreign" by the South African Reserve Bank
Should you have any queries regarding IDX Single Stock Futures, please contact the IDX team on 011 520-7399 or idx#jse.co.za
Graham Smale Director: Bonds and Financial Derivatives Tel: +27 11 520 7831 Fax:+27 11 520 8831 E-mail: grahams#jse.co.za
Distributed by the Company Secretariat +27 11 520 7346
Page 2 of 2

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex