readPDF (tm package) in R - r

I tried to read some online pdf document in R. I used readRDF function. My script goes like this
safex <- readPDF(PdftotextOptions='-layout')(elem=list(uri='C:/Users/FCG/Desktop/NoteF7000.pdf'),language='en',id='id1')
R showed the message that running command has status 309. I tried different pdftotext options. however, it is the same message. and the text file created has no content.
Can anyone read this pdf

readPDF has bugs and probably isn't worth bothering with (check out this well-documented struggle with it).
Assuming that...
you've got xpdf installed (see here for details)
your PATHs are all in order (see here for details of how to do that) and you've restarted your computer.
Then you might be better off avoiding readPDF and instead using this workaround:
system(paste('"C:/Program Files/xpdf/pdftotext.exe"',
'"C:/Users/FCG/Desktop/NoteF7000.pdf"'), wait=FALSE)
And then read the text file into R like so...
require(tm)
mycorpus <- Corpus(URISource("C:/Users/FCG/Desktop/NoteF7001.txt"))
And have a look to confirm that it went well:
inspect(mycorpus)
A corpus with 1 text document
The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
create_date creator
Available variables in the data frame are:
MetaID
[[1]]
Market Notice
Number: Date F7001 08 May 2013
New IDX SSF (EWJG) The following new IDX SSF contract will be added to the list and will be available for trade today.
Summary Contract Specifications Contract Code Underlying Instrument Bloomberg Code ISIN Code EWJG EWJG IShares MSCI Japan Index Fund (US) EWJ US EQUITY US4642868487 1 (R1 per point)
Contract Size / Nominal
Expiry Dates & Times
10am New York Time; 14 Jun 2013 / 16 Sep 2013
Underlying Currency Quotations Minimum Price Movement (ZAR) Underlying Reference Price
USD/ZAR Bloomberg Code (USDZAR Currency) Price per underlying share to two decimals. R0.01 (0.01 in the share price)
4pm underlying spot level as captured by the JSE.
Currency Reference Price
The same method as the one utilized for the expiry of standard currency futures on standard quarterly SAFEX expiry dates.
JSE Limited Registration Number: 2005/022939/06 One Exchange Square, Gwen Lane, Sandown, South Africa. Private Bag X991174, Sandton, 2146, South Africa. Telephone: +27 11 520 7000, Facsimile: +27 11 520 8584, www.jse.co.za
Executive Director: NF Newton-King (CEO), A Takoordeen (CFO) Non-Executive Directors: HJ Borkum (Chairman), AD Botha, MR Johnston, DM Lawrence, A Mazwai, Dr. MA Matooane , NP Mnxasana, NS Nematswerani, N Nyembezi-Heita, N Payne Alternate Directors: JH Burke, LV Parsons
Member of the World Federation of Exchanges
Company Secretary: GC Clarke
Settlement Method
Cash Settled
-
Clearing House Fees -
On-screen IDX Futures Trading: o 1 BP for Taker (Aggressor) o Zero Booking Fees for Maker (Passive) o No Cap o Floor of 0.01 Reported IDX Futures Trades o 1.75 BP for both buyer and seller o No Cap o Floor of 0.01
Initial Margin Class Spread Margin V.S.R. Expiry Date
R 10.00 R 5.00 3.5 14/06/2013, 16/09/2013
The above instrument has been designated as "Foreign" by the South African Reserve Bank
Should you have any queries regarding IDX Single Stock Futures, please contact the IDX team on 011 520-7399 or idx#jse.co.za
Graham Smale Director: Bonds and Financial Derivatives Tel: +27 11 520 7831 Fax:+27 11 520 8831 E-mail: grahams#jse.co.za
Distributed by the Company Secretariat +27 11 520 7346
Page 2 of 2

Related

Converting PDF to text with pdftools in R returning empty string

In the following example, the result is empty for every page in the PDF.
library(pdftools)
rm(list = ls())
setwd(dirname(rstudioapi::getActiveDocumentContext()$path))
url = "https://reporting.standardbank.com/wp-content/uploads/2022/02/SBS72-Pricing-Supplement.pdf"
destfile = file.path(getwd(), basename(url))
download.file(url, destfile, mode = "wb")
file = list.files(path=".", pattern="pdf$")
pdf_text(file)
I am not sure whether there is a problem with the PDF file and the way it was scanned and saved that prevents PDF reading.
Is there a workaround for PDF files like this or a better package/library that I should consider?
I would guess that the issue is that it's a scanned document. So your probably need some OCR tools to extract the text and information from the document. One option would be the tesseract package:
library(tesseract)
url = "https://reporting.standardbank.com/wp-content/uploads/2022/02/SBS72-Pricing-Supplement.pdf"
eng <- tesseract("eng")
text <- tesseract::ocr(url, engine = eng)
#> Converting page 1 to file16a069b77ed2SBS72-Pricing-Supplement_1.png... done!
#> Converting page 2 to file16a069b77ed2SBS72-Pricing-Supplement_2.png... done!
#> Converting page 3 to file16a069b77ed2SBS72-Pricing-Supplement_3.png... done!
#> Converting page 4 to file16a069b77ed2SBS72-Pricing-Supplement_4.png... done!
#> Converting page 5 to file16a069b77ed2SBS72-Pricing-Supplement_5.png... done!
#> Converting page 6 to file16a069b77ed2SBS72-Pricing-Supplement_6.png... done!
#> Converting page 7 to file16a069b77ed2SBS72-Pricing-Supplement_7.png... done!
#> Converting page 8 to file16a069b77ed2SBS72-Pricing-Supplement_8.png... done!
text[[1]]
#> [1] "APPLICABLE PRICING SUPPLEMENT DATED 28 JANUARY 2022\nThe Standard Bank of South Africa Limited\n(dncorporated with limited liability under Registration Number 1962/000738/06\nin the Republic of South Africa)\nIssue of ZAR404,000,000 Senior Unsecured Floating Rate Notes due 02 February 2029\nUnder its ZAR110,000,000,000 Domestic Medium Term Note Programme\nThis document constitutes the Applicable Pricing Supplement relating to the issue of Notes described herein.\nTerms used herein shall be deemed to be defined as such for the purposes of the terms and conditions (the\n“Terms and Conditions\") set forth in the Programme Memorandum dated 24 December 2020 (the \"Programme\nMemorandum\"), as updated and amended from time to time. This Pricing Supplement must be read in\nconjunction with such Programme Memorandum. To the extent that there is any conflict or inconsistency between\nthe contents of this Pricing Supplement and the Programme Memorandum, the provisions of this Pricing\nSupplement shall prevail.\nDESCRIPTION OF THE NOTES\nl. Issuer The Standard Bank of South Africa\nLimited\n2. Debt Officer Amo Daehnke, Group Chief\nFinancial and Value Management\nOfficer of Standard Bank Group\nLimited\n3. Status of the Notes Senior Unsecured\n4. (a) Series Number 72\n(b) Tranche Number ]\n5. Aggregate Nominal Amount ZAR404,000,000\n6. Redemption/Payment Basis N/A\n7. Type of Notes Floating Rate Notes\n8. Interest Payment Basis Floating Rate\n9. Form of Notes Registered Notes\n10. Automatic/Optional Conversion from one Interest/Payment N/A\nBasis to another\nll. Issue Date 2 February 2022\n12. Business Centre Johannesburg\n13. Additional Business Centre N/A\n14. Specified Denomination ZAR]1,000,000\n15. Calculation Amount ZAR1,000,000\n16. Issue Price 100%\n17. Interest Commencement Date 02 February 2022\n18. Maturity Date 02 February 2029\n19. Maturity Period N/A\n1\n"

Extract specific lines of text in r

I have a .txt file with thousands of lines. In this file, I have a meta information about research articles. Every paper has information about Published year (PY), Title (TI), DOI number (DI), Publishing Type (PT) and Abstract (AB). So, the information of almost 300 papers exist in the text file. The format of information about first two article is as follows.
PT J
AU Filieri, Raffaele
Acikgoz, Fulya
Ndou, Valentina
Dwivedi, Yogesh
TI Is TripAdvisor still relevant? The influence of review credibility,
review usefulness, and ease of use on consumers' continuance intention
SO INTERNATIONAL JOURNAL OF CONTEMPORARY HOSPITALITY MANAGEMENT
DI 10.1108/IJCHM-05-2020-0402
EA NOV 2020
PY 2020
AB Purpose - Recent figures show that users are discontinuing their usage
of TripAdvisor, the leading user-generated content (UGC) platform in the
tourism sector. Hence, it is relevant to study the factors that
influence travelers' continued use of TripAdvisor.
Design/methodology/approach - The authors have integrated constructs
from the technology acceptance model, information systems (IS)
continuance model and electronic word of mouth literature. They used
PLS-SEM (smartPLS V.3.2.8) to test the hypotheses using data from 297
users of TripAdvisor recruited through Prolific.
Findings - Findings reveal that perceived ease of use, online consumer
review (OCR) credibility and OCR usefulness have a positive impact on
customer satisfaction, which ultimately leads to continuance intention
of UGC platforms. Customer satisfaction mediates the effect of the
independent variables on continuance intention.
Practical implications - Managers of UGC platforms (i.e. TripAdvisor)
can benefit from the findings of this study. Specifically, they should
improve the ease of use of their platforms by facilitating travelers'
information searches. Moreover, they should use signals to make credible
and helpful content stand out from the crowd of reviews.
Originality/value - This is the first study that adopts the IS
continuance model in the travel and tourism literature to research the
factors influencing consumers' continued use of travel-based UGC
platforms. Moreover, the authors have extended this model by including
new constructs that are particularly relevant to UGC platforms, such as
performance heuristics and OCR credibility.
ZR 0
ZA 0
Z8 0
ZS 0
TC 0
ZB 0
Z9 0
SN 0959-6119
EI 1757-1049
UT WOS:000592516500001
ER
PT J
AU Li, Yelin
Bu, Hui
Li, Jiahong
Wu, Junjie
TI The role of text-extracted investor sentiment in Chinese stock price
prediction with the enhancement of deep learning
SO INTERNATIONAL JOURNAL OF FORECASTING
VL 36
IS 4
BP 1541
EP 1562
DI 10.1016/j.ijforecast.2020.05.001
PD OCT-DEC 2020
PY 2020
AB Whether investor sentiment affects stock prices is an issue of
long-standing interest for economists. We conduct a comprehensive study
of the predictability of investor sentiment, which is measured directly
by extracting expectations from online user-generated content (UGC) on
the stock message board of Eastmoney.com in the Chinese stock market. We
consider the influential factors in prediction, including the selections
of different text classification algorithms, price forecasting models,
time horizons, and information update schemes. Using comparisons of the
long short-term memory (LSTM) model, logistic regression, support vector
machine, and Naive Bayes model, the results show that daily investor
sentiment contains predictive information only for open prices, while
the hourly sentiment has two hours of leading predictability for closing
prices. Investors do update their expectations during trading hours.
Moreover, our results reveal that advanced models, such as LSTM, can
provide more predictive power with investor sentiment only if the inputs
of a model contain predictive information. (C) 2020 International
Institute of Forecasters. Published by Elsevier B.V. All rights
reserved.
CT 14th International Conference on Services Systems and Services
Management (ICSSSM)
CY JUN 16-18, 2017
CL Dongbei Univ Finance & Econ, Sch Management Sci & Engn, Dalian, PEOPLES
R CHINA
HO Dongbei Univ Finance & Econ, Sch Management Sci & Engn
SP Tsinghua Univ; Chinese Univ Hong Kong; IEEE Syst Man & Cybernet Soc
ZA 0
TC 0
ZB 0
ZS 0
Z8 0
ZR 0
Z9 0
SN 0169-2070
EI 1872-8200
UT WOS:000570797300025
ER
Now, I want to extract the abstract of each article and store it in the data frame. To extract the abstract I have the following code, which gives me the first match of abstract.
f = readLines("sample.txt")
#extract first match....
pattern <- "AB\\s*(.*?)\\s*ZR"
result <- regmatches(as.String(f), regexec(pattern, as.String(f)))
result[[1]][2]
[1] "Purpose - Recent figures show that users are discontinuing their usage\n of TripAdvisor, the leading user-generated content (UGC) platform in the\n tourism sector. Hence, it is relevant to study the factors that\n influence travelers' continued use of TripAdvisor.\n Design/methodology/approach - The authors have integrated constructs\n from the technology acceptance model, information systems (IS)\n continuance model and electronic word of mouth literature. They used\n PLS-SEM (smartPLS V.3.2.8) to test the hypotheses using data from 297\n users of TripAdvisor recruited through Prolific.\n Findings - Findings reveal that perceived ease of use, online consumer\n review (OCR) credibility and OCR usefulness have a positive impact on\n customer satisfaction, which ultimately leads to continuance intention\n of UGC platforms. Customer satisfaction mediates the effect of the\n independent variables on continuance intention.\n Practical implications - Managers of UGC platforms (i.e. TripAdvisor)\n can benefit from the findings of this study. Specifically, they should\n improve the ease of use of their platforms by facilitating travelers'\n information searches. Moreover, they should use signals to make credible\n and helpful content stand out from the crowd of reviews.\n Originality/value - This is the first study that adopts the IS\n continuance model in the travel and tourism literature to research the\n factors influencing consumers' continued use of travel-based UGC\n platforms. Moreover, the authors have extended this model by including\n new constructs that are particularly relevant to UGC platforms, such as\n performance heuristics and OCR credibility."
The problem is, I want to extract all the abstracts but the pattern would be different for most of the abstracts. So the specific pattern for all the abstract is that I should extract text starting from AB and every next line having space in the front. Any body can help me in this regard?
You can first group the lines: whenever a line does not start with a space character the group counter is moved up by one.
Then you can aggregate f by group and select the abstracts from the aggregated vector:
group <- cumsum(!grepl("^ ", f))
f2 <- aggregate(f, list(group), function(x) paste(x, collapse = " "))[, 2]
f2[grepl("^AB ", f2)]
A completely different approach. If your text file has the layout you are showing, you could also read everything in a data.frame with readr::read_fwf. When doing this you have all the info from the articles available. You could use tidyr::fill to fill out the missing meta info.
library(dplyr)
library(readr)
articles <- read_fwf("tests/SO text.txt", fwf_empty("tests/SO text.txt", col_names = c("mi", "text")))
articles <- articles %>%
filter(!(is.na(mi) & is.na(text))) # removes empty lines between articles.
articles
# A tibble: 98 x 2
mi text
<chr> <chr>
1 PT J
2 AU Filieri, Raffaele
3 NA Acikgoz, Fulya
4 NA Ndou, Valentina
5 NA Dwivedi, Yogesh
6 TI Is TripAdvisor still relevant? The influence of review credibility,
7 NA review usefulness, and ease of use on consumers' continuance intention
8 SO INTERNATIONAL JOURNAL OF CONTEMPORARY HOSPITALITY MANAGEMENT
9 DI 10.1108/IJCHM-05-2020-0402
10 EA NOV 2020
# ... with 88 more rows
Try it with this regex:
^AB (?:(?!^[A-Z]{2} )([\s\S]))*
PCRE Demo (requires perl=TRUE in R)
If you want to drop the prefix add \K after ^AB \K
You can use
(?m)^AB\h+\K.*(?:\R\h.+)*
See the regex demo. Details:
(?m) - a multiline flag making ^ match at the start of each line
^ - start of a line
AB - an AB substring
\h+ - one or more horizontal whitespaces
\K - match reset operator discard the text matched so far
.* - the rest of the line
(?:\R\h.+)* - zero or more consecutive lines that start with a horizontal whitespace.
In R, you may use it like
x <- as.String(f)
regmatches(x, gregexpr("(?m)^AB\\h+\\K.*(?:\\R\\h.+)*", x, perl=TRUE))

Extract the paragraphs from a PDF that contain a keyword using R

I need to extract from a pdf file the paragraphs that contain a keyword. Tried various codes but none got anything.
I have seen this code from a user #Tyler Rinker (Extract before and after lines based on keyword in Pdf using R programming) but it extracts the line where the keyword is, the before and after.
library(textreadr)
library(tidyverse)
loc <- function(var, regex, n = 1, ignore.case = TRUE){
locs <- grep(regex, var, ignore.case = ignore.case)
out <- sort(unique(c(locs - 1, locs, locs + 1)))
out <- out[out > 0]
out[out <= length(var)]
}
doc <- 'https://www.in.kpmg.com/pdf/Indian%20Pharma%20Outlook.pdf' %>%
read_pdf() %>%
slice(loc(text, 'cancer'))
However, I need to get the paragraphs and store each one in a row in my database. Could you help me?
The text lines in paragraphs will all be quite long unless it is the final line of the paragraph. We can count the characters in each line and do a histogram to show this:
library(textreadr)
doc <- read_pdf('https://www.in.kpmg.com/pdf/Indian%20Pharma%20Outlook.pdf')
hist(nchar(doc$text), 20)
So anything less than about 75 characters is either not in a paragraph or at the end of a paragraph. We can therefore stick a line break on the short ones, paste all the lines together, then split on linebreaks:
doc$text[nchar(doc$text) < 75] <- paste0(doc$text[nchar(doc$text) < 75], "\n")
txt <- paste(doc$text, collapse = " ")
txt <- strsplit(txt, "\n")[[1]]
So now we can just do our regex and find the paragraphs with the key word:
grep("cancer", txt, value = TRUE)
#> [1] " Ranjit Shahani applauds the National Pharmaceuticals Policy's proposal of public/private partnerships (PPPs) to tackle life-threatening diseases such as cancer and HIV/AIDS, but stresses that, in order for them to work, they should be voluntary, and the government should exempt all life-saving drugs from import duties and other taxes such as excise duty and VAT. He is, however, critical about a proposal for mandatory price negotiation of newly patented drugs. He feels this will erode India's credibility in implementing the Patent Act in © 2006 KPMG International. KPMG International is a Swiss cooperative that serves as a coordinating entity for a network of independent firms operating under the KPMG name. KPMG International provides no services to clients. Each member firm of KPMG International is a legally distinct and separate entity and each describes itself as such. All rights reserved. Collaboration for Growth 24"
#> [2] " a fair and transparent manner. To deal with diabetes, medicines are not the only answer; awareness about the need for lifestyle changes needs to be increased, he adds. While industry leaders have long called for the development of PPPs for the provision of health care in India, particularly in rural areas, such initiatives are currently totally unexplored. However, the government's 2006 draft National Pharmaceuticals Policy proposes the introduction of PPPs with drug manufacturers and hospitals as a way of vastly increasing the availability of medicines to treat life-threatening diseases. It notes, for example, that while an average estimate of the value of drugs to treat the country's cancer patients is $1.11 billion, the market is in fact worth only $33.5 million. “The big gap indicates the near non-accessibility of the medicines to a vast majority of the affected population, mainly because of the high cost of these medicines,” says the Policy, which also calls for tax and excise exemptions for anti-cancer drugs."
#> [3] " 50.1 percent of Aventis Pharma is held by European drug major Sanofi-Aventis and, in early April 2006, it was reported that UB Holdings had sold its 10 percent holding in the firm to Variegate Trading, a UB subsidiary. The firm's major products are in the anti-infective, anti-inflammatory, cancer, diabetes and allergy market segments and, for the year ended December 31, 2005, it reported net sales (excluding excise duty) up 9.9 percent to $181.1 million, with domestic sales up 9.1 percent at $129.8 million and exports increasing 12 percent to $51.2 million. Sales were led by 83 percent annual growth for the diabetes treatment Lantus (insulin glargine), followed by the rabies vaccine Rabipur (+22 percent), the diabetes drug Amaryl (glimepiride) and epilepsy treatment Frisium (clobazam), both up 18 percent, the angiotensin-coverting enzyme inhibitor Cardace (ramipril +15 percent), Clexane (enoxaparin), an anticoagulant, growing 14 percent and Targocid (teicoplanin), an antibiotic, whose sales advanced 8 percent."
Created on 2020-09-16 by the reprex package (v0.3.0)

error reading text file into new columns of a dataframe using some text editing

I have a text file (0001.txt) which contains the data as below:
<DOC>
<DOCNO>1100101_business_story_11931012.utf8</DOCNO>
<TEXT>
The Telegraph - Calcutta (Kolkata) | Business | Local firms go global
6 Local firms go global
JAYANTA ROY CHOWDHURY
New Delhi, Dec. 31: Indian companies are stepping out of their homes to try their luck on foreign shores.
Corporate India invested $2.7 billion abroad in the first quarter of 2009-2010 on top of $15.9 billion in 2008-09.
Though the first-quarter investment was 15 per cent lower than what was invested in the same period last year, merchant banker Sudipto Bose said, It marks a confidence in a new world order where Indian businesses see themselves as equal to global players.
According to analysts, confidence in global recovery, cheap corporate buys abroad and easier rules governing investment overseas had spurred flow of capital and could see total investment abroad top $12 billion this year and rise to $18-20 billion next fiscal.
For example, Titagarh Wagons plans to expand abroad on the back of the proposed Asian railroad project.
We plan to travel all around the world with the growth of the railroads, said Umesh Chowdhury of Titagarh Wagons.
India is full of opportunities, but we are all also looking at picks abroad, said Gautam Mitra, managing director of Indian Structurals Engineering Company.
Mitra plans to open a holding company in Switzerland to take his business in structurals to other Asian and African countries.
Indian companies created 3 lakh jobs in the US, while contributing $105 billion to the US economy between 2004 and 2007, according to commerce ministry statistics. During 2008-09, Singapore, the Netherlands, Cyprus, the UK, the US and Mauritius together accounted for 81 per cent of the total outward investment.
Bose said, And not all of it is organic growth. Much of our investment abroad reflects takeovers and acquisitions.
In the last two years, Suzlon acquired Portugals Martifers stake in German REpower Systems for $122 million. McNally Bharat Engineering has bought the coal and minerals processing business of KHD Humboldt Wedag. ONGC bought out Imperial Energy for $2 billion.
Indias foreign assets and liabilities today add up to more than 60 per cent of its gross domestic product. By the end of 2008-09, total foreign investment was $67 billion, more than double of that at the end of March 2007.
</TEXT>
</DOC>
Above, all text data is within the HTML code for text i.e.
<TEXT> and </TEXT>.
I want to read it into an R dataframe in a way that there will be four columns and the data should be read as:
Title Author Date Text
The Telegraph - Calcutta (Kolkata) JAYANTA ROY CHOWDHURY Dec. 31 Indian companies are stepping out of their homes to try their luck on foreign shores. Corporate India invested $2.7 billion abroad in the first quarter of 2009-2010 on top of $15.9 billion in 2008-09. Though the first-quarter investment was 15 percent lower than what was invested in the same period last year, merchant banker Sudipto Bose said, It marks a confidence in a new world order where Indian businesses see themselves as equal to global players.
What I was trying to read using dplyr and as shown below:
# read text file
library(dplyr)
library(readr)
dat <- read_csv("0001.txt") %>% slice(-8)
# print part of data frame
head(dat, n=2)
In above code, I tried to skip first few lines (which are not important) from the text file that contains the above text and then read it into dataframe.
But I could not get what I was looking for and got confused what I am doing is wrong.
Could someone please help?
To be able to read data into R as a data frame or table, the data needs to have a consistent structure maintained by separators. One of the most common formats is a file with comma separated values (CSV).
The data you're working with doesn't have separators though. It's essentially a string with minimally enforced structure. Because of this, it sounds like the question is more related to regular expressions (regex) and data mining than it is to reading text files into R. So I'd recommend looking into those two things if you do this task often.
That aside, to do what you're wanting in this example, I'd recommend reading the text file into R as a single string of text first. Then you can parse the data you want using regex. Here's a basic, rough draft of how to do that:
fileName <- "Path/to/your/data/0001.txt"
string <- readChar(fileName, file.info(fileName)$size)
df <- data.frame(
Title=sub("\\s+[|]+(.*)","",string),
Author=gsub("(.*)+?([A-Z]{2,}.*[A-Z]{2,})+(.*)","\\2",string),
Date=gsub("(.*)+([A-Z]{1}[a-z]{2}\\.\\s[0-9]{1,2})+(.*)","\\2",string),
Text=gsub("(.*)+([A-Z]{1}[a-z]{2}\\.\\s[0-9]{1,2})+[: ]+(.*)","\\3",string))
Output:
str(df)
'data.frame': 1 obs. of 4 variables:
$ Title : chr "The Telegraph - Calcutta (Kolkata)"
$ Author: chr "JAYANTA ROY CHOWDHURY"
$ Date : chr "Dec. 31"
$ Text : chr "Indian companies are stepping out of their homes to"| __truncated__
The reason why regex can be useful is that it allows for very specific patterns in strings. The downside is when you're working with strings that keep changing formats. That will likely mean some slight adjustments to the regex used.
read.table( file = ... , sep = "|") will solve your issue.

Removing rows with a specific word in Corpus

I have a Corpus with multiple texts (news articles) scraped from the internet.
Some of the texts contain the description of the photo that is used in the article. I want to remove that.
I found an existing string about this topic but it could not help me. See link: Removing rows from Corpus with multiple documents
I want to remove every row that contains the words "PHOTO FILE" (in caps). This solution was posted:
require(tm)
corp <- VCorpus(VectorSource(txt))
textVector <- sapply(corp, as.character)
for(j in seq(textVector)) {
newCorp<-textVector
newCorp[[j]] <- textVector[[j]][-grep("PHOTO", textVector[[j]], ignore.case = FALSE)]
}
This does not seem to work for me though. The code runs but nothing is removed.
What does work is this:
require(tm)
corp <- VCorpus(VectorSource(txt))
textVector <- sapply(corp, as.character)
newCorp <- VCorpus(VectorSource(textVector[-grep("PHOTO", textVector,
ignore.case = FALSE)]))
But that removes every file that contains the word and I do not want that.
Would greatly appreciate if someone can help me on this.
Addition:
Here is an example of one of the texts:
[1] "Top News | Wed Apr 19, 2017 | 3:53pm BST\nFILE PHOTO: People walk accross a plaza in the Canary Wharf financial district, London, Britain, January 9, 2017. REUTERS/Dylan Martinez/File Photo\nLONDON Britain's current account deficit, one of the weak points of its economy, was bigger than previously thought in the years up to 2012, according to new estimates from the Office for National Statistics on Wednesday.\nThe figures showed British companies had paid out more interest to foreign holders of corporate bonds than initially estimated, resulting in a larger current account deficit.\nThe deficit, one of the biggest among advanced economies, has been in the spotlight since June's Brexit vote.\nBank of England Governor Mark Carney said in the run-up to the referendum that Britain was reliant on the \"kindness of strangers\", highlighting how the country needed tens of billions of pounds of foreign finance a year to balance its books.\nThe ONS said the current account deficit for 2012 now stood at 4.4 percent of gross domestic product, compared with 3.7 percent in its previous estimate.\nThe ONS revised up the deficit for every year dating back to 1998 by an average of 0.6 percentage points. The biggest revisions occurred from 2005 onwards.\nLast month the ONS said Britain's current account deficit tumbled to 2.4 percent of GDP in the final three months of 2016, less than half its reading of 5.3 percent in the third quarter.\nRevised data for 2012 onward is due on Sept. 29, and it is unclear if Wednesday's changes point to significant further upward revisions, as British corporate bond yields have declined markedly since 2012 and touched a new low in mid-2016. .MERUR00\nThe ONS also revised up its earlier estimates of how much Britons saved. The household savings ratio for 2012 rose to 9.8 percent from 8.3 percent previously, with a similar upward revision for 2011.\nThe ratio for Q4 2016, which has not yet been revised, stood at its lowest since 1963 at 3.3 percent.\nThe ONS said the changes reflected changes to the treatment of self-employed people paying themselves dividends from their own companies, as well as separating out the accounts of charities, which had previously been included with households.\nMore recent years may produce similarly large revisions to the savings ratio. Around 40 percent of the roughly 2.2 million new jobs generated since the beginning of 2008 fell into the self-employed category.\n"
So I wish to delete the sentence (row) of FILE PHOTO
Let's say that initially the text is contained in the file input.txt.
The raw file is as follows:
THis is a text that contains a lot
of information
and PHOTO FILE.
Great!
my_text<-readLines("input.txt")
[1] "THis is a text that contains a lot" "of information" "and PHOTO FILE." "Great!"
If you get rid of the spurious element
blah[-grep("PHOTO FILE",blah,value = F,perl=T)]
you end up with
[1] "THis is a text that contains a lot" "of information" "Great!"

Resources