How to have all elements of text file processed in R - r

I have a single text file, NPFile, that contains 100 different newspaper articles that is 3523 lines in length. I am trying to pick out and parse different data fields for each article for text processing. These fields are: Full text: Publication date:, Publication title: etc....
I am using grep to pick out the different lines that contain the data fields I want. Although I can get the line numbers (start and end positions of the fields), I am getting an error when I try to use the line numbers to extract the actual text and put it into a vector:
#Find full text of article, clean and store in a variable
findft<-grep ('Full text:', NPFile, ignore.case=TRUE)
endft<-grep ('Publication date:', NPFile)
ftfield<-(NPFile[findft:endft])
The last line ftfield<-(NPFile[findft:endft] is giving this warning message:
1: In findft:endft :
numerical expression has 100 elements: only the first used
The starting findft and ending points endft each contain 100 elements, but as the warning indicated, ftfield only contains the first element (which is 11 lines in length). I was assuming (wrongly/mistakenly) that the respective lines for each 100 instances of the full text field would be extracted and stored in ftfield - but obviously I have not coded this correctly. Any help would be appreciated.
Example of Data (These are the fields and data associated with one of the 100 in the text file):
Waiting for the 500-year flood; Red River rampage: Severe weather events, new records are more frequent than expected.
Full text: AS THE RED River raged over makeshift dikes futilely erected against its wrath in North Dakota, drowning cities beneath a column of water 26 feet above flood level, meteorologists were hard pressed to describe its magnitude in human chronology.
A 500-year flood, some call it, a catastrophic weather event that would have occurred only once since Christopher Columbus arrived on the shores of the New World. Whether it could be termed a 700-year flood or a 300-year flood is open to question.
The flood's size and power are unprecedented. While the Red River has ravaged the upper Midwest before, the height of the flood crest in Fargo and Grand Forks has been almost incomprehensible.
But climatological records are being broken more rapidly than ever. A 100-year-storm may as likely repeat within a few years as waiting another century. It is simply a way of classifying severity, not the frequency. "There isn't really a hundred-year event anymore," states climatologist Tom Karl of the National Oceanic and Atmospheric Administration.
Reliable, consistent weather records in the U.S. go back only 150 years or so. Human development has altered the Earth's surface and atmosphere, promoting greater weather changes and effects than an untouched environment would generate by itself.
What might be a 500-year event in the Chesapeake Bay is uncertain. Last year was the record for freshwater gushing into the bay. The January 1996 torrent of melted snowfall into the estuary recorded a daily average that exceeded the flow during Tropical Storm Agnes in 1972, a benchmark for 100-year meteorological events in these parts. But, according to the U.S. Geological Survey, the impact on the bay's ecosystem was not as damaging as in 1972.
Sea level in the Bay has risen nearly a foot in the past century, three times the rate of the past 5,000 years, which University of Maryland scientist Stephen Leatherman ties to global climate warming. Estuarine islands and upland shoreline are eroding at an accelerated pace.
The topography of the bay watershed is, of course, different from that of the Red River. It's not just flow rates and rainfall, but how the water is directed and where it can escape without intruding too far onto dry land. We can only hope that another 500 years really passes before the Chesapeake region is so tested.
Pub Date: 4/22/97
Publication date: Apr 22, 1997
Publication title: The Sun; Baltimore, Md.
Title: Waiting for the 500-year flood; Red River rampage: Severe weather events, new records are more frequent than expected.:   [FINAL Edition ]
From this data example above, ftfield has 11 lines when I examined it:
[1] "Full text: AS THE RED River raged over makeshift dikes futilely erected against its wrath in North Dakota, drowning cities beneath a column of water 26 feet above flood level, meteorologists were hard pressed to describe its magnitude in human chronology."
[2] "A 500-year flood, some call it, a catastrophic weather event that would have occurred only once since Christopher Columbus arrived on the shores of the New World. Whether it could be termed a 700-year flood or a 300-year flood is open to question."
[3] "The flood's size and power are unprecedented. While the Red River has ravaged the upper Midwest before, the height of the flood crest in Fargo and Grand Forks has been almost incomprehensible."
[4] "But climatological records are being broken more rapidly than ever. A 100-year-storm may as likely repeat within a few years as waiting another century. It is simply a way of classifying severity, not the frequency. \"There isn't really a hundred-year event anymore,\" states climatologist Tom Karl of the National Oceanic and Atmospheric Administration."
[5] "Reliable, consistent weather records in the U.S. go back only 150 years or so. Human development has altered the Earth's surface and atmosphere, promoting greater weather changes and effects than an untouched environment would generate by itself."
[6] "What might be a 500-year event in the Chesapeake Bay is uncertain. Last year was the record for freshwater gushing into the bay. The January 1996 torrent of melted snowfall into the estuary recorded a daily average that exceeded the flow during Tropical Storm Agnes in 1972, a benchmark for 100-year meteorological events in these parts. But, according to the U.S. Geological Survey, the impact on the bay's ecosystem was not as damaging as in 1972."
[7] "Sea level in the Bay has risen nearly a foot in the past century, three times the rate of the past 5,000 years, which University of Maryland scientist Stephen Leatherman ties to global climate warming. Estuarine islands and upland shoreline are eroding at an accelerated pace."
[8] "The topography of the bay watershed is, of course, different from that of the Red River. It's not just flow rates and rainfall, but how the water is directed and where it can escape without intruding too far onto dry land. We can only hope that another 500 years really passes before the Chesapeake region is so tested."
[9] "Pub Date: 4/22/97"
[10] ""
[11] "Publication date: Apr 22, 1997"
And, lastly, findft[1] corresponds with endft[1] and so on until findft[100] and endft[100].

I'll assume that findft will contain several indexes as well as endft. I'm also assuming that both of them have the same length and that they are paired by the same index ( e.g. findft[5] corresponds to endft[5]) and that you want all NPfile elements between these two indexes as well as the other pairs.
If this is so, try:
ftfield = lapply(1:length(findft), function(x){ NPFile[findft[x]:endft[x]] })
This will return a list. I can't guarantee that this will work because there is no data example to work with.

We can do this with Map. Get the sequence of values for each corresponding element of 'findft' to 'endft', then subset the 'NPFile' based on that index
Map(function(x, y) NPFile[x:y], findft, endft)

Related

Extract the paragraphs from a PDF that contain a keyword using R

I need to extract from a pdf file the paragraphs that contain a keyword. Tried various codes but none got anything.
I have seen this code from a user #Tyler Rinker (Extract before and after lines based on keyword in Pdf using R programming) but it extracts the line where the keyword is, the before and after.
library(textreadr)
library(tidyverse)
loc <- function(var, regex, n = 1, ignore.case = TRUE){
locs <- grep(regex, var, ignore.case = ignore.case)
out <- sort(unique(c(locs - 1, locs, locs + 1)))
out <- out[out > 0]
out[out <= length(var)]
}
doc <- 'https://www.in.kpmg.com/pdf/Indian%20Pharma%20Outlook.pdf' %>%
read_pdf() %>%
slice(loc(text, 'cancer'))
However, I need to get the paragraphs and store each one in a row in my database. Could you help me?
The text lines in paragraphs will all be quite long unless it is the final line of the paragraph. We can count the characters in each line and do a histogram to show this:
library(textreadr)
doc <- read_pdf('https://www.in.kpmg.com/pdf/Indian%20Pharma%20Outlook.pdf')
hist(nchar(doc$text), 20)
So anything less than about 75 characters is either not in a paragraph or at the end of a paragraph. We can therefore stick a line break on the short ones, paste all the lines together, then split on linebreaks:
doc$text[nchar(doc$text) < 75] <- paste0(doc$text[nchar(doc$text) < 75], "\n")
txt <- paste(doc$text, collapse = " ")
txt <- strsplit(txt, "\n")[[1]]
So now we can just do our regex and find the paragraphs with the key word:
grep("cancer", txt, value = TRUE)
#> [1] " Ranjit Shahani applauds the National Pharmaceuticals Policy's proposal of public/private partnerships (PPPs) to tackle life-threatening diseases such as cancer and HIV/AIDS, but stresses that, in order for them to work, they should be voluntary, and the government should exempt all life-saving drugs from import duties and other taxes such as excise duty and VAT. He is, however, critical about a proposal for mandatory price negotiation of newly patented drugs. He feels this will erode India's credibility in implementing the Patent Act in © 2006 KPMG International. KPMG International is a Swiss cooperative that serves as a coordinating entity for a network of independent firms operating under the KPMG name. KPMG International provides no services to clients. Each member firm of KPMG International is a legally distinct and separate entity and each describes itself as such. All rights reserved. Collaboration for Growth 24"
#> [2] " a fair and transparent manner. To deal with diabetes, medicines are not the only answer; awareness about the need for lifestyle changes needs to be increased, he adds. While industry leaders have long called for the development of PPPs for the provision of health care in India, particularly in rural areas, such initiatives are currently totally unexplored. However, the government's 2006 draft National Pharmaceuticals Policy proposes the introduction of PPPs with drug manufacturers and hospitals as a way of vastly increasing the availability of medicines to treat life-threatening diseases. It notes, for example, that while an average estimate of the value of drugs to treat the country's cancer patients is $1.11 billion, the market is in fact worth only $33.5 million. “The big gap indicates the near non-accessibility of the medicines to a vast majority of the affected population, mainly because of the high cost of these medicines,” says the Policy, which also calls for tax and excise exemptions for anti-cancer drugs."
#> [3] " 50.1 percent of Aventis Pharma is held by European drug major Sanofi-Aventis and, in early April 2006, it was reported that UB Holdings had sold its 10 percent holding in the firm to Variegate Trading, a UB subsidiary. The firm's major products are in the anti-infective, anti-inflammatory, cancer, diabetes and allergy market segments and, for the year ended December 31, 2005, it reported net sales (excluding excise duty) up 9.9 percent to $181.1 million, with domestic sales up 9.1 percent at $129.8 million and exports increasing 12 percent to $51.2 million. Sales were led by 83 percent annual growth for the diabetes treatment Lantus (insulin glargine), followed by the rabies vaccine Rabipur (+22 percent), the diabetes drug Amaryl (glimepiride) and epilepsy treatment Frisium (clobazam), both up 18 percent, the angiotensin-coverting enzyme inhibitor Cardace (ramipril +15 percent), Clexane (enoxaparin), an anticoagulant, growing 14 percent and Targocid (teicoplanin), an antibiotic, whose sales advanced 8 percent."
Created on 2020-09-16 by the reprex package (v0.3.0)

unnest_tokens fails to handle vectors in R with tidytext package

I want to use the tidytext package to create a column with 'ngrams'. with the following code:
library(tidytext)
unnest_tokens(tbl = president_tweets,
output = bigrams,
input = text,
token = "ngrams",
n = 2)
But when I run this I get the following error message:
error: unnest_tokens expects all columns of input to be atomic vectors (not lists)
My text column consists of a lot of tweets with rows that look like the following and is of class character.
president_tweets$text <– c("The United States Senate just passed the biggest in history Tax Cut and Reform Bill. Terrible Individual Mandate (ObamaCare)Repealed. Goes to the House tomorrow morning for final vote. If approved, there will be a News Conference at The White House at approximately 1:00 P.M.",
"Congratulations to Paul Ryan, Kevin McCarthy, Kevin Brady, Steve Scalise, Cathy McMorris Rodgers and all great House Republicans who voted in favor of cutting your taxes!",
"A story in the #washingtonpost that I was close to rescinding the nomination of Justice Gorsuch prior to confirmation is FAKE NEWS. I never even wavered and am very proud of him and the job he is doing as a Justice of the U.S. Supreme Court. The unnamed sources dont exist!",
"Stocks and the economy have a long way to go after the Tax Cut Bill is totally understood and appreciated in scope and size. Immediate expensing will have a big impact. Biggest Tax Cuts and Reform EVER passed. Enjoy, and create many beautiful JOBS!",
"DOW RISES 5000 POINTS ON THE YEAR FOR THE FIRST TIME EVER - MAKE AMERICA GREAT AGAIN!",
"70 Record Closes for the Dow so far this year! We have NEVER had 70 Dow Records in a one year period. Wow!"
)
---------Update:----------
It looks like the sentimetr or exploratory package caused the conflict. I reloaded my packages without these and now it works again!
Hmmmmm, I am not able to reproduce your problem.
library(tidytext)
library(dplyr)
president_tweets <- data_frame(text = c("The United States Senate just passed the biggest in history Tax Cut and Reform Bill. Terrible Individual Mandate (ObamaCare)Repealed. Goes to the House tomorrow morning for final vote. If approved, there will be a News Conference at The White House at approximately 1:00 P.M.",
"Congratulations to Paul Ryan, Kevin McCarthy, Kevin Brady, Steve Scalise, Cathy McMorris Rodgers and all great House Republicans who voted in favor of cutting your taxes!",
"A story in the #washingtonpost that I was close to rescinding the nomination of Justice Gorsuch prior to confirmation is FAKE NEWS. I never even wavered and am very proud of him and the job he is doing as a Justice of the U.S. Supreme Court. The unnamed sources dont exist!",
"Stocks and the economy have a long way to go after the Tax Cut Bill is totally understood and appreciated in scope and size. Immediate expensing will have a big impact. Biggest Tax Cuts and Reform EVER passed. Enjoy, and create many beautiful JOBS!",
"DOW RISES 5000 POINTS ON THE YEAR FOR THE FIRST TIME EVER - MAKE AMERICA GREAT AGAIN!",
"70 Record Closes for the Dow so far this year! We have NEVER had 70 Dow Records in a one year period. Wow!"))
unnest_tokens(tbl = president_tweets,
output = bigrams,
input = text,
token = "ngrams",
n = 2)
#> # A tibble: 205 x 1
#> bigrams
#> <chr>
#> 1 the united
#> 2 united states
#> 3 states senate
#> 4 senate just
#> 5 just passed
#> 6 passed the
#> 7 the biggest
#> 8 biggest in
#> 9 in history
#> 10 history tax
#> # ... with 195 more rows
The current CRAN version of tidytext does in fact not allow list-columns but we have changed the column handling so that the development version on GitHub now supports list-columns. Are you sure you don't have any of these in your data frame/tibble? What are the data types of all of your columns? Are any of them of type list?

error reading text file into new columns of a dataframe using some text editing

I have a text file (0001.txt) which contains the data as below:
<DOC>
<DOCNO>1100101_business_story_11931012.utf8</DOCNO>
<TEXT>
The Telegraph - Calcutta (Kolkata) | Business | Local firms go global
6 Local firms go global
JAYANTA ROY CHOWDHURY
New Delhi, Dec. 31: Indian companies are stepping out of their homes to try their luck on foreign shores.
Corporate India invested $2.7 billion abroad in the first quarter of 2009-2010 on top of $15.9 billion in 2008-09.
Though the first-quarter investment was 15 per cent lower than what was invested in the same period last year, merchant banker Sudipto Bose said, It marks a confidence in a new world order where Indian businesses see themselves as equal to global players.
According to analysts, confidence in global recovery, cheap corporate buys abroad and easier rules governing investment overseas had spurred flow of capital and could see total investment abroad top $12 billion this year and rise to $18-20 billion next fiscal.
For example, Titagarh Wagons plans to expand abroad on the back of the proposed Asian railroad project.
We plan to travel all around the world with the growth of the railroads, said Umesh Chowdhury of Titagarh Wagons.
India is full of opportunities, but we are all also looking at picks abroad, said Gautam Mitra, managing director of Indian Structurals Engineering Company.
Mitra plans to open a holding company in Switzerland to take his business in structurals to other Asian and African countries.
Indian companies created 3 lakh jobs in the US, while contributing $105 billion to the US economy between 2004 and 2007, according to commerce ministry statistics. During 2008-09, Singapore, the Netherlands, Cyprus, the UK, the US and Mauritius together accounted for 81 per cent of the total outward investment.
Bose said, And not all of it is organic growth. Much of our investment abroad reflects takeovers and acquisitions.
In the last two years, Suzlon acquired Portugals Martifers stake in German REpower Systems for $122 million. McNally Bharat Engineering has bought the coal and minerals processing business of KHD Humboldt Wedag. ONGC bought out Imperial Energy for $2 billion.
Indias foreign assets and liabilities today add up to more than 60 per cent of its gross domestic product. By the end of 2008-09, total foreign investment was $67 billion, more than double of that at the end of March 2007.
</TEXT>
</DOC>
Above, all text data is within the HTML code for text i.e.
<TEXT> and </TEXT>.
I want to read it into an R dataframe in a way that there will be four columns and the data should be read as:
Title Author Date Text
The Telegraph - Calcutta (Kolkata) JAYANTA ROY CHOWDHURY Dec. 31 Indian companies are stepping out of their homes to try their luck on foreign shores. Corporate India invested $2.7 billion abroad in the first quarter of 2009-2010 on top of $15.9 billion in 2008-09. Though the first-quarter investment was 15 percent lower than what was invested in the same period last year, merchant banker Sudipto Bose said, It marks a confidence in a new world order where Indian businesses see themselves as equal to global players.
What I was trying to read using dplyr and as shown below:
# read text file
library(dplyr)
library(readr)
dat <- read_csv("0001.txt") %>% slice(-8)
# print part of data frame
head(dat, n=2)
In above code, I tried to skip first few lines (which are not important) from the text file that contains the above text and then read it into dataframe.
But I could not get what I was looking for and got confused what I am doing is wrong.
Could someone please help?
To be able to read data into R as a data frame or table, the data needs to have a consistent structure maintained by separators. One of the most common formats is a file with comma separated values (CSV).
The data you're working with doesn't have separators though. It's essentially a string with minimally enforced structure. Because of this, it sounds like the question is more related to regular expressions (regex) and data mining than it is to reading text files into R. So I'd recommend looking into those two things if you do this task often.
That aside, to do what you're wanting in this example, I'd recommend reading the text file into R as a single string of text first. Then you can parse the data you want using regex. Here's a basic, rough draft of how to do that:
fileName <- "Path/to/your/data/0001.txt"
string <- readChar(fileName, file.info(fileName)$size)
df <- data.frame(
Title=sub("\\s+[|]+(.*)","",string),
Author=gsub("(.*)+?([A-Z]{2,}.*[A-Z]{2,})+(.*)","\\2",string),
Date=gsub("(.*)+([A-Z]{1}[a-z]{2}\\.\\s[0-9]{1,2})+(.*)","\\2",string),
Text=gsub("(.*)+([A-Z]{1}[a-z]{2}\\.\\s[0-9]{1,2})+[: ]+(.*)","\\3",string))
Output:
str(df)
'data.frame': 1 obs. of 4 variables:
$ Title : chr "The Telegraph - Calcutta (Kolkata)"
$ Author: chr "JAYANTA ROY CHOWDHURY"
$ Date : chr "Dec. 31"
$ Text : chr "Indian companies are stepping out of their homes to"| __truncated__
The reason why regex can be useful is that it allows for very specific patterns in strings. The downside is when you're working with strings that keep changing formats. That will likely mean some slight adjustments to the regex used.
read.table( file = ... , sep = "|") will solve your issue.

Removing rows with a specific word in Corpus

I have a Corpus with multiple texts (news articles) scraped from the internet.
Some of the texts contain the description of the photo that is used in the article. I want to remove that.
I found an existing string about this topic but it could not help me. See link: Removing rows from Corpus with multiple documents
I want to remove every row that contains the words "PHOTO FILE" (in caps). This solution was posted:
require(tm)
corp <- VCorpus(VectorSource(txt))
textVector <- sapply(corp, as.character)
for(j in seq(textVector)) {
newCorp<-textVector
newCorp[[j]] <- textVector[[j]][-grep("PHOTO", textVector[[j]], ignore.case = FALSE)]
}
This does not seem to work for me though. The code runs but nothing is removed.
What does work is this:
require(tm)
corp <- VCorpus(VectorSource(txt))
textVector <- sapply(corp, as.character)
newCorp <- VCorpus(VectorSource(textVector[-grep("PHOTO", textVector,
ignore.case = FALSE)]))
But that removes every file that contains the word and I do not want that.
Would greatly appreciate if someone can help me on this.
Addition:
Here is an example of one of the texts:
[1] "Top News | Wed Apr 19, 2017 | 3:53pm BST\nFILE PHOTO: People walk accross a plaza in the Canary Wharf financial district, London, Britain, January 9, 2017. REUTERS/Dylan Martinez/File Photo\nLONDON Britain's current account deficit, one of the weak points of its economy, was bigger than previously thought in the years up to 2012, according to new estimates from the Office for National Statistics on Wednesday.\nThe figures showed British companies had paid out more interest to foreign holders of corporate bonds than initially estimated, resulting in a larger current account deficit.\nThe deficit, one of the biggest among advanced economies, has been in the spotlight since June's Brexit vote.\nBank of England Governor Mark Carney said in the run-up to the referendum that Britain was reliant on the \"kindness of strangers\", highlighting how the country needed tens of billions of pounds of foreign finance a year to balance its books.\nThe ONS said the current account deficit for 2012 now stood at 4.4 percent of gross domestic product, compared with 3.7 percent in its previous estimate.\nThe ONS revised up the deficit for every year dating back to 1998 by an average of 0.6 percentage points. The biggest revisions occurred from 2005 onwards.\nLast month the ONS said Britain's current account deficit tumbled to 2.4 percent of GDP in the final three months of 2016, less than half its reading of 5.3 percent in the third quarter.\nRevised data for 2012 onward is due on Sept. 29, and it is unclear if Wednesday's changes point to significant further upward revisions, as British corporate bond yields have declined markedly since 2012 and touched a new low in mid-2016. .MERUR00\nThe ONS also revised up its earlier estimates of how much Britons saved. The household savings ratio for 2012 rose to 9.8 percent from 8.3 percent previously, with a similar upward revision for 2011.\nThe ratio for Q4 2016, which has not yet been revised, stood at its lowest since 1963 at 3.3 percent.\nThe ONS said the changes reflected changes to the treatment of self-employed people paying themselves dividends from their own companies, as well as separating out the accounts of charities, which had previously been included with households.\nMore recent years may produce similarly large revisions to the savings ratio. Around 40 percent of the roughly 2.2 million new jobs generated since the beginning of 2008 fell into the self-employed category.\n"
So I wish to delete the sentence (row) of FILE PHOTO
Let's say that initially the text is contained in the file input.txt.
The raw file is as follows:
THis is a text that contains a lot
of information
and PHOTO FILE.
Great!
my_text<-readLines("input.txt")
[1] "THis is a text that contains a lot" "of information" "and PHOTO FILE." "Great!"
If you get rid of the spurious element
blah[-grep("PHOTO FILE",blah,value = F,perl=T)]
you end up with
[1] "THis is a text that contains a lot" "of information" "Great!"

Read text file, create new rows by separator

I am trying to import a text file as a data frame as a single column and multiple rows. I want a new row created for every sentence and I want to repeat the process for every word.
Like this.
Mr. Trump has been leading most national polls in the Republican presidential contest, but he is facing a potentially changed landscape. With the Iowa caucuses less than three months away, attention has shifted to national security in the wake of the terrorist attacks in Paris last week. While the Republican electorate so far has favored political outsiders like Mr. Trump and Ben Carson, the concerns over terrorism and the arrival of refugees from Syria into the United States could change things.
should be read as
V1
[1] Mr
[2] Trump has been leading most national polls in the Republican presidential contest, but he is facing a potentially changed landscape
[3] With the Iowa caucuses less than three months away, attention has shifted to national security in the wake of the terrorist attacks in Paris last week
[4] While the Republican electorate so far has favored political outsiders like Mr
[5] Trump and Ben Carson, the concerns over terrorism and the arrival of refugees from Syria into the United States could change things
Thanks.
We can use strsplit
strsplit(txt, '[.]\\s*')[[1]]
data
txt <- "Mr. Trump has been leading most national polls in the Republican presidential contest, but he is facing a potentially changed landscape. With the Iowa caucuses less than three months away, attention has shifted to national security in the wake of the terrorist attacks in Paris last week. While the Republican electorate so far has favored political outsiders like Mr. Trump and Ben Carson, the concerns over terrorism and the arrival of refugees from Syria into the United States could change things."

Resources