Adding extra lines between paragraphs in txt files - r

I have a over 5000 txt files of news articles that look just like the one below. I am trying to create a corpus that creates a document for each paragraph of every txt file in a folder. There is a command (corpus_reshape) in the Quanteda R package that helps me to create this corpus with paragraphs, instead of full articles, as the documents. However, the command isn't able to identify the single "enter" paragraphs in the body of the article, but instead is looking for larger gaps between text to determine where one paragraph begins and one ends. In other words, from the text file below, the command only create for documents. The first documents starting with "Paying fo the Papal Visit, the second with "Copyright 1979 The Washington Post", the third with "NO ONE KNOWS" and the last with "End of Document". But the body of the article (between "Body" and "End of Text" actually consists of four paragraphs that the corpus_reshape couldn't identify.
So, I need to somehow go back through all 5,000+ txt files and increase the number of empty lines between paragraphs in the body of the text, so that, when I create the corpus, it can accurately parse out all paragraphs. I would greatly appreciate any help. Thank you!
Update: Below I have added both a link to the a downloadable copy of the txt file in the example as well as the pasted text as copied from the txt file.
Downloadable link:
https://drive.google.com/file/d/1SO5S5XgNlc4H-Ms8IEvK_ogZ57qSoTkw/view?usp=sharing
Pasted Text:
Paying for the Papal Visit
The Washington Post
September 16, 1979, Sunday, Final Edition
Copyright 1979 The Washington Post
Section: Outlook; Editorial; B6
Length: 403 words
Body
NO ONE KNOWS how much Pope John Paul II's week-long U.S. visit will end up costing -- or even how to calculate the cost. But already who picks up the tab has become a subject of considerable unnecessary controversy in three cities. Some religious and civil-liberties groups in Philadelphia and Boston are challenging -- or nit-picking -- proposals by governments in these cities to spend public money on facilities connected with outdoor papal masses; and in New York, local and Roman Catholic officials have been locked in negotiations over who will pay for what.
But by and large, here in Washington and in Chicago and Des Moines, these details are being handled as they should be: without making a separation-of-church-and-state issue out of the logistics. Spending by a city for the smoothest and safest handling of a major event is a legitimate secular municipal function. Even a spokesman for Americans United for Separation of Church and State has agreed that there is nothing wrong with using public money for cleanup, police overtime, police protection and traffic control.
Playing host to world figures and huge turnouts is indeed expensive, as District officials regularly remind us when they are haggling with Congress for more federal help with the bills. Still, here in the capital, whether it is the pope, angry American farmers, anti-war demonstrators or civil-rights marchers, public spending for special services is considered normal and essential. Much of the hair-splitting in other cities over the papal-visit expenses has to do with whether public money should pay for platforms from which the pope will celebrate outdoor masses. That's a long reach for a constitutional controversy, and not worth it.
Far better is the kind of cooperation that separate church and state groups here are demonstrating in their planning. For example, there will be a chainlink fence surrounding the stage and altar from which the pope will say the mass on the Mall and extending to other nearby areas. The police recommended the fence, estimated to cost about $25,000, the church has agreed to pay for the portion around the stage and altar. To help clean up, the church plans to produce hundreds of Scouts on the Monday holiday for volunteer duty. This approach to the visit is a lot more sensible -- and helpful to all taxpayers -- than a drawn-out argument and threats of legal action.
End of Document

This will add 3 lines at teh end of those paragraphs. THe logic used was to add the extra lines when the line length was greater than 50. You may wnat to modify that. It was chosen because the longest line in the "paragraphs" you were happy with was 46 characters.
txt <- readLines("/home/david/Documents/R_code/misc files/WP_1979.9.16.txt")
spread <- ifelse( nchar(txt) < 50,
paste0(txt, "\n") , # these lines are left alone
paste0( txt, "\n\n\n\n") ) # longer lines are padded
cat(spread, file="/home/david/Documents/R_code/misc files/spread.txt" )
The cat function doesn't modify lines much, but it does omit the returns that were not included when readLines did the input. Some of the "lines" in the input text were just empty:
nchar(txt)
[1] 0 0 26 19 41 0 0 34 31 17 4 0 0 566 519 643 672 0 0 15
Now the same operation on spread.txt yields a different "picture". I'm thinking that the added padding with "\n" characters is what is changing the counts, but I think that the corpus processing machinery will not mind:
nchar( readLines("/home/david/Documents/R_code/misc files/spread.txt" ))
#------------
[1] 0 1 27 20 42 1 1 35 32 18 5 1 1 567 0 0 0 520 0 0 0 644 0 0 0 673 0
[28] 0 0 1 1 16
>

Related

R (regex) - removing apartment, unit, and other words from end of address

I have a large dataset of addresses that I plan to geocode in ArcGIS (Google geolocating is too expensive). Examples of the addresses are below.
9999 ST PAUL ST BSMT
GARRISON BL & BOARMAN AVENUE REAR
1234 MAIN STREET 123
1234 MAIN ST UNIT1
ArcGIS doesn't recognize addresses that include units and other words at the end. So I want to remove these words so that it looks like the below.
9999 ST PAUL ST
GARRISON BL & BOARMAN AVENUE
1234 MAIN STREET
1234 MAIN ST
The key challenges include
ST is used both to abbreviate streets and indicate "SAINT" in street names.
Addresses end in many different indicators such as STREET and AVENUE
There are intersections (indicated with &) that might include indicators like ST and AVENUE twice.
Using R, I'm attempting to apply the sub() function to solve the problem but I have not had success. Below is my latest attempt.
sub("(.*)ST","\\1",df$Address,perl=T)
I know that many questions ask similar questions but none address this problem directly and I suspect it is relevant to other users.
Although I feel removing the last word should work for you, but just to be little safer, you can use this regex to retain what you want and discard what you don't want in safer way.
(.*(?:ST|AVENUE|STREET)\b).*
Here, .*(?:ST|AVENUE|STREET)\b captures your intended data by capturing everything from start in greedy manner and only stop when it encounters any of those words ST or AVENUE or STREET (i.e. last occurrence of those words), and whatever comes after that, will be discarded which is what you wanted. In your current case you only have one word but it can discard more than one word or indeed anything that occurs after those specific words. Intended data gets captured in group 1 so just replace that with \1
So instead of this,
sub("(.*)ST","\\1",df$Address,perl=T)
try this,
sub("(.*(?:ST|AVENUE|STREET)\b).*","\\1",df$Address,perl=T)
See this demo

How dictionary is created when making dictionary-based text classifications? How values are determined?

I'm trying to create sentimental analysis of about 1 million twits I've collected from Twitter. I've found a lot of dictionary related to text categorization. The dictionaries I found were rated words between -4 and +4. For example,
fan 3
angry -2
revenge -2
bad -3
calm 2
celebration 3
What I wonder is how numbers are given to words. How can I sure that numbers are valid? How dictionaries are created?
The example you have provided appears to be (subjectively) rating the words based on their "positive/negative" meaning. So the following tweet "That was a bad celebration; I am an angry fan." would score a +1 whereas "I am a fan of that celebration!" would score a +6.
The final sum for any tweet can then be used in a strategy to do something. You could send bags of candy to anyone that tweets -10 or below in hopes of cheering them up. You could persist a tweet with a score of +50 or higher so that it shows to more people.
It's all an analysis game and there are no "right answers" when it comes to assigning subjective numbers to words until you provide a specific intent of what you wish to do with the resultant data.

Trying to use regex (or R) to turn press releases into a dataset

I'm working on a project to turn press releases from Operation Inherent Resolve which detail airstrikes against ISIS in Syria and Iraq into a usable dataset. So far we've been handcoding everything but it just takes insanely long.
Every press release is structured like this:
November 23, 2016
Military Strikes Continue Against ISIL Terrorists in Syria and Iraq
U.S. Central Command
SOUTHWEST ASIA, November 23, 2016 - On Nov. 22, Coalition military forces conducted 17 strikes against ISIL terrorists in Syria and Iraq. In Syria, Coalition military forces conducted 11 strikes using attack, bomber, fighter, and remotely piloted aircraft against ISIL targets. Additionally in Iraq, Coalition military forces conducted six strikes coordinated with and in support of the Government of Iraq using attack, bomber, fighter, and remotely piloted aircraft against ISIL targets.
The following is a summary of the strikes conducted since the last press release:
Syria
Near Abu Kamal, one strike destroyed an oil rig.
Near Ar Raqqah, four strikes engaged an ISIL tactical unit, destroyed two vehicles, an oil tanker truck, an oil pump, and a VBIED, and damaged a road.
Iraq
Near Rawah, one strike engaged an ISIL tactical unit and destroyed a vehicle, a mortar system, and a weapons cache.
Near Mosul, four strikes engaged three ISIL tactical units, destroyed >six ISIL-held buildings, a mortar system, a vehicle, a weapons cache, a supply cache, and an artillery system, and damaged five supply routes, and a bridge.
more text I don't need, about 5 exceptions where they amend previous reports I'll just fix by hand, and then the next report
What I'm trying to do is pull out just the date of the strike and how many strikes per city for both Iraq and Syria and reformat that information into a proper dataset organized as one row per date, like this:
Rawah Mosul And So On
1/1/2014 1 4
1/2/2014 2 5
The bad: There's a different number of cities listed for each country in each press release, and a different number of strikes listed each time.
The good: Everything one of these press releases is worded exactly the same.
The string "SOUTHWEST ASIA," is always in front of the date
A 4 space indent followed by the word "Near" are always in front of the city
The city and a comma are always in front of the number of strikes
The number of strikes are always in front of the word "airstrike" or "airstrikes"
The question is whether it's possible to make a regex to either copy/cut everything matching those criteria in order or just delete everything else. I think to grab the arbitrary number of cities (with unknown names) and unknown numbers of strikes it would have to be based on copying/saving everything next to the unchanging markers.
I've tried using notepad++'s find/replace function with something like *(foobar)* but I can only match one thing at a time and when I try to replace everything but the matched string it just deletes the whole file instead of protecting every instance of the matching string.
I suggest searching by using Near (.*?),. You can back-reference with \1.
I did a quick scan of the documents, and it seems the more recent ones change a bit of the format, adding "Strikes at [country]" rather than your example of just "[country]". But each one lists the city in a Near [city], format.
This would grab you the cities, of course, but you would have to do some pretty hacky things to get the number of strikes, since there doesn't seem to be a standard for that.
If you are only dealing with the records that have your formatting, try Near (.*?), (.*? ) and you should get the spelled out number of strikes per city by referencing \2, and the city by referencing \1.
So, if you were to find and replace in Notepad++, you would use something like .*Near (.*?), (.*? ).* as your find, and something like \1 -- \2 as your replace. From there you would need to draft up a small script to translate the spelled numbers to digits, and output those where they need to go. The pattern \w* \d{1,2}, \d{4} will match a date in the long format, something else you could pipe into a python script or something to construct your table of data. Sorry I couldn't help more there!

A weird word appears with topic analysis in r

I have a paragraph:
disgusting do at was horrific we have stayed please to at traveler photos ironic i did post those witnessed each every thing in pictures gave us fist free then moved us to rooms were any better we slept with clothes on entire there never once took off shoes to walk on carpet shower etc holes in wall stains on bedding curtains couch chair no working electric in lamps cords nothing could be plugged in when we called down to fix it so we no lighting except bathroom light tv toilets constantly plugged up shower drain.
That appears to be a little grammatically weird since I cleaned the paragraph. And I use the following code to extract work frequencies.
# create corpus
docs<-Corpus(VectorSource(example))
# stem document
docs<-tm_map(docs,stemDocument)
# create document-term matrix
dtm<-DocumentTermMatrix(docs)
# convert row names
rownames(dtm)<-"example"
# collapse matrix by summing over columns
freq<-colSums(as.matrix(dtm))
# length should be total number of terms
length(freq)
# create sort order (descending)
ord<-order(freq,decreasing=TRUE)
# list all terms in decreasing order of freq and write to disk
freq[ord]
Then the freq[ord] is:
I am wondering why there is a word ani here, apparently, ani does not appear in my paragraph. Thanks.
Just figured the problem, the following code transfers any to ani, does anyone know how to avoid that?
docs<-tm_map(docs,stemDocument)
It's the word "any" after having being stemmed. The (in this case faulty) logic of the underlying function, wordStem, which uses Dr. Martin Porter's stemming algorithm and the C libstemmer library generated by Snowball, changed the y to an i.

Splitting data from a data.frame of length 1 by using a delimiter in R

I just imported a text file into R, and now I have a data frame of length 1.
I had 152 reviews separated by a * character in that text file. Each review is like a paragraph long.
How can I get a data frame of length 152, each review being only 1 in the data frame ? I used this line to import the file into R :
myReviews <- read.table("C:/Users/Norbert/Desktop/research/Important files/Airline Reviews/Reviews/air_can_Review.txt", header=FALSE,sep="*")
(myReview has a length if 1 here.)... I need it to be 152, the number of reviews inside the text file.
How can I split the data from the data frame by the "*" delimiter, or just import the text file correctly by putting it into a data frame of length 152 instead of 1.
EDIT : Example of the data
I have 152 of this kind of data, all separated by "*":
I boarded Air Canada flight AC 7354 to Washington DCA the next morning, April 13th. After arriving at DCA, I
discovered to my dismay that the suitcase I had checked in at Tel Aviv was not on my flight. I spent 5 hours at
Pearson trying to sort out the mess resulting from the bumped connection.
*First time in 40 years to fly with Air Canada. We loved the slightly wider seats. Disappointed in no movies to
watch as not all travelers have i-phones etc. Also, on our trip down my husband asked twice for coffee, black. He
did not get any and was not offered anything else. On our trip home, I asked for a black coffee. She said she was
making more as several travelers had asked. I did not my coffee or anything else to drink. It was a long trip with
nothing to drink especially when they had asked. When trying to get their attention, they seemed to avoid my
tries. We found the two female stewardess's not very friendly. It may seem a small thing but very frustrating for
a traveler.
*Painless check-in. Two legs of 3 on AC: AC105, YYZ-YVR. Roomy and clean A321 with fantastic crew. AC33: YVR-SYD,
very light load and had 3 seats to myself. A very enthusiastic and friendly crew as usual on this transpacific
route that I take several times a year. Arrived 20 min ahead of schedule. The expected high level of service from
our flag carrier, Air Canada. Altitude Elite member.
They are airline reviews, each review separated by a *. I would like to take all of the reviews and put them in one data.frame in R, but each review should get its own "slot" or "position". The "*" is intended to be the separator.

Resources