Extracting N number of matches from a text string in R? - r

I am using stringr in R, and I have a string of text that lists titles of news articles. I want to extract these titles, but only the first N-number of titles that appear. In my example string of text, I have three article titles, but I only want to extract the first two.
How can I tell str_extract to only collect the first 2 titles? Thank you.
Here is my current code with the example texts.
library(stringr)
Here is the example text.
texting <- ("Time: Friday, September 14, 2018 4:34:00 PM EDT\r\nJob Number: 73591483\r\nDocuments (100)\r\n 1. U.S. Stocks Rebound Slightly After Tech-Driven Slump\r\n Client/Matter: -None-\r\n Search Terms: trade war or US-China trade or china tariff and not dealbook\r\n Search Type: Terms and Connectors\r\n Narrowed by:\r\n Content Type Narrowed by\r\n News Sources: The New York Times; Content Type: News;\r\n Timeline: Jan 01, 2018 to Dec 31, 2018\r\n 2. Shifting Strategy on Tariffs\r\n Client/Matter: -None-\r\n Search Terms: trade war or US-China trade or china tariff and not dealbook\r\n 100. Example")
titles.1 <- str_extract_all(texting, "\\d+\\.\\s.+")
titles.1
The current code brings back all three matches in the string:
[[1]]
[1] "1. U.S. Stocks Rebound Slightly After Tech-Driven Slump"
[2] "2. Shifting Strategy on Tariffs"
[3] "100. Example"
I only want it to collect the first two matches.

You can use the option simplify = TRUE to get a vector as result, rather than a list. Then, just pick the first N elements from the vector
titles.1 <- str_extract_all(texting, "\\d+\\.\\s.+", simplify = TRUE)[1:2]

Related

How to convert a PDF listing the worlds ministers and cabinet members by country to a .csv in R

The CIA publishes a list of world leaders and cabinet ministers for all countries multiple times a year. This information is in PDF form.
I want to convert this PDF to CSV using R and then seperate and tidy the data.
I am getting the PDF from "https://www.cia.gov/library/publications/resources/world-leaders-1/"
under the link 'PDF Version for Prior Years' located at the center right hand side of the page.
Each PDF has some introductory pages and then lists the Leaders and Ministers for each country.
With each'Title' and 'Name' being seperated by a '..........' of varying lengths.
I have tried to use the pdftools package to convert from PDF, but I am not quite sure how to deal with the format of the data for sorting and tidying.
Here is the first steps I have taken with a downloaded PDF
library(pdftools)
text <- pdf_text("Data/April2006ChiefsDirectory.pdf")
test <- as.data.frame(text)
Starting with a single PDF, I want to list each Minister in a seperate row, with individual columns for year, country, title and name.
With the step I have taken so far, converting the PDF into .csv without any additional tidying, the data is in a single column and each row has a string of text contining title and name for multiple countries.
I am a novice at data tidying any help would be much appreciated.
You can do it with tabulizer but it is going to require some work to clean it up if your want to import all the 240 pages of the document.
Here I import page 4, that is the first with info regarding the leaders
library(tabulizer)
mw_table <- extract_tables(
"https://www.cia.gov/library/publications/resources/world-leaders-1/pdfs/2019/January2019ChiefsDirectory.pdf",
output = "data.frame",
pages = 4,
area = list(c(35.68168, 40.88842, 740.97853, 497.74737 )),
guess = FALSE
)
head(mw_table[[1]])
#> X Afghanistan
#> 1 Last Updated: 20 Dec 2017
#> 2 Pres. Ashraf GHANI
#> 3 CEO Abdullah ABDULLAH, Dr.
#> 4 First Vice Pres. Abdul Rashid DOSTAM
#> 5 Second Vice Pres. Sarwar DANESH
#> 6 First Deputy CEO Khyal Mohammad KHAN
You can use a vector of pages that you want to import as the argument in pages. Consider that you will have all the country names buried among the people names in the second column. Probably you can work out a method to identifying the indexes of the country by looking for the empty "" occurrences in the first column.

Extract words starting with # in R dataframe and save as new column

My dataframe column looks like this:
head(tweets_date$Tweet)
[1] b"It is #DineshKarthik's birthday and here's a rare image of the captain of #KKRiders. Have you seen him do this before? Happy birthday, DK\\xf0\\x9f\\x98\\xac
[2] b'The awesome #IPL officials do a wide range of duties to ensure smooth execution of work! Here\\xe2\\x80\\x99s #prabhakaran285 engaging with the #ChennaiIPL kid-squad that wanted to meet their daddies while the presentation was on :) #cutenessoverload #lineofduty \\xf0\\x9f\\x98\\x81
[3] b'\\xf0\\x9f\\x8e\\x89\\xf0\\x9f\\x8e\\x89\\n\\nCHAMPIONS!!
[4] b'CHAMPIONS - 2018 #IPLFinal
[5] b'Chennai are Super Kings. A fairytale comeback as #ChennaiIPL beat #SRH by 8 wickets to seal their third #VIVOIPL Trophy \\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86. This is their moment to cherish, a moment to savour.
[6] b"Final. It's all over! Chennai Super Kings won by 8 wickets
These are tweets which have mentions starting with '#', I need to extract all of them and save each mention in that particular tweet as "#mention1 #mention2". Currently my code just extracts them as lists.
My code:
tweets_date$Mentions<-str_extract_all(tweets_date$Tweet, "#\\w+")
How do I collapse those lists in each row to a form a string separated by spaces as mentioned earlier.
Thanks in advance.
I trust it would be best if you used an asis column in this case:
extract words:
library(stringr)
Mentions <- str_extract_all(lis, "#\\w+")
some data frame:
df <- data.frame(col = 1:6, lett = LETTERS[1:6])
create a list column:
df$Mentions <- I(Mentions)
df
#output
col lett Mentions
1 1 A #DineshK....
2 2 B #IPL, #p....
3 3 C
4 4 D
5 5 E #ChennaiIPL
6 6 F
I think this is better since it allows for quite easy sub setting:
df$Mentions[[1]]
#output
[1] "#DineshKarthik" "#KKRiders"
df$Mentions[[1]][1]
#output
[1] "#DineshKarthik"
and it succinctly shows whats inside the column when printing the df.
data:
lis <- c("b'It is #DineshKarthik's birthday and here's a rare image of the captain of #KKRiders. Have you seen him do this before? Happy birthday, DK\\xf0\\x9f\\x98\\xac",
"b'The awesome #IPL officials do a wide range of duties to ensure smooth execution of work! Here\\xe2\\x80\\x99s #prabhakaran285 engaging with the #ChennaiIPL kid-squad that wanted to meet their daddies while the presentation was on :) #cutenessoverload #lineofduty \\xf0\\x9f\\x98\\x81",
"b'\\xf0\\x9f\\x8e\\x89\\xf0\\x9f\\x8e\\x89\\n\\nCHAMPIONS!!",
"b'CHAMPIONS - 2018 #IPLFinal",
"b'Chennai are Super Kings. A fairytale comeback as #ChennaiIPL beat #SRH by 8 wickets to seal their third #VIVOIPL Trophy \\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86. This is their moment to cherish, a moment to savour.",
"b'Final. It's all over! Chennai Super Kings won by 8 wickets")
The str_extract_all function from the stringr package returns a list of character vectors. So, if you instead want a list of single CSV terms, then you may try using sapply for a base R option:
tweets <- str_extract_all(tweets_date$Tweet, "#\\w+")
tweets_date$Mentions <- sapply(tweets, function(x) paste(x, collapse=", "))
Demo
Via Twitter's help site: "Your username cannot be longer than 15 characters. Your real name can be longer (20 characters), but usernames are kept shorter for the sake of ease. A username can only contain alphanumeric characters (letters A-Z, numbers 0-9) with the exception of underscores, as noted above. Check to make sure your desired username doesn't contain any symbols, dashes, or spaces."
Note that email addresses can be in tweets as can URLs with #'s in them (and not just the silly URLs with username/password in the host component). Thus, something like:
(^|[^[[:alnum:]_]#/\\!?=&])#([[:alnum:]_]{1,15})\\b
is likely a better, safer choice

keeping the row number in a data frame column

I have a bunch of .txt files (articles) in a folder, I use a for cycle in order to get text from all of them on R
input_loc <- "C:/Users/User/Desktop/Folder"
files <- dir(input_loc, full.names = TRUE)
text <- c()
for (f in files) {
text <- c(text, paste(readLines(f), collapse = "\n"))
}
from here, I tokenize per paragraphs and I get each paragraph in each article:
paragraphs <- tokenize_paragraphs(text)
sapply(paragraphs, length)
paragraphs
then I unlist and transform into a dataframe
par_unlisted<-unlist(paragraphs)
par_unlisted
par_unlisted_df<-as.data.frame(par_unlisted)
BUT doing that I no longer have an inter-article separation of paragraph numbers (e.g. first article has 6 paragraphs, before unlisting the first paragraph of the second article would still have a [1] in front, while after unlisting it will have a [7]).
What I would like to do is, once I have the dataframe, having a column with the number of the paragraph, then create another column named "article" with the number of the article.
Thank You in advance
EDIT
this is roughly what I get once I get to paragraphs:
> paragraphs
[[1]]
[1] "The Miami Dolphins have decided to use their non-exclusive franchise
tag on wide receiver Jarvis Landry."
[2] "The Dolphins tweeted the announcement Tuesday, the first day teams
could use their franchise or transition tags. The salary for wide receivers
getting the franchise tag this offseason is expected to be around $16.2
million, which will be quite the raise for Landry, who made $894,000 last
season."
[[2]]
[1] "Despite months of little-to-no movement on contract negotiations,
Jarvis Landry has often stated his desire to stay in Miami."
[2] "The Dolphins used their lone tool to wipe away negotation-driven stress
-- at least in the immediate future -- and ensure Landry won't be lured away
from Miami, placing the franchise tag on the receiver on Tuesday, the team
announced."
I would want to keep the paragraph number ([n]) as a column in the dataframe, because when I unlist them they no longer stay separated per article and then per paragraph, but I get them in sequence, let's say (basically in the example I've just posted I no longer have
[[1]]
[1] ...
[2] ...
[[2]]
[1] ...
[2] ...
but I get
[1] ...
[2] ...
[3] ...
[4] ...
Consider iterating through the paragraphs list and build a list of dataframes with needed article and paragraph numbers with a final row bind through all dataframe elements.
Input Data
paragraphs <- list(
c("The Miami Dolphins have decided to use their non-exclusive franchise tag on wide receiver Jarvis Landry.",
"The Dolphins tweeted the announcement Tuesday, the first day teams could use their franchise or transition tags. The salary for wide receivers
getting the franchise tag this offseason is expected to be around $16.2 million, which will be quite the raise for Landry, who made $894,000 last
season."),
c("Despite months of little-to-no movement on contract negotiations, Jarvis Landry has often stated his desire to stay in Miami.",
"The Dolphins used their lone tool to wipe away negotation-driven stress -- at least in the immediate future -- and ensure Landry won't be lured away
from Miami, placing the franchise tag on the receiver on Tuesday, the team announced."))
Dataframe Build
df_list <- lapply(seq_along(paragraphs), function(i)
setNames(data.frame(i, 1:length(paragraphs[[i]]), paragraphs[[i]]),
c("article_num", "paragraph_num", "paragraph"))
)
final_df <- do.call(rbind, df_list)
Output Result
final_df
# article_num paragraph_num paragraph
# 1 1 1 The Miami Dolphins have decided to use their non-e...
# 2 1 2 The Dolphins tweeted the announcement Tuesday, the...
# 3 2 1 Despite months of little-to-no movement on contrac...
# 4 2 2 The Dolphins used their lone tool to wipe away neg...

Removing rows with a specific word in Corpus

I have a Corpus with multiple texts (news articles) scraped from the internet.
Some of the texts contain the description of the photo that is used in the article. I want to remove that.
I found an existing string about this topic but it could not help me. See link: Removing rows from Corpus with multiple documents
I want to remove every row that contains the words "PHOTO FILE" (in caps). This solution was posted:
require(tm)
corp <- VCorpus(VectorSource(txt))
textVector <- sapply(corp, as.character)
for(j in seq(textVector)) {
newCorp<-textVector
newCorp[[j]] <- textVector[[j]][-grep("PHOTO", textVector[[j]], ignore.case = FALSE)]
}
This does not seem to work for me though. The code runs but nothing is removed.
What does work is this:
require(tm)
corp <- VCorpus(VectorSource(txt))
textVector <- sapply(corp, as.character)
newCorp <- VCorpus(VectorSource(textVector[-grep("PHOTO", textVector,
ignore.case = FALSE)]))
But that removes every file that contains the word and I do not want that.
Would greatly appreciate if someone can help me on this.
Addition:
Here is an example of one of the texts:
[1] "Top News | Wed Apr 19, 2017 | 3:53pm BST\nFILE PHOTO: People walk accross a plaza in the Canary Wharf financial district, London, Britain, January 9, 2017. REUTERS/Dylan Martinez/File Photo\nLONDON Britain's current account deficit, one of the weak points of its economy, was bigger than previously thought in the years up to 2012, according to new estimates from the Office for National Statistics on Wednesday.\nThe figures showed British companies had paid out more interest to foreign holders of corporate bonds than initially estimated, resulting in a larger current account deficit.\nThe deficit, one of the biggest among advanced economies, has been in the spotlight since June's Brexit vote.\nBank of England Governor Mark Carney said in the run-up to the referendum that Britain was reliant on the \"kindness of strangers\", highlighting how the country needed tens of billions of pounds of foreign finance a year to balance its books.\nThe ONS said the current account deficit for 2012 now stood at 4.4 percent of gross domestic product, compared with 3.7 percent in its previous estimate.\nThe ONS revised up the deficit for every year dating back to 1998 by an average of 0.6 percentage points. The biggest revisions occurred from 2005 onwards.\nLast month the ONS said Britain's current account deficit tumbled to 2.4 percent of GDP in the final three months of 2016, less than half its reading of 5.3 percent in the third quarter.\nRevised data for 2012 onward is due on Sept. 29, and it is unclear if Wednesday's changes point to significant further upward revisions, as British corporate bond yields have declined markedly since 2012 and touched a new low in mid-2016. .MERUR00\nThe ONS also revised up its earlier estimates of how much Britons saved. The household savings ratio for 2012 rose to 9.8 percent from 8.3 percent previously, with a similar upward revision for 2011.\nThe ratio for Q4 2016, which has not yet been revised, stood at its lowest since 1963 at 3.3 percent.\nThe ONS said the changes reflected changes to the treatment of self-employed people paying themselves dividends from their own companies, as well as separating out the accounts of charities, which had previously been included with households.\nMore recent years may produce similarly large revisions to the savings ratio. Around 40 percent of the roughly 2.2 million new jobs generated since the beginning of 2008 fell into the self-employed category.\n"
So I wish to delete the sentence (row) of FILE PHOTO
Let's say that initially the text is contained in the file input.txt.
The raw file is as follows:
THis is a text that contains a lot
of information
and PHOTO FILE.
Great!
my_text<-readLines("input.txt")
[1] "THis is a text that contains a lot" "of information" "and PHOTO FILE." "Great!"
If you get rid of the spurious element
blah[-grep("PHOTO FILE",blah,value = F,perl=T)]
you end up with
[1] "THis is a text that contains a lot" "of information" "Great!"

Text summarization in R language

I have long text file using help of R language I want to summarize text in at least 10 to 20 line or in small sentences.
How to summarize text in at least 10 line with R language ?
You may try this (from the LSAfun package):
genericSummary(D,k=1)
whereby 'D' specifies your text document and 'k' the number of sentences to be used in the summary. (Further modifications are shown in the package documentation).
For more information:
http://search.r-project.org/library/LSAfun/html/genericSummary.html
There's a package called lexRankr that summarizes text in the same way that Reddit's /u/autotldr bot summarizes articles. This article has a full walkthrough on how to use it but just as a quick example so you can test it yourself in R:
#load needed packages
library(xml2)
library(rvest)
library(lexRankr)
#url to scrape
monsanto_url = "https://www.theguardian.com/environment/2017/sep/28/monsanto-banned-from-european-parliament"
#read page html
page = xml2::read_html(monsanto_url)
#extract text from page html using selector
page_text = rvest::html_text(rvest::html_nodes(page, ".js-article__body p"))
#perform lexrank for top 3 sentences
top_3 = lexRankr::lexRank(page_text,
#only 1 article; repeat same docid for all of input vector
docId = rep(1, length(page_text)),
#return 3 sentences to mimick /u/autotldr's output
n = 3,
continuous = TRUE)
#reorder the top 3 sentences to be in order of appearance in article
order_of_appearance = order(as.integer(gsub("_","",top_3$sentenceId)))
#extract sentences in order of appearance
ordered_top_3 = top_3[order_of_appearance, "sentence"]
> ordered_top_3
[1] "Monsanto lobbyists have been banned from entering the European parliament after the multinational refused to attend a parliamentary hearing into allegations of regulatory interference."
[2] "Monsanto officials will now be unable to meet MEPs, attend committee meetings or use digital resources on parliament premises in Brussels or Strasbourg."
[3] "A Monsanto letter to MEPs seen by the Guardian said that the European parliament was not “an appropriate forum” for discussion on the issues involved."

Resources