counting sentences containing a specific key word in R - r

UPDATE
Here is what I have done so far.
library(tm)
library(NLP)
library(SnowballC)
# set directory
setwd("C:\\Users\\...\\Data pretest all TXT")
# create corpus with tm package
pretest <- Corpus(DirSource("\\Users\\...\\Data pretest all TXT"), readerControl = list(language = "en"))
pretest is a large SimpleCorpus with 36 elements.
My folder contains 36 txt files.
# check what went in
summary(pretest)
# create TDM
pretest.tdm <- TermDocumentMatrix(pretest, control = list(stopwords = TRUE,
tolower = TRUE, stemming = TRUE))
# convert corpus to data frame
dataframePT <- data.frame(text = unlist(sapply(pretest, `[`, "content")),
stringsAsFactors = FALSE)
dataframePT has 36 observations. So I think until here it is okay.
# load stringr library
library(stringr)
# define sentences
v = strsplit(dataframePT[,1], "(?<=[A-Za-z ,]{10})\\.", perl = TRUE)
lapply(v, function(x) (stringr::str_count(x, "gain")))
My output looks like this
...
[[35]]
[1] NA
[[36]]
[1] NA
So there are actually 36 files, so that's good. But I don't know why it returns NA.
Thank you in advance for any suggestions.

library(NLP)
library(tm)
library(SnowballC)
Load data:
data("crude")
crude.tdm <- TermDocumentMatrix(crude, control = list(stopwords = TRUE, tolower = TRUE, stemming= TRUE))
First convert corpus to data frame
dataframe <- data.frame(text = unlist(sapply(crude, `[`, "content")), stringsAsFactors = F)
one can also inspect the content: crude[[2]]$content
now we need to define a sentence - here I define it with an entity that has at least 10 A-Z or a-z characters mixed with spaces and "," and ending with ".". And I split the documents by that rule using look-behind the .
z = strsplit(dataframe[,1], "(?<=[A-Za-z ,]{10})\\.", perl = T)
but this is not needed for crude corpus since every sentence ends with .\n so one can do:
z = strsplit(dataframe[,1], "\\.n\", perl = T)
I will stick with my previous definition of sentence since one wants it functioning not only for crude corpus. The definition is not perfect so I am keen on hearing your thoughts?
Lets check the output
z[[2]]
[1] "OPEC may be forced to meet before a\nscheduled June session to readdress its production cutting\nagreement if the organization wants to halt the current slide\nin oil prices, oil industry analysts said"
[2] "\n \"The movement to higher oil prices was never to be as easy\nas OPEC thought"
[3] " They may need an emergency meeting to sort out\nthe problems,\" said Daniel Yergin, director of Cambridge Energy\nResearch Associates, CERA"
[4] "\n Analysts and oil industry sources said the problem OPEC\nfaces is excess oil supply in world oil markets"
[5] "\n \"OPEC's problem is not a price problem but a production\nissue and must be addressed in that way,\" said Paul Mlotok, oil\nanalyst with Salomon Brothers Inc"
[6] "\n He said the market's earlier optimism about OPEC and its\nability to keep production under control have given way to a\npessimistic outlook that the organization must address soon if\nit wishes to regain the initiative in oil prices"
[7] "\n But some other analysts were uncertain that even an\nemergency meeting would address the problem of OPEC production\nabove the 15.8 mln bpd quota set last December"
[8] "\n \"OPEC has to learn that in a buyers market you cannot have\ndeemed quotas, fixed prices and set differentials,\" said the\nregional manager for one of the major oil companies who spoke\non condition that he not be named"
[9] " \"The market is now trying to\nteach them that lesson again,\" he added.\n David T"
[10] " Mizrahi, editor of Mideast reports, expects OPEC\nto meet before June, although not immediately"
[11] " However, he is\nnot optimistic that OPEC can address its principal problems"
[12] "\n \"They will not meet now as they try to take advantage of the\nwinter demand to sell their oil, but in late March and April\nwhen demand slackens,\" Mizrahi said"
[13] "\n But Mizrahi said that OPEC is unlikely to do anything more\nthan reiterate its agreement to keep output at 15.8 mln bpd.\"\n Analysts said that the next two months will be critical for\nOPEC's ability to hold together prices and output"
[14] "\n \"OPEC must hold to its pact for the next six to eight weeks\nsince buyers will come back into the market then,\" said Dillard\nSpriggs of Petroleum Analysis Ltd in New York"
[15] "\n But Bijan Moussavar-Rahmani of Harvard University's Energy\nand Environment Policy Center said that the demand for OPEC oil\nhas been rising through the first quarter and this may have\nprompted excesses in its production"
[16] "\n \"Demand for their (OPEC) oil is clearly above 15.8 mln bpd\nand is probably closer to 17 mln bpd or higher now so what we\nare seeing characterized as cheating is OPEC meeting this\ndemand through current production,\" he told Reuters in a\ntelephone interview"
[17] "\n Reuter"
and the original:
cat(crude[[2]]$content)
OPEC may be forced to meet before a
scheduled June session to readdress its production cutting
agreement if the organization wants to halt the current slide
in oil prices, oil industry analysts said.
"The movement to higher oil prices was never to be as easy
as OPEC thought. They may need an emergency meeting to sort out
the problems," said Daniel Yergin, director of Cambridge Energy
Research Associates, CERA.
Analysts and oil industry sources said the problem OPEC
faces is excess oil supply in world oil markets.
"OPEC's problem is not a price problem but a production
issue and must be addressed in that way," said Paul Mlotok, oil
analyst with Salomon Brothers Inc.
He said the market's earlier optimism about OPEC and its
ability to keep production under control have given way to a
pessimistic outlook that the organization must address soon if
it wishes to regain the initiative in oil prices.
But some other analysts were uncertain that even an
emergency meeting would address the problem of OPEC production
above the 15.8 mln bpd quota set last December.
"OPEC has to learn that in a buyers market you cannot have
deemed quotas, fixed prices and set differentials," said the
regional manager for one of the major oil companies who spoke
on condition that he not be named. "The market is now trying to
teach them that lesson again," he added.
David T. Mizrahi, editor of Mideast reports, expects OPEC
to meet before June, although not immediately. However, he is
not optimistic that OPEC can address its principal problems.
"They will not meet now as they try to take advantage of the
winter demand to sell their oil, but in late March and April
when demand slackens," Mizrahi said.
But Mizrahi said that OPEC is unlikely to do anything more
than reiterate its agreement to keep output at 15.8 mln bpd."
Analysts said that the next two months will be critical for
OPEC's ability to hold together prices and output.
"OPEC must hold to its pact for the next six to eight weeks
since buyers will come back into the market then," said Dillard
Spriggs of Petroleum Analysis Ltd in New York.
But Bijan Moussavar-Rahmani of Harvard University's Energy
and Environment Policy Center said that the demand for OPEC oil
has been rising through the first quarter and this may have
prompted excesses in its production.
"Demand for their (OPEC) oil is clearly above 15.8 mln bpd
and is probably closer to 17 mln bpd or higher now so what we
are seeing characterized as cheating is OPEC meeting this
demand through current production," he told Reuters in a
telephone interview.
Reuter
You can clean it a bit if you wish, removing the trailing \n but it is not needed for your request.
Now you can do all sorts of things, like:
Which sentences contain the word "gain"
lapply(z, function(x) (grepl("gain", x)))
or the frequency of word "gain" per sentence:
lapply(z, function(x) (stringr::str_count(x, "gain")))

Hi I recommend using filter function from dplyr package and grepl function to search a pattern inside
pattern <- "word1|word2"
df<- df %>%
filter(grepl(pattern,column_name)
The df would be limited to only those matching that condition. So then just use nrow function to count how many rows last :)
Example:
a1<-1:10
a2<-11:20
(data<-data.frame(a1,a2,stringsAsFactors = F))
a1 a2
1 1 11
2 2 12
3 3 13
4 4 14
5 5 15
6 6 16
7 7 17
8 8 18
9 9 19
10 10 20
(data<-data %>% filter(grepl("5|7",data$a2)))
a1 a2
1 5 15
2 7 17
(nrow(data))
[1] 2

Related

Extract the paragraphs from a PDF that contain a keyword using R

I need to extract from a pdf file the paragraphs that contain a keyword. Tried various codes but none got anything.
I have seen this code from a user #Tyler Rinker (Extract before and after lines based on keyword in Pdf using R programming) but it extracts the line where the keyword is, the before and after.
library(textreadr)
library(tidyverse)
loc <- function(var, regex, n = 1, ignore.case = TRUE){
locs <- grep(regex, var, ignore.case = ignore.case)
out <- sort(unique(c(locs - 1, locs, locs + 1)))
out <- out[out > 0]
out[out <= length(var)]
}
doc <- 'https://www.in.kpmg.com/pdf/Indian%20Pharma%20Outlook.pdf' %>%
read_pdf() %>%
slice(loc(text, 'cancer'))
However, I need to get the paragraphs and store each one in a row in my database. Could you help me?
The text lines in paragraphs will all be quite long unless it is the final line of the paragraph. We can count the characters in each line and do a histogram to show this:
library(textreadr)
doc <- read_pdf('https://www.in.kpmg.com/pdf/Indian%20Pharma%20Outlook.pdf')
hist(nchar(doc$text), 20)
So anything less than about 75 characters is either not in a paragraph or at the end of a paragraph. We can therefore stick a line break on the short ones, paste all the lines together, then split on linebreaks:
doc$text[nchar(doc$text) < 75] <- paste0(doc$text[nchar(doc$text) < 75], "\n")
txt <- paste(doc$text, collapse = " ")
txt <- strsplit(txt, "\n")[[1]]
So now we can just do our regex and find the paragraphs with the key word:
grep("cancer", txt, value = TRUE)
#> [1] " Ranjit Shahani applauds the National Pharmaceuticals Policy's proposal of public/private partnerships (PPPs) to tackle life-threatening diseases such as cancer and HIV/AIDS, but stresses that, in order for them to work, they should be voluntary, and the government should exempt all life-saving drugs from import duties and other taxes such as excise duty and VAT. He is, however, critical about a proposal for mandatory price negotiation of newly patented drugs. He feels this will erode India's credibility in implementing the Patent Act in © 2006 KPMG International. KPMG International is a Swiss cooperative that serves as a coordinating entity for a network of independent firms operating under the KPMG name. KPMG International provides no services to clients. Each member firm of KPMG International is a legally distinct and separate entity and each describes itself as such. All rights reserved. Collaboration for Growth 24"
#> [2] " a fair and transparent manner. To deal with diabetes, medicines are not the only answer; awareness about the need for lifestyle changes needs to be increased, he adds. While industry leaders have long called for the development of PPPs for the provision of health care in India, particularly in rural areas, such initiatives are currently totally unexplored. However, the government's 2006 draft National Pharmaceuticals Policy proposes the introduction of PPPs with drug manufacturers and hospitals as a way of vastly increasing the availability of medicines to treat life-threatening diseases. It notes, for example, that while an average estimate of the value of drugs to treat the country's cancer patients is $1.11 billion, the market is in fact worth only $33.5 million. “The big gap indicates the near non-accessibility of the medicines to a vast majority of the affected population, mainly because of the high cost of these medicines,” says the Policy, which also calls for tax and excise exemptions for anti-cancer drugs."
#> [3] " 50.1 percent of Aventis Pharma is held by European drug major Sanofi-Aventis and, in early April 2006, it was reported that UB Holdings had sold its 10 percent holding in the firm to Variegate Trading, a UB subsidiary. The firm's major products are in the anti-infective, anti-inflammatory, cancer, diabetes and allergy market segments and, for the year ended December 31, 2005, it reported net sales (excluding excise duty) up 9.9 percent to $181.1 million, with domestic sales up 9.1 percent at $129.8 million and exports increasing 12 percent to $51.2 million. Sales were led by 83 percent annual growth for the diabetes treatment Lantus (insulin glargine), followed by the rabies vaccine Rabipur (+22 percent), the diabetes drug Amaryl (glimepiride) and epilepsy treatment Frisium (clobazam), both up 18 percent, the angiotensin-coverting enzyme inhibitor Cardace (ramipril +15 percent), Clexane (enoxaparin), an anticoagulant, growing 14 percent and Targocid (teicoplanin), an antibiotic, whose sales advanced 8 percent."
Created on 2020-09-16 by the reprex package (v0.3.0)

filtering text and storing the filtered sentence/paragraph into a new column

I am trying to extract some sentences from text data. I want to extract the sentences which correspond to medical device company released. I can run the following code:
df_text <- unlist(strsplit(df$TD, "\\."))
df_text
df_text <- df_text[grep(pattern = "medical device company released", df_text, ignore.case = TRUE)]
df_text
Which gives me:
[1] "\n\nThe medical device company released its financial outlook in a press release before an investor conference Friday"
So I extracted the sentences which contain the sentence medical device company released. However, I want to do this but store the results in a new column from which grp the sentence came from.
Expected output:
grp TD newCol
3613 text NA # does not contain the sentence
4973 text medical device company released
5570 text NA # does not contain the sentence
Data:
df <- structure(list(grp = c("3613", "4973", "5570"), TD = c(" Wal-Mart plans to add an undisclosed number of positions in areas including its store-planning operation and New York apparel office.\n\nThe moves, which began Tuesday, are meant to \"increase operational efficiencies, support our strategic growth plans and reduce overall costs,\" Wal-Mart spokesman David Tovar said.\n\nWal-Mart still expects net growth of tens of thousands of jobs at the store level this year, Tovar said.\n\nThe reduction in staff is hardly a new development for retailers, which have been cutting jobs at their corporate offices as they contend with the down economy. Target Corp. (TGT), Saks Inc. (SKS) and Best Buy Co. (BBY) are among retailers that have said in recent weeks they plan to pare their ranks.\n\nTovar declined to say whether the poor economy was a factor in Wal-Mart's decision.\n\nWal-Mart is operating from a position of comparative strength as one of the few retailers to consistently show positive growth in same-store sales over the past year as the recession dug in.\n\nWal-Mart is \"a fiscally responsible company that will manage its capital structure appropriately,\" said Todd Slater, retail analyst at Lazard Capital Markets.\n\nEven though Wal-Mart is outperforming its peers, the company \"is not performing anywhere near peak or optimum levels,\" Slater said. \"The consumer has cut back significantly.\"\n\nWal-Mart indicated it had regained some footing in January, when comparable-store sales rose 2.1%, after a lower-than-expected 1.7% rise in December.\n\nWal-Mart shares are off 3.2% to $47.68.\n\n-By Karen Talley, Dow Jones Newswires; 201-938-5106; karen.talley#dowjones.com [ 02-10-09 1437ET ]\n ",
" --To present new valve platforms Friday\n\n(Updates with additional comment from company, beginning in the seventh paragraph.)\n\n\n \n By Anjali Athavaley \n Of DOW JONES NEWSWIRES \n \n\nNEW YORK (Dow Jones)--Edwards Lifesciences Corp. (EW) said Friday that it expects earnings to grow 35% to 40%, excluding special items, in 2012 on expected sales of its catheter-delivered heart valves that were approved in the U.S. earlier this year.\n\nThe medical device company released its financial outlook in a press release before an investor conference Friday. The catheter-delivered heart valve market is considered to have a multibillion-dollar market potential, but questions have persisted on how quickly the Edwards device, called Sapien, will be rolled out and who will be able to receive it.\n\nEdwards said it expects transcatheter valve sales between $560 million and $630 million in 2012, with $200 million to $260 million coming from the U.S.\n\nOverall, for 2012, Edwards sees total sales between $1.95 billion and $2.05 billion, above the $1.68 billion to $1.72 billion expected this year and bracketing the $2.01 billion expected on average by analysts surveyed by Thomson Reuters.\n\nThe company projects 2012 per-share earnings between $2.70 and $2.80, the midpoint of which is below the average analyst estimate of $2.78 on Thomson Reuters. Edwards estimates a gross profit margin of 73% to 75%.\n\nEdwards also reaffirmed its 2011 guidance, which includes earnings per share of $1.97 to $2.02, excluding special items.\n\nThe company said it continues to expect U.S. approval of its Sapien device for high-risk patients in mid-2012. Currently, the device is only approved in the U.S. for patients too sick for surgery.\n\nThe company added that a separate trial studying its newer-generation valve in a larger population is under way in the U.S. It expects U.S. approval of that device in 2014.\n\nEdwards also plans to present at its investor conference two new catheter-delivered valve platforms designed for different implantation methods. European trials for these devices are expected to begin in 2012.\n\nShares of Edwards, down 9% over the past 12 months, were inactive premarket. The stock closed at $63.82 on Thursday.\n\n-By Anjali Athavaley, Dow Jones Newswires; 212-416-4912; anjali.athavaley#dowjones.com [ 12-09-11 0924ET ]\n ",
" In September, the company issued a guidance range of 43 cents to 44 cents a share. \n\nFor the year, GE now sees earnings no lower than $1.81 a share to $1.83 a share. The previous forecast called for income of $1.80 to $1.83 a share. The new range brackets analyst projections of $1.82 a share. \n\nThe new targets represent double-digit growth from the respective year-earlier periods. Last year's third-quarter earnings were $3.87 billion, or 36 cents a share, excluding items; earnings for the year ended Dec. 31 came in at $16.59 billion, or $1.59 a share. [ 10-06-05 0858ET ] \n\nGeneral Electric also announced Thursday that it expects 2005 cash flow from operating activities to exceed $19 billion. \n\nBecause of the expected cash influx, the company increased its authorization for share repurchases by $1 billion to more than $4 billion. \n\nGE announced the updated guidance at an analysts' meeting Thursday in New York. A Web cast of the meeting is available at . \n\nThe company plans to report third-quarter earnings Oct. 14. \n\nShares of the Dow Jones Industrial Average component recently listed at $33.20 in pre-market trading, according to Inet, up 1.6%, or 52 cents, from Wednesday's close of $32.68. \n\nCompany Web site: \n\n-Jeremy Herron; Dow Jones Newswires; 201-938-5400; Ask Newswires#DowJones.com \n\nOrder free Annual Report for General Electric Co. \n\nVisit or call 1-888-301-0513 [ 10-06-05 0904ET ] \n "
)), class = "data.frame", row.names = c(NA, -3L))
We can get data in separate rows keeping the grp intact and keep only sentence that has "medical device company released" in it.
library(dplyr)
df %>%
tidyr::separate_rows(TD, sep = "\\.") %>%
group_by(grp) %>%
summarise(newCol = toString(grep(pattern = "medical device company released",
TD, ignore.case = TRUE, value = TRUE)))
# grp newCol
# <chr> <chr>
#1 3613 ""
#2 4973 "\n\nThe medical device company released its financia…
#3 5570 ""

R Programming Need String based unique solution for splitting huge text

I have a text file with a sample text like below all in small case:
"venezuela probes ex-oil czar ramirez over alleged graft scheme
caracas/houston (reuters) - venezuela is investigating rafael ramirez, a
once powerful oil minister and former head of state oil company pdvsa, in
connection with an alleged $4.8 billion vienna-based corruption scheme, the
state prosecutor's office announced on friday.
5.5 hours ago
— reuters
amazon ordered not to pull in customers who can't spell `birkenstock'
a german court has ordered amazon not to lure internet shoppers to its
online marketplace when they mistakenly search for "brikenstock",
"birkenstok", "bierkenstock" and other variations in google.
6 hours ago
— business standard"
What I need in R is to get these two pieces of text, separated out.
The first piece of text would correspond with the text1 variable and the second piece of text should correspond with the text2 variable.
Please remember I have many text-like paragraphs in this file. The solution would have to work for, say, 100,000 texts.
The only thing I thought that could be used as a delimiter is "—" but with that I lose the source of the information such as "reuters" or "business standard". I need that as well.
Would you know how to accomplish this in R?
Read the text from field with readLines and then split on the shifted cumsum of the occurence of that special dash in from of the publisher:
Lines <- readLines("Lines.txt") # from file in wd()
split(Lines, cumsum(c(0, head(grepl("—", Lines),-1))) )
#--------------
$`0`
[1] "venezuela probes ex-oil czar ramirez over alleged graft scheme"
[2] "caracas/houston (reuters) - venezuela is investigating rafael ramirez, a "
[3] "once powerful oil minister and former head of state oil company pdvsa, in "
[4] "connection with an alleged $4.8 billion vienna-based corruption scheme, the "
[5] "state prosecutor's office announced on friday."
[6] "5.5 hours ago"
[7] "— reuters"
$`1`
[1] "amazon ordered not to pull in customers who can't spell `birkenstock'"
[2] "a german court has ordered amazon not to lure internet shoppers to its "
[3] "online marketplace when they mistakenly search for \"brikenstock\", "
[4] "\"birkenstok\", \"bierkenstock\" and other variations in google."
[5] "6 hours ago"
[6] "— business standard'"
It's not a regular "-". Its a "—". And notice the by default readLines will omit the blank lines.
Here's what I could do. I do not like the loop in this, but I could not vectorize it. Hopefully this answer will at least serve as a starting point for other better answers.
Assumptions: All publisher names are preceeded by "— "
TEXT <- read.delim2("C:/Users/Arani.das/Desktop/TEXT.txt", header=FALSE, quote="", stringsAsFactors=F)
TEXT$Publisher <- grepl("— ", TEXT$V1)
TEXT$V1 <- gsub("^\\s+|\\s+$", "", TEXT$V1) #trim whitespaces in start and end of line
TEXT$FLAG <- 1 #grouping variable
for(i in 2:nrow(TEXT)){
if(TEXT$Publisher[i-1]==T){TEXT$FLAG[i]=TEXT$FLAG[i]+1}else{TEXT$FLAG[i]=TEXT$FLAG[i-1]}
} # Grouping entries
TEXT <- data.table::data.table(TEXT, key="FLAG")
TEXT2 <- TEXT[, list(News=paste0(V1[1:(length(V1)-2)], collapse=" "), Time=V1[length(V1)-1], Publisher=V1[length(V1)]), by="FLAG"]
Output:
FLAG News Time Publisher
1 Venezuela... 5.5 hours ago — reuters
2 amazon... 6 hours ago — business standard

unnest_tokens fails to handle vectors in R with tidytext package

I want to use the tidytext package to create a column with 'ngrams'. with the following code:
library(tidytext)
unnest_tokens(tbl = president_tweets,
output = bigrams,
input = text,
token = "ngrams",
n = 2)
But when I run this I get the following error message:
error: unnest_tokens expects all columns of input to be atomic vectors (not lists)
My text column consists of a lot of tweets with rows that look like the following and is of class character.
president_tweets$text <– c("The United States Senate just passed the biggest in history Tax Cut and Reform Bill. Terrible Individual Mandate (ObamaCare)Repealed. Goes to the House tomorrow morning for final vote. If approved, there will be a News Conference at The White House at approximately 1:00 P.M.",
"Congratulations to Paul Ryan, Kevin McCarthy, Kevin Brady, Steve Scalise, Cathy McMorris Rodgers and all great House Republicans who voted in favor of cutting your taxes!",
"A story in the #washingtonpost that I was close to rescinding the nomination of Justice Gorsuch prior to confirmation is FAKE NEWS. I never even wavered and am very proud of him and the job he is doing as a Justice of the U.S. Supreme Court. The unnamed sources dont exist!",
"Stocks and the economy have a long way to go after the Tax Cut Bill is totally understood and appreciated in scope and size. Immediate expensing will have a big impact. Biggest Tax Cuts and Reform EVER passed. Enjoy, and create many beautiful JOBS!",
"DOW RISES 5000 POINTS ON THE YEAR FOR THE FIRST TIME EVER - MAKE AMERICA GREAT AGAIN!",
"70 Record Closes for the Dow so far this year! We have NEVER had 70 Dow Records in a one year period. Wow!"
)
---------Update:----------
It looks like the sentimetr or exploratory package caused the conflict. I reloaded my packages without these and now it works again!
Hmmmmm, I am not able to reproduce your problem.
library(tidytext)
library(dplyr)
president_tweets <- data_frame(text = c("The United States Senate just passed the biggest in history Tax Cut and Reform Bill. Terrible Individual Mandate (ObamaCare)Repealed. Goes to the House tomorrow morning for final vote. If approved, there will be a News Conference at The White House at approximately 1:00 P.M.",
"Congratulations to Paul Ryan, Kevin McCarthy, Kevin Brady, Steve Scalise, Cathy McMorris Rodgers and all great House Republicans who voted in favor of cutting your taxes!",
"A story in the #washingtonpost that I was close to rescinding the nomination of Justice Gorsuch prior to confirmation is FAKE NEWS. I never even wavered and am very proud of him and the job he is doing as a Justice of the U.S. Supreme Court. The unnamed sources dont exist!",
"Stocks and the economy have a long way to go after the Tax Cut Bill is totally understood and appreciated in scope and size. Immediate expensing will have a big impact. Biggest Tax Cuts and Reform EVER passed. Enjoy, and create many beautiful JOBS!",
"DOW RISES 5000 POINTS ON THE YEAR FOR THE FIRST TIME EVER - MAKE AMERICA GREAT AGAIN!",
"70 Record Closes for the Dow so far this year! We have NEVER had 70 Dow Records in a one year period. Wow!"))
unnest_tokens(tbl = president_tweets,
output = bigrams,
input = text,
token = "ngrams",
n = 2)
#> # A tibble: 205 x 1
#> bigrams
#> <chr>
#> 1 the united
#> 2 united states
#> 3 states senate
#> 4 senate just
#> 5 just passed
#> 6 passed the
#> 7 the biggest
#> 8 biggest in
#> 9 in history
#> 10 history tax
#> # ... with 195 more rows
The current CRAN version of tidytext does in fact not allow list-columns but we have changed the column handling so that the development version on GitHub now supports list-columns. Are you sure you don't have any of these in your data frame/tibble? What are the data types of all of your columns? Are any of them of type list?

Removing rows with a specific word in Corpus

I have a Corpus with multiple texts (news articles) scraped from the internet.
Some of the texts contain the description of the photo that is used in the article. I want to remove that.
I found an existing string about this topic but it could not help me. See link: Removing rows from Corpus with multiple documents
I want to remove every row that contains the words "PHOTO FILE" (in caps). This solution was posted:
require(tm)
corp <- VCorpus(VectorSource(txt))
textVector <- sapply(corp, as.character)
for(j in seq(textVector)) {
newCorp<-textVector
newCorp[[j]] <- textVector[[j]][-grep("PHOTO", textVector[[j]], ignore.case = FALSE)]
}
This does not seem to work for me though. The code runs but nothing is removed.
What does work is this:
require(tm)
corp <- VCorpus(VectorSource(txt))
textVector <- sapply(corp, as.character)
newCorp <- VCorpus(VectorSource(textVector[-grep("PHOTO", textVector,
ignore.case = FALSE)]))
But that removes every file that contains the word and I do not want that.
Would greatly appreciate if someone can help me on this.
Addition:
Here is an example of one of the texts:
[1] "Top News | Wed Apr 19, 2017 | 3:53pm BST\nFILE PHOTO: People walk accross a plaza in the Canary Wharf financial district, London, Britain, January 9, 2017. REUTERS/Dylan Martinez/File Photo\nLONDON Britain's current account deficit, one of the weak points of its economy, was bigger than previously thought in the years up to 2012, according to new estimates from the Office for National Statistics on Wednesday.\nThe figures showed British companies had paid out more interest to foreign holders of corporate bonds than initially estimated, resulting in a larger current account deficit.\nThe deficit, one of the biggest among advanced economies, has been in the spotlight since June's Brexit vote.\nBank of England Governor Mark Carney said in the run-up to the referendum that Britain was reliant on the \"kindness of strangers\", highlighting how the country needed tens of billions of pounds of foreign finance a year to balance its books.\nThe ONS said the current account deficit for 2012 now stood at 4.4 percent of gross domestic product, compared with 3.7 percent in its previous estimate.\nThe ONS revised up the deficit for every year dating back to 1998 by an average of 0.6 percentage points. The biggest revisions occurred from 2005 onwards.\nLast month the ONS said Britain's current account deficit tumbled to 2.4 percent of GDP in the final three months of 2016, less than half its reading of 5.3 percent in the third quarter.\nRevised data for 2012 onward is due on Sept. 29, and it is unclear if Wednesday's changes point to significant further upward revisions, as British corporate bond yields have declined markedly since 2012 and touched a new low in mid-2016. .MERUR00\nThe ONS also revised up its earlier estimates of how much Britons saved. The household savings ratio for 2012 rose to 9.8 percent from 8.3 percent previously, with a similar upward revision for 2011.\nThe ratio for Q4 2016, which has not yet been revised, stood at its lowest since 1963 at 3.3 percent.\nThe ONS said the changes reflected changes to the treatment of self-employed people paying themselves dividends from their own companies, as well as separating out the accounts of charities, which had previously been included with households.\nMore recent years may produce similarly large revisions to the savings ratio. Around 40 percent of the roughly 2.2 million new jobs generated since the beginning of 2008 fell into the self-employed category.\n"
So I wish to delete the sentence (row) of FILE PHOTO
Let's say that initially the text is contained in the file input.txt.
The raw file is as follows:
THis is a text that contains a lot
of information
and PHOTO FILE.
Great!
my_text<-readLines("input.txt")
[1] "THis is a text that contains a lot" "of information" "and PHOTO FILE." "Great!"
If you get rid of the spurious element
blah[-grep("PHOTO FILE",blah,value = F,perl=T)]
you end up with
[1] "THis is a text that contains a lot" "of information" "Great!"

Resources