Extracting words based on a $ and unnesting the data - r

I am trying to do two things to my data. The data looks like:
# A tibble: 10 x 2
grp newCol
<int> <chr>
1 6303 "The company sees earnings of $3.40 to $3.60 a share for all of 2008, agai…
2 7686 " -- reaffirmed its fiscal 2015 guidance of per share diluted earnings b…
3 9577 "Analysts polled by FactSet anticipate earnings of 96 cents a share, down …
4 6475 ""
5 5229 "The company also expects income to be \"significantly impacted\" by costs…
6 2648 "Hoku also expects losses for the foreseeable future on significant cost i…
7 3691 "St. Louis-based Emerson now sees full-year earnings of $2.40 to $2.60 a s…
8 9297 ""
9 2080 "The restaurant group also raised its earnings guidance for fiscal 2007 to…
10 3513 "Guidance, For the full fiscal year 2008, the Company is moderating its pr…
I can run the following:
x <- d %>%
mutate(
extractedWords = str_extract_all(newCol, "([^\\s]+\\s){2}earnings(\\s[^\\s]+){12}")
)
Where I get:
# A tibble: 10 x 1
extractedWords
<list>
1 <chr [1]>
2 <chr [3]>
3 <chr [1]>
4 <chr [0]>
5 <chr [0]>
6 <chr [0]>
7 <chr [1]>
8 <chr [0]>
9 <chr [2]>
10 <chr [4]>
I firstly want to modify the str_extract_all(newCol, "([^\\s]+\\s){2}earnings(\\s[^\\s]+){12}") - which currently extracts the 2 words before the Word earnings and 12 words after the Word earnings. I want to change it such that it extracts the words before the dollar $ symbol.
Secondly I want to unnest the columns. When I run:
x %>%
unnest(extractedWords)
The number of rows from the data goes from 10 to 12. I want to unnest it but paste the c("text", more text") into something like text, more text or separated by | (or some variation).
Data:
d <- structure(list(grp = c(6303L, 7686L, 9577L, 6475L, 5229L, 2648L,
3691L, 9297L, 2080L, 3513L), newCol = c("The company sees earnings of $3.40 to $3.60 a share for all of 2008, against $2.74 a share from continuing operations in 2007, an increase of 24% to 31%. ",
" -- reaffirmed its fiscal 2015 guidance of per share diluted earnings between , We believe the guidance outlook for fiscal 2015 remains realistic and takes into consideration the heightened competitive market trends for the Diagnostics segment offset by strategic investments that target the growing outpatient segment and further growth of our Life Science segment through our focus on global expansion, increasing industrial market efforts, and emerging success in the AgriBio and genomics research areas., FISCAL 2015 GUIDANCE REAFFIRMED, For the fiscal year ending September 30, 2015, management expects net revenues to be in the range of $193 million to $200 million and per share diluted earnings to be between $0.85 and $0.91. The per share estimates assume an increase in average diluted shares outstanding from approximately 41.9 million at fiscal 2014 year end to approximately 42.4 million at fiscal 2015 year end. The revenue and earnings guidance provided in this press release is from expected internal growth and does not include the impact of any additional acquisitions the Company might complete during fiscal 2015.",
"Analysts polled by FactSet anticipate earnings of 96 cents a share, down 10 cents from a year earlier. Revenue is expected to have decreased 5.5% to $30.4 billion, which would mark the fourth consecutive quarterly decline after six years of growth. Verizon has said it expects earnings and sales will be roughly flat this year.",
"", "The company also expects income to be \"significantly impacted\" by costs related to its pending acquisition of SRS Labs Inc. (SRSL)., The company also expects income to be \"significantly impacted\" by costs related to its pending acquisition of SRS Labs Inc. (SRSL).",
"Hoku also expects losses for the foreseeable future on significant cost increases. ",
"St. Louis-based Emerson now sees full-year earnings of $2.40 to $2.60 a share, down from its February forecast of $2.70 to $2.95 a share. The company also expects net sales for the fiscal year to fall 13% to 15% to $21 billion to $21.7 billion. Sales are expected to be hurt by about 5% because of currency translations, but boosted by 1% because of acquisitions.",
"", "The restaurant group also raised its earnings guidance for fiscal 2007 to a range of $3.45 a share to $3.50 a share, and said its expects earnings for the third quarter ending in July of 85 cents to 89 cents a share and same-store sales growth of 5% to 6%. ",
"Guidance, For the full fiscal year 2008, the Company is moderating its previously issued guidance and expects net sales to be approximately $950 million and earnings per diluted share to be approximately $1.00, which includes approximately $0.45 per diluted share of restructuring charges and other unusual items. For the twelve months ended February 2, 2008, net sales were $1.09 billion and earnings per diluted share were $2.59., Set forth below is our reconciliation of net earnings per share, calculated in accordance with generally accepted accounting principles, or GAAP, to net earnings per share, as adjusted, for certain historical periods and certain future periods. For reference, we also include our previous guidance for third quarter fiscal 2008. Net earnings per share, as adjusted, excludes (i) the net impact of certain restructuring costs and other unusual items as well as the write off of unamortized financing costs during the first three quarters of fiscal 2008 and (ii) the anticipated impact of certain restructuring costs in the fourth quarter of fiscal 2008. We believe that investors often look at ongoing operations as a measure of assessing performance and as a basis for comparing past results against future results. Therefore, we believe that presenting our results and expected results excluding these items provides useful information to investors because this allows investors to make decisions based on our ongoing operations. We use the results excluding these items to discuss our business with investment institutions, our board of directors and others. Further, we believe that presenting our results and expected results excluding these items provides useful information to investors because this allows investors to compare our results and our expected results for the periods presented to other periods., Guidance Results for Guidance Guidance"
)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-10L))

I'm not 100% sure what you mean with the first part of your question. Assuming you want to extract all words after the word earnings and before $, this should do what you want. It uses a 'positive lookahead' and allows for any number of words until it finds the first dollar sign (hence the *?).
Rather than unnesting, I loop over the extractedWords column using purrr::map_chr, which returns a character vector, which makes further unnesting unneccessary.
library(tidyverse)
d %>%
mutate(
extractedWords = str_extract_all(newCol, "([^\\s]+\\s){2}\\$(\\s?[^\\s]+){12}")
) %>%
mutate(result = map_chr(extractedWords, str_c, "", collapse="|"))
EDIT: edited the regular expression to extract 2 words before the dollar sign and 12 words after it. Note I had to escape the dollar sign (\\$) for it to work, since a dollar sign has a special meaning in a regular expression.

Related

Extract the paragraphs from a PDF that contain a keyword using R

I need to extract from a pdf file the paragraphs that contain a keyword. Tried various codes but none got anything.
I have seen this code from a user #Tyler Rinker (Extract before and after lines based on keyword in Pdf using R programming) but it extracts the line where the keyword is, the before and after.
library(textreadr)
library(tidyverse)
loc <- function(var, regex, n = 1, ignore.case = TRUE){
locs <- grep(regex, var, ignore.case = ignore.case)
out <- sort(unique(c(locs - 1, locs, locs + 1)))
out <- out[out > 0]
out[out <= length(var)]
}
doc <- 'https://www.in.kpmg.com/pdf/Indian%20Pharma%20Outlook.pdf' %>%
read_pdf() %>%
slice(loc(text, 'cancer'))
However, I need to get the paragraphs and store each one in a row in my database. Could you help me?
The text lines in paragraphs will all be quite long unless it is the final line of the paragraph. We can count the characters in each line and do a histogram to show this:
library(textreadr)
doc <- read_pdf('https://www.in.kpmg.com/pdf/Indian%20Pharma%20Outlook.pdf')
hist(nchar(doc$text), 20)
So anything less than about 75 characters is either not in a paragraph or at the end of a paragraph. We can therefore stick a line break on the short ones, paste all the lines together, then split on linebreaks:
doc$text[nchar(doc$text) < 75] <- paste0(doc$text[nchar(doc$text) < 75], "\n")
txt <- paste(doc$text, collapse = " ")
txt <- strsplit(txt, "\n")[[1]]
So now we can just do our regex and find the paragraphs with the key word:
grep("cancer", txt, value = TRUE)
#> [1] " Ranjit Shahani applauds the National Pharmaceuticals Policy's proposal of public/private partnerships (PPPs) to tackle life-threatening diseases such as cancer and HIV/AIDS, but stresses that, in order for them to work, they should be voluntary, and the government should exempt all life-saving drugs from import duties and other taxes such as excise duty and VAT. He is, however, critical about a proposal for mandatory price negotiation of newly patented drugs. He feels this will erode India's credibility in implementing the Patent Act in © 2006 KPMG International. KPMG International is a Swiss cooperative that serves as a coordinating entity for a network of independent firms operating under the KPMG name. KPMG International provides no services to clients. Each member firm of KPMG International is a legally distinct and separate entity and each describes itself as such. All rights reserved. Collaboration for Growth 24"
#> [2] " a fair and transparent manner. To deal with diabetes, medicines are not the only answer; awareness about the need for lifestyle changes needs to be increased, he adds. While industry leaders have long called for the development of PPPs for the provision of health care in India, particularly in rural areas, such initiatives are currently totally unexplored. However, the government's 2006 draft National Pharmaceuticals Policy proposes the introduction of PPPs with drug manufacturers and hospitals as a way of vastly increasing the availability of medicines to treat life-threatening diseases. It notes, for example, that while an average estimate of the value of drugs to treat the country's cancer patients is $1.11 billion, the market is in fact worth only $33.5 million. “The big gap indicates the near non-accessibility of the medicines to a vast majority of the affected population, mainly because of the high cost of these medicines,” says the Policy, which also calls for tax and excise exemptions for anti-cancer drugs."
#> [3] " 50.1 percent of Aventis Pharma is held by European drug major Sanofi-Aventis and, in early April 2006, it was reported that UB Holdings had sold its 10 percent holding in the firm to Variegate Trading, a UB subsidiary. The firm's major products are in the anti-infective, anti-inflammatory, cancer, diabetes and allergy market segments and, for the year ended December 31, 2005, it reported net sales (excluding excise duty) up 9.9 percent to $181.1 million, with domestic sales up 9.1 percent at $129.8 million and exports increasing 12 percent to $51.2 million. Sales were led by 83 percent annual growth for the diabetes treatment Lantus (insulin glargine), followed by the rabies vaccine Rabipur (+22 percent), the diabetes drug Amaryl (glimepiride) and epilepsy treatment Frisium (clobazam), both up 18 percent, the angiotensin-coverting enzyme inhibitor Cardace (ramipril +15 percent), Clexane (enoxaparin), an anticoagulant, growing 14 percent and Targocid (teicoplanin), an antibiotic, whose sales advanced 8 percent."
Created on 2020-09-16 by the reprex package (v0.3.0)

Apply Sentimentr on Dataframe with Multiple Sentences in 1 String Per Row

I have a dataset where I am trying to get the sentiment by article. I have about 1000 articles. Each article is a string. This string has multiple sentences within it. I ideally would like to add another column that would summarise the sentiment for each article. Is there an efficient way to do this using dplyr?
Below is an example dataset with just 2 articles.
date<- as.Date(c('2020-06-24', '2020-06-24'))
text <- c('3 more cops recover as PNP COVID-19 infections soar to 519', 'QC suspends processing of PWD IDs after reports of abuse in issuance of cards')
link<- c('https://newsinfo.inquirer.net/1296981/3-more-cops-recover-as-pnps-covid-19-infections-soar-to-519,3,10,4,11,9,8', 'https://newsinfo.inquirer.net/1296974/qc-suspends-processing-of-pwd-ids-after-reports-of-abuse-in-issuance-of-cards')
V4 <-c('MANILA, Philippines — Three more police officers have recovered from the new coronavirus disease, increasing the total number of recoveries in the Philippine National Police to (PNP) 316., This developed as the total number of COVID-19 cases in the PNP rose to 519 with one new infection and nine deaths recorded., In a Facebook post on Wednesday, the PNP also recorded 676 probable and 876 suspects for the disease., PNP chief Gen. Archie Gamboa previously said the force would will intensify its health protocols among its personnel after recording a recent increase in deaths., The latest fatality of the ailment is a police officer in Cebu City, which is under enhanced community quarantine as COVID-19 cases continued to surge there., ATM, \r\n\r\nFor more news about the novel coronavirus click here.\r\nWhat you need to know about Coronavirus.\r\n\r\n\r\n\r\nFor more information on COVID-19, call the DOH Hotline: (02) 86517800 local 1149/1150.\r\n\r\n \r\n \r\n \r\n\r\n \r\n , The Inquirer Foundation supports our healthcare frontliners and is still accepting cash donations to be deposited at Banco de Oro (BDO) current account #007960018860 or donate through PayMaya using this link .',
'MANILA, Philippines — Quezon City will halt the processing of identification cards to persons with disability for two days starting Thursday, June 25, so it could tweak its guidelines after reports that unqualified persons had issued with the said IDs., In a statement on Wednesday, Quezon City Mayor Joy Belmonte said the suspension would the individual who issued PWD ID cards to six members of a family who were not qualified but who paid P2,000 each to get the IDs., Belmonte said the suspect, who is a local government employee, was already issued with a show-cause order to respond to the allegation., According to city government lawyer Nino Casimir, the suspect could face a grave misconduct case that could result in dismissal., The IDs are issued to only to persons qualified under the Act Expanding the Benefits and Privileges of Persons with Disability (Republic Act No. 10754)., The IDs entitle PWDs to a 20 percent discount and VAT exemption on goods and services., /atm')
df<-data.frame(date, text, link, V4)
head(df)
So I have been looking up how to do this using the sentimentr package and created below. However, this only outputs each sentences' sentiment (I do this by doing a strsplit of .,) and I want to instead aggregate everything at the full article level after applying this strsplit.
library(sentimentr)
full<-df %>%
group_by(V4) %>%
mutate(V2 = strsplit(as.character(V4), "[.],")) %>%
unnest(V2) %>%
get_sentences() %>%
sentiment()
The desired output I am looking for is to simply add an extra column my df dataframe with a summary sum(sentiment) for each article.
Additional info based on answer below:
date<- as.Date(c('2020-06-24', '2020-06-24'))
text <- c('3 more cops recover as PNP COVID-19 infections soar to 519', 'QC suspends processing of PWD IDs after reports of abuse in issuance of cards')
link<- c('https://newsinfo.inquirer.net/1296981/3-more-cops-recover-as-pnps-covid-19-infections-soar-to-519,3,10,4,11,9,8', 'https://newsinfo.inquirer.net/1296974/qc-suspends-processing-of-pwd-ids-after-reports-of-abuse-in-issuance-of-cards')
V4 <-c('MANILA, Philippines — Three more police officers have recovered from the new coronavirus disease, increasing the total number of recoveries in the Philippine National Police to (PNP) 316., This developed as the total number of COVID-19 cases in the PNP rose to 519 with one new infection and nine deaths recorded., In a Facebook post on Wednesday, the PNP also recorded 676 probable and 876 suspects for the disease., PNP chief Gen. Archie Gamboa previously said the force would will intensify its health protocols among its personnel after recording a recent increase in deaths., The latest fatality of the ailment is a police officer in Cebu City, which is under enhanced community quarantine as COVID-19 cases continued to surge there., ATM, \r\n\r\nFor more news about the novel coronavirus click here.\r\nWhat you need to know about Coronavirus.\r\n\r\n\r\n\r\nFor more information on COVID-19, call the DOH Hotline: (02) 86517800 local 1149/1150.\r\n\r\n \r\n \r\n \r\n\r\n \r\n , The Inquirer Foundation supports our healthcare frontliners and is still accepting cash donations to be deposited at Banco de Oro (BDO) current account #007960018860 or donate through PayMaya using this link .',
'MANILA, Philippines — Quezon City will halt the processing of identification cards to persons with disability for two days starting Thursday, June 25, so it could tweak its guidelines after reports that unqualified persons had issued with the said IDs., In a statement on Wednesday, Quezon City Mayor Joy Belmonte said the suspension would the individual who issued PWD ID cards to six members of a family who were not qualified but who paid P2,000 each to get the IDs., Belmonte said the suspect, who is a local government employee, was already issued with a show-cause order to respond to the allegation., According to city government lawyer Nino Casimir, the suspect could face a grave misconduct case that could result in dismissal., The IDs are issued to only to persons qualified under the Act Expanding the Benefits and Privileges of Persons with Disability (Republic Act No. 10754)., The IDs entitle PWDs to a 20 percent discount and VAT exemption on goods and services., /atm')
df<-data.frame(date, text, link, V4)
df %>%
group_by(V4) %>% # group by not really needed
mutate(V4 = gsub("[.],", ".", V4),
sentiment_score = sentiment_by(V4))
# A tibble: 2 x 5
# Groups: V4 [2]
date text link V4 sentiment_score$e~ $word_count $sd $ave_sentiment
<date> <chr> <chr> <chr> <int> <int> <dbl> <dbl>
1 2020-06-24 3 more cops recover as P~ https://newsinfo.inquirer.net/1296~ "MANILA, Philippines — Three more police officers ~ 1 172 0.204 -0.00849
2 2020-06-24 QC suspends processing o~ https://newsinfo.inquirer.net/1296~ "MANILA, Philippines — Quezon City will halt the p~ 1 161 0.329 -0.174
Warning message:
Can't combine <sentiment_by> and <sentiment_by>; falling back to <data.frame>.
x Some attributes are incompatible.
i The author of the class should implement vctrs methods.
i See <https://vctrs.r-lib.org/reference/faq-error-incompatible-attributes.html>.
If you need the sentiment over the whole text, there is no need to split the text first into sentences, the sentiment functions take care of this. I replaced the ., in your text back to periods as this is needed for the sentiment functions. The sentiment functions recognizes "mr." as not being the end of a sentence. If you use get_sentences() first, you get the sentiment per sentence and not over the whole text.
The function sentiment_by handles the sentiment over the whole text and averages it nicely. Check help with the option for the averaging.function if you need to change this. The by part of the function can deal with any grouping you want to apply.
df %>%
group_by(V4) %>% # group by not really needed
mutate(V4 = gsub("[.],", ".", V4),
sentiment_score = sentiment_by(V4))
# A tibble: 2 x 5
# Groups: V4 [2]
date text link V4 sentiment_score$~ $word_count $sd $ave_sentiment
<date> <chr> <chr> <chr> <int> <int> <dbl> <dbl>
1 2020-06-24 3 more cops recov~ https://newsinfo.inquire~ "MANILA, Philippines — Three~ 1 172 0.204 -0.00849
2 2020-06-24 QC suspends proce~ https://newsinfo.inquire~ "MANILA, Philippines — Quezo~ 1 161 0.329 -0.174

filtering text and storing the filtered sentence/paragraph into a new column

I am trying to extract some sentences from text data. I want to extract the sentences which correspond to medical device company released. I can run the following code:
df_text <- unlist(strsplit(df$TD, "\\."))
df_text
df_text <- df_text[grep(pattern = "medical device company released", df_text, ignore.case = TRUE)]
df_text
Which gives me:
[1] "\n\nThe medical device company released its financial outlook in a press release before an investor conference Friday"
So I extracted the sentences which contain the sentence medical device company released. However, I want to do this but store the results in a new column from which grp the sentence came from.
Expected output:
grp TD newCol
3613 text NA # does not contain the sentence
4973 text medical device company released
5570 text NA # does not contain the sentence
Data:
df <- structure(list(grp = c("3613", "4973", "5570"), TD = c(" Wal-Mart plans to add an undisclosed number of positions in areas including its store-planning operation and New York apparel office.\n\nThe moves, which began Tuesday, are meant to \"increase operational efficiencies, support our strategic growth plans and reduce overall costs,\" Wal-Mart spokesman David Tovar said.\n\nWal-Mart still expects net growth of tens of thousands of jobs at the store level this year, Tovar said.\n\nThe reduction in staff is hardly a new development for retailers, which have been cutting jobs at their corporate offices as they contend with the down economy. Target Corp. (TGT), Saks Inc. (SKS) and Best Buy Co. (BBY) are among retailers that have said in recent weeks they plan to pare their ranks.\n\nTovar declined to say whether the poor economy was a factor in Wal-Mart's decision.\n\nWal-Mart is operating from a position of comparative strength as one of the few retailers to consistently show positive growth in same-store sales over the past year as the recession dug in.\n\nWal-Mart is \"a fiscally responsible company that will manage its capital structure appropriately,\" said Todd Slater, retail analyst at Lazard Capital Markets.\n\nEven though Wal-Mart is outperforming its peers, the company \"is not performing anywhere near peak or optimum levels,\" Slater said. \"The consumer has cut back significantly.\"\n\nWal-Mart indicated it had regained some footing in January, when comparable-store sales rose 2.1%, after a lower-than-expected 1.7% rise in December.\n\nWal-Mart shares are off 3.2% to $47.68.\n\n-By Karen Talley, Dow Jones Newswires; 201-938-5106; karen.talley#dowjones.com [ 02-10-09 1437ET ]\n ",
" --To present new valve platforms Friday\n\n(Updates with additional comment from company, beginning in the seventh paragraph.)\n\n\n \n By Anjali Athavaley \n Of DOW JONES NEWSWIRES \n \n\nNEW YORK (Dow Jones)--Edwards Lifesciences Corp. (EW) said Friday that it expects earnings to grow 35% to 40%, excluding special items, in 2012 on expected sales of its catheter-delivered heart valves that were approved in the U.S. earlier this year.\n\nThe medical device company released its financial outlook in a press release before an investor conference Friday. The catheter-delivered heart valve market is considered to have a multibillion-dollar market potential, but questions have persisted on how quickly the Edwards device, called Sapien, will be rolled out and who will be able to receive it.\n\nEdwards said it expects transcatheter valve sales between $560 million and $630 million in 2012, with $200 million to $260 million coming from the U.S.\n\nOverall, for 2012, Edwards sees total sales between $1.95 billion and $2.05 billion, above the $1.68 billion to $1.72 billion expected this year and bracketing the $2.01 billion expected on average by analysts surveyed by Thomson Reuters.\n\nThe company projects 2012 per-share earnings between $2.70 and $2.80, the midpoint of which is below the average analyst estimate of $2.78 on Thomson Reuters. Edwards estimates a gross profit margin of 73% to 75%.\n\nEdwards also reaffirmed its 2011 guidance, which includes earnings per share of $1.97 to $2.02, excluding special items.\n\nThe company said it continues to expect U.S. approval of its Sapien device for high-risk patients in mid-2012. Currently, the device is only approved in the U.S. for patients too sick for surgery.\n\nThe company added that a separate trial studying its newer-generation valve in a larger population is under way in the U.S. It expects U.S. approval of that device in 2014.\n\nEdwards also plans to present at its investor conference two new catheter-delivered valve platforms designed for different implantation methods. European trials for these devices are expected to begin in 2012.\n\nShares of Edwards, down 9% over the past 12 months, were inactive premarket. The stock closed at $63.82 on Thursday.\n\n-By Anjali Athavaley, Dow Jones Newswires; 212-416-4912; anjali.athavaley#dowjones.com [ 12-09-11 0924ET ]\n ",
" In September, the company issued a guidance range of 43 cents to 44 cents a share. \n\nFor the year, GE now sees earnings no lower than $1.81 a share to $1.83 a share. The previous forecast called for income of $1.80 to $1.83 a share. The new range brackets analyst projections of $1.82 a share. \n\nThe new targets represent double-digit growth from the respective year-earlier periods. Last year's third-quarter earnings were $3.87 billion, or 36 cents a share, excluding items; earnings for the year ended Dec. 31 came in at $16.59 billion, or $1.59 a share. [ 10-06-05 0858ET ] \n\nGeneral Electric also announced Thursday that it expects 2005 cash flow from operating activities to exceed $19 billion. \n\nBecause of the expected cash influx, the company increased its authorization for share repurchases by $1 billion to more than $4 billion. \n\nGE announced the updated guidance at an analysts' meeting Thursday in New York. A Web cast of the meeting is available at . \n\nThe company plans to report third-quarter earnings Oct. 14. \n\nShares of the Dow Jones Industrial Average component recently listed at $33.20 in pre-market trading, according to Inet, up 1.6%, or 52 cents, from Wednesday's close of $32.68. \n\nCompany Web site: \n\n-Jeremy Herron; Dow Jones Newswires; 201-938-5400; Ask Newswires#DowJones.com \n\nOrder free Annual Report for General Electric Co. \n\nVisit or call 1-888-301-0513 [ 10-06-05 0904ET ] \n "
)), class = "data.frame", row.names = c(NA, -3L))
We can get data in separate rows keeping the grp intact and keep only sentence that has "medical device company released" in it.
library(dplyr)
df %>%
tidyr::separate_rows(TD, sep = "\\.") %>%
group_by(grp) %>%
summarise(newCol = toString(grep(pattern = "medical device company released",
TD, ignore.case = TRUE, value = TRUE)))
# grp newCol
# <chr> <chr>
#1 3613 ""
#2 4973 "\n\nThe medical device company released its financia…
#3 5570 ""

extract the first number after a specific word in a text column [duplicate]

This question already has answers here:
Extract number after a certain word
(4 answers)
Closed 2 years ago.
I have some text data and I want to extract from it the first number after the word "expects earnings of". What I currently have is the following:
x <- d %>%
mutate(
expectsEarningsOf = str_match_all(newCol, "expects earnings of (.*?) cents")
)
Which extracts the text along with the number after the word "expects earnings of" and before the word "cents". I just want to now extract the first number after "expects earnings of". I thought about something:
x <- d %>%
mutate(
expectsEarningsOf = str_match_all(newCol, "expects earnings of (.*?) anyStringCharacter")
)
Where anyStringCharacter is any non numeric number.
Data:
d <- structure(list(grp = c(2635L, 1276L, 10799L, 10882L, 6307L, 7622L,
2448L, 6467L, 3224L, 2064L, 9232L, 5039L, 2888L, 5977L, 3565L
), newCol = c("For 2008, True Religion expects earnings of $1.48 to $1.52 a share and net sales of $210 million to $215 million. The company expects to incur additional marketing expenses of about $1.7 million. ",
"But Hospira also said it now expects net sales on a GAAP basis to grow at a rate of 1% to 2% this year, reduced from earlier expectations by lower-than-expected international sales and purchasing delays in the medication-management business. After the second quarter, the company had projected growth in a range of 3% to 5%. ",
"14 Nov 2013 16:04 EDT *Thermogenesis Sees Net Savings About $1.5 Million From Reorganization",
" The Company announced that net sales for this nine week period increased by 25.4% to $185.3 million while comparable store sales for this period decreased by 0.5%. Based on this quarter-to-date performance, the Company now expects net sales for the fourth quarter of fiscal 2013 to be in the range of $208 million to $210 million, comparable store sales to be in the range of -1.5% to -0.5% and GAAP net income to be in the range of $23.3 million to $24.3 million, with a GAAP diluted income per common share range of $0.43 to $0.45 on approximately 54.0 million estimated weighted average shares outstanding. Excluding $0.9 million, or $0.02 per adjusted diluted share in tax-effected expenses related to the founders' transaction(1) , adjusted net income is expected to be approximately $24.2 million to $25.2 million, or $0.44 to $0.46 per diluted share based on estimated adjusted diluted weighted average shares outstanding of approximately 54.6 million., 9 Jan 2014 16:45 EDT *Five Below, Inc. Updates 4Q Fiscal 2013 Guidance Based On Qtr-To-Date Results",
"", "1323 GMT Raiffeisen Centrobank calls Verbund's (VER.VI) recent guidance increase for 2014 a \"mixed bag,\" raising its target price to EUR15.60 from EUR14.30. The bank retains its hold rating as positive effects are mostly due to one-offs, although the utility's sustainable cost savings were a positive surprise. \"The power price environment is still bleak following a weakish outlook for Central European economies, coal prices falling further and only lacklustre hopes for a quick fix of the European energy and climate policy,\" Raiffeisen adds. Verbund's shares trade up 0.6% at EUR15.34. (Nicole.lundeen#wsj.com; #nicole_lundeen) ",
"As a result of its third quarter results and current fourth quarter outlook, the Company has updated its guidance for fiscal 2007. The Company now expects net sales to range from $2.68 billion to $2.7 billion, which compares to prior expectations of $2.7 billion to $2.75 billion. Same-store sales for the year are expected to increase approximately 2.5% to 3% compared to previous expectations of an increase of approximately 3.0% to 4.5%. The Company now expects full year net income to range from $2.37 to $2.43 per diluted share, which compares to its prior guidance of $2.49 to $2.56 per diluted share. ",
" Sempra Energy (SRE) sees earnings next year growing 15% from this year's estimate, putting 2010 expectations above Wall Street's, as the parent of San Diego Gas & Electric anticipates much lower capital spending for the next five years.",
"Outlook for 2008: Midpoint for EPS guidance increased, For the full year 2008, the company now expects results from continuing operations as follows: earnings per diluted share of between $3.10 and $3.20, compared to the previous range of $3.00 to $3.20; revenue growth of approximately 9%, and operating income to approach 17% of revenues. Over the same period, the company expects cash from operations to approximate $900 million and capital expenditures of between $240 million and $260 million. These estimates exclude potential special charges.",
"California Pizza Kitchen expects second-quarter earnings of 34 cents to 36 cents a share. Wall Street expects earnings of 36 cents a share. ",
" -- Q1 2013 gross margin within guidance, sales ahead of guidance , \"We achieved first quarter sales ahead of and gross margin in line with our guidance, and reiterate our expectation for a sales acceleration during the year, with a second quarter markedly stronger than the first quarter and a large second half, leading to expected 2013 full year net sales at a similar level to that of 2012. The underlying assumptions are unchanged, with foundry and logic preparing for very lithography-intensive 14-20 nm technology nodes to be used for next generation mobile end-products; while lithography investments in memory are still muted, memory chip price recovery and discussions on scanner shipment capability are signs of potential upside for second half deliveries. EUV technology industrialization continues to make steady progress on the trajectory set with the introduction of the improved source concept last year: firstly, the EUV light sources have now been demonstrated at 55 Watts with adequate dose control; secondly, the scanners themselves have demonstrated production-worthy, 10 nm node compatible imaging and overlay specifications. We therefore confirm our expectation of the ramp of EUV-enabled semiconductor production in 2015, supported by our NXE:3300B scanners, two of which are being prepared for shipment and installation in Q2 and Q3,\" said Eric Meurice, President and Chief Executive Officer of ASML., -- For the second quarter of 2013, ASML expects net sales of about EUR 1.1 ",
"In the first quarter, Covanceexpects earnings of 60 cents a share on a modest sequential increase in net revenues. Analysts predicted income of 66 cents share on $534 million in revenue, which is nearly flat with the latest quarter's revenue.",
"The company said Monday it expects to report revenue of about $875 million for 2007, up sharply from $196 million in 2006, mostly because of new military contracts. However, it expects net income to remain nearly the same at $16.6 million. ",
"For the fourth quarter, the company sees earnings of $1.13 to $1.16 a share. ",
"Chip maker now expects earnings from continuing operations of 15c-17c a share, excluding restructuring charges, and a revenue decline of 25% to 30% sequentially, because of weak demand. Shares fall 6% late., Chip maker now expects earnings from continuing operations of 15c-17c a share, excluding restructuring charges, and a revenue decline of 25% to 30% sequentially, because of weak demand. Shares fall 6% late."
)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-15L))
The first number after "expects earnings of":
library(stringr)
str_extract_all(d$newCol, "(?<=expects earnings of )\\d+")
This solution uses positive lookbehind in (?<=expects earnings of ), encoding an instruction to match \\d+if it is immediately preceded by expects earnings of (with a white space).

error reading text file into new columns of a dataframe using some text editing

I have a text file (0001.txt) which contains the data as below:
<DOC>
<DOCNO>1100101_business_story_11931012.utf8</DOCNO>
<TEXT>
The Telegraph - Calcutta (Kolkata) | Business | Local firms go global
6 Local firms go global
JAYANTA ROY CHOWDHURY
New Delhi, Dec. 31: Indian companies are stepping out of their homes to try their luck on foreign shores.
Corporate India invested $2.7 billion abroad in the first quarter of 2009-2010 on top of $15.9 billion in 2008-09.
Though the first-quarter investment was 15 per cent lower than what was invested in the same period last year, merchant banker Sudipto Bose said, It marks a confidence in a new world order where Indian businesses see themselves as equal to global players.
According to analysts, confidence in global recovery, cheap corporate buys abroad and easier rules governing investment overseas had spurred flow of capital and could see total investment abroad top $12 billion this year and rise to $18-20 billion next fiscal.
For example, Titagarh Wagons plans to expand abroad on the back of the proposed Asian railroad project.
We plan to travel all around the world with the growth of the railroads, said Umesh Chowdhury of Titagarh Wagons.
India is full of opportunities, but we are all also looking at picks abroad, said Gautam Mitra, managing director of Indian Structurals Engineering Company.
Mitra plans to open a holding company in Switzerland to take his business in structurals to other Asian and African countries.
Indian companies created 3 lakh jobs in the US, while contributing $105 billion to the US economy between 2004 and 2007, according to commerce ministry statistics. During 2008-09, Singapore, the Netherlands, Cyprus, the UK, the US and Mauritius together accounted for 81 per cent of the total outward investment.
Bose said, And not all of it is organic growth. Much of our investment abroad reflects takeovers and acquisitions.
In the last two years, Suzlon acquired Portugals Martifers stake in German REpower Systems for $122 million. McNally Bharat Engineering has bought the coal and minerals processing business of KHD Humboldt Wedag. ONGC bought out Imperial Energy for $2 billion.
Indias foreign assets and liabilities today add up to more than 60 per cent of its gross domestic product. By the end of 2008-09, total foreign investment was $67 billion, more than double of that at the end of March 2007.
</TEXT>
</DOC>
Above, all text data is within the HTML code for text i.e.
<TEXT> and </TEXT>.
I want to read it into an R dataframe in a way that there will be four columns and the data should be read as:
Title Author Date Text
The Telegraph - Calcutta (Kolkata) JAYANTA ROY CHOWDHURY Dec. 31 Indian companies are stepping out of their homes to try their luck on foreign shores. Corporate India invested $2.7 billion abroad in the first quarter of 2009-2010 on top of $15.9 billion in 2008-09. Though the first-quarter investment was 15 percent lower than what was invested in the same period last year, merchant banker Sudipto Bose said, It marks a confidence in a new world order where Indian businesses see themselves as equal to global players.
What I was trying to read using dplyr and as shown below:
# read text file
library(dplyr)
library(readr)
dat <- read_csv("0001.txt") %>% slice(-8)
# print part of data frame
head(dat, n=2)
In above code, I tried to skip first few lines (which are not important) from the text file that contains the above text and then read it into dataframe.
But I could not get what I was looking for and got confused what I am doing is wrong.
Could someone please help?
To be able to read data into R as a data frame or table, the data needs to have a consistent structure maintained by separators. One of the most common formats is a file with comma separated values (CSV).
The data you're working with doesn't have separators though. It's essentially a string with minimally enforced structure. Because of this, it sounds like the question is more related to regular expressions (regex) and data mining than it is to reading text files into R. So I'd recommend looking into those two things if you do this task often.
That aside, to do what you're wanting in this example, I'd recommend reading the text file into R as a single string of text first. Then you can parse the data you want using regex. Here's a basic, rough draft of how to do that:
fileName <- "Path/to/your/data/0001.txt"
string <- readChar(fileName, file.info(fileName)$size)
df <- data.frame(
Title=sub("\\s+[|]+(.*)","",string),
Author=gsub("(.*)+?([A-Z]{2,}.*[A-Z]{2,})+(.*)","\\2",string),
Date=gsub("(.*)+([A-Z]{1}[a-z]{2}\\.\\s[0-9]{1,2})+(.*)","\\2",string),
Text=gsub("(.*)+([A-Z]{1}[a-z]{2}\\.\\s[0-9]{1,2})+[: ]+(.*)","\\3",string))
Output:
str(df)
'data.frame': 1 obs. of 4 variables:
$ Title : chr "The Telegraph - Calcutta (Kolkata)"
$ Author: chr "JAYANTA ROY CHOWDHURY"
$ Date : chr "Dec. 31"
$ Text : chr "Indian companies are stepping out of their homes to"| __truncated__
The reason why regex can be useful is that it allows for very specific patterns in strings. The downside is when you're working with strings that keep changing formats. That will likely mean some slight adjustments to the regex used.
read.table( file = ... , sep = "|") will solve your issue.

Resources