Wrangling Data in R - r

I'm trying to break this data (specifically extract the graduation rate) out into being analyzed in a useful way. I believe I need to str_split (using R) but am not understanding what type of data it is and what all the \'s mean / etc. I scraped this from a website using the rvest package and below code:
url <- "https://www.greatschools.org/maryland/severna-park/115-Severna-Park-High-School/"
grad_rate <- read_html(url) %>%
html_nodes("script") %>%
html_text() %>%
purrr::pluck(9)
grad_rate
"{\"title\":\"College readiness\",\"anchor\":\"College_readiness\",\"analytics_id\":\"CollegeReadiness\",\"subtitle\":\"Learn more about how to help your child graduate ready for college. \\u003ca href=\\\"/gk/articles/jump-start-college-planning/\\\" target=\\\"_blank\\\"\\u003eSee how.\\u003c/a\\u003e\",\"icon_classes\":\"icon-graduation\",\"info_text\":\"\\u003cp\\u003eThis rating shows how well students at this school are prepared for college compared to students at other schools in this state, based on key measures, like graduation rates, college entrance tests and advanced coursework when available.\\u003c/p\\u003e\\u003cp\\u003e\\u003ca href=\\\"/gk/ratings/#collegereadinessrating\\\" target=\\\"_blank\\\"\\u003eLearn more about this rating.\\u003c/a\\u003e\\u003c/p\\u003e\\n\",\"rating\":9,\"sources\":\"\\u003cdiv class=\\\"sourcing\\\"\\u003e\\u003ch1\\u003eGreatSchools profile data sources \\u0026amp; information\\u003c/h1\\u003e\\u003cdiv\\u003e\\u003ch4 \\u003eGreatSchools College Readiness Rating\\u003c/h4\\u003e\\u003cp\\u003eThe College Readiness Rating uses this high school's graduation rates, college entrance exam participation and performance, or AP, IB, or Dual Enrollment participation and AP performance to determine how well schools are preparing students for success in college and beyond. The College Readiness Rating was created using 2015 4-year high school graduation rate data from MSDE, using 2016 demographic data from NCES, and the following data from the 2016 Civil Rights Data Collection: percentage of students enrolled in IB, AP or Dual Enrollment classes in grades 9-12, and percentage of students passing 1 or more AP exams grades 9-12.\\u003c/p\\u003e\\u003cp\\u003e\\u003cspan class=\\\"emphasis\\\"\\u003eSource\\u003c/span\\u003e: GreatSchools; this rating was calculated in 2019 | \\u003cspan class=\\\"emphasis\\\"\\u003eSee more\\u003c/span\\u003e: \\u003ca href=\\\"/gk/ratings/#collegereadinessrating\\\"; target=\\\"_blank\\\"\\u003eAbout this rating\\u003c/a\\u003e\\u003c/p\\u003e\\u003c/div\\u003e\\u003cdiv\\u003e\\u003ch4\\u003e4-year high school graduation rate\\u003c/h4\\u003e\\u003cp\\u003eGraduation rates reflect how many students graduate from this school on time.\\u003c/p\\u003e\\u003cp\\u003e\\u003cspan class=\\\"emphasis\\\"\\u003eSource\\u003c/span\\u003e: MSDE, 2015\\u003c/p\\u003e\\u003c/div\\u003e\\u003cdiv\\u003e\\u003ch4\\u003eAP course participation\\u003c/h4\\u003e\\u003cp\\u003eAdvanced Placement classes are college-level courses students can take in high school. The percentage of students taking AP classes may reflect whether the school culture is focused on college.\\u003c/p\\u003e\\u003cp\\u003e\\u003cspan class=\\\"emphasis\\\"\\u003eSource\\u003c/span\\u003e: Civil Rights Data Collection, 2016\\u003c/p\\u003e\\u003c/div\\u003e\\u003cdiv\\u003e\\u003ch4\\u003ePercentage of students passing 1 or more AP exams grades 9-12\\u003c/h4\\u003e\\u003cp\\u003eThe AP exam pass rate reflects how many students at this school earned a passing score on at least one AP exam. Students who do well on AP exams (passing with a score of 3, 4, or 5) may receive college credit.\\u003c/p\\u003e\\u003cp\\u003e\\u003cspan class=\\\"emphasis\\\"\\u003eSource\\u003c/span\\u003e: Civil Rights Data Collection, 2016\\u003c/p\\u003e\\u003c/div\\u003e\\u003cdiv\\u003e\\u003ch4\\u003ePercentage of students enrolled in Dual Enrollment classes grades 9-12\\u003c/h4\\u003e\\u003cp\\u003eThe Dual Enrollment participation rate reflects the percentage of students at this school who are taking college courses while in high school. Credits for these courses apply both to high school diploma requirements and college graduation requisites.\\u003c/p\\u003e\\u003cp\\u003e\\u003cspan class=\\\"emphasis\\\"\\u003eSource\\u003c/span\\u003e: Civil Rights Data Collection, 2016\\u003c/p\\u003e\\u003c/div\\u003e\\u003cdiv\\u003e\\u003ch4\\u003ePercentage of students enrolled in IB grades 9-12\\u003c/h4\\u003e\\u003cp\\u003eInternational Baccalaureate (IB) is an internationally recognized, high-standards program that emphasizes creative and critical thinking. A high school may have specific IB classes students can take, or a school-wide IB program that affects all classes. Some colleges give college credit for IB courses. \\u003ca href='/gk/articles/what-is-ib-international-baccalaureate/' target='_blank'\\u003eMore about IB\\u003c/a\\u003e\\n\\u003c/p\\u003e\\u003cp\\u003e\\u003cspan class=\\\"emphasis\\\"\\u003eSource\\u003c/span\\u003e: Civil Rights Data Collection, 2016\\u003c/p\\u003e\\u003c/div\\u003e\\u003cdiv\\u003e\\u003ch4\\u003eSAT/ACT participation rate\\u003c/h4\\u003e\\u003cp\\u003eThe SAT/ACT participation rate shows the percentage of eligible students in grades 11 or 12 at this school who took the SAT or ACT.\\u003c/p\\u003e\\u003cp\\u003e\\u003cspan class=\\\"emphasis\\\"\\u003eSource\\u003c/span\\u003e: Civil Rights Data Collection, 2014\\u003c/p\\u003e\\u003c/div\\u003e\\u003c/div\\u003e\",\"feedback\":{\"feedback_cta\":\"Did you find the information about college success useful? What can we do better?\",\"feedback_link\":\"https://s.qualaroo.com/45194/cb0e676f-324a-4a74-bc02-72ddf1a2ddd6?school=115\\u0026state=MD\",\"button_text\":\"Answer\"},\"share_content\":\"\\u003cdiv class=\\\"sharing-modal\\\"\\u003e\\u003cdiv class=\\\"sharing-row js-emailSharingLinks js-slTracking\\\" data-url=\\\"https://www.greatschools.org/maryland/severna-park/115-Severna-Park-High-School/?utm_source=profile\\u0026utm_medium=Email\\u0026subject=Severna+Park+High+School+-+College+readiness\\u0026body=Check+out+the+Severna+Park+High+School+-+College+readiness%250D%250A#College_readiness\\\" data-type=\\\"Email\\\" data-module=\\\"College_readiness\\\" data-link=\\\"mailto:?subject=Severna Park High School - College readiness\\u0026body=Check out the Severna Park High School - College readiness%0D%0Ahttps://www.greatschools.org/maryland/severna-park/115-Severna-Park-High-School//?utm_source=profile%26utm_medium=email#College_readiness\\\"\\u003e\\u003cdiv class=\\\"sharing-icon-box\\\"\\u003e\\u003cspan class=\\\"icon-mail\\\"\\u003e\\u003c/span\\u003e\\u003c/div\\u003e\\u003cspan class=\\\"sharing-row-text\\\"\\u003eEmail\\u003c/span\\u003e\\u003c/div\\u003e\\u003cdiv class=\\\"sharing-row js-sharingLinks js-slTracking\\\" data-url=\\\"https://www.greatschools.org/maryland/severna-park/115-Severna-Park-High-School/?utm_source=profile\\u0026utm_medium=Facebook#College_readiness\\\" data-siteparams=\\\"\\u0026t=Severna Park High School - College readiness\\\" data-type=\\\"Facebook\\\" data-module=\\\"College_readiness\\\" data-link=\\\"https://www.facebook.com/sharer/sharer.php?u=\\\"\\u003e\\u003cdiv class=\\\"sharing-icon-box\\\"\\u003e\\u003cspan class=\\\"icon-facebook\\\"\\u003e\\u003c/span\\u003e\\u003c/div\\u003e\\u003cspan class=\\\"sharing-row-text\\\"\\u003eFacebook\\u003c/span\\u003e\\u003c/div\\u003e\\u003cdiv class=\\\"sharing-row js-sharingLinks js-slTracking\\\" data-url=\\\"https://www.greatschools.org/maryland/severna-park/115-Severna-Park-High-School/?utm_source=profile\\u0026utm_medium=Twitter#College_readiness\\\" data-siteparams=\\\"\\u0026via=GreatSchools\\u0026text=Severna Park High School - College readiness\\\" data-type=\\\"Twitter\\\" data-module=\\\"College_readiness\\\" data-link=\\\"https://twitter.com/share?url=\\\"\\u003e\\u003cdiv class=\\\"sharing-icon-box\\\"\\u003e\\u003cspan class=\\\"icon-twitter\\\"\\u003e\\u003c/span\\u003e\\u003c/div\\u003e\\u003cspan class=\\\"sharing-row-text\\\"\\u003eTwitter\\u003c/span\\u003e\\u003c/div\\u003e\\u003cdiv class=\\\"sharing-row\\\"\\u003e\\u003cdiv class=\\\"sharing-icon-box\\\"\\u003e\\u003cspan class=\\\"icon-link\\\"\\u003e\\u003c/span\\u003e\\u003c/div\\u003e\\u003cspan class=\\\"sharing-row-text\\\"\\u003ePermalink\\u003c/span\\u003e\\u003cdiv\\u003e\\u003cinput class=\\\"permalink js-permaLink js-slTracking\\\" type=\\\"text\\\" value=\\\"https://www.greatschools.org/maryland/severna-park/115-Severna-Park-High-School/?utm_source=profile\\u0026utm_medium=Permalink#College_readiness\\\" /\\u003e\\u003cspan class=\\\"acknowledgement\\\"\\u003eCopied to clipboard\\u003c/span\\u003e\\u003c/div\\u003e\\u003c/div\\u003e\\u003c/div\\u003e\",\"data\":[{\"title\":\"College readiness\",\"anchor\":\"College_readiness\",\"data\":[{\"narration\":\"\\u003cdiv class=\\\"auto-narration\\\"\\u003e \\u003ch3 class=\\\"positive\\\"\\u003eGood news!\\u003c/h3\\u003e \\u003cp\\u003eThis school is \\u003cspan class=\\\"emphasis\\\"\\u003efar above\\u003c/span\\u003e the state average in key measures of college and career readiness.\\u003c/p\\u003e \\u003cp\\u003eEven at schools with strong college and career readiness, there may be students who are not getting the opportunities they need to succeed.\\u003c/p\\u003e \\u003chr /\\u003e \\u003cp class=\\\"parent-tip\\\"\\u003e\\u003cimg src='/assets/school_profiles/owl.png' /\\u003e\\u003cspan class=\\\"speech-bubble left\\\"\\u003eParent tip\\u003c/span\\u003e\\u003c/p\\u003e \\u003cp class=\\\"footnote\\\"\\u003eAsk the school what it’s doing to help all students succeed in advanced classes and prepare for \\u003ca href=\\\"/gk/articles/improving-sat-scores/\\\"\\u003ecollege entrance tests\\u003c/a\\u003e.\\u003c/p\\u003e \\u003c/div\\u003e\\n\",\"title\":\"College readiness\",\"values\":[{\"label\":\"94\",\"score\":93,\"breakdown\":\"4-year high school graduation rate\",\"state_average\":86,\"state_average_label\":\"87\",\"display_type\":\"person\",\"lower_range\":0,\"upper_range\":100,\"tooltip_html\":\"Graduation rates reflect how many students graduate from this school on time.\"},{\"label\":\"51\",\"score\":50,\"breakdown\":\"AP course participation\",\"state_average\":26,\"state_average_label\":\"27\",\"display_type\":\"person\",\"lower_range\":0,\"upper_range\":100,\"tooltip_html\":\"Advanced Placement classes are college-level courses students can take in high school. The percentage of students taking AP classes may reflect whether the school culture is focused on college.\"},{\"label\":\"73\",\"score\":72,\"breakdown\":\"Percentage of students passing 1 or more AP exams grades 9-12\",\"state_average\":62,\"state_average_label\":\"63\",\"display_type\":\"bar\",\"lower_range\":0,\"upper_range\":100,\"tooltip_html\":\"The AP exam pass rate reflects how many students at this school earned a passing score on at least one AP exam. Students who do well on AP exams (passing with a score of 3, 4, or 5) may receive college credit.\"},{\"label\":\"6\",\"score\":5,\"breakdown\":\"Percentage of students enrolled in Dual Enrollment classes grades 9-12\",\"state_average\":2,\"state_average_label\":\"3\",\"display_type\":\"person\",\"lower_range\":0,\"upper_range\":100,\"tooltip_html\":\"The Dual Enrollment participation rate reflects the percentage of students at this school who are taking college courses while in high school. Credits for these courses apply both to high school diploma requirements and college graduation requisites.\"},{\"label\":\"\\u003c1\",\"score\":0,\"breakdown\":\"Percentage of students enrolled in IB grades 9-12\",\"state_average\":2,\"state_average_label\":\"2\",\"display_type\":\"person\",\"lower_range\":0,\"upper_range\":100,\"tooltip_html\":\"International Baccalaureate (IB) is an internationally recognized, high-standards program that emphasizes creative and critical thinking. A high school may have specific IB classes students can take, or a school-wide IB program that affects all classes. Some colleges give college credit for IB courses. \\u003ca href='/gk/articles/what-is-ib-international-baccalaureate/' target='_blank'\\u003eMore about IB\\u003c/a\\u003e\\n\"},{\"label\":\"93\",\"score\":93,\"breakdown\":\"SAT/ACT participation rate\",\"state_average\":57,\"state_average_label\":\"57\",\"display_type\":\"person\",\"lower_range\":0,\"upper_range\":100,\"tooltip_html\":\"The SAT/ACT participation rate shows the percentage of eligible students in grades 11 or 12 at this school who took the SAT or ACT.\"}]}]}],\"showTabs\":false,\"faq\":{\"cta\":\"Notice something missing or confusing?\",\"content\":\"\\u003cp\\u003eCollege readiness information comes from state or national education agencies (click on the \\\"Sources\\\" link for details).\\u003c/p\\u003e \\u003cp\\u003eWhen information is missing in our display, it's most likely because this school did not offer an AP course, IB or dual enrollment classes, or participate in one of the two college readiness tests, the ACT or SAT (some states mandate which college readiness test schools use). It's also possible that the missing data was not included in the data we received from the state.\\u003c/p\\u003e \\u003cp\\u003eDid you find the information about college readiness useful? What can we do better? \\u003ca href=\\\"https://s.qualaroo.com/45194/34aea707-ec71-4130-b6bb-2864e0528c64\\\" target=\\\"_blank\\\"\\u003eShare your feedback.\\u003c/a\\u003e\\u003c/p\\u003e \\u003cp\\u003e\\u003ca href=\\\"/gk/ratings/#collegereadinessrating\\\" target=\\\"_blank\\\"\\u003eLearn more about this rating.\\u003c/a\\u003e\\u003c/p\\u003e \\u003cp\\u003eStill have questions? \\u003ca href=\\\"https://greatschools.zendesk.com/hc/en-us\\\" target=\\\"_blank\\\"\\u003eVisit our FAQ page.\\u003c/a\\u003e\\u003c/p\\u003e\\n\",\"element_type\":\"faq\"},\"no_data_summary\":\"This section includes information about this school’s graduation rates, SAT/ACT tests, and AP coursework.\\n\",\"qualaroo_module_link\":\"https://s.qualaroo.com/45194/34aea707-ec71-4130-b6bb-2864e0528c64?state=MD\\u0026school=115\"}"
Thanks for any help!

Good news is you can easily regex out this value from response text
library(rvest)
library(magrittr)
library(stringr)
p <- read_html('https://www.greatschools.org/maryland/severna-park/115-Severna-Park-High-School/') %>% html_text()
rate <- str_match_all(p,'"College readiness","values":\\[\\{"label":"(.*?)"')[[1]][,2][1]
print(as.numeric(rate))

Related

Extract the paragraphs from a PDF that contain a keyword using R

I need to extract from a pdf file the paragraphs that contain a keyword. Tried various codes but none got anything.
I have seen this code from a user #Tyler Rinker (Extract before and after lines based on keyword in Pdf using R programming) but it extracts the line where the keyword is, the before and after.
library(textreadr)
library(tidyverse)
loc <- function(var, regex, n = 1, ignore.case = TRUE){
locs <- grep(regex, var, ignore.case = ignore.case)
out <- sort(unique(c(locs - 1, locs, locs + 1)))
out <- out[out > 0]
out[out <= length(var)]
}
doc <- 'https://www.in.kpmg.com/pdf/Indian%20Pharma%20Outlook.pdf' %>%
read_pdf() %>%
slice(loc(text, 'cancer'))
However, I need to get the paragraphs and store each one in a row in my database. Could you help me?
The text lines in paragraphs will all be quite long unless it is the final line of the paragraph. We can count the characters in each line and do a histogram to show this:
library(textreadr)
doc <- read_pdf('https://www.in.kpmg.com/pdf/Indian%20Pharma%20Outlook.pdf')
hist(nchar(doc$text), 20)
So anything less than about 75 characters is either not in a paragraph or at the end of a paragraph. We can therefore stick a line break on the short ones, paste all the lines together, then split on linebreaks:
doc$text[nchar(doc$text) < 75] <- paste0(doc$text[nchar(doc$text) < 75], "\n")
txt <- paste(doc$text, collapse = " ")
txt <- strsplit(txt, "\n")[[1]]
So now we can just do our regex and find the paragraphs with the key word:
grep("cancer", txt, value = TRUE)
#> [1] " Ranjit Shahani applauds the National Pharmaceuticals Policy's proposal of public/private partnerships (PPPs) to tackle life-threatening diseases such as cancer and HIV/AIDS, but stresses that, in order for them to work, they should be voluntary, and the government should exempt all life-saving drugs from import duties and other taxes such as excise duty and VAT. He is, however, critical about a proposal for mandatory price negotiation of newly patented drugs. He feels this will erode India's credibility in implementing the Patent Act in © 2006 KPMG International. KPMG International is a Swiss cooperative that serves as a coordinating entity for a network of independent firms operating under the KPMG name. KPMG International provides no services to clients. Each member firm of KPMG International is a legally distinct and separate entity and each describes itself as such. All rights reserved. Collaboration for Growth 24"
#> [2] " a fair and transparent manner. To deal with diabetes, medicines are not the only answer; awareness about the need for lifestyle changes needs to be increased, he adds. While industry leaders have long called for the development of PPPs for the provision of health care in India, particularly in rural areas, such initiatives are currently totally unexplored. However, the government's 2006 draft National Pharmaceuticals Policy proposes the introduction of PPPs with drug manufacturers and hospitals as a way of vastly increasing the availability of medicines to treat life-threatening diseases. It notes, for example, that while an average estimate of the value of drugs to treat the country's cancer patients is $1.11 billion, the market is in fact worth only $33.5 million. “The big gap indicates the near non-accessibility of the medicines to a vast majority of the affected population, mainly because of the high cost of these medicines,” says the Policy, which also calls for tax and excise exemptions for anti-cancer drugs."
#> [3] " 50.1 percent of Aventis Pharma is held by European drug major Sanofi-Aventis and, in early April 2006, it was reported that UB Holdings had sold its 10 percent holding in the firm to Variegate Trading, a UB subsidiary. The firm's major products are in the anti-infective, anti-inflammatory, cancer, diabetes and allergy market segments and, for the year ended December 31, 2005, it reported net sales (excluding excise duty) up 9.9 percent to $181.1 million, with domestic sales up 9.1 percent at $129.8 million and exports increasing 12 percent to $51.2 million. Sales were led by 83 percent annual growth for the diabetes treatment Lantus (insulin glargine), followed by the rabies vaccine Rabipur (+22 percent), the diabetes drug Amaryl (glimepiride) and epilepsy treatment Frisium (clobazam), both up 18 percent, the angiotensin-coverting enzyme inhibitor Cardace (ramipril +15 percent), Clexane (enoxaparin), an anticoagulant, growing 14 percent and Targocid (teicoplanin), an antibiotic, whose sales advanced 8 percent."
Created on 2020-09-16 by the reprex package (v0.3.0)

filtering text and storing the filtered sentence/paragraph into a new column

I am trying to extract some sentences from text data. I want to extract the sentences which correspond to medical device company released. I can run the following code:
df_text <- unlist(strsplit(df$TD, "\\."))
df_text
df_text <- df_text[grep(pattern = "medical device company released", df_text, ignore.case = TRUE)]
df_text
Which gives me:
[1] "\n\nThe medical device company released its financial outlook in a press release before an investor conference Friday"
So I extracted the sentences which contain the sentence medical device company released. However, I want to do this but store the results in a new column from which grp the sentence came from.
Expected output:
grp TD newCol
3613 text NA # does not contain the sentence
4973 text medical device company released
5570 text NA # does not contain the sentence
Data:
df <- structure(list(grp = c("3613", "4973", "5570"), TD = c(" Wal-Mart plans to add an undisclosed number of positions in areas including its store-planning operation and New York apparel office.\n\nThe moves, which began Tuesday, are meant to \"increase operational efficiencies, support our strategic growth plans and reduce overall costs,\" Wal-Mart spokesman David Tovar said.\n\nWal-Mart still expects net growth of tens of thousands of jobs at the store level this year, Tovar said.\n\nThe reduction in staff is hardly a new development for retailers, which have been cutting jobs at their corporate offices as they contend with the down economy. Target Corp. (TGT), Saks Inc. (SKS) and Best Buy Co. (BBY) are among retailers that have said in recent weeks they plan to pare their ranks.\n\nTovar declined to say whether the poor economy was a factor in Wal-Mart's decision.\n\nWal-Mart is operating from a position of comparative strength as one of the few retailers to consistently show positive growth in same-store sales over the past year as the recession dug in.\n\nWal-Mart is \"a fiscally responsible company that will manage its capital structure appropriately,\" said Todd Slater, retail analyst at Lazard Capital Markets.\n\nEven though Wal-Mart is outperforming its peers, the company \"is not performing anywhere near peak or optimum levels,\" Slater said. \"The consumer has cut back significantly.\"\n\nWal-Mart indicated it had regained some footing in January, when comparable-store sales rose 2.1%, after a lower-than-expected 1.7% rise in December.\n\nWal-Mart shares are off 3.2% to $47.68.\n\n-By Karen Talley, Dow Jones Newswires; 201-938-5106; karen.talley#dowjones.com [ 02-10-09 1437ET ]\n ",
" --To present new valve platforms Friday\n\n(Updates with additional comment from company, beginning in the seventh paragraph.)\n\n\n \n By Anjali Athavaley \n Of DOW JONES NEWSWIRES \n \n\nNEW YORK (Dow Jones)--Edwards Lifesciences Corp. (EW) said Friday that it expects earnings to grow 35% to 40%, excluding special items, in 2012 on expected sales of its catheter-delivered heart valves that were approved in the U.S. earlier this year.\n\nThe medical device company released its financial outlook in a press release before an investor conference Friday. The catheter-delivered heart valve market is considered to have a multibillion-dollar market potential, but questions have persisted on how quickly the Edwards device, called Sapien, will be rolled out and who will be able to receive it.\n\nEdwards said it expects transcatheter valve sales between $560 million and $630 million in 2012, with $200 million to $260 million coming from the U.S.\n\nOverall, for 2012, Edwards sees total sales between $1.95 billion and $2.05 billion, above the $1.68 billion to $1.72 billion expected this year and bracketing the $2.01 billion expected on average by analysts surveyed by Thomson Reuters.\n\nThe company projects 2012 per-share earnings between $2.70 and $2.80, the midpoint of which is below the average analyst estimate of $2.78 on Thomson Reuters. Edwards estimates a gross profit margin of 73% to 75%.\n\nEdwards also reaffirmed its 2011 guidance, which includes earnings per share of $1.97 to $2.02, excluding special items.\n\nThe company said it continues to expect U.S. approval of its Sapien device for high-risk patients in mid-2012. Currently, the device is only approved in the U.S. for patients too sick for surgery.\n\nThe company added that a separate trial studying its newer-generation valve in a larger population is under way in the U.S. It expects U.S. approval of that device in 2014.\n\nEdwards also plans to present at its investor conference two new catheter-delivered valve platforms designed for different implantation methods. European trials for these devices are expected to begin in 2012.\n\nShares of Edwards, down 9% over the past 12 months, were inactive premarket. The stock closed at $63.82 on Thursday.\n\n-By Anjali Athavaley, Dow Jones Newswires; 212-416-4912; anjali.athavaley#dowjones.com [ 12-09-11 0924ET ]\n ",
" In September, the company issued a guidance range of 43 cents to 44 cents a share. \n\nFor the year, GE now sees earnings no lower than $1.81 a share to $1.83 a share. The previous forecast called for income of $1.80 to $1.83 a share. The new range brackets analyst projections of $1.82 a share. \n\nThe new targets represent double-digit growth from the respective year-earlier periods. Last year's third-quarter earnings were $3.87 billion, or 36 cents a share, excluding items; earnings for the year ended Dec. 31 came in at $16.59 billion, or $1.59 a share. [ 10-06-05 0858ET ] \n\nGeneral Electric also announced Thursday that it expects 2005 cash flow from operating activities to exceed $19 billion. \n\nBecause of the expected cash influx, the company increased its authorization for share repurchases by $1 billion to more than $4 billion. \n\nGE announced the updated guidance at an analysts' meeting Thursday in New York. A Web cast of the meeting is available at . \n\nThe company plans to report third-quarter earnings Oct. 14. \n\nShares of the Dow Jones Industrial Average component recently listed at $33.20 in pre-market trading, according to Inet, up 1.6%, or 52 cents, from Wednesday's close of $32.68. \n\nCompany Web site: \n\n-Jeremy Herron; Dow Jones Newswires; 201-938-5400; Ask Newswires#DowJones.com \n\nOrder free Annual Report for General Electric Co. \n\nVisit or call 1-888-301-0513 [ 10-06-05 0904ET ] \n "
)), class = "data.frame", row.names = c(NA, -3L))
We can get data in separate rows keeping the grp intact and keep only sentence that has "medical device company released" in it.
library(dplyr)
df %>%
tidyr::separate_rows(TD, sep = "\\.") %>%
group_by(grp) %>%
summarise(newCol = toString(grep(pattern = "medical device company released",
TD, ignore.case = TRUE, value = TRUE)))
# grp newCol
# <chr> <chr>
#1 3613 ""
#2 4973 "\n\nThe medical device company released its financia…
#3 5570 ""

how to delete documents in corpus that are similar

I have a corpus of news articles on a given topic. Some of these articles are the exact same article but have been given additional headers and footers that very slightly change the content. I am trying to delete all but one of the potential duplicates so the final corpus only contains unique articles.
I decided to use cosine similiarity to identify the potential duplicates:
myDfm <- dfm(as.character(docs$text_main), verbose=FALSE)
cosinesim <- textstat_simil(x=myDfm, selection=docnames(myDfm), margin="documents", method="cosine")
cosinemat <- as.matrix(cosinesim)
After looking at a subset of the data, I chose a cut off of .9 cosine distance or above to indicate duplicates.(I am okay with any error that this Given this, I have converted the diagonal to 0 (i.e., not a dup) and altered the matrix to indicate which documents are duplicates and which are not:
diag(cosinemat) <- 0
cosinemat[cosinemat >= .9] <- 1
cosinemat[cosinemat < .9] <- 0
The problem I'm running into is figuring out how to delete all but one of the duplicate documents. Initially, I envisioned a for loop to go through each column cell by cell, for any cell that has a value of 1 (i.e., is a duplicate) to delete the column with the same name as the row of the current cell, reconstitute the matrix and continue on to the next cell. The for loop doesn't seem to like the line of code that deletes the columns with the name of the current row when the cell is equal to 1. Though, I'm not sure its okay to reconstitute the object you're looping through. Something like this:
cosine_df <- as.data.frame(cosinemat)
for(col in 1:ncol(cosine_df)){
for(row in 1:nrow(cosine_df)){
if(cosine_df[col,row] == 0){
next
}
if(cosine_df[col,row] == 1){
cosine_df <- cosine_df[!rownames(cosine_df) %in% paste(rownames(cosine_df)[col,row]]
}
}
}
I'm not set on this approach, and I'm open to creative solutions, so long as I am able to identify similar documents and to delete all but one document.
Here's a subset of the documents if it helps:
docs <- structure(list(text_main = c("Congressional Documents and PublicationsMay 26, 2016Copyright 2016 Federal Information and News Dispatch, Inc.Section:U.S. HOUSE OF REPRESENTATIVES DOCUMENTSLength:287 wordsBody(Washington, DC) Reps. Ted Deutch (D-FL) and Gus Bilirakis (R-FL) joined with Reps. Steve Israel (D-NY), Mike Kelly (R-PA), Ted Lieu (D-CA), Adam Kinzinger (R-IL), Hakeem Jeffries (D-NY), Lee Zeldin (R-NY), and Susan Davis (D-CA) to introduce a resolution (H. Res. 750) urging the European Union (EU) to designate the entirety of Hizballah as a terrorist organization and increase pressure on the organizations and its members. Currently, the EU only designates Hizballah's military wing as a terrorist organization, while the United States makes no distinction between its military and political branches when listing the group on its Foreign Terrorist Organization list.Upon introduction, the Members of Congress released the following statement:\"Hizballah is an Iranian-backed terrorist organization with a global reach that engages in significant illicit criminal activity to fund its terrorism. It doesn't matter what part of the organization you're associated with; if you are connected with Hizballah, you are contributing to the rocket attacks on innocent Israeli civilians, targeted bombings of Jews around the world, slaughter of civilians in Syria, and destabilization of the Middle East. There is no distinction between parts of Hizballah when every part contributes to terrorism. We urge our EU allies to help rein in Hizballah's dangerous worldwide activities.\"The resolution can be viewed here .Last year, Congress passed the Hizballah International Financing Prevention Act which tightened sanctions on Hizballah's criminal and financial networks.Read this original document at: ",
"Congressional Documents and PublicationsApril 20, 2016Copyright 2016 Federal Information and News Dispatch, Inc.Section:U.S. HOUSE OF REPRESENTATIVES DOCUMENTSLength:499 wordsBodyToday, members of the House of Representatives Bipartisan Taskforce for Combating Anti-Semitism sounded the alarm about a troubling surge in anti-Semitism on American college campuses. In a letter to the Secretary of Education, the Taskforce asked the Secretary about the Department's planned response to the issue. Additionally, the co-chairs made the following statement:\"An alarming rise of anti-Israel programs on American college campuses contribute to increasing harassment, intimidation, and discrimination against Jewish students. While we believe that students' freedoms of speech and assembly should be respected, there are increasing reports that activity advertised as anti-Israel or anti-Zionist is devolving into displays of subtle, but sometimes outright anti-Semitism. Attacks on students because of their actual or perceived religion, ancestry, or ethnicity are unacceptable. We believe strongly that no student should ever face discrimination and that school activities must be structured in a respectful manner to ensure academic integrity and a nondiscriminatory environment throughout the entire campus. For these reasons, we ask the Department of Education to assess its ability to monitor and respond to anti-Semitic incidents and to take additional steps to combat intimidation and harassment against minority students on college campuses.\"In 2004, the U.S. Department of Education Office for Civil Rights (OCR) clarified its interpretation of Title VI of the Civil Rights Act of 1964, including protections for groups of students on the basis of their actual or perceived shared ancestry or ethnic characteristics, regardless of whether they are members of a faith community, as in the case for Jewish, Sikh, and Muslim students. The Department reiterated this policy again in 2010 and 2015.However, as the number of reported Boycott, Divestment, and Sanctions (BDS) movement campaigns and other anti-Israel initiatives rise on college campuses, Members of Congress believe the Department must proactively implement its anti-discrimination policy to mitigate anti-Semitism on college campuses.The Bipartisan Taskforce for Combating Anti-Semitism is co-chaired by U.S. Reps. Nita Lowey (D-NY), Chris Smith (R-NJ), Eliot Engel (D-NY), Ileana Ros-Lehtinen (R-FL), Kay Granger (R-TX), Steve Israel (D-NY), Peter Roskam (R-IL), and Ted Deutch (D-FL).The following organizations expressed their support for the letter: the Anti-Defamation League, Jewish Federation of North America, B'nai Brith International, Jewish United Fund/Jewish Federation of Metropolitan Chicago, the Louis D. Brandeis Center for Human Rights Under Law, the World Jewish Congress, and the Zionist Organization of America.Text of the letter can be found here .Read this original document at: ",
"Targeted News ServiceApril 20, 2016 Wednesday 7:41 AM ESTCopyright 2016 Targeted News Service LLC All Rights ReservedLength:511 wordsByline:Targeted News ServiceDateline:WASHINGTON BodyRep. Ted Deutch, D-Fla. (21st CD), issued the following news release:Today, members of the House of Representatives Bipartisan Taskforce for Combating Anti-Semitism sounded the alarm about a troubling surge in anti-Semitism on American college campuses. In a letter to the Secretary of Education, the Taskforce asked the Secretary about the Department's planned response to the issue. Additionally, the co-chairs made the following statement:\"An alarming rise of anti-Israel programs on American college campuses contribute to increasing harassment, intimidation, and discrimination against Jewish students. While we believe that students' freedoms of speech and assembly should be respected, there are increasing reports that activity advertised as anti-Israel or anti-Zionist is devolving into displays of subtle, but sometimes outright anti-Semitism. Attacks on students because of their actual or perceived religion, ancestry, or ethnicity are unacceptable. We believe strongly that no student should ever face discrimination and that school activities must be structured in a respectful manner to ensure academic integrity and a nondiscriminatory environment throughout the entire campus. For these reasons, we ask the Department of Education to assess its ability to monitor and respond to anti-Semitic incidents and to take additional steps to combat intimidation and harassment against minority students on college campuses.\"In 2004, the U.S. Department of Education Office for Civil Rights (OCR) clarified its interpretation of Title VI of the Civil Rights Act of 1964, including protections for groups of students on the basis of their actual or perceived shared ancestry or ethnic characteristics, regardless of whether they are members of a faith community, as in the case for Jewish, Sikh, and Muslim students. The Department reiterated this policy again in 2010 and 2015.However, as the number of reported Boycott, Divestment, and Sanctions (BDS) movement campaigns and other anti-Israel initiatives rise on college campuses, Members of Congress believe the Department must proactively implement its anti-discrimination policy to mitigate anti-Semitism on college campuses.The Bipartisan Taskforce for Combating Anti-Semitism is co-chaired by U.S. Reps. Nita Lowey (D-NY), Chris Smith (R-NJ), Eliot Engel (D-NY), Ileana Ros-Lehtinen (R-FL), Kay Granger (R-TX), Steve Israel (D-NY), Peter Roskam (R-IL), and Ted Deutch (D-FL).The following organizations expressed their support for the letter: the Anti-Defamation League, Jewish Federation of North America, B'nai Brith International, Jewish United Fund/Jewish Federation of Metropolitan Chicago, the Louis D. Brandeis Center for Human Rights Under Law, the World Jewish Congress, and the Zionist Organization of America.Text of the letter can be found here ().Contact: Jason Attermann, 202/225-3001Copyright Targeted News Services30FurigayJof-5501453 30FurigayJof",
"US Official NewsFebruary 13, 2013 WednesdayCopyright 2013 Plus Media Solutions Private Limited All Rights ReservedLength:298 wordsDateline:Washington Body Office of the House of Representative Ted Deutch, U.S Government has issued the following news release: Rep. Ted Deutch (D-FL) and Rep. Gus Bilirakis (R-GL) issued the following statements regarding the Bulgarian governments report that two individuals responsible for the July 2012 terrorist attack on a bus in Burgas, Bulgaria, have ties to Hezbollah. Five Israeli tourists and the Bulgarian bus driver were killed in the attack.Congressman Bilirakis: The Bulgarian governments report is yet another example of Hezbollah's deliberate use of terror across the globe. Contrary to some European opinions, Hezbollah is not merely a political organization and is actively involved in terrorist activities. As I have requested many times, the European Union must finally recognize Hezbollah for what it is: a terrorist organization. I commend the Bulgarian government for their thorough investigation and call on the members of the European Union to examine these findings closely.Congressman Deutch: The results of the Bulgarian governments investigation into the deadly attack in Burgas confirms what we already knew - Hezbollah is a terrorist organization that is willing to perpetrate attacks on innocent civilians around the globe. I continue to urge our European partners to formally designate Hezbollah as a terrorist organization. Failure to do so only emboldens Hezbollah to continue its reign of terror in Europe and around the world.In September 2012, Congressmen Bilirakis and Deutch initiated a bi-partisan letter signed by 268 Members of Congress to the President and Ministers of the Commission of the European Union, urging them to include Hezbollah on the European Union's list of terrorist organizations. For further information please visit: ",
"Congressional Documents and PublicationsMay 4, 2011Copyright 2011 Federal Information and News Dispatch, Inc.Section:U.S. HOUSE OF REPRESENTATIVES DOCUMENTSLength:204 wordsBodyWashington, May 4 -Rep. Ted Deutch released the following statement on the Florida legislature's passage of SB 444, which expands upon the Protecting Florida's Investments Act, legislation he authored in 2007 in the Florida State Senate:\"I applaud the Florida Legislature's passage of SB 444, legislation that will help ensure national and international security by preventing Florida's taxpayer dollars from supporting companies who choose to violate federal law by bolstering the Iranian regime. I congratulate the bill's sponsors, Sen. Ellyn Bogdanoff and Rep. Mack Bernard. This bill prevents state and local governments from awarding contracts to companies found to be investing in the Iranian energy sector. It is consistent with federal policy and sends a clear message that Floridians will not support any company that puts profit over international security. The Iranian regime continues to pursue its illicit nuclear weapons program, continues to engage in the most egregious human rights violations, and continues to support terrorism across the globe. We must continue to utilize every economic tool at our disposal to bring this regime to its knees. I urge Governor Scott to act quickly to sign this bill into law.\"",
"Congressional Documents and PublicationsMarch 23, 2011Copyright 2011 Federal Information and News Dispatch, Inc.Section:U.S. HOUSE OF REPRESENTATIVES DOCUMENTSLength:128 wordsBodyBoca Raton, Mar 23 -Congressman Ted Deutch (D-FL) released the following statement in reaction to the explosion of a bomb today in Jerusalem that killed a 59-year-old woman and injured dozens more:\"Today's horrific bombing in Jerusalem is yet another attack in a surge of violence perpetuated by Palestinian terrorists against innocent Israeli citizens,\" said Congressman Ted Deutch. \"The victims of this heinous attack and the Israeli people deserve the full support of the international community as they seek to defend themselves against this relentless violence. It is deplorable that as Israelis endure this latest bombing in Jerusalem, as well as ongoing rocket attacks by Hamas, some astonishingly still seek to blame Israel for the lack of peace in the region.\"",
"States News ServiceMarch 26, 2015 ThursdayCopyright 2015 States News ServiceLength:218 wordsByline:States News ServiceDateline:WASHINGTON BodyThe following information was released by the office of Florida Rep. Ted Deutch:Congressman Ted Deutch (FL-21) issued the following statement after voting in favor of the bipartisan 'doc fix' deal included in H.R. 2, the Medicare Access and CHIP Reauthorization Act, which passed the House of Representatives by a vote of 392-27:\"Today I was pleased to vote for a bipartisan deal to permanently repeal the misguided Sustainable Growth Rate (SGR) policy that jeopardizes doctors participating in Medicare, and to permanently extend assistance to low-income seniors struggling to pay their Medicare premiums.In addition to building certainty in our Medicare system, this legislation also protects health care for low-income kids and funding for the federal health centers in Florida's 21st district that in 2013 alone served over 70,000 patients living below or near the federal poverty level.\"Today's deal is the product of the kind of bipartisan compromise that has become all too rare in Washington, in which neither Democrats nor Republicans get everything they want but come together to move our country forward. I can only imagine how much more we could achieve for the American people if we came together in the spirit of compromise on more issues, from infrastructure investment to immigration reform.\"",
"Congressional Documents and PublicationsMarch 26, 2015Copyright 2015 Federal Information and News Dispatch, Inc.Section:U.S. HOUSE OF REPRESENTATIVES DOCUMENTSLength:250 wordsBodyCongressman Ted Deutch (FL-21) issued the following statement after voting in favor of the bipartisan 'doc fix' deal included in H.R. 2, the Medicare Access and CHIP Reauthorization Act, which passed the House of Representatives by a vote of 392-27:\"Today I was pleased to vote for a bipartisan deal to permanently repeal the misguided Sustainable Growth Rate (SGR) policy that jeopardizes doctors participating in Medicare, and to permanently extend assistance to low-income seniors struggling to pay their Medicare premiums. In addition to building certainty in our Medicare system, this legislation also protects health care for low-income kids and funding for the federal health centers in Florida's 21st district that in 2013 alone served over 70,000 patients living below or near the federal poverty level.\"Today's deal is the product of the kind of bipartisan compromise that has become all too rare in Washington, in which neither Democrats nor Republicans get everything they want but come together to move our country forward. I can only imagine how much more we could achieve for the American people if we came together in the spirit of compromise on more issues, from infrastructure investment to immigration reform.\"For a fact sheet on H.R. 2, please go to: .Read this original document at: ",
"US Official NewsMarch 27, 2015 FridayCopyright 2015 Plus Media Solutions Private Limited All Rights ReservedLength:241 wordsDateline:Washington Body Office of the House of Representative Ted Deutch, U.S Government has issued the following news release:Congressman Ted Deutch (FL-21) issued the following statement after voting in favor of the bipartisan 'doc fix' deal included in H.R. 2, the Medicare Access and CHIP Reauthorization Act, which passed the House of Representatives by a vote of 392-27:\"Today I was pleased to vote for a bipartisan deal to permanently repeal the misguided Sustainable Growth Rate (SGR) policy that jeopardizes doctors participating in Medicare, and to permanently extend assistance to low-income seniors struggling to pay their Medicare premiums.In addition to building certainty in our Medicare system, this legislation also protects health care for low-income kids and funding for the federal health centers in Floridas 21st district that in 2013 alone served over 70,000 patients living below or near the federal poverty level.\"Todays deal is the product of the kind of bipartisan compromise that has become all too rare in Washington, in which neither Democrats nor Republicans get everything they want but come together to move our country forward. I can only imagine how much more we could achieve for the American people if we came together in the spirit of compromise on more issues, from infrastructure investment to immigration reform.\" In case of any query regarding this article or other content needs please contact: ",
"US Official NewsMarch 27, 2015 FridayCopyright 2015 Plus Media Solutions Private Limited All Rights ReservedLength:241 wordsDateline:Washington Body Office of the House of Representative Ted Deutch, U.S Government has issued the following news release:Congressman Ted Deutch (FL-21) issued the following statement after voting in favor of the bipartisan 'doc fix' deal included in H.R. 2, the Medicare Access and CHIP Reauthorization Act, which passed the House of Representatives by a vote of 392-27:\"Today I was pleased to vote for a bipartisan deal to permanently repeal the misguided Sustainable Growth Rate (SGR) policy that jeopardizes doctors participating in Medicare, and to permanently extend assistance to low-income seniors struggling to pay their Medicare premiums.In addition to building certainty in our Medicare system, this legislation also protects health care for low-income kids and funding for the federal health centers in Floridas 21st district that in 2013 alone served over 70,000 patients living below or near the federal poverty level.\"Todays deal is the product of the kind of bipartisan compromise that has become all too rare in Washington, in which neither Democrats nor Republicans get everything they want but come together to move our country forward. I can only imagine how much more we could achieve for the American people if we came together in the spirit of compromise on more issues, from infrastructure investment to immigration reform.\" In case of any query regarding this article or other content needs please contact: "
)), row.names = c(NA, 10L), class = "data.frame", .Names = "text_main")
Here is the matrix of similarity for the same subset of documents:
cosine_df <- structure(list(text1 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), text2 = c(0,
0, 1, 0, 0, 0, 0, 0, 0, 0), text3 = c(0, 1, 0, 0, 0, 0, 0, 0,
0, 0), text4 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), text5 = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), text6 = c(0, 0, 0, 0, 0, 0, 0, 0,
0, 0), text7 = c(0, 0, 0, 0, 0, 0, 0, 1, 1, 1), text8 = c(0,
0, 0, 0, 0, 0, 1, 0, 1, 1), text9 = c(0, 0, 0, 0, 0, 0, 1, 1,
0, 1), text10 = c(0, 0, 0, 0, 0, 0, 1, 1, 1, 0)), .Names = c("text1",
"text2", "text3", "text4", "text5", "text6", "text7", "text8",
"text9", "text10"), row.names = c("text1", "text2", "text3",
"text4", "text5", "text6", "text7", "text8", "text9", "text10"
), class = "data.frame")
In case anyone else has a similar problem, this was the solution I ended up creating:
library(quanteda)
myDfm <- dfm(as.character(docs$text_main), verbose=FALSE)
cosinesim <- textstat_simil(x=myDfm, selection=docnames(myDfm), margin="documents", method="cosine")
cosinemat <- as.matrix(cosinesim) #this produces a matrix of the document similarities
threshold <- .9
similar_indices <- unique(apply(cosinemat, 1,
function(x) which(x > threshold)))
## keep only the first element of each set
if(class(similar_indices) == "list") { # check if list or not
unique_indices <- unique(sapply(similar_indices, function(x) as.numeric(x[1])))
} else if (class(similar_indices) == "matrix"){
unique_indices <- unique(apply(similar_indices, 2, function(x) as.numeric(x[1])))
} else {
unique_indices <- similar_indices
}
## get only the unique texts
docs_unique <- docs[unique_indices ,]

Transforming kwic objects into single dfm

I have a corpus of newspaper articles of which only specific parts are of interest for my research. I'm not happy with the results I get from classifying texts along different frames because the data contains too much noise. I therefore want to extract only the relevant parts from the documents. I was thinking of doing so by transforming several kwic objects generated by the quanteda package into a single df.
So far I've tried the following
exampletext <- c("The only reason for (the haste) which we can discern is the prospect of an Olympic medal, which is the raison d'etat of the banana republic,'' The Guardian said in an editorial under the headline ''Whatever Zola Wants. . .'' The Government made it clear it had acted promptly on the application to insure that the 5-foot-2-inch track star could qualify for the British Olympic team. The International Olympic Organization has a rule that says athletes who change their nationality must wait three years before competing for that country - a rule, however, that is often waived by the I.O.C. The British Olympic Association said it consulted with the I.O.C. before asserting Miss Budd's eligibility for the British team. ''Since Zola is now here and has a British passport she should be made to feel welcome and accepted by other British athletes,'' said Paul Dickenson, chairman of the International Athletes Club, an organization that raises money for amateur athletes and looks after their political interests. ''The thing we objected to was the way she got into the country by the Government and the Daily Mail and the commercialization exploitation associated with it.", "That left 14 countries that have joined the Soviet-led withdrawal. Albania and Iran had announced that they would not compete and did not send written notification. Bolivia, citing financial trouble, announced Sunday it would not participate.The 1972 Munich Games had the previous high number of competing countries, 122.No Protest Planned on Zola Budd YAOUNDE, Cameroon, June 4 (AP) - African countries do not plan to boycott the Los Angeles Olympics in protest of the inclusion of Zola Budd, the South African-born track star, on the British team, according to Lamine Ba, the secretary-general of the Supreme Council for Sport in Africa. Because South Africa is banned from participation in the Olympics, Miss Budd, whose father is of British descent, moved to Britain in March and was granted British citizenship.75 Olympians to Train in Atlanta ATLANTA, June 4 (AP) - About 75 Olympic athletes from six African countries and Pakistan will participate in a month-long training camp this summer in Atlanta under a program financed largely by a grant from the United States Information Agency, Anne Bassarab, a member of Mayor Andrew Young's staff, said today. The athletes, from Mozambique, Tanzania, Zambia, Zimbabwe, Uganda, Somalia and Pakistan, will arrive here June 24.")
mycorpus <- corpus(exampletext)
mycorpus.nat <- corpus(kwic(mycorpus, "nationalit*", window = 5, valuetype = "glob"))
mycorpus.cit <- corpus(kwic(mycorpus, "citizenship", window = 5, valuetype = "glob"))
mycorpus.kwic <- mycorpus.nat + mycorpus.cit
mydfm <- dfm(mycorpus.kwic)
This, however, generates a dfm that contains 4 documents instead of 2, and when both keywords are present in a document even more. I can't think of a way to bring the dfm down to the original number of documents.
Thank you for helping me out.
We recently added window argument to tokens_select() for this purpose:
require(quanteda)
txt <- c("The only reason for (the haste) which we can discern is the prospect of an Olympic medal, which is the raison d'etat of the banana republic,'' The Guardian said in an editorial under the headline ''Whatever Zola Wants. . .'' The Government made it clear it had acted promptly on the application to insure that the 5-foot-2-inch track star could qualify for the British Olympic team. The International Olympic Organization has a rule that says athletes who change their nationality must wait three years before competing for that country - a rule, however, that is often waived by the I.O.C. The British Olympic Association said it consulted with the I.O.C. before asserting Miss Budd's eligibility for the British team. ''Since Zola is now here and has a British passport she should be made to feel welcome and accepted by other British athletes,'' said Paul Dickenson, chairman of the International Athletes Club, an organization that raises money for amateur athletes and looks after their political interests. ''The thing we objected to was the way she got into the country by the Government and the Daily Mail and the commercialization exploitation associated with it.", "That left 14 countries that have joined the Soviet-led withdrawal. Albania and Iran had announced that they would not compete and did not send written notification. Bolivia, citing financial trouble, announced Sunday it would not participate.The 1972 Munich Games had the previous high number of competing countries, 122.No Protest Planned on Zola Budd YAOUNDE, Cameroon, June 4 (AP) - African countries do not plan to boycott the Los Angeles Olympics in protest of the inclusion of Zola Budd, the South African-born track star, on the British team, according to Lamine Ba, the secretary-general of the Supreme Council for Sport in Africa. Because South Africa is banned from participation in the Olympics, Miss Budd, whose father is of British descent, moved to Britain in March and was granted British citizenship.75 Olympians to Train in Atlanta ATLANTA, June 4 (AP) - About 75 Olympic athletes from six African countries and Pakistan will participate in a month-long training camp this summer in Atlanta under a program financed largely by a grant from the United States Information Agency, Anne Bassarab, a member of Mayor Andrew Young's staff, said today. The athletes, from Mozambique, Tanzania, Zambia, Zimbabwe, Uganda, Somalia and Pakistan, will arrive here June 24.")
toks <- tokens(txt)
mt_nat <- dfm(tokens_select(toks, "nationalit*", window = 5))
mt_cit <- dfm(tokens_select(toks, "citizenship*", window = 5))
Please make sure that you are using the latest version of Quanteda.

error reading text file into new columns of a dataframe using some text editing

I have a text file (0001.txt) which contains the data as below:
<DOC>
<DOCNO>1100101_business_story_11931012.utf8</DOCNO>
<TEXT>
The Telegraph - Calcutta (Kolkata) | Business | Local firms go global
6 Local firms go global
JAYANTA ROY CHOWDHURY
New Delhi, Dec. 31: Indian companies are stepping out of their homes to try their luck on foreign shores.
Corporate India invested $2.7 billion abroad in the first quarter of 2009-2010 on top of $15.9 billion in 2008-09.
Though the first-quarter investment was 15 per cent lower than what was invested in the same period last year, merchant banker Sudipto Bose said, It marks a confidence in a new world order where Indian businesses see themselves as equal to global players.
According to analysts, confidence in global recovery, cheap corporate buys abroad and easier rules governing investment overseas had spurred flow of capital and could see total investment abroad top $12 billion this year and rise to $18-20 billion next fiscal.
For example, Titagarh Wagons plans to expand abroad on the back of the proposed Asian railroad project.
We plan to travel all around the world with the growth of the railroads, said Umesh Chowdhury of Titagarh Wagons.
India is full of opportunities, but we are all also looking at picks abroad, said Gautam Mitra, managing director of Indian Structurals Engineering Company.
Mitra plans to open a holding company in Switzerland to take his business in structurals to other Asian and African countries.
Indian companies created 3 lakh jobs in the US, while contributing $105 billion to the US economy between 2004 and 2007, according to commerce ministry statistics. During 2008-09, Singapore, the Netherlands, Cyprus, the UK, the US and Mauritius together accounted for 81 per cent of the total outward investment.
Bose said, And not all of it is organic growth. Much of our investment abroad reflects takeovers and acquisitions.
In the last two years, Suzlon acquired Portugals Martifers stake in German REpower Systems for $122 million. McNally Bharat Engineering has bought the coal and minerals processing business of KHD Humboldt Wedag. ONGC bought out Imperial Energy for $2 billion.
Indias foreign assets and liabilities today add up to more than 60 per cent of its gross domestic product. By the end of 2008-09, total foreign investment was $67 billion, more than double of that at the end of March 2007.
</TEXT>
</DOC>
Above, all text data is within the HTML code for text i.e.
<TEXT> and </TEXT>.
I want to read it into an R dataframe in a way that there will be four columns and the data should be read as:
Title Author Date Text
The Telegraph - Calcutta (Kolkata) JAYANTA ROY CHOWDHURY Dec. 31 Indian companies are stepping out of their homes to try their luck on foreign shores. Corporate India invested $2.7 billion abroad in the first quarter of 2009-2010 on top of $15.9 billion in 2008-09. Though the first-quarter investment was 15 percent lower than what was invested in the same period last year, merchant banker Sudipto Bose said, It marks a confidence in a new world order where Indian businesses see themselves as equal to global players.
What I was trying to read using dplyr and as shown below:
# read text file
library(dplyr)
library(readr)
dat <- read_csv("0001.txt") %>% slice(-8)
# print part of data frame
head(dat, n=2)
In above code, I tried to skip first few lines (which are not important) from the text file that contains the above text and then read it into dataframe.
But I could not get what I was looking for and got confused what I am doing is wrong.
Could someone please help?
To be able to read data into R as a data frame or table, the data needs to have a consistent structure maintained by separators. One of the most common formats is a file with comma separated values (CSV).
The data you're working with doesn't have separators though. It's essentially a string with minimally enforced structure. Because of this, it sounds like the question is more related to regular expressions (regex) and data mining than it is to reading text files into R. So I'd recommend looking into those two things if you do this task often.
That aside, to do what you're wanting in this example, I'd recommend reading the text file into R as a single string of text first. Then you can parse the data you want using regex. Here's a basic, rough draft of how to do that:
fileName <- "Path/to/your/data/0001.txt"
string <- readChar(fileName, file.info(fileName)$size)
df <- data.frame(
Title=sub("\\s+[|]+(.*)","",string),
Author=gsub("(.*)+?([A-Z]{2,}.*[A-Z]{2,})+(.*)","\\2",string),
Date=gsub("(.*)+([A-Z]{1}[a-z]{2}\\.\\s[0-9]{1,2})+(.*)","\\2",string),
Text=gsub("(.*)+([A-Z]{1}[a-z]{2}\\.\\s[0-9]{1,2})+[: ]+(.*)","\\3",string))
Output:
str(df)
'data.frame': 1 obs. of 4 variables:
$ Title : chr "The Telegraph - Calcutta (Kolkata)"
$ Author: chr "JAYANTA ROY CHOWDHURY"
$ Date : chr "Dec. 31"
$ Text : chr "Indian companies are stepping out of their homes to"| __truncated__
The reason why regex can be useful is that it allows for very specific patterns in strings. The downside is when you're working with strings that keep changing formats. That will likely mean some slight adjustments to the regex used.
read.table( file = ... , sep = "|") will solve your issue.

Resources