This question already has answers here:
Collapse / concatenate / aggregate a column to a single comma separated string within each group
(6 answers)
Closed 1 year ago.
I have a collection of texts which are organised in a data frame in the following way:
I would need such texts to be organised in the following way
I have been through a lot of previous questions here, but all merging suggested includes calculations, something which is not the case here. I have also consulted Tidytext package but did not seem to find a function to merge text in this way.
Any help is appreciated.
Edit
A pice of the actual data frame would be:
dput(df1)
structure(list(Title = c("Immigrants five times better off in Britain - Daily Star",
"Immigrants five times better off in Britain - Daily Star", "Immigrants five times better off in Britain - Daily Star",
"Immigrants five times better off in Britain - Daily Star", "Immigrants five times better off in Britain - Daily Star",
"Immigrants five times better off in Britain - Daily Star", "Immigrants five times better off in Britain - Daily Star",
"Donald Trump pledges to deport 3 MILLION illegal immigrants from the US - Daily Star",
"Donald Trump pledges to deport 3 MILLION illegal immigrants from the US - Daily Star",
"Donald Trump pledges to deport 3 MILLION illegal immigrants from the US - Daily Star",
"Donald Trump pledges to deport 3 MILLION illegal immigrants from the US - Daily Star",
"Donald Trump pledges to deport 3 MILLION illegal immigrants from the US - Daily Star",
"Donald Trump pledges to deport 3 MILLION illegal immigrants from the US - Daily Star",
"Donald Trump pledges to deport 3 MILLION illegal immigrants from the US - Daily Star",
"Donald Trump pledges to deport 3 MILLION illegal immigrants from the US - Daily Star",
"Donald Trump pledges to deport 3 MILLION illegal immigrants from the US - Daily Star",
"Donald Trump pledges to deport 3 MILLION illegal immigrants from the US - Daily Star",
"Donald Trump pledges to deport 3 MILLION illegal immigrants from the US - Daily Star",
"Donald Trump pledges to deport 3 MILLION illegal immigrants from the US - Daily Star",
"Donald Trump pledges to deport 3 MILLION illegal immigrants from the US - Daily Star",
"Donald Trump pledges to deport 3 MILLION illegal immigrants from the US - Daily Star",
"Donald Trump pledges to deport 3 MILLION illegal immigrants from the US - Daily Star",
"Thousands of immigrants get access to state handouts on arrival due to EU loophole - Daily Star",
"Thousands of immigrants get access to state handouts on arrival due to EU loophole - Daily Star",
"Thousands of immigrants get access to state handouts on arrival due to EU loophole - Daily Star",
"Thousands of immigrants get access to state handouts on arrival due to EU loophole - Daily Star",
"Thousands of immigrants get access to state handouts on arrival due to EU loophole - Daily Star",
"Thousands of immigrants get access to state handouts on arrival due to EU loophole - Daily Star",
"Thousands of immigrants get access to state handouts on arrival due to EU loophole - Daily Star",
"Thousands of immigrants get access to state handouts on arrival due to EU loophole - Daily Star"
), Content = c("IMMIGRANTS from Romania and Bulgaria would be five times better off if they moved to Britain.",
"Don't miss a thing by getting the Daily Star's biggest headlines straight to your inbox!",
"Related content", "And families with two kids would be nine times richer, according to shock new figures.",
"From 2014, the 29 million citizens of Romania and Bulgaria become eligible to live anywhere in Europe – and there are fears that millions will be heading to the UK.",
"Migration Watch UK says our minimum wage of £254 a week compares to an average £55 a week in those countries.",
"Chairman Sir Andrew Green said: “Given the incentives, it would be absurd to suggest that there will not be a significant inflow.”",
"US President-elect Donald Trump has reaffirmed plans to deport millions of illegal immigrants from America in a bold statement to the world.",
"Don't miss a thing by getting the Daily Star's biggest headlines straight to your inbox!",
"The 70-year-old billionaire will promise to tackle criminals who were illegally living in America in a broadcast due to be aired later this evening.",
"Appearing in his first tv interview since his shocking election win, Trump said that two to three million immigrants with criminal records in the US would either be jailed or deported.",
"He told CBS show 60 Minutes: \"What we are going to do is get the people that are criminal and have criminal records, gang members, drug dealers. where a lot of these people, probably two million, it could even be three million, we are getting them out of our country, they're here illegally.",
"\"After the border is secure and after everything gets normalised, we're going to make a determination on the people that they're talking about who are terrific people, they're terrific people, but we are gonna (sic) make a determination at that.",
"\"But before we make that determination, it's very important, we are going to secure our border.\"",
"Trump also confirmed plans were underway to construct a \"great wall\" on the US-Mexican border.",
"A spokeswoman for Mr Trump yesterday confirmed that the 70-year-old tycoon had set up a taskforce to begin plans on constructing the wall, which could cost as much as £9.3billion.",
"But the President-elect did concede that parts of the wall may have to be a fence.",
"When asked if he would accept a fence, Trump said: \"For certain areas I would, but certain areas, a wall is more appropriate. I’m very good at this, it’s called construction.\"",
"Congressman Louie Gohmert confirmed yesterday that Trump's wall would is only likely to stretch for “around half” the length of the border, which spans California, Arizona, New Mexico and Texas.",
"Plans to build the wall has seen widespread protests across the US, with demonstrators taking to the streets to protest about their new president.",
"Scores have been arrested and a man was shot in Portland, Oregon, following an argument between activists.",
"In Los Angeles, officers were scouring the route of an earlier protest after an undercover officer lost his gun and handcuffs during a scuffle.",
"THOUSANDS of immigrants are getting access to UK state handouts as soon as they arrive thanks to an EU loophole.",
"Don't miss a thing by getting the Daily Star's biggest headlines straight to your inbox!",
"Related content", "In the past five years 100,000 wives, husbands and children of EU citizens have moved to Britain under a lax system that bypasses rules for Brits.",
"British people who want close family from outside Europe to move to the UK have to prove they earn around £18,000 a year before they get visas.",
"But separate rules for EU citizens mean they do not have to bring in the same wages before flying in relatives. They then get the same right to benefits as unemployed Brits.",
"Sir Andrew Green, chairman of Migration Watch, said: “This is a loophole that must be closed.",
"“It is absurd that EU citizens should be in a more favourable position than our own citizens.”"
)), row.names = c(NA, -30L), class = c("tbl_df", "tbl", "data.frame"
))
Thanks
PS.: Sorry for the images, the system did not allow me to add actual tables.
We can use
aggregate(Text ~ Book, df1, FUN = paste, collapse =' ')
-output
Book Text
1 Book1 Text.a Text.b
2 Book2 Text.c Text.d
For the OP's data
aggregate( Content ~ Title, df1, FUN = paste, collapse =' ')
-output
Title
1 Donald Trump pledges to deport 3 MILLION illegal immigrants from the US - Daily Star
2 Immigrants five times better off in Britain - Daily Star
3 Thousands of immigrants get access to state handouts on arrival due to EU loophole - Daily Star
Content
1 US President-elect Donald Trump has reaffirmed plans to deport millions of illegal immigrants from America in a bold statement to the world. Don't miss a thing by getting the Daily Star's biggest headlines straight to your inbox! The 70-year-old billionaire will promise to tackle criminals who were illegally living in America in a broadcast due to be aired later this evening. Appearing in his first tv interview since his shocking election win, Trump said that two to three million immigrants with criminal records in the US would either be jailed or deported. He told CBS show 60 Minutes: "What we are going to do is get the people that are criminal and have criminal records, gang members, drug dealers. where a lot of these people, probably two million, it could even be three million, we are getting them out of our country, they're here illegally. "After the border is secure and after everything gets normalised, we're going to make a determination on the people that they're talking about who are terrific people, they're terrific people, but we are gonna (sic) make a determination at that. "But before we make that determination, it's very important, we are going to secure our border." Trump also confirmed plans were underway to construct a "great wall" on the US-Mexican border. A spokeswoman for Mr Trump yesterday confirmed that the 70-year-old tycoon had set up a taskforce to begin plans on constructing the wall, which could cost as much as £9.3billion. But the President-elect did concede that parts of the wall may have to be a fence. When asked if he would accept a fence, Trump said: "For certain areas I would, but certain areas, a wall is more appropriate. I’m very good at this, it’s called construction." Congressman Louie Gohmert confirmed yesterday that Trump's wall would is only likely to stretch for “around half” the length of the border, which spans California, Arizona, New Mexico and Texas. Plans to build the wall has seen widespread protests across the US, with demonstrators taking to the streets to protest about their new president. Scores have been arrested and a man was shot in Portland, Oregon, following an argument between activists. In Los Angeles, officers were scouring the route of an earlier protest after an undercover officer lost his gun and handcuffs during a scuffle.
2 IMMIGRANTS from Romania and Bulgaria would be five times better off if they moved to Britain. Don't miss a thing by getting the Daily Star's biggest headlines straight to your inbox! Related content And families with two kids would be nine times richer, according to shock new figures. From 2014, the 29 million citizens of Romania and Bulgaria become eligible to live anywhere in Europe – and there are fears that millions will be heading to the UK. Migration Watch UK says our minimum wage of £254 a week compares to an average £55 a week in those countries. Chairman Sir Andrew Green said: “Given the incentives, it would be absurd to suggest that there will not be a significant inflow.”
3 THOUSANDS of immigrants are getting access to UK state handouts as soon as they arrive thanks to an EU loophole. Don't miss a thing by getting the Daily Star's biggest headlines straight to your inbox! Related content In the past five years 100,000 wives, husbands and children of EU citizens have moved to Britain under a lax system that bypasses rules for Brits. British people who want close family from outside Europe to move to the UK have to prove they earn around £18,000 a year before they get visas. But separate rules for EU citizens mean they do not have to bring in the same wages before flying in relatives. They then get the same right to benefits as unemployed Brits. Sir Andrew Green, chairman of Migration Watch, said: “This is a loophole that must be closed. “It is absurd that EU citizens should be in a more favourable position than our own citizens.”
Or this can be done in tidyverse
library(dplyr)
library(stringr)
df1 %>%
group_by(Title) %>%
summarise(Content = str_c(Content, collapse=" "), .groups = 'drop')
data
df1 <- structure(list(Book = c("Book1", "Book1", "Book2", "Book2"),
Text = c("Text.a", "Text.b", "Text.c", "Text.d")),
class = "data.frame", row.names = c(NA,
-4L))
Related
I would like to load multple pages from a single website and extract specific attributes from different classes as below. Then I woule like to create a dataframe with parsed information from multiple pages.
Extract from multiple pages
for page in range(1,10):
url = f"https://www.consilium.europa.eu/en/press/press-releases/?page={page}"
res = requests.get(url)
soup = bs(res.text, 'lxml')
Parsing
soup_content = soup.find_all('li', {'class':['list-item ceu clearfix','list-item gsc clearfix','list-item euco clearfix','list-item eg clearfix' ]})
datePublished = []
headline = []
description =[]
urls = []
for i in range(len(soup_content)):
datePublished.append(soup_content[i].find('span', {'itemprop': 'datePublished'}).attrs['content'])
headline.append(soup_content[i].find('h3', {'itemprop': 'headline'}).get_text().strip())
description.append(soup_content[i].find('p', {'itemprop': 'description'}).get_text().strip())
urls.append('https://www.consilium.europa.eu{}'.format(soup.find('a', {'itemprop': 'url'}).attrs['href']))
To DataFrame
df = pd.DataFrame(data = zip(datePublished, headline, description, urls), columns=['date','title', 'description', 'link'])
df
To expand on my comments, this should work:
maxPage = 9
datePublished = []
headline = []
description =[]
urls = []
for page in range(1, maxPage+1):
url = f"https://www.consilium.europa.eu/en/press/press-releases/?page={page}"
res = requests.get(url)
print(f'[page {page:>3}]', res.status_code, res.reason, 'from', res.url)
soup = BeautifulSoup(res.content, 'lxml')
soup_content = soup.find_all('li', {'class':['list-item ceu clearfix','list-item gsc clearfix','list-item euco clearfix','list-item eg clearfix' ]})
for i in range(len(soup_content)):
datePublished.append(soup_content[i].find('span', {'itemprop': 'datePublished'}).attrs['content'])
headline.append(soup_content[i].find('h3', {'itemprop': 'headline'}).get_text().strip())
description.append(soup_content[i].find('p', {'itemprop': 'description'}).get_text().strip())
urls.append('https://www.consilium.europa.eu{}'.format(soup.find('a', {'itemprop': 'url'}).attrs['href']))
When I ran it, 179 unique rows were collected [20 rows from all pages except the 7th, which had 19].
There are different ways to get your goal:
#Driftr95 comes up with a modification of yours using range(), that is fine, while iterating a specific number of pages.
Using a while-loop to be flexible in number of pages, without knowing the exact one. You can also use a counter if you like to break the loop at a certain number of iterations.
...
I would recommend the second one and also to avoid the bunch of lists cause you have to ensure they have the same lenght. Instead use a single list with dicts that looks more structured.
Example
import requests
import pandas as pd
from bs4 import BeautifulSoup
base_url = 'https://www.consilium.europa.eu'
path ='/en/press/press-releases'
url = base_url+path
data = []
while True:
print(url)
soup = BeautifulSoup(requests.get(url).text)
for e in soup.select('li.list-item'):
data.append({
'date':e.find_previous('h2').text,
'title':e.h3.text,
'desc':e.p.text,
'url':base_url+e.h3.a.get('href')
})
if soup.select_one('li[aria-label="Go to the next page"] a[href]'):
url = base_url+path+soup.select_one('li[aria-label="Go to the next page"] a[href]').get('href')
else:
break
df = pd.DataFrame(data)
Output
date
title
desc
url
0
30 January 2023
Statement by the High Representative on behalf of the EU on the alignment of certain third countries concerning restrictive measures in view of the situation in the Democratic Republic of the Congo
Statement by the High Representative on behalf of the European Union on the alignment of certain third countries with Council Implementing Decision (CFSP) 2022/2398 of 8 December 2022 implementing Decision 2010/788/CFSP concerning restrictive measures in view of the situation in the Democratic Republic of the Congo.
https://www.consilium.europa.eu/en/press/press-releases/2023/01/30/statement-by-the-high-representative-on-behalf-of-the-eu-on-the-alignment-of-certain-third-countries-concerning-restrictive-measures-in-view-of-the-situation-in-the-democratic-republic-of-the-congo/
1
30 January 2023
Council adopts recommendation on adequate minimum income
The Council adopted a recommendation on adequate minimum income to combat poverty and social exclusion. Income support is considered adequate when it ensures a life in dignity at all stages of life. Member states are recommended to gradually achieve the adequate level of income support by 2030 at the latest, while safeguarding the sustainability of public finances.
https://www.consilium.europa.eu/en/press/press-releases/2023/01/30/council-adopts-recommendation-on-adequate-minimum-income/
2
27 January 2023
Forward look: 30 January - 12 February 2023
Overview of the main subjects to be discussed at meetings of the Council of the EU over the next two weeks and upcoming media events.
https://www.consilium.europa.eu/en/press/press-releases/2023/01/27/forward-look/
3
27 January 2023
Russia: EU prolongs economic sanctions over Russia’s military aggression against Ukraine
The Council prolonged restrictive measures in view of Russia's actions destabilising the situation in Ukraine by six months.
https://www.consilium.europa.eu/en/press/press-releases/2023/01/27/russia-eu-prolongs-economic-sanctions-over-russia-s-military-aggression-against-ukraine/
4
27 January 2023
Media advisory – Agriculture and Fisheries Council meeting on 30 January 2023
Main agenda items, approximate timing, public sessions and press opportunities.
https://www.consilium.europa.eu/en/press/press-releases/2023/01/27/media-advisory-agriculture-and-fisheries-council-meeting-on-30-january-2023/
...
435
6 July 2022
EU support to the African Union Mission in Somalia: Council approves further support under the European Peace Facility
The Council approved €120 million in support to the military component of AMISOM/ATMIS for 2022 under the European Peace Facility.
https://www.consilium.europa.eu/en/press/press-releases/2022/07/06/eu-support-to-the-african-union-mission-in-somalia-council-approves-further-support-under-the-european-peace-facility/
436
6 July 2022
Report by President Charles Michel to the European Parliament plenary session
Report by European Council President Charles Michel to the European Parliament plenary session on the outcome of the European Council meeting of 23-24 June 2022.
https://www.consilium.europa.eu/en/press/press-releases/2022/07/06/report-by-president-charles-michel-to-the-european-parliament-plenary-session/
437
5 July 2022
Declaration by the High Representative on behalf of the EU on the alignment of certain countries concerning restrictive measures against ISIL (Da’esh) and Al-Qaeda and persons, groups, undertakings and entities associated with them
Declaration by the High Representative on behalf of the European Union on the alignment of certain third countries with Council Decision (CFSP) 2022/950 of 20 June 2022 amending Decision (CFSP) 2016/1693 concerning restrictive measures against ISIL (Da’esh) and Al-Qaeda and persons, groups, undertakings and entities associated with them.
https://www.consilium.europa.eu/en/press/press-releases/2022/07/05/declaration-by-the-high-representative-on-behalf-of-the-eu-on-the-alignment-of-certain-countries-concerning-restrictive-measures-against-isil-da-esh-and-al-qaeda-and-persons-groups-undertakings-and-entities-associated-with-them/
438
5 July 2022
Remarks by President Charles Michel after his meeting in Skopje with Prime Minister of North Macedonia Dimitar Kovačevski
During his visit to North Macedonia, President Michel expressed his support for proposed compromise solution on the country's accession negotiations.
https://www.consilium.europa.eu/en/press/press-releases/2022/07/05/remarks-by-president-charles-michel-after-his-meeting-in-skopje-with-prime-minister-of-north-macedonia-dimitar-kovacevski/
439
4 July 2022
Readout of the telephone conversation between President Charles Michel and Prime Minister of Ethiopia Abiy Ahmed
President Charles Michel and Prime Minister of Ethiopia Abiy Ahmed valued their open and frank exchange and agreed to speak in the near future to take stock.
https://www.consilium.europa.eu/en/press/press-releases/2022/07/04/readout-of-the-telephone-conversation-between-president-charles-michel-and-prime-minister-of-ethiopia-abiy-ahmed/
...
I am trying to extract some sentences from text data. I want to extract the sentences which correspond to medical device company released. I can run the following code:
df_text <- unlist(strsplit(df$TD, "\\."))
df_text
df_text <- df_text[grep(pattern = "medical device company released", df_text, ignore.case = TRUE)]
df_text
Which gives me:
[1] "\n\nThe medical device company released its financial outlook in a press release before an investor conference Friday"
So I extracted the sentences which contain the sentence medical device company released. However, I want to do this but store the results in a new column from which grp the sentence came from.
Expected output:
grp TD newCol
3613 text NA # does not contain the sentence
4973 text medical device company released
5570 text NA # does not contain the sentence
Data:
df <- structure(list(grp = c("3613", "4973", "5570"), TD = c(" Wal-Mart plans to add an undisclosed number of positions in areas including its store-planning operation and New York apparel office.\n\nThe moves, which began Tuesday, are meant to \"increase operational efficiencies, support our strategic growth plans and reduce overall costs,\" Wal-Mart spokesman David Tovar said.\n\nWal-Mart still expects net growth of tens of thousands of jobs at the store level this year, Tovar said.\n\nThe reduction in staff is hardly a new development for retailers, which have been cutting jobs at their corporate offices as they contend with the down economy. Target Corp. (TGT), Saks Inc. (SKS) and Best Buy Co. (BBY) are among retailers that have said in recent weeks they plan to pare their ranks.\n\nTovar declined to say whether the poor economy was a factor in Wal-Mart's decision.\n\nWal-Mart is operating from a position of comparative strength as one of the few retailers to consistently show positive growth in same-store sales over the past year as the recession dug in.\n\nWal-Mart is \"a fiscally responsible company that will manage its capital structure appropriately,\" said Todd Slater, retail analyst at Lazard Capital Markets.\n\nEven though Wal-Mart is outperforming its peers, the company \"is not performing anywhere near peak or optimum levels,\" Slater said. \"The consumer has cut back significantly.\"\n\nWal-Mart indicated it had regained some footing in January, when comparable-store sales rose 2.1%, after a lower-than-expected 1.7% rise in December.\n\nWal-Mart shares are off 3.2% to $47.68.\n\n-By Karen Talley, Dow Jones Newswires; 201-938-5106; karen.talley#dowjones.com [ 02-10-09 1437ET ]\n ",
" --To present new valve platforms Friday\n\n(Updates with additional comment from company, beginning in the seventh paragraph.)\n\n\n \n By Anjali Athavaley \n Of DOW JONES NEWSWIRES \n \n\nNEW YORK (Dow Jones)--Edwards Lifesciences Corp. (EW) said Friday that it expects earnings to grow 35% to 40%, excluding special items, in 2012 on expected sales of its catheter-delivered heart valves that were approved in the U.S. earlier this year.\n\nThe medical device company released its financial outlook in a press release before an investor conference Friday. The catheter-delivered heart valve market is considered to have a multibillion-dollar market potential, but questions have persisted on how quickly the Edwards device, called Sapien, will be rolled out and who will be able to receive it.\n\nEdwards said it expects transcatheter valve sales between $560 million and $630 million in 2012, with $200 million to $260 million coming from the U.S.\n\nOverall, for 2012, Edwards sees total sales between $1.95 billion and $2.05 billion, above the $1.68 billion to $1.72 billion expected this year and bracketing the $2.01 billion expected on average by analysts surveyed by Thomson Reuters.\n\nThe company projects 2012 per-share earnings between $2.70 and $2.80, the midpoint of which is below the average analyst estimate of $2.78 on Thomson Reuters. Edwards estimates a gross profit margin of 73% to 75%.\n\nEdwards also reaffirmed its 2011 guidance, which includes earnings per share of $1.97 to $2.02, excluding special items.\n\nThe company said it continues to expect U.S. approval of its Sapien device for high-risk patients in mid-2012. Currently, the device is only approved in the U.S. for patients too sick for surgery.\n\nThe company added that a separate trial studying its newer-generation valve in a larger population is under way in the U.S. It expects U.S. approval of that device in 2014.\n\nEdwards also plans to present at its investor conference two new catheter-delivered valve platforms designed for different implantation methods. European trials for these devices are expected to begin in 2012.\n\nShares of Edwards, down 9% over the past 12 months, were inactive premarket. The stock closed at $63.82 on Thursday.\n\n-By Anjali Athavaley, Dow Jones Newswires; 212-416-4912; anjali.athavaley#dowjones.com [ 12-09-11 0924ET ]\n ",
" In September, the company issued a guidance range of 43 cents to 44 cents a share. \n\nFor the year, GE now sees earnings no lower than $1.81 a share to $1.83 a share. The previous forecast called for income of $1.80 to $1.83 a share. The new range brackets analyst projections of $1.82 a share. \n\nThe new targets represent double-digit growth from the respective year-earlier periods. Last year's third-quarter earnings were $3.87 billion, or 36 cents a share, excluding items; earnings for the year ended Dec. 31 came in at $16.59 billion, or $1.59 a share. [ 10-06-05 0858ET ] \n\nGeneral Electric also announced Thursday that it expects 2005 cash flow from operating activities to exceed $19 billion. \n\nBecause of the expected cash influx, the company increased its authorization for share repurchases by $1 billion to more than $4 billion. \n\nGE announced the updated guidance at an analysts' meeting Thursday in New York. A Web cast of the meeting is available at . \n\nThe company plans to report third-quarter earnings Oct. 14. \n\nShares of the Dow Jones Industrial Average component recently listed at $33.20 in pre-market trading, according to Inet, up 1.6%, or 52 cents, from Wednesday's close of $32.68. \n\nCompany Web site: \n\n-Jeremy Herron; Dow Jones Newswires; 201-938-5400; Ask Newswires#DowJones.com \n\nOrder free Annual Report for General Electric Co. \n\nVisit or call 1-888-301-0513 [ 10-06-05 0904ET ] \n "
)), class = "data.frame", row.names = c(NA, -3L))
We can get data in separate rows keeping the grp intact and keep only sentence that has "medical device company released" in it.
library(dplyr)
df %>%
tidyr::separate_rows(TD, sep = "\\.") %>%
group_by(grp) %>%
summarise(newCol = toString(grep(pattern = "medical device company released",
TD, ignore.case = TRUE, value = TRUE)))
# grp newCol
# <chr> <chr>
#1 3613 ""
#2 4973 "\n\nThe medical device company released its financia…
#3 5570 ""
I have a single text file, NPFile, that contains 100 different newspaper articles that is 3523 lines in length. I am trying to pick out and parse different data fields for each article for text processing. These fields are: Full text: Publication date:, Publication title: etc....
I am using grep to pick out the different lines that contain the data fields I want. Although I can get the line numbers (start and end positions of the fields), I am getting an error when I try to use the line numbers to extract the actual text and put it into a vector:
#Find full text of article, clean and store in a variable
findft<-grep ('Full text:', NPFile, ignore.case=TRUE)
endft<-grep ('Publication date:', NPFile)
ftfield<-(NPFile[findft:endft])
The last line ftfield<-(NPFile[findft:endft] is giving this warning message:
1: In findft:endft :
numerical expression has 100 elements: only the first used
The starting findft and ending points endft each contain 100 elements, but as the warning indicated, ftfield only contains the first element (which is 11 lines in length). I was assuming (wrongly/mistakenly) that the respective lines for each 100 instances of the full text field would be extracted and stored in ftfield - but obviously I have not coded this correctly. Any help would be appreciated.
Example of Data (These are the fields and data associated with one of the 100 in the text file):
Waiting for the 500-year flood; Red River rampage: Severe weather events, new records are more frequent than expected.
Full text: AS THE RED River raged over makeshift dikes futilely erected against its wrath in North Dakota, drowning cities beneath a column of water 26 feet above flood level, meteorologists were hard pressed to describe its magnitude in human chronology.
A 500-year flood, some call it, a catastrophic weather event that would have occurred only once since Christopher Columbus arrived on the shores of the New World. Whether it could be termed a 700-year flood or a 300-year flood is open to question.
The flood's size and power are unprecedented. While the Red River has ravaged the upper Midwest before, the height of the flood crest in Fargo and Grand Forks has been almost incomprehensible.
But climatological records are being broken more rapidly than ever. A 100-year-storm may as likely repeat within a few years as waiting another century. It is simply a way of classifying severity, not the frequency. "There isn't really a hundred-year event anymore," states climatologist Tom Karl of the National Oceanic and Atmospheric Administration.
Reliable, consistent weather records in the U.S. go back only 150 years or so. Human development has altered the Earth's surface and atmosphere, promoting greater weather changes and effects than an untouched environment would generate by itself.
What might be a 500-year event in the Chesapeake Bay is uncertain. Last year was the record for freshwater gushing into the bay. The January 1996 torrent of melted snowfall into the estuary recorded a daily average that exceeded the flow during Tropical Storm Agnes in 1972, a benchmark for 100-year meteorological events in these parts. But, according to the U.S. Geological Survey, the impact on the bay's ecosystem was not as damaging as in 1972.
Sea level in the Bay has risen nearly a foot in the past century, three times the rate of the past 5,000 years, which University of Maryland scientist Stephen Leatherman ties to global climate warming. Estuarine islands and upland shoreline are eroding at an accelerated pace.
The topography of the bay watershed is, of course, different from that of the Red River. It's not just flow rates and rainfall, but how the water is directed and where it can escape without intruding too far onto dry land. We can only hope that another 500 years really passes before the Chesapeake region is so tested.
Pub Date: 4/22/97
Publication date: Apr 22, 1997
Publication title: The Sun; Baltimore, Md.
Title: Waiting for the 500-year flood; Red River rampage: Severe weather events, new records are more frequent than expected.: [FINAL Edition ]
From this data example above, ftfield has 11 lines when I examined it:
[1] "Full text: AS THE RED River raged over makeshift dikes futilely erected against its wrath in North Dakota, drowning cities beneath a column of water 26 feet above flood level, meteorologists were hard pressed to describe its magnitude in human chronology."
[2] "A 500-year flood, some call it, a catastrophic weather event that would have occurred only once since Christopher Columbus arrived on the shores of the New World. Whether it could be termed a 700-year flood or a 300-year flood is open to question."
[3] "The flood's size and power are unprecedented. While the Red River has ravaged the upper Midwest before, the height of the flood crest in Fargo and Grand Forks has been almost incomprehensible."
[4] "But climatological records are being broken more rapidly than ever. A 100-year-storm may as likely repeat within a few years as waiting another century. It is simply a way of classifying severity, not the frequency. \"There isn't really a hundred-year event anymore,\" states climatologist Tom Karl of the National Oceanic and Atmospheric Administration."
[5] "Reliable, consistent weather records in the U.S. go back only 150 years or so. Human development has altered the Earth's surface and atmosphere, promoting greater weather changes and effects than an untouched environment would generate by itself."
[6] "What might be a 500-year event in the Chesapeake Bay is uncertain. Last year was the record for freshwater gushing into the bay. The January 1996 torrent of melted snowfall into the estuary recorded a daily average that exceeded the flow during Tropical Storm Agnes in 1972, a benchmark for 100-year meteorological events in these parts. But, according to the U.S. Geological Survey, the impact on the bay's ecosystem was not as damaging as in 1972."
[7] "Sea level in the Bay has risen nearly a foot in the past century, three times the rate of the past 5,000 years, which University of Maryland scientist Stephen Leatherman ties to global climate warming. Estuarine islands and upland shoreline are eroding at an accelerated pace."
[8] "The topography of the bay watershed is, of course, different from that of the Red River. It's not just flow rates and rainfall, but how the water is directed and where it can escape without intruding too far onto dry land. We can only hope that another 500 years really passes before the Chesapeake region is so tested."
[9] "Pub Date: 4/22/97"
[10] ""
[11] "Publication date: Apr 22, 1997"
And, lastly, findft[1] corresponds with endft[1] and so on until findft[100] and endft[100].
I'll assume that findft will contain several indexes as well as endft. I'm also assuming that both of them have the same length and that they are paired by the same index ( e.g. findft[5] corresponds to endft[5]) and that you want all NPfile elements between these two indexes as well as the other pairs.
If this is so, try:
ftfield = lapply(1:length(findft), function(x){ NPFile[findft[x]:endft[x]] })
This will return a list. I can't guarantee that this will work because there is no data example to work with.
We can do this with Map. Get the sequence of values for each corresponding element of 'findft' to 'endft', then subset the 'NPFile' based on that index
Map(function(x, y) NPFile[x:y], findft, endft)
I have a corpus of newspaper articles of which only specific parts are of interest for my research. I'm not happy with the results I get from classifying texts along different frames because the data contains too much noise. I therefore want to extract only the relevant parts from the documents. I was thinking of doing so by transforming several kwic objects generated by the quanteda package into a single df.
So far I've tried the following
exampletext <- c("The only reason for (the haste) which we can discern is the prospect of an Olympic medal, which is the raison d'etat of the banana republic,'' The Guardian said in an editorial under the headline ''Whatever Zola Wants. . .'' The Government made it clear it had acted promptly on the application to insure that the 5-foot-2-inch track star could qualify for the British Olympic team. The International Olympic Organization has a rule that says athletes who change their nationality must wait three years before competing for that country - a rule, however, that is often waived by the I.O.C. The British Olympic Association said it consulted with the I.O.C. before asserting Miss Budd's eligibility for the British team. ''Since Zola is now here and has a British passport she should be made to feel welcome and accepted by other British athletes,'' said Paul Dickenson, chairman of the International Athletes Club, an organization that raises money for amateur athletes and looks after their political interests. ''The thing we objected to was the way she got into the country by the Government and the Daily Mail and the commercialization exploitation associated with it.", "That left 14 countries that have joined the Soviet-led withdrawal. Albania and Iran had announced that they would not compete and did not send written notification. Bolivia, citing financial trouble, announced Sunday it would not participate.The 1972 Munich Games had the previous high number of competing countries, 122.No Protest Planned on Zola Budd YAOUNDE, Cameroon, June 4 (AP) - African countries do not plan to boycott the Los Angeles Olympics in protest of the inclusion of Zola Budd, the South African-born track star, on the British team, according to Lamine Ba, the secretary-general of the Supreme Council for Sport in Africa. Because South Africa is banned from participation in the Olympics, Miss Budd, whose father is of British descent, moved to Britain in March and was granted British citizenship.75 Olympians to Train in Atlanta ATLANTA, June 4 (AP) - About 75 Olympic athletes from six African countries and Pakistan will participate in a month-long training camp this summer in Atlanta under a program financed largely by a grant from the United States Information Agency, Anne Bassarab, a member of Mayor Andrew Young's staff, said today. The athletes, from Mozambique, Tanzania, Zambia, Zimbabwe, Uganda, Somalia and Pakistan, will arrive here June 24.")
mycorpus <- corpus(exampletext)
mycorpus.nat <- corpus(kwic(mycorpus, "nationalit*", window = 5, valuetype = "glob"))
mycorpus.cit <- corpus(kwic(mycorpus, "citizenship", window = 5, valuetype = "glob"))
mycorpus.kwic <- mycorpus.nat + mycorpus.cit
mydfm <- dfm(mycorpus.kwic)
This, however, generates a dfm that contains 4 documents instead of 2, and when both keywords are present in a document even more. I can't think of a way to bring the dfm down to the original number of documents.
Thank you for helping me out.
We recently added window argument to tokens_select() for this purpose:
require(quanteda)
txt <- c("The only reason for (the haste) which we can discern is the prospect of an Olympic medal, which is the raison d'etat of the banana republic,'' The Guardian said in an editorial under the headline ''Whatever Zola Wants. . .'' The Government made it clear it had acted promptly on the application to insure that the 5-foot-2-inch track star could qualify for the British Olympic team. The International Olympic Organization has a rule that says athletes who change their nationality must wait three years before competing for that country - a rule, however, that is often waived by the I.O.C. The British Olympic Association said it consulted with the I.O.C. before asserting Miss Budd's eligibility for the British team. ''Since Zola is now here and has a British passport she should be made to feel welcome and accepted by other British athletes,'' said Paul Dickenson, chairman of the International Athletes Club, an organization that raises money for amateur athletes and looks after their political interests. ''The thing we objected to was the way she got into the country by the Government and the Daily Mail and the commercialization exploitation associated with it.", "That left 14 countries that have joined the Soviet-led withdrawal. Albania and Iran had announced that they would not compete and did not send written notification. Bolivia, citing financial trouble, announced Sunday it would not participate.The 1972 Munich Games had the previous high number of competing countries, 122.No Protest Planned on Zola Budd YAOUNDE, Cameroon, June 4 (AP) - African countries do not plan to boycott the Los Angeles Olympics in protest of the inclusion of Zola Budd, the South African-born track star, on the British team, according to Lamine Ba, the secretary-general of the Supreme Council for Sport in Africa. Because South Africa is banned from participation in the Olympics, Miss Budd, whose father is of British descent, moved to Britain in March and was granted British citizenship.75 Olympians to Train in Atlanta ATLANTA, June 4 (AP) - About 75 Olympic athletes from six African countries and Pakistan will participate in a month-long training camp this summer in Atlanta under a program financed largely by a grant from the United States Information Agency, Anne Bassarab, a member of Mayor Andrew Young's staff, said today. The athletes, from Mozambique, Tanzania, Zambia, Zimbabwe, Uganda, Somalia and Pakistan, will arrive here June 24.")
toks <- tokens(txt)
mt_nat <- dfm(tokens_select(toks, "nationalit*", window = 5))
mt_cit <- dfm(tokens_select(toks, "citizenship*", window = 5))
Please make sure that you are using the latest version of Quanteda.
I am trying to import a text file as a data frame as a single column and multiple rows. I want a new row created for every sentence and I want to repeat the process for every word.
Like this.
Mr. Trump has been leading most national polls in the Republican presidential contest, but he is facing a potentially changed landscape. With the Iowa caucuses less than three months away, attention has shifted to national security in the wake of the terrorist attacks in Paris last week. While the Republican electorate so far has favored political outsiders like Mr. Trump and Ben Carson, the concerns over terrorism and the arrival of refugees from Syria into the United States could change things.
should be read as
V1
[1] Mr
[2] Trump has been leading most national polls in the Republican presidential contest, but he is facing a potentially changed landscape
[3] With the Iowa caucuses less than three months away, attention has shifted to national security in the wake of the terrorist attacks in Paris last week
[4] While the Republican electorate so far has favored political outsiders like Mr
[5] Trump and Ben Carson, the concerns over terrorism and the arrival of refugees from Syria into the United States could change things
Thanks.
We can use strsplit
strsplit(txt, '[.]\\s*')[[1]]
data
txt <- "Mr. Trump has been leading most national polls in the Republican presidential contest, but he is facing a potentially changed landscape. With the Iowa caucuses less than three months away, attention has shifted to national security in the wake of the terrorist attacks in Paris last week. While the Republican electorate so far has favored political outsiders like Mr. Trump and Ben Carson, the concerns over terrorism and the arrival of refugees from Syria into the United States could change things."