Related
This question already has answers here:
Collapse / concatenate / aggregate a column to a single comma separated string within each group
(6 answers)
Closed 1 year ago.
I have a collection of texts which are organised in a data frame in the following way:
I would need such texts to be organised in the following way
I have been through a lot of previous questions here, but all merging suggested includes calculations, something which is not the case here. I have also consulted Tidytext package but did not seem to find a function to merge text in this way.
Any help is appreciated.
Edit
A pice of the actual data frame would be:
dput(df1)
structure(list(Title = c("Immigrants five times better off in Britain - Daily Star",
"Immigrants five times better off in Britain - Daily Star", "Immigrants five times better off in Britain - Daily Star",
"Immigrants five times better off in Britain - Daily Star", "Immigrants five times better off in Britain - Daily Star",
"Immigrants five times better off in Britain - Daily Star", "Immigrants five times better off in Britain - Daily Star",
"Donald Trump pledges to deport 3 MILLION illegal immigrants from the US - Daily Star",
"Donald Trump pledges to deport 3 MILLION illegal immigrants from the US - Daily Star",
"Donald Trump pledges to deport 3 MILLION illegal immigrants from the US - Daily Star",
"Donald Trump pledges to deport 3 MILLION illegal immigrants from the US - Daily Star",
"Donald Trump pledges to deport 3 MILLION illegal immigrants from the US - Daily Star",
"Donald Trump pledges to deport 3 MILLION illegal immigrants from the US - Daily Star",
"Donald Trump pledges to deport 3 MILLION illegal immigrants from the US - Daily Star",
"Donald Trump pledges to deport 3 MILLION illegal immigrants from the US - Daily Star",
"Donald Trump pledges to deport 3 MILLION illegal immigrants from the US - Daily Star",
"Donald Trump pledges to deport 3 MILLION illegal immigrants from the US - Daily Star",
"Donald Trump pledges to deport 3 MILLION illegal immigrants from the US - Daily Star",
"Donald Trump pledges to deport 3 MILLION illegal immigrants from the US - Daily Star",
"Donald Trump pledges to deport 3 MILLION illegal immigrants from the US - Daily Star",
"Donald Trump pledges to deport 3 MILLION illegal immigrants from the US - Daily Star",
"Donald Trump pledges to deport 3 MILLION illegal immigrants from the US - Daily Star",
"Thousands of immigrants get access to state handouts on arrival due to EU loophole - Daily Star",
"Thousands of immigrants get access to state handouts on arrival due to EU loophole - Daily Star",
"Thousands of immigrants get access to state handouts on arrival due to EU loophole - Daily Star",
"Thousands of immigrants get access to state handouts on arrival due to EU loophole - Daily Star",
"Thousands of immigrants get access to state handouts on arrival due to EU loophole - Daily Star",
"Thousands of immigrants get access to state handouts on arrival due to EU loophole - Daily Star",
"Thousands of immigrants get access to state handouts on arrival due to EU loophole - Daily Star",
"Thousands of immigrants get access to state handouts on arrival due to EU loophole - Daily Star"
), Content = c("IMMIGRANTS from Romania and Bulgaria would be five times better off if they moved to Britain.",
"Don't miss a thing by getting the Daily Star's biggest headlines straight to your inbox!",
"Related content", "And families with two kids would be nine times richer, according to shock new figures.",
"From 2014, the 29 million citizens of Romania and Bulgaria become eligible to live anywhere in Europe – and there are fears that millions will be heading to the UK.",
"Migration Watch UK says our minimum wage of £254 a week compares to an average £55 a week in those countries.",
"Chairman Sir Andrew Green said: “Given the incentives, it would be absurd to suggest that there will not be a significant inflow.”",
"US President-elect Donald Trump has reaffirmed plans to deport millions of illegal immigrants from America in a bold statement to the world.",
"Don't miss a thing by getting the Daily Star's biggest headlines straight to your inbox!",
"The 70-year-old billionaire will promise to tackle criminals who were illegally living in America in a broadcast due to be aired later this evening.",
"Appearing in his first tv interview since his shocking election win, Trump said that two to three million immigrants with criminal records in the US would either be jailed or deported.",
"He told CBS show 60 Minutes: \"What we are going to do is get the people that are criminal and have criminal records, gang members, drug dealers. where a lot of these people, probably two million, it could even be three million, we are getting them out of our country, they're here illegally.",
"\"After the border is secure and after everything gets normalised, we're going to make a determination on the people that they're talking about who are terrific people, they're terrific people, but we are gonna (sic) make a determination at that.",
"\"But before we make that determination, it's very important, we are going to secure our border.\"",
"Trump also confirmed plans were underway to construct a \"great wall\" on the US-Mexican border.",
"A spokeswoman for Mr Trump yesterday confirmed that the 70-year-old tycoon had set up a taskforce to begin plans on constructing the wall, which could cost as much as £9.3billion.",
"But the President-elect did concede that parts of the wall may have to be a fence.",
"When asked if he would accept a fence, Trump said: \"For certain areas I would, but certain areas, a wall is more appropriate. I’m very good at this, it’s called construction.\"",
"Congressman Louie Gohmert confirmed yesterday that Trump's wall would is only likely to stretch for “around half” the length of the border, which spans California, Arizona, New Mexico and Texas.",
"Plans to build the wall has seen widespread protests across the US, with demonstrators taking to the streets to protest about their new president.",
"Scores have been arrested and a man was shot in Portland, Oregon, following an argument between activists.",
"In Los Angeles, officers were scouring the route of an earlier protest after an undercover officer lost his gun and handcuffs during a scuffle.",
"THOUSANDS of immigrants are getting access to UK state handouts as soon as they arrive thanks to an EU loophole.",
"Don't miss a thing by getting the Daily Star's biggest headlines straight to your inbox!",
"Related content", "In the past five years 100,000 wives, husbands and children of EU citizens have moved to Britain under a lax system that bypasses rules for Brits.",
"British people who want close family from outside Europe to move to the UK have to prove they earn around £18,000 a year before they get visas.",
"But separate rules for EU citizens mean they do not have to bring in the same wages before flying in relatives. They then get the same right to benefits as unemployed Brits.",
"Sir Andrew Green, chairman of Migration Watch, said: “This is a loophole that must be closed.",
"“It is absurd that EU citizens should be in a more favourable position than our own citizens.”"
)), row.names = c(NA, -30L), class = c("tbl_df", "tbl", "data.frame"
))
Thanks
PS.: Sorry for the images, the system did not allow me to add actual tables.
We can use
aggregate(Text ~ Book, df1, FUN = paste, collapse =' ')
-output
Book Text
1 Book1 Text.a Text.b
2 Book2 Text.c Text.d
For the OP's data
aggregate( Content ~ Title, df1, FUN = paste, collapse =' ')
-output
Title
1 Donald Trump pledges to deport 3 MILLION illegal immigrants from the US - Daily Star
2 Immigrants five times better off in Britain - Daily Star
3 Thousands of immigrants get access to state handouts on arrival due to EU loophole - Daily Star
Content
1 US President-elect Donald Trump has reaffirmed plans to deport millions of illegal immigrants from America in a bold statement to the world. Don't miss a thing by getting the Daily Star's biggest headlines straight to your inbox! The 70-year-old billionaire will promise to tackle criminals who were illegally living in America in a broadcast due to be aired later this evening. Appearing in his first tv interview since his shocking election win, Trump said that two to three million immigrants with criminal records in the US would either be jailed or deported. He told CBS show 60 Minutes: "What we are going to do is get the people that are criminal and have criminal records, gang members, drug dealers. where a lot of these people, probably two million, it could even be three million, we are getting them out of our country, they're here illegally. "After the border is secure and after everything gets normalised, we're going to make a determination on the people that they're talking about who are terrific people, they're terrific people, but we are gonna (sic) make a determination at that. "But before we make that determination, it's very important, we are going to secure our border." Trump also confirmed plans were underway to construct a "great wall" on the US-Mexican border. A spokeswoman for Mr Trump yesterday confirmed that the 70-year-old tycoon had set up a taskforce to begin plans on constructing the wall, which could cost as much as £9.3billion. But the President-elect did concede that parts of the wall may have to be a fence. When asked if he would accept a fence, Trump said: "For certain areas I would, but certain areas, a wall is more appropriate. I’m very good at this, it’s called construction." Congressman Louie Gohmert confirmed yesterday that Trump's wall would is only likely to stretch for “around half” the length of the border, which spans California, Arizona, New Mexico and Texas. Plans to build the wall has seen widespread protests across the US, with demonstrators taking to the streets to protest about their new president. Scores have been arrested and a man was shot in Portland, Oregon, following an argument between activists. In Los Angeles, officers were scouring the route of an earlier protest after an undercover officer lost his gun and handcuffs during a scuffle.
2 IMMIGRANTS from Romania and Bulgaria would be five times better off if they moved to Britain. Don't miss a thing by getting the Daily Star's biggest headlines straight to your inbox! Related content And families with two kids would be nine times richer, according to shock new figures. From 2014, the 29 million citizens of Romania and Bulgaria become eligible to live anywhere in Europe – and there are fears that millions will be heading to the UK. Migration Watch UK says our minimum wage of £254 a week compares to an average £55 a week in those countries. Chairman Sir Andrew Green said: “Given the incentives, it would be absurd to suggest that there will not be a significant inflow.”
3 THOUSANDS of immigrants are getting access to UK state handouts as soon as they arrive thanks to an EU loophole. Don't miss a thing by getting the Daily Star's biggest headlines straight to your inbox! Related content In the past five years 100,000 wives, husbands and children of EU citizens have moved to Britain under a lax system that bypasses rules for Brits. British people who want close family from outside Europe to move to the UK have to prove they earn around £18,000 a year before they get visas. But separate rules for EU citizens mean they do not have to bring in the same wages before flying in relatives. They then get the same right to benefits as unemployed Brits. Sir Andrew Green, chairman of Migration Watch, said: “This is a loophole that must be closed. “It is absurd that EU citizens should be in a more favourable position than our own citizens.”
Or this can be done in tidyverse
library(dplyr)
library(stringr)
df1 %>%
group_by(Title) %>%
summarise(Content = str_c(Content, collapse=" "), .groups = 'drop')
data
df1 <- structure(list(Book = c("Book1", "Book1", "Book2", "Book2"),
Text = c("Text.a", "Text.b", "Text.c", "Text.d")),
class = "data.frame", row.names = c(NA,
-4L))
I am trying to extract some sentences from text data. I want to extract the sentences which correspond to medical device company released. I can run the following code:
df_text <- unlist(strsplit(df$TD, "\\."))
df_text
df_text <- df_text[grep(pattern = "medical device company released", df_text, ignore.case = TRUE)]
df_text
Which gives me:
[1] "\n\nThe medical device company released its financial outlook in a press release before an investor conference Friday"
So I extracted the sentences which contain the sentence medical device company released. However, I want to do this but store the results in a new column from which grp the sentence came from.
Expected output:
grp TD newCol
3613 text NA # does not contain the sentence
4973 text medical device company released
5570 text NA # does not contain the sentence
Data:
df <- structure(list(grp = c("3613", "4973", "5570"), TD = c(" Wal-Mart plans to add an undisclosed number of positions in areas including its store-planning operation and New York apparel office.\n\nThe moves, which began Tuesday, are meant to \"increase operational efficiencies, support our strategic growth plans and reduce overall costs,\" Wal-Mart spokesman David Tovar said.\n\nWal-Mart still expects net growth of tens of thousands of jobs at the store level this year, Tovar said.\n\nThe reduction in staff is hardly a new development for retailers, which have been cutting jobs at their corporate offices as they contend with the down economy. Target Corp. (TGT), Saks Inc. (SKS) and Best Buy Co. (BBY) are among retailers that have said in recent weeks they plan to pare their ranks.\n\nTovar declined to say whether the poor economy was a factor in Wal-Mart's decision.\n\nWal-Mart is operating from a position of comparative strength as one of the few retailers to consistently show positive growth in same-store sales over the past year as the recession dug in.\n\nWal-Mart is \"a fiscally responsible company that will manage its capital structure appropriately,\" said Todd Slater, retail analyst at Lazard Capital Markets.\n\nEven though Wal-Mart is outperforming its peers, the company \"is not performing anywhere near peak or optimum levels,\" Slater said. \"The consumer has cut back significantly.\"\n\nWal-Mart indicated it had regained some footing in January, when comparable-store sales rose 2.1%, after a lower-than-expected 1.7% rise in December.\n\nWal-Mart shares are off 3.2% to $47.68.\n\n-By Karen Talley, Dow Jones Newswires; 201-938-5106; karen.talley#dowjones.com [ 02-10-09 1437ET ]\n ",
" --To present new valve platforms Friday\n\n(Updates with additional comment from company, beginning in the seventh paragraph.)\n\n\n \n By Anjali Athavaley \n Of DOW JONES NEWSWIRES \n \n\nNEW YORK (Dow Jones)--Edwards Lifesciences Corp. (EW) said Friday that it expects earnings to grow 35% to 40%, excluding special items, in 2012 on expected sales of its catheter-delivered heart valves that were approved in the U.S. earlier this year.\n\nThe medical device company released its financial outlook in a press release before an investor conference Friday. The catheter-delivered heart valve market is considered to have a multibillion-dollar market potential, but questions have persisted on how quickly the Edwards device, called Sapien, will be rolled out and who will be able to receive it.\n\nEdwards said it expects transcatheter valve sales between $560 million and $630 million in 2012, with $200 million to $260 million coming from the U.S.\n\nOverall, for 2012, Edwards sees total sales between $1.95 billion and $2.05 billion, above the $1.68 billion to $1.72 billion expected this year and bracketing the $2.01 billion expected on average by analysts surveyed by Thomson Reuters.\n\nThe company projects 2012 per-share earnings between $2.70 and $2.80, the midpoint of which is below the average analyst estimate of $2.78 on Thomson Reuters. Edwards estimates a gross profit margin of 73% to 75%.\n\nEdwards also reaffirmed its 2011 guidance, which includes earnings per share of $1.97 to $2.02, excluding special items.\n\nThe company said it continues to expect U.S. approval of its Sapien device for high-risk patients in mid-2012. Currently, the device is only approved in the U.S. for patients too sick for surgery.\n\nThe company added that a separate trial studying its newer-generation valve in a larger population is under way in the U.S. It expects U.S. approval of that device in 2014.\n\nEdwards also plans to present at its investor conference two new catheter-delivered valve platforms designed for different implantation methods. European trials for these devices are expected to begin in 2012.\n\nShares of Edwards, down 9% over the past 12 months, were inactive premarket. The stock closed at $63.82 on Thursday.\n\n-By Anjali Athavaley, Dow Jones Newswires; 212-416-4912; anjali.athavaley#dowjones.com [ 12-09-11 0924ET ]\n ",
" In September, the company issued a guidance range of 43 cents to 44 cents a share. \n\nFor the year, GE now sees earnings no lower than $1.81 a share to $1.83 a share. The previous forecast called for income of $1.80 to $1.83 a share. The new range brackets analyst projections of $1.82 a share. \n\nThe new targets represent double-digit growth from the respective year-earlier periods. Last year's third-quarter earnings were $3.87 billion, or 36 cents a share, excluding items; earnings for the year ended Dec. 31 came in at $16.59 billion, or $1.59 a share. [ 10-06-05 0858ET ] \n\nGeneral Electric also announced Thursday that it expects 2005 cash flow from operating activities to exceed $19 billion. \n\nBecause of the expected cash influx, the company increased its authorization for share repurchases by $1 billion to more than $4 billion. \n\nGE announced the updated guidance at an analysts' meeting Thursday in New York. A Web cast of the meeting is available at . \n\nThe company plans to report third-quarter earnings Oct. 14. \n\nShares of the Dow Jones Industrial Average component recently listed at $33.20 in pre-market trading, according to Inet, up 1.6%, or 52 cents, from Wednesday's close of $32.68. \n\nCompany Web site: \n\n-Jeremy Herron; Dow Jones Newswires; 201-938-5400; Ask Newswires#DowJones.com \n\nOrder free Annual Report for General Electric Co. \n\nVisit or call 1-888-301-0513 [ 10-06-05 0904ET ] \n "
)), class = "data.frame", row.names = c(NA, -3L))
We can get data in separate rows keeping the grp intact and keep only sentence that has "medical device company released" in it.
library(dplyr)
df %>%
tidyr::separate_rows(TD, sep = "\\.") %>%
group_by(grp) %>%
summarise(newCol = toString(grep(pattern = "medical device company released",
TD, ignore.case = TRUE, value = TRUE)))
# grp newCol
# <chr> <chr>
#1 3613 ""
#2 4973 "\n\nThe medical device company released its financia…
#3 5570 ""
I have a single text file, NPFile, that contains 100 different newspaper articles that is 3523 lines in length. I am trying to pick out and parse different data fields for each article for text processing. These fields are: Full text: Publication date:, Publication title: etc....
I am using grep to pick out the different lines that contain the data fields I want. Although I can get the line numbers (start and end positions of the fields), I am getting an error when I try to use the line numbers to extract the actual text and put it into a vector:
#Find full text of article, clean and store in a variable
findft<-grep ('Full text:', NPFile, ignore.case=TRUE)
endft<-grep ('Publication date:', NPFile)
ftfield<-(NPFile[findft:endft])
The last line ftfield<-(NPFile[findft:endft] is giving this warning message:
1: In findft:endft :
numerical expression has 100 elements: only the first used
The starting findft and ending points endft each contain 100 elements, but as the warning indicated, ftfield only contains the first element (which is 11 lines in length). I was assuming (wrongly/mistakenly) that the respective lines for each 100 instances of the full text field would be extracted and stored in ftfield - but obviously I have not coded this correctly. Any help would be appreciated.
Example of Data (These are the fields and data associated with one of the 100 in the text file):
Waiting for the 500-year flood; Red River rampage: Severe weather events, new records are more frequent than expected.
Full text: AS THE RED River raged over makeshift dikes futilely erected against its wrath in North Dakota, drowning cities beneath a column of water 26 feet above flood level, meteorologists were hard pressed to describe its magnitude in human chronology.
A 500-year flood, some call it, a catastrophic weather event that would have occurred only once since Christopher Columbus arrived on the shores of the New World. Whether it could be termed a 700-year flood or a 300-year flood is open to question.
The flood's size and power are unprecedented. While the Red River has ravaged the upper Midwest before, the height of the flood crest in Fargo and Grand Forks has been almost incomprehensible.
But climatological records are being broken more rapidly than ever. A 100-year-storm may as likely repeat within a few years as waiting another century. It is simply a way of classifying severity, not the frequency. "There isn't really a hundred-year event anymore," states climatologist Tom Karl of the National Oceanic and Atmospheric Administration.
Reliable, consistent weather records in the U.S. go back only 150 years or so. Human development has altered the Earth's surface and atmosphere, promoting greater weather changes and effects than an untouched environment would generate by itself.
What might be a 500-year event in the Chesapeake Bay is uncertain. Last year was the record for freshwater gushing into the bay. The January 1996 torrent of melted snowfall into the estuary recorded a daily average that exceeded the flow during Tropical Storm Agnes in 1972, a benchmark for 100-year meteorological events in these parts. But, according to the U.S. Geological Survey, the impact on the bay's ecosystem was not as damaging as in 1972.
Sea level in the Bay has risen nearly a foot in the past century, three times the rate of the past 5,000 years, which University of Maryland scientist Stephen Leatherman ties to global climate warming. Estuarine islands and upland shoreline are eroding at an accelerated pace.
The topography of the bay watershed is, of course, different from that of the Red River. It's not just flow rates and rainfall, but how the water is directed and where it can escape without intruding too far onto dry land. We can only hope that another 500 years really passes before the Chesapeake region is so tested.
Pub Date: 4/22/97
Publication date: Apr 22, 1997
Publication title: The Sun; Baltimore, Md.
Title: Waiting for the 500-year flood; Red River rampage: Severe weather events, new records are more frequent than expected.: [FINAL Edition ]
From this data example above, ftfield has 11 lines when I examined it:
[1] "Full text: AS THE RED River raged over makeshift dikes futilely erected against its wrath in North Dakota, drowning cities beneath a column of water 26 feet above flood level, meteorologists were hard pressed to describe its magnitude in human chronology."
[2] "A 500-year flood, some call it, a catastrophic weather event that would have occurred only once since Christopher Columbus arrived on the shores of the New World. Whether it could be termed a 700-year flood or a 300-year flood is open to question."
[3] "The flood's size and power are unprecedented. While the Red River has ravaged the upper Midwest before, the height of the flood crest in Fargo and Grand Forks has been almost incomprehensible."
[4] "But climatological records are being broken more rapidly than ever. A 100-year-storm may as likely repeat within a few years as waiting another century. It is simply a way of classifying severity, not the frequency. \"There isn't really a hundred-year event anymore,\" states climatologist Tom Karl of the National Oceanic and Atmospheric Administration."
[5] "Reliable, consistent weather records in the U.S. go back only 150 years or so. Human development has altered the Earth's surface and atmosphere, promoting greater weather changes and effects than an untouched environment would generate by itself."
[6] "What might be a 500-year event in the Chesapeake Bay is uncertain. Last year was the record for freshwater gushing into the bay. The January 1996 torrent of melted snowfall into the estuary recorded a daily average that exceeded the flow during Tropical Storm Agnes in 1972, a benchmark for 100-year meteorological events in these parts. But, according to the U.S. Geological Survey, the impact on the bay's ecosystem was not as damaging as in 1972."
[7] "Sea level in the Bay has risen nearly a foot in the past century, three times the rate of the past 5,000 years, which University of Maryland scientist Stephen Leatherman ties to global climate warming. Estuarine islands and upland shoreline are eroding at an accelerated pace."
[8] "The topography of the bay watershed is, of course, different from that of the Red River. It's not just flow rates and rainfall, but how the water is directed and where it can escape without intruding too far onto dry land. We can only hope that another 500 years really passes before the Chesapeake region is so tested."
[9] "Pub Date: 4/22/97"
[10] ""
[11] "Publication date: Apr 22, 1997"
And, lastly, findft[1] corresponds with endft[1] and so on until findft[100] and endft[100].
I'll assume that findft will contain several indexes as well as endft. I'm also assuming that both of them have the same length and that they are paired by the same index ( e.g. findft[5] corresponds to endft[5]) and that you want all NPfile elements between these two indexes as well as the other pairs.
If this is so, try:
ftfield = lapply(1:length(findft), function(x){ NPFile[findft[x]:endft[x]] })
This will return a list. I can't guarantee that this will work because there is no data example to work with.
We can do this with Map. Get the sequence of values for each corresponding element of 'findft' to 'endft', then subset the 'NPFile' based on that index
Map(function(x, y) NPFile[x:y], findft, endft)
I have a corpus of newspaper articles of which only specific parts are of interest for my research. I'm not happy with the results I get from classifying texts along different frames because the data contains too much noise. I therefore want to extract only the relevant parts from the documents. I was thinking of doing so by transforming several kwic objects generated by the quanteda package into a single df.
So far I've tried the following
exampletext <- c("The only reason for (the haste) which we can discern is the prospect of an Olympic medal, which is the raison d'etat of the banana republic,'' The Guardian said in an editorial under the headline ''Whatever Zola Wants. . .'' The Government made it clear it had acted promptly on the application to insure that the 5-foot-2-inch track star could qualify for the British Olympic team. The International Olympic Organization has a rule that says athletes who change their nationality must wait three years before competing for that country - a rule, however, that is often waived by the I.O.C. The British Olympic Association said it consulted with the I.O.C. before asserting Miss Budd's eligibility for the British team. ''Since Zola is now here and has a British passport she should be made to feel welcome and accepted by other British athletes,'' said Paul Dickenson, chairman of the International Athletes Club, an organization that raises money for amateur athletes and looks after their political interests. ''The thing we objected to was the way she got into the country by the Government and the Daily Mail and the commercialization exploitation associated with it.", "That left 14 countries that have joined the Soviet-led withdrawal. Albania and Iran had announced that they would not compete and did not send written notification. Bolivia, citing financial trouble, announced Sunday it would not participate.The 1972 Munich Games had the previous high number of competing countries, 122.No Protest Planned on Zola Budd YAOUNDE, Cameroon, June 4 (AP) - African countries do not plan to boycott the Los Angeles Olympics in protest of the inclusion of Zola Budd, the South African-born track star, on the British team, according to Lamine Ba, the secretary-general of the Supreme Council for Sport in Africa. Because South Africa is banned from participation in the Olympics, Miss Budd, whose father is of British descent, moved to Britain in March and was granted British citizenship.75 Olympians to Train in Atlanta ATLANTA, June 4 (AP) - About 75 Olympic athletes from six African countries and Pakistan will participate in a month-long training camp this summer in Atlanta under a program financed largely by a grant from the United States Information Agency, Anne Bassarab, a member of Mayor Andrew Young's staff, said today. The athletes, from Mozambique, Tanzania, Zambia, Zimbabwe, Uganda, Somalia and Pakistan, will arrive here June 24.")
mycorpus <- corpus(exampletext)
mycorpus.nat <- corpus(kwic(mycorpus, "nationalit*", window = 5, valuetype = "glob"))
mycorpus.cit <- corpus(kwic(mycorpus, "citizenship", window = 5, valuetype = "glob"))
mycorpus.kwic <- mycorpus.nat + mycorpus.cit
mydfm <- dfm(mycorpus.kwic)
This, however, generates a dfm that contains 4 documents instead of 2, and when both keywords are present in a document even more. I can't think of a way to bring the dfm down to the original number of documents.
Thank you for helping me out.
We recently added window argument to tokens_select() for this purpose:
require(quanteda)
txt <- c("The only reason for (the haste) which we can discern is the prospect of an Olympic medal, which is the raison d'etat of the banana republic,'' The Guardian said in an editorial under the headline ''Whatever Zola Wants. . .'' The Government made it clear it had acted promptly on the application to insure that the 5-foot-2-inch track star could qualify for the British Olympic team. The International Olympic Organization has a rule that says athletes who change their nationality must wait three years before competing for that country - a rule, however, that is often waived by the I.O.C. The British Olympic Association said it consulted with the I.O.C. before asserting Miss Budd's eligibility for the British team. ''Since Zola is now here and has a British passport she should be made to feel welcome and accepted by other British athletes,'' said Paul Dickenson, chairman of the International Athletes Club, an organization that raises money for amateur athletes and looks after their political interests. ''The thing we objected to was the way she got into the country by the Government and the Daily Mail and the commercialization exploitation associated with it.", "That left 14 countries that have joined the Soviet-led withdrawal. Albania and Iran had announced that they would not compete and did not send written notification. Bolivia, citing financial trouble, announced Sunday it would not participate.The 1972 Munich Games had the previous high number of competing countries, 122.No Protest Planned on Zola Budd YAOUNDE, Cameroon, June 4 (AP) - African countries do not plan to boycott the Los Angeles Olympics in protest of the inclusion of Zola Budd, the South African-born track star, on the British team, according to Lamine Ba, the secretary-general of the Supreme Council for Sport in Africa. Because South Africa is banned from participation in the Olympics, Miss Budd, whose father is of British descent, moved to Britain in March and was granted British citizenship.75 Olympians to Train in Atlanta ATLANTA, June 4 (AP) - About 75 Olympic athletes from six African countries and Pakistan will participate in a month-long training camp this summer in Atlanta under a program financed largely by a grant from the United States Information Agency, Anne Bassarab, a member of Mayor Andrew Young's staff, said today. The athletes, from Mozambique, Tanzania, Zambia, Zimbabwe, Uganda, Somalia and Pakistan, will arrive here June 24.")
toks <- tokens(txt)
mt_nat <- dfm(tokens_select(toks, "nationalit*", window = 5))
mt_cit <- dfm(tokens_select(toks, "citizenship*", window = 5))
Please make sure that you are using the latest version of Quanteda.
In RStudio I have followed the approach of R code to search a word in a paragraph and copy the sentence in a variable
to identify the sentence which contains the key word (eg. pollination below) I require.
However, I want to extract one sentences preceeding and one sentences after this sentence containing the key word I require.
Desired output for input below:
They range much further north than honey bees, and colonies can be found on Ellesmere Island in northern Canada, only 880 km from the north pole! With the recent popularity of using bumblebees in glasshouse pollination they will probably be found in most parts of the world before long (see below), especially Bombus terrestris which seems to be the most popular species sold for this purpose.
Recently there have been proposals to introduce bumblebees into Australia to pollinate crops in glasshouses.
If there are many occurrences of word pollination, how I can obtain this through a loop function.
Here is my R code so far:
text <- "Bumblebees are found mainly in northern temperate regions, thoughthere are a few native South American species and New Zealand has some naturalised species that were introduced around 100 years ago to pollinate red clover. They range much further north than honey bees, and colonies can be found on Ellesmere Island in northern Canada, only 880 km from the north pole!
With the recent popularity of using bumblebees in glasshouse pollination they will probably be found in most parts of the world before long (see below), especially Bombus terrestris which seems to be the most popular species sold for this purpose. Recently there have been proposals to introduce bumblebees into Australia to pollinate crops in glasshouses. Now, though I dearly love bumblebees, I do think that this might not be a very good idea. No matter what security measures are taken, mated queens WILL escape eventually and that will probably lead to their establishment in the wild.And yet another non-native invasion of a country that has suffered more than most from such things. This invasion may or may not be benign, but isn't it better to err on the side of caution? Apparently there are already colonies of Bombus terrestris on Tasmania, so I suppose it is now only a matter of time before they reach the mainland."
#end
library(qdap)
sent_detect(text)
##There are NINE sentences in text
##Output
[1] "Bumblebees are found mainly in northern temperate regions, though there are a few native South American species and New Zealand has some naturalised species that were introduced around 100 years ago to pollinate red clover."
[2] "They range much further north than honey bees, and colonies can be found on Ellesmere Island in northern Canada, only 880 km from the north pole!"
[3] "With the recent popularity of using bumblebees in glasshouse pollination they will probably be found in most parts of the world before long, especially Bombus terrestris which seems to be the most popular species sold for this purpose."
[4] "Recently there have been proposals to introduce bumblebees into Australia to pollinate crops in glasshouses."
[5] "Now, though I dearly love bumblebees, I do think that this might not be a very good idea."
[6] "No matter what security measures are taken, mated queens WILL escape eventually and that will probably lead to their establishment in the wild."
[7] "And yet another non-native invasion of a country that has suffered more than most from such things."
[8] "This invasion may or may not be benign, but isn't it better to err on the side of caution?"
[9] "Apparently there are already colonies of Bombus terrestris on Tasmania, so I suppose it is now only a matter of time before they reach the mainland."
#End
Using quanteda package, I confirm there are NINE sentences and then to tokenize the text:
library(quanteda)
nsentence(text)
# [1] 9
##Searching for word pollination - it finds the first occurrence only
dat <- data.frame(text=sent_detect(text), stringsAsFactors = FALSE)
Search(dat, "pollination")
[1] "With the recent popularity of using bumblebees in glasshouse pollination they will probably be found in most parts of the world before long, especially Bombus terrestris which seems to be the most popular species sold for this purpose."
#End
you can use base R pattern match functions:
d <- sent_detect(text)
# grep the sentense with the keyword:
n <- which(grepl('pollination', d) == T)
# 3
# get context of +-1
d[(n - 1):(n + 1)]
# [1] "They range much further north than honey bees, and colonies can be found on Ellesmere Island in northern Canada, only 880 km from the north pole!"
# [2] "With the recent popularity of using bumblebees in glasshouse pollination they will probably be found in most parts of the world before long, especially Bombus terrestris which seems to be the most popular species sold for this purpose."
# [3] "Recently there have been proposals to introduce bumblebees into Australia to pollinate crops in glasshouses."
# nice output:
cat(d[(n - 1):(n + 1)])
# if there are multiple sentences with the keyword:
lapply(which(grepl('pollination', d) == T), function(n){
cat(d[(n - 1):(n + 1)])
})
Here's a fairly straight forward way o doing this:
dat[c(inds <- grep("[Pp]ollination", dat[[1]]) + 1, inds - 2),]
## [1] "Recently there have been proposals to introduce bumblebees into Australia to pollinate crops in glasshouses."
## [2] "They range much further north than honey bees, and colonies can be found on E
llesmere Island in northern Canada, only 880 km from the north pole!"