how to delete documents in corpus that are similar - r
I have a corpus of news articles on a given topic. Some of these articles are the exact same article but have been given additional headers and footers that very slightly change the content. I am trying to delete all but one of the potential duplicates so the final corpus only contains unique articles.
I decided to use cosine similiarity to identify the potential duplicates:
myDfm <- dfm(as.character(docs$text_main), verbose=FALSE)
cosinesim <- textstat_simil(x=myDfm, selection=docnames(myDfm), margin="documents", method="cosine")
cosinemat <- as.matrix(cosinesim)
After looking at a subset of the data, I chose a cut off of .9 cosine distance or above to indicate duplicates.(I am okay with any error that this Given this, I have converted the diagonal to 0 (i.e., not a dup) and altered the matrix to indicate which documents are duplicates and which are not:
diag(cosinemat) <- 0
cosinemat[cosinemat >= .9] <- 1
cosinemat[cosinemat < .9] <- 0
The problem I'm running into is figuring out how to delete all but one of the duplicate documents. Initially, I envisioned a for loop to go through each column cell by cell, for any cell that has a value of 1 (i.e., is a duplicate) to delete the column with the same name as the row of the current cell, reconstitute the matrix and continue on to the next cell. The for loop doesn't seem to like the line of code that deletes the columns with the name of the current row when the cell is equal to 1. Though, I'm not sure its okay to reconstitute the object you're looping through. Something like this:
cosine_df <- as.data.frame(cosinemat)
for(col in 1:ncol(cosine_df)){
for(row in 1:nrow(cosine_df)){
if(cosine_df[col,row] == 0){
next
}
if(cosine_df[col,row] == 1){
cosine_df <- cosine_df[!rownames(cosine_df) %in% paste(rownames(cosine_df)[col,row]]
}
}
}
I'm not set on this approach, and I'm open to creative solutions, so long as I am able to identify similar documents and to delete all but one document.
Here's a subset of the documents if it helps:
docs <- structure(list(text_main = c("Congressional Documents and PublicationsMay 26, 2016Copyright 2016 Federal Information and News Dispatch, Inc.Section:U.S. HOUSE OF REPRESENTATIVES DOCUMENTSLength:287 wordsBody(Washington, DC) Reps. Ted Deutch (D-FL) and Gus Bilirakis (R-FL) joined with Reps. Steve Israel (D-NY), Mike Kelly (R-PA), Ted Lieu (D-CA), Adam Kinzinger (R-IL), Hakeem Jeffries (D-NY), Lee Zeldin (R-NY), and Susan Davis (D-CA) to introduce a resolution (H. Res. 750) urging the European Union (EU) to designate the entirety of Hizballah as a terrorist organization and increase pressure on the organizations and its members. Currently, the EU only designates Hizballah's military wing as a terrorist organization, while the United States makes no distinction between its military and political branches when listing the group on its Foreign Terrorist Organization list.Upon introduction, the Members of Congress released the following statement:\"Hizballah is an Iranian-backed terrorist organization with a global reach that engages in significant illicit criminal activity to fund its terrorism. It doesn't matter what part of the organization you're associated with; if you are connected with Hizballah, you are contributing to the rocket attacks on innocent Israeli civilians, targeted bombings of Jews around the world, slaughter of civilians in Syria, and destabilization of the Middle East. There is no distinction between parts of Hizballah when every part contributes to terrorism. We urge our EU allies to help rein in Hizballah's dangerous worldwide activities.\"The resolution can be viewed here .Last year, Congress passed the Hizballah International Financing Prevention Act which tightened sanctions on Hizballah's criminal and financial networks.Read this original document at: ",
"Congressional Documents and PublicationsApril 20, 2016Copyright 2016 Federal Information and News Dispatch, Inc.Section:U.S. HOUSE OF REPRESENTATIVES DOCUMENTSLength:499 wordsBodyToday, members of the House of Representatives Bipartisan Taskforce for Combating Anti-Semitism sounded the alarm about a troubling surge in anti-Semitism on American college campuses. In a letter to the Secretary of Education, the Taskforce asked the Secretary about the Department's planned response to the issue. Additionally, the co-chairs made the following statement:\"An alarming rise of anti-Israel programs on American college campuses contribute to increasing harassment, intimidation, and discrimination against Jewish students. While we believe that students' freedoms of speech and assembly should be respected, there are increasing reports that activity advertised as anti-Israel or anti-Zionist is devolving into displays of subtle, but sometimes outright anti-Semitism. Attacks on students because of their actual or perceived religion, ancestry, or ethnicity are unacceptable. We believe strongly that no student should ever face discrimination and that school activities must be structured in a respectful manner to ensure academic integrity and a nondiscriminatory environment throughout the entire campus. For these reasons, we ask the Department of Education to assess its ability to monitor and respond to anti-Semitic incidents and to take additional steps to combat intimidation and harassment against minority students on college campuses.\"In 2004, the U.S. Department of Education Office for Civil Rights (OCR) clarified its interpretation of Title VI of the Civil Rights Act of 1964, including protections for groups of students on the basis of their actual or perceived shared ancestry or ethnic characteristics, regardless of whether they are members of a faith community, as in the case for Jewish, Sikh, and Muslim students. The Department reiterated this policy again in 2010 and 2015.However, as the number of reported Boycott, Divestment, and Sanctions (BDS) movement campaigns and other anti-Israel initiatives rise on college campuses, Members of Congress believe the Department must proactively implement its anti-discrimination policy to mitigate anti-Semitism on college campuses.The Bipartisan Taskforce for Combating Anti-Semitism is co-chaired by U.S. Reps. Nita Lowey (D-NY), Chris Smith (R-NJ), Eliot Engel (D-NY), Ileana Ros-Lehtinen (R-FL), Kay Granger (R-TX), Steve Israel (D-NY), Peter Roskam (R-IL), and Ted Deutch (D-FL).The following organizations expressed their support for the letter: the Anti-Defamation League, Jewish Federation of North America, B'nai Brith International, Jewish United Fund/Jewish Federation of Metropolitan Chicago, the Louis D. Brandeis Center for Human Rights Under Law, the World Jewish Congress, and the Zionist Organization of America.Text of the letter can be found here .Read this original document at: ",
"Targeted News ServiceApril 20, 2016 Wednesday 7:41 AM ESTCopyright 2016 Targeted News Service LLC All Rights ReservedLength:511 wordsByline:Targeted News ServiceDateline:WASHINGTON BodyRep. Ted Deutch, D-Fla. (21st CD), issued the following news release:Today, members of the House of Representatives Bipartisan Taskforce for Combating Anti-Semitism sounded the alarm about a troubling surge in anti-Semitism on American college campuses. In a letter to the Secretary of Education, the Taskforce asked the Secretary about the Department's planned response to the issue. Additionally, the co-chairs made the following statement:\"An alarming rise of anti-Israel programs on American college campuses contribute to increasing harassment, intimidation, and discrimination against Jewish students. While we believe that students' freedoms of speech and assembly should be respected, there are increasing reports that activity advertised as anti-Israel or anti-Zionist is devolving into displays of subtle, but sometimes outright anti-Semitism. Attacks on students because of their actual or perceived religion, ancestry, or ethnicity are unacceptable. We believe strongly that no student should ever face discrimination and that school activities must be structured in a respectful manner to ensure academic integrity and a nondiscriminatory environment throughout the entire campus. For these reasons, we ask the Department of Education to assess its ability to monitor and respond to anti-Semitic incidents and to take additional steps to combat intimidation and harassment against minority students on college campuses.\"In 2004, the U.S. Department of Education Office for Civil Rights (OCR) clarified its interpretation of Title VI of the Civil Rights Act of 1964, including protections for groups of students on the basis of their actual or perceived shared ancestry or ethnic characteristics, regardless of whether they are members of a faith community, as in the case for Jewish, Sikh, and Muslim students. The Department reiterated this policy again in 2010 and 2015.However, as the number of reported Boycott, Divestment, and Sanctions (BDS) movement campaigns and other anti-Israel initiatives rise on college campuses, Members of Congress believe the Department must proactively implement its anti-discrimination policy to mitigate anti-Semitism on college campuses.The Bipartisan Taskforce for Combating Anti-Semitism is co-chaired by U.S. Reps. Nita Lowey (D-NY), Chris Smith (R-NJ), Eliot Engel (D-NY), Ileana Ros-Lehtinen (R-FL), Kay Granger (R-TX), Steve Israel (D-NY), Peter Roskam (R-IL), and Ted Deutch (D-FL).The following organizations expressed their support for the letter: the Anti-Defamation League, Jewish Federation of North America, B'nai Brith International, Jewish United Fund/Jewish Federation of Metropolitan Chicago, the Louis D. Brandeis Center for Human Rights Under Law, the World Jewish Congress, and the Zionist Organization of America.Text of the letter can be found here ().Contact: Jason Attermann, 202/225-3001Copyright Targeted News Services30FurigayJof-5501453 30FurigayJof",
"US Official NewsFebruary 13, 2013 WednesdayCopyright 2013 Plus Media Solutions Private Limited All Rights ReservedLength:298 wordsDateline:Washington Body Office of the House of Representative Ted Deutch, U.S Government has issued the following news release: Rep. Ted Deutch (D-FL) and Rep. Gus Bilirakis (R-GL) issued the following statements regarding the Bulgarian governments report that two individuals responsible for the July 2012 terrorist attack on a bus in Burgas, Bulgaria, have ties to Hezbollah. Five Israeli tourists and the Bulgarian bus driver were killed in the attack.Congressman Bilirakis: The Bulgarian governments report is yet another example of Hezbollah's deliberate use of terror across the globe. Contrary to some European opinions, Hezbollah is not merely a political organization and is actively involved in terrorist activities. As I have requested many times, the European Union must finally recognize Hezbollah for what it is: a terrorist organization. I commend the Bulgarian government for their thorough investigation and call on the members of the European Union to examine these findings closely.Congressman Deutch: The results of the Bulgarian governments investigation into the deadly attack in Burgas confirms what we already knew - Hezbollah is a terrorist organization that is willing to perpetrate attacks on innocent civilians around the globe. I continue to urge our European partners to formally designate Hezbollah as a terrorist organization. Failure to do so only emboldens Hezbollah to continue its reign of terror in Europe and around the world.In September 2012, Congressmen Bilirakis and Deutch initiated a bi-partisan letter signed by 268 Members of Congress to the President and Ministers of the Commission of the European Union, urging them to include Hezbollah on the European Union's list of terrorist organizations. For further information please visit: ",
"Congressional Documents and PublicationsMay 4, 2011Copyright 2011 Federal Information and News Dispatch, Inc.Section:U.S. HOUSE OF REPRESENTATIVES DOCUMENTSLength:204 wordsBodyWashington, May 4 -Rep. Ted Deutch released the following statement on the Florida legislature's passage of SB 444, which expands upon the Protecting Florida's Investments Act, legislation he authored in 2007 in the Florida State Senate:\"I applaud the Florida Legislature's passage of SB 444, legislation that will help ensure national and international security by preventing Florida's taxpayer dollars from supporting companies who choose to violate federal law by bolstering the Iranian regime. I congratulate the bill's sponsors, Sen. Ellyn Bogdanoff and Rep. Mack Bernard. This bill prevents state and local governments from awarding contracts to companies found to be investing in the Iranian energy sector. It is consistent with federal policy and sends a clear message that Floridians will not support any company that puts profit over international security. The Iranian regime continues to pursue its illicit nuclear weapons program, continues to engage in the most egregious human rights violations, and continues to support terrorism across the globe. We must continue to utilize every economic tool at our disposal to bring this regime to its knees. I urge Governor Scott to act quickly to sign this bill into law.\"",
"Congressional Documents and PublicationsMarch 23, 2011Copyright 2011 Federal Information and News Dispatch, Inc.Section:U.S. HOUSE OF REPRESENTATIVES DOCUMENTSLength:128 wordsBodyBoca Raton, Mar 23 -Congressman Ted Deutch (D-FL) released the following statement in reaction to the explosion of a bomb today in Jerusalem that killed a 59-year-old woman and injured dozens more:\"Today's horrific bombing in Jerusalem is yet another attack in a surge of violence perpetuated by Palestinian terrorists against innocent Israeli citizens,\" said Congressman Ted Deutch. \"The victims of this heinous attack and the Israeli people deserve the full support of the international community as they seek to defend themselves against this relentless violence. It is deplorable that as Israelis endure this latest bombing in Jerusalem, as well as ongoing rocket attacks by Hamas, some astonishingly still seek to blame Israel for the lack of peace in the region.\"",
"States News ServiceMarch 26, 2015 ThursdayCopyright 2015 States News ServiceLength:218 wordsByline:States News ServiceDateline:WASHINGTON BodyThe following information was released by the office of Florida Rep. Ted Deutch:Congressman Ted Deutch (FL-21) issued the following statement after voting in favor of the bipartisan 'doc fix' deal included in H.R. 2, the Medicare Access and CHIP Reauthorization Act, which passed the House of Representatives by a vote of 392-27:\"Today I was pleased to vote for a bipartisan deal to permanently repeal the misguided Sustainable Growth Rate (SGR) policy that jeopardizes doctors participating in Medicare, and to permanently extend assistance to low-income seniors struggling to pay their Medicare premiums.In addition to building certainty in our Medicare system, this legislation also protects health care for low-income kids and funding for the federal health centers in Florida's 21st district that in 2013 alone served over 70,000 patients living below or near the federal poverty level.\"Today's deal is the product of the kind of bipartisan compromise that has become all too rare in Washington, in which neither Democrats nor Republicans get everything they want but come together to move our country forward. I can only imagine how much more we could achieve for the American people if we came together in the spirit of compromise on more issues, from infrastructure investment to immigration reform.\"",
"Congressional Documents and PublicationsMarch 26, 2015Copyright 2015 Federal Information and News Dispatch, Inc.Section:U.S. HOUSE OF REPRESENTATIVES DOCUMENTSLength:250 wordsBodyCongressman Ted Deutch (FL-21) issued the following statement after voting in favor of the bipartisan 'doc fix' deal included in H.R. 2, the Medicare Access and CHIP Reauthorization Act, which passed the House of Representatives by a vote of 392-27:\"Today I was pleased to vote for a bipartisan deal to permanently repeal the misguided Sustainable Growth Rate (SGR) policy that jeopardizes doctors participating in Medicare, and to permanently extend assistance to low-income seniors struggling to pay their Medicare premiums. In addition to building certainty in our Medicare system, this legislation also protects health care for low-income kids and funding for the federal health centers in Florida's 21st district that in 2013 alone served over 70,000 patients living below or near the federal poverty level.\"Today's deal is the product of the kind of bipartisan compromise that has become all too rare in Washington, in which neither Democrats nor Republicans get everything they want but come together to move our country forward. I can only imagine how much more we could achieve for the American people if we came together in the spirit of compromise on more issues, from infrastructure investment to immigration reform.\"For a fact sheet on H.R. 2, please go to: .Read this original document at: ",
"US Official NewsMarch 27, 2015 FridayCopyright 2015 Plus Media Solutions Private Limited All Rights ReservedLength:241 wordsDateline:Washington Body Office of the House of Representative Ted Deutch, U.S Government has issued the following news release:Congressman Ted Deutch (FL-21) issued the following statement after voting in favor of the bipartisan 'doc fix' deal included in H.R. 2, the Medicare Access and CHIP Reauthorization Act, which passed the House of Representatives by a vote of 392-27:\"Today I was pleased to vote for a bipartisan deal to permanently repeal the misguided Sustainable Growth Rate (SGR) policy that jeopardizes doctors participating in Medicare, and to permanently extend assistance to low-income seniors struggling to pay their Medicare premiums.In addition to building certainty in our Medicare system, this legislation also protects health care for low-income kids and funding for the federal health centers in Floridas 21st district that in 2013 alone served over 70,000 patients living below or near the federal poverty level.\"Todays deal is the product of the kind of bipartisan compromise that has become all too rare in Washington, in which neither Democrats nor Republicans get everything they want but come together to move our country forward. I can only imagine how much more we could achieve for the American people if we came together in the spirit of compromise on more issues, from infrastructure investment to immigration reform.\" In case of any query regarding this article or other content needs please contact: ",
"US Official NewsMarch 27, 2015 FridayCopyright 2015 Plus Media Solutions Private Limited All Rights ReservedLength:241 wordsDateline:Washington Body Office of the House of Representative Ted Deutch, U.S Government has issued the following news release:Congressman Ted Deutch (FL-21) issued the following statement after voting in favor of the bipartisan 'doc fix' deal included in H.R. 2, the Medicare Access and CHIP Reauthorization Act, which passed the House of Representatives by a vote of 392-27:\"Today I was pleased to vote for a bipartisan deal to permanently repeal the misguided Sustainable Growth Rate (SGR) policy that jeopardizes doctors participating in Medicare, and to permanently extend assistance to low-income seniors struggling to pay their Medicare premiums.In addition to building certainty in our Medicare system, this legislation also protects health care for low-income kids and funding for the federal health centers in Floridas 21st district that in 2013 alone served over 70,000 patients living below or near the federal poverty level.\"Todays deal is the product of the kind of bipartisan compromise that has become all too rare in Washington, in which neither Democrats nor Republicans get everything they want but come together to move our country forward. I can only imagine how much more we could achieve for the American people if we came together in the spirit of compromise on more issues, from infrastructure investment to immigration reform.\" In case of any query regarding this article or other content needs please contact: "
)), row.names = c(NA, 10L), class = "data.frame", .Names = "text_main")
Here is the matrix of similarity for the same subset of documents:
cosine_df <- structure(list(text1 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), text2 = c(0,
0, 1, 0, 0, 0, 0, 0, 0, 0), text3 = c(0, 1, 0, 0, 0, 0, 0, 0,
0, 0), text4 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), text5 = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), text6 = c(0, 0, 0, 0, 0, 0, 0, 0,
0, 0), text7 = c(0, 0, 0, 0, 0, 0, 0, 1, 1, 1), text8 = c(0,
0, 0, 0, 0, 0, 1, 0, 1, 1), text9 = c(0, 0, 0, 0, 0, 0, 1, 1,
0, 1), text10 = c(0, 0, 0, 0, 0, 0, 1, 1, 1, 0)), .Names = c("text1",
"text2", "text3", "text4", "text5", "text6", "text7", "text8",
"text9", "text10"), row.names = c("text1", "text2", "text3",
"text4", "text5", "text6", "text7", "text8", "text9", "text10"
), class = "data.frame")
In case anyone else has a similar problem, this was the solution I ended up creating:
library(quanteda)
myDfm <- dfm(as.character(docs$text_main), verbose=FALSE)
cosinesim <- textstat_simil(x=myDfm, selection=docnames(myDfm), margin="documents", method="cosine")
cosinemat <- as.matrix(cosinesim) #this produces a matrix of the document similarities
threshold <- .9
similar_indices <- unique(apply(cosinemat, 1,
function(x) which(x > threshold)))
## keep only the first element of each set
if(class(similar_indices) == "list") { # check if list or not
unique_indices <- unique(sapply(similar_indices, function(x) as.numeric(x[1])))
} else if (class(similar_indices) == "matrix"){
unique_indices <- unique(apply(similar_indices, 2, function(x) as.numeric(x[1])))
} else {
unique_indices <- similar_indices
}
## get only the unique texts
docs_unique <- docs[unique_indices ,]
Related
How to crawl multiple pages and create a dataframe with parsing?
I would like to load multple pages from a single website and extract specific attributes from different classes as below. Then I woule like to create a dataframe with parsed information from multiple pages. Extract from multiple pages for page in range(1,10): url = f"https://www.consilium.europa.eu/en/press/press-releases/?page={page}" res = requests.get(url) soup = bs(res.text, 'lxml') Parsing soup_content = soup.find_all('li', {'class':['list-item ceu clearfix','list-item gsc clearfix','list-item euco clearfix','list-item eg clearfix' ]}) datePublished = [] headline = [] description =[] urls = [] for i in range(len(soup_content)): datePublished.append(soup_content[i].find('span', {'itemprop': 'datePublished'}).attrs['content']) headline.append(soup_content[i].find('h3', {'itemprop': 'headline'}).get_text().strip()) description.append(soup_content[i].find('p', {'itemprop': 'description'}).get_text().strip()) urls.append('https://www.consilium.europa.eu{}'.format(soup.find('a', {'itemprop': 'url'}).attrs['href'])) To DataFrame df = pd.DataFrame(data = zip(datePublished, headline, description, urls), columns=['date','title', 'description', 'link']) df
To expand on my comments, this should work: maxPage = 9 datePublished = [] headline = [] description =[] urls = [] for page in range(1, maxPage+1): url = f"https://www.consilium.europa.eu/en/press/press-releases/?page={page}" res = requests.get(url) print(f'[page {page:>3}]', res.status_code, res.reason, 'from', res.url) soup = BeautifulSoup(res.content, 'lxml') soup_content = soup.find_all('li', {'class':['list-item ceu clearfix','list-item gsc clearfix','list-item euco clearfix','list-item eg clearfix' ]}) for i in range(len(soup_content)): datePublished.append(soup_content[i].find('span', {'itemprop': 'datePublished'}).attrs['content']) headline.append(soup_content[i].find('h3', {'itemprop': 'headline'}).get_text().strip()) description.append(soup_content[i].find('p', {'itemprop': 'description'}).get_text().strip()) urls.append('https://www.consilium.europa.eu{}'.format(soup.find('a', {'itemprop': 'url'}).attrs['href'])) When I ran it, 179 unique rows were collected [20 rows from all pages except the 7th, which had 19].
There are different ways to get your goal: #Driftr95 comes up with a modification of yours using range(), that is fine, while iterating a specific number of pages. Using a while-loop to be flexible in number of pages, without knowing the exact one. You can also use a counter if you like to break the loop at a certain number of iterations. ... I would recommend the second one and also to avoid the bunch of lists cause you have to ensure they have the same lenght. Instead use a single list with dicts that looks more structured. Example import requests import pandas as pd from bs4 import BeautifulSoup base_url = 'https://www.consilium.europa.eu' path ='/en/press/press-releases' url = base_url+path data = [] while True: print(url) soup = BeautifulSoup(requests.get(url).text) for e in soup.select('li.list-item'): data.append({ 'date':e.find_previous('h2').text, 'title':e.h3.text, 'desc':e.p.text, 'url':base_url+e.h3.a.get('href') }) if soup.select_one('li[aria-label="Go to the next page"] a[href]'): url = base_url+path+soup.select_one('li[aria-label="Go to the next page"] a[href]').get('href') else: break df = pd.DataFrame(data) Output date title desc url 0 30 January 2023 Statement by the High Representative on behalf of the EU on the alignment of certain third countries concerning restrictive measures in view of the situation in the Democratic Republic of the Congo Statement by the High Representative on behalf of the European Union on the alignment of certain third countries with Council Implementing Decision (CFSP) 2022/2398 of 8 December 2022 implementing Decision 2010/788/CFSP concerning restrictive measures in view of the situation in the Democratic Republic of the Congo. https://www.consilium.europa.eu/en/press/press-releases/2023/01/30/statement-by-the-high-representative-on-behalf-of-the-eu-on-the-alignment-of-certain-third-countries-concerning-restrictive-measures-in-view-of-the-situation-in-the-democratic-republic-of-the-congo/ 1 30 January 2023 Council adopts recommendation on adequate minimum income The Council adopted a recommendation on adequate minimum income to combat poverty and social exclusion. Income support is considered adequate when it ensures a life in dignity at all stages of life. Member states are recommended to gradually achieve the adequate level of income support by 2030 at the latest, while safeguarding the sustainability of public finances. https://www.consilium.europa.eu/en/press/press-releases/2023/01/30/council-adopts-recommendation-on-adequate-minimum-income/ 2 27 January 2023 Forward look: 30 January - 12 February 2023 Overview of the main subjects to be discussed at meetings of the Council of the EU over the next two weeks and upcoming media events. https://www.consilium.europa.eu/en/press/press-releases/2023/01/27/forward-look/ 3 27 January 2023 Russia: EU prolongs economic sanctions over Russia’s military aggression against Ukraine The Council prolonged restrictive measures in view of Russia's actions destabilising the situation in Ukraine by six months. https://www.consilium.europa.eu/en/press/press-releases/2023/01/27/russia-eu-prolongs-economic-sanctions-over-russia-s-military-aggression-against-ukraine/ 4 27 January 2023 Media advisory – Agriculture and Fisheries Council meeting on 30 January 2023 Main agenda items, approximate timing, public sessions and press opportunities. https://www.consilium.europa.eu/en/press/press-releases/2023/01/27/media-advisory-agriculture-and-fisheries-council-meeting-on-30-january-2023/ ... 435 6 July 2022 EU support to the African Union Mission in Somalia: Council approves further support under the European Peace Facility The Council approved €120 million in support to the military component of AMISOM/ATMIS for 2022 under the European Peace Facility. https://www.consilium.europa.eu/en/press/press-releases/2022/07/06/eu-support-to-the-african-union-mission-in-somalia-council-approves-further-support-under-the-european-peace-facility/ 436 6 July 2022 Report by President Charles Michel to the European Parliament plenary session Report by European Council President Charles Michel to the European Parliament plenary session on the outcome of the European Council meeting of 23-24 June 2022. https://www.consilium.europa.eu/en/press/press-releases/2022/07/06/report-by-president-charles-michel-to-the-european-parliament-plenary-session/ 437 5 July 2022 Declaration by the High Representative on behalf of the EU on the alignment of certain countries concerning restrictive measures against ISIL (Da’esh) and Al-Qaeda and persons, groups, undertakings and entities associated with them Declaration by the High Representative on behalf of the European Union on the alignment of certain third countries with Council Decision (CFSP) 2022/950 of 20 June 2022 amending Decision (CFSP) 2016/1693 concerning restrictive measures against ISIL (Da’esh) and Al-Qaeda and persons, groups, undertakings and entities associated with them. https://www.consilium.europa.eu/en/press/press-releases/2022/07/05/declaration-by-the-high-representative-on-behalf-of-the-eu-on-the-alignment-of-certain-countries-concerning-restrictive-measures-against-isil-da-esh-and-al-qaeda-and-persons-groups-undertakings-and-entities-associated-with-them/ 438 5 July 2022 Remarks by President Charles Michel after his meeting in Skopje with Prime Minister of North Macedonia Dimitar Kovačevski During his visit to North Macedonia, President Michel expressed his support for proposed compromise solution on the country's accession negotiations. https://www.consilium.europa.eu/en/press/press-releases/2022/07/05/remarks-by-president-charles-michel-after-his-meeting-in-skopje-with-prime-minister-of-north-macedonia-dimitar-kovacevski/ 439 4 July 2022 Readout of the telephone conversation between President Charles Michel and Prime Minister of Ethiopia Abiy Ahmed President Charles Michel and Prime Minister of Ethiopia Abiy Ahmed valued their open and frank exchange and agreed to speak in the near future to take stock. https://www.consilium.europa.eu/en/press/press-releases/2022/07/04/readout-of-the-telephone-conversation-between-president-charles-michel-and-prime-minister-of-ethiopia-abiy-ahmed/ ...
filtering text and storing the filtered sentence/paragraph into a new column
I am trying to extract some sentences from text data. I want to extract the sentences which correspond to medical device company released. I can run the following code: df_text <- unlist(strsplit(df$TD, "\\.")) df_text df_text <- df_text[grep(pattern = "medical device company released", df_text, ignore.case = TRUE)] df_text Which gives me: [1] "\n\nThe medical device company released its financial outlook in a press release before an investor conference Friday" So I extracted the sentences which contain the sentence medical device company released. However, I want to do this but store the results in a new column from which grp the sentence came from. Expected output: grp TD newCol 3613 text NA # does not contain the sentence 4973 text medical device company released 5570 text NA # does not contain the sentence Data: df <- structure(list(grp = c("3613", "4973", "5570"), TD = c(" Wal-Mart plans to add an undisclosed number of positions in areas including its store-planning operation and New York apparel office.\n\nThe moves, which began Tuesday, are meant to \"increase operational efficiencies, support our strategic growth plans and reduce overall costs,\" Wal-Mart spokesman David Tovar said.\n\nWal-Mart still expects net growth of tens of thousands of jobs at the store level this year, Tovar said.\n\nThe reduction in staff is hardly a new development for retailers, which have been cutting jobs at their corporate offices as they contend with the down economy. Target Corp. (TGT), Saks Inc. (SKS) and Best Buy Co. (BBY) are among retailers that have said in recent weeks they plan to pare their ranks.\n\nTovar declined to say whether the poor economy was a factor in Wal-Mart's decision.\n\nWal-Mart is operating from a position of comparative strength as one of the few retailers to consistently show positive growth in same-store sales over the past year as the recession dug in.\n\nWal-Mart is \"a fiscally responsible company that will manage its capital structure appropriately,\" said Todd Slater, retail analyst at Lazard Capital Markets.\n\nEven though Wal-Mart is outperforming its peers, the company \"is not performing anywhere near peak or optimum levels,\" Slater said. \"The consumer has cut back significantly.\"\n\nWal-Mart indicated it had regained some footing in January, when comparable-store sales rose 2.1%, after a lower-than-expected 1.7% rise in December.\n\nWal-Mart shares are off 3.2% to $47.68.\n\n-By Karen Talley, Dow Jones Newswires; 201-938-5106; karen.talley#dowjones.com [ 02-10-09 1437ET ]\n ", " --To present new valve platforms Friday\n\n(Updates with additional comment from company, beginning in the seventh paragraph.)\n\n\n \n By Anjali Athavaley \n Of DOW JONES NEWSWIRES \n \n\nNEW YORK (Dow Jones)--Edwards Lifesciences Corp. (EW) said Friday that it expects earnings to grow 35% to 40%, excluding special items, in 2012 on expected sales of its catheter-delivered heart valves that were approved in the U.S. earlier this year.\n\nThe medical device company released its financial outlook in a press release before an investor conference Friday. The catheter-delivered heart valve market is considered to have a multibillion-dollar market potential, but questions have persisted on how quickly the Edwards device, called Sapien, will be rolled out and who will be able to receive it.\n\nEdwards said it expects transcatheter valve sales between $560 million and $630 million in 2012, with $200 million to $260 million coming from the U.S.\n\nOverall, for 2012, Edwards sees total sales between $1.95 billion and $2.05 billion, above the $1.68 billion to $1.72 billion expected this year and bracketing the $2.01 billion expected on average by analysts surveyed by Thomson Reuters.\n\nThe company projects 2012 per-share earnings between $2.70 and $2.80, the midpoint of which is below the average analyst estimate of $2.78 on Thomson Reuters. Edwards estimates a gross profit margin of 73% to 75%.\n\nEdwards also reaffirmed its 2011 guidance, which includes earnings per share of $1.97 to $2.02, excluding special items.\n\nThe company said it continues to expect U.S. approval of its Sapien device for high-risk patients in mid-2012. Currently, the device is only approved in the U.S. for patients too sick for surgery.\n\nThe company added that a separate trial studying its newer-generation valve in a larger population is under way in the U.S. It expects U.S. approval of that device in 2014.\n\nEdwards also plans to present at its investor conference two new catheter-delivered valve platforms designed for different implantation methods. European trials for these devices are expected to begin in 2012.\n\nShares of Edwards, down 9% over the past 12 months, were inactive premarket. The stock closed at $63.82 on Thursday.\n\n-By Anjali Athavaley, Dow Jones Newswires; 212-416-4912; anjali.athavaley#dowjones.com [ 12-09-11 0924ET ]\n ", " In September, the company issued a guidance range of 43 cents to 44 cents a share. \n\nFor the year, GE now sees earnings no lower than $1.81 a share to $1.83 a share. The previous forecast called for income of $1.80 to $1.83 a share. The new range brackets analyst projections of $1.82 a share. \n\nThe new targets represent double-digit growth from the respective year-earlier periods. Last year's third-quarter earnings were $3.87 billion, or 36 cents a share, excluding items; earnings for the year ended Dec. 31 came in at $16.59 billion, or $1.59 a share. [ 10-06-05 0858ET ] \n\nGeneral Electric also announced Thursday that it expects 2005 cash flow from operating activities to exceed $19 billion. \n\nBecause of the expected cash influx, the company increased its authorization for share repurchases by $1 billion to more than $4 billion. \n\nGE announced the updated guidance at an analysts' meeting Thursday in New York. A Web cast of the meeting is available at . \n\nThe company plans to report third-quarter earnings Oct. 14. \n\nShares of the Dow Jones Industrial Average component recently listed at $33.20 in pre-market trading, according to Inet, up 1.6%, or 52 cents, from Wednesday's close of $32.68. \n\nCompany Web site: \n\n-Jeremy Herron; Dow Jones Newswires; 201-938-5400; Ask Newswires#DowJones.com \n\nOrder free Annual Report for General Electric Co. \n\nVisit or call 1-888-301-0513 [ 10-06-05 0904ET ] \n " )), class = "data.frame", row.names = c(NA, -3L))
We can get data in separate rows keeping the grp intact and keep only sentence that has "medical device company released" in it. library(dplyr) df %>% tidyr::separate_rows(TD, sep = "\\.") %>% group_by(grp) %>% summarise(newCol = toString(grep(pattern = "medical device company released", TD, ignore.case = TRUE, value = TRUE))) # grp newCol # <chr> <chr> #1 3613 "" #2 4973 "\n\nThe medical device company released its financia… #3 5570 ""
Wrangling Data in R
I'm trying to break this data (specifically extract the graduation rate) out into being analyzed in a useful way. I believe I need to str_split (using R) but am not understanding what type of data it is and what all the \'s mean / etc. I scraped this from a website using the rvest package and below code: url <- "https://www.greatschools.org/maryland/severna-park/115-Severna-Park-High-School/" grad_rate <- read_html(url) %>% html_nodes("script") %>% html_text() %>% purrr::pluck(9) grad_rate "{\"title\":\"College readiness\",\"anchor\":\"College_readiness\",\"analytics_id\":\"CollegeReadiness\",\"subtitle\":\"Learn more about how to help your child graduate ready for college. \\u003ca href=\\\"/gk/articles/jump-start-college-planning/\\\" target=\\\"_blank\\\"\\u003eSee how.\\u003c/a\\u003e\",\"icon_classes\":\"icon-graduation\",\"info_text\":\"\\u003cp\\u003eThis rating shows how well students at this school are prepared for college compared to students at other schools in this state, based on key measures, like graduation rates, college entrance tests and advanced coursework when available.\\u003c/p\\u003e\\u003cp\\u003e\\u003ca href=\\\"/gk/ratings/#collegereadinessrating\\\" target=\\\"_blank\\\"\\u003eLearn more about this rating.\\u003c/a\\u003e\\u003c/p\\u003e\\n\",\"rating\":9,\"sources\":\"\\u003cdiv class=\\\"sourcing\\\"\\u003e\\u003ch1\\u003eGreatSchools profile data sources \\u0026amp; information\\u003c/h1\\u003e\\u003cdiv\\u003e\\u003ch4 \\u003eGreatSchools College Readiness Rating\\u003c/h4\\u003e\\u003cp\\u003eThe College Readiness Rating uses this high school's graduation rates, college entrance exam participation and performance, or AP, IB, or Dual Enrollment participation and AP performance to determine how well schools are preparing students for success in college and beyond. The College Readiness Rating was created using 2015 4-year high school graduation rate data from MSDE, using 2016 demographic data from NCES, and the following data from the 2016 Civil Rights Data Collection: percentage of students enrolled in IB, AP or Dual Enrollment classes in grades 9-12, and percentage of students passing 1 or more AP exams grades 9-12.\\u003c/p\\u003e\\u003cp\\u003e\\u003cspan class=\\\"emphasis\\\"\\u003eSource\\u003c/span\\u003e: GreatSchools; this rating was calculated in 2019 | \\u003cspan class=\\\"emphasis\\\"\\u003eSee more\\u003c/span\\u003e: \\u003ca href=\\\"/gk/ratings/#collegereadinessrating\\\"; target=\\\"_blank\\\"\\u003eAbout this rating\\u003c/a\\u003e\\u003c/p\\u003e\\u003c/div\\u003e\\u003cdiv\\u003e\\u003ch4\\u003e4-year high school graduation rate\\u003c/h4\\u003e\\u003cp\\u003eGraduation rates reflect how many students graduate from this school on time.\\u003c/p\\u003e\\u003cp\\u003e\\u003cspan class=\\\"emphasis\\\"\\u003eSource\\u003c/span\\u003e: MSDE, 2015\\u003c/p\\u003e\\u003c/div\\u003e\\u003cdiv\\u003e\\u003ch4\\u003eAP course participation\\u003c/h4\\u003e\\u003cp\\u003eAdvanced Placement classes are college-level courses students can take in high school. The percentage of students taking AP classes may reflect whether the school culture is focused on college.\\u003c/p\\u003e\\u003cp\\u003e\\u003cspan class=\\\"emphasis\\\"\\u003eSource\\u003c/span\\u003e: Civil Rights Data Collection, 2016\\u003c/p\\u003e\\u003c/div\\u003e\\u003cdiv\\u003e\\u003ch4\\u003ePercentage of students passing 1 or more AP exams grades 9-12\\u003c/h4\\u003e\\u003cp\\u003eThe AP exam pass rate reflects how many students at this school earned a passing score on at least one AP exam. Students who do well on AP exams (passing with a score of 3, 4, or 5) may receive college credit.\\u003c/p\\u003e\\u003cp\\u003e\\u003cspan class=\\\"emphasis\\\"\\u003eSource\\u003c/span\\u003e: Civil Rights Data Collection, 2016\\u003c/p\\u003e\\u003c/div\\u003e\\u003cdiv\\u003e\\u003ch4\\u003ePercentage of students enrolled in Dual Enrollment classes grades 9-12\\u003c/h4\\u003e\\u003cp\\u003eThe Dual Enrollment participation rate reflects the percentage of students at this school who are taking college courses while in high school. Credits for these courses apply both to high school diploma requirements and college graduation requisites.\\u003c/p\\u003e\\u003cp\\u003e\\u003cspan class=\\\"emphasis\\\"\\u003eSource\\u003c/span\\u003e: Civil Rights Data Collection, 2016\\u003c/p\\u003e\\u003c/div\\u003e\\u003cdiv\\u003e\\u003ch4\\u003ePercentage of students enrolled in IB grades 9-12\\u003c/h4\\u003e\\u003cp\\u003eInternational Baccalaureate (IB) is an internationally recognized, high-standards program that emphasizes creative and critical thinking. A high school may have specific IB classes students can take, or a school-wide IB program that affects all classes. Some colleges give college credit for IB courses. \\u003ca href='/gk/articles/what-is-ib-international-baccalaureate/' target='_blank'\\u003eMore about IB\\u003c/a\\u003e\\n\\u003c/p\\u003e\\u003cp\\u003e\\u003cspan class=\\\"emphasis\\\"\\u003eSource\\u003c/span\\u003e: Civil Rights Data Collection, 2016\\u003c/p\\u003e\\u003c/div\\u003e\\u003cdiv\\u003e\\u003ch4\\u003eSAT/ACT participation rate\\u003c/h4\\u003e\\u003cp\\u003eThe SAT/ACT participation rate shows the percentage of eligible students in grades 11 or 12 at this school who took the SAT or ACT.\\u003c/p\\u003e\\u003cp\\u003e\\u003cspan class=\\\"emphasis\\\"\\u003eSource\\u003c/span\\u003e: Civil Rights Data Collection, 2014\\u003c/p\\u003e\\u003c/div\\u003e\\u003c/div\\u003e\",\"feedback\":{\"feedback_cta\":\"Did you find the information about college success useful? What can we do better?\",\"feedback_link\":\"https://s.qualaroo.com/45194/cb0e676f-324a-4a74-bc02-72ddf1a2ddd6?school=115\\u0026state=MD\",\"button_text\":\"Answer\"},\"share_content\":\"\\u003cdiv class=\\\"sharing-modal\\\"\\u003e\\u003cdiv class=\\\"sharing-row js-emailSharingLinks js-slTracking\\\" data-url=\\\"https://www.greatschools.org/maryland/severna-park/115-Severna-Park-High-School/?utm_source=profile\\u0026utm_medium=Email\\u0026subject=Severna+Park+High+School+-+College+readiness\\u0026body=Check+out+the+Severna+Park+High+School+-+College+readiness%250D%250A#College_readiness\\\" data-type=\\\"Email\\\" data-module=\\\"College_readiness\\\" data-link=\\\"mailto:?subject=Severna Park High School - College readiness\\u0026body=Check out the Severna Park High School - College readiness%0D%0Ahttps://www.greatschools.org/maryland/severna-park/115-Severna-Park-High-School//?utm_source=profile%26utm_medium=email#College_readiness\\\"\\u003e\\u003cdiv class=\\\"sharing-icon-box\\\"\\u003e\\u003cspan class=\\\"icon-mail\\\"\\u003e\\u003c/span\\u003e\\u003c/div\\u003e\\u003cspan class=\\\"sharing-row-text\\\"\\u003eEmail\\u003c/span\\u003e\\u003c/div\\u003e\\u003cdiv class=\\\"sharing-row js-sharingLinks js-slTracking\\\" data-url=\\\"https://www.greatschools.org/maryland/severna-park/115-Severna-Park-High-School/?utm_source=profile\\u0026utm_medium=Facebook#College_readiness\\\" data-siteparams=\\\"\\u0026t=Severna Park High School - College readiness\\\" data-type=\\\"Facebook\\\" data-module=\\\"College_readiness\\\" data-link=\\\"https://www.facebook.com/sharer/sharer.php?u=\\\"\\u003e\\u003cdiv class=\\\"sharing-icon-box\\\"\\u003e\\u003cspan class=\\\"icon-facebook\\\"\\u003e\\u003c/span\\u003e\\u003c/div\\u003e\\u003cspan class=\\\"sharing-row-text\\\"\\u003eFacebook\\u003c/span\\u003e\\u003c/div\\u003e\\u003cdiv class=\\\"sharing-row js-sharingLinks js-slTracking\\\" data-url=\\\"https://www.greatschools.org/maryland/severna-park/115-Severna-Park-High-School/?utm_source=profile\\u0026utm_medium=Twitter#College_readiness\\\" data-siteparams=\\\"\\u0026via=GreatSchools\\u0026text=Severna Park High School - College readiness\\\" data-type=\\\"Twitter\\\" data-module=\\\"College_readiness\\\" data-link=\\\"https://twitter.com/share?url=\\\"\\u003e\\u003cdiv class=\\\"sharing-icon-box\\\"\\u003e\\u003cspan class=\\\"icon-twitter\\\"\\u003e\\u003c/span\\u003e\\u003c/div\\u003e\\u003cspan class=\\\"sharing-row-text\\\"\\u003eTwitter\\u003c/span\\u003e\\u003c/div\\u003e\\u003cdiv class=\\\"sharing-row\\\"\\u003e\\u003cdiv class=\\\"sharing-icon-box\\\"\\u003e\\u003cspan class=\\\"icon-link\\\"\\u003e\\u003c/span\\u003e\\u003c/div\\u003e\\u003cspan class=\\\"sharing-row-text\\\"\\u003ePermalink\\u003c/span\\u003e\\u003cdiv\\u003e\\u003cinput class=\\\"permalink js-permaLink js-slTracking\\\" type=\\\"text\\\" value=\\\"https://www.greatschools.org/maryland/severna-park/115-Severna-Park-High-School/?utm_source=profile\\u0026utm_medium=Permalink#College_readiness\\\" /\\u003e\\u003cspan class=\\\"acknowledgement\\\"\\u003eCopied to clipboard\\u003c/span\\u003e\\u003c/div\\u003e\\u003c/div\\u003e\\u003c/div\\u003e\",\"data\":[{\"title\":\"College readiness\",\"anchor\":\"College_readiness\",\"data\":[{\"narration\":\"\\u003cdiv class=\\\"auto-narration\\\"\\u003e \\u003ch3 class=\\\"positive\\\"\\u003eGood news!\\u003c/h3\\u003e \\u003cp\\u003eThis school is \\u003cspan class=\\\"emphasis\\\"\\u003efar above\\u003c/span\\u003e the state average in key measures of college and career readiness.\\u003c/p\\u003e \\u003cp\\u003eEven at schools with strong college and career readiness, there may be students who are not getting the opportunities they need to succeed.\\u003c/p\\u003e \\u003chr /\\u003e \\u003cp class=\\\"parent-tip\\\"\\u003e\\u003cimg src='/assets/school_profiles/owl.png' /\\u003e\\u003cspan class=\\\"speech-bubble left\\\"\\u003eParent tip\\u003c/span\\u003e\\u003c/p\\u003e \\u003cp class=\\\"footnote\\\"\\u003eAsk the school what it’s doing to help all students succeed in advanced classes and prepare for \\u003ca href=\\\"/gk/articles/improving-sat-scores/\\\"\\u003ecollege entrance tests\\u003c/a\\u003e.\\u003c/p\\u003e \\u003c/div\\u003e\\n\",\"title\":\"College readiness\",\"values\":[{\"label\":\"94\",\"score\":93,\"breakdown\":\"4-year high school graduation rate\",\"state_average\":86,\"state_average_label\":\"87\",\"display_type\":\"person\",\"lower_range\":0,\"upper_range\":100,\"tooltip_html\":\"Graduation rates reflect how many students graduate from this school on time.\"},{\"label\":\"51\",\"score\":50,\"breakdown\":\"AP course participation\",\"state_average\":26,\"state_average_label\":\"27\",\"display_type\":\"person\",\"lower_range\":0,\"upper_range\":100,\"tooltip_html\":\"Advanced Placement classes are college-level courses students can take in high school. The percentage of students taking AP classes may reflect whether the school culture is focused on college.\"},{\"label\":\"73\",\"score\":72,\"breakdown\":\"Percentage of students passing 1 or more AP exams grades 9-12\",\"state_average\":62,\"state_average_label\":\"63\",\"display_type\":\"bar\",\"lower_range\":0,\"upper_range\":100,\"tooltip_html\":\"The AP exam pass rate reflects how many students at this school earned a passing score on at least one AP exam. Students who do well on AP exams (passing with a score of 3, 4, or 5) may receive college credit.\"},{\"label\":\"6\",\"score\":5,\"breakdown\":\"Percentage of students enrolled in Dual Enrollment classes grades 9-12\",\"state_average\":2,\"state_average_label\":\"3\",\"display_type\":\"person\",\"lower_range\":0,\"upper_range\":100,\"tooltip_html\":\"The Dual Enrollment participation rate reflects the percentage of students at this school who are taking college courses while in high school. Credits for these courses apply both to high school diploma requirements and college graduation requisites.\"},{\"label\":\"\\u003c1\",\"score\":0,\"breakdown\":\"Percentage of students enrolled in IB grades 9-12\",\"state_average\":2,\"state_average_label\":\"2\",\"display_type\":\"person\",\"lower_range\":0,\"upper_range\":100,\"tooltip_html\":\"International Baccalaureate (IB) is an internationally recognized, high-standards program that emphasizes creative and critical thinking. A high school may have specific IB classes students can take, or a school-wide IB program that affects all classes. Some colleges give college credit for IB courses. \\u003ca href='/gk/articles/what-is-ib-international-baccalaureate/' target='_blank'\\u003eMore about IB\\u003c/a\\u003e\\n\"},{\"label\":\"93\",\"score\":93,\"breakdown\":\"SAT/ACT participation rate\",\"state_average\":57,\"state_average_label\":\"57\",\"display_type\":\"person\",\"lower_range\":0,\"upper_range\":100,\"tooltip_html\":\"The SAT/ACT participation rate shows the percentage of eligible students in grades 11 or 12 at this school who took the SAT or ACT.\"}]}]}],\"showTabs\":false,\"faq\":{\"cta\":\"Notice something missing or confusing?\",\"content\":\"\\u003cp\\u003eCollege readiness information comes from state or national education agencies (click on the \\\"Sources\\\" link for details).\\u003c/p\\u003e \\u003cp\\u003eWhen information is missing in our display, it's most likely because this school did not offer an AP course, IB or dual enrollment classes, or participate in one of the two college readiness tests, the ACT or SAT (some states mandate which college readiness test schools use). It's also possible that the missing data was not included in the data we received from the state.\\u003c/p\\u003e \\u003cp\\u003eDid you find the information about college readiness useful? What can we do better? \\u003ca href=\\\"https://s.qualaroo.com/45194/34aea707-ec71-4130-b6bb-2864e0528c64\\\" target=\\\"_blank\\\"\\u003eShare your feedback.\\u003c/a\\u003e\\u003c/p\\u003e \\u003cp\\u003e\\u003ca href=\\\"/gk/ratings/#collegereadinessrating\\\" target=\\\"_blank\\\"\\u003eLearn more about this rating.\\u003c/a\\u003e\\u003c/p\\u003e \\u003cp\\u003eStill have questions? \\u003ca href=\\\"https://greatschools.zendesk.com/hc/en-us\\\" target=\\\"_blank\\\"\\u003eVisit our FAQ page.\\u003c/a\\u003e\\u003c/p\\u003e\\n\",\"element_type\":\"faq\"},\"no_data_summary\":\"This section includes information about this school’s graduation rates, SAT/ACT tests, and AP coursework.\\n\",\"qualaroo_module_link\":\"https://s.qualaroo.com/45194/34aea707-ec71-4130-b6bb-2864e0528c64?state=MD\\u0026school=115\"}" Thanks for any help!
Good news is you can easily regex out this value from response text library(rvest) library(magrittr) library(stringr) p <- read_html('https://www.greatschools.org/maryland/severna-park/115-Severna-Park-High-School/') %>% html_text() rate <- str_match_all(p,'"College readiness","values":\\[\\{"label":"(.*?)"')[[1]][,2][1] print(as.numeric(rate))
Using unique function on a specific column [duplicate]
This question already has answers here: Remove duplicates keeping entry with largest absolute value (7 answers) Closed 4 years ago. I have a dataframe with Twitter data where the tweet message is in the first column (text), and number of retweets is in in the second column (retweetCount). I would like to remove rows where the tweet message is repeated. In the past, I've used the unique function to remove duplicate observations from a dataframe. Like so, df_no_duplicates <- unique(df). But for my Twitter data, this would only remove rows where both the exact text and exact retweetCount. Can I specify for the unique function to only work on the text column? If possible, I would also like to specify the function further with the following logic: IF text is repeated in dataframe, THEN only keep the observation with the greatest retweetCount. Here's a reproducible sample of my data (although I'm not sure if there are any repeat messages in the first 50 rows): dput(head(df, 50)) structure(list(text = c("as always making sense of it all for us ive never felt less welcome in this country brexit ", "never underestimate power of stupid people in a democracy brexit", "a quick guide to brexit and beyond after britain votes to quit eu ", "this selfinflicted wound will be his legacy cameron falls on sword after brexit euref ", "so the uk is out cameron resigned scotland wants to leave great britain sinn fein plans to unify ireland and its o", "this is a very good summary no biasspinagenda of the legal ramifications of the leave result brexit ", "you cant make this up cornwall votes out immediately pleads to keep eu cash this was never a rehearsal ", "no matter the outcome brexit polls demonstrate how quickly half of any population can be convinced to vote against itself q", "i wouldnt mind so much but the result is based on a pack of lies and unaccountable promises democracy didnt win brexit pro", "so the uk is out cameron resigned scotland wants to leave great britain sinn fein plans to unify ireland and its o", "absolutely brilliant poll on brexit by ", "think the brexit campaign relies on the same sort of logic that drpepper does whats the worst that can happen thingsthatarewellbrexit", "am baffled by nigel farages claim that brexit is a victory for real people as if the 47 voting remain are fucking smu", "not one of the uks problems has been solved by brexit vote migration inequality the uks centurylong decline as", "scotland should never leave eu calls for new independence vote grow brexit", "the most articulate take on brexit is actually this ft reader comment today ", "david cameron has said he is set to resign as british prime minister after uk votes to leave eu brexit ", "im laughing at people who voted for brexit but are complaining about the exchange rate affecting their holiday\r\nremain", "life is too short to wear boring shoes brexit", "pm at buckingham palace for audience with the queen brexit", "i hate people too but i dont think id vote for armageddon over it brexit", "text = when you send a message\r\n\r\nsext = when you send a sexy message\r\n\r\nbrexit = when you send an entire global economy to he", "i actually was pretty confident that the brits wouldnt vote for a brexit didnt see this coming", "pm at buckingham palace for audience with the queen brexit", "now just the time can say if it is the right decision brexit", "no matter the outcome brexit polls demonstrate how quickly half of any population can be convinced to vote against itself q", "that was whatever your view on brexit a superb speech hope next pm will be as good a statesman as david cameron ", "david cameron to step down as over 52pc of britains vote to leave the european union brexit", "between brexit and euro2016 england have got a few johnsons to worry about so heres a quick guideeurefresults ", "scotland voted overwhelmingly to remain in the eu ", "brexit is great enough on the merits but watching the tears and tantrums is the icing on the cake ", "the nightmare has begun it will be a long one todays column on brexit ", "brexit why premier league clubs may be unable to sign foreign players under age of 18\r\n ", "brexit why premier league clubs may be unable to sign foreign players under age of 18\r\n ", "cant think about brexit without thinking about this ", "brexit likely to help rajoy win sundays election but could be nightmare for him if he gets to govern given economic fragil", "trump praises uk public for taking back control of country brexit", "expert many feel globalisation isnt working for them yes mate thats the 999 of punters who it is not working for abc730 brexit", "cornwall votes against europe then expects to keep eu funding good luck with that ", "weve done it without a bullet being fired nigel farage forgetting that a member of parliament was assassinated over b", "londoners call for capital to gain independence after brexit vote ", "12 trump and brexit are direct results of pressure on working class when big companies bow down to", "just a reminder that the brexit newspapers were easily worth more than a 2 swing none of the men who own them pay the", "i always loved gb thought about moving there some day but the decision they made yesterday is really shocking disa", "winter is coming gameofthrones brexit ", "the most articulate take on brexit is actually this ft reader comment today ", "aw\r\n\r\ni worry that the brexit thing will justaid tyrannys spread", "breaking brexit spain proposes shared sovereignty over gibraltar", "the entirety of scotland voted to remain you imbecile brexit ", "diane calling it right again \r\nthe dispossessed voted for brexit jeremy corbyn offers real change\r\nhttp" ), retweetCount = c(0, 251, 39, 0, 6462, 0, 1391, 31595, 15, 6462, 20521, 0, 871, 10, 184, 1239, 143, 0, 0, 218, 0, 3482, 0, 218, 0, 31595, 0, 25, 777, 14, 404, 6, 1, 0, 10756, 4, 198, 0, 666, 12387, 609, 0, 237, 1, 0, 1239, 0, 2431, 6, 84)), .Names = c("text", "retweetCount"), row.names = c(NA, 50L), class = "data.frame")
The reprex data needs a little work - but I think this will work in general using dplyr from tidyverse: library(tidyverse) df2 <- df %>% group_by(text) %>% summarise(retweetCount = max(retweetCount)) %>% distinct() I can't test on your data so the final distinct function might not be necessary.
Transforming kwic objects into single dfm
I have a corpus of newspaper articles of which only specific parts are of interest for my research. I'm not happy with the results I get from classifying texts along different frames because the data contains too much noise. I therefore want to extract only the relevant parts from the documents. I was thinking of doing so by transforming several kwic objects generated by the quanteda package into a single df. So far I've tried the following exampletext <- c("The only reason for (the haste) which we can discern is the prospect of an Olympic medal, which is the raison d'etat of the banana republic,'' The Guardian said in an editorial under the headline ''Whatever Zola Wants. . .'' The Government made it clear it had acted promptly on the application to insure that the 5-foot-2-inch track star could qualify for the British Olympic team. The International Olympic Organization has a rule that says athletes who change their nationality must wait three years before competing for that country - a rule, however, that is often waived by the I.O.C. The British Olympic Association said it consulted with the I.O.C. before asserting Miss Budd's eligibility for the British team. ''Since Zola is now here and has a British passport she should be made to feel welcome and accepted by other British athletes,'' said Paul Dickenson, chairman of the International Athletes Club, an organization that raises money for amateur athletes and looks after their political interests. ''The thing we objected to was the way she got into the country by the Government and the Daily Mail and the commercialization exploitation associated with it.", "That left 14 countries that have joined the Soviet-led withdrawal. Albania and Iran had announced that they would not compete and did not send written notification. Bolivia, citing financial trouble, announced Sunday it would not participate.The 1972 Munich Games had the previous high number of competing countries, 122.No Protest Planned on Zola Budd YAOUNDE, Cameroon, June 4 (AP) - African countries do not plan to boycott the Los Angeles Olympics in protest of the inclusion of Zola Budd, the South African-born track star, on the British team, according to Lamine Ba, the secretary-general of the Supreme Council for Sport in Africa. Because South Africa is banned from participation in the Olympics, Miss Budd, whose father is of British descent, moved to Britain in March and was granted British citizenship.75 Olympians to Train in Atlanta ATLANTA, June 4 (AP) - About 75 Olympic athletes from six African countries and Pakistan will participate in a month-long training camp this summer in Atlanta under a program financed largely by a grant from the United States Information Agency, Anne Bassarab, a member of Mayor Andrew Young's staff, said today. The athletes, from Mozambique, Tanzania, Zambia, Zimbabwe, Uganda, Somalia and Pakistan, will arrive here June 24.") mycorpus <- corpus(exampletext) mycorpus.nat <- corpus(kwic(mycorpus, "nationalit*", window = 5, valuetype = "glob")) mycorpus.cit <- corpus(kwic(mycorpus, "citizenship", window = 5, valuetype = "glob")) mycorpus.kwic <- mycorpus.nat + mycorpus.cit mydfm <- dfm(mycorpus.kwic) This, however, generates a dfm that contains 4 documents instead of 2, and when both keywords are present in a document even more. I can't think of a way to bring the dfm down to the original number of documents. Thank you for helping me out.
We recently added window argument to tokens_select() for this purpose: require(quanteda) txt <- c("The only reason for (the haste) which we can discern is the prospect of an Olympic medal, which is the raison d'etat of the banana republic,'' The Guardian said in an editorial under the headline ''Whatever Zola Wants. . .'' The Government made it clear it had acted promptly on the application to insure that the 5-foot-2-inch track star could qualify for the British Olympic team. The International Olympic Organization has a rule that says athletes who change their nationality must wait three years before competing for that country - a rule, however, that is often waived by the I.O.C. The British Olympic Association said it consulted with the I.O.C. before asserting Miss Budd's eligibility for the British team. ''Since Zola is now here and has a British passport she should be made to feel welcome and accepted by other British athletes,'' said Paul Dickenson, chairman of the International Athletes Club, an organization that raises money for amateur athletes and looks after their political interests. ''The thing we objected to was the way she got into the country by the Government and the Daily Mail and the commercialization exploitation associated with it.", "That left 14 countries that have joined the Soviet-led withdrawal. Albania and Iran had announced that they would not compete and did not send written notification. Bolivia, citing financial trouble, announced Sunday it would not participate.The 1972 Munich Games had the previous high number of competing countries, 122.No Protest Planned on Zola Budd YAOUNDE, Cameroon, June 4 (AP) - African countries do not plan to boycott the Los Angeles Olympics in protest of the inclusion of Zola Budd, the South African-born track star, on the British team, according to Lamine Ba, the secretary-general of the Supreme Council for Sport in Africa. Because South Africa is banned from participation in the Olympics, Miss Budd, whose father is of British descent, moved to Britain in March and was granted British citizenship.75 Olympians to Train in Atlanta ATLANTA, June 4 (AP) - About 75 Olympic athletes from six African countries and Pakistan will participate in a month-long training camp this summer in Atlanta under a program financed largely by a grant from the United States Information Agency, Anne Bassarab, a member of Mayor Andrew Young's staff, said today. The athletes, from Mozambique, Tanzania, Zambia, Zimbabwe, Uganda, Somalia and Pakistan, will arrive here June 24.") toks <- tokens(txt) mt_nat <- dfm(tokens_select(toks, "nationalit*", window = 5)) mt_cit <- dfm(tokens_select(toks, "citizenship*", window = 5)) Please make sure that you are using the latest version of Quanteda.