I would like to load multple pages from a single website and extract specific attributes from different classes as below. Then I woule like to create a dataframe with parsed information from multiple pages.
Extract from multiple pages
for page in range(1,10):
url = f"https://www.consilium.europa.eu/en/press/press-releases/?page={page}"
res = requests.get(url)
soup = bs(res.text, 'lxml')
Parsing
soup_content = soup.find_all('li', {'class':['list-item ceu clearfix','list-item gsc clearfix','list-item euco clearfix','list-item eg clearfix' ]})
datePublished = []
headline = []
description =[]
urls = []
for i in range(len(soup_content)):
datePublished.append(soup_content[i].find('span', {'itemprop': 'datePublished'}).attrs['content'])
headline.append(soup_content[i].find('h3', {'itemprop': 'headline'}).get_text().strip())
description.append(soup_content[i].find('p', {'itemprop': 'description'}).get_text().strip())
urls.append('https://www.consilium.europa.eu{}'.format(soup.find('a', {'itemprop': 'url'}).attrs['href']))
To DataFrame
df = pd.DataFrame(data = zip(datePublished, headline, description, urls), columns=['date','title', 'description', 'link'])
df
To expand on my comments, this should work:
maxPage = 9
datePublished = []
headline = []
description =[]
urls = []
for page in range(1, maxPage+1):
url = f"https://www.consilium.europa.eu/en/press/press-releases/?page={page}"
res = requests.get(url)
print(f'[page {page:>3}]', res.status_code, res.reason, 'from', res.url)
soup = BeautifulSoup(res.content, 'lxml')
soup_content = soup.find_all('li', {'class':['list-item ceu clearfix','list-item gsc clearfix','list-item euco clearfix','list-item eg clearfix' ]})
for i in range(len(soup_content)):
datePublished.append(soup_content[i].find('span', {'itemprop': 'datePublished'}).attrs['content'])
headline.append(soup_content[i].find('h3', {'itemprop': 'headline'}).get_text().strip())
description.append(soup_content[i].find('p', {'itemprop': 'description'}).get_text().strip())
urls.append('https://www.consilium.europa.eu{}'.format(soup.find('a', {'itemprop': 'url'}).attrs['href']))
When I ran it, 179 unique rows were collected [20 rows from all pages except the 7th, which had 19].
There are different ways to get your goal:
#Driftr95 comes up with a modification of yours using range(), that is fine, while iterating a specific number of pages.
Using a while-loop to be flexible in number of pages, without knowing the exact one. You can also use a counter if you like to break the loop at a certain number of iterations.
...
I would recommend the second one and also to avoid the bunch of lists cause you have to ensure they have the same lenght. Instead use a single list with dicts that looks more structured.
Example
import requests
import pandas as pd
from bs4 import BeautifulSoup
base_url = 'https://www.consilium.europa.eu'
path ='/en/press/press-releases'
url = base_url+path
data = []
while True:
print(url)
soup = BeautifulSoup(requests.get(url).text)
for e in soup.select('li.list-item'):
data.append({
'date':e.find_previous('h2').text,
'title':e.h3.text,
'desc':e.p.text,
'url':base_url+e.h3.a.get('href')
})
if soup.select_one('li[aria-label="Go to the next page"] a[href]'):
url = base_url+path+soup.select_one('li[aria-label="Go to the next page"] a[href]').get('href')
else:
break
df = pd.DataFrame(data)
Output
date
title
desc
url
0
30 January 2023
Statement by the High Representative on behalf of the EU on the alignment of certain third countries concerning restrictive measures in view of the situation in the Democratic Republic of the Congo
Statement by the High Representative on behalf of the European Union on the alignment of certain third countries with Council Implementing Decision (CFSP) 2022/2398 of 8 December 2022 implementing Decision 2010/788/CFSP concerning restrictive measures in view of the situation in the Democratic Republic of the Congo.
https://www.consilium.europa.eu/en/press/press-releases/2023/01/30/statement-by-the-high-representative-on-behalf-of-the-eu-on-the-alignment-of-certain-third-countries-concerning-restrictive-measures-in-view-of-the-situation-in-the-democratic-republic-of-the-congo/
1
30 January 2023
Council adopts recommendation on adequate minimum income
The Council adopted a recommendation on adequate minimum income to combat poverty and social exclusion. Income support is considered adequate when it ensures a life in dignity at all stages of life. Member states are recommended to gradually achieve the adequate level of income support by 2030 at the latest, while safeguarding the sustainability of public finances.
https://www.consilium.europa.eu/en/press/press-releases/2023/01/30/council-adopts-recommendation-on-adequate-minimum-income/
2
27 January 2023
Forward look: 30 January - 12 February 2023
Overview of the main subjects to be discussed at meetings of the Council of the EU over the next two weeks and upcoming media events.
https://www.consilium.europa.eu/en/press/press-releases/2023/01/27/forward-look/
3
27 January 2023
Russia: EU prolongs economic sanctions over Russia’s military aggression against Ukraine
The Council prolonged restrictive measures in view of Russia's actions destabilising the situation in Ukraine by six months.
https://www.consilium.europa.eu/en/press/press-releases/2023/01/27/russia-eu-prolongs-economic-sanctions-over-russia-s-military-aggression-against-ukraine/
4
27 January 2023
Media advisory – Agriculture and Fisheries Council meeting on 30 January 2023
Main agenda items, approximate timing, public sessions and press opportunities.
https://www.consilium.europa.eu/en/press/press-releases/2023/01/27/media-advisory-agriculture-and-fisheries-council-meeting-on-30-january-2023/
...
435
6 July 2022
EU support to the African Union Mission in Somalia: Council approves further support under the European Peace Facility
The Council approved €120 million in support to the military component of AMISOM/ATMIS for 2022 under the European Peace Facility.
https://www.consilium.europa.eu/en/press/press-releases/2022/07/06/eu-support-to-the-african-union-mission-in-somalia-council-approves-further-support-under-the-european-peace-facility/
436
6 July 2022
Report by President Charles Michel to the European Parliament plenary session
Report by European Council President Charles Michel to the European Parliament plenary session on the outcome of the European Council meeting of 23-24 June 2022.
https://www.consilium.europa.eu/en/press/press-releases/2022/07/06/report-by-president-charles-michel-to-the-european-parliament-plenary-session/
437
5 July 2022
Declaration by the High Representative on behalf of the EU on the alignment of certain countries concerning restrictive measures against ISIL (Da’esh) and Al-Qaeda and persons, groups, undertakings and entities associated with them
Declaration by the High Representative on behalf of the European Union on the alignment of certain third countries with Council Decision (CFSP) 2022/950 of 20 June 2022 amending Decision (CFSP) 2016/1693 concerning restrictive measures against ISIL (Da’esh) and Al-Qaeda and persons, groups, undertakings and entities associated with them.
https://www.consilium.europa.eu/en/press/press-releases/2022/07/05/declaration-by-the-high-representative-on-behalf-of-the-eu-on-the-alignment-of-certain-countries-concerning-restrictive-measures-against-isil-da-esh-and-al-qaeda-and-persons-groups-undertakings-and-entities-associated-with-them/
438
5 July 2022
Remarks by President Charles Michel after his meeting in Skopje with Prime Minister of North Macedonia Dimitar Kovačevski
During his visit to North Macedonia, President Michel expressed his support for proposed compromise solution on the country's accession negotiations.
https://www.consilium.europa.eu/en/press/press-releases/2022/07/05/remarks-by-president-charles-michel-after-his-meeting-in-skopje-with-prime-minister-of-north-macedonia-dimitar-kovacevski/
439
4 July 2022
Readout of the telephone conversation between President Charles Michel and Prime Minister of Ethiopia Abiy Ahmed
President Charles Michel and Prime Minister of Ethiopia Abiy Ahmed valued their open and frank exchange and agreed to speak in the near future to take stock.
https://www.consilium.europa.eu/en/press/press-releases/2022/07/04/readout-of-the-telephone-conversation-between-president-charles-michel-and-prime-minister-of-ethiopia-abiy-ahmed/
...
I am trying to extract some sentences from text data. I want to extract the sentences which correspond to medical device company released. I can run the following code:
df_text <- unlist(strsplit(df$TD, "\\."))
df_text
df_text <- df_text[grep(pattern = "medical device company released", df_text, ignore.case = TRUE)]
df_text
Which gives me:
[1] "\n\nThe medical device company released its financial outlook in a press release before an investor conference Friday"
So I extracted the sentences which contain the sentence medical device company released. However, I want to do this but store the results in a new column from which grp the sentence came from.
Expected output:
grp TD newCol
3613 text NA # does not contain the sentence
4973 text medical device company released
5570 text NA # does not contain the sentence
Data:
df <- structure(list(grp = c("3613", "4973", "5570"), TD = c(" Wal-Mart plans to add an undisclosed number of positions in areas including its store-planning operation and New York apparel office.\n\nThe moves, which began Tuesday, are meant to \"increase operational efficiencies, support our strategic growth plans and reduce overall costs,\" Wal-Mart spokesman David Tovar said.\n\nWal-Mart still expects net growth of tens of thousands of jobs at the store level this year, Tovar said.\n\nThe reduction in staff is hardly a new development for retailers, which have been cutting jobs at their corporate offices as they contend with the down economy. Target Corp. (TGT), Saks Inc. (SKS) and Best Buy Co. (BBY) are among retailers that have said in recent weeks they plan to pare their ranks.\n\nTovar declined to say whether the poor economy was a factor in Wal-Mart's decision.\n\nWal-Mart is operating from a position of comparative strength as one of the few retailers to consistently show positive growth in same-store sales over the past year as the recession dug in.\n\nWal-Mart is \"a fiscally responsible company that will manage its capital structure appropriately,\" said Todd Slater, retail analyst at Lazard Capital Markets.\n\nEven though Wal-Mart is outperforming its peers, the company \"is not performing anywhere near peak or optimum levels,\" Slater said. \"The consumer has cut back significantly.\"\n\nWal-Mart indicated it had regained some footing in January, when comparable-store sales rose 2.1%, after a lower-than-expected 1.7% rise in December.\n\nWal-Mart shares are off 3.2% to $47.68.\n\n-By Karen Talley, Dow Jones Newswires; 201-938-5106; karen.talley#dowjones.com [ 02-10-09 1437ET ]\n ",
" --To present new valve platforms Friday\n\n(Updates with additional comment from company, beginning in the seventh paragraph.)\n\n\n \n By Anjali Athavaley \n Of DOW JONES NEWSWIRES \n \n\nNEW YORK (Dow Jones)--Edwards Lifesciences Corp. (EW) said Friday that it expects earnings to grow 35% to 40%, excluding special items, in 2012 on expected sales of its catheter-delivered heart valves that were approved in the U.S. earlier this year.\n\nThe medical device company released its financial outlook in a press release before an investor conference Friday. The catheter-delivered heart valve market is considered to have a multibillion-dollar market potential, but questions have persisted on how quickly the Edwards device, called Sapien, will be rolled out and who will be able to receive it.\n\nEdwards said it expects transcatheter valve sales between $560 million and $630 million in 2012, with $200 million to $260 million coming from the U.S.\n\nOverall, for 2012, Edwards sees total sales between $1.95 billion and $2.05 billion, above the $1.68 billion to $1.72 billion expected this year and bracketing the $2.01 billion expected on average by analysts surveyed by Thomson Reuters.\n\nThe company projects 2012 per-share earnings between $2.70 and $2.80, the midpoint of which is below the average analyst estimate of $2.78 on Thomson Reuters. Edwards estimates a gross profit margin of 73% to 75%.\n\nEdwards also reaffirmed its 2011 guidance, which includes earnings per share of $1.97 to $2.02, excluding special items.\n\nThe company said it continues to expect U.S. approval of its Sapien device for high-risk patients in mid-2012. Currently, the device is only approved in the U.S. for patients too sick for surgery.\n\nThe company added that a separate trial studying its newer-generation valve in a larger population is under way in the U.S. It expects U.S. approval of that device in 2014.\n\nEdwards also plans to present at its investor conference two new catheter-delivered valve platforms designed for different implantation methods. European trials for these devices are expected to begin in 2012.\n\nShares of Edwards, down 9% over the past 12 months, were inactive premarket. The stock closed at $63.82 on Thursday.\n\n-By Anjali Athavaley, Dow Jones Newswires; 212-416-4912; anjali.athavaley#dowjones.com [ 12-09-11 0924ET ]\n ",
" In September, the company issued a guidance range of 43 cents to 44 cents a share. \n\nFor the year, GE now sees earnings no lower than $1.81 a share to $1.83 a share. The previous forecast called for income of $1.80 to $1.83 a share. The new range brackets analyst projections of $1.82 a share. \n\nThe new targets represent double-digit growth from the respective year-earlier periods. Last year's third-quarter earnings were $3.87 billion, or 36 cents a share, excluding items; earnings for the year ended Dec. 31 came in at $16.59 billion, or $1.59 a share. [ 10-06-05 0858ET ] \n\nGeneral Electric also announced Thursday that it expects 2005 cash flow from operating activities to exceed $19 billion. \n\nBecause of the expected cash influx, the company increased its authorization for share repurchases by $1 billion to more than $4 billion. \n\nGE announced the updated guidance at an analysts' meeting Thursday in New York. A Web cast of the meeting is available at . \n\nThe company plans to report third-quarter earnings Oct. 14. \n\nShares of the Dow Jones Industrial Average component recently listed at $33.20 in pre-market trading, according to Inet, up 1.6%, or 52 cents, from Wednesday's close of $32.68. \n\nCompany Web site: \n\n-Jeremy Herron; Dow Jones Newswires; 201-938-5400; Ask Newswires#DowJones.com \n\nOrder free Annual Report for General Electric Co. \n\nVisit or call 1-888-301-0513 [ 10-06-05 0904ET ] \n "
)), class = "data.frame", row.names = c(NA, -3L))
We can get data in separate rows keeping the grp intact and keep only sentence that has "medical device company released" in it.
library(dplyr)
df %>%
tidyr::separate_rows(TD, sep = "\\.") %>%
group_by(grp) %>%
summarise(newCol = toString(grep(pattern = "medical device company released",
TD, ignore.case = TRUE, value = TRUE)))
# grp newCol
# <chr> <chr>
#1 3613 ""
#2 4973 "\n\nThe medical device company released its financia…
#3 5570 ""
I have a corpus of news articles on a given topic. Some of these articles are the exact same article but have been given additional headers and footers that very slightly change the content. I am trying to delete all but one of the potential duplicates so the final corpus only contains unique articles.
I decided to use cosine similiarity to identify the potential duplicates:
myDfm <- dfm(as.character(docs$text_main), verbose=FALSE)
cosinesim <- textstat_simil(x=myDfm, selection=docnames(myDfm), margin="documents", method="cosine")
cosinemat <- as.matrix(cosinesim)
After looking at a subset of the data, I chose a cut off of .9 cosine distance or above to indicate duplicates.(I am okay with any error that this Given this, I have converted the diagonal to 0 (i.e., not a dup) and altered the matrix to indicate which documents are duplicates and which are not:
diag(cosinemat) <- 0
cosinemat[cosinemat >= .9] <- 1
cosinemat[cosinemat < .9] <- 0
The problem I'm running into is figuring out how to delete all but one of the duplicate documents. Initially, I envisioned a for loop to go through each column cell by cell, for any cell that has a value of 1 (i.e., is a duplicate) to delete the column with the same name as the row of the current cell, reconstitute the matrix and continue on to the next cell. The for loop doesn't seem to like the line of code that deletes the columns with the name of the current row when the cell is equal to 1. Though, I'm not sure its okay to reconstitute the object you're looping through. Something like this:
cosine_df <- as.data.frame(cosinemat)
for(col in 1:ncol(cosine_df)){
for(row in 1:nrow(cosine_df)){
if(cosine_df[col,row] == 0){
next
}
if(cosine_df[col,row] == 1){
cosine_df <- cosine_df[!rownames(cosine_df) %in% paste(rownames(cosine_df)[col,row]]
}
}
}
I'm not set on this approach, and I'm open to creative solutions, so long as I am able to identify similar documents and to delete all but one document.
Here's a subset of the documents if it helps:
docs <- structure(list(text_main = c("Congressional Documents and PublicationsMay 26, 2016Copyright 2016 Federal Information and News Dispatch, Inc.Section:U.S. HOUSE OF REPRESENTATIVES DOCUMENTSLength:287 wordsBody(Washington, DC) Reps. Ted Deutch (D-FL) and Gus Bilirakis (R-FL) joined with Reps. Steve Israel (D-NY), Mike Kelly (R-PA), Ted Lieu (D-CA), Adam Kinzinger (R-IL), Hakeem Jeffries (D-NY), Lee Zeldin (R-NY), and Susan Davis (D-CA) to introduce a resolution (H. Res. 750) urging the European Union (EU) to designate the entirety of Hizballah as a terrorist organization and increase pressure on the organizations and its members. Currently, the EU only designates Hizballah's military wing as a terrorist organization, while the United States makes no distinction between its military and political branches when listing the group on its Foreign Terrorist Organization list.Upon introduction, the Members of Congress released the following statement:\"Hizballah is an Iranian-backed terrorist organization with a global reach that engages in significant illicit criminal activity to fund its terrorism. It doesn't matter what part of the organization you're associated with; if you are connected with Hizballah, you are contributing to the rocket attacks on innocent Israeli civilians, targeted bombings of Jews around the world, slaughter of civilians in Syria, and destabilization of the Middle East. There is no distinction between parts of Hizballah when every part contributes to terrorism. We urge our EU allies to help rein in Hizballah's dangerous worldwide activities.\"The resolution can be viewed here .Last year, Congress passed the Hizballah International Financing Prevention Act which tightened sanctions on Hizballah's criminal and financial networks.Read this original document at: ",
"Congressional Documents and PublicationsApril 20, 2016Copyright 2016 Federal Information and News Dispatch, Inc.Section:U.S. HOUSE OF REPRESENTATIVES DOCUMENTSLength:499 wordsBodyToday, members of the House of Representatives Bipartisan Taskforce for Combating Anti-Semitism sounded the alarm about a troubling surge in anti-Semitism on American college campuses. In a letter to the Secretary of Education, the Taskforce asked the Secretary about the Department's planned response to the issue. Additionally, the co-chairs made the following statement:\"An alarming rise of anti-Israel programs on American college campuses contribute to increasing harassment, intimidation, and discrimination against Jewish students. While we believe that students' freedoms of speech and assembly should be respected, there are increasing reports that activity advertised as anti-Israel or anti-Zionist is devolving into displays of subtle, but sometimes outright anti-Semitism. Attacks on students because of their actual or perceived religion, ancestry, or ethnicity are unacceptable. We believe strongly that no student should ever face discrimination and that school activities must be structured in a respectful manner to ensure academic integrity and a nondiscriminatory environment throughout the entire campus. For these reasons, we ask the Department of Education to assess its ability to monitor and respond to anti-Semitic incidents and to take additional steps to combat intimidation and harassment against minority students on college campuses.\"In 2004, the U.S. Department of Education Office for Civil Rights (OCR) clarified its interpretation of Title VI of the Civil Rights Act of 1964, including protections for groups of students on the basis of their actual or perceived shared ancestry or ethnic characteristics, regardless of whether they are members of a faith community, as in the case for Jewish, Sikh, and Muslim students. The Department reiterated this policy again in 2010 and 2015.However, as the number of reported Boycott, Divestment, and Sanctions (BDS) movement campaigns and other anti-Israel initiatives rise on college campuses, Members of Congress believe the Department must proactively implement its anti-discrimination policy to mitigate anti-Semitism on college campuses.The Bipartisan Taskforce for Combating Anti-Semitism is co-chaired by U.S. Reps. Nita Lowey (D-NY), Chris Smith (R-NJ), Eliot Engel (D-NY), Ileana Ros-Lehtinen (R-FL), Kay Granger (R-TX), Steve Israel (D-NY), Peter Roskam (R-IL), and Ted Deutch (D-FL).The following organizations expressed their support for the letter: the Anti-Defamation League, Jewish Federation of North America, B'nai Brith International, Jewish United Fund/Jewish Federation of Metropolitan Chicago, the Louis D. Brandeis Center for Human Rights Under Law, the World Jewish Congress, and the Zionist Organization of America.Text of the letter can be found here .Read this original document at: ",
"Targeted News ServiceApril 20, 2016 Wednesday 7:41 AM ESTCopyright 2016 Targeted News Service LLC All Rights ReservedLength:511 wordsByline:Targeted News ServiceDateline:WASHINGTON BodyRep. Ted Deutch, D-Fla. (21st CD), issued the following news release:Today, members of the House of Representatives Bipartisan Taskforce for Combating Anti-Semitism sounded the alarm about a troubling surge in anti-Semitism on American college campuses. In a letter to the Secretary of Education, the Taskforce asked the Secretary about the Department's planned response to the issue. Additionally, the co-chairs made the following statement:\"An alarming rise of anti-Israel programs on American college campuses contribute to increasing harassment, intimidation, and discrimination against Jewish students. While we believe that students' freedoms of speech and assembly should be respected, there are increasing reports that activity advertised as anti-Israel or anti-Zionist is devolving into displays of subtle, but sometimes outright anti-Semitism. Attacks on students because of their actual or perceived religion, ancestry, or ethnicity are unacceptable. We believe strongly that no student should ever face discrimination and that school activities must be structured in a respectful manner to ensure academic integrity and a nondiscriminatory environment throughout the entire campus. For these reasons, we ask the Department of Education to assess its ability to monitor and respond to anti-Semitic incidents and to take additional steps to combat intimidation and harassment against minority students on college campuses.\"In 2004, the U.S. Department of Education Office for Civil Rights (OCR) clarified its interpretation of Title VI of the Civil Rights Act of 1964, including protections for groups of students on the basis of their actual or perceived shared ancestry or ethnic characteristics, regardless of whether they are members of a faith community, as in the case for Jewish, Sikh, and Muslim students. The Department reiterated this policy again in 2010 and 2015.However, as the number of reported Boycott, Divestment, and Sanctions (BDS) movement campaigns and other anti-Israel initiatives rise on college campuses, Members of Congress believe the Department must proactively implement its anti-discrimination policy to mitigate anti-Semitism on college campuses.The Bipartisan Taskforce for Combating Anti-Semitism is co-chaired by U.S. Reps. Nita Lowey (D-NY), Chris Smith (R-NJ), Eliot Engel (D-NY), Ileana Ros-Lehtinen (R-FL), Kay Granger (R-TX), Steve Israel (D-NY), Peter Roskam (R-IL), and Ted Deutch (D-FL).The following organizations expressed their support for the letter: the Anti-Defamation League, Jewish Federation of North America, B'nai Brith International, Jewish United Fund/Jewish Federation of Metropolitan Chicago, the Louis D. Brandeis Center for Human Rights Under Law, the World Jewish Congress, and the Zionist Organization of America.Text of the letter can be found here ().Contact: Jason Attermann, 202/225-3001Copyright Targeted News Services30FurigayJof-5501453 30FurigayJof",
"US Official NewsFebruary 13, 2013 WednesdayCopyright 2013 Plus Media Solutions Private Limited All Rights ReservedLength:298 wordsDateline:Washington Body Office of the House of Representative Ted Deutch, U.S Government has issued the following news release: Rep. Ted Deutch (D-FL) and Rep. Gus Bilirakis (R-GL) issued the following statements regarding the Bulgarian governments report that two individuals responsible for the July 2012 terrorist attack on a bus in Burgas, Bulgaria, have ties to Hezbollah. Five Israeli tourists and the Bulgarian bus driver were killed in the attack.Congressman Bilirakis: The Bulgarian governments report is yet another example of Hezbollah's deliberate use of terror across the globe. Contrary to some European opinions, Hezbollah is not merely a political organization and is actively involved in terrorist activities. As I have requested many times, the European Union must finally recognize Hezbollah for what it is: a terrorist organization. I commend the Bulgarian government for their thorough investigation and call on the members of the European Union to examine these findings closely.Congressman Deutch: The results of the Bulgarian governments investigation into the deadly attack in Burgas confirms what we already knew - Hezbollah is a terrorist organization that is willing to perpetrate attacks on innocent civilians around the globe. I continue to urge our European partners to formally designate Hezbollah as a terrorist organization. Failure to do so only emboldens Hezbollah to continue its reign of terror in Europe and around the world.In September 2012, Congressmen Bilirakis and Deutch initiated a bi-partisan letter signed by 268 Members of Congress to the President and Ministers of the Commission of the European Union, urging them to include Hezbollah on the European Union's list of terrorist organizations. For further information please visit: ",
"Congressional Documents and PublicationsMay 4, 2011Copyright 2011 Federal Information and News Dispatch, Inc.Section:U.S. HOUSE OF REPRESENTATIVES DOCUMENTSLength:204 wordsBodyWashington, May 4 -Rep. Ted Deutch released the following statement on the Florida legislature's passage of SB 444, which expands upon the Protecting Florida's Investments Act, legislation he authored in 2007 in the Florida State Senate:\"I applaud the Florida Legislature's passage of SB 444, legislation that will help ensure national and international security by preventing Florida's taxpayer dollars from supporting companies who choose to violate federal law by bolstering the Iranian regime. I congratulate the bill's sponsors, Sen. Ellyn Bogdanoff and Rep. Mack Bernard. This bill prevents state and local governments from awarding contracts to companies found to be investing in the Iranian energy sector. It is consistent with federal policy and sends a clear message that Floridians will not support any company that puts profit over international security. The Iranian regime continues to pursue its illicit nuclear weapons program, continues to engage in the most egregious human rights violations, and continues to support terrorism across the globe. We must continue to utilize every economic tool at our disposal to bring this regime to its knees. I urge Governor Scott to act quickly to sign this bill into law.\"",
"Congressional Documents and PublicationsMarch 23, 2011Copyright 2011 Federal Information and News Dispatch, Inc.Section:U.S. HOUSE OF REPRESENTATIVES DOCUMENTSLength:128 wordsBodyBoca Raton, Mar 23 -Congressman Ted Deutch (D-FL) released the following statement in reaction to the explosion of a bomb today in Jerusalem that killed a 59-year-old woman and injured dozens more:\"Today's horrific bombing in Jerusalem is yet another attack in a surge of violence perpetuated by Palestinian terrorists against innocent Israeli citizens,\" said Congressman Ted Deutch. \"The victims of this heinous attack and the Israeli people deserve the full support of the international community as they seek to defend themselves against this relentless violence. It is deplorable that as Israelis endure this latest bombing in Jerusalem, as well as ongoing rocket attacks by Hamas, some astonishingly still seek to blame Israel for the lack of peace in the region.\"",
"States News ServiceMarch 26, 2015 ThursdayCopyright 2015 States News ServiceLength:218 wordsByline:States News ServiceDateline:WASHINGTON BodyThe following information was released by the office of Florida Rep. Ted Deutch:Congressman Ted Deutch (FL-21) issued the following statement after voting in favor of the bipartisan 'doc fix' deal included in H.R. 2, the Medicare Access and CHIP Reauthorization Act, which passed the House of Representatives by a vote of 392-27:\"Today I was pleased to vote for a bipartisan deal to permanently repeal the misguided Sustainable Growth Rate (SGR) policy that jeopardizes doctors participating in Medicare, and to permanently extend assistance to low-income seniors struggling to pay their Medicare premiums.In addition to building certainty in our Medicare system, this legislation also protects health care for low-income kids and funding for the federal health centers in Florida's 21st district that in 2013 alone served over 70,000 patients living below or near the federal poverty level.\"Today's deal is the product of the kind of bipartisan compromise that has become all too rare in Washington, in which neither Democrats nor Republicans get everything they want but come together to move our country forward. I can only imagine how much more we could achieve for the American people if we came together in the spirit of compromise on more issues, from infrastructure investment to immigration reform.\"",
"Congressional Documents and PublicationsMarch 26, 2015Copyright 2015 Federal Information and News Dispatch, Inc.Section:U.S. HOUSE OF REPRESENTATIVES DOCUMENTSLength:250 wordsBodyCongressman Ted Deutch (FL-21) issued the following statement after voting in favor of the bipartisan 'doc fix' deal included in H.R. 2, the Medicare Access and CHIP Reauthorization Act, which passed the House of Representatives by a vote of 392-27:\"Today I was pleased to vote for a bipartisan deal to permanently repeal the misguided Sustainable Growth Rate (SGR) policy that jeopardizes doctors participating in Medicare, and to permanently extend assistance to low-income seniors struggling to pay their Medicare premiums. In addition to building certainty in our Medicare system, this legislation also protects health care for low-income kids and funding for the federal health centers in Florida's 21st district that in 2013 alone served over 70,000 patients living below or near the federal poverty level.\"Today's deal is the product of the kind of bipartisan compromise that has become all too rare in Washington, in which neither Democrats nor Republicans get everything they want but come together to move our country forward. I can only imagine how much more we could achieve for the American people if we came together in the spirit of compromise on more issues, from infrastructure investment to immigration reform.\"For a fact sheet on H.R. 2, please go to: .Read this original document at: ",
"US Official NewsMarch 27, 2015 FridayCopyright 2015 Plus Media Solutions Private Limited All Rights ReservedLength:241 wordsDateline:Washington Body Office of the House of Representative Ted Deutch, U.S Government has issued the following news release:Congressman Ted Deutch (FL-21) issued the following statement after voting in favor of the bipartisan 'doc fix' deal included in H.R. 2, the Medicare Access and CHIP Reauthorization Act, which passed the House of Representatives by a vote of 392-27:\"Today I was pleased to vote for a bipartisan deal to permanently repeal the misguided Sustainable Growth Rate (SGR) policy that jeopardizes doctors participating in Medicare, and to permanently extend assistance to low-income seniors struggling to pay their Medicare premiums.In addition to building certainty in our Medicare system, this legislation also protects health care for low-income kids and funding for the federal health centers in Floridas 21st district that in 2013 alone served over 70,000 patients living below or near the federal poverty level.\"Todays deal is the product of the kind of bipartisan compromise that has become all too rare in Washington, in which neither Democrats nor Republicans get everything they want but come together to move our country forward. I can only imagine how much more we could achieve for the American people if we came together in the spirit of compromise on more issues, from infrastructure investment to immigration reform.\" In case of any query regarding this article or other content needs please contact: ",
"US Official NewsMarch 27, 2015 FridayCopyright 2015 Plus Media Solutions Private Limited All Rights ReservedLength:241 wordsDateline:Washington Body Office of the House of Representative Ted Deutch, U.S Government has issued the following news release:Congressman Ted Deutch (FL-21) issued the following statement after voting in favor of the bipartisan 'doc fix' deal included in H.R. 2, the Medicare Access and CHIP Reauthorization Act, which passed the House of Representatives by a vote of 392-27:\"Today I was pleased to vote for a bipartisan deal to permanently repeal the misguided Sustainable Growth Rate (SGR) policy that jeopardizes doctors participating in Medicare, and to permanently extend assistance to low-income seniors struggling to pay their Medicare premiums.In addition to building certainty in our Medicare system, this legislation also protects health care for low-income kids and funding for the federal health centers in Floridas 21st district that in 2013 alone served over 70,000 patients living below or near the federal poverty level.\"Todays deal is the product of the kind of bipartisan compromise that has become all too rare in Washington, in which neither Democrats nor Republicans get everything they want but come together to move our country forward. I can only imagine how much more we could achieve for the American people if we came together in the spirit of compromise on more issues, from infrastructure investment to immigration reform.\" In case of any query regarding this article or other content needs please contact: "
)), row.names = c(NA, 10L), class = "data.frame", .Names = "text_main")
Here is the matrix of similarity for the same subset of documents:
cosine_df <- structure(list(text1 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), text2 = c(0,
0, 1, 0, 0, 0, 0, 0, 0, 0), text3 = c(0, 1, 0, 0, 0, 0, 0, 0,
0, 0), text4 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), text5 = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), text6 = c(0, 0, 0, 0, 0, 0, 0, 0,
0, 0), text7 = c(0, 0, 0, 0, 0, 0, 0, 1, 1, 1), text8 = c(0,
0, 0, 0, 0, 0, 1, 0, 1, 1), text9 = c(0, 0, 0, 0, 0, 0, 1, 1,
0, 1), text10 = c(0, 0, 0, 0, 0, 0, 1, 1, 1, 0)), .Names = c("text1",
"text2", "text3", "text4", "text5", "text6", "text7", "text8",
"text9", "text10"), row.names = c("text1", "text2", "text3",
"text4", "text5", "text6", "text7", "text8", "text9", "text10"
), class = "data.frame")
In case anyone else has a similar problem, this was the solution I ended up creating:
library(quanteda)
myDfm <- dfm(as.character(docs$text_main), verbose=FALSE)
cosinesim <- textstat_simil(x=myDfm, selection=docnames(myDfm), margin="documents", method="cosine")
cosinemat <- as.matrix(cosinesim) #this produces a matrix of the document similarities
threshold <- .9
similar_indices <- unique(apply(cosinemat, 1,
function(x) which(x > threshold)))
## keep only the first element of each set
if(class(similar_indices) == "list") { # check if list or not
unique_indices <- unique(sapply(similar_indices, function(x) as.numeric(x[1])))
} else if (class(similar_indices) == "matrix"){
unique_indices <- unique(apply(similar_indices, 2, function(x) as.numeric(x[1])))
} else {
unique_indices <- similar_indices
}
## get only the unique texts
docs_unique <- docs[unique_indices ,]
I am using R for text analysis. I used the 'readtext' function to pull in text from a pdf. However, as you can imagine, it is pretty messy. I used 'gsub' to replace text for different purposes. The general goal is to use one type of delimiter '%%%%%' to split records into rows, and another delimiter '#' into columns. I accomplished the first but am at a loss of how to accomplish the latter. A sample of the data found in the dataframe is as follows:
895 "The ambulatory case-mix development project\n#Published:: June 6, 1994#Authors: Baker A, Honigfeld S, Lieberman R, Tucker AM, Weiner JP#Country: United States #Journal:Project final report. Baltimore, MD, USA: Johns Hopkins University and Aetna Health Plans. Johns Hopkins\nUniversity and Aetna Health Plans, USA As the US […"
896 "Ambulatory Care Groups: an evaluation for military health care use#Published:: June 6, 1994#Authors: Bolling DR, Georgoulakis JM, Guillen AC#Country: United States #Journal:Fort Sam Houston, TX, USA: United States Army Center for Healthcare Education and Studies, publication #HR 94-\n004. United States Army Center for Healthcare Education and […]#URL: http://oai.dtic.mil/oai/oai?verb=getRecord&metadataPrefix=html&identifier=ADA27804"
I want to take this data and split the #Published, #Authors, #Journal, #URL into columns -- c("Published", "Authors", "Journal", "URL").
Any suggestions?
Thanks in advance!
This seems to work OK:
dfr <- data.frame(TEXT=c("The ambulatory case-mix development project\n#Published:: June 6, 1994#Authors: Baker A, Honigfeld S, Lieberman R, Tucker AM, Weiner JP#Country: United States #Journal:Project final report. Baltimore, MD, USA: Johns Hopkins University and Aetna Health Plans. Johns Hopkins\nUniversity and Aetna Health Plans, USA As the US […",
"Ambulatory Care Groups: an evaluation for military health care use#Published:: June 6, 1994#Authors: Bolling DR, Georgoulakis JM, Guillen AC#Country: United States #Journal:Fort Sam Houston, TX, USA: United States Army Center for Healthcare Education and Studies, publication #HR 94-\n004. United States Army Center for Healthcare Education and […]#URL: http://oai.dtic.mil/oai/oai?verb=getRecord&metadataPrefix=html&identifier=ADA27804"),
stringsAsFactors = FALSE)
library(magrittr)
do.call(rbind, strsplit(dfr$TEXT, "#Published::|#Authors:|#Country:|#Journal:")) %>%
as.data.frame %>%
setNames(nm = c("Preamble","Published","Authors","Country","Journal"))
Basically split the text by one of four fields (noticing double :: after Published!), row-binding the result, converting to a dataframe, and giving some names.