How to crawl multiple pages and create a dataframe with parsing? - web-scraping

I would like to load multple pages from a single website and extract specific attributes from different classes as below. Then I woule like to create a dataframe with parsed information from multiple pages.
Extract from multiple pages
for page in range(1,10):
url = f"https://www.consilium.europa.eu/en/press/press-releases/?page={page}"
res = requests.get(url)
soup = bs(res.text, 'lxml')
Parsing
soup_content = soup.find_all('li', {'class':['list-item ceu clearfix','list-item gsc clearfix','list-item euco clearfix','list-item eg clearfix' ]})
datePublished = []
headline = []
description =[]
urls = []
for i in range(len(soup_content)):
datePublished.append(soup_content[i].find('span', {'itemprop': 'datePublished'}).attrs['content'])
headline.append(soup_content[i].find('h3', {'itemprop': 'headline'}).get_text().strip())
description.append(soup_content[i].find('p', {'itemprop': 'description'}).get_text().strip())
urls.append('https://www.consilium.europa.eu{}'.format(soup.find('a', {'itemprop': 'url'}).attrs['href']))
To DataFrame
df = pd.DataFrame(data = zip(datePublished, headline, description, urls), columns=['date','title', 'description', 'link'])
df

To expand on my comments, this should work:
maxPage = 9
datePublished = []
headline = []
description =[]
urls = []
for page in range(1, maxPage+1):
url = f"https://www.consilium.europa.eu/en/press/press-releases/?page={page}"
res = requests.get(url)
print(f'[page {page:>3}]', res.status_code, res.reason, 'from', res.url)
soup = BeautifulSoup(res.content, 'lxml')
soup_content = soup.find_all('li', {'class':['list-item ceu clearfix','list-item gsc clearfix','list-item euco clearfix','list-item eg clearfix' ]})
for i in range(len(soup_content)):
datePublished.append(soup_content[i].find('span', {'itemprop': 'datePublished'}).attrs['content'])
headline.append(soup_content[i].find('h3', {'itemprop': 'headline'}).get_text().strip())
description.append(soup_content[i].find('p', {'itemprop': 'description'}).get_text().strip())
urls.append('https://www.consilium.europa.eu{}'.format(soup.find('a', {'itemprop': 'url'}).attrs['href']))
When I ran it, 179 unique rows were collected [20 rows from all pages except the 7th, which had 19].

There are different ways to get your goal:
#Driftr95 comes up with a modification of yours using range(), that is fine, while iterating a specific number of pages.
Using a while-loop to be flexible in number of pages, without knowing the exact one. You can also use a counter if you like to break the loop at a certain number of iterations.
...
I would recommend the second one and also to avoid the bunch of lists cause you have to ensure they have the same lenght. Instead use a single list with dicts that looks more structured.
Example
import requests
import pandas as pd
from bs4 import BeautifulSoup
base_url = 'https://www.consilium.europa.eu'
path ='/en/press/press-releases'
url = base_url+path
data = []
while True:
print(url)
soup = BeautifulSoup(requests.get(url).text)
for e in soup.select('li.list-item'):
data.append({
'date':e.find_previous('h2').text,
'title':e.h3.text,
'desc':e.p.text,
'url':base_url+e.h3.a.get('href')
})
if soup.select_one('li[aria-label="Go to the next page"] a[href]'):
url = base_url+path+soup.select_one('li[aria-label="Go to the next page"] a[href]').get('href')
else:
break
df = pd.DataFrame(data)
Output
date
title
desc
url
0
30 January 2023
Statement by the High Representative on behalf of the EU on the alignment of certain third countries concerning restrictive measures in view of the situation in the Democratic Republic of the Congo
Statement by the High Representative on behalf of the European Union on the alignment of certain third countries with Council Implementing Decision (CFSP) 2022/2398 of 8 December 2022 implementing Decision 2010/788/CFSP concerning restrictive measures in view of the situation in the Democratic Republic of the Congo.
https://www.consilium.europa.eu/en/press/press-releases/2023/01/30/statement-by-the-high-representative-on-behalf-of-the-eu-on-the-alignment-of-certain-third-countries-concerning-restrictive-measures-in-view-of-the-situation-in-the-democratic-republic-of-the-congo/
1
30 January 2023
Council adopts recommendation on adequate minimum income
The Council adopted a recommendation on adequate minimum income to combat poverty and social exclusion. Income support is considered adequate when it ensures a life in dignity at all stages of life. Member states are recommended to gradually achieve the adequate level of income support by 2030 at the latest, while safeguarding the sustainability of public finances.
https://www.consilium.europa.eu/en/press/press-releases/2023/01/30/council-adopts-recommendation-on-adequate-minimum-income/
2
27 January 2023
Forward look: 30 January - 12 February 2023
Overview of the main subjects to be discussed at meetings of the Council of the EU over the next two weeks and upcoming media events.
https://www.consilium.europa.eu/en/press/press-releases/2023/01/27/forward-look/
3
27 January 2023
Russia: EU prolongs economic sanctions over Russia’s military aggression against Ukraine
The Council prolonged restrictive measures in view of Russia's actions destabilising the situation in Ukraine by six months.
https://www.consilium.europa.eu/en/press/press-releases/2023/01/27/russia-eu-prolongs-economic-sanctions-over-russia-s-military-aggression-against-ukraine/
4
27 January 2023
Media advisory – Agriculture and Fisheries Council meeting on 30 January 2023
Main agenda items, approximate timing, public sessions and press opportunities.
https://www.consilium.europa.eu/en/press/press-releases/2023/01/27/media-advisory-agriculture-and-fisheries-council-meeting-on-30-january-2023/
...
435
6 July 2022
EU support to the African Union Mission in Somalia: Council approves further support under the European Peace Facility
The Council approved €120 million in support to the military component of AMISOM/ATMIS for 2022 under the European Peace Facility.
https://www.consilium.europa.eu/en/press/press-releases/2022/07/06/eu-support-to-the-african-union-mission-in-somalia-council-approves-further-support-under-the-european-peace-facility/
436
6 July 2022
Report by President Charles Michel to the European Parliament plenary session
Report by European Council President Charles Michel to the European Parliament plenary session on the outcome of the European Council meeting of 23-24 June 2022.
https://www.consilium.europa.eu/en/press/press-releases/2022/07/06/report-by-president-charles-michel-to-the-european-parliament-plenary-session/
437
5 July 2022
Declaration by the High Representative on behalf of the EU on the alignment of certain countries concerning restrictive measures against ISIL (Da’esh) and Al-Qaeda and persons, groups, undertakings and entities associated with them
Declaration by the High Representative on behalf of the European Union on the alignment of certain third countries with Council Decision (CFSP) 2022/950 of 20 June 2022 amending Decision (CFSP) 2016/1693 concerning restrictive measures against ISIL (Da’esh) and Al-Qaeda and persons, groups, undertakings and entities associated with them.
https://www.consilium.europa.eu/en/press/press-releases/2022/07/05/declaration-by-the-high-representative-on-behalf-of-the-eu-on-the-alignment-of-certain-countries-concerning-restrictive-measures-against-isil-da-esh-and-al-qaeda-and-persons-groups-undertakings-and-entities-associated-with-them/
438
5 July 2022
Remarks by President Charles Michel after his meeting in Skopje with Prime Minister of North Macedonia Dimitar Kovačevski
During his visit to North Macedonia, President Michel expressed his support for proposed compromise solution on the country's accession negotiations.
https://www.consilium.europa.eu/en/press/press-releases/2022/07/05/remarks-by-president-charles-michel-after-his-meeting-in-skopje-with-prime-minister-of-north-macedonia-dimitar-kovacevski/
439
4 July 2022
Readout of the telephone conversation between President Charles Michel and Prime Minister of Ethiopia Abiy Ahmed
President Charles Michel and Prime Minister of Ethiopia Abiy Ahmed valued their open and frank exchange and agreed to speak in the near future to take stock.
https://www.consilium.europa.eu/en/press/press-releases/2022/07/04/readout-of-the-telephone-conversation-between-president-charles-michel-and-prime-minister-of-ethiopia-abiy-ahmed/
...

Related

Apply Sentimentr on Dataframe with Multiple Sentences in 1 String Per Row

I have a dataset where I am trying to get the sentiment by article. I have about 1000 articles. Each article is a string. This string has multiple sentences within it. I ideally would like to add another column that would summarise the sentiment for each article. Is there an efficient way to do this using dplyr?
Below is an example dataset with just 2 articles.
date<- as.Date(c('2020-06-24', '2020-06-24'))
text <- c('3 more cops recover as PNP COVID-19 infections soar to 519', 'QC suspends processing of PWD IDs after reports of abuse in issuance of cards')
link<- c('https://newsinfo.inquirer.net/1296981/3-more-cops-recover-as-pnps-covid-19-infections-soar-to-519,3,10,4,11,9,8', 'https://newsinfo.inquirer.net/1296974/qc-suspends-processing-of-pwd-ids-after-reports-of-abuse-in-issuance-of-cards')
V4 <-c('MANILA, Philippines — Three more police officers have recovered from the new coronavirus disease, increasing the total number of recoveries in the Philippine National Police to (PNP) 316., This developed as the total number of COVID-19 cases in the PNP rose to 519 with one new infection and nine deaths recorded., In a Facebook post on Wednesday, the PNP also recorded 676 probable and 876 suspects for the disease., PNP chief Gen. Archie Gamboa previously said the force would will intensify its health protocols among its personnel after recording a recent increase in deaths., The latest fatality of the ailment is a police officer in Cebu City, which is under enhanced community quarantine as COVID-19 cases continued to surge there., ATM, \r\n\r\nFor more news about the novel coronavirus click here.\r\nWhat you need to know about Coronavirus.\r\n\r\n\r\n\r\nFor more information on COVID-19, call the DOH Hotline: (02) 86517800 local 1149/1150.\r\n\r\n \r\n \r\n \r\n\r\n \r\n , The Inquirer Foundation supports our healthcare frontliners and is still accepting cash donations to be deposited at Banco de Oro (BDO) current account #007960018860 or donate through PayMaya using this link .',
'MANILA, Philippines — Quezon City will halt the processing of identification cards to persons with disability for two days starting Thursday, June 25, so it could tweak its guidelines after reports that unqualified persons had issued with the said IDs., In a statement on Wednesday, Quezon City Mayor Joy Belmonte said the suspension would the individual who issued PWD ID cards to six members of a family who were not qualified but who paid P2,000 each to get the IDs., Belmonte said the suspect, who is a local government employee, was already issued with a show-cause order to respond to the allegation., According to city government lawyer Nino Casimir, the suspect could face a grave misconduct case that could result in dismissal., The IDs are issued to only to persons qualified under the Act Expanding the Benefits and Privileges of Persons with Disability (Republic Act No. 10754)., The IDs entitle PWDs to a 20 percent discount and VAT exemption on goods and services., /atm')
df<-data.frame(date, text, link, V4)
head(df)
So I have been looking up how to do this using the sentimentr package and created below. However, this only outputs each sentences' sentiment (I do this by doing a strsplit of .,) and I want to instead aggregate everything at the full article level after applying this strsplit.
library(sentimentr)
full<-df %>%
group_by(V4) %>%
mutate(V2 = strsplit(as.character(V4), "[.],")) %>%
unnest(V2) %>%
get_sentences() %>%
sentiment()
The desired output I am looking for is to simply add an extra column my df dataframe with a summary sum(sentiment) for each article.
Additional info based on answer below:
date<- as.Date(c('2020-06-24', '2020-06-24'))
text <- c('3 more cops recover as PNP COVID-19 infections soar to 519', 'QC suspends processing of PWD IDs after reports of abuse in issuance of cards')
link<- c('https://newsinfo.inquirer.net/1296981/3-more-cops-recover-as-pnps-covid-19-infections-soar-to-519,3,10,4,11,9,8', 'https://newsinfo.inquirer.net/1296974/qc-suspends-processing-of-pwd-ids-after-reports-of-abuse-in-issuance-of-cards')
V4 <-c('MANILA, Philippines — Three more police officers have recovered from the new coronavirus disease, increasing the total number of recoveries in the Philippine National Police to (PNP) 316., This developed as the total number of COVID-19 cases in the PNP rose to 519 with one new infection and nine deaths recorded., In a Facebook post on Wednesday, the PNP also recorded 676 probable and 876 suspects for the disease., PNP chief Gen. Archie Gamboa previously said the force would will intensify its health protocols among its personnel after recording a recent increase in deaths., The latest fatality of the ailment is a police officer in Cebu City, which is under enhanced community quarantine as COVID-19 cases continued to surge there., ATM, \r\n\r\nFor more news about the novel coronavirus click here.\r\nWhat you need to know about Coronavirus.\r\n\r\n\r\n\r\nFor more information on COVID-19, call the DOH Hotline: (02) 86517800 local 1149/1150.\r\n\r\n \r\n \r\n \r\n\r\n \r\n , The Inquirer Foundation supports our healthcare frontliners and is still accepting cash donations to be deposited at Banco de Oro (BDO) current account #007960018860 or donate through PayMaya using this link .',
'MANILA, Philippines — Quezon City will halt the processing of identification cards to persons with disability for two days starting Thursday, June 25, so it could tweak its guidelines after reports that unqualified persons had issued with the said IDs., In a statement on Wednesday, Quezon City Mayor Joy Belmonte said the suspension would the individual who issued PWD ID cards to six members of a family who were not qualified but who paid P2,000 each to get the IDs., Belmonte said the suspect, who is a local government employee, was already issued with a show-cause order to respond to the allegation., According to city government lawyer Nino Casimir, the suspect could face a grave misconduct case that could result in dismissal., The IDs are issued to only to persons qualified under the Act Expanding the Benefits and Privileges of Persons with Disability (Republic Act No. 10754)., The IDs entitle PWDs to a 20 percent discount and VAT exemption on goods and services., /atm')
df<-data.frame(date, text, link, V4)
df %>%
group_by(V4) %>% # group by not really needed
mutate(V4 = gsub("[.],", ".", V4),
sentiment_score = sentiment_by(V4))
# A tibble: 2 x 5
# Groups: V4 [2]
date text link V4 sentiment_score$e~ $word_count $sd $ave_sentiment
<date> <chr> <chr> <chr> <int> <int> <dbl> <dbl>
1 2020-06-24 3 more cops recover as P~ https://newsinfo.inquirer.net/1296~ "MANILA, Philippines — Three more police officers ~ 1 172 0.204 -0.00849
2 2020-06-24 QC suspends processing o~ https://newsinfo.inquirer.net/1296~ "MANILA, Philippines — Quezon City will halt the p~ 1 161 0.329 -0.174
Warning message:
Can't combine <sentiment_by> and <sentiment_by>; falling back to <data.frame>.
x Some attributes are incompatible.
i The author of the class should implement vctrs methods.
i See <https://vctrs.r-lib.org/reference/faq-error-incompatible-attributes.html>.
If you need the sentiment over the whole text, there is no need to split the text first into sentences, the sentiment functions take care of this. I replaced the ., in your text back to periods as this is needed for the sentiment functions. The sentiment functions recognizes "mr." as not being the end of a sentence. If you use get_sentences() first, you get the sentiment per sentence and not over the whole text.
The function sentiment_by handles the sentiment over the whole text and averages it nicely. Check help with the option for the averaging.function if you need to change this. The by part of the function can deal with any grouping you want to apply.
df %>%
group_by(V4) %>% # group by not really needed
mutate(V4 = gsub("[.],", ".", V4),
sentiment_score = sentiment_by(V4))
# A tibble: 2 x 5
# Groups: V4 [2]
date text link V4 sentiment_score$~ $word_count $sd $ave_sentiment
<date> <chr> <chr> <chr> <int> <int> <dbl> <dbl>
1 2020-06-24 3 more cops recov~ https://newsinfo.inquire~ "MANILA, Philippines — Three~ 1 172 0.204 -0.00849
2 2020-06-24 QC suspends proce~ https://newsinfo.inquire~ "MANILA, Philippines — Quezo~ 1 161 0.329 -0.174

filtering text and storing the filtered sentence/paragraph into a new column

I am trying to extract some sentences from text data. I want to extract the sentences which correspond to medical device company released. I can run the following code:
df_text <- unlist(strsplit(df$TD, "\\."))
df_text
df_text <- df_text[grep(pattern = "medical device company released", df_text, ignore.case = TRUE)]
df_text
Which gives me:
[1] "\n\nThe medical device company released its financial outlook in a press release before an investor conference Friday"
So I extracted the sentences which contain the sentence medical device company released. However, I want to do this but store the results in a new column from which grp the sentence came from.
Expected output:
grp TD newCol
3613 text NA # does not contain the sentence
4973 text medical device company released
5570 text NA # does not contain the sentence
Data:
df <- structure(list(grp = c("3613", "4973", "5570"), TD = c(" Wal-Mart plans to add an undisclosed number of positions in areas including its store-planning operation and New York apparel office.\n\nThe moves, which began Tuesday, are meant to \"increase operational efficiencies, support our strategic growth plans and reduce overall costs,\" Wal-Mart spokesman David Tovar said.\n\nWal-Mart still expects net growth of tens of thousands of jobs at the store level this year, Tovar said.\n\nThe reduction in staff is hardly a new development for retailers, which have been cutting jobs at their corporate offices as they contend with the down economy. Target Corp. (TGT), Saks Inc. (SKS) and Best Buy Co. (BBY) are among retailers that have said in recent weeks they plan to pare their ranks.\n\nTovar declined to say whether the poor economy was a factor in Wal-Mart's decision.\n\nWal-Mart is operating from a position of comparative strength as one of the few retailers to consistently show positive growth in same-store sales over the past year as the recession dug in.\n\nWal-Mart is \"a fiscally responsible company that will manage its capital structure appropriately,\" said Todd Slater, retail analyst at Lazard Capital Markets.\n\nEven though Wal-Mart is outperforming its peers, the company \"is not performing anywhere near peak or optimum levels,\" Slater said. \"The consumer has cut back significantly.\"\n\nWal-Mart indicated it had regained some footing in January, when comparable-store sales rose 2.1%, after a lower-than-expected 1.7% rise in December.\n\nWal-Mart shares are off 3.2% to $47.68.\n\n-By Karen Talley, Dow Jones Newswires; 201-938-5106; karen.talley#dowjones.com [ 02-10-09 1437ET ]\n ",
" --To present new valve platforms Friday\n\n(Updates with additional comment from company, beginning in the seventh paragraph.)\n\n\n \n By Anjali Athavaley \n Of DOW JONES NEWSWIRES \n \n\nNEW YORK (Dow Jones)--Edwards Lifesciences Corp. (EW) said Friday that it expects earnings to grow 35% to 40%, excluding special items, in 2012 on expected sales of its catheter-delivered heart valves that were approved in the U.S. earlier this year.\n\nThe medical device company released its financial outlook in a press release before an investor conference Friday. The catheter-delivered heart valve market is considered to have a multibillion-dollar market potential, but questions have persisted on how quickly the Edwards device, called Sapien, will be rolled out and who will be able to receive it.\n\nEdwards said it expects transcatheter valve sales between $560 million and $630 million in 2012, with $200 million to $260 million coming from the U.S.\n\nOverall, for 2012, Edwards sees total sales between $1.95 billion and $2.05 billion, above the $1.68 billion to $1.72 billion expected this year and bracketing the $2.01 billion expected on average by analysts surveyed by Thomson Reuters.\n\nThe company projects 2012 per-share earnings between $2.70 and $2.80, the midpoint of which is below the average analyst estimate of $2.78 on Thomson Reuters. Edwards estimates a gross profit margin of 73% to 75%.\n\nEdwards also reaffirmed its 2011 guidance, which includes earnings per share of $1.97 to $2.02, excluding special items.\n\nThe company said it continues to expect U.S. approval of its Sapien device for high-risk patients in mid-2012. Currently, the device is only approved in the U.S. for patients too sick for surgery.\n\nThe company added that a separate trial studying its newer-generation valve in a larger population is under way in the U.S. It expects U.S. approval of that device in 2014.\n\nEdwards also plans to present at its investor conference two new catheter-delivered valve platforms designed for different implantation methods. European trials for these devices are expected to begin in 2012.\n\nShares of Edwards, down 9% over the past 12 months, were inactive premarket. The stock closed at $63.82 on Thursday.\n\n-By Anjali Athavaley, Dow Jones Newswires; 212-416-4912; anjali.athavaley#dowjones.com [ 12-09-11 0924ET ]\n ",
" In September, the company issued a guidance range of 43 cents to 44 cents a share. \n\nFor the year, GE now sees earnings no lower than $1.81 a share to $1.83 a share. The previous forecast called for income of $1.80 to $1.83 a share. The new range brackets analyst projections of $1.82 a share. \n\nThe new targets represent double-digit growth from the respective year-earlier periods. Last year's third-quarter earnings were $3.87 billion, or 36 cents a share, excluding items; earnings for the year ended Dec. 31 came in at $16.59 billion, or $1.59 a share. [ 10-06-05 0858ET ] \n\nGeneral Electric also announced Thursday that it expects 2005 cash flow from operating activities to exceed $19 billion. \n\nBecause of the expected cash influx, the company increased its authorization for share repurchases by $1 billion to more than $4 billion. \n\nGE announced the updated guidance at an analysts' meeting Thursday in New York. A Web cast of the meeting is available at . \n\nThe company plans to report third-quarter earnings Oct. 14. \n\nShares of the Dow Jones Industrial Average component recently listed at $33.20 in pre-market trading, according to Inet, up 1.6%, or 52 cents, from Wednesday's close of $32.68. \n\nCompany Web site: \n\n-Jeremy Herron; Dow Jones Newswires; 201-938-5400; Ask Newswires#DowJones.com \n\nOrder free Annual Report for General Electric Co. \n\nVisit or call 1-888-301-0513 [ 10-06-05 0904ET ] \n "
)), class = "data.frame", row.names = c(NA, -3L))
We can get data in separate rows keeping the grp intact and keep only sentence that has "medical device company released" in it.
library(dplyr)
df %>%
tidyr::separate_rows(TD, sep = "\\.") %>%
group_by(grp) %>%
summarise(newCol = toString(grep(pattern = "medical device company released",
TD, ignore.case = TRUE, value = TRUE)))
# grp newCol
# <chr> <chr>
#1 3613 ""
#2 4973 "\n\nThe medical device company released its financia…
#3 5570 ""

unnest_tokens fails to handle vectors in R with tidytext package

I want to use the tidytext package to create a column with 'ngrams'. with the following code:
library(tidytext)
unnest_tokens(tbl = president_tweets,
output = bigrams,
input = text,
token = "ngrams",
n = 2)
But when I run this I get the following error message:
error: unnest_tokens expects all columns of input to be atomic vectors (not lists)
My text column consists of a lot of tweets with rows that look like the following and is of class character.
president_tweets$text <– c("The United States Senate just passed the biggest in history Tax Cut and Reform Bill. Terrible Individual Mandate (ObamaCare)Repealed. Goes to the House tomorrow morning for final vote. If approved, there will be a News Conference at The White House at approximately 1:00 P.M.",
"Congratulations to Paul Ryan, Kevin McCarthy, Kevin Brady, Steve Scalise, Cathy McMorris Rodgers and all great House Republicans who voted in favor of cutting your taxes!",
"A story in the #washingtonpost that I was close to rescinding the nomination of Justice Gorsuch prior to confirmation is FAKE NEWS. I never even wavered and am very proud of him and the job he is doing as a Justice of the U.S. Supreme Court. The unnamed sources dont exist!",
"Stocks and the economy have a long way to go after the Tax Cut Bill is totally understood and appreciated in scope and size. Immediate expensing will have a big impact. Biggest Tax Cuts and Reform EVER passed. Enjoy, and create many beautiful JOBS!",
"DOW RISES 5000 POINTS ON THE YEAR FOR THE FIRST TIME EVER - MAKE AMERICA GREAT AGAIN!",
"70 Record Closes for the Dow so far this year! We have NEVER had 70 Dow Records in a one year period. Wow!"
)
---------Update:----------
It looks like the sentimetr or exploratory package caused the conflict. I reloaded my packages without these and now it works again!
Hmmmmm, I am not able to reproduce your problem.
library(tidytext)
library(dplyr)
president_tweets <- data_frame(text = c("The United States Senate just passed the biggest in history Tax Cut and Reform Bill. Terrible Individual Mandate (ObamaCare)Repealed. Goes to the House tomorrow morning for final vote. If approved, there will be a News Conference at The White House at approximately 1:00 P.M.",
"Congratulations to Paul Ryan, Kevin McCarthy, Kevin Brady, Steve Scalise, Cathy McMorris Rodgers and all great House Republicans who voted in favor of cutting your taxes!",
"A story in the #washingtonpost that I was close to rescinding the nomination of Justice Gorsuch prior to confirmation is FAKE NEWS. I never even wavered and am very proud of him and the job he is doing as a Justice of the U.S. Supreme Court. The unnamed sources dont exist!",
"Stocks and the economy have a long way to go after the Tax Cut Bill is totally understood and appreciated in scope and size. Immediate expensing will have a big impact. Biggest Tax Cuts and Reform EVER passed. Enjoy, and create many beautiful JOBS!",
"DOW RISES 5000 POINTS ON THE YEAR FOR THE FIRST TIME EVER - MAKE AMERICA GREAT AGAIN!",
"70 Record Closes for the Dow so far this year! We have NEVER had 70 Dow Records in a one year period. Wow!"))
unnest_tokens(tbl = president_tweets,
output = bigrams,
input = text,
token = "ngrams",
n = 2)
#> # A tibble: 205 x 1
#> bigrams
#> <chr>
#> 1 the united
#> 2 united states
#> 3 states senate
#> 4 senate just
#> 5 just passed
#> 6 passed the
#> 7 the biggest
#> 8 biggest in
#> 9 in history
#> 10 history tax
#> # ... with 195 more rows
The current CRAN version of tidytext does in fact not allow list-columns but we have changed the column handling so that the development version on GitHub now supports list-columns. Are you sure you don't have any of these in your data frame/tibble? What are the data types of all of your columns? Are any of them of type list?

Transforming kwic objects into single dfm

I have a corpus of newspaper articles of which only specific parts are of interest for my research. I'm not happy with the results I get from classifying texts along different frames because the data contains too much noise. I therefore want to extract only the relevant parts from the documents. I was thinking of doing so by transforming several kwic objects generated by the quanteda package into a single df.
So far I've tried the following
exampletext <- c("The only reason for (the haste) which we can discern is the prospect of an Olympic medal, which is the raison d'etat of the banana republic,'' The Guardian said in an editorial under the headline ''Whatever Zola Wants. . .'' The Government made it clear it had acted promptly on the application to insure that the 5-foot-2-inch track star could qualify for the British Olympic team. The International Olympic Organization has a rule that says athletes who change their nationality must wait three years before competing for that country - a rule, however, that is often waived by the I.O.C. The British Olympic Association said it consulted with the I.O.C. before asserting Miss Budd's eligibility for the British team. ''Since Zola is now here and has a British passport she should be made to feel welcome and accepted by other British athletes,'' said Paul Dickenson, chairman of the International Athletes Club, an organization that raises money for amateur athletes and looks after their political interests. ''The thing we objected to was the way she got into the country by the Government and the Daily Mail and the commercialization exploitation associated with it.", "That left 14 countries that have joined the Soviet-led withdrawal. Albania and Iran had announced that they would not compete and did not send written notification. Bolivia, citing financial trouble, announced Sunday it would not participate.The 1972 Munich Games had the previous high number of competing countries, 122.No Protest Planned on Zola Budd YAOUNDE, Cameroon, June 4 (AP) - African countries do not plan to boycott the Los Angeles Olympics in protest of the inclusion of Zola Budd, the South African-born track star, on the British team, according to Lamine Ba, the secretary-general of the Supreme Council for Sport in Africa. Because South Africa is banned from participation in the Olympics, Miss Budd, whose father is of British descent, moved to Britain in March and was granted British citizenship.75 Olympians to Train in Atlanta ATLANTA, June 4 (AP) - About 75 Olympic athletes from six African countries and Pakistan will participate in a month-long training camp this summer in Atlanta under a program financed largely by a grant from the United States Information Agency, Anne Bassarab, a member of Mayor Andrew Young's staff, said today. The athletes, from Mozambique, Tanzania, Zambia, Zimbabwe, Uganda, Somalia and Pakistan, will arrive here June 24.")
mycorpus <- corpus(exampletext)
mycorpus.nat <- corpus(kwic(mycorpus, "nationalit*", window = 5, valuetype = "glob"))
mycorpus.cit <- corpus(kwic(mycorpus, "citizenship", window = 5, valuetype = "glob"))
mycorpus.kwic <- mycorpus.nat + mycorpus.cit
mydfm <- dfm(mycorpus.kwic)
This, however, generates a dfm that contains 4 documents instead of 2, and when both keywords are present in a document even more. I can't think of a way to bring the dfm down to the original number of documents.
Thank you for helping me out.
We recently added window argument to tokens_select() for this purpose:
require(quanteda)
txt <- c("The only reason for (the haste) which we can discern is the prospect of an Olympic medal, which is the raison d'etat of the banana republic,'' The Guardian said in an editorial under the headline ''Whatever Zola Wants. . .'' The Government made it clear it had acted promptly on the application to insure that the 5-foot-2-inch track star could qualify for the British Olympic team. The International Olympic Organization has a rule that says athletes who change their nationality must wait three years before competing for that country - a rule, however, that is often waived by the I.O.C. The British Olympic Association said it consulted with the I.O.C. before asserting Miss Budd's eligibility for the British team. ''Since Zola is now here and has a British passport she should be made to feel welcome and accepted by other British athletes,'' said Paul Dickenson, chairman of the International Athletes Club, an organization that raises money for amateur athletes and looks after their political interests. ''The thing we objected to was the way she got into the country by the Government and the Daily Mail and the commercialization exploitation associated with it.", "That left 14 countries that have joined the Soviet-led withdrawal. Albania and Iran had announced that they would not compete and did not send written notification. Bolivia, citing financial trouble, announced Sunday it would not participate.The 1972 Munich Games had the previous high number of competing countries, 122.No Protest Planned on Zola Budd YAOUNDE, Cameroon, June 4 (AP) - African countries do not plan to boycott the Los Angeles Olympics in protest of the inclusion of Zola Budd, the South African-born track star, on the British team, according to Lamine Ba, the secretary-general of the Supreme Council for Sport in Africa. Because South Africa is banned from participation in the Olympics, Miss Budd, whose father is of British descent, moved to Britain in March and was granted British citizenship.75 Olympians to Train in Atlanta ATLANTA, June 4 (AP) - About 75 Olympic athletes from six African countries and Pakistan will participate in a month-long training camp this summer in Atlanta under a program financed largely by a grant from the United States Information Agency, Anne Bassarab, a member of Mayor Andrew Young's staff, said today. The athletes, from Mozambique, Tanzania, Zambia, Zimbabwe, Uganda, Somalia and Pakistan, will arrive here June 24.")
toks <- tokens(txt)
mt_nat <- dfm(tokens_select(toks, "nationalit*", window = 5))
mt_cit <- dfm(tokens_select(toks, "citizenship*", window = 5))
Please make sure that you are using the latest version of Quanteda.

error reading text file into new columns of a dataframe using some text editing

I have a text file (0001.txt) which contains the data as below:
<DOC>
<DOCNO>1100101_business_story_11931012.utf8</DOCNO>
<TEXT>
The Telegraph - Calcutta (Kolkata) | Business | Local firms go global
6 Local firms go global
JAYANTA ROY CHOWDHURY
New Delhi, Dec. 31: Indian companies are stepping out of their homes to try their luck on foreign shores.
Corporate India invested $2.7 billion abroad in the first quarter of 2009-2010 on top of $15.9 billion in 2008-09.
Though the first-quarter investment was 15 per cent lower than what was invested in the same period last year, merchant banker Sudipto Bose said, It marks a confidence in a new world order where Indian businesses see themselves as equal to global players.
According to analysts, confidence in global recovery, cheap corporate buys abroad and easier rules governing investment overseas had spurred flow of capital and could see total investment abroad top $12 billion this year and rise to $18-20 billion next fiscal.
For example, Titagarh Wagons plans to expand abroad on the back of the proposed Asian railroad project.
We plan to travel all around the world with the growth of the railroads, said Umesh Chowdhury of Titagarh Wagons.
India is full of opportunities, but we are all also looking at picks abroad, said Gautam Mitra, managing director of Indian Structurals Engineering Company.
Mitra plans to open a holding company in Switzerland to take his business in structurals to other Asian and African countries.
Indian companies created 3 lakh jobs in the US, while contributing $105 billion to the US economy between 2004 and 2007, according to commerce ministry statistics. During 2008-09, Singapore, the Netherlands, Cyprus, the UK, the US and Mauritius together accounted for 81 per cent of the total outward investment.
Bose said, And not all of it is organic growth. Much of our investment abroad reflects takeovers and acquisitions.
In the last two years, Suzlon acquired Portugals Martifers stake in German REpower Systems for $122 million. McNally Bharat Engineering has bought the coal and minerals processing business of KHD Humboldt Wedag. ONGC bought out Imperial Energy for $2 billion.
Indias foreign assets and liabilities today add up to more than 60 per cent of its gross domestic product. By the end of 2008-09, total foreign investment was $67 billion, more than double of that at the end of March 2007.
</TEXT>
</DOC>
Above, all text data is within the HTML code for text i.e.
<TEXT> and </TEXT>.
I want to read it into an R dataframe in a way that there will be four columns and the data should be read as:
Title Author Date Text
The Telegraph - Calcutta (Kolkata) JAYANTA ROY CHOWDHURY Dec. 31 Indian companies are stepping out of their homes to try their luck on foreign shores. Corporate India invested $2.7 billion abroad in the first quarter of 2009-2010 on top of $15.9 billion in 2008-09. Though the first-quarter investment was 15 percent lower than what was invested in the same period last year, merchant banker Sudipto Bose said, It marks a confidence in a new world order where Indian businesses see themselves as equal to global players.
What I was trying to read using dplyr and as shown below:
# read text file
library(dplyr)
library(readr)
dat <- read_csv("0001.txt") %>% slice(-8)
# print part of data frame
head(dat, n=2)
In above code, I tried to skip first few lines (which are not important) from the text file that contains the above text and then read it into dataframe.
But I could not get what I was looking for and got confused what I am doing is wrong.
Could someone please help?
To be able to read data into R as a data frame or table, the data needs to have a consistent structure maintained by separators. One of the most common formats is a file with comma separated values (CSV).
The data you're working with doesn't have separators though. It's essentially a string with minimally enforced structure. Because of this, it sounds like the question is more related to regular expressions (regex) and data mining than it is to reading text files into R. So I'd recommend looking into those two things if you do this task often.
That aside, to do what you're wanting in this example, I'd recommend reading the text file into R as a single string of text first. Then you can parse the data you want using regex. Here's a basic, rough draft of how to do that:
fileName <- "Path/to/your/data/0001.txt"
string <- readChar(fileName, file.info(fileName)$size)
df <- data.frame(
Title=sub("\\s+[|]+(.*)","",string),
Author=gsub("(.*)+?([A-Z]{2,}.*[A-Z]{2,})+(.*)","\\2",string),
Date=gsub("(.*)+([A-Z]{1}[a-z]{2}\\.\\s[0-9]{1,2})+(.*)","\\2",string),
Text=gsub("(.*)+([A-Z]{1}[a-z]{2}\\.\\s[0-9]{1,2})+[: ]+(.*)","\\3",string))
Output:
str(df)
'data.frame': 1 obs. of 4 variables:
$ Title : chr "The Telegraph - Calcutta (Kolkata)"
$ Author: chr "JAYANTA ROY CHOWDHURY"
$ Date : chr "Dec. 31"
$ Text : chr "Indian companies are stepping out of their homes to"| __truncated__
The reason why regex can be useful is that it allows for very specific patterns in strings. The downside is when you're working with strings that keep changing formats. That will likely mean some slight adjustments to the regex used.
read.table( file = ... , sep = "|") will solve your issue.

Resources