How to delete everything BUT two patterns found by regex - r

I'm trying to turn a couple thousand press releases on anti-ISIS airstrikes into an organized dataset. So far I've got working code to do it one at a time, but it chokes on doing more than one because of the way there's one date per N (constantly changing) number of cases.
Using ((?<=SOUTHWEST ASIA,).*(?<=-)) and ((?<=Near).*?(?=airstrik)) I can match the two things I need individually, but I can't figure out how to set it up to preserve all strings matching either of those regexes while deleting everything else.
I've tried ((?<=SOUTHWEST ASIA,).*(?<=-))|((?<=Near).*?(?=airstrik)) and ((?<=SOUTHWEST ASIA,).*(?<=-)).*((?<=Near).*?(?=airstrik)) but both of those wind up matching everything in the document.
What I'm trying to do is take the whole document and delete everything but the matching strings so I go from this:
November 23, 2016
Military Strikes Continue Against ISIL Terrorists in Syria and Iraq
U.S. Central Command
SOUTHWEST ASIA, November 23, 2016 - On Nov. 22, Coalition military forces conducted 17 strikes against ISIL terrorists in Syria and Iraq. In Syria, Coalition military forces conducted 11 strikes using attack, bomber, fighter, and remotely piloted aircraft against ISIL targets. Additionally in Iraq, Coalition military forces conducted six strikes coordinated with and in support of the Government of Iraq using attack, bomber, fighter, and remotely piloted aircraft against ISIL targets.
The following is a summary of the strikes conducted since the last press release:
Syria
Near Abu Kamal, one strike destroyed an oil rig.
Near Ar Raqqah, four strikes engaged an ISIL tactical unit, destroyed two vehicles, an oil tanker truck, an oil pump, and a VBIED, and damaged a road.
Iraq
Near Rawah, one strike engaged an ISIL tactical unit and destroyed a vehicle, a mortar system, and a weapons cache.
Near Mosul, four strikes engaged three ISIL tactical units, destroyed >six ISIL-held buildings, a mortar system, a vehicle, a weapons cache, a supply cache, and an artillery system, and damaged five supply routes, and a bridge.
more text I don't need, about 5 exceptions where they amend previous reports I'll just fix by hand, and then the next report
To this:
SOUTHWEST ASIA, November 23, 2016
Near Abu Kamal, one strike
Near Ar Raqqah, four strikes
Near Rawah, one strike
Near Mosul, four strikes
SOUTHWEST ASIA, November 22, 2016
Near Abu Kamal, one strike
Near Ar Raqqah, four strikes
Near Rawah, one strike
Near Mosul, four strikes
I can match and pull out the dates and cities/strikes seperately, but that doesn't work for my purposes so I need to find a way to clean up the source document so it looks like the above.

You can use the str_extract_all function from the stringr package, and pass it your regex.
I think if you pass your two regexes and separate them with |, it should work. If you need to test your regex, you can go to : https://regex101.com/
Best,
Colin

Related

Processing large text files in R (Speed up a loop for separating sentences)

I have large text documents (150K lines per document; 160 documents). Then I read them in as a large VCorpus and convert them to a dataframe it runs quite quickly. Now, I want to separate each sentence in rows and remove those that do not contain a certain keyword. Then I run the code R crashes. If I try it with one document the code runs for approximately 10 minutes.
The data (text) looks like this (just with much more text):
| Company1 | Company1 hat engaged in child safety in South Africa. The Community rewards this with great recognition. Charity is a big part of the companies culture.
| Company2 | Company2 opened up several factories in the that do not pay the minimum wage. Affordable Housing is not one of thair priorities.
There is also a small example structure below.
library(readr)
library(qdap)
library(tm)
corp <- VCorpus(DirSource("test"))
text <- as.data.frame(corp)
names(text)[1] <- 'doc_id'
names(text)[2] <- 'text'
text <- dput(text)
text <- structure(list(doc_id = c("Company1.txt", "Company2.txt", "Company3.txt"
), text = c("Acucap's market cap on listing in 2002 was a meagre R372m, which would make it one of the smallest stocks in the real estate sector in today's terms. This may be a reflection of how the real estate sector of the JSE has evolved over the past 10 years. But, it is also amazing that Acucap - at a current market cap of more than R7bn - has managed to retain much of the same level of entrepreneurial spirit as at the time of its listing. Still at the helm is sector veteran Paul Theodosiou. The team has grown since listing. The core operational supporting team is Jonathan Rens, Gavin Jones and Craig Kotze, and the finance team is led by founding chief financial officer Baden Marlow. Acucap's physical portfolio is concentrated in retail, which comprises 74%, offices make up 24% and industrial takes the remaining 2%. Some of the larger retail properties include Festival Mall (the largest single asset in the portfolio) in Kempton Park, Bayside Mall in Tableview and Keywest shopping centre in Krugersdorp. The office portfolio includes assets like Mowbray Golf Park in Pinelands in the Cape, the Microsoft offices in Bryanston and 82 Grayston Drive in the Sandton CBD. Though small in Acucap's portfolio, the industrial portfolio is of a high quality and offers good scope for growth as expansion continues at the N1 Business Park in Midrand and the Montague Business Park in Cape Town. Acucap's other investments include a 17,2% shareholding in Sycom Property Fund (a fellow listed real estate property company) as well as a 100% shareholding in Sycom's management company. Sycom's other significant shareholder is Hyprop Investments. The latter's 33,9% stake in Sycom has been the subject of much speculation over the past few years as the ownership structure has created somewhat of an impasse. Acucap has, however, stated its strategic intent to retain Sycom as a separate fund. Company1 hat engaged in child safety in South Africa. The Community rewards this with great recognition. Charity is a big part of the companies culture.",
"Australian copper producer Aditya Birla Minerals Ltd. said June 16 that there is a risk that the overall reserves at its Nifty copper mine in Western Australia may be adversely affected due to the effects of an earlier ground collapse at the mine. The company also said it will be unable to provide its annual reserve update until all investigative activities are finalized. Aditya Birla is expecting site costs for the June quarter to be in the range of A$17 million to A$19 million, higher than the earlier estimate of A$12 million to A$15 million due to higher-than-expected mine activity, unplanned maintenance on critical infrastructure and additional employee costs. The company was forced to halt operations at the Nifty mine in March following suspected underground subsidence, which led to the standing down of 350 mine employees until at least July 15. Phase two of probe drilling, which is being undertaken as part of efforts to assess mine safety, is progressing ahead of schedule and is now expected to be completed by the end of June instead of mid-July. Aditya Birla said preliminary observations from this second phase are similar to those from phase one drilling, in that there is less water being intersected between levels 16 and 20 than previously expected. Areas with potential to self-propagate into new sinkholes are being reviewed and, as Aditya Birla expected, the rock mass strength has deteriorated on top of mined-out areas. Management has begun an initial review of the results to identify new gaps and/or the need for additional confirmatory drilling. The first phase of the seismic system installation has been commissioned and is fully functional, albeit limited in coverage to upper parts of the mine. The pit has been dewatered with residual mud left at the pit sump. Aditya Birla added that while surface cracking has continued to develop in the area affected by the sinkhole, this was expected and does not present a hazard. However, washouts will occur after heavy rainfall, leading to widening and deepening of the cracks.",
"In preparation for the potential completion of the merger with AGL Resources Inc., the board of directors of Nicor Inc. announced a special pro rata dividend. The dividend is contingent upon the merger being completed prior to Nicor's next scheduled dividend record date, Dec. 31, to ensure that \"shareholders continue to receive a dividend at the current rate until the closing of the merger,\" the company said. In a Nov. 1 news release, Nicor said its board of directors declared a pro rata dividend of 0.5 cent per share per day from Oct. 1 until and including the day before the merger effective date. The dividend is the daily equivalent of the current quarterly dividend rate of 46.5 cents per share. It will be paid to Nicor shareholders of record at the close of business on the day immediately before the effective merger date, which is expected in the fourth quarter. \"The dividend will be paid as soon as practical following the completion of the merger,\" Nicor said. \"This pro rata dividend is in addition to the previously announced Nov. 1 dividend that was paid to shareholders of record Sept. 30. Following the merger closing, AGL Resources is expected to pay a dividend at the rate of $0.004945055 per share, per day for the remainder of the current quarterly dividend period. These dividend payment scenarios, together with a similar plan announced today by AGL Resources, will synchronize the companies' dividends as of the merger effective date in accordance with the merger agreement.\" If the merger is not completed by Dec. 31, Nicor shareholders of record Dec. 31 will receive the regular quarterly dividend of 46.5 cents per share, payable Feb. 1, 2012, and a new pro rata dividend will be announced to ensure that shareholders receive a dividend at the current rate until the merger is completed. Company2 opened up several factories in the that do not pay the minimum wage. Affordable Housing is not one of thair priorities."
)), class = "data.frame", row.names = c(NA, 3L))
options(stringsAsFactors = FALSE)
Sys.setlocale('LC_ALL','C')
library(dplyr)
library(tidyr)
sckeywords <- c("Affordable Housing", "Benefit The Masses",
"Charitability", "Charitable", "Charitably", " Charities ", " Charity ")
pat <- paste0(sckeywords, collapse = '|')
text2 <- (text) %>%
separate_rows(text, sep = '\\.\\s*') %>%
slice({
tmp <- grep(pat, text, ignore.case = TRUE)
sort(unique(c(tmp-1, tmp, tmp + 1)))
})
Can I make it run faster somehow without needing more hardware capacity?
I have 16 GB of RAM and a 4 core CPU (i5-10210U).
There are several ways to increase the speed:
I would use data.table instead of dplyr for this amount of data
text <- structure(list(doc_id = c("Company1.txt", "Company2.txt", "Company3.txt"
), text = c("Acucap's market cap on listing in 2002 was a meagre R372m, which would make it one of the smallest stocks in the real estate sector in today's terms. This may be a reflection of how the real estate sector of the JSE has evolved over the past 10 years. But, it is also amazing that Acucap - at a current market cap of more than R7bn - has managed to retain much of the same level of entrepreneurial spirit as at the time of its listing. Still at the helm is sector veteran Paul Theodosiou. The team has grown since listing. The core operational supporting team is Jonathan Rens, Gavin Jones and Craig Kotze, and the finance team is led by founding chief financial officer Baden Marlow. Acucap's physical portfolio is concentrated in retail, which comprises 74%, offices make up 24% and industrial takes the remaining 2%. Some of the larger retail properties include Festival Mall (the largest single asset in the portfolio) in Kempton Park, Bayside Mall in Tableview and Keywest shopping centre in Krugersdorp. The office portfolio includes assets like Mowbray Golf Park in Pinelands in the Cape, the Microsoft offices in Bryanston and 82 Grayston Drive in the Sandton CBD. Though small in Acucap's portfolio, the industrial portfolio is of a high quality and offers good scope for growth as expansion continues at the N1 Business Park in Midrand and the Montague Business Park in Cape Town. Acucap's other investments include a 17,2% shareholding in Sycom Property Fund (a fellow listed real estate property company) as well as a 100% shareholding in Sycom's management company. Sycom's other significant shareholder is Hyprop Investments. The latter's 33,9% stake in Sycom has been the subject of much speculation over the past few years as the ownership structure has created somewhat of an impasse. Acucap has, however, stated its strategic intent to retain Sycom as a separate fund. Company1 hat engaged in child safety in South Africa. The Community rewards this with great recognition. Charity is a big part of the companies culture.",
"Australian copper producer Aditya Birla Minerals Ltd. said June 16 that there is a risk that the overall reserves at its Nifty copper mine in Western Australia may be adversely affected due to the effects of an earlier ground collapse at the mine. The company also said it will be unable to provide its annual reserve update until all investigative activities are finalized. Aditya Birla is expecting site costs for the June quarter to be in the range of A$17 million to A$19 million, higher than the earlier estimate of A$12 million to A$15 million due to higher-than-expected mine activity, unplanned maintenance on critical infrastructure and additional employee costs. The company was forced to halt operations at the Nifty mine in March following suspected underground subsidence, which led to the standing down of 350 mine employees until at least July 15. Phase two of probe drilling, which is being undertaken as part of efforts to assess mine safety, is progressing ahead of schedule and is now expected to be completed by the end of June instead of mid-July. Aditya Birla said preliminary observations from this second phase are similar to those from phase one drilling, in that there is less water being intersected between levels 16 and 20 than previously expected. Areas with potential to self-propagate into new sinkholes are being reviewed and, as Aditya Birla expected, the rock mass strength has deteriorated on top of mined-out areas. Management has begun an initial review of the results to identify new gaps and/or the need for additional confirmatory drilling. The first phase of the seismic system installation has been commissioned and is fully functional, albeit limited in coverage to upper parts of the mine. The pit has been dewatered with residual mud left at the pit sump. Aditya Birla added that while surface cracking has continued to develop in the area affected by the sinkhole, this was expected and does not present a hazard. However, washouts will occur after heavy rainfall, leading to widening and deepening of the cracks.",
"In preparation for the potential completion of the merger with AGL Resources Inc., the board of directors of Nicor Inc. announced a special pro rata dividend. The dividend is contingent upon the merger being completed prior to Nicor's next scheduled dividend record date, Dec. 31, to ensure that \"shareholders continue to receive a dividend at the current rate until the closing of the merger,\" the company said. In a Nov. 1 news release, Nicor said its board of directors declared a pro rata dividend of 0.5 cent per share per day from Oct. 1 until and including the day before the merger effective date. The dividend is the daily equivalent of the current quarterly dividend rate of 46.5 cents per share. It will be paid to Nicor shareholders of record at the close of business on the day immediately before the effective merger date, which is expected in the fourth quarter. \"The dividend will be paid as soon as practical following the completion of the merger,\" Nicor said. \"This pro rata dividend is in addition to the previously announced Nov. 1 dividend that was paid to shareholders of record Sept. 30. Following the merger closing, AGL Resources is expected to pay a dividend at the rate of $0.004945055 per share, per day for the remainder of the current quarterly dividend period. These dividend payment scenarios, together with a similar plan announced today by AGL Resources, will synchronize the companies' dividends as of the merger effective date in accordance with the merger agreement.\" If the merger is not completed by Dec. 31, Nicor shareholders of record Dec. 31 will receive the regular quarterly dividend of 46.5 cents per share, payable Feb. 1, 2012, and a new pro rata dividend will be announced to ensure that shareholders receive a dividend at the current rate until the merger is completed. Company2 opened up several factories in the that do not pay the minimum wage. Affordable Housing is not one of thair priorities."
)), class = "data.frame", row.names = c(NA, 3L))
library(data.table)
sckeywords <- c("Affordable Housing", "Benefit The Masses",
"Charitability", "Charitable", "Charitably", " Charities ", " Charity ")
pat <- paste0(sckeywords, collapse = '|')
setDT(text)
text <- text[, .(text = unlist(tstrsplit(text, "\\.\\s*", type.convert = TRUE))),
by = "doc_id"]
text <- text[grepl(pat, text)]
You can try to speed up the regex. One possibility is to create a trie, e.g. with trieregex. In your case, it would be:
trie_1 <- "(?:Charitab(?:ility|l[ey])|\\ Charit(?:ies\\ |y\\ )|Benefit\\ The\\ Masses|Affordable\\ Housing)"
text <- text[grepl(trie_1, text)]
# or without the white space around the last two words
trie_2 <- "(?:Charit(?:ab(?:ility|l[ey])|ies|y)|Benefit\\ The\\ Masses|Affordable\\ Housing)"
text <- text[grepl(trie_2, text)]
However, in the short example, using fixed patterns was actually the fastest (checked with microbenchmark), so you could try this:
text <- text[grepl(sckeywords[1], text, fixed = TRUE) | grepl(sckeywords[2], text, fixed = TRUE) |
grepl(sckeywords[3], text, fixed = TRUE) | grepl(sckeywords[4], text, fixed = TRUE) |
grepl(sckeywords[5], text, fixed = TRUE) | grepl(sckeywords[6], text, fixed = TRUE) |
grepl(sckeywords[7], text, fixed = TRUE)]
If this is still too slow, try to use other tools like awk or sed.
Use parallel processing. You could try to parallelise everything:
make a list of the files to read in and send parts of this file list to the different cores/workers. They then read in the files so that you don't need to send big files to different cores (you can also first try to skip this and start with your original code if the file is not too big)
make a clean data.table and filter it
return the data and join everything
When you run into memory issues, you could also try to save the cleaned data.tables separately in the workers instead of returning them.
For a parallel framework, I'd recommend foreach or the futureverse
Edit
Here is a regex that ignores the case of the first character. However, I'm not an expert on regexes, so maybe there are better ways to optimise the regex!
trie_2 <- "(?:[Cc]harit(?:ab(?:ility|l[ey])|ies|y)|[Bb]enefit\\ [Tt]he\\ [Mm]asses|[Aa]ffordable\\ [Hh]ousing)"

Matching statutory provisions of two in R

In advance: Sorry for all the Norwegian references, but I hope I've explained my problem good enough for them to still make sense...
So, in 2005 Norway got a new criminal law. The old one was somewhat unstructured (only three chapters), while the statutory provisions in the 2005 version have been structured into 31 chapters, depending on the area of the offense (can be seen here: https://lovdata.no/dokument/NL/lov/2005-05-20-28). I call these "areas of law". For example, in the 2005 version laws regarding sexual offenses are in chapter 26. Logically, then the statutory provisions that belong to this chapter are categorized as belonging to the area of law called "s
Some of the old laws have been structured into the new chapters, some new have been added, and some have been repealed. I have what is called a "law mirror" – a list where you can find where the old provision are in the new law, if it hasn't been repealed. The new law came into force for offenses committed from the 1st of October in 2015.
An example of a law mirror: https://no.wikipedia.org/wiki/Straffeloven_(lovspeil). I've pivoted the list longer, such that it looks like this:
Law Mirror: "Seksuallovbrud" means sexual offense, "kap_2005" says which chapter in the 2005 law that the statutory provision (norwegian: "paragraf") falls under, and "straffelov" specifies whether the provison comes from the 2005 or 1902 version of the law.
The data I have consist of two separate data frames. Df1 is the law mirror. Df2 consists of cases in the Norwegian court of appeals from between 1993 and 2019, where the criminal law was the basis of the verdict. I've made a dummy (strl1902) in Df2 for whether the verdict in the case came before or after the new law came into force. Equal to 1 if it's the old one. I've also extracted the number of the statutory provision.
On the basis of this I want to categorize the cases using statutory provisions from the old criminal law into the areas of law from the new law.
This is where I need help:
Do any of you have any idea of how I can distinguish between the provisions from the old and the new law, such that I also can make dummies for the provisions from the 1902 law, such that they are separated into the areas of the law of the 2005 law?
Hope this makes sense.

Remove a section from Corpus

I have a quanteda corpus of hundreds of documents. How do I remove specific sections - like the abstract and footnotes etc. Otherwise, I am faced with doing it manually. Thanks
As requested, here is a text example. It is from a regular journal article. It shows the Meta data, then the abstract, then keywords, then introduction, then author contact details, then body of article, then Note, then Disclosure statement, then Notes on contributors, then references. I would like to remove everything apart from the introduction and body of the article. I would also like to remove the author name and Journal title - which are repeated throughout
" Behavioral Sciences of Terrorism and Political Aggression
ISSN: 1943-4472 (Print) 1943-4480 (Online) Journal homepage: http://www.tandfonline.com/loi/rirt20
Sometimes they come back: responding to
American foreign fighter returnees and other
Elusive threats
Christopher J. Wright
To cite this article: Christopher J. Wright (2018): Sometimes they come back: responding to
American foreign fighter returnees and other Elusive threats, Behavioral Sciences of Terrorism and
Political Aggression, DOI: 10.1080/19434472.2018.1464493
To link to this article: https://doi.org/10.1080/19434472.2018.1464493
Published online: 23 Apr 2018.
Submit your article to this journal
Article views: 57
View related articles
View Crossmark data
Full Terms & Conditions of access and use can be found at
http://www.tandfonline.com/action/journalInformation?journalCode=rirt20
"
"BEHAVIORAL SCIENCES OF TERRORISM AND POLITICAL AGGRESSION, 2018
https://doi.org/10.1080/19434472.2018.1464493
Sometimes they come back: responding to American foreign
fighter returnees and other Elusive threats
Christopher J. Wright
Department of Criminal Justice, Austin Peay State University, Clarksville, TN, USA
ABSTRACT ARTICLE HISTORY
Much has been made of the threat of battle hardened jihadis from Received 8 January 2018
Islamist insurgencies, especially Syria. But do Americans who Accepted 10 April 2018
return home after gaining experience fighting abroad pose a
KEYWORDS
greater risk than homegrown jihadi militants with no such Terrorism; foreign fighters;
experience? Using updated data covering 1990–2017, this study domestic terrorism;
shows that the presence of a returnee decreases the likelihood homegrown terrorism;
that an executed plot will cause mass casualties. Plots carried out lone-wolf; homeland security
Introduction: being afraid. Being a little afraid
How great of a threat do would-be jihadis pose to their home country? And do those who
return home after gaining experience fighting abroad in Islamist insurgencies or attending
terror training camps pose a greater risk than other jihadi militants? The fear, as first outlined
by Hegghammer (2013), is two-fold. First, individuals that have gone abroad to fight might
CONTACT Christopher J. Wright wrightc#apsu.edu Department of Criminal Justice, Austin Peay State University,
Clarksville, TN 37043, USA
© 2018 Society for Terrorism Research
"
"2 C. J. WRIGHT
Many of the earliest studies on Western foreign fighters suggested that those who
returned were in fact more deadly than those with no experience fighting in Islamist insur-
gencies. Hegghammer’s (2013) analysis suggested that these foreign fighter returnees
were a greater danger than when they left. Likewise, Byman (2015), Nilson (2015),
Kenney (2015), and Vidno (2011) came to similar conclusions while offering key insights
into the various mechanisms linking foreign fighting with successful plot execution and
greater mass casualties.
Other studies came to either mixed conclusions or directly contradicted the earlier find-
ings. Adding several years of data to Hegghammer’s (2013) earlier study, Hegghammer
"
" BEHAVIORAL SCIENCES OF TERRORISM AND POLITICAL AGGRESSION 3
for them to form the types of large, local networks that would be necessary to carry out a
large-scale attack without attracting the attention of security services’ (p. 92).
"
Note
1. Charges were brought against Noor Zahi Salman, the widow of the Omar Mateen who carried
out the June, 2016 attack against the Pulse Nightclub in Orlando, Florida (US Department of
Justice., 2017a, January 17). However, in March of 2018 a jury acquitted her of the charges that
she had foreknowledge of the attack.
Disclosure statement
No potential conflict of interest was reported by the authors.
Notes on contributors
Christopher J. Wright, Ph.D., is an Assistant Professor at Austin Peay State University where he
teaches in the Homeland Security Concentration.
ORCID
Christopher J. Wright http://orcid.org/0000-0003-0043-6616
References
Byman, D. (2015). The homecomings: What happens when Arab foreign fighters in Iraq and Syria
return? Studies in Conflict & Terrorism, 38(8), 581–602.
Byman, D. (2016). The Jihadist returnee threat: Just how dangerous? Political Science Quarterly, 131(1),
69–99.
Byman, D., & Shapiro, J. (2014). Be afraid. Be a little afraid: The threat of terrorism from Western foreign
fighters in Syria and Iraq. Foreign Policy at Brookings. Washington, DC: Brookings. Retrieved from
https://www.brookings.edu/wp-content/uploads/2016/06/Be-Afraid-web.pdf
The approach
The key here is to determine the regular markers that precede each section, and then to use them as tags in a call to corpus_segment(). It's the tags that will need tweaking, based on their degree of regularity across documents.
Based on what you supplied above, I pasted that into a plain text file that I named example.txt. This code extracted the Introduction and what I think is the body of the article, but for that I had to decide a tag that marked its ending. Below, I used "Disclosure Statement". So:
library("quanteda")
crp <- readtext::readtext("~/tmp/example.txt") %>%
corpus()
pat <- c("\nIntroduction?", "\nCONTACT", "©", "\nDisclosure statement")
crpextracted <- corpus_segment(crp, pattern = pat)
summary(crpextracted)
## Corpus consisting of 4 documents:
##
## Text Types Tokens Sentences pattern
## example.txt.1 62 74 5 Introduction:
## example.txt.2 18 21 2 CONTACT
## example.txt.3 156 253 11 ©
## example.txt.4 101 180 19 Disclosure statement
##
## Source: /Users/kbenoit/Dropbox (Personal)/GitHub/quanteda/quanteda/* on x86_64 by kbenoit
## Created: Fri Jul 6 19:51:01 2018
## Notes: corpus_segment.corpus(crp, pattern = pat)
When you examine the text in the "Introduction:" tagged segment, you can see that everything from that string until the next tag was extracted to that as a new document:
corpus_subset(crpextracted, pattern == "\nIntroduction:") %>%
texts() %>% cat()
## being afraid. Being a little afraid
##
## How great of a threat do would-be jihadis pose to their home country? And do those who
##
## return home after gaining experience fighting abroad in Islamist insurgencies or attending
##
## terror training camps pose a greater risk than other jihadi militants? The fear, as first outlined
##
## by Hegghammer (2013), is two-fold. First, individuals that have gone abroad to fight might
How to remove pdf junk
All pdf conversions produce unwanted junk such as running headers, footers, etc. Here's how to remove them. (Note: You will want to do this before the step above.) How to construct the toreplace pattern? You will need to understand something about regular expressions, and use some experimentation.
library("stringr")
toreplace <- '\\n*\" \" BEHAVIORAL SCIENCES OF TERRORISM AND POLITICAL AGGRESSION,{0,1} \\d+\\n*'
texts(crp) <- str_replace_all(texts(crp), regex(toreplace), "")
cat(texts(crp))
To demonstrate this on a section from your example:
# demonstration
x <- '
" " BEHAVIORAL SCIENCES OF TERRORISM AND POLITICAL AGGRESSION 3
'
str_replace_all(x, regex(toreplace), "")
## [1] ""

Regex to extract double quotes and string in quotes R

I have a data frame with a column of "text." Each row of this column is filled with text from media articles.
I am trying to extract a string that occurs like this: "term" (including the double quotes around the term). I tried the following regular expression to capture instances where a word is sandwiched between two double quotes:
stri_extract_all_regex(df$text, '"(.+?)"')
This seems to capture some instances of what I am looking for, but in other cases - where I know the criteria are met - it does not. It also captures what seem to be just quotes of longer text (and not other instances of quoted text). Here are the results of using the above:
[[19]]
[1] "\"play a constructive and positive role\""
[2] "\"active and hectic reception\""
[3] "\"[term]\""
I would like to have only exactly "term" as the output (including the double quotes). I am trying to find instances when the term is used alone in quotes.
Example in R:
test <- c(df$text[12], df$text[18])
res <- stri_extract_all_regex(test, '"\\S+"')
unlist(res)
[1] "\"Rohingya\"" "\"Bengali\"" NA
print(test)
[1] "Former UN general secretary Kofi Annan will advise Myanmar's government on resolving conflicts in Rakhine State, the office of the state counsellor announced today.Former UN Secretary General Kofi Annan speaks at the opening of the Consciouness Summit on climate change in Paris, France on July 21, 2015. Photo: EPARakhine State, one of the poorest in the Union, was wracked by sectarian violence in 2012 that forced more than 100,000 – mostly Muslims who ethnically identify as Rohingya – into squalid displacement camps where they face severe restrictions on movement as well as access to health care, education, and other other basic services.Addressing the ongoing crises has posed one of the most troubling challenges to Daw Aung San Suu Kyi's National League for Democracy-led government.Earlier today, the government announced the formation of an advisory panel that will be chaired by former UN chief, and focus on \"finding lasting solutions to the complex and delicate issues in the Rakhine State\".The board will submit recommendations to the government on \"conflict prevention, humanitarian assistance, rights and reconciliation, institution-building and promotion of development of Rakhine State,\" a statement from the state counsellor's office said.The statement did not use the word \"Rohingya\". Daw Aung San Suu Kyi has come under fire both at home and from international rights groups for failing prioritise to address the group's plight and seeking to placate hardline Buddhist nationalists by avoiding the politically-charged term. The government has already requested that the US Embassy and other diplomatic groups avoid the term Rohingya, and in June, she proposed \"Muslim community of Rakhine State\".The proposed neutral terminology, which the state counsellor ordered government officials to adopt, sparked mass protests in Rakhine State and in Yangon by hardline nationalists, who insist on use of the term \"Bengali\" that was also preferred by the previous government's to suggest the group's origins in neighbouring Bangladesh.In July UN special rapporteur for human rights Yanghee Lee urged the government to make ending \"institutionalised discrimination\" against the Rohingya and other Muslims in Rakhine an urgent priority.Myanmar also announced this week that current UN Secretary General Ban Ki-moon will attend the highly-anticipated 21st Century Panglong conference at the end of the month.The five-day talks, aimed at ending a host of complicated border ethnic conflicts that have lasted for decades, will begin on August 31."
[2] "Thousands of Kaman Muslims from the Rakhine State capital Sittwe obtained identity cards this week, some two years after they applied for the documents.“Now the problem is solved. The Kaman got national ID cards. We had proposed to the government and immigration office to work on the process of giving ID cards to Kaman,” said U Tin Hlaing Win, general secretary of the Kaman National Development Party (KNDP).The Kaman, as one of 135 officially recognised ethnic groups in Myanmar entitled to full citizenship, had struggled to get authorities to grant them the IDs due to the complex ethnic demographics and fraught identity politics of Rakhine State.Complicating the process for the Kaman applicants has been the population of more than 1 million Muslims in Rakhine State who self-identify as Rohingya, most of whom are stateless. “Citizenship scrutiny” programs to issue some form of identification to the minority group by the previous and current governments have been met with resistance by Rakhine nationalists.The shared Muslim faith of the Kaman and Rohingya became just one aspect of a contentious debate over terminology earlier this year, when State Counsellor Daw Aung San Suu Kyi put forward the phrase “the Muslim community in Rakhine State” to refer to self-identifying Rohingya in an attempt to chart a middle course on the issue of lexicon. “Rohingya” stirs passions among Buddhist nationalists, who insist that they be called “Bengalis” to imply that they are illegal immigrants from neighbouring Bangladesh, despite many tracing familial lineage in Rakhine State back generations.Some Kaman have since viewed the state counsellor’s edict warily, concerned that the Kaman identity might be conflated with that of the Rohingya – who are not entitled to citizenship – and could jeopardise Kaman prospects for ID cards and full rights under the law.Violence in 2012 between Buddhists and Muslims in Rakhine State affected Rakhine Buddhists, Kaman and Rohingya, but the latter suffered the brunt of casualties and displacement.U Tin Hlaing Win told The Myanmar Times last week that some of the Kaman Muslims displaced by the conflict in the island town of Rambre had also recently received national ID cards.“The Kaman are ethnics belonging to Myanmar,” said U Than Htun Aung, a senior immigration officer for Rakhine State. “Township immigration officers will examine [legitimate Kaman claims to citizenship] according to the process and they will ensure they get their rights.”Around 2000 Kaman applied for national ID cards in 2014, but only 38 people were issued the documents.The others were told they had not received IDs because of the purported existence of “fake Kaman”.More than 100,000 people are thought to hold government-issued national ID cards identifying them as Kaman, but KNDP research in 2013 estimated the actual ethnic Kaman population to be about 50,000.U Tin Hlaing Win said sorting out the “fake Kaman” issue was not solely the responsibility of Kaman people, adding that immigration officers through the years, and generations of ethnically mixed marriages and the offspring they produced, were also to blame for the confusion.“According to our research and knowledge tracing family trees, some Kaman identity-card holders were Rakhine plus Bengali or Rakhine plus Indian, not Kaman. It [identity problems] should be solved by three groups – we Kaman, the Rakhine and immigration authorities,” U Tin Hlaing Win told The Myanmar Times last week.What most seem to agree on is that “real Kaman” deserve the documentation they need to enjoy the full rights of citizenship.Ethnic Rakhine youth leader Ko Khine Lamin said, “The Rakhine objected to national ID cards for Kaman because of the controversy over fake Kaman. But there are real Kaman who have lived in Rakhine State since a long, long time ago. They should get their ethnic rights through careful examination by immigration officers.”Kaman politicians are not satisfied with their victory this week and are trying to meet with Rakhine State Chief Minister U Nyi Pu to raise other difficulties Kaman people face, such as transportation barriers. They also intend to ask the chief minister for rehabilitation programs for Kaman internally displaced people, as well as education and health support for the broader Kaman community."
The above code can only return terms in [1].
The "(.+?)" pattern matches ", then any char other than line break chars, as few as possible, up to the closest (leftmost) ". It means it can match whitespaces, too, and thus matches "play a constructive and positive role" and "active and hectic reception".
To match a streak of non-whitespace chars in-between double quotes, you need to use
stri_extract_all_regex(df$text, '"\\S+"')
The "\S+" pattern matches ", then 1 or more non-whitespace chars, and then a closing ".
See the regex demo.
If you only want to match word chars (letters, digits, _) in between double quotes, use
'"\\w+"'
See another regex demo.
To match curly quotes, use '["“]\\S+["”]' regex:
> res <- stri_extract_all_regex(test, '["“]\\S+["”]')
> unlist(res)
[1] "\"Rohingya\"" "\"Bengali\"" "\u0093Rohingya\u0094"
[4] "\u0093Bengalis\u0094"
And if you need to "normalize" the double quotes, use
> gsub("[“”]", '"', unlist(res))
[1] "\"Rohingya\"" "\"Bengali\"" "\"Rohingya\"" "\"Bengalis\""

fread EOF instead of separator

I'm trying to read a huge file with fread, but i guess something is messed with the layout of the file.
If i try to read the file with
data = fread(input = "../data.txt", sep = "\t")
on this file (i just took the line with the error and few before and after):
ID imdbID Title Year Rating Runtime Genre Released Director Writer Cast Metacritic imdbRating imdbVotes Poster Plot FullPlot Language Country Awards lastUpdated Type
683 tt0000683 The Fatal Hour 1908 14 min Short, Crime 1908-08-18 D.W. Griffith D.W. Griffith George Gebhardt, Harry Solter, Linda Arvidson, Florence Auer 5.9 26 Pong Lee, a Mephistophelian, saffron-skinned varlet, has for some time carried on this atrocious female white slave traffic, in which sinister business he was assisted by a stygian whelp, ... Pong Lee, a Mephistophelian, saffron-skinned varlet, has for some time carried on this atrocious female white slave traffic, in which sinister business he was assisted by a stygian whelp, by name Hendricks. Pong writes Hendricks that he has need for five young girls, and so Hendricks sets out to secure them. Visiting a rural district, he has no trouble, by his glib, affable manner, in gaining the confidence of several young and pretty girls. Pong is on hand with a closed carriage to bag the prey. One of the girls, as she is seized, emits a yell that alarms the neighborhood and brings to the scene several policemen and a couple of detectives, who have long been on the lookout for these caitiffs. The Chinese get away with the carriage, however, and Hendricks by subterfuge throws the police on the wrong scent. One of the detectives is a woman, and possessed of shrewd powers of deduction, hence does not swallow the bald story of the villain, and exercises her natural acumen with success. She shadows Hendricks, and by means of a flirtation inveigles him to a restaurant, where she succeeds in doping his drink. He falls asleep and she secures the letter written by Pong, which discloses the hiding place of the Chinaman. This she immediately telephones to the police, and while so doing Hendricks awakes and starts off to warn his friends. He arrives at the old deserted house ahead of the police, but escape is impossible, so the police rescue the girls, but fail to secure Pong and Hendricks, who afterwards seize the girl detective, and taking her to the house, tie her to a post and arrange a large pistol on the face of a clock in such a way that when the hands point to twelve the gun is fired and the girl will receive the charge. Twenty minutes are allowed for them to get away, for the hands are now indicating 11:40. Certain death seems to be her fate, and would have been had not an accident disclosed her plight. Hendricks after leaving the place is thrown by a street car, and this serves to discover his identity, so he is captured and a wild ride is made to the house in which the poor girl is incarcerated. This incident is shown in alternate scenes. There is the helpless girl, with the clock ticking its way towards her destruction, and out on the road is the carriage, tearing along at breakneck speed to the rescue, arriving just in time to get her safely out of range of the pistol as it goes off. In conclusion we can promise this to be an exceedingly thrilling film, of more than ordinary interest. English USA 2015-10-24 01:44:09.623000000 movie
684 tt0000684 Father Gets in the Game 1908 10 min Short, Comedy 1908-10-10 D.W. Griffith D.W. Griffith Mack Sennett, Harry Solter, George Gebhardt, Linda Arvidson 5.1 39 "You have got to keep up with the bandwagon or quit." This never impressed old Wilkins so forcibly as when his son and daughter give him the go-by, stamping him as a "has-been," and away ... "You have got to keep up with the bandwagon or quit." This never impressed old Wilkins so forcibly as when his son and daughter give him the go-by, stamping him as a "has-been," and away out of the game. Even Mrs. Wilkins, who is as vivacious as a widow, snubs him. He keenly feels his condition and resolves to alter it. With this in view, he enlists the services of Professor Dyem, the celebrated Dermatologist and Tonsorial Artist. After a session with the Professor, beheld the transmogrified Wilkins. What a change! Shorn of his grizzly beard, his locks raven, complexion florid, eye clear and step elastic, he views himself in the mirror. He hardly recognizes himself. In fact, it requires his valet to convince him that he is he. "Am I in it? Well. I guess. If I don't keep up with and even beat that bandwagon by a city block, my name is not Pill Wilkins." He sallies forth and makes for the park. The first person he encounters is his wife. He approaches her in elation, but she mistakes him for an impudent masher and he receives the weight of her parasol over his head for his trouble. The next one he meets is his daughter. She is seated on a bench, waiting for Charley. He takes a seat beside her and when he tries to make himself known she draws herself up to full height and with a blow sends him backward over the bench onto the grass. Well, he changes his tactics, and gets reckless. Along comes his son with his best girl, so he decides to win her out for spite. Now this young lady has a sensitive pneumogastric nerve, and when he sits beside her on the bench and slyly suggests a cold bottle and a hot bird, she is "his'n." This is done so coolly and so quickly, that young Wilkins, who, of course, does not recognize his respected papa, is speechless with rage. He follows them, however, to the café, where his intrusion is resented and he is rudely thrown from the place. At the Wilkins' domicile there is an indignation meeting. Mother, daughter and son all rush in to relate their experiences to father. He is not to be found. Suddenly a hilarious individual enters. "'Tis he, the insulter: a drunk and disorderly." They are about to have him thrown out when the valet comes to his rescue and explains that the jubilant gentleman is no other than their dear papa, who has not only caught up with the bandwagon, but is sitting on the seat with the driver. They all gasp in surprise, and young Wilkins takes a wreath of laurel from a statue and places it on old Wilkins' brow, saying: "Pop, you are the candy." English USA 2015-10-02 04:59:48.643000000 movie
685 tt0000685 The Feud and the Turkey 1908 15 min Short, Drama, Romance 1908-12-08 D.W. Griffith D.W. Griffith Harry Solter, Linda Arvidson, Arthur V. Johnson, Robert Harron 5.8 13 The Wilkinsons and Caulfields, owing to a trivial dispute, had been at loggerheads for years and as time went on the feeling became more bitter, until they even forbade their children ... The Wilkinsons and Caulfields, owing to a trivial dispute, had been at loggerheads for years and as time went on the feeling became more bitter, until they even forbade their children playing together. The little ones, however, in their childish innocence, could not appreciate the odium of their elders, and Bobby Wilkinson and Nellie Caulfield became child lovers. This incensed Colonel Wilkinson, who tore the children apart, ordered Bobby never to be seen in her company again. The Colonel's action ignited the ire of the Caulfields and a furious conflict ensued, resulting in the shooting to death of George, the Colonel's youngest son, a boy of fourteen. From that time on the clans kept strictly to themselves. But love knows no clannishness, and, despite family hatred, Bob and Nellie remained lovers. After ten years, driven to desperation by this apparently insurmountable barrier, they elope and are married. Bob decides to brave the storm of his father's anger and present his wife, but the old Colonel drives him from the house, disowning him. Old Aunt Dinah and Uncle Daniel, the colored servants, were so attached to the young folks that they go with them. Two years later we find the little family, now increased by an infant son, having a hard of it. It is Christmas morning and no turkey for dinner. Old Aunt Dinah, believing in the efficacy of prayer, gets down on her knees in the kitchen to ask the good Lord to send them a bird. Uncle Daniel, touched by this demonstration of faith, takes a gun and determines to get a turkey at any hazard. Over the hills he goes, but his journey is hopelessly fruitless until he comes to the rear of the Colonel's house. Tillie, the cook, has just hung a fat turkey on a post outside the kitchen door. When Daniel sees it he can't resist the temptation. Back home he hustles and finds Dinah still at prayer, he lays the fowl on the floor beside her and sneaks out. When Dinah sees it she surely thinks it was due to her prayers. Well, the turkey is cooked and an old-fashioned Christmas anticipated. Meanwhile the Colonel has discovered his loss and tracks the thief to Bob's estate. Entering, a tragedy seems inevitable, but when the old Colonel sees the young one, his grandson, in the cradle, his heart goes out to it and the feud ends then and there. All hands sit down and enjoy a real Merry Christmas dinner. English USA 2015-08-29 00:33:15.610000000 movie
686 tt0000686 Fiestas del carnaval de 1908 en Barcelona 1908 Documentary, Short Fructuós Gelabert Fructuós Gelabert Spain 2015-11-09 14:24:29.583000000 movie
I get this error:
> Error in fread(input = "../data.txt", sep="\t" : Expected sep (' ') but new line, EOF (or other
> non printing character) ends field 20 when detecting types ( first):
> 684 tt0000684 Father Gets in the Game 1908 10 min Short,
> Comedy 1908-10-10 D.W. Griffith D.W. Griffith Mack Sennett, Harry
> Solter, George Gebhardt, Linda Arvidson 5.1 39 "You have got to keep
> up with the bandwagon or quit." This never impressed old Wilkins so
> forcibly as when his son and daughter give him the go-by, stamping him
> as a "has-been," and away ... "You have got to keep up with the
> bandwagon or quit." This never impressed old Wilkins so forcibly as
> when his son and daughter give him the go-by, stamping him as a
> "has-been," and away out of the game. Even Mrs. Wilkins, who is as
> vivacious as a widow, snubs him. He keenly feels his condition and
> resolves to alter it. With this in view, he enlists the services of
> Professor Dyem, the celebrated Dermatologist and Tonsorial Artist.
> After a session with the Professor, beheld the transmogrified Wilkins.
> W
How can i solve it?
I'm not 100% sure what the error is in your data, here, but try running the code with
data = fread(input = "../data.txt", sep = "\t", fill = TRUE)
in the fread options. I had a similar error, and it seemed that fread was having trouble identifying my column separation. Setting fill to true allows fread to fill in any missing data- at least then you can check the resulting data frame and find out where the weirdness is.
Add fill = TRUE in the syntax
What's happening: The rows in the data have unequal length. With this syntax, blank fields are implicitly filled.

Resources