Remove a section from Corpus - r

I have a quanteda corpus of hundreds of documents. How do I remove specific sections - like the abstract and footnotes etc. Otherwise, I am faced with doing it manually. Thanks
As requested, here is a text example. It is from a regular journal article. It shows the Meta data, then the abstract, then keywords, then introduction, then author contact details, then body of article, then Note, then Disclosure statement, then Notes on contributors, then references. I would like to remove everything apart from the introduction and body of the article. I would also like to remove the author name and Journal title - which are repeated throughout
" Behavioral Sciences of Terrorism and Political Aggression
ISSN: 1943-4472 (Print) 1943-4480 (Online) Journal homepage: http://www.tandfonline.com/loi/rirt20
Sometimes they come back: responding to
American foreign fighter returnees and other
Elusive threats
Christopher J. Wright
To cite this article: Christopher J. Wright (2018): Sometimes they come back: responding to
American foreign fighter returnees and other Elusive threats, Behavioral Sciences of Terrorism and
Political Aggression, DOI: 10.1080/19434472.2018.1464493
To link to this article: https://doi.org/10.1080/19434472.2018.1464493
Published online: 23 Apr 2018.
Submit your article to this journal
Article views: 57
View related articles
View Crossmark data
Full Terms & Conditions of access and use can be found at
http://www.tandfonline.com/action/journalInformation?journalCode=rirt20
"
"BEHAVIORAL SCIENCES OF TERRORISM AND POLITICAL AGGRESSION, 2018
https://doi.org/10.1080/19434472.2018.1464493
Sometimes they come back: responding to American foreign
fighter returnees and other Elusive threats
Christopher J. Wright
Department of Criminal Justice, Austin Peay State University, Clarksville, TN, USA
ABSTRACT ARTICLE HISTORY
Much has been made of the threat of battle hardened jihadis from Received 8 January 2018
Islamist insurgencies, especially Syria. But do Americans who Accepted 10 April 2018
return home after gaining experience fighting abroad pose a
KEYWORDS
greater risk than homegrown jihadi militants with no such Terrorism; foreign fighters;
experience? Using updated data covering 1990–2017, this study domestic terrorism;
shows that the presence of a returnee decreases the likelihood homegrown terrorism;
that an executed plot will cause mass casualties. Plots carried out lone-wolf; homeland security
Introduction: being afraid. Being a little afraid
How great of a threat do would-be jihadis pose to their home country? And do those who
return home after gaining experience fighting abroad in Islamist insurgencies or attending
terror training camps pose a greater risk than other jihadi militants? The fear, as first outlined
by Hegghammer (2013), is two-fold. First, individuals that have gone abroad to fight might
CONTACT Christopher J. Wright wrightc#apsu.edu Department of Criminal Justice, Austin Peay State University,
Clarksville, TN 37043, USA
© 2018 Society for Terrorism Research
"
"2 C. J. WRIGHT
Many of the earliest studies on Western foreign fighters suggested that those who
returned were in fact more deadly than those with no experience fighting in Islamist insur-
gencies. Hegghammer’s (2013) analysis suggested that these foreign fighter returnees
were a greater danger than when they left. Likewise, Byman (2015), Nilson (2015),
Kenney (2015), and Vidno (2011) came to similar conclusions while offering key insights
into the various mechanisms linking foreign fighting with successful plot execution and
greater mass casualties.
Other studies came to either mixed conclusions or directly contradicted the earlier find-
ings. Adding several years of data to Hegghammer’s (2013) earlier study, Hegghammer
"
" BEHAVIORAL SCIENCES OF TERRORISM AND POLITICAL AGGRESSION 3
for them to form the types of large, local networks that would be necessary to carry out a
large-scale attack without attracting the attention of security services’ (p. 92).
"
Note
1. Charges were brought against Noor Zahi Salman, the widow of the Omar Mateen who carried
out the June, 2016 attack against the Pulse Nightclub in Orlando, Florida (US Department of
Justice., 2017a, January 17). However, in March of 2018 a jury acquitted her of the charges that
she had foreknowledge of the attack.
Disclosure statement
No potential conflict of interest was reported by the authors.
Notes on contributors
Christopher J. Wright, Ph.D., is an Assistant Professor at Austin Peay State University where he
teaches in the Homeland Security Concentration.
ORCID
Christopher J. Wright http://orcid.org/0000-0003-0043-6616
References
Byman, D. (2015). The homecomings: What happens when Arab foreign fighters in Iraq and Syria
return? Studies in Conflict & Terrorism, 38(8), 581–602.
Byman, D. (2016). The Jihadist returnee threat: Just how dangerous? Political Science Quarterly, 131(1),
69–99.
Byman, D., & Shapiro, J. (2014). Be afraid. Be a little afraid: The threat of terrorism from Western foreign
fighters in Syria and Iraq. Foreign Policy at Brookings. Washington, DC: Brookings. Retrieved from
https://www.brookings.edu/wp-content/uploads/2016/06/Be-Afraid-web.pdf

The approach
The key here is to determine the regular markers that precede each section, and then to use them as tags in a call to corpus_segment(). It's the tags that will need tweaking, based on their degree of regularity across documents.
Based on what you supplied above, I pasted that into a plain text file that I named example.txt. This code extracted the Introduction and what I think is the body of the article, but for that I had to decide a tag that marked its ending. Below, I used "Disclosure Statement". So:
library("quanteda")
crp <- readtext::readtext("~/tmp/example.txt") %>%
corpus()
pat <- c("\nIntroduction?", "\nCONTACT", "©", "\nDisclosure statement")
crpextracted <- corpus_segment(crp, pattern = pat)
summary(crpextracted)
## Corpus consisting of 4 documents:
##
## Text Types Tokens Sentences pattern
## example.txt.1 62 74 5 Introduction:
## example.txt.2 18 21 2 CONTACT
## example.txt.3 156 253 11 ©
## example.txt.4 101 180 19 Disclosure statement
##
## Source: /Users/kbenoit/Dropbox (Personal)/GitHub/quanteda/quanteda/* on x86_64 by kbenoit
## Created: Fri Jul 6 19:51:01 2018
## Notes: corpus_segment.corpus(crp, pattern = pat)
When you examine the text in the "Introduction:" tagged segment, you can see that everything from that string until the next tag was extracted to that as a new document:
corpus_subset(crpextracted, pattern == "\nIntroduction:") %>%
texts() %>% cat()
## being afraid. Being a little afraid
##
## How great of a threat do would-be jihadis pose to their home country? And do those who
##
## return home after gaining experience fighting abroad in Islamist insurgencies or attending
##
## terror training camps pose a greater risk than other jihadi militants? The fear, as first outlined
##
## by Hegghammer (2013), is two-fold. First, individuals that have gone abroad to fight might
How to remove pdf junk
All pdf conversions produce unwanted junk such as running headers, footers, etc. Here's how to remove them. (Note: You will want to do this before the step above.) How to construct the toreplace pattern? You will need to understand something about regular expressions, and use some experimentation.
library("stringr")
toreplace <- '\\n*\" \" BEHAVIORAL SCIENCES OF TERRORISM AND POLITICAL AGGRESSION,{0,1} \\d+\\n*'
texts(crp) <- str_replace_all(texts(crp), regex(toreplace), "")
cat(texts(crp))
To demonstrate this on a section from your example:
# demonstration
x <- '
" " BEHAVIORAL SCIENCES OF TERRORISM AND POLITICAL AGGRESSION 3
'
str_replace_all(x, regex(toreplace), "")
## [1] ""

Related

Matching statutory provisions of two in R

In advance: Sorry for all the Norwegian references, but I hope I've explained my problem good enough for them to still make sense...
So, in 2005 Norway got a new criminal law. The old one was somewhat unstructured (only three chapters), while the statutory provisions in the 2005 version have been structured into 31 chapters, depending on the area of the offense (can be seen here: https://lovdata.no/dokument/NL/lov/2005-05-20-28). I call these "areas of law". For example, in the 2005 version laws regarding sexual offenses are in chapter 26. Logically, then the statutory provisions that belong to this chapter are categorized as belonging to the area of law called "s
Some of the old laws have been structured into the new chapters, some new have been added, and some have been repealed. I have what is called a "law mirror" – a list where you can find where the old provision are in the new law, if it hasn't been repealed. The new law came into force for offenses committed from the 1st of October in 2015.
An example of a law mirror: https://no.wikipedia.org/wiki/Straffeloven_(lovspeil). I've pivoted the list longer, such that it looks like this:
Law Mirror: "Seksuallovbrud" means sexual offense, "kap_2005" says which chapter in the 2005 law that the statutory provision (norwegian: "paragraf") falls under, and "straffelov" specifies whether the provison comes from the 2005 or 1902 version of the law.
The data I have consist of two separate data frames. Df1 is the law mirror. Df2 consists of cases in the Norwegian court of appeals from between 1993 and 2019, where the criminal law was the basis of the verdict. I've made a dummy (strl1902) in Df2 for whether the verdict in the case came before or after the new law came into force. Equal to 1 if it's the old one. I've also extracted the number of the statutory provision.
On the basis of this I want to categorize the cases using statutory provisions from the old criminal law into the areas of law from the new law.
This is where I need help:
Do any of you have any idea of how I can distinguish between the provisions from the old and the new law, such that I also can make dummies for the provisions from the 1902 law, such that they are separated into the areas of the law of the 2005 law?
Hope this makes sense.

How could you remove the similar portion from two large strings?

I am working on classification of some documents and a number of the documents have large sections of similar (and usually irrelevant) text. I would like to identify and remove those similar sections, as I believe I may be able to make a better model.
An example would be proposals by an organization, each of which contains the same paragraph regarding the organization's mission statement and purpose.
A couple points which make it difficult:
similar sections are not known ahead of time, making a fixed pattern inappropriate
could be located anywhere in the documents, documents do not have consistent structure
the pattern could be many characters long, e.g. 3000+ characters
I don't want to remove every similar word, just large sections
I don't want to identify which strings are similar, rather I want to remove the similar sections.
I've considered regex and looked through some packages like stringr, strdist, and the base functions, but these utilities seem useful if you already know the pattern and the pattern is much shorter, or if the documents have a similar structure. In my case the text could be structured differently and the pattern is not predefined, but rather whatever is similar between the documents.
I considered making and comparing lists of 3000-grams for each document but this didn't seem feasible or easy to implement.
Below is an example of a complete solution, but really I am not even sure how to approach this problem, so information in that direction would be useful as well.
Example code
doc_a <- "this document discusses african hares in the northern sahara. african hares
are the most common land dwelling mammal in the northern sahara. crocodiles eat
african hares. this text is from a book written for the foundation for education
in northern africa."
doc_b <- "this document discusses the nile. The nile delta is in egypt. the nile is the
longest river in the world. the nile has lots of crocodiles. crocodiles and
alligators are different. crocodiles eat african hares. crocodiles are the most common
land dwelling reptile in egypt. this text is from a book written for the foundation
for education in northern africa."
# this function would trim similar sections of 6 or more words in length
# (length in characters is also acceptable)
trim_similar(doc_a, doc_b, 6)
Output
[1] "this document discusses african hares in the northern sahara. african hares
mammal in the northern sahara. crocodiles eat african hares."
[2] "this document discusses the nile. The nile delta is in egypt. the nile is the
longest river in the world. the nile has lots of crocodiles. crocodiles and alligators
are different. crocodiles eat african hares. crocodiles reptile in egypt."

Regex to extract double quotes and string in quotes R

I have a data frame with a column of "text." Each row of this column is filled with text from media articles.
I am trying to extract a string that occurs like this: "term" (including the double quotes around the term). I tried the following regular expression to capture instances where a word is sandwiched between two double quotes:
stri_extract_all_regex(df$text, '"(.+?)"')
This seems to capture some instances of what I am looking for, but in other cases - where I know the criteria are met - it does not. It also captures what seem to be just quotes of longer text (and not other instances of quoted text). Here are the results of using the above:
[[19]]
[1] "\"play a constructive and positive role\""
[2] "\"active and hectic reception\""
[3] "\"[term]\""
I would like to have only exactly "term" as the output (including the double quotes). I am trying to find instances when the term is used alone in quotes.
Example in R:
test <- c(df$text[12], df$text[18])
res <- stri_extract_all_regex(test, '"\\S+"')
unlist(res)
[1] "\"Rohingya\"" "\"Bengali\"" NA
print(test)
[1] "Former UN general secretary Kofi Annan will advise Myanmar's government on resolving conflicts in Rakhine State, the office of the state counsellor announced today.Former UN Secretary General Kofi Annan speaks at the opening of the Consciouness Summit on climate change in Paris, France on July 21, 2015. Photo: EPARakhine State, one of the poorest in the Union, was wracked by sectarian violence in 2012 that forced more than 100,000 – mostly Muslims who ethnically identify as Rohingya – into squalid displacement camps where they face severe restrictions on movement as well as access to health care, education, and other other basic services.Addressing the ongoing crises has posed one of the most troubling challenges to Daw Aung San Suu Kyi's National League for Democracy-led government.Earlier today, the government announced the formation of an advisory panel that will be chaired by former UN chief, and focus on \"finding lasting solutions to the complex and delicate issues in the Rakhine State\".The board will submit recommendations to the government on \"conflict prevention, humanitarian assistance, rights and reconciliation, institution-building and promotion of development of Rakhine State,\" a statement from the state counsellor's office said.The statement did not use the word \"Rohingya\". Daw Aung San Suu Kyi has come under fire both at home and from international rights groups for failing prioritise to address the group's plight and seeking to placate hardline Buddhist nationalists by avoiding the politically-charged term. The government has already requested that the US Embassy and other diplomatic groups avoid the term Rohingya, and in June, she proposed \"Muslim community of Rakhine State\".The proposed neutral terminology, which the state counsellor ordered government officials to adopt, sparked mass protests in Rakhine State and in Yangon by hardline nationalists, who insist on use of the term \"Bengali\" that was also preferred by the previous government's to suggest the group's origins in neighbouring Bangladesh.In July UN special rapporteur for human rights Yanghee Lee urged the government to make ending \"institutionalised discrimination\" against the Rohingya and other Muslims in Rakhine an urgent priority.Myanmar also announced this week that current UN Secretary General Ban Ki-moon will attend the highly-anticipated 21st Century Panglong conference at the end of the month.The five-day talks, aimed at ending a host of complicated border ethnic conflicts that have lasted for decades, will begin on August 31."
[2] "Thousands of Kaman Muslims from the Rakhine State capital Sittwe obtained identity cards this week, some two years after they applied for the documents.“Now the problem is solved. The Kaman got national ID cards. We had proposed to the government and immigration office to work on the process of giving ID cards to Kaman,” said U Tin Hlaing Win, general secretary of the Kaman National Development Party (KNDP).The Kaman, as one of 135 officially recognised ethnic groups in Myanmar entitled to full citizenship, had struggled to get authorities to grant them the IDs due to the complex ethnic demographics and fraught identity politics of Rakhine State.Complicating the process for the Kaman applicants has been the population of more than 1 million Muslims in Rakhine State who self-identify as Rohingya, most of whom are stateless. “Citizenship scrutiny” programs to issue some form of identification to the minority group by the previous and current governments have been met with resistance by Rakhine nationalists.The shared Muslim faith of the Kaman and Rohingya became just one aspect of a contentious debate over terminology earlier this year, when State Counsellor Daw Aung San Suu Kyi put forward the phrase “the Muslim community in Rakhine State” to refer to self-identifying Rohingya in an attempt to chart a middle course on the issue of lexicon. “Rohingya” stirs passions among Buddhist nationalists, who insist that they be called “Bengalis” to imply that they are illegal immigrants from neighbouring Bangladesh, despite many tracing familial lineage in Rakhine State back generations.Some Kaman have since viewed the state counsellor’s edict warily, concerned that the Kaman identity might be conflated with that of the Rohingya – who are not entitled to citizenship – and could jeopardise Kaman prospects for ID cards and full rights under the law.Violence in 2012 between Buddhists and Muslims in Rakhine State affected Rakhine Buddhists, Kaman and Rohingya, but the latter suffered the brunt of casualties and displacement.U Tin Hlaing Win told The Myanmar Times last week that some of the Kaman Muslims displaced by the conflict in the island town of Rambre had also recently received national ID cards.“The Kaman are ethnics belonging to Myanmar,” said U Than Htun Aung, a senior immigration officer for Rakhine State. “Township immigration officers will examine [legitimate Kaman claims to citizenship] according to the process and they will ensure they get their rights.”Around 2000 Kaman applied for national ID cards in 2014, but only 38 people were issued the documents.The others were told they had not received IDs because of the purported existence of “fake Kaman”.More than 100,000 people are thought to hold government-issued national ID cards identifying them as Kaman, but KNDP research in 2013 estimated the actual ethnic Kaman population to be about 50,000.U Tin Hlaing Win said sorting out the “fake Kaman” issue was not solely the responsibility of Kaman people, adding that immigration officers through the years, and generations of ethnically mixed marriages and the offspring they produced, were also to blame for the confusion.“According to our research and knowledge tracing family trees, some Kaman identity-card holders were Rakhine plus Bengali or Rakhine plus Indian, not Kaman. It [identity problems] should be solved by three groups – we Kaman, the Rakhine and immigration authorities,” U Tin Hlaing Win told The Myanmar Times last week.What most seem to agree on is that “real Kaman” deserve the documentation they need to enjoy the full rights of citizenship.Ethnic Rakhine youth leader Ko Khine Lamin said, “The Rakhine objected to national ID cards for Kaman because of the controversy over fake Kaman. But there are real Kaman who have lived in Rakhine State since a long, long time ago. They should get their ethnic rights through careful examination by immigration officers.”Kaman politicians are not satisfied with their victory this week and are trying to meet with Rakhine State Chief Minister U Nyi Pu to raise other difficulties Kaman people face, such as transportation barriers. They also intend to ask the chief minister for rehabilitation programs for Kaman internally displaced people, as well as education and health support for the broader Kaman community."
The above code can only return terms in [1].
The "(.+?)" pattern matches ", then any char other than line break chars, as few as possible, up to the closest (leftmost) ". It means it can match whitespaces, too, and thus matches "play a constructive and positive role" and "active and hectic reception".
To match a streak of non-whitespace chars in-between double quotes, you need to use
stri_extract_all_regex(df$text, '"\\S+"')
The "\S+" pattern matches ", then 1 or more non-whitespace chars, and then a closing ".
See the regex demo.
If you only want to match word chars (letters, digits, _) in between double quotes, use
'"\\w+"'
See another regex demo.
To match curly quotes, use '["“]\\S+["”]' regex:
> res <- stri_extract_all_regex(test, '["“]\\S+["”]')
> unlist(res)
[1] "\"Rohingya\"" "\"Bengali\"" "\u0093Rohingya\u0094"
[4] "\u0093Bengalis\u0094"
And if you need to "normalize" the double quotes, use
> gsub("[“”]", '"', unlist(res))
[1] "\"Rohingya\"" "\"Bengali\"" "\"Rohingya\"" "\"Bengalis\""

Replicate Postgres pg_trgm text similarity scores in R?

Does anyone know how to replicate the (pg_trgm) postgres trigram similarity score from the similarity(text, text) function in R? I am using the stringdist package and would rather use R to calculate these on a matrix of text strings in a .csv file than run a bunch of postgresql quires.
Running similarity(string1, string2) in postgres give me a number score between 0 and 1.
I tired using the stringdist package to get a score but I think I still need to divide the code below by something.
stringdist(string1, string2, method="qgram",q = 3 )
Is there a way to replicate the pg_trgm score with the stringdist package or another way to do this in R?
An example would be getting the similarity score between the description of a book and the description of a genre like science fiction. For example, if I have two book descriptions and the using the similarity score of
book 1 = "Area X has been cut off from the rest of the continent for decades. Nature has reclaimed the last vestiges of human civilization. The first expedition returned with reports of a pristine, Edenic landscape; the second expedition ended in mass suicide, the third expedition in a hail of gunfire as its members turned on one another. The members of the eleventh expedition returned as shadows of their former selves, and within weeks, all had died of cancer. In Annihilation, the first volume of Jeff VanderMeer's Southern Reach trilogy, we join the twelfth expedition.
The group is made up of four women: an anthropologist; a surveyor; a psychologist, the de facto leader; and our narrator, a biologist. Their mission is to map the terrain, record all observations of their surroundings and of one anotioner, and, above all, avoid being contaminated by Area X itself.
They arrive expecting the unexpected, and Area X delivers—they discover a massive topographic anomaly and life forms that surpass understanding—but it’s the surprises that came across the border with them and the secrets the expedition members are keeping from one another that change everything."
book 2= "From Wall Street to Main Street, John Brooks, longtime contributor to the New Yorker, brings to life in vivid fashion twelve classic and timeless tales of corporate and financial life in America
What do the $350 million Ford Motor Company disaster known as the Edsel, the fast and incredible rise of Xerox, and the unbelievable scandals at GE and Texas Gulf Sulphur have in common? Each is an example of how an iconic company was defined by a particular moment of fame or notoriety; these notable and fascinating accounts are as relevant today to understanding the intricacies of corporate life as they were when the events happened.
Stories about Wall Street are infused with drama and adventure and reveal the machinations and volatile nature of the world of finance. John Brooks’s insightful reportage is so full of personality and critical detail that whether he is looking at the astounding market crash of 1962, the collapse of a well-known brokerage firm, or the bold attempt by American bankers to save the British pound, one gets the sense that history repeats itself.
Five additional stories on equally fascinating subjects round out this wonderful collection that will both entertain and inform readers . . . Business Adventures is truly financial journalism at its liveliest and best."
genre 1 = "Science fiction is a genre of fiction dealing with imaginative content such as futuristic settings, futuristic science and technology, space travel, time travel, faster than light travel, parallel universes, and extraterrestrial life. It often explores the potential consequences of scientific and other innovations, and has been called a "literature of ideas".[1] Authors commonly use science fiction as a framework to explore politics, identity, desire, morality, social structure, and other literary themes."
How can I get a similarity score for the description of each book against the description of the science fiction genre like pg_trgm using an R script?
How about something like this?
library(textcat)
?textcat_xdist
# Compute cross-distances between collections of n-gram profiles.
round(textcat_xdist(
list(
text1="hello there",
text2="why hello there",
text3="totally different"
),
method="cosine"),
3)
# text1 text2 text3
#text1 0.000 0.078 0.731
#text2 0.078 0.000 0.739
#text3 0.731 0.739 0.000

Standard Unit Conversion Tables

I've been tasked with looking into some issues regarding errors with unit conversions (Meters to Feet for example). Is there a definitive source for standard conversion numbers? I would like the conversion factors to be the highest precision possible.
I know there are a number of sites out there that offer a plethora of conversions, but what I am looking for is a standard, either official or defacto to use as a baseline.
A table with 32 and 64 bit numbers would be perfect.
The unit definition file of GNU Units lists a large number of sources:
Most units data was drawn from
NIST Special Publication 811, Guide for the
Use of the International System of Units (SI).
Barry N. Taylor. 1995
CRC Handbook of Chemistry and Physics 70th edition
Oxford English Dictionary
Websters New Universal Unabridged Dictionary
Units of Measure by Stephen Dresner
A Dictionary of English Weights and Measures by Ronald Zupko
British Weights and Measures by Ronald Zupko
Realm of Measure by Isaac Asimov
United States standards of weights and measures, their
creation and creators by Arthur H. Frazier.
French weights and measures before the Revolution: a
dictionary of provincial and local units by Ronald Zupko
Weights and Measures: their ancient origins and their
development in Great Britain up to AD 1855 by FG Skinner
The World of Measurements by H. Arthur Klein
For Good Measure by William Johnstone
NTC's Encyclopedia of International Weights and Measures
by William Johnstone
Sizes by John Lord
Sizesaurus by Stephen Strauss
CODATA Recommended Values of Physical Constants available at
http://physics.nist.gov/cuu/Constants/index.html
How Many? A Dictionary of Units of Measurement. Available at
http://www.unc.edu/~rowlett/units/index.html
Numericana. http://www.numericana.com
UK history of measurement
http://www.ukmetrication.com/history.htm
NIST Handbook 44, Specifications, Tolerances, and
Other Technical Requirements for Weighing and Measuring
Devices. 2011
NIST Special Publication 447, Weights and Measures Standards
of the the United States: a brief history. Lewis V. Judson.
1963; rev. 1976
For many units, the file contains the official definition, so it will be as precise as it gets. The comments often explain where that definition came from.

Resources