Error when Importing XML Document into R - r

I am trying to import an XML document and convert it to a dataframe in R. Usually the following code works fine:
xmlfile <- xmlTreeParse(file.choose()) ; topxml <- xmlRoot(xmlfile) ;
topxml2 <- xmlSApply(topxml, function(x) xmlSApply(x, xmlValue))
psycinfo <- data.frame(t(topxml2), row.names=NULL, stringsAsFactors=FALSE)
However, when I try this i get a dataframe with one row and 22570 columns (which is the number of rows that ideally want so that each record has its own row with multiple columns.
I've attached a snippet of what my XML data looks like for the first two records, which should be on separate rows.
<records>
<rec resultID="1">
<header shortDbName="psyh" longDbName="PsycINFO" uiTerm="2016-10230-001">
<controlInfo>
<bkinfo>
<btl>Reducing conservativeness of stabilization conditions for switched ts fuzzy systems.</btl>
<aug />
</bkinfo>
<chapinfo />
<revinfo />
<jinfo>
<jtl>Neurocomputing: An International Journal</jtl>
<issn type="Print">09252312</issn>
</jinfo>
<pubinfo>
<dt year="2016" month="02" day="16">20160216</dt>
</pubinfo>
<artinfo>
<ui type="doi">10.1016/j.neucom.2016.01.067</ui>
<tig>
<atl>Reducing conservativeness of stabilization conditions for switched ts fuzzy systems.</atl>
</tig>
<aug>
<au>Jaballi, Ahmed</au>
<au>Hajjaji, Ahmed El</au>
<au>Sakly, Anis</au>
</aug>
<sug>
<subj type="unclass">No terms assigned</subj>
</sug>
<ab>In this paper, less conservative sufficient conditions for the existence of switching laws for stabilizing switched TS fuzzy systems via a fuzzy Lyapunov function (FLF) and estimates the basin of attraction are proposed. The conditions are found by exploring properties of the membership functions and are formulated in terms of linear matrix inequalities (LMIs), which can be solved very efficiently using the convex optimization techniques. Finally, the effectiveness and the reduced conservatism of the proposed results are shown through two numerical examples. (PsycINFO Database Record (c) 2016 APA, all rights reserved)</ab>
<pubtype>Journal</pubtype>
<pubtype>Peer Reviewed Journal</pubtype>
</artinfo>
<language>English</language>
</controlInfo>
<displayInfo>
<pLink>
<url>http://search.ebscohost.com/login.aspx?direct=true&db=psyh&AN=2016-10230-001&site=ehost-live&scope=site</url>
</pLink>
</displayInfo>
</header>
</rec>
<rec resultID="2">
<header shortDbName="psyh" longDbName="PsycINFO" uiTerm="2016-08643-001">
<controlInfo>
<bkinfo>
<btl>Self–other relations in biodiversity conservation in the community: Representational processes and adjustment to new actions.</btl>
<aug />
</bkinfo>
<chapinfo />
<revinfo />
<jinfo>
<jtl>Journal of Community & Applied Social Psychology</jtl>
<issn type="Print">10529284</issn>
<issn type="Electronic">10991298</issn>
</jinfo>
<pubinfo>
<dt year="2016" month="02" day="15">20160215</dt>
</pubinfo>
<artinfo>
<ui type="doi">10.1002/casp.2267</ui>
<tig>
<atl>Self–other relations in biodiversity conservation in the community: Representational processes and adjustment to new actions.</atl>
</tig>
<aug>
<au>Mouro, Carla</au>
<au>Castro, Paula</au>
</aug>
<sug>
<subj type="unclass">No terms assigned</subj>
</sug>
<ab>This research explores the simultaneous role of two Self–Other relations in the elaboration of representations at the micro†and ontogenetic levels, assuming that it can result in acceptance and/or resistance to new laws. Drawing on the Theory of Social Representations, it concretely looks at how individuals elaborate new representations relevant for biodiversity conservation in the context of their relations with their local community (an interactional Other) and with the legal/reified sphere (an institutional Other). This is explored in two studies in Portuguese Natura 2000 sites where a conservation project calls residents to protect an atâ€risk species. Study 1 shows that (i) agreement with the institutional Other (the laws) and metaâ€representations of the interactional Other (the community) as approving of conservation independently help explain (at the ontogenetic level) internalisation of conservation goals and willingness to act; (ii) the same metaâ€representations operating at the microâ€genetic level attenuate the negative relation between ambivalence and willingness to act. Study 2 shows that a metaâ€representation of the interactional Other as showing no clear position regarding conservation increases ambivalence. Findings demonstrate the necessarily social nature of representational processes and the importance of considering them at more than one level for understanding responses to new policy/legal proposals. Copyright © 2016 John Wiley & Sons, Ltd. (PsycINFO Database Record (c) 2016 APA, all rights reserved)</ab>
<pubtype>Journal</pubtype>
<pubtype>Peer Reviewed Journal</pubtype>
</artinfo>
<language>English</language>
</controlInfo>
<displayInfo>
<pLink>
<url>http://search.ebscohost.com/login.aspx?direct=true&db=psyh&AN=2016-08643-001&site=ehost-live&scope=site</url>
</pLink>
</displayInfo>
</header>
</rec>

Related

Matching statutory provisions of two in R

In advance: Sorry for all the Norwegian references, but I hope I've explained my problem good enough for them to still make sense...
So, in 2005 Norway got a new criminal law. The old one was somewhat unstructured (only three chapters), while the statutory provisions in the 2005 version have been structured into 31 chapters, depending on the area of the offense (can be seen here: https://lovdata.no/dokument/NL/lov/2005-05-20-28). I call these "areas of law". For example, in the 2005 version laws regarding sexual offenses are in chapter 26. Logically, then the statutory provisions that belong to this chapter are categorized as belonging to the area of law called "s
Some of the old laws have been structured into the new chapters, some new have been added, and some have been repealed. I have what is called a "law mirror" – a list where you can find where the old provision are in the new law, if it hasn't been repealed. The new law came into force for offenses committed from the 1st of October in 2015.
An example of a law mirror: https://no.wikipedia.org/wiki/Straffeloven_(lovspeil). I've pivoted the list longer, such that it looks like this:
Law Mirror: "Seksuallovbrud" means sexual offense, "kap_2005" says which chapter in the 2005 law that the statutory provision (norwegian: "paragraf") falls under, and "straffelov" specifies whether the provison comes from the 2005 or 1902 version of the law.
The data I have consist of two separate data frames. Df1 is the law mirror. Df2 consists of cases in the Norwegian court of appeals from between 1993 and 2019, where the criminal law was the basis of the verdict. I've made a dummy (strl1902) in Df2 for whether the verdict in the case came before or after the new law came into force. Equal to 1 if it's the old one. I've also extracted the number of the statutory provision.
On the basis of this I want to categorize the cases using statutory provisions from the old criminal law into the areas of law from the new law.
This is where I need help:
Do any of you have any idea of how I can distinguish between the provisions from the old and the new law, such that I also can make dummies for the provisions from the 1902 law, such that they are separated into the areas of the law of the 2005 law?
Hope this makes sense.

XPath to extract all text between two 'p' elements scrapy

I am trying to scrape a database using Scrapy and Splash, which requires login so unfortunately, I am unable to share the full website. The database contains a list of companies showing their name and a short description.
I am struggling to find an XPath expression that would yield all the text between the two 'p' tags as shown:
<p class="pre-wrap ng-binding"
ng-bind-html="object._source.startup.general_information.project_public_description"
ng-click="listView.showDetail(object)" role="button" tabindex="0">
<div>With the vision of providing creative sustainable solutions for global food crisis,
AquiNovo develops innovative, non-GMO, non-hormonal, peptide-based feed additives,
addressing the ever-growing demand for fish protein. Company’s additives improve both growth
performance and feed utilization, enabling the <strong><em>growth of more fish with less
feed</em></strong>. A unique peptide production system, enables large commercial
scale production at significant lower cost and carbon footprint. Growing more fish with less
feed also promote several SDG’s including the reduction of pressure on fish population in
the sea, providing food security and reducing hunger and poverty, climate change and
responsible production. </div>
</p>
All the company descriptions are in the same format (between two 'p' elements), but as shown in the HTML, there are <strong><em> elements as well. I would like to ask for help to find a way to create an XPath that would get all text including the ones in the <strong><em> element as one single text block (that would be one description, when viewed on the website there is no separation in the text.
I tried the following but that only gets the part before the element //p[#class='pre-wrap ng-binding']//div//text()
I used the following code:
'the descript': ''.join(startup.xpath('//div//text()').getall()),
scrapy shell
In [1]: html = """<html>
...: <body>
...: <p class="pre-wrap ng-binding"
...: ng-bind-html="object._source.startup.general_information.project_public_description"
...: ng-click="listView.showDetail(object)" role="button" tabindex="0">
...: <div>With the vision of providing creative sustainable solutions for global food crisis,
...: AquiNovo develops innovative, non-GMO, non-hormonal, peptide-based feed additives,
...: addressing the ever-growing demand for fish protein. Company’s additives improve both growth
...: performance and feed utilization, enabling the <strong><em>growth of more fish with less
...: feed</em></strong>. A unique peptide production system, enables large commercial
...: scale production at significant lower cost and carbon footprint. Growing more fish with less
...: feed also promote several SDG’s including the reduction of pressure on fish population in
...: the sea, providing food security and reducing hunger and poverty, climate change and
...: responsible production. </div>
...: </p>
...: </body>
...: </html>"""
In [2]: selector = scrapy.Selector(text=html)
In [3]: ''.join(selector.xpath('//div//text()').getall())
Out[3]: 'With the vision of providing creative sustainable solutions for global food crisis,\n AquiNovo develops innovative, non-GMO, non-hormonal, peptide-based feed additives,\n addressing the ever-growing demand for fish protein. Company’s additives improve both growth\n performance and feed utilization, enabling the growth of more fish with less\n feed. A unique peptide production system, enables large commercial\n scale production at significant lower cost and carbon footprint. Growing more fish with less\n feed also promote several SDG’s including the reduction of pressure on fish population in\n the sea, providing food security and reducing hunger and poverty, climate change and\n responsible production.\xa0'

How to extract sentences between point and brackets with R?

I Have:
Stringa=" This is different from primary data created specifically by researchers to reflect concepts that are higher-order and more abstract(Lee,1991;Walsham,1995).Given the major differences between big data and research-collected data, it is surprising how little discussion has arisen about how using big data should change the practice of theory-informed IS research. Some scholars have noted that the very nature of inquiry is likely to change, given that large data sets, advanced algorithms, and powerful computing capabilities can initiate and refine questions without human intervention (Agarwal & Dhar, 2014). Other commentators argue that the scientific method is likely to become obsolete, as with the “availability of huge amounts of data, along with the statistical tools to crunch these numbers … science can advance even without coherent models, unified theories, or really any mechanistic explanation at all” (Anderson, 2008). Perhaps “scientists no longer have to make educated guesses, construct hypotheses and models, test them in data-based experiments andexamples. Instead, they canmine thecomplete setof data forpatterns that reveal effects, producing scientificconclusions without further experimentation” (Prensky, 2009). "
Desidered Output:
[1]This is different from primary data created specifically by researchers to reflect concepts that are higher-order and more abstract(Lee,1991;Walsham,1995).
[2]Some scholars have noted that the very nature of inquiry is likely to change, given that large data sets, advanced algorithms, and powerful computing capabilities can initiate and refine questions without human intervention (Agarwal & Dhar, 2014)
[3] Other commentators argue that the scientific method is likely to become obsolete, as with the “availability of huge amounts of data, along with the statistical tools to crunch these numbers … science can advance even without coherent models, unified theories, or really any mechanistic explanation at all” (Anderson, 2008)
[4]Instead, they canmine thecomplete setof data forpatterns that reveal effects, producing scientific conclusions without further experimentation” (Prensky, 2009)
I use:unlist(str_extract_all(string =Stringa, pattern = "\\. [A-Za-z][^()]+ \\("))
But it doesn't work
I don’t want extract ‘Given the major differences between big data and research-collected data, it is surprising how little discussion has arisen about how using big data should change the practice of theory-informed IS research. ‘ and ‘Perhaps “scientists no longer have to make educated guesses, construct hypotheses and models, test them in data-based experiments andexamples. ‘
If there are no abbreviations in the text, you may use
regmatches(Stringa, gregexpr("[^.?!\\s][^.!?]*?\\([^()]*\\)", Stringa, perl=TRUE))
[[1]]
[1] "This is different from primary data created specifically by researchers to reflect concepts that are higher-order and more abstract(Lee,1991;Walsham,1995)"
[2] "Some scholars have noted that the very nature of inquiry is likely to change, given that large data sets, advanced algorithms, and powerful computing capabilities can initiate and refine questions without human intervention (Agarwal & Dhar, 2014)"
[3] "Other commentators argue that the scientific method is likely to become obsolete, as with the “availability of huge amounts of data, along with the statistical tools to crunch these numbers … science can advance even without coherent models, unified theories, or really any mechanistic explanation at all” (Anderson, 2008)"
[4] "Instead, they canmine thecomplete setof data forpatterns that reveal effects, producing scientificconclusions without further experimentation” (Prensky, 2009)"
See the regex demo and the R demo.
Details
[^.?!\\s] - any char but ., ?, ! and whitespace
[^.!?]*? - any 0+ chars other than ., ?, ! as few as possible
\([^()]*\) - a (, 0+ chars other than ( and ) and then a ).
We can handle this using grepexpr and regmatches, using the following regex pattern:
.*?\([^)]+\).*?(?=\w|$)
This will capture any content up to the first parenthesis, followed by a (...) term. The script below will capture all such matches in the source text.
m <- gregexpr(".*?\\([^)]+\\).*?(?=\\w|$)", x, perl=TRUE)
regmatches(x, m)
[[1]]
[1] "This is different from primary data created specifically by researchers to reflect concepts that are higher-order and more abstract(Lee,1991;Walsham,1995)."
[2] "Given the major differences between big data and research-collected data, it is surprising how little discussion has arisen about how using big data should change the practice of theory-informed IS research. Some scholars have noted that the very nature of inquiry is likely to change, given that large data sets, advanced algorithms, and powerful computing capabilities can initiate and refine questions without human intervention (Agarwal & Dhar, 2014). "
[3] "Other commentators argue that the scientific method is likely to become obsolete, as with the “availability of huge amounts of data, along with the statistical tools to crunch these numbers … science can advance even without coherent models, unified theories, or really any mechanistic explanation at all” (Anderson, 2008). "
[4] "Perhaps “scientists no longer have to make educated guesses, construct hypotheses and models, test them in data-based experiments andexamples. Instead, they canmine thecomplete setof data forpatterns that reveal effects, producing scientificconclusions without further experimentation”(Prensky, 2009). "

Bayesian networks for text analysis in R

I have one page story (i.e. text data), I need to use Bayesian network on that story and analyse the same. Could someone tell me whether it is possible in R? If yes, that how to proceed?
The objective of the analysis is - Extract Action Descriptions from
Narrative Text.
The data considered for analysis -
Krishna’s Dharam-shasthra to Arjuna:
The Gita is the conversation between Krishna and Arjuna leading up to the battle.
Krishna emphasised on two terms: Karma and Dharma. He told Arjun that this was a righteous war; a war of Dharma. Dharma is the way of righteousness or a set of rules and laws laid down. The Kauravas were on the side of Adharma and had broken rules and laws and hence Arjun would have to do his Karma to uphold Dharma.
Arjuna doesn't want to fight. He doesn't understand why he has to shed his family's blood for a kingdom that he doesn't even necessarily want. In his eyes, killing his evil and killing his family is the greatest sin of all. He casts down his weapons and tells Krishna he will not fight. Krishna, then, begins the systematic process of explaining why it is Arjuna's dharmic duty to fight and how he must fight in order to restore his karma.
Krishna first explains the samsaric cycle of birth and death. He says there is no true death of the soul simply a sloughing of the body at the end of each round of birth and death. The purpose of this cycle is to allow a person to work off their karma, accumulated through lifetimes of action. If a person completes action selflessly, in service to God, then they can work off their karma, eventually leading to a dissolution of the soul, the achievement of enlightenment and vijnana, and an end to the samsaric cycle. If they act selfishly, then they keep accumulating debt, putting them further and further into karmic debt.
What I want is - post tagger to separate verbs, nouns etc. and then create a meaningful network using that.
The steps that should be followed in pre-processing are:
syntactic processing (post tagger)
SRL algorithm (semantic role labelling of characters of the story)
conference resolution
Using all of the above I need to create a knowledge database and create a Bayesian network.
This is what I have tried so far to get post tagger:
txt <- c("As the years went by, they remained isolated in their city. Their numbers increased by freeing women from slavery.
Doom would come to the world in the form of Ares the god of war and the Son of Zeus. Ares was unhappy with the gods as he wanted to prove just how foul his father’s creation was. Hence, he decided to corrupt the mortal men created by Zeus. Fearing his wrath upon the world Zeus decided to create the God killer in order to stop Ares. He then commanded Hippolyta to mould a baby from the sand and clay of the island. Then the five goddesses went back into the Underworld, drawing out the last soul that remained in the Well and giving it incredible powers. The soul was merged with the clay and became flesh. Hippolyta had her daughter and named her Diana, Princess of the Amazons, the first child born on Paradise Island.
Each of the six members of the Greek Pantheon granted Diana a gift: Demeter, great strength; Athena, wisdom and courage; Artemis, a hunter's heart and a communion with animals; Aphrodite, beauty and a loving heart; Hestia, sisterhood with fire; Hermes, speed and the power of flight. Diana was also gifted with a sword, the Lasso of truth and the bracelets of penance as weapons to defeat Ares.
The time arrived when Diana, protector of the Amazons and mankind was sent to the Man's World to defeat Ares and rid the mortal men off his corruption. Diana believed that only love could truly rid the world of his influence. Diana was successfully able to complete the task she was sent out by defeating Ares and saving the world.
")
writeLines(txt, tf <- tempfile())
library(stringi)
library(cleanNLP)
cnlp_init_tokenizers()
anno <- cnlp_annotate(tf)
names(anno)
get_token(anno)
cnlp_init_spacy()
anno <- cnlp_annotate(tf)
get_token(anno)
cnlp_init_corenlp()

Replicate Postgres pg_trgm text similarity scores in R?

Does anyone know how to replicate the (pg_trgm) postgres trigram similarity score from the similarity(text, text) function in R? I am using the stringdist package and would rather use R to calculate these on a matrix of text strings in a .csv file than run a bunch of postgresql quires.
Running similarity(string1, string2) in postgres give me a number score between 0 and 1.
I tired using the stringdist package to get a score but I think I still need to divide the code below by something.
stringdist(string1, string2, method="qgram",q = 3 )
Is there a way to replicate the pg_trgm score with the stringdist package or another way to do this in R?
An example would be getting the similarity score between the description of a book and the description of a genre like science fiction. For example, if I have two book descriptions and the using the similarity score of
book 1 = "Area X has been cut off from the rest of the continent for decades. Nature has reclaimed the last vestiges of human civilization. The first expedition returned with reports of a pristine, Edenic landscape; the second expedition ended in mass suicide, the third expedition in a hail of gunfire as its members turned on one another. The members of the eleventh expedition returned as shadows of their former selves, and within weeks, all had died of cancer. In Annihilation, the first volume of Jeff VanderMeer's Southern Reach trilogy, we join the twelfth expedition.
The group is made up of four women: an anthropologist; a surveyor; a psychologist, the de facto leader; and our narrator, a biologist. Their mission is to map the terrain, record all observations of their surroundings and of one anotioner, and, above all, avoid being contaminated by Area X itself.
They arrive expecting the unexpected, and Area X delivers—they discover a massive topographic anomaly and life forms that surpass understanding—but it’s the surprises that came across the border with them and the secrets the expedition members are keeping from one another that change everything."
book 2= "From Wall Street to Main Street, John Brooks, longtime contributor to the New Yorker, brings to life in vivid fashion twelve classic and timeless tales of corporate and financial life in America
What do the $350 million Ford Motor Company disaster known as the Edsel, the fast and incredible rise of Xerox, and the unbelievable scandals at GE and Texas Gulf Sulphur have in common? Each is an example of how an iconic company was defined by a particular moment of fame or notoriety; these notable and fascinating accounts are as relevant today to understanding the intricacies of corporate life as they were when the events happened.
Stories about Wall Street are infused with drama and adventure and reveal the machinations and volatile nature of the world of finance. John Brooks’s insightful reportage is so full of personality and critical detail that whether he is looking at the astounding market crash of 1962, the collapse of a well-known brokerage firm, or the bold attempt by American bankers to save the British pound, one gets the sense that history repeats itself.
Five additional stories on equally fascinating subjects round out this wonderful collection that will both entertain and inform readers . . . Business Adventures is truly financial journalism at its liveliest and best."
genre 1 = "Science fiction is a genre of fiction dealing with imaginative content such as futuristic settings, futuristic science and technology, space travel, time travel, faster than light travel, parallel universes, and extraterrestrial life. It often explores the potential consequences of scientific and other innovations, and has been called a "literature of ideas".[1] Authors commonly use science fiction as a framework to explore politics, identity, desire, morality, social structure, and other literary themes."
How can I get a similarity score for the description of each book against the description of the science fiction genre like pg_trgm using an R script?
How about something like this?
library(textcat)
?textcat_xdist
# Compute cross-distances between collections of n-gram profiles.
round(textcat_xdist(
list(
text1="hello there",
text2="why hello there",
text3="totally different"
),
method="cosine"),
3)
# text1 text2 text3
#text1 0.000 0.078 0.731
#text2 0.078 0.000 0.739
#text3 0.731 0.739 0.000

Resources