XPath to extract all text between two 'p' elements scrapy - web-scraping

I am trying to scrape a database using Scrapy and Splash, which requires login so unfortunately, I am unable to share the full website. The database contains a list of companies showing their name and a short description.
I am struggling to find an XPath expression that would yield all the text between the two 'p' tags as shown:
<p class="pre-wrap ng-binding"
ng-bind-html="object._source.startup.general_information.project_public_description"
ng-click="listView.showDetail(object)" role="button" tabindex="0">
<div>With the vision of providing creative sustainable solutions for global food crisis,
AquiNovo develops innovative, non-GMO, non-hormonal, peptide-based feed additives,
addressing the ever-growing demand for fish protein. Company’s additives improve both growth
performance and feed utilization, enabling the <strong><em>growth of more fish with less
feed</em></strong>. A unique peptide production system, enables large commercial
scale production at significant lower cost and carbon footprint. Growing more fish with less
feed also promote several SDG’s including the reduction of pressure on fish population in
the sea, providing food security and reducing hunger and poverty, climate change and
responsible production. </div>
</p>
All the company descriptions are in the same format (between two 'p' elements), but as shown in the HTML, there are <strong><em> elements as well. I would like to ask for help to find a way to create an XPath that would get all text including the ones in the <strong><em> element as one single text block (that would be one description, when viewed on the website there is no separation in the text.
I tried the following but that only gets the part before the element //p[#class='pre-wrap ng-binding']//div//text()
I used the following code:
'the descript': ''.join(startup.xpath('//div//text()').getall()),

scrapy shell
In [1]: html = """<html>
...: <body>
...: <p class="pre-wrap ng-binding"
...: ng-bind-html="object._source.startup.general_information.project_public_description"
...: ng-click="listView.showDetail(object)" role="button" tabindex="0">
...: <div>With the vision of providing creative sustainable solutions for global food crisis,
...: AquiNovo develops innovative, non-GMO, non-hormonal, peptide-based feed additives,
...: addressing the ever-growing demand for fish protein. Company’s additives improve both growth
...: performance and feed utilization, enabling the <strong><em>growth of more fish with less
...: feed</em></strong>. A unique peptide production system, enables large commercial
...: scale production at significant lower cost and carbon footprint. Growing more fish with less
...: feed also promote several SDG’s including the reduction of pressure on fish population in
...: the sea, providing food security and reducing hunger and poverty, climate change and
...: responsible production. </div>
...: </p>
...: </body>
...: </html>"""
In [2]: selector = scrapy.Selector(text=html)
In [3]: ''.join(selector.xpath('//div//text()').getall())
Out[3]: 'With the vision of providing creative sustainable solutions for global food crisis,\n AquiNovo develops innovative, non-GMO, non-hormonal, peptide-based feed additives,\n addressing the ever-growing demand for fish protein. Company’s additives improve both growth\n performance and feed utilization, enabling the growth of more fish with less\n feed. A unique peptide production system, enables large commercial\n scale production at significant lower cost and carbon footprint. Growing more fish with less\n feed also promote several SDG’s including the reduction of pressure on fish population in\n the sea, providing food security and reducing hunger and poverty, climate change and\n responsible production.\xa0'

Related

Bayesian networks for text analysis in R

I have one page story (i.e. text data), I need to use Bayesian network on that story and analyse the same. Could someone tell me whether it is possible in R? If yes, that how to proceed?
The objective of the analysis is - Extract Action Descriptions from
Narrative Text.
The data considered for analysis -
Krishna’s Dharam-shasthra to Arjuna:
The Gita is the conversation between Krishna and Arjuna leading up to the battle.
Krishna emphasised on two terms: Karma and Dharma. He told Arjun that this was a righteous war; a war of Dharma. Dharma is the way of righteousness or a set of rules and laws laid down. The Kauravas were on the side of Adharma and had broken rules and laws and hence Arjun would have to do his Karma to uphold Dharma.
Arjuna doesn't want to fight. He doesn't understand why he has to shed his family's blood for a kingdom that he doesn't even necessarily want. In his eyes, killing his evil and killing his family is the greatest sin of all. He casts down his weapons and tells Krishna he will not fight. Krishna, then, begins the systematic process of explaining why it is Arjuna's dharmic duty to fight and how he must fight in order to restore his karma.
Krishna first explains the samsaric cycle of birth and death. He says there is no true death of the soul simply a sloughing of the body at the end of each round of birth and death. The purpose of this cycle is to allow a person to work off their karma, accumulated through lifetimes of action. If a person completes action selflessly, in service to God, then they can work off their karma, eventually leading to a dissolution of the soul, the achievement of enlightenment and vijnana, and an end to the samsaric cycle. If they act selfishly, then they keep accumulating debt, putting them further and further into karmic debt.
What I want is - post tagger to separate verbs, nouns etc. and then create a meaningful network using that.
The steps that should be followed in pre-processing are:
syntactic processing (post tagger)
SRL algorithm (semantic role labelling of characters of the story)
conference resolution
Using all of the above I need to create a knowledge database and create a Bayesian network.
This is what I have tried so far to get post tagger:
txt <- c("As the years went by, they remained isolated in their city. Their numbers increased by freeing women from slavery.
Doom would come to the world in the form of Ares the god of war and the Son of Zeus. Ares was unhappy with the gods as he wanted to prove just how foul his father’s creation was. Hence, he decided to corrupt the mortal men created by Zeus. Fearing his wrath upon the world Zeus decided to create the God killer in order to stop Ares. He then commanded Hippolyta to mould a baby from the sand and clay of the island. Then the five goddesses went back into the Underworld, drawing out the last soul that remained in the Well and giving it incredible powers. The soul was merged with the clay and became flesh. Hippolyta had her daughter and named her Diana, Princess of the Amazons, the first child born on Paradise Island.
Each of the six members of the Greek Pantheon granted Diana a gift: Demeter, great strength; Athena, wisdom and courage; Artemis, a hunter's heart and a communion with animals; Aphrodite, beauty and a loving heart; Hestia, sisterhood with fire; Hermes, speed and the power of flight. Diana was also gifted with a sword, the Lasso of truth and the bracelets of penance as weapons to defeat Ares.
The time arrived when Diana, protector of the Amazons and mankind was sent to the Man's World to defeat Ares and rid the mortal men off his corruption. Diana believed that only love could truly rid the world of his influence. Diana was successfully able to complete the task she was sent out by defeating Ares and saving the world.
")
writeLines(txt, tf <- tempfile())
library(stringi)
library(cleanNLP)
cnlp_init_tokenizers()
anno <- cnlp_annotate(tf)
names(anno)
get_token(anno)
cnlp_init_spacy()
anno <- cnlp_annotate(tf)
get_token(anno)
cnlp_init_corenlp()

Find & replace text not already inside an <A> tag - RegEx .Net

I am working with XML data in .NET from the Federal Register, which contain many references to Executive Orders & chapters from the U.S. Code.
I'd like to be able to hyperlink to these references, unless they're already inside of an <a> tag (which is determined by the XML, and often links within the document itself).
The pattern I've written is matching and deleting leading and trailing characters and not displaying them, even if I include the boundary character in the replacement string:
[?!]([0-9]{1,2})[ ]{0,1}(U\.S\.C\.|USC)[\s]{0,1}([0-9]{1,5})(\b)[^]
An example of the initial XML:
<p>The Regulatory Flexibility Act of 1980 (RFA), 5 U.S.C. 604(b), as amended, requires Federal agencies to consider the potential impact of regulations on small entities during rulemaking.</p>
<p>Small entities include small businesses, small not-for-profit organizations, and small governmental jurisdictions.</p>
<p>Section 605 of the RFA allows an agency to certify a rule, in lieu of preparing an analysis, if the rulemaking is not expected to have a significant economic impact on a substantial number of small entities. Reference: 13 USC 401</p>
<ul>
<li><em>Related laws from 14USC301-345 do not apply.</em></li>
<li>14 USC 301 does apply.</li>
</ul>
As you can see, some references include ranges of U.S. Code sections (e.g. 14 USC 301-345) or references to specific subsections (e.g. 5 U.S.C. 604(b) ). I'd only want to link to the first reference in the range, so the link should terminate at the - or the (.
If I'm understanding you correctly, I think the following should work.
var re = new Regex(#"\d{1,2}\s?U\.?S\.?C\.?\s?\d{1,5}\b(?!</a>)");
var matches = re.Matches(text);
// matches[0].Value = 5 U.S.C. 604
// matches[1].Value = 14USC301
You might even be able to simplify the regex to \d+\s?U\.?S\.?C\.?\s?\d+\b(?!</a>) – I'm not sure if the upper limits of 2 and 5 are significant.

Error when Importing XML Document into R

I am trying to import an XML document and convert it to a dataframe in R. Usually the following code works fine:
xmlfile <- xmlTreeParse(file.choose()) ; topxml <- xmlRoot(xmlfile) ;
topxml2 <- xmlSApply(topxml, function(x) xmlSApply(x, xmlValue))
psycinfo <- data.frame(t(topxml2), row.names=NULL, stringsAsFactors=FALSE)
However, when I try this i get a dataframe with one row and 22570 columns (which is the number of rows that ideally want so that each record has its own row with multiple columns.
I've attached a snippet of what my XML data looks like for the first two records, which should be on separate rows.
<records>
<rec resultID="1">
<header shortDbName="psyh" longDbName="PsycINFO" uiTerm="2016-10230-001">
<controlInfo>
<bkinfo>
<btl>Reducing conservativeness of stabilization conditions for switched ts fuzzy systems.</btl>
<aug />
</bkinfo>
<chapinfo />
<revinfo />
<jinfo>
<jtl>Neurocomputing: An International Journal</jtl>
<issn type="Print">09252312</issn>
</jinfo>
<pubinfo>
<dt year="2016" month="02" day="16">20160216</dt>
</pubinfo>
<artinfo>
<ui type="doi">10.1016/j.neucom.2016.01.067</ui>
<tig>
<atl>Reducing conservativeness of stabilization conditions for switched ts fuzzy systems.</atl>
</tig>
<aug>
<au>Jaballi, Ahmed</au>
<au>Hajjaji, Ahmed El</au>
<au>Sakly, Anis</au>
</aug>
<sug>
<subj type="unclass">No terms assigned</subj>
</sug>
<ab>In this paper, less conservative sufficient conditions for the existence of switching laws for stabilizing switched TS fuzzy systems via a fuzzy Lyapunov function (FLF) and estimates the basin of attraction are proposed. The conditions are found by exploring properties of the membership functions and are formulated in terms of linear matrix inequalities (LMIs), which can be solved very efficiently using the convex optimization techniques. Finally, the effectiveness and the reduced conservatism of the proposed results are shown through two numerical examples. (PsycINFO Database Record (c) 2016 APA, all rights reserved)</ab>
<pubtype>Journal</pubtype>
<pubtype>Peer Reviewed Journal</pubtype>
</artinfo>
<language>English</language>
</controlInfo>
<displayInfo>
<pLink>
<url>http://search.ebscohost.com/login.aspx?direct=true&db=psyh&AN=2016-10230-001&site=ehost-live&scope=site</url>
</pLink>
</displayInfo>
</header>
</rec>
<rec resultID="2">
<header shortDbName="psyh" longDbName="PsycINFO" uiTerm="2016-08643-001">
<controlInfo>
<bkinfo>
<btl>Self–other relations in biodiversity conservation in the community: Representational processes and adjustment to new actions.</btl>
<aug />
</bkinfo>
<chapinfo />
<revinfo />
<jinfo>
<jtl>Journal of Community & Applied Social Psychology</jtl>
<issn type="Print">10529284</issn>
<issn type="Electronic">10991298</issn>
</jinfo>
<pubinfo>
<dt year="2016" month="02" day="15">20160215</dt>
</pubinfo>
<artinfo>
<ui type="doi">10.1002/casp.2267</ui>
<tig>
<atl>Self–other relations in biodiversity conservation in the community: Representational processes and adjustment to new actions.</atl>
</tig>
<aug>
<au>Mouro, Carla</au>
<au>Castro, Paula</au>
</aug>
<sug>
<subj type="unclass">No terms assigned</subj>
</sug>
<ab>This research explores the simultaneous role of two Self–Other relations in the elaboration of representations at the micro†and ontogenetic levels, assuming that it can result in acceptance and/or resistance to new laws. Drawing on the Theory of Social Representations, it concretely looks at how individuals elaborate new representations relevant for biodiversity conservation in the context of their relations with their local community (an interactional Other) and with the legal/reified sphere (an institutional Other). This is explored in two studies in Portuguese Natura 2000 sites where a conservation project calls residents to protect an atâ€risk species. Study 1 shows that (i) agreement with the institutional Other (the laws) and metaâ€representations of the interactional Other (the community) as approving of conservation independently help explain (at the ontogenetic level) internalisation of conservation goals and willingness to act; (ii) the same metaâ€representations operating at the microâ€genetic level attenuate the negative relation between ambivalence and willingness to act. Study 2 shows that a metaâ€representation of the interactional Other as showing no clear position regarding conservation increases ambivalence. Findings demonstrate the necessarily social nature of representational processes and the importance of considering them at more than one level for understanding responses to new policy/legal proposals. Copyright © 2016 John Wiley & Sons, Ltd. (PsycINFO Database Record (c) 2016 APA, all rights reserved)</ab>
<pubtype>Journal</pubtype>
<pubtype>Peer Reviewed Journal</pubtype>
</artinfo>
<language>English</language>
</controlInfo>
<displayInfo>
<pLink>
<url>http://search.ebscohost.com/login.aspx?direct=true&db=psyh&AN=2016-08643-001&site=ehost-live&scope=site</url>
</pLink>
</displayInfo>
</header>
</rec>

How to search a word in a dictionary with a document in R?

I have created a dictionary of words. Now I need to check whether the word in the dictionary is present in the document or not. The sample of the document is given below:
Laparoscopic surgery, also called minimally invasive surgery (MIS), bandaid surgery, or keyhole surgery, is a modern surgical technique in which operations are performed far from their location through small incisions (usually 0.5–1.5 cm) elsewhere in the body.
There are a number of advantages to the patient with laparoscopic surgery versus the more common, open procedure. Pain and hemorrhaging are reduced due to smaller incisions and recovery times are shorter. The key element in laparoscopic surgery is the use of a laparoscope, a long fiber optic cable system which allows viewing of the affected area by snaking the cable from a more distant, but more easily accessible location.
From this document, I have split each paragraph into each sentence as follows:
[1] "Laparoscopic surgery, also called minimally invasive surgery (MIS), bandaid surgery, or keyhole surgery, is a modern surgical technique in which operations are performed far from their location through small incisions (usually 0.5–1.5 cm) elsewhere in the body."
[2] "There are a number of advantages to the patient with laparoscopic surgery versus the more common, open procedure."
[3] "Pain and hemorrhaging are reduced due to smaller incisions and recovery times are shorter."
[4] "The key element in laparoscopic surgery is the use of a laparoscope, a long fiber optic cable system which allows viewing of the affected area by snaking the cable from a more distant, but more easily accessible location."
The dictionary includes the following words:
Laparoscopic surgery
minimally invasive surgery
bandaid surgery
keyhole surgery
surgical technique
small incisions
fiber optic cable system
Now I want to search for all words in the dictionary with those in each sentence using R? The code that I have worked out is given below.
c <- "Laparoscopic surgery, also called minimally invasive surgery (MIS), bandaid surgery, or keyhole surgery, is a modern surgical technique in which operations are performed far from their location through small incisions (usually 0.5–1.5 cm) elsewhere in the body.
There are a number of advantages to the patient with laparoscopic surgery versus the more common, open procedure. Pain and hemorrhaging are reduced due to smaller incisions and recovery times are shorter. The key element in laparoscopic surgery is the use of a laparoscope, a long fiber optic cable system which allows viewing of the affected area by snaking the cable from a more distant, but more easily accessible location."
library(tm)
library(openNLP)
convert_text_to_sentences <- function(text, lang = "en") {
sentence_token_annotator <- Maxent_Sent_Token_Annotator(language = lang)
text <- as.String(text)
sentence.boundaries <- annotate(text, sentence_token_annotator)
sentences <- text[sentence.boundaries]
return(sentences)
}
q <- convert_text_to_sentences(c)
Assuming q is a character vector (or list) of the sentences and you're interested in exact matches of the keywords only, then you can use regular expressions:
matches = lapply(q, function(x) dict[sapply(dict, grepl, x, ignore.case=T)])
You get a list of the length of q. Every list element contains a vector of the dictionary words found in the according sentence.

Replicate Postgres pg_trgm text similarity scores in R?

Does anyone know how to replicate the (pg_trgm) postgres trigram similarity score from the similarity(text, text) function in R? I am using the stringdist package and would rather use R to calculate these on a matrix of text strings in a .csv file than run a bunch of postgresql quires.
Running similarity(string1, string2) in postgres give me a number score between 0 and 1.
I tired using the stringdist package to get a score but I think I still need to divide the code below by something.
stringdist(string1, string2, method="qgram",q = 3 )
Is there a way to replicate the pg_trgm score with the stringdist package or another way to do this in R?
An example would be getting the similarity score between the description of a book and the description of a genre like science fiction. For example, if I have two book descriptions and the using the similarity score of
book 1 = "Area X has been cut off from the rest of the continent for decades. Nature has reclaimed the last vestiges of human civilization. The first expedition returned with reports of a pristine, Edenic landscape; the second expedition ended in mass suicide, the third expedition in a hail of gunfire as its members turned on one another. The members of the eleventh expedition returned as shadows of their former selves, and within weeks, all had died of cancer. In Annihilation, the first volume of Jeff VanderMeer's Southern Reach trilogy, we join the twelfth expedition.
The group is made up of four women: an anthropologist; a surveyor; a psychologist, the de facto leader; and our narrator, a biologist. Their mission is to map the terrain, record all observations of their surroundings and of one anotioner, and, above all, avoid being contaminated by Area X itself.
They arrive expecting the unexpected, and Area X delivers—they discover a massive topographic anomaly and life forms that surpass understanding—but it’s the surprises that came across the border with them and the secrets the expedition members are keeping from one another that change everything."
book 2= "From Wall Street to Main Street, John Brooks, longtime contributor to the New Yorker, brings to life in vivid fashion twelve classic and timeless tales of corporate and financial life in America
What do the $350 million Ford Motor Company disaster known as the Edsel, the fast and incredible rise of Xerox, and the unbelievable scandals at GE and Texas Gulf Sulphur have in common? Each is an example of how an iconic company was defined by a particular moment of fame or notoriety; these notable and fascinating accounts are as relevant today to understanding the intricacies of corporate life as they were when the events happened.
Stories about Wall Street are infused with drama and adventure and reveal the machinations and volatile nature of the world of finance. John Brooks’s insightful reportage is so full of personality and critical detail that whether he is looking at the astounding market crash of 1962, the collapse of a well-known brokerage firm, or the bold attempt by American bankers to save the British pound, one gets the sense that history repeats itself.
Five additional stories on equally fascinating subjects round out this wonderful collection that will both entertain and inform readers . . . Business Adventures is truly financial journalism at its liveliest and best."
genre 1 = "Science fiction is a genre of fiction dealing with imaginative content such as futuristic settings, futuristic science and technology, space travel, time travel, faster than light travel, parallel universes, and extraterrestrial life. It often explores the potential consequences of scientific and other innovations, and has been called a "literature of ideas".[1] Authors commonly use science fiction as a framework to explore politics, identity, desire, morality, social structure, and other literary themes."
How can I get a similarity score for the description of each book against the description of the science fiction genre like pg_trgm using an R script?
How about something like this?
library(textcat)
?textcat_xdist
# Compute cross-distances between collections of n-gram profiles.
round(textcat_xdist(
list(
text1="hello there",
text2="why hello there",
text3="totally different"
),
method="cosine"),
3)
# text1 text2 text3
#text1 0.000 0.078 0.731
#text2 0.078 0.000 0.739
#text3 0.731 0.739 0.000

Resources