In XQuery, how would I go about keeping track of certain variables, and in this case, return the average of all such variables?
This is a small cutout of the database I'm experimenting with.
<?xml version="1.0" encoding="UTF-8"?>
<catalog>
<book id="bk101">
<author>Gambardella, Matthew</author>
<title>XML Developer's Guide</title>
<genre>Computer</genre>
<price>44.95</price>
<publish_date>2000-10-01</publish_date>
<description>An in-depth look at creating applications with XML.</description>
</book>
<book id="bk102">
<author>Ralls, Kim</author>
<title>Midnight Rain</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2000-12-16</publish_date>
<description>A former architect battles corporate zombies, an evil sorceress, and her own childhood to become queen of the world.</description>
</book>
<book id="bk103">
<author>Corets, Eva</author>
<title>Maeve Ascendant</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2000-11-17</publish_date>
<description>After the collapse of a nanotechnology society in England, the young survivors lay the foundation for a new society.</description>
</book>
</catalog>
Now, what I am trying to do, or trying to understand how works, is obtaining the average price for all three books combined.
What I've tried looks like:
for $eachElement in catalog/book
where $eachElement/price > 0
return avg($eachElement/price)
What this does is return each price individually. As you can probably guess, this isn't what I want.
Could anyone explain to me how I could get the average price of all three books, and how XQuery completes this process?
You did use a FLWOR expression, i.e. you are iterating over a sequence. . This sequence is generated by catalog/book, which holds all the book elements. When you now put this into a for expression, each book will be computed separated. Thus, you compute the average of exactly one element in each step - Which, of course, will always be the element itself.
Thus, you should pass the whole sequence to avg. You can filter out values using a predicate []:
avg(catalog/book[price > 0]/price)
you're now indeed 'iterating' through all items individually. The function avg(..) actually accepts a sequence of elements, so what you should do it select all price elements you want to average and pass it directly to the avg(..) function:
avg(catalog/book/price)
Related
In information retrieval, the inverted index has entries which are the words of corpus, and each word has a posting list which is the list of documents it appears in.
If stemming is applied, index entry would be a stem, so multiple words may finally map to the same entry if they share the same stem. For example:
Without stemming:
(slowing) --> [D1, D5, D9,...]
(slower) --> [D9, D10, D20,...]
(slow) --> [D2,...]
With stemming:
(slow) --> [D1, D2, D5, D9, , D10, D20...]
I want to avoid stemming, and instead would like to make each entry in my inverted index as a bag of words (inflections) such as (slow, slows, slowing, slowed, slower, slowest). For example:
(slow, slows, slowing, slowed, slower, slowest) --> [D1, D2, D5, D9, , D10, D20...]
Would that be possible and feasible or not?
Short Answer:
Simply avoid stemming to suit your need of not considering slow and slows to be a match.
Long Answer:
Question:
I want to avoid stemming, and instead would like to make each entry in my inverted index as a bag of words (inflections) such as (slow, slows, slowing, slowed, slower, slowest).
Let me try to clear some confusion that you have about inverted lists. It is the documents that are stored in the postings for each term (not the terms themselves).
The words are typically stored in a in-memory dictionary (implemented with a hash-table or a trie) containing pointers to the postings (list of documents which contain that particular term) stored and loaded on the fly from secondary storage.
A simple example (without showing document weights):
(information) --> [D1, D5, D9,...]
(informative) --> [D9, D10, D20,...]
(retrieval) --> [D1, D9, D17,...]
..
So, if you don't want to apply stemming, that's fine... In fact, the above example shows an unstemmed index, where the words information and informative appear in their non-conflated forms. In a conflated term index (with a stemmer or a lemmatizer), you would substitute the different forms with an equivalent representation (say inform). In that case, the index will be:
(inform) --> [D1, D5, D9, D10, D20...]. --- union of the different forms
(retrieval) --> [D1, D9, D17,...]
..
So, this conflated representation matches all possible forms of the word information, e.g. informative, informational etc.
Longer Answer
Now let's say you want to achieve the best of both worlds, i.e. a representation which allows this conflation to be done in a user controlled way, e.g. wrapping a word around quotes to denote requiring an exact match ("slow"vs.slowin the query), or some indicator to include synonyms for a query term for semantic search (e.g.syn(slow)` to include synonyms of the word slow).
For this, you need to maintain separate postings for the non-conflated words and maintain additional equivalence indicating pointers between a set of equivalent (stem relation/synonym relation/ semantic relation etc.) terms.
Coming back to our example, you would have something like:
(E1)-->(information) --> [D1, D5, D9,...]
|---->(informative) --> [D9, D10, D20,...]
|---->(data) --> [D20, D23, D25,...]
(E2)-->(retrieval) --> [D1, D9, D17,...]
|---->(search) --> [D20, D30, D31,...]
..
Here, I have shown two examples of equivalence classes (concept representations) of two sets of terms information, data... and retrieval, search....
Depending on the query syntax, it would then be possible at the retrieval time to facilitate exact search or relaxed search (based on inflections/synonyms etc.)
I have one page story (i.e. text data), I need to use Bayesian network on that story and analyse the same. Could someone tell me whether it is possible in R? If yes, that how to proceed?
The objective of the analysis is - Extract Action Descriptions from
Narrative Text.
The data considered for analysis -
Krishna’s Dharam-shasthra to Arjuna:
The Gita is the conversation between Krishna and Arjuna leading up to the battle.
Krishna emphasised on two terms: Karma and Dharma. He told Arjun that this was a righteous war; a war of Dharma. Dharma is the way of righteousness or a set of rules and laws laid down. The Kauravas were on the side of Adharma and had broken rules and laws and hence Arjun would have to do his Karma to uphold Dharma.
Arjuna doesn't want to fight. He doesn't understand why he has to shed his family's blood for a kingdom that he doesn't even necessarily want. In his eyes, killing his evil and killing his family is the greatest sin of all. He casts down his weapons and tells Krishna he will not fight. Krishna, then, begins the systematic process of explaining why it is Arjuna's dharmic duty to fight and how he must fight in order to restore his karma.
Krishna first explains the samsaric cycle of birth and death. He says there is no true death of the soul simply a sloughing of the body at the end of each round of birth and death. The purpose of this cycle is to allow a person to work off their karma, accumulated through lifetimes of action. If a person completes action selflessly, in service to God, then they can work off their karma, eventually leading to a dissolution of the soul, the achievement of enlightenment and vijnana, and an end to the samsaric cycle. If they act selfishly, then they keep accumulating debt, putting them further and further into karmic debt.
What I want is - post tagger to separate verbs, nouns etc. and then create a meaningful network using that.
The steps that should be followed in pre-processing are:
syntactic processing (post tagger)
SRL algorithm (semantic role labelling of characters of the story)
conference resolution
Using all of the above I need to create a knowledge database and create a Bayesian network.
This is what I have tried so far to get post tagger:
txt <- c("As the years went by, they remained isolated in their city. Their numbers increased by freeing women from slavery.
Doom would come to the world in the form of Ares the god of war and the Son of Zeus. Ares was unhappy with the gods as he wanted to prove just how foul his father’s creation was. Hence, he decided to corrupt the mortal men created by Zeus. Fearing his wrath upon the world Zeus decided to create the God killer in order to stop Ares. He then commanded Hippolyta to mould a baby from the sand and clay of the island. Then the five goddesses went back into the Underworld, drawing out the last soul that remained in the Well and giving it incredible powers. The soul was merged with the clay and became flesh. Hippolyta had her daughter and named her Diana, Princess of the Amazons, the first child born on Paradise Island.
Each of the six members of the Greek Pantheon granted Diana a gift: Demeter, great strength; Athena, wisdom and courage; Artemis, a hunter's heart and a communion with animals; Aphrodite, beauty and a loving heart; Hestia, sisterhood with fire; Hermes, speed and the power of flight. Diana was also gifted with a sword, the Lasso of truth and the bracelets of penance as weapons to defeat Ares.
The time arrived when Diana, protector of the Amazons and mankind was sent to the Man's World to defeat Ares and rid the mortal men off his corruption. Diana believed that only love could truly rid the world of his influence. Diana was successfully able to complete the task she was sent out by defeating Ares and saving the world.
")
writeLines(txt, tf <- tempfile())
library(stringi)
library(cleanNLP)
cnlp_init_tokenizers()
anno <- cnlp_annotate(tf)
names(anno)
get_token(anno)
cnlp_init_spacy()
anno <- cnlp_annotate(tf)
get_token(anno)
cnlp_init_corenlp()
I am trying to import an XML document and convert it to a dataframe in R. Usually the following code works fine:
xmlfile <- xmlTreeParse(file.choose()) ; topxml <- xmlRoot(xmlfile) ;
topxml2 <- xmlSApply(topxml, function(x) xmlSApply(x, xmlValue))
psycinfo <- data.frame(t(topxml2), row.names=NULL, stringsAsFactors=FALSE)
However, when I try this i get a dataframe with one row and 22570 columns (which is the number of rows that ideally want so that each record has its own row with multiple columns.
I've attached a snippet of what my XML data looks like for the first two records, which should be on separate rows.
<records>
<rec resultID="1">
<header shortDbName="psyh" longDbName="PsycINFO" uiTerm="2016-10230-001">
<controlInfo>
<bkinfo>
<btl>Reducing conservativeness of stabilization conditions for switched ts fuzzy systems.</btl>
<aug />
</bkinfo>
<chapinfo />
<revinfo />
<jinfo>
<jtl>Neurocomputing: An International Journal</jtl>
<issn type="Print">09252312</issn>
</jinfo>
<pubinfo>
<dt year="2016" month="02" day="16">20160216</dt>
</pubinfo>
<artinfo>
<ui type="doi">10.1016/j.neucom.2016.01.067</ui>
<tig>
<atl>Reducing conservativeness of stabilization conditions for switched ts fuzzy systems.</atl>
</tig>
<aug>
<au>Jaballi, Ahmed</au>
<au>Hajjaji, Ahmed El</au>
<au>Sakly, Anis</au>
</aug>
<sug>
<subj type="unclass">No terms assigned</subj>
</sug>
<ab>In this paper, less conservative sufficient conditions for the existence of switching laws for stabilizing switched TS fuzzy systems via a fuzzy Lyapunov function (FLF) and estimates the basin of attraction are proposed. The conditions are found by exploring properties of the membership functions and are formulated in terms of linear matrix inequalities (LMIs), which can be solved very efficiently using the convex optimization techniques. Finally, the effectiveness and the reduced conservatism of the proposed results are shown through two numerical examples. (PsycINFO Database Record (c) 2016 APA, all rights reserved)</ab>
<pubtype>Journal</pubtype>
<pubtype>Peer Reviewed Journal</pubtype>
</artinfo>
<language>English</language>
</controlInfo>
<displayInfo>
<pLink>
<url>http://search.ebscohost.com/login.aspx?direct=true&db=psyh&AN=2016-10230-001&site=ehost-live&scope=site</url>
</pLink>
</displayInfo>
</header>
</rec>
<rec resultID="2">
<header shortDbName="psyh" longDbName="PsycINFO" uiTerm="2016-08643-001">
<controlInfo>
<bkinfo>
<btl>Self–other relations in biodiversity conservation in the community: Representational processes and adjustment to new actions.</btl>
<aug />
</bkinfo>
<chapinfo />
<revinfo />
<jinfo>
<jtl>Journal of Community & Applied Social Psychology</jtl>
<issn type="Print">10529284</issn>
<issn type="Electronic">10991298</issn>
</jinfo>
<pubinfo>
<dt year="2016" month="02" day="15">20160215</dt>
</pubinfo>
<artinfo>
<ui type="doi">10.1002/casp.2267</ui>
<tig>
<atl>Self–other relations in biodiversity conservation in the community: Representational processes and adjustment to new actions.</atl>
</tig>
<aug>
<au>Mouro, Carla</au>
<au>Castro, Paula</au>
</aug>
<sug>
<subj type="unclass">No terms assigned</subj>
</sug>
<ab>This research explores the simultaneous role of two Self–Other relations in the elaboration of representations at the micro†and ontogenetic levels, assuming that it can result in acceptance and/or resistance to new laws. Drawing on the Theory of Social Representations, it concretely looks at how individuals elaborate new representations relevant for biodiversity conservation in the context of their relations with their local community (an interactional Other) and with the legal/reified sphere (an institutional Other). This is explored in two studies in Portuguese Natura 2000 sites where a conservation project calls residents to protect an atâ€risk species. Study 1 shows that (i) agreement with the institutional Other (the laws) and metaâ€representations of the interactional Other (the community) as approving of conservation independently help explain (at the ontogenetic level) internalisation of conservation goals and willingness to act; (ii) the same metaâ€representations operating at the microâ€genetic level attenuate the negative relation between ambivalence and willingness to act. Study 2 shows that a metaâ€representation of the interactional Other as showing no clear position regarding conservation increases ambivalence. Findings demonstrate the necessarily social nature of representational processes and the importance of considering them at more than one level for understanding responses to new policy/legal proposals. Copyright © 2016 John Wiley & Sons, Ltd. (PsycINFO Database Record (c) 2016 APA, all rights reserved)</ab>
<pubtype>Journal</pubtype>
<pubtype>Peer Reviewed Journal</pubtype>
</artinfo>
<language>English</language>
</controlInfo>
<displayInfo>
<pLink>
<url>http://search.ebscohost.com/login.aspx?direct=true&db=psyh&AN=2016-08643-001&site=ehost-live&scope=site</url>
</pLink>
</displayInfo>
</header>
</rec>
I know that I can represent any relation as a RDF triplet as in:
Barack Obama -> president of -> USA
(I am aware that this is not RDF, I am just illustrating)
But how do I add additional information about this relation, like for example the time dimension? I mean he is in his second presidential period and any period last only for a lapse of time. And, how about after and before his presidential periods?
There are several options to do this. I'll illustrate some of the more popular ones.
Named Graphs / Quads
In RDF, named graphs are subsets of an RDF dataset that are assigned a specific identifier (the "graph name"). In most RDF databases, this is implemented by adding a fourth element to the RDF triple, turning it from a triple into a "quad" (sometimes it's also called the 'context' of the triple).
You can use this mechanism to express information about a certain collection of statements. For example (using pseudo N-Quads syntax for RDF):
:i1 a :TimePeriod .
:i1 :begin "2009-01-20T00:00:00Z"^^xsd:dateTime .
:i1 :end "2017-01-20T00:00:00Z"^^xsd:dateTime .
:barackObama :presidentOf :USA :i1 .
Notice the fourth element in the last statement: it links the statement "Barack Obama is president of the USA" to the named graph identified by :i.
The named graphs approach is particularly useful in situations where you have data to express about several statements at once. It is of course also possible to use it for data about individual statements (as the above example illustrates), though it may quickly become cumbersome if used in that fashion (every distinct time period will need its own named graph).
Representing the relation as an object
An alternative approach is to model the relation itself as an object. The relation between "Barack Obama" and "USA" is not just that one is the president of the other, but that one is president of the other between certain dates. To express this in RDF (as Joshua Taylor also illustrated in his comment):
:barackObama :hasRole :president_44 .
:president_44 a :Presidency ;
:of :USA ;
:begin "2009-01-20T00:00:00Z"^^xsd:dateTime ;
:end "2017-01-20T00:00:00Z"^^xsd:dateTime .
The relation itself has now become an object (an instance of the "Presidency" class, with identifier :president_44).
Compared to using named graphs, this approach is much more tailored to asserting data about individual statements. A possible downside is that it becomes a bit more complex to query the relation in SPARQL.
RDF Reification
Not sure this approach actually still counts as "popular", but RDF reification is the historically W3C-sanctioned approach to asserting "statements about statements". In this approach we turn the statement itself into an object:
:obamaPresidency a rdf:Statement ;
rdf:subject :barackObama ;
rdf:predicate :presidentOf ;
rdf:object :USA ;
:trueBetween [
:begin "2009-01-20T00:00:00Z"^^xsd:dateTime ;
:end "2017-01-20T00:00:00Z"^^xsd:dateTime .
] .
There's several good reasons not to use RDF reification in this case, however:
it's conceptually a bit strange. The knowledge that we want to express is about the temporal aspect of the relation, but using RDF reification we are saying something about the statement.
What we have expressed in the above example is: "the statement about Barack Obama being president of the USA is valid between ... and ...". Note that we have not expressed that Barack Obama actually is the president of the USA! You could of course still assert that separately (by just adding the original triple as well as the reified one), but this creates a further duplication/maintenance problem.
It is a pain to use in SPARQL queries.
As Joshua also indicated in his comment, the W3C Note on defining N-ary RDF relations is useful to look at, as it goes into more depth about these (and other) approaches.
RDF*, or RDF-star, allows expressing additional information about an RDF triple, allowing nested structures such as:
<< :BarackObama :presidentOf :USA >> :since :2009
Bit of a contrived example (since the term of presidency could be expressed by simply normalizing data structure), but it’s quite useful for expressing “external” concerns (such as probability or provenance).
See Olaf’s blog post and, for technical details, https://www.w3.org/2021/12/rdf-star.html.
I’m pretty sure Apache Jena supports it already, not quite sure about other products like Neo4j.
I read the mapreduce at http://en.wikipedia.org/wiki/MapReduce ,understood the example of how to get the count of a "word" in many "documents". However I did not understand the following line:
Thus the MapReduce framework transforms a list of (key, value) pairs into a list of values. This behavior is different from the functional programming map and reduce combination, which accepts a list of arbitrary values and returns one single value that combines all the values returned by map.
Can someone elaborate on the difference again(MapReduce framework VS map and reduce combination)? Especially, what does the reduce functional programming do?
Thanks a great deal.
The main difference would be that MapReduce is apparently patentable. (Couldn't help myself, sorry...)
On a more serious note, the MapReduce paper, as I remember it, describes a methodology of performing calculations in a massively parallelised fashion. This methodology builds upon the map / reduce construct which was well known for years before, but goes beyond into such matters as distributing the data etc. Also, some constraints are imposed on the structure of data being operated upon and returned by the functions used in the map-like and reduce-like parts of the computation (the thing about data coming in lists of key/value pairs), so you could say that MapReduce is a massive-parallelism-friendly specialisation of the map & reduce combination.
As for the Wikipedia comment on the function being mapped in the functional programming's map / reduce construct producing one value per input... Well, sure it does, but here there are no constraints at all on the type of said value. In particular, it could be a complex data structure like perhaps a list of things to which you would again apply a map / reduce transformation. Going back to the "counting words" example, you could very well have a function which, for a given portion of text, produces a data structure mapping words to occurrence counts, map that over your documents (or chunks of documents, as the case may be) and reduce the results.
In fact, that's exactly what happens in this article by Phil Hagelberg. It's a fun and supremely short example of a MapReduce-word-counting-like computation implemented in Clojure with map and something equivalent to reduce (the (apply + (merge-with ...)) bit -- merge-with is implemented in terms of reduce in clojure.core). The only difference between this and the Wikipedia example is that the objects being counted are URLs instead of arbitrary words -- other than that, you've got a counting words algorithm implemented with map and reduce, MapReduce-style, right there. The reason why it might not fully qualify as being an instance of MapReduce is that there's no complex distribution of workloads involved. It's all happening on a single box... albeit on all the CPUs the box provides.
For in-depth treatment of the reduce function -- also known as fold -- see Graham Hutton's A tutorial on the universality and expressiveness of fold. It's Haskell based, but should be readable even if you don't know the language, as long as you're willing to look up a Haskell thing or two as you go... Things like ++ = list concatenation, no deep Haskell magic.
Using the word count example, the original functional map() would take a set of documents, optionally distribute subsets of that set, and for each document emit a single value representing the number of words (or a particular word's occurrences) in the document. A functional reduce() would then add up the global counts for all documents, one for each document. So you get a total count (either of all words or a particular word).
In MapReduce, the map would emit a (word, count) pair for each word in each document. A MapReduce reduce() would then add up the count of each word in each document without mixing them into a single pile. So you get a list of words paired with their counts.
MapReduce is a framework built around splitting a computation into parallelizable mappers and reducers. It builds on the familiar idiom of map and reduce - if you can structure your tasks such that they can be performed by independent mappers and reducers, then you can write it in a way which takes advantage of a MapReduce framework.
Imagine a Python interpreter which recognized tasks which could be computed independently, and farmed them out to mapper or reducer nodes. If you wrote
reduce(lambda x, y: x+y, map(int, ['1', '2', '3']))
or
sum([int(x) for x in ['1', '2', '3']])
you would be using functional map and reduce methods in a MapReduce framework. With current MapReduce frameworks, there's a lot more plumbing involved, but it's the same concept.