Count number of word occurrences in strings using XQuery - xquery

We have a very similar XML file to this:
<?xml version="1.0" encoding="UTF-8"?>
<nodeOne>
<nodeTwo>
<nodeThree>
foo bar zoo
</nodeThree>
</nodeTwo>
</nodeOne>
<nodeOne>
<nodeTwo>
<nodeThree>
foo bar
</nodeThree>
</nodeTwo>
</nodeOne>
<nodeOne>
<nodeTwo>
<nodeThree>
zoo bar
</nodeThree>
</nodeTwo>
</nodeOne>
What I would like to achieve is to count the occurrences of every word (delimitered by a whitespace) inside nodeThree. Considering the above example, the output would be something like:
foo 2
bar 3
zoo 2
I've tried to fetch every text() of nodeThree, and tried to tokenize() it into sequences of strings. Then, I thought, I might be able to join them and group, count afterwards, but I was not able to do so. Tried a lot of things until now.

First note that your XML is ill-formed (i.e. it's not really XML) unless you create a single root node to wrap it.
If performance is a concern, this problem is much better suited to leverage a word index with frequency data, like in an XML database. Solving this in pure XQuery may be considerably slower for large XML but solves the problem:
let $xml :=
<root>
<nodeOne>
<nodeTwo>
<nodeThree>
foo bar zoo
</nodeThree>
</nodeTwo>
</nodeOne>
<nodeOne>
<nodeTwo>
<nodeThree>
foo bar
</nodeThree>
</nodeTwo>
</nodeOne>
<nodeOne>
<nodeTwo>
<nodeThree>
zoo bar
</nodeThree>
</nodeTwo>
</nodeOne>
</root>
let $toks := $xml//text()/fn:tokenize(fn:normalize-space(.),'\s')
for $t in distinct-values($toks)
let $count := count($toks[. = $t])
return element { $t } {
attribute count { $count }
}
=>
<foo count="2"/>
<bar count="3"/>
<zoo count="2"/>

Related

concatenation of one value with multiple values from another xquery

Hej. I am new at XML-query and have some problem finding a way to combine the result of two xquery into one.
Consider the XML:
<programs>
<program id="488">
<editor>Anna</editor>
<channel>132</channel>
<category>5</category>
</program>
<program id="178">
<editor>Olle</editor>
<channel>132</channel>
<category>68</category>
</program>
<program id="179">
<editor>Olle</editor>
<channel>132</channel>
<category>10</category>
</program>
</programs>
I want to extract list of editors along with the categories they have worked on which would be like this:
<li>Anna 5 </li>
<li>Olle 68 10</li>
Here is the xquery code I am using
let $editors :=
for $d in $sr/sr/programs/program
where $d/channel = "132"
return $d/editor
let $cat :=
for $a in $sr/sr/programs/program
where $a/editor = data($editors)
return concat($a/editor ,' ', $a/category)
for $result in distinct-values($cat)
return <li>{string($result)}</li>
Appreciate all the helps!
I would iterate over the distinct list of editors, obtained from an XPath with a predicate to restrict to those with channel equal to 132, and then retrieve a distinct list of categories for those editors with a similar XPath with predicate filter, and then return the HTML with those values:
for $editor in distinct-values($sr/sr/programs/program[channel eq 132]/editor)
let $categories :=
distinct-values($sr/sr/programs/program[channel eq 132 and editor eq $editor]/category)
return <li>{$editor, $categories}</li>

Exist-db Add node with XQUERY

I have a listPers.xml (TEI List containing persons, obviously ) . I want to write a function to update the listPers.xml
My function looks like this:
declare function app:addPerson($node as node(), $model as map(*)) {
let $person := "<person xml:id=""><persName><forename>Albert</forename><surname>Test</surname></persName></person>"
let $list := doc(concat($config:app-root, '/resources/listPers_test.xml'))
return
update insert $person into $list//tei:listPerson
};
And the listPerson.xml
looks more or less like a typical list with person-entries
I have a tei:header (here omitted) followed by
<text>
<body>
<listPerson xml:id="person">
<person xml:id="abbadie_jacques">
<persName ref="http://d-nb.info/gnd/100002307">
<forename>Jacques</forename>
<surname>Abbadie</surname>
</persName>
<note>Prediger der französisch-reformierten Gemeinde in <rs type="place" ref="#berlin">Berlin</rs>
</note>
</person>
</body>
</text>
</TEI>
(sorry for ruining indentions, it's just an excerpt )
I do not get an error, which means that my app:addPerson should be fine, right?
I want the listPers_test to look like this:
<text>
<body>
<listPerson xml:id="person">
<person xml:id="abbadie_jacques">
<persName ref="http://d-nb.info/gnd/100002307">
<forename>Jacques</forename>
<surname>Abbadie</surname>
</persName>
<note>Prediger der französisch-reformierten Gemeinde in <rs type="place" ref="#berlin">Berlin</rs>
</note>
</person>
<!-- here comes the output that I wish to have :-) -->
<person xml:id=""><persName><forename>Albert</forename><surname>Test</surname></persName></person>
</body>
</text>
</TEI>
In the long run, I aim for an html-form that allows users to input names etc., where ids are generated using sth like
to-lowercase(concat($surname, "_", $forename));
But I will not get into my questions regarding forms and xquery, as I have barely done a quick Google-trip regarding html forms and xquery!
Can anyone hint me at why I do not get the listPers_test.xml file updated with the second value? :-)
All the best and thanks in advance to everyone,
K
Alright, I have a solution for anyone interested in it:
My first snippet $person:= ... contains a STRING, not an element.Changing the line
let $person := "<person xml:id=""><persName><forename>Albert</forename><surname>Test</surname></persName></person>"
to this one actually solves the issue:
let $person := <tei:person xml:id=""><persName><forename>Albert</forename><surname>Test</surname></persName></tei:person>

Filtering by attribute

Below is a excerpt from a XML file with 65 lectures:
<?xml version="1.0" encoding="iso-8859-1" ?>
<university>
<lecture>
<class>English</class>
<hours>3</hours>
<pupils>30</pupils>
</lecture>
<lecture>
<class>Math</class>
<hours>4</hours>
<pupils>27</pupils
</lecture>
<lecture>
<class>Science</class>
<hours>2</hours>
<pupils>25</pupils>
</lecture>
</university>
I need a where clause that gives me a list of lectures with more pupils than an English lecture. However, not with the attribute "30" used, but calling the English's lecture attribute instead
E.g., I want to use a where clause with a condition like pupils > English.pupils, instead of pupils > 30.
(The "pupils > English.pupils" is just puesdo code as an example)
A where clause isn't strictly necessary, but to use one you would make it part of a for iterator:
let $lectures := doc("lectures.xml")/university/lecture
let $english-pupils := $lectures[class = "English"]/pupils/xs:integer(.)
for $lecture in $lectures
where ($lecture/pupils/xs:integer(.) gt $english-pupils)
return $lecture
You could also avoid the flwor altogether by using an XPath predicate.
let $lectures := doc("lectures.xml")/university/lecture
let $english-pupils := $lectures[class = "English"]/pupils/xs:integer(.)
return $lectures[pupils/xs:integer(.) gt $english-pupils]

XQuery "flattening" an element

I am extracting data from an XML file and I need to extract a delimited list of sub-elements. I have the following:
for $record in //record
let $person := $record/person/names
return concat($record/#uid/string()
,",", $record/#category/string()
,",", $person/first_name
,",", $person/last_name
,",", $record/details/citizenships
,"
")
The element "citizenships" contains sub-elements called "citizenship" and as the query stands it sticks them all together in one string, e.g. "UKFrance". I need to keep them in one string but separate them, e.g. "UK|France".
Thanks in advance for any help!
fn:string-join($arg1 as xs:string*, $arg2 as xs:string) is what you're looking for here.
In your currently desired usage, that would look something like the following:
fn:string-join($record/details/citizenships/citizenship, "|")
Testing outside your document, with:
fn:string-join(("UK", "France"), "|")
...returns:
UK|France
Notably, ("UK", "France") is a sequence of strings, just as a query returning multiple citizenships would likewise be a sequence (the entries in which will be evaluated for their string value when passed to fn:string-join(), which is typed as taking a sequence of strings for its first argument).
Consider the following (simplified) query:
declare context item := document { <root>
<record uid="1">
<person>
<citizenships>
<citizenship>France</citizenship>
<citizenship>UK</citizenship>
</citizenships>
</person>
</record>
</root> };
for $record in //record
return concat(fn:string-join($record//citizenship, "|"), "
")
...and its output:
France|UK

MarkLogic Join Query

Hi I am new to marklogic and in Xquery world. I am not able to think of starting point to write the following logic in Marklogic Xquery. I would be thankful if somebody can give me idea/sample so I can achieve the following:
I want to Query A.XML based on a word lookup in B.XML. Query should produce C.XML. The logic should be as follows:
A.XML
<root>
<content> The state passed its first ban on using a handheld cellphone while driving in 2004 Nokia Vodafone Nokia Growth Recession Creicket HBO</content>
</root>
B.XML
<WordLookUp>
<companies>
<company name="Vodafone">Vodafone</company>
<company name="Nokia">Nokia</company>
</companies>
<topics>
<topic group="Sports">Cricket</topic>
<topic group="Entertainment">HBO</topic>
<topic group="Finance">GDP</topic>
</topics>
<moods>
<mood number="4">Growth</mood>
<mood number="-5">Depression</mood>
<mood number="-3">Recession</mood>
</moods>
C.XML (Result XML)
<root>
<content> The state passed its first ban on using a handheld cellphone while driving in 2004 Nokia Vodafone Nokia Growth Recession Creicket HBO</content>
<updatedElement>
<companies>
<company count="1">Vodafone</company>
<company count="2">Nokia</company>
</companies>
<mood>1</mood>
<topics>
<topic count="1">Sports</topic>
<topic count="1">Entertainment</topic>
</topics>
<word-count>22</word-count>
</updatedElement>
</root>
Search each company/text() of A.xml in B.xml, if match found create tag:
TAG {company count="Number of occurrence of that word"}company/#name
{/company}
Search each topic/text() of A.xml in B.xml, if match found create tag
TAG {topic topic="Number of occurrences of that word"}topic/#group{/topic}
Search each mood/text() of A.xml in B.xml, if match found
[occurrences of first word * {/mood[first word]/#number}] + [occurrences of second word * {/mood[second word]/#number})]....
get the word count of element.
This was a fun one, and I learned a few things in the process. Thanks!
Note: to get the results you wanted, I fixed a typo in A.xml ("Creicket" -> "Cricket").
The following solution uses two MarkLogic-specific functions:
cts:highlight (for replacing matching text with nodes which you can then count)
cts:tokenize (for breaking up a given string into word, space, and punctuation parts)
It also includes some powerful magic specific to those two functions, respectively:
the dynamic binding of the special variable $cts:text (which isn't really necessary for this particular use case, but I digress), and
the data model extension which adds these subtypes of xs:string:
cts:word,
cts:space, and
cts:punctuation.
Enjoy!
xquery version "1.0-ml";
(: Generic function using MarkLogic's ability to find query matches within a single node :)
declare function local:find-matches($content, $search-text) {
cts:highlight($content, $search-text, <MATCH>{$cts:text}</MATCH>)
//MATCH
};
(: Generic function using MarkLogic's ability to tokenize text into words, punctuation, and spaces :)
declare function local:get-words($text) {
cts:tokenize($text)[. instance of cts:word]
};
(: The rest of this is pure XQuery :)
let $content := doc("A.xml")/root/content,
$lookup := doc("B.xml")/WordLookUp
return
<root>
{$content}
<updatedElement>
<companies>{
for $company in $lookup/companies/company
let $results := local:find-matches($content, string($company))
where exists($results)
return
<company count="{count($results)}">{string($company/#name)}</company>
}</companies>
<mood>{
sum(
for $mood in $lookup/moods/mood
let $results := local:find-matches($content, string($mood))
return count($results) * $mood/#number
)
}</mood>
<topics>{
for $topic in $lookup/topics/topic
let $results := local:find-matches($content, string($topic))
where exists($results)
return
<topic count="{count($results)}">{string($topic/#group)}</topic>
}</topics>
<word-count>{
count(local:get-words($content))
}</word-count>
</updatedElement>
</root>
Let me know if you have any follow-up questions about how all the above works. At first, I was inclined to use cts:search or cts:contains, which are the bread and butter for search in MarkLogic. But I realized that this example wasn't so much about search (finding documents) as it was about looking up matching text within an already-given document. If you needed to extend this somehow to aggregate across a large number of documents, then you'd want to look into the additional use of cts:search or cts:contains.
One final caveat: if you think your content might have <MATCH> elements already, you'll want to use a different element name when calling cts:highlight (a name which you can guarantee won't conflict with your content's existing element names). Otherwise, you'll potentially get the wrong number of results (higher than the accurate count).
ADDENDUM:
I was curious if this could be done without cts:highlight, given that cts:tokenize already breaks up the text into all the words for you. The same result is produced using this alternative implementation of local:find-matches (provided you swap the order of the function declarations because one depends on the other):
(: Find word matches by comparing them one-by-one :)
declare function local:find-matches($content, $search-text) {
local:get-words($content)[cts:stem(.) = cts:stem($search-text)]
};
It uses cts:stem to normalize the given word to its stem, so, for example searching for "pass" will match "passed", etc. However, this still won't work for multi-word (phrase) searches. So to be safe, I'd stick with using cts:highlight, which, like cts:search and cts:contains, can handle any cts:query you give it (including simple word/phrase searches like we do above).
Might make sense to step back and ask if you might be better served modeling your data and or documents for use with a document oriented database instead of an rdbms
This is simpler/shorter and fully compliant XQuery not containing any implementation extensions, which make it work with any compliant XQuery 1.0 processor:
let $content := doc('file:///c:/temp/delete/A.xml')/*/*,
$lookup := doc('file:///c:/temp/delete/B.xml')/*,
$words := tokenize($content, '\W+')[.]
return
<root>
{$content}
<updatedElement>
<companies>
{for $c in $lookup/companies/*,
$occurs in count(index-of($words, $c))
return
if($occurs)
then
<company count="{$occurs}">
{$c/text()}
</company>
else ()
}
</companies>
<mood>
{
sum($lookup/moods/*[false or index-of($words, data(.))]/#number)
}
</mood>
<topics>
{for $t in $lookup/topics/*,
$occurs in count(index-of($words, $t))
return
if($occurs)
then
<topic count="{$occurs}">
{data($t/#group)}
</topic>
else ()
}
</topics>
<word-count>{count($words)}</word-count>
</updatedElement>
</root>
When applied on the provided files A.xml and B.XML (contained in the local directory c:/temp/delete), the wanted, correct result is produced:
<root>
<content> The state passed its first ban on using a handheld cellphone while driving in 2004 Nokia Vodafone Nokia Growth Recession Cricket HBO</content>
<updatedElement>
<companies>
<company count="1">Vodafone</company>
<company count="2">Nokia</company>
</companies>
<mood>1</mood>
<topics>
<topic count="1">Sports</topic>
<topic count="1">Entertainment</topic>
</topics>
<word-count>22</word-count>
</updatedElement>
</root>

Resources