Compare two elements of the same document in MarkLogic - xquery

I have a MarkLogic 8 database in which there are documents which have two date time fields:
created-on
active-since
I am trying to write an Xquery to search all the documents for which the value of active-since is less than the value of created-on
Currently I am using the following FLWOR exression:
for $entity in fn:collection("entities")
let $id := fn:data($entity//id)
let $created-on := fn:data($entity//created-on)
let $active-since := fn:data($entity//active-since)
where $active-since < $created-on
return
(
$id,
$created-on,
$active-since
)
The above query takes too long to execute and with increase in the number of documents the execution time of this query will also increase.
Also, I have
element-range-index for both the above mentioned dateTime fields but they are not getting used here. The cts-element-query function only compares one element with a set of atomic values. In my case I am trying to compare two elements of the same document.
I think there should be a better and optimized solution for this problem.
Please let me know in case there is any search function or any other approach which will be suitable in this scenario.

This may be efficient enough for you.
Take one of the values and build a range query per value. This all uses the range indexes, so in that sense, it is efficient. However, at some point, there is a large query that us built. It reads similiar to a flword statement. If really wanted to be a bit more efficient, you could find out which if your elements had less unique values (size of the index) and use that for your iteration - thus building a smaller query. Also, you will note that on the element-values call, I also constrain it to your collection. This is just in case you happen to have that element in documents outside of your collection. This keeps the list to only those values you know are in your collection:
let $q := cts:or-query(
for $created-on in cts:element-values(xs:QName("created-on"), (), cts:collection-query("entities"))
return cts:element-value-range-query(xs:Qname("active-since"), "<" $created-on)
)
return
cts:search(
fn:collection("entities"),
$q
)
So, lets explain what is happening in a simple example:
Lets say I have elements A and B - each with a range index defined.
Lets pretend we have the combinations like this in 5 documents:
A,B
2,3
4,2
2,7
5,4
2,9
let $ := cts:or-query(
for $a in cts:element-values(xs:QName("A"))
return cts:element-value-range-query(xs:Qname("B"), "<" $a)
)
This would create the following query:
cts:or-query(
(
cts:element-value-range-query(xs:Qname("B"), "<" 2),
cts:element-value-range-query(xs:Qname("B"), "<" 4),
cts:element-value-range-query(xs:Qname("B"), "<" 5)
)
)
And in the example above, the only match would be the document with the combination: (5,4)

You might try using cts:tuple-values(). Pass in three references: active-since, created-on, and the URI reference. Then iterate the results looking for ones where active-since is less than created-on, and you'll have the URI of the doc.
It's not the prettiest code, but it will let all the data come from RAM, so it should scale nicely.

I am now using the following script to get the count of documents for which the value of active-since is less than the value of created-on:
fn:sum(
for $value-pairs in cts:value-tuples(
(
cts:element-reference(xs:QName("created-on")),
cts:element-reference(xs:QName("active-since"))
),
("fragment-frequency"),
cts:collection-query("entities")
)
let $created-on := json:array-values($value-pairs)[1]
let $active-since := json:array-values($value-pairs)[2]
return
if($active-since lt $created-on) then cts:frequency($value-pairs) else 0
)

Sorry for not having enough reputation, hence I need to comment here on your answer. Why do you think that ML will not return (2,3) and (4,2). I believe we are using an Or-query which will take any single query as true and return the document.

Related

How to convert string to XPATH in BaseX

How can i convert string into XPATH, below is the code
let $ti := "item/title"
let $tiValue := "Welcome to America"
return db:open('test')/*[ $tiValue = $ti]/base-uri()
Here is one way to solve it:
let $ti := "item/title"
let $tiValue := "Welcome to America"
let $input := db:open('test')
let $steps := tokenize($ti, '/')
let $process-step := function($input, $step) { $input/*[name() = $step] }
let $output := fold-left($input, $steps, $process-step)
let $test := $output[. = $tiValue]
return $test/base-uri()
The path string is split into single steps (item, title). With fold-left, all child nodes of the current input (initially db:open('test')) will be matched against the current step (initially, item). The result will be used as new input and matched against the next step (title), and so on. Finally, only those nodes with $tiValue as text value will be returned.
Your question is very unclear - the basic problem is that you've shown us some code that doesn't do what you want, and you're asking us to work out what you want by guessing what was going on in your head when you wrote the incorrect code.
I suspect -- I may be wrong -- that you were hoping this might somehow give you the result of
db:open('test')/*[item/title = $ti]/base-uri()
and presumably $ti might hold different path expressions on different occasions.
XQuery 3.0/3.1 doesn't have any standard way to evaluate an XPath expression supplied dynamically as a string (unless you count the rather devious approach of using fn:transform() to invoke an XSLT transformation that uses the xsl:evaluate instruction).
BaseX however has an query:eval() function that will do the job for you. See https://docs.basex.org/wiki/XQuery_Module

MarkLogic optic query using two indexes returns no results

I want to use the MarkLogic optic API to join two range indexes but somehow they don't join. Is the query I wrote wrong or can't I compare the indexes used?
I have two indexes defined:
an element-attribute range index x/#refid
a range field index 'id'
Both are of type string and have the same collation defined. Both indexes have data that I can retrieve with cts:values() function. Both are huge indexes and I want to join them using optics so I have constructed the following query :
import module namespace op="http://marklogic.com/optic"
at "/MarkLogic/optic.xqy";
let $subfrag := op:fragment-id-col("subfrag")
let $notfrag := op:fragment-id-col("notfrag")
let $query :=
cts:and-query((
cts:collection-query("latest")
))
let $subids := op:from-lexicons(
map:entry("subid", cts:field-reference("id")), (), $subfrag) => op:where($query)
let $notids := op:from-lexicons(
map:entry("notid", cts:element-attribute-reference(xs:QName("x"), xs:QName("refid"))),
(),
$notfrag)
return $subids
=> op:join-cross-product($notids)
=> op:where(op:eq($notfrag, $subfrag))
=> op:result()
This query uses the join-cross-product and when I remove the op:where clause I get all values left and right. I verified and some are equal so the clause should filter only those rows i'm actually interested in. But somehow it doesn't work and I get an empty result. Also, if I replace one of the values in the op:eq with a string value it doesn't return a result.
When I use the same variable in the op:eq operator (like op:eq($notfrag, $notfrag)) I get results back so the statement as is works. Just not the comparison between the two indexes.
I have also used variants with join-inner and left-outer-join but those are also returning no results.
Am I comparing two incomparable indexes or am I missing some statement (as documentation/example is a bit thin).
(of course I can solve by not using optics but in this case it would be a perfect fit)
[update]
I got it working by eventually by changing the final statement:
return $subids
=> op:join-cross-product($notids)
=> op:where(op:eq(op:col('subid'), op:col('notid')))
=> op:result()
So somehow you cannot use the fragment definitions in the condition. After this I replaced the join-cross-product with a join-inner construction which should be a bit more efficient.
And to be complete, I initially used the example from the MarkLogic documentation found here (https://docs.marklogic.com/guide/app-dev/OpticAPI#id_87356), specifically the last example where they use a fragment column definition to be used as param in the join-inner statement that didn't work in my case.
Cross products are typically useful only for small rows sets.
Putting both reference in the same from-lexicons() accessor does an implicit join, meaning that the engine forms rows by constructing a local cross-product of the values indexed for each document.
Such a query could be expressed by:
op:from-lexicons(
map:entry("subid", cts:field-reference("id"))
=>map:with("notid", cts:element-attribute-reference(xs:QName("x"),
xs:QName("refid")))
=>op:where(cts:collection-query("latest"))
=>op:result()
Making the joins explicitly could be done with:
let $subids := op:from-lexicons(
map:entry("subid", cts:field-reference("id")), (), $subfrag)
=> op:where($query)
let $notids := op:from-lexicons(
map:entry("notid", cts:element-attribute-reference(xs:QName("x"),
xs:QName("refid"))),
(),
$notfrag)
return $subids
=> op:join-inner($notids, op:on($notfrag, $subfrag))
=> op:result()
Hoping that helps,

searching in multiple collections joined by common fileds in xquery marklogic

I have two collections('A' and 'B') with millions of transport insurance data documents. The two collections have four elements in common(customer-no, date-of-insurance, insurance-no,accident-number) and one element(license-no) exists only in one collection('A'). I want to extract all the documents that are present in both the collections and also have the element of collection'A'. I am able to retrieve all the customer-nos from 'A' with cts-search. Then I loop through each of these customer-nos to look for license-no in 'A'. It gives an empty sequence. But I know this is not possible. Could someone guide me with appropriate search logic?
let $col-A := cts:search(
doc(),
cts:and-query((
cts:collection-query('col-A'),
cts:element-value-query(xs:QName('abc:Acusno'), '*', (("wildcarded")))
)))
for $each in $col-A
let $col-B := cts:search(doc(),
cts:and-query((cts:collection-query('col-B'),
cts:element-value-query(xs:QName('abc:Bcusno'), $each)
)))
return $col-B
returns empty sequence
Your first cts:search is returning entire documents, which you are then passing in as argument into the value-query. You probably want to pass in just the value of abc:Acusno. You could do that with something like $each//abc:Acusno.
Your code is not using a very efficient approach though, and what if certain Acusno values occur multiple times?
I would recommend putting a range index on abc:Acusno, and using cts:values to pull up the unique values that match a given query. Then feed that entire list as one argument without any looping to a query against abc:Bcusno. You don't have to use a range index, and range query on Bcusno, but it could be useful to have that index anyhow. The code would then look something like this:
let $query :=
cts:and-query((
cts:collection-query('col-A'),
cts:element-query(xs:QName('abc:Acusno'), cts:true-query())
))
let $customerNrs :=
cts:values(
cts:element-reference(xs:QName("abc:Acusno")),
(),
(),
$query
)
return cts:search(
collection(),
cts:and-query((
cts:collection-query('col-B'),
cts:element-range-query(xs:QName('abc:Bcusno'), '=', $customerNrs)
))
)
Note: be careful when returning full search lists like this. You might want to paginate the response.
HTH!

Combined search query for a few xml documents

I have in each books directory /books/{book_id}/ a couple of xml documents.
/books/{book_id}/basic.xml and /books/{book_id}/formats.xml.
First one is
<document book_id="{book_id}">
<title>The book</title>
</document>
and the second is
<document book_id="{book_id}">
<format>a</format>
<format>b</format>
<format>c</format>
</document>
How can I find all books in /books/ directory with format eq 'a' and title eq *'book'* by one query? I have done one variant when I first finding all books by format by cts:search() and then filter the result in "for loop" by checking title in basic.xml file.
Thank you!
This question is listed as MarkLogic as well as xQuery. For completeness, I have included a MarkLogic solution that is a single statement:
let $res := cts:search(doc(), cts:and-query(
(
cts:element-word-query(xs:QName("title"), '*book*', ('wildcarded'))
,
cts:element-attribute-range-query(xs:QName("document"), xs:QName("book_id"), '=', cts:element-attribute-values(xs:QName("document"), xs:QName("book_id"), (), (), cts:element-value-query(xs:QName("format"), 'b')))
)
)
)
OK. Now lets break this down and have a look.
Note: This sample requires a single range index on the attribute book_id.
I tool advantage of the fact that you have the same attribute in the same namespace in both types of documents. This allowed the following:
I could use a single index
Then I used element-attribute-values for the list of book_ids
-- This was constrained by the 'format' element
The list of book_ids above was used to filter the books (range query)
Which was then further filtered by the title
This approach joins the two documents using a range index which is super-fast - especially on the integer value of the book_id
It should be noted that in this articular case, I was able to isolate the proper documents because title elements only exist in one type of document.
Now, lets look at a cleaner example of the same query.
(: I used a word-query so that I could do wildcarded searches for document with 'book' in the title. This is because your sample has a title 'The Book', yet you search for 'book' so I can olnly conclude that you meant to have wildcard searches :)
let $title-constraint := "*book*"
(: This could also be a sequence :)
let $format-constraint := "a"
(: used for the right-side of the element-range-query :)
let $format-filter := cts:element-attribute-values(xs:QName("document"), xs:QName("book_id"), (), (), cts:element-value-query(xs:QName("format"), $format-constraint))
(: final results :)
let $res := cts:search(doc(), cts:and-query((
cts:element-word-query(xs:QName("title"), $title-constraint, ('wildcarded'))
,
cts:element-attribute-range-query(xs:QName("document"), xs:QName("book_id"), '=', $format-filter)
)
) )
return $res
Maybe stating the obvious, the best approach would be to change the model so the format is in the same document as the title and can be matched by a single query.
If that's not possible, one alternative would be to turn on the uri lexicon in the database configuration (if it's not enabled already).
Assuming that the title is more selective than the format, something along the following lines might work.
let $title-uris := cts:uris((), (), cts:and-query((
cts:directory-query("/books/", "infinity"),
cts:element-word-query(xs:QName("title"), "book")
)))
let $title-dirs :=
for $uri in $title-uris
return fn:replace($uri, "/basic\.xml$", "/")
let $format-uris := cts:uris((), (), cts:and-query((
cts:directory-query($title-dirs),
cts:element-value-query(xs:QName("format"), "a")
)))
let $book-docs :=
for $uri in $format-uris
return fn:replace($uri, "/format\.xml$", "/basic.xml")
for $doc in fn:doc($book-docs)
return ... do something with the basic document ...
The extra cost beyond the document reads consists of two lookups in the uri lexicon and the string manipulation. The benefit is in reading only the documents that match.
In general, it's better at scale to use the indexes to match the relevant documents instead of reading the documents into memory and filtering out the irrelevant documents. The cts:uris() and cts:search() functions always match using the indexes first (and only filter when the search option is specified). XPaths optimize by matching with the indexes when possible but have to fallback to filtering for some predicates. Unless you're careful, it's usually better to limit XPaths to navigation of nodes in memory.
Hoping that helps,
How can I find all books in /books/ directory with format eq 'a' and title eq 'book' by one query?
Try:
doc('basic.xml')/document[#book_id='X']/title[contains(., 'book')]]
[doc('format.xml')/document[#book_id='X'][format = 'a']
The last predicate, if it turns empty, will result in the title to not be found. If it exists, then title will be returned.
You should, of course, replace X with your ID. And you can set the relative path to include the ID. If you have a set of ID's you want to go over, you can do this:
for $id in ('{book_id1}', '{book_id2}')
return
doc(concat($id, '/basic.xml'))/document[#book_id=$id]/title[contains(., 'book')]]
[doc(concat($id, '/format.xml'))/document[#book_id=$id][format = 'a']
You'll get the drift ;)
PS: I'm not sure if {...} is a legal URI pathpart, but I assume you'll replace it with something sensible. Otherwise, escape it with the appropriate percent-encoding.
I think I found better solution
let $book_ids := cts:values(
cts:element-attribute-reference(xs:QName("document"), xs:QName("book_id") ),
(),
("map"),
cts:and-query((
cts:directory-query(("/books/"), "infinity"),
cts:element-query(xs:QName("title"),"book")
))
)
return
cts:search(
/,
cts:and-query((
cts:element-attribute-value-query(xs:QName("document"), xs:QName("book_id"), map:keys($book_ids)),
cts:element-value-query(xs:QName("format"), "a"),
))
)

Distinct-Nodes taking too long in BaseX (XQuery)

I am trying to get all distinct start element /products/p:category/start nodes of a big file. I have written a query which is given below. It is taking to long to get the result. I am attaching the query info and the XML file.
After Running couple of minutes, I stopped the execution.
The query is trying to get all the distinct start elements. There are 3 lac category elements.
declare namespace functx = "http://www.functx.com";
declare namespace p="a:b:c";
declare function functx:is-node-in-sequence(
$node as node()? ,
$seq as node()*
) as xs:boolean {
some $nodeInSeq in $seq satisfies deep-equal($nodeInSeq,$node)
};
declare function functx:distinct-nodes(
$nodes as node()*
) as node()* {
for $seq in (1 to count($nodes))
return $nodes[$seq]
[not(functx:is-node-in-sequence(.,$nodes[position() < $seq]))]} ;
let $diff_starts := functx:distinct-nodes(/products/p:category/start)
return $diff_starts
Please let me know if you require further details.
xml file
Comparing a rather large number of nodes with the function provided by FunctX is very expensive, and far beyond costs linear with the number of items.
FunctX is generally a neat library, but often does not scale very well for larger data as you have it (although XML databases can very well handle data much larger without problems).
In this query, I first fetch all distinct values in linear time (in the number of node lookups; for getting distinct values BaseX uses a hashtable ), and another linear scan over all nodes to retrieve the first result node for each of those values. Total execution time on my laptop was about 700ms.
declare namespace p="a:b:c";
for $date in distinct-values(/products/p:category/start)
return (/products/p:category/start[. eq $date])[1]

Resources