How to show inserted xml documents in order in Marklogic? - xquery

I am inserting some xml documents from the UI in Marklogic Server and at the same time showing them in a list. I want to show the documents in a order. The document first inserted should come first in a list. Second document should come at second place and so on. But Marklogic is showing them randomly without any order.

The insert order is not persisted or preserved when working with MarkLogic Server. If you want your document's insert order to be preserved the data or the data's properties will need some value upon which the server can do an "order by" clause.
for $doc in fn:doc()
order by $doc//some-aspect-of-the-xml-structure
return
$doc
The documents are indeed independent from each-other in a "shared nothing" architecture. This helps MarkLogic run much faster than some relational database approaches where "rows" share membership and ordering in a "table" and as a result have trouble clustering efficiently.

You can order documents by data of last update:
(:If uri lexicone is enabled, else you can iterate by fn:collection():)
for $uri in cts:uris((), "document")
let $updated-date := xdmp:document-get-properties($uri, fn:QName("http://marklogic.com/cpf", "last-updated"))
order by $updated-date/text()
return $uri
There is another way, without using uri lexicon:
for $doc in fn:collection()
let $uri := xdmp:node-uri($doc)
let $updated-date := xdmp:document-get-properties($uri, fn:QName("http://marklogic.com/cpf", "last-updated"))
order by $updated-date/text()
return $uri

Related

In XQuery Marklogic how to sort dynamically?

In XQuery Marklogic how to sort dynamically?
let $sortelement := 'Salary'
for $doc in collection('employee')
order by $doc/$sortelement
return $doc
PS: Sorting will change based on user input, like data, name in place of salary.
If Salary is the name of the element, then you could more generically select any element in the XPath with * and then apply a predicate filter to test whether the local-name() matches the variable for the selected element value $sortelement:
let $sortelement := 'Salary'
for $doc in collection('employee')
order by $doc/*[local-name() eq $sortelement]
return $doc
This manner of sorting all items in the collection may work with smaller number of documents, but if you are working with hundreds of thousands or millions of documents, you may find that pulling back all docs is either slow or blows out the Expanded Tree Cache.
A more efficient solution would be to create range indexes on the elements that you intend to sort on, and could then perform a search with options specified to order the results by cts:index-order with an appropriate reference to the indexed item, such as cts:element-reference(), cts:json-property-reference(), cts:field-reference().
For example:
let $sortelement := 'Salary'
return
cts:search(doc(),
cts:collection-query("employee"),
cts:index-order(cts:element-reference(xs:QName($sortelement)))
)
Not recommended because the chances of introducing security issues, runtime crashes and just 'bad results' is much higher and more difficult to control --
BUT available as a last resort.
ALL XQuery can be dynamically created as a string then evaluated using xdmp:eval
Much better to follow the guidance of Mads, and use the search apis instead of xquery FLOWR expressions -- note that these APIs actually 'compile down' to a data structure. This is what the 'cts constructors' do : https://docs.marklogic.com/cts/constructors
I find it helps to think of cts searches as a structured search described by data -- which the cts:xxx are simply helper functions to create the data structure.
(they dont actually do any searching, they build up a data structure that is used to do the searching)
If you look at the source to the search:xxx apis you can see how this is done.

How to get total number of nodes in an XML in MarkLogic Database

I have an XML like below in my database:
<PersonalData>
<Person>
<Name></Name>
<Age></Age>
<AccountNo>
<Number>123<Number>
<SwiftCode>1235<SwiftCode>
</AccountNo>
<AccountNo>
<Number>15523<Number>
<SwiftCode>188235<SwiftCode>
</AccountNo>
</Person>
</PersonalData>
in this XML, I have multiple AccountNo nodes and I have around 1M similar records in my database. I want to identify the count of AccountNo nodes in my entire database.
One way in which you can report the count of AccountNo elements would be to use an XPath and count:
count(//AccountNo)
You can also use cts:search and specify the AccountNo in the $expression XPath, and then count() the results:
count(cts:search(//AccountNo, cts:true-query()))
Another way to get a count of all the distinct AccountNo elements would be to run a CoRB job to select the docs that have those elements, and then in the process module return a line for every element in the doc and write the results to a text file. Below is an example OPTIONS-FILE that could be used to achieve that:
URIS-MODULE=INLINE-XQUERY|let $uris := cts:uris('',(),cts:element-query(xs:QName("AccountNo"), cts:true-query())) return (count($uris), $uris)
PROCESS-MODULE=INLINE-XQUERY|declare variable $URI external; doc($URI)//AccountNo ! 1
PROCESS-TASK=com.marklogic.developer.corb.ExportBatchToFileTask
EXPORT-FILE-NAME=AccountNoCounts.txt
DISK-QUEUE=true
Then you could get the line count from the result file, which would tell you have many elements there are: wc -l AccountNoCounts.txt
If you need to be able to get this count often, and need the response to be fast, you could create a TDE that projects rows for each of the AccountNo elements and then could and could select the count with SQL (e.g. SELECT count(1) FROM Person.AccountNo) or use the Optic API against that TDE and op.count().

Querying on Global Secondary indexes with a usage of contains operator

I've been reading a DynamoDB docs and was unable to understand if it does make sense to query on Global Secondary Index with a usage of 'contains' operator.
My problem is as follows: my dynamoDB document has a list of embedded objects, every object has a 'code' field which is unique:
{
"entities":[
{"code":"entity1Code", "name":"entity1Name"},
{"code":"entity2Code", "name":"entity2Name"}
]
}
I want to be able to get all documents that contain entities with entity.code = X.
For this purpose I'm considering adding a Global Secondary Index that would contain all entity.codes that are present in current db document separated by a comma. So the example above would look like:
{
"entities":[
{"code":"entity1Code", "name":"entity1Name"},
{"code":"entity2Code", "name":"entity2Name"}
],
"entitiesGlobalSecondaryIndex":"entityCode1,entityCode2"
}
And then I would like to apply filter expression on entitiesGlobalSecondaryIndex something like: entitiesGlobalSecondaryIndex contains entityCode1.
Would this be efficient or using global secondary index does not make sense in this way and DynamoDB will simply check the condition against every document which is similar so scan?
Any help is very appreciated,
Thanks
The contains operator of a query cannot be run on a partition Key. In order for a query to use any sort of operators (contains, begins with, > < ect...) you must have a range attributes- aka your Sort Key.
You can very well set up a GSI with some value as your PK and this code as your SK. However, GSIs are replication of the table - there is a slight potential for the data ina GSI to lag behind that of the master copy. If the query you're doing against this GSI isn't very often, then you're probably safe from that.
However. If you are trying to do this to the entire table at once then it's no better than a scan.
If what you need is a specific Code to return all its documents at once, then you could do a GSI with that as the PK. If you add a date field as the SK of this GSI it would even be time sorted. If you query against that code in that index, you'll get every single one of them.
Since you may have multiple codes, if they aren't too many per document, you maybe could use a Sparse Index - if you have an entity with code "AAAA" then you also have an attribute named AAAA (or AAAAflag or something.) It is always null/does not exist Unless the entities contains that code. If you do a GSI on this AAAflag attribute, it will only contain documents that contain that entity code, and ignore all where this attribute does not exist on a given document. This may work for you if you can also provide a good PK on this to keep the numbers well partitioned and if you don't have too many codes.
Filter expressions by the way are different than all of the above. Filter expressions are run on tbe data that would be returned, after it is already read out of the table. This is useful I'd you have a multi access pattern setup, but don't want a particular call to get all the documents associated with a particular PK - in the interests of keeping the data your code is working with concise. The query with a filter expression still retrieves everything from that query, but only presents what makes it past the filter.
If are only querying against a particular PK at any given time and you want to know if it contains any entities of x, then a Filter expressions would work perfectly. Of course, this is only per PK and not for your entire table.
If all you need is numbers, then you could do a count attribute on the document, or a meta document on that partition that contains these values and could be queried directly.
Lastly, and I have no idea if this would work or not, if your entities attribute is a map type you might very well be able to filter against entities code - and maybe even with entities.code.contains(value) if it was an SK - but I do not know if this is possible or not

Marklogic commit frame/return sequence guarantee

I have a simple 1 node Marklogic server that I need to purge documents daily.
The test query below selects the documents then returns a sequence which I want to do the following:
output the name of the file being extracted
ensure the directory path exists of file in #1
save a zipped version of the document to the file in #1.
Delete the document
Is this structure safe? It returns a sequence for each document to be deleted. The last item in the returned sequence deletes the document. If any of the prior steps fail, will the document still be deleted? Should I trust the engine to execute the return sequence in order given?
xquery version "1.0-ml";
declare namespace html = "http://www.w3.org/1999/xhtml";
let $dateLimitAll := current-dateTime() -xs:dayTimeDuration("P1460D")
let $dateLimitSome := current-dateTime() -xs:dayTimeDuration("P730D")
for $adoc in doc()[1 to 5]
let $docDate := $adoc/Unit/created
let $uri := document-uri($adoc)
let $path:= fn:concat("d:/purge/" , $adoc/Unit/xmldatastore/state/data(), "/", fn:year-from-dateTime($docDate), "/", fn:month-from-dateTime($docDate))
let $filename := fn:concat($path, "/", $uri, ".zip")
where ( ($docDate < $dateLimitAll) or (($docDate < $dateLimitSome) and ($adoc/Unit/xmldatastore/state != "FIRMED") and ($adoc/Unit/xmldatastore/state != "SHIPPED")))
return ( $filename, xdmp:filesystem-directory-create($path, map:new(map:entry("createParents", fn:true()))), xdmp:save($filename, xdmp:zip-create(<parts xmlns="xdmp:zip"><part>{$uri}</part></parts>, doc($uri))), xdmp:document-delete($uri) )
p.s. please ignore the [1 to 5] doc limit. Added for testing.
If any of the prior steps fail, will the document still be deleted?
If there is an error in the execution of that module, the transaction will rollback and the delete from the database will be undone.
However, the directory and zip file written to the filesystem will persist and will not be deleted. The xdmp:filesystem-directory-create() and xdmp:save() functions do not rollback or get undone if a transaction rolls back.
Should I trust the engine to execute the return sequence in order given?
Not sure that it matters much, given the statement above.
Is this structure safe?
It is unclear how many documents you might be dealing with. You may find that the filter is better/faster using cts:search and some indexes to target the candidate documents. Also, even if you can select the set of documents to process faster, if there are a lot of documents, you could still exceed execution time limits.
Another approach might be to break up the work. Select the URIs of the documents that match the criteria, and then have separate query executions for each document that is responsible for saving the zip file and deleting the document from the database. This is likely to be faster, as you can process multiple documents in parallel, avoids the risk of a timeout, and in the event of an exception, allows for some items to fail without causing the entire set to fail and rollback.
Tools such as CoRB were built exactly for this type of batch work.

Does a logical partition scan in CosmosDB always returns items in the same order?

In CosmosDB using the SQL API ( hope API might not matter ) and queries that do not use ORDER BY over an specific Logic Partition ( e.g. WHERE CustomerId = 123 ), wondering if the response will return the results always in the same order.
A use case could be something like an Audit log, where it is possible that TimeStamp _ts is not granular enough so likely to find at some point the same value twice and the source or events doesn't allow to create an sequence that can be used for ordering.
wondering if the response will return the results always in the same
order.
Based on my previous test, if you do not set any sort rules, it will be sorted as default based on the time created in the database,whatever it is partitioned or not.
In above sample documents, the sort will not be changed if I change the id,partition key(that's name) or ts.

Resources