This Question is in reference with Data Hub Framework-
I am having 3-4 conditions in which i am doing operations like xdmp:node-replace and xdmp:document-delete and after all the conditions i am trying to insert the document using xdmp:document-insert.
When i am running the conditions independently by commenting the other conditions then it is working fine but if i am trying to run 2 or more conditions together- i am getting XDMP-CONFLICTINGUPDATES
$envelope is coming from STAGING Database which i am using in writer.xqy
The code sample is as below-
let $con1 := if($envelope/*:test/text() eq "abc")
then xdmp:node-replace(....) else ()
let $con2 := if($envelope/*:test/text() eq "123")
then xdmp:node-replace(....) else ()
let $con1 := if($envelope/*:test/text() eq "cde")
then xdmp:document-delete(....) else ()
return if($envelope//*FLAG/text() eq "1")
then
xdmp:document-insert($id, $envelope, xdmp:default-permissions(), map:get($options, "entity"))
Any Suggestions ?
XDMP-ConflictingUpdates means you are trying to update the same node more than once within a single transaction. Solving these types of errors can be infamously tricky and are a rite of passage for every MarkLogician.
In your case, this is caused by updating a node with xdmp:node-replace and then updating the document node which is the parent of that node with xdmp:document-insert. Thus, because you are updating both the node and its parent, you are in effect updating that node twice causing the error. Or, this may also occur from trying to both delete and insert a document at the same URI within the same transaction.
Here is a simple query you can run in QConsole to reproduce this behavior:
xquery version "1.0-ml";
xdmp:document-insert("/test.xml", <test><value></value></test>);
xquery version "1.0-ml";
let $d := fn:doc("/test.xml")
let $_ := xdmp:node-replace($d//value, <value>test</value>)
return
xdmp:document-insert("/test.xml", $d)
In the case of this demonstration, as well as your code, the xdmp:document-insert is redundant and can simply be removed.
Likely the XQuery statement above is attempting multiple updates to the same node in the same single-statement transaction. The xdmp:node-replace calls are performing updates at each operation to the same node. See the documentation for more details.
Here are two solutions that may work for you
Use conditional statements to decide what kind of update needs to be performed on the node, e.g., whether the node need to be deleted, whether the node needs to be updated and how. At the end of your script you could then apply the update behavior to the node.
Perform in-memory updates to the node then commit the node to the database at the end of the transaction. Here is one library you could use https://github.com/ryanjdew/XQuery-XML-Memory-Operations
One general possibility for complicated updates: use XSLT.
This is multi transaction statement. There are multiple ways to handle it in your scenario:
Use xdmp:eval
Use mem library of MarkLogic to replace your nodes
Rewrite your Query to avoid transaction conflict
Related
I have a requirement where I iterate through 10,000,000 documents and for each document I do some operation and store some values in '/count.xml'. When I iterate to second document I update '/count.xml' with updated value
Currently this is what I am doing, here $total-records is 10,000,000
let $total-records := xdmp:estimate(cts:search( //some code))
let $batch-size := 5000
let $pagination := 0
let $bs :=
for $records in 1 to fn:ceiling($total-records div $batch-size )
let $start := fn:sum($pagination + 1)
let $end := fn:sum($batch-size + $pagination)
let $_ := xdmp:set($pagination, $end)
return
xdmp:spawn-function
(
function() {
for $each in cts:search( //some code)[$start to $end]
return //some operation and update '/count.xml' with some updated values
},
<options xmlns="xdmp:eval"><commit>auto</commit><update>true</update</options>
)
let $doc := doc("/count.xml")
return ()
So here the issue is I need to read the '/count.xml' file after all documents are iterated, But with above code using spawn task
let $doc := doc("/count.xml")
will not be latest one as above spawn task will run on different threads.
I need a solution where
let $doc := doc("/count.xml")
waits till all spawn task are completed.
I have came across
<result>{fn:true()}</result>
option as well, but I do not know whether it will work or not because variable
$bs
not being used anywhere and documentation says 'When the calling request uses the value future in any operation, it will automatically wait for the spawned task to complete and it will use the result.'
Is there any other alternative where
let $doc := doc("/count.xml")
line will be executed only after all spawn task are completed
To process 10 mln documents, you probably need to spawn something like 10.000 batches of a 1000 docs. I don't think that will work well from within MarkLogic.
I'd advice looking into the built-in aggregation features of MarkLogic. See for instance cts:sum-aggregate. You might be able to pre-calculate per-document intermediate results, that you could aggregate at run-time using those aggregation features. That would definitely be most performant, and would scale best.
Alternative would be to orchestrate your calculations from outside of MarkLogic. Otherwise you end up either flooding the task queue, or running into timeout limits, or both. Tools like Corb2 and DMSDK could be of help with this.
Note: you can indeed make spawns wait for result by using the <result> option, but either use <result>true</result> or <result>{fn:true()}</result> (note the parentheses behind fn:true, it is a function).
HTH!
The requirement as given, one cannot tell the difference between writing once the final result of a query across 10,mil docs vs writing the result after query of 1 document at a time. Since your example does no writes to the queried documents it need not be spawned nor run in a seperate thread or transaction, rather as HTH says, you can aka use of aggregate functions to do a single query over the entire set, compute the final result and store it in 1 operation. Likely this will run very quickly (or can be made to).
If the requirements are actually that each single document MUST be queried, then sequentially another shared document written to -- this can only be observed by using seperate transactions, serially. Its going to be horrendously slow, almost certainly longer then the timeout for the calling transaction. This means you must orchestrate it from outside -- if the requirement is that the same caller start the process as finish it (a highly implementation specific requirement that if true is likely to have other implications beyond those given).
Something close thats achievable but still horrendously slow is to have an outside query poll on the updated shared document and return 'success' once the job is done.
But again, with this many documents, if your forcing a write transaction for each one, its going to take longer (or atleast is not easily guaranteed to NOT take longer) then the a single transaction timeout so must be invoked from 'outside'.
This is where I would recommend revisiting the requirements to determine the core functionality/result that is desired and if it is truly required to implement exactly as stated vs a more performant implementation that achieves the desired result.
If the core functionality needed is that every single query be 'checkpointed' with a document update, then there are other implications such as transaction rollback that need to be considered.
I recognized that (insert/delete)-XQueries executed with the BaseX client always returning an empty string. I find this very confusing or unintuitive.
Is there a way to find out if the query was "successful" without querying the database again (and using potentially buggy "transitive" logic like "if I deleted a node, there must be 'oldNodeCount-1' nodes in the XML")?
XQuery Update statements do not return anything -- that's how they are defined. But you're not the only one who does not like those restrictions, and BaseX added two ways around this limitation:
Returning Results
By default, it is not possible to mix different types of expressions
in a query result. The outermost expression of a query must either be
a collection of updating or non-updating expressions. But there are
two ways out:
The BaseX-specific update:output() function bridges this gap: it caches the results of its arguments at runtime and returns them after
all updates have been processed. The following example performs an
update and returns a success message:
update:output("Update successful."), insert node <c/> into doc('factbook')/mondial
With the MIXUPDATES option, all updating constraints will be turned off. Returned nodes will be copied before they are modified by
updating expressions. An error is raised if items are returned within
a transform expression.
If you want to modify nodes in main memory, you can use the transform
expression.
The transform expression will not help you, as you seem to modify the data on disk. Enabling MIXUPDATES allows you to both update the document and return something at the same time, for example running something like
let $node := <c/>
return ($node, insert node $node into doc('factbook')/mondial)
MIXUPDATES allows you to return something which can be further processed. Results are copied before being returned, if you run multiple updates operations and do not get the expected results, make sure you got the concept of the pending update list.
The db:output() function intentionally breaks its interface contract: it is defined to be an updating function (not having any output), but at the same time it prints some information to the query info. You cannot further process these results, but the output can help you debugging some issues.
Pending Update List
Both ways, you will not be able to have an immediate result from the update, you have to add something on your own -- and be aware updates are not visible until the pending update list is applied, ie. after the query finished.
Compatibility
Obviously, these options are BaseX-specific. If you strongly require compatible and standard XQuery, you cannot use these expressions.
I recognized that (insert/delete)-XQueries executed with the BaseX client always returning an empty string. I find this very confusing or unintuitive.
Is there a way to find out if the query was "successful" without querying the database again (and using potentially buggy "transitive" logic like "if I deleted a node, there must be 'oldNodeCount-1' nodes in the XML")?
XQuery Update statements do not return anything -- that's how they are defined. But you're not the only one who does not like those restrictions, and BaseX added two ways around this limitation:
Returning Results
By default, it is not possible to mix different types of expressions
in a query result. The outermost expression of a query must either be
a collection of updating or non-updating expressions. But there are
two ways out:
The BaseX-specific update:output() function bridges this gap: it caches the results of its arguments at runtime and returns them after
all updates have been processed. The following example performs an
update and returns a success message:
update:output("Update successful."), insert node <c/> into doc('factbook')/mondial
With the MIXUPDATES option, all updating constraints will be turned off. Returned nodes will be copied before they are modified by
updating expressions. An error is raised if items are returned within
a transform expression.
If you want to modify nodes in main memory, you can use the transform
expression.
The transform expression will not help you, as you seem to modify the data on disk. Enabling MIXUPDATES allows you to both update the document and return something at the same time, for example running something like
let $node := <c/>
return ($node, insert node $node into doc('factbook')/mondial)
MIXUPDATES allows you to return something which can be further processed. Results are copied before being returned, if you run multiple updates operations and do not get the expected results, make sure you got the concept of the pending update list.
The db:output() function intentionally breaks its interface contract: it is defined to be an updating function (not having any output), but at the same time it prints some information to the query info. You cannot further process these results, but the output can help you debugging some issues.
Pending Update List
Both ways, you will not be able to have an immediate result from the update, you have to add something on your own -- and be aware updates are not visible until the pending update list is applied, ie. after the query finished.
Compatibility
Obviously, these options are BaseX-specific. If you strongly require compatible and standard XQuery, you cannot use these expressions.
I would like to execute this kind of flwor query (I am using Saxon) :
for $baseItem in collection('file:/xmlDir?select=*.xml;recurse=yes')/item
let $itemToRetrieve := xs:string($baseItem/item)
let $itemFilter := xs:string($baseItem/filter)
let $fileName := tokenize("*xmlPath($baseItem)*"),'/')[last()]
where $itemFilter = 'test'
return ($itemToRetrieve, $fileName)
This way I could quickly find, when working on a large collection, where the returned items where found by the processor, without having to use a external program, like find commands.
I have tried to use document-uri() and base-uri() functions but without success.
Is there a way to achieve this ?
The document-uri() function should give you what you want. I just tried
collection($someURI)!document-uri(.)
and it works for me provided the items in the collection are all document nodes (but it fails with a type error if the collection includes non-XML resources which are retrieved as items other than document nodes)
Another approach is to use uri-collection() which gives you the URIs of the resources rather than the resources themselves; you can then fetch the particular resources you want using the doc() function (or json-doc() or unparsed-text() depending on the type of resource).
I want to tell if an XML document has been constructed (e.g. using xdmp:unquote) or has been retrieved from a database. One method I have tried is to check the document-uri property
declare variable $doc as document-node() external;
if (fn:exists(fn:document-uri($doc))) then
'on database'
else
'in memory'
This seems to work well enough but I can't see anything in the MarkLogic documentation that guarantees this. Is this method reliable? Is there some other technique I should be using?
I think that behavior has been stable for a while. You could always check for the URI too, as long as you expect it to be from the current database:
xdmp:exists(fn:doc(fn:document-uri($doc)))
Or if you are in an update context and need ACID guarantees, use fn:exists.
The real test would be to try to call xdmp:node-replace or similar, and catch the expected error. Those node-level update functions do not work on constructed nodes. But that requires an update context, and might be tricky to implement in a robust way.
If your XML document is in-memeory, you can use in-mem-update API
import module namespace mem = "http://xqdev.com/in-mem-update" at "/MarkLogic/appservices/utils/in-mem-update.xqy";
If your XML document exists in your database you can use fn:exists() or fn:doc-available()
The real test of In-memory or In-Db is xdmp:node-replace .
If you are able to replace , update , delete a node then it is in database else if it throws exception then it's not in database.
Now there are two situation
1. your document is not created at all:
you can use fn:empty() to check if it is created or not.
2. Your document is created and it's in memory:
if fn:empty() returns false and xdmp:node-replace throws exception then it's in-memory