I am writing code that needs to return a modified version of an XML node, without changing the original node in the parent document.
How can I copy/clone the node so that the original context will not be connected to/affected by it? I don't want changes made to this node to change the original node in the parent document, just to the copy that my function is returning.
What I'm looking for would be very similar to whatever cts:highlight is doing internally:
Returns a copy of the node, replacing any text matching the query
with the specified expression. You can use this function to easily
highlight any text found in a query. Unlike fn:replace and other
XQuery string functions that match literal text, cts:highlight matches
every term that matches the search, including stemmed matches or
matches with different capitalization. [marklogic docs > cts:highlight]
The easiest way to create a clone/copy of a node is to use the computed document node constructor:
document{ $doc }
If you are cloning a node that is not a document-node(), and don't want a document-node(), just a clone of the original node(), then you can XPath to select that cloned node from the new document-node():
document{ $foo }/node()
Just for completeness: in general, the standard XQuery Update Facility has copy-modify expressions that explicitly perform a copy. With no modifications, this is like explicit cloning.
copy $node := $foo
modify ()
return $node
I am not sure if MarkLogic supports this syntax or not though. As far as I know, it uses its own function library for updates.
In-memory XML nodes are not directly modifiable. Instead, you make your desired changes while constructing a new node. If you know XSLT, that can be a good way to do it. If not, you can use an XQuery technique called recursive descent.
Related
I recognized that (insert/delete)-XQueries executed with the BaseX client always returning an empty string. I find this very confusing or unintuitive.
Is there a way to find out if the query was "successful" without querying the database again (and using potentially buggy "transitive" logic like "if I deleted a node, there must be 'oldNodeCount-1' nodes in the XML")?
XQuery Update statements do not return anything -- that's how they are defined. But you're not the only one who does not like those restrictions, and BaseX added two ways around this limitation:
Returning Results
By default, it is not possible to mix different types of expressions
in a query result. The outermost expression of a query must either be
a collection of updating or non-updating expressions. But there are
two ways out:
The BaseX-specific update:output() function bridges this gap: it caches the results of its arguments at runtime and returns them after
all updates have been processed. The following example performs an
update and returns a success message:
update:output("Update successful."), insert node <c/> into doc('factbook')/mondial
With the MIXUPDATES option, all updating constraints will be turned off. Returned nodes will be copied before they are modified by
updating expressions. An error is raised if items are returned within
a transform expression.
If you want to modify nodes in main memory, you can use the transform
expression.
The transform expression will not help you, as you seem to modify the data on disk. Enabling MIXUPDATES allows you to both update the document and return something at the same time, for example running something like
let $node := <c/>
return ($node, insert node $node into doc('factbook')/mondial)
MIXUPDATES allows you to return something which can be further processed. Results are copied before being returned, if you run multiple updates operations and do not get the expected results, make sure you got the concept of the pending update list.
The db:output() function intentionally breaks its interface contract: it is defined to be an updating function (not having any output), but at the same time it prints some information to the query info. You cannot further process these results, but the output can help you debugging some issues.
Pending Update List
Both ways, you will not be able to have an immediate result from the update, you have to add something on your own -- and be aware updates are not visible until the pending update list is applied, ie. after the query finished.
Compatibility
Obviously, these options are BaseX-specific. If you strongly require compatible and standard XQuery, you cannot use these expressions.
I recognized that (insert/delete)-XQueries executed with the BaseX client always returning an empty string. I find this very confusing or unintuitive.
Is there a way to find out if the query was "successful" without querying the database again (and using potentially buggy "transitive" logic like "if I deleted a node, there must be 'oldNodeCount-1' nodes in the XML")?
XQuery Update statements do not return anything -- that's how they are defined. But you're not the only one who does not like those restrictions, and BaseX added two ways around this limitation:
Returning Results
By default, it is not possible to mix different types of expressions
in a query result. The outermost expression of a query must either be
a collection of updating or non-updating expressions. But there are
two ways out:
The BaseX-specific update:output() function bridges this gap: it caches the results of its arguments at runtime and returns them after
all updates have been processed. The following example performs an
update and returns a success message:
update:output("Update successful."), insert node <c/> into doc('factbook')/mondial
With the MIXUPDATES option, all updating constraints will be turned off. Returned nodes will be copied before they are modified by
updating expressions. An error is raised if items are returned within
a transform expression.
If you want to modify nodes in main memory, you can use the transform
expression.
The transform expression will not help you, as you seem to modify the data on disk. Enabling MIXUPDATES allows you to both update the document and return something at the same time, for example running something like
let $node := <c/>
return ($node, insert node $node into doc('factbook')/mondial)
MIXUPDATES allows you to return something which can be further processed. Results are copied before being returned, if you run multiple updates operations and do not get the expected results, make sure you got the concept of the pending update list.
The db:output() function intentionally breaks its interface contract: it is defined to be an updating function (not having any output), but at the same time it prints some information to the query info. You cannot further process these results, but the output can help you debugging some issues.
Pending Update List
Both ways, you will not be able to have an immediate result from the update, you have to add something on your own -- and be aware updates are not visible until the pending update list is applied, ie. after the query finished.
Compatibility
Obviously, these options are BaseX-specific. If you strongly require compatible and standard XQuery, you cannot use these expressions.
I would like to execute this kind of flwor query (I am using Saxon) :
for $baseItem in collection('file:/xmlDir?select=*.xml;recurse=yes')/item
let $itemToRetrieve := xs:string($baseItem/item)
let $itemFilter := xs:string($baseItem/filter)
let $fileName := tokenize("*xmlPath($baseItem)*"),'/')[last()]
where $itemFilter = 'test'
return ($itemToRetrieve, $fileName)
This way I could quickly find, when working on a large collection, where the returned items where found by the processor, without having to use a external program, like find commands.
I have tried to use document-uri() and base-uri() functions but without success.
Is there a way to achieve this ?
The document-uri() function should give you what you want. I just tried
collection($someURI)!document-uri(.)
and it works for me provided the items in the collection are all document nodes (but it fails with a type error if the collection includes non-XML resources which are retrieved as items other than document nodes)
Another approach is to use uri-collection() which gives you the URIs of the resources rather than the resources themselves; you can then fetch the particular resources you want using the doc() function (or json-doc() or unparsed-text() depending on the type of resource).
I'm trying to find an example on how to use the discard-document function of Saxon. I have about 50 files 40mb each, so they are using about 4,5GB of memory in my xquery script.
I've tried to use saxon:discard-document(doc("filename.xml")) after every call to the XML file, but maybe this is not the correct way to do it? There is no difference in memory usage after using that.
I also found some questions about its usage (7 years ago), and they were suggesting running the xpath using discard-document. But I have many calls to that document, so I would have to replace all declarations with saxon:discard-document(doc("filename.xml"))/xpath/etc/etc/etc
Thanks
I think this is a good question and there is not much information available so I will try to answer it myself.
Here is an example on how to use saxon:discard-document:
declare function local:doStuffInDocument($doc as document-node()) {
$doc//testPath
};
let $urls := ("http://url1", "http://url2")
let $results :=
for $url in $urls
let $doc := saxon:discard-document(doc($url))
return local:doStuffInDocument($doc)
return $results
By using a similar code I managed to reduce the memory consumption from 4+GB to only 300MB.
To understand what discard-document does, here is a great comment from Michael Kay found at the SF maillist:
Just to explain what discard-document() does:
Saxon maintains (owned by the Transformer/Controller) a table that
maps document URIs to document nodes. When you call the document()
function, Saxon looks to see if the URI is in this table, and if it
is, it returns the corresponding document node. If it isn't, it reads
and parses the resource found at that URI. The effect of
saxon:discard-document() is to remove the entry for a document from
this mapping table. (Of course, if a document is referenced from this
table then the garbage collector will hold the document in memory; if
it is not referenced from the table then it becomes eligible for
garbage collection. It won't be garbage collected if it's referenced
from a global variable; but it will still be absent from the table in
the event that another call on document() uses the same URI again.)
And another one from Michael Kay found at the Altova maillist:
In Saxon, if you use the doc() or document() function, then the file
will be loaded into memory, and will stay in memory until the end of
the run, just in case it's referenced again. So you will hit the same
memory problem with lots of small files as with one large file -
worse, in fact, since there is a significant per-document overhead.
However, there's a workaround: an extension function
saxon:discard-document() that causes a document to be discarded from
memory by the garbage collector as soon as there are no more
references to it.
It's probably useful to understand what actually happens below the covers. The doc() function looks in a cache to see if the document is already there; if not, it reads the document, adds it to the cache, and then returns it. The discard-document() function looks to see if the document is in the cache, removes it if it is, and then returns it. By removing the document from the cache, it makes it eligible for garbage collection when the document is no longer referenced. If using discard-document has no effect on memory consumption, that's probably because there is something else still referencing the document - for example, a global variable.
Can someone tell me the exact difference between node() and element() types in XQuery? The documentation states that element() is an element node, while node() is any node, so if I understand it correctly element() is a subset of node().
The thing is I have an XQuery function like this:
declare function local:myFunction($arg1 as element()) as element() {
let $value := data($arg1/subelement)
etc...
};
Now I want to call the function with a parameter which is obtained by another function, say functionX (which I have no control over):
let $parameter := someNamespace:functionX()
return local:myFunction($parameter)
The problem is, functionX returns an node() so it will not let me pass the $parameter directly. I tried changing the type of my function to take a node() instead of an element(), but then I can’t seem to read any data from it. $value is just empty.
Is there some way of either converting the node to an element or should am I just missing something?
EDIT: As far as I can tell the problem is in the part where I try to get the subelement using $arg1/subelement. Apparently you can do this if $arg1 is an element() but not if it is a node().
UPDATE: I have tested the example provided by Dimitre below, and it indeed works fine, both with Saxon and with eXist DB (which is what I am using as the XQuery engine). The problem actually occurs with the request:get-data() function from eXist DB. This function gets data provided by the POST request when using eXist through REST, parses it as XML and returns it as a node(). But for some reason when I pass the data to another function XQuery doesn’t acknowledge it as being a valid element(), even though it is. If I extract it manually (i.e. copy the output and paste it to my source code), assign it to a variable and pass it to my function all goes well. But if I pass it directly it gives me a runtime error (and indeed fails the instance of test).
I need to be able to either make it ignore this type-check or “typecast” the data to an element().
data() returning empty for an element just because the argument type is node() sounds like a bug to me. What XQuery processor are you using?
It sounds like you need to placate static type checking, which you can do using a treat as expression. I don't believe a dynamic test using instance of will suffice.
Try this:
let $parameter := someNamespace:functionX() treat as element()
return local:myFunction($parameter)
Quoting from the 4th edition of Michael Kay's magnum opus, "The treat as operator is essentially telling the system that you know what the runtime type is going to be, and you want any checking to be deferred until runtime, because you're confident that your code is correct." (p. 679)
UPDATE: I think the above is actually wrong, since treat as is just an assertion. It doesn't change the type annotation node(), which means it's also a wrong assertion and doesn't help you. Hmmm... What I really want is cast as, but that only works for atomic types. I guess I'm stumped. Maybe you should change XQuery engines. :-) I'll report back if I think of something else. Also, I'm curious to find out if Dimitre's solution works for you.
UPDATE #2: I had backpedaled here earlier. Can I backpedal again? ;-) Now my theory is that treat as will work based on the fact that node() is interpreted as a union of the various specific node type annotations, and not as a run-time type annotation itself (see the "Note" in the "Item types" section of the XQuery formal semantics.) At run time, the type annotation will be element(). Use treat as to guarantee to the type checker that this will be true. Now I wait on bated breath: does it work for you?
EXPLANATORY ADDENDUM: Assuming this works, here's why. node() is a union type. Actual items at run time are never annotated with node(). "An item type is either an atomic type, an element type, an attribute type, a document node type, a text node type, a comment node type, or a processing instruction type."1 Notice that node() is not in that list. Thus, your XQuery engine isn't complaining that an item has type node(); rather it's complaining that it doesn't know what the type is going to be (node() means it could end up being attribute(), element(), text(), comment(), processing-instruction(), or document-node()). Why does it have to know? Because you're telling it elsewhere that it's an element (in your function's signature). It's not enough to narrow it down to one of the above six possibilities. Static type checking means that you have to guarantee—at compile time—that the types will match up (element with element, in this case). treat as is used to narrow down the static type from a general type (node()) to a more specific type (element()). It doesn't change the dynamic type. cast as, on the other hand, is used to convert an item from one type to another, changing both the static and dynamic types (e.g., xs:string to xs:boolean). It makes sense that cast as can only be used with atomic values (and not nodes), because what would it mean to convert an attribute to an element (etc.)? And there's no such thing as converting a node() item to an element() item, because there's no such thing as a node() item. node() only exists as a static union type. Moral of the story? Avoid XQuery processors that use static type checking. (Sorry for the snarky conclusion; I feel I've earned the right. :-) )
NEW ANSWER BASED ON UPDATED INFORMATION: It sounds like static type checking is a red herring (a big fat one). I believe you are in fact not dealing with an element but a document node, which is the invisible root node that contains the top-level element (document element) in the XPath data model representation of a well-formed XML document.
The tree is thus modeled like this:
[document-node]
|
<docElement>
|
<subelement>
and not like this:
<docElement>
|
<subelement>
I had assumed you were passing the <docElement> node. But if I'm right, you were actually passing the document node (its parent). Since the document node is invisible, its serialization (what you copied and pasted) is indistinguishable from an element node, and the distinction was lost when you pasted what is now interpreted as a bare element constructor in your XQuery. (To construct a document node in XQuery, you have to wrap the element constructor with document{ ... }.)
The instance of test fails because the node is not an element but a document-node. (It's not a node() per se, because there's no such thing; see explanation above.)
Also, this would explain why data() returns empty when you tried to get the <subelement> child of the document node (after relaxing the function argument type to node()). The first tree representation above shows that <subelement> is not a child of the document node; thus it returns the empty sequence.
Now for the solution. Before passing the (document node) parameter, get its element child (the document element), by appending /* (or /element() which is equivalent) like this:
let $parameter := someNamespace:functionX()/*
return local:myFunction($parameter)
Alternatively, let your function take a document node and update the argument you pass to data():
declare function local:myFunction($arg1 as document-node()) as element() {
let $value := data($arg1/*/subelement)
etc...
};
Finally, it looks like the description of eXist's request:get-data() function is perfectly consistent with this explanation. It says: "If its not a binary document, we attempt to parse it as XML and return a document-node()." (emphasis added)
Thanks for the adventure. This turned out to be a common XPath gotcha (awareness of document nodes), but I learned a few things from our detour into static type checking.
This works perfectly using Saxon 9.3:
declare namespace my = "my:my";
declare namespace their = "their:their";
declare function my:fun($arg1 as element()) as element()
{
$arg1/a
};
declare function their:fun2($arg1 as node()) as node()
{
$arg1
};
my:fun(their:fun2(/*) )
when the code above is applied on the following XML document:
<t>
<a/>
</t>
the correct result is produced with no error messages:
<a/>
Update:
The following should work even with the most punctuential static type-checking XQuery implementation:
declare namespace my = "my:my";
declare namespace their = "their:their";
declare function my:fun($arg1 as element()) as element()
{
$arg1/a
};
declare function their:fun2($arg1 as node()) as node()
{
$arg1
};
let $vRes := their:fun2(/*)
(: this prevents our code from runtime crash :)
return if($vRes instance of element())
then
(: and this assures the static type-checker
that the type is element() :)
my:fun(their:fun2(/*) treat as element())
else()
node() is an element, attribute, processing instruction, text node, etc.
But data() converts the result to a string, which isn't any of those; it's a primitive type.
You might want to try item(), which should match either.
See 2.5.4.2 Matching an ItemType and an Item in the W3C XQuery spec.
Although it's not shown in your example code, I assume you are actually returning a value (like the $value you are working with) from the local:myFunction.