I'm trying to find an example on how to use the discard-document function of Saxon. I have about 50 files 40mb each, so they are using about 4,5GB of memory in my xquery script.
I've tried to use saxon:discard-document(doc("filename.xml")) after every call to the XML file, but maybe this is not the correct way to do it? There is no difference in memory usage after using that.
I also found some questions about its usage (7 years ago), and they were suggesting running the xpath using discard-document. But I have many calls to that document, so I would have to replace all declarations with saxon:discard-document(doc("filename.xml"))/xpath/etc/etc/etc
Thanks
I think this is a good question and there is not much information available so I will try to answer it myself.
Here is an example on how to use saxon:discard-document:
declare function local:doStuffInDocument($doc as document-node()) {
$doc//testPath
};
let $urls := ("http://url1", "http://url2")
let $results :=
for $url in $urls
let $doc := saxon:discard-document(doc($url))
return local:doStuffInDocument($doc)
return $results
By using a similar code I managed to reduce the memory consumption from 4+GB to only 300MB.
To understand what discard-document does, here is a great comment from Michael Kay found at the SF maillist:
Just to explain what discard-document() does:
Saxon maintains (owned by the Transformer/Controller) a table that
maps document URIs to document nodes. When you call the document()
function, Saxon looks to see if the URI is in this table, and if it
is, it returns the corresponding document node. If it isn't, it reads
and parses the resource found at that URI. The effect of
saxon:discard-document() is to remove the entry for a document from
this mapping table. (Of course, if a document is referenced from this
table then the garbage collector will hold the document in memory; if
it is not referenced from the table then it becomes eligible for
garbage collection. It won't be garbage collected if it's referenced
from a global variable; but it will still be absent from the table in
the event that another call on document() uses the same URI again.)
And another one from Michael Kay found at the Altova maillist:
In Saxon, if you use the doc() or document() function, then the file
will be loaded into memory, and will stay in memory until the end of
the run, just in case it's referenced again. So you will hit the same
memory problem with lots of small files as with one large file -
worse, in fact, since there is a significant per-document overhead.
However, there's a workaround: an extension function
saxon:discard-document() that causes a document to be discarded from
memory by the garbage collector as soon as there are no more
references to it.
It's probably useful to understand what actually happens below the covers. The doc() function looks in a cache to see if the document is already there; if not, it reads the document, adds it to the cache, and then returns it. The discard-document() function looks to see if the document is in the cache, removes it if it is, and then returns it. By removing the document from the cache, it makes it eligible for garbage collection when the document is no longer referenced. If using discard-document has no effect on memory consumption, that's probably because there is something else still referencing the document - for example, a global variable.
Related
I have a set of events that have been refactored to another package. This works as is until I execute the event replay. Digging deeper I noticed a payloadtype in the domainevententry table and figure changing this would be sufficient but alas it seems the xml root element of the event needs to be changed as well. I am hoping there is a simple way to do this.
I cannot find any examples on upcasting to different packages or using XStream aliasing so any help would be greatly appreciated.
Thanks
As you noticed, the default payload type stored in events is the fully qualified class name. This ensures that out of the box serialization and deserialization work as intended. However, moving classes around thus means the payload type can no longer be found, requiring some adjustment to be made.
You could have used the EventTypeUpcaster, as referred to in the Reference Guide. The EventTypeUpcaster is dedicated to adjusting the payload type, and thus can also be used to deal with changing package names.
When using (the default) XStreamSerializer, aliasing the tags would indeed also work. How to set aliases can be seen here for example. AS noticed in that sample, the alias is added to the XStream instance. The XStreamSerializer uses an XStream instance to support de-/serialization from/to XML. To adjust the XStream instance, you can simply use the builder paradigm on the XStreamSerializer. The JavaDoc of the builder should be specific enough to help you out how to use it.
Went the long way round with this but it seems to work. As always, backup your database before executing large volume changes. I also restarted the service using the database on completion. Needless to say I will make sure the events are in logical packages before deploying next time :)
Database Engine: Postgres 10
Table: domainevententry
update domainevententry
set
payloadtype = '<new.package.Classname>',
payload = lo_from_bytea(0, decode(REPLACE(
subquery.output,
'<old.package.Classname>',
'<new.package.Classname>'
), 'escape'))
from (
SELECT eventidentifier, payloadtype, encode(lo_get(payload::oid), 'escape') as output FROM domainevententry
WHERE eventidentifier in (
'<event guid 1>',
'<event guid 2>'
)
AND payloadtype = '<old.package.Classname>'
) as subquery
where domainevententry.eventidentifier = subquery.eventidentifier;
Once that is completed I needed to update the OWNER of the large object:
ALTER LARGE OBJECT <LargeObjectId> OWNER TO database_role;
Probably not the most elegant solution but based on the time constraints I have, it did the job. There are probably encoding issues with this solution for the large object but it all worked out for me in the end. Feel free to share any optimizations that would make the above more suitable.
Firing off the Axon Framework replays rebuilt the projections and everything lined up.
I recognized that (insert/delete)-XQueries executed with the BaseX client always returning an empty string. I find this very confusing or unintuitive.
Is there a way to find out if the query was "successful" without querying the database again (and using potentially buggy "transitive" logic like "if I deleted a node, there must be 'oldNodeCount-1' nodes in the XML")?
XQuery Update statements do not return anything -- that's how they are defined. But you're not the only one who does not like those restrictions, and BaseX added two ways around this limitation:
Returning Results
By default, it is not possible to mix different types of expressions
in a query result. The outermost expression of a query must either be
a collection of updating or non-updating expressions. But there are
two ways out:
The BaseX-specific update:output() function bridges this gap: it caches the results of its arguments at runtime and returns them after
all updates have been processed. The following example performs an
update and returns a success message:
update:output("Update successful."), insert node <c/> into doc('factbook')/mondial
With the MIXUPDATES option, all updating constraints will be turned off. Returned nodes will be copied before they are modified by
updating expressions. An error is raised if items are returned within
a transform expression.
If you want to modify nodes in main memory, you can use the transform
expression.
The transform expression will not help you, as you seem to modify the data on disk. Enabling MIXUPDATES allows you to both update the document and return something at the same time, for example running something like
let $node := <c/>
return ($node, insert node $node into doc('factbook')/mondial)
MIXUPDATES allows you to return something which can be further processed. Results are copied before being returned, if you run multiple updates operations and do not get the expected results, make sure you got the concept of the pending update list.
The db:output() function intentionally breaks its interface contract: it is defined to be an updating function (not having any output), but at the same time it prints some information to the query info. You cannot further process these results, but the output can help you debugging some issues.
Pending Update List
Both ways, you will not be able to have an immediate result from the update, you have to add something on your own -- and be aware updates are not visible until the pending update list is applied, ie. after the query finished.
Compatibility
Obviously, these options are BaseX-specific. If you strongly require compatible and standard XQuery, you cannot use these expressions.
I recognized that (insert/delete)-XQueries executed with the BaseX client always returning an empty string. I find this very confusing or unintuitive.
Is there a way to find out if the query was "successful" without querying the database again (and using potentially buggy "transitive" logic like "if I deleted a node, there must be 'oldNodeCount-1' nodes in the XML")?
XQuery Update statements do not return anything -- that's how they are defined. But you're not the only one who does not like those restrictions, and BaseX added two ways around this limitation:
Returning Results
By default, it is not possible to mix different types of expressions
in a query result. The outermost expression of a query must either be
a collection of updating or non-updating expressions. But there are
two ways out:
The BaseX-specific update:output() function bridges this gap: it caches the results of its arguments at runtime and returns them after
all updates have been processed. The following example performs an
update and returns a success message:
update:output("Update successful."), insert node <c/> into doc('factbook')/mondial
With the MIXUPDATES option, all updating constraints will be turned off. Returned nodes will be copied before they are modified by
updating expressions. An error is raised if items are returned within
a transform expression.
If you want to modify nodes in main memory, you can use the transform
expression.
The transform expression will not help you, as you seem to modify the data on disk. Enabling MIXUPDATES allows you to both update the document and return something at the same time, for example running something like
let $node := <c/>
return ($node, insert node $node into doc('factbook')/mondial)
MIXUPDATES allows you to return something which can be further processed. Results are copied before being returned, if you run multiple updates operations and do not get the expected results, make sure you got the concept of the pending update list.
The db:output() function intentionally breaks its interface contract: it is defined to be an updating function (not having any output), but at the same time it prints some information to the query info. You cannot further process these results, but the output can help you debugging some issues.
Pending Update List
Both ways, you will not be able to have an immediate result from the update, you have to add something on your own -- and be aware updates are not visible until the pending update list is applied, ie. after the query finished.
Compatibility
Obviously, these options are BaseX-specific. If you strongly require compatible and standard XQuery, you cannot use these expressions.
I want to tell if an XML document has been constructed (e.g. using xdmp:unquote) or has been retrieved from a database. One method I have tried is to check the document-uri property
declare variable $doc as document-node() external;
if (fn:exists(fn:document-uri($doc))) then
'on database'
else
'in memory'
This seems to work well enough but I can't see anything in the MarkLogic documentation that guarantees this. Is this method reliable? Is there some other technique I should be using?
I think that behavior has been stable for a while. You could always check for the URI too, as long as you expect it to be from the current database:
xdmp:exists(fn:doc(fn:document-uri($doc)))
Or if you are in an update context and need ACID guarantees, use fn:exists.
The real test would be to try to call xdmp:node-replace or similar, and catch the expected error. Those node-level update functions do not work on constructed nodes. But that requires an update context, and might be tricky to implement in a robust way.
If your XML document is in-memeory, you can use in-mem-update API
import module namespace mem = "http://xqdev.com/in-mem-update" at "/MarkLogic/appservices/utils/in-mem-update.xqy";
If your XML document exists in your database you can use fn:exists() or fn:doc-available()
The real test of In-memory or In-Db is xdmp:node-replace .
If you are able to replace , update , delete a node then it is in database else if it throws exception then it's not in database.
Now there are two situation
1. your document is not created at all:
you can use fn:empty() to check if it is created or not.
2. Your document is created and it's in memory:
if fn:empty() returns false and xdmp:node-replace throws exception then it's in-memory
I'm trying to understand when I should call Query.close(Object) or Query.closeAll();
From the docs I see that it will "close a query result" and "release all resources associated with it," but what does "close" mean? Does it mean I can't use objects that I get out of a Query after I've called close on them?
I ask because I'm trying to compartmentalize my code with functions like getUser(id), which will create a Query, fetch the User, and destroy the query. If I have to keep the Query around just to access the User, then I can't really do that compartmentalization.
What happens if I don't call close on an object? Will it get collected as garbage? Will I be able to access that object later?
Please let me know if I can further specify my question somehow.
You can't use the query results since you closed them (i.e the List object you got back from query.execute). You can access the objects in the results ... if you copied them to your own List, or made references to them in your code. Failure to call close can leak memory
When your query method returns a single object it is easy to simply close the query before returning the single object.
On the other hand, when your query method returns a collection the query method itself can not close the query before returning the result because the query needs to stay open while the caller is iterating through the results.
This puts the responsibility for closing a query that returns a collection on the caller and can introduce leaks if the caller neglects to close the query - I thought there must be a safer way and there is!
Guido, a long time DataNucleus user, created a 'auto closing' collection facade that wraps the collection returned by JDO's Query.execute method. Usage is extremely simple: Wrap the query result inside an instance of the auto closing collection object:
Instead of returning the Query result set like this:
return q.execute();
simply return an 'auto closing' wrapped version of it:
return new JDOQueryResultCollection(q, q.execute());
To the caller it appears like any other Collection but the wrapper keeps a reference to the query that created the collection result and auto closes it when the wrapper is disposed of by the GC.
Guido kindly gave us permission to include his clever auto closing code in our open source exPOJO library. The auto closing classes are completely independent of exPOJO and can be used in isolation. The classes of interest are in the expojo_jdo*.jar file that can be downloaded from:
http://www.expojo.com/
JDOQueryResultCollection is the only class used directly but it does have a few supporting classes.
Simply add the jar to your project and import com.sas.framework.expojo.jdo.JDOQueryResultCollection into any Java file that includes query methods that return results as a collection.
Alternatively you can extract the source files from the jar and include them in your project: They are:
JDOQueryResultCollection.java
Disposable.java
AutoCloseQueryIterator.java
ReadonlyIterator.java