eXist-db / XQuery compression:zip() of XML files saves text only - xquery

In eXist-db 4.4, XQuery 3.1, I am using automation to compress a number of xml files. The problem is that when they compress they are storing only the text content and not the xml content.
This function uses compression:zip to create a zip from a batch of documents:
declare option exist:serialize "expand-xincludes=no";
declare option exist:serialize "method=xml media-type=application/xml";
declare function zip:create-zip-by-batch()
{
[...]
let $zipobject := compression:zip(zip:get-entry-for-zip($x,false())
let $zipname := "foozipname.zip"
let $store := xmldb:store("/db/foodirectory", $zipname, $zipobject)
return $store
};
The above calls this function, where the documents are serialized and put into <entry> per documentation:
declare option exist:serialize "expand-xincludes=no";
declare option exist:serialize "method=xml media-type=application/xml";
declare function zip:get-entry-for-zip($x)
{
[...for each $foo document in $x, create an <entry>...]
let $serialized := serialize($foo, map { "method": "xml" })
let $entry =
<entry name="somefooname" type='xml' method='store'>
{$serialized}
</entry>
[...return a sequence of $entry...]
}
I think it's missing a configuration for serialization, but I can't figure it out...
Thanks in advance for any help.

Here a query for eXist demonstrating how to compress XML documents into a ZIP file and store it into one's database:
xquery version "3.1";
(: create a test collection with 10 test files: 1.xml = <x>1</x>
thru 10.xml = <x>10</x> :)
let $prepare := xmldb:create-collection("/db", "test")
let $populate := (1 to 10) ! xmldb:store("/db/test", . || ".xml", <x>{.}</x>)
(: construct zip-bound <entry> elements for the documents in the test collection :)
let $entries := collection("/db/test") !
<entry name="{util:document-name(.)}" type="xml" method="store">{
serialize(., map { "method": "xml" })
}</entry>
(: compress the entries and store in database :)
let $zip := compression:zip($entries, false())
return
xmldb:store("/db", "test.zip", $zip)
The resulting ZIP file contains the 10 test XML documents, intact. For a variant showing how to write the ZIP file to a location on your file system, see https://gist.github.com/joewiz/aa8d84500b1f1478779cdf2cc1934348.
For a fuller discussion of serialization options in eXist, see my answer to an earlier question: https://stackoverflow.com/a/49290616/659732.

Related

How to execute XQuery on all XML documents in the folder

I need to make sure a particular node exists in many XML files. I have to switch the context each time I want to query another document.
Is there any way I can execute XQuery on all documents in the directory without switching the context?
I may be a little late, but most probably the following XQuery will do what you wish, it returns the path to each XML-File that does not contain a specific element:
let $path := "."
for $file in file:list( $path, true(), '*.xml')
let $path := $path || "/" || $file
where not(
exists(fetch:xml($path)/foo/bar[text() = "Text"])
)
return $path
If you were only interested if there were XML-files in a specific that do or do not contain a specific element the following query might be useful:
declare variable $path := "/Users/michael/Code/foo";
every $doc in file:list($path, true(), '*.xml') (: returns a sequence of file-names :)
=> for-each(concat($path,"/", ?)) (: returns a sequence of full paths to each file :)
=> for-each(fetch:xml#1) (: returns a sequence of documents :)
satisfies exists(
$doc/*/*[text() = "Text"]
)
Hope this helps ;-)

eXist-db serialize is expand-xincludes=no ignored?

In eXist-db 4.4, Xquery 3.1, I am compressing a number of XML files to a .zip in a directory. The compression process uses serialize().
The XML files have some large xincludes which according to the documentation are automatically processed in serializing. I have attempted to 'turn off' the xinclude serialization in two places in the code (prologue declare and map), but the serializer is still outputting all xincludes:
declare option exist:serialize "expand-xincludes=no";
declare function zip:get-entries-for-zip()
{
(: get documents prefixed by 'MS609' :)
let $pref := "MS609"
(: get list of document names :)
let $doclist := xmldb:get-child-resources($globalvar:URIdata)[starts-with(., $pref)]
(: output serialized entries :)
let $entries :=
for $n in $doclist
return
<entry name="{$n}" type='text' method='store'>
{serialize(doc(concat($globalvar:URIdata, "/", $n)), map { "method": "xml", "expand-xincludes": "no"})}
</entry>
return $entries
};
The XML data with xincludes to reproduce this problem can be found here http://medieval-inquisition.huma-num.fr/downloads under the description "BM MS609 Edition (tei-xml)".
Many thanks in advance.
The expand-xincludes serialization parameter is specific to eXist and, as such (or at least at present), cannot be set using the fn:serialize() function. Instead, use the util:serialize() function:
util:serialize($document, "expand-xincludes=no")
Alternatively, since you're ultimately interested in zipping the contents of a collection, you can skip the explicit serialization step, declare your serialization options in the query's prolog (or set it inline using util:declare-option()), and simply provide the compression:zip() function the URI path(s) to the collections/documents you want to zip. For example:
xquery version "3.1";
declare option exist:serialize "expand-xincludes=no";
let $sources := "/db/apps/my-app/my-data" (: or a sequence of paths to individual docs:) ! xs:anyURI(.)
let $preserve-collection-structure := false()
let $zip := compression:zip($sources, $preserve-collection-structure),
return
xmldb:store("/db", "my-data.zip", $zip)
For more on serialization options in eXist, see my earlier answer to a similar question: https://stackoverflow.com/a/49290616/659732.

Recursive copy of a folder with XQuery

I have to copy an entire project folder inside the MarkLogic server and instead of doing it manually I decided to do it with a recursive function, but is becoming the worst idea I have ever had. I'm having problems with the transactions and with the syntax but being new I don't find a true way to solve it. Here's my code, thank you for the help!
import module namespace dls = "http://marklogic.com/xdmp/dls" at "/MarkLogic/dls.xqy";
declare option xdmp:set-transaction-mode "update";
declare function local:recursive-copy($filesystem as xs:string, $uri as xs:string)
{
for $e in xdmp:filesystem-directory($filesystem)/dir:entry
return
if($e/dir:type/text() = "file")
then dls:document-insert-and-manage($e/dir:filename, fn:false(), $e/dir:pathname)
else
(
xdmp:directory-create(concat(concat($uri, data($e/dir:filename)), "/")),
local:recursive-copy($e/dir:pathname, $uri)
)
};
let $filesystemfolder := 'C:\Users\WB523152\Downloads\expath-ml-console-0.4.0\src'
let $uri := "/expath_console/"
return local:recursive-copy($filesystemfolder, $uri)
MLCP would have been nice to use. However, here is my version:
declare option xdmp:set-transaction-mode "update";
declare variable $prefix-replace := ('C:/', '/expath_console/');
declare function local:recursive-copy($filesystem as xs:string){
for $e in xdmp:filesystem-directory($filesystem)/dir:entry
return
if($e/dir:type/text() = "file")
then
let $source := $e/dir:pathname/text()
let $dest := fn:replace($source, $prefix-replace[1], $prefix-replace[2])
let $_ := xdmp:document-insert($source,
<options xmlns="xdmp:document-load">
<uri>{$dest}</uri>
</options>)
return <record>
<from>{$source}</from>
<to>{$dest}</to>
</record>
else
local:recursive-copy($e/dir:pathname)
};
let $filesystemfolder := 'C:\Temp'
return <results>{local:recursive-copy($filesystemfolder)}</results>
Please note the following:
I changed my sample to the C:\Temp dir
The output is XML only because by convention I try to do this in case I want to analyze results. It is actually how I found the error related to conflicting updates.
I chose to define a simple prefix replace on the URIs
I saw no need for DLS in your description
I saw no need for the explicit creation of directories in your use case
The reason you were getting conflicting updates because you were using just the filename as the URI. Across the whole directory structure, these names were not unique - hence the conflicting update on double inserts of same URI.
This is not solid code:
You would have to ensure that a URI is valid. Not all filesystem paths/names are OK for a URI, so you would want to test for this and escape chars if needed.
Large filesystems would time-out, so spawning in batches may be useful.
A an example, I might gather the list of docs as in my XML and then process that list by spawning a new task for every 100 documents. This could be accomplished by a simple loop over xdmp:spawn-function or using a library such as taskbot by #mblakele

MarkLogic - How to insert element into XML

How to insert the node in XML.
let $a := <a><b>bbb</b></a>)
return
xdmp:node-insert-after(doc("/example.xml")/a/b, <c>ccc</c>);
Expected Output:
<a><c>ccc</c><b>bbb</b></a>
Please help to get the output.
You should be using xdmp:node-insert-before I believe in the following way:
xdmp:document-insert('/example.xml', <a><b>bbb</b></a>);
xdmp:node-insert-before(fn:doc('/example.xml')/a/b, <c>ccc</c>);
fn:doc('/example.xml');
(: returns <a><c>ccc</c><b>bbb</b></a> :)
Nodes are immutable, so in-memory mutation can only be done by creating a new copy.
The copy can use the unmodified contained nodes from the original:
declare function local:insert-after(
$prior as node(),
$inserted as node()+
) as element()
{
let $container := $prior/parent::element()
return element {fn:node-name($container)} {
$container/namespace::*,
$container/attribute(),
$prior/preceding-sibling::node(),
$prior,
$inserted,
$prior/following-sibling::node()
}
};
let $a := <a><b>bbb</b></a>
return local:insert-after($a//b, <c>ccc</c>)
Creating a copy in memory and then inserting the copy is faster than inserting and modifying a document in the database.
Depending on how many documents are inserted, the difference could be significant.
There are community libraries for copying with changes, but sometimes it's as easy to write a quick function (recursive where necessary).
You can use below code to insert the element into the XML:
xdmp:node-insert-child(fn:doc('directory URI'),element {fn:QName('http://yournamesapce','elementName') }{$elementValue})
Here we use fn:QName to remove addition of xmlns="" in added node.

split document by using MarkLogic Flow Editor

i try to split my incoming documents using "Information Studio Flows" (MarkLogic v 8.0-1.1). The problem is in "Transform" section.
This is my importing documents. For simplicity i reduce it content to one stwtext-element
<docs>
<stwtext id="RD-10-00258" update="03.2011" seq="RQ-10-00001">
<head>
<ti>
<i>j</i>
</ti>
<ff-list>
<ff id="0103"/>
</ff-list>
</head><p>
Symbol für die
<vw idref="RD-19-04447">Stromdichte</vw>
.
</p>
</stwtext>
</docs>
This is my "xquery transform" content:
xquery version "1.0-ml";
(: Copyright 2002-2015 MarkLogic Corporation. All Rights Reserved. :)
(:
:: Custom action. It must be a CPF action module.
:: Replace this text completely, or use it as a template and
:: add imports, declarations,
:: and code between START and END comment tags.
:: Uses the external variables:
:: $cpf:document-uri: The document being processed
:: $cpf:transition: The transition being executed
:)
import module namespace cpf = "http://marklogic.com/cpf"
at "/MarkLogic/cpf/cpf.xqy";
(: START custom imports and declarations; imports must be in Modules/ on filesystem :)
(: END custom imports and declarations :)
declare option xdmp:mapping "false";
declare variable $cpf:document-uri as xs:string external;
declare variable $cpf:transition as node() external;
if ( cpf:check-transition($cpf:document-uri,$cpf:transition))
then
try {
(: START your custom XQuery here :)
let $doc := fn:doc($cpf:document-uri)
return
xdmp:eval(
for $wpt in fn:doc($doc)//stwtext
return
xdmp:document-insert(
fn:concat("/rom-data/", fn:concat($wpt/#id,".xml")),
$wpt
)
)
(: END your custom XQuery here :)
,
cpf:success( $cpf:document-uri, $cpf:transition, () )
}
catch ($e) {
cpf:failure( $cpf:document-uri, $cpf:transition, $e, () )
}
else ()
by running of snippet, i take the error:
Invalid URI format
and long description of it:
XDMP-URI: (err:FODC0005) fn:doc(fn:doc("/8122584828241226495/12835482492021535301/URI=/content/home/admin/Vorlagen/testing/v10.new-ML.xml")) -- Invalid URI format: "
j
Symbol für die
Stromdichte
"
In /18200382103958065126.xqy on line 37
In xdmp:invoke("/18200382103958065126.xqy", (xs:QName("trgr:uri"), "/8122584828241226495/12835482492021535301/URI=/content/home/admi...", xs:QName("trgr:trigger"), ...), <options xmlns="xdmp:eval"><isolation>different-transaction</isolation><prevent-deadlocks>t...</options>)
$doc = fn:doc("/8122584828241226495/12835482492021535301/URI=/content/home/admin/Vorlagen/testing/v10.new-ML.xml")
In /MarkLogic/cpf/triggers/internal-cpf.xqy on line 179
In execute-action("on-state-enter", "http://marklogic.com/states/initial", "/8122584828241226495/12835482492021535301/URI=/content/home/admi...", (xs:QName("trgr:uri"), "/8122584828241226495/12835482492021535301/URI=/content/home/admi...", xs:QName("trgr:trigger"), ...), <options xmlns="xdmp:eval"><isolation>different-transaction</isolation><prevent-deadlocks>t...</options>, (fn:doc("http://marklogic.com/cpf/pipelines/14379829270688061297.xml")/p:pipeline, fn:doc("http://marklogic.com/cpf/pipelines/15861601524191348323.xml")/p:pipeline), fn:doc("http://marklogic.com/cpf/pipelines/15861601524191348323.xml")/p:pipeline/p:state-transition[1]/p:default-action, fn:doc("http://marklogic.com/cpf/pipelines/15861601524191348323.xml")/p:pipeline/p:state-transition[1])
$caller = "on-state-enter"
$state-or-status = "http://marklogic.com/states/initial"
$uri = "/8122584828241226495/12835482492021535301/URI=/content/home/admi..."
$vars = (xs:QName("trgr:uri"), "/8122584828241226495/12835482492021535301/URI=/content/home/admi...", xs:QName("trgr:trigger"), ...)
$invoke-options = <options xmlns="xdmp:eval"><isolation>different-transaction</isolation><prevent-deadlocks>t...</options>
$pipelines = (fn:doc("http://marklogic.com/cpf/pipelines/14379829270688061297.xml")/p:pipeline, fn:doc("http://marklogic.com/cpf/pipelines/15861601524191348323.xml")/p:pipeline)
$action-to-execute = fn:doc("http://marklogic.com/cpf/pipelines/15861601524191348323.xml")/p:pipeline/p:state-transition[1]/p:default-action
$chosen-transition = fn:doc("http://marklogic.com/cpf/pipelines/15861601524191348323.xml")/p:pipeline/p:state-transition[1]
$raw-module-name = "/18200382103958065126.xqy"
$module-kind = "xquery"
$module-name = "/18200382103958065126.xqy"
In /MarkLogic/cpf/triggers/internal-cpf.xqy on line 320
i thought, it was a problem with "Document setting" in "load" section of "Flow editor"
URI=/content{$path}/{$filename}{$dot-ext}
but if i remove it, i recive the same error.
i have no idea what to do. i am really new. please help
First of all, Information Studio has been deprecated in MarkLogic 8. I would also recommend very much looking in to the aggregate_record feature of MarkLogic Content Pump:
http://docs.marklogic.com/guide/ingestion/content-pump#id_65814
Apart from that, there are several issues with your code. You are calling fn:doc twice, effectively trying to interpret the doc contents as a uri. There is an unnecessary xdmp:eval wrapping the FLWOR statement, which expects a string as first param. I think you can shorten it to (showing inner part of the action only):
(: START your custom XQuery here :)
let $doc := fn:doc($cpf:document-uri)
for $wpt in $doc//stwtext
return
xdmp:document-insert(
fn:concat("/roempp-data/", fn:concat($wpt/#id,".xml")),
$wpt
)
(: END your custom XQuery here :)
HTH!
very many thanks #grtjn and this is my approach. Practically it is the same solution
(: START your custom XQuery here :)
xdmp:log(fn:doc($cpf:document-uri), "debug"),
let $doc := fn:doc($cpf:document-uri)
return
xdmp:eval('
declare variable $doc external;
for $wpt in $doc//stwtext
return (
xdmp:document-insert(
fn:concat("/roempp-data/", fn:concat($wpt/#id,".xml")),
$wpt,
xdmp:default-permissions(),
"roempp-data"
)
)'
,
(xs:QName("doc"), $doc),
<options xmlns="xdmp:eval">
<database>{xdmp:database("roempp-tutorial")}</database>
</options>
)
(: END your custom XQuery here :)
Ok, now it works. It is fine, but i found, that after the loading is over, i see in MarkLogic two documents:
my splited document "/rom-data/RD-10-00258.xml" with one root element "stwtext" (as desired)
origin document "URI=/content/home/admin/Vorlagen/testing/v10.new-ML.xml" with root element "docs"
is it possible to prohibit insert of origin document ?

Resources