split document by using MarkLogic Flow Editor

split document by using MarkLogic Flow Editor - xquery

i try to split my incoming documents using "Information Studio Flows" (MarkLogic v 8.0-1.1). The problem is in "Transform" section.
This is my importing documents. For simplicity i reduce it content to one stwtext-element
<docs>
<stwtext id="RD-10-00258" update="03.2011" seq="RQ-10-00001">
<head>
<ti>
<i>j</i>
</ti>
<ff-list>
<ff id="0103"/>
</ff-list>
</head><p>
Symbol für die
<vw idref="RD-19-04447">Stromdichte</vw>
.
</p>
</stwtext>
</docs>
This is my "xquery transform" content:
xquery version "1.0-ml";
(: Copyright 2002-2015 MarkLogic Corporation. All Rights Reserved. :)
(:
:: Custom action. It must be a CPF action module.
:: Replace this text completely, or use it as a template and
:: add imports, declarations,
:: and code between START and END comment tags.
:: Uses the external variables:
:: $cpf:document-uri: The document being processed
:: $cpf:transition: The transition being executed
:)
import module namespace cpf = "http://marklogic.com/cpf"
at "/MarkLogic/cpf/cpf.xqy";
(: START custom imports and declarations; imports must be in Modules/ on filesystem :)
(: END custom imports and declarations :)
declare option xdmp:mapping "false";
declare variable $cpf:document-uri as xs:string external;
declare variable $cpf:transition as node() external;
if ( cpf:check-transition($cpf:document-uri,$cpf:transition))
then
try {
(: START your custom XQuery here :)
let $doc := fn:doc($cpf:document-uri)
return
xdmp:eval(
for $wpt in fn:doc($doc)//stwtext
return
xdmp:document-insert(
fn:concat("/rom-data/", fn:concat($wpt/#id,".xml")),
$wpt
)
)
(: END your custom XQuery here :)
,
cpf:success( $cpf:document-uri, $cpf:transition, () )
}
catch ($e) {
cpf:failure( $cpf:document-uri, $cpf:transition, $e, () )
}
else ()
by running of snippet, i take the error:
Invalid URI format
and long description of it:
XDMP-URI: (err:FODC0005) fn:doc(fn:doc("/8122584828241226495/12835482492021535301/URI=/content/home/admin/Vorlagen/testing/v10.new-ML.xml")) -- Invalid URI format: "
j
Symbol für die
Stromdichte
"
In /18200382103958065126.xqy on line 37
In xdmp:invoke("/18200382103958065126.xqy", (xs:QName("trgr:uri"), "/8122584828241226495/12835482492021535301/URI=/content/home/admi...", xs:QName("trgr:trigger"), ...), <options xmlns="xdmp:eval"><isolation>different-transaction</isolation><prevent-deadlocks>t...</options>)
$doc = fn:doc("/8122584828241226495/12835482492021535301/URI=/content/home/admin/Vorlagen/testing/v10.new-ML.xml")
In /MarkLogic/cpf/triggers/internal-cpf.xqy on line 179
In execute-action("on-state-enter", "http://marklogic.com/states/initial", "/8122584828241226495/12835482492021535301/URI=/content/home/admi...", (xs:QName("trgr:uri"), "/8122584828241226495/12835482492021535301/URI=/content/home/admi...", xs:QName("trgr:trigger"), ...), <options xmlns="xdmp:eval"><isolation>different-transaction</isolation><prevent-deadlocks>t...</options>, (fn:doc("http://marklogic.com/cpf/pipelines/14379829270688061297.xml")/p:pipeline, fn:doc("http://marklogic.com/cpf/pipelines/15861601524191348323.xml")/p:pipeline), fn:doc("http://marklogic.com/cpf/pipelines/15861601524191348323.xml")/p:pipeline/p:state-transition[1]/p:default-action, fn:doc("http://marklogic.com/cpf/pipelines/15861601524191348323.xml")/p:pipeline/p:state-transition[1])
$caller = "on-state-enter"
$state-or-status = "http://marklogic.com/states/initial"
$uri = "/8122584828241226495/12835482492021535301/URI=/content/home/admi..."
$vars = (xs:QName("trgr:uri"), "/8122584828241226495/12835482492021535301/URI=/content/home/admi...", xs:QName("trgr:trigger"), ...)
$invoke-options = <options xmlns="xdmp:eval"><isolation>different-transaction</isolation><prevent-deadlocks>t...</options>
$pipelines = (fn:doc("http://marklogic.com/cpf/pipelines/14379829270688061297.xml")/p:pipeline, fn:doc("http://marklogic.com/cpf/pipelines/15861601524191348323.xml")/p:pipeline)
$action-to-execute = fn:doc("http://marklogic.com/cpf/pipelines/15861601524191348323.xml")/p:pipeline/p:state-transition[1]/p:default-action
$chosen-transition = fn:doc("http://marklogic.com/cpf/pipelines/15861601524191348323.xml")/p:pipeline/p:state-transition[1]
$raw-module-name = "/18200382103958065126.xqy"
$module-kind = "xquery"
$module-name = "/18200382103958065126.xqy"
In /MarkLogic/cpf/triggers/internal-cpf.xqy on line 320
i thought, it was a problem with "Document setting" in "load" section of "Flow editor"
URI=/content{$path}/{$filename}{$dot-ext}
but if i remove it, i recive the same error.
i have no idea what to do. i am really new. please help

First of all, Information Studio has been deprecated in MarkLogic 8. I would also recommend very much looking in to the aggregate_record feature of MarkLogic Content Pump:
http://docs.marklogic.com/guide/ingestion/content-pump#id_65814
Apart from that, there are several issues with your code. You are calling fn:doc twice, effectively trying to interpret the doc contents as a uri. There is an unnecessary xdmp:eval wrapping the FLWOR statement, which expects a string as first param. I think you can shorten it to (showing inner part of the action only):
(: START your custom XQuery here :)
let $doc := fn:doc($cpf:document-uri)
for $wpt in $doc//stwtext
return
xdmp:document-insert(
fn:concat("/roempp-data/", fn:concat($wpt/#id,".xml")),
$wpt
)
(: END your custom XQuery here :)
HTH!

very many thanks #grtjn and this is my approach. Practically it is the same solution
(: START your custom XQuery here :)
xdmp:log(fn:doc($cpf:document-uri), "debug"),
let $doc := fn:doc($cpf:document-uri)
return
xdmp:eval('
declare variable $doc external;
for $wpt in $doc//stwtext
return (
xdmp:document-insert(
fn:concat("/roempp-data/", fn:concat($wpt/#id,".xml")),
$wpt,
xdmp:default-permissions(),
"roempp-data"
)
)'
,
(xs:QName("doc"), $doc),
<options xmlns="xdmp:eval">
<database>{xdmp:database("roempp-tutorial")}</database>
</options>
)
(: END your custom XQuery here :)
Ok, now it works. It is fine, but i found, that after the loading is over, i see in MarkLogic two documents:
my splited document "/rom-data/RD-10-00258.xml" with one root element "stwtext" (as desired)
origin document "URI=/content/home/admin/Vorlagen/testing/v10.new-ML.xml" with root element "docs"
is it possible to prohibit insert of origin document ?

Related

How to manipulate file-paths

I know this seems like a duplicate, and I am sure it more or less is ...
However, it really bugs me, and I cannot make anything of the posts before:
I am building a digital edition, utlizing TEI, XML, XSLT, (and probably existDB, maybe I switch to node/javascript).
I built a php-function that should transforme each file in a specified directory to html. (My xsl-file works well)
declare function app:XMLtoHTML-forAll ($node as node(), $model as map(*), $query as xs:string?){
let $ref := xs:string(request:get-parameter("document", ""))
let $xml := doc(concat("/db/apps/BookOfOrders/data/edition/",$ref))
let $xsl := doc("/db/apps/BookOfOrders/resources/xslt/xmlToHtml.xsl")
let $params :=
<parameters>
{for $p in request:get-parameter-names()
let $val := request:get-parameter($p,())
where not($p = ("document","directory","stylesheet"))
return
<param name="{$p}" value="{$val}"/>
}
</parameters>
return
transform:transform($xml, $xsl, $params)
};
There is a list of files in the apps/BookofOrders/data/edition/ named FolioX.html, where x is the page-number. (I'll probably change names to [FolioNumber].xml, but that's not the issue)
I am trying to make a text slider (so that when I open the page, a page is presented and further buttons are created, and I can slide to the right and read the rest of the pages).
I have a table of content, that is linked to the transformed files:
declare function app:toc($node as node(), $model as map(*)) {
for $doc in collection("/db/apps/BookOfOrders/data/edition")/tei:TEI
return
<li>{document-uri(root($doc))}</li>
};
I guess I am wondering on how to change the link inside to for example Folio29 to Folio30.
Can I take a part of the provided link and make the destination of a link flexible, similar but not identical to what I did in the toc-function above?
I'd be really happy if anyone could point me in the right direction.

Given an expression like document-uri(root($doc)) (perhaps more simply util:document-name($doc), since you're using eXist) that returns the path to (or filename of) the document ending in "FolioX", you just need to isolate X, then cast it as an integer so you can perform addition/subtraction on the value:
document-uri(root($doc)) => substring-after("Folio") => xs:integer()
util:document-name($doc) => substring-after("Folio") => xs:integer()
Then add 1, and you've got your next document. Subtract one, and you've got the previous
However, this could lead to broken links: Folio0 or Folio98 (assuming there are only 97). To avoid this, you might want to retrieve determine the complete list of Folios, find the current position, and then never hit 0 or 98:
let $this-folio := $doc => util:document-name()
let $collection := $doc => util:collection-name()
let $all-folios := xmldb:get-child-resources($collection)
(: sort the filenames using UCA Numeric collation to ensure Folio2 < Folio10.
: see https://www.w3.org/TR/xpath-functions-31/#uca-collations :)
let $sorted-folios := $all-folios => sort("?numeric=yes")
let $this-folio-n := index-of($all-folios, $this-folio)
let $prev-folio := if ($this-folio-n gt 1) then "Folio" || $this-folio-n - 1 else ()
let $next-folio := if ($this-folio-n lt count($all-folios)) then "Folio" || $this-folio-n + 1 else ()
return
<nav>
<prev>{$prev-folio}</prev>
<this>{"Folio" || $this-folio-n}</this>
<next>{$next-folio}</next>
</nav>

eXist-db serialize is expand-xincludes=no ignored?

In eXist-db 4.4, Xquery 3.1, I am compressing a number of XML files to a .zip in a directory. The compression process uses serialize().
The XML files have some large xincludes which according to the documentation are automatically processed in serializing. I have attempted to 'turn off' the xinclude serialization in two places in the code (prologue declare and map), but the serializer is still outputting all xincludes:
declare option exist:serialize "expand-xincludes=no";
declare function zip:get-entries-for-zip()
{
(: get documents prefixed by 'MS609' :)
let $pref := "MS609"
(: get list of document names :)
let $doclist := xmldb:get-child-resources($globalvar:URIdata)[starts-with(., $pref)]
(: output serialized entries :)
let $entries :=
for $n in $doclist
return
<entry name="{$n}" type='text' method='store'>
{serialize(doc(concat($globalvar:URIdata, "/", $n)), map { "method": "xml", "expand-xincludes": "no"})}
</entry>
return $entries
};
The XML data with xincludes to reproduce this problem can be found here http://medieval-inquisition.huma-num.fr/downloads under the description "BM MS609 Edition (tei-xml)".
Many thanks in advance.

The expand-xincludes serialization parameter is specific to eXist and, as such (or at least at present), cannot be set using the fn:serialize() function. Instead, use the util:serialize() function:
util:serialize($document, "expand-xincludes=no")
Alternatively, since you're ultimately interested in zipping the contents of a collection, you can skip the explicit serialization step, declare your serialization options in the query's prolog (or set it inline using util:declare-option()), and simply provide the compression:zip() function the URI path(s) to the collections/documents you want to zip. For example:
xquery version "3.1";
declare option exist:serialize "expand-xincludes=no";
let $sources := "/db/apps/my-app/my-data" (: or a sequence of paths to individual docs:) ! xs:anyURI(.)
let $preserve-collection-structure := false()
let $zip := compression:zip($sources, $preserve-collection-structure),
return
xmldb:store("/db", "my-data.zip", $zip)
For more on serialization options in eXist, see my earlier answer to a similar question: https://stackoverflow.com/a/49290616/659732.

Recursive copy of a folder with XQuery

I have to copy an entire project folder inside the MarkLogic server and instead of doing it manually I decided to do it with a recursive function, but is becoming the worst idea I have ever had. I'm having problems with the transactions and with the syntax but being new I don't find a true way to solve it. Here's my code, thank you for the help!
import module namespace dls = "http://marklogic.com/xdmp/dls" at "/MarkLogic/dls.xqy";
declare option xdmp:set-transaction-mode "update";
declare function local:recursive-copy($filesystem as xs:string, $uri as xs:string)
{
for $e in xdmp:filesystem-directory($filesystem)/dir:entry
return
if($e/dir:type/text() = "file")
then dls:document-insert-and-manage($e/dir:filename, fn:false(), $e/dir:pathname)
else
(
xdmp:directory-create(concat(concat($uri, data($e/dir:filename)), "/")),
local:recursive-copy($e/dir:pathname, $uri)
)
};
let $filesystemfolder := 'C:\Users\WB523152\Downloads\expath-ml-console-0.4.0\src'
let $uri := "/expath_console/"
return local:recursive-copy($filesystemfolder, $uri)

MLCP would have been nice to use. However, here is my version:
declare option xdmp:set-transaction-mode "update";
declare variable $prefix-replace := ('C:/', '/expath_console/');
declare function local:recursive-copy($filesystem as xs:string){
for $e in xdmp:filesystem-directory($filesystem)/dir:entry
return
if($e/dir:type/text() = "file")
then
let $source := $e/dir:pathname/text()
let $dest := fn:replace($source, $prefix-replace[1], $prefix-replace[2])
let $_ := xdmp:document-insert($source,
<options xmlns="xdmp:document-load">
<uri>{$dest}</uri>
</options>)
return <record>
<from>{$source}</from>
<to>{$dest}</to>
</record>
else
local:recursive-copy($e/dir:pathname)
};
let $filesystemfolder := 'C:\Temp'
return <results>{local:recursive-copy($filesystemfolder)}</results>
Please note the following:
I changed my sample to the C:\Temp dir
The output is XML only because by convention I try to do this in case I want to analyze results. It is actually how I found the error related to conflicting updates.
I chose to define a simple prefix replace on the URIs
I saw no need for DLS in your description
I saw no need for the explicit creation of directories in your use case
The reason you were getting conflicting updates because you were using just the filename as the URI. Across the whole directory structure, these names were not unique - hence the conflicting update on double inserts of same URI.
This is not solid code:
You would have to ensure that a URI is valid. Not all filesystem paths/names are OK for a URI, so you would want to test for this and escape chars if needed.
Large filesystems would time-out, so spawning in batches may be useful.
A an example, I might gather the list of docs as in my XML and then process that list by spawning a new task for every 100 documents. This could be accomplished by a simple loop over xdmp:spawn-function or using a library such as taskbot by #mblakele

How to tidy-up Processing Instructions in Marklogic

I have a content which is neither a valid HTML nor a XML in my legacy database. Considering the fact, it would be difficult to clean the legacy, I want to tidy this up in MarkLogic using xdmp:tidy. I am currently using ML-8.
<sub>
<p>
<???†?>
</p>
</sub>
I'm passing this content to tidy functionality in a way :
declare variable $xml as node() :=
<content>
<![CDATA[<p><???†?></p>]]>
</content>;
xdmp:tidy(xdmp:quote($xml//text()),
<options xmlns="xdmp:tidy">
<assume-xml-procins>yes</assume-xml-procins>
<quiet>yes</quiet>
<tidy-mark>no</tidy-mark>
<enclose-text>yes</enclose-text>
<indent>yes</indent>
</options>)
As a result it returns :
<p>
<? ?†?>
</p>
Now this result is not the valid xml format (I checked it via XML validator) due to which when I try to insert this XML into the MarkLogic it throws an error saying 'MALFORMED BODY | Invalid Processing Instruction names'.
I did some investigation around PIs but not much luck. I could have tried saving the content without PI but this is also not a valid PI too.

That is because what you think is a PI is in fact not a PI.
From W3C:
2.6 Processing Instructions
[Definition: Processing instructions (PIs) allow documents to contain
instructions for applications.]
Processing Instructions
[16] PI ::= '' Char*)))?
'?>'
[17] PITarget ::= Name - (('X' | 'x') ('M' | 'm') ('L' |
'l'))
So the PI name cannot start with ? as in your sample ??†
You probably want to clean up the content before you pass it to tidy.
Like below:
declare variable $xml as node() :=
<content><![CDATA[<p>Hello <???†?>world</p>]]></content>;
declare function local:copy($input as item()*) as item()* {
for $node in $input
return
typeswitch($node)
case text()
return fn:replace($node,"<\?[^>]+\?>","")
case element()
return
element {name($node)} {
(: output each attribute in this element :)
for $att in $node/#*
return
attribute {name($att)} {$att}
,
(: output all the sub-elements of this element recursively :)
for $child in $node
return local:copy($child/node())
}
(: otherwise pass it through. Used for text(), comments, and PIs :)
default return $node
};
xdmp:tidy(local:copy($xml),
<options xmlns="xdmp:tidy">
<assume-xml-procins>no</assume-xml-procins>
<quiet>yes</quiet>
<tidy-mark>no</tidy-mark>
<enclose-text>yes</enclose-text>
<indent>yes</indent>
</options>)
This would do the trick to get rid of all PIs (real and fake PIs)
Regards,
Peter

How to dynamically create a search query based on a set of quoted strings in MarkLogic

I have the following query, where i want to form a string of values from a list and i want to use that comma separated string as an or-query but it does not give any result, however when i return just the concatenated string it gives the exact value needed for the query.
The query is as follows:
xquery version "1.0-ml";
declare namespace html = "http://www.w3.org/1999/xhtml";
declare variable $docURI as xs:string external ;
declare variable $orQuery as xs:string external ;
let $tags :=
<tags>
<tag>"credit"</tag>
<tag>"bank"</tag>
<tag>"private banking"</tag>
</tags>
let $docURI := "/2012-10-22_CSGN.VX_(Citi)_Credit_Suisse_(CSGN.VX)__Model_Update.61198869.xml"
let $orQuery := (string-join($tags/tag, ','))
for $x in cts:search(doc($docURI)/doc/Content/Section/Paragraph, cts:or-query(($orQuery)))
let $r := cts:highlight($x, cts:or-query($orQuery), <b>{$cts:text}</b>)
return <result>{$r}</result>
The exact query that i want to run is :
cts:search(doc($docURI)/doc/Content/Section/Paragraph, cts:or-query(("credit","bank","private banking")))
and when i do
return (string-join($tags/tag, ','))
it gives me exactly what i require
"credit","bank","private banking"
But why does it not return any result in or-query?

The string-join step should not need to be string-join. That passes in a literal string. In xQuery, sequences are your friend.
I think you want to do something like this:
let $tags-to-search := ($tags/tag/text()!replace(., '^"|"$', '') ) (: a sequence of tags :)
cts:search(doc($docURI)/doc/Content/Section/Paragraph, cts:word-query($tags-to-search))
cts:word-query is the default query used for parameter 2 of search if you pass in a string. cts:word query also returns matches for any items in a sequence if presented with that.
https://docs.marklogic.com/cts:word-query
EDIT: Added the replace step for the quotes as suggested by Abel. This is specific to the data as presented by the original question. The overall approach remains the same.

Maybe do you need something like this
let $orQuery := for $tag in $tags/tag return cts:word-query($tag)

I used fn:tokenize instead it worked perfectly for my usecase
its because i was trying to pass these arguments from java using XCC api and it would not return anything with string values
xquery version "1.0-ml";
declare namespace html = "http://www.w3.org/1999/xhtml";
declare variable $docURI as xs:string external ;
declare variable $orQuery as xs:string external ;
let $input := "credit,bank"
let $tokens := fn:tokenize($input, ",")
let $docURI := "2012-11-19 0005.HK (Citi) HSBC Holdings Plc (0005.HK)_ Model Update.61503613.pdf"
for $x in cts:search(fn:doc($docURI), cts:or-query(($tokens)))
let $r := cts:highlight($x, cts:or-query(($tokens)), <b>{$cts:text}</b>)
return <result>{$r}</result>

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

split document by using MarkLogic Flow Editor - xquery

Related

How to manipulate file-paths

eXist-db serialize is expand-xincludes=no ignored?

Recursive copy of a folder with XQuery

How to tidy-up Processing Instructions in Marklogic

How to dynamically create a search query based on a set of quoted strings in MarkLogic

Categories

Resources