BaseX - XQuery - Out of memory when writing results to CSV file - out-of-memory

I am trying to write an XQuery-result to a CSV-file, see attached code (resulting in at least 1.6 millions lines, will problably become a lot more..).
However several minutes into execution the program fails with an 'out of main memory' error. I am using a laptop with 4GB of memory. I would have thought that writing to file would prevent memory bottlenecks. Also, I am already using the copynode-false pragma.
I might have gone about the code the wrong way, since this is my first XQuery/BaseX-program. Or this might be non-solvable without extra hardware.. (current Database-SIZE: 3092 MB; NODES: 142477344) Any assistance would be much appreciated!
let $params :=
<output:serialization-parameters xmlns:output="http://www.w3.org/2010/xslt-xquery-serialization">
<output:method value="csv"/>
<output:csv value="header=yes, separator=semicolon"/>
</output:serialization-parameters>
return file:write(
'/tmp/output.csv',
(# db:copynode false #){<csv>{
for $stand in //*:stand
return <record>{$stand//*:kenmerk}</record>
(: {$stand//*:identificatieVanVerblijfsobject}
{$stand//*:inOnderzoek}
{$stand//*:documentdatum}
{$stand//*:documentnummer} :)
}</csv>},
$params
)

It’s a good idea to use the copynode pragma to save memory. In the given case, it’s probably the total amount of newly created element nodes that will simply consume too much memory before the data can be written to disk.
If you have large data sets, the xquery serialization format may be the better choice. Maps and arrays consume less memory than XML nodes:
let $params := map {
'format': 'xquery',
'header': true(),
'separator': 'semicolon'
}
let $data := map {
'names': [
'kenmerk', 'inOnderzoek'
],
'records': (
for $stand in //*:stand
return [
string($stand//*:kenmerk),
string($stand//*:inOnderzoek)
]
)
}
return file:write-text(
'/tmp/output.csv',
csv:serialize($data, $params)
)
Another approach is to use the window clause and write the results in chunks:
for tumbling window $stands in //*:stand
start at $s when true()
end at $e when $e - $s eq 100000
let $first := $s = 1
let $path := '/tmp/output.csv'
let $csv := <csv>{
for $stand in $stands
return <record>{
$stand//*:kenmerk,
$stand//*:inOnderzoek
}</record>
}</csv>
let $params := map {
'method': 'csv',
'csv': map {
'separator': 'semicolon',
'header': $first
}
}
return if ($first) then (
file:write($path, $csv, $params)
) else (
file:append($path, $csv, $params)
)
After the first write operation, subsequent table rows will be appended to the original file. The chunk size (here: 100000 rows per loop) can be freely adjusted. Similar as in your original code, the serialization parameters can also be specified as XML; and it’s of course also possible to use the xquery serialization format in the second example.

Related

How to manipulate file-paths

I know this seems like a duplicate, and I am sure it more or less is ...
However, it really bugs me, and I cannot make anything of the posts before:
I am building a digital edition, utlizing TEI, XML, XSLT, (and probably existDB, maybe I switch to node/javascript).
I built a php-function that should transforme each file in a specified directory to html. (My xsl-file works well)
declare function app:XMLtoHTML-forAll ($node as node(), $model as map(*), $query as xs:string?){
let $ref := xs:string(request:get-parameter("document", ""))
let $xml := doc(concat("/db/apps/BookOfOrders/data/edition/",$ref))
let $xsl := doc("/db/apps/BookOfOrders/resources/xslt/xmlToHtml.xsl")
let $params :=
<parameters>
{for $p in request:get-parameter-names()
let $val := request:get-parameter($p,())
where not($p = ("document","directory","stylesheet"))
return
<param name="{$p}" value="{$val}"/>
}
</parameters>
return
transform:transform($xml, $xsl, $params)
};
There is a list of files in the apps/BookofOrders/data/edition/ named FolioX.html, where x is the page-number. (I'll probably change names to [FolioNumber].xml, but that's not the issue)
I am trying to make a text slider (so that when I open the page, a page is presented and further buttons are created, and I can slide to the right and read the rest of the pages).
I have a table of content, that is linked to the transformed files:
declare function app:toc($node as node(), $model as map(*)) {
for $doc in collection("/db/apps/BookOfOrders/data/edition")/tei:TEI
return
<li>{document-uri(root($doc))}</li>
};
I guess I am wondering on how to change the link inside to for example Folio29 to Folio30.
Can I take a part of the provided link and make the destination of a link flexible, similar but not identical to what I did in the toc-function above?
I'd be really happy if anyone could point me in the right direction.
Given an expression like document-uri(root($doc)) (perhaps more simply util:document-name($doc), since you're using eXist) that returns the path to (or filename of) the document ending in "FolioX", you just need to isolate X, then cast it as an integer so you can perform addition/subtraction on the value:
document-uri(root($doc)) => substring-after("Folio") => xs:integer()
util:document-name($doc) => substring-after("Folio") => xs:integer()
Then add 1, and you've got your next document. Subtract one, and you've got the previous
However, this could lead to broken links: Folio0 or Folio98 (assuming there are only 97). To avoid this, you might want to retrieve determine the complete list of Folios, find the current position, and then never hit 0 or 98:
let $this-folio := $doc => util:document-name()
let $collection := $doc => util:collection-name()
let $all-folios := xmldb:get-child-resources($collection)
(: sort the filenames using UCA Numeric collation to ensure Folio2 < Folio10.
: see https://www.w3.org/TR/xpath-functions-31/#uca-collations :)
let $sorted-folios := $all-folios => sort("?numeric=yes")
let $this-folio-n := index-of($all-folios, $this-folio)
let $prev-folio := if ($this-folio-n gt 1) then "Folio" || $this-folio-n - 1 else ()
let $next-folio := if ($this-folio-n lt count($all-folios)) then "Folio" || $this-folio-n + 1 else ()
return
<nav>
<prev>{$prev-folio}</prev>
<this>{"Folio" || $this-folio-n}</this>
<next>{$next-folio}</next>
</nav>

XQuery: filter large amounts of data

I do have a xml file (at about 3gb) containing 150k entrys.
sample entry:
<entry>
.... lots of data here ....
<customer-id>1</customer-id>
</entry>
Each of theese entrys do have a specific customer-id. I have to filter the dataset based on a blacklist (sequence of 3k ids)
f.e
let $blacklist-customers := ('0',
'1',
'2',
'3',
....
'3000')
I currently do the check whether or not the customer-id from each entry is included within the blacklist like this:
for $entry in //entry
let $customer-id:= $entry//customer-id
let $inblacklist := $blacklist = //$customer-id
return if (not($inblacklist)) then $entry else ()
If it is not included, it will be returned.
Following this approach, after at about 2 minutes of processing I do get an out of main memory error.
I tried to adjust the code so that I group first and only ask for each group whether or not it is included in the blacklist. But I still do get an out of main memory error that way.
for $entry in //entry
let $customer-id:= $entry//customer-id
group by $customer-id
let $inblacklist := $blacklist = //$customer-id
return if (not($inblacklist)) then $entry else ()
The processing takes place in basex.
What are the reasons for the out of main memory error and what is the best approach to solve this problem?
Also does grouping the data reduce the amount of iterations needed if I follow the second approach or not?

Xquery. How to check current incremental backup status?

I have written an Xquery to that gets executed at the time of when incremental backup is in progress. I know the backup status returns three possible values -
completed, in-progress and failed. Not sure the exact value of last one but anyways this is my xquery -
xquery version "1.0-ml";
declare function local:escape-for-regex
( $arg as xs:string? ) as xs:string {
replace($arg,
'(\.|\[|\]|\\|\||\-|\^|\$|\?|\*|\+|\{|\}|\(|\))','\\$1')
} ;
declare function local:substring-before-last
( $arg as xs:string? ,
$delim as xs:string ) as xs:string {
if (matches($arg, local:escape-for-regex($delim)))
then replace($arg,
concat('^(.*)', local:escape-for-regex($delim),'.*'),
'$1')
else ''
} ;
let $server-info := doc("/config/server-info.xml")
let $content-database :="xyzzy"
let $backup-directory:=$server-info/configuration/server-info/backup-directory/text()
let $backup-latest-dateTime := xdmp:filesystem-directory(fn:concat( $backup-directory,'/',$content-database))/dir:entry[1]/dir:filename/text()
let $backup-latest-date := fn:substring-before($backup-latest-dateTime,"-")
let $backup-info := cts:search(/,cts:element-value-query(xs:QName("directory-name"),$backup-latest-date))
let $new-backup := if($backup-info)
then fn:false()
else fn:true()
let $db-bkp-status := if($new-backup)
then (xdmp:database-backup-status(())[./*:forest/*:backup-path[fn:contains(., $backup-latest-dateTime)]][./*:forest/*:incremental-backup eq "false"]/*:status)
else (xdmp:database-backup-status(())[./*:forest/*:backup-path[fn:contains(., $backup-latest-dateTime)]][./*:forest/*:incremental-backup eq "true"][./*:forest/*:incremental-backup-path[fn:contains(., fn:replace(local:substring-before-last(xs:string(fn:current-date()), "-"), "-", ""))]]/*:status)
return $db-bkp-status
We maintain a configuration file that stores backup status. If there is a new full backup day then $backup-info will return nothing. If it is daily incremental backup day then it will return the config. I'm using it just to check if todays backup is new full or incremental. For incremental day $backup-info is false and so it goes to the last line i.e. else condition. this doesn't return anything for incremental backups. Neither completed nor in-progress. I wonder how markLogic picks up the timestamp. Please assist on this.
Feel free to provide your own xquery from scratch. I can update mine.
I even took out the Job id and search in the output of the function xdmp:database-backup-status(()) but that job id too doesn't exist in the result set.
MarkLogic provides the Admin modules to provide much of the information you are attempting to get via other methods. The Admin UI modules (typically found in /opt/MarkLogic/Modules/MarkLogic/Admin/Lib) contains a lot of helpful code that can be adapted to get these sorts of details. In this case I would refer to database-status-form.xqy
define function db-mount-state(
$fstats as node()*,
$fcounts as node()*,
$dbid as xs:unsignedLong)
{
let $times := $fstats/fs:last-state-change,
$ls := max($times),
$since :=
if (not(empty($ls)))
then concat(" since ", longDate($ls), " ", longTimeSecs($ls))
else ""
return concat(database-status($dbid,$fstats,$fcounts),$since)
}
define function backup-recov-state($fstats as node()*)
{
if(empty($fstats/fs:backups/fs:backup)
and
empty($fstats/fs:restore))
then
"No backup or restore in progress"
else
if(empty($fstats/fs:backups/fs:backup))
then
"Restore in progress (see below for details)"
else
"Backup in progress (see below for details)"
}
... Call the functions against your database, then pull the details from the elements you want:
let $last-full-backup := max($fstats/fs:last-backup)
let $last-incremental-backup : = max($fstats/fs:last-incr-backup
return ($last-full-backup, $last-incremental-backup)
This is just some sample code snippets, not executable, but it should get you moving in the right direction.

Compare two elements of the same document in MarkLogic

I have a MarkLogic 8 database in which there are documents which have two date time fields:
created-on
active-since
I am trying to write an Xquery to search all the documents for which the value of active-since is less than the value of created-on
Currently I am using the following FLWOR exression:
for $entity in fn:collection("entities")
let $id := fn:data($entity//id)
let $created-on := fn:data($entity//created-on)
let $active-since := fn:data($entity//active-since)
where $active-since < $created-on
return
(
$id,
$created-on,
$active-since
)
The above query takes too long to execute and with increase in the number of documents the execution time of this query will also increase.
Also, I have
element-range-index for both the above mentioned dateTime fields but they are not getting used here. The cts-element-query function only compares one element with a set of atomic values. In my case I am trying to compare two elements of the same document.
I think there should be a better and optimized solution for this problem.
Please let me know in case there is any search function or any other approach which will be suitable in this scenario.
This may be efficient enough for you.
Take one of the values and build a range query per value. This all uses the range indexes, so in that sense, it is efficient. However, at some point, there is a large query that us built. It reads similiar to a flword statement. If really wanted to be a bit more efficient, you could find out which if your elements had less unique values (size of the index) and use that for your iteration - thus building a smaller query. Also, you will note that on the element-values call, I also constrain it to your collection. This is just in case you happen to have that element in documents outside of your collection. This keeps the list to only those values you know are in your collection:
let $q := cts:or-query(
for $created-on in cts:element-values(xs:QName("created-on"), (), cts:collection-query("entities"))
return cts:element-value-range-query(xs:Qname("active-since"), "<" $created-on)
)
return
cts:search(
fn:collection("entities"),
$q
)
So, lets explain what is happening in a simple example:
Lets say I have elements A and B - each with a range index defined.
Lets pretend we have the combinations like this in 5 documents:
A,B
2,3
4,2
2,7
5,4
2,9
let $ := cts:or-query(
for $a in cts:element-values(xs:QName("A"))
return cts:element-value-range-query(xs:Qname("B"), "<" $a)
)
This would create the following query:
cts:or-query(
(
cts:element-value-range-query(xs:Qname("B"), "<" 2),
cts:element-value-range-query(xs:Qname("B"), "<" 4),
cts:element-value-range-query(xs:Qname("B"), "<" 5)
)
)
And in the example above, the only match would be the document with the combination: (5,4)
You might try using cts:tuple-values(). Pass in three references: active-since, created-on, and the URI reference. Then iterate the results looking for ones where active-since is less than created-on, and you'll have the URI of the doc.
It's not the prettiest code, but it will let all the data come from RAM, so it should scale nicely.
I am now using the following script to get the count of documents for which the value of active-since is less than the value of created-on:
fn:sum(
for $value-pairs in cts:value-tuples(
(
cts:element-reference(xs:QName("created-on")),
cts:element-reference(xs:QName("active-since"))
),
("fragment-frequency"),
cts:collection-query("entities")
)
let $created-on := json:array-values($value-pairs)[1]
let $active-since := json:array-values($value-pairs)[2]
return
if($active-since lt $created-on) then cts:frequency($value-pairs) else 0
)
Sorry for not having enough reputation, hence I need to comment here on your answer. Why do you think that ML will not return (2,3) and (4,2). I believe we are using an Or-query which will take any single query as true and return the document.

Distinct-Nodes taking too long in BaseX (XQuery)

I am trying to get all distinct start element /products/p:category/start nodes of a big file. I have written a query which is given below. It is taking to long to get the result. I am attaching the query info and the XML file.
After Running couple of minutes, I stopped the execution.
The query is trying to get all the distinct start elements. There are 3 lac category elements.
declare namespace functx = "http://www.functx.com";
declare namespace p="a:b:c";
declare function functx:is-node-in-sequence(
$node as node()? ,
$seq as node()*
) as xs:boolean {
some $nodeInSeq in $seq satisfies deep-equal($nodeInSeq,$node)
};
declare function functx:distinct-nodes(
$nodes as node()*
) as node()* {
for $seq in (1 to count($nodes))
return $nodes[$seq]
[not(functx:is-node-in-sequence(.,$nodes[position() < $seq]))]} ;
let $diff_starts := functx:distinct-nodes(/products/p:category/start)
return $diff_starts
Please let me know if you require further details.
xml file
Comparing a rather large number of nodes with the function provided by FunctX is very expensive, and far beyond costs linear with the number of items.
FunctX is generally a neat library, but often does not scale very well for larger data as you have it (although XML databases can very well handle data much larger without problems).
In this query, I first fetch all distinct values in linear time (in the number of node lookups; for getting distinct values BaseX uses a hashtable ), and another linear scan over all nodes to retrieve the first result node for each of those values. Total execution time on my laptop was about 700ms.
declare namespace p="a:b:c";
for $date in distinct-values(/products/p:category/start)
return (/products/p:category/start[. eq $date])[1]

Resources