XQuery: filter large amounts of data - xquery

I do have a xml file (at about 3gb) containing 150k entrys.
sample entry:
<entry>
.... lots of data here ....
<customer-id>1</customer-id>
</entry>
Each of theese entrys do have a specific customer-id. I have to filter the dataset based on a blacklist (sequence of 3k ids)
f.e
let $blacklist-customers := ('0',
'1',
'2',
'3',
....
'3000')
I currently do the check whether or not the customer-id from each entry is included within the blacklist like this:
for $entry in //entry
let $customer-id:= $entry//customer-id
let $inblacklist := $blacklist = //$customer-id
return if (not($inblacklist)) then $entry else ()
If it is not included, it will be returned.
Following this approach, after at about 2 minutes of processing I do get an out of main memory error.
I tried to adjust the code so that I group first and only ask for each group whether or not it is included in the blacklist. But I still do get an out of main memory error that way.
for $entry in //entry
let $customer-id:= $entry//customer-id
group by $customer-id
let $inblacklist := $blacklist = //$customer-id
return if (not($inblacklist)) then $entry else ()
The processing takes place in basex.
What are the reasons for the out of main memory error and what is the best approach to solve this problem?
Also does grouping the data reduce the amount of iterations needed if I follow the second approach or not?

Related

BaseX - XQuery - Out of memory when writing results to CSV file

I am trying to write an XQuery-result to a CSV-file, see attached code (resulting in at least 1.6 millions lines, will problably become a lot more..).
However several minutes into execution the program fails with an 'out of main memory' error. I am using a laptop with 4GB of memory. I would have thought that writing to file would prevent memory bottlenecks. Also, I am already using the copynode-false pragma.
I might have gone about the code the wrong way, since this is my first XQuery/BaseX-program. Or this might be non-solvable without extra hardware.. (current Database-SIZE: 3092 MB; NODES: 142477344) Any assistance would be much appreciated!
let $params :=
<output:serialization-parameters xmlns:output="http://www.w3.org/2010/xslt-xquery-serialization">
<output:method value="csv"/>
<output:csv value="header=yes, separator=semicolon"/>
</output:serialization-parameters>
return file:write(
'/tmp/output.csv',
(# db:copynode false #){<csv>{
for $stand in //*:stand
return <record>{$stand//*:kenmerk}</record>
(: {$stand//*:identificatieVanVerblijfsobject}
{$stand//*:inOnderzoek}
{$stand//*:documentdatum}
{$stand//*:documentnummer} :)
}</csv>},
$params
)
It’s a good idea to use the copynode pragma to save memory. In the given case, it’s probably the total amount of newly created element nodes that will simply consume too much memory before the data can be written to disk.
If you have large data sets, the xquery serialization format may be the better choice. Maps and arrays consume less memory than XML nodes:
let $params := map {
'format': 'xquery',
'header': true(),
'separator': 'semicolon'
}
let $data := map {
'names': [
'kenmerk', 'inOnderzoek'
],
'records': (
for $stand in //*:stand
return [
string($stand//*:kenmerk),
string($stand//*:inOnderzoek)
]
)
}
return file:write-text(
'/tmp/output.csv',
csv:serialize($data, $params)
)
Another approach is to use the window clause and write the results in chunks:
for tumbling window $stands in //*:stand
start at $s when true()
end at $e when $e - $s eq 100000
let $first := $s = 1
let $path := '/tmp/output.csv'
let $csv := <csv>{
for $stand in $stands
return <record>{
$stand//*:kenmerk,
$stand//*:inOnderzoek
}</record>
}</csv>
let $params := map {
'method': 'csv',
'csv': map {
'separator': 'semicolon',
'header': $first
}
}
return if ($first) then (
file:write($path, $csv, $params)
) else (
file:append($path, $csv, $params)
)
After the first write operation, subsequent table rows will be appended to the original file. The chunk size (here: 100000 rows per loop) can be freely adjusted. Similar as in your original code, the serialization parameters can also be specified as XML; and it’s of course also possible to use the xquery serialization format in the second example.

MarkLogic optic query using two indexes returns no results

I want to use the MarkLogic optic API to join two range indexes but somehow they don't join. Is the query I wrote wrong or can't I compare the indexes used?
I have two indexes defined:
an element-attribute range index x/#refid
a range field index 'id'
Both are of type string and have the same collation defined. Both indexes have data that I can retrieve with cts:values() function. Both are huge indexes and I want to join them using optics so I have constructed the following query :
import module namespace op="http://marklogic.com/optic"
at "/MarkLogic/optic.xqy";
let $subfrag := op:fragment-id-col("subfrag")
let $notfrag := op:fragment-id-col("notfrag")
let $query :=
cts:and-query((
cts:collection-query("latest")
))
let $subids := op:from-lexicons(
map:entry("subid", cts:field-reference("id")), (), $subfrag) => op:where($query)
let $notids := op:from-lexicons(
map:entry("notid", cts:element-attribute-reference(xs:QName("x"), xs:QName("refid"))),
(),
$notfrag)
return $subids
=> op:join-cross-product($notids)
=> op:where(op:eq($notfrag, $subfrag))
=> op:result()
This query uses the join-cross-product and when I remove the op:where clause I get all values left and right. I verified and some are equal so the clause should filter only those rows i'm actually interested in. But somehow it doesn't work and I get an empty result. Also, if I replace one of the values in the op:eq with a string value it doesn't return a result.
When I use the same variable in the op:eq operator (like op:eq($notfrag, $notfrag)) I get results back so the statement as is works. Just not the comparison between the two indexes.
I have also used variants with join-inner and left-outer-join but those are also returning no results.
Am I comparing two incomparable indexes or am I missing some statement (as documentation/example is a bit thin).
(of course I can solve by not using optics but in this case it would be a perfect fit)
[update]
I got it working by eventually by changing the final statement:
return $subids
=> op:join-cross-product($notids)
=> op:where(op:eq(op:col('subid'), op:col('notid')))
=> op:result()
So somehow you cannot use the fragment definitions in the condition. After this I replaced the join-cross-product with a join-inner construction which should be a bit more efficient.
And to be complete, I initially used the example from the MarkLogic documentation found here (https://docs.marklogic.com/guide/app-dev/OpticAPI#id_87356), specifically the last example where they use a fragment column definition to be used as param in the join-inner statement that didn't work in my case.
Cross products are typically useful only for small rows sets.
Putting both reference in the same from-lexicons() accessor does an implicit join, meaning that the engine forms rows by constructing a local cross-product of the values indexed for each document.
Such a query could be expressed by:
op:from-lexicons(
map:entry("subid", cts:field-reference("id"))
=>map:with("notid", cts:element-attribute-reference(xs:QName("x"),
xs:QName("refid")))
=>op:where(cts:collection-query("latest"))
=>op:result()
Making the joins explicitly could be done with:
let $subids := op:from-lexicons(
map:entry("subid", cts:field-reference("id")), (), $subfrag)
=> op:where($query)
let $notids := op:from-lexicons(
map:entry("notid", cts:element-attribute-reference(xs:QName("x"),
xs:QName("refid"))),
(),
$notfrag)
return $subids
=> op:join-inner($notids, op:on($notfrag, $subfrag))
=> op:result()
Hoping that helps,

List of all document names in marklogic forest

I just want to find all document names in a forest.
I know the forest name(ABC) and I need to find all documents in that forest(ABC). My out put should looks like this.
Forest ABC has
A.xml
B.xml
C.xml
and so on...
Searches and lexicon lookups can be constrained by forest, so you should be able to get the document names from the URI lexicon with a call similar to the following:
cts.values(cts.uriReference(), null, null, null, null, xdmp.forest('ABC'))
That said, there aren't many common motivations for looking up the names of documents in a forest. What are you trying to accomplish?
In order to list all of the URIs from a particular forest, you can use cts:uris() and specify the forest-id in the 5th parameter:
cts:uris((), (), cts:true-query(), (), xdmp:forest("ABC"))
Your comment suggested that the reason why you are attempting to list all of the URIs from a particular forest was so that you could delete the ones that are duplicates.
The code below could be use to obtain all of the URIs from the specified forest, and then remove them from that forest if they are duplicates.
If you attempt to read the document properties and a XDMP-DBDUPURI exception is thrown, catch that exception and then delete the document in a different transaction from the problem forest.
(: update this with the name of problem forest :)
declare variable $PROBLEM-FOREST := xdmp:forest("ABC");
declare variable $URIS := cts:uris((), (), cts:true-query(), (), $PROBLEM-FOREST);
for $uri in $URIS
return
try {
let $properties := xdmp:document-get-properties($uri, xs:QName("foo"))
return ()
} catch($e) {
if ($e/error:code = "XDMP-DBDUPURI") then
xdmp:invoke-function(
function(){ xdmp:document-delete($uri) },
<options xmlns="xdmp:eval">
<isolation>different-transaction</isolation>
<database>{$PROBLEM-FOREST}</database>
</options>
)
else ()
}
Depending on how many documents are in this forest, you may run into timeout issues. You might consider running this as a CORB job where the forsts URIs are selected in the URIS-MODULE and then each inspection/delete is handled individually in the PROCESS-MODULE.

Compare two elements of the same document in MarkLogic

I have a MarkLogic 8 database in which there are documents which have two date time fields:
created-on
active-since
I am trying to write an Xquery to search all the documents for which the value of active-since is less than the value of created-on
Currently I am using the following FLWOR exression:
for $entity in fn:collection("entities")
let $id := fn:data($entity//id)
let $created-on := fn:data($entity//created-on)
let $active-since := fn:data($entity//active-since)
where $active-since < $created-on
return
(
$id,
$created-on,
$active-since
)
The above query takes too long to execute and with increase in the number of documents the execution time of this query will also increase.
Also, I have
element-range-index for both the above mentioned dateTime fields but they are not getting used here. The cts-element-query function only compares one element with a set of atomic values. In my case I am trying to compare two elements of the same document.
I think there should be a better and optimized solution for this problem.
Please let me know in case there is any search function or any other approach which will be suitable in this scenario.
This may be efficient enough for you.
Take one of the values and build a range query per value. This all uses the range indexes, so in that sense, it is efficient. However, at some point, there is a large query that us built. It reads similiar to a flword statement. If really wanted to be a bit more efficient, you could find out which if your elements had less unique values (size of the index) and use that for your iteration - thus building a smaller query. Also, you will note that on the element-values call, I also constrain it to your collection. This is just in case you happen to have that element in documents outside of your collection. This keeps the list to only those values you know are in your collection:
let $q := cts:or-query(
for $created-on in cts:element-values(xs:QName("created-on"), (), cts:collection-query("entities"))
return cts:element-value-range-query(xs:Qname("active-since"), "<" $created-on)
)
return
cts:search(
fn:collection("entities"),
$q
)
So, lets explain what is happening in a simple example:
Lets say I have elements A and B - each with a range index defined.
Lets pretend we have the combinations like this in 5 documents:
A,B
2,3
4,2
2,7
5,4
2,9
let $ := cts:or-query(
for $a in cts:element-values(xs:QName("A"))
return cts:element-value-range-query(xs:Qname("B"), "<" $a)
)
This would create the following query:
cts:or-query(
(
cts:element-value-range-query(xs:Qname("B"), "<" 2),
cts:element-value-range-query(xs:Qname("B"), "<" 4),
cts:element-value-range-query(xs:Qname("B"), "<" 5)
)
)
And in the example above, the only match would be the document with the combination: (5,4)
You might try using cts:tuple-values(). Pass in three references: active-since, created-on, and the URI reference. Then iterate the results looking for ones where active-since is less than created-on, and you'll have the URI of the doc.
It's not the prettiest code, but it will let all the data come from RAM, so it should scale nicely.
I am now using the following script to get the count of documents for which the value of active-since is less than the value of created-on:
fn:sum(
for $value-pairs in cts:value-tuples(
(
cts:element-reference(xs:QName("created-on")),
cts:element-reference(xs:QName("active-since"))
),
("fragment-frequency"),
cts:collection-query("entities")
)
let $created-on := json:array-values($value-pairs)[1]
let $active-since := json:array-values($value-pairs)[2]
return
if($active-since lt $created-on) then cts:frequency($value-pairs) else 0
)
Sorry for not having enough reputation, hence I need to comment here on your answer. Why do you think that ML will not return (2,3) and (4,2). I believe we are using an Or-query which will take any single query as true and return the document.

Distinct-Nodes taking too long in BaseX (XQuery)

I am trying to get all distinct start element /products/p:category/start nodes of a big file. I have written a query which is given below. It is taking to long to get the result. I am attaching the query info and the XML file.
After Running couple of minutes, I stopped the execution.
The query is trying to get all the distinct start elements. There are 3 lac category elements.
declare namespace functx = "http://www.functx.com";
declare namespace p="a:b:c";
declare function functx:is-node-in-sequence(
$node as node()? ,
$seq as node()*
) as xs:boolean {
some $nodeInSeq in $seq satisfies deep-equal($nodeInSeq,$node)
};
declare function functx:distinct-nodes(
$nodes as node()*
) as node()* {
for $seq in (1 to count($nodes))
return $nodes[$seq]
[not(functx:is-node-in-sequence(.,$nodes[position() < $seq]))]} ;
let $diff_starts := functx:distinct-nodes(/products/p:category/start)
return $diff_starts
Please let me know if you require further details.
xml file
Comparing a rather large number of nodes with the function provided by FunctX is very expensive, and far beyond costs linear with the number of items.
FunctX is generally a neat library, but often does not scale very well for larger data as you have it (although XML databases can very well handle data much larger without problems).
In this query, I first fetch all distinct values in linear time (in the number of node lookups; for getting distinct values BaseX uses a hashtable ), and another linear scan over all nodes to retrieve the first result node for each of those values. Total execution time on my laptop was about 700ms.
declare namespace p="a:b:c";
for $date in distinct-values(/products/p:category/start)
return (/products/p:category/start[. eq $date])[1]

Resources