Distinct-Nodes taking too long in BaseX (XQuery) - xquery

I am trying to get all distinct start element /products/p:category/start nodes of a big file. I have written a query which is given below. It is taking to long to get the result. I am attaching the query info and the XML file.
After Running couple of minutes, I stopped the execution.
The query is trying to get all the distinct start elements. There are 3 lac category elements.
declare namespace functx = "http://www.functx.com";
declare namespace p="a:b:c";
declare function functx:is-node-in-sequence(
$node as node()? ,
$seq as node()*
) as xs:boolean {
some $nodeInSeq in $seq satisfies deep-equal($nodeInSeq,$node)
};
declare function functx:distinct-nodes(
$nodes as node()*
) as node()* {
for $seq in (1 to count($nodes))
return $nodes[$seq]
[not(functx:is-node-in-sequence(.,$nodes[position() < $seq]))]} ;
let $diff_starts := functx:distinct-nodes(/products/p:category/start)
return $diff_starts
Please let me know if you require further details.
xml file

Comparing a rather large number of nodes with the function provided by FunctX is very expensive, and far beyond costs linear with the number of items.
FunctX is generally a neat library, but often does not scale very well for larger data as you have it (although XML databases can very well handle data much larger without problems).
In this query, I first fetch all distinct values in linear time (in the number of node lookups; for getting distinct values BaseX uses a hashtable ), and another linear scan over all nodes to retrieve the first result node for each of those values. Total execution time on my laptop was about 700ms.
declare namespace p="a:b:c";
for $date in distinct-values(/products/p:category/start)
return (/products/p:category/start[. eq $date])[1]

Related

BaseX - XQuery - Out of memory when writing results to CSV file

I am trying to write an XQuery-result to a CSV-file, see attached code (resulting in at least 1.6 millions lines, will problably become a lot more..).
However several minutes into execution the program fails with an 'out of main memory' error. I am using a laptop with 4GB of memory. I would have thought that writing to file would prevent memory bottlenecks. Also, I am already using the copynode-false pragma.
I might have gone about the code the wrong way, since this is my first XQuery/BaseX-program. Or this might be non-solvable without extra hardware.. (current Database-SIZE: 3092 MB; NODES: 142477344) Any assistance would be much appreciated!
let $params :=
<output:serialization-parameters xmlns:output="http://www.w3.org/2010/xslt-xquery-serialization">
<output:method value="csv"/>
<output:csv value="header=yes, separator=semicolon"/>
</output:serialization-parameters>
return file:write(
'/tmp/output.csv',
(# db:copynode false #){<csv>{
for $stand in //*:stand
return <record>{$stand//*:kenmerk}</record>
(: {$stand//*:identificatieVanVerblijfsobject}
{$stand//*:inOnderzoek}
{$stand//*:documentdatum}
{$stand//*:documentnummer} :)
}</csv>},
$params
)
It’s a good idea to use the copynode pragma to save memory. In the given case, it’s probably the total amount of newly created element nodes that will simply consume too much memory before the data can be written to disk.
If you have large data sets, the xquery serialization format may be the better choice. Maps and arrays consume less memory than XML nodes:
let $params := map {
'format': 'xquery',
'header': true(),
'separator': 'semicolon'
}
let $data := map {
'names': [
'kenmerk', 'inOnderzoek'
],
'records': (
for $stand in //*:stand
return [
string($stand//*:kenmerk),
string($stand//*:inOnderzoek)
]
)
}
return file:write-text(
'/tmp/output.csv',
csv:serialize($data, $params)
)
Another approach is to use the window clause and write the results in chunks:
for tumbling window $stands in //*:stand
start at $s when true()
end at $e when $e - $s eq 100000
let $first := $s = 1
let $path := '/tmp/output.csv'
let $csv := <csv>{
for $stand in $stands
return <record>{
$stand//*:kenmerk,
$stand//*:inOnderzoek
}</record>
}</csv>
let $params := map {
'method': 'csv',
'csv': map {
'separator': 'semicolon',
'header': $first
}
}
return if ($first) then (
file:write($path, $csv, $params)
) else (
file:append($path, $csv, $params)
)
After the first write operation, subsequent table rows will be appended to the original file. The chunk size (here: 100000 rows per loop) can be freely adjusted. Similar as in your original code, the serialization parameters can also be specified as XML; and it’s of course also possible to use the xquery serialization format in the second example.

MarkLogic optic query using two indexes returns no results

I want to use the MarkLogic optic API to join two range indexes but somehow they don't join. Is the query I wrote wrong or can't I compare the indexes used?
I have two indexes defined:
an element-attribute range index x/#refid
a range field index 'id'
Both are of type string and have the same collation defined. Both indexes have data that I can retrieve with cts:values() function. Both are huge indexes and I want to join them using optics so I have constructed the following query :
import module namespace op="http://marklogic.com/optic"
at "/MarkLogic/optic.xqy";
let $subfrag := op:fragment-id-col("subfrag")
let $notfrag := op:fragment-id-col("notfrag")
let $query :=
cts:and-query((
cts:collection-query("latest")
))
let $subids := op:from-lexicons(
map:entry("subid", cts:field-reference("id")), (), $subfrag) => op:where($query)
let $notids := op:from-lexicons(
map:entry("notid", cts:element-attribute-reference(xs:QName("x"), xs:QName("refid"))),
(),
$notfrag)
return $subids
=> op:join-cross-product($notids)
=> op:where(op:eq($notfrag, $subfrag))
=> op:result()
This query uses the join-cross-product and when I remove the op:where clause I get all values left and right. I verified and some are equal so the clause should filter only those rows i'm actually interested in. But somehow it doesn't work and I get an empty result. Also, if I replace one of the values in the op:eq with a string value it doesn't return a result.
When I use the same variable in the op:eq operator (like op:eq($notfrag, $notfrag)) I get results back so the statement as is works. Just not the comparison between the two indexes.
I have also used variants with join-inner and left-outer-join but those are also returning no results.
Am I comparing two incomparable indexes or am I missing some statement (as documentation/example is a bit thin).
(of course I can solve by not using optics but in this case it would be a perfect fit)
[update]
I got it working by eventually by changing the final statement:
return $subids
=> op:join-cross-product($notids)
=> op:where(op:eq(op:col('subid'), op:col('notid')))
=> op:result()
So somehow you cannot use the fragment definitions in the condition. After this I replaced the join-cross-product with a join-inner construction which should be a bit more efficient.
And to be complete, I initially used the example from the MarkLogic documentation found here (https://docs.marklogic.com/guide/app-dev/OpticAPI#id_87356), specifically the last example where they use a fragment column definition to be used as param in the join-inner statement that didn't work in my case.
Cross products are typically useful only for small rows sets.
Putting both reference in the same from-lexicons() accessor does an implicit join, meaning that the engine forms rows by constructing a local cross-product of the values indexed for each document.
Such a query could be expressed by:
op:from-lexicons(
map:entry("subid", cts:field-reference("id"))
=>map:with("notid", cts:element-attribute-reference(xs:QName("x"),
xs:QName("refid")))
=>op:where(cts:collection-query("latest"))
=>op:result()
Making the joins explicitly could be done with:
let $subids := op:from-lexicons(
map:entry("subid", cts:field-reference("id")), (), $subfrag)
=> op:where($query)
let $notids := op:from-lexicons(
map:entry("notid", cts:element-attribute-reference(xs:QName("x"),
xs:QName("refid"))),
(),
$notfrag)
return $subids
=> op:join-inner($notids, op:on($notfrag, $subfrag))
=> op:result()
Hoping that helps,

How do I get output for a XQuery in MarkLogic in a one line output?

Will elaborate - when I execute the following command :
let $value := xdmp:forest-status(
xdmp:forest-open-replica(
xdmp:database-forests(xdmp:database("Documents"))))
return $value
Above query returns a lot of information about the database "Documents" forest, like - forest-id, host-id, etc.
I only require that it should return only the "state" of my forest. How do I do that?
Use XPath to select what you want to return.
let $value := xdmp:forest-status(
xdmp:forest-open-replica(
xdmp:database-forests(xdmp:database("Documents"))))
return $value/*:state/text()
Also, no need for a FLWOR you could make it a one-liner:
xdmp:forest-status(
xdmp:forest-open-replica(
xdmp:database-forests(xdmp:database("Documents"))))/*:state/text()
Or you may find that using the arrow operator makes things easier to read instead of nested function calls and tons of parenthesis wrapping them:
(xdmp:database("Documents")
=> xdmp:database-forests()
=> xdmp:forest-open-replica()
=> xdmp:forest-status()
)/*:state/text()
The XML elements in the response are in the http://marklogic.com/xdmp/status/forest namespace. So, you would either need to declare the namespace (i.e. declare namespace f = "http://marklogic.com/xdmp/status/forest";) and use the prefix in your XPath (i.e. f:state), or just use the wildcard as I have done *:state

Compare two elements of the same document in MarkLogic

I have a MarkLogic 8 database in which there are documents which have two date time fields:
created-on
active-since
I am trying to write an Xquery to search all the documents for which the value of active-since is less than the value of created-on
Currently I am using the following FLWOR exression:
for $entity in fn:collection("entities")
let $id := fn:data($entity//id)
let $created-on := fn:data($entity//created-on)
let $active-since := fn:data($entity//active-since)
where $active-since < $created-on
return
(
$id,
$created-on,
$active-since
)
The above query takes too long to execute and with increase in the number of documents the execution time of this query will also increase.
Also, I have
element-range-index for both the above mentioned dateTime fields but they are not getting used here. The cts-element-query function only compares one element with a set of atomic values. In my case I am trying to compare two elements of the same document.
I think there should be a better and optimized solution for this problem.
Please let me know in case there is any search function or any other approach which will be suitable in this scenario.
This may be efficient enough for you.
Take one of the values and build a range query per value. This all uses the range indexes, so in that sense, it is efficient. However, at some point, there is a large query that us built. It reads similiar to a flword statement. If really wanted to be a bit more efficient, you could find out which if your elements had less unique values (size of the index) and use that for your iteration - thus building a smaller query. Also, you will note that on the element-values call, I also constrain it to your collection. This is just in case you happen to have that element in documents outside of your collection. This keeps the list to only those values you know are in your collection:
let $q := cts:or-query(
for $created-on in cts:element-values(xs:QName("created-on"), (), cts:collection-query("entities"))
return cts:element-value-range-query(xs:Qname("active-since"), "<" $created-on)
)
return
cts:search(
fn:collection("entities"),
$q
)
So, lets explain what is happening in a simple example:
Lets say I have elements A and B - each with a range index defined.
Lets pretend we have the combinations like this in 5 documents:
A,B
2,3
4,2
2,7
5,4
2,9
let $ := cts:or-query(
for $a in cts:element-values(xs:QName("A"))
return cts:element-value-range-query(xs:Qname("B"), "<" $a)
)
This would create the following query:
cts:or-query(
(
cts:element-value-range-query(xs:Qname("B"), "<" 2),
cts:element-value-range-query(xs:Qname("B"), "<" 4),
cts:element-value-range-query(xs:Qname("B"), "<" 5)
)
)
And in the example above, the only match would be the document with the combination: (5,4)
You might try using cts:tuple-values(). Pass in three references: active-since, created-on, and the URI reference. Then iterate the results looking for ones where active-since is less than created-on, and you'll have the URI of the doc.
It's not the prettiest code, but it will let all the data come from RAM, so it should scale nicely.
I am now using the following script to get the count of documents for which the value of active-since is less than the value of created-on:
fn:sum(
for $value-pairs in cts:value-tuples(
(
cts:element-reference(xs:QName("created-on")),
cts:element-reference(xs:QName("active-since"))
),
("fragment-frequency"),
cts:collection-query("entities")
)
let $created-on := json:array-values($value-pairs)[1]
let $active-since := json:array-values($value-pairs)[2]
return
if($active-since lt $created-on) then cts:frequency($value-pairs) else 0
)
Sorry for not having enough reputation, hence I need to comment here on your answer. Why do you think that ML will not return (2,3) and (4,2). I believe we are using an Or-query which will take any single query as true and return the document.

idl: pass keyword dynamically to isa function to test structure read by read_csv

I am using IDL 8.4. I want to use isa() function to determine input type read by read_csv(). I want to use /number, /integer, /float and /string as some field I want to make sure float, other to be integer and other I don't care. I can do like this, but it is not very readable to human eye.
str = read_csv(filename, header=inheader)
; TODO check header
if not isa(str.(0), /integer) then stop
if not isa(str.(1), /number) then stop
if not isa(str.(2), /float) then stop
I am hoping I can do something like
expected_header = ['id', 'x', 'val']
expected_type = ['/integer', '/number', '/float']
str = read_csv(filename, header=inheader)
if not array_equal(strlowcase(inheader), expected_header) then stop
for i=0l,n_elements(expected_type) do
if not isa(str.(i), expected_type[i]) then stop
endfor
the above doesn't work, as '/integer' is taken literally and I guess isa() is looking for named structure. How can you do something similar?
Ideally I want to pick expected type based on header read from file, so that script still works as long as header specifies expected field.
EDIT:
my tentative solution is to write a wrapper for ISA(). Not very pretty, but does what I wanted... if there is cleaner solution , please let me know.
Also, read_csv is defined to return only one of long, long64, double and string, so I could write function to test with this limitation. but I just wanted to make it to work in general so that I can reuse them for other similar cases.
function isa_generic,var,typ
; calls isa() http://www.exelisvis.com/docs/ISA.html with keyword
; if 'n', test /number
; if 'i', test /integer
; if 'f', test /float
; if 's', test /string
if typ eq 'n' then return, isa(var, /number)
if typ eq 'i' then then return, isa(var, /integer)
if typ eq 'f' then then return, isa(var, /float)
if typ eq 's' then then return, isa(var, /string)
print, 'unexpected typename: ', typ
stop
end
IDL has some limited reflection abilities, which will do exactly what you want:
expected_types = ['integer', 'number', 'float']
expected_header = ['id', 'x', 'val']
str = read_csv(filename, header=inheader)
if ~array_equal(strlowcase(inheader), expected_header) then stop
foreach type, expected_types, index do begin
if ~isa(str.(index), _extra=create_struct(type, 1)) then stop
endforeach
It's debatable if this is really "easier to read" in your case, since there are only three cases to test. If there were 500 cases, it would be a lot cleaner than writing 500 slightly different lines.
This snipped used some rather esoteric IDL features, so let me explain what's happening a bit:
expected_types is just a list of (string) keyword names in the order they should be used.
The foreach part iterates over expected_types, putting the keyword string into the type variable and the iteration count into index.
This is equivalent to using for index = 0, n_elements(expected_types) - 1 do and then using expected_types[index] instead of type, but the foreach loop is easier to read IMHO. Reference here.
_extra is a special keyword that can pass a structure as if it were a set of keywords. Each of the structure's tags is interpreted as a keyword. Reference here.
The create_struct function takes one or more pairs of (string) tag names and (any type) values, then returns a structure with those tag names and values. Reference here.
Finally, I replaced not (bitwise not) with ~ (logical not). This step, like foreach vs for, is not necessary in this instance, but can avoid headache when debugging some types of code, where the distinction matters.
--
Reflective abilities like these can do an awful lot, and come in super handy. They're work-horses in other languages, but IDL programmers don't seem to use them as much. Here's a quick list of common reflective features I use in IDL, with links to the documentation for each:
create_struct - Create a structure from (string) tag names and values.
n_tags - Get the number of tags in a structure.
_extra, _strict_extra, and _ref_extra - Pass keywords by structure or reference.
call_function - Call a function by its (string) name.
call_procedure - Call a procedure by its (string) name.
call_method - Call a method (of an object) by its (string) name.
execute - Run complete IDL commands stored in a string.
Note: Be very careful using the execute function. It will blindly execute any IDL statement you (or a user, file, web form, etc.) feed it. Never ever feed untrusted or web user input to the IDL execute function.
You can't access the keywords quite like that, but there is a typename parameter to ISA that might be useful. This is untested, but should work:
expected_header = ['id', 'x', 'val']
expected_type = ['int', 'long', 'float']
str = read_cv(filename, header=inheader)
if not array_equal(strlowcase(inheader), expected_header) then stop
for i = 0L, n_elemented(expected_type) - 1L do begin
if not isa(str.(i), expected_type[i]) then stop
endfor

Resources