List of all document names in marklogic forest - xquery

I just want to find all document names in a forest.
I know the forest name(ABC) and I need to find all documents in that forest(ABC). My out put should looks like this.
Forest ABC has
A.xml
B.xml
C.xml
and so on...

Searches and lexicon lookups can be constrained by forest, so you should be able to get the document names from the URI lexicon with a call similar to the following:
cts.values(cts.uriReference(), null, null, null, null, xdmp.forest('ABC'))
That said, there aren't many common motivations for looking up the names of documents in a forest. What are you trying to accomplish?

In order to list all of the URIs from a particular forest, you can use cts:uris() and specify the forest-id in the 5th parameter:
cts:uris((), (), cts:true-query(), (), xdmp:forest("ABC"))
Your comment suggested that the reason why you are attempting to list all of the URIs from a particular forest was so that you could delete the ones that are duplicates.
The code below could be use to obtain all of the URIs from the specified forest, and then remove them from that forest if they are duplicates.
If you attempt to read the document properties and a XDMP-DBDUPURI exception is thrown, catch that exception and then delete the document in a different transaction from the problem forest.
(: update this with the name of problem forest :)
declare variable $PROBLEM-FOREST := xdmp:forest("ABC");
declare variable $URIS := cts:uris((), (), cts:true-query(), (), $PROBLEM-FOREST);
for $uri in $URIS
return
try {
let $properties := xdmp:document-get-properties($uri, xs:QName("foo"))
return ()
} catch($e) {
if ($e/error:code = "XDMP-DBDUPURI") then
xdmp:invoke-function(
function(){ xdmp:document-delete($uri) },
<options xmlns="xdmp:eval">
<isolation>different-transaction</isolation>
<database>{$PROBLEM-FOREST}</database>
</options>
)
else ()
}
Depending on how many documents are in this forest, you may run into timeout issues. You might consider running this as a CORB job where the forsts URIs are selected in the URIS-MODULE and then each inspection/delete is handled individually in the PROCESS-MODULE.

Related

MarkLogic optic query using two indexes returns no results

I want to use the MarkLogic optic API to join two range indexes but somehow they don't join. Is the query I wrote wrong or can't I compare the indexes used?
I have two indexes defined:
an element-attribute range index x/#refid
a range field index 'id'
Both are of type string and have the same collation defined. Both indexes have data that I can retrieve with cts:values() function. Both are huge indexes and I want to join them using optics so I have constructed the following query :
import module namespace op="http://marklogic.com/optic"
at "/MarkLogic/optic.xqy";
let $subfrag := op:fragment-id-col("subfrag")
let $notfrag := op:fragment-id-col("notfrag")
let $query :=
cts:and-query((
cts:collection-query("latest")
))
let $subids := op:from-lexicons(
map:entry("subid", cts:field-reference("id")), (), $subfrag) => op:where($query)
let $notids := op:from-lexicons(
map:entry("notid", cts:element-attribute-reference(xs:QName("x"), xs:QName("refid"))),
(),
$notfrag)
return $subids
=> op:join-cross-product($notids)
=> op:where(op:eq($notfrag, $subfrag))
=> op:result()
This query uses the join-cross-product and when I remove the op:where clause I get all values left and right. I verified and some are equal so the clause should filter only those rows i'm actually interested in. But somehow it doesn't work and I get an empty result. Also, if I replace one of the values in the op:eq with a string value it doesn't return a result.
When I use the same variable in the op:eq operator (like op:eq($notfrag, $notfrag)) I get results back so the statement as is works. Just not the comparison between the two indexes.
I have also used variants with join-inner and left-outer-join but those are also returning no results.
Am I comparing two incomparable indexes or am I missing some statement (as documentation/example is a bit thin).
(of course I can solve by not using optics but in this case it would be a perfect fit)
[update]
I got it working by eventually by changing the final statement:
return $subids
=> op:join-cross-product($notids)
=> op:where(op:eq(op:col('subid'), op:col('notid')))
=> op:result()
So somehow you cannot use the fragment definitions in the condition. After this I replaced the join-cross-product with a join-inner construction which should be a bit more efficient.
And to be complete, I initially used the example from the MarkLogic documentation found here (https://docs.marklogic.com/guide/app-dev/OpticAPI#id_87356), specifically the last example where they use a fragment column definition to be used as param in the join-inner statement that didn't work in my case.
Cross products are typically useful only for small rows sets.
Putting both reference in the same from-lexicons() accessor does an implicit join, meaning that the engine forms rows by constructing a local cross-product of the values indexed for each document.
Such a query could be expressed by:
op:from-lexicons(
map:entry("subid", cts:field-reference("id"))
=>map:with("notid", cts:element-attribute-reference(xs:QName("x"),
xs:QName("refid")))
=>op:where(cts:collection-query("latest"))
=>op:result()
Making the joins explicitly could be done with:
let $subids := op:from-lexicons(
map:entry("subid", cts:field-reference("id")), (), $subfrag)
=> op:where($query)
let $notids := op:from-lexicons(
map:entry("notid", cts:element-attribute-reference(xs:QName("x"),
xs:QName("refid"))),
(),
$notfrag)
return $subids
=> op:join-inner($notids, op:on($notfrag, $subfrag))
=> op:result()
Hoping that helps,

searching in multiple collections joined by common fileds in xquery marklogic

I have two collections('A' and 'B') with millions of transport insurance data documents. The two collections have four elements in common(customer-no, date-of-insurance, insurance-no,accident-number) and one element(license-no) exists only in one collection('A'). I want to extract all the documents that are present in both the collections and also have the element of collection'A'. I am able to retrieve all the customer-nos from 'A' with cts-search. Then I loop through each of these customer-nos to look for license-no in 'A'. It gives an empty sequence. But I know this is not possible. Could someone guide me with appropriate search logic?
let $col-A := cts:search(
doc(),
cts:and-query((
cts:collection-query('col-A'),
cts:element-value-query(xs:QName('abc:Acusno'), '*', (("wildcarded")))
)))
for $each in $col-A
let $col-B := cts:search(doc(),
cts:and-query((cts:collection-query('col-B'),
cts:element-value-query(xs:QName('abc:Bcusno'), $each)
)))
return $col-B
returns empty sequence
Your first cts:search is returning entire documents, which you are then passing in as argument into the value-query. You probably want to pass in just the value of abc:Acusno. You could do that with something like $each//abc:Acusno.
Your code is not using a very efficient approach though, and what if certain Acusno values occur multiple times?
I would recommend putting a range index on abc:Acusno, and using cts:values to pull up the unique values that match a given query. Then feed that entire list as one argument without any looping to a query against abc:Bcusno. You don't have to use a range index, and range query on Bcusno, but it could be useful to have that index anyhow. The code would then look something like this:
let $query :=
cts:and-query((
cts:collection-query('col-A'),
cts:element-query(xs:QName('abc:Acusno'), cts:true-query())
))
let $customerNrs :=
cts:values(
cts:element-reference(xs:QName("abc:Acusno")),
(),
(),
$query
)
return cts:search(
collection(),
cts:and-query((
cts:collection-query('col-B'),
cts:element-range-query(xs:QName('abc:Bcusno'), '=', $customerNrs)
))
)
Note: be careful when returning full search lists like this. You might want to paginate the response.
HTH!

Compare two elements of the same document in MarkLogic

I have a MarkLogic 8 database in which there are documents which have two date time fields:
created-on
active-since
I am trying to write an Xquery to search all the documents for which the value of active-since is less than the value of created-on
Currently I am using the following FLWOR exression:
for $entity in fn:collection("entities")
let $id := fn:data($entity//id)
let $created-on := fn:data($entity//created-on)
let $active-since := fn:data($entity//active-since)
where $active-since < $created-on
return
(
$id,
$created-on,
$active-since
)
The above query takes too long to execute and with increase in the number of documents the execution time of this query will also increase.
Also, I have
element-range-index for both the above mentioned dateTime fields but they are not getting used here. The cts-element-query function only compares one element with a set of atomic values. In my case I am trying to compare two elements of the same document.
I think there should be a better and optimized solution for this problem.
Please let me know in case there is any search function or any other approach which will be suitable in this scenario.
This may be efficient enough for you.
Take one of the values and build a range query per value. This all uses the range indexes, so in that sense, it is efficient. However, at some point, there is a large query that us built. It reads similiar to a flword statement. If really wanted to be a bit more efficient, you could find out which if your elements had less unique values (size of the index) and use that for your iteration - thus building a smaller query. Also, you will note that on the element-values call, I also constrain it to your collection. This is just in case you happen to have that element in documents outside of your collection. This keeps the list to only those values you know are in your collection:
let $q := cts:or-query(
for $created-on in cts:element-values(xs:QName("created-on"), (), cts:collection-query("entities"))
return cts:element-value-range-query(xs:Qname("active-since"), "<" $created-on)
)
return
cts:search(
fn:collection("entities"),
$q
)
So, lets explain what is happening in a simple example:
Lets say I have elements A and B - each with a range index defined.
Lets pretend we have the combinations like this in 5 documents:
A,B
2,3
4,2
2,7
5,4
2,9
let $ := cts:or-query(
for $a in cts:element-values(xs:QName("A"))
return cts:element-value-range-query(xs:Qname("B"), "<" $a)
)
This would create the following query:
cts:or-query(
(
cts:element-value-range-query(xs:Qname("B"), "<" 2),
cts:element-value-range-query(xs:Qname("B"), "<" 4),
cts:element-value-range-query(xs:Qname("B"), "<" 5)
)
)
And in the example above, the only match would be the document with the combination: (5,4)
You might try using cts:tuple-values(). Pass in three references: active-since, created-on, and the URI reference. Then iterate the results looking for ones where active-since is less than created-on, and you'll have the URI of the doc.
It's not the prettiest code, but it will let all the data come from RAM, so it should scale nicely.
I am now using the following script to get the count of documents for which the value of active-since is less than the value of created-on:
fn:sum(
for $value-pairs in cts:value-tuples(
(
cts:element-reference(xs:QName("created-on")),
cts:element-reference(xs:QName("active-since"))
),
("fragment-frequency"),
cts:collection-query("entities")
)
let $created-on := json:array-values($value-pairs)[1]
let $active-since := json:array-values($value-pairs)[2]
return
if($active-since lt $created-on) then cts:frequency($value-pairs) else 0
)
Sorry for not having enough reputation, hence I need to comment here on your answer. Why do you think that ML will not return (2,3) and (4,2). I believe we are using an Or-query which will take any single query as true and return the document.

Combined search query for a few xml documents

I have in each books directory /books/{book_id}/ a couple of xml documents.
/books/{book_id}/basic.xml and /books/{book_id}/formats.xml.
First one is
<document book_id="{book_id}">
<title>The book</title>
</document>
and the second is
<document book_id="{book_id}">
<format>a</format>
<format>b</format>
<format>c</format>
</document>
How can I find all books in /books/ directory with format eq 'a' and title eq *'book'* by one query? I have done one variant when I first finding all books by format by cts:search() and then filter the result in "for loop" by checking title in basic.xml file.
Thank you!
This question is listed as MarkLogic as well as xQuery. For completeness, I have included a MarkLogic solution that is a single statement:
let $res := cts:search(doc(), cts:and-query(
(
cts:element-word-query(xs:QName("title"), '*book*', ('wildcarded'))
,
cts:element-attribute-range-query(xs:QName("document"), xs:QName("book_id"), '=', cts:element-attribute-values(xs:QName("document"), xs:QName("book_id"), (), (), cts:element-value-query(xs:QName("format"), 'b')))
)
)
)
OK. Now lets break this down and have a look.
Note: This sample requires a single range index on the attribute book_id.
I tool advantage of the fact that you have the same attribute in the same namespace in both types of documents. This allowed the following:
I could use a single index
Then I used element-attribute-values for the list of book_ids
-- This was constrained by the 'format' element
The list of book_ids above was used to filter the books (range query)
Which was then further filtered by the title
This approach joins the two documents using a range index which is super-fast - especially on the integer value of the book_id
It should be noted that in this articular case, I was able to isolate the proper documents because title elements only exist in one type of document.
Now, lets look at a cleaner example of the same query.
(: I used a word-query so that I could do wildcarded searches for document with 'book' in the title. This is because your sample has a title 'The Book', yet you search for 'book' so I can olnly conclude that you meant to have wildcard searches :)
let $title-constraint := "*book*"
(: This could also be a sequence :)
let $format-constraint := "a"
(: used for the right-side of the element-range-query :)
let $format-filter := cts:element-attribute-values(xs:QName("document"), xs:QName("book_id"), (), (), cts:element-value-query(xs:QName("format"), $format-constraint))
(: final results :)
let $res := cts:search(doc(), cts:and-query((
cts:element-word-query(xs:QName("title"), $title-constraint, ('wildcarded'))
,
cts:element-attribute-range-query(xs:QName("document"), xs:QName("book_id"), '=', $format-filter)
)
) )
return $res
Maybe stating the obvious, the best approach would be to change the model so the format is in the same document as the title and can be matched by a single query.
If that's not possible, one alternative would be to turn on the uri lexicon in the database configuration (if it's not enabled already).
Assuming that the title is more selective than the format, something along the following lines might work.
let $title-uris := cts:uris((), (), cts:and-query((
cts:directory-query("/books/", "infinity"),
cts:element-word-query(xs:QName("title"), "book")
)))
let $title-dirs :=
for $uri in $title-uris
return fn:replace($uri, "/basic\.xml$", "/")
let $format-uris := cts:uris((), (), cts:and-query((
cts:directory-query($title-dirs),
cts:element-value-query(xs:QName("format"), "a")
)))
let $book-docs :=
for $uri in $format-uris
return fn:replace($uri, "/format\.xml$", "/basic.xml")
for $doc in fn:doc($book-docs)
return ... do something with the basic document ...
The extra cost beyond the document reads consists of two lookups in the uri lexicon and the string manipulation. The benefit is in reading only the documents that match.
In general, it's better at scale to use the indexes to match the relevant documents instead of reading the documents into memory and filtering out the irrelevant documents. The cts:uris() and cts:search() functions always match using the indexes first (and only filter when the search option is specified). XPaths optimize by matching with the indexes when possible but have to fallback to filtering for some predicates. Unless you're careful, it's usually better to limit XPaths to navigation of nodes in memory.
Hoping that helps,
How can I find all books in /books/ directory with format eq 'a' and title eq 'book' by one query?
Try:
doc('basic.xml')/document[#book_id='X']/title[contains(., 'book')]]
[doc('format.xml')/document[#book_id='X'][format = 'a']
The last predicate, if it turns empty, will result in the title to not be found. If it exists, then title will be returned.
You should, of course, replace X with your ID. And you can set the relative path to include the ID. If you have a set of ID's you want to go over, you can do this:
for $id in ('{book_id1}', '{book_id2}')
return
doc(concat($id, '/basic.xml'))/document[#book_id=$id]/title[contains(., 'book')]]
[doc(concat($id, '/format.xml'))/document[#book_id=$id][format = 'a']
You'll get the drift ;)
PS: I'm not sure if {...} is a legal URI pathpart, but I assume you'll replace it with something sensible. Otherwise, escape it with the appropriate percent-encoding.
I think I found better solution
let $book_ids := cts:values(
cts:element-attribute-reference(xs:QName("document"), xs:QName("book_id") ),
(),
("map"),
cts:and-query((
cts:directory-query(("/books/"), "infinity"),
cts:element-query(xs:QName("title"),"book")
))
)
return
cts:search(
/,
cts:and-query((
cts:element-attribute-value-query(xs:QName("document"), xs:QName("book_id"), map:keys($book_ids)),
cts:element-value-query(xs:QName("format"), "a"),
))
)

idl: pass keyword dynamically to isa function to test structure read by read_csv

I am using IDL 8.4. I want to use isa() function to determine input type read by read_csv(). I want to use /number, /integer, /float and /string as some field I want to make sure float, other to be integer and other I don't care. I can do like this, but it is not very readable to human eye.
str = read_csv(filename, header=inheader)
; TODO check header
if not isa(str.(0), /integer) then stop
if not isa(str.(1), /number) then stop
if not isa(str.(2), /float) then stop
I am hoping I can do something like
expected_header = ['id', 'x', 'val']
expected_type = ['/integer', '/number', '/float']
str = read_csv(filename, header=inheader)
if not array_equal(strlowcase(inheader), expected_header) then stop
for i=0l,n_elements(expected_type) do
if not isa(str.(i), expected_type[i]) then stop
endfor
the above doesn't work, as '/integer' is taken literally and I guess isa() is looking for named structure. How can you do something similar?
Ideally I want to pick expected type based on header read from file, so that script still works as long as header specifies expected field.
EDIT:
my tentative solution is to write a wrapper for ISA(). Not very pretty, but does what I wanted... if there is cleaner solution , please let me know.
Also, read_csv is defined to return only one of long, long64, double and string, so I could write function to test with this limitation. but I just wanted to make it to work in general so that I can reuse them for other similar cases.
function isa_generic,var,typ
; calls isa() http://www.exelisvis.com/docs/ISA.html with keyword
; if 'n', test /number
; if 'i', test /integer
; if 'f', test /float
; if 's', test /string
if typ eq 'n' then return, isa(var, /number)
if typ eq 'i' then then return, isa(var, /integer)
if typ eq 'f' then then return, isa(var, /float)
if typ eq 's' then then return, isa(var, /string)
print, 'unexpected typename: ', typ
stop
end
IDL has some limited reflection abilities, which will do exactly what you want:
expected_types = ['integer', 'number', 'float']
expected_header = ['id', 'x', 'val']
str = read_csv(filename, header=inheader)
if ~array_equal(strlowcase(inheader), expected_header) then stop
foreach type, expected_types, index do begin
if ~isa(str.(index), _extra=create_struct(type, 1)) then stop
endforeach
It's debatable if this is really "easier to read" in your case, since there are only three cases to test. If there were 500 cases, it would be a lot cleaner than writing 500 slightly different lines.
This snipped used some rather esoteric IDL features, so let me explain what's happening a bit:
expected_types is just a list of (string) keyword names in the order they should be used.
The foreach part iterates over expected_types, putting the keyword string into the type variable and the iteration count into index.
This is equivalent to using for index = 0, n_elements(expected_types) - 1 do and then using expected_types[index] instead of type, but the foreach loop is easier to read IMHO. Reference here.
_extra is a special keyword that can pass a structure as if it were a set of keywords. Each of the structure's tags is interpreted as a keyword. Reference here.
The create_struct function takes one or more pairs of (string) tag names and (any type) values, then returns a structure with those tag names and values. Reference here.
Finally, I replaced not (bitwise not) with ~ (logical not). This step, like foreach vs for, is not necessary in this instance, but can avoid headache when debugging some types of code, where the distinction matters.
--
Reflective abilities like these can do an awful lot, and come in super handy. They're work-horses in other languages, but IDL programmers don't seem to use them as much. Here's a quick list of common reflective features I use in IDL, with links to the documentation for each:
create_struct - Create a structure from (string) tag names and values.
n_tags - Get the number of tags in a structure.
_extra, _strict_extra, and _ref_extra - Pass keywords by structure or reference.
call_function - Call a function by its (string) name.
call_procedure - Call a procedure by its (string) name.
call_method - Call a method (of an object) by its (string) name.
execute - Run complete IDL commands stored in a string.
Note: Be very careful using the execute function. It will blindly execute any IDL statement you (or a user, file, web form, etc.) feed it. Never ever feed untrusted or web user input to the IDL execute function.
You can't access the keywords quite like that, but there is a typename parameter to ISA that might be useful. This is untested, but should work:
expected_header = ['id', 'x', 'val']
expected_type = ['int', 'long', 'float']
str = read_cv(filename, header=inheader)
if not array_equal(strlowcase(inheader), expected_header) then stop
for i = 0L, n_elemented(expected_type) - 1L do begin
if not isa(str.(i), expected_type[i]) then stop
endfor

Resources