Is Marklogic search with position possible? - xquery

There is some explanation of a use case below; the actual question follows.
I am using ML search queries on some documents that contain elements of the form:
<resource>
<version>
<metadata label="author">Jim</metadata>
...
</version>
<version>
<metadata label="author">John</metadata>
...
</version>
</resource>
Note the versioning of metadata. Uppermost version element contains up-to-date info for the document.
The queries are based on user input; the user looks e. g. for documents, whose author is John.
I am not knowledgeable enough to combine attribute value and element/text value queries in a better way than this:
cts:near-query((cts:element-attribute-value-query(xs:QName("metadata"), xs:QName("label"), "author"), cts:element-value-query(xs:QName("metadata"), "John")), 0)
It does work though, so I am fine with it. What doesn't work is choosing only the last version in the resource (/resource/version[1]). If, at a certain point, the "author" was changed from "John" to "Jim", the document with the resource as shown above will always be found, because I don't know how to look only for values in the last (uppermost) version element. So I have to filter the results once more over XPath in a loop.
Is there a way to do this on an ML search query level?

You could create a field with a path that points to the metadata with the #label="author" that is in the first version element: /resource/version[1]/metadata[#label="author"] and then you could use a cts:field-value-query()
Then you could search that named field:
cts:search(doc(), cts:field-value-query("author", "John"))

Just through xPath, someone (yourself or MarkLogic) will have to take the hit of filtering on the value you want. This is even the case for using a searchable expressions and using the filtering option. Such is the case of repeating elements in a document.
The most efficient way is to index the path in question separately and then query against that value.
Some options:
TDE Template to extract the value. Even though extremely powerful and likely my choice, it steps away from your simple example, so I will pass on that example.
Range Index. Nice, but memory mapped assuming that you want to do range queries - but your query is on a simple value query, so we will skip this
Field. Simple, elegant and what is needed here. Define the xPath to exactly what you want and a second-pass indexing will pull that value out and index it separately with its own indexing rules. you can then query this value.
Please note the semi-colon (;) -there are three separate executions here. (1) field creation, (2) document insert and (3) search.
xquery version "1.0-ml";
import module namespace admin = "http://marklogic.com/xdmp/admin"
at "/MarkLogic/admin.xqy";
let $config := admin:get-configuration()
let $dbid := xdmp:database("Documents")
let $field-name := "latest-resource"
return
if(empty(admin:database-get-fields($config, $dbid)[./*:field-name="latest-resource"]))
then
let $field-spec := <field xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://marklogic.com/xdmp/database">
<field-name>{$field-name}</field-name>
<field-path><path>/resource/version[./metadata/#label="author"][1]</path>
<weight>1.0</weight>
</field-path>
<word-searches>false</word-searches>
<field-value-searches>true</field-value-searches>
</field>
let $_ := admin:save-configuration(admin:database-add-field($config,$dbid,$field-spec))
return ()
else ();
(:--------------------------------------------:)
(
xdmp:document-insert("/sample/jim-first.xml", <resource>
<version>
<metadata label="author">Jim</metadata>
</version>
<version>
<metadata label="author">John</metadata>
</version>
</resource>),
xdmp:document-insert("/sample/john-first.xml", <resource>
<version>
<metadata label="author">John</metadata>
</version>
<version>
<metadata label="author">Jim</metadata>
</version>
</resource>)
);
(: ----------------------------------------------------- :)
cts:search(doc(), cts:field-value-query("latest-resource", "Jim"))
In this case, only Jim is returned where he is in the first version.
<resource>
<version>
<metadata label="author">Jim</metadata>
</version>
<version>
<metadata label="author">John</metadata>
</version>
</resource>

Related

Does MarkLogic search API support comma for cts:or-query

I'm using Marklogic8, and our query like below:
query=Color:red,yellow,black AND Size:middle
search options like below:
<options xmlns="http://marklogic.com/appservices/search">
<grammar>
<quotation>"</quotation>
<implicit>
<cts:and-query strength="20" xmlns:cts="http://marklogic.com/cts"/>
</implicit>
<starter strength="30" apply="grouping" delimiter=")">(</starter>
<starter strength="40" apply="prefix" element="cts:not-query" tokenize="word">NOT</starter>
<joiner strength="10" apply="infix" element="cts:or-query" tokenize="word">OR</joiner>
<joiner strength="20" apply="infix" element="cts:and-query" tokenize="word">AND</joiner>
<joiner strength="10" apply="infix" element="cts:or-query">,</joiner>
<joiner strength="50" apply="constraint">:</joiner>
</grammar>
<constraint name="Color"><value><element name="Color" ns="" /></value></constraint>
<constraint name="Size"><value><element name="Size" ns="" /></value></constraint>
</options>
We are using this to parse our query text:
cts:query(search:parse($query, $options)
However, it can't parse the query to correct way:
<cts:or-query xmlns:cts="http://marklogic.com/cts">
<cts:element-value-query>
<cts:element>Color</cts:element>
<cts:text xml:lang="en">red</cts:text>
</cts:element-value-query>
<cts:word-query>
<cts:text xml:lang="en">yellow</cts:text>
</cts:word-query>
<cts:word-query>
<cts:text xml:lang="en">black</cts:text>
</cts:word-query>
<cts:element-value-query>
<cts:element>Size</cts:element>
<cts:text xml:lang="en">middle</cts:text>
</cts:element-value-query>
</cts:or-query>
I know that we can use the input query like below:
query=Color:red OR Color:yellow OR Color:black AND Size:middle
But it's too long.
Is there any possible to cut short our input query?
The markLogic Search API does not do that. However, you can write a small custom search constraint on the search API to accomplish what you are trying to do. Custom constraints are passed 2 parameters - the information on the left and right sides of the semi-colon. You could then create the proper query to match as you like. You could probably accomplish this by extending the search library as well.
However, it is also something that you can likely take care of in your logic before passing the query to the server.
It might be worth looking into cts:parse. You have to translate your options to bindings yourself (not too difficult), but you'll get a slightly more advanced, and faster parser for your search strings. It allows for amongst others expressions like:
Color = (yellow red black) AND Size:middle
See also: http://docs.marklogic.com/guide/search-dev/cts_query#id_15151
HTH!

eXist-db ft:query returning zero result while running eXide or oxygen

I am running ft:query on a collection which is stored in eXist-db but it's returning zero results. If I use fn:contains function it works perfect but ft:query returns zero results. Below is my XML structure, index configuration file, and query:
test.xml
<article xmlns="http://www.rsc.org/schema/rscart38"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
type="ART"
xsi:schemaLocation="http://www.rsc.org/schema/rscart38 http://www.rsc.org/schema/rscart38/rscart38.xsd" dtd="RSCART3.8">
<metainfo last-modified="2012-11-23T19:16:50.023Z">
<subsyear>1997</subsyear>
<collectiontype>rscart</collectiontype>
<collectionname>journals</collectionname>
<docid>A605867A</docid>
<doctitle>NMR studies on hydrophobic interactions in solution Part
2.—Temperature and urea effect on
the self-association of ethanol in water</doctitle>
<summary/>
</article>
collection.xconf
<collection xmlns="http://exist-db.org/collection-config/1.0">
<index rsc="http://www.rsc.org/schema/rscart38"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
type="ART"
xsi:schemaLocation="http://www.rsc.org/schema/rscart38 http://www.rsc.org/schema/rscart38/rscart38.xsd"
dtd="RSCART3.8">
<fulltext default="all" attributes="false"/>
<lucene>
<analyzer id="nosw" class="org.apache.lucene.analysis.standard.StandardAnalyzer">
<param name="stopwords" type="org.apache.lucene.analysis.util.CharArraySet"/>
</analyzer>
<text qname="//rsc:article" analyzer="nosw"/>
</lucene>
<create path="//rsc:doctitle" type="xs:string"/>
<create path="//rsc:journal-full-title" type="xs:string"/>
<create path="//rsc:journal-full-title" type="xs:string"/>
</index>
</collection>
test.xq
declare namespace rsc="http://www.rsc.org/schema/rscart38";
let $coll := collection('/db/apps/test/RSC')
let $hits := $coll//rsc:doctitle[ft:query(., 'studies')]
return
$hits
Let's start from your query. The key part of your query is:
$coll//rsc:doctitle[ft:query(., 'studies')]
This performs a full text query for the string studies on rsc:doctitle elements in the collection. For this ft:query() function to work, there must be an index configuration for the named elements. This brings us to your index configuration.
In your index configuration, you have a full text (Lucene) index:
<text qname="//rsc:article" analyzer="nosw"/>
A couple of issues:
The #qname attribute should be a QName - simply, an element or attribute name. You've expressed this as a path. Remove the path //, leaving just rsc:article.
Your code does a full text query on rsc:doctitle, not on rsc:article, so I would expect your code, as written, to return 0 results. Change the existing index to rsc:doctitle, or add a new index on rsc:doctitle so that you could query either one. Reindex the collection afterwards, and as Adam suggested, check the Monex app's Indexing pane to ensure that the database has applied your index configuration as expected.
Lastly, contains() does not require an index to be in place. It benefits from the presence of a range index (i.e., your <create> elements), but range indexes are quite different from full text indexes. To learn more about these, I'd suggest reading the eXist documentation on indexing, http://exist-db.org/exist/apps/doc/indexing.xml.
I am not certain if configuring a Standard Analyzer without stopwords in the way you have done is correct. Can you check with Monex that your index has your terms in it?
Note also, if you created the index config after loading the index, then you need to reindex the collection. When you reindex it is also worth monitoring $EXIST_HOME/webapp/WEB-INF/exist.log to ensure that the indexing is done as expected.

Improve performance of query with range indexes in eXist-db

Reading the docs http://exist-db.org/exist/apps/doc/indexing.xml
I'm finding difficult to understand how and if I can improve the performances of a 'read' query (with 2 parameters: a string and an integer).
Do eXist-db have a default structural index? Can I improve a 2 params query with a 'range index'?
More details about my XML db (note there are 2 different dbs simply merged on the same root):
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<db>
<docs>
<doc>
<header>
<year>2001</year>
<number>1</number>
<type>O</type>
</header>
<metas>
<meta>
<number>26001</number>
<details>
<detail>
<description>legge</description>
<number>19</number>
<date>14/01/1994</date>
</detail>
<detail>
<description>decreto legge</description>
<number>453</number>
<date>15/11/1993</date>
</detail>
</details>
</meta>
</metas>
</doc>
<doc>
<header>
<year>2001</year>
<number>2</number>
<type>O</type>
</header>
<metas>
<meta>
<number>26002</number>
<details>
<detail>
<description>decreto legislativo</description>
<number>29</number>
<date>03/02/1993</date>
</detail>
</details>
</meta>
<meta>
<number>26016</number>
<details>
<detail>
<description>decreto legislativo</description>
<number>29</number>
<date>03/02/1993</date>
</detail>
</details>
</meta>
</metas>
</doc>
</docs>
<full_text_docs>
<doc>
<header>
<year>2001</year>
<number>1</number>
<type>O</type>
<president>ferrari</president>
</header>
<text>lorem ipsum ...
</text>
</doc>
<doc>
<header>
<year>2001</year>
<number>2</number>
<type>O</type>
<president>ferrari</president>
</header>
<text>lorem ipsum......
</text>
</doc>
</full_text_docs>
</db>
This is my xquery
xquery version "3.0";
let $doc := doc("/db//index_test/test_general.xml")//db/docs/doc
let $fulltxt := doc("/db//index_test/test_general.xml")//db/full_text_docs/doc
return <root> {
for $a in $doc[metas/meta/details/detail[date="03/02/1993" and number = "29"]]/header
return $fulltxt[header/year/text()=$a/year/text() and
header/number/text()=$a/number/text() and
header/type/text()=$a/type/text()
]
} </root>
Basically I simply find for the detail/number and detail/date that matches the input in the first db and take the results for querying the second db. The results are all the <full_text_header> documents that matches.
I would to know if I can create indexes for the fields number and date to improve performance. Note this is the ONLY query I need to optimize (the only I do on this db) obviously number and date changes :).
SOLUTION:
For a clear explanation read the joewiz answer. My problem was the correct recognition of the .xconf file. It have to be placed in /db/yourcollectiondir. If you're using eXide when you create the file you should select Xml type with template "eXist-db collection configuration". When you try to save the file you will see a prompt "Apply configuration?" then click 'ok'. Just then run this xquery xmldb:reindex('/db/yourcollectiondir').
Now if all it's right when you run an xquery involving an index you will see the usage in "Monitoring and profiling".
As that documentation page states, eXist does create a structural index for all XML stored in the database. This is not an index of values, though, so without further indexes, queries based on value (rather than structure) would involve a lookup of values in the DOM. As your data grows larger, looking up values in the DOM gets slower and slower. This is where value-based indexes, such a range index, saves the day. (For a fuller explanation, see the "Indexing" section of Wolfgang Meier's "Tuning the Database" article, which is essential for getting the most performance out of eXist.)
So, yes, you can create indexes for the <number> and <date> fields. I'd recommend the "new range" index, as described on that documentation page. Your collection.xconf file setting up these indexes would look like this:
<collection xmlns="http://exist-db.org/collection-config/1.0"
xmlns:xs="http://www.w3.org/2001/XMLSchema">
<index>
<range>
<create qname="number" type="xs:integer"/>
<create qname="date" type="xs:string"/>
</range>
</index>
</collection>
You have to store this within the /db/system/config/ collection, in a subcollection corresponding to the location of your data in the database. So if your data is located in /db/apps/myapp/data, you would place this collection.xconf file in /db/system/config/db/apps/myapp/data.
Note that the configuration here would only affect the for clause's queries of date and number values, and not the predicates in the return clause, which depend on the values of <year> and <type> elements. So, to ensure your query maximized the use of indexes, you should declare indexes on these; it seems that xs:integer would be the appropriate type for each.
Lastly, I would suggest eliminating the /text() steps, which are completely extraneous. For more on the use/abuse of text(), see Evan Lenz's article, "text() is a code smell".
Update (2016-07-17): With the updated code sample above, I have a couple of additional suggestions. First, since the code is in /db/index_test, we will store our files as follows:
Assuming you're using eXide, when you store the collection.xconf file in a collection, eXide will prompt you to have a copy of the file placed in the correct location in /db/system/config. If you're not using eXide, you need to store the collection.xconf file there yourself.
Using the unmodified query, I can confirm that despite the presence of the collection.xconf file, monex shows no indexes are being applied:
Let's make a few modifications to the file to ensure indexes are properly applied:
xquery version "3.0";
<root> {
for $a in doc("/db/index_test/test_general.xml")//detail[date = "03/02/1993" and number = 29]/ancestor::doc/header
return
doc("/db/index_test/test_general.xml")/db/full_text_docs/doc
[
header/year = $a/year and
header/number = $a/number and
header/type = $a/type
]
} </root>
With these modifications, monex shows that indexes are applied to the comparisons in the for clause:
The insights here are derived from the "Tuning the Database" article. To get full indexing for all comparisons, you will need to define additional indexes and may need to make similar modifications to your query.
One final note: the version of monex you see in these pictures is using a feature I added this weekend, called "Tare", which tries to filter out other operations from the query profiling results in order to help the user see just the effects of their own query. This feature is still just a pull request, so running the current release version, you won't see identical results.

Marklogic xquery gives different result on different servers

I have an XML file, a query and two servers.
I loaded the xml file into both using mlcp ad put attribute range indexes on where I think they are needed.
Our dev server acts as I expected, but the TEST server gives back only the first map element in the document. Checked all db setting, reloaded the docs, re-indexed both servers no result...
The document looks like this:
<geo version="0.3" xmlns="http://www.nvsp.nl/geo-mapping">
<meta-data>
<!--Generated by DIKW for NetwerkVSP STTip-->
<dateCreated>2014-06-27 15:17:17.643318</dateCreated>
</meta-data>
<map ppc4_id="3902" wijk_id="390213">
<bruto>196</bruto>
<stickers>19</stickers>
<netto>177</netto>
<aktief>J</aktief>
</map>
<map ppc4_id="3902" wijk_id="3902B01">
<bruto>36</bruto>
<stickers>3</stickers>
<netto>33</netto>
<aktief>J</aktief>
</map>
<map ppc4_id="3902" wijk_id="3902K01">
<bruto>245</bruto>
<stickers>44</stickers>
<netto>201</netto>
<aktief>J</aktief>
</map>
<map ppc4_id="3903" wijk_id="390301">
<bruto>256</bruto>
<stickers>37</stickers>
<netto>219</netto>
<aktief>J</aktief>
</map>
with roughly another 35000 map elements following.
The XQuery intents to find maps with certain ppc4_id or wijk_id attributes like so:
xquery version "1.0-ml";
declare namespace gm = "http://www.nvsp.nl/geo-mapping";
let $p4_id := "6626"
let $wijk_id := "662601"
let $uri := '/data/map/geo-mapping.xml'
(: setup query:)
let $q2 := cts:element-attribute-value-query(xs:QName("gm:map"), xs:QName("ppc4_id"), $p4_id)
let $q3 := cts:element-attribute-value-query(xs:QName("gm:map"), xs:QName("wijk_id"), $wijk_id)
(: return map with wijk_id from geo:)
let $maps := cts:search(//gm:map,$q2,("unfiltered"))
return $maps
Now the DEV server finds appropriate results like:
<map ppc4_id="6626" wijk_id="662601" xmlns="http://www.nvsp.nl/geo-mapping">
<bruto>220</bruto>
<stickers>11</stickers>
<netto>209</netto>
<aktief>J</aktief>
</map>
element
<map ppc4_id="6626" wijk_id="662602" xmlns="http://www.nvsp.nl/geo-mapping">
<bruto>198</bruto>
<stickers>13</stickers>
<netto>185</netto>
<aktief>J</aktief>
</map>
... more map elements ...
But the TEST server gives back only the first map element from the doc! No matter what id I ask for.
The scary part is that is does not complain or give an error but gives back a wrong answer?
I'm observing the same with 7.0-2.3. What you effectively see happening is that the unfiltered search returns the fragment for the entire geo-mapping document. And for some reason the searchable expression is returning just the first map element within it on your test server. Maybe there is a version difference?
What you observe is caused by the 'unfiltered' option. Run filtered and it will work fine without any extra indexes. From the looks of it adding an attribute range index doesn't help, nor enabling positions, though I thought that should. Maybe Mike's suggestions can help investigate what is happening there.
What does help is add a fragment root for the map element. But I wouldn't recommend using fragmentation on such a large document. Split the geo-mapping into separate map documents. That makes getting accurate estimates much easier..
HTH!
You can use several tools to figure out what a query is doing. In this case https://docs.marklogic.com/xdmp:plan and https://docs.marklogic.com/xdmp:query-trace should help.
You could also try https://docs.marklogic.com/xdmp:query-meters but it's generally more useful for performance analysis.
Also it's often useful to https://docs.marklogic.com/xdmp:describe your results. Sometimes that reveals subtleties that don't show up in the XML or browser rendering.

Group testNG tests without annotations

I'm responsible for allowing unit tests for one of ETL components.I want to acomplish this using testNG with generic java test class and number of test definitions in testng.xmlpassing various parameters to the class.Oracle and ETL guys should be able to add new tests without changing the java code, so we need to use xml suite file instead of annotations.
Question
Is there a way to group tests in testng.xml?(similarly to how it is done with annotations)
I mean something like
<group name="first_group">
<test>
<class ...>
<parameter ...>
</test>
</group>
<group name="second_group">
<test>
<class ...>
<parameter ...>
</test>
</group>
I've checked the testng.dtd as figured out that similar syntax is not allowed.But is therea workaround to allow grouping?
Thanks in advance
You can specify groups within testng.xml and then run testng using -groups
<test name="Regression1">
<groups>
<run>
<exclude name="brokenTests" />
<include name="checkinTests" />
</run>
</groups>
....
No, this is not possible at the moment.
As a rule of thumb, I don't like adding information in XML that points into Java code, because refactorings might silently break your entire build.
For example, if you rename a method or a class name, your tests might start mysteriously breaking until you remember you need to update your XML as well.
Feel free to bring this up on the testng-users mailing-list and we can see if there's interest for such a feature.
--
Cedric

Resources