Reading the docs http://exist-db.org/exist/apps/doc/indexing.xml
I'm finding difficult to understand how and if I can improve the performances of a 'read' query (with 2 parameters: a string and an integer).
Do eXist-db have a default structural index? Can I improve a 2 params query with a 'range index'?
More details about my XML db (note there are 2 different dbs simply merged on the same root):
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<db>
<docs>
<doc>
<header>
<year>2001</year>
<number>1</number>
<type>O</type>
</header>
<metas>
<meta>
<number>26001</number>
<details>
<detail>
<description>legge</description>
<number>19</number>
<date>14/01/1994</date>
</detail>
<detail>
<description>decreto legge</description>
<number>453</number>
<date>15/11/1993</date>
</detail>
</details>
</meta>
</metas>
</doc>
<doc>
<header>
<year>2001</year>
<number>2</number>
<type>O</type>
</header>
<metas>
<meta>
<number>26002</number>
<details>
<detail>
<description>decreto legislativo</description>
<number>29</number>
<date>03/02/1993</date>
</detail>
</details>
</meta>
<meta>
<number>26016</number>
<details>
<detail>
<description>decreto legislativo</description>
<number>29</number>
<date>03/02/1993</date>
</detail>
</details>
</meta>
</metas>
</doc>
</docs>
<full_text_docs>
<doc>
<header>
<year>2001</year>
<number>1</number>
<type>O</type>
<president>ferrari</president>
</header>
<text>lorem ipsum ...
</text>
</doc>
<doc>
<header>
<year>2001</year>
<number>2</number>
<type>O</type>
<president>ferrari</president>
</header>
<text>lorem ipsum......
</text>
</doc>
</full_text_docs>
</db>
This is my xquery
xquery version "3.0";
let $doc := doc("/db//index_test/test_general.xml")//db/docs/doc
let $fulltxt := doc("/db//index_test/test_general.xml")//db/full_text_docs/doc
return <root> {
for $a in $doc[metas/meta/details/detail[date="03/02/1993" and number = "29"]]/header
return $fulltxt[header/year/text()=$a/year/text() and
header/number/text()=$a/number/text() and
header/type/text()=$a/type/text()
]
} </root>
Basically I simply find for the detail/number and detail/date that matches the input in the first db and take the results for querying the second db. The results are all the <full_text_header> documents that matches.
I would to know if I can create indexes for the fields number and date to improve performance. Note this is the ONLY query I need to optimize (the only I do on this db) obviously number and date changes :).
SOLUTION:
For a clear explanation read the joewiz answer. My problem was the correct recognition of the .xconf file. It have to be placed in /db/yourcollectiondir. If you're using eXide when you create the file you should select Xml type with template "eXist-db collection configuration". When you try to save the file you will see a prompt "Apply configuration?" then click 'ok'. Just then run this xquery xmldb:reindex('/db/yourcollectiondir').
Now if all it's right when you run an xquery involving an index you will see the usage in "Monitoring and profiling".
As that documentation page states, eXist does create a structural index for all XML stored in the database. This is not an index of values, though, so without further indexes, queries based on value (rather than structure) would involve a lookup of values in the DOM. As your data grows larger, looking up values in the DOM gets slower and slower. This is where value-based indexes, such a range index, saves the day. (For a fuller explanation, see the "Indexing" section of Wolfgang Meier's "Tuning the Database" article, which is essential for getting the most performance out of eXist.)
So, yes, you can create indexes for the <number> and <date> fields. I'd recommend the "new range" index, as described on that documentation page. Your collection.xconf file setting up these indexes would look like this:
<collection xmlns="http://exist-db.org/collection-config/1.0"
xmlns:xs="http://www.w3.org/2001/XMLSchema">
<index>
<range>
<create qname="number" type="xs:integer"/>
<create qname="date" type="xs:string"/>
</range>
</index>
</collection>
You have to store this within the /db/system/config/ collection, in a subcollection corresponding to the location of your data in the database. So if your data is located in /db/apps/myapp/data, you would place this collection.xconf file in /db/system/config/db/apps/myapp/data.
Note that the configuration here would only affect the for clause's queries of date and number values, and not the predicates in the return clause, which depend on the values of <year> and <type> elements. So, to ensure your query maximized the use of indexes, you should declare indexes on these; it seems that xs:integer would be the appropriate type for each.
Lastly, I would suggest eliminating the /text() steps, which are completely extraneous. For more on the use/abuse of text(), see Evan Lenz's article, "text() is a code smell".
Update (2016-07-17): With the updated code sample above, I have a couple of additional suggestions. First, since the code is in /db/index_test, we will store our files as follows:
Assuming you're using eXide, when you store the collection.xconf file in a collection, eXide will prompt you to have a copy of the file placed in the correct location in /db/system/config. If you're not using eXide, you need to store the collection.xconf file there yourself.
Using the unmodified query, I can confirm that despite the presence of the collection.xconf file, monex shows no indexes are being applied:
Let's make a few modifications to the file to ensure indexes are properly applied:
xquery version "3.0";
<root> {
for $a in doc("/db/index_test/test_general.xml")//detail[date = "03/02/1993" and number = 29]/ancestor::doc/header
return
doc("/db/index_test/test_general.xml")/db/full_text_docs/doc
[
header/year = $a/year and
header/number = $a/number and
header/type = $a/type
]
} </root>
With these modifications, monex shows that indexes are applied to the comparisons in the for clause:
The insights here are derived from the "Tuning the Database" article. To get full indexing for all comparisons, you will need to define additional indexes and may need to make similar modifications to your query.
One final note: the version of monex you see in these pictures is using a feature I added this weekend, called "Tare", which tries to filter out other operations from the query profiling results in order to help the user see just the effects of their own query. This feature is still just a pull request, so running the current release version, you won't see identical results.
Related
There is some explanation of a use case below; the actual question follows.
I am using ML search queries on some documents that contain elements of the form:
<resource>
<version>
<metadata label="author">Jim</metadata>
...
</version>
<version>
<metadata label="author">John</metadata>
...
</version>
</resource>
Note the versioning of metadata. Uppermost version element contains up-to-date info for the document.
The queries are based on user input; the user looks e. g. for documents, whose author is John.
I am not knowledgeable enough to combine attribute value and element/text value queries in a better way than this:
cts:near-query((cts:element-attribute-value-query(xs:QName("metadata"), xs:QName("label"), "author"), cts:element-value-query(xs:QName("metadata"), "John")), 0)
It does work though, so I am fine with it. What doesn't work is choosing only the last version in the resource (/resource/version[1]). If, at a certain point, the "author" was changed from "John" to "Jim", the document with the resource as shown above will always be found, because I don't know how to look only for values in the last (uppermost) version element. So I have to filter the results once more over XPath in a loop.
Is there a way to do this on an ML search query level?
You could create a field with a path that points to the metadata with the #label="author" that is in the first version element: /resource/version[1]/metadata[#label="author"] and then you could use a cts:field-value-query()
Then you could search that named field:
cts:search(doc(), cts:field-value-query("author", "John"))
Just through xPath, someone (yourself or MarkLogic) will have to take the hit of filtering on the value you want. This is even the case for using a searchable expressions and using the filtering option. Such is the case of repeating elements in a document.
The most efficient way is to index the path in question separately and then query against that value.
Some options:
TDE Template to extract the value. Even though extremely powerful and likely my choice, it steps away from your simple example, so I will pass on that example.
Range Index. Nice, but memory mapped assuming that you want to do range queries - but your query is on a simple value query, so we will skip this
Field. Simple, elegant and what is needed here. Define the xPath to exactly what you want and a second-pass indexing will pull that value out and index it separately with its own indexing rules. you can then query this value.
Please note the semi-colon (;) -there are three separate executions here. (1) field creation, (2) document insert and (3) search.
xquery version "1.0-ml";
import module namespace admin = "http://marklogic.com/xdmp/admin"
at "/MarkLogic/admin.xqy";
let $config := admin:get-configuration()
let $dbid := xdmp:database("Documents")
let $field-name := "latest-resource"
return
if(empty(admin:database-get-fields($config, $dbid)[./*:field-name="latest-resource"]))
then
let $field-spec := <field xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://marklogic.com/xdmp/database">
<field-name>{$field-name}</field-name>
<field-path><path>/resource/version[./metadata/#label="author"][1]</path>
<weight>1.0</weight>
</field-path>
<word-searches>false</word-searches>
<field-value-searches>true</field-value-searches>
</field>
let $_ := admin:save-configuration(admin:database-add-field($config,$dbid,$field-spec))
return ()
else ();
(:--------------------------------------------:)
(
xdmp:document-insert("/sample/jim-first.xml", <resource>
<version>
<metadata label="author">Jim</metadata>
</version>
<version>
<metadata label="author">John</metadata>
</version>
</resource>),
xdmp:document-insert("/sample/john-first.xml", <resource>
<version>
<metadata label="author">John</metadata>
</version>
<version>
<metadata label="author">Jim</metadata>
</version>
</resource>)
);
(: ----------------------------------------------------- :)
cts:search(doc(), cts:field-value-query("latest-resource", "Jim"))
In this case, only Jim is returned where he is in the first version.
<resource>
<version>
<metadata label="author">Jim</metadata>
</version>
<version>
<metadata label="author">John</metadata>
</version>
</resource>
My XSLT is primitive, my XQuery almost non existent, this should be trivial, so I wont post a whole example.
I have an XQuery, that I'm compiling and executing via the dotnet saxon9ee-api
import schema default element namespace "" at "MessingAbout.xsd";
for $v in (validate { doc("MessingAbout.xml") })/element(SQUARE,FILLEDSQUARETYPE)
return <OUTPUT>{$v/#colour}</OUTPUT>
which works very nicely.
I want to use the "ContextItem" though, so I can query different XMLS, and I've got this to work, by setting the ContextItem in the XQueryEvaluator to a document.
import schema default element namespace "" at "MessingAbout.xsd";
for $v in /SQUARE
return <OUTPUT>{$v/#colour}</OUTPUT>
but I'd like to validate the contextItem and then use that do use things like element(SQUARE,FILLEDSQUARETYPE)...but how do you do this?
I'm not quite sure what you're attempting to do, but given "MessingAbout.xsd":
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
elementFormDefault="qualified">
<xs:complexType name="FILLEDSQUARETYPE">
<xs:attribute name="colour" type="xs:string"/>
</xs:complexType>
<xs:element name="SQUARE" type="FILLEDSQUARETYPE"/>
</xs:schema>
and "MessingAbout.xml":
<SQUARE colour="red"/>
your first query produces <OUTPUT colour="red"/>, which I assume is what you expect. To use the context item in the second query, I rewrote it as:
import schema default element namespace "" at "MessingAbout.xsd";
for $v in (validate { . })/element(SQUARE,FILLEDSQUARETYPE)
return <OUTPUT>{$v/#colour}</OUTPUT>
and passed the source document on the command line: -q:test2.xq -s:MessingAbout.xml.
That gives me the same result as the first query. I hope that's helpful.
As well as the approaches suggested by Martin and Norm, you have the option of doing the validation in the calling application, e.g. Java or C#. Build the document using a s9api DocumentBuilder with validation options set, and then pass the resulting typed XdmNode as the context item when running the query. This approach is preferable if you want to do more with the validated document than just running one query. But if you do it this way, it's useful for the query to assert that it's expecting a validated document, which you can do with a "declare context-item" in the query prolog.
I am running ft:query on a collection which is stored in eXist-db but it's returning zero results. If I use fn:contains function it works perfect but ft:query returns zero results. Below is my XML structure, index configuration file, and query:
test.xml
<article xmlns="http://www.rsc.org/schema/rscart38"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
type="ART"
xsi:schemaLocation="http://www.rsc.org/schema/rscart38 http://www.rsc.org/schema/rscart38/rscart38.xsd" dtd="RSCART3.8">
<metainfo last-modified="2012-11-23T19:16:50.023Z">
<subsyear>1997</subsyear>
<collectiontype>rscart</collectiontype>
<collectionname>journals</collectionname>
<docid>A605867A</docid>
<doctitle>NMR studies on hydrophobic interactions in solution Part
2.—Temperature and urea effect on
the self-association of ethanol in water</doctitle>
<summary/>
</article>
collection.xconf
<collection xmlns="http://exist-db.org/collection-config/1.0">
<index rsc="http://www.rsc.org/schema/rscart38"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
type="ART"
xsi:schemaLocation="http://www.rsc.org/schema/rscart38 http://www.rsc.org/schema/rscart38/rscart38.xsd"
dtd="RSCART3.8">
<fulltext default="all" attributes="false"/>
<lucene>
<analyzer id="nosw" class="org.apache.lucene.analysis.standard.StandardAnalyzer">
<param name="stopwords" type="org.apache.lucene.analysis.util.CharArraySet"/>
</analyzer>
<text qname="//rsc:article" analyzer="nosw"/>
</lucene>
<create path="//rsc:doctitle" type="xs:string"/>
<create path="//rsc:journal-full-title" type="xs:string"/>
<create path="//rsc:journal-full-title" type="xs:string"/>
</index>
</collection>
test.xq
declare namespace rsc="http://www.rsc.org/schema/rscart38";
let $coll := collection('/db/apps/test/RSC')
let $hits := $coll//rsc:doctitle[ft:query(., 'studies')]
return
$hits
Let's start from your query. The key part of your query is:
$coll//rsc:doctitle[ft:query(., 'studies')]
This performs a full text query for the string studies on rsc:doctitle elements in the collection. For this ft:query() function to work, there must be an index configuration for the named elements. This brings us to your index configuration.
In your index configuration, you have a full text (Lucene) index:
<text qname="//rsc:article" analyzer="nosw"/>
A couple of issues:
The #qname attribute should be a QName - simply, an element or attribute name. You've expressed this as a path. Remove the path //, leaving just rsc:article.
Your code does a full text query on rsc:doctitle, not on rsc:article, so I would expect your code, as written, to return 0 results. Change the existing index to rsc:doctitle, or add a new index on rsc:doctitle so that you could query either one. Reindex the collection afterwards, and as Adam suggested, check the Monex app's Indexing pane to ensure that the database has applied your index configuration as expected.
Lastly, contains() does not require an index to be in place. It benefits from the presence of a range index (i.e., your <create> elements), but range indexes are quite different from full text indexes. To learn more about these, I'd suggest reading the eXist documentation on indexing, http://exist-db.org/exist/apps/doc/indexing.xml.
I am not certain if configuring a Standard Analyzer without stopwords in the way you have done is correct. Can you check with Monex that your index has your terms in it?
Note also, if you created the index config after loading the index, then you need to reindex the collection. When you reindex it is also worth monitoring $EXIST_HOME/webapp/WEB-INF/exist.log to ensure that the indexing is done as expected.
I have an XML file, a query and two servers.
I loaded the xml file into both using mlcp ad put attribute range indexes on where I think they are needed.
Our dev server acts as I expected, but the TEST server gives back only the first map element in the document. Checked all db setting, reloaded the docs, re-indexed both servers no result...
The document looks like this:
<geo version="0.3" xmlns="http://www.nvsp.nl/geo-mapping">
<meta-data>
<!--Generated by DIKW for NetwerkVSP STTip-->
<dateCreated>2014-06-27 15:17:17.643318</dateCreated>
</meta-data>
<map ppc4_id="3902" wijk_id="390213">
<bruto>196</bruto>
<stickers>19</stickers>
<netto>177</netto>
<aktief>J</aktief>
</map>
<map ppc4_id="3902" wijk_id="3902B01">
<bruto>36</bruto>
<stickers>3</stickers>
<netto>33</netto>
<aktief>J</aktief>
</map>
<map ppc4_id="3902" wijk_id="3902K01">
<bruto>245</bruto>
<stickers>44</stickers>
<netto>201</netto>
<aktief>J</aktief>
</map>
<map ppc4_id="3903" wijk_id="390301">
<bruto>256</bruto>
<stickers>37</stickers>
<netto>219</netto>
<aktief>J</aktief>
</map>
with roughly another 35000 map elements following.
The XQuery intents to find maps with certain ppc4_id or wijk_id attributes like so:
xquery version "1.0-ml";
declare namespace gm = "http://www.nvsp.nl/geo-mapping";
let $p4_id := "6626"
let $wijk_id := "662601"
let $uri := '/data/map/geo-mapping.xml'
(: setup query:)
let $q2 := cts:element-attribute-value-query(xs:QName("gm:map"), xs:QName("ppc4_id"), $p4_id)
let $q3 := cts:element-attribute-value-query(xs:QName("gm:map"), xs:QName("wijk_id"), $wijk_id)
(: return map with wijk_id from geo:)
let $maps := cts:search(//gm:map,$q2,("unfiltered"))
return $maps
Now the DEV server finds appropriate results like:
<map ppc4_id="6626" wijk_id="662601" xmlns="http://www.nvsp.nl/geo-mapping">
<bruto>220</bruto>
<stickers>11</stickers>
<netto>209</netto>
<aktief>J</aktief>
</map>
element
<map ppc4_id="6626" wijk_id="662602" xmlns="http://www.nvsp.nl/geo-mapping">
<bruto>198</bruto>
<stickers>13</stickers>
<netto>185</netto>
<aktief>J</aktief>
</map>
... more map elements ...
But the TEST server gives back only the first map element from the doc! No matter what id I ask for.
The scary part is that is does not complain or give an error but gives back a wrong answer?
I'm observing the same with 7.0-2.3. What you effectively see happening is that the unfiltered search returns the fragment for the entire geo-mapping document. And for some reason the searchable expression is returning just the first map element within it on your test server. Maybe there is a version difference?
What you observe is caused by the 'unfiltered' option. Run filtered and it will work fine without any extra indexes. From the looks of it adding an attribute range index doesn't help, nor enabling positions, though I thought that should. Maybe Mike's suggestions can help investigate what is happening there.
What does help is add a fragment root for the map element. But I wouldn't recommend using fragmentation on such a large document. Split the geo-mapping into separate map documents. That makes getting accurate estimates much easier..
HTH!
You can use several tools to figure out what a query is doing. In this case https://docs.marklogic.com/xdmp:plan and https://docs.marklogic.com/xdmp:query-trace should help.
You could also try https://docs.marklogic.com/xdmp:query-meters but it's generally more useful for performance analysis.
Also it's often useful to https://docs.marklogic.com/xdmp:describe your results. Sometimes that reveals subtleties that don't show up in the XML or browser rendering.
I want to create a catalog metadata with a computed string. So following Aspeli's book and the developer manual I proceeded to create an indexer:
# indexer.py
#grok.adapter(Entry, name='bind_representation')
#indexer(Entry)
def bindIndexer(context):
print str(IBindRepresentable(context))
return str(IBindRepresentable(context))
and register the index with genericSetup:
<!-- profiles/default/catalog.xml -->
<?xml version="1.0"?>
<object name="portal_catalog" meta_type="Plone Catalog Tool">
<index name="bind_representation" meta_type="ZCTextIndex">
<!-- I tried with meta_type="FieldIndex" too -->
<indexed_attr value="bind_representation"/>
<!-- copied from other text metadata -->
<extra name="index_type" value="Okapi BM25 Rank"/>
<extra name="lexicon_id" value="plaintext_lexicon"/>
</index>
</object>
The problems are: (1) only the index is registered, not the metadata, and (2) After reindex all the zodb, bind_representation still doesn't find any entry to index, even when they are.
The examples cited only deal with pre-existent indexes, so I'm not sure about the content of catalog.xml. bindIndexer seems not to be called at all, since its print statement is never executed. I copied bindIndexer to entry.py too, to get sure it wasn't being ignored, but still nothing.
What am I missing?
Thanks.
1- In order to add a new metadata, you have to use this syntax:
<?xml version="1.0"?>
<object name="portal_catalog" meta_type="Plone Catalog Tool">
...
<column value="bind_representation"/>
</object>
2a- you are adapting your content class, you should adapt your content interface (IEntry most likely).
2b- you are using a ZCTextIndex: that index won't show you all entries anyway (even following the previous point) because it's based on a lexicon. You should probably use this instead (unless you have specific bounds):
<index name="bind_representation" meta_type="FieldIndex">
<indexed_attr value="bind_representation"/>
</index>
More info:
http://bluebream.zope.org/doc/1.0/manual/componentarchitecture.html#adapters
http://maurits.vanrees.org/weblog/archive/2009/12/catalog