Marklogic xquery gives different result on different servers - xquery

I have an XML file, a query and two servers.
I loaded the xml file into both using mlcp ad put attribute range indexes on where I think they are needed.
Our dev server acts as I expected, but the TEST server gives back only the first map element in the document. Checked all db setting, reloaded the docs, re-indexed both servers no result...
The document looks like this:
<geo version="0.3" xmlns="http://www.nvsp.nl/geo-mapping">
<meta-data>
<!--Generated by DIKW for NetwerkVSP STTip-->
<dateCreated>2014-06-27 15:17:17.643318</dateCreated>
</meta-data>
<map ppc4_id="3902" wijk_id="390213">
<bruto>196</bruto>
<stickers>19</stickers>
<netto>177</netto>
<aktief>J</aktief>
</map>
<map ppc4_id="3902" wijk_id="3902B01">
<bruto>36</bruto>
<stickers>3</stickers>
<netto>33</netto>
<aktief>J</aktief>
</map>
<map ppc4_id="3902" wijk_id="3902K01">
<bruto>245</bruto>
<stickers>44</stickers>
<netto>201</netto>
<aktief>J</aktief>
</map>
<map ppc4_id="3903" wijk_id="390301">
<bruto>256</bruto>
<stickers>37</stickers>
<netto>219</netto>
<aktief>J</aktief>
</map>
with roughly another 35000 map elements following.
The XQuery intents to find maps with certain ppc4_id or wijk_id attributes like so:
xquery version "1.0-ml";
declare namespace gm = "http://www.nvsp.nl/geo-mapping";
let $p4_id := "6626"
let $wijk_id := "662601"
let $uri := '/data/map/geo-mapping.xml'
(: setup query:)
let $q2 := cts:element-attribute-value-query(xs:QName("gm:map"), xs:QName("ppc4_id"), $p4_id)
let $q3 := cts:element-attribute-value-query(xs:QName("gm:map"), xs:QName("wijk_id"), $wijk_id)
(: return map with wijk_id from geo:)
let $maps := cts:search(//gm:map,$q2,("unfiltered"))
return $maps
Now the DEV server finds appropriate results like:
<map ppc4_id="6626" wijk_id="662601" xmlns="http://www.nvsp.nl/geo-mapping">
<bruto>220</bruto>
<stickers>11</stickers>
<netto>209</netto>
<aktief>J</aktief>
</map>
element
<map ppc4_id="6626" wijk_id="662602" xmlns="http://www.nvsp.nl/geo-mapping">
<bruto>198</bruto>
<stickers>13</stickers>
<netto>185</netto>
<aktief>J</aktief>
</map>
... more map elements ...
But the TEST server gives back only the first map element from the doc! No matter what id I ask for.
The scary part is that is does not complain or give an error but gives back a wrong answer?

I'm observing the same with 7.0-2.3. What you effectively see happening is that the unfiltered search returns the fragment for the entire geo-mapping document. And for some reason the searchable expression is returning just the first map element within it on your test server. Maybe there is a version difference?
What you observe is caused by the 'unfiltered' option. Run filtered and it will work fine without any extra indexes. From the looks of it adding an attribute range index doesn't help, nor enabling positions, though I thought that should. Maybe Mike's suggestions can help investigate what is happening there.
What does help is add a fragment root for the map element. But I wouldn't recommend using fragmentation on such a large document. Split the geo-mapping into separate map documents. That makes getting accurate estimates much easier..
HTH!

You can use several tools to figure out what a query is doing. In this case https://docs.marklogic.com/xdmp:plan and https://docs.marklogic.com/xdmp:query-trace should help.
You could also try https://docs.marklogic.com/xdmp:query-meters but it's generally more useful for performance analysis.
Also it's often useful to https://docs.marklogic.com/xdmp:describe your results. Sometimes that reveals subtleties that don't show up in the XML or browser rendering.

Related

eXist-db ft:query returning zero result while running eXide or oxygen

I am running ft:query on a collection which is stored in eXist-db but it's returning zero results. If I use fn:contains function it works perfect but ft:query returns zero results. Below is my XML structure, index configuration file, and query:
test.xml
<article xmlns="http://www.rsc.org/schema/rscart38"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
type="ART"
xsi:schemaLocation="http://www.rsc.org/schema/rscart38 http://www.rsc.org/schema/rscart38/rscart38.xsd" dtd="RSCART3.8">
<metainfo last-modified="2012-11-23T19:16:50.023Z">
<subsyear>1997</subsyear>
<collectiontype>rscart</collectiontype>
<collectionname>journals</collectionname>
<docid>A605867A</docid>
<doctitle>NMR studies on hydrophobic interactions in solution Part
2.—Temperature and urea effect on
the self-association of ethanol in water</doctitle>
<summary/>
</article>
collection.xconf
<collection xmlns="http://exist-db.org/collection-config/1.0">
<index rsc="http://www.rsc.org/schema/rscart38"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
type="ART"
xsi:schemaLocation="http://www.rsc.org/schema/rscart38 http://www.rsc.org/schema/rscart38/rscart38.xsd"
dtd="RSCART3.8">
<fulltext default="all" attributes="false"/>
<lucene>
<analyzer id="nosw" class="org.apache.lucene.analysis.standard.StandardAnalyzer">
<param name="stopwords" type="org.apache.lucene.analysis.util.CharArraySet"/>
</analyzer>
<text qname="//rsc:article" analyzer="nosw"/>
</lucene>
<create path="//rsc:doctitle" type="xs:string"/>
<create path="//rsc:journal-full-title" type="xs:string"/>
<create path="//rsc:journal-full-title" type="xs:string"/>
</index>
</collection>
test.xq
declare namespace rsc="http://www.rsc.org/schema/rscart38";
let $coll := collection('/db/apps/test/RSC')
let $hits := $coll//rsc:doctitle[ft:query(., 'studies')]
return
$hits
Let's start from your query. The key part of your query is:
$coll//rsc:doctitle[ft:query(., 'studies')]
This performs a full text query for the string studies on rsc:doctitle elements in the collection. For this ft:query() function to work, there must be an index configuration for the named elements. This brings us to your index configuration.
In your index configuration, you have a full text (Lucene) index:
<text qname="//rsc:article" analyzer="nosw"/>
A couple of issues:
The #qname attribute should be a QName - simply, an element or attribute name. You've expressed this as a path. Remove the path //, leaving just rsc:article.
Your code does a full text query on rsc:doctitle, not on rsc:article, so I would expect your code, as written, to return 0 results. Change the existing index to rsc:doctitle, or add a new index on rsc:doctitle so that you could query either one. Reindex the collection afterwards, and as Adam suggested, check the Monex app's Indexing pane to ensure that the database has applied your index configuration as expected.
Lastly, contains() does not require an index to be in place. It benefits from the presence of a range index (i.e., your <create> elements), but range indexes are quite different from full text indexes. To learn more about these, I'd suggest reading the eXist documentation on indexing, http://exist-db.org/exist/apps/doc/indexing.xml.
I am not certain if configuring a Standard Analyzer without stopwords in the way you have done is correct. Can you check with Monex that your index has your terms in it?
Note also, if you created the index config after loading the index, then you need to reindex the collection. When you reindex it is also worth monitoring $EXIST_HOME/webapp/WEB-INF/exist.log to ensure that the indexing is done as expected.

Improve performance of query with range indexes in eXist-db

Reading the docs http://exist-db.org/exist/apps/doc/indexing.xml
I'm finding difficult to understand how and if I can improve the performances of a 'read' query (with 2 parameters: a string and an integer).
Do eXist-db have a default structural index? Can I improve a 2 params query with a 'range index'?
More details about my XML db (note there are 2 different dbs simply merged on the same root):
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<db>
<docs>
<doc>
<header>
<year>2001</year>
<number>1</number>
<type>O</type>
</header>
<metas>
<meta>
<number>26001</number>
<details>
<detail>
<description>legge</description>
<number>19</number>
<date>14/01/1994</date>
</detail>
<detail>
<description>decreto legge</description>
<number>453</number>
<date>15/11/1993</date>
</detail>
</details>
</meta>
</metas>
</doc>
<doc>
<header>
<year>2001</year>
<number>2</number>
<type>O</type>
</header>
<metas>
<meta>
<number>26002</number>
<details>
<detail>
<description>decreto legislativo</description>
<number>29</number>
<date>03/02/1993</date>
</detail>
</details>
</meta>
<meta>
<number>26016</number>
<details>
<detail>
<description>decreto legislativo</description>
<number>29</number>
<date>03/02/1993</date>
</detail>
</details>
</meta>
</metas>
</doc>
</docs>
<full_text_docs>
<doc>
<header>
<year>2001</year>
<number>1</number>
<type>O</type>
<president>ferrari</president>
</header>
<text>lorem ipsum ...
</text>
</doc>
<doc>
<header>
<year>2001</year>
<number>2</number>
<type>O</type>
<president>ferrari</president>
</header>
<text>lorem ipsum......
</text>
</doc>
</full_text_docs>
</db>
This is my xquery
xquery version "3.0";
let $doc := doc("/db//index_test/test_general.xml")//db/docs/doc
let $fulltxt := doc("/db//index_test/test_general.xml")//db/full_text_docs/doc
return <root> {
for $a in $doc[metas/meta/details/detail[date="03/02/1993" and number = "29"]]/header
return $fulltxt[header/year/text()=$a/year/text() and
header/number/text()=$a/number/text() and
header/type/text()=$a/type/text()
]
} </root>
Basically I simply find for the detail/number and detail/date that matches the input in the first db and take the results for querying the second db. The results are all the <full_text_header> documents that matches.
I would to know if I can create indexes for the fields number and date to improve performance. Note this is the ONLY query I need to optimize (the only I do on this db) obviously number and date changes :).
SOLUTION:
For a clear explanation read the joewiz answer. My problem was the correct recognition of the .xconf file. It have to be placed in /db/yourcollectiondir. If you're using eXide when you create the file you should select Xml type with template "eXist-db collection configuration". When you try to save the file you will see a prompt "Apply configuration?" then click 'ok'. Just then run this xquery xmldb:reindex('/db/yourcollectiondir').
Now if all it's right when you run an xquery involving an index you will see the usage in "Monitoring and profiling".
As that documentation page states, eXist does create a structural index for all XML stored in the database. This is not an index of values, though, so without further indexes, queries based on value (rather than structure) would involve a lookup of values in the DOM. As your data grows larger, looking up values in the DOM gets slower and slower. This is where value-based indexes, such a range index, saves the day. (For a fuller explanation, see the "Indexing" section of Wolfgang Meier's "Tuning the Database" article, which is essential for getting the most performance out of eXist.)
So, yes, you can create indexes for the <number> and <date> fields. I'd recommend the "new range" index, as described on that documentation page. Your collection.xconf file setting up these indexes would look like this:
<collection xmlns="http://exist-db.org/collection-config/1.0"
xmlns:xs="http://www.w3.org/2001/XMLSchema">
<index>
<range>
<create qname="number" type="xs:integer"/>
<create qname="date" type="xs:string"/>
</range>
</index>
</collection>
You have to store this within the /db/system/config/ collection, in a subcollection corresponding to the location of your data in the database. So if your data is located in /db/apps/myapp/data, you would place this collection.xconf file in /db/system/config/db/apps/myapp/data.
Note that the configuration here would only affect the for clause's queries of date and number values, and not the predicates in the return clause, which depend on the values of <year> and <type> elements. So, to ensure your query maximized the use of indexes, you should declare indexes on these; it seems that xs:integer would be the appropriate type for each.
Lastly, I would suggest eliminating the /text() steps, which are completely extraneous. For more on the use/abuse of text(), see Evan Lenz's article, "text() is a code smell".
Update (2016-07-17): With the updated code sample above, I have a couple of additional suggestions. First, since the code is in /db/index_test, we will store our files as follows:
Assuming you're using eXide, when you store the collection.xconf file in a collection, eXide will prompt you to have a copy of the file placed in the correct location in /db/system/config. If you're not using eXide, you need to store the collection.xconf file there yourself.
Using the unmodified query, I can confirm that despite the presence of the collection.xconf file, monex shows no indexes are being applied:
Let's make a few modifications to the file to ensure indexes are properly applied:
xquery version "3.0";
<root> {
for $a in doc("/db/index_test/test_general.xml")//detail[date = "03/02/1993" and number = 29]/ancestor::doc/header
return
doc("/db/index_test/test_general.xml")/db/full_text_docs/doc
[
header/year = $a/year and
header/number = $a/number and
header/type = $a/type
]
} </root>
With these modifications, monex shows that indexes are applied to the comparisons in the for clause:
The insights here are derived from the "Tuning the Database" article. To get full indexing for all comparisons, you will need to define additional indexes and may need to make similar modifications to your query.
One final note: the version of monex you see in these pictures is using a feature I added this weekend, called "Tare", which tries to filter out other operations from the query profiling results in order to help the user see just the effects of their own query. This feature is still just a pull request, so running the current release version, you won't see identical results.

TAL Condition in Plone to hide HTML if it's a Document (Page)

I'm trying to modify my /portal_view_customizations/zope.interface.interface-plone.belowcontenttitle.documentbyline template with a tal expression, so that the document's author and the modification date do not show if the current portal type is a Document (Page). I don't mind if it shows for News Items, which are time sensitive, but not the Documents/Pages.
This is my failing Plone TAL expression:
<div class="documentByLine"
id="plone-document-byline"
i18n:domain="plone"
tal:condition="view/show and not:python:here.portal_type == 'Document'">
...
I've also tried:
<div class="documentByLine"
id="plone-document-byline"
i18n:domain="plone"
tal:condition="view/show and not:context/portal_type='Document'">
but still no luck. The tracebacks are pretty cryptic and don't relate to the TAL expression. However, if I get rid of my condition for portal_type, then it works again. Any thoughts are appreciated. A manual would be good, but I've looked at the official ones, and they don't mention this.
TAL's underlying TALES, the expression-engine of which TAL makes use of, doesn't support the mixing of expression-types in one expression. It's syntax allows only one expression-type to be specified (defaults to 'path', if omitted, btw), as no delimiter is provided (like a semicolon for chaining several TAL-statements in one element, e.g.).
But instead of mixing three expression-types you can use one python-expression, try:
python: view.show and context.portal_type()!='Document'
Update:
If you have customized a Plone's default-template via portal_view_customizations ('TTW'), now restricted python-methods cannot be accessed, that' why view/show throws an error.
It returns the allowAnonymousViewAbout-property of the site-properties, you can also check this condition yourself and your expression looks like:
tal:define="name user/getUserName"
tal:condition="python:test(name!='Anonymous User') and context.portal_type()!='Document';
Quick'n'dirty alternative:
Do it with CSS:
body.userrole-anonymous .documentByLine {display:none}
body:not(.template-document_view) .documentByLine {display:block}
This will render the hidden elements, but it's helpful for prototyping and such, or 'quickfixes' (no admin available, etc.).

avoiding XDMP-EXPNTREECACHEFULL and loading document

I am using marklogic 4 and I have some 15000 documents (each of around 10 KB). I want to load the entire content as a document ( and convert the total documents to a single csv file and output to HTTP output stream for downloading). While I load the documents this way:
let $uri := cts:uri-match('products/documents/*.xml')
let $doc := fn:doc ($uri)
The xpath has some 15000 xmls. So fn:doc throws an error XDMP-EXPNTREECACHEFULL.
Is there any workaround for this? I cannot increase tree cache size in admin console because the number of xml files in products/documents/*.xml may increase.
Thanks.
When you want to export large quantities of XML from MarkLogic, the best technique is to write the query so that results can stream, avoiding the expanded tree cache entirely. It is a very different style of coding, though: you'll have to avoid strong typing of any kind, and refactor your code to remove FLWOR expressions. You won't be able to test any of the code in cq or qconsole, either.
Take a look at http://blakeley.com/blogofile/2012/03/19/let-free-style-and-streaming/ for some tips on how to get there. At a minimum the code sample you posted would have to become:
doc(cts:uri-match('products/documents/*.xml'))
In passing I would try to rework that to avoid the *.xml part, because it will be slower than needed. Maybe something like this?
cts:search(
collection(),
cts:directory-query('products/documents/', 'infinity'))
If you need to test for something more than the directory, you could add a cts:and-query with some cts:element-query test.
For general information about this error, see the MarkLogic knowledge base article on XDMP-EXPNTREECACHEFULL

Nested REST Routing

Simple situation: I have a server with thousands of pictures on it. I want to create a restful business layer which will allow me to add tags (categories) to each picture. That's simple. I also want to get lists of pictures that match a single tag. That's simple too. But now I also want to create a method that accepts a list of tags and which will return only pictures that match all these tags. That's a bit more complex, but I can still do that.
The problem is this, however. Say, my rest service is at pictures.example.com, I want to be able to make the following calls:
pictures.example.com/Image/{ID} - Should return a specific image
pictures.example.com/Images - Should return a list of image IDs.
pictures.example.com/Images/{TAG} - Should return a list of image IDs with this tag.
pictures.example.com/Images/{TAG}/{TAG} - Should return a list of image IDs with these tags.
pictures.example.com/Images/{TAG}/{TAG}/{TAG} - Should return a list of image IDs with these tags.
pictures.example.com/Images/{TAG}/{TAG}/{TAG}/{TAG}/{TAG} - Should return a list of image IDs with these tags.
etcetera...
So, how do I set up a RESTful web service projects that will allow me to nest tags like this and still be able to read them all? Without any limitations for the number of tags, although the URL length would be a limit. I might want to have up to 30 tags in a selection and I don't want to set up 30 different routing thingies to get it to work. I want one routing thingie that could technically allow unlimited tags.
Yes, I know there could be other ways to send such a list back and forth. Better even, but I want to know if this is possible. And if it's easy to create. So the URL cannot be different from above examples.
Must be simple, I think. Just can't come up with a good solution...
The URL structure you choose should be based on whatever is easy to implement with your web framework. I would expect something like:
http://pictures.example.com/images?tags=tag1,tag2,tag3,tag4
Is going to be much easier to handle on the server, and I can see no advantage to the path segment approach that you are having trouble with.
I assume you can figure out how to actually write the SQL or filesystem query to filter by multiple tags. In CherryPy, for example, hooking that up to a URL is as simple as:
class Images:
#cherrypy.tools.json_out()
def index(self):
return [cherrypy.url("/images/" + x.id)
for x in mylib.images()]
index.exposed = True
#cherrypy.tools.json_out()
def default(self, *tags):
return [cherrypy.url("/images/" + x.id)
for x in mylib.images(*tags)]
default.exposed = True
...where the *tags argument is a tuple of all the /{TAG} path segments the client sends. Other web frameworks will have similar options.

Resources