marklogic 8 query performance down after inserting large number xml files in my database - xquery

I inserted 200000 xml document (approximately Total size 1GB) in my database through MLCP command. Now I want to run below search query against that database (database with default index setup in the admin api) to get all documents.
let $options :=
<options xmlns="http://marklogic.com/appservices/search">
<search-option>unfiltered</search-option>
<term>
<term-option>case-insensitive</term-option>
</term>
<constraint name="Title">
<range collation="http://marklogic.com/collation/" facet="true">
<element ns="http://learning.com" name="title" />
</range>
</constraint>
<constraint name="Keywords">
<range collation="http://marklogic.com/collation/" facet="true">
<element ns="http://learning.com" name="subjectKeyword" />
</range>
</constraint>
<constraint name="Subjects">
<range collation="http://marklogic.com/collation/" facet="true">
<element ns="http://learning.com" name="subjectHeading" />
</range>
</constraint>
<return-results>true</return-results>
<return-query>true</return-query>
</options>
let $result := search:search("**", $options, 1, 20)
return $result
Range Index:-
<range-element-index>
<scalar-type>string</scalar-type>
<namespace-uri>http://learning.com</namespace-uri>
<localname>title</localname>
<collation>http://marklogic.com/collation/</collation>
<range-value-positions>false</range-value-positions>
<invalid-values>ignore</invalid-values>
</range-element-index>
<range-element-index>
<scalar-type>string</scalar-type>
<namespace-uri>http://learning.com</namespace-uri>
<localname>subjectKeyword</localname>
<collation>http://marklogic.com/collation/</collation>
<range-value-positions>false</range-value-positions>
<invalid-values>ignore</invalid-values>
</range-element-index>
<range-element-index>
<scalar-type>string</scalar-type>
<namespace-uri>http://learning.com</namespace-uri>
<localname>subjectHeading</localname>
<collation>http://marklogic.com/collation/</collation>
<range-value-positions>false</range-value-positions>
<invalid-values>ignore</invalid-values>
</range-element-index>
In each xml document subjectkeyword and title value like be
<lmm:subjectKeyword>anatomy, biology, illustration, cross, section, digestive, human, circulatory, body, small, neck, head, ear, torso, veins, teaching, model, deep, descending, heart, brain, muscles, lungs, diaphragm, c</lmm:subjectKeyword><lmm:title>CORTY_EQ07-014.eps</lmm:title>
But it taking lots of time even query console saying Too many elements to render or Parser Error: Cannot parse result. File Size too large

I'd also add that if you wanted to fetch all documents (which I wouldn't recommend on a non-trivial database) doing it directly rather than as a wildcarded search is going to be more efficient: fn:doc() (or, as Geert suggests, paginated: fn:doc[1 to 20]

First of all, don't try to get all documents at once. It will mean MarkLogic will have to go to disk for every document, process, and serialize it, and last but not least, client-side need to receive and display too. The latter is probably the bottle-neck here. This is typically why user application show search results by 10 or 20 at a time. In other words: use pagination.
I also recommend running unfiltered for better performance.
HTH!

Pagination is definitely key here, and I'm curious about your facets. From your example, I'm imagining "Title" is almost always unique across your 200k documents. And the lmm:subjectKeyword element seems like it needs a little post-processing to make it more useful as a facet - it's a string of comma-delimited values, which means subjectKeyword will almost always be unique too (I recommend putting each of these values into a separate element, that would be much more useful as a facet). And I'm guessing subjectHeading is mostly unique too.
Facets are generally useful when you have a bounded set of values - e.g. for laptops, bounded sets include manufacturer, monitor size, and buckets for price range. Once you get into hundreds of values, the utility of a facet decreases for a user - how many users really want to sort through hundreds or thousands of values to find what they want? And in your case, we're probably talking about tens of thousands of unique values, if not 200k unique values (particularly for "Title"). And - when you have that many unique values, facet resolution time is going to take longer.
So before exploring the facet resolution time - what problem are you trying to solve with these 3 facets?
Without knowing anything more, I'd post-process that subjectKeyword element into many elements, each with a single keyword in it, and then put a facet on that element. Ideally, you have dozens of keywords, maybe hundreds, and resolving that facet should be very fast.

Related

Using Java API for Container Constraint (nested)

I'm using MarkLogic v8.
I am trying to apply a container constraint on a structured query to return only documents with value x in element c (nested within elements a and b).
queryBuilder.containerConstraint() takes a parameter for an option name and a StructuredQueryDefinition. My option looks like this:
<options xmlns='http://marklogic.com/appservices/search'>
<constraint name='language'>
<element name=\"name\" ns=\"\"/>
</constraint>
</options>
"name" is the name of the innermost element (c) containing the value I want to reference against. Is this how the option should be constructed, or should 'name' instead be the name of the outermost element?
How should the StructuredQueryDefinition (that is accepted as a parameter by containerConstraint()) be constructed? Should I be writing raw XML, or are there contruction methods to be passed in?
Is there a better way to do this? I already have a working Term search, I just need to be able to filter by a property set inside the document.
I think I found an answer:
Option was as follows:
<search:options
xmlns:search='http://marklogic.com/appservices/search'>
<search:constraint name='language'>
<search:word>
<search:element name='name' ns=''/>
</search:word>
</search:constraint>
</search:options>
Then called the option in a Word Constraint:
queryBuilder.wordConstraint("language", MY_LANGUAGE)
This appears to do what I wanted it to.

How to write OSLC query where clause in Maximo Anywhere to evaluate somevalue < now()

I'm configuring Work Execution. The Work Order History query that is called when retrieving past work orders for assets or locations is open-ended. Consequently, several thousand rows are retrieved each time and the application times out. I can attach where clause (see below) to limit it to records with actfinish after a specific date. However, what I want to do is something like this...
spi_wm:actfinish>now()-30
<!--WorkOrder History Asset Resource-->
<resource id="workOrderHistoryAssetLoc" class="application.business.WorkOrderObject" defaultOrderBy="wonum asc" describedBy="http://jazz.net/ns/ism/work/smarter_physical_infrastructure#WorkOrder" name="workOrderHistoryAssetLoc" pageSize="50" providedBy="/oslc/sp/WorkManagement">
<attributes id="workOrderHistoryAsset_attributes1">
<attribute describedByProperty="dcterms:identifier" id="workOrderHistoryAsset_identifier_dctermsidentifier1" index="true" name="identifier"/>
<attribute describedByProperty="oslc:shortTitle" id="workOrderHistoryAsset_wonum_oslcshortTitle1" index="true" name="wonum"/>
<attribute describedByProperty="dcterms:title" id="workOrderHistoryAsset_description_dctermstitle1" index="true" method="descriptionChanged" name="description"/>
<attribute describedByProperty="spi:status" id="workOrderHistoryAsset_status_spistatus" index="true" method="statusChanged" name="status"/>
<localAttribute dataType="string" id="workOrderHistoryAsset_statusdesc_string" name="statusdesc"/>
</attributes>
<queryBases id="workOrderHistoryAsset_queryBasesh">
<queryBase defaultForSearch="true" id="workOrderHistoryAsset_queryBase_searchAllWorkOrdersh" name="searchAllWorkOrdersAsset" queryUri="/oslc/os/oslcwodetail?savedQuery=getWithComplexQuery"/>
<!-- TODO AWH 20170130 - add where clause to this query -->
</queryBases>
<whereClause clause="spi:status in ['COMP','CLOSE'] and spi_wm:actfinish>'2016-10-10T09:50:00-04:00'" id="workOrderHistoryAssetLoc_whereClause"/>
</resource>
I see elsewhere where there are formulas in the app.xml but I don't know what types of operators or language is available to accomplish something like this. I was hoping the whereClause attribute had the ability to use a resolverClass and resolverFunction so that I could replace a named parameter with a value derived from a javascript function... no dice. Any help would be appreciated!
It looks like you are attempting to set the where clause in the app.xml. While I think this could work, it would probably be a million times easier to do the following.
duplicate the resource, then comment out the original
Create a saved query in Maximo with the where clause you need
a. spi:status in ['COMP','CLOSE'] and spi_wm:actfinish>'2016-10-10T09:50:00-04:00'
Name the saved query "ANYWHERE_WOHIST" or something like that.
Modify the duplicate resource to point to your new saved query.
<queryBase defaultForSearch="true" id="workOrderHistoryAsset_queryBase_searchAllWorkOrdersh" name="searchAllWorkOrdersAsset" queryUri="/oslc/os/oslcwodetail?savedQuery=ANYWHERE_WOHIST"/>
Also, this allows the query where clause to be managed in the backend, so when your users decide they want to see something else here you can mange the query within Maximo. We're nearing the end of our project with Anywhere, so feel free to reach out if you'd like to swap war stories.

MarkLogic cts:element-query false positives?

Given this document :-
<items>
<item><type>T1</type><value>V1</value></item>
<item><type>T2</type><value>V2</value></item>
</items>
unsurprisingly, I find that this will pull back the page in a cts:uris() :-
cts:and-query((
cts:element-query(xs:QName('item'),
cts:element-value-query(xs:QName('type'),'T1')
),
cts:element-query(xs:QName('item'),
cts:element-value-query(xs:QName('value'),'V2')
)
))
but somewhat surprisingly (to me at least) I also find that this will too :-
cts:element-query(xs:QName('item'),
cts:and-query((
cts:element-value-query(xs:QName('type'),'T1'),
cts:element-value-query(xs:QName('value'),'V2')
))
)
This doesn't seem right, as there is no single item with type=T1 and value=V2.
To me this seems like a false positive.
Have I misunderstood how cts:element-query works?
(I have to say that the documentation isn't particularly clear in this area).
Or is this something where MarkLogic strives to give me the result I expect, and had I had more or better indexes in place, I would be less likely to get a false positive match.
In addition to the answer by #wst, you only need to enable element value positions to get accurate results from unfiltered search. Here some code to show this:
xdmp:document-insert("/items.xml", <items>
<item><type>T1</type><value>V1</value></item>
<item><type>T2</type><value>V2</value></item>
</items>);
cts:search(collection(),
cts:element-query(xs:QName('item'),
cts:and-query((
cts:element-value-query(xs:QName('type'),'T1'),
cts:element-value-query(xs:QName('value'),'V2')
))
), 'unfiltered'
)
Without element value positions enabled this returns the test document. After enabling the positions, the query returns nothing.
As said by #wst, cts:search() runs filtered by default, whereas cts:uris() (and for instance xdmp:estimate() only runs unfiltered.
HTH!
Yes, I think this is a slight misunderstanding of how queries work. In cts:search, the default behavior is to enable the filtered option. In this case ML will evaluate the query using only indexes, and then once candidate documents have been selected, it will load them into memory, inspect, and filter out false positives. This is more time consuming, but more accurate.
cts:uris is a lexicon function, so queries passed to it will only resolve via indexes, and there is no option to filter false positives.
The simple way to handle this query via indexes would be to change your schema such that documents are based on <item> instead of <items>. Then each item would have a separate index entry, and results would not be commingled before filtering.
Another way that doesn't involve updating documents is to wrap the queries you expect to occur in the same element in a cts:near-query. That would prevent a <type> in one <item> from matching with a <value> in a different <item>. I suggest reading the documentation because you may need to enable one or more position-based indexes for cts:near-query to be accurate.

element-attribute-range-query fetching result but element-attribute-value-query is not fetching any result

I wanted to fetch the document which have the particular element attribute value.
So, I tried the cts:element-attribute-value-query but I didn't get any result. But the same element attribute value, I am able to get using cts:element-attribute-range-query.
Here the sample snippet used.
let $s-query := cts:element-attribute-range-query(xs:QName("tit:title"),xs:QName("name"),"=",
"SampleTitle",
("collation=http://marklogic.com/collation/codepoint"))
let $s-query := cts:element-attribute-value-query(xs:QName("tit:title"),xs:QName("name"),
"SampleTitle",
())
return cts:search(fn:doc(),($s-query))
The problem with range-query is it needs the range index. I have hundreds of DB's in multiple hosts. I need to create range indexes on each DB.
What could be the problem with attribute-value-query?
I found the issue with a couple of research.
Actually the result document is a french language document. It has the structure as follows. This is a sample.
<doc xml:lang="fr:CA" xmlns:tit="title">
<tit:title name="SampleTitle"/>
</doc>
The cts:element-attribute-value-query is a language dependent query. To get the french language results, then language needs to be mentioned in the option as follows.
cts:element-attribute-value-query(xs:QName("tit:title"),xs:QName("name"), "SampleTitle",("lang=fr"))
But cts:element-attribute-range-query don't require the language option.
Thanks for the effort.

Xquery on MarkLogic using OR

This is a newbie MarkLogic question. Imagine an xml structure like this, a condensation of my real business problem:
<Person id="1">
<Name>Bob</Name>
<City>Oakland</City>
<Phone>2122931022</Phone>
<Phone>3123032902</Phone>
</Person>
Note that a document can and will have multiple Phone elements.
I have a requirement to return information from EVERY document that has a Phone element that matches ANY of a list of phone numbers. The list may have a couple of dozen phone numbers in it.
I have tried this:
let $a := cts:word-query("3738494044")
let $b := cts:word-query("2373839383")
let $c := cts:word-query("3933849383")
let $or := cts:or-query( ($a, $b, $c) )
return cts:search(/Person/Phone, $or)
which does the query properly, but it returns a sequence of Phone elements inside a Results element. My goal is instead to return all the Name and City elements along with the id attribute from the Person element, for every matching document. Example:
<results>
<match id="18" phone="2123339494" name="bob" city="oakland"/>
<match id="22" phone="3940594844" name="mary" city="denver"/>
etc...
</results>
So I think I need some form of cts:search that allows both this boolean capability but also allows me to specify what part of each document gets returned. At that point then I could further process the result with XPATH. I need to do this efficiently so for example I think it would NOT be efficient to return a list of document uri's and then query for each document in a loop. Thanks!
Your approach is not as bad as you might think. There are only a few changes necessary to make it work as you like.
First of all, you are better off using cts:element-value-query instead of cts:word-query. It will allow you to limit the searched values to a specific element. It performs best when you add an element range index for that element, but it is not required. It can rely on the always present word index as well.
Secondly, there is no need for the cts:or-query. Both cts:word-query and cts:element-value-query functions (as well as all other related functions) accept multiple search strings as one sequence argument. They are automatically treated as or-query.
Thirdly, the phone numbers are your 'primary key' in the result, so returning a list of all matching Phone elements is the way to go. You just need to realize that the resulting Phone element are still aware of where they came from. You can easily use XPath to navigate to parent and siblings.
Fourthly, there is nothing against looping over the search results. It may sound a bit weird, but it doesn't cost much extra performance. Actually, it is pretty much negligable, in MarkLogic Server that is. Most performance could be lost when you try to return many results (more than several thousands), in which case most time is lost in serializing it all. And if it is likely you will have to handle lots of search results, it is wise to start using pagination straight away.
To get what you ask, you could use the following code:
<results>{
for $phone in
cts:search(
doc()/Person/Phone,
cts:element-value-query(
xs:QName("Phone"),
("3738494044", "2373839383", "3933849383")
)
)
return
<match id="{data($phone/../#id)}" phone="{data($phone)}" name="{data($phone/../Name)}" city="{data($phone/../City)}"/>
}</results>
Best of luck.
Here's what I would do:
let $numbers := ("3738494044", "2373839383", "3933849383")
return
<results>{
for $person in cts:search(/Person, cts:element-value-query(xs:QName("Phone"),$numbers))
return
<match id="{data($person/#id)}" name="{data($person/Name)}" city="{data($person/City)}">
{
for $phone in $person/Phone[cts:contains(.,$numbers)]
return element phone {$phone}
}
</match>
}
First, there's an implicit OR when passing multiple values into word-query and value-query and their cousins, and this query is more efficiently resolved from the indexes, so do this when you can.
Second, an individual might match on more than one phone number, so you need that additional inner loop to effectively group by individual.
I would not create a range index for this - no need, and it isn't necessarily faster. There are indexes for element values by default, so you can leverage those with element-value-query.
You could do all of this with the SearchAPI and a little XSLT. That would make it easy to start combining names and numbers and other conditions in a single query.

Resources