I have a few XML documents in marklogic which have the structure
<abc:doc>
<abc:doc-meta>
<abc:meetings>
<abc:meeting>
</abc:meeting>
<abc:meeting>
</abc:meeting>
</abc:meetings>
</abc:doc-meta>
</abc:doc>
We can have more than one <abc:meeting> element under the <abc:meetings> element.
I am trying to write a cts:search query to get only documents that have more than one <abc:meeting> element in the document.
Please advise
This is tricky. Ideally, you'd want to drive searches from indexes for best performance. Unfortunately, MarkLogic doesn't keep track of element counts in its universal index, and aggregating counts from a range index can be cumbersome.
The overall simplest solution would be to add a count attribute on abc:meetings, and then add a range index on that. It does mean you'd have to change your data, and you'd have to keep that attribute in synch with each change.
You could also just search on the presence of abc:meeting with cts:element-query(), and append an XPath predicate to count the number of elements afterwards. Something like:
cts:search(
collection(),
cts:element-query(xs:QName('abc:meeting'), cts:true-query())
)[count(.//abc:meeting) > 1]
If not many documents contain meetings, this might work fairly well for you, but it still requires pulling up all documents containing meetings, hence could be expensive.
I played with the thought of leveraging cts:near-query(), but that is driven on word positions, so depends on the actual amount of tokens inside a meeting. If that were always an exact number of tokens (unlikely I'd guess), you could use the minimal-distance option on a double cts:element-query() wrapped in a cts:near-query(). It might help optimize the previous option a little though.
Most performant option I can think of right now, involves adding a User-Defined aggregate Function. It unfortunately means compiling c++ code. I happen to have written such a UDF in the past, that you should be able to use as-is after compilation and installation. For details see:
https://github.com/grtjn/doc-count-udf
and
http://docs.marklogic.com/guide/app-dev/aggregateUDFs
HTH!
It boils down to how many "a few" is. If it's thousands or fewer, than what grtjn presents above for a cts:search plus an XPath expression will work fine. If it's more, I'd add the count attribute to abc:meetings and then use a pre-commit trigger (e.g. on the collection of these documents) to ensure that the count attribute value is kept in sync. You'd need a range index to be able to query for "Documents that have a count of meetings of 2 or greater".
Of course, if all you need to query on is whether there's more than one meeting, then just add a "multiple" attribute to abc:meetings with a value of "true". Then you don't need a range index - you can do a cts:element-attribute-value-query on abc:meetings and multiple="true".
Related
/* Method 1 */
FOR EACH customer NO-LOCK WHERE customer.name EQ 'John':
DISPLAY customer.name.
END.
/* Method 2*/
for each customer where (customer.name EQ 'John'):
DISPLAY customer.name.
end.
Could you please explain by putting brackets how the compiler will act as?
Your second example is NOT faster. It is actually slower because you failed to specify NO-LOCK. By default you will therefore get the record(s) with a SHARE-LOCK which will then need to be unlocked as each record goes out of scope. This extra work takes time and results in a slower query.
If you are connecting client/server it can be orders of magnitude slower because:
a) each record will require 3 network messages. 1 to ask for it, another to return it and a 3rd to unlock it.
b) NO-LOCK queries can pack multiple records into a response message. this is much more efficient than requesting and sending them one at a time. and since there is no lock nothing needs to be unlocked.
(In your sample you are only getting one record so the difference is pretty small.)
"Index Brackets" are an entirely different concept from grouping sub-expressions with "(" and ")". An index bracket is a set of records specified by elements of the WHERE clause. In both of your examples above you have an equality match on a field which is the leading component of a unique index. So the "bracket" is exactly one record. (Assuming that this is the standard sports database!)
If you specify something more complicated like:
for each order no-lock where order.custid = 1 and order.ship-date >= 7/1/2019 and order.ship-date <= 7/31/2019:
Progress uses a static, rule based query optimizer. The indexes are chosen at compile time. The rules are complicated but the most important, by far, is that equality matches on leading components keep your index in play. Range matches are next most useful. But once you get a range match no further fields will be considered for index selection.
There can be cases where using parenthesis to group elements of the WHERE clause has an impact on index selection. (The case where there is only one field being selected on is not one of them.) Generally this is going to be a situation where you have a complex query with a mix of AND and OR elements. If nothing else the use of parenthesis in these situations makes operator precedence much less error prone and easier for a human to read.
There is a lot of material on the index selection process available. I suggest that you start here: https://documentation.progress.com/output/ua/OpenEdge_latest/index.html#page/wp-abl-triggers/general-rules-for-choosing-a-single-index.html
There are also excellent presentations at every PUG Challenge on the topic. If you cannot attend in person (you should), many of them are available for download: http://pugchallenge.org/downloads.html
We are new to DynamoDB and struggling with what seems like it would be a simple task.
It is not actually related to stocks (it's about recording machine results over time) but the stock example is the simplest I can think of that illustrates the goal and problems we're facing.
The two query scenarios are:
All historical values of given stock symbol <= We think we have this figured out
The latest value of all stock symbols <= We do not have a good solution here!
Assume that updates are not synchronized, e.g. the moment of the last update record for TSLA maybe different than for AMZN.
The 3 attributes are just { Symbol, Moment, Value }. We could make the hash_key Symbol, range_key Moment, and believe we could achieve the first query easily/efficiently.
We also assume could get the latest value for a single, specified Symbol following https://stackoverflow.com/a/12008398
The SQL solution for getting the latest value for each Symbol would look a lot like https://stackoverflow.com/a/6841644
But... we can't come up with anything efficient for DynamoDB.
Is it possible to do this without either retrieving everything or making multiple round trips?
The best idea we have so far is to somehow use update triggers or streams to track the latest record per Symbol and essentially keep that cached. That could be in a separate table or the same table with extra info like a column IsLatestForMachineKey (effectively a bool). With every insert, you'd grab the one where IsLatestForMachineKey=1, compare the Moment and if the insertion is newer, set the new one to 1 and the older one to 0.
This is starting to feel complicated enough that I question whether we're taking the right approach at all, or maybe DynamoDB itself is a bad fit for this, even though the use case seems so simple and common.
There is a way that is fairly straightforward, in my opinion.
Rather than using a GSI, just use two tables with (almost) the exact same schema. The hash key of both should be symbol. They should both have moment and value. Pick one of the tables to be stocks-current and the other to be stocks-historical. stocks-current has no range key. stocks-historical uses moment as a range key.
Whenever you write an item, write it to both tables. If you need strong consistency between the two tables, use the TransactWriteItems api.
If your data might arrive out of order, you can add a ConditionExpression to prevent newer data in stocks-current from being overwritten by out of order data.
The read operations are pretty straightforward, but I’ll state them anyway. To get the latest value for everything, scan the stocks-current table. To get historical data for a stock, query the stocks-historical table with no range key condition.
The data model I am planning would have a few property "fields" in place, including a "category/tags" property, which would be a list/array of a lot of tags.
I'm planning on querying on one category at a time. I am not interested in indexing which entities have combinations of categories, just individual categories.
I am NOT referencing simply not indexing a particular property.
Bonus Question:
It seems Google datastore doesn't like "monotonically increasing" property values (ie timestamps) because presumably they make hotspots on the machines while forming indexes. So would just storing the current calendar date help? I could see that making even more of a "hotspot" since every entity for 24 hours would have the same index value for that property, is there some way of storing some data about when each entity was recorded?
Indeed, one should encounter no issues creating a builtin index, as mentioned in the above reply. Still, properties with array values can behave in surprising ways. For more than one filter, all conditions defined by the filters must be satisfied by at least one of the array’s individual values, for it to match the query. This does not apply in case of the equality filters.
Sort order is also unusual: the first value seen in the index determines an entity's sort order.
I don't think a property index (aka Built-in Index) on an Array property creates the index with various value combinations. I believe each value in the Array is indexed. For example, if you have a Book with two tags, the index will have two entries for each tag. Adding another book with three tags would add 3 more entries to the Tags index. This index allows you to query for books based on a single tag as well as multiple tags.
The "combination of values" that you mentioned happens if you create a composite index containing more than one Array type (e.g. Authors and Tags of a Book), and all/most books have multiple authors and multiple tags.
You should not have any issues creating a builtin index on your Category/Tag.
On your other question on indexing entity created/modified timestamp, I do see that the Best Practices says to avoid indexing such a property.
Do not index properties with monotonically increasing values (such as
a NOW() timestamp). Maintaining such an index could lead to hotspots
that impact Cloud Datastore latency for applications with high read
and write rates
Not sure what the alternative would be. If you don't have to query on the timestamp/sort on the timestamp, you are fine storing the timestamp by excluding the property from indexing.
Our database contains documents with a lot of metadata, including relationships between those documents. Fictional example:
<document>
<metadata>
<document-number>ID 12345 : 2012</document-number>
<publication-year>2012</publication-year>
<cross-reference>ID 67890 : 1995</cross-reference>
<cross-reference>ID 67890 : 1998</cross-reference>
<cross-reference>ID 67891 : 2000</cross-reference>
<cross-reference>ID 12345 : 2004</cross-reference>
<supersedes>ID 12345 : 2004</supersedes>
...
</metadata>
</document>
<document>
<metadata>
<document-number>ID 12345 : 2004</document-number>
<publication-year>2004</publication-year>
<cross-reference>ID 67890 : 1995</cross-reference>
<cross-reference>ID 67890 : 1998</cross-reference>
<cross-reference>ID 67891 : 2000</cross-reference>
<cross-reference>ID 12345 : 2012</cross-reference>
<cross-reference>ID 12345 : 2001</cross-reference>
<superseded-by>ID 12345 : 2012</superseded-by>
<supersedes>ID 12345 : 2001</supersedes>
...
</metadata>
</document>
We're using a 1-box search, based on the Marklogic search api to allow users to search these documents. The search grammar describes a variety of contraints and search options, but mostly (and by default) they search by a field defined to include most of the metadata elements, with (somewhat) carefully chosen weights (what really matters here is that document-number has the highest weight.)
The problem is that the business wants quite specific ordering of results, and I can't think of a way to achieve it using the search api.
The requirement that's causing trouble is that if the user search matches a document number (say they search for "12345",) then all documents with that document number should be at the top of the result-set, ordered by descending date. It's easy enough to get them at the top of the result-set; document-number has the highest weight, so sorting by score works fine. The problem is that the secondary sort by date doesn't work because even though all the document-number matches have higher scores than other documents, they don't have the same score, so they end up ordered by how often the search term appears in the rest of the metadata; which isn't really meaningful at all.
What I think we really need is a way of having the search api score results simply by the highest weighted element that matches the search-term, without reference to any other matches in the document. I've had a look at the scoring algorithms and can't see one that does that; have I missed something or is this just not possible? Obviously, it doesn't have to be score that we order by; if there's some other way to get at the score of the single best match in a document and use it for sorting, that would be fine.
Is there some other solution I haven't even thought of?
I thought of doing two searches (one on document-number, and one on the whole metadata tree) and then combining the results, but that seems like it's going to cause a lot of pain with pagination and performance. Which sort-of defeats the purpose of using the search api in the first place.
I should add that it is correct to have those other matches in the result-set, so we can't just search only on document-number.
I think you've reached the limits of what the high-level search API can do for you. I have a few tricks to suggest, though. These won't be 100% robust, but they might be good enough for the business. Then you can get on with the application. Sorry if I sound cynical or dismissive, but I don't believe in micromanaging search results.
Simplest possible: re-sort the first page in memory. That first page could be a bit larger than the page you show to the user. Because it is still limited in size, you can make the rules for this fairly complex without suffering much. That would fix your 'descending date' problem. The results from page 1 wouldn't quite match up with page 2, but that might be good enough.
Taking the next step in complexity, consider using document-quality to handle the descending-date issue. This approach is used by http://markmail.org among others. As each document is inserted or updated, set document quality using a number derived from the date. This could be days or weeks or months since 1970, or using some other fixed date. Newer results will tend to float to the top. If any other boosts tend to swamp the date-based boost, you might get close to what you want.
There might also be some use in analyzing the query to extract the potentially boosting terms. If necessary you could then begin a recursive run of xdmp:exists(cts:search(doc(), $query)) on each boosting term as if it were a standalone query. Bail out as soon as you find a true() result: that means you are going to boost that query term with an absurdly high weight to make it float to the top.
Once you know what the boosting term is, rewrite the entire query to set all other term weights to much lower values, perhaps even 0. The lower the weight, the less those non-boosting terms will interfere with the date-based quality and the boosting weight. If there is no boosting term, you might want make other adjustments. All this is less expensive than it sounds, by the way. Aside from the xdmp:exists calls, it's just in-memory expression evaluation.
Again, though, these are all just tricks to nudge the scores. They won't give you the absolute control over ranking that you're looking for. In my experience, attempts to micromanage scores are doomed to failure. My bet is that your users would be happier with raw TF/IDF, whatever your business managers say.
Another way to do it is to use two searches, as you suggest. Put a range index on document-number (and ideally the document date), extract any potential document-number values from the query (search:parse, extract, then search:resolve is a good strategy), then execute a cts:element-range-query for docs matching those document-number values with date descending. If there aren't enough results to fill up your N-result page, then get the next N-x results from search api. You can keep track of the documents that were returned in the first result set and exclude those URIs from the second one. Keeping track of the pagination won't be too bad.
This might not perform as well as the first solution, but the time difference for the additional range index query combined with a shorter search api query should be negligible enough for most.
I am working on Marklogic tool
I am having a database of around 27000 documents.
What I want to do is retrieve the keywords which have maximum frequency in the documents given by the result of any search query.
I am currently using xquery functions to count the frequency of each word in the set of all documents retrieved as query result. However, this is quite inefficient.
I was thinking that it would help me if i could get the list of words on which marklogic has performed indexing.
So is there a way to retrieve the list of indexed words from the universal index of marklogic??
Normally you would use something like this in MarkLogic:
(
for $v in cts:element-values(xs:Qname("myelem"))
let $f := cts:frequency($v)
order by $f descending
return $v
)[1 to 10]
This kind of functionality is built-in in the search:search library, which works very conveniently.
But you cannot use that on values from cts:words e.a. unfortunately. There is a little trick that could get you close though. Instead of using cts:frequency, you could use a xdmp:estimate on a cts:search to get a fragment count:
(
for $v in cts:words()
let $f := xdmp:estimate(cts:search(collection(), $v))
order by $f descending
return $v
)[1 to 10]
The performance is less, but still much faster than bluntly running through all documents.
HTH!
What if your search contains multiple terms? How will you calculate the order?
What if some of your terms are very common in your corpus of documents, and others are very rare? Should the count of "the" contribute more to the score than "protease", or should they contribute the same?
If the words occur in the title vs elsewhere in the document, should that matter?
What if one document is relatively short, and another is quite long. How do you account for that?
These are some of the basic questions that come up when trying to determine relevancy. Most search engines use a combination of term frequency (how often do the terms occur in your documents), and document frequency (how many documents contain the terms). They can also use the location of the terms in your documents to determine a score, and they can also account for document length in determining a score.
MarkLogic uses a combination of term frequency and document frequency to determine relevance by default. These factors (and others) are used to determine a relevance score for your search criteria, and this score is the default sorting for results returned by search:search from the search API or the low-level cts:search and its supporting operators.
You can look at the details of the options for cts:search to learn about some of the different scoring options. See 'score-logtfidf' and others here:
http://community.marklogic.com/pubs/5.0/apidocs/SearchBuiltins.html#cts:search
I would also look at the search developers guide:
http://community.marklogic.com/pubs/5.0/books/search-dev-guide.pdf
Many of the concepts are under consideration by the XQuery working group as enhancements for a future version of XQuery. They aren't part of the language today. MarkLogic has been at the forefront of search for a number of years, so you'll find there are many features in the product, and a lot of discussion related to this area in the archives.
"Is there a way to retrieve the list of indexed words from the universal index of marklogic?" No. The universal index is a hash index, so it contains hashes not words.
As noted by others you can create value-based lexicons that can list their contents. Some of these also include frequency information. However, I have another suggestion: cts:distinctive-terms() will identify the most distinctive terms from a sequence of nodes, which could be the current page of search results. You can control whether the output terms are just words, or include more complex terms such as element-word or phrase. See the docs for more details.
http://docs.marklogic.com/5.0doc/docapp.xqy#display.xqy?fname=http://pubs/5.0doc/apidoc/SearchBuiltins.xml&category=SearchBuiltins&function=cts:distinctive-terms
I have used cts:distinctive-terms(). It gives mostly wildcarded terms in my case which are not of much use. Furthur it is suitable for finding distinctive terms in a single document. When I try to run it on many documents it is quite slow.
What I want to implement is a dynamic facet which is populated with the keywords of the documents which come up in the search result. I have implemented it but it is inefficient as it counts the frequency of all the words in the documents. I want it to be a suggestion or recommandation feature like if you have searched for this particular term or phrase then you may be interested in these suggested terms or phrases. So I want an efficient method to find the terms which are common in the result set of documents of a search.
I tried cts:words() as suggested. It gives similar words as the search query word and the number of documents in which it is contained. WHat it does not take into account is the set of search result documents. It just shows the number of documents which contain similar words in the whole database, irrespective of whether these documents are present in the search result or not