performance of element-value-query vs element-range-query - xquery

I have an element range index configured for an element in my database. I am trying to run a search query on that element. The element contains string values and i need to search for one particular string value (not a range of values or date). Though both element-value and element-range queries can be used, and index is already present, Will both these queries perform in same way? or element-range performs better in this scenario?

The range query will be faster.
The element value query uses the universal index and that isn't full in memory
The range query users a ranged index and that's an in memory index.
The range query will be much faster as your data grows. It will also be faster if you have a lot of unique terms in that element.

The range-query is also answering a different question from the value-query.
Value queries query for matching word sequences, not matching strings. By default they are stemmed too, so cts:element-value-query(xs:QName("x"),"be fine") will match <x>Is finer</x>. Unless you do an exact unstemmed unwildcarded value query, an unfiltered search will not be able to resolve space and punctuation differences, either.
Range queries (on strings) are matching strings under the rules of a particular collation.

Related

What is the difference using bracket on the entire index?

/* Method 1 */
FOR EACH customer NO-LOCK WHERE customer.name EQ 'John':
DISPLAY customer.name.
END.
/* Method 2*/
for each customer where (customer.name EQ 'John'):
DISPLAY customer.name.
end.
Could you please explain by putting brackets how the compiler will act as?
Your second example is NOT faster. It is actually slower because you failed to specify NO-LOCK. By default you will therefore get the record(s) with a SHARE-LOCK which will then need to be unlocked as each record goes out of scope. This extra work takes time and results in a slower query.
If you are connecting client/server it can be orders of magnitude slower because:
a) each record will require 3 network messages. 1 to ask for it, another to return it and a 3rd to unlock it.
b) NO-LOCK queries can pack multiple records into a response message. this is much more efficient than requesting and sending them one at a time. and since there is no lock nothing needs to be unlocked.
(In your sample you are only getting one record so the difference is pretty small.)
"Index Brackets" are an entirely different concept from grouping sub-expressions with "(" and ")". An index bracket is a set of records specified by elements of the WHERE clause. In both of your examples above you have an equality match on a field which is the leading component of a unique index. So the "bracket" is exactly one record. (Assuming that this is the standard sports database!)
If you specify something more complicated like:
for each order no-lock where order.custid = 1 and order.ship-date >= 7/1/2019 and order.ship-date <= 7/31/2019:
Progress uses a static, rule based query optimizer. The indexes are chosen at compile time. The rules are complicated but the most important, by far, is that equality matches on leading components keep your index in play. Range matches are next most useful. But once you get a range match no further fields will be considered for index selection.
There can be cases where using parenthesis to group elements of the WHERE clause has an impact on index selection. (The case where there is only one field being selected on is not one of them.) Generally this is going to be a situation where you have a complex query with a mix of AND and OR elements. If nothing else the use of parenthesis in these situations makes operator precedence much less error prone and easier for a human to read.
There is a lot of material on the index selection process available. I suggest that you start here: https://documentation.progress.com/output/ua/OpenEdge_latest/index.html#page/wp-abl-triggers/general-rules-for-choosing-a-single-index.html
There are also excellent presentations at every PUG Challenge on the topic. If you cannot attend in person (you should), many of them are available for download: http://pugchallenge.org/downloads.html

Search query to find documents that have multiple element

I have a few XML documents in marklogic which have the structure
<abc:doc>
<abc:doc-meta>
<abc:meetings>
<abc:meeting>
</abc:meeting>
<abc:meeting>
</abc:meeting>
</abc:meetings>
</abc:doc-meta>
</abc:doc>
We can have more than one <abc:meeting> element under the <abc:meetings> element.
I am trying to write a cts:search query to get only documents that have more than one <abc:meeting> element in the document.
Please advise
This is tricky. Ideally, you'd want to drive searches from indexes for best performance. Unfortunately, MarkLogic doesn't keep track of element counts in its universal index, and aggregating counts from a range index can be cumbersome.
The overall simplest solution would be to add a count attribute on abc:meetings, and then add a range index on that. It does mean you'd have to change your data, and you'd have to keep that attribute in synch with each change.
You could also just search on the presence of abc:meeting with cts:element-query(), and append an XPath predicate to count the number of elements afterwards. Something like:
cts:search(
collection(),
cts:element-query(xs:QName('abc:meeting'), cts:true-query())
)[count(.//abc:meeting) > 1]
If not many documents contain meetings, this might work fairly well for you, but it still requires pulling up all documents containing meetings, hence could be expensive.
I played with the thought of leveraging cts:near-query(), but that is driven on word positions, so depends on the actual amount of tokens inside a meeting. If that were always an exact number of tokens (unlikely I'd guess), you could use the minimal-distance option on a double cts:element-query() wrapped in a cts:near-query(). It might help optimize the previous option a little though.
Most performant option I can think of right now, involves adding a User-Defined aggregate Function. It unfortunately means compiling c++ code. I happen to have written such a UDF in the past, that you should be able to use as-is after compilation and installation. For details see:
https://github.com/grtjn/doc-count-udf
and
http://docs.marklogic.com/guide/app-dev/aggregateUDFs
HTH!
It boils down to how many "a few" is. If it's thousands or fewer, than what grtjn presents above for a cts:search plus an XPath expression will work fine. If it's more, I'd add the count attribute to abc:meetings and then use a pre-commit trigger (e.g. on the collection of these documents) to ensure that the count attribute value is kept in sync. You'd need a range index to be able to query for "Documents that have a count of meetings of 2 or greater".
Of course, if all you need to query on is whether there's more than one meeting, then just add a "multiple" attribute to abc:meetings with a value of "true". Then you don't need a range index - you can do a cts:element-attribute-value-query on abc:meetings and multiple="true".

Indexing only individual values in property arrays (instead of indexing every combination of those values) in Google datastore

The data model I am planning would have a few property "fields" in place, including a "category/tags" property, which would be a list/array of a lot of tags.
I'm planning on querying on one category at a time. I am not interested in indexing which entities have combinations of categories, just individual categories.
I am NOT referencing simply not indexing a particular property.
Bonus Question:
It seems Google datastore doesn't like "monotonically increasing" property values (ie timestamps) because presumably they make hotspots on the machines while forming indexes. So would just storing the current calendar date help? I could see that making even more of a "hotspot" since every entity for 24 hours would have the same index value for that property, is there some way of storing some data about when each entity was recorded?
Indeed, one should encounter no issues creating a builtin index, as mentioned in the above reply. Still, properties with array values can behave in surprising ways. For more than one filter, all conditions defined by the filters must be satisfied by at least one of the array’s individual values, for it to match the query. This does not apply in case of the equality filters.
Sort order is also unusual: the first value seen in the index determines an entity's sort order.
I don't think a property index (aka Built-in Index) on an Array property creates the index with various value combinations. I believe each value in the Array is indexed. For example, if you have a Book with two tags, the index will have two entries for each tag. Adding another book with three tags would add 3 more entries to the Tags index. This index allows you to query for books based on a single tag as well as multiple tags.
The "combination of values" that you mentioned happens if you create a composite index containing more than one Array type (e.g. Authors and Tags of a Book), and all/most books have multiple authors and multiple tags.
You should not have any issues creating a builtin index on your Category/Tag.
On your other question on indexing entity created/modified timestamp, I do see that the Best Practices says to avoid indexing such a property.
Do not index properties with monotonically increasing values (such as
a NOW() timestamp). Maintaining such an index could lead to hotspots
that impact Cloud Datastore latency for applications with high read
and write rates
Not sure what the alternative would be. If you don't have to query on the timestamp/sort on the timestamp, you are fine storing the timestamp by excluding the property from indexing.

Retrieve all items with a column beginning with specified text on DynamoDB

I have a table in DynamoDB:
Id: int, hash key
Name: string
(there are many more columns, but I omitted them)
Typically I just pull out and update items by their Id, and this schema works fine for that.
However, one of the requirements is to have an auto-completing drop down box based on the name. I want to be able to query all items in this DynamoDB table for Name columns starting with a query string.
The SQL way of solving this would be to just add an index on Name and write a query like SELECT Id FROM table WHERE Name LIKE 'query%', but I can't figure out a DynamoDB-friendly way of doing this.
I have considered a few ways to solve this:
Scan the table. This is the easiest option, but least efficient. There's a bit more data in this table than I would be comfortable frequently scanning.
Scan + cache it in memory. But then I have to worry about cache invalidation etc.
Make Name a range key, which supports a begins_with function on the query. However, I'd still have to Scan the table since I want to retrieve results for every single hash key, so this doesn't really work.
Make a global secondary index and query it only with the range key. This also doesn't appear to be possible. I could have a column with a static value and use that as the hash key for the GSI, but that seems like a really ugly hack.
Use a full text search engine like CloudSearch, but this seems like massive overkill for my use case.
Is there a simple solution to this issue?
The use case you described is not directly supported by DynamoDB's Query operation today - DynamoDB typically requires you to specify a hashkey then query on the range key accordingly.
However, there is a popular scatter-gather technique that is commonly used for usecase such as yours. In this case, you would add an attribute bucket_id and create a global secondary index with bucket_id as hash key, and Name as the range key.
The bucket_id refers to a fixed range of IDs or numbers, with enough cardinality to ensure your global secondary index is well-distributed. For instance, bucket_id could range from 0 to 99. Then when updating your base table, whenever a new entry is added, a random bucket_id between 0 and 99 is assigned to it.
During your autocomplete query, the application would send 100 separate queries (scatter) for each bucket_id value (0 to 99) and use BEGINS_WITH on the range key Name. After the results are retrieved, the application would have to combine the 100 sets of responses and re-sort as necessary (gather).
The above process may seem a bit cumbersome, but it allows your system/table to scale well by ensuring the load is evenly distributed over a fixed key range. You can increase the bucket_id range as appropriate. To save cost, you can choose to project KEYS_ONLY onto your global secondary index, so cost of querying is minimized.
The problem is that DynamoDB is essentially a key-value store with support for operations against a single key, and you are trying to search all values which doesn't work well . The "simplest" solution to this is to have a known hash key and then you can Query it directly and specify conditions.
For example, you could query with hash_key='name_search' and range_key=begins_with(myText) or other_key=begins_with(myText) and get the use case you are describing. This will work fine for small sets of data that do not require a large amount of provisioned RCUs.
The problem is that this does not scale because you are not following any of the DynamoDB best practices (in fact, this is an anti-pattern). Take a look at the Understand Partition Behavior documentation
My suggestion would be to use a different service/solution to accomplish this rather than trying to squeeze DynamoDB into this use case.

When to include an index (automated heuristic)

I have a piece of software which takes in a database, and uses it to produce graphs based on what the user wants (primarily queries of the form SELECT AVG(<input1>) AS x, AVG(<intput2>) as y FROM <input3> WHERE <key> IN (<vals..> AND ...). This works nicely.
I have a simple script that is passed a (often large) number of files, each describing a row
name=foo
x=12
y=23.4
....... etc.......
The script goes through each file, saving the variable names, and an INSERT query for each. It then loads the variable names, sort | uniq's them, and makes a CREATE TABLE statement out of them (sqlite, amusingly enough, is ok with having all columns be NUMERIC, even if they actually end up containing text data). Once this is done, it then executes the INSERTS (in a single transaction, otherwise it would take ages).
To improve performance, I added an basic index on each row. However, this increases database size somewhat significantly, and only provides a moderate improvement.
Data comes in three basic types:
single value, indicating things like program version, etc.
a few values (<10), indicating things like input parameters used
many values (>1000), primarily output data.
The first type obviously shouldn't need an index, since it will never be sorted upon.
The second type should have an index, because it will commonly be filtered by.
The third type probably shouldn't need an index, because it will be used in output.
It would be annoying to determine which type a particular value is before it is put in the database, but it is possible.
My question is twofold:
Is there some hidden cost to extraneous indexes, beyond the size increase that I have seen?
Is there a better way to index for filtration queries of the form WHERE foo IN (5) AND bar IN (12,14,15)? Note that I don't know which columns the user will pick, beyond the that it will be a type 2 column.
Read the relevant documentation:
Query Planning;
Query Optimizer Overview;
EXPLAIN QUERY PLAN.
The most important thing for optimizing queries is avoiding I/O, so tables with less than ten rows should not be indexed because all the data fits into a single page anyway, so having an index would just force SQLite to read another page for the index.
Indexes are important when you are looking up records in a big table.
Extraneous indexes make table updates slower, because each index needs to be updated as well.
SQLite can use at most one index per table in a query.
This particular query could be optimized best by having a single index on the two columns foo and bar.
However, creating such indexes for all possible combinations of lookup columns is most likely not worth the effort.
If the queries are generated dynamically, the best idea probably is to create one index for each column that has good selectivity, and rely on SQLite to pick the best one.
And don't forget to run ANALYZE.

Resources