Azure CosmosDB IS_DEFINED vs NOT IS_DEFINED

Azure CosmosDB IS_DEFINED vs NOT IS_DEFINED - azure-cosmosdb

I was trying to query a collection, which had few documents. Some of the collections had "Exception" property, where some don't have.
My end query looks some thing like:
Records that do not contain Exception:
**select COUNT(1) from doc c WHERE NOT IS_DEFINED(c.Exception)**
Records that contain Exception:
**select COUNT(1) from doc c WHERE IS_DEFINED(c.Exception)**
But this seems not be working. When NOT IS_DEFINED is returning some count, IS_DEFINED is returning 0 records, where it actually had data.
My data looks something like (some documents can contain Exception property & others don't):
[{
'Name': 'Sagar',
'Age': 26,
'Exception: 'Object reference not set to an instance of the object', ...
},
{
'Name': 'Sagar',
'Age': 26, ...
}]

Update
As Dax Fohl said in an answer NOT IS_DEFINED is implemented now. See the the cosmos dev blob April updates for more details.
To use it properly the queried property should be added to the index of the collection.
Excerpt from the blog post:
Queries with inequality filters or filters on undefined values can now
be run more efficiently. Previously, these filters did not utilize the
index. When executing a query, Azure Cosmos DB would first evaluate
other less expensive filters (such as =, >, or <) in the query. If
there were inequality filters or filters on undefined values
remaining, the query engine would be required to load each of these
documents. Since inequality filters and filters on undefined values
now utilize the index, we can avoid loading these documents and see a
significant improvement in RU charge.
Here’s a full list of query filters with improvements:
Inequality comparison expression (e.g. c.age != 4)
NOT IN expression (e.g. c.name NOT IN (‘Luis’, ‘Andrew’, ‘Deborah’))
NOT IsDefined
Is expressions (e.g. NOT IsDefined(c.age), NOT IsString(c.name))
Coalesce operator expression (e.g. (c.name ?? ‘N/A’) = ‘Thomas’)
Ternary operator expression (e.g. c.name = null ? ‘N/A’ : c.name)
If you have queries with these filters, you should add an index for
the relevant properties.

The main difference between IS_DEFINED and NOT IS_DEFINED is the former utilizes the index while the later does not (same w/ = vs. !=). It's most likely the case here is IS_DEFINED query finishes in a single continuation and thus you get the full COUNT result. On the other hand, it seems that NOT IS_DEFINED query did not finish in a single continuation and thus you got partial COUNT result. You should get the full result by following the query continuation.

Related

Running into some SQLite limitation using IN operator

I have a query that uses WHERE id IN (1,2,3,...) where the list (1,2,3,...) is dynamically generated from an array of integers (not using parameters). Now I have a particular query that takes roughly 500ms with 26623 ids but 50s (100x slower) with 26624 ids.
I couldn't find anything that looks related in https://sqlite.org/limits.html
SELECT params.name AS name, json_group_array(DISTINCT params.value) AS "values"
FROM view_requests AS req, search_params(search) AS params
JOIN flows ON flows.request_id = req.id
WHERE search NOT IN ('', '?')
AND flows.id IN (1,2,3) /* <=== here more than 26623 IDs make it super slow */
GROUP BY params.name
ORDER BY json_array_length("values") DESC, params.name ASC
Before I try to make that reproducible in isolate (e.g. search_params is a custom virtual table), does anyone know what limitation I might be running into? It's not the number of IDs per se, since a different query runs just fine with the same IDs.
SQLite version 3.36.0 via better-sqlite3 (Node.js) with a readonly database. The only pragma I use is journal_mode = WAL.
Compiled with (https://github.com/JoshuaWise/better-sqlite3/blob/master/docs/compilation.md#bundled-configuration):
SQLITE_DQS=0
SQLITE_LIKE_DOESNT_MATCH_BLOBS
SQLITE_THREADSAFE=2
SQLITE_USE_URI=0
SQLITE_DEFAULT_MEMSTATUS=0
SQLITE_OMIT_DEPRECATED
SQLITE_OMIT_GET_TABLE
SQLITE_OMIT_TCL_VARIABLE
SQLITE_OMIT_PROGRESS_CALLBACK
SQLITE_OMIT_SHARED_CACHE
SQLITE_TRACE_SIZE_LIMIT=32
SQLITE_DEFAULT_CACHE_SIZE=-16000
SQLITE_DEFAULT_FOREIGN_KEYS=1
SQLITE_DEFAULT_WAL_SYNCHRONOUS=1
SQLITE_ENABLE_MATH_FUNCTIONS
SQLITE_ENABLE_DESERIALIZE
SQLITE_ENABLE_COLUMN_METADATA
SQLITE_ENABLE_UPDATE_DELETE_LIMIT
SQLITE_ENABLE_STAT4
SQLITE_ENABLE_FTS3_PARENTHESIS
SQLITE_ENABLE_FTS3
SQLITE_ENABLE_FTS4
SQLITE_ENABLE_FTS5
SQLITE_ENABLE_JSON1
SQLITE_ENABLE_RTREE
SQLITE_ENABLE_GEOPOLY
SQLITE_INTROSPECTION_PRAGMAS
SQLITE_SOUNDEX
HAVE_STDINT_H=1
HAVE_INT8_T=1
HAVE_INT16_T=1
HAVE_INT32_T=1
HAVE_UINT8_T=1
HAVE_UINT16_T=1
HAVE_UINT32_T=1

Here's the answer from the SQLite forums. Essentially this is a combination of how the query planner handles IN literals and what cost my virtual table estimates. That means I'm running into the exact moment when the query planner makes a different decision.
SQLite NGQP is a cost based query planner. The IN () operator with a list of literal values gets implemented as a kind of temporary table; sometimes SQLite decides to create an index and do lookups, other times it decides to use that table as the outermost loop of the query.
EXPLAIN QUERY PLAN should show that in a more concise manner.
If compiled in DEBUG mode mith WHERETRACE enabled, the .wheretrace command will show how SQLite NGQP reaches its plan. Essential input is the return values from the xBestIndex method of your virtual table, especially the "number of rows" and the "estimated cost". It is paramount to deliver accurate estimates. Cost should reflect processing cost relative to SQLite native tables.
Note that you can name the IN table by making it a CTE and CROSS JOIN to force the query plan that works fast.
https://sqlite.org/forum/forumpost/a3d68ed8b40cf583?t=h
The workaround I use is json_each and serialize the array of integers into a JSON string. In my particular use-case this has some other benefits as well (e.g. I can bind a single parameter and re-use the query with any number of IDs), so I don't mind doing that:
SELECT params.name AS name, json_group_array(DISTINCT params.value) AS "values"
FROM view_requests AS req, search_params(search) AS params
JOIN flows ON flows.request_id = req.id
WHERE search NOT IN ('', '?')
-AND flows.id IN (1,2,3)
+AND flows.id IN (SELECT value FROM json_each('[1,2,3]'))
GROUP BY params.name
ORDER BY json_array_length("values") DESC, params.name ASC
I also know that the generic virtual table implementation of better-sqlite3 makes a trade-off between being easy to use (it's ridiculously easy) and achieving maximum performance.

Using the sort() cursor method without the default indexing policy in Azure Cosmos DB for MongoDB API

With Cosmos DB for MongoDB API (Version 3.4), the following find query in combination with the method cursor sort seems to behave incorrectly:
db.test.find({"field1": "value1"}).sort({"field2": 1})
The error occurs, if all of the following conditions are met:
the default indexing policy were discarded - regardless of whether custom indexes were created afterwards using createIndex().
The find() query does not return any documents (Find(filter).Count() == 0)
The Sort document defining the sort order contains only one field. It doesn't matter, whether this field exists or has been indexed. Using two fields in the sort document returns 0 hits which is the correct behavior.
The error also occurs, if all of the following conditions are met:
the default indexing policy were discarded
The find() query returns one or more documents
The Sort document contains exactly one field. This field has not been indexed.
The error message:
The index path corresponding to the specified order-by item is excluded.
The malfunction occurs only when using the CosmosDB, with native MongoDB (mongoDB Atlas, v4.0) it behaves correctly.
Azure Cosmos DB for MongoDB API with MongoDB 3.4 wire protocol (preview feature) is used. The problem occurs with both a MongoDB C#/.NET driver and the mongo shell.
In addition, the problem only occurs with find(). An equivalent aggregation pipeline containing $match and $sort behaves correctly.
Reproduction
Create an Azure Cosmos DB Account with the "Azure Cosmos DB for MongoDB API". Enable the preview feature MongoDB 3.4 (Version 3.2 has not been tested).
Create a new database
Create a new collection, define a shard key
Drop the default indexing policy (using db.test.dropIndexes() )
(Optional) Create new custom indexes
(Optional) Insert documents
Execute command in mongo shell (or the equivalent code with mongoDB C#/.NET driver):
db.test.find({"field1": "value1"}).sort({"field2": 1})
Expected result
All documents that match the query criteria. If there are none, no documents should be returned.
Actual result
Error: error: {
"_t" : "OKMongoResponse",
"ok" : 0,
"code" : 2,
"errmsg" : "Message: {\"Errors\":[\"The index path corresponding to the specified order-by item is excluded.\"]}\r\nActivityId: c50cc751-0000-0000-0000-000000000000, Request URI: /apps/[...]/, RequestStats: \r\nRequestStartTime: 2019-07-11T08:58:48.9880813Z, RequestEndTime: 2019-07-11T08:58:49.0081101Z, Number of regions attempted: 1\r\nResponseTime: 2019-07-11T08:58:49.0081101Z, StoreResult: StorePhysicalAddress: rntbd://[...]/, LSN: 359549, GlobalCommittedLsn: 359548, PartitionKeyRangeId: 0, IsValid: True, StatusCode: 400, SubStatusCode: 0, RequestCharge: 1, ItemLSN: -1, SessionToken: -1#359549, UsingLocalLSN: True, TransportException: null, ResourceType: Document, OperationType: Query\r\n, SDK: Microsoft.Azure.Documents.Common/2.4.0.0", [...]
Workaround
Adding an additional "dummy" field to the sort document prevents the error:
db.test.find({"field1": "value1"}).sort({"field2": 1, "dummyfield": 1}).count()
The workaround is not satisfactory. It could falsify the result.
Am I doing something wrong, or is Cosmos DB behaving flawed here?

According to Microsoft support, an index needs to be created on the field being sorted. The default indexes can be dropped and custom indexes created. As for the issue of not modifying the index every time a new field is added, there is no other alternative other than performing a client side sort. Unfortunately, client side sorting would take lot of CPU memory on the client side and the sort on index would take work when you would get more fields to index.
Thus I did not find a really satisfying solution:
Using the Default Indexing Policy. However, this can lead to a huge index.
Indexing all elements that need to be sorted. Every time a new element has to be indexed, this leads to a manual modification of the indexing policy.
Only use Client-side sort. In my opinion this leads to a strong limitation of MongoDB functionality.
Using aggregation frameworks instead of the find method. This leads to increased complexity and traffic.
Migrating to native MongoDB.

db.collection.createIndex ({ "$**" : 1 });

Google Datastore python returning less number of entities per page

I am using Python client SDK for Datastore (google-cloud-datastore) version 1.4.0. I am trying to run a key-only query fetch:
query = client.query(kind = 'SomeEntity')
query.keys_only()
Query filter has EQUAL condition on field1 and GREATER_THAN_OR_EQUAL condition on field2. Ordering is done based on field2
For fetch, I am specifying a limit:
query_iter = query.fetch(start_cursor=cursor, limit=100)
page = next(query_iter.pages)
keyList = [entity.key for entity in page]
nextCursor = query_iter.next_page_token
Though there are around 50 entities satisfying this query, each fetch returns around 10-15 results and a cursor. I can use the cursor to get all the results; but this results in additional call overhead
Is this behavior expected?

keys_only query is limited to 1000 entries in a single call. This operation counts as a single entity read.
For another limitations of Datastore, please refer detailed table in the documentation.
However, in the code, you did specify cursor as a starting point for a subsequent retrieval operation. Query can be limited, without cursor:
query = client.query()
query.keys_only()
tasks = list(query.fetch(limit=100))
For detailed instruction how to use limits and cursors, please refer documentation of the Google Gloud Datastore

Azure DocumentDB ARRAY_CONTAINS on nested documents

It seems like the ARRAY_CONTAINS function on nested documents never matches any document.
For example, trying the following simple query with the Azure DocumentDB Query Playground would return no result, even if some nested documents should match this query.
SELECT *
FROM food
WHERE ARRAY_CONTAINS(food.tags.name, "blueberries")
This past question on Stack Overflow also infered that this kind of nested query is valid.
Thank you

The first argument to ARRAY_CONTAINS must be an array. For example, in this case food.tags is valid as an argument, but food.tags.name is not.
Both the following DocumentDB queries are valid and might be what you're looking for:
SELECT food
FROM food
JOIN tag IN food.tags
WHERE tag.name = "blueberries"
Or
SELECT food
FROM food
WHERE ARRAY_CONTAINS(food.tags, { name: "blueberries" })

Riak search queries via the java client

I am trying to perform queries using the OR operator as following:
MapReduceResult result = riakClient.
mapReduce("some_bucket", "Name:c1 OR c2").
addMapPhase(new NamedJSFunction("Riak.mapValuesJson"), true).
execute();
I only get the 1st object in the query (where name='c1').
If I change the order of the query (i.e. Name:c2 OR c1) again I get only the first object in query (where name='c2').
is the OR operator (and other query operators) supported in the java client?

I got this answer from Basho engeneer, Sean C.:
You either need to group the terms or qualify both of them. Without a field identifier, the search query assumes that the default field is being searched. You can determine how the query will be interpreted by using the 'search-cmd explain' command. Here's two alternate ways to express your query:
Name:c1 OR Name:c2
Name:(c1 OR c2)
both options worked for me!

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Azure CosmosDB IS_DEFINED vs NOT IS_DEFINED - azure-cosmosdb

Related

Running into some SQLite limitation using IN operator

Using the sort() cursor method without the default indexing policy in Azure Cosmos DB for MongoDB API

Google Datastore python returning less number of entities per page

Azure DocumentDB ARRAY_CONTAINS on nested documents

Riak search queries via the java client

Categories

Resources