I'm planning on using Cosmos Db (Document Db) and I'm trying to understand how the queries, indexing and partitions relate to each other.
How to partition and scale in Azure Cosmos Db talks about the partition key and other documentation indicates that partition key + id = unique id for the document. But then SQL Query and SQL syntax in Azure Cosmos Db says it provides automatic indexing of JSON documents without requiring explicit schema or creation of secondary indexes.
I understand that partition key is important for scalability and how data is stored. But if we think about searching is the partition key kind of like extra filter/where clause? All the documents are indexed so I can execute query like:
SELECT *
FROM Families
WHERE Families.address.state = "NY"
Should I still specify the partition key or indicate some how that cross partition queries are allowed when using this SQL query syntax?
Your first link gives the answer for this:
For partitioned collections, you can use PartitionKey to run the query against a single partition (though Cosmos DB can automatically extract this from the query text), and EnableCrossPartitionQuery to run queries that may need to be run against multiple partitions.
So, yes, you either need to specify the WHERE clause which will make query run against a single partition, or set EnableCrossPartitionQuery to true in query options.
You don't have to do that anymore, EnableCrossPartitionQuery is set to true by default nowadays. This means Cosmos won't complain if you don't skip the partition key in your query.
More info here.
You don't need to specify a partition key to the query. Recent version enabled cross partition queries by default
Related
I have a SQL API Cosmos DB collection with the id and partition key both equal to /id.
Given a list of IDs, I need to fetch all those documents. When using the .NET SDK (v3.25), which of the below Container class methods is recommended to get the lowest latency:
In parallel, use ReadItemAsync to read all documents.
Use ReadManyItemsAsync to read all the documents.
Use GetItemQueryIterator with a SQL query of the form SELECT * FROM c where c.id in ('id-1', 'id-2', ...).
If you want to retrieve large group of individual items, the most efficient way is to use ReadManyItemsAsync() rather than invoking ReadItemAsync() many times/Parallel.
Is there any Partition key not found exception when we query with partitionkey via QueryRequestOptions? or is there any other ways I can be notified that the logical partition does not exist in a query?
Is there any Partition key not found exception when we query with
partitionkey via QueryRequestOptions?
Assuming you are talking about the value of the partition key, as far as I know there is no such thing in Cosmos DB.
or is there any other ways I can be notified that the logical
partition does not exist in a query?
One possible way to find out is query your container with the partition key value in the query and try to fetch at most one document. If you don't get any documents back (i.e. get empty resultset), that would mean the logical partition does not exist in the container.
When query cosmos db, there is an option of setting enableCrossPartitionQuery as true.
I am wondering what happens that if I did not set it? Which partition will be used for the query?
thanks
If your collection is partitioned, then the query,update, delete opeartions need partition key setting.
If you don't set, perhaps you could see below error:
For this situation, if you don't want to set any partition key or you don't know which partition the row data belongs to, then you could set enableCrossPartitionQuery= true to avoid the error. If you set enableCrossPartitionQuery= true, it means this request will scan all the partitions to filter the data. Of course,it's query performance is bound to decline.
BTW,if your data size is small,i think the impact may be small. However,if the data size is large, i suggest you trying your best to avoid setting this property.
I tested the sample project : https://github.com/Azure-Samples/azure-cosmos-db-sql-api-nodejs-getting-started.git and it doesn't require partition key indeed when the container is partitioned.
However, based on the statements in the cosmos db rest api :
I tested java sdk and it requires the partition key when i query partitioned container. Anyway,i want to say that if you met the error which indicates the lack of partition key, you could try to add the property enableCrossPartitionQuery = true to solve it. Mostly, i still suggest you providing partition key for the query performance.
If my cosmos DB has multiple partitions is there any reason to NOT set EnableCrossPartitionQuery to true?
I know it is necessary if running a query that could hit multiple partitions. But what if the query uses a valid partition key and definitely will only hit one partition, is there any performance loss or increased cost because I set that flag to true?
But what if the query uses a valid partition key and definitely will
only hit one partition, is there any performance loss or increased
cost because I set that flag to true?
Per my knowledge, you need set the partition key for partitioned collection and the cost will not change even if you still set the EnableCrossPartitionQuery as true.Because the request only scans the specific partition you already set. I did a sample test and try to verify it.
FeedOptions feedOptions = new FeedOptions();
PartitionKey partitionKey = new PartitionKey("A");
feedOptions.setPartitionKey(partitionKey);
feedOptions.setEnableCrossPartitionQuery(true);
FeedResponse<Document> queryResults = client.queryDocuments(
"/dbs/db/colls/part",
"SELECT * FROM c",
feedOptions);
System.out.println("Running SQL query...");
for (Document document : queryResults.getQueryIterable()) {
System.out.println(String.format("\tRead %s", document));
}
System.out.println(queryResults.getRequestCharge());
I think maybe you don't have to struggle with this problem. EnableCrossPartitionQuery option only need to be used if the query for partitioned collection is not scoped to single partition key value. If you know the specific partition key,then no need to set EnableCrossPartitionQuery.
Team,
I have a dynamodb with a given hashkey (userid) and sort key (ages). Lets say if we want to retrieve the elements as "per each hashkey(userid), smallest age" output, what would be the query and filter expression for the dynamo query.
Thanks!
I don't think you can do it in a query. You would need to do full table scan. If you have a list of hash keys somewhere, then you can do N queries (in parallel) instead.
[Update] Here is another possible approach:
Maintain a second table, where you have just a hash key (userID). This table will contain record with the smallest age for given user. To achieve that, make sure that every time you update main table you also update second one if new age is less than current age in the second table. You can use conditional update for that. Update can either be done by application itself, or you can have AWS lambda listening to dynamoDB stream. Now if you need smallest age for each use, you still do full table scan of the second table, but this scan will only read relevant records, to it will be optimal.
There are two ways to achieve that:
If you don't need to get this data in realtime you can export your data into a other AWS systems, like EMR or Redshift and perform complex analytics queries there. With this you can write SQL expressions using joins and group by operators.
You can even perform EMR Hive queries on DynamoDB data, but they perform scans, so it's not very cost efficient.
Another option is use DynamoDB streams. You can maintain a separate table that stores:
Table: MinAges
UserId - primary key
MinAge - regular numeric attribute
On every update/delete/insert of an original query you can query minimum age for an updated user and store into the MinAges table
Another option is to write something like this:
storeNewAge(userId, newAge)
def smallestAge = getSmallestAgeFor(userId)
storeSmallestAge(userId, smallestAge)
But since DynamoDB does not has native transactions support it's dangerous to run code like that, since you may end up with inconsistent data. You can use DynamoDB transactions library, but these transactions are expensive. While if you are using streams you will have consistent data, at a very low price.
You can do it using ScanIndexForward
YourEntity requestEntity = new YourEntity();
requestEntity.setHashKey(hashkey);
DynamoDBQueryExpression<YourEntity> queryExpression = new DynamoDBQueryExpression<YourEntity>()
.withHashKeyValues(requestEntity)
.withConsistentRead(false);
equeryExpression.setIndexName(IndexName); // if you are using any index
queryExpression.setScanIndexForward(false);
queryExpression.setLimit(1);