Simple CosmosDb query high RU - azure-cosmosdb

I am evaluating Cosmos Db for a project and working through the documentation. I have created a sample collection following the documentation on this page https://learn.microsoft.com/en-us/azure/cosmos-db/sql-query-getting-started. When I run the first query on this page in the local emulator I get the following results:
Why is the Request Charge 2.89 RUs? From all of the documentation I have read this should be 1 RU. The collection is partitioned on the id field and is auto indexed and Cross Partition Queries are enabled. I have event tried putting both items in the same partition and I get the same results.

1 RU is the cost of a Point-Read operation, not a query. Reference: https://learn.microsoft.com/azure/cosmos-db/request-units:
The cost to read a 1 KB item is 1 Request Unit (or 1 RU).
Also there:
Query patterns: The complexity of a query affects how many RUs are consumed for an operation. Factors that affect the cost of query operations include
If you want to read a single document, and you know the id and partition key, just do a point operation, it will always be cheaper than a query with the id="something" query. If you don't know the partition key, then yes, you need a cross partition query, because you don't know on which partition key is stored and there could be multiple documents with the same id (as long as their partition keys are different, see https://learn.microsoft.com/azure/cosmos-db/partitioning-overview).
You can use any of the available SDKs or work with the REST API.

Related

Benefit of local index in AWS DynamoDB?

In DynamoDB I have a table like below example data
pk sk name price
=======================================================
product cat#phone#name#iPhone11 iPhone 11 500
product cat#phone#name#Nokia1100 Nokia 1100 100
product cat#phone#name#iPhone11 iPhone 11 500
In a case I have to search by name. So, first I have created a global index for name where in index pk = pk, sk=name . Then I made a search which working fine.
Now I have changed my decision and created a local index for name, where name is sk. It's also working fine. My question is if I use local index here, has there any benefit ? and when I should not use local index ? If global index not required here but I have used , has there any performance issues ?
#niloy-rony,
This AWS doc very well explains LSI and GSI in detail.
Now to answer your questions
- LSI comes at no extra cost. You don't need to pay for GSI's RCUs, WCUs however need to pay for storage as depicted here in another AWS doc.
- One should not use LSI if you are very certain that single partition (ie - pk) of your main table (pk remains the same in LSI) can be over 10GB. This is also discussed in link shared above.
- There is no performance issue with LSI and GSI in terms of query latencies. However, reads in GSI are eventual consistent whereas LSI supports strong consistent reads.
Edit, putting excerpt from the AWS doc to understand strong and eventual consistent reads.
Strongly Consistent Reads - When you request a strongly consistent read, DynamoDB returns a response with the most up-to-date data, reflecting the updates from all prior write operations that were successful.
Eventually Consistent Reads - When you read data from a DynamoDB table, the response might not reflect the results of a recently completed write operation. The response might include some stale data. If you repeat your read request after a short time, the response should return the latest data.
Refer this AWS doc for tips to minimise propagation delay of data from main table to GSIs

Checking millions of IDs in Cosmos DB

Given a potentially large (up to 10^7) set of IDs (together with associated partition keys), I need to verify that there is no document in a Cosmos DB collection with an ID that is in the given set.
There are two obvious ways to achieve this:
Check the existence for each ID/partition key pair individually using parallel point reads, with AllowBulkExecution = true, and abort as soon as a read comes back successfully.
Group the IDs by partition key, and for each group, issue parallel queries of the following form (such that each query is smaller than the maximum query size 256 kB), and abort as soon as any query returns with a non-empty result:
SELECT c.id FROM c
WHERE c.partitionkey = 'partition123' AND ARRAY_CONTAINS(['id1', 'id2', ...], c.id)
LIMIT 1
Is it possible to say, without trying it out, which one is faster?
Here is a bit more context:
The client is an Azure App Service located in the same region as the Cosmos DB instance.
The Cosmos DB collection contains about 10^7 documents and has a throughput of 4000 RU/s.
The IDs are actually GUID strings of length 36, so the number of IDs per query in Solution 2 would be limited to about 6500 in order to not exceed the maximum query size. In other words, the number of required queries in Solution 2 is about n/6500 where n is the number of IDs in the set.
The number of different partition keys is small (< 10).
The average document size is about 500 B.
Default indexing policy.
A bit more background: The check is part of an import/initial load operation. More precisely, it is part of the validation of an import set so an error can be returned before the write operations begin. So the expected (non-error) case is that none of the IDs in the set already exists. The import operation is not expected to be executed frequently (though certainly more than once), so managing auxiliary processes/data just to optimize for this check would not be a good tradeoff.
Not quite sure I understand the need for this but... queries will cost more than a point-read, in terms of RU cost (and given your doc size, those point reads are going to cost 1 RU).
I don't see how you will be able to abandon parallel point-reads if you succeed in finding a particular ID within a given partition. Also remember that an ID is only unique within a partition, so it's possible to have that ID in multiple partitions.
It is likely more efficient to just attempt to write a given ID to a given partition, and see if it succeeds (it'll fail if there's an ID collision).
Lastly: For all practical purposes, you won't have a duplicate ID if you're generating a new GUID for every document you're saving.

CosmosDb - Determining best partitionKey when only fetching data by their Id

I’ve been dabbling with CosmosDb and am now starting to get in the range of over 10k documents instead of just a few.
I’m struggling with how best to partition.
Some background
• I will have 10-50k documents in CosmosDb (maybe more in later phases)
• I have an index on top of those in Azure Search, for a small subset of these document’s properties)
• I will NOT be performing complex searches in CosmosDb
except:
• I will be fetching documents from cosmosDb by their Id (most likely coming from Azure Search results, when the user clicks one of the results)
o Initially only 1 document will be requested
o Possibly, in the future, I might ask for e.g. 10 documents at the same time, all by their Id.
I currently have 1 partition, which feels like a waste of a good system.
I could partition on e.g. the last digit of the document number, which would give a nice spread of documents across 10 partitions.
My concrete question:
If I spread data equally (almost randomly, to be honest) across 10 partitions, does that speed up fetching documents by Id (assuming many simultaneous calls to the system, each fetching 1 document by Id).
My reasoning: The last digit would determine the partition, so only 1 partition would be accessed to find the document, which is better than searching all partitions at the same time?
Spreading data across partitions does not make things faster on the read path in a partitioned data store. Where it helps is on the write path because you are spreading the load out horizontally across many computers simultaneously. And this only matters where the amount of throughput overloads what a single partition can achieve. For Cosmos DB this is 10,000 RU.
The key to fast reads is to indicate the partition key value in your read. The partition key is basically a router to where your data is stored. Once there it uses the index (or id in your case) to find the data.
There's some articles that provide some details on partitioning that are helpful.
Partitioning in Azure Cosmos DB
How to model and partition data on Azure Cosmos DB using a real-world example
Hope this helps.

Firestore checking if username exists, best model to not query the whole database?

Hello I'm developing and Android app and using Firebase's Firestore. My concern is about creating a username for my user when he is signing up for my app. I know I have to check if the username exists in my database, but what if you have 1 million users or 5. I don't think the results will be fast when you query the whole database. Is querying the whole database the only approach? or maybe creating a collection called usernames with 24 documents inside and for example the first document holds collection of usernames starting with a, then second document holds collection of usernames starting with b, and so on. Need your help. Thank you.
Actually one of the key characteristics of Firestore is exactly that: the performance of a query is proportional to the size of your result set, not your data set.
So the query performance that you get for finding 1 document in a collection of 5, 24 or 1 million docments will be exactly the same.
In Cloud Firestore, you can use queries to retrieve individual,
specific documents or to retrieve all the documents in a collection
that match your query parameters. Your queries can include multiple,
chained filters and combine filtering and sorting. They're also
indexed by default, so query performance is proportional to the size
of your result set, not your data set.
So the answer is that you should query your already existing collection of documents and not create smaller collection(s) with a subset of documents for the sake of query performance.

Would using a substring of a GUID in CosmosDB as partitionkey be a bad idea?

I'm doing some R&D to move a product catalog into CosmosDB.
In it's simplest terms a Product document will have:
Product Id (GUID)
Product Name
Manufacturer
A manufacturer will log into this system and will only be able to query their own data so there will always be a ManufacturerId = SINGLE_VALUE filter on every query.
When reviewing the cosmos docs, re: chosing the correct partition strategy, there seems to be 2 main points.
- Choose a partition key with a high cardinality
- Choose a partition key that gives an even distribution of data.
In my scenario above, chosing product Id as the PartitionKey would be pretty extreme... 1 document per logical partition.
On the other hand chosing Manufactuer wouldn't be great either since that won't result in an even distribution (some manufacturers have 10 products, others have 100,000)
One way to ensure an even distribution would be to take the first 4 characters of the GUID and use that as a PartitionKey. (so max 4096 partitions). Based on the existing dataset i have, this does result in an even distribution of data. but I'm wondering are there any downsides to doing this.
Are there any downsides to just using the entire productId as the PartitionKey (1 doc per partition) as they seem to indicate that's a valid approach for a system that stores user profiles. Would this approach have implications for searching for multiple products in the same search.
Using a key that is unique per-document is a good way to ensure even distribution to support high performance - so that makes the full product id a great choice. I don't believe you would gain any advantage from using a substring of a full guid as a partition key - and you would be limiting your maximum number of usable partitions.
So why not always use a unique identifier as the partition key?
First, if you add a partition key to a query, you do not need to enable cross-partition query and you will have a lower overall query cost (RU/s). So if you can design your partition key to reduce your need for cross-partition queries it could save RU/s. I don't think a 'substring of a guid' helps you there, because the random nature of the guid would not distribute documents in a way you could take advantage of for efficient querying.
Second, only documents with the same partition key are guaranteed to all be available on the same partition if you need to involve them in a transactional stored procedure. A 'substring of a guid' also doesn't help with this case.
I almost always use 'identifier' based partition keys such as your product id. This doesn't always correspond to the 'id' of the document itself. Sometimes I have multiple documents with content related to the same thing. For example, if I have some product information synced from another system, that sync job can be most efficient if it uses upsert - but due to current lack of partial update support in CosmosDB (see user voice) the whole document needs to be upserted. So in this case I have one document for the synced information, and a separate document for other information. This could look something like:
{
"id": "12345:myinfo",
"productid":"12345",
"info":{}
"type":"myinfotype"
},
{
"id": "12345:vendorsync",
"productid":"12345",
"syncedinfo":{},
"type":"vendorsync"
}
Here the product id is the partition key, and I have a couple of different documents related to that product that I know will reside on the same partition so I can query them efficiently or involve them in a transaction.
I have also used this pattern when implementing a revision system, so that all revisions of the same logical document are guaranteed to be placed on the same partition. In that case the document has a "documentid" that is the same for all revisions, and the actual "id" of the document is the document id with the revision number added.
Please also review 'Design for Partitioning' here if you haven't already:
https://learn.microsoft.com/en-us/azure/cosmos-db/partition-data
Depending on the size of your docs and the overall number of docs for a manufacturer, I would probably go with ManufacturerID as your PartitionKey.
Would it be unbalanced, yes. But as long as the biggest manufacturer can stay under the partition limit (12.5GB as of this writing) then you would have very efficient querying. If you chose the GUID field, then you would always have to utilize a cross-partition query, which means higher RUs are needed and thus more costly and slower. The assumption I'm making here are that the larger manufacturers will probably execute more queries.
If you do think you'll bump up against that partition limit, some other ideas would be partition into a sub-category for each manufacturer if that's possible. Example: Manufacturer = General Motors, Category = SUVs, and then partition on a custom string field that represents Manufacturer_Category. This composite partition key is the best compromise of read/write speeds, and partition balancing.
-FYI: No need to use substring of a GUID as a partitionKey because CosmosDB will hash your values automatically for you into the appropriate partition key ranges for the number of physical partitions you have.

Resources