Cosmos Like Query taking different RU/s - azure-cosmosdb

i Have a container in cosmos that is partitioned by a synthetic key i created, this data has around 6mil documents at the moment and the synthetic key is as follows:
'InfantCereal-Fct-4WEApr2818' where the pieces are separated by a dash and are representative of Category, Type, and Date respectively
what i find odd is that when i use a like query as follows these are the results i get:
SELECT * FROM c where c.Partition LIKE 'InfantCereal-%-4WEApr2818'
Request Charge
316.78 RUs (VERY HIGH)
-----------------------------------------------------
SELECT * FROM c where c.Partition LIKE 'InfantCereal-Fct-%'
Request Charge
17.41 RUs (VERY LOW)
-----------------------------------------------------
SELECT * FROM c where c.Partition LIKE '%-Fct-4WEApr2818'
Request Charge
297.25 RUs (VERY HIGH)
-----------------------------------------------------
SELECT * FROM c where c.Period = '4WE Apr 28 18'
Request Charge
23.56 RUs
Why is it that when querying for any date value the RU/s is very low? in this Container there would be more date difference than category and type but i don't understand why the RU/s cost is so high
also why why is it the case that when querying on Period, (which the container isn't partitioned on) costs much less RU/s than when querying?
I am quite new to Cosmos DB and i want to make sure I'm not making any mistakes in terms of my data partitioning before moving forward, any explanation on Cosmos RU/Query costs is appreciated.
Thanks!

It appears you're seeing behavior that corresponds to cases where indexes can be used for more efficient queries. In the docs for the LIKE operator, it's noted that this is equivalent to using the system function RegexMatch. And in the RegexMatch docs, there's a note tucked away at the bottom:
This system function will benefit from a range index if the regular
expression can be broken down into either StartsWith, EndsWith,
Contains, or StringEquals system functions.
This indicates you'll get cost savings when your query can be optimized into one of these cases. This matches what you're documenting.
Accordingly, if you know you're doing one of those cases, you might consider using the explicit functions like STARTSWITH instead of LIKE to ensure the expected index treatment, if not to also be more self-documenting.

Related

DynamoDB index/query questions

I need to create a table with the following fields :
place, date, status
My keys are parition key - place , sort key - date
Status can be either 0 or 1
Table has approximately 300k rows per day and about 3 days worth of data at any given time, so about 1 million rows. I have a service that is continuously populating data to this DDB.
I need to run the following queries (only) once per day :
#1 Return count of all places with date = current_date-1
#2 Return count and list of all places with date= current_date-1 and status = 0
Questions :
As date is already a sort key, is query #1 bound to be quick?
Do we need to create indexes on sort key fields ?
If answer to above question is yes: for query #2, do I need to create a GSI on date and status? with date as Partition key, and status as sort key?
Creating a GSI vs using filter expression on status for query #2. Which of the two is recommended?
Running analytical queries (such as count) is a wrong usage of a NoSQL database such as DynamoDB that is designed for scalable LOOKUP use cases.
Even if you get the SCAN to work with one design or another, it will be more expensive and slow than it should.
A better option is to export the table data from DynamoDB into S3, and then run an Athena query over that data. It will be much more flexible to run various analytical queries.
Easiest thing for you to do is a full table scan once per day filtering by yesterday's date, and as part of that keep your own client-side count on if the status was 0 or 1. The filter is not index optimized so it will be a true full table scan.
Why not an export to S3? Because you're really just doing one query. If you follow the export route you'll have to a new export every day to keep the data fresh and the cost of the export in dollar terms (plus complexity) is more than a single full scan. If you were going to do repeated queries against the data then the export makes more sense.
Why not use GSIs? They would make the table scan more efficient by minimizing what's scanned. However, there's a cost (plus complexity) in keeping them current.
Short answer: a once per day full table scan is both simple to implement and as fast as you want (parallel scan is an option), plus it's not really costly.
How much would it cost? Million rows, 100 bytes each, so that's a 100 MB table. That's 25,000 read units to fully scan, which is halved down to 12,500 with eventual consistency. On Demand pricing is $0.25 per million read units. 12,500 / 1,000,000 * $0.25 = $0.003. Less than a dime a month. It'd be cheaper still if you run provisioned.
Just do the scan. :)

cosmos db sql api datetimepart function increasing rus

my cosmos db has data stored per minute time intervals, I would like to query it using different intervals, depending on start date and end date.
But when using the DateTimePart function, the RUs are drastically increasing.
Without the datetimepart function, for a months data is 184, if I add the datetimepart function it jumps to 6000.
I tried the query like this:
SELECT c.config,c.datetime
FROM c where c.config='DBM' and c.datetime >='2021-04-21T07:02:16'
and c.datetime <='2021-05-21T07:02:16' and (DateTimePart('minute' , c.body.metadata.datetime) % 60) = 0
as well as using a subquery join, but still no luck with bringing down the rus.
Does anyone have any other idea I could use to query the data with an interval, so basically if I query data for a month, but choose an hourly interval, i should get data for every hour for that month
Sasha
For common and expensive queries, it can make sense to incorporate additional derived fields in your model to optimize the query cost. For example, if you are always querying for items on the hour mark, you could write those records with an additional property minuteOfHour: 0 and then use the more straightforward query:
SELECT c.config,c.datetime
FROM c where c.config='DBM' and c.minuteOfHour = 0 and c.datetime >='2021-04-21T07:02:16'
and c.datetime <='2021-05-21T07:02:16'
In other words, the strategy is to precompute things once rather than forcing the query engine to perform the work every query for every item.
It may also help to add one or more composite indexes to reduce the query cost.
Of course, always test different approaches to see what yields the least cost.

How do you synchronize related collections in Cosmos Db?

My application need to support lookups for invoices by invoice id and by the customer. For that reason I created two collections in which I store the (exact) same invoice documents:
InvoicesById, with partition key /InvoiceId
InvoicesByCustomerId, with partition key /CustomerId
Apparently you should use partition keys when doing queries and since there are two queries I need two collections. I guess there may be more in the future.
Updates are primarily done to the InvoicesById collection, but then I need to replicate the change to InvoicesByCustomer (and others) as well.
Are there any best practice or sane approaches how to keep collections in sync?
I'm thinking change feeds and what not. I want avoid writing this sync code and risk inconsistencies due to missing transactions between collections (etc). Or maybe I'm missing something crucial here.
Change feed will do the trick though I would suggest to take a step back before brute-forcing the problem.
Please find detailed article describing split issue here: Azure Cosmos DB. Partitioning.
Based on the Microsoft recommendation for maintainable data growth you should select partition key with highest cardinality (in your case I assume it will be InvoiceId). For the main reason:
Spread request unit (RU) consumption and data storage evenly across all logical partitions. This ensures even RU consumption and storage distribution across your physical partitions.
You don't need creating separate container with CustomerId partition key as it won't give you desired, and most importantly, maintainable performance in future and might result in physical partition data skew when too many Invoices linked to the same customer.
To get optimal and scalable query performance you most probably need InvoiceId as partition key and indexing policy by CustomerId (and others in future).
There will be a slight RU overhead (definitely not multiplication of RUs but rather couple additional RUs per request) in consumption when data you're querying is distributed between number of physical partitions (PPs) but it will be neglectable comparing to issues occurring when data starts growing beyond 50-, 100-, 150GB.
Why CustomerId might not be the best partition key for the data sets which are expected to grow beyond 50GB?
Main reason is that Cosmos DB is designed to scale horizontally and provisioned throughput per PP is limited to the [total provisioned per container (or DB)] / [number of PP].
Once PP split occurs due to exceeding 50GB size your max throughput for existing PPs as well as two newly created PPs will be lower then it was before split.
So imagine following scenario (consider days as a measure of time between actions):
You've created container with provisioned 10k RUs and CustomerId partition key (which will generate one underlying PP1). Maximum throughput per PP is 10k/1 = 10k RUs
Gradually adding data to container you end-up with 3 big customers with C1[10GB], C2[20GB] and C3[10GB] of invoices
When another customer was onboarded to the system with C4[15GB] of data Cosmos DB will have to split PP1 data into two newly created PP2 (30GB) and PP3 (25GB). Maximum throughput per PP is 10k/2 = 5k RUs
Two more customers C5[10GB] C6[15GB] were added to the system and both ended-up in PP2 which lead to another split -> PP4 (20GB) and PP5 (35GB). Maximum throughput per PP is now 10k/3 = 3.333k RUs
IMPORTANT: As a result on [Day 2] C1 data was queried with up to 10k RUs
but on [Day 4] with only max to 3.333k RUs which directly impacts execution time of your query
This is a main thing to remember when designing partition keys in current version of Cosmos DB (12.03.21).
What you are doing is a good solution. Different queries requires different Partition Keys on different Cosmos DB Containers with same data.
How to sync the two Containers: use Triggers from the firs Container.
https://devblogs.microsoft.com/premier-developer/synchronizing-azure-cosmos-db-collections-for-blazing-fast-queries/
Cassandra has a Feature called Materialized Views for this exact problem, abstracting the sync problem. Maybe some day same Feature will be included on Cosmos DB.

Same query on same data consumes as much as double the RUs between calls

We have a cosmos db container where documents are inserted (append-only). We are seeing the same query consume wildly varying RUs even though the number of matching documents did not change. I am able to reproduce in Data Explorer.
Query:
select * from c where c.Id=<someId> and c.Version > <someVersion> order by c.Version asc
PartitionKeyPath = /Id
When it's executed successively in DataExplorer, I get the following query statistics:
Request Charge = 4857.38 RUs, Index lookup time = 9562.75 ms, Retrieved document count = 77
Request Charge = 1900.79 RUs, Index lookup time = 466.72 ms, Retrieved document count = 77
Request Charge = 1878.25 RUs, Index lookup time = 548.80 ms, Retrieved document count = 77
Note the varying RUs and index lookup times (see actual screen shots with remaining data). Our logs show the same query taking upward of 7964 RUs and 20 seconds!
Also note then when I remove the "order by" clause, I start getting the same RUs on successive executions.
Per documentation: cosmos db guarantees that the same query on the same data always costs the same number of RUs on repeated executions. Why are we seeing those variations?
I can answer first as why it's taking so long. The reason is you need to add a composite index. You should always explore adding a composite index to optimize ORDER BY queries. They can have a huge impact.
In regards to the guarantee in the RU section of our docs, it is general guidance that is true 99% of the time but we can’t guarantee it. We will update our docs on this to be more concise.
AS for why the RU/s was different, based upon the query metrics you shared, the cheaper 1,878 RU query had a 548 ms index lookup time, while the more expensive 4,857 RU query had a 9+ second index lookup time. This is probably a hint to why the RU difference is so high but it's not possible to say for sure why without activity id's on the operations.
The query engine has a lot of heuristics it uses during execution and, in some cases, a small change could have a significant impact on the query results. Even things like throttling could force a query to be split into an inconsistent number of pages, impacting the RU charge. This is rare, but possible.

Can a group count query fail due to Big Data ? Amazon Neptune Graph Databases

Can a group count query in Amazon Neptune or any Graph Databases fail due to Big Data ?
I mean if the counts exceeds the limits of the count datatype can there be a n overflow?
Short answer
Gremlin query language semantics (as defined by the Tinkerpop code) define output of count() function as a 64 bit long. So, yes, count cannot exceed the range of long.
Long answer
Having said that, let's try to calculate the amount of data you would need to insert into the DB to hit that threshold. Each entity(Vertex/Edge/Property) in the DB contains a unique ID associated with it. Let us hypothetically assume that the storage of each entity consists of just the identifier. Also, let us assume that the data type of the identifier is the most efficient, i.e. a long (and not a String which would use greater space than a long).
To hit the limit of count, the DB would need to store at least 2^64 entities each with a unique identifier i.e. at least ((2^64)*64)bits of data i.e. greater than 1000 PetaBytes of data at a very conservative estimate.
The point is, you would need to store a huge amount of data before you hit the limit of count. If you are operating with such amount of data, a DB might not be right storage solution for you.

Resources