cosmos db sql api datetimepart function increasing rus - azure-cosmosdb

my cosmos db has data stored per minute time intervals, I would like to query it using different intervals, depending on start date and end date.
But when using the DateTimePart function, the RUs are drastically increasing.
Without the datetimepart function, for a months data is 184, if I add the datetimepart function it jumps to 6000.
I tried the query like this:
SELECT c.config,c.datetime
FROM c where c.config='DBM' and c.datetime >='2021-04-21T07:02:16'
and c.datetime <='2021-05-21T07:02:16' and (DateTimePart('minute' , c.body.metadata.datetime) % 60) = 0
as well as using a subquery join, but still no luck with bringing down the rus.
Does anyone have any other idea I could use to query the data with an interval, so basically if I query data for a month, but choose an hourly interval, i should get data for every hour for that month
Sasha

For common and expensive queries, it can make sense to incorporate additional derived fields in your model to optimize the query cost. For example, if you are always querying for items on the hour mark, you could write those records with an additional property minuteOfHour: 0 and then use the more straightforward query:
SELECT c.config,c.datetime
FROM c where c.config='DBM' and c.minuteOfHour = 0 and c.datetime >='2021-04-21T07:02:16'
and c.datetime <='2021-05-21T07:02:16'
In other words, the strategy is to precompute things once rather than forcing the query engine to perform the work every query for every item.
It may also help to add one or more composite indexes to reduce the query cost.
Of course, always test different approaches to see what yields the least cost.

Related

The best way to calculate total money from multiple orders

Let's say i have an multi-restaurant food order app.
I'm storing orders in Firestore as documents.
Each order object/document contains:
total: double
deliveredByUid: str
restaurantId: str
I wanna see anytime during the day, the totals of every Driver to each Restaurant like so:
robert: mcdonalds: 10
kfc: 20
alex: mcdonalds: 35
kfc: 10
What is the best way of calculating the totals of all the orders?
I currently thinking of the following:
The safest and easiest method but expensive: Each time i need to know the totals i just query all the documents in that day and calculate them 1 by 1
Cloud Functions method: Each time an order has been added/removed modify a value in a Realtime database specific child: /totals/driverId/placeId
Manual totals: Each time a driver complete an order and write its id to the order object, make another write to the Realtime database specific child.
Edit: added the whole order object because i was asked to.
What I would most likely do is make sure orders are completely atomic (or as atomic as they can be). Most likely, I'd perform the order on the client within a transaction or batch write (both are atomic) that would not only create this document in question but also update the delivery driver's document by incrementing their running total. Depending on how extensible I wanted to get, I may even create subcollections within the user's document that represented chunks of time if I wanted to be able to record totals by month or year, or whatever. You really want to think this one through now.
The reason I'd advise against your suggested pattern is because it's not atomic. If the operation succeeds on the client, there is no guarantee it will succeed in the cloud. If you make both writes part of the same transaction then they could never be out of sync and you could guarantee that the total will always be accurate.

DynamoDB index/query questions

I need to create a table with the following fields :
place, date, status
My keys are parition key - place , sort key - date
Status can be either 0 or 1
Table has approximately 300k rows per day and about 3 days worth of data at any given time, so about 1 million rows. I have a service that is continuously populating data to this DDB.
I need to run the following queries (only) once per day :
#1 Return count of all places with date = current_date-1
#2 Return count and list of all places with date= current_date-1 and status = 0
Questions :
As date is already a sort key, is query #1 bound to be quick?
Do we need to create indexes on sort key fields ?
If answer to above question is yes: for query #2, do I need to create a GSI on date and status? with date as Partition key, and status as sort key?
Creating a GSI vs using filter expression on status for query #2. Which of the two is recommended?
Running analytical queries (such as count) is a wrong usage of a NoSQL database such as DynamoDB that is designed for scalable LOOKUP use cases.
Even if you get the SCAN to work with one design or another, it will be more expensive and slow than it should.
A better option is to export the table data from DynamoDB into S3, and then run an Athena query over that data. It will be much more flexible to run various analytical queries.
Easiest thing for you to do is a full table scan once per day filtering by yesterday's date, and as part of that keep your own client-side count on if the status was 0 or 1. The filter is not index optimized so it will be a true full table scan.
Why not an export to S3? Because you're really just doing one query. If you follow the export route you'll have to a new export every day to keep the data fresh and the cost of the export in dollar terms (plus complexity) is more than a single full scan. If you were going to do repeated queries against the data then the export makes more sense.
Why not use GSIs? They would make the table scan more efficient by minimizing what's scanned. However, there's a cost (plus complexity) in keeping them current.
Short answer: a once per day full table scan is both simple to implement and as fast as you want (parallel scan is an option), plus it's not really costly.
How much would it cost? Million rows, 100 bytes each, so that's a 100 MB table. That's 25,000 read units to fully scan, which is halved down to 12,500 with eventual consistency. On Demand pricing is $0.25 per million read units. 12,500 / 1,000,000 * $0.25 = $0.003. Less than a dime a month. It'd be cheaper still if you run provisioned.
Just do the scan. :)

Cosmos Like Query taking different RU/s

i Have a container in cosmos that is partitioned by a synthetic key i created, this data has around 6mil documents at the moment and the synthetic key is as follows:
'InfantCereal-Fct-4WEApr2818' where the pieces are separated by a dash and are representative of Category, Type, and Date respectively
what i find odd is that when i use a like query as follows these are the results i get:
SELECT * FROM c where c.Partition LIKE 'InfantCereal-%-4WEApr2818'
Request Charge
316.78 RUs (VERY HIGH)
-----------------------------------------------------
SELECT * FROM c where c.Partition LIKE 'InfantCereal-Fct-%'
Request Charge
17.41 RUs (VERY LOW)
-----------------------------------------------------
SELECT * FROM c where c.Partition LIKE '%-Fct-4WEApr2818'
Request Charge
297.25 RUs (VERY HIGH)
-----------------------------------------------------
SELECT * FROM c where c.Period = '4WE Apr 28 18'
Request Charge
23.56 RUs
Why is it that when querying for any date value the RU/s is very low? in this Container there would be more date difference than category and type but i don't understand why the RU/s cost is so high
also why why is it the case that when querying on Period, (which the container isn't partitioned on) costs much less RU/s than when querying?
I am quite new to Cosmos DB and i want to make sure I'm not making any mistakes in terms of my data partitioning before moving forward, any explanation on Cosmos RU/Query costs is appreciated.
Thanks!
It appears you're seeing behavior that corresponds to cases where indexes can be used for more efficient queries. In the docs for the LIKE operator, it's noted that this is equivalent to using the system function RegexMatch. And in the RegexMatch docs, there's a note tucked away at the bottom:
This system function will benefit from a range index if the regular
expression can be broken down into either StartsWith, EndsWith,
Contains, or StringEquals system functions.
This indicates you'll get cost savings when your query can be optimized into one of these cases. This matches what you're documenting.
Accordingly, if you know you're doing one of those cases, you might consider using the explicit functions like STARTSWITH instead of LIKE to ensure the expected index treatment, if not to also be more self-documenting.

How would I order my collection on timestamp and score

I have a collection with documents that have a createdAt timestamp and a score number. I sort all the documents on score for our leaderboard. But now I want to also have the daily best.
matchResults.orderBy("score").where("createdAt", ">", yesterday).startAt(someValue).limit(10);
But I found that there are limitations when using different fields.
https://firebase.google.com/docs/firestore/query-data/order-limit-data#limitations.
So how could I get the results of today in chuncks of 10 sorted on score?
You can use multiple orderBy(...) clauses to order on multiple fields, but this won't exactly meet your needs since you must first order by timestamp and only second by score.
A brute force option would be to fetch all the scores for the given day and truncate the list locally. But that of course won't work well if there are thousands of scores to load.
One simple answer would be to use a datestamp instead of timestamp:
matchResults.where("dayCreated", "=", "YYYY-MM-DD").orderBy("score").startAt(...).limit(10)
A second simple answer would be to run a Cloud Function on write events and maintain a daily top scores table separate from your scores data. If the results are frequently viewed, this would ultimately prove more economical and scalable as you would only need to record a small subset (say the top 100) by day, and can simply query that table ordering by score.
Scoreboards are extremely difficult at scale, so don't underestimate the complexity of handling every edge case. Start small and practical, focus on aggregating your results during writes and keep reads small and simple. Limit scope by listing only a top percentage for your "top scores" records and skip complex pagination schemas where possible.

Storing Weighted Graph Time Series in Cassandra

I am new to Cassandra, and I want to brainstorm storing time series of weighted graphs in Cassandra, where edge weight is incremented upon each time but also updated as a function of time. For example,
w_ij(t+1) = w_ij(t)*exp(-dt/tau) + 1
My first shot involves two CQL v3 tables:
First, I create a partition key by concatenating the id of the graph and the two nodes incident on the particular edge, e.g. G-V1-V2. I do this in order to be able to use the "ORDER BY" directive on the second component of the composite keys described below, which is type timestamp. Call this string the EID, for "edge id".
TABLE 1
- a time series of edge updates
- PRIMARY KEY: EID, time, weight
TABLE 2
- values of "last update time" and "last weight"
- PRIMARY KEY: EID
- COLUMNS: time, weight
Upon each tick, I fetch and update the time and weight values stored in TABLE 2. I use these values to compute the time delta and new weight. I then insert these values in TABLE 1.
Are there any terrible inefficiencies in this strategy? How should it be done? I already know that the update procedure for TABLE 2 is not idempotent and could result in inconsistencies, but I can accept that for the time being.
EDIT: One thing I might do is merge the two tables into a single time series table.
You should avoid any kind of read-before-write when it comes to Cassandra (and any other database where you can't do a compare-and-swap operation for the write).
First of all: Which queries and query-patterns does your application have?
Furthermore I would be interested how often a new weight for each edge will be calculated and stored. Every second, hour, day?
Would it be possible to hold the last weight of each edge in memory? So you could avoid the reading before writing? Possibly some sort of lazy-loading mechanism of this value would be feasible.
If your queries will allow this data model, I would try to build a solution with a single column family.
I would avoid reading before writing in Cassandra as it really isn't a great fit. Reads are expensive, considerably more so than writes, and to sustain performance you'll need a large number of nodes for a relatively small amount of queries. What you're suggesting doesn't really lend itself to be a good fit for Cassandra, as there doesn't appear to be any way to avoid reading before you write. Even if you use a single table you will still need to fetch the last update entries to perform your write. While it certainly could be done, I think there is better tools for the job. Having said that, this would be perfectly feasible if you could keep all data in table 2 in memory, and potentially utilise the row cache. As long as table 2 isn't so large that it can fit the majority of rows in memory, your reads will be significantly faster which may make up for the need to perform a read every write. This would be quite a challenge however and you would need to ensure only the "last update time" for each row is kept in memory, and disk is rarely needed to be touched.
Anyway, another design you may want to look at is an implementation where you not only use Cassandra but also a cache in front of Cassandra to store the last updated times. This could be run alongside Cassandra or on a separate node but could be an in memory store of the last update times only, and when you need to update a row you query the cache, and write your full row to Cassandra (you could even write the last update time if you wished). You could use something like Redis to perform this function, and that way you wouldn't need to worry about tombstones or forcing everything to be stored in memory and so on and so forth.

Resources