DynamoDb Multiple Conditional Updates of Non-dependant Attributes - amazon-dynamodb

I am using the Javascript SDK for AWS and I am attempting to perform a single update operation on OHLC (Open, High, Low, Close) data and I only want to update the "High" attribute if the new value is higher than the stored value and the opposite for the "Low".
As far as I can tell, there are 2 options:
Query the dB to get the current OHLC, calculate the differences, then update the dB again.
Perform 2 updates, one with the conditional expression for "High" and one for the "Low" conditional expression.
The basis of the question is this: "Can I use the conditionExpression to perform multiple, non-dependant update conditions on seperate attributes?"

I'm afraid its not possible to have 2 non-dependent conditions on the same UpdateItem API call.
Your first options is more cost effective, however, you may need to use versioning if you have high concurrency.

Related

The best way to calculate total money from multiple orders

Let's say i have an multi-restaurant food order app.
I'm storing orders in Firestore as documents.
Each order object/document contains:
total: double
deliveredByUid: str
restaurantId: str
I wanna see anytime during the day, the totals of every Driver to each Restaurant like so:
robert: mcdonalds: 10
kfc: 20
alex: mcdonalds: 35
kfc: 10
What is the best way of calculating the totals of all the orders?
I currently thinking of the following:
The safest and easiest method but expensive: Each time i need to know the totals i just query all the documents in that day and calculate them 1 by 1
Cloud Functions method: Each time an order has been added/removed modify a value in a Realtime database specific child: /totals/driverId/placeId
Manual totals: Each time a driver complete an order and write its id to the order object, make another write to the Realtime database specific child.
Edit: added the whole order object because i was asked to.
What I would most likely do is make sure orders are completely atomic (or as atomic as they can be). Most likely, I'd perform the order on the client within a transaction or batch write (both are atomic) that would not only create this document in question but also update the delivery driver's document by incrementing their running total. Depending on how extensible I wanted to get, I may even create subcollections within the user's document that represented chunks of time if I wanted to be able to record totals by month or year, or whatever. You really want to think this one through now.
The reason I'd advise against your suggested pattern is because it's not atomic. If the operation succeeds on the client, there is no guarantee it will succeed in the cloud. If you make both writes part of the same transaction then they could never be out of sync and you could guarantee that the total will always be accurate.

Firestore Query crashes while using whereNotEqualTo and multiple orderBy [duplicate]

Let's say I have a collection of cars and I want to filter them by price range and by year range. I know that Firestore has strict limitations due performance reasons, so something like:
db.collection("products")
.where('price','>=', 70000)
.where('price','<=', 90000)
.where('year','>=', 2015)
.where('year','<=', 2018)
will throw an error:
Invalid query. All where filters with an inequality (<, <=, >, or >=) must be on the same field.
So is there any other way to perform this kind of query without local data managing? Maybe some kind of indexing or tricky data organization?
The error message and documentation are quite explicit on this: a Firestore query can only perform range filtering on a single field. Since you're trying to filter ranges on both price and year, that is not possible in a single Firestore query.
There are two common ways around this:
Perform filtering on one field in the query, and on the other field in your client-side code.
Combine the values of the two range into a single field in some way that allows your use-case with a single field. This is incredibly non-trivial, and the only successful example of such a combination that I know of is using geohashes to filter on latitude and longitude.
Given the difference in effort between these two, I'd recommend picking the first option.
A third option is to model your data differently, as to make it easier to implement your use-case. The most direct implementation of this would be to put all products from 2015-2018 into a single collection. Then you could query that collection with db.collection("products-2015-2018").where('price','>=', 70000).where('price','<=', 90000).
A more general alternative would be to store the products in a collection for each year, and then perform 4 queries to get the results you're looking for: one of each collection products-2015, products-2016, products-2017, and products-2018.
I recommend reading the document on compound queries and their limitations, and watching the video on Cloud Firestore queries.
You can't do multiple range queries as there are limitations mentioned here, but with a little cost to the UI, you can still achieve by indexing the year like this.
db.collection("products")
.where('price','>=', 70000)
.where('price','<=', 90000)
.where('yearCategory','IN', ['new', 'old'])
Of course, new and old go out of date, so you can group the years into yearCategory like yr-2014-2017, yr-2017-2020 so on. The in can only take 10 elements per query so this may give you an idea of how wide of a range to index the years.
You can write to yearCategory during insert or, if you have a large range such as a number of likes, then you'd want another process that polls these data and updates the category.
In Flutter You can do something like this,
final _queryList = await db.collection("products").where('price','>=', 70000).get();
final _docL1 = _querList.where('price','<=', 90000);
Add more queries as you want, but for firestore, you can only request a limited number of queries, and get the data. After that you can filter out according to your need.

DynamoDB: Using filtered expression vs creating separate table with selected data for efficiency

I am writing an API, which has a data model with a status field which is boolean.
And 90% of the calls to the API will require filter over that status = “active"
Context:
Currently, I have it as a DyanmoDB Boolean field and use a filtered expression over it but I am contending the decision over creating a separate table with the relevant identifier which acts as a hash key for the query and saving corresponding item information corresponding to "active" status, as there can be only one item with "active" status in the item for a particular hash key.
Now my questions are:
Data integrity is a big question here since I will be updating two
tables depending upon the request.
Is using separate tables a good practice in Dynamo DB in this use
case or I am using a wrong DB?
Is the query execution over filtered expression efficient enough and
I can use the current setup?
Scale of the API usage is medium right now but it is expected to increase.
A filter expression is going to be inefficient because filter expressions are applied to results after the scan or query is processed. They could save on network bandwidth in some cases but otherwise you could just as well apply the filter in your own code with pretty much the same results and efficiency.
You other option would be to create a Global Secondary Index (GSI) with a partition key on the boolean field, which might be better if you have significantly less "active" records than "inactive". In that case a useful pattern is to create a surrogate field, say "status_active", which you set to TRUE only for active fields, and to NULL for others. Then, if you create a GSI with a partition key on the "status_active" field it will contain only the active records (NULL values do not get indexed).
The index on a surrogate field is probably the best option as long as you expect than the set of active records is sparse in the table (ie. there's less actives than inactives).
If you expect that about 50% of records would be active and 50% would be inactive then having two tables and dealing with transaction integrity on your own might be a better choice. This is especially attractive if records are only infrequently expected to transition between states. DynamoDB provides very powerful atomic counters and conditional checks that you can use to craft a solution that ensures state transitions are consistent.
If you expect that many records would be active and only a few inactive, then using a filter might actually be the best option, but keep in mind that filtered records still count towards your provisioned throughput, so again, you could simply filter them out in the application with much the same result.
In summary, the answer depends on the distribution of values in the status attribute.

Storing Weighted Graph Time Series in Cassandra

I am new to Cassandra, and I want to brainstorm storing time series of weighted graphs in Cassandra, where edge weight is incremented upon each time but also updated as a function of time. For example,
w_ij(t+1) = w_ij(t)*exp(-dt/tau) + 1
My first shot involves two CQL v3 tables:
First, I create a partition key by concatenating the id of the graph and the two nodes incident on the particular edge, e.g. G-V1-V2. I do this in order to be able to use the "ORDER BY" directive on the second component of the composite keys described below, which is type timestamp. Call this string the EID, for "edge id".
TABLE 1
- a time series of edge updates
- PRIMARY KEY: EID, time, weight
TABLE 2
- values of "last update time" and "last weight"
- PRIMARY KEY: EID
- COLUMNS: time, weight
Upon each tick, I fetch and update the time and weight values stored in TABLE 2. I use these values to compute the time delta and new weight. I then insert these values in TABLE 1.
Are there any terrible inefficiencies in this strategy? How should it be done? I already know that the update procedure for TABLE 2 is not idempotent and could result in inconsistencies, but I can accept that for the time being.
EDIT: One thing I might do is merge the two tables into a single time series table.
You should avoid any kind of read-before-write when it comes to Cassandra (and any other database where you can't do a compare-and-swap operation for the write).
First of all: Which queries and query-patterns does your application have?
Furthermore I would be interested how often a new weight for each edge will be calculated and stored. Every second, hour, day?
Would it be possible to hold the last weight of each edge in memory? So you could avoid the reading before writing? Possibly some sort of lazy-loading mechanism of this value would be feasible.
If your queries will allow this data model, I would try to build a solution with a single column family.
I would avoid reading before writing in Cassandra as it really isn't a great fit. Reads are expensive, considerably more so than writes, and to sustain performance you'll need a large number of nodes for a relatively small amount of queries. What you're suggesting doesn't really lend itself to be a good fit for Cassandra, as there doesn't appear to be any way to avoid reading before you write. Even if you use a single table you will still need to fetch the last update entries to perform your write. While it certainly could be done, I think there is better tools for the job. Having said that, this would be perfectly feasible if you could keep all data in table 2 in memory, and potentially utilise the row cache. As long as table 2 isn't so large that it can fit the majority of rows in memory, your reads will be significantly faster which may make up for the need to perform a read every write. This would be quite a challenge however and you would need to ensure only the "last update time" for each row is kept in memory, and disk is rarely needed to be touched.
Anyway, another design you may want to look at is an implementation where you not only use Cassandra but also a cache in front of Cassandra to store the last updated times. This could be run alongside Cassandra or on a separate node but could be an in memory store of the last update times only, and when you need to update a row you query the cache, and write your full row to Cassandra (you could even write the last update time if you wished). You could use something like Redis to perform this function, and that way you wouldn't need to worry about tombstones or forcing everything to be stored in memory and so on and so forth.

Multi-row atomicity-consistency with Riak?

Let me get to my example:
For the ID=>values 0=>87, 1=>24, 2=>82, 3=>123, 4=>34, 5=>61,
increment all values for keys between 1 and 4 by 10
For a multi-row operation like this, does Riak offer atomicity; ie this operation either fails or succeeds, without dirtying the data partially?
Do queries aggregating on the rows when they are updating see consistent results?
I saw no place which dealt with this question explicitly. But I guess the "tunable CAP" controls set to "enable consistency and partition tolerance" seems like the key.
No.
Riak has no concept of atomicity overall (it's an eventually consistent system), and also does not have any concept of a "transaction" where multiple K/V pairs can be modified or read as a set.

Resources