DynamoDB top item per partition - amazon-dynamodb

We are new to DynamoDB and struggling with what seems like it would be a simple task.
It is not actually related to stocks (it's about recording machine results over time) but the stock example is the simplest I can think of that illustrates the goal and problems we're facing.
The two query scenarios are:
All historical values of given stock symbol <= We think we have this figured out
The latest value of all stock symbols <= We do not have a good solution here!
Assume that updates are not synchronized, e.g. the moment of the last update record for TSLA maybe different than for AMZN.
The 3 attributes are just { Symbol, Moment, Value }. We could make the hash_key Symbol, range_key Moment, and believe we could achieve the first query easily/efficiently.
We also assume could get the latest value for a single, specified Symbol following https://stackoverflow.com/a/12008398
The SQL solution for getting the latest value for each Symbol would look a lot like https://stackoverflow.com/a/6841644
But... we can't come up with anything efficient for DynamoDB.
Is it possible to do this without either retrieving everything or making multiple round trips?
The best idea we have so far is to somehow use update triggers or streams to track the latest record per Symbol and essentially keep that cached. That could be in a separate table or the same table with extra info like a column IsLatestForMachineKey (effectively a bool). With every insert, you'd grab the one where IsLatestForMachineKey=1, compare the Moment and if the insertion is newer, set the new one to 1 and the older one to 0.
This is starting to feel complicated enough that I question whether we're taking the right approach at all, or maybe DynamoDB itself is a bad fit for this, even though the use case seems so simple and common.

There is a way that is fairly straightforward, in my opinion.
Rather than using a GSI, just use two tables with (almost) the exact same schema. The hash key of both should be symbol. They should both have moment and value. Pick one of the tables to be stocks-current and the other to be stocks-historical. stocks-current has no range key. stocks-historical uses moment as a range key.
Whenever you write an item, write it to both tables. If you need strong consistency between the two tables, use the TransactWriteItems api.
If your data might arrive out of order, you can add a ConditionExpression to prevent newer data in stocks-current from being overwritten by out of order data.
The read operations are pretty straightforward, but I’ll state them anyway. To get the latest value for everything, scan the stocks-current table. To get historical data for a stock, query the stocks-historical table with no range key condition.

Related

3 column query in DynamoDB using DynamooseJs

My table is (device, type, value, timestamp), where (device,type,timestamp) makes a unique combination ( a candidate for composite key in non-DynamoDB DBMS).
My queries can range between any of these three attributes, such as
GET (value)s from (device) with (type) having (timestamp) greater than <some-timestamp>
I'm using dynamoosejs/dynamoose. And from most of the searches, I believe I'm supposed to use a combination of the three fields (as a single field ; device-type-timestamp) as id. However the set: function of Schema doesn't let me use the object properties (such as this.device) and due to some reasons, I cannot do it externally.
The closest I got (id:uuidv4:hashKey, device:string:GlobalSecIndex, type:string:LocalSecIndex, timestamp:Date:LocalSecIndex)
and
(id:uuidv4:rangeKey, device:string:hashKey, type:string:LocalSecIndex, timestamp:Date:LocalSecIndex)
and so on..
However, while using a Query, it becomes difficult to fetch results of particular device,type as the id, (hashKey or rangeKey) keeps missing from the scene.
So the question. How would you do it for such kind of table?
And point to be noted, this table is meant to gather content from IoT devices, which is generated every 5 mins by each device on an average.
I'm curious why you are choosing DynamoDB for this task. Advanced queries like this seem to be much better suited for a SQL based database as opposed to a NoSQL database. Due to the advanced nature of SQL queries, this task in my experience is a lot easier in SQL databases. So I would encourage you to think about if DynamoDB is truly the right system for what you are trying to do here.
If you determine it is, you might have to restructure your data a little bit. You could do something like having a property that is device-type and that will be the device and type values combined. Then set that as an index, and query based on that and sort by the timestamp, and filter out the results that are not greater than the value you want.
You are correct that currently, Dynamoose does not pass in the entire object into the set function. This is something that personally I'm open to exploring. I'm a member on the GitHub project, and if you would like to submit a PR adding that feature I would be more than happy to help explore that option with you and get that into the codebase.
The other thing you might want to explore is having a DynamoDB stream, that will set that device-type property whenever it gets added to your DynamoDB table. That would abstract that logic out of DynamoDB and your application. I'm not sure if it's necessary for what you are doing to decouple it to that level, but it might be something you want to explore.
Finally, depending on your setup, you could figure out which item will be more unique, device or type, and setup an index on that property. Then just query based on that, and filter out the results of the other property that you don't want. I'm not sure if that is what you are looking for, it will of course work, but I'm not sure how many items you will have in your table, and there get to be questions about scalability at a certain level. One way to solve some of those scalability questions might be to set the TTL of your items if you know that you the timestamp you are querying for is constant, or predictable ahead of time.
Overall there are a lot of ways to achieve what you are looking to do. Without more detail about how many items, what exactly those properties will be doing, the amount of scalability you require, which of those properties will be most unique, etc. it's hard to give a good solution. I would highly encourage you to think about if NoSQL is truly the best way to go. That query you are looking to do seems a LOT more like a SQL query. Not saying it's impossible in DynamoDB, but it will require some thought about how you want to structure your data model, and such.
Considering opinion of #charlie-fish, I decided to jump into Dynamoose and improvise the code to pass the model to the set function of the attribute. However, I discovered that the model is already being passed to default parameter of the attribute. So I changed my Schema to the following:
id:hashKey;default: function(model){ return model.device + "" + model.type; }
timestamp:rangeKey
For anyone landing here on this answer, please note that the default & set functions can access attribute options & schema instance using this . However both those functions should be regular functions, rather than arrow functions.
Keeping this here as an answer, but I won't accept it as an answer to my question for sometime, as I want to wait for someone else to hit out a better approach.
I also want to make sure that if a value is passed for id field, it shouldn't be set. For this I can use set to ignore the actual incoming value, which I don't know how, as of yet.

Dynamo DB, How do you query everything AND leverage sort key

I already have an index set up with the second sort key set to what I want (an integer timestamp). The API keeps complaining that I'm not giving it a KeyConditionExpression. Then if I give it one, it says id must be specified. I've tried forcing it to just give me everything using id <> null and it STILL won't do it. Is this even possible?? Maybe its time to get rid of dynamo if it can't do this utterly simple task.
For the love of god, all I'm trying to do is query the entire table AND have it use my sort key. I would have had this going in SQL hours ago..
First of all, DynamoDB is a NOSQL database, so it's intentionally NOT SQL. Perhaps you shouldn't expect to be able to perform SQL like queries that you are used to, and be frustrated by the fact that these are two completely different types of databases, each with its strengths and weaknesses.
Records in DynamoDB are partitioned using the hash key, and may optionally be sorted within each partition.
The hash key should be picked so that items are as evenly distributed over partitions as possible. The use of partitions is what makes DynamoDB extremely scalable and fast. But if what you need is to scan over all your items and get them in sorted order, then you probably either are using the wrong tool for the job, or you need to sort the items on the client side.
The scan operation will simply go through all partitions, returning all items from each partition. At this point, the items can only be sorted within their respective partition.
As an example, consider a set of data being partitioned into 3 partitions:
Partition A Partition B Partition B
Sort key Sort key Sort key
A D C
C E K
P G L
As you can see, you can easily query each partition and get the items in it in sorted order. But if you scan, you will probably get items sorted as
[A, C, P, D, E, G, C, K, L], if the sort order is at all deterministic. At this point you would have to sort the items yourself.
A "trick" that is sometimes seen is to use a "dummy" hash key with an equal value for all items, like you mentioned in your own answer. This way you can query for "dummy = 1" and get the items sorted according to the sort key. However, this completely defeats the purpose of the hash key as all items will be put in the same partition, thus not making the table scale at all. But if you find yourself using DynamoDB even though you have a really small dataset, by all means it would work. But again, with a small data set and use-cases like this, you should probably be using another tool such as RDS in the first place.
Just to elaborate on #JHH though. In general I'd say he is correct that you shouldn't need to sort all elements in DynamoDB. I also have a requirement similar to this, as I need to get the top N number of elements, which could all be in different partitions.
DynamoDB does have a way of doing this, it just isn't out of the box. I don't think that it's so correct to say you should then need an SQL database, as arguably you'd never use a NoSQL database because you will always have one of these limitations. Also if you only ever use NoSQL for large data-sets then you will always have to rework your application later.
What to do then? Well you do have a few options, and it depends on your use-case, lets' assume that you are at least having sorting within your partitions, this makes it easier. We'll also assume you are looking for the max.
The simplest way would be if you would get the first value from every partition. And find the max. If you needed say the top 10 values you could still utilise this strategy but would get too complicated.
Next option is to make use of DynamoDB Streams. Say we want to keep a list of the top 100 elements. These would sit ready and waiting on their own top values partition, sorted and ready for instant retrieval. You would need to maintain this list yourself by checking when items are inserted or updated, that they are greater than the 100th element. If that is the case you would insert the element into the top values partition, and delete the last value. This I think would be the most likely way to approach this problem.
So in NoSQL if there is some sort of query, you would love to do which is oh so easy in SQL, and you cant use your Table/GSI/LSI, then you pretty much need to compute the result manually, and have it ready for consumption.
Now if you weren't going to make use of these top values very often, then you might go with the first method, and scan every partition top values till you had the list you wanted, but depending on how much the values are scattered across partitions this could take many capacity units.
Hope that helps.
Turns out, you can also add an IndexName to a scan. That helps. Furthermore, if you create an index with a sort key, all primary indices MUST be identical for the sort to occur.

Is a scan query always expensive in DynamoDB or should you use a range key

I've been playing around with Amazon DynamoDB and looking through their examples but I think I'm still slightly confused by the example. I've created the example data on a local dynamodb instance to get used to querying data etc. The sample data sets up 3 tables of 'Forum'->'Thread'->'Reply'
Now if I'm in a specific forum, the thread table has a ForumName key I can query against to return relevant threads, but would the very top level (displaying the forums) always have to be a scan operation?
From what I can gather the only way to "select *" in dynamodb is to use a scan and I assume in this instance - where forum is very high level and might have a relatively small number of rows - that it wouldn't be that expensive or are you actually better creating a hash and range key and using that to query this table? I'm not sure what the range key would be in this instance, maybe just a number and then specify in the query that the value has to be > 0? Or perhaps a date it was created and the query always uses a constant date in the past?
I did try a sample query on the 'Forum' table example data using a ComparisonOperator of 'GE' (Greater than or equal) with an attribute value list of 'S'=>'a' but this states that any conditions on the hash key must be of type EQ which implies I couldn't do the above as I would always need to know my 'Name' values upfront
Maybe I'm still struggling having come from an RDBS background especially seen as there are many forum examples out there.
thanks
I think using Scan to get all the forums is fine. I think it is very efficient because it will not return you anything that you don't need (all of the work that scan does is necessary). Also since Scan operation is so simple it is easier to implement and more likely to be efficient

Array calculation in Tableau, maxif routine

I'm fairly new to Tableau, and I'm struggling in building some routines that could be easily implemented in Excel (though it would take forever for big sets of data).
So here is the deal, consider a dataset with the following fields:
int [id_order] -> id of the sales order (deepest level, there are only unique entries of id_order)
int [id_client] -> as I want to know who bought it
date [purchase_date] -> when the customer bought the product
What I want to know is, for each order, when was the last time (if ever) the client has bought something. In order words, what is the highest purchase_date for that user that is smaller than current purchase_date.
In excel, approach is simple (but again, not efficient)
{=max(if(id_client=B1,if(purchase_order
Is there a way to do this kind of calculation in Tableau?
You can do this in Tableau using table calculations. They take a little time to understand how to use well, but are very powerful and flexible. I posted a sample Tableau workbook for a similar question in an answer for SO question Find first time a condition is met
Your situation is similar, but with the extra complication that you want to repeat the analysis for each client id, so you might want to try a recursive approach using the Previous_Value() function instead of the approach used in that example - though I'm not certain that previous_value() will fit your situation.
Still, it might be helpful to download the example workbook I mentioned to get an idea how table calculations can address similar problems.
Just to register the solution, in case someone has the same doubt.
So, basically the solution I found use table calculation, which is not calculated until it's called on a sheet (and is only calculated on the context of the sheet). That's a little bit limiting, so what I do is create a sheet with all the fields I need (+ what is necessary for the table calculation) then export the data (to mdb) and connect to this new file.
So, for my example, the right table calculation is (let's name it last_order_date):
LOOKUP(MAX([purchase_date]),-1)
Explanations. The MAX() is necessary because Lookup (and all table calculations) does not work with data directly, only with aggregations. You can use sum, avg, max, attr, whatever suits you. As in my case there will be only 1 correspondence, any function will do just fine and return the same value.
The -1 indicates that I'm looking for the element immediately before the current entry (of the table, as you define it). If it were FIRST(), it would go for the first entry of the table, and LAST() would go for the last.
Now, I have to put it on a sheet. So I'll bring the fields id_client, id_order, purchase_date and last_order_date.
Then I have to define the parameters of my table calculation last_order_date (Edit Table Calculation). I'll go to Compute using and choose advanced. Now I'll do Partitioning: id_client, and addressing all the rest. What will that do? This mean Tableau will create temporary tables for each id_client, and table calculations will use those tables as parameter.
Additionally, I will Sort by field purchase_date, Max (again the aggregation issue) and ascending, to guarantee my entries are in chronological order.
Now, what will it do? For each entry it will access the table of the id_client, and check what was the purchase_date that is immediately before the current entry (that is being assessed), exactly what I need.
To avoid spending Tableau processing in Visualization, I often put all the fields in details (and leave nothing on screen), use Bar chart (it's good because it allows me to see the data). Then I export it to mdb, then connect to it again. Unfortunately Tableau doesn't directly export to tde.

Storing Weighted Graph Time Series in Cassandra

I am new to Cassandra, and I want to brainstorm storing time series of weighted graphs in Cassandra, where edge weight is incremented upon each time but also updated as a function of time. For example,
w_ij(t+1) = w_ij(t)*exp(-dt/tau) + 1
My first shot involves two CQL v3 tables:
First, I create a partition key by concatenating the id of the graph and the two nodes incident on the particular edge, e.g. G-V1-V2. I do this in order to be able to use the "ORDER BY" directive on the second component of the composite keys described below, which is type timestamp. Call this string the EID, for "edge id".
TABLE 1
- a time series of edge updates
- PRIMARY KEY: EID, time, weight
TABLE 2
- values of "last update time" and "last weight"
- PRIMARY KEY: EID
- COLUMNS: time, weight
Upon each tick, I fetch and update the time and weight values stored in TABLE 2. I use these values to compute the time delta and new weight. I then insert these values in TABLE 1.
Are there any terrible inefficiencies in this strategy? How should it be done? I already know that the update procedure for TABLE 2 is not idempotent and could result in inconsistencies, but I can accept that for the time being.
EDIT: One thing I might do is merge the two tables into a single time series table.
You should avoid any kind of read-before-write when it comes to Cassandra (and any other database where you can't do a compare-and-swap operation for the write).
First of all: Which queries and query-patterns does your application have?
Furthermore I would be interested how often a new weight for each edge will be calculated and stored. Every second, hour, day?
Would it be possible to hold the last weight of each edge in memory? So you could avoid the reading before writing? Possibly some sort of lazy-loading mechanism of this value would be feasible.
If your queries will allow this data model, I would try to build a solution with a single column family.
I would avoid reading before writing in Cassandra as it really isn't a great fit. Reads are expensive, considerably more so than writes, and to sustain performance you'll need a large number of nodes for a relatively small amount of queries. What you're suggesting doesn't really lend itself to be a good fit for Cassandra, as there doesn't appear to be any way to avoid reading before you write. Even if you use a single table you will still need to fetch the last update entries to perform your write. While it certainly could be done, I think there is better tools for the job. Having said that, this would be perfectly feasible if you could keep all data in table 2 in memory, and potentially utilise the row cache. As long as table 2 isn't so large that it can fit the majority of rows in memory, your reads will be significantly faster which may make up for the need to perform a read every write. This would be quite a challenge however and you would need to ensure only the "last update time" for each row is kept in memory, and disk is rarely needed to be touched.
Anyway, another design you may want to look at is an implementation where you not only use Cassandra but also a cache in front of Cassandra to store the last updated times. This could be run alongside Cassandra or on a separate node but could be an in memory store of the last update times only, and when you need to update a row you query the cache, and write your full row to Cassandra (you could even write the last update time if you wished). You could use something like Redis to perform this function, and that way you wouldn't need to worry about tombstones or forcing everything to be stored in memory and so on and so forth.

Resources