Generate sequence in Cosmos DB

Generate sequence in Cosmos DB - azure-cosmosdb

Need to create sequence so what the code was currently doing select max(name) from table where item1 = '' and item2= '' and item3 = '' . After fetching max then it insert the element starting from max. But it will lead to concurrency issue.
Note: The query will always be perform in same parition
Currently i have two approach:
One to create a sequence table which will contain the ID and sequence as column. Sequence column will contain the last sequence number for that ID. This will be updated using optimistic concurrency.
Second one is to use stored procedure.
But still i am looking for some more better approach if that is present.

There is no better approach other than complicate things dramatically with semaphores etc, but I think you approach with Optimistic Concurrency Control will work fine.
Stored procedures will not help you in this instance at all.
ChangeFeed can be one way to update things after the fact but I dont see how this helps you unless you make a seperate document keeping track of current number etc but that seems like over kill for what you want.

Related

DynamoDB top item per partition

We are new to DynamoDB and struggling with what seems like it would be a simple task.
It is not actually related to stocks (it's about recording machine results over time) but the stock example is the simplest I can think of that illustrates the goal and problems we're facing.
The two query scenarios are:
All historical values of given stock symbol <= We think we have this figured out
The latest value of all stock symbols <= We do not have a good solution here!
Assume that updates are not synchronized, e.g. the moment of the last update record for TSLA maybe different than for AMZN.
The 3 attributes are just { Symbol, Moment, Value }. We could make the hash_key Symbol, range_key Moment, and believe we could achieve the first query easily/efficiently.
We also assume could get the latest value for a single, specified Symbol following https://stackoverflow.com/a/12008398
The SQL solution for getting the latest value for each Symbol would look a lot like https://stackoverflow.com/a/6841644
But... we can't come up with anything efficient for DynamoDB.
Is it possible to do this without either retrieving everything or making multiple round trips?
The best idea we have so far is to somehow use update triggers or streams to track the latest record per Symbol and essentially keep that cached. That could be in a separate table or the same table with extra info like a column IsLatestForMachineKey (effectively a bool). With every insert, you'd grab the one where IsLatestForMachineKey=1, compare the Moment and if the insertion is newer, set the new one to 1 and the older one to 0.
This is starting to feel complicated enough that I question whether we're taking the right approach at all, or maybe DynamoDB itself is a bad fit for this, even though the use case seems so simple and common.

There is a way that is fairly straightforward, in my opinion.
Rather than using a GSI, just use two tables with (almost) the exact same schema. The hash key of both should be symbol. They should both have moment and value. Pick one of the tables to be stocks-current and the other to be stocks-historical. stocks-current has no range key. stocks-historical uses moment as a range key.
Whenever you write an item, write it to both tables. If you need strong consistency between the two tables, use the TransactWriteItems api.
If your data might arrive out of order, you can add a ConditionExpression to prevent newer data in stocks-current from being overwritten by out of order data.
The read operations are pretty straightforward, but I’ll state them anyway. To get the latest value for everything, scan the stocks-current table. To get historical data for a stock, query the stocks-historical table with no range key condition.

Dynamo DB, How do you query everything AND leverage sort key

I already have an index set up with the second sort key set to what I want (an integer timestamp). The API keeps complaining that I'm not giving it a KeyConditionExpression. Then if I give it one, it says id must be specified. I've tried forcing it to just give me everything using id <> null and it STILL won't do it. Is this even possible?? Maybe its time to get rid of dynamo if it can't do this utterly simple task.
For the love of god, all I'm trying to do is query the entire table AND have it use my sort key. I would have had this going in SQL hours ago..

First of all, DynamoDB is a NOSQL database, so it's intentionally NOT SQL. Perhaps you shouldn't expect to be able to perform SQL like queries that you are used to, and be frustrated by the fact that these are two completely different types of databases, each with its strengths and weaknesses.
Records in DynamoDB are partitioned using the hash key, and may optionally be sorted within each partition.
The hash key should be picked so that items are as evenly distributed over partitions as possible. The use of partitions is what makes DynamoDB extremely scalable and fast. But if what you need is to scan over all your items and get them in sorted order, then you probably either are using the wrong tool for the job, or you need to sort the items on the client side.
The scan operation will simply go through all partitions, returning all items from each partition. At this point, the items can only be sorted within their respective partition.
As an example, consider a set of data being partitioned into 3 partitions:
Partition A Partition B Partition B
Sort key Sort key Sort key
A D C
C E K
P G L
As you can see, you can easily query each partition and get the items in it in sorted order. But if you scan, you will probably get items sorted as
[A, C, P, D, E, G, C, K, L], if the sort order is at all deterministic. At this point you would have to sort the items yourself.
A "trick" that is sometimes seen is to use a "dummy" hash key with an equal value for all items, like you mentioned in your own answer. This way you can query for "dummy = 1" and get the items sorted according to the sort key. However, this completely defeats the purpose of the hash key as all items will be put in the same partition, thus not making the table scale at all. But if you find yourself using DynamoDB even though you have a really small dataset, by all means it would work. But again, with a small data set and use-cases like this, you should probably be using another tool such as RDS in the first place.

Just to elaborate on #JHH though. In general I'd say he is correct that you shouldn't need to sort all elements in DynamoDB. I also have a requirement similar to this, as I need to get the top N number of elements, which could all be in different partitions.
DynamoDB does have a way of doing this, it just isn't out of the box. I don't think that it's so correct to say you should then need an SQL database, as arguably you'd never use a NoSQL database because you will always have one of these limitations. Also if you only ever use NoSQL for large data-sets then you will always have to rework your application later.
What to do then? Well you do have a few options, and it depends on your use-case, lets' assume that you are at least having sorting within your partitions, this makes it easier. We'll also assume you are looking for the max.
The simplest way would be if you would get the first value from every partition. And find the max. If you needed say the top 10 values you could still utilise this strategy but would get too complicated.
Next option is to make use of DynamoDB Streams. Say we want to keep a list of the top 100 elements. These would sit ready and waiting on their own top values partition, sorted and ready for instant retrieval. You would need to maintain this list yourself by checking when items are inserted or updated, that they are greater than the 100th element. If that is the case you would insert the element into the top values partition, and delete the last value. This I think would be the most likely way to approach this problem.
So in NoSQL if there is some sort of query, you would love to do which is oh so easy in SQL, and you cant use your Table/GSI/LSI, then you pretty much need to compute the result manually, and have it ready for consumption.
Now if you weren't going to make use of these top values very often, then you might go with the first method, and scan every partition top values till you had the list you wanted, but depending on how much the values are scattered across partitions this could take many capacity units.
Hope that helps.

Turns out, you can also add an IndexName to a scan. That helps. Furthermore, if you create an index with a sort key, all primary indices MUST be identical for the sort to occur.

Is a scan query always expensive in DynamoDB or should you use a range key

I've been playing around with Amazon DynamoDB and looking through their examples but I think I'm still slightly confused by the example. I've created the example data on a local dynamodb instance to get used to querying data etc. The sample data sets up 3 tables of 'Forum'->'Thread'->'Reply'
Now if I'm in a specific forum, the thread table has a ForumName key I can query against to return relevant threads, but would the very top level (displaying the forums) always have to be a scan operation?
From what I can gather the only way to "select *" in dynamodb is to use a scan and I assume in this instance - where forum is very high level and might have a relatively small number of rows - that it wouldn't be that expensive or are you actually better creating a hash and range key and using that to query this table? I'm not sure what the range key would be in this instance, maybe just a number and then specify in the query that the value has to be > 0? Or perhaps a date it was created and the query always uses a constant date in the past?
I did try a sample query on the 'Forum' table example data using a ComparisonOperator of 'GE' (Greater than or equal) with an attribute value list of 'S'=>'a' but this states that any conditions on the hash key must be of type EQ which implies I couldn't do the above as I would always need to know my 'Name' values upfront
Maybe I'm still struggling having come from an RDBS background especially seen as there are many forum examples out there.
thanks

I think using Scan to get all the forums is fine. I think it is very efficient because it will not return you anything that you don't need (all of the work that scan does is necessary). Also since Scan operation is so simple it is easier to implement and more likely to be efficient

Storing Weighted Graph Time Series in Cassandra

I am new to Cassandra, and I want to brainstorm storing time series of weighted graphs in Cassandra, where edge weight is incremented upon each time but also updated as a function of time. For example,
w_ij(t+1) = w_ij(t)*exp(-dt/tau) + 1
My first shot involves two CQL v3 tables:
First, I create a partition key by concatenating the id of the graph and the two nodes incident on the particular edge, e.g. G-V1-V2. I do this in order to be able to use the "ORDER BY" directive on the second component of the composite keys described below, which is type timestamp. Call this string the EID, for "edge id".
TABLE 1
- a time series of edge updates
- PRIMARY KEY: EID, time, weight
TABLE 2
- values of "last update time" and "last weight"
- PRIMARY KEY: EID
- COLUMNS: time, weight
Upon each tick, I fetch and update the time and weight values stored in TABLE 2. I use these values to compute the time delta and new weight. I then insert these values in TABLE 1.
Are there any terrible inefficiencies in this strategy? How should it be done? I already know that the update procedure for TABLE 2 is not idempotent and could result in inconsistencies, but I can accept that for the time being.
EDIT: One thing I might do is merge the two tables into a single time series table.

You should avoid any kind of read-before-write when it comes to Cassandra (and any other database where you can't do a compare-and-swap operation for the write).

First of all: Which queries and query-patterns does your application have?
Furthermore I would be interested how often a new weight for each edge will be calculated and stored. Every second, hour, day?
Would it be possible to hold the last weight of each edge in memory? So you could avoid the reading before writing? Possibly some sort of lazy-loading mechanism of this value would be feasible.
If your queries will allow this data model, I would try to build a solution with a single column family.

I would avoid reading before writing in Cassandra as it really isn't a great fit. Reads are expensive, considerably more so than writes, and to sustain performance you'll need a large number of nodes for a relatively small amount of queries. What you're suggesting doesn't really lend itself to be a good fit for Cassandra, as there doesn't appear to be any way to avoid reading before you write. Even if you use a single table you will still need to fetch the last update entries to perform your write. While it certainly could be done, I think there is better tools for the job. Having said that, this would be perfectly feasible if you could keep all data in table 2 in memory, and potentially utilise the row cache. As long as table 2 isn't so large that it can fit the majority of rows in memory, your reads will be significantly faster which may make up for the need to perform a read every write. This would be quite a challenge however and you would need to ensure only the "last update time" for each row is kept in memory, and disk is rarely needed to be touched.
Anyway, another design you may want to look at is an implementation where you not only use Cassandra but also a cache in front of Cassandra to store the last updated times. This could be run alongside Cassandra or on a separate node but could be an in memory store of the last update times only, and when you need to update a row you query the cache, and write your full row to Cassandra (you could even write the last update time if you wished). You could use something like Redis to perform this function, and that way you wouldn't need to worry about tombstones or forcing everything to be stored in memory and so on and so forth.

Reindexing a large SQL Server database to Lucene

We have a web service method which accepts some data and puts it in Lucene index. We use it to index new and updated entries from our asp.net web app.
These entries are stored in a large SQL Server table (20M rows and growing), and I need a way to be able to reindex the whole table in case if current index gets deleted or corrupted. I'm not sure what's the optimal way to retrieve chunks of data from a large table. Currently, we use the fact that the table has PK which is autoincrement, so we get chunks of 1000 rows until it starts to return nothing. Kind of like (in pseudo language):
i = 0
while (true)
{
SELECT col1, col2, col3 FROM mytable WHERE pk between i and i + 1000
.... if result is empty 20 times in a row, break ....
.... otherwise send result to web service to reindex ....
i = i + 1000
}
This way, we don't need to SELECT COUNT(*) which would be a big performance killer, and we just move up the pk values until we stop getting any results. This has it's con: if we have a hole greater than 20,000 values somewhere in the table, it will stop indexing assuming it reached the end, but that's a tradeoff we have to live for now.
Can anyone suggest a more efficient way of getting data from a table to index? I would assume we are not the first ones facing this problem - search engines are widely used nowadays :)

For what we do with Lucene, we rarely need to reindex everything. I can't remember coming across any case when all index would be corrupted (Lucene is actually quite safe/good at this), but it has been many times when individual items needed to be reindexed because of one reason or another. I'd say the most frequent reindexing patterns would be:
reindex items by given id (or set of ids)
reindex items by given period of time
The latter, of course, requires separate db index on the relevant date field(s) which should be a bit costly for 20M+ records but we decided to go for it (our biggest deployment had up to 10M records) as disk space is cheap these days anyway.
EDIT: added few explanations as per question author's comment.
If the source data structure changes, requiring reindexing of all records, our approach is to roll out new code which ensures all new data is correct (basically forms correct Lucene Document from this moment). Then after we can reindex things in batches (either manually or by hand), by providing relevant period ranges. This, to certain extent, also applies to Lucene version changes, too.

Why is a COUNT(*) a performance killer? What about MAX(id)? I'm thinking that a index would provide the information needed for those queries. You do have an index on your primary key, right?

I actually just figured it out - I can use IDENT_CURRENT(table_name) to get the last generated id, and use that instead of MAX() or Count() - this method should blow the other two away :)