Need to model a Date Range in DynamoDB - amazon-dynamodb

This seems like such an elementary part of databases, cannot believe Dynamo does not do this.
Supposing I have a Case. I have 2 dates: when the Case became active, and when it became inactive. I want to write a query that would return the count of active cases for a given Date.
In SQL (and MySQL has special Date indices), I could do an expression 'where :date between active and inactive.' Cannot do this in dynamo for a bunch of reasons:
there is no date type
there only seem to be concatenated keys since everything is a hash hence no between
So far the only things I have come up with were:
Sharding - should probably shard this table, I did some reading on that and the way Dynamo does sharding seems simple, although kinda sucks that you end up with 2 tables
if I do this, then I can just ask for the active count each day and store it
which means if I wanted count for a day in the past, I would have to table scan, and worse, scan 2 tables (as I understand it)
Date Partitions - the problem here is which date do we partition on, I guess activation, then the presumption is a count for a given Date would have a key expression of active <= :date, and a filter expression of inactive is null
Distinct Events - if I am recording Events on each case, the count of active cases on a given date is also the distinct set of CaseIDs in the Events table for that date, but that looks like it's not easy to do
Still reading so would not be surprised if I am missing something obvious. Actually one other possible way to do this is move the event data to Timestream and then have it compute this aggregate.

Related

DynamoDB architecture for getting recent articles

I'm working on an intranet app, where one of the requirements is to display recent news from inside the company. This thing doesn't need to be "webscale"--we have less than 20,000 users, and of course they won't all be logging in at the same time. We're going with DynamoDB not so much because we need its scalability, but more because it's fairly cheap and easy to use from Lambda (where our app code will run).
With all that being said, I need a way to display the most recent, say, 5 news articles. I can think of two ways to do this:
Use timestamp as the partition key. Get most recent articles by doing a Scan then sorting on timestamp.
Use the same partition key for all posts, and make timestamp the sort key. Use the sort key to get recent articles.
I know neither of these is ideal or playing to DynamoDB's strengths, but is one of them "less bad"? I'm also open to other solutions, of course.
Option 2 (or a variant) looks better. Query on the timestamp sort index with ScanIndexForward = false to achieve descending order. Timestamp value can either be a numeric timestamp or ISO format string (YYYY-DD-MM HH:mm:ss) - either is sortable.
For the partition key, you can either use a dummy value for all records (but make it short to save space), or a subset of the timestamp (year or year-month).

DynamoDB top item per partition

We are new to DynamoDB and struggling with what seems like it would be a simple task.
It is not actually related to stocks (it's about recording machine results over time) but the stock example is the simplest I can think of that illustrates the goal and problems we're facing.
The two query scenarios are:
All historical values of given stock symbol <= We think we have this figured out
The latest value of all stock symbols <= We do not have a good solution here!
Assume that updates are not synchronized, e.g. the moment of the last update record for TSLA maybe different than for AMZN.
The 3 attributes are just { Symbol, Moment, Value }. We could make the hash_key Symbol, range_key Moment, and believe we could achieve the first query easily/efficiently.
We also assume could get the latest value for a single, specified Symbol following https://stackoverflow.com/a/12008398
The SQL solution for getting the latest value for each Symbol would look a lot like https://stackoverflow.com/a/6841644
But... we can't come up with anything efficient for DynamoDB.
Is it possible to do this without either retrieving everything or making multiple round trips?
The best idea we have so far is to somehow use update triggers or streams to track the latest record per Symbol and essentially keep that cached. That could be in a separate table or the same table with extra info like a column IsLatestForMachineKey (effectively a bool). With every insert, you'd grab the one where IsLatestForMachineKey=1, compare the Moment and if the insertion is newer, set the new one to 1 and the older one to 0.
This is starting to feel complicated enough that I question whether we're taking the right approach at all, or maybe DynamoDB itself is a bad fit for this, even though the use case seems so simple and common.
There is a way that is fairly straightforward, in my opinion.
Rather than using a GSI, just use two tables with (almost) the exact same schema. The hash key of both should be symbol. They should both have moment and value. Pick one of the tables to be stocks-current and the other to be stocks-historical. stocks-current has no range key. stocks-historical uses moment as a range key.
Whenever you write an item, write it to both tables. If you need strong consistency between the two tables, use the TransactWriteItems api.
If your data might arrive out of order, you can add a ConditionExpression to prevent newer data in stocks-current from being overwritten by out of order data.
The read operations are pretty straightforward, but I’ll state them anyway. To get the latest value for everything, scan the stocks-current table. To get historical data for a stock, query the stocks-historical table with no range key condition.

Storing Weighted Graph Time Series in Cassandra

I am new to Cassandra, and I want to brainstorm storing time series of weighted graphs in Cassandra, where edge weight is incremented upon each time but also updated as a function of time. For example,
w_ij(t+1) = w_ij(t)*exp(-dt/tau) + 1
My first shot involves two CQL v3 tables:
First, I create a partition key by concatenating the id of the graph and the two nodes incident on the particular edge, e.g. G-V1-V2. I do this in order to be able to use the "ORDER BY" directive on the second component of the composite keys described below, which is type timestamp. Call this string the EID, for "edge id".
TABLE 1
- a time series of edge updates
- PRIMARY KEY: EID, time, weight
TABLE 2
- values of "last update time" and "last weight"
- PRIMARY KEY: EID
- COLUMNS: time, weight
Upon each tick, I fetch and update the time and weight values stored in TABLE 2. I use these values to compute the time delta and new weight. I then insert these values in TABLE 1.
Are there any terrible inefficiencies in this strategy? How should it be done? I already know that the update procedure for TABLE 2 is not idempotent and could result in inconsistencies, but I can accept that for the time being.
EDIT: One thing I might do is merge the two tables into a single time series table.
You should avoid any kind of read-before-write when it comes to Cassandra (and any other database where you can't do a compare-and-swap operation for the write).
First of all: Which queries and query-patterns does your application have?
Furthermore I would be interested how often a new weight for each edge will be calculated and stored. Every second, hour, day?
Would it be possible to hold the last weight of each edge in memory? So you could avoid the reading before writing? Possibly some sort of lazy-loading mechanism of this value would be feasible.
If your queries will allow this data model, I would try to build a solution with a single column family.
I would avoid reading before writing in Cassandra as it really isn't a great fit. Reads are expensive, considerably more so than writes, and to sustain performance you'll need a large number of nodes for a relatively small amount of queries. What you're suggesting doesn't really lend itself to be a good fit for Cassandra, as there doesn't appear to be any way to avoid reading before you write. Even if you use a single table you will still need to fetch the last update entries to perform your write. While it certainly could be done, I think there is better tools for the job. Having said that, this would be perfectly feasible if you could keep all data in table 2 in memory, and potentially utilise the row cache. As long as table 2 isn't so large that it can fit the majority of rows in memory, your reads will be significantly faster which may make up for the need to perform a read every write. This would be quite a challenge however and you would need to ensure only the "last update time" for each row is kept in memory, and disk is rarely needed to be touched.
Anyway, another design you may want to look at is an implementation where you not only use Cassandra but also a cache in front of Cassandra to store the last updated times. This could be run alongside Cassandra or on a separate node but could be an in memory store of the last update times only, and when you need to update a row you query the cache, and write your full row to Cassandra (you could even write the last update time if you wished). You could use something like Redis to perform this function, and that way you wouldn't need to worry about tombstones or forcing everything to be stored in memory and so on and so forth.

How to work with data stored in database?

Good day everyone,
I have some questions, about how to do calculations of data stored in the database. Like, I have a table:
| ID | Item name | quantity of items | item price | date |
and for example i have stored 10000 records.
First that I need to do is to pick up items from a date interval, so I wont need the whole database for my calculations. And then I get items from that date interval, I have to add some tables, for example to calculate:
full price = quantity of items * item price
and store them in new table for each item. So the database for the items picked from the date interval should look like this:
| ID | Item name | quantity of items | item price | date | full price |
The point is that I don't know how to store that items which i picked with date interval. Like, do i have create some temporary table, or something?
This will be using an ASP.NET web application, and for calculations in the database I think I will use SQL queries. Maybe there is an easier way to do it? Thank you for your time to help me.
Like other people have said, you can perform these queries on the fly rather than store them.
However, to answer your question, a query like this should do the trick..
I haven't tested this so the syntax might be off a touch, though it will get you on the right track.
Ultimately you have to do an insert with a select
insert into itemFullPrice
select id, itemname, itemqty, itemprice, [date], itemqty*itemprice as fullprice from items where [date] between '2012/10/01' AND '2012/11/01'
again..don't shoot me if i have got the syntax a little off.. it's a busy day today :D
Having 10000 records, it'd not be a good idea to use temporary tables.
You'd better have another table, called ProductsPriceHistory, where you peridodically calculate and store, let's say, monthly reports.
This way, your reports would be faster and you wouldn't have to make calculations everytime you want to get your report.
Be aware this approach is OK if your date intervals are fixed, I mean, monthly, quarterly, yearly, etc.
If your date intervals are dynamic, ex. from 10/20/2011 to 10/25/2011, from 05/20/2011 to 08/13/2011, etc, this approach wouldn't work.
Another approach is to make calculations on ASP.Net.
Even with 10000 records, your best bet is to calculate something like this on the fly. This is what structured databases were designed to do.
For instance:
SELECT [quantity of items] * [item price] AS [full price]
, [MyTable].*
FROM [MyTable]
More complex calculations that involve JOINs to 3 or more tables and thousands of records might lend itself to storing values.
There are few approaches:
use sql query to calculate that on the fly - this way nothing is stored to the database
use same or another table to perform calculation
use calculated field
If you have low database load (few queries per minute, few thousands of rows per fetch) then use first aproach.
If calculation on the fly performs poorly (millions of records, x fetches per second...) try second or third aproach.
Third one is ok if your db supports calculated and persisted fields, say MSSQL Server.
EDIT:
Typically, as others said, you will perform calculation in your query. That is, as long as your project is simple enough.
First, when the table where you store all the items and their prices becomes attacked with insert/update/deletes from multiple clients, you don't want to block or be blocked by others. You have to understand that e.g. table X update will possibly block your select from table X until it is finished (look up page/row lock). This means that you are going to love parallel denormalized structure (table with product and the calculated stuff along with it). This is where e.g. reporting comes into play.
Second, when calculation is simple enough (a*b) and done over not-so-many records, then it's ok. When you have e.g. 10M records and you have to correlate each row with several other rows and do some aggregation over some groups, there is a chance that calculated/persisted field will save your time - you can gain up to 10-100 times faster result using this approach.
You should separate concerns in your application:
aspx pages for presentation
sql server for data persistency
some kind of intermediate "business" layer for extra logic like fullprice = p * q
E.g. if you are using Linq-2-sql for data retrieval, it is very simple to add a the fullprice to your entities. The same for entity framework. Also, if you want, you can already do the computation of p*q in the SQL select. Only if performance really becomes an issue, you can start thinking about temporary tables, views with clustered indexes etc.

solr search for a time within a time range

I'm aware that Solr provides a date field which can store a time instance and then range queries can be performed to match all documents which have that field within a particular range.
My problem is the inverse of this. I need to associate multiple time ranges with documents and then search for all documents which have the searched time within one of those ranges.
For e.g. I'm indexing outlets and have 3-4 ranges during which the outlet is open. I need to search for all outlets which are open at a particular time instance.
One way of doing this is to index start time and end time of the durations as separate date fields and compare during search like
(time1_1 > t AND time1_2 < t) OR (time2_1 > t AND time2_2 < t) OR (time3_1 > t AND time3_2 < t)
Is there a better/faster/cleaner way to do this?
Your example looks like the entities of your index are the outlet stores and you store their opening and closing times in separate (probably dynamic) fields.
If you ask for a different approach you have to consider to restructure the existing schema or to even create an additional one that uses another entity.
It may seem unusual at first, but if this query is the most essential one to your app then you should consider making the entity of your new index to what you acutally want to query: the particular time instance. I take it, time instance is either a whole day, or maybe half or quarter of a day.
The schema would include fields like the ID, the startdate of the day or half day or whatever you choose, the end of it, and a multivalued list of ids that point to the outlets (stored in your current index (use a multi core setup)).
Even if you choose quarter days to handle morning, afternoon and night hours separately, and even with a preview of several years, data should not explode.
This different schema setup allows you to:
do the most important computation during import so that it is easily accessible when querying,
simple query that returns in one hit what you seek
You could even forgo Date fields by using a custom way to identify the ranges. I am thinking of creating the identifier from the date and a string that indicates whether it is morning or afternoon etc. This would be used as the unique ID in SOLR. If you can create such an ID from any "time instance" that is queried you'd end up with a simple ID lookup.
e.g.
What is open on 2013/03/03 in the morning?
/solr/openhours/select?q=id:2013_03_03_am
returns:
Array of outlet ids.

Resources