DynamoDB index/query questions - amazon-dynamodb

I need to create a table with the following fields :
place, date, status
My keys are parition key - place , sort key - date
Status can be either 0 or 1
Table has approximately 300k rows per day and about 3 days worth of data at any given time, so about 1 million rows. I have a service that is continuously populating data to this DDB.
I need to run the following queries (only) once per day :
#1 Return count of all places with date = current_date-1
#2 Return count and list of all places with date= current_date-1 and status = 0
Questions :
As date is already a sort key, is query #1 bound to be quick?
Do we need to create indexes on sort key fields ?
If answer to above question is yes: for query #2, do I need to create a GSI on date and status? with date as Partition key, and status as sort key?
Creating a GSI vs using filter expression on status for query #2. Which of the two is recommended?

Running analytical queries (such as count) is a wrong usage of a NoSQL database such as DynamoDB that is designed for scalable LOOKUP use cases.
Even if you get the SCAN to work with one design or another, it will be more expensive and slow than it should.
A better option is to export the table data from DynamoDB into S3, and then run an Athena query over that data. It will be much more flexible to run various analytical queries.

Easiest thing for you to do is a full table scan once per day filtering by yesterday's date, and as part of that keep your own client-side count on if the status was 0 or 1. The filter is not index optimized so it will be a true full table scan.
Why not an export to S3? Because you're really just doing one query. If you follow the export route you'll have to a new export every day to keep the data fresh and the cost of the export in dollar terms (plus complexity) is more than a single full scan. If you were going to do repeated queries against the data then the export makes more sense.
Why not use GSIs? They would make the table scan more efficient by minimizing what's scanned. However, there's a cost (plus complexity) in keeping them current.
Short answer: a once per day full table scan is both simple to implement and as fast as you want (parallel scan is an option), plus it's not really costly.
How much would it cost? Million rows, 100 bytes each, so that's a 100 MB table. That's 25,000 read units to fully scan, which is halved down to 12,500 with eventual consistency. On Demand pricing is $0.25 per million read units. 12,500 / 1,000,000 * $0.25 = $0.003. Less than a dime a month. It'd be cheaper still if you run provisioned.
Just do the scan. :)

Related

Fetching top 10 records without using scan in DynamoDB

I have a DynamoDB table containing:
productID (PK), name, description, url, createTimestamp, <constant>
I'm trying to retrieve the latest 10 products by createTimestamp (unix timestamp).
In SQL, I would probably pull out the data like:
select * from [table] order by createTimestamp desc limit 10;
Q: How can I achieve the same result using DynamoDB without using scan?
The table can be pretty large and data will be accessed often (e.g., whenever user access the e-commerce website) so using scan wouldn't be optimal. I'm thinking of creating a GSI using a constant value as PK (because there isn't any other attribute we could use to narrow the results) and sort key as createTimestamp but this is considered anti-pattern. Is there a better alternative?
That’s the way to go, with a GSI having a singular PK and the timestamps in the SK.
If your write rate will exceed 1,000 write units per second then you’ll want to shard the PK value to one of N many randomly chosen values to increase throughout to N,000 writes per second.
That means you’ll need to do N many Query calls to get your unified answer but each Query will be highly efficient and index optimized.
This is a common design pattern.

Optimizing DynamoDB Read Consumption

I have table which has a String column of date. the sample input is 2018-12-31T23:59:59.999Z. It is not indexed.
Now what would be better from Read Capacity Consumption if I want to fetch all records which are older than a given date.
Should I scan the whole table and apply logic in my script OR
Should I use DynamoDB condition while scanning the records.
What I mean to ask, is RCU computer based on what results are being sent or is it computed at the query level. If its computed on results then option 2 is an optimized approach but if it is not then it doesn't matter.
What do you guys suggest.
The RCU is based on the volume of data that was accessed in disk by the Dynamodb engine, not the volume of data returned to the caller. Using DynamoDB conditions you will get the answer fast because that probably will be a lot less bytes to be sent to network, but it will cost you the same in terms of Read Capacity Units.

Can a group count query fail due to Big Data ? Amazon Neptune Graph Databases

Can a group count query in Amazon Neptune or any Graph Databases fail due to Big Data ?
I mean if the counts exceeds the limits of the count datatype can there be a n overflow?
Short answer
Gremlin query language semantics (as defined by the Tinkerpop code) define output of count() function as a 64 bit long. So, yes, count cannot exceed the range of long.
Long answer
Having said that, let's try to calculate the amount of data you would need to insert into the DB to hit that threshold. Each entity(Vertex/Edge/Property) in the DB contains a unique ID associated with it. Let us hypothetically assume that the storage of each entity consists of just the identifier. Also, let us assume that the data type of the identifier is the most efficient, i.e. a long (and not a String which would use greater space than a long).
To hit the limit of count, the DB would need to store at least 2^64 entities each with a unique identifier i.e. at least ((2^64)*64)bits of data i.e. greater than 1000 PetaBytes of data at a very conservative estimate.
The point is, you would need to store a huge amount of data before you hit the limit of count. If you are operating with such amount of data, a DB might not be right storage solution for you.

Query dynamoDB by date range

I am developing an application that allows users to read books. I am using DynamoDB for storing details of the books that user reads and I plan to use the data stored in DynamoDB for calculating statistics, such as trending books, authors, etc.
My current schema looks like this:
user_id | timestamp | book_id | author_id
user_id is the partition key, and timestamp is the sort key.
The problem I am having is that, with this schema I am only able to query
the details of the books that a single user (partition key) has read. That is one of the requirements for me.
The other requirement is to query all the records that has been created in a certain date range, eg: records created in the past 7 days. With this schema, I am unable to run this query.
I have looked into so many other options, and haven't figured out a way to create a schema that would allow me to run both queries.
Retrieve the records of the books read by a single user (Can be done).
Retrieve the records of books read by all the users in last x days (Unable to do it).
I do not want to run a scan, since It will be expensive and I looked into the option of using GSI for timestamp, but it requires me to specify a hash key, and therefore I cannot query all the records created between 2 dates.
One naive solution would be to create a GSI with a constant hash key across all books and timestamp as a range key. This will allow you to perform your type of queries.
The problem with this approach is that it is likely to become a scaling bottleneck, as same hash key means same node. One workaround for this problem is to do sharding: create a set of hash keys (ex: from 1 to 10) and assign random key from this set to every book. Then when you make a query you will need to make 10 queries and merge results. You can even make this set size dynamic, so that it scales with your data.
I would also suggest looking into other tools (not DynamoDB) for this use case, as DDB is not the best tool for data analysis. You might, for example, feed DynamoDB data into CloudSearch or ElasticSearch and do your analysis there.
One solution could be using GSI and including two more columns, when ever you ingest a record kindly ingest date as a primary key e.g 2017-07-02 and timestamp as range key 04:22:33:000.
Maintain one table for checkpoint which would contain the process name and timestamp of the table, Everytime you read from the table you can update the checkpoint table to get incremental data. if you want to get last 7 day data change timestamp to past 7 date and get data between last 7 day and current time.
You can use query spec for the same by passing date as a partition and using between keywords for timestamp which is range condition.
Date diff you will to calculate from checkpoint table and current date and so day wise you get the data.

Storing Weighted Graph Time Series in Cassandra

I am new to Cassandra, and I want to brainstorm storing time series of weighted graphs in Cassandra, where edge weight is incremented upon each time but also updated as a function of time. For example,
w_ij(t+1) = w_ij(t)*exp(-dt/tau) + 1
My first shot involves two CQL v3 tables:
First, I create a partition key by concatenating the id of the graph and the two nodes incident on the particular edge, e.g. G-V1-V2. I do this in order to be able to use the "ORDER BY" directive on the second component of the composite keys described below, which is type timestamp. Call this string the EID, for "edge id".
TABLE 1
- a time series of edge updates
- PRIMARY KEY: EID, time, weight
TABLE 2
- values of "last update time" and "last weight"
- PRIMARY KEY: EID
- COLUMNS: time, weight
Upon each tick, I fetch and update the time and weight values stored in TABLE 2. I use these values to compute the time delta and new weight. I then insert these values in TABLE 1.
Are there any terrible inefficiencies in this strategy? How should it be done? I already know that the update procedure for TABLE 2 is not idempotent and could result in inconsistencies, but I can accept that for the time being.
EDIT: One thing I might do is merge the two tables into a single time series table.
You should avoid any kind of read-before-write when it comes to Cassandra (and any other database where you can't do a compare-and-swap operation for the write).
First of all: Which queries and query-patterns does your application have?
Furthermore I would be interested how often a new weight for each edge will be calculated and stored. Every second, hour, day?
Would it be possible to hold the last weight of each edge in memory? So you could avoid the reading before writing? Possibly some sort of lazy-loading mechanism of this value would be feasible.
If your queries will allow this data model, I would try to build a solution with a single column family.
I would avoid reading before writing in Cassandra as it really isn't a great fit. Reads are expensive, considerably more so than writes, and to sustain performance you'll need a large number of nodes for a relatively small amount of queries. What you're suggesting doesn't really lend itself to be a good fit for Cassandra, as there doesn't appear to be any way to avoid reading before you write. Even if you use a single table you will still need to fetch the last update entries to perform your write. While it certainly could be done, I think there is better tools for the job. Having said that, this would be perfectly feasible if you could keep all data in table 2 in memory, and potentially utilise the row cache. As long as table 2 isn't so large that it can fit the majority of rows in memory, your reads will be significantly faster which may make up for the need to perform a read every write. This would be quite a challenge however and you would need to ensure only the "last update time" for each row is kept in memory, and disk is rarely needed to be touched.
Anyway, another design you may want to look at is an implementation where you not only use Cassandra but also a cache in front of Cassandra to store the last updated times. This could be run alongside Cassandra or on a separate node but could be an in memory store of the last update times only, and when you need to update a row you query the cache, and write your full row to Cassandra (you could even write the last update time if you wished). You could use something like Redis to perform this function, and that way you wouldn't need to worry about tombstones or forcing everything to be stored in memory and so on and so forth.

Resources