Efficient DynamoDB schema for time series data - amazon-dynamodb

We are building a conversation system that will support messages between 2 users (and eventually between 3+ users). Each conversation will have a collection of users who can participate/view the conversation as well as a collection of messages. The UI will display the most recent 10 messages in a specific conversation with the ability to "page" (progressive scrolling?) the messages to view messages further back in time.
The plan is to store conversations and the participants in MSSQL and then only store the messages (which represents the data that has the potential to grow very large) in DynamoDB. The message table would use the conversation ID as the hash key and the message CreateDate as the range key. The conversation ID could be anything at this point (integer, GUID, etc) to ensure an even message distribution across the partitions.
In order to avoid hot partitions one suggestion is to create separate tables for time series data because typically only the most recent data will be accessed. Would this lead to issues when we need to pull back previous messages for a user as they scroll/page because we have to query across multiple tables to piece together a batch of messages?
Is there a different/better approach for storing time series data that may be infrequently accessed, but available quickly?

I guess we can assume that there are many "active" conversations in parallel, right? Meaning - we're not dealing with the case where all the traffic is regarding a single conversation (or a few).
If that's the case, and you're using a random number/GUID as your HASH key, your objects will be evenly spread throughout the nodes and as far as I know, you shouldn't be afraid of skewness. Since the CreateDate is only the RANGE key, all messages for the same conversation will be stored on the same node (based on their ConversationID), so it actually doesn't matter if you query for the latest 5 records or the earliest 5. In both cases it's query using the index on CreateDate.
I wouldn't break the data into multiple tables. I don't see what benefit it gives you (considering the previous section) and it will make your administrative life a nightmare (just imagine changing throughput for all tables, or backing them up, or creating a CloudFormation template to create your whole environment).
I would be concerned with the number of messages that will be returned when you pull the history. I guess you'll implement that by a query command with the ConversationID as the HASH key and order results by CreationDate descending. In that case, I'd return only the first page of results (I think it returns up to 1MB of data, so depends on an average message length, it might be enough or not) and only if the user keeps scrolling, fetch the next page. Otherwise, you might use a lot of your throughput on really long conversations and anyway, the client doesn't really want to get stuck for a long time waiting for megabytes of data to appear on screen..
Hope this helps

Related

Optimize storage and bandwith use in a chat app using Realtime database

Im developing a chat app with Realtime Database as backend, and this is the way i save the data into the DB:
I identify each message with the full uid of the user sending it.
Do you think is this necessary, or can i only save the first 10 characters (for example) of the uid in order to reduce bytes? my concern is if in some moment 2 diferent users will have the same 10 first characters
There is no guarantee that the first 10 characters of a UID are going to be unique, so using those as an identifier is not a great idea.
If you want to use shorter IDs, the two first options that come to mind are:
Create your own identifier for each user in the room, for example by giving them a sequential ID, and then store that.
Use an actual hash function to determine a shorter unique ID for each user. While there is still a chance of collisions (multiple users getting the same ID), the chances are likely smaller then when you just take the first 10 characters.
In all cases, I'd highly recommend calculating the cost savings that you'll accomplish with this though. Message length is more typically the dominant factor in the size of each message.

Storing and querying for announcements between two datetimes

Background
I have to design a table to store announcements in DynamoDB. Each announcement has the following structure:
{
"announcementId": "(For the frontend to identify an announcement to the backend)",
"author": "(id of author)",
"displayStartDatetime": "",
"displayEndDatetime": "",
"title": "",
"description": "",
"image": "(A url to an image)",
"link": "(A single url to another page)"
}
As we are still designing the table, alterations to the structure are permitted. In particular, announcementId, displayStartDatetime and displayEndDatetime can be changed.
The main access pattern is to find the current announcements. Users have a webpage which they can see all current announcements and their details.
Every announcement has a date for when to start showing it (displayStartDatetime) and when to stop showing it (displayEndDatetime). The announcement is should still be kept in the table after the current datetime is past displayEndDatetime for reference for admins.
The start and end datetime are precise to the minute.
Problem
Ideally, I would like a way to query the table for all the current announcements in one query.
However, I have come to the conclusion that it is impossible to fuse two datetimes in one sort key because it is impossible to order two pieces of data of equal importance (e.g. storing the timestamps as a string will mean one will be more important/greater than the other).
Hence, as a compromise, I would like to sort the table values by displayEndDatetime so that I can filter out past announcements. This is because, as time goes on, there will be more past announcements than future announcements, so it will be more beneficial to optimise that.
Compromised Solution
Currently, my (not very good) solutions are:
Use one "hot" partition key and use the displayEndDatetime as the sort key.
This allows me to filter out past announcements, but it also means that all the data is in a single partition. I could run a scheduled job every now and then to move the past announcements to a different spaced out partitions.
Scan through the table
I believe Scan will look at every item in the table before it performs any filtering. This solution doesn't seem as good as 1. but it would be the simplest to implement and it would allow me to keep announcementId as the partition key.
Scan a GSI of the table
Since Scan will look through every item, it may be more efficient to create a GSI (announcementId (PK), displayEndDatetime (SK)) and scan through that to retrieve all the announcementIds which have not passed. After that, another request could be made to get all the announcements.
Question
What is the most optimised solution for storing all announcements and then finding current announcements when using DynamoDB?
Although I have listed a few possible solutions for sorting the displayEndDatetime, the main point is still finding announcements between the start and end datetime.
Edit
Here are the answers to #tugberk's questions on the background:
What is the rate of writes you anticipate receiving (i.e. peak writes per second you need to handle)?
I am uncertain of how the admins will use this system, announcements can be very regular (about 3/day) or very infrequent (about 3/month).
How much new data do you anticipate storing daily, and how do you think this will grow?
As mentioned above, this could be about 3 announcements a day or 3 a month. This is likely to remain the same for as long as I should be concerned about.
What is the rate of reads (e.g. peak reads per second)?
I would expect the peak reads per second to be around 500-1000 reads/s. This number is expected to grow as there are more users.
How many announcements a user can see at a time (i.e. what's avg/max number of announcements will be visible at any point in time)? Practically thinking, this shouldn't be more than a few (e.g. 10-20 at most).
I would expect the maxmimum number of viewable announcements to be up to 30-40. This is because there could be multiple long-running announcements along with short-term announcements. On average, I would expect about 5-10 announcements.
What is the data inconsistency gap you are happy to have here (i.e. do you need seconds level precision, or would you be happy to have ~1min delay on displaying and hiding announcements)?
I think the speed which the announcement starts showing is important, especially if the admins decide that this is a good platform for urgent announcements (likely urgent to the minute). However, when it stops showing is less important, but to avoid confusing the users the announcement should stop display at most 4 hours after it is past its display end datetime.
This type of questions are always hard to answer here as there is so many assumptions on the answer as it's really hard to have all the facts. But I will try to give you so ideas, which may help you think about your data storage choice as well as giving you further options.
I know what I am doing, and really need to use DynamoDB
Edited this answer based on the OP's answers to my original questions.
As you really need to us DynamoDB for this for internal reasons, I think it's more suitable to store the data in two DynamoDB tables for both serving reads and writes as nearly all access patterns I can think of will hit multiple partitions if you have one table. You can get away with a GSI, but it's not too straight forward how to do it, and I am not sure whether there is any advantage to doing it that way.
The core thing you need to optimize for is the reads as you mentioned it can go up to 2K/rps which is big enough to make this the part where you optimize your architecture against. Based on your assumptions of having 3 announcements a day, it's nothing to worry about as far as the writes are concerned.
General idea is this:
I would consider using one DynamoDB table to handle writes where you can configure author identifier as the partition key, and announcement identifier as the sort key (and make your primary key as the combination of both). This will allow you to query all the announcements for a given author easily.
I would also have a second DynamoDB table to handle reads, where you will only store active announcements which your application can query and retrieve all of it with a Scan query (i.e. O(N)), which is not a concern as you mentioned there will only be 30-40 active announcments at any point in time. Let's imagine this to be even 500, you are still OK with this structure. In terms of partition and sort key, I would just have an active boolean field as the partition key, which you will always have it as true, you can have the announcement id as the sort key, and make the combination of both as the primary key. If you care about the sort of these announcements, you can adjust the sort key accordingly but make sure it's unique (i.e. consider concatenating the announcement identifier, e.g. {displayBeginDatetime-in-yyyyMMddHHmmss-format}-{announcementId}. With this way you will guarantee that you will only hit one partition. However, you can actually simplify this and have the announcement identifier as the partition key and primary key as I am nearly sure that DynamoDB will store all your data in one partition as it's going to be so small. Better to confirm this though as I am not 100% sure. The point here is that you are much better of ensuring hitting one partition with this query.
Here is how this may work, where there are some edge cases I am overlooking:
record the write inside the first DynamoDB for an announcement. When an announcement is written, configure displayEndDatetime as the TTL of that row, with the assumption that you don't need this record in this table when an announcement expires.
have a job running for N minute (one or more, depending on the data inconsistency gap you can handle), which will Scan the entire DynamoDB table across partitions (do it in a paginated way), and makes decisions on which announcements are currently visible. Then, write your data into the second DynamoDB table, which will handle the reads, in the structure we have established above so that your consumer can read from this w/o worrying about any filtering as the data is already filtered (e.g. all the announcements here are visible ones). Note that Scan is fine here as you are running this once every N minutes, with the assumption that you are ok with at least 1 minute + processing time data inconsistency gap. I would suggest running this every 10 minutes or so, if you don't have strong data consistency requirements.
On the read storage system, also configure displayEndDatetime as the TTL for the row so that it gets automatically deleted.
Configure DynamoDB streams on the first DynamoDB table, which has 24 hours retention and exactly once delivery guarantee, and have a lambda consumer of this stream, which to handle when an item is deleted (will happen when TTL kicks in for a particular row) to keep a record of this announcements somewhere else, for longer retention reasons, and will need to expose it through different access pattern (e.g. show all the announcements per author so that they can reenable old announcements), as you mentioned in you question. You can configure a lambda event sourcing with DynamoDb streams, which will allow you to handle failures with retries, etc. Make sure that your logic in these lambdas are idempotent so that you can retry safely.
The below is the parts from my original question, which are still relevant to anyone who might be trying to achieve the same. So, I will leave them here but they are less relevant as the OP needs to use DynamoDB.
Why DynamoDB?
First of all, I would question why you need DynamoDB for this, as it seems like your requirements are more read heavy than it's being write heavy, where I think DynamoDB shines the most due to its partitioned out of the box nature.
Below questions would help you understand whether you really need DynamoDB for this, or can you get away with a more flexible data storage system:
what is the rate of writes you anticipate receiving (i.e. peak writes per second you need to handle)?
how much new data do you anticipate storing daily, and how do you think this will grow?
what is the rate of reads (e.g. peak reads per second)?
How many announcements a user can see at a time (i.e. what's avg/max number of announcements will be visible at any point in time)? Practically thinking, this shouldn't be more than a few (e.g. 10-20 at most). This will help you understand whether you need will be OK pulling all the visible announcements in one go, or need a pagination system.
What is the data inconsistency gap you are happy to have here (i.e. do you need seconds level precision, or would you be happy to have ~1min delay on displaying and hiding announcements)?
Actually, I don't need DynamoDB
Based on my assumptions on your consumption and admin needs for this use case, I believe you don't need DynamoDB for this with the assumption of not having high number of writes for this (which might be wrong), and if these assumptions are correct, the above is a super over engineered solution for you. Let's say it's correct, I think you are better of using PostgreSQL for this, which can give you easy ability to change your access pattern as you see fit with further indexing, and for the current access pattern you have, you can have a range query over the start and end times.

Unity + Firebase: is it possible to append data to a keys value, or do I have to retrieve keys data every time?

I'm a bit worried that I will reach the free data limits of Firebase in a student project.
Basically my question is:
is it possible to append to the end of the string instead of retrieving key and value, appending and uploading again.
What I want to achieve:
I have to create statistics of user right/wrong answers for particular questions.
I want to have a kvp:
answers: 1r/5w/3r
Where number is the number of users guesses and r/w means right wrong. Whenever the guessing session ends I want to add /numberOfGuesses+RightOrWrongAnswer and the end.
I'm using Unity 2018.
Thank you in advance for all the help!
I don't know how your game is architected or how many people are playing, but I'd be surprised if you hit your free limit on a student project (you can store 1GB and download 10GB). That string is 8 bytes, let's assume worst case scenario: as a UTF32 string, that would be 32 bytes of data - you'd have to pull that down 312 million times to hit a cap (there'll be some overhead, but I can't imagine it being a hugely impactful). If you're afraid of being charged, you can opt to not have a credit card on file to be doubly sure you stay on a student budget.
If you want to reduce the amount of reading/writing though, I might suggest that instead of:
key: <value_string> (so, instead of session_id: "1r/5w/3r")
you structure more like:
key:
- wrong: 5
- right: 3
So have two more values nested under your key. One for all the wrong answers, just an incrementing integer. Then one for all the right answers: just an incrementing integer.
The mechanism to "append" would be a transaction, and you should use these whether you're mutating a string or counter. Firebase tries to be smart with data usage and offline caching, but you don't get much more control other than that.
If order really matters, you might want to get cleverer. You'll generally want to work with the abstractions Realtime Database gives you though to maximize any inherent optimizations (it likes to think in terms of JSON documents, so think about your data layout similarly). This may not be as data optimal, but you may want to consider instead using a ledger of some kind (perhaps using ServerValue.Timestamp to record a single right or wrong answer, and having a cloud function listening to sum up the results in the background after a game - this would be especially useful if you plan on having a lot of users trying to write the same key at the same time).

How to use DynamoDB streams to maintain duplicated data consistency?

From what I understand one of the uses cases of DynamoDB Streams is to maintain/update duplicated data.
Let's say I have a User object, and its name attribute is replicated in many Invoice objects.
When a User edits/updates its name, I will have a lambda using DynamoDb Streams to then update all Invoices related to this user with his new name.
There could be thousands of Invoices related to this user so this updating could take a while, specially because I will want to do a rate limited batch_write so that this operation doesn't throttle my table.
The question is : How can my (web)application know that the lambda has finished updating? For example, I want to show a loading screen to the client using the application untill the duplicated data updating is done, so that he doesn't see any outdated information on his browser.
Or is there other ways of rapidly dealing with updating thousands of duplicated data?
Why aren't you capturing the output of Lambda. You can make Lambda return successful status, once all the updates are persisting to DDB.
Invoice can keep a reference to User object instead of storing the exact name and can fetch name at the time of generating/printing

DynamoDB Query Time Based on Table Size

Is there any good documentation on how query times change for a DynamoDB table based on equal read capacity and differing row sizes? I've been reading through the documentation and can't find anything, was wondering if anybody has done any studies into this?
My use case is that I'm putting a million rows into a table a week. These records are referenced quite a bit as they're entered but as time goes on the frequency at which I query those rows decreases. Can I leave those records in the table indefinitely with no detrimental effect on query time, or should I rotate them out so the newer data that is requested more frequently returns faster?
Please don't keep the old data indefinitely. It is advised to archive the data for better performance.
Few points on design and testing:-
Designing the proper hash key, so that the data is distributed
access the partitions
Understand Access Patterns for Time Series Data
Test your application at scale to avoid problems with "hot" keys
when your table becomes larger
Suppose you design a table to track customer behavior on your site,
such as URLs that they click. You might design the table with a
composite primary key consisting of Customer ID as the partition key
and date/time as the sort key. In this application, customer data
grows indefinitely over time; however, the applications might show
uneven access pattern across all the items in the table where the
latest customer data is more relevant and your application might
access the latest items more frequently and as time passes these items
are less accessed, eventually the older items are rarely accessed. If
this is a known access pattern, you could take it into consideration
when designing your table schema. Instead of storing all items in a
single table, you could use multiple tables to store these items. For
example, you could create tables to store monthly or weekly data. For
the table storing data from the latest month or week, where data
access rate is high, request higher throughput and for tables storing
older data, you could dial down the throughput and save on resources.
Time Series Data Access Pattern
Guidelines for table partitions

Resources