I would like to get a good understanding of what would be the price (in terms of $) of using DynamoDB Titan backend. For this, I need to be able to understand when DynamoDB Titan backend does reads and writes. Right now I am pretty clueless.
Ideally I would like to run a testcase which adds some vertices, edges and then does a rather simple traversal and then see how many reads and writes were done. Any ideas of how I can achieve this? Possibly through metrics?
If it turns out I can't extract this information myself, I would very much appreciate a first brief explanation about when DynamoDB Titan backend performs reads and writes.
For all Titan backends, to understand and estimate the number of writes, we rely on estimating the number of columns for a given KCVStore. You can also measure the number of columns that get written using metrics when using the DynamoDB Storage Backend for Titan.
To enable metrics, enable the configuration options listed here.
Specifically, enable lines 7-11.
Note the max-queue-length configuration property. If the executor-queue-size metric hits max-queue-length for a particular tx.commit() call, then you know that the queue / storage.buffer-size were not large enough. Once the executor-queue-size metric peaks without reaching max-queue-length, you know you have captured all the columns being written in a tx.commit() call, so that will give you the number of columns being changed in a tx.commit(). You can look at UpdateItem metrics for edgestore and graphindex to understand the spread of columns between the two tables.
All Titan storage backends implement KCVStore, and the keys and columns have different meanings depending on the kind of store. There are two stores that get the bulk of writes, assuming you have not turned on user-defined transaction logs. They are edgestore and graphindex.
The edgestore KCVStore is always written to, regardless of whether you configure composite indexes. Each edge and all of the edge properties of that edge are represented by two columns (unless you set the schema of that edge label to be unidirectional). The key of edge columns are the out-vertex of an edge in the direct column, and the in-vertex of an edge in the reverse. Again, the column of an edge is the in-vertex of an edge in the direct column, and the out-vertex of an edge in the reverse. Each vertex is represented by at least one column for the VertexExists hidden property, one column for a vertex label (optional) and one column for each vertex property. The key of vertices is the vertex id and the columns correspond to vertex properties, hidden vertex properties, and labels.
The graphindex KCVStore will only be written to if you configure composite indexes in the Titan management system. You can index vertex and edge properties. For each pair of indexed value and edge/vertex that has that indexed value, there will be one column in the graphindex KCVStore. The key will be a combination of the index id and value, and the column will be the vertex/edge id.
Now that you know how to count columns, you can use this knowledge to estimate the size and number of writes to edgestore and graphindex when using the DynamoDB Storage Backend for Titan. If you use the multiple-item data model for a KCVStore, you will get one item for each key-column pair. If you use the single-item data model for a KCVStore, you will get one item for all columns at a key (this is not necessarily true when graph partitioning is enabled but this is a detail I will not discuss now). As long as each vertex property is less than 1kb, and the sum of all edge properties for an edge are less than 1 kb, each column will cost 1 WCU to write when using multiple-item data model for edgestore. Again, each column in the graphindex will cost 1 WCU to write if you use the multiple-item data model.
Lets assume you did your estimation and you use multiple-item data model throughout. Lets assume you estimate that you will be writing 750 columns per second to edgestore and 750 columns per second to graphindex, and that you want to drive this load for a day. You can set the read capacity for both tables to 1, so you know each table will start off with one physical DynamoDB partition to begin with. In us-east-1, the cost for writes is $0.0065 per hour for every 10 units of write capacity, so 24 * 75 * $0.0065 is $11.70 per day for writes for each table. This means the write capacity would cost $23.40 per day for edgestore and graphindex together. The reads could be set to 1 read per second for each of the tables, making the read cost 2 * 24 * $0.0065 = $0.312 for both tables per day. If your AWS account is new, the reads would fall within the free tier, so effectively, you would only be paying for the writes.
Another aspect of DynamoDB pricing is storage. If you write 750 columns per second, that is 64.8 million items per day to one table, that means 1.9 billion (approximately 2 billion) items per month. The average number of items in the table in a month is then 1 billion. If each items averages out to 412 bytes, and there is 100 bytes of overhead, then that means 1 billion 512 byte items are stored for a month, approximately 477 GB in a month. 477 / 25 rounded up is 20, so storage for the first month at this load would cost 20 * $0.25 dollars a month. If you keep adding items at this rate without deleting them, the monthly storage cost will increase by approximately 5 dollars per month.
If you do not have super nodes in your graph, or vertices with a relatively large number of properties, then the writes to the edgestore will be distributed evenly throughout the partition key space. That means your table will split into 2 partitions when it hits 10GB, and then each of those partitions will split into a total of 4 partitions when they hit 10GB, and so on and so forth. the nearest power of 2 to 477 GB / (10 GB / partition) is 2^6=64, so that means your edgestore would split 6 times over the course of the first month. You would probably have around 64 partitions at the end of the first month. Eventually, your table will have so many partitions that each partition will have very few IOPS. This phenomenon is called IOPS starvation. You should have a strategy in place to address IOPS starvation. Two commonly used strategies are 1. batch cleanup/archival of old data and 2. rolling (time-series) graphs. In option 1, you spin up an EC2 instance to traverse the graph and write old data to a colder store (S3, Glacier etc) and delete it from DynamoDB. In option 2, you direct writes to graphs that correspond to a time period (weeks - 2015W1, months - 2015M1, etc). As time passes, you down provision the writes on the older tables, and when time comes to migrate them to colder storage, you read the entire graph for that time period and delete the corresponding DynamoDB tables. The advantage of this approach is that it allows you to manage your write provisioning cost with higher granularity, and it allows you to avoid the cost of deleting individual items (because you delete a table for free instead of incurring at least 1 WCU for every item you delete).
Related
I need to create a table with the following fields :
place, date, status
My keys are parition key - place , sort key - date
Status can be either 0 or 1
Table has approximately 300k rows per day and about 3 days worth of data at any given time, so about 1 million rows. I have a service that is continuously populating data to this DDB.
I need to run the following queries (only) once per day :
#1 Return count of all places with date = current_date-1
#2 Return count and list of all places with date= current_date-1 and status = 0
Questions :
As date is already a sort key, is query #1 bound to be quick?
Do we need to create indexes on sort key fields ?
If answer to above question is yes: for query #2, do I need to create a GSI on date and status? with date as Partition key, and status as sort key?
Creating a GSI vs using filter expression on status for query #2. Which of the two is recommended?
Running analytical queries (such as count) is a wrong usage of a NoSQL database such as DynamoDB that is designed for scalable LOOKUP use cases.
Even if you get the SCAN to work with one design or another, it will be more expensive and slow than it should.
A better option is to export the table data from DynamoDB into S3, and then run an Athena query over that data. It will be much more flexible to run various analytical queries.
Easiest thing for you to do is a full table scan once per day filtering by yesterday's date, and as part of that keep your own client-side count on if the status was 0 or 1. The filter is not index optimized so it will be a true full table scan.
Why not an export to S3? Because you're really just doing one query. If you follow the export route you'll have to a new export every day to keep the data fresh and the cost of the export in dollar terms (plus complexity) is more than a single full scan. If you were going to do repeated queries against the data then the export makes more sense.
Why not use GSIs? They would make the table scan more efficient by minimizing what's scanned. However, there's a cost (plus complexity) in keeping them current.
Short answer: a once per day full table scan is both simple to implement and as fast as you want (parallel scan is an option), plus it's not really costly.
How much would it cost? Million rows, 100 bytes each, so that's a 100 MB table. That's 25,000 read units to fully scan, which is halved down to 12,500 with eventual consistency. On Demand pricing is $0.25 per million read units. 12,500 / 1,000,000 * $0.25 = $0.003. Less than a dime a month. It'd be cheaper still if you run provisioned.
Just do the scan. :)
I've read in the documentation that writes per second are a maximum of 10,000 for Cloud Firestore. But if a collection has sequential values with an index, there are only 500 writes per second allowed.
"Maximum write rate to a collection in which documents contain sequential values in an indexed field: 500 per second"
https://firebase.google.com/docs/firestore/quotas?hl=en
In order to increase the writes per second, a "shard" field should be implemented. The number of writes of 500 is then multiplied by the number of shards.
https://cloud.google.com/firestore/docs/solutions/shard-timestamp?hl=en
My question is: Does that mean that a number of 20 shards will increase my writes per second to the maximum of 10,000? And more shards are superfluous?
Further shards would only make sense if I wanted to increase the writes per second for a single document, as I understand it. For example, for a counter that is then divided into several documents, in order to avoid the write limit of one second per document. (This scenario is not relevant for my purpose)
I think it wouldn't be much of a hassle for me to implement 20 shards from the start, even though I may never need them. Just to make sure I won't have any problems with it in the future as the number of users increases.
I know that one downside would be more complicated queries. But I think, that I could easily avoid this in my App because of how my data is structured.
The page you linked has this example of determining the necessary number of shards:
After some research, you determine that the app will receive between 1,000 and 1,500 instrument updates per second. This surpasses the 500 writes per second allowed for collections containing documents with indexed timestamp fields. To increase the write throughput, you need 3 shard values, MAX_INSTRUMENT_UPDATES/500 = 3.
So it indeed seems that you can simply divide your necessary throughput by 500 (the maximum number of writes per shard) to get the number of shard values you'll need.
Don't forget that you'll need to also update the index definitions to drop the existing indexes on your sequential (in the example: timestamp) field, and add composite indexes on that field and the shard field. The 500/second throughput limit comes from the write speed of individual indexes, so it's actually having multiple composite indexes that increases the throughput.
My application need to support lookups for invoices by invoice id and by the customer. For that reason I created two collections in which I store the (exact) same invoice documents:
InvoicesById, with partition key /InvoiceId
InvoicesByCustomerId, with partition key /CustomerId
Apparently you should use partition keys when doing queries and since there are two queries I need two collections. I guess there may be more in the future.
Updates are primarily done to the InvoicesById collection, but then I need to replicate the change to InvoicesByCustomer (and others) as well.
Are there any best practice or sane approaches how to keep collections in sync?
I'm thinking change feeds and what not. I want avoid writing this sync code and risk inconsistencies due to missing transactions between collections (etc). Or maybe I'm missing something crucial here.
Change feed will do the trick though I would suggest to take a step back before brute-forcing the problem.
Please find detailed article describing split issue here: Azure Cosmos DB. Partitioning.
Based on the Microsoft recommendation for maintainable data growth you should select partition key with highest cardinality (in your case I assume it will be InvoiceId). For the main reason:
Spread request unit (RU) consumption and data storage evenly across all logical partitions. This ensures even RU consumption and storage distribution across your physical partitions.
You don't need creating separate container with CustomerId partition key as it won't give you desired, and most importantly, maintainable performance in future and might result in physical partition data skew when too many Invoices linked to the same customer.
To get optimal and scalable query performance you most probably need InvoiceId as partition key and indexing policy by CustomerId (and others in future).
There will be a slight RU overhead (definitely not multiplication of RUs but rather couple additional RUs per request) in consumption when data you're querying is distributed between number of physical partitions (PPs) but it will be neglectable comparing to issues occurring when data starts growing beyond 50-, 100-, 150GB.
Why CustomerId might not be the best partition key for the data sets which are expected to grow beyond 50GB?
Main reason is that Cosmos DB is designed to scale horizontally and provisioned throughput per PP is limited to the [total provisioned per container (or DB)] / [number of PP].
Once PP split occurs due to exceeding 50GB size your max throughput for existing PPs as well as two newly created PPs will be lower then it was before split.
So imagine following scenario (consider days as a measure of time between actions):
You've created container with provisioned 10k RUs and CustomerId partition key (which will generate one underlying PP1). Maximum throughput per PP is 10k/1 = 10k RUs
Gradually adding data to container you end-up with 3 big customers with C1[10GB], C2[20GB] and C3[10GB] of invoices
When another customer was onboarded to the system with C4[15GB] of data Cosmos DB will have to split PP1 data into two newly created PP2 (30GB) and PP3 (25GB). Maximum throughput per PP is 10k/2 = 5k RUs
Two more customers C5[10GB] C6[15GB] were added to the system and both ended-up in PP2 which lead to another split -> PP4 (20GB) and PP5 (35GB). Maximum throughput per PP is now 10k/3 = 3.333k RUs
IMPORTANT: As a result on [Day 2] C1 data was queried with up to 10k RUs
but on [Day 4] with only max to 3.333k RUs which directly impacts execution time of your query
This is a main thing to remember when designing partition keys in current version of Cosmos DB (12.03.21).
What you are doing is a good solution. Different queries requires different Partition Keys on different Cosmos DB Containers with same data.
How to sync the two Containers: use Triggers from the firs Container.
https://devblogs.microsoft.com/premier-developer/synchronizing-azure-cosmos-db-collections-for-blazing-fast-queries/
Cassandra has a Feature called Materialized Views for this exact problem, abstracting the sync problem. Maybe some day same Feature will be included on Cosmos DB.
Can a group count query in Amazon Neptune or any Graph Databases fail due to Big Data ?
I mean if the counts exceeds the limits of the count datatype can there be a n overflow?
Short answer
Gremlin query language semantics (as defined by the Tinkerpop code) define output of count() function as a 64 bit long. So, yes, count cannot exceed the range of long.
Long answer
Having said that, let's try to calculate the amount of data you would need to insert into the DB to hit that threshold. Each entity(Vertex/Edge/Property) in the DB contains a unique ID associated with it. Let us hypothetically assume that the storage of each entity consists of just the identifier. Also, let us assume that the data type of the identifier is the most efficient, i.e. a long (and not a String which would use greater space than a long).
To hit the limit of count, the DB would need to store at least 2^64 entities each with a unique identifier i.e. at least ((2^64)*64)bits of data i.e. greater than 1000 PetaBytes of data at a very conservative estimate.
The point is, you would need to store a huge amount of data before you hit the limit of count. If you are operating with such amount of data, a DB might not be right storage solution for you.
I have a question. I m pretty new to DynamoDB but have been working on large scale aggregation on SQL databases for a long time.
Suppose you have a table called GamePoints (PlayerId, GameId, Points) and would like to create a ranking table Rankings (PlayerId, Points) sorted by points.
This table needs to be updated on an hourly basis but keeping the previous version of its contents is not required. Just the current Rankings.
The query will always be give me the ranking table (with paging).
The GamePoints table will get very very large over time.
Questions:
Is this the best practice schema for DynamoDB ?
How would you do this kind of aggregation?
Thanks
You can enable a DynamoDB Stream on the GamePoints table. You can read stream records from the stream to maintain materialized views, including aggregations, like the Rankings table. Set StreamViewType=NEW_IMAGE on your GamePoints table, and set up a Lambda function to consume stream records from your stream and update the points per player using atomic counters (UpdateItem, HK=player_id, UpdateExpression="ADD Points #stream_record_points", ExpressionAttributeValues={"#stream_record_points":[put the value from stream record here.]}). As the hash key of the Rankings table would still be the player ID, you could do full table scans of the Rankings table every hour to get the n highest players, or all the players and sort.
However, considering the size of fields (player_id and number of points probably do not take more than 100 bytes), an in memory cache updated by a Lambda function could equally well be used to track the descending order list of players and their total number of points in real time. Finally, if your application requires stateful processing of Stream records, you could use the Kinesis Client Library combined with the DynamoDB Streams Kinesis Adapter on your application server to achieve the same effect as subscribing a Lambda function to the Stream of the GamePoints table.
An easy way to do this is by using DynamoDb's HashKey and Sort key. For example, the HashKey is the GameId and Sort key is the Score. You then query the table with a descending sort and a limit to get the real-time top players in O(1).
To get the rank of a given player, you can use the same technique as above: you get the top 1000 scores in O(1) and you then use BinarySearch to find the player's rank amongst the top 1000 scores in O(log n) on your application server.
If the user has a rank of 1000, you can specify that this user has a rank of 1000+. You can also obviously change 1000 to a greater number (100,000 for example).
Hope this helps.
Henri
The PutItem can be helpful to implement the persistence logic according to your Use Case:
PutItem Creates a new item, or replaces an old item with a new item.
If an item that has the same primary key as the new item already
exists in the specified table, the new item completely replaces the
existing item. You can perform a conditional put operation (add a new
item if one with the specified primary key doesn't exist), or replace
an existing item if it has certain attribute values. Source:
http://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_PutItem.html
In terms of querying the data, if you know for sure that you are going to be reading the entire Ranking table, I would suggest doing it through several read operations with minimum acceptable page size so you can make the best use of your provisioned throughput. See the guidelines below for more details:
Instead of using a large Scan operation, you can use the following
techniques to minimize the impact of a scan on a table's provisioned
throughput.
Reduce Page Size
Because a Scan operation reads an entire page (by default, 1 MB), you
can reduce the impact of the scan operation by setting a smaller page
size. The Scan operation provides a Limit parameter that you can use
to set the page size for your request. Each Scan or Query request that
has a smaller page size uses fewer read operations and creates a
"pause" between each request. For example, if each item is 4 KB and
you set the page size to 40 items, then a Query request would consume
only 40 strongly consistent read operations or 20 eventually
consistent read operations. A larger number of smaller Scan or Query
operations would allow your other critical requests to succeed without
throttling.
Isolate Scan Operations
DynamoDB is designed for easy scalability. As a result, an application
can create tables for distinct purposes, possibly even duplicating
content across several tables. You want to perform scans on a table
that is not taking "mission-critical" traffic. Some applications
handle this load by rotating traffic hourly between two tables – one
for critical traffic, and one for bookkeeping. Other applications can
do this by performing every write on two tables: a "mission-critical"
table, and a "shadow" table.
SOURCE: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/QueryAndScanGuidelines.html#QueryAndScanGuidelines.BurstsOfActivity
You can also segment your tables by GameId (e.g. Ranking_GameId) to distribute the data more evenly and give you more granularity in terms of provisioned throughput.