I want to know the number of unique items being queried from (Get/BatchGet) DynamoDB for given time period (say per day), is there a way to figure it out?
For example: 500 unique documents were queried in last 10 minutes.
You can log all data plane activities to CloudTrail and query the logs:
https://aws.amazon.com/about-aws/whats-new/2021/03/now-you-can-use-aws-cloudtrail-to-log-data-plane-api-activity-in-your-amazon-dynamodb-tables/
Related
This is a simplified version of my problem using a DynamoDB Table. Most items in the Table represent sales across multiple countries. One of my required access patterns is to retrieve all sales in countries which belong to a certain country_grouping between a range of order_dates. The incoming stream of sales data contains the country attribute, but not the country_grouping attribute.
Another entity in the same Table is a reference table, which is infrequently updated, which could be used to identify the country_grouping for each country. Can I design a GSI or otherwise structure the table to retrieve all sales for a given country_grouping between a range of order dates?
Here's an example of the Table structure:
PK
SK
sale_id
order_date
country
country_grouping
SALE#ID#1
ORDER_DATE#2022-06-01
1
2022-06-01
UK
SALE#ID#2
ORDER_DATE#2022-09-01
2
2022-09-01
France
SALE#ID#3
ORDER_DATE#2022-07-01
3
2022-07-01
Switzerland
COUNTRY_GROUPING#EU
COUNTRY#France
France
EU
COUNTRY_GROUPING#NATO
COUNTRY#UK
UK
NATO
COUNTRY_GROUPING#NATO
COUNTRY#France
France
NATO
Possible solution 1
As the sales items are streamed into the Table, query the country_grouping associated with the country in the sale, and write the corresponding country_grouping to each sale record. Then create a GSI where country_grouping is the partition key and the order_date is the sort key. This seems expensive to me, consuming 1 RCU and 1 WCU per sale record imported. If country groupings changed (imagine the UK rejoins the EU), then I would run an update operation against all sales in the UK.
Possible solution 2
Have the application first query to retrieve every country in the desired country_grouping, then send an individual request for each country using a GSI where the partition key is country and the order_date is the sort key. Again, this seems less than ideal, as I consume 1 WCU per country, plus the 1 WCU to obtain the list of countries.
Is there a better way?
Picking an optimal solution depends on factors you haven't mentioned:
How quickly you need it to execute
How many sales records per country and country group you insert
How many sales records per country you expect there to be in the db at query time
How large a Sale item is
For example, if your Sale items are large and/or you insert a lot every second, you're going to need to worry about creating a hot key in the GSI. I'm going to assume your update rate is not too high, the Sale item size isn't too large, and you're going to have thousands or more Sale items per country.
If my assumptions are correct, then I'd go with Solution 2. You'll spend one read unit (it's not a WCU but rather an RCU, and it's only half a read unit if eventually consistent) to Query the country group and get a list of countries. Do one Query for each country in that group to pull all the Sale items matching the specific time range for that country. Since there are lots of matching sales, the cost is about the same. One 400 KB pull from a country_grouping PK is the same cost as 4 100 KB pulls from four different country PKs. You can also do the country Query calls in parallel, if you want, to speed execution. If you're returning megabytes of data or more, this will be helpful.
If in fact you have only a few sales per country, well, then any design will work.
Your solution 1 is probably best. The underlying issue is that PK actually defines the physical location on a server (both for an original entry or a GSI copy). You duplicate data because storage is cheap to get better performance for queries.
So if like you said UK rejoins UE, you won't be modifying the entries for GSI, AWS will create a new entry in a different location since PK changed.
How about if you put the country_grouping in the SK of the sale?
For example COUNTRY_GROUPING#EU#ORDER_DATE#2022-07-01
Then you can do a "begins with" query and avoid the GSI which will consume the extra capacity unit.
The country group lookup can be cached in memory to save some units + I wouldn't design my table around one-time events like the UK leaving. If that happens do a full scan and update everything. It's a one-time operation, not a big deal.
Also, Dynamo is not designed to store items for large periods of time. Typically you would store the sales for the past 30 days (for example) set e TTL to the items and stream them to S3 (or BigQuery) once they expire.
I've read in the documentation that writes per second are a maximum of 10,000 for Cloud Firestore. But if a collection has sequential values with an index, there are only 500 writes per second allowed.
"Maximum write rate to a collection in which documents contain sequential values in an indexed field: 500 per second"
https://firebase.google.com/docs/firestore/quotas?hl=en
In order to increase the writes per second, a "shard" field should be implemented. The number of writes of 500 is then multiplied by the number of shards.
https://cloud.google.com/firestore/docs/solutions/shard-timestamp?hl=en
My question is: Does that mean that a number of 20 shards will increase my writes per second to the maximum of 10,000? And more shards are superfluous?
Further shards would only make sense if I wanted to increase the writes per second for a single document, as I understand it. For example, for a counter that is then divided into several documents, in order to avoid the write limit of one second per document. (This scenario is not relevant for my purpose)
I think it wouldn't be much of a hassle for me to implement 20 shards from the start, even though I may never need them. Just to make sure I won't have any problems with it in the future as the number of users increases.
I know that one downside would be more complicated queries. But I think, that I could easily avoid this in my App because of how my data is structured.
The page you linked has this example of determining the necessary number of shards:
After some research, you determine that the app will receive between 1,000 and 1,500 instrument updates per second. This surpasses the 500 writes per second allowed for collections containing documents with indexed timestamp fields. To increase the write throughput, you need 3 shard values, MAX_INSTRUMENT_UPDATES/500 = 3.
So it indeed seems that you can simply divide your necessary throughput by 500 (the maximum number of writes per shard) to get the number of shard values you'll need.
Don't forget that you'll need to also update the index definitions to drop the existing indexes on your sequential (in the example: timestamp) field, and add composite indexes on that field and the shard field. The 500/second throughput limit comes from the write speed of individual indexes, so it's actually having multiple composite indexes that increases the throughput.
If you have a lot of entries you would usually have a document that sums the price for each transaction. I could read that document to show the overall value in the overview.
My problem is that the price (price * amount) for each entry changes every 5 min. Because of that I can't save this sum value of all documents.
In order to calculate I would need the price I bought at that point in time and the amount.
Thats basically what im saving for each transaction and to save it again makes no sense anything else.
I can't just do a document with one value that I can update all the time because each transaction is bought for a different price and a different amount of this item.
I could have thousands of transactions and the 20k Firestore limit would not be enough for that summary document.
The view for the transactions shows the latest 50 and is paged, that's fine for Firestore but I can't read all documents for the overall sum price.
Is there no other option as reading all transactions? I thought maybe of using firebase cloud storage and saving there a document with all transactions just for the the summary page.
How to aggregate 10records at a time in collection aggregator? I can send 10 records at a time to target. So , these 100 records should be splitted and aggregated as 10 records at one time.
My approach towards it to generate different correlationIDs for a set of records but i am not able to write MEL for that.
Splitter-collection will not be a optimal solution for this.Use Batch process split records and then before sending to the outbound use batch commit and give the size as 10.This will solve your problem.
I would like to get a good understanding of what would be the price (in terms of $) of using DynamoDB Titan backend. For this, I need to be able to understand when DynamoDB Titan backend does reads and writes. Right now I am pretty clueless.
Ideally I would like to run a testcase which adds some vertices, edges and then does a rather simple traversal and then see how many reads and writes were done. Any ideas of how I can achieve this? Possibly through metrics?
If it turns out I can't extract this information myself, I would very much appreciate a first brief explanation about when DynamoDB Titan backend performs reads and writes.
For all Titan backends, to understand and estimate the number of writes, we rely on estimating the number of columns for a given KCVStore. You can also measure the number of columns that get written using metrics when using the DynamoDB Storage Backend for Titan.
To enable metrics, enable the configuration options listed here.
Specifically, enable lines 7-11.
Note the max-queue-length configuration property. If the executor-queue-size metric hits max-queue-length for a particular tx.commit() call, then you know that the queue / storage.buffer-size were not large enough. Once the executor-queue-size metric peaks without reaching max-queue-length, you know you have captured all the columns being written in a tx.commit() call, so that will give you the number of columns being changed in a tx.commit(). You can look at UpdateItem metrics for edgestore and graphindex to understand the spread of columns between the two tables.
All Titan storage backends implement KCVStore, and the keys and columns have different meanings depending on the kind of store. There are two stores that get the bulk of writes, assuming you have not turned on user-defined transaction logs. They are edgestore and graphindex.
The edgestore KCVStore is always written to, regardless of whether you configure composite indexes. Each edge and all of the edge properties of that edge are represented by two columns (unless you set the schema of that edge label to be unidirectional). The key of edge columns are the out-vertex of an edge in the direct column, and the in-vertex of an edge in the reverse. Again, the column of an edge is the in-vertex of an edge in the direct column, and the out-vertex of an edge in the reverse. Each vertex is represented by at least one column for the VertexExists hidden property, one column for a vertex label (optional) and one column for each vertex property. The key of vertices is the vertex id and the columns correspond to vertex properties, hidden vertex properties, and labels.
The graphindex KCVStore will only be written to if you configure composite indexes in the Titan management system. You can index vertex and edge properties. For each pair of indexed value and edge/vertex that has that indexed value, there will be one column in the graphindex KCVStore. The key will be a combination of the index id and value, and the column will be the vertex/edge id.
Now that you know how to count columns, you can use this knowledge to estimate the size and number of writes to edgestore and graphindex when using the DynamoDB Storage Backend for Titan. If you use the multiple-item data model for a KCVStore, you will get one item for each key-column pair. If you use the single-item data model for a KCVStore, you will get one item for all columns at a key (this is not necessarily true when graph partitioning is enabled but this is a detail I will not discuss now). As long as each vertex property is less than 1kb, and the sum of all edge properties for an edge are less than 1 kb, each column will cost 1 WCU to write when using multiple-item data model for edgestore. Again, each column in the graphindex will cost 1 WCU to write if you use the multiple-item data model.
Lets assume you did your estimation and you use multiple-item data model throughout. Lets assume you estimate that you will be writing 750 columns per second to edgestore and 750 columns per second to graphindex, and that you want to drive this load for a day. You can set the read capacity for both tables to 1, so you know each table will start off with one physical DynamoDB partition to begin with. In us-east-1, the cost for writes is $0.0065 per hour for every 10 units of write capacity, so 24 * 75 * $0.0065 is $11.70 per day for writes for each table. This means the write capacity would cost $23.40 per day for edgestore and graphindex together. The reads could be set to 1 read per second for each of the tables, making the read cost 2 * 24 * $0.0065 = $0.312 for both tables per day. If your AWS account is new, the reads would fall within the free tier, so effectively, you would only be paying for the writes.
Another aspect of DynamoDB pricing is storage. If you write 750 columns per second, that is 64.8 million items per day to one table, that means 1.9 billion (approximately 2 billion) items per month. The average number of items in the table in a month is then 1 billion. If each items averages out to 412 bytes, and there is 100 bytes of overhead, then that means 1 billion 512 byte items are stored for a month, approximately 477 GB in a month. 477 / 25 rounded up is 20, so storage for the first month at this load would cost 20 * $0.25 dollars a month. If you keep adding items at this rate without deleting them, the monthly storage cost will increase by approximately 5 dollars per month.
If you do not have super nodes in your graph, or vertices with a relatively large number of properties, then the writes to the edgestore will be distributed evenly throughout the partition key space. That means your table will split into 2 partitions when it hits 10GB, and then each of those partitions will split into a total of 4 partitions when they hit 10GB, and so on and so forth. the nearest power of 2 to 477 GB / (10 GB / partition) is 2^6=64, so that means your edgestore would split 6 times over the course of the first month. You would probably have around 64 partitions at the end of the first month. Eventually, your table will have so many partitions that each partition will have very few IOPS. This phenomenon is called IOPS starvation. You should have a strategy in place to address IOPS starvation. Two commonly used strategies are 1. batch cleanup/archival of old data and 2. rolling (time-series) graphs. In option 1, you spin up an EC2 instance to traverse the graph and write old data to a colder store (S3, Glacier etc) and delete it from DynamoDB. In option 2, you direct writes to graphs that correspond to a time period (weeks - 2015W1, months - 2015M1, etc). As time passes, you down provision the writes on the older tables, and when time comes to migrate them to colder storage, you read the entire graph for that time period and delete the corresponding DynamoDB tables. The advantage of this approach is that it allows you to manage your write provisioning cost with higher granularity, and it allows you to avoid the cost of deleting individual items (because you delete a table for free instead of incurring at least 1 WCU for every item you delete).