Elastic Cache vs DynamoDb DAX - amazon-dynamodb

I have use case where I write data in Dynamo db in two table say t1 and t2 in transaction.My app needs to read data from these tables lot of times (1 write, at least 4 reads). I am considering DAX vs Elastic Cache. Anyone has any suggestions?
Thanks in advance
K

ElastiCache is not intended for use with DynamoDB.
DAX is good for read-heavy apps, like yours. But be aware that DAX is only good for eventually consistent reads, so don't use it with banking apps, etc. where the info always needs to be perfectly up to date. Without further info it's hard to tell more, these are just two general points to consider.
Amazon DynamoDB Accelerator (DAX) is a fully managed, highly available, in-memory cache that can reduce Amazon DynamoDB response times from milliseconds to microseconds, even at millions of requests per second. While DynamoDB offers consistent single-digit millisecond latency, DynamoDB with DAX takes performance to the next level with response times in microseconds for millions of requests per second for read-heavy workloads. With DAX, your applications remain fast and responsive, even when a popular event or news story drives unprecedented request volumes your way. No tuning required. https://aws.amazon.com/dynamodb/dax/

AWS recommends that you use **DAX as solution for this requirement.
Elastic Cache is an old method and it is used to store the session states in addition to the cache data.
DAX is extensively used for intensive reads through eventual consistent reads and for latency sensitive applications. Also DAX stores cache using these parameters:-
Item cache - populated with items with based on GetItem results.
Query cache - based on parameters used while using query or scan method
Cheers!

I'd recommend to use DAX with DynamoDB, provided you're having more read calls using item level API (and NOT query level API), such as GetItem API.
Why? DAX has one weird behavior as follows. From, AWS,
"Every write to DAX alters the state of the item cache. However, writes to the item cache don't affect the query cache. (The DAX item cache and query cache serve different purposes, and operate independently from one another.)"
Hence, If I elaborate, If your query operation is cached, and thereafter if you've write operation that affect's result of previously cached query and if same is not yet expired, in that case your query cache result would be outdated.
This out of sync issue, is also discussed here.

I find DAX useful only for cached queries, put item and get item. In general very difficult to find a use case for it.
DAX separates queries, scans from CRUD for individual items. That means, if you update an item and then do a query/scan, it will not reflect changes.
You can't invalidate cache, it only invalidates when ttl is reached or nodes memory is full and it is dropping old items.
Take Aways:
doing puts/updates and then queries - two seperate caches so out of sync
looking for single item - you are left only with primary key and default index and getItem request (no query and limit 1). You can't use any indexes for gets/updates/deletes.
Using ConsistentRead option when using query to get latest data - it works, but only for primary index.
Writing through DAX is slower than writing directly to Dynamodb since you have a hop in the middle.
XRay does not work with DAX
Use Case
You have queries that you don't really care they are not up to date
You are doing few putItem/updateItem and a lot of getItem

Related

Delete items in Cosmos DB with spare RUs

We use Cosmos DB to track all our devices and also data that is related to the device (and not stored in the device document itself) is stored in the same container with the same partition ID.
Both the device document and the related documents have /deviceId as the partition key. When a device is removed, then I remove the device document. I actually want to remove the entire partition, but this doesn't seem to be possible. So I revert to a query that queries for all items with this partition key and remove them from the database.
This works fine, but may consume a lot of RUs if there is a lot of related data (which may be true in some cases). I would rather just remove the device and schedule all related data for removal later (it doesn't hurt to have them in the database for a while). When RU utilization is low, then I start removing these items. Is there a standard solution to do this?
The best solution would be to schedule this and that Cosmos DB would process these commands when it has spare RUs, just like with the TTL deletion. Is this even possible?
A feature is now in preview to delete all items by partition key using fire and forget background processing model with a limited amount of available throughput. There's a signup link in the feature request page to get access to preview.
Currently, the API looks like a new DeleteAllItemsByPartitionKey method in the SDK.
It definitely is possible to set a TTL and then let Cosmos handle expiring data out of the container when it is idle. However, the cost to update the document in the first place is about what it costs to delete it anyway so you're not gaining much.
An approach as you suggest, may be to have a separate container (or even a queue) where you insert a new item with the deviceId to retire. Then in the evenings or during a time when you know the system is idle. Run a job that reads the next deviceId in the queue, queries for all the items with that partition key, then deletes the data or sets the TTL to expire the data.
There is a feature to delete an entire partition in the works that would be perfect for this scenario (in fact, it's designed for it) but no ETA on availability.

Efficient way to Increment Vertex counter property in janusgraph

I am using janusGraph-0.2.0 with Cassandra backend with ES.
I want to store no.of views in Vertex property, Need an efficient and scalable way to increment/store the views count without impacting read performance.
Read views property from graph while fetching vertex, and update new views count in another query. (Wont impact read performance, but counter is not synchronised)
g.V().has("key","keyId").valueMap(true);
g.V(id).property('views', 21);
Using sack to store value 1, and add it to views property.
g.withSack(0).V().has("key","keyId").
sack(assign).by("views").sack(sum).by(constant(1)).
property("views", sack())
Use in-memory storage (Redis) to increment counters, and persist the updates in graph periodically.
Any other better approach ?
Is there any way to use cassendra's counter functionality in janusGraph?
There is no way to use Cassandra counters with JanusGraph. Even more, there is no way to use Cassandra counters with general Cassandra table. The logic of Cassandra counter developed in such way that updating the counter don't require a lock. That is why you get a lot of limitations in exchange for great performance.
Counting views isn't that easy task. In short, my suggestion would be to go with option 3.
I would go with Redis and periodical update to JanusGraph in case when we are in a single data center and your single master server can handle all requests (you can ofcourse use some hash ring to split your counters among different Redis servers but it will increase complexity costs for maintenance).
In case you have multiple data centers your single master Redis server cannot handle all requests I would go with Cassandra counters.
In case you have a very big amount of view events so even Cassandra counters (with their cache) cannot handle all requests because the disk is accessed too many times and you cannot scale more because of high cost then the logic would be harder. I have never been in such situation so it is just theoretically. In this case I would develop the application servers to cache and group views and periodically send this cached data to RabbitMQ workers so that they could update Cassandra counters and then update necessary vertex with total views amount in JanusGraph. In such case very frequently vertex views would be grouped so that we don't need to update counter with +1 each time but with +100 or +1000 views in a single update. It would decease disk usage very much and you would have eventually consistent and fast counters. Again, this solution is only theoretical and should be tested. I believe other solutions also exist.

DynamoDB conflict resolution strategy

What is the conflict resolution strategy for DynamoDB ? The white paper on Dynamo talks about returning multiple versions by GetItem to be resolved by the client.
This SO Question says that Dynamo and DynamoDB are different and GetItem returns only one value. In that case, what is the conflict resolution strategy that DynamoDB employs ?
See this
"Conflicts can arise if applications update the same item in different regions at about the same time. To ensure eventual consistency, DynamoDB global tables use a “last writer wins” reconciliation between concurrent updates, where DynamoDB makes a best effort to determine the last writer. With this conflict resolution mechanism, all of the replicas will agree on the latest update, and converge toward a state in which they all have identical data."
So the latest write wins based on some for of consensus between the replicas.
As stated, your question is not very clear: "What is the conflict resolution strategy for DynamoDB" - what conflicts? Are you referring to potentially inconsistent reads?
DynamoDB, for GetItem queries, allows both eventual consistent and strongly consistent reads, configurable with a parameter on the request (as described in the docs here: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.ReadConsistency.html). For strongly consistent reads the value returned is the most recent value at the time the query was executed. For eventual consistent reads it is possible to read a slightly out of date version of an item but there is no "conflict resolution" per se.
You may be thinking about conditional updates which allow for requests to fail if an expected condition is not met at the time the query is executed.

Is it ok to build architecture around regular creation/deletion of tables in DynamoDB?

I have a messaging app, where all messages are arranged into seasons by creation time. There could be billions of messages each season. I have a task to delete messages of old seasons. I thought of a solution, which involves DynamoDB table creation/deletion like this:
Each table contains messages of only one season
When season becomes 'old' and messages no longer needed, table is deleted
Is it a good pattern and does it encouraged by Amazon?
ps: I'm asking, because I'm afraid of two things, met in different Amazon services -
In Amazon S3 you have to delete each item before you can fully delete bucket. When you have billions of items, it becomes a real pain.
In Amazon SQS there is a notion of 'unwanted behaviour'. When using SQS api you can act badly regarding SQS infrastructure (for example not polling messages) and thus could be penalized for it.
Yes, this is an acceptable design pattern, it actually follows a best practice put forward by the AWS team, but there are things to consider for your specific use case.
AWS has a limit of 256 tables per region, but this can be raised. If you are expecting to need multiple orders of magnitude more than this you should probably re-evaluate.
You can delete a table a DynamoDB table that still contains records, if you have a large number of records you have to regularly delete this is actually a best practice by using a rolling set of tables
Creating and deleting tables is an asynchronous operation so you do not want to have your application depend on the time it takes for these operations to complete. Make sure you create tables well in advance of you needing them. Under normal circumstances tables create in just a few seconds to a few minutes, but under very, very rare outage circumstances I've seen it take hours.
The DynamoDB best practices documentation on Understand Access Patterns for Time Series Data states...
You can save on resources by storing "hot" items in one table with
higher throughput settings, and "cold" items in another table with
lower throughput settings. You can remove old items by simply deleting
the tables. You can optionally backup these tables to other storage
options such as Amazon Simple Storage Service (Amazon S3). Deleting an
entire table is significantly more efficient than removing items
one-by-one, which essentially doubles the write throughput as you do
as many delete operations as put operations.
It's perfectly acceptable to split your data the way you describe. You can delete a DynamoDB table regardless of its size of how many items it contains.
As far as I know there are no explicit SLAs for the time it takes to delete or create tables (meaning there is no way to know if it's going to take 2 seconds or 2 minutes or 20 minutes) but as long your solution does not depend on this sort of timing you're fine.
In fact the idea of sharding your data based on age has the potential of significantly improving the performance of your application and will definitely help you control your costs.

Riak and time-sorted records

I'd like to sort some records, stored in riak, by a function of the each record's score and "age" (current time - creation date). What is the best way do do a "time-sensitive" query in riak? Thus far, the options I'm aware of are:
Realtime mapreduce - Do the entire calculation in a mapreduce job, at query-time
ETL job - Periodically do the query in a background job, and store the result back into riak
Punt it to the app layer - Don't sort at all using riak, and instead use an application-level layer to sort and cache the records.
Mapreduce seems the best on paper, however, I've read mixed-reports about the real-world latency of riak mapreduce.
MapReduce is a quite expensive operation and not recommended as a real-time querying tool. It works best when run over a limited set of data in batch mode where the number of concurrent mapreduce jobs can be controlled, and I would therefore not recommend the first option.
Having a process periodically process/aggregate data for a specific time slice as described in the second option could work and allow efficient access to the prepared data through direct key access. The aggregation process could, if you are using leveldb, be based around a secondary index holding a timestamp. One downside could however be that newly inserted records may not show up in the results immediately, which may or may not be a problem in your scenario.
If you need the computed records to be accurate and will perform a significant number of these queries, you may be better off updating the computed summary records as part of the writing and updating process.
In general it is a good idea to make sure that you can get the data you need as efficiently as possibly, preferably through direct key access, and then perform filtering of data that is not required as well as sorting and aggregation on the application side.

Resources