Purging technique for Dynamodb - amazon-dynamodb

I am a newbie in Amazon Dynamodb world with strong background from relation database world :-p
I am writing a service using AWS lambda functionality that migrates the data from dynamodb to RedShift for analytics purpose. My aim is to keep only active data of say 1 month in dynamodb and then purge it periodically.
I researched a lot but could not find a precise purging technique for Amazon dynamodb that will avoid full table scan.
Also, I want to perform delete based on the Range key attribute which is a timestamp attribute.
Can somebody help me out here?
Thanks

From my experience the easiest and most cost-effective way to handle this job is to create a new table each month, and remove complete old tables when time passes and you are done crunching them.
If you can make your use case use a TABLE-MMYYYY it would help you a lot.

Related

DynamoDB usable for largeish event table?

I'm thinking of re-architecting an RDS model to a DynamoDB one and it appears mostly to be working using a single-table design. We have, however a log table that can contain 5-10 million rows that are queried on many attributes.
Is there any pattern that might be applicable in migrating to DynamoDB or is this a case where full scans would be required and we would just be better off keeping the log stuff as a relational table?
Thanks in advance,
Nik
Those keywords and phrases "log" and "queried on many attributes" sound to me like DynamoDB is not the best solution for your log data. If the number of distinct queries is fairly limited and well-known in advance, you might be able to design your keys to fit your access patterns.
For example, if you commonly query on Color and Quantity attributes, you could design a key like COLOR#Red#QTY#25. And you could use secondary or global secondary indexes for queries involving other attributes similarly.
But it is not a great solution if you have many attributes that you need to query arbitrarily.
Alternative Solution: Another serverless option to consider is storing your log data in S3 and using Athena to query it using SQL.
You will likely be trading away a bit of latency and speed by taking this approach compared to RDS and DynamoDB. But queries against log data often don't need millisecond response times, so it can cover a lot of use cases.
Data modelling for DynamoDB
Write down all of your access patterns, in order of priority/most used
Research models which are similar to your use-case
Download NoSQL Workbench and create test models where you can visualize your ideas
Run commands against DynamoDB Local and test your access patterns are fulfilled.
Access Parterns
Your access patterns will ultimately decide if DynamoDB will suit your needs. If you need to query based on multiple fields you can have up to 20 Global Secondary Indexes which will give you some flexibility, but usually if you exceed 8-10 indexes then DynamoDB may not be a good choice or the schema is badly designed.
Use smart designs with sort-key and index-key overloading, it will allow you to group the data better and make your access patterns more efficient.
Log Data Use-case
Storing log data is a pretty common use-case for DynamoDB and many many AWS customers use it for that sole purpose. But I can't over emphasize the importance of understanding your access patterns and working backwards from those to create your model.
Alternatives
If you require query capability or free text search ability, then you could use DynamoDB integrations with OpenSearch (via Lambda/EventBridge) for example, with OpenSearch providing you the flexibility for your queries.
Doesn't seem like a good use case - I have done it and wasn't at all happy with the result - now I load 'log like' data into elasticsearch and much happier with the result.
In my case, I insert the data to dynamodb - to archive it - but also feed data in ES, but once in a while if I kill my ES cluster, I can reload all or some of the data from ddb.

Most efficient way to change synthetic partition key values

I have a collection with thousands of documents all of which have a synthetic partition key property like:
partitionKey: ‘some-document-related-value’
now i need to change values for partitionKey. of course, it takes recreation of documents in order to do so but i am wondering what is the most efficient/straightforward way to do it?
should i use azure function with cosmosdbtrigger? (set to start feed from begining)
change feed processor?
some other way?
i’m looking for quickest solution thats still reliable.
Yes, change feed is a common way to migrate data from one container to another. Another simple option may be to use Data Migration Tool where you build your new partition key in the select statement.
Hopefully this is helpful.

DynamoDB for an evolving application

We are considering using DynamoDB as out back end for our new multi-tenant Saas application. This application is still very nascent and will evolve over the next few years. We do not know all the entities yet. The entities we do know also will evolve. Considering these points, is it a good idea to use DynamoDB?
My biggest concern is the fact that we cannot add an LSI for an existing table. So, if my entity were to add a new attribute which needs to be used in a filter, we'd have to create a GSI which costs as much as another table.
Please share your thoughts/experiences in this regard.
The key consideration with Dyanmo...do you understand how you will need to access the data?
If most of your access will be by key, with a few well defined queries. Dynamo might be a decent fit.
Here's a useful slide from one of the Dynamo presentations at AWS Summit

CosmosDB transactions per partition

I am looking at CosmosDB partitioning facility and what I have got so far is that it is good for performance. It can really help us in avoiding the fanout queries but I have got stuck into one question with partitioning. For partitioning in write if I have got different type of documents, can be thousands of them, belong to same partition the write operation will be slow but if I give them different partition key then I will lose the transactional behaviour because store procedures are scoped to one transaction.
My use case is I have got different type of documents within same collection and at one given time i will be updating and inserting thousands of different type of documentation and I have to do that within the same transaction which means I have to use the same key but if I do that then I will be doing HOT write operation which is not suggested in CosmosDB. Anyhelp on how to achive this issue will be be appreciated.
People use stored procedures to batch their documents and today it does constrain you to one partition. However, be aware of other limitations that your partition key should be as such that your documents fan out in different partitions. So your one batch can be for one partition key and next batch is for another.
read more here
https://learn.microsoft.com/en-us/azure/cosmos-db/partition-data
hope this help.
Rafat
Its tricky.. I do have a large set of docs within a single partition at the moment, maybe later on I would need to redesign the collection. Right now I am using a bulk insert/update lib in CosmosDB. Link https://learn.microsoft.com/en-us/azure/cosmos-db/bulk-executor-overview Way faster for large data inserts/updates, its Microsoft backed library, however it supports transactional behaviour but only withing a single partition. So at the moment, I am safe.

What data store technology/solution allows very fast inserts, lookups and 'selects'

Here's my problem.
I want to ingest lots and lots of data .... right now millions and later billions of rows.
I have been using MySQL and I am playing around with PostgreSQL for now.
Inserting is easy, but before I insert I want to check if that particular records exists or not, if it does I don't want to insert. As the DB grows this operation (obviously) takes longer and longer.
If my data was in a Hashmap the look up would be o(1) so I thought I'd create a Hash index to help with lookups. But then I realised that if I have to compute the Hash again every time I will slow the process down massively (and if I don't compute the index I don't have o(1) lookup).
So I am in a quandry, is there a simple solution? Or a complex one? I am happy to try other datastores, however I need to be able to do reasonably complex queries e.g. something to similar to SELECT statements with WHERE clauses, so I am not sure if no-sql solutions are applicable.
I am very much a novice, so I wouldn't be surprised if there is a trivial solution.
Nosql Stores are good for handling huge inserts and updates
MongoDB has really good feature for update/Insert (called as upsert) based on whether the document is existing.
Check out this page from mongo doc
http://www.mongodb.org/display/DOCS/Updating#Updating-UpsertswithModifiers
Also you can checkout the safe mode in mongo connection. Which you can set it as false to get more efficiency in inserts.
http://www.mongodb.org/display/DOCS/Connections
You could use CouchDB. Its no SQL so you can't do queries per se, but you can create design documents that allow you to run map/reduce functions on your data.

Resources