Querying DynamoDb timestamp data in range - amazon-dynamodb

I want to migrate my data from DynamoDb to Redshift. I dont want to scan the whole table at once as this might result in throttling.
My Table is as below:
acountId(hash key), lastUpdatedTime.
I thought I can create GSI on lastUpdatedTime and then I can query like give me the data between day1 to day5. Again next day I can do give me data between day6 to day7.
But even with GSI my understanding is that It will scan the whole table As I wont have any hash key to provide. I just have some range of timestamp to query.

Creating a GSI is the right solution indeed. However the GSI creation operation might be a bit slow/expensive if you set GSI to project all attributes. I would recommend creating the GSI on lastUpdatedTime, and project only the partition key (and order key if you have one) using KEYS_ONLY. Then, when you scan, you will only retrieve the item keys and query the item there and then, when migrating.
I recommend reading up on GSIs here: https://docs.aws.amazon.com/fr_fr/amazondynamodb/latest/developerguide/GSI.html

Related

How do I query DynamoDB without specifying a partition key value?

I have a simple table consisting of orderID as PK and userID as SK, I found out that in dynamoDB you need to specify both PK and SK to use query. so in this case, how is it possible for me to get all the orders for userID x since I can't ignore orderID since they're the partition key of this table? another way to solve this which works but not recomended is using a scan filter, which scans the whole table then filters the result. it will eventually slow down as the table grow. I wonder how do you guys do it with this scenario?
You can create a Global Secondary Index (GSI) based on userId and query based on that index.
You can read more about indexes and GSI’s in the AWS docs here.

How to query dynamoDB without using hashKey

I have a dynamoDB table with two attributes:
A: primary partition key
B: primary sort key
I want to query this table using attribute B since I don't know the value of A. Is it possible to do so?
Is it possible to make B as GSI (global secondary index), how to do and query the table using B, since B is already a sort key.
You need partition-key to query - you can't do it using sort-key alone. You can only scan.
So, the only way out for you is to create a GSI with B as the partition-key.
Update
Yes, you can use range-key as GSI.
The drawback to using GSI are:
There can only be a maximum of 5 GSI per table, so choose wisely what you need to index as GSI can only be specified during table creation and cannot be altered.
GSI will cost you additional money as you will need to assign Provisioned Throughput to it.
GSI is eventually consistent, meaning that DynamoDB does not guarantee that the moment data associated to the table's hash key is written into DB, the data's GSI hash key immediately becomes available for querying. The document states that this is usually immediate, but can be the case that it could take up to seconds for the GSI hash key to become available.

Dynamodb query expression

Team,
I have a dynamodb with a given hashkey (userid) and sort key (ages). Lets say if we want to retrieve the elements as "per each hashkey(userid), smallest age" output, what would be the query and filter expression for the dynamo query.
Thanks!
I don't think you can do it in a query. You would need to do full table scan. If you have a list of hash keys somewhere, then you can do N queries (in parallel) instead.
[Update] Here is another possible approach:
Maintain a second table, where you have just a hash key (userID). This table will contain record with the smallest age for given user. To achieve that, make sure that every time you update main table you also update second one if new age is less than current age in the second table. You can use conditional update for that. Update can either be done by application itself, or you can have AWS lambda listening to dynamoDB stream. Now if you need smallest age for each use, you still do full table scan of the second table, but this scan will only read relevant records, to it will be optimal.
There are two ways to achieve that:
If you don't need to get this data in realtime you can export your data into a other AWS systems, like EMR or Redshift and perform complex analytics queries there. With this you can write SQL expressions using joins and group by operators.
You can even perform EMR Hive queries on DynamoDB data, but they perform scans, so it's not very cost efficient.
Another option is use DynamoDB streams. You can maintain a separate table that stores:
Table: MinAges
UserId - primary key
MinAge - regular numeric attribute
On every update/delete/insert of an original query you can query minimum age for an updated user and store into the MinAges table
Another option is to write something like this:
storeNewAge(userId, newAge)
def smallestAge = getSmallestAgeFor(userId)
storeSmallestAge(userId, smallestAge)
But since DynamoDB does not has native transactions support it's dangerous to run code like that, since you may end up with inconsistent data. You can use DynamoDB transactions library, but these transactions are expensive. While if you are using streams you will have consistent data, at a very low price.
You can do it using ScanIndexForward
YourEntity requestEntity = new YourEntity();
requestEntity.setHashKey(hashkey);
DynamoDBQueryExpression<YourEntity> queryExpression = new DynamoDBQueryExpression<YourEntity>()
.withHashKeyValues(requestEntity)
.withConsistentRead(false);
equeryExpression.setIndexName(IndexName); // if you are using any index
queryExpression.setScanIndexForward(false);
queryExpression.setLimit(1);

DynamoDB query/sort based on timestamp

In DynamoDB, I have a table where each record has two date attributes, create_date and last_modified_date. These dates are in ISO-8601 format e.g. 2016-01-22T16:19:52.464Z.
I need to have a way of querying them based on the create_date and last_modified_date e.g.
get all records where create_date > [some_date]
get all records where last_modified_date < [some_date]
In general, I need to get all records where [date_attr] [comparison_op] [some_date].
One way of doing it is to insert a dummy fixed attribute with each record and create an index with the dummy attribute as the partition key and the create_date as the sort key (likewise for last_modified_date.)
Then I'll be able to query it as such by providing the fixed dummy attribute as partition key, the date attributes as the sort key and use any comparison operators <, >, <=, >=, and so on.
But this doesn't seem good and looks like a hack instead of a proper solution/design. Are there any better solutions?
There are some things that NoSQL DBs are not good at, but you can solve this with the following solutions:
Move this table data to SQL database for searching purpose: This can be effective because you will be able to query as per your requirement, this might be tedious sometimes because you need to synchronize the data between two different DBs
Integrate with Amazon CloudSearch: You can integrate this table with CloudSearch and then rather than querying your DynamoDB table you can query Cloudsearch
Integrate with Elasticsearch: Elasticsearch is similar to CloudSearch although each has pros and cons, the end result would be same - rather than querying DynamoDB, instead query Elasticsearch
As you have mentioned in your question, add GSI indexes

DynamoDB Change Range Key Column

Is it possible to modify the Rangekey column after table creation. Such as adding new column/attribute and assigning as RangeKey for the table. Tried searching but cant ble to find any articles about changing the Range or Hash key
No, unfortunately it's not possible to change the hash key, range key, or indexes after a table is created in DynamoDB. The DynamoDB UpdateItem API Documentation is clear about the fact that indexes cannot be modified. I can't find a reference to anywhere in the docs that explicitly states that the table keys cannot be modified, but at present they cannot be changed.
Note that DynamoDB is schema-less other than the hash and range key, and you can add other attributes to new items with no problems. Unfortunately, if you need to modify either your hash key or range key, you'll have to make a new table and migrate the data.
Edit (January 2014): DynamoDB now has support for on the fly global secondary indexes
To change or create an additional sort key, you will need to create a new table and migrate over to it, as both actions cannot be done on existing tables.
DynamoDB streams enable us to migrate tables without any downtime. I've done this to great effective, and the steps I've followed are:
Create a new table (let us call this NewTable), with the desired key structure, LSIs, GSIs.
Enable DynamoDB Streams on the original table
Associate a Lambda to the Stream, which pushes the record into NewTable. (This Lambda should trim off the migration flag in Step 5)
[Optional] Create a GSI on the original table to speed up scanning items. Ensure this GSI only has attributes: Primary Key, and Migrated (See Step 5).
Scan the GSI created in the previous step (or entire table) and use the following Filter:
FilterExpression = "attribute_not_exists(Migrated)"
Update each item in the table with a migrate flag (ie: “Migrated”: { “S”: “0” }, which sends it to the DynamoDB Streams (using UpdateItem API, to ensure no data loss occurs).
NOTE: You may want to increase write capacity units on the table during the updates.
The Lambda will pick up all items, trim off the Migrated flag and push it into NewTable.
Once all items have been migrated, repoint the code to the new table
Remove original table, and Lambda function once happy all is good.
Following these steps should ensure you have no data loss and no downtime.
I've documented this on my blog, with code to assist:
https://www.abhayachauhan.com/2018/01/dynamodb-changing-table-schema/

Resources