Dynamodb query expression - amazon-dynamodb

Team,
I have a dynamodb with a given hashkey (userid) and sort key (ages). Lets say if we want to retrieve the elements as "per each hashkey(userid), smallest age" output, what would be the query and filter expression for the dynamo query.
Thanks!

I don't think you can do it in a query. You would need to do full table scan. If you have a list of hash keys somewhere, then you can do N queries (in parallel) instead.
[Update] Here is another possible approach:
Maintain a second table, where you have just a hash key (userID). This table will contain record with the smallest age for given user. To achieve that, make sure that every time you update main table you also update second one if new age is less than current age in the second table. You can use conditional update for that. Update can either be done by application itself, or you can have AWS lambda listening to dynamoDB stream. Now if you need smallest age for each use, you still do full table scan of the second table, but this scan will only read relevant records, to it will be optimal.

There are two ways to achieve that:
If you don't need to get this data in realtime you can export your data into a other AWS systems, like EMR or Redshift and perform complex analytics queries there. With this you can write SQL expressions using joins and group by operators.
You can even perform EMR Hive queries on DynamoDB data, but they perform scans, so it's not very cost efficient.
Another option is use DynamoDB streams. You can maintain a separate table that stores:
Table: MinAges
UserId - primary key
MinAge - regular numeric attribute
On every update/delete/insert of an original query you can query minimum age for an updated user and store into the MinAges table
Another option is to write something like this:
storeNewAge(userId, newAge)
def smallestAge = getSmallestAgeFor(userId)
storeSmallestAge(userId, smallestAge)
But since DynamoDB does not has native transactions support it's dangerous to run code like that, since you may end up with inconsistent data. You can use DynamoDB transactions library, but these transactions are expensive. While if you are using streams you will have consistent data, at a very low price.

You can do it using ScanIndexForward
YourEntity requestEntity = new YourEntity();
requestEntity.setHashKey(hashkey);
DynamoDBQueryExpression<YourEntity> queryExpression = new DynamoDBQueryExpression<YourEntity>()
.withHashKeyValues(requestEntity)
.withConsistentRead(false);
equeryExpression.setIndexName(IndexName); // if you are using any index
queryExpression.setScanIndexForward(false);
queryExpression.setLimit(1);

Related

Copy DynamoDB table while modifying key attribute

I have a DynamoDB table with hundreds of thousands of data, which I need it duplicated, with one catch that the key needs to be modified. The current key is a combination of 2 fields, e.g. attr1:attr2. I need the new table to have the key consisted only from attr1.
I know copying the table with Data pipelines is pretty straight forward, but how do I do the new key creation according to the use case I have?
Note: the data size is between 500K and 1M items.
Use Elastic Map Reduce in order to manipulate the data. This article explains how to handle DynamoDB data with EMR. Create a UDF which will parse and manipulate the key and use that in a comprehensive
SELECT UDF(id), all, other, columns FROM your_table
Which will be saved in another DynamoDB table.

DynamoDbScan Expression using aws-sdk java

I am using aws-sdk to connect with DynamoDb and I have got into a scenario where I got one dynamodb table with different partition/hash-key and I have to scan and filter to get the results. Scanning the entire table would be a costly operation . Is there a way to scan only a certain partion/haskey of a table ?
You have to use Dynamo DB Query. You can query any table or secondary index that has a composite primary key (a partition key and a sort key).
In my opinion you shouldn't use scan, because it very costly and slow.
You didn't write what is the program language, but here is some example to query:
https://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/DynamoDB.html#query-property
https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_Query.html
https://www.dynamodbguide.com/querying/
About indices:
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/SQLtoNoSQL.Indexes.html
UPDATE #1:
Maybe that will help:
Add a new column to your table. The values will be static. (For example: Col name: const_value Values: const)
Create a new secondary index to your table.
'partition key': 'const_value'
'sort key': the column what you want to filter
You can use Query.

DynamoDB how to search for a list of values

I have a DynamoDB instance with a partition key and sort key. Let's say that they are organisation (hash key) and employee id (sort key).
I want to retrieve all employees who's ids are in a list. They all work for the same organisation but they are not all of the employees of that organisation.
In SQL I'd do something like:
select * from table where organisation_id = 'org' and employee_id in [list of ids]
There does not seem to be an equivalent in DynamoDB.
My choices seem to be:
1) Iterate over all employee IDs using a Query OR
2) Use BatchGetItems and provide organisation_id:employee_id for all items
The first seems like it will be slower as it involves multiple requests while the second is a single request but may consume more RCUs.
Which of these is preferred solution to this problem? Or am I missing a better third way?
I would iterate your list using GetItem, adding each employee found to a collection. This approach isn't slow - DynamoDB is designed specifically for getting lots of items fast using their keys.
There is no need to use Query as you have both the partition key and range key. You would only use a Query if say you wanted all employees of one organisation.
If your list is particularly large you could use BatchGetItem, which will create multiple parallel threads and therefore reduce latency. You won't find much a difference though unless you have a lot of items to get.
By the way, DynamoDB does have an 'IN' operator but your can't use it on KeyConditions.

Use ConditionExpression to limit insert when ID doesn't exist in other table

Simple thing. While inserting data to table A I have a HashKey id and additional hash index for column ex_id, which is kind of a foreign key in table B.
When inserting a new data into table A I would like to create an exception whenever data is inserted with value in column ex_id that doesn't have a correspondent entry in table B.
I thought that ConditionExpression is the way to go, but can't make it work - probably missing something obvious. Tried to use contains()...
Any ideas?
As per my knowledge this would not be possible at DynamoDB end because there are no relationship between the tables.
What you can do is that you can have a condition at the application level, which checks on its own and throw an exception before inserting the value in table A. (You can query table B for that "Id" if found then insert else throw exception)
DynamoDB does not natively support any kind of foreign key support, everything works on a per table basis, per key basis. DynamoDB's approach is to handle such logic at the client level. For example see the dynamodb transactions client. This library allows you to perform transactions across tables which either all succeed or all rollback.
For your case, I would first make a getItem request to table B (use consistent read) if it exists then write to table A.
Then I would enable streams on table A and write a lambda function to check if any data violations get written to the table.

Change the schema of a DynamoDB table: what is the best/recommended way?

What is the Amazon-recommended way of changing the schema of a large table in a production DynamoDB?
Imagine a hypothetical case where we have a table Person, with primary hash key SSN. This table may contain 10 million items.
Now the news comes that due to the critical volume of identity thefts, the government of this hypothetical country has introduced another personal identification: Unique Personal Identifier, or UPI.
We have to add an UPI column and change the schema of the Person table, so that now the primary hash key is UPI. We want to support for some time both the current system, which uses SSN and the new system, which uses UPI, thus we need both these two columns to co-exist in the Person table.
What is the Amazon-recommended way to do this schema change?
There are a couple of approaches, but first you must understand that you cannot change the schema of an existing table. To get a different schema, you have to create a new table. You may be able to reuse your existing table, but the result would be the same as if you created a different table.
Lazy migration to the same table, without Streams. Every time you modify an entry in the Person table, create a new item in the Person table using UPI and not SSN as the value for the hash key, and delete the old item keyed at SSN. This assumes that UPI draws from a different range of values than SSN. If SSN looks like XXX-XX-XXXX, then as long as UPI has a different number of digits than SSN, then you will never have an overlap.
Lazy migration to the same table, using Streams. When streams becomes generally available, you will be able to turn on a Stream for your Person table. Create a stream with the NEW_AND_OLD_IMAGES stream view type, and whenever you detect a change to an item that adds a UPI to an existing person in the Person table, create a Lambda function that removes the person keyed at SSN and add a person with the same attributes keyed at UPI. This approach has race conditions that can be mitigated by adding an atomic counter-version attribute to the item and conditioning the DeleteItem call on the version attribute.
Preemptive (scripted) migration to a different table, using Streams. Run a script that scans your table and adds a unique UPI to each Person-item in the Person table. Create a stream on Person table with the NEW_AND_OLD_IMAGES stream view type and subscribe a lambda function to that stream that writes all the new Persons in a new Person_UPI table when the lambda function detects that a Person with a UPI was changed or when a Person had a UPI added. Mutations on the base table usually take hundreds of milliseconds to appear in a stream as stream records, so you can do a hot failover to the new Person_UPI table in your application. Reject requests for a few seconds, point your application to the Person_UPI table during that time, and re-enable requests.
DynamoDB streams enable us to migrate tables without any downtime. I've done this to great effective, and the steps I've followed are:
Create a new table (let us call this NewTable), with the desired key structure, LSIs, GSIs.
Enable DynamoDB Streams on the original table
Associate a Lambda to the Stream, which pushes the record into NewTable. (This Lambda should trim off the migration flag in Step 5)
[Optional] Create a GSI on the original table to speed up scanning items. Ensure this GSI only has attributes: Primary Key, and Migrated (See Step 5).
Scan the GSI created in the previous step (or entire table) and use the following Filter:
FilterExpression = "attribute_not_exists(Migrated)"
Update each item in the table with a migrate flag (ie: “Migrated”: { “S”: “0” }, which sends it to the DynamoDB Streams (using UpdateItem API, to ensure no data loss occurs).
NOTE: You may want to increase write capacity units on the table during the updates.
The Lambda will pick up all items, trim off the Migrated flag and push it into NewTable.
Once all items have been migrated, repoint the code to the new table
Remove original table, and Lambda function once happy all is good.
Following these steps should ensure you have no data loss and no downtime.
I've documented this on my blog, with code to assist:
https://www.abhayachauhan.com/2018/01/dynamodb-changing-table-schema/
I'm using a variant of Alexander's third approach. Again, you create a new table that will be updated as the old table is updated. The difference is that you use code in the existing service to write to both tables while you're transitioning instead of using a lambda function. You may have custom persistence code that you don't want to reproduce in a temporary lambda function and it's likely that you'll have to write the service code for this new table anyway. Depending on your architecture, you may even be able to switch to the new table without downtime.
However, the nice part about using a lambda function is that any load introduced by additional writes to the new table would be on the lambda, not the service.
If the changes involve changing the partition key, you can add a new GSI (global secondary index). Moreover, you can always add new columns/attributes to DynamoDB without needing to migrate tables.

Resources