Deleting items in a Dynamodb table with duplicate values - amazon-dynamodb

I have a dynamodb table with the following structure:
{
accountId: string,//PRIMARY KEY
userId: string,//SORT KEY
email: string,
dateCreated: number // timestamp
}
I want to perform an action that deletes all items with duplicate emails from the table except for the one with the oldest dateCreated attribute.
Is this operation possible in DynamoDB?
Thanks

Firstly, you need both partition and sort keys to delete an item from DynamoDB. Unless, you know the accountId and userId, you can't perform the delete item operation.
On the above use case, neither email nor dateCreated attribute is part of the key attribute.
Also, sort functionality is available for the sort key attribute only.
Approach 1:-
Preferred one if it is a one time activity
Get the data and identify the old values based on dateCreated at client side
Delete the data on DynamoDB based on accountId and userId
Approach 2:-
Preferred one if it is required frequently
Create a GSI with hash key as email and sort key as dateCreated
Assuming you know the email id that you wanted to query against and identify whether it has duplicates, you can use Query API with index name, email id value and ScanIndexForward value as false (i.e. descending order)
The result set will have email id with latest record at the top. You can ignore the top record and run the Delete API with accountId and userId for the rest of the items.
Approach 3:-
Preferred approach if the data can be manageable at flat file and run some program to find the duplicates
You can export the data to S3 bucket using AWS Data Pipeline
Run some program to read the file to find the duplicates and execute the DynamoDB delete query to delete the item
Approach 4:-
Preferred approach if the data is large
You can export the data to AWS EMR using AWS Data Pipeline
Run some queries to find the duplicates and execute the DynamoDB delete query to delete the item
Note:-
Please note that if you are expecting something like SQL with sub-queries to identify the latest updated record and delete the rest, it is NOT possible on DynamoDB
Export data to S3

Related

DynamoDB : Good practice to use a timestamp field in a Primary Key

I want to store and retrieve data from a DynamoDB table.
My data (an item = a review a user gave on a feature of an app) have the following attributes :
user string
feature string
appVersion string
timestamp string
rate int
description string
There is multiple features, on multiple versions of the app, and an user can give multiple reviews on these features. So I would like to use (user, appVersion, feature, timestamp) as a primary key.
But it does not seem to be possible to use that much attributes in a primary key in DynamoDB.
The first solution I implemented is to use user as a Partition Key, and a hash of (appVersion, feature, timestamp) as a Sort Key (in a new field named reviewID).
My problem is that, I want to retrieve an item for a given user, feature, appVersion without knowing the timestamp value (let's say I want the item with the latest timestamp, or the list of all items matching the 3 fields)
Without knowing the timestamp, I can't build the Sort Key necessary to retrieve my item. But if I remove the timestamp from the Sort Key, I will not be able to store multiple items having the same (user, appVersion, feature).
What would be the proper way to handle this usecase ?
I am thinking about using a hash of (user, appVersion, feature) as a Partition Key, and the timestamp as a Sort Key, would this be a correct solution ?
Put the timestamp at the end of your SK and then when you Query the data you use begins_with on the SK.
PK SK
UserID appVersion#feature#timestamp
This will allow you to dynamically query the data. For example you want all the users votes for a specific appversion
SELECT * FROM Mytable WHERE PK= 'x' AND SK BEGINS_WITH('{VERSION ID}')
This is done using a Query command.
The answer from Lee Hannigan will work, I like it.
However, keep in mind that accessing a PK is very fast because its hash-based.
I am thinking about using a hash of (user, appVersion, feature) as a
Partition Key, and the timestamp as a Sort Key, would this be a
correct solution?
This might also work, the table would look like this
PK SK
User#{User}AppVersion#{appVersion}#Feature#{feature} TimeStamp#{timestamp}
If you always know the user, appVersion, and the feature, this will be more optimal, because the SK lookup is O(logN)
one way
HASH string "modelName": "user"
RANGE string "id": "b0d5be50-4fae-11ed-981f-dbffcc56c88a"
uuid himself can be used for as timestamp
when searching you could search using reverse index
Another way
HASH string "modelName": "user"
RANGE string "createdAt" "2019-10-12T07:20:50.52Z"
createdAt, use time format rfc3339
when searching you could search using reverse index
Put down on paper what you need and you'll find others way to manage indes HASH/RANGE

How to future proof these possible requirement changes (swaping primary key columns) with a dynamodb table design?

I have the following data structure
item_id String
version String
_id String
data String
_id is simply a UUID to identify the item. There is no need to search for a row by this field yet.
As of now, item_id, an id generated by an external system, is the a primary key. i.e. Given the item_id, I want to be able retrieve version, _id and data from the dynamodb table.
item_id -> (version, _id, data)
Therefore I am setting item_id as the partition key.
I have two questions for future-proofing (evolution of) the above "schema":
In the future, if I want to incorporate version (version number of the item) into the primary key, can I just modify the table and add it to be the partition key?
If I also want to make the data searchable by _id, is it feasible modify the table to assign _id to be the partition key (It is a unique value because it is a UUID) and reassign item_id to be a search key?
I want to avoid creation of new dynamodb table and data migration to create new key structures, because it may lead to down time.
You cannot update primary keys in DynamoDB. From the docs:
You cannot use UpdateItem to update any primary key attributes. Instead, you will need to delete the item, and then use PutItem to create a new item with new attributes.
If you wanted to make data searchable by _id, you could introduce a secondary index with the _id field as the partition key of the index.
For example, let's say your data looked like this:
If you defined a secondary index on _id, the index would look like this (same data as the previous example, just a different logical view):
DynamoDB doesn't currently have any native versioning functionality, so you'll have to incorporate that into your data model. Fortunately, there's lots of discussion about this use case on the web. AWS has a document of DynamoDB "Best Practices", including an example of versioning.

Handling DocumentClientException with BulkImport

I am using Microsoft.Azure.CosmosDB.BulkExecutor.IBulkExecutor.BulkImportAsync to insert documents as a batch. I have implemented unique constraints for my cosmos db collection. If any of the input documents violates the constraint the entire bulk import operation fails with throwing DocumentClientException. Is this an expected behaviour? Or is there a way we can handle the exceptions for failed documents and make sure the valid documents are inserted?
First of all Thanks to Microsoft Document which has explained solid scenarios on the issue,
https://learn.microsoft.com/en-us/azure/data-factory/connector-troubleshoot-guide
This error appears when we define unique_key in addition to default id field defined by cosmos. The reason could be possible duplication of row for Unique Key in the dataset. Another possible reason, the Delta dataset which we are about to load has some of the unique keys which are already present in existing cosmos dataset.
For regular batch jobs there could be some updates happening on existing unique key itself, but we cannot update an existing unique key through batch process. As each record gets into cosmos as new record with new 'id' field value. Cosmos updates an existing record only same id field not on unique key.
Workaround, Since unique key is already going to be unique for every row across entire collection, we can define our unique value itself as also 'id' field. So now if we have any updates on addition field apart from unique key we can update them as 'id' field for respective unique key will also be same.
In SQL way,
SELECT <unique_key_field> AS id, <unique_key_field>, field1, field2 FROM <table_name>

DynamoDB sub item filter using .Net Core API

First of all, I have table structure like this,
Users:{
UserId
Name
Email
SubTable1:[{
Column-111
Column-112
},
{
Column-121
Column-122
}]
SubTable2:[{
Column-211
Column-212
},
{
Column-221
Column-222
}]
}
As I am new to DynamoDB, so I have couple of questions regarding this as follows:
1. Can I create structure like this?
2. Can we set primary key for subtables?
3. Luckily, I found DynamoDB helper class to do some operations into my DB.
https://www.gopiportal.in/2018/12/aws-dynamodb-helper-class-c-and-net-core.html
But, don't know how to fetch only perticular subtable
4. Can we fetch only specific columns from my main table? Also need suggestion for subtables
Note: I am using .net core c# language to communicate with DynamoDB.
Can I create structure like this?
Yes
Can we set primary key for subtables?
No, hash key can be set on top level scalar attributes only (String, Number etc.)
Luckily, I found DynamoDB helper class to do some operations into my DB.
https://www.gopiportal.in/2018/12/aws-dynamodb-helper-class-c-and-net-core.html
But, don't know how to fetch only perticular subtable
When you say subtables, I assume that you are referring to Array datatype in the above sample table. In order to fetch the data from DynamoDB table, you need hash key to use Query API. If you don't have hash key, you can use Scan API which scans the entire table. The Scan API is a costly operation.
GSI (Global Secondary Index) can be created to avoid scan operation. However, it can be created on scalar attributes only. GSI can't be created on Array attribute.
Other option is to redesign the table accordingly to match your Query Access Pattern.
Can we fetch only specific columns from my main table? Also need suggestion for subtables
Yes, you can fetch specific columns using ProjectionExpression. This way you get only the required attributes in the result set

Query on non-key attribute

It appears that dynamodb's query method must include the partition key as part of the filter. How can a query be performed if you do not know the partition key?
For example, you have a User table with the attribute userid set as the partition key. Now we want to look up a user by their phone number. Is it possible to perform the query without the partition key? Using the scan method, this goal can be achieved, but at the expense of pulling every item from the table before the filter is applied, as far as I know.
You'll need to set up a global secondary index (GSI), using your phoneNumber column as the index hash key.
You can create a GSI by calling UpdateTable.
Once you create the index, you'll be able to call Query with your IndexName, to pull user records based on the phone number.

Resources