I have DynamoDB table structured like this
A B C D
1 id1 foo hi
1 id2 var hello
A is the partition key and B is the sort key.
Let' say I only have the partition key and don't know the sort key and I'd like to delete all entries have the same partition key.
So I am thinking about loading entries by query with a fixed size (e.g 1000) and delete them in a batch until there are no more entries with the partition key left in DynamoDB.
Is it possible to delete entries without loading them first?
https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_DeleteItem.html
DeleteItem
Deletes a single item in a table by primary key.
For the primary key, you must provide all of the attributes. For
example, with a simple primary key, you only need to provide a value
for the partition key. For a composite primary key, you must provide
values for both the partition key and the sort key.
In order to delete an item you must provide the whole primary key (partition + sort key). So in your case you would need to query on the partition key, get all of the primary keys, then use those to delete each item. You can also use BatchWriteItem
https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_BatchWriteItem.html
BatchWriteItem
The BatchWriteItem operation puts or deletes multiple items in one or
more tables. A single call to BatchWriteItem can write up to 16 MB of
data, which can comprise as many as 25 put or delete requests.
Individual items to be written can be as large as 400 KB.
DeleteRequest - Perform a DeleteItem operation on the specified item. The item to be deleted is identified by a Key subelement: Key -
A map of primary key attribute values that uniquely identify the item.
Each entry in this map consists of an attribute name and an attribute
value. For each primary key, you must provide all of the key
attributes. For example, with a simple primary key, you only need to
provide a value for the partition key. For a composite primary key,
you must provide values for both the partition key and the sort key.
No, but you can Query all the items for the partition, and then issue an individual DeleteRequest for each item, which you can batch in multiple BatchWrite calls of up to 25 items.
JS code
async function deleteItems(tableName, partitionId ) {
const queryParams = {
TableName: tableName,
KeyConditionExpression: 'partitionId = :partitionId',
ExpressionAttributeValues: { ':partitionId': partitionId } ,
};
const queryResults = await docClient.query(queryParams).promise()
if (queryResults.Items && queryResults.Items.length > 0) {
const batchCalls = chunks(queryResults.Items, 25).map( async (chunk) => {
const deleteRequests = chunk.map( item => {
return {
DeleteRequest : {
Key : {
'partitionId' : item.partitionId,
'sortId' : item.sortId,
}
}
}
})
const batchWriteParams = {
RequestItems : {
[tableName] : deleteRequests
}
}
await docClient.batchWrite(batchWriteParams).promise()
})
await Promise.all(batchCalls)
}
}
// https://stackoverflow.com/a/37826698/3221253
function chunks(inputArray, perChunk) {
return inputArray.reduce((all,one,i) => {
const ch = Math.floor(i/perChunk);
all[ch] = [].concat((all[ch]||[]),one);
return all
}, [])
}
For production databases and critical Amazon DynamoDB tables, recommendation is to use batch-write-item to purge huge data.
batch-write-item (with DeleteRequest) is 10 to 15 times faster than delete-item.
aws dynamodb scan --table-name "test_table_name" --projection-expression "primary_key, timestamp" --filter-expression "timestamp < :oldest_date" --expression-attribute-values '{":oldest_date":{"S":"2020-02-01"}}' --max-items 25 --total-segments "$TOTAL_SEGMENT" --segment "$SEGMENT_NUMBER" > $SCAN_OUTPUT_FILE
cat $SCAN_OUTPUT_FILE | jq -r ".Items[] | tojson" | awk '{ print "{\"DeleteRequest\": {\"Key\": " $0 " }}," }' | sed '$ s/.$//' | sed '1 i { "test_table_name": [' | sed '$ a ] }' > $INPUT_FILE
aws dynamodb batch-write-item --request-items file://$INPUT_FILE
Please find more information # https://medium.com/analytics-vidhya/how-to-delete-huge-data-from-dynamodb-table-f3be586c011c
You can use "begins_with" on the range key.
For example (pseudo code)
DELETE WHERE A = '1' AND B BEGINS_WITH 'id'
Related
I have a DynamoDB table, I need to find the records which are between the given date range.
So here is my table structure
{
"Id":"String",
"Name":"String",
"CrawledAt":"String"
}
In this table partition key as Id and CrawledAt fileds used. And also created local secondary index with CrawledAt field and it's name "CrawledAt-index"
When querying most of the articles using Id with CreatedAt. But in my case I don't know what is the Id, I only need to retrieve records for a particular date range.
Here is the code I have tried
request = {
"TableName": "sflnd00001-test",
"IndexName": "CrawledAt-index",
"ConsistentRead": False,
"ProjectionExpression": "Name",
"KeyConditionExpression":
"CrawledAt between :v_start and :v_end",
"ExpressionAttributeValues": {
":v_start": {"S": "2020-01-31T00:00:00.000Z"},
":v_end": {"S": "2025-11-31T00:00:00.000Z"} }
}
response = table.query(**request)
It's returning this error
"An error occurred (ValidationException) when calling the Query operation: Invalid KeyConditionExpression: Incorrect operand type for operator or function; operator or function: BETWEEN, operand type: M",
Can someone please tell me how to find data set with the given date range without providing primary key
You cannot do a between or any other function on a partition key, you must always provide the entire key.
For your use-case your GSI partition key should be a single value, and the crawledAt should be the sort key.
{
"Id":"String",
"Name":"String",
"CrawledAt":"String",
"GsiPk": "Number"
}
.
"KeyConditionExpression":
"GsiPk = 1 AND CrawledAt between :v_start and :v_end"
This would then allow you to retrieve all the data in the table between two dates. But be aware of the caveat of doing this, using a single value for a GSIPK is not scalable, and would cap the write requests to approx 1000WCU.
If you need more scale you can assign a random number to the GSIPK (n) to increase the number of partitions which would then require you to make (n) queries to collect all the data.
Alternatively you can Scan the table and use FilterExpression which is also not a scalable solution:
aws dynamodb scan \
--table-name MusicCollection \
--filter-expression "timestamp between :a and :b" \
--expression-attribute-names file://expression-attribute-names.json \
--expression-attribute-values file://expression-attribute-values.json
CURRENTLY
I have a table in DynamoDB with a single attribute - Primary Key - that contains unique values.
PK
------
#A#B#C#
#B#C#
#C#D#E#
#BC#
ISSUE
I am looking to do 2 searches for #B#C# (1) exact match, and (2) containing match, and therefore only want results:
(1) Exact Match:
#B#C#
(2) Containing Match:
#A#B#C#
#B#C#
Are these 2 searches possible against the primary key?
If so, what is the most efficient query to run? e.g. QUERY or SCAN
Note:
For (2) I am using the following code, but it is returning all items in DB:
params = {
TableName: 'myTable',
FilterExpression: "contains(#key, :v)",
ExpressionAttributeNames: { "#key": "PK" },
ExpressionAttributeValues: { ":v": #B#C# }
}
dynamodb.scan(params,callback)
DynamoDB supports two main types of searches: query and scan. The Query operation finds items based on primary key values. The Scan operation returns one or more items and item attributes by accessing every item in a table or a secondary index
If you wanted to find the item with a primary key #B#C, you would use the query API:
ddbClient.query(
{
"TableName": "<YOUR TABLE NAME>",
"KeyConditionExpression": "#pk = :pk",
"ExpressionAttributeValues": {
":pk": {
"S": "#B#C"
}
},
"ExpressionAttributeNames": {
"#pk": "PK"
}
}
)
For your second access pattern, you'll need to use the scan API because you are searching across the entire table/secondary index.
You can use scan to test if a primary key has a substring using contains. I don't see anything wrong with the format of your scan operation.
Be careful when using scan this way. Because scan will read your entire table to fetch results, you will have a fairly inefficient operation at scale. If this operation is run infrequently, or you are running it against a sparse index, it's probably fine. However, if it's one of your primary access patterns, you may want to reconsider using the scan API for this operation.
Here is the sample JSON which we are planning to insert into DynamoDB table. As of now we are having organizationID as primary partition key and __id__ as sort key. Since we will query based on organizationID we kept it as primary partition key. Is it a good approach to keep __id__ as sort key.
{
"__class__": "package",
"__updated__": "2015-10-19T14:30:13Z",
"__created__": "2015-10-19T12:32:28Z",
"transactions": [
{
transaction1
},
{
transaction2
}
],
"carrier": "USPS",
"organizationID": "6406fa6fd32393908125d4d81ec358",
"barcode": "9400110891302408",
"queryString": [
"xxxxxxx",
"YYYY",
"delivered",
],
"deliveredTo": null,
"__id__": "3232d1a045476786fg22dfg32b82209155b32"
}
As per the best practice, you can have timestamp as sort key for the above data model. One advantage of having timestamp as sort key is that you can sort the data for the particular partition key and identity the latest updated item. This is the very common use case for having sort key.
It doesn't make much sense to keep both partition and sort key as randomly generated value because you can't use sort key efficiently (unless I miss something here).
I'm familiar with MySQL and am starting to use Amazon DynamoDB for a new project.
Assume I have a MySQL table like this:
CREATE TABLE foo (
id CHAR(64) NOT NULL,
scheduledDelivery DATETIME NOT NULL,
-- ...other columns...
PRIMARY KEY(id),
INDEX schedIndex (scheduledDelivery)
);
Note the secondary Index schedIndex which is supposed to speed-up the following query (which is executed periodically):
SELECT *
FROM foo
WHERE scheduledDelivery <= NOW()
ORDER BY scheduledDelivery ASC
LIMIT 100;
That is: Take the 100 oldest items that are due to be delivered.
With DynamoDB I can use the id column as primary partition key.
However, I don't understand how I can avoid full-table scans in DynamoDB. When adding a secondary index I must always specify a "partition key". However, (in MySQL words) I see these problems:
the scheduledDelivery column is not unique, so it can't be used as a partition key itself AFAIK
adding id as unique partition key and using scheduledDelivery as "sort key" sounds like a (id, scheduledDelivery) secondary index to me, which makes that index pratically useless
I understand that MySQL and DynamoDB require different approaches, so what would be a appropriate solution in this case?
It's not possible to avoid a full table scan with this kind of query.
However, you may be able to disguise it as a Query operation, which would allow you to sort the results (not possible with a Scan).
You must first create a GSI. Let's name it scheduled_delivery-index.
We will specify our index's partition key to be an attribute named fixed_val, and our sort key to be scheduled_delivery.
fixed_val will contain any value you want, but it must always be that value, and you must know it from the client side. For the sake of this example, let's say that fixed_val will always be 1.
GSI keys do not have to be unique, so don't worry if there are two duplicated scheduled_delivery values.
You would query the table like this:
var now = Date.now();
//...
{
TableName: "foo",
IndexName: "scheduled_delivery-index",
ExpressionAttributeNames: {
"#f": "fixed_value",
"#d": "scheduled_delivery"
},
ExpressionAttributeValues: {
":f": 1,
":d": now
},
KeyConditionExpression: "#f = :f and #d <= :d",
ScanIndexForward: true
}
I have a table such as the one at the end of this question. I insert into the peers_array field a dynamically keyed array/object such as:
{
"130":{
"to":5
},
"175":{
"fr":0
},
"188":{
"fr":0
},
"190":{
"to":5
},
"280":{
"fr":4
}
}
I'm looking for advice on how to wildcard query the key field. Such as:
select * from table where peers_array[*]['to'] > 10
In Elasticsearch I can query like this:
peers_array.*.to: >10
My Table:
CREATE TABLE table (
"id" long primary key,
"sourceRouteId" integer,
"rci" integer,
peers_array object(dynamic),
"partition_date" string primary key
) partitioned by (partition_date) with (number_of_replicas = 0, refresh_interval = 5000);
I'm sorry to say, but this is currently not possible. We'll put it on our backlog. Thanks for reporting this use case.