How does one create a GSI for an Item in a Array in DyanmoDB? - amazon-dynamodb

I have a DyanmmoDB table that has:
Partition key: State (IE: two letter State ID)
Sort Key: City (Name of city in the state)
Items in the "record" is an array, let's say Advertisements
"StateShort": "AK",
"City": "Anchorage",
"Ads": [
{
"AdDateAdded": 1674671363999,
"AdDateExpire": 1682447536551,
"AdIDKey": "ABC-123-GUID-Here",
"AdTitle": "This is the Title to Ad 1"
"AdDescription": "Description of the Details to Ad 1",
"AdOwner": "bob#example.com",
},
{
"AdDateAdded": 1674671363999,
"AdDateExpire": 1682447536551,
"AdIDKey": "DEF-456-GUID-Here",
"AdTitle": "This is the Title to Ad 2"
"AdDescription": "Description of the Details to Ad 2",
"AdOwner": "bob#example.com",
}
]
Query to retrieve all ads in State and City, easy-peasy, as they are PK and SK.
but, I do NOT want to Scan to find all the ads that AdOwner has ("bob#example.com"). They may have Ads in other states and city requiring me to Scan entire table.
FEELS like a perfect use case for a Global secondary indexes.
I've added AdOwner as a GSI but, clearly it can't find the key in the array.
Question: Is this solvable with a GSI? If so, what structure would that look like?
After creating the GSI, I've tried this code but, it returns no items
const params = {
"TableName": "My_table",
"IndexName": "AdEmail-index",
"KeyConditionExpression": "#IndexName = :AccountID",
"ExpressionAttributeNames": {
"#IndexName": "AdOwner",
},
ExpressionAttributeValues: {
":AccountID": "bob#example",
},
"ScanIndexForward": false
}
try{
const item = await dynamo.query(params).promise()
console.log("what: ", item)
}
catch (e) {
console.log("ERROR", e)
}

No, a global secondary index key must be a top level attribute and be of type string, number or binary.
You should vertically shard your items, giving you more flexibility:
pk
sk
data
AdOwner
AK
Anchorage#BC-123-GUID-Here
{}
bob#example.com
AK
Anchorage#BDEF-456-GUID-Here
{}
bob#example.com
All ads in a state and city, still easy, using Query:
SELECT * FROM MY TABLE WHERE pk = 'AK' AND sk BEGINS_WITH 'Anchorage'
You can now create a GSI on the AdOwner to fulfill your second access pattern.
SELECT * FROM MY TABLE.INDEX WHERE AdOwnder = 'bob#example.com'

Related

DynamoDB filter if primary key contains value

CURRENTLY
I have a table in DynamoDB with a single attribute - Primary Key - that contains unique values.
PK
------
#A#B#C#
#B#C#
#C#D#E#
#BC#
ISSUE
I am looking to do 2 searches for #B#C# (1) exact match, and (2) containing match, and therefore only want results:
(1) Exact Match:
#B#C#
(2) Containing Match:
#A#B#C#
#B#C#
Are these 2 searches possible against the primary key?
If so, what is the most efficient query to run? e.g. QUERY or SCAN
Note:
For (2) I am using the following code, but it is returning all items in DB:
params = {
TableName: 'myTable',
FilterExpression: "contains(#key, :v)",
ExpressionAttributeNames: { "#key": "PK" },
ExpressionAttributeValues: { ":v": #B#C# }
}
dynamodb.scan(params,callback)
DynamoDB supports two main types of searches: query and scan. The Query operation finds items based on primary key values. The Scan operation returns one or more items and item attributes by accessing every item in a table or a secondary index
If you wanted to find the item with a primary key #B#C, you would use the query API:
ddbClient.query(
{
"TableName": "<YOUR TABLE NAME>",
"KeyConditionExpression": "#pk = :pk",
"ExpressionAttributeValues": {
":pk": {
"S": "#B#C"
}
},
"ExpressionAttributeNames": {
"#pk": "PK"
}
}
)
For your second access pattern, you'll need to use the scan API because you are searching across the entire table/secondary index.
You can use scan to test if a primary key has a substring using contains. I don't see anything wrong with the format of your scan operation.
Be careful when using scan this way. Because scan will read your entire table to fetch results, you will have a fairly inefficient operation at scale. If this operation is run infrequently, or you are running it against a sparse index, it's probably fine. However, if it's one of your primary access patterns, you may want to reconsider using the scan API for this operation.

Limit number of documents in a partition for cosmosdb

I have a cosmosdb collection with each partition containing a set of documents.I would like to maintain the collection such that a logical partition('id' in this case) does not go over the limit of 5 documents. In the sample below, when a sixth entry(say for 8/11/2020) is added, I want to delete the document created on 7/13/2020 since that was updated the earliest.
Basically, I want to make sure for item with id 12345, there are only 5 latest entries and no more. This is to reduce the data in the db and thus avoid querying more data than what's needed.
{
"id": 12345,
"lastUpdated": 8/10/2020
},
{
"id": 12345,
"lastUpdated": 8/3/2020
},
{
"id": 12345,
"lastUpdated": 7/27/2020
},
{
"id": 12345,
"lastUpdated": 7/20/2020
},
{
"id": 12345,
"lastUpdated": 7/13/2020
}
I could do something like this:
Get all documents for id 12345
If count of documents is >=5, get the last document (with instance 5) and delete it.
Insert new document
However, that is a running 3 queries to insert a single document.
Is there a more elegant way to do this?
Thanks!
You can use OFFSET 1 LIMIT 5 to get 5 latest entries. For more details, you can read offical document about OFFSET LIMIT clause in Azure Cosmos DB.
You can get the count(assume 100) of data and set ttl, or delete directly. We can query like below.
SELECT f.id, f.lastUpdated FROM yourcosmosdb f ORDER BY f.lastUpdated OFFSET 6 LIMIT 100
Foreach
List<Task> concurrentDeleteTasks = new List<Task>();
while (feedIterator.HasMoreResults)
{
FeedResponse<response> res = await feedIterator.ReadNextAsync();
foreach (var item in res)
{
concurrentDeleteTasks.Add(container.DeleteItemAsync<response>(item.id, new PartitionKey(item.deviceid)));
}
}
await Task.WhenAll(concurrentDeleteTasks.Take(3));
You also can foreach the collection and set ttl=10, these data will be deleted 10s later.
You can get the latest 5 dataļ¼š
SELECT f.id, f.lastUpdated FROM yourcosmosdb f ORDER BY f.lastUpdated OFFSET 1 LIMIT 5

What Keys to be made while storing news data in dynamo db?

I need some guidance regarding how to store News Records in Dynamo DB
{
"news": [{
"id": "nws_7KnqNr",
"title": "Dow Jones Futures: From Apple To Zscaler, This Is The New Stock Market Trend",
"publication_date": "2019-09-11T10:46:24.000Z",
"url": "https://finance.yahoo.com/m/c5f84bed-ce61-3938-9af1-953d15dbcf65/dow-jones-futures%3A-from-apple.html?.tsrc=rss",
"summary": "Dow Jones futures: From low Apple TV+ pricing to Roku's sell-off and Ally Financial's breakout, value is in. Already-reeling Zscaler plunged on guidance. RH fell too."
}],
"company": {
"id": "com_NX6GzO",
"ticker": "AAPL",
"name": "Apple Inc",
"lei": "HWUPKR0MPOU8FGXBT394",
"cik": "0000320193"
},
"next_page": "MjAxOS0wOS0xMSAxMDo0NjoyNCBVVEN8NTM1NDYzNg=="
}
This is sample JSON
News is being pulled from some API and should be stored in Dynamo DB
What keys needs to be made for efficient retrival?
News can be fetched as per company too.
News
----------
id (hash key)
title
publication_date
url
summary
company_id (index - hash key)
Should do the trick. So every element of the "news" array will go here, with the company id. If you want to fetch by news id, you can do it efficiently and also by company id (because of the index).
There will be issues with the index if there are several big companies which will have most news (Apple for example) and you have a lot of data.
In order to fix that, use
company_by_month_id (index - hash key)
which is a compound key.
Update:
company_name (index - hash key + timestamp as sort key)
ticker (index - hash key + timestamp as sort key)
timestamp (this is generated)
Query the two indexes created to get the most recent news items based on company name or ticker.

Delete large data with same partition key from DynamoDB

I have DynamoDB table structured like this
A B C D
1 id1 foo hi
1 id2 var hello
A is the partition key and B is the sort key.
Let' say I only have the partition key and don't know the sort key and I'd like to delete all entries have the same partition key.
So I am thinking about loading entries by query with a fixed size (e.g 1000) and delete them in a batch until there are no more entries with the partition key left in DynamoDB.
Is it possible to delete entries without loading them first?
https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_DeleteItem.html
DeleteItem
Deletes a single item in a table by primary key.
For the primary key, you must provide all of the attributes. For
example, with a simple primary key, you only need to provide a value
for the partition key. For a composite primary key, you must provide
values for both the partition key and the sort key.
In order to delete an item you must provide the whole primary key (partition + sort key). So in your case you would need to query on the partition key, get all of the primary keys, then use those to delete each item. You can also use BatchWriteItem
https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_BatchWriteItem.html
BatchWriteItem
The BatchWriteItem operation puts or deletes multiple items in one or
more tables. A single call to BatchWriteItem can write up to 16 MB of
data, which can comprise as many as 25 put or delete requests.
Individual items to be written can be as large as 400 KB.
DeleteRequest - Perform a DeleteItem operation on the specified item. The item to be deleted is identified by a Key subelement: Key -
A map of primary key attribute values that uniquely identify the item.
Each entry in this map consists of an attribute name and an attribute
value. For each primary key, you must provide all of the key
attributes. For example, with a simple primary key, you only need to
provide a value for the partition key. For a composite primary key,
you must provide values for both the partition key and the sort key.
No, but you can Query all the items for the partition, and then issue an individual DeleteRequest for each item, which you can batch in multiple BatchWrite calls of up to 25 items.
JS code
async function deleteItems(tableName, partitionId ) {
const queryParams = {
TableName: tableName,
KeyConditionExpression: 'partitionId = :partitionId',
ExpressionAttributeValues: { ':partitionId': partitionId } ,
};
const queryResults = await docClient.query(queryParams).promise()
if (queryResults.Items && queryResults.Items.length > 0) {
const batchCalls = chunks(queryResults.Items, 25).map( async (chunk) => {
const deleteRequests = chunk.map( item => {
return {
DeleteRequest : {
Key : {
'partitionId' : item.partitionId,
'sortId' : item.sortId,
}
}
}
})
const batchWriteParams = {
RequestItems : {
[tableName] : deleteRequests
}
}
await docClient.batchWrite(batchWriteParams).promise()
})
await Promise.all(batchCalls)
}
}
// https://stackoverflow.com/a/37826698/3221253
function chunks(inputArray, perChunk) {
return inputArray.reduce((all,one,i) => {
const ch = Math.floor(i/perChunk);
all[ch] = [].concat((all[ch]||[]),one);
return all
}, [])
}
For production databases and critical Amazon DynamoDB tables, recommendation is to use batch-write-item to purge huge data.
batch-write-item (with DeleteRequest) is 10 to 15 times faster than delete-item.
aws dynamodb scan --table-name "test_table_name" --projection-expression "primary_key, timestamp" --filter-expression "timestamp < :oldest_date" --expression-attribute-values '{":oldest_date":{"S":"2020-02-01"}}' --max-items 25 --total-segments "$TOTAL_SEGMENT" --segment "$SEGMENT_NUMBER" > $SCAN_OUTPUT_FILE
cat $SCAN_OUTPUT_FILE | jq -r ".Items[] | tojson" | awk '{ print "{\"DeleteRequest\": {\"Key\": " $0 " }}," }' | sed '$ s/.$//' | sed '1 i { "test_table_name": [' | sed '$ a ] }' > $INPUT_FILE
aws dynamodb batch-write-item --request-items file://$INPUT_FILE
Please find more information # https://medium.com/analytics-vidhya/how-to-delete-huge-data-from-dynamodb-table-f3be586c011c
You can use "begins_with" on the range key.
For example (pseudo code)
DELETE WHERE A = '1' AND B BEGINS_WITH 'id'

Tips for querying dynamic object fields in Crate

I have a table such as the one at the end of this question. I insert into the peers_array field a dynamically keyed array/object such as:
{
"130":{
"to":5
},
"175":{
"fr":0
},
"188":{
"fr":0
},
"190":{
"to":5
},
"280":{
"fr":4
}
}
I'm looking for advice on how to wildcard query the key field. Such as:
select * from table where peers_array[*]['to'] > 10
In Elasticsearch I can query like this:
peers_array.*.to: >10
My Table:
CREATE TABLE table (
"id" long primary key,
"sourceRouteId" integer,
"rci" integer,
peers_array object(dynamic),
"partition_date" string primary key
) partitioned by (partition_date) with (number_of_replicas = 0, refresh_interval = 5000);
I'm sorry to say, but this is currently not possible. We'll put it on our backlog. Thanks for reporting this use case.

Resources