Cloud Datastore: 503 service Unavailable with 1000/s+ concurrent transactions - google-cloud-datastore

When trying to save entities within a transaction to Datastore at a rate of ~1000 transactions per second or more, Datastore consistently returns 503 Service Unavailable until load backs off to a smaller rate.
I'm using the python-datastore client library in a web service to save millions of unique entities to Datastore (NOT Firestore in Datastore mode).
I've tried using the recommended "500/50/5" rule to gradually ramp up to 1000 operations per second and more, but Datastore consistently peaks at the same level irrespective of how gradually the load is increased.
I've also observed that the same Datastore transaction operations are perfectly sustained at 750 operations per second without issues.
My understanding is Datastore can handle millions ops - does this also apply to transactional operations?
Are there any limits or constraints to using transactions when it comes to call volume.
Any suggestions or feedback as to how to tackle this issue would be greatly appreciated!
Here's a sample data model for an "Offer" Kind that's written to Datastore. "id", a uuid, is the entity's key.
{
"id": "a0cf7d66-5fab-495f-a73c-617570628fd6",
"loyalty_id": "191200101829",
"status": "eligible",
"ce_promotion_id": "6452",
"hybris_promotion_id": null,
"offer_promotion_id": "47032",
"ce_campaign_id": "0382",
"promotion_type": "offer",
"display_order": 1,
"activation_date": null,
"deactivation_date": null,
"expiration_date": "2021-04-12T08:00:00Z",
"scheduled_expiration_date": "2021-04-13T07:56:00Z",
"redemption_date": null,
"created_date": "2021-04-11T19:15:28.067053Z",
"update_date": "2021-04-11T19:15:28.067083Z"
}
I also have 3 composite indexes:
loyalty_id ASC status ASC expiration_date DESC
status ASC scheduled_expiration_date DESC
loyalty_id ASC expiration_date DESC

If the date properties increase or decrease monotonically, that would create a hotspot. To increase throughput, you would need to make the property non-monotonic when it's written to the database. Here's an approach from the docs:
For instance, if you want to query for entries by timestamp but only
need to return results for a single user at a time, you could prefix
the timestamp with the user id and index that new property instead.
This would still permit queries and ordered results for that user, but
the presence of the user id would ensure the index itself is well
sharded.
Another approach is to leave the property as is but turn off all indexes on that property (built-in, too) except those that add some randomness in front of the property. From your example, these indexes might fit that model if the loyalty_id is pretty random:
loyalty_id ASC status ASC expiration_date DESC
loyalty_id ASC expiration_date DESC
Note: Indexing order matters. It's important here that the monotonic property come last in the composite index.

Related

Is using "Current Date" a good partition key for data that will be queried by date and id?

I'm new to Azure Cosmos DB and I have this new project where I decided to give it a go.
My DB has only one collection where around 6,000 new items are added everyday and each looks like this
{
"Result": "Pass",
"date": "23-Sep-2021",
"id": "user1#example.com"
}
The date is partition key and it will be the date of which the item was added to the collection where the same id can be added again everyday as follows
{
"Result": "Fail",
"date": "24-Sep-2021",
"id": "user1#example.com"
}
The application that uses this DB will query by id and date to retrieve the Result.
I read some Azure Cosmos DB documentations and found that selecting the partition key very carefully can improve the performance of the database and RUs used for each request.
I tried running this query and it consumed 2.9 RUs where the collection has about 23,000 items.
SELECT * FROM c
WHERE c.id = 'user1#example.com' AND c.date = '24-Sep-2021'
Here are my questions
Is using date a good partition key for my scenario? Any rooms for improvements?
Will consumed RUs per request increase over time if number of items in collection increase?
Thanks.
For a write-heavy workload using date as a partition key is a bad choice because you will always have a hot partition on the current date. However, if the amount of data being written is consistent and the write volume is low, then it can be used and you will have good distribution of data on storage.
In read-heavy scenarios, date can be a good partition key if it is used to answer most of the queries in the app.
The value for id must be unique per partition key value so for your data model to work you can only have one "id" value per day.
If this is the case for your app then you can make one additional optimization and replace the query you have with a point read, ReadItemAsync(). This takes the partition key value and the id. This is the fastest and most efficient way to read data because it does not go through the query engine and reads directly from the backend data store. All point reads for 1KB of data or less will always cost 1RU/s.

Are Azure CosmosDB indexes split by partition

I am sending some IoT events into Azure Cosmos DB. I am partitioning by device id and I am always querying by device id. I want to know if the automatically created indexes are separated by partition key. Specifically if I do query like
SELECT TOP 5 ... FROM events WHERE deviceId = X ORDER BY timeStamp DESC
Will it use the automatically created index on timeStamp and if so is it effective. Basically what I am asking is if there are separate indexes on timeStamp for each partition key (deviceId in my case) because otherwise the index will be relatively useless because the range will contain a lot of irrelevant data from other devices. If this was SQL Server I would create an index on deviceId followed by timeStamp but I am not sure how Cosmos DB works by default.
Indexes sit within the partition so yes.
For this query you have you should also create a composite index with DESC sort order for the best performance.

CosmosDB - TOP 1 query with ORDER BY - Retrieved document count and RU

Considering the following query:
SELECT TOP 1 * FROM c
WHERE c.Type = 'Case'
AND c.Entity.SomeField = #someValue
AND c.Entity.CreatedTimeUtc > #someTime
ORDER BY c.Entity.CreatedTimeUtc DESC
Until recently, when I ran this query, the number of documents processed by the query (RetrievedDocumentCount in the query metrics) was the number of documents that satisfies the first two condition, regardless the "CreatedTimeUtc" or the TOP 1.
Only when I added a composite index of (Type DESC, Entity.SomeField DESC, Entity.CreatedTimeUtc DESC) and added them to the ORDER BY clause, the retrieved documents count dropped to the number of documents that satisfies all 3 conditions (still not one document as expected, but better).
Then, starting a few days ago, we noticed in our dev environment that the composite index is no longer needed as retrieved documents count changed to only one document (= the number in the TOP, as expected), and the RU/s reduced significantly.
My question – is this a new improvement/fix in CosmosDB? I couldn’t find any announcement/documentation on this manner.
If so, is the roll-out completed or still in-progress? We have several production instances in different regions.
Thanks
There have not been any recent changes to our query engine that would explain why this query is suddenly less expensive.
The only thing that would explain this is fewer results match the filter than before and that our query engine was able to perform an optimization that it would not otherwise be able to have done with a larger set of results.
Thanks.

Performing a conditional expression query on GSI in dynamodb

I know the query below is not supported in DynamoDB since you must use an equality expression on the HASH key.
query({
TableName,
IndexName,
KeyConditionExpression: 'purchases >= :p',
ExpressionAttributeValues: { ':p': 6 }
});
How can I organize my data so I can efficiently make a query for all items purchased >= 6 times?
Right now I only have 3 columns, orderID (Primary Key), address, confirmations (GSI).
Would it be better to use a different type of database for this type of query?
You would probably want to use the DynamoDB streams feature to perform aggregation into another DynamoDB table. The streams feature will publish events for each change to your data, which you can then process with a Lambda function.
I'm assuming in your primary table you would be tracking each purchase, incrementing a counter. A simple example of the logic might be on each update, you check the purchases count for the item, and if it is >= 6, add the item ID to a list attribute itemIDs or similar in another DynamoDB table. Depending on how you want to query this statistic, you might create a new entry every day, hour, etc.
Bear in mind DynamoDB has a 400KB limit per attribute, so this may not be the best solution depending on how many items you would need to capture in the itemIDs attribute for a given time period.
You would also need to consider how you reset your purchases counter (this might be a scheduled batch job where you reset purchase count back to zero every x time period).
Alternatively you could capture the time period in your primary table and create a GSI that is partitioned based upon time period and has purchases as the sort key. This way you could efficiently query (rather than scan) based upon a given time period for all items that have purchase count of >= 6.
You dont need to reorganise your data, just use a scan instead of a query
scan({
TableName,
IndexName,
FilterExpression: 'purchases >= :p',
ExpressionAttributeValues: { ':p': 6 }
});

DynamoDB NoSQL design for queries

I am looking to store a log of user events. It is going to be a lot of entries so I thought DynamoDB would be good as everything else is hosted there.
I need to query these events in two ways, totalt of events for a user for a date (range) and occasionally all the events for a date.
I was thinking to store it in one table as user id (key), sequence number (key), date, time and duration.
Should it be multiple tables? How can this be done most efficient?
For a small amount of data this structure is ok.
Keep in mind that the sequence number (your range key) has to be provided by you. It seems a good idea to choose the date as a unix timestamp with a milliseconds accuracy as a sort key.
There is no need for extra tables.
However your structure depends largely on the read write capacity that you want to achieve, and the data size.
Supposing your user_id is your partition key.
For every distinct partition key value, the total sizes of all table and index items cannot exceed 10 GB.
A single partition can support a maximum of 3,000 read capacity units or 1,000 write capacity units.
You need to create your partition keys by taking into consideration these limitations.
For example a very active user has many events thus you need more than 1000 write capacity units. Unfortunately you have choosen as a partition the user id.
In this case you are limited to 1000 write capacity units therefore you might have failures.
You need to have a different structure. For example a partition name like
user_id_1 user_id_2 etc. Therefore a partition naming mechanism spreading the data to partitions according to your application's needs.
Check these links on dynamodb limitations.
Tables guidance,
Partition distribution
I would suggest the following structure for your events table:
user id -- hash key
event date/time (timestamp with milliseconds) -- range key
duration
Having event timestamp as a range key should be sufficient to provide uniqueness for an event (unless a user can have multiple events right in the same millisecond), so you don't need a sequence number.
Having such a schema, you can get all events for a user for a date by using simple query.
Unfortunately, DynamoDB do not support aggregate queries, so you can't get a total number of events for a user quickly (you would have to query all records and calculate total manually).
So I would suggest creating a separate table for user events statistics like this:
user id -- hash key
date -- range key
events_cnt (total number of events for a user for a date)
So, after you add a new record into your events table, you have to increment events counter for the user in statistics table like shown below:
var dynamodbDoc = new AWS.DynamoDB.DocumentClient();
var params = {
TableName : "user_events_stats",
Key: {
userId: "65716110-f4df-11e6-bc64-92361f002671" ,
date: "2017-02-17",
},
UpdateExpression: "SET #events_cnt = if_not_exists(#events_cnt, :zero) + :one",
ExpressionAttributeNames: {
"#events_cnt": "events_cnt",
},
ExpressionAttributeValues: {
":one": 1,
":zero": 0,
},
};
dynamodbDoc.update(params, function(err, data) {
});

Resources