I have a dynamodb table which contains information of the status of different cron jobs.
Table attributes:
id [HashKey]
jobId [RangeKey]
status ('failed','pending', 'success')
I want to query the items based on the job status field.
Eg: list all jobs which are in pending state?
So I created the GSI as below.
GSI:
{
IndexName: 'StatusIndex',
KeySchema: [
{
AttributeName: 'status',
KeyType: 'HASH',
},
],
Projection: {
ProjectionType: 'ALL',
},
},
But the query on GSI is very slow when all the items contains same status value.
id
jobId
status
1
job1
pending
2
job2
pending
3
job3
pending
4
job4
pending
Is this because of not having range key?
You might be better of with a Parallel Scan here. A Query does not have parallel functionality. If you're trying to get a very large amount of data in one Query, it will be slow. If you use a Parallel Scan, set the number of threads to match the number of MBs of data in your table to optimise the speed. This will cost you more RCUs than a Query.
Alternatively you can consider remodeling your data. You will need a way of running multiple Queries to access the desired data, and a way of running them in parallel from your client. One option you can consider is breaking the data down into time series.
Related
I'm new to Azure Cosmos DB and I have this new project where I decided to give it a go.
My DB has only one collection where around 6,000 new items are added everyday and each looks like this
{
"Result": "Pass",
"date": "23-Sep-2021",
"id": "user1#example.com"
}
The date is partition key and it will be the date of which the item was added to the collection where the same id can be added again everyday as follows
{
"Result": "Fail",
"date": "24-Sep-2021",
"id": "user1#example.com"
}
The application that uses this DB will query by id and date to retrieve the Result.
I read some Azure Cosmos DB documentations and found that selecting the partition key very carefully can improve the performance of the database and RUs used for each request.
I tried running this query and it consumed 2.9 RUs where the collection has about 23,000 items.
SELECT * FROM c
WHERE c.id = 'user1#example.com' AND c.date = '24-Sep-2021'
Here are my questions
Is using date a good partition key for my scenario? Any rooms for improvements?
Will consumed RUs per request increase over time if number of items in collection increase?
Thanks.
For a write-heavy workload using date as a partition key is a bad choice because you will always have a hot partition on the current date. However, if the amount of data being written is consistent and the write volume is low, then it can be used and you will have good distribution of data on storage.
In read-heavy scenarios, date can be a good partition key if it is used to answer most of the queries in the app.
The value for id must be unique per partition key value so for your data model to work you can only have one "id" value per day.
If this is the case for your app then you can make one additional optimization and replace the query you have with a point read, ReadItemAsync(). This takes the partition key value and the id. This is the fastest and most efficient way to read data because it does not go through the query engine and reads directly from the backend data store. All point reads for 1KB of data or less will always cost 1RU/s.
I am using Azure Cosmos to store customer data in my multi-tenant app. One of my customers started complaining about long wait times when querying their data. As a quick fix, I created a dedicated Cosmos instance for them and copied their data to that dedicated instance. Now I have two databases that contain exact copies of this customer's data. We'll call these databases db1 and db2. db1 contains all data for all customers, including this customer in question. db2 contains only data for this customer in question. Also, for both databases the partition key is an aggregate of tenant id and date, called ownerTime. Also, each database contains a single container named "call".
I then run this query in both databases:
select c.id,c.callTime,c.direction,c.action,c.result,c.duration,c.hasR,c.hasV,c.callersIndexed,c.callers,c.files,c.tags_s,c.ownerTime
from c
where
c.ownerTime = '352897067_202011'
and c.callTime>='2020-11-01T00:00:00'
and c.callTime<='2020-11-30T59:59:59'
and (CONTAINS(c.phoneNums_s, '7941521523'))
As you can see, I am isolating one partition (ownerTime: 352897067_202011). In this partition, there are about 50,000 records in each database.
In db1 (the database with all customer data), this uses 5116.38 RUs. In db2 (the dedicated instance), this query uses 65.8 RUs.
Why is there this discrepancy? The data in these two partitions is exactly the same across the two databases. The indexing policy is exactly the same as well. I suspect that db1 is trying to do a fan-out query. But why would it do that? I have the query set up so that it will only look in this one partition.
Here are the stats I retrieved for the above query after running on each database:
db1
Request Charge: 5116.38 RUs
Retrieved document count: 8
Retrieved document size: 18168 bytes
Output document count: 7
Output document size: 11793 bytes
Index hit document count: 7
Index lookup time: 5521.42 ms
Document load time: 7.8100000000000005 ms
Query engine execution time: 0.23 ms
System function execution time: 0.01 ms
User defined function execution time: 0 ms
Document write time: 0.07 ms
Round Trips: 1
db2
Request Charge: 65.8 RUs
Showing Results: 1 - 7
Retrieved document count: 7
Retrieved document size: 16585 bytes
Output document count: 7
Output document size: 11744 bytes
Index hit document count: 7
Index lookup time: 20.720000000000002 ms
Document load time: 4.8099 ms
Query engine execution time: 0.2001 ms
System function execution time: 0.01 ms
User defined function execution time: 0 ms
Document write time: 0.05 ms
Round Trips: 1
The indexing policy for both databases is:
{
"indexingMode": "consistent",
"automatic": true,
"includedPaths": [
{
"path": "/*"
}
],
"excludedPaths": [
{
"path": "/\"_etag\"/?"
},
{
"path": "/callers/*"
},
{
"path": "/files/*"
}
]
}
*Update: I recently updated this question with a clearer query, and the query stats returned.
Following up from comments.
Since db1 has all customer’s data, the physical partition will have lot more unique values for callTime so the number of index pages scanned to evaluate callTime will be high. In case of db2, since only 1 customer data is there the logical and physical partition will be the same. So while this is not a fan-out, the query engine will still need to evaluate the range filter on callTime for all other other customer data.
To fix/improve the performance on db1 you should create a composite index on /ownerTime and /callTime, see below.
"compositeIndexes":[
[
{
"path":"/ownerTime"
},
{
"path":"/callTime"
}
]
],
Thanks.
Microsoft makes it clear that cross-partition queries "fan-out" the query to each partition (link):
The following query doesn't have a filter on the partition key (DeviceId). Therefore, it must fan-out to all physical partitions where it is run against each partition's index:
So I am curious if that "fan-out" can be optimized by doing a range query on a partition key, such as STARTSWITH.
To test it, I created a small Cosmos DB with seven documents:
{
"partitionKey": "prefix1:",
"id": "item1a"
},
{
"partitionKey": "prefix1:",
"id": "item1b"
},
{
"partitionKey": "prefix1:",
"id": "item1c"
},
{
"partitionKey": "prefix1X:",
"id": "item1d"
},
{
"partitionKey": "prefix2:",
"id": "item2a"
},
{
"partitionKey": "prefix2:",
"id": "item2b"
},
{
"partitionKey": "prefix3:",
"id": "item3a"
}
It has the default indexing policy with partition key "/partitionKey". Then I ran a bunch of queries:
SELECT * FROM c WHERE STARTSWITH(c.partitionKey, 'prefix1')
-- Actual Request Charge: 2.92 RUs
SELECT * FROM c WHERE c.partitionKey = 'prefix1:' OR c.partitionKey = 'prefix1X:'
-- Actual Request Charge: 3.02 RUs
SELECT * FROM c WHERE STARTSWITH(c.partitionKey, 'prefix1:')
SELECT * FROM c WHERE c.partitionKey = 'prefix1:'
-- Each Query Has Actual Request Charge: 2.89 RUs
SELECT * FROM c WHERE STARTSWITH(c.partitionKey, 'prefix2')
SELECT * FROM c WHERE c.partitionKey = 'prefix2:'
-- Each Query Has Actual Request Charge: 2.86 RUs
SELECT * FROM c WHERE STARTSWITH(c.partitionKey, 'prefix3')
SELECT * FROM c WHERE c.partitionKey = 'prefix3:'
-- Each Query Has Actual Request Charge: 2.83 RUs
SELECT * FROM c WHERE c.partitionKey = 'prefix2:' OR c.partitionKey = 'prefix3:'
-- Actual Request Charge: 2.99 RUs
The request charges were consistent when re-running the queries. And the pattern of charge growth seemed consistent with the result set and query complexity, with exception of maybe the 'OR' queries. However, then I tried this:
SELECT * FROM c
-- Actual Request Charge: 2.35 RUs
And the basic fan-out to all partitions is even faster than targeting a specific partition, even with an equality operator. I don't understand how this can be.
All this being said, my sample database is extremely small with only seven documents. The query set is probably not big enough to trust the results.
So, if I had millions of documents, would STARTSWITH(c.partitionKey, 'prefix') be more optimized than fanning out to all partitions?
I was trying to determine if there is any benefit of this approach myself, and according to the answers it does not seem like there is.
I did just learn about the new hierarchical partition keys feature that is in private preview and seems to address the problem we are trying to solve:
https://devblogs.microsoft.com/cosmosdb/hierarchical-partition-keys-private-preview/
Hierarchical partition keys are now available in private preview for
the Azure Cosmos DB Core (SQL) API. With hierarchical partition keys,
also known as sub-partitioning, you can now natively partition your
container with up to three levels of partition keys. This enables more
optimal partitioning strategies for multi-tenant scenarios or
workloads that would otherwise use synthetic partition keys. Instead
of having to choose a single partition key – which often leads to
performance trade-offs – you can now use up to three keys to further
sub-partition your data, enabling more optimal data distribution and
higher scale.
Since this allows up to 3 keys it could solve the problem by breaking up the prefixes into separate keys or at least further optimize it if there are more then 3.
Example
(Usage example from link):
https://github.com/AzureCosmosDB/HierarchicalPartitionKeysFeedbackGroup#net-v3-sdk-2
// Get the full partition key path
var id = "0a70accf-ec5d-4c2b-99a7-af6e2ea33d3d";
var fullPartitionkeyPath = new PartitionKeyBuilder()
.Add("Contoso") //TenantId
.Add("Alice") //UserId
.Build();
var itemResponse = await containerSubpartitionByTenantId_UserId.ReadItemAsync<dynamic>(id, fullPartitionkeyPath);
Considerations
From the preview link it looks like you would need to opt in to the preview and create a new container
New containers only – all keys must be specified upon container
creation
As you scale, you get fewer "logical partitions" per "physical partition", until eventually each partition key value has its own physical partition.
So:
if I had millions of documents, would STARTSWITH(c.partitionKey, 'prefix') be more optimized than fanning out to all partitions?
Both queries would fan-out across multiple partitions.
And I'm pretty sure that since "Azure Cosmos DB uses hash-based partitioning to spread logical partitions across physical partitions", there's no locality between partition keys with a common prefix, and each STARTSWITH query will have to fan-out across all the physical partitions.
The docs suggest that there is some efficiency
With Azure Cosmos DB, typically queries perform in the following order from fastest/most efficient to slower/less efficient.
GET on a single partition key and item key
Query with a filter clause on a single partition key
Query without an equality or range filter clause on any property
Query without filters
I want to use DynamoDB to store historical stock closing values.
My store will have a few stocks, and grow to include more as requirements change.
I figured I'll have a single table where the only key is "DATE" formatted as YYYY-MM-DD.
This means that each item in the table will have a date key and several attributes of the form { TICKER = CLOSING_VALUE }
Queries for a given date will also filter by a subset of desired stock tickers, e.g. ["INTC", "AAPL"].
I am a bit confused since this single key should work both as partition and sort keys.
How should I query to retrieve a subset of stock tickers for a given date range ?
Update:
I'm creating the table with...
{
AttributeDefinitions: [
{
AttributeName: Date,
AttributeType: S
}
],
TableName: "Historic",
KeySchema: [
{
AttributeName: Date,
KeyType: HASH
}
]
}
And the query:
{
table_name: "Historic",
projection_expression: "USD,CAD",
filter_expression: "#k between :val1 and :val2",
expression_attribute_names: { "#k" => "Date" },
expression_attribute_values: {
":val1" => "2019-12-01",
":val2" => "2020-01-10"
}
}
And I get an error:
Aws::DynamoDB::Errors::ValidationException: Either the KeyConditions or KeyConditionExpression parameter must be specified in the request.
You cannot sort by - or efficiently retrieve a range of - the partition key, you can only sort by the sort key. To understand why, you need to understand how DynamoDB stores its data.
The "partition key" is also called in the CreateTable operation a "hash key" - and indeed it works like a key in a hash table: DynamoDB runs a hash function on this key, and using the resulting number, decides which node(s) of its large cluster should hold this partition. This approach allows distributing the table across the cluster, but it makes it impossible to efficiently retrieve the different partitions ordered by their key. The "Scan" operation will return the partitions in seemingly-random order (they are likely to be sorted by the hash function of their key), and it's impossible to efficiently scan just a range of partition keys. It's possible to do this inefficiently - by scanning the entire table and filtering just the partitions you want. If I understand correctly, this is what you were trying to do. But this only makes sense for tiny databases - would that be your case?
As you noticed, the other component of the key is the "sort key". Inside a partition, in one node, the different items in of that partition are kept sequentially sorted by the "sort key" order. This allows DynamoDB to efficiently retrieve them sorted in this order, or efficiently retrieve only a range of these sort keys - the Query request can do both these things.
So to achieve what you want, you need the date to be the sort key, not the partition key. How to do the rest of the data modeling depends on what your typical queries look like:
If you have a large number of stocks, but a typical query only asks for a handful of stocks, the most reasonable approach is to use the stock name as the partition key, and as I said, the data as the sort key. This will allow you to efficiently Query a date range for one particular stock - and if you need 3 different stocks, you'll need to do 3 Querys (you can and should do them in parallel!) but each of these queries will be efficient and you'll only be paying for the actual data you retrieve, without any post-filtering.
If there is a huge number of different dates (e.g, you keep data at 1 second resolution), your partitions can grow huge, and for various reasons this is not recommended. In such a case, you can split each partition into multiple partition by some coarse time window. For example, instead of having one huge partition for stock "GOOG", have one partition "GOOG Nov 2019", one "GOOG Dec 2019", etc. When you query a small date range, you'll know which specific partition you need to read from. But when the query spans more than one month, you'll need to query multiple of these partitions. Note that very large queries will read (and return) huge amounts of data, so will be very expensive, so you're only likely to want to do this in large analytic jobs.
I am currently designing dynamo DB schemas for the following use case:
A company can have many channels, and then for each channel, I have multiple channel cards.
I am thinking of having following tables:
Channel Table:
Partition Key: CompanyId
Sort Key: Combination of Time Stamp and deleted or not
Now after getting the channels for a company, I need to fetch its channel cards, for this, I am thinking to have following Table Schema for ChannelCard.
ChannelCard Table:
Partition Key: channelId
Sort Key: Combination of Time Stamp and deleted or not
Now to get the channel cards for a company, I need to do the following:
1. First query the channels for the company using partition key (1 query)
2. Getting channel cards for each channel (number of channels query)
So in this case, we will be making many queries, can we have any less number of queries in our case?
Any suggestions for modifying the database tables or about how to query the database are welcome.
You could also have
Channel Table
Partition Key: CompanyId
Sort Key: Deleted+timestamp
Channel Card Table
Partition Key: CompanyId
Sort Key: Deleted+ChannelCardTimeStamp
GSI #1:
Partition Key: ChannelId
Sort Key: Deleted+ChannelCardTimeStamp
This way you can have one query for the most recent channelcards for any given company and you can also query for the most recent channelcards for any channel.