I have a cosmos db that's around 4gb which when it was small could perform date filters relatively quickly with a low RU value (3 - 15ish) but as the DB has grown to contain millions of records it now has slowed right down and the RU value is up in the thousands.
Looking at the documentation for date https://learn.microsoft.com/en-us/azure/cosmos-db/working-with-dates is says
To execute these queries efficiently, you must configure your
collection for Range indexing on strings
However reading the linked index policy doc (https://learn.microsoft.com/en-us/azure/cosmos-db/index-policy) it sounds like by default every field has a range index created
The default indexing policy for newly created containers indexes every property of every item, enforcing range indexes for any string or number, and spatial indexes for any GeoJSON object of type Point
Do I need to configure the indexs to anything other than the default?
{
"indexingMode": "consistent",
"automatic": true,
"includedPaths": [
{
"path": "/*"
}
],
"excludedPaths": [
{
"path": "/\"_etag\"/?"
}
]
}
When it comes to indexing you can see the best practices from here,
You should exclude unused paths from indexing for faster writes.
You should leverage IndexingPolicy with IncludedPaths and ExcludedPaths
for ex:
var collection = new DocumentCollection { id = "excludedPathCollection"};
collection.IndexingPolicy.IncludedPaths.Add(new IncludedPath { Path = "/*" });
collection.IndexingPolicy.ExcludedPaths.Add(new ExcludedPath { Path = "/nonIndexedContent/*");
So if you are concerned with query costs indexing won't really help. Writes cost depend on indexing, not read. If you are seeing thousands of RU per request I suspect you are either using cross-partition queries or you are not having partitions at all (or a single partition for everything). To cut these costs down you need to either stop using cross-partition queries or (if the former is not possible) rearchitect your data in such a fashion that you do not need to use cross-partition queries.
I think range is the default index in cosmos db
Related
I am moving my database from a sql database to Dynamodb. I currently have a table with those values:
tenantId (PartitionKey)
resourceId (RangeKey)
type
role
name
I have the following query at the moment:
get all the resources belonging to a tenant ten that has type t, role r and name contains n. Where type role name may be null values, so in that case those are not used as filters.
Using filters it is possible to make this query in dynamodb, but reading the following article https://aws.amazon.com/blogs/database/querying-on-multiple-attributes-in-amazon-dynamodb/ I realized it may be an expensive query as dynamodb is retrieving those data and then filtering server side. That page suggests to create a GSI with the following value:
tenantId-type-role-name
With this index I can easily filter for ten t r n but in case I just have to filter for tenantId type name how should I query the GSI to get all the records that have tenant ten type t, and name contains n but have no restrictions on role (contains statement seems only to be supported on filters).
I am wondering if I need to create a GSI for each combination, something like:
tenantId-type
tenantId-role
tenantId-name
tenantId-type-role
...
Thanks in advance for your help
Before you build GSIs to make your querying simpler. Think about storing your data in a different format.
For example how many resources do you expect per tenant? Could you store your data as such:
{
tenant: 123, //(partition)
resources: [
{ type: 'type1', role: 'role1', name: 'somename1'},
{ type: 'type2', role: 'role2', name: 'somename2'},
{ type: 'type3', role: 'role3', name: 'somename3'}
]
}
In the format above your read times will be rapid and scale. You can then filter your contains logic in code. Your dynamodb records can be 400kb in size, so you could probably store several thousands resources in the above format per record.
Also note each GSI has its own read/write unit usage that is used up when you insert into the table. If you do the GSI approach and write a lot to that table you'll have a surprisingly high write usage.
I am confused what to make my CosmosDB partition key when my JSON looks like this
{
"AE": [
{
"storeCode": "XXX",
"storeClass": "YYY"
}
],
"AT": [
{
"storeCode": "ZZZ",
"storeClass": "XYZ"
}
]
}
Normally the top level would be country:AT and so on and I would make the partition key /country but in this case I have nothing to use on the top level so what do I do?
the JSON comes from a third party so I dont have the option to change it at source.
Since i did not find any statements about the partition key for sub-array in the official document. I could only provide you with a similar thread for your reference :CosmosDB - Correct Partition Key
Here is some explanations by #Mikhail:
Partition Key has to be a single value for each document, it can't be
a field in sub-array. Partition Key is used to determine which
database node will host your document, and it wouldn't be possible if
you specified multiple values, of course.
If your single document contains data from multiple entities, and you
will query those entities separately, it might make sense to split
your documents per entity. If all those "radars" are related to some
higher level entity, use that entity ID as partition key.
For rigor,i would suggest you contacting with azure cosmos team to check whether this feature is not supported yet so far,whether will be implemented in the future.
In mongo we can do something like as follows in order to select first or last count elements:
The document looks like as follows:
{
id: 123,
aliases: [
{name: john}
{name: alpha}
{name: tom}
{name: alpha}
]
}
You can query in mongo and also restrict the number of aliases you want to retrieve from the database as follows:
db.collection.find( { field: value }, { array: {$slice: count } } );
where,
count = 3
Is there anything straightforward way to achieve the same result in DynamoDB?
There is no exact equivalent available on DynamoDB. In fact, there is no near equivalent as well.
DynamoDB has a feature to limit the items on evaluation process. However, it is not equivalent to limiting the number of items in the result set.
Limit — (Integer)
The maximum number of items to evaluate (not necessarily the number of
matching items). If DynamoDB processes the number of items up to the
limit while processing the results, it stops the operation and returns
the matching values up to that point, and a key in LastEvaluatedKey to
apply in a subsequent operation, so that you can pick up where you
left off. Also, if the processed data set size exceeds 1 MB before
DynamoDB reaches this limit, it stops the operation and returns the
matching values up to the limit, and a key in LastEvaluatedKey to
apply in a subsequent operation to continue the operation. For more
information, see Query and Scan in the Amazon DynamoDB Developer
Guide.
I have a collection in Azure DocumentDB where in I have documents clustered into 3 sets using a JSON property called clusterName for each document. The 3 clusters of documents are templated somewhat like these:
{
"clusterName": "CustomerInformation",
"id": "CustInfo1001",
"custName": "XXXX"
},
{
"clusterName": "ZoneInformation",
"id": "ZoneInfo5005",
"zoneName": "YYYY"
},
{
"clusterName": "CustomerZoneAssociation",
"id": "CustZoneAss9009",
"custId": "CustInfo1001",
"zoneId": "ZoneInfo5005"
}
As you can see the document for CustomerZoneAssociation links the documents of CustomerInformation and ZoneInformation with their Id s. I need help in querying out information from CustomerInformation and ZoneInformation cluster with the help of their Id s associated in the CustomerZoneAssociation cluster. The result of the query I am expecting is:
{
"clusterName": "CustomerZoneAssociation",
"id": "CustZoneAss9009",
"custId": "CustInfo1001",
"custName": "XXXX",
"zoneId": "ZoneInfo5005",
"zoneName": "YYYY"
}
Please suggest a solution which would take only 1 trip to DocumentDB
DocumentDB does not support inter-document JOINs... instead, the JOIN keyword is used to perform intra-document cross-products (to be used with nested arrays).
I would recommend one of the following approaches:
Keep in mind that you do not have to normalize every entity as you would with a traditional RDBMS. It may be worth revisiting your data model, and de-normalize parts of your data where appropriate. Also keep in mind that, de-normalizing comes with its own trade offs (fanning out writes vs issuing follow up reads). Check out the following SO answer to read more on the tradeoffs between normalizing vs de-normalizing data.
Write a stored procedure to batch a sequence of operations within a single network request. Checkout the following SO answer for a code sample on this approach.
I am currently using DynamoDB and having a problem scanning. I am able to get paged results in forward order by using the ExclusiveStartKey. However, regardless of whether I set ScanIndexForward true or false, I get results in forward order from my scan operation. How can i get results in reverse order from a Scan in DynamoDB?
ScanIndexForward is the correct way to get items in descending order by the range key of the table or index you are querying. From the AWS API Reference:
A value that specifies ascending (true) or descending (false)
traversal of the index. DynamoDB returns results reflecting the
requested order determined by the range key. If the data type is
Number, the results are returned in numeric order. For type String,
the results are returned in order of ASCII character code values. For
type Binary, DynamoDB treats each byte of the binary data as unsigned
when it compares binary values.
Based on the docs for Scan, I conclude that there is no way to Scan in reverse. However, I would say that you are not using DynamoDB correctly if you need to do that. When designing a schema for a database like DyanmoDB you should plan the schema based on your expected queries to ensure that almost all application queries have a good index. Scans are meant more for sys admin operations or for feeding into MapReduce or analytics. "A Scan operation always scans the entire table, then filters out values to provide the desired result, essentially adding the extra step of removing data from the result set." (Query and Scan Performance) That can lead to performance problems and other issues.
Using DynamoDB is fundamentally different from working with a traditional relational database and requires a big change in the way you think about using it. You need to decide whether DynamoDB's advantages of availability in storage and performance, reliability and availability are worth accepting its limitations.
As of now the dynamoDB scan cannot return you sorted results.
You need to use a query with a new global secondary index (GSI) with a hashkey and range field. The trick is to use a hashkey which is assigned the same value for all data in your table.
I recommend making a new field for all data and calling it "Status" and set the value to "OK", or something similar.
Then your query to get all the results sorted would look like this:
{
TableName: "YourTable",
IndexName: "Status-YourRange-index",
KeyConditions: {
Status: {
ComparisonOperator: "EQ",
AttributeValueList: [
"OK"
]
}
},
ScanIndexForward: false
}
The docs for how to write GSI queries are found here: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GSI.html#GSI.Querying