Cosmos DB .NET SDK order by a dynamic field (parameterized) - azure-cosmosdb

I use the .NET SDK to retrieve some items in a Cosmos DB instance using continuationTokens to be able to retrieve paginated pieces of data. So far this works.
I use a generic Get function to retrieve the items:
var query = container.GetItemQueryIterator<T>(
new QueryDefinition("SELECT * FROM c"),
continuationToken: continuationToken,
requestOptions: new QueryRequestOptions()
{
MaxItemCount = itemCount
});
However I would like to add a dynamic order by field where the callee can decide on which field the results should be ordered. I tried adding a parameterized field like:
new QueryDefinition("SELECT * FROM c order by #orderBy")
.WithParameter("#orderBy", "fieldname")
But this does not work, I keep getting Syntax errors while executing, is it actually possible to dynamically add an order by clause?

The .WithParameter() fluent syntax can only be used with the WHERE clause in QueryDefinition so you will have to construct your sql with the order by appended dynamically to the sql string.
One thing to keep in mind is that unless this is a small workload with less than 20GB of data, this container will not scale unless you use the partition key in your queries. The other consideration here too is that order by gets much better performance when you using composite indexes. But if there are a wide number of properties that results can be sorted on, writes may get very expensive from all of the individual composite indexes.
In all cases, if this is meant to scale you should measure and benchmark high concurrency operations.

Related

Cosmos client and versioning of records

I'm working on a web api where I need to store all previous versions of a record, so for the Put endpoint, instead of updating an existing record, a new record is created, with the same partition key. This complicates things a lot, and a simple Read method, which should give you the most recently created record becomes:
public async Task<IEnumerable<T>> ReadLatestAsync(Expression<Func<T, bool>> predicate)
{
var entityList = new List<T>();
var query = _container.GetItemLinqQueryable<T>()
.Where(predicate).AsQueryable();
using (var iterator = query.ToFeedIterator())
{
while (iterator.HasMoreResults)
{
entityList.AddRange(
await iterator.ReadNextAsync());
}
}
return entityList.GroupBy(PartitionKey).Select(x => x.OrderByDescending(x => x.TimeStamp).First());
}
where PartitionKey is found in the specific repositories, like this for instance:
public override Func<Project, object> PartitionKey => (x => x.ProjectId);
This has worked okay, up until now when I need to add pagination using continuation tokens and need to excecute the whole GroupBy(PartitionKey).Select(x => x.OrderByDescending(x => x.TimeStamp).First()) part as part of the cosmos client query for it to work correctly (if the selection is done after the pagination, each GET request will return a different number of records). But the cosmos client doesn't have support for GroupBy, so I'm kind of lost as to how to do this.
Are there any queries that could do the same thing without having to use GroupBy?
Or should I just handle the whole versioning in a different way?
This does not look scalable. There are a few alternatives.
If you always do an in-partition query for the latest mutation you could create a new item that has a different property such as "version": "latest" or something that lets you distinguish it from other data in the logical partition. That way you can pass in the partition key value and the value for version there and get just that record. I don't know if EF supports point reads but if you were using the native Cosmos SDK, you could give id a unique value like "latest" instead of creating a new property, then you could call ReadItemAsync() with that id and partition key value for the most efficient possible in Cosmos DB.
If the data for a logical partition will grow beyond 20GB then you will need a different partition key with higher cardinality. In either case, you then need to use Change Feed on the container then create a materialized view of the data where you upsert the container with the latest value such that you can do efficient single-partition queries (or point reads again) in the second container.

CosmosDB, very long index that's also the partition key

We are storing a folder tree, the number of items is huge so we have created a partition on the parent folder.
When we issue queries such as
SELECT * FROM root WHERE root.parentPath = "\\server\share\shortpath" AND root.isFile
The RUs is very low and the performance is very good.
But, when we have a long path eg
SELECT * FROM root WHERE root.parentPath = "\\server\share\a very\long\path\longer\than\this" AND root.isFile
The RUs go up to 5000 and the performance suffers.
parentPath works well as a partition key as all queries include this field in the filter.
If I add another clause to the query it also becomes very fast, eg if I do something like and root.name = 'filename'
It's almost like it's scanning the entire partition based on the hash that's derived from it.
The Query returns NO DATA
which is fine as its someone looking for child folders under a given node, once you get deep it just gets very slow.
Query Metrics
x-ms-documentdb-query-metrics:
totalExecutionTimeInMs=1807.61;
queryCompileTimeInMs=0.08;
queryLogicalPlanBuildTimeInMs=0.04;
queryPhysicalPlanBuildTimeInMs=0.06;
queryOptimizationTimeInMs=0.01;
VMExecutionTimeInMs=1807.11;
indexLookupTimeInMs=0.65;
documentLoadTimeInMs=1247.08;
systemFunctionExecuteTimeInMs=0.00;
userFunctionExecuteTimeInMs=0.00;
retrievedDocumentCount=72554;
retrievedDocumentSize=59561577;
outputDocumentCount=0;
outputDocumentSize=49;
writeOutputTimeInMs=0.00;
indexUtilizationRatio=0.00
From string
x-ms-documentdb-query-metrics: totalExecutionTimeInMs=1807.61;queryCompileTimeInMs=0.08;queryLogicalPlanBuildTimeInMs=0.04;queryPhysicalPlanBuildTimeInMs=0.06;queryOptimizationTimeInMs=0.01;VMExecutionTimeInMs=1807.11;indexLookupTimeInMs=0.65;documentLoadTimeInMs=1247.08;systemFunctionExecuteTimeInMs=0.00;userFunctionExecuteTimeInMs=0.00;retrievedDocumentCount=72554;retrievedDocumentSize=59561577;outputDocumentCount=0;outputDocumentSize=49;writeOutputTimeInMs=0.00;indexUtilizationRatio=0.00
This is because of a path length limit in Indexing v1.
We have increased the path length limit to a larger value in the new index layout, therefore migrating the collections to this new layout would fix the issue and provide many performance benefit.
We have rolled out the new index layout for new collections by default. If it is possible for you to recreate the current collection and migrate existing data over there, it would be great. Otherwise, an alternative is to trigger the migration process to move existing collections to the new index layout. The following C# method can be used to do that:
static async Task UpgradeCollectionToIndexV2Async(
DocumentClient client,
string databaseId,
string collectionId)
{
DocumentCollection collection = (await client.ReadDocumentCollectionAsync(string.Format("/dbs/{0}/colls/{1}", databaseId, collectionId))).Resource;
collection.SetPropertyValue("IndexVersion", 2);
ResourceResponse<DocumentCollection> replacedCollection = await client.ReplaceDocumentCollectionAsync(collection);
Console.WriteLine(string.Format(CultureInfo.InvariantCulture, "Upgraded indexing version for database {0}, collection {1} to v2", databaseId, collectionId));
}
It could take several hours for the migration to complete, depending on the amount of data in the collection. The issue should be addressed once it is completed.
(This was copy pasted from an email conversation we had to resolve this issue)

Google Datastore python returning less number of entities per page

I am using Python client SDK for Datastore (google-cloud-datastore) version 1.4.0. I am trying to run a key-only query fetch:
query = client.query(kind = 'SomeEntity')
query.keys_only()
Query filter has EQUAL condition on field1 and GREATER_THAN_OR_EQUAL condition on field2. Ordering is done based on field2
For fetch, I am specifying a limit:
query_iter = query.fetch(start_cursor=cursor, limit=100)
page = next(query_iter.pages)
keyList = [entity.key for entity in page]
nextCursor = query_iter.next_page_token
Though there are around 50 entities satisfying this query, each fetch returns around 10-15 results and a cursor. I can use the cursor to get all the results; but this results in additional call overhead
Is this behavior expected?
keys_only query is limited to 1000 entries in a single call. This operation counts as a single entity read.
For another limitations of Datastore, please refer detailed table in the documentation.
However, in the code, you did specify cursor as a starting point for a subsequent retrieval operation. Query can be limited, without cursor:
query = client.query()
query.keys_only()
tasks = list(query.fetch(limit=100))
For detailed instruction how to use limits and cursors, please refer documentation of the Google Gloud Datastore

Dynamodb query expression

Team,
I have a dynamodb with a given hashkey (userid) and sort key (ages). Lets say if we want to retrieve the elements as "per each hashkey(userid), smallest age" output, what would be the query and filter expression for the dynamo query.
Thanks!
I don't think you can do it in a query. You would need to do full table scan. If you have a list of hash keys somewhere, then you can do N queries (in parallel) instead.
[Update] Here is another possible approach:
Maintain a second table, where you have just a hash key (userID). This table will contain record with the smallest age for given user. To achieve that, make sure that every time you update main table you also update second one if new age is less than current age in the second table. You can use conditional update for that. Update can either be done by application itself, or you can have AWS lambda listening to dynamoDB stream. Now if you need smallest age for each use, you still do full table scan of the second table, but this scan will only read relevant records, to it will be optimal.
There are two ways to achieve that:
If you don't need to get this data in realtime you can export your data into a other AWS systems, like EMR or Redshift and perform complex analytics queries there. With this you can write SQL expressions using joins and group by operators.
You can even perform EMR Hive queries on DynamoDB data, but they perform scans, so it's not very cost efficient.
Another option is use DynamoDB streams. You can maintain a separate table that stores:
Table: MinAges
UserId - primary key
MinAge - regular numeric attribute
On every update/delete/insert of an original query you can query minimum age for an updated user and store into the MinAges table
Another option is to write something like this:
storeNewAge(userId, newAge)
def smallestAge = getSmallestAgeFor(userId)
storeSmallestAge(userId, smallestAge)
But since DynamoDB does not has native transactions support it's dangerous to run code like that, since you may end up with inconsistent data. You can use DynamoDB transactions library, but these transactions are expensive. While if you are using streams you will have consistent data, at a very low price.
You can do it using ScanIndexForward
YourEntity requestEntity = new YourEntity();
requestEntity.setHashKey(hashkey);
DynamoDBQueryExpression<YourEntity> queryExpression = new DynamoDBQueryExpression<YourEntity>()
.withHashKeyValues(requestEntity)
.withConsistentRead(false);
equeryExpression.setIndexName(IndexName); // if you are using any index
queryExpression.setScanIndexForward(false);
queryExpression.setLimit(1);

How to retrieve an entity using a property from datastore

Is it possible to retrieve an entity from gae datastore using a property and not using the key?
I could see I can retrieve entities with key using the below syntax.
quote = mgr.getObjectById(Students.class, id);
Is there an alternative that enables us to use a property instead of key?
Or please suggest any other ways to achieve the requirement.
Thanks,
Karthick.
Of course this is possible. Think of the key of an entity being like the primary key of an SQL row (but please, don't stretch the analogy too far - the point is it's a primary key - the implementations of these two data storage systems are very different and it causes people trouble when they don't keep this in mind).
You should look either here (JDO) to read about JDO queries or here (JPA) to read about JPA queries, depending what kind of mgr your post refers to. For JDO, you would do something like this:
// begin building a new query on the Cat-kind entities (given a properly annotated
// entity model class "Cat" somewhere in your code)
Query q = pm.newQuery(Cat.class);
// set filter on species property to == param
q.setFilter("species == speciesParam");
// set ordering for query results by age property descending
q.setOrdering("age desc");
// declare the parameters for this query (format is "<Type> <name>")
// as referenced above in filter statement
q.declareParameters("String speciesParam");
// run the query
List<Cat> results = (List<Cat>) q.execute ("siamese");
For JPA, you would use JPQL strings to run your queries.

Resources