Cosmos DB read a single document without partition key - azure-cosmosdb

A container has a function called ReadItemAsync. The problem is I do not have the partition key, but only the id of the document. What is the best approach to get just a single item then?
Do I have to get it from a collection? Like:
var allItemsQuery = VesselContainer.GetItemQueryIterator<CoachVessel>("SELECT * FROM c where c.id=....");
var q = VesselContainer.GetItemLinqQueryable<CoachVessel>();
var iterator = q.ToFeedIterator();
var result = new List<CoachVessel>();
while (iterator.HasMoreResults)
{
foreach (var item in await iterator.ReadNextAsync())
{
result.Add(item);
}
}

Posting as answer.
Yes you have to do a fan out query but id is only distinct per partition key so even then you may end up with multiple items. Frankly speaking, if you don't have the partition key for a point read then the model for the database is not correct. It (or the application itself) should be redesigned.
Additionally. For small, single partition collections this x-partition query will not be too expensive as the collection is small. However, once the database starts to scale out this will get increasingly slower and more expensive as the query will fan out to ever increasing numbers of physical partitions. As stated above, I would strongly recommend you modify the app to pass the partition key value in the request. This will allow you to do a single point read operation which is extremely fast and efficient.
Good luck.

try using ReadItemAsync like:
dynamic log = await container.ReadItemAsync(ID, PartitionKey.None);

Related

Cosmos DB .NET SDK order by a dynamic field (parameterized)

I use the .NET SDK to retrieve some items in a Cosmos DB instance using continuationTokens to be able to retrieve paginated pieces of data. So far this works.
I use a generic Get function to retrieve the items:
var query = container.GetItemQueryIterator<T>(
new QueryDefinition("SELECT * FROM c"),
continuationToken: continuationToken,
requestOptions: new QueryRequestOptions()
{
MaxItemCount = itemCount
});
However I would like to add a dynamic order by field where the callee can decide on which field the results should be ordered. I tried adding a parameterized field like:
new QueryDefinition("SELECT * FROM c order by #orderBy")
.WithParameter("#orderBy", "fieldname")
But this does not work, I keep getting Syntax errors while executing, is it actually possible to dynamically add an order by clause?
The .WithParameter() fluent syntax can only be used with the WHERE clause in QueryDefinition so you will have to construct your sql with the order by appended dynamically to the sql string.
One thing to keep in mind is that unless this is a small workload with less than 20GB of data, this container will not scale unless you use the partition key in your queries. The other consideration here too is that order by gets much better performance when you using composite indexes. But if there are a wide number of properties that results can be sorted on, writes may get very expensive from all of the individual composite indexes.
In all cases, if this is meant to scale you should measure and benchmark high concurrency operations.

Cosmos client and versioning of records

I'm working on a web api where I need to store all previous versions of a record, so for the Put endpoint, instead of updating an existing record, a new record is created, with the same partition key. This complicates things a lot, and a simple Read method, which should give you the most recently created record becomes:
public async Task<IEnumerable<T>> ReadLatestAsync(Expression<Func<T, bool>> predicate)
{
var entityList = new List<T>();
var query = _container.GetItemLinqQueryable<T>()
.Where(predicate).AsQueryable();
using (var iterator = query.ToFeedIterator())
{
while (iterator.HasMoreResults)
{
entityList.AddRange(
await iterator.ReadNextAsync());
}
}
return entityList.GroupBy(PartitionKey).Select(x => x.OrderByDescending(x => x.TimeStamp).First());
}
where PartitionKey is found in the specific repositories, like this for instance:
public override Func<Project, object> PartitionKey => (x => x.ProjectId);
This has worked okay, up until now when I need to add pagination using continuation tokens and need to excecute the whole GroupBy(PartitionKey).Select(x => x.OrderByDescending(x => x.TimeStamp).First()) part as part of the cosmos client query for it to work correctly (if the selection is done after the pagination, each GET request will return a different number of records). But the cosmos client doesn't have support for GroupBy, so I'm kind of lost as to how to do this.
Are there any queries that could do the same thing without having to use GroupBy?
Or should I just handle the whole versioning in a different way?
This does not look scalable. There are a few alternatives.
If you always do an in-partition query for the latest mutation you could create a new item that has a different property such as "version": "latest" or something that lets you distinguish it from other data in the logical partition. That way you can pass in the partition key value and the value for version there and get just that record. I don't know if EF supports point reads but if you were using the native Cosmos SDK, you could give id a unique value like "latest" instead of creating a new property, then you could call ReadItemAsync() with that id and partition key value for the most efficient possible in Cosmos DB.
If the data for a logical partition will grow beyond 20GB then you will need a different partition key with higher cardinality. In either case, you then need to use Change Feed on the container then create a materialized view of the data where you upsert the container with the latest value such that you can do efficient single-partition queries (or point reads again) in the second container.

CosmosDB, very long index that's also the partition key

We are storing a folder tree, the number of items is huge so we have created a partition on the parent folder.
When we issue queries such as
SELECT * FROM root WHERE root.parentPath = "\\server\share\shortpath" AND root.isFile
The RUs is very low and the performance is very good.
But, when we have a long path eg
SELECT * FROM root WHERE root.parentPath = "\\server\share\a very\long\path\longer\than\this" AND root.isFile
The RUs go up to 5000 and the performance suffers.
parentPath works well as a partition key as all queries include this field in the filter.
If I add another clause to the query it also becomes very fast, eg if I do something like and root.name = 'filename'
It's almost like it's scanning the entire partition based on the hash that's derived from it.
The Query returns NO DATA
which is fine as its someone looking for child folders under a given node, once you get deep it just gets very slow.
Query Metrics
x-ms-documentdb-query-metrics:
totalExecutionTimeInMs=1807.61;
queryCompileTimeInMs=0.08;
queryLogicalPlanBuildTimeInMs=0.04;
queryPhysicalPlanBuildTimeInMs=0.06;
queryOptimizationTimeInMs=0.01;
VMExecutionTimeInMs=1807.11;
indexLookupTimeInMs=0.65;
documentLoadTimeInMs=1247.08;
systemFunctionExecuteTimeInMs=0.00;
userFunctionExecuteTimeInMs=0.00;
retrievedDocumentCount=72554;
retrievedDocumentSize=59561577;
outputDocumentCount=0;
outputDocumentSize=49;
writeOutputTimeInMs=0.00;
indexUtilizationRatio=0.00
From string
x-ms-documentdb-query-metrics: totalExecutionTimeInMs=1807.61;queryCompileTimeInMs=0.08;queryLogicalPlanBuildTimeInMs=0.04;queryPhysicalPlanBuildTimeInMs=0.06;queryOptimizationTimeInMs=0.01;VMExecutionTimeInMs=1807.11;indexLookupTimeInMs=0.65;documentLoadTimeInMs=1247.08;systemFunctionExecuteTimeInMs=0.00;userFunctionExecuteTimeInMs=0.00;retrievedDocumentCount=72554;retrievedDocumentSize=59561577;outputDocumentCount=0;outputDocumentSize=49;writeOutputTimeInMs=0.00;indexUtilizationRatio=0.00
This is because of a path length limit in Indexing v1.
We have increased the path length limit to a larger value in the new index layout, therefore migrating the collections to this new layout would fix the issue and provide many performance benefit.
We have rolled out the new index layout for new collections by default. If it is possible for you to recreate the current collection and migrate existing data over there, it would be great. Otherwise, an alternative is to trigger the migration process to move existing collections to the new index layout. The following C# method can be used to do that:
static async Task UpgradeCollectionToIndexV2Async(
DocumentClient client,
string databaseId,
string collectionId)
{
DocumentCollection collection = (await client.ReadDocumentCollectionAsync(string.Format("/dbs/{0}/colls/{1}", databaseId, collectionId))).Resource;
collection.SetPropertyValue("IndexVersion", 2);
ResourceResponse<DocumentCollection> replacedCollection = await client.ReplaceDocumentCollectionAsync(collection);
Console.WriteLine(string.Format(CultureInfo.InvariantCulture, "Upgraded indexing version for database {0}, collection {1} to v2", databaseId, collectionId));
}
It could take several hours for the migration to complete, depending on the amount of data in the collection. The issue should be addressed once it is completed.
(This was copy pasted from an email conversation we had to resolve this issue)

Microsoft.Azure.Documents.Client for Azure Cosmos multiple calls

I"m trying to understand why the Microsoft.Azure.Documents.Client makes multiple calls when running a query.
var option = new FeedOptions { EnableCrossPartitionQuery = true, MaxItemCount = 100};
var myobj = cosmosClient.CreateDocumentQuery<myCosmosObj>(documentUri, option)
.Where(x => x.ID == request.Id);
while (myobj.AsDocumentQuery().HasMoreResults)
{
var results = await myobj.AsDocumentQuery().ExecuteNextAsync<myCosmosObj>();
resultList.AddRange(results);
}
A Fiddler trace shows 5 calls to the cosmos collection dbs/mycollectionname/colls/docs (the while loop above runs 5 times)
My question is 1 network hop would improve performance, so I would like to understand why it is making 5 network calls, and If there is something I need to do with the configuration to adjust this. I have already tried adjusting the ResultSize. This is roughly a 3GB collection.
David's answer is theoretically correct however it is missing a crucial point.
Your code is wrong. The way your create the document query inside the loop means that you will always query the result of the first execution 5 times.
The code should actually be like this:
var query = cosmosClient.CreateDocumentQuery<myCosmosObj>(documentUri, option)
.Where(x => x.ID == request.Id).AsDocumentQuery();
while (query.HasMoreResults)
{
var results = await query.ExecuteNextAsync<myCosmosObj>();
resultList.AddRange(results);
}
This will now properly run your query and it will use the continuation properties of the query object in order to read the next page in ExecuteNextAsync.
With a partitioned collection, the most efficient way to find a document by id is by also specifying the partition key (which then directs your query to a single partition). Without PK, there's really no way to know, up front, which partition your documents will reside in. And that's likely why you're seeing 5 calls (you likely have 5 partitions).
The alternative, which your code shows, is to do a cross-partition query, which has to do one query per partition, to seek the document you're looking for.
One more thing to note: A query will have higher RU cost than a Read. And if you already know the partition key and id, there's no need to invoke the query engine (as you can only retrieve a single document anyway, for a given partition key + row key combination).

CosmosDB SQL API, still attempting to get list of partitions and their sizes

I've seen a bunch of answers on StackOverflow stating that Microsoft's Cosmosdb simply doesn't support getting a list of partition keys. This has been bothering me as it seems like a sort of a zeroith requirement for any data store, to get a list of the logical partition names and sizes - any other data store will give you things like table sizes, and I can't believe Microsoft would leave this off.
I don't think they'd do this, so it must just not be documented (or documented well at least). In the following code:
var client = new DocumentClient(
endpoint,
authKey
Database db = client.CreateDatabaseQuery().Where(d => d.Id == databaseName).AsEnumerable().FirstOrDefault();
//Sure is a lot of verbose faff. Have to keep specifying things you've already basically specified when you initialized the client...
var collection = client.CreateDocumentCollectionQuery(databaseSelfLink).Where(c => c.Id == myCollectionName).ToArray().FirstOrDefault();
//This yields "/$pk" in the value - so I guess there's just one path,
//but I still have a lot of distinct values in that path.
//I try a DocumentQuery next to drill down.
var partitionKeys = collection.PartitionKey.Paths;
var querySpec = new SqlQuerySpec("SELECT DISTINCT c.PartitionKey FROM c");
var test = client.CreateDocumentQuery(collection.SelfLink, querySpec);
when I breakpoint after this last line and look at the test object, I see it has a-k sub objects, each with an integer value. I'm not sure what these are, but could they be partitions and sizes? Is there a better way to pull them out?
I have a bit of an answer, though not a full one. Therefore, I won't mark this question answered yet.
I found this document: https://learn.microsoft.com/en-us/dotnet/api/microsoft.azure.documents.documentcollection.partitionkeyrangestatistics?view=azure-dotnet
Here is some modified code pulling back partitions and sizes. It only guarantees reporting on partitions > 1GB according to the above document, though the smallest reported partition I've seen so far is 42MB:
var client = new DocumentClient(
endpoint,
authKey
Database db = client.CreateDatabaseQuery().Where(d => d.Id == databaseName).AsEnumerable().FirstOrDefault();
var collection = client.CreateDocumentCollectionQuery(databaseSelfLink).Where(c => c.Id == myCollectionName).ToArray().FirstOrDefault();
collection = await client.ReadDocumentCollectionAsync(
collection.SelfLink,
new RequestOptions { PopulatePartitionKeyRangeStatistics = true });
Console.WriteLine(collection.PartitionKeyRangeStatistics.ToString());
So now all I have to do is parse the strings with the reported partition names and sizes. I'll still have further questions to answer regarding dynamically creating new partitions in order to make a system which can properly scale while using the full provisioned RUs.

Resources