CosmosDB, very long index that's also the partition key - azure-cosmosdb

We are storing a folder tree, the number of items is huge so we have created a partition on the parent folder.
When we issue queries such as
SELECT * FROM root WHERE root.parentPath = "\\server\share\shortpath" AND root.isFile
The RUs is very low and the performance is very good.
But, when we have a long path eg
SELECT * FROM root WHERE root.parentPath = "\\server\share\a very\long\path\longer\than\this" AND root.isFile
The RUs go up to 5000 and the performance suffers.
parentPath works well as a partition key as all queries include this field in the filter.
If I add another clause to the query it also becomes very fast, eg if I do something like and root.name = 'filename'
It's almost like it's scanning the entire partition based on the hash that's derived from it.
The Query returns NO DATA
which is fine as its someone looking for child folders under a given node, once you get deep it just gets very slow.
Query Metrics
x-ms-documentdb-query-metrics:
totalExecutionTimeInMs=1807.61;
queryCompileTimeInMs=0.08;
queryLogicalPlanBuildTimeInMs=0.04;
queryPhysicalPlanBuildTimeInMs=0.06;
queryOptimizationTimeInMs=0.01;
VMExecutionTimeInMs=1807.11;
indexLookupTimeInMs=0.65;
documentLoadTimeInMs=1247.08;
systemFunctionExecuteTimeInMs=0.00;
userFunctionExecuteTimeInMs=0.00;
retrievedDocumentCount=72554;
retrievedDocumentSize=59561577;
outputDocumentCount=0;
outputDocumentSize=49;
writeOutputTimeInMs=0.00;
indexUtilizationRatio=0.00
From string
x-ms-documentdb-query-metrics: totalExecutionTimeInMs=1807.61;queryCompileTimeInMs=0.08;queryLogicalPlanBuildTimeInMs=0.04;queryPhysicalPlanBuildTimeInMs=0.06;queryOptimizationTimeInMs=0.01;VMExecutionTimeInMs=1807.11;indexLookupTimeInMs=0.65;documentLoadTimeInMs=1247.08;systemFunctionExecuteTimeInMs=0.00;userFunctionExecuteTimeInMs=0.00;retrievedDocumentCount=72554;retrievedDocumentSize=59561577;outputDocumentCount=0;outputDocumentSize=49;writeOutputTimeInMs=0.00;indexUtilizationRatio=0.00

This is because of a path length limit in Indexing v1.
We have increased the path length limit to a larger value in the new index layout, therefore migrating the collections to this new layout would fix the issue and provide many performance benefit.
We have rolled out the new index layout for new collections by default. If it is possible for you to recreate the current collection and migrate existing data over there, it would be great. Otherwise, an alternative is to trigger the migration process to move existing collections to the new index layout. The following C# method can be used to do that:
static async Task UpgradeCollectionToIndexV2Async(
DocumentClient client,
string databaseId,
string collectionId)
{
DocumentCollection collection = (await client.ReadDocumentCollectionAsync(string.Format("/dbs/{0}/colls/{1}", databaseId, collectionId))).Resource;
collection.SetPropertyValue("IndexVersion", 2);
ResourceResponse<DocumentCollection> replacedCollection = await client.ReplaceDocumentCollectionAsync(collection);
Console.WriteLine(string.Format(CultureInfo.InvariantCulture, "Upgraded indexing version for database {0}, collection {1} to v2", databaseId, collectionId));
}
It could take several hours for the migration to complete, depending on the amount of data in the collection. The issue should be addressed once it is completed.
(This was copy pasted from an email conversation we had to resolve this issue)

Related

How to handle data model with long text column + associated embeded metadata in an Android Room database

I'm new to Android, and rather new to SQL in general.
I have a data model where I have a Text that consists of TextMetadata as well as a long string, which is the text content itself. So
Text {
metadata: {
author: string,
title: string
// more associated metadata
},
textContent: long string, or potentially array of lines or paragraphs
}
I'd like to load a list of the metadata for all texts on the App's landing page, without incurring the cost of reading all the long strings (or having operations be slowed down because the table has a column with a long string?).
What is the proper pattern here? Should I use two tables, and related them? Or can I use one table/one #Entity, with embedded metadata, and do some fancy stuff in the DAO to just list/sort/operate on the embedded metadata?
Most of my background is with NoSQL databases, so I could be thinking about this entirely wrong. Advice on the general best practices here would be helpful, but I guess I have two core questions:
Does having a long/very long string/TEXT column cause performance considerations when operating on that specific table/row?
Is there a clean way using Kotlin annotations to express embedded metadata that would make it easy to fetch in the DAO, without having use a long SELECT for each individual column?
This is a good question that is also relevant to other environments.
The Core Issue: How to store large data without effecting your database?
As a rule of thumb you should avoid storing information in your database that is not queryable. Large strings, images, or event metadata which you will never query - does not belong in your db. I was surprised when I realized how many design patterns there are regarding to mongo db (which are relevant to other noSQL databases as well)
So, we know that this data should NOT be stored in the DB. But, because the alternative (file system) is WAY worse than that (unless you would like to implement your own secured file-system-based store) - we should at least try to minimize its footprint.
Our Strategy: save large data chunks in a different table without defining it as an entity (there is no need to wrap it as entity anyway)
How Are We Going To Do That?
Well, thankfully, android room has a direct access to sqLite and it can be used directly (read the docs). This is the place to remind us that android room is built on-top of sqLite - which is (in my own opinion) a fascinating database. I enjoy working with it very much and it's just getting better as the time goes by (personal opinion). Advantages? we are still using android APIs while storing large data in a performant, unified and secure way. yay
Steps we are going to perform:
Initiate a class which will manage a new database - for storing large data only
Define a command that will create our table - constructed of 2 columns
key (primary key) - the id of the item
value - the item itself
In original db for the Text entity - define a column that will hold the id (key) of the large text stored
Whenever you save an item to your large items table - get the id and store it in your entity
You can of course use only 1 table for this.. but.. I know that sqLite requires a certain amount of understanding and it is NOT as easy as android room so.. it's your choice whenever to use 1 or 2 tables in your solution
Below is a code that demonstrates the main principal of my proposal
object LargeDataContract {
// all tables for handling large data will be defined here
object TextEntry : BaseColumns {
const val TABLE_NAME = "text_entry"
const val COLUMN_NAME_KEY = "key"
const val COLUMN_NAME_VALUE = "value"
}
}
// in the future - append this "create statement" whenever you add more tables to your database
private const val SQL_CREATE_ENTRIES =
"CREATE TABLE ${TextEntry.TABLE_NAME} (" +
"${TextEntry.COLUMN_NAME_KEY} INTEGER PRIMARY KEY," +
"${TextEntry.COLUMN_NAME_VALUE} TEXT)"
// create a helper that will assist you to initiate your database properly
class LargeDataDbHelper(context: Context) : SQLiteOpenHelper(context, DATABASE_NAME, null, DATABASE_VERSION) {
override fun onCreate(db: SQLiteDatabase) {
db.execSQL(SQL_CREATE_ENTRIES)
}
companion object {
// If you change the database schema, you must increment the database version. Also - please read `sqLite` documentation to better understand versioning ,upgrade and downgrade operations
const val DATABASE_VERSION = 1
const val DATABASE_NAME = "LargeData.db"
}
}
// create an instance and connect to your database
val dbHelper = LargeDataDbHelper(context)
// write an item to your database
val db = dbHelper.writableDatabase
val values = ContentValues().apply {
put(TextEntry.COLUMN_NAME_VALUE, "some long value goes here")
}
val key = db?.insert(TextEntry.TABLE_NAME, null, values)
// now take the key variable and store it in you entity. this is the only reference you should need
Bottom Line: This approach will assist you to gain as much performance as possible while using android APIs. Sure thing, not the most "intuitive" solution, but - this is how we gain performance and making great apps as well as educating ourselves and upgrading our knowledge and skillset. Cheers

Cosmos DB .NET SDK order by a dynamic field (parameterized)

I use the .NET SDK to retrieve some items in a Cosmos DB instance using continuationTokens to be able to retrieve paginated pieces of data. So far this works.
I use a generic Get function to retrieve the items:
var query = container.GetItemQueryIterator<T>(
new QueryDefinition("SELECT * FROM c"),
continuationToken: continuationToken,
requestOptions: new QueryRequestOptions()
{
MaxItemCount = itemCount
});
However I would like to add a dynamic order by field where the callee can decide on which field the results should be ordered. I tried adding a parameterized field like:
new QueryDefinition("SELECT * FROM c order by #orderBy")
.WithParameter("#orderBy", "fieldname")
But this does not work, I keep getting Syntax errors while executing, is it actually possible to dynamically add an order by clause?
The .WithParameter() fluent syntax can only be used with the WHERE clause in QueryDefinition so you will have to construct your sql with the order by appended dynamically to the sql string.
One thing to keep in mind is that unless this is a small workload with less than 20GB of data, this container will not scale unless you use the partition key in your queries. The other consideration here too is that order by gets much better performance when you using composite indexes. But if there are a wide number of properties that results can be sorted on, writes may get very expensive from all of the individual composite indexes.
In all cases, if this is meant to scale you should measure and benchmark high concurrency operations.

Cosmos client and versioning of records

I'm working on a web api where I need to store all previous versions of a record, so for the Put endpoint, instead of updating an existing record, a new record is created, with the same partition key. This complicates things a lot, and a simple Read method, which should give you the most recently created record becomes:
public async Task<IEnumerable<T>> ReadLatestAsync(Expression<Func<T, bool>> predicate)
{
var entityList = new List<T>();
var query = _container.GetItemLinqQueryable<T>()
.Where(predicate).AsQueryable();
using (var iterator = query.ToFeedIterator())
{
while (iterator.HasMoreResults)
{
entityList.AddRange(
await iterator.ReadNextAsync());
}
}
return entityList.GroupBy(PartitionKey).Select(x => x.OrderByDescending(x => x.TimeStamp).First());
}
where PartitionKey is found in the specific repositories, like this for instance:
public override Func<Project, object> PartitionKey => (x => x.ProjectId);
This has worked okay, up until now when I need to add pagination using continuation tokens and need to excecute the whole GroupBy(PartitionKey).Select(x => x.OrderByDescending(x => x.TimeStamp).First()) part as part of the cosmos client query for it to work correctly (if the selection is done after the pagination, each GET request will return a different number of records). But the cosmos client doesn't have support for GroupBy, so I'm kind of lost as to how to do this.
Are there any queries that could do the same thing without having to use GroupBy?
Or should I just handle the whole versioning in a different way?
This does not look scalable. There are a few alternatives.
If you always do an in-partition query for the latest mutation you could create a new item that has a different property such as "version": "latest" or something that lets you distinguish it from other data in the logical partition. That way you can pass in the partition key value and the value for version there and get just that record. I don't know if EF supports point reads but if you were using the native Cosmos SDK, you could give id a unique value like "latest" instead of creating a new property, then you could call ReadItemAsync() with that id and partition key value for the most efficient possible in Cosmos DB.
If the data for a logical partition will grow beyond 20GB then you will need a different partition key with higher cardinality. In either case, you then need to use Change Feed on the container then create a materialized view of the data where you upsert the container with the latest value such that you can do efficient single-partition queries (or point reads again) in the second container.

Cosmos DB read a single document without partition key

A container has a function called ReadItemAsync. The problem is I do not have the partition key, but only the id of the document. What is the best approach to get just a single item then?
Do I have to get it from a collection? Like:
var allItemsQuery = VesselContainer.GetItemQueryIterator<CoachVessel>("SELECT * FROM c where c.id=....");
var q = VesselContainer.GetItemLinqQueryable<CoachVessel>();
var iterator = q.ToFeedIterator();
var result = new List<CoachVessel>();
while (iterator.HasMoreResults)
{
foreach (var item in await iterator.ReadNextAsync())
{
result.Add(item);
}
}
Posting as answer.
Yes you have to do a fan out query but id is only distinct per partition key so even then you may end up with multiple items. Frankly speaking, if you don't have the partition key for a point read then the model for the database is not correct. It (or the application itself) should be redesigned.
Additionally. For small, single partition collections this x-partition query will not be too expensive as the collection is small. However, once the database starts to scale out this will get increasingly slower and more expensive as the query will fan out to ever increasing numbers of physical partitions. As stated above, I would strongly recommend you modify the app to pass the partition key value in the request. This will allow you to do a single point read operation which is extremely fast and efficient.
Good luck.
try using ReadItemAsync like:
dynamic log = await container.ReadItemAsync(ID, PartitionKey.None);

More efficient SQL for retrieving thousands of records on a view

I am using Linq to Sql as my ORM and I have a list of Ids (up to a few thousand) passed into my retriever method, and with that list I want to grab all User records that correspond to those unique Ids. To clarify, imagine I have something like this:
List<IUser> GetUsersForListOfIds(List<int> ids)
{
using (var db = new UserDataContext(_connectionString))
{
var results = (from user in db.UserDtos
where ids.Contains(user.Id)
select user);
return results.Cast<IUser>().ToList();
}
}
Essentially that gets translated into sql as
select * from dbo.Users where userId in ([comma delimmited list of Ids])
I'm looking for a more efficient way of doing this. The problem is the in clause in sql seems to take too long (over 30 seconds).
Will need more information on your database setup like index's and type of server (Mitch Wheat's post). Type of database would help as well, some databases handle in clauses poorly.
From a trouble shooting standpoint...have you isolated the time delay to the sql server? Can you run the query directly on your server and confirm it's the query taking the extra time?
Select * can also have a bit of a performance impact...could you narrow down the result set that's being returned to just the columns you require?
edit: just saw the 'view comment' that you added...I've had problems with view performance in the past. Is it a materialized view...or could you make it into one? Recreating the view logic as a stored procedure may aslo help.
Have you tried converting this to a list, so the application is doing this in-memory? i.e.:
List<IUser> GetUsersForListOfIds(List<int> ids)
{
using (var db = new UserDataContext(_connectionString))
{
var results = (from user in db.UserDtos.ToList()
where ids.Contains(user.Id)
select user);
return results.Cast<IUser>().ToList();
}
}
This will obviously be memory-intensive if this is being run on a public-facing page on a hard-hit site. If this still takes 30+ seconds though in staging/development, then my guess is that the View itself takes that long to process -OR- you're transferring 10's of MB of data each time you retrieve the view. Either way, my only suggestions are to access the table directly and only retrieve the data you need, rewrite the view, or create a new view for this particular scenario.

Resources