Cosmos DB Gremlin API query performance impacted by order vertices are added - azure-cosmosdb-gremlinapi

I have a simple Cosmos DB Graph database that models people owning a cat or a dog.
Person - Owns - Dog
Person - Owns - Cat
There are 10000 Persons that own a Dog and 10000 persons that own a cat. The 10000 dog owners were added to the database first, then the 10000 cat owners added.
The execution profiles to query for Persons that own dogs or cats are quite different, the performance for Dogs is significantly better than for cats. This is multiplied when there are more nodes in the DB (eg 50k of each Cat or Dog).
Note the result count below, Dog only needs to lookup 1000 Persons to return the required dataset, but Cat needs to lookup 11000. Presumably this is because the Dog owners were added first and Cosmos is starting with the first nodes added and traversing them all to find the right edge.
How can I improve the performance of the Cat owners query? This is a simplistic example, our use case is more complex and would involve filtering on the Person node before traversing to the Cat node, so starting at Cat is not an option.
"gremlin": "g.V().haslabel('Person').out('Owns').haslabel('Dog').limit(20).executionprofile()",
"activityId": "1802c19d-2f59-4378-93c8-0fd50ce97b7f",
"totalTime": 273,
"totalResourceUsage": 710.8800000000001,
"metrics": [
{
"name": "GetVertices",
"time": 172.8297999999995,
"stepResourceUsage": 513.28,
"annotations": {
"percentTime": 63.4,
"percentResourceUsage": 72.2
},
"counts": {
"resultCount": 1000
},
"gremlin": "g.V().haslabel('Person').out('Owns').haslabel('Cat').limit(20).executionprofile()",
"activityId": "5fddfe3e-fc15-431e-aa79-4e5f3acd27c9",
"totalTime": 2354,
"totalResourceUsage": 3998.340000000008,
"metrics": [
{
"name": "GetVertices",
"time": 346.0707000000017,
"stepResourceUsage": 513.28,
"annotations": {
"percentTime": 14.7,
"percentResourceUsage": 12.84
},
"counts": {
"resultCount": 11000
},
I used Gremlinq to setup a sample database using the following code.
int numberOrOwners = 10000;
for (int i = 0; i < numberOrOwners; i++)
{
await _g.AddV(new Person()).As((_, addedPerson)
=> _.AddE<Owns>().To(__ => __.AddV(new Dog())));
}
for (int i = 0; i < numberOrOwners; i++)
{
await _g.AddV(new Person()).As((_, addedPerson)
=> _.AddE<Owns>().To(__ => __.AddV(new Cat())));
}
I've tested this in 3 different scenarios and each time the results of the query are slower for data that has been added to the database later.

Related

Inconsistent Cosmos DB graph RU cost for the same query

We're using a CosmosDB Graph API instance provisioned with 120K RUs. We've setup a consistent partitioning structure using the /partition_key property.
When querying our graph using Gremlin, we've noticed that some queries use an unreasonably high amount of RUs when compared to other queries. The queries themselves are the same, but for the partition_key value itself.
The following query costs 23.25 RUs, for example:
g.V().has('partition_key', 'xxx')
Whereas the same query with a different partition_key value costs 4.14 RUs:
g.V().has('partition_key', 'yyy')
Looking at the .exectionProfile() results for both queries; they look similar.
The expensive query which costs 23.25 RUs (xxx):
[
{
"gremlin": "g.V().has('partition_key', 'xxx').executionProfile()",
"activityId": "ec181c9d-59a1-4849-9c08-111d6b465b88",
"totalTime": 12,
"totalResourceUsage": 19.8,
"metrics": [
{
"name": "GetVertices",
"time": 12.324,
"stepResourceUsage": 19.8,
"annotations": {
"percentTime": 98.78,
"percentResourceUsage": 100
},
"counts": {
"resultCount": 1
},
"storeOps": [
{
"fanoutFactor": 1,
"count": 1,
"size": 848,
"storageCount": 1,
"storageSize": 791,
"time": 12.02,
"storeResourceUsage": 19.8
}
]
},
{
"name": "ProjectOperator",
"time": 0.15259999999999962,
"stepResourceUsage": 0,
"annotations": {
"percentTime": 1.22,
"percentResourceUsage": 0
},
"counts": {
"resultCount": 1
}
}
]
}
]
The cheap query which costs 4.14 RUs (yyy):
[
{
"gremlin": "g.V().has('partition_key', 'yyy').executionProfile()",
"activityId": "841e1c37-471c-461e-b784-b53893a3c349",
"totalTime": 6,
"totalResourceUsage": 3.08,
"metrics": [
{
"name": "GetVertices",
"time": 5.7595,
"stepResourceUsage": 3.08,
"annotations": {
"percentTime": 98.71,
"percentResourceUsage": 100
},
"counts": {
"resultCount": 1
},
"storeOps": [
{
"fanoutFactor": 1,
"count": 1,
"size": 862,
"storageCount": 1,
"storageSize": 805,
"time": 5.4,
"storeResourceUsage": 3.08
}
]
},
{
"name": "ProjectOperator",
"time": 0.07500000000000018,
"stepResourceUsage": 0,
"annotations": {
"percentTime": 1.29,
"percentResourceUsage": 0
},
"counts": {
"resultCount": 1
}
}
]
}
]
The results both queries return a single vertex of about the same size.
Can someone please help explain why this is so? And why one is significantly more expensive than the other? Is there some aspect that I don't understand about Cosmos DB partitioning?
Edit 1:
We also did some experimentation by adding other query parameters, such as id and also label. An id clause did indeed reduce the cost of the expensive query from ~23 RUs to ~4.57 RUs. The problem with this approach is that in general it makes all other queries less efficient (i.e. it increases RUs).
For example, other queries (like the fast one in this ticket) go from ~4.14 RUs to ~4.80 RUs, with the addition of an id clause. So that's not really feasible as 99% of queries would be worse off. We need to find the root cause.
Edit 2:
Queries are run on the same container using the Data Explorer tool in Azure Portal. Here's the partition distribution graph:
The "issue" you're describing can be related to the size boundaries of physical partitions (PP) and logical partition (LP). Cosmos DB allows "infinite" scaling based on its partitioning architecture. But scaling performance and data growth highly depends on logical partition strategy. Microsoft highly recommend to have as granular LP key as possible so data will be equally distributed across PPs.
Initially when creating container for 120k RUs you will end-up with 12 PP - 10k RUs limit per physical partition. Once you start loading data it is possible to end-up with more PP. Following scenarios might lead to "split":
size of the LP (total size of data per your partition key) is larger than 20GB
size of PP (total sum of all LP sizes stored in this PP) is larger than 50GB
As per documentation
A physical partition split simply creates a new mapping of logical partitions to physical partitions.
Based on the PP storage allocation it looks like you had multiple "splits" resulting in provisioning of ~20 PPs.
Every time "split" occurs Cosmos DB creates two new PPs and split existing PP equally between newly created. Once this process is finished "skewed" PP is deleted. You can roughly guess number of splits by PP id's on the Metrics chart you provided (you would have id: [1-12] if no splits happened).
Splits potentially can result in higher RU consumption due to request fan-out and cross-partition queries.

Filtering and sorting with Firestore

I'm building a task management app in firestore. Tasks can have multiple members and tags. Since I would always sort and display content (based on due-date, priority, etc.) of user(s) and due to the limitations of firestore with lists and composite indexes, I have ended up storing data in the following structure.
projects:
.....
101: {
name: 'task1',
members: {201: true, 202: true, ......},
tags:{'tag1':true, 'tag2':true, 'tag3':true,....}
},
102: {
name: 'task2',
members: {201: true, 202: true, ......},
tags:{'tag1':true, 'tag2':true, 'tag3':true,....}
},
103: {
name: 'task3',
members: {201: true, 202: true, ......},
tags:{'tag1':true, 'tag2':true, 'tag3':true,....}
}
.....
Now, since composite indexes have to be manual, I ended up implementing reverse lookup:
users:
.....
201: {
name: 'John',
tasks:
501: {taskId: 601, priority: high, ...... },
502: {taskId: 601, priority: high, ...... },
503: {taskId: 601, priority: high, ...... },
}
202: {
name: 'Doe',
tasks:
504: {taskId: 601, priority: high, ...... },
505: {taskId: 601, priority: high, ...... },
506: {taskId: 601, priority: high, ...... },
}
......
At this point, if you have to filter tags too, Under users, I will have to add subcollection for each tag and store tasks under them too. This will create insane amount of documents for each task. For example, if you have one task with 3 members and 3 tags, this setup will create 12 documents for just one task. And any changes I make will involve 12 writes.
What am I missing here? Is it the way I'm storing the data? or is it more to do with the lack of capabilities of firestore itself?
Firestore does have limitations on querying
What I see is that you are trying to normalize the data, that is a good approach if you are using RDBMS. In Firestore usually it is recommended to denormalize the data as much possible. But again its a trade-off, your read will be fast & easy to query but write might be slow since you might have to write data at multiple places.
Denormalizing - In simple words, its having flat data structure
Good read -
https://angularfirebase.com/lessons/firestore-nosql-data-modeling-by-example/

Get the latest document based on a document's property in Azure Cosmos DB

Let's say that I have a Cosmos DB collection with the SQL API that contains a list of messages that people have sent that give their current mood and a timestamp of when the message has been received. People can send messages whenever they want.
In my collection I have something that looks like this:
[
{
"PersonName": "John",
"CurrentMood": "Moody",
"TimeStamp": "2012-04-23T18:25:43.511Z",
"id": "25123829-1745-0a09-5436-9cf8bdcc95e3"
},
{
"PersonName": "Jim",
"CurrentMood": "Happy",
"TimeStamp": "2012-05-23T17:25:43.511Z",
"id": "6feb7b41-4b85-164e-dcd4-4e078872c5e2"
},
{
"PersonName": "John",
"CurrentMood": "Moody",
"TimeStamp": "2012-05-23T18:25:43.511Z",
"id": "b021a4a5-ee92-282c-0fe0-b5d6c27019af"
},
{
"PersonName": "Don",
"CurrentMood": "Sad",
"TimeStamp": "2012-03-23T18:25:43.511Z",
"id": "ee72cb36-4304-06e5-ed7c-1d0ff890de48"
}
]
I would like to send a query that gets the "current" mood of all the user who sent a message (the latest message received for all person).
It's relatively easy to do for each particular user, by combining TOP 1 and ORDER BY
SELECT TOP 1 *
FROM C
WHERE C.PersonName = "John"
ORDER BY C.TimeStamp
But I feel like looping through all my users and running the query for each might be very wasteful in resource and become expensive quickly, but I can't find a way that will work.
To note that I will quickly have a lot of persons who will send a lot of messages.
The common pattern for this is to have two collections, one that stores documents for (user, timestamp -> mood), then a downstream processor using Azure Functions or the Change feed API directly that computes the (user -> latest mood)
[Mood Time series Collection] ==> Lambda ==> [Latest Mood Collection]
And the Latest Mood Collection will look something like this for the data stream above. You then use this for your lookups (which are now key lookups).
{
"PersonName": "Jim",
"LatestMood": "Happy",
"LatestTimeStamp": "2012-05-23T17:25:43.511Z",
"id": "6feb7b41-4b85-164e-dcd4-4e078872c5e2"
},
{
"PersonName": "John",
"LatestMood": "Moody",
"LatestTimeStamp": "2012-05-23T18:25:43.511Z",
"id": "b021a4a5-ee92-282c-0fe0-b5d6c27019af"
},
{
"PersonName": "Don",
"LatestMood": "Sad",
"LatestTimeStamp": "2012-03-23T18:25:43.511Z",
"id": "ee72cb36-4304-06e5-ed7c-1d0ff890de48"
}

Google Analytics Reporting API: Possible to get pivot table of page and their views from each country?

I currently have a table like
Page | Views | Singapore | US | Country A | ...
-----|-------|-----------|-----|-----------| ...
A | 200 | 50 | 100 | 30 | ...
B | 220 | 20 | 150 | 20 | ...
Generated from the following query:
{
"reportRequests": [
{
"viewId": "XXXXXX",
"pageSize": "5",
"dateRanges": [
{
"startDate": "2017-10-01",
"endDate": "2017-10-31"
}
],
"metrics": [
{
"expression": "ga:pageviews"
}
],
"dimensions": [
{
"name": "ga:pageTitle"
}
],
"orderBys": [
{
"sortOrder": "DESCENDING",
"fieldName": "ga:pageviews"
}
],
"pivots": [
{
"dimensions": [
{
"name": "ga:country"
}
],
"metrics": [
{
"expression": "ga:pageviews"
}
]
}
]
}
]
}
But I want only focus only on specific countries. For example I only want to see Singapore and US columns without any other country data. How can I do that? Total views can still stay and include data from other countries.
If I provide a list of countries can it always show those columns even if theres no views?
I see a few questions:
How to filter on specific countries?
You can use the query filters or segment to achieve this. The query explorer will help you build the correct filters:
https://ga-dev-tools.appspot.com/query-explorer/.
Note: the operator for OR is , (see reference).
"can it always show those columns even if theres no views?"
By default no: Google Analytics only returns data it has. If it has no views, it has no data, thus won't return anything. The reason is that it would waste processing power and network bandwidth to return data that doesn't exist in GA. You will have to rebuild the missing data yourself.
You might want to try the includeEmptyRows of the V4 reporting API though, which is set by default to FALSE, as this might do what you're looking for.
Note on the country dimension
Instead of using ga:country, you can use ga:countryIsoCode to filter on 2-letter country codes and not have to deal with percent encoding for countries which are composed of several words (e.g. US vs. United%20States).

Cosmos DocumentDb: Inefficient ORDER BY

I'm doing some early trials on Cosmos, and populated a table with a set of DTOs. While some simple WHERE queries seem to return quite quickly, others are horribly inefficient. A simple COUNT(1) from c took several secons and used over 10K request units. Even worse, doing a little experiment with ordering also was very discouraging. Here's my query
SELECT TOP 20 c.Foo, c.Location from c
ORDER BY c.Location.Position.Latitude DESC
My collection (if the count was correct, I got super weird results running it while populating the DB, but that's another issue) contains about 300K DTOs. The above query ran for about 30 seconds (I currently have the DB configured to perform with 4K RU/s), and ate 87453.439 RUs with 6 round trips. Obviously, that's a no-go.
According to the documentation, the numeric Latitute property should be indexed, so I'm not sure it's me screwing up here, or the reality didn't really catch up with the marketing here ;)
Any idea on why this doesn't perform properly? Thanks for your advice!
Here's a document as returned:
{
"Id": "y-139",
"Location": {
"Position": {
"Latitude": 47.3796977,
"Longitude": 8.523499
},
"Name": "Restaurant Eichhörnli",
"Details": "Nietengasse 16, 8004 Zürich, Switzerland"
},
"TimeWindow": {
"ReferenceTime": "2017-07-01T15:00:00",
"ReferenceTimeUtc": "2017-07-01T15:00:00+02:00",
"Direction": 0,
"Minutes": 45
}
}
The DB/collection I use is just the default one that can be created for the ToDo application from within the Azure portal. This apparently created the following indexing policy:
{
"indexingMode": "consistent",
"automatic": true,
"includedPaths": [
{
"path": "/*",
"indexes": [
{
"kind": "Range",
"dataType": "Number",
"precision": -1
},
{
"kind": "Hash",
"dataType": "String",
"precision": 3
}
]
}
],
"excludedPaths": []
}
Update as of Dec 2017:
I revisited my unchanged database and ran the same query again. This time, it's fast and instead of > 87000 RUs, it eats up around 6 RUs. Bottom line: It appears there was something very, very wrong with Cosmos DB, but whatever it was, it seems to be gone.

Resources