We're using a CosmosDB Graph API instance provisioned with 120K RUs. We've setup a consistent partitioning structure using the /partition_key property.
When querying our graph using Gremlin, we've noticed that some queries use an unreasonably high amount of RUs when compared to other queries. The queries themselves are the same, but for the partition_key value itself.
The following query costs 23.25 RUs, for example:
g.V().has('partition_key', 'xxx')
Whereas the same query with a different partition_key value costs 4.14 RUs:
g.V().has('partition_key', 'yyy')
Looking at the .exectionProfile() results for both queries; they look similar.
The expensive query which costs 23.25 RUs (xxx):
[
{
"gremlin": "g.V().has('partition_key', 'xxx').executionProfile()",
"activityId": "ec181c9d-59a1-4849-9c08-111d6b465b88",
"totalTime": 12,
"totalResourceUsage": 19.8,
"metrics": [
{
"name": "GetVertices",
"time": 12.324,
"stepResourceUsage": 19.8,
"annotations": {
"percentTime": 98.78,
"percentResourceUsage": 100
},
"counts": {
"resultCount": 1
},
"storeOps": [
{
"fanoutFactor": 1,
"count": 1,
"size": 848,
"storageCount": 1,
"storageSize": 791,
"time": 12.02,
"storeResourceUsage": 19.8
}
]
},
{
"name": "ProjectOperator",
"time": 0.15259999999999962,
"stepResourceUsage": 0,
"annotations": {
"percentTime": 1.22,
"percentResourceUsage": 0
},
"counts": {
"resultCount": 1
}
}
]
}
]
The cheap query which costs 4.14 RUs (yyy):
[
{
"gremlin": "g.V().has('partition_key', 'yyy').executionProfile()",
"activityId": "841e1c37-471c-461e-b784-b53893a3c349",
"totalTime": 6,
"totalResourceUsage": 3.08,
"metrics": [
{
"name": "GetVertices",
"time": 5.7595,
"stepResourceUsage": 3.08,
"annotations": {
"percentTime": 98.71,
"percentResourceUsage": 100
},
"counts": {
"resultCount": 1
},
"storeOps": [
{
"fanoutFactor": 1,
"count": 1,
"size": 862,
"storageCount": 1,
"storageSize": 805,
"time": 5.4,
"storeResourceUsage": 3.08
}
]
},
{
"name": "ProjectOperator",
"time": 0.07500000000000018,
"stepResourceUsage": 0,
"annotations": {
"percentTime": 1.29,
"percentResourceUsage": 0
},
"counts": {
"resultCount": 1
}
}
]
}
]
The results both queries return a single vertex of about the same size.
Can someone please help explain why this is so? And why one is significantly more expensive than the other? Is there some aspect that I don't understand about Cosmos DB partitioning?
Edit 1:
We also did some experimentation by adding other query parameters, such as id and also label. An id clause did indeed reduce the cost of the expensive query from ~23 RUs to ~4.57 RUs. The problem with this approach is that in general it makes all other queries less efficient (i.e. it increases RUs).
For example, other queries (like the fast one in this ticket) go from ~4.14 RUs to ~4.80 RUs, with the addition of an id clause. So that's not really feasible as 99% of queries would be worse off. We need to find the root cause.
Edit 2:
Queries are run on the same container using the Data Explorer tool in Azure Portal. Here's the partition distribution graph:
The "issue" you're describing can be related to the size boundaries of physical partitions (PP) and logical partition (LP). Cosmos DB allows "infinite" scaling based on its partitioning architecture. But scaling performance and data growth highly depends on logical partition strategy. Microsoft highly recommend to have as granular LP key as possible so data will be equally distributed across PPs.
Initially when creating container for 120k RUs you will end-up with 12 PP - 10k RUs limit per physical partition. Once you start loading data it is possible to end-up with more PP. Following scenarios might lead to "split":
size of the LP (total size of data per your partition key) is larger than 20GB
size of PP (total sum of all LP sizes stored in this PP) is larger than 50GB
As per documentation
A physical partition split simply creates a new mapping of logical partitions to physical partitions.
Based on the PP storage allocation it looks like you had multiple "splits" resulting in provisioning of ~20 PPs.
Every time "split" occurs Cosmos DB creates two new PPs and split existing PP equally between newly created. Once this process is finished "skewed" PP is deleted. You can roughly guess number of splits by PP id's on the Metrics chart you provided (you would have id: [1-12] if no splits happened).
Splits potentially can result in higher RU consumption due to request fan-out and cross-partition queries.
Related
I have a simple Cosmos DB Graph database that models people owning a cat or a dog.
Person - Owns - Dog
Person - Owns - Cat
There are 10000 Persons that own a Dog and 10000 persons that own a cat. The 10000 dog owners were added to the database first, then the 10000 cat owners added.
The execution profiles to query for Persons that own dogs or cats are quite different, the performance for Dogs is significantly better than for cats. This is multiplied when there are more nodes in the DB (eg 50k of each Cat or Dog).
Note the result count below, Dog only needs to lookup 1000 Persons to return the required dataset, but Cat needs to lookup 11000. Presumably this is because the Dog owners were added first and Cosmos is starting with the first nodes added and traversing them all to find the right edge.
How can I improve the performance of the Cat owners query? This is a simplistic example, our use case is more complex and would involve filtering on the Person node before traversing to the Cat node, so starting at Cat is not an option.
"gremlin": "g.V().haslabel('Person').out('Owns').haslabel('Dog').limit(20).executionprofile()",
"activityId": "1802c19d-2f59-4378-93c8-0fd50ce97b7f",
"totalTime": 273,
"totalResourceUsage": 710.8800000000001,
"metrics": [
{
"name": "GetVertices",
"time": 172.8297999999995,
"stepResourceUsage": 513.28,
"annotations": {
"percentTime": 63.4,
"percentResourceUsage": 72.2
},
"counts": {
"resultCount": 1000
},
"gremlin": "g.V().haslabel('Person').out('Owns').haslabel('Cat').limit(20).executionprofile()",
"activityId": "5fddfe3e-fc15-431e-aa79-4e5f3acd27c9",
"totalTime": 2354,
"totalResourceUsage": 3998.340000000008,
"metrics": [
{
"name": "GetVertices",
"time": 346.0707000000017,
"stepResourceUsage": 513.28,
"annotations": {
"percentTime": 14.7,
"percentResourceUsage": 12.84
},
"counts": {
"resultCount": 11000
},
I used Gremlinq to setup a sample database using the following code.
int numberOrOwners = 10000;
for (int i = 0; i < numberOrOwners; i++)
{
await _g.AddV(new Person()).As((_, addedPerson)
=> _.AddE<Owns>().To(__ => __.AddV(new Dog())));
}
for (int i = 0; i < numberOrOwners; i++)
{
await _g.AddV(new Person()).As((_, addedPerson)
=> _.AddE<Owns>().To(__ => __.AddV(new Cat())));
}
I've tested this in 3 different scenarios and each time the results of the query are slower for data that has been added to the database later.
Spatial indexing does not seem to be working on a collection which contains a document with GeoJson coordinates. I've tried using the default indexing policy which inherently provides spatial indexing on all fields.
I've tried creating a new Cosmos Db account, database, and collection from scratch without any success of getting the spatial indexing to work with ST_DISTANCE query.
I've setup a simple collection with the following indexing policy:
{
"indexingMode": "consistent",
"automatic": true,
"includedPaths": [
{
"path": "/\"location\"/?",
"indexes": [
{
"kind": "Spatial",
"dataType": "Point"
},
{
"kind": "Range",
"dataType": "Number",
"precision": -1
},
{
"kind": "Range",
"dataType": "String",
"precision": -1
}
]
}
],
"excludedPaths": [
{
"path": "/*",
},
{
"path": "/\"_etag\"/?"
}
]
}
The document that I've inserted into the collection:
{
"id": "document1",
"type": "Type1",
"location": {
"type": "Point",
"coordinates": [
-50,
50
]
},
"name": "TestObject"
}
The query that should return the single document in the collection:
SELECT * FROM f WHERE f.type = "Type1" and ST_DISTANCE(f.location, {'type': 'Point', 'coordinates':[-50,50]}) < 200000
Is not returning any results. If I explicitly query without using the spatial index like so:
SELECT * FROM f WHERE f.type = "Type1" and ST_DISTANCE({'type': 'Point', 'coordinates':[f.location.coordinates[0],f.location.coordinates[1]]}, {'type': 'Point', 'coordinates':[-50,50]}) < 200000
It returns the document as it should, but doesn't take advantage of the indexing which I will need because I will be storing a lot of coordinates.
This seems to be the same issue referenced here. If I add a second document far away and change the '<' to '>' in the first query it works!
I should mention this is only occurring on Azure. When I use the Azure Cosmos Db Emulator it works perfectly! What is going on here?! Any tips or suggestions are much appreciated.
UPDATE: I found out the reason that the query works on the Emulator and not Azure - the database on the emulator doesn't have provisioned (shared) throughput among its collections, while I made the database in Azure with provisioned throughput to keep costs down (i.e. 4 collections sharing 400 RU/s). I created a non provisioned throughput database in Azure and the query works with spatial indexing!! I will log this issue with Microsoft to see if there is a reason why this is the case?
Thanks for following up with additional details with regards to a fixed collection being the solution but, I did want to get some additional information.
The Cosmos DB Emulator now supports containers:
By default, you can create up to 25 fixed size containers (only supported using Azure Cosmos DB SDKs), or 5 unlimited containers using the Azure Cosmos Emulator. By modifying the PartitionCount value, you can create up to 250 fixed size containers or 50 unlimited containers, or any combination of the two that does not exceed 250 fixed size containers (where one unlimited container = 5 fixed size containers). However it's not recommended to set up the emulator to run with more than 200 fixed size containers. Because of the overhead that it adds to the disk IO operations, which result in unpredictable timeouts when using the endpoint APIs.
So, I want to see which version of the Emulator you were using. Current version is azure-cosmosdb-emulator-2.2.2.
I have discrepancies in the revenue metric, between the data I collect from the Google Analytics API and the custom reports in the user interface.
The discrepancies for each value maintain the same rate, where the data collected through the API is greater than the data in the custom reports.
This is the body of the request I'm using:
{
"reportRequests":[
{
"viewId":"xxxxxxxxxx",
"dateRanges": [{"startDate":"2017-07-01","endDate":"2018-12-31"}],
"metrics": [
{"expression": "ga:transactionRevenue","alias": "transactionRevenue","formattingType": "CURRENCY"},
{"expression": "ga:itemRevenue","alias": "itemRevenue","formattingType": "CURRENCY"},
{"expression": "ga:productRevenuePerPurchase","alias": "productRevenuePerPurchase","formattingType": "CURRENCY"}
],
"dimensions": [
{"name": "ga:channelGrouping"},
{"name": "ga:sourceMedium"},
{"name": "ga:dateHour"},
{"name": "ga:transactionId"},
{"name": "ga:keyWord"}
],
"pageSize": "10000"
}]}
This is an extract of the response:
{{
"reports": [
{
"columnHeader": {
"dimensions": [
"ga:channelGrouping",
"ga:sourceMedium",
"ga:dateHour",
"ga:transactionId",
"ga:keyWord"
],
"metricHeader": {
"metricHeaderEntries": [
{
"name": "transactionRevenue",
"type": "CURRENCY"
},
{
"name": "itemRevenue",
"type": "CURRENCY"
},
{
"name": "productRevenuePerPurchase",
"type": "CURRENCY"
}
]
}
},
"data": {
"rows": [
{
"dimensions": [
"(Other)",
"bing / (not set)",
"2018052216",
"834042319461-01",
"(not set)"
],
"metrics": [
{
"values": [
"367.675436",
"316.55053699999996",
"316.55053699999996"
]
}
]
},
...
So, if I create a custom report in the Google Analytics user interface and look for the transaction ID 834042319461-01, I get the following result:
google Analytics custom report filtered by transaction id 834042319461-01
In the end I have a revenue value of 367.675436 in the API response, but a value of 333.12 in the custom report, its a 10.37% more in the value of the API. I get this 10.37% increase for all values.
¿Why I'm having these discrepance?
¿What would you recomend to do in order to solve these problem?
Thanks.
My bet is that you're experiencing sampling (is your time range in the UI lower than in the API?): https://support.google.com/analytics/answer/2637192?hl=en
Sampling applies when:
you customize the reports
the number of sessions for the overall time range of the report (whether or not your query returns less sessions) exceeds 500K (GA) or 100M (GA 360)
The consequence is that:
the report will be based on a subset of the data (the % depends on the total number of sessions)
therefore your report data won't be as accurate as usual
What you can do to reduce sampling:
increase sample size (will only decrease sampling to a certain extend, but in most cases won't completely remove sampling). In UI it's done via the option at the top of the report, in the API it's done using the samplingLevel option
reduce time range
create filtered views so your reports contain the data you need without needed to customize reports
Because you are looking at a particular transaction ID, this might not be a sampling issue.
If the ratio is consistent, from your question it seems to be 10.37%. I believe this is the case of currency that you are using.
Try using local currency metric API calls when making monetary based calls.
For example -
ga:localTransactionRevenue instead of ga:transactionRevenue
I'm doing some early trials on Cosmos, and populated a table with a set of DTOs. While some simple WHERE queries seem to return quite quickly, others are horribly inefficient. A simple COUNT(1) from c took several secons and used over 10K request units. Even worse, doing a little experiment with ordering also was very discouraging. Here's my query
SELECT TOP 20 c.Foo, c.Location from c
ORDER BY c.Location.Position.Latitude DESC
My collection (if the count was correct, I got super weird results running it while populating the DB, but that's another issue) contains about 300K DTOs. The above query ran for about 30 seconds (I currently have the DB configured to perform with 4K RU/s), and ate 87453.439 RUs with 6 round trips. Obviously, that's a no-go.
According to the documentation, the numeric Latitute property should be indexed, so I'm not sure it's me screwing up here, or the reality didn't really catch up with the marketing here ;)
Any idea on why this doesn't perform properly? Thanks for your advice!
Here's a document as returned:
{
"Id": "y-139",
"Location": {
"Position": {
"Latitude": 47.3796977,
"Longitude": 8.523499
},
"Name": "Restaurant Eichhörnli",
"Details": "Nietengasse 16, 8004 Zürich, Switzerland"
},
"TimeWindow": {
"ReferenceTime": "2017-07-01T15:00:00",
"ReferenceTimeUtc": "2017-07-01T15:00:00+02:00",
"Direction": 0,
"Minutes": 45
}
}
The DB/collection I use is just the default one that can be created for the ToDo application from within the Azure portal. This apparently created the following indexing policy:
{
"indexingMode": "consistent",
"automatic": true,
"includedPaths": [
{
"path": "/*",
"indexes": [
{
"kind": "Range",
"dataType": "Number",
"precision": -1
},
{
"kind": "Hash",
"dataType": "String",
"precision": 3
}
]
}
],
"excludedPaths": []
}
Update as of Dec 2017:
I revisited my unchanged database and ran the same query again. This time, it's fast and instead of > 87000 RUs, it eats up around 6 RUs. Bottom line: It appears there was something very, very wrong with Cosmos DB, but whatever it was, it seems to be gone.
So we have been developing some graph based analysis tools, using neo4j as a persistence engine in the background. As part of this we are developing a graph data model suitable for our domain, and we want to use this in the application layer to restrict the types of nodes, or to ensure that nodes of certain types must carry certain properties. Normal data model restrictions.
So thats the background, what I am asking is if there is some standard way to represent a data-model for a graph db? The graph equivalent of an xsd perhaps?
There's an open-source project supporting strong schema definitions in Neo4j: Structr (http://structr.org, see it in action: http://vimeo.com/structr/videos)
With Structr, you can define an in-graph schema of your data model including
Type inheritance
Supported data types: Boolean, String, Integer, Long, Double, Date, Enum (+ values)
Default values
Cardinality (1:1, 1:*, *:1)
Not-null constraints
Uniqueness constraints
Full type safety
Validation
Cardinality enforcement
Support for methods (custom action) is currently being added to the schema.
The schema can be edited with an editor, or directly via REST, modifiying the JSON representation of the data model:
{
"query_time": "0.001618446",
"result_count": 4,
"result": [
{
"name": "Whisky",
"extendsClass": null,
"relatedTo": [
{
"id": "96d05ddc9f0b42e2801f06afb1374458",
"name": "Flavour"
},
{
"id": "28f85dca915245afa3782354ea824130",
"name": "Location"
}
],
"relatedFrom": [],
"id": "df9f9431ed304b0494da84ef63f5f2d8",
"type": "SchemaNode",
"_name": "String"
},
{
"name": "Flavour",
...
},
{
"name": "Location",
...
},
{
"name": "Region",
...
}
],
"serialization_time": "0.000829985"
}
{
"query_time": "0.001466743",
"result_count": 3,
"result": [
{
"name": null,
"sourceId": "28f85dca915245afa3782354ea824130",
"targetId": "e4139c5db45a4c1cbfe5e358a84b11ed",
"sourceMultiplicity": null,
"targetMultiplicity": "1",
"sourceNotion": null,
"targetNotion": null,
"relationshipType": "LOCATED_IN",
"sourceJsonName": null,
"targetJsonName": null,
"id": "d43902ad7348498cbdebcd92135926ea",
"type": "SchemaRelationship",
"relType": "IS_RELATED_TO"
},
{
"name": null,
"sourceId": "df9f9431ed304b0494da84ef63f5f2d8",
"targetId": "96d05ddc9f0b42e2801f06afb1374458",
"sourceMultiplicity": null,
"targetMultiplicity": "1",
"sourceNotion": null,
"targetNotion": null,
"relationshipType": "HAS_FLAVOURS",
"sourceJsonName": null,
"targetJsonName": null,
"id": "bc9a6308d1fd4bfdb64caa355444299d",
"type": "SchemaRelationship",
"relType": "IS_RELATED_TO"
},
{
"name": null,
"sourceId": "df9f9431ed304b0494da84ef63f5f2d8",
"targetId": "28f85dca915245afa3782354ea824130",
"sourceMultiplicity": null,
"targetMultiplicity": "1",
"sourceNotion": null,
"targetNotion": null,
"relationshipType": "PRODUCED_IN",
"sourceJsonName": null,
"targetJsonName": null,
"id": "a55fb5c3cc29448e99a538ef209b8421",
"type": "SchemaRelationship",
"relType": "IS_RELATED_TO"
}
],
"serialization_time": "0.000403616"
}
You can access nodes and relationships stored in Neo4j as JSON objects through a RESTful API which is dynamically configured based on the in-graph schema.
$ curl try.structr.org:8082/structr/rest/whiskies?name=Ardbeg
{
"query_time": "0.001267211",
"result_count": 1,
"result": [
{
"flavour": {
"name": "J",
"description": "Full-Bodied, Dry, Pungent, Peaty and Medicinal, with Spicy, Feinty Notes.",
"id": "626ba892263b45e29d71f51889839ebc",
"type": "Flavour"
},
"location": {
"region": {
"name": "Islay",
"id": "4c7dd3fe2779492e85bdfe7323cd78ee",
"type": "Region"
},
"whiskies": [
...
],
"name": "Port Ellen",
"latitude": null,
"longitude": null,
"altitude": null,
"id": "47f90d67e1954cc584c868e7337b6cbb",
"type": "Location"
},
"name": "Ardbeg",
"id": "2db6b3b41b70439dac002ba2294dc5e7",
"type": "Whisky"
}
],
"serialization_time": "0.010824154"
}
In the UI, there's also a data editing (CRUD) tool, and CMS components supporting to create web applications on Neo4j.
Disclaimer: I'm a developer of Structr and founder of the project.
No, there's no standard way to do this. Indeed, even if there were, keep in mind that the only constraints that neo4j currently supports are uniqueness constraints.
Take for example some sample rules:
All nodes labeled :Person must have non-empty properties fname and lname
All nodes labeled :Person must have >= 1 outbound relationship of type :works_for
The trouble with the present neo4j is that even in the case where you did have a schema language (standardized) that could express these things, there wouldn't be a way that the db engine itself could actually enforce that constraint.
So the simple answer is no, there's no standard way of doing that right now.
A few tricks I've seen people use to simulate the same:
Assemble a list of "test suite" cypher queries, with known results. Query for things you know shouldn't be there; non-empty result sets are a sign of a problem/integrity violation. Query for things you know should be there; empty result sets are a problem.
Application-level control -- via some layer like spring-data or similar, control who can talk to the database. This essentially moves your data integrity/testing problem up into the app, away from the database.
It's a common (and IMHO annoying) aspect of many NoSQL solutions (not specifically neo4j) that because of their schema-weakness, they tend to force validation up the tech stack into the application. Doing these things in the application tends to be harder and more error-prone. SQL databases permit you to implement all sorts of schema constraints, triggers, etc -- specifically to make it really damn hard to put the wrong data into the database. The NoSQL databases typically either aren't there yet, or don't do this as a design decision. There are indeed flexibility/performance tradeoffs. Databases can insert faster and be more flexible to adapt quickly if they aren't burdened with checking each atom of data against a long list of schema rules.
EDIT: Two relevant resources: the metagraphs proposal talks about how you could represent the schema as a graph, and neoprofiler is an application that attempts to infer the actual structure of a neo4j database and show you its "profile".
With time, I think it's reasonable to hope that neo would include basic integrity features like requiring certain labels to have certain properties (the example above), restricting the data types of certain properties (lname must always be a String, never an integer), and so on. The graph data model is a bit wild and wooly though (in the computational complexity sense) and there are some constraints on graphs that people desperately would want, but will probably never get. An example would be the constraint that a graph can't have cycles in it. Enforcing that on the creation of every relationship would be very computationally intensive. (