As a CosmosDB (SQL API) user I would like to index all non object or array properties inside of an object.
By default the index in cosmos /* will index every property, our data set is getting extremely large (expensive) and this strategy is no longer optimal. We store our metadata at the root and our customer data wrapped inside of an object property data.
Our platform restricts queries on the data path to be value type properties, this means that for us to index objects and arrays nested under the data path is just slowing down writes and costing RUs to store but never getting used.
I have tried several iterations of index policies but cannot find one that fits. Example:
{
"partitionKey": "f402a704-19bb-4f4d-93e6-801c50280cf6",
"id": "4a7a11e5-00b5-4def-8e80-132a8c083f24",
"data": {
"country": "Belgium",
"employee": 250,
"teammates": [
{ "name": "Jake", "id": 123 ...},
{ "name": "kyle", "id": 3252352 ...}
],
"user": {
"name": "Brian",
"addresses": [{ "city": "Moscow" ...}, { "city": "Moscow" ...}]
}
}
}
In this case I want to only index the root properties as well as /data/employee and /data/country.
Policies like /data/* will not work because it would then index /data/teammates/name ... and so on.
/data/? => assumes data is a value type which it never will be so this doesn't work.
/data/ and /data/*/? and /data/*? are not accepted by cosmos as valid policies.
Additionally I can't simply exclude /data/teammates/ and /data/user/ because what is inside of data is completely dynamic so while that might cover this use case there are several 100k others that it would not.
I have tried many iterations but it seems that options don't work for various reasons, is there a way to support what I am trying to do?
This indexing policy will index the properties you are asking for.
{
"indexingMode": "consistent",
"automatic": true,
"includedPaths": [
{
"path": "/partitionKey/?"
},
{
"path": "/data/country/?"
},
{
"path": "/data/employee/?"
}
],
"excludedPaths": [
{
"path": "/*"
}
]
}
Related
Suppose I've the following data in my container:
{
"id": "1DBF704E-1623-4844-DC86-EFA729A5C048",
"firstName": "Wylie",
"lastName": "Ramsey",
"country": "AZ",
"city": "Tucson"
}
Where I use the field "id" as the item id, and the field "country" as the partition key, when I query on specific partition key:
SELECT * FROM c WHERE c.country = "AZ"
(get all the people in "AZ")
Should I add "country" as an index or I will get it by default, since I declered "country" as my partition key?
Is there a diference when using the SDK (meaning: adding the new PartitionKey("AZ") option and then sending the query as mentioned above)?
I created a collection with 50,000 records and disabled indexing on all properties.
Indexing policy:
{
"indexingMode": "consistent",
"automatic": true,
"includedPaths": [], // Included nothing
"excludedPaths": [
{
"path": "/\"_etag\"/?"
},
{
"path": "/*" // Exclude all paths
}
]
}
Querying by id cost 2.85 RUs.
Querying by PartitionKey cost 580 RUs.
Indexing policy with PartitionKey (country) added:
{
"indexingMode": "consistent",
"automatic": true,
"includedPaths": [
{
"path": "/country/*"
}
],
"excludedPaths": [
{
"path": "/\"_etag\"/?"
},
{
"path": "/*" // Exclude all paths
}
]
}
Adding an index on the PartitionKey brought it down to 2.83 RUs.
So the answer to this is Yes, if you have disabled default indexing policies and need to search by partition key then you should add an index to it.
In my opinion, it's a good practice to query with partition key in cosmosdb sql api, here's the offical doc related to it.
By the way, cosmosdb sql api indexes all the properties by default. If you'd like to cover the default setting and customise the indexing policy, this doc may provide more details.
Assuming we have the following data structure
"data": [
{
"type": "node--press",
"id": "f04eab99-9174-4d00-bbbe-cdf45056660e",
"attributes": {
"nid": 130,
"uuid": "f04eab99-9174-4d00-bbbe-cdf45056660e",
"title": "TITLE OF NODE",
"revision_translation_affected": true,
"path": {
"alias": "/press/title-of-node",
"pid": 428,
"langcode": "es"
}
...
}
The data returned is compliant with JSON API standards, and I have no problem retrieving and processing it, except for the fact that I need to be able to filter the nodes returned by the path pid.
How can I filter my data by path.pid?
I have tried:
- node-press?filter[path][pid]=428
- node-press?filter[path][pid][value]=428
to no avail
It's not well defined in the filters section of the specification but other parameters such as include describe accessing nested keys with dot-notation. You could try ?filter[path.pid]=428 and parse the filter that way.
"field_country": {
"data": {
"type": "taxonomy_term--country",
"id": "818f11ab-dd9d-406b-b1ca-f79491eedd73"
}
}
Above structure can be filtered by ?filter[field_country.id]=818f11ab-dd9d-406b-b1ca-f79491eedd73
The Firebase database uses a subset of JSON. Thus it seems obvious to use JSON schema to describe the data model. This would allow to make use of tools which generate HTML forms or typescript models from it or generate random test data.
My question: How would one model key-value pairs in JSON schema, where the key is an id?
Example: (borrowed from firebase spec)
{
"users": {
"mchen": {
"name": "Mary Chen",
// index Mary's groups in her profile
"groups": {
// the value here doesn't matter, just that the key exists
"alpha": true,
"charlie": true
}
},
...
The group name here is used as an group id. In this reference (groups object) as well as in the group object itself, the id is used as the property name.
JSON schema for above example is:
{
"$schema": "http://json-schema.org/draft-04/schema#",
"type": "object",
"properties": {
"users": {
"type": "object",
"properties": {
"mchen": {
"type": "object",
"properties": {
"name": {
"type": "string"
},
"groups": {
"type": "object",
"properties": {
"alpha": {
"type": "boolean"
},
"charlie": {
"type": "boolean"
}
}
}
}
}
}
}
}
}
What I would need for the example is something like the following, where NAME is a placeholder for the property name and NAME_TYPE defines it's type.
{
"$schema": "http://json-schema.org/draft-04/schema#",
"type": "object",
"properties": {
"users": {
"type": "object",
"properties": {
NAME: {
"type": "object",
NAME_TYPE: "string",
"properties": {
"name": {
"type": "string"
},
"groups": {
"type": "object",
"properties": {
NAME: {
NAME_TYPE: "string"
"type": "boolean"
}
}
}
}
}
}
}
}
}
(Maybe I am on the completely wrong path here or maybe JSON schema isn't able to model the required structure.)
There are certainly arrays in Firebase but they are situational and should be used only in certain use cases and should generally be avoided.
The Firebase structure you posted is very common and there are key:value pairs in your structure so the question is a tad unclear but I'll give it a shot.
'groups' is the parent key and the values are the children key:value pairs of group1:value, group2:value.
The group1, group2 keys you listed are essentially the same as the id's listed in the first example, other than it's not an array. i.e. array's have sequential, hard coded-indexes (0th, 1st, 2nd etc) whereas the keys in firebase are open-ended and can generally be set to any alphanumeric value - they are used more to refer to a specific node than the enforce an particular order (i'm speaking generally here)
In the Firebase structure, those keys could be id0, id1, id2... or a,b,c... or a timestamp... or an auto-generated Firebase id (childByAutoId) that would also make them 'sequential'.
However, you can get into trouble assigning your own with id0, id1 etc.
id0
id1
id2
.
id9
id10
id11
The reality here is that the actual order will be
id0
id1
id10
id11
id2
The 'key' is that if you are using the keys to read data in sequentially, set them up as such. You may also want to consider generating your keys with childByAutoId (see docs for language specifics) and orderBy one of the child values such as a timestamp or index.
'groups': {
'auto-generated id': {
'name': 'alpha',
'index': 0,
'timestamp': '20160814074146'
...
},
'auto-generated id': {
'name': 'charlie',
'index': 1,
'timestamp': '20160814073600'
...
},
...
}
in the above case, I can orderBy name, index or timestamp.
Name and index will read the nodes in the order they are listed, if we order by timestamp, then the charlie node will be loaded first. Leveraging the child values to orderBy is very flexible.
Also, you can limit the set of data you are loading in with startingAt and endingAt. So for example, you want to load nodes starting at node 10 and ending at node 14. Easily done with non-array JSON data but not easily done if it's stored in an array as the entire array must be read in.
I'm using Following Query :
g.V(741440).outE('Notification').order().by('PostedDateLong', decr).range(0,1).as('notificationInfo').match(
__.as('notificationInfo').inV().as('postInfo'),
).select('notificationInfo','postInfo')
it is giving following result :
{
"requestId": "9846447c-4217-4103-ac2e-de3536a3c62a",
"status": {
"message": "",
"code": 200,
"attributes": { }
},
"result": {
"data": [
{
"notificationInfo": {
"id": "c0zs-fw3k-347p-g2g0",
"label": "Notification",
"type": "edge",
"inVLabel": "Comment",
"outVLabel": "User",
"inV": 749664,
"outV": 741440,
"properties": {
"ParentPostId": "823488",
"PostedDate": "2016-05-26T02:35:52.3889982Z",
"PostedDateLong": 635998269523889982,
"Type": "CommentedOnPostNotification",
"NotificationInitiatedByVertexId": "1540312"
}
},
"postInfo": {
"id": 749664,
"label": "Comment",
"type": "vertex",
"properties": {
"PostImage": [
{
"id": "amto-g2g0-2wat",
"value": ""
}
],
"PostedByUser": [
{
"id": "am18-g2g0-2txh",
"value": "orbitpage#gmail.com"
}
],
"PostedTime": [
{
"id": "amfg-g2g0-2upx",
"value": "2016-05-26T02:35:39.1489483Z"
}
],
"PostMessage": [
{
"id": "aln0-g2g0-2t51",
"value": "hi"
}
]
}
}
}
],
"meta": { }
}
}
I want to get information of Vertex "NotificationInitiatedByVertexId" (Edge Property ) in the response as well.
For that i tried following query :
g.V(741440).outE('Notification').order().by('PostedDateLong', decr).range(0,2).as('notificationInfo').match(
__.as('notificationInfo').inV().as('postInfo'),
g.V(1540312).next().as('notificationByUser')
).select('notificationInfo','postInfo','notificationByUser')
Note : I tried directly with vertex Id in subquery as I wasn't aware how to dynamically get value from edge property in query itself.
It is giving error. I tried a lot but am not able to find any solution.
I'm assuming that you are storing a Titan generated identifier in that edge property called NotificationInitiatedByVertexId. If so, please consider the following even though this first part doesn't really answer your question. I don't think you should store a vertex identifier on the edge. Your graph model should explicitly track the relationship of NotificationInitiatedBy with an edge and by storing the identifier of the vertex on the edge itself you are bypassing that. Also, if you ever have to migrate your data in some way, the ids won't be preserved (Titan will generate new ones) and trying to sort that out will be a mess.
Even if that is not a Titan generated identifier and a logical one you created, I still think I would look to adjust your graph schema and promote that Notification to a vertex. Then your Gremlin traversals would flow more easily.
Now, assuming you don't change that, then I don't see a reason to not just issue two queries in the same request and then combine the results to one data structure. You just need to do a lookup with the vertex id which is going to be pretty fast and inexpensive:
edgeStuff = g.V(741440).outE('Notification').
order().by('PostedDateLong', decr).range(0,1).as('notificationInfo').
... // whatever logic you have
select('notificationInfo','postInfo').next()
vertexStuff = g.V(edgeStuff.get('notificationInfo').value('NotificationInitiatedByVertexId')).next()
[notificationInitiatedBy: vertexStuff, notification: edgeStuff]
Is it possible to avoid the entity overhead when fetching data in the cloud datastore ? Ideally I'd love to have SQL-like results : simple arrays/objects with key:values instead of
{
"batch": {
"entityResultType": "FULL",
"entityResults": [
{
"entity": {
"key": {
"partitionId": {
"datasetId": "e~******************"
},
"path": [
{
"kind": "Answer",
"id": "*******************"
}
]
},
"properties": {
"value": {
"stringValue": "12",
"indexed": false
},
"question": {
"integerValue": "120"
}
}
}
}
],
"endCursor": "********************************************",
"moreResults": "MORE_RESULTS_AFTER_LIMIT",
"skippedResults": null
}
}
which has just too much overhead for me (I plan on running queries over thousands of entities). I just couldn't find in the docs if that's possible or not.
You can use projection queries to query for specific entity properties.
In your example, you could use a projection query to avoid returning the entity's key in the query results.
The other fields are part of the API's data representation and are required to be returned, however, many of them (e.g. endCursor) are only returned once per query batch, so the overhead is low when there are many results.