Select related documents in one query from Cosmos DB - azure-cosmosdb

Consider the following data structure:
[
{
"id": "parent",
"links":[{
"href": "child"
}]
},
{
"id": "child",
"links":[{
"href": "grand_child"
}]
},
{
"id": "grand_child",
"links":[]
},
{
"id": "some-other-item",
"links":[{
"href": "anywhere"
}]
}
]
Is there a way to select all the documents related to the "parent" in a single query using Cosmos DB SQL?
The result must include parent, child and grand_child documents.

Assuming here the JSON array shown in OP is array of documents in your collection.
NO, this can not be done using SQL API querying tools. CosmosDB does not support joining different documents, even less recursively. Joins in CosmosDB SQL are self-joins.
But if you definitely need this to happen server-side then you can implement the recursive gathering algorithm by scripting it to a Used Defined Function. Rather ugly though, imho.
I would suggest to just implement this on client side, single query per depth and merge the results. This also keeps nice separation of logic and data and performance should be acceptable if you correctly query all new links together in single query to avoid (1+N).
If your actual needs get much more complex on graph traversal, then you'd have to migrate your documents (or just links) to somewhere else capable of querying graphs (ex: Gremlin API).

Related

Are Cosmos DB indexes range by default?

I have a cosmos db that's around 4gb which when it was small could perform date filters relatively quickly with a low RU value (3 - 15ish) but as the DB has grown to contain millions of records it now has slowed right down and the RU value is up in the thousands.
Looking at the documentation for date https://learn.microsoft.com/en-us/azure/cosmos-db/working-with-dates is says
To execute these queries efficiently, you must configure your
collection for Range indexing on strings
However reading the linked index policy doc (https://learn.microsoft.com/en-us/azure/cosmos-db/index-policy) it sounds like by default every field has a range index created
The default indexing policy for newly created containers indexes every property of every item, enforcing range indexes for any string or number, and spatial indexes for any GeoJSON object of type Point
Do I need to configure the indexs to anything other than the default?
{
"indexingMode": "consistent",
"automatic": true,
"includedPaths": [
{
"path": "/*"
}
],
"excludedPaths": [
{
"path": "/\"_etag\"/?"
}
]
}
When it comes to indexing you can see the best practices from here,
You should exclude unused paths from indexing for faster writes.
You should leverage IndexingPolicy with IncludedPaths and ExcludedPaths
for ex:
var collection = new DocumentCollection { id = "excludedPathCollection"};
collection.IndexingPolicy.IncludedPaths.Add(new IncludedPath { Path = "/*" });
collection.IndexingPolicy.ExcludedPaths.Add(new ExcludedPath { Path = "/nonIndexedContent/*");
So if you are concerned with query costs indexing won't really help. Writes cost depend on indexing, not read. If you are seeing thousands of RU per request I suspect you are either using cross-partition queries or you are not having partitions at all (or a single partition for everything). To cut these costs down you need to either stop using cross-partition queries or (if the former is not possible) rearchitect your data in such a fashion that you do not need to use cross-partition queries.
I think range is the default index in cosmos db

What do I make my CosmosDB parition key when my JSON top level token is variable?

I am confused what to make my CosmosDB partition key when my JSON looks like this
{
"AE": [
{
"storeCode": "XXX",
"storeClass": "YYY"
}
],
"AT": [
{
"storeCode": "ZZZ",
"storeClass": "XYZ"
}
]
}
Normally the top level would be country:AT and so on and I would make the partition key /country but in this case I have nothing to use on the top level so what do I do?
the JSON comes from a third party so I dont have the option to change it at source.
Since i did not find any statements about the partition key for sub-array in the official document. I could only provide you with a similar thread for your reference :CosmosDB - Correct Partition Key
Here is some explanations by #Mikhail:
Partition Key has to be a single value for each document, it can't be
a field in sub-array. Partition Key is used to determine which
database node will host your document, and it wouldn't be possible if
you specified multiple values, of course.
If your single document contains data from multiple entities, and you
will query those entities separately, it might make sense to split
your documents per entity. If all those "radars" are related to some
higher level entity, use that entity ID as partition key.
For rigor,i would suggest you contacting with azure cosmos team to check whether this feature is not supported yet so far,whether will be implemented in the future.

Geo shard key + Cosmos DB

I try to create a collection in CosmosDB and I don't know how to create a good shardkey !
I had something like that in mind, but it does not accept my shard-key !
{
"shard_key" : ["50.836421", "4.355267"],
"position":
{
"type": "Point",
"coordinates": [50.836421, 4.355267]
},
}
Does someone has experience with this?
You could make the shard_key as "[\"50.836421\", \"4.355267\"]",it is accepted by cosmos db mongo api.
Based on the book and the link, shard-key from array is not supported by mongo db.
Shard keys cannot be arrays. sh.shardCollection() will fail if any key
has an array value and inserting an array into that field is not
allowed. Once inserted, a document's shard key value cannot be
modified. To change a document's shard key, you must remove the
document, change the key, and reinsert it. Thus, you should choose a
field that is unchangeable or changes frequently.
Hope it helps you.

Sliding window with cosmos db temporarily loses records

I have what I figure is a pretty standard setup
Event hubs -> stream analytics ->cosmosdb sql
As events come in I have a stream analytics query which adds up the values using a sliding window of 60 minutes and writes to cosmos. I'd expect at most 5000messages/hr which would update a particular record in cosmos. My problem is that records in cosmos seem to be deleted and rewritten instead of being updated. There is quite a long period (30 seconds+) when no record is visible even though one should be. Eventually, a record shows up with updated values.
I can only think that as events enter and exit the sliding window they're causing updates which rewrite the record rather than updating it and that I'm querying the database during that window. This is obviously a pretty irritating issue and I'm sure I must be doing something wrong to have this behaviour. Is there some setting or some diagnostics which might shed some light on the issue?
I've tried playing with the consistency levels in cosmos but they don't seem to have much impact on what I'm seeing. I'm running my cosmos collections at 5000RU but I'm only pumping through about 10RU's worth of data. So there are no scaling issues.
My query looks like
SELECT
concat(cast(groupid as nvarchar(max)), ':', cast(IPV4_SRC_ADDR as nvarchar(max))) as partitionkey,
concat(cast(L7_PROTO_NAME as nvarchar(max)), ':lastHour') as Id,
System.timestamp as outputtime,
groupid,
IPV4_SRC_ADDR,
L7_PROTO_NAME,
sum(IN_BYTES + OUT_BYTES) as TOTAL_BYTES
INTO
layer7sql
FROM
source TIMESTAMP BY [timestamp]
Group by groupid,
IPV4_SRC_ADDR,
L7_PROTO_NAME,
slidingwindow(Minute,60)
I'm happy to provide any other debugging information which could be useful in understanding my issue.
Additional
The _rid of the records remains the same. These are the output settings I have
When I query I get back
{
"_rid": "glpSAPQhPgA=",
"Documents": [
{
"partitionkey": "2587:10.1.2.194",
"id": "SSL:lastHour",
"outputtime": "2017-12-23T06:28:40.960916Z",
"groupid": "2587",
"ipv4_src_addr": "10.1.2.194",
"l7_proto_name": "SSL",
"total_bytes": 322,
"_rid": "glpSAPQhPgAMAAAAAAAAAA==",
"_self": "dbs/glpSAA==/colls/glpSAPQhPgA=/docs/glpSAPQhPgAMAAAAAAAAAA==/",
"_etag": "\"02001fd6-0000-0000-0000-5a3df7cd0000\"",
"_attachments": "attachments/",
"_ts": 1514010573
}
],
"_count": 1
}
and the query I'm using via the REST api is
{
"query": "select * from root r where r.ipv4_src_addr='10.1.2.194' and r.id='SSL:lastHour' order by r.total_bytes desc",
"parameters": []
}
I am specifying a partition key there including the field IPV4_SRC_ADDR but I don't think this is actually a partitioned collection. As an experiment I updated my query to
SELECT
concat(cast(L7_PROTO_NAME as nvarchar(max)), ':lastHour') as Id,
System.timestamp as outputtime,
groupid,
L7_PROTO_NAME,
sum(IN_BYTES + OUT_BYTES) as TOTAL_BYTES
INTO
layer7sql
FROM
NProbe TIMESTAMP BY [timestamp]
Group by groupid,
L7_PROTO_NAME,
slidingwindow(Minute,60)
So far that appears to be working better. I wonder if maybe I had some conflicts were taking a while to resolve and during that time window the records weren't visible.

How to perform join operation similar to SQL join in Azure DocumentDB

I have a collection in Azure DocumentDB where in I have documents clustered into 3 sets using a JSON property called clusterName for each document. The 3 clusters of documents are templated somewhat like these:
{
"clusterName": "CustomerInformation",
"id": "CustInfo1001",
"custName": "XXXX"
},
{
"clusterName": "ZoneInformation",
"id": "ZoneInfo5005",
"zoneName": "YYYY"
},
{
"clusterName": "CustomerZoneAssociation",
"id": "CustZoneAss9009",
"custId": "CustInfo1001",
"zoneId": "ZoneInfo5005"
}
As you can see the document for CustomerZoneAssociation links the documents of CustomerInformation and ZoneInformation with their Id s. I need help in querying out information from CustomerInformation and ZoneInformation cluster with the help of their Id s associated in the CustomerZoneAssociation cluster. The result of the query I am expecting is:
{
"clusterName": "CustomerZoneAssociation",
"id": "CustZoneAss9009",
"custId": "CustInfo1001",
"custName": "XXXX",
"zoneId": "ZoneInfo5005",
"zoneName": "YYYY"
}
Please suggest a solution which would take only 1 trip to DocumentDB
DocumentDB does not support inter-document JOINs... instead, the JOIN keyword is used to perform intra-document cross-products (to be used with nested arrays).
I would recommend one of the following approaches:
Keep in mind that you do not have to normalize every entity as you would with a traditional RDBMS. It may be worth revisiting your data model, and de-normalize parts of your data where appropriate. Also keep in mind that, de-normalizing comes with its own trade offs (fanning out writes vs issuing follow up reads). Check out the following SO answer to read more on the tradeoffs between normalizing vs de-normalizing data.
Write a stored procedure to batch a sequence of operations within a single network request. Checkout the following SO answer for a code sample on this approach.

Resources