Sliding window with cosmos db temporarily loses records - azure-cosmosdb

I have what I figure is a pretty standard setup
Event hubs -> stream analytics ->cosmosdb sql
As events come in I have a stream analytics query which adds up the values using a sliding window of 60 minutes and writes to cosmos. I'd expect at most 5000messages/hr which would update a particular record in cosmos. My problem is that records in cosmos seem to be deleted and rewritten instead of being updated. There is quite a long period (30 seconds+) when no record is visible even though one should be. Eventually, a record shows up with updated values.
I can only think that as events enter and exit the sliding window they're causing updates which rewrite the record rather than updating it and that I'm querying the database during that window. This is obviously a pretty irritating issue and I'm sure I must be doing something wrong to have this behaviour. Is there some setting or some diagnostics which might shed some light on the issue?
I've tried playing with the consistency levels in cosmos but they don't seem to have much impact on what I'm seeing. I'm running my cosmos collections at 5000RU but I'm only pumping through about 10RU's worth of data. So there are no scaling issues.
My query looks like
SELECT
concat(cast(groupid as nvarchar(max)), ':', cast(IPV4_SRC_ADDR as nvarchar(max))) as partitionkey,
concat(cast(L7_PROTO_NAME as nvarchar(max)), ':lastHour') as Id,
System.timestamp as outputtime,
groupid,
IPV4_SRC_ADDR,
L7_PROTO_NAME,
sum(IN_BYTES + OUT_BYTES) as TOTAL_BYTES
INTO
layer7sql
FROM
source TIMESTAMP BY [timestamp]
Group by groupid,
IPV4_SRC_ADDR,
L7_PROTO_NAME,
slidingwindow(Minute,60)
I'm happy to provide any other debugging information which could be useful in understanding my issue.
Additional
The _rid of the records remains the same. These are the output settings I have
When I query I get back
{
"_rid": "glpSAPQhPgA=",
"Documents": [
{
"partitionkey": "2587:10.1.2.194",
"id": "SSL:lastHour",
"outputtime": "2017-12-23T06:28:40.960916Z",
"groupid": "2587",
"ipv4_src_addr": "10.1.2.194",
"l7_proto_name": "SSL",
"total_bytes": 322,
"_rid": "glpSAPQhPgAMAAAAAAAAAA==",
"_self": "dbs/glpSAA==/colls/glpSAPQhPgA=/docs/glpSAPQhPgAMAAAAAAAAAA==/",
"_etag": "\"02001fd6-0000-0000-0000-5a3df7cd0000\"",
"_attachments": "attachments/",
"_ts": 1514010573
}
],
"_count": 1
}
and the query I'm using via the REST api is
{
"query": "select * from root r where r.ipv4_src_addr='10.1.2.194' and r.id='SSL:lastHour' order by r.total_bytes desc",
"parameters": []
}
I am specifying a partition key there including the field IPV4_SRC_ADDR but I don't think this is actually a partitioned collection. As an experiment I updated my query to
SELECT
concat(cast(L7_PROTO_NAME as nvarchar(max)), ':lastHour') as Id,
System.timestamp as outputtime,
groupid,
L7_PROTO_NAME,
sum(IN_BYTES + OUT_BYTES) as TOTAL_BYTES
INTO
layer7sql
FROM
NProbe TIMESTAMP BY [timestamp]
Group by groupid,
L7_PROTO_NAME,
slidingwindow(Minute,60)
So far that appears to be working better. I wonder if maybe I had some conflicts were taking a while to resolve and during that time window the records weren't visible.

Related

Is using "Current Date" a good partition key for data that will be queried by date and id?

I'm new to Azure Cosmos DB and I have this new project where I decided to give it a go.
My DB has only one collection where around 6,000 new items are added everyday and each looks like this
{
"Result": "Pass",
"date": "23-Sep-2021",
"id": "user1#example.com"
}
The date is partition key and it will be the date of which the item was added to the collection where the same id can be added again everyday as follows
{
"Result": "Fail",
"date": "24-Sep-2021",
"id": "user1#example.com"
}
The application that uses this DB will query by id and date to retrieve the Result.
I read some Azure Cosmos DB documentations and found that selecting the partition key very carefully can improve the performance of the database and RUs used for each request.
I tried running this query and it consumed 2.9 RUs where the collection has about 23,000 items.
SELECT * FROM c
WHERE c.id = 'user1#example.com' AND c.date = '24-Sep-2021'
Here are my questions
Is using date a good partition key for my scenario? Any rooms for improvements?
Will consumed RUs per request increase over time if number of items in collection increase?
Thanks.
For a write-heavy workload using date as a partition key is a bad choice because you will always have a hot partition on the current date. However, if the amount of data being written is consistent and the write volume is low, then it can be used and you will have good distribution of data on storage.
In read-heavy scenarios, date can be a good partition key if it is used to answer most of the queries in the app.
The value for id must be unique per partition key value so for your data model to work you can only have one "id" value per day.
If this is the case for your app then you can make one additional optimization and replace the query you have with a point read, ReadItemAsync(). This takes the partition key value and the id. This is the fastest and most efficient way to read data because it does not go through the query engine and reads directly from the backend data store. All point reads for 1KB of data or less will always cost 1RU/s.

Select related documents in one query from Cosmos DB

Consider the following data structure:
[
{
"id": "parent",
"links":[{
"href": "child"
}]
},
{
"id": "child",
"links":[{
"href": "grand_child"
}]
},
{
"id": "grand_child",
"links":[]
},
{
"id": "some-other-item",
"links":[{
"href": "anywhere"
}]
}
]
Is there a way to select all the documents related to the "parent" in a single query using Cosmos DB SQL?
The result must include parent, child and grand_child documents.
Assuming here the JSON array shown in OP is array of documents in your collection.
NO, this can not be done using SQL API querying tools. CosmosDB does not support joining different documents, even less recursively. Joins in CosmosDB SQL are self-joins.
But if you definitely need this to happen server-side then you can implement the recursive gathering algorithm by scripting it to a Used Defined Function. Rather ugly though, imho.
I would suggest to just implement this on client side, single query per depth and merge the results. This also keeps nice separation of logic and data and performance should be acceptable if you correctly query all new links together in single query to avoid (1+N).
If your actual needs get much more complex on graph traversal, then you'd have to migrate your documents (or just links) to somewhere else capable of querying graphs (ex: Gremlin API).

Are Cosmos DB indexes range by default?

I have a cosmos db that's around 4gb which when it was small could perform date filters relatively quickly with a low RU value (3 - 15ish) but as the DB has grown to contain millions of records it now has slowed right down and the RU value is up in the thousands.
Looking at the documentation for date https://learn.microsoft.com/en-us/azure/cosmos-db/working-with-dates is says
To execute these queries efficiently, you must configure your
collection for Range indexing on strings
However reading the linked index policy doc (https://learn.microsoft.com/en-us/azure/cosmos-db/index-policy) it sounds like by default every field has a range index created
The default indexing policy for newly created containers indexes every property of every item, enforcing range indexes for any string or number, and spatial indexes for any GeoJSON object of type Point
Do I need to configure the indexs to anything other than the default?
{
"indexingMode": "consistent",
"automatic": true,
"includedPaths": [
{
"path": "/*"
}
],
"excludedPaths": [
{
"path": "/\"_etag\"/?"
}
]
}
When it comes to indexing you can see the best practices from here,
You should exclude unused paths from indexing for faster writes.
You should leverage IndexingPolicy with IncludedPaths and ExcludedPaths
for ex:
var collection = new DocumentCollection { id = "excludedPathCollection"};
collection.IndexingPolicy.IncludedPaths.Add(new IncludedPath { Path = "/*" });
collection.IndexingPolicy.ExcludedPaths.Add(new ExcludedPath { Path = "/nonIndexedContent/*");
So if you are concerned with query costs indexing won't really help. Writes cost depend on indexing, not read. If you are seeing thousands of RU per request I suspect you are either using cross-partition queries or you are not having partitions at all (or a single partition for everything). To cut these costs down you need to either stop using cross-partition queries or (if the former is not possible) rearchitect your data in such a fashion that you do not need to use cross-partition queries.
I think range is the default index in cosmos db

Query dynamodb for text search?

I am looking to optimize the dynamodb operations i.e removing scans and use queries to fetch data fastly.
Table Data:
itemId itemName itemOwners
hash1 abc [user1, user2]
hash2 abcd [user1, user3]
hash3 xyz [user2, user3]
I have to do the item search using an item name.
Right now, we scan the whole table.
let getItems = {
TableName: ItemsTable,
FilterExpression: 'contains (#itemName, :searchValue)',
ExpressionAttributeNames: { '#itemName': 'itemName' },
ExpressionAttributeValues: { ':searchValue': searchValue },
};
let items = await docClient.scan(getItems).promise();
We then filter the items result if the itemOwners contains the userId for the searching user.
I wanted to know if there is a better way of doing this search query with dynamodb?
There isn't a way to do a contains in DynamoDB without it being a filter condition. DynamoDB is not really designed for full text search. However, there is a way to do some degree of text search capabilities from DynamoDB. I'm not suggesting you should, but you can. Basically you create a record for each word/item combination you want to include in your search. This doesn't allow for partial word matches, but it is a way to get full word matching. That, of course, requires pre-processing all your data to make it available for search. If you decide to go this route I would recommend using DynamoDB streams to manage updating the search data. Every time an item is inserted/updated/deleted from the database you can run a lambda to update the search records for that item. Again, not suggesting you should do this, but you can.
I would recommend investigating CloudSearch as an alternative to this.

How to perform join operation similar to SQL join in Azure DocumentDB

I have a collection in Azure DocumentDB where in I have documents clustered into 3 sets using a JSON property called clusterName for each document. The 3 clusters of documents are templated somewhat like these:
{
"clusterName": "CustomerInformation",
"id": "CustInfo1001",
"custName": "XXXX"
},
{
"clusterName": "ZoneInformation",
"id": "ZoneInfo5005",
"zoneName": "YYYY"
},
{
"clusterName": "CustomerZoneAssociation",
"id": "CustZoneAss9009",
"custId": "CustInfo1001",
"zoneId": "ZoneInfo5005"
}
As you can see the document for CustomerZoneAssociation links the documents of CustomerInformation and ZoneInformation with their Id s. I need help in querying out information from CustomerInformation and ZoneInformation cluster with the help of their Id s associated in the CustomerZoneAssociation cluster. The result of the query I am expecting is:
{
"clusterName": "CustomerZoneAssociation",
"id": "CustZoneAss9009",
"custId": "CustInfo1001",
"custName": "XXXX",
"zoneId": "ZoneInfo5005",
"zoneName": "YYYY"
}
Please suggest a solution which would take only 1 trip to DocumentDB
DocumentDB does not support inter-document JOINs... instead, the JOIN keyword is used to perform intra-document cross-products (to be used with nested arrays).
I would recommend one of the following approaches:
Keep in mind that you do not have to normalize every entity as you would with a traditional RDBMS. It may be worth revisiting your data model, and de-normalize parts of your data where appropriate. Also keep in mind that, de-normalizing comes with its own trade offs (fanning out writes vs issuing follow up reads). Check out the following SO answer to read more on the tradeoffs between normalizing vs de-normalizing data.
Write a stored procedure to batch a sequence of operations within a single network request. Checkout the following SO answer for a code sample on this approach.

Resources