I am trying to use cosmos db change feed (I'm referring to https://learn.microsoft.com/en-us/azure/cosmos-db/change-feed-processor and https://github.com/Azure/azure-cosmos-dotnet-v2/tree/master/samples/code-samples/ChangeFeedProcessorV2).
When I start a multiple instances of a consumer, the observer seems to see only 1 partition key range. I only see a message - Observer opened for partition Key Range 0 and it starts receiving the change feed. So, the feed is received by only 1 consumer at any given point. If I close one consumer, the next one picks up happily.
I can't seem to understand the partition keys / ranges in cosmos db. In cosmos db, I've created a database and a collection within it. I've defined a partition key - /myId. I store a unique guid in myId. I've saved about 10000 transactions in the collection.
When I look at partition key ranges using api (/dbs/db-name/colls/coll-name/pkranges), I see only node under PartitionKeyRanges. Below is the output I see
{
"_rid": "LEAgAL7tmKM=",
"PartitionKeyRanges": [
{
"_rid": "LEAgAL7tmKMCAAAAAAAAUA==",
"id": "0",
"_etag": "\"00007d00-0000-0000-0000-5c3645e70000\"",
"minInclusive": "",
"maxExclusive": "FF",
"ridPrefix": 0,
"_self": "dbs/LAEgAA==/colls/LEAgAL7tmKM=/pkranges/LEAgAL7tmKMCAAAAAAAAUA==/",
"throughputFraction": 1,
"status": "online",
"parents": [],
"_ts": 1547060711
}
],
"_count": 1
}
Shouldn't this show more partition key ranges? Is this behavior expected?
How do I get multiple consumers to receive data as shown under https://learn.microsoft.com/en-us/azure/cosmos-db/change-feed-processor?
TL;DR - you should be able to ignore partition key ranges and the number of them you have and just let Change Feed Processor manage that for you.
The partition key ranges is an implementation detail we currently leak. The short answer is we add new partition key ranges when we want to restructure how your data is stored in the backend. This can happen for lots of reasons, like you add more data, you consume a lot of RUs for a subsection of that data, or we just want to shuffle things around. Theoretically, if you kept adding data, we'd eventually split the range in two.
We're working on some updates for the v3 SDKs that are currently in preview to abstract this a bit further, since even the answer I have given above is pretty hand wavey and we should have a more easily understood contract for public APIs.
Related
Quick high level background:
DynamoDB with single table design
OpenSearch for full text search
DynamoDB Stream which indexes into OpenSearch on DynamoDB Create/Update/Delete via Lambda
The single table design approach has been working well for us so far, but we also haven't really had many-to-many relationships to deal with. However a new relationship we recently needed to account for is Tags for Entry objects:
interface Entries {
readonly id: string
readonly title: string
readonly tags: Tag[]
}
interface Tags {
readonly id: string
readonly name: string
}
We want to try and stick to a single query/read to retrieve a list of Entries / single Entry but also want to find a good balance between having to manage updates.
A few ways we've considered storing the data:
Store all tag data in the Entry
{
"id": "asdf1234",
"title": "Entry Title",
"tags": [
{
"id": "1234asdf",
"name": "stack"
},
{
"id": "4321hjkl",
"name": "over"
},
{
"id": "7657gdfg",
"name": "flow"
}
]
}
This approach makes reads easy, but updates become a pain - anytime a tag is updated, we would need to find a way to find all Entries that reference that tag and then update it.
Store only the tag ids in the Entry
{
"id": "asdf1234",
"title": "Entry Title",
"tags": ["1234asdf", "4321hjkl", "7657gdfg"]
}
With this approach, no updates would be required when a Tag is updated, but now we have to do multiple reads to return the full data - we would need to query each Tag by id to retrieve its data before returning the full content back to the client.
Store only the tag ids in the Entry but use OpenSearch to query and get data
This option, similar to the one above, would store only the tag ids on the Entry, but then have the Entry document that is indexed on the search side include all Tag data in our stream lambda. Updates on a Tag would still require updates to all Entries (in search) to also query and update each Entry individually - but the question is if its more cost effective to just do it in DynamoDB.
This scenario presents an interesting uni-directional flow:
writes go straight to DynamoDB
DynamoDB stream -> Lambda - do a transformations on the data => index in OpenSearch
reads are exclusively done via OpenSearch
The overall question is, how do applications using nosql with single table design, handle these many-to-many scenarios? Is using a uni-directional flow stated above a good idea/worth it?
Things to consider:
our application leans more heavily on the read side
our application will also utilize search capability quite heavily
Tag updates will not be often
I have a CosmosDb container that have (simplified) documents like this:
{
id,
city,
propertyA,
propertyB
}
The partitionkey is 'city'.
Usually this works well. But there has been a bug in our code and now I have to update a lot of documents in a lot of different partitions.
I made a conditional update document like so:
{
"conditions": "from c where propertyA = 1",
"operations": [
{
"op": "set"
"path": "/propertyB",
"value": true
}
]
}
This document I send with the REST API to CosmosDB
Because I want to update all documents in all partitions that satisfy the condition I set the x-ms-documentdb-query-enablecrosspartition header to 'True'.
But I still need to supply a partition key in the x-ms-documentdb-partitionkey header.
Is there a way to use the REST API to update all the documents for which the condition is true, whatever the partitionkey is?
Yes, there is a way through which REST API can provide us with the ability to create, query, and delete databases, collection of documents, and documents through programmatically. To change a document partially, you can use the PATCH HTTP Method.
You will still need to supply a partition key in the x-ms-documentdb-partitionkey header.
If you have created the document based on partition key, you must need to provide partition key in header.
For more information regarding the above said partition key, you can refer the below documentation link: -
https://learn.microsoft.com/en-us/rest/api/cosmos-db/patch-a-document
I have enabled diagnostic logging of a Cosmos Account (SQL interface). The diagnostic log data is being sent to a storage account - and I can see that there is a new DataPlaneRequests blob created every 5 minutes.
So far, so good.
I'm performing CRUD requests against a collection in the Cosmos account. I can see entries within the DataPlaneRequest logs that look like this ('*' used to protect the innocent)...
{ "time": "2020-01-28T03:04:59.2606375Z", "resourceId": "/SUBSCRIPTIONS/****/RESOURCEGROUPS/****/PROVIDERS/MICROSOFT.DOCUMENTDB/DATABASEACCOUNTS/**********", "category": "DataPlaneRequests", "operationName": "Query", "properties": {"activityId": "38f497ee-7e37-435f-8b4a-a2f0d8d65d12","requestResourceType": "DocumentFeed","requestResourceId": "/dbs/****/colls/****/docs","collectionRid": "","databaseRid": "","statusCode": "200","duration": "4.588500","userAgent": "Windows/10.0.14393 documentdb-netcore-sdk/2.8.1","clientIpAddress": "52...***","requestCharge": "4.160000","requestLength": "278","responseLength": "5727","resourceTokenUserRid": "","region": "West US 2","partitionId": ""}}
Every entry in the DataPlaneRequests log has an empty partitionId property value.
(The operationName property value in the log is either "Create" or "Query").
So my question is - why is this property empty?
Here is the documentation for DataPlaneRequests
What I'm actually trying to accomplish, is to obtain information about the load being placed on the physical partitions of a collection.
e.g. I'd like to know that during the past 10 minutes, 10k Create operations were performed in physical-partition "1", while 55k operations were performed in physical-partition "3".
That will allow me to have much more insight into why a collection is experiencing throttling, etc.
When you connect to Cosmos, there are two connection modes available: Gateway and Direct. It turns out that only Direct mode, causes the partitionId to be included in the logs. (If you read up about how these two modes work (differently), then that makes sense).
Anyway, it turns out that the partitionId in the logs is not a reference to a physical partition of a collection. So I'm unable to use that data to solve the problem, I was attempting to solve.
There is a physical partition id available in the logs - but it's also of limited use - since it's only tracked for the 3 largest (logical) PartitionKey values, of each physical partition, only if the key-value contains >=1Gb of documents.
I am trying to create multiple unique indexes on users collection which are also sparse.
Use case:
Allow user to signup with either phone or email
Problem:
Mongo db allows creating such indexes, and app runs as expected when connected to a real mongo 3.2 instance. However, with cosmos, email becomes a unique field but not sparse.
Is there a way to achieve this use-case with cosmos via mongodb API? i.e without resorting to check for uniqueness via a fetch query before every insert.
Attempts:
Going through cosmos' documentation didn't reveal much. They claim to support mongo 3.2 API, and sparse indexes are not mentioned in caveats. Creating a unique-sparse index with createIndex do not result in error, but only create a unique index. Creating another similar index on a different field doesn't give an error, but don't even create a unique index. It creates a normal index.
Update: Got a response from azure's support team:
Currently we don’t support sparse indexes.
You can do this i.e. create a composite index with both the phone and email property.
I don't think a composite index is going to help me with my use case as I want both phone and email to be globally unique in the collection, not only their combination.
Can you review this Stack Overflow forum thread: Mongodb unique sparse index
Do the following:
var UserSchema = new Schema({
// ...
email: {type: String, default: null, trim: true, unique: true, sparse: true},
// ...
});
Notice:
unique: true,
sparse: true
i have two set of entities in my firebase realtime schema. Called Orders and customers.
so far i was not actually relating them in my app but was just showing them related. the current schema looked like:
{
"orders" : [
{"id" : 1, "name": "abc", "price": 200, "customer": "vik"}
],
"customers" : [
{"cust_id" : "10", "name" : "vik", "type": "existing"}
]
}
so i have a orders list page showing all the orders in a table which i get firing /orders.json
But practically, instead of having the customer name directly in the orders i should have cust_id attribute as that is the key.
That naturally makes it a standard relational schema where i will be free to change customer attributes without worrying about mismatch in orders.
However, the downside i see right away is that if i have say 20 orders to show in the order list table then instead of 1 i will end up firing 21 rest calls (1 to get order list and 20 to fetch customer name for each of the order)
What are the recommendations or standards around this ?
Firebase is a NoSQL database. So the rules of normalization that you know from relational databases don't necessarily apply.
For example: having the customer name in each order is actually quite normal. It saves having to do a client-side join for each customer record, significantly simplifying the code and improving the speed of the operation. But of course it comes at the cost of having to store data multiple times (quite normal in NoSQL databases), and having to consider if/how you update the duplicated data in case of updates of the customer record.
I recommend reading NoSQL data modeling, watching Firebase for SQL developers, and reading my answer on keeping denormalized data up to date.