DataPlaneRequests partitionID in CosmosDB diagnostic logs always seems to be empty - azure-cosmosdb

I have enabled diagnostic logging of a Cosmos Account (SQL interface). The diagnostic log data is being sent to a storage account - and I can see that there is a new DataPlaneRequests blob created every 5 minutes.
So far, so good.
I'm performing CRUD requests against a collection in the Cosmos account. I can see entries within the DataPlaneRequest logs that look like this ('*' used to protect the innocent)...
{ "time": "2020-01-28T03:04:59.2606375Z", "resourceId": "/SUBSCRIPTIONS/****/RESOURCEGROUPS/****/PROVIDERS/MICROSOFT.DOCUMENTDB/DATABASEACCOUNTS/**********", "category": "DataPlaneRequests", "operationName": "Query", "properties": {"activityId": "38f497ee-7e37-435f-8b4a-a2f0d8d65d12","requestResourceType": "DocumentFeed","requestResourceId": "/dbs/****/colls/****/docs","collectionRid": "","databaseRid": "","statusCode": "200","duration": "4.588500","userAgent": "Windows/10.0.14393 documentdb-netcore-sdk/2.8.1","clientIpAddress": "52...***","requestCharge": "4.160000","requestLength": "278","responseLength": "5727","resourceTokenUserRid": "","region": "West US 2","partitionId": ""}}
Every entry in the DataPlaneRequests log has an empty partitionId property value.
(The operationName property value in the log is either "Create" or "Query").
So my question is - why is this property empty?
Here is the documentation for DataPlaneRequests
What I'm actually trying to accomplish, is to obtain information about the load being placed on the physical partitions of a collection.
e.g. I'd like to know that during the past 10 minutes, 10k Create operations were performed in physical-partition "1", while 55k operations were performed in physical-partition "3".
That will allow me to have much more insight into why a collection is experiencing throttling, etc.

When you connect to Cosmos, there are two connection modes available: Gateway and Direct. It turns out that only Direct mode, causes the partitionId to be included in the logs. (If you read up about how these two modes work (differently), then that makes sense).
Anyway, it turns out that the partitionId in the logs is not a reference to a physical partition of a collection. So I'm unable to use that data to solve the problem, I was attempting to solve.
There is a physical partition id available in the logs - but it's also of limited use - since it's only tracked for the 3 largest (logical) PartitionKey values, of each physical partition, only if the key-value contains >=1Gb of documents.

Related

What are the usual RUs values you expect when you run simple queries?

I started a short time ago to experiment with gremlin and cosmos DB, with the idea of doing a prototype for a project at work.
How do I know if a "Request charge value" is good or bad?
For example, I have a query to get a list of flats located in a specific state which looks like this:
g.V().hasLabel('Flat').as('Flats').outE('has_property')
.inV().has('propertyType',eq('reference')).has('propertyName','City').outE('property_refers_to')
.inV().hasLabel('City').outE('has_property')
.inV().has('propertyType',eq('reference')).has('propertyName','State').outE('property_refers_to')
.inV().hasLabel('State').outE('has_property')
.inV().has('propertyType',eq('scalar')).has('name',eq('Some_State_Name'))
.values('name')
.select('Flats')
Each object-property is not stored directly in the node representing an "object" but has its own node. Some properties are "scalar" (like the name of a state or a city) and some are a reference to another node (like the city where a flat is located).
With this query, azure shows me a "Request charge" value of 73.179
Is it good?
What are the things I should try to improve?

Cosmos db user id/email as partition key

I have a dilema about choosing best (syntetic) value for partition key for storing user data.
User document has:
- id (guid)
- email (used to login, e.g.)
- profile data
There are 2 main types of queries:
Looking for user by id (most queries)
Looking for user by email (login and some admin queries)
I want to avoid cross partition queries.
If i choose id for partitionKey (synthetic field) then login queries would be cross partition.
On the other hand, if i choose email then if user ever changes email - its a problem.
What i am thinking is to introduce new type within the collection. Something like:
userId: guid,
userEmail: “email1”,
partitonKey: “users-mappings”
then i can have User document itself as:
id: someguid,
type: “user”,
partitionKey: “user_someguid”,
profileData: {}
that way when user logs in, i first check mappings type/partition by email, get guid and then check actual User document by guid.
also, this way email can be changed without affecting partitioning.
is this a valid approach? any problems with it? am i missing something?
Your question does not has a standard answer. In my opinion, you solution named mapping type causes two queries which is also inefficient. Choosing partition key is always a process of balancing the pros and cons.Please see the guidance from official document.
Based on your description:
1.Looking for user by id (most queries)
2.Looking for user by email (login and some admin queries)
I suggest you to prioritize the most frequent queries, that is to say, id.
My reason:
1.id won't change easily,is relatively stable.
2.Session or cookie can be saved after login, so there is not much accesses to login as same as id.
3.id is your most frequent query condition, so it's impossible to cross all partitions every time.
4.If you do concern about login performance,don't forget adding indexing policy for email column.It could also improve the performance.
As you already know, in querying Cosmos DB, Fan-out should be the last option to query, especially on such a high-volume action such as logging in. Plus, the cost in RUs will be significantly higher with large data.
In the Cosmos DB SQL API, one pattern is to use synthetic partition keys. You can compose a synthetic partition key by concatenating the id and the email on write. This pattern works for a myriad of query scenarios providing flexibility.
Something like this:
{
"id": "123",
"email":"joe#abc.com",
"partitionKey":"123-joe#abc.com"
}
Then on read, do something like this:
SELECT s.something
FROM s
WHERE STARTSWITH(s.partitionKey, "123")
OR
ENDSWITH(s.partitionKey, "joe#abc.com")
You can also use SUBSTRING() etc...
With the above approach, you can search for a user either by their id or email and still use the efficiency of a partition key, minimizing your query RU cost and optimizing performance.

How can I query for all new and updated documents since last query?

I need to query a collection and return all documents that are new or updated since the last query. The collection is partitioned by userId. I am looking for a value that I can use (or create and use) that would help facilitate this query. I considered using _ts:
SELECT * FROM collection WHERE userId=[some-user-id] AND _ts > [some-value]
The problem with _ts is that it is not granular enough and the query could miss updates made in the same second by another client.
In SQL Server I could accomplish this using an IDENTITY column in another table. Let's call the table version. In a transaction I would create a new row in the version table, do the updates to the other table (including updating the version column with the new value. To query for new and updated rows I would use a query like this:
SELECT * FROM table WHERE userId=[some-user-id] and version > [some-value]
How could I do something like this in Cosmos DB? The Change Feed seems like the right option, but without the ability to query the Change Feed, I'm not sure how I would go about this.
In case it matters, the (web/mobile) clients connect to data in Cosmos DB via a web api. I have control of the entire stack - from client to back-end.
As the statements in this link:
Today, you see all operations in the change feed. The functionality
where you can control change feed, for specific operations such as
updates only and not inserts is not yet available. You can add a “soft
marker” on the item for updates and filter based on that when
processing items in the change feed. Currently change feed doesn’t log
deletes. Similar to the previous example, you can add a soft marker on
the items that are being deleted, for example, you can add an
attribute in the item called "deleted" and set it to "true" and set a
TTL on the item, so that it can be automatically deleted. You can read
the change feed for historic items, for example, items that were added
five years ago. If the item is not deleted you can read the change
feed as far as the origin of your container.
Change feed is not available for your requirements.
My idea:
Use Azure Function Cosmos DB Trigger to collect all the operations in your specific cosmos collection. Follow this document to configure the input of azure function as cosmos db, then follow this document to configure the output as azure queue storage.
Get the ids of changed items and send them into queue storage as messages.When you want to query the changed item,just query the messages from the queue to consume them at a specific unit time and after that just clear the entire queue. No items will be missed.
With your approach, you can get added/updated documents and save reference value (_ts and id field) somewhere (like blob)
SELECT * FROM collection WHERE userId=[some-user-id] AND _ts > [some-value] and id !='guid' order by _ts desc
This is a similar approach we use to read data from Eventhub and store checkpointing information (epoch number, sequence number and offset value) in blob. And at a time only one function can take a lease of that blob.
If you go with ChangeFeed, you can create listener (Function or Job) to listen all add/update data from collection and you can store those value in some collection, while saving data you can add Identity/version field on every document. This approach may increase your cosmos DB bill.
This is what the transaction consistency levels are for: https://learn.microsoft.com/en-us/azure/cosmos-db/consistency-levels
Choose strong consistency and your queries will always return the latest write.
Strong: Strong consistency offers a linearizability guarantee. The
reads are guaranteed to return the most recent committed version of an
item. A client never sees an uncommitted or partial write. Users are
always guaranteed to read the latest committed write.

Using multiple consumers with CosmosDB change feed

I am trying to use cosmos db change feed (I'm referring to https://learn.microsoft.com/en-us/azure/cosmos-db/change-feed-processor and https://github.com/Azure/azure-cosmos-dotnet-v2/tree/master/samples/code-samples/ChangeFeedProcessorV2).
When I start a multiple instances of a consumer, the observer seems to see only 1 partition key range. I only see a message - Observer opened for partition Key Range 0 and it starts receiving the change feed. So, the feed is received by only 1 consumer at any given point. If I close one consumer, the next one picks up happily.
I can't seem to understand the partition keys / ranges in cosmos db. In cosmos db, I've created a database and a collection within it. I've defined a partition key - /myId. I store a unique guid in myId. I've saved about 10000 transactions in the collection.
When I look at partition key ranges using api (/dbs/db-name/colls/coll-name/pkranges), I see only node under PartitionKeyRanges. Below is the output I see
{
"_rid": "LEAgAL7tmKM=",
"PartitionKeyRanges": [
{
"_rid": "LEAgAL7tmKMCAAAAAAAAUA==",
"id": "0",
"_etag": "\"00007d00-0000-0000-0000-5c3645e70000\"",
"minInclusive": "",
"maxExclusive": "FF",
"ridPrefix": 0,
"_self": "dbs/LAEgAA==/colls/LEAgAL7tmKM=/pkranges/LEAgAL7tmKMCAAAAAAAAUA==/",
"throughputFraction": 1,
"status": "online",
"parents": [],
"_ts": 1547060711
}
],
"_count": 1
}
Shouldn't this show more partition key ranges? Is this behavior expected?
How do I get multiple consumers to receive data as shown under https://learn.microsoft.com/en-us/azure/cosmos-db/change-feed-processor?
TL;DR - you should be able to ignore partition key ranges and the number of them you have and just let Change Feed Processor manage that for you.
The partition key ranges is an implementation detail we currently leak. The short answer is we add new partition key ranges when we want to restructure how your data is stored in the backend. This can happen for lots of reasons, like you add more data, you consume a lot of RUs for a subsection of that data, or we just want to shuffle things around. Theoretically, if you kept adding data, we'd eventually split the range in two.
We're working on some updates for the v3 SDKs that are currently in preview to abstract this a bit further, since even the answer I have given above is pretty hand wavey and we should have a more easily understood contract for public APIs.

How can you create a transaction/batch write between multiple Firestore instances?

Firebase allows having multiple projects in a single application.
// Initialize another app with a different config
var secondary = firebase.initializeApp(secondaryAppConfig, "secondary");
// Retrieve the database.
var secondaryDatabase = secondary.database();
Example:
Project 1 has my users collection; Project 2 has my friends collection (suppose there's a reason for that). When I add a new friend in the Project 2 database, I want to increment the friendsCount in the user document in Project 1. For this reason, I want to create a transaction/batch write to insure consistency in the data.
How can I achieve this? Can I create a transaction or a batch write between different Firestore instances?
No, you cannot use the database transaction feature across multiple databases.
If absolutely required, I'd probably instead create a custom locking feature. From wiki,
To allow several users to edit a database table at the same time and also prevent inconsistencies created by unrestricted access, a single record can be locked when retrieved for editing or updating. Anyone attempting to retrieve the same record for editing is denied write access because of the lock (although, depending on the implementation, they may be able to view the record without editing it). Once the record is saved or edits are canceled, the lock is released. Records can never be saved so as to overwrite other changes, preserving data integrity.
In database management theory, locking is used to implement isolation among multiple database users. This is the "I" in the acronym ACID.
Source: https://en.wikipedia.org/wiki/Record_locking
It's been three years since the question, I know, but since I needed the same thing I found a working solution to perform the double (or even ^n) transaction. You have to nest the transactions like this.
db1.runTransaction(t1 => db2.runTransaction(t2 => async () => {
await t1.set(.....
await t2.update(.....
etc....
})).then(...).catch(...)
Since the error is propagated in the nested promises it is safe to execute the double transaction in this way because for a failure in any one of the databases it results in the error in all of them.

Resources