How to keep documents with 2 partition keys in sync / referential integrity? - azure-cosmosdb

I have a cosmos db with high cardinality synthetic partition keys and type properties.
I need a setup where users can share documents between them.
for example, this is a document:
{
“id”:”guid”,
“title”:”Example document to share”,
“ownerUserId”:”user1Guid”,
“type”: “usersDocument”,
“partitionKey”:”user_user1Guid_documents”
}
now, user wants to share this document with another user.
Assumptions:
one document can be shared with many users (thousands)
one user can have thousands of documents shared with him
For these 2 reasons:
i dont want to embed sharings into document documents nor in user documents (since writes would very soon become ineffective/expensive) but i would prefer those m:n be separate documents.
i dont want to put shares for all users/documents as it will create hot spots very soon
I need both queries:
1. ListDocumentsSharedWithMe
In this query, at query time, i know id of the user documents are shared with.
2. ListAllUsersISharedThisDocumentWith
In this query, at query time, i know ‘idof thedocumentthat has been shared with differentusers`.
All this makes me think i should have 2 separate document types with separate partition
For listing all documents shared with me:
{
“id”:”documentGuid”,
“type”:”sharedWithMe”,
“partitionKey”:”sharedWithMe_myUserGuid”
}
(this could also be a single document with collection of shared documents. important here is partitionKey)
Now i can easily do SQL like SELECT * FROM c WHERE c.type = “sharedWithMe” and run query against partition key containing my user guid.
For listing all users i shared some document with, its similar:
{
“id”:”userISharedWithGuid”,
“type”:”documentSharings”,
“partitionKey”:”documentShare_documentGuid”
}
Now i can easily do SQL like SELECT * FROM c WHERE c.type = “documentSharings” and run query against partition key containing my document guid.
Question:
When user shares a document with some user, both documents should be created with different partition keys (thus, no sp/transactions).
How to keep this “atomic-like” or avoid create/update anomalies?
Or is there any better way to model this?

I think your method makes sense I do something similar to partition in multiple ways based on the scope of a query. I assume your main concern is if a failure happens in between saving the first and last set of related documents? The only way unfortunately to manage the chain of documents as they save is within your application code. i.e. we make sure we save in the order that makes it easiest to rollback and then implement a rollback method within the exception handler, this works by keeping a collection saved documents in memory.
As you say as you are across partitions there is no transaction handling out of the box.

Related

Organizing a Cloud Firestore database

I can't manage to determine what is the better way of organizing my database for my app :
My users can create items identified by a unique ID.
The queries I need :
- Query 1: Get all the items created by a user
- Query 2 : From the UID of an item, get its creator
My database is organized as following :
Users database
user1 : {
item1_uid,
item2_uid
},
user2 : {
item3_uid
}
Items database
item1_uid : {
title,
description
},
item2_uid : {
title,
description
},
item3_uid : {
title,
description
}
For the query 2, its quite simple but for the query 2, I need to parse all the users database and list all the items Id to see if there is the one I am looking for. It works right now but I'm afraid that it will slow the request time as the database grows.
Should I add in the items data a row with the user id ? If yes the query will be simpler but I heard that I am not supposed to have twice the same data in the database because it can lead to conflicts when adding or removing items.
Should I add in the items data a row with the user id ?
Yes, this is a very common approach in the NoSQL world and is called denormalization. Denormalization is described, in this "famous" post about NoSQL data modeling, as "copying of the same data into multiple documents in order to simplify/optimize query processing or to fit the user’s data into a particular data model". In other words, the main driver of your data model design is the queries you plan to execute.
More concretely you could have an extra field in your item documents, which contain the ID of the creator. You could even have another one with, e.g., the name of the creator: This way, in one query, you can display the items and their creators.
Now, for maintaining these different documents in sync (for example, if you change the name of one user, you want it to be updated in the corresponding items), you can either use a Batched Write to modify several documents in one atomic operation, or rely on one or more Cloud Functions that would detect the changes of the user documents and reflect them in the item documents.

How can I query for all new and updated documents since last query?

I need to query a collection and return all documents that are new or updated since the last query. The collection is partitioned by userId. I am looking for a value that I can use (or create and use) that would help facilitate this query. I considered using _ts:
SELECT * FROM collection WHERE userId=[some-user-id] AND _ts > [some-value]
The problem with _ts is that it is not granular enough and the query could miss updates made in the same second by another client.
In SQL Server I could accomplish this using an IDENTITY column in another table. Let's call the table version. In a transaction I would create a new row in the version table, do the updates to the other table (including updating the version column with the new value. To query for new and updated rows I would use a query like this:
SELECT * FROM table WHERE userId=[some-user-id] and version > [some-value]
How could I do something like this in Cosmos DB? The Change Feed seems like the right option, but without the ability to query the Change Feed, I'm not sure how I would go about this.
In case it matters, the (web/mobile) clients connect to data in Cosmos DB via a web api. I have control of the entire stack - from client to back-end.
As the statements in this link:
Today, you see all operations in the change feed. The functionality
where you can control change feed, for specific operations such as
updates only and not inserts is not yet available. You can add a “soft
marker” on the item for updates and filter based on that when
processing items in the change feed. Currently change feed doesn’t log
deletes. Similar to the previous example, you can add a soft marker on
the items that are being deleted, for example, you can add an
attribute in the item called "deleted" and set it to "true" and set a
TTL on the item, so that it can be automatically deleted. You can read
the change feed for historic items, for example, items that were added
five years ago. If the item is not deleted you can read the change
feed as far as the origin of your container.
Change feed is not available for your requirements.
My idea:
Use Azure Function Cosmos DB Trigger to collect all the operations in your specific cosmos collection. Follow this document to configure the input of azure function as cosmos db, then follow this document to configure the output as azure queue storage.
Get the ids of changed items and send them into queue storage as messages.When you want to query the changed item,just query the messages from the queue to consume them at a specific unit time and after that just clear the entire queue. No items will be missed.
With your approach, you can get added/updated documents and save reference value (_ts and id field) somewhere (like blob)
SELECT * FROM collection WHERE userId=[some-user-id] AND _ts > [some-value] and id !='guid' order by _ts desc
This is a similar approach we use to read data from Eventhub and store checkpointing information (epoch number, sequence number and offset value) in blob. And at a time only one function can take a lease of that blob.
If you go with ChangeFeed, you can create listener (Function or Job) to listen all add/update data from collection and you can store those value in some collection, while saving data you can add Identity/version field on every document. This approach may increase your cosmos DB bill.
This is what the transaction consistency levels are for: https://learn.microsoft.com/en-us/azure/cosmos-db/consistency-levels
Choose strong consistency and your queries will always return the latest write.
Strong: Strong consistency offers a linearizability guarantee. The
reads are guaranteed to return the most recent committed version of an
item. A client never sees an uncommitted or partial write. Users are
always guaranteed to read the latest committed write.

What is the recommended way to track document changes in firestore?

Let's say I have a employees collection where I have one document per employee and I want to keep record of all changes that were made to a single employee doc. I was thinking of the following approach:-
Have a pendingEmployeeWrites collection where client is
only allowed to create documents. Each doc here will have an
employeeId field (this id is generated on client side for new employees).
Cloud function will be invoked whenever such a doc is created and then it validates the data. If valid, the employeeId doc in employees collection is overwritten with this data. Otherwise the pendingEmployeeWrites doc is updated to set isFailed as true. Client app is only allowed to read from employees collection.
Keeping pendingEmployeeWrites as a flat collection instead of a sub-collection allows me to pull all changes made by a user as well as all changes for a particular document. Does this approach make sense or is there a better approach that I should consider?

How can you create a transaction/batch write between multiple Firestore instances?

Firebase allows having multiple projects in a single application.
// Initialize another app with a different config
var secondary = firebase.initializeApp(secondaryAppConfig, "secondary");
// Retrieve the database.
var secondaryDatabase = secondary.database();
Example:
Project 1 has my users collection; Project 2 has my friends collection (suppose there's a reason for that). When I add a new friend in the Project 2 database, I want to increment the friendsCount in the user document in Project 1. For this reason, I want to create a transaction/batch write to insure consistency in the data.
How can I achieve this? Can I create a transaction or a batch write between different Firestore instances?
No, you cannot use the database transaction feature across multiple databases.
If absolutely required, I'd probably instead create a custom locking feature. From wiki,
To allow several users to edit a database table at the same time and also prevent inconsistencies created by unrestricted access, a single record can be locked when retrieved for editing or updating. Anyone attempting to retrieve the same record for editing is denied write access because of the lock (although, depending on the implementation, they may be able to view the record without editing it). Once the record is saved or edits are canceled, the lock is released. Records can never be saved so as to overwrite other changes, preserving data integrity.
In database management theory, locking is used to implement isolation among multiple database users. This is the "I" in the acronym ACID.
Source: https://en.wikipedia.org/wiki/Record_locking
It's been three years since the question, I know, but since I needed the same thing I found a working solution to perform the double (or even ^n) transaction. You have to nest the transactions like this.
db1.runTransaction(t1 => db2.runTransaction(t2 => async () => {
await t1.set(.....
await t2.update(.....
etc....
})).then(...).catch(...)
Since the error is propagated in the nested promises it is safe to execute the double transaction in this way because for a failure in any one of the databases it results in the error in all of them.

Firebase query for bi-directional link

I'm designing a chat app much like Facebook Messenger. My two current root nodes are chats and users. A user has an associated list of chats users/user/chats, and the chats are added by autoID in the chats node chats/a151jl1j6. That node stores information such as a list of the messages, time of the last message, if someone is typing, etc.
What I'm struggling with is where to make the definition of which two users are in the chat. Originally, I put a reference to the other user as the value of the chatId key in the users/user/chats node, but I thought that was a bad idea incase I ever wanted group chats.
What seems more logical is to have a chats/chat/members node in which I define userId: true, user2id: true. My issue with this is how to efficiently query it. For example, if the user is going to create a new chat with a user, we want to check if a chat already exists between them. I'm not sure how to do the query of "Find chat where members contains currentUserId and friendUserId" or if this is an efficient denormalized way of doing things.
Any hints?
Although the idea of having ids in the format id1---||---id2 definitely gets the job done, it may not scale if you expect to have large groups and you have to account for id2---||---id1 comparisons which also gets more complicated when you have more people in a conversation. You should go with that if you don't need to worry about large groups.
I'd actually go with using the autoId chats/a151jl1j6 since you get it for free. The recommended way to structure the data is to make the autoId the key in the other nodes with related child objects. So chats/a151jl1j6 would contain the conversation metadata, members/a151jl1j6 would contain the members in that conversation, messages/a151jl1j6 would contain the messages and so on.
"chats":{
"a151jl1j6":{}}
"members":{
"a151jl1j6":{
"user1": true,
"user2": true
}
}
"messages":{
"a151jl1j6":{}}
The part where this gets is little "inefficient" is the querying for conversations that include both user1 and user2. The recommended way is to create an index of conversations for each user and then query the members data.
"user1":{
"chats":{
"a151jl1j6":true
}
}
This is a trade-off when it comes to querying relationships with a flattened data structure. The queries are fast since you are only dealing with a subset of the data, but you end up with a lot of duplicate data that need to be accounted for when you are modifying/deleting i.e. when the user leaves the chat conversation, you have to update multiple structures.
Reference: https://firebase.google.com/docs/database/ios/structure-data#flatten_data_structures
I remember I had similar issue some time ago. The way how I solved it:
user 1 has an unique ID id1
user 2 has an unique ID id2
Instead of adding a new chat by autoId chats/a151jl1j6 the ID of the chat was id1---||---id2 (superoriginal human-readable delimeter)
(which is exactly what you've originally suggested)
Originally, I put a reference to the other user as the value of the chatId key in the users/user/chats node, but I thought that was a bad idea in case I ever wanted group chats.
There is a saying: https://en.wikipedia.org/wiki/You_aren%27t_gonna_need_it
There might a limitation of how many userIDs can live in the path - you can always hash the value...

Resources