Does moving from None to Consistent Indexing mode in Cosmos Db re-indexes everything? - azure-cosmosdb

The azure cosmos db documentation mentions following in the index transformation section :
When you move to None indexing mode, the index is dropped immediately. Moving to None is useful when you want to cancel an in-progress transformation and start fresh with a different indexing policy.
I have a partitioned collection with custom indexing policy with indexing mode consistent. Since there is no transformation progress available for a partitioned collection, so before starting any new indexing policy update, I was thinking of using None mode to cancel any in progress transformation and start the new one(consistent mode). But will this cause the entire index to get recreated, even if, just a new path was added?
If answer is yes, what is the best way of checking if no transformation is in progress before starting a new one?

Welcome to StackOverflow, Jatin! I am from the CosmosDB engineering team.
We do have transformation progress now for a partitioned collections - this was released recently. This can be fetched by setting the header 'x-ms-documentdb-populatequotainfo' to True. This header, in addition to providing quota and usage information for a partitioned collection, is also used to piggyback the index transformation progress information for partitioned collections.
However, you do not need to wait for the progress to complete before changing your indexing policy. Even though this would mean that it could take longer for our background re-indexing to finish all previous index updates and catch up to your latest indexing policy, there is an opportunity for the transformations to be online, meaning that queries could continue to work based on the changes between the indexing policy updates. For example, if the only change was adding paths throughout the transformations, queries on paths that are already indexed will continue to work throughout the reindexing process, assuming that the indexing mode was consistent through all the updates.
However, if you want your new index to be effective potentially sooner, then your suggestion of transitioning to None and then to Consistent could work. Please note that this is equivalent to an offline index rebuild, meaning the existing index will be deleted and the new one will be constructed from scratch. Queries will not return consistent results on any path up until the reindexing process completes.

Welcome to Stack Overflow!
Simply change the indexing policy from Custom to Consistent. Skip going to None and go directly to Consistent indexing policy:
All index transformations are performed online. The items indexed per the old policy are efficiently transformed per the new policy without affecting the write availability or the throughput  provisioned to the container. The consistency of read and write operations that are performed by using the REST API, SDKs, or from stored procedures and triggers is not affected during index transformation.
Index transformations
While reindexing is in progress, your query may not return all matching results, if the queries happen to use the index that is being modified, and queries will not return any errors/failures. While reindexing is in progress, queries are eventually consistent regardless of the indexing mode configuration.
If you feel that None is safe step before going to Consistent, you are free to do so but it is not absolutely necessary.

Related

How to handle offline aggregation using Firestore?

I have been scouring the internet for days on a solution to this problem.
That is, how to handle aggregation when there is no network connection? I have a task management app that looks to aggregate meta data about user tasks. For example, the task can contain tags that can be aggregated to be shown in a dashboard to the user on a daily basis. This would be easy if the user is always online, so I could use transaction or cloud function to aggregate, but when the user is offline, the aggregation will appear to be incorrect, until the user restores their network connection.
Aggregation queries are explained here:
https://firebase.google.com/docs/firestore/solutions/aggregation
Which states a limitation:
Offline support - Client-side transactions will fail when the user's
device is offline, which means you need to handle this case in your
app and retry at the appropriate time.
However, there has yet to be any example or documentation on how to 'handle this case'. How would I go about addressing this problem?
Some thoughts:
I could cache the item if a transaction fails. This item will be aggregated on top of the stored aggregation. However, going down this line would mean that I can't take advantage of the Firestore's "offline mode", because I'm using my own cache on every write while offline anyway.
I could aggregate on demand. That is, never store the aggregation. This is going to be very heavy on read depending on how many tasks a user has. Furthermore, if the aggregation will need to be shared as insights to other users, this option will not work because other users do not have access to the tasks.
I'm at a loss and any help would be appreciated, thanks!
After a lot of research and trial and error I found a solution that can address this problem gracefully.
FieldValue.increment to the rescue.
What FieldValue.increment does is bypass the use of transaction while respecting the default Firestore's offline cache behaviour. It requires the use of set or update on the field directly. The drawback is the inability to use the 'withConverter' on the collection for type safety. I'm willing to live with the drawback considering how useful FieldValue.increment is.
I've done multiple tests and can confirm that the values can be incremented/decremented multiple times locally while offline. This offline value is reflected in a get or snapshot call to the cache. When the network connection is restored, the values are updated on the server.
The value itself is not stored on the cache, it simply stores the "difference" in the FieldValue sentinel for when it is time to update it on the server.
This method only works with incrementing and decrementing values. Storing averages will not be possible using this method. That is because the true total number of items is not known at the time of its calculation when offline.
Instead, the total number of items are stored along side the total value. The average is then calculated when and as needed. In this way the average will always be accurate from a local perspective when offline, and it will also be accurate when online when the total value and count has been synced.

Does DynamoDB expose an API to query or detect when there is conflict in merging item data

DynamoDB is an AP system based on the original dynamo paper.
Is there any API to detect when a merge conflict has happened or resolved?
Is there any API to provide a strategy to resolve a conflict if it happens.
Your question is based on a wrong premise. Although DynamoDB shares the name, and some goals and implementation details, with the original "Dynamo" paper, it is not very close, and the data model in particular is completely different.
Whereas in the Dynamo paper multiple clients could store multiple different values for an item concurrently - and later readers need to resolve the conflict - DynamoDB does things very differently:
If two clients replace an item, DynamoDB offers a "last write wins" - one of these writes will win, you don't know or care which.
If two clients modify different attributes in the same item concurrently, both changes will be merged. I never found this explicitly promised, but it appears to work this way.
You also have a powerful conditional update feature, which can do a modification to a single item based on some condition on the old value of this item. These conditional updates are guaranteed to be isolated, so they can be used to ensure safe concurrent modification. For example, a conditional update can be used to implement so-called optimistic locking: An item has a version attribute among other attributes, a client reads the old item, decides what to change it to, and then does the write - with the condition that the version still hasn't changed. If the condition fails (because some other client raced us), the write fails and the client tries the whole process again (read again, apply a change, and write back).
DynamoDB also has a new feature of full (multi-item) transactions. This feature did not exist in Dynamo at all.

Cosmos DB continuation token size influences whether query returns new documents

I was messing around with the Azure Cosmos DB (via .NET SDK) and noticed something odd.
Normally when I request a query page by page using continuation tokens, I never get documents that were created after the first continuation token had been created. I can observe changed documents, lack of removed (or rather newly filtered out) documents, but not the new ones.
However, if I only allow 1kB continuation tokens (the smallest I can set), I get the new documents as well. As long as they end up sorted to the remaining pages, obviously.
This kind of makes sense, since with the size limit, I prevent the Cosmos DB from including the serialized index lookup and whatnot in the continuation token. As a downside, the Cosmos DB has to recreate the resume state for every page I request, what will cost some extra RUs. At least according to this discussion. As a side-effect, new documents end up in the result.
Now, I actually have a couple of questions in regards to this.
Is this behavior reliable? I'd love to see some documentation on this.
Is the amount of RUs saved by a larger continuation token significant?
Is there another way to get new documents included in the result?
Are my assumptions completely wrong?
I am from the CosmosDB Engineering Team.
Is this behavior reliable? I'd love to see some documentation on this.
We brought in this feature (limiting continuation token size) due to an ask from customers to help in reducing the response continuation size. We are of the opinion that it's too much detail to expose the effects of pruning the continuation, since for most customers the subtle behavior change shouldn't matter.
Is the amount of RUs saved by a larger continuation token significant?
This depends on the amount of work done in producing the state from the index. For example, if we had to evaluate a range predicate (e.g. _ts > some discrete second), then the RU saved could be significant, since we potentially avoid scanning a whole bunch of index keys corresponding to _ts (this could be O(number of documents), assuming the worst case of having inserted at most 1 document per second). In this scenario, assuming X continuations, we save (X - 1) * O(number of documents) worth of work.
Is there another way to get new documents included in the result?
No, not unless you force CosmosDB to re-evaluate the index every continuation by setting the header to 1. Typically queries are meant to be executed fairly quickly over continuations, so the chance of users seeing new documents should be fairly small. Ideally we should implement snapshot isolation to retrieve results with the session token from the first continuation, but we haven't done this yet.
Are my assumptions completely wrong?
Your assumptions are spot on :)

Deleting very large collections in Firestore

I need to delete very large collections in Firestore.
Initially I used client side batch deletes, but when the documentation changed and started to discouraged that with the comments
Deleting collections from an iOS client is not recommended.
Deleting collections from a Web client is not recommended.
Deleting collections from an Android client is not recommended.
https://firebase.google.com/docs/firestore/manage-data/delete-data?authuser=0
I switched to a cloud function as recommended in the docs. The cloud function gets triggered when a document is deleted and then deletes all documents in a subcollection as proposed in the above link in the section on "NODE.JS".
The problem that I am running into now is that the cloud function seems to be able to manage around 300 deletes per seconds. With the maximum runtime of a cloud function of 9 minutes I can manage up to 162000 deletes this way. But the collection I want to delete currently holds 237560 documents, which makes the cloud function timeout about half way.
I cannot trigger the cloud function again with an onDelete trigger on the parent document, as this one has already been deleted (which triggered the initial call of the function).
So my question is: What is the recommended way to delete large collections in Firestore? According to the docs it's not client side but server side, but the recommended solution does not scale for large collections.
Thanks!
When you have too muck work that can be performed in a single Cloud Function execution, you will need to either find a way to shard that work across multiple invocations, or continue the work in a subsequent invocations after the first. This is not trivial, and you have to put some thought and work into constructing the best solution for your particular situation.
For a sharding solution, you will have to figure out how to split up the document deletes ahead of time, and have your master function kick off subordinate functions (probably via pubsub), passing it the arguments to use to figure out which shard to delete. For example, you might kick off a function whose sole purpose is to delete documents that begin with 'a'. And another with 'b', etc by querying for them, then deleting them.
For a continuation solution, you might just start deleting documents from the beginning, go for as long as you can before timing out, remember where you left off, then kick off a subordinate function to pick up where the prior stopped.
You should be able to use one of these strategies to limit the amount of work done per functions, but the implementation details are entirely up to you to work out.
If, for some reason, neither of these strategies are viable, you will have to manage your own server (perhaps via App Engine), and message (via pubsub) it to perform a single unit of long-running work in response to a Cloud Function.

firebase database equivalent of MySQL transaction

I'm seeking something where I can thread through multiple updates to multiple firebase.database.References (before performing a commit) a single object and then commit that at the end and if it is unsuccessful no changes are made to any of my Firebase References.
Does this exist? the firebase.database.Transaction I thought would be similar since it is an atomic update and it does involve a callback which says if it has been committed or not, but the update function, I believe, is only for a single object, and the function doesn't seem to return a transactionId or something I could pass to other firebase.database.Transactionss or something.
UPDATE
This transaction's update seems to return a Transaction which would lend itself to perhaps chaining: https://firebase.google.com/docs/reference/js/firebase.firestore.Transaction
however this is different from the other Transaction:
Firebase Database transactions perform an update to a single location based on the current value of that same location. They explicitly do not work across multiple locations, since that would limit their scalability. Sometimes developers work around this by performing a transaction higher up in their JSON tree (at the first common point of the locations). I'd recommend against that, as that would limit the scalability even further.
The only way to efficiently update multiple locations with one API call, is with a multiple location update. This does however not have reading of the current value built-in.
So if you want to update multiple locations based on their current value, you'll have to perform the read operation in your application code, turn that into a multi-location update, and then use security rules to ensure all of those updates follow your application rules. This is a quite non-trivial approach, so I hardly see it being done in practice. See my answer here for an example: Is the way the Firebase database quickstart handles counts secure?

Resources