Hi I am wandering how the internal mechanism of subscribing to an azure cosmosdb change feed actually works. Specifically if you are using azure-cosmosdb-js from node. Is there some sort of long polling mechanism that checks a change feed table or are events pushed to the subscriber using web-sockets?
Are there any limits on the number of subscriptions that you can have to any partition keys change feed?
Imagine the change feed as nothing other than an event source that keeps track of document changes.
All the actual change feed consuming logic is abstracted into the SDKs. The server just offers the change feed as something the SDK can consume. That is what the change feed processor libraries are using to operate.
We don't know much about the Change Feed Processor SDKs mainly because they are not open source. (Edit: Thanks to Matias for pointing out that they are actually open source now). However, from extensive personal usage I can tell you the following.
The Change Feed Processor will need a collection to store some documents. Those documents are just checkpoints for the processor to keep track of the consumption. Each main document in those lease collections corresponds to a physical partition. Each physical partition will be polled by the processor in a set interval. you can set that by setting the FeedPollDelay setting which "gets or sets the delay in between polling a partition for new changes on the feed, after all current changes are drained.".
The library is also capable of spreading the leases if multiple processors are running against a single collection. If a service fails, the running services will pick up the lease. Due to polling and delays you might end up reprocessing already processed documents. You can also choose to set the CheckpointFrequency of the change feed processor.
In terms of "subscriptions", you can have as many as you want. Keep in mind however that the change feed processor is writing the lease documents. They are smaller than 1kb so you will be paying the minimum charge of 10 RUs per change. However, if you end up with more than 40 physical partitions you might have to raise the throughput from the minimum 400 RU/s.
Related
The system I'm working on has multiple environments, each running in separate Azure regions. Our CosmosDB is replicated to these regions and multi-region writes are enabled. We're using the default consistency model (Session).
We have azure functions that use the CosmosDb trigger deployed in all three regions. Currently these use the same lease prefix which means that only one function processes changes at any given time. I know that we can set each
region to have different lease prefixes to enable concurrent processing but I'd like to solidify my understanding before taking this step.
My question is around the behaviour of the change feed with regards to replication in this scenario? According to this link https://github.com/MicrosoftDocs/azure-docs/issues/42248#issuecomment-552207409 data is first converged on the primary region and then the change feed is updated.
Other resources I've read seem to suggest that each region has it's own change feed which will update upon replication. Also, the previous link recommends only running a change feed processor in the primary region in multi-master.
In an ideal world, I'd like change feed processors in each region to handle local writes quickly. These functions will make updates to CosmosDB and I also want to avoid issues with replication. My question is - what is the actual behavior in a multi master configuration (and by extension the correct architecture)?. Is it "safe" to use per-region change feed processors, or should we use a single processor in the primary region?
You cannot have per-region Change Feed Processor's that only process the local changes, because the Change Feed in each region contains the local writes plus the replicated writes from each other region.
Technically you can use a single Change Feed Processor deployment connecting to one of the regions to process events on all the regions.
Is it possible to limit the speed at which Google Firestore pushes writes made in an app to the online database?
I'm investigating the feasibility of using Firestore to store a data stream from an IoT device via a mobile device/bluetooth.
The main concern is battery cost - receive a new data packet roughly two minutes, I'm concerned about the additional battery drain that an internet round-trip every two minutes, 24hrs a day, will cost. I also would want to limit updates to wifi connections only.
It's not important for the data to be available online real-time. However it is possible for multiple sources to add to the same datastream in a 2-way sybc, (ie online DB and all devices have the merged data).
I'm currently handling that myself, but when I saw the offline capability of Datastore I hoped I could get all that functionality for free.
I know we can't directly control offline-mode in Firestore, but is there any way to prevent it from always and immediately pushing write changes online?
The only technical question I can see here has to do with how batch writes operate, and more importantly, cost. Simply put, a batch write of 100 writes is the same as writing 100 writes individually. The function is not a way to avoid the write costs of firestore. Same goes for transactions. Same for editing a document (that's a write). If you really want to avoid those costs then you could store the values for the thirty minutes and let the client send the aggregated data in a single document. Though you mentioned you need data to be immediate so I'm not sure that's an option for you. Of course, this would be dependent on what one interprets "immediate" as based off the relative timespan. In my opinion, (I know those aren't really allowed here but it's kind of part of the question) if the data is stored over months/years, 30 minutes is fairly immediate. Either way, batch writes aren't quite the solution I think you're looking for.
EDIT: You've updated your question so I'll update my answer. You can do a local cache system and choose how you update however you wish. That's completely up to you and your own code. Writes aren't really automatic. So if you want to only send a data packet every hour then you'd send it at that time. You're likely going to want to do this in a transaction if multiple devices will write to the same stream so one doesn't overwrite the other if they're sending at the same time. Other than that I don't see firestore being a problem for you.
Is there a way to automatically move expired documents to blob storage via change feed?
I Google but found no solution to automatically move expired documents to blob storage via the change feed option. Is it possible?
There is not built in functionality for something like that and the change feed would be of no use in this case.
The change feed processor (which is what the Azure Function trigger is using too) won't notify you for deleted documents so you can't listen for them.
Your best bet is to write some custom application that does scheduling archiving and deleted the archived document.
As statements in the Cosmos db TTL document: When you configure TTL, the system will automatically delete the expired items based on the TTL value, unlike a delete operation that is explicitly issued by the client application.
So,it is controlled by the cosmos db system,not client side.You could follow and vote up this feedback to push the progress of cosmos db.
To come back to this question, one way I've found that works is to make use of the in-built TTL (Let CosmosDB expire documents) and to have a backup script that queries for documents that are near the TTL, but with a safe window in case of latency - e.g. I have the window up to 24 hours.
The main reasoning for this is that issuing deletes as a query not only uses RUs, but quite a lot of them. Even when you slim your indexes down you can still have massive RU usage, whereas letting Cosmos TTL the documents itself induces no extra RU use.
A pattern that I came up with to help is to have my backup script enable the container-level TTL when it starts (doing an initial backup first to ensure no data loss occurs instantly) and to have a try-except-finally, with the finally removing/turning off the TTL to stop it potentially removing data in case my script is down for longer than the window. I'm not yet sure of the performance hit that might occur on large containers when it has to update indexes, but in a logical sense this approach seems to be effective.
I am using the Google Calendar API to preprocess events that are being added (adjust their content depending on certain values they may contain). This means that theoretically I need to update any number of events at any given time, depending on how many are created.
The Google Calendar API has usage quotas, especially one stating a maximum of 500 operations per 100 seconds.
To tackle this I am using a time-based trigger (every 2 minutes) that does up to 500 operations (and only updates sync tokens when all events are processed). The downside of this approach is that I have to run a check every 2 minutes, whether or not anything has actually changed.
I would like to replace the time-based trigger with a watch. I'm not sure though if there is any way to limit the amount of watch calls so that I can ensure the 100 seconds quota is not exceeded.
My research so far shows me that it cannot be done. I'm hoping I'm wrong. Any ideas on how this can be solved?
AFAIK, that is one of the best practice suggested by Google. Using watch and push notification allows you to eliminate the extra network and compute costs involved with polling resources to determine if they have changed. Here are some tips to best manage working within the quota from this blog:
Use push notifications instead of polling.
If you cannot avoid polling, make sure you only poll when necessary (for example poll very seldomly at night).
Use incremental synchronization with sync tokens for all collections instead of repeatedly retrieving all the entries.
Increase page size to retrieve more data at once by using the maxResults parameter.
Update events when they change, avoid re-creating all the events on every sync.
Use exponential backoff for error retries.
Also, if you cannot avoid exceeding to your current limit. You can always request for additional quota.
I'm using Firebase to store user profiles. I tried to put the minimum amount of data in each user profile (following the good practices advised in the documentation about structuring data) but as I have more than 220K user profiles, it still represents 150MB when downloading as JSON all user profiles.
And of course, it will grow bigger and bigger as I intend to have a lot more users :)
I can't do queries on those user profiles anymore because each time I do that, I reach 100% Database I/O capacity and thus some other requests, performed by users currently using the app, end up with errors.
I understand that when using queries, Firebase need to consider all data in the list and thus read it all from disk. And 150MB of data seems to be too much.
So is there an actual limit before reaching 100% Database I/O capacity? And what is exactly the usefulness of Firebase queries in that case?
If I simply have small amounts of data, I don't really need queries, I could easily download all data. But now that I have a lot of data, I can't use queries anymore, when I need them the most...
The core problem here isn't the query or the size of the data, it's simply the time required to warm the data into memory (i.e. load it from disk) when it's not being frequently queried. It's likely to be only a development issue, as in production this query would likely be a more frequently used asset.
But if the goal is to improve performance on initial load, the only reasonable answer here is to query on less data. 150MB is significant. Try copying a 150MB file between computers over a wireless network and you'll have some idea what it's like to send it over the internet, or to load it into memory from a file server.
A lot here depends on the use case, which you haven't included.
Assuming you have fairly standard search criteria (e.g. you search on email addresses), you can use indices to store email addresses separately to reduce the data set for your query.
/search_by_email/$user_id/<email address>
Now, rather than 50k per record, you have only the bytes to store the email address per records--a much smaller payload to warm into memory.
Assuming you're looking for robust search capabilities, the best answer is to use a real search engine. For example, enable private backups and export to BigQuery, or go with ElasticSearch (see Flashlight for an example).