Cosmos db document automatically archive with TTL - azure-cosmosdb

Is there a way to automatically move expired documents to blob storage via change feed?
I Google but found no solution to automatically move expired documents to blob storage via the change feed option. Is it possible?

There is not built in functionality for something like that and the change feed would be of no use in this case.
The change feed processor (which is what the Azure Function trigger is using too) won't notify you for deleted documents so you can't listen for them.
Your best bet is to write some custom application that does scheduling archiving and deleted the archived document.

As statements in the Cosmos db TTL document: When you configure TTL, the system will automatically delete the expired items based on the TTL value, unlike a delete operation that is explicitly issued by the client application.
So,it is controlled by the cosmos db system,not client side.You could follow and vote up this feedback to push the progress of cosmos db.

To come back to this question, one way I've found that works is to make use of the in-built TTL (Let CosmosDB expire documents) and to have a backup script that queries for documents that are near the TTL, but with a safe window in case of latency - e.g. I have the window up to 24 hours.
The main reasoning for this is that issuing deletes as a query not only uses RUs, but quite a lot of them. Even when you slim your indexes down you can still have massive RU usage, whereas letting Cosmos TTL the documents itself induces no extra RU use.
A pattern that I came up with to help is to have my backup script enable the container-level TTL when it starts (doing an initial backup first to ensure no data loss occurs instantly) and to have a try-except-finally, with the finally removing/turning off the TTL to stop it potentially removing data in case my script is down for longer than the window. I'm not yet sure of the performance hit that might occur on large containers when it has to update indexes, but in a logical sense this approach seems to be effective.

Related

DynamoDB: Time frame to avoid stale read

I am writing to dynamoDB using AWS lambda. I am using AWS Console to read data from dynamoDB. But I have seen instances of stale read with latest not records getting updated when trying to pull records under few mins. What is a safe time interval for data pull which would ensure that latest data is available on read? Would 30 mins be a safe interval?
The below is from AWS site: Just want to understand how recent is recent here. "When you read data from a DynamoDB table, the response might not reflect the results of a recently completed write operation. The response might include some stale data"
Regards,
Dbeings
If you must have a strongly consistent read, you can specify that in your read statement. That way the client will always read from the leader storage node for that partition.
In order for DynamoDB to acknowledge a write, the write must be durable on the leader storage node for that partition and one other storage node for the partition.
If you do an eventually consistent read (which is the default), you might get that read you have a 1:3 chance of that read coming from a node that was not part of the write and an even lesser chance that the item has not been updated on that third storage node.
So, if you need a strongly consistent read, ask for one and you'll get the newest of that item. There is no real performance degradation for doing a strongly consistent read.

Only download changed/updated documents from Firestore

being new to firestore, I am trying to keep the number of downloads of documents as small as possible. I figured that I could download documents only once and store them offline. If something is changed in the cloud, download a new copy and replace the offline version of that document (give relevant documents a timestamp of last-change and download when the local version is older). I haven't started with this yet but I am going to assume something like this must already exist, right?
I'm not sure where to start and google isn't giving me many answers with the exception of enablePersistence() from the FirebaseFirestore instance. I have a feeling that this is not the thing I am looking for since it would be weird to artificially turn the network on and off for every time I want to check for changes.
Am I missing something or am I about to discover an optimisation solution to this problem?
What you're describing isn't something that's built in to Firestore. You're going to have to design and build it using Firestore capabilities that you read in the documentation. Persistence is enabled by default, but that's not going to solve your problem, which is rather broadly stated here.
The bottom line is that neither Firestore nor its local cache have an understanding of optimizing the download of only documents that have changed, by whatever definition of "change" that you choose. When you query Firestore, it will, by default, always go to the server and download the full set of documents that match your query. If you want to query only the local cache, you can do that as well, but it will not consult the server or be aware of any changes. These capabilities are not sufficient for the optimizations you describe.
If you want to get only document changes since the last time you queried, you're going to have to design your data so that you can make such a query, perhaps by adding a timestamp field to each document, and use that to query for documents that have changed since the last time you made a query. You might also need to manage your own local cache, since the SDK's local cache might not be flexible enough for what you want.
I recommend that you read this blog post that describes in more detail how the local cache actually works, and how you might take advantage of it (or not).

Will Google Firestore always write to the server immediately in a mobile app with frequent writes?

Is it possible to limit the speed at which Google Firestore pushes writes made in an app to the online database?
I'm investigating the feasibility of using Firestore to store a data stream from an IoT device via a mobile device/bluetooth.
The main concern is battery cost - receive a new data packet roughly two minutes, I'm concerned about the additional battery drain that an internet round-trip every two minutes, 24hrs a day, will cost. I also would want to limit updates to wifi connections only.
It's not important for the data to be available online real-time. However it is possible for multiple sources to add to the same datastream in a 2-way sybc, (ie online DB and all devices have the merged data).
I'm currently handling that myself, but when I saw the offline capability of Datastore I hoped I could get all that functionality for free.
I know we can't directly control offline-mode in Firestore, but is there any way to prevent it from always and immediately pushing write changes online?
The only technical question I can see here has to do with how batch writes operate, and more importantly, cost. Simply put, a batch write of 100 writes is the same as writing 100 writes individually. The function is not a way to avoid the write costs of firestore. Same goes for transactions. Same for editing a document (that's a write). If you really want to avoid those costs then you could store the values for the thirty minutes and let the client send the aggregated data in a single document. Though you mentioned you need data to be immediate so I'm not sure that's an option for you. Of course, this would be dependent on what one interprets "immediate" as based off the relative timespan. In my opinion, (I know those aren't really allowed here but it's kind of part of the question) if the data is stored over months/years, 30 minutes is fairly immediate. Either way, batch writes aren't quite the solution I think you're looking for.
EDIT: You've updated your question so I'll update my answer. You can do a local cache system and choose how you update however you wish. That's completely up to you and your own code. Writes aren't really automatic. So if you want to only send a data packet every hour then you'd send it at that time. You're likely going to want to do this in a transaction if multiple devices will write to the same stream so one doesn't overwrite the other if they're sending at the same time. Other than that I don't see firestore being a problem for you.

How does azure cosmosdb change feed internal communication work?

Hi I am wandering how the internal mechanism of subscribing to an azure cosmosdb change feed actually works. Specifically if you are using azure-cosmosdb-js from node. Is there some sort of long polling mechanism that checks a change feed table or are events pushed to the subscriber using web-sockets?
Are there any limits on the number of subscriptions that you can have to any partition keys change feed?
Imagine the change feed as nothing other than an event source that keeps track of document changes.
All the actual change feed consuming logic is abstracted into the SDKs. The server just offers the change feed as something the SDK can consume. That is what the change feed processor libraries are using to operate.
We don't know much about the Change Feed Processor SDKs mainly because they are not open source. (Edit: Thanks to Matias for pointing out that they are actually open source now). However, from extensive personal usage I can tell you the following.
The Change Feed Processor will need a collection to store some documents. Those documents are just checkpoints for the processor to keep track of the consumption. Each main document in those lease collections corresponds to a physical partition. Each physical partition will be polled by the processor in a set interval. you can set that by setting the FeedPollDelay setting which "gets or sets the delay in between polling a partition for new changes on the feed, after all current changes are drained.".
The library is also capable of spreading the leases if multiple processors are running against a single collection. If a service fails, the running services will pick up the lease. Due to polling and delays you might end up reprocessing already processed documents. You can also choose to set the CheckpointFrequency of the change feed processor.
In terms of "subscriptions", you can have as many as you want. Keep in mind however that the change feed processor is writing the lease documents. They are smaller than 1kb so you will be paying the minimum charge of 10 RUs per change. However, if you end up with more than 40 physical partitions you might have to raise the throughput from the minimum 400 RU/s.

How to implement synchronized Memcached with database

AFAIK, Memcached does not support synchronization with database (at least SQL Server and Oracle). We are planning to use Memcached (it is free) with our OLTP database.
In some business processes we do some heavy validations which requires lot of data from database, we can not keep static copy of these data as we don't know whether the data has been modified so we fetch the data every time which slows the process down.
One possible solution could be
Write triggers on database to create/update prefixed-postfixed (table-PK1-PK2-PK3-column) files on change of records
Monitor this change of file using FileSystemWatcher and expire the key (table-PK1-PK2-PK3-column) to get updated data
Problem: There would be around 100,000 users using any combination of data for 10 hours. So we will end up having a lot of files e.g. categ1-subcateg5-subcateg-78-data100, categ1-subcateg5-subcateg-78-data250, categ2-subcateg5-subcateg-78-data100, categ1-subcateg5-subcateg-33-data100, etc.
I am expecting 5 million files at least. Now it looks a pathetic solution :(
Other possibilities are
call a web service asynchronously from the trigger passing the key
to be expired
call an exe from trigger without waiting it to finish and then this
exe would expire the key. (I have got some success with this approach on SQL Server using xp_cmdsell to call an exe, calling an exe from oracle's trigger looks a bit difficult)
Still sounds pathetic, isn't it?
Any intelligent suggestions please
It's not clear (to me) if the use of Memcached is mandatory or not. I would personally avoid it and use instead SqlDependency and OracleDependency. The two both allow to pass a db command and get notified when the data that the command would return changes.
If Memcached is mandatory you can still use this two classes to trigger the invalidation.
MS SQL Server has "Change Tracking" features that maybe be of use to you. You enable the database for change tracking and configure which tables you wish to track. SQL Server then creates change records on every update, insert, delete on a table and then lets you query for changes to records that have been made since the last time you checked. This is very useful for syncing changes and is more efficient than using triggers. It's also easier to manage than making your own tracking tables. This has been a feature since SQL Server 2005.
How to: Use SQL Server Change Tracking
Change tracking only captures the primary keys of the tables and let's you query which fields might have been modified. Then you can query the tables join on those keys to get the current data. If you want it to capture the data also you can use Change Capture, but it requires more overhead and at least SQL Server 2008 enterprise edition.
Change Data Capture
I have no experience with Oracle, but i believe it may also have some tracking functionality as well. This article might get you started:
20 Using Oracle Streams to Record Table Changes

Resources