Re-run all changes in Lease Collection - azure-cosmosdb

I created several new pipelines in Azure Data Factory to process CosmosDB Change Feed (which go into Blob storage for ADF processing to on-prem SQL Server), and I'd like to "resnap" the data from the leases collection to force a full re-sync. Is there a way to do this?
For clarity, my set-up is:
Change Feed ->
Azure Function to process the changes -> Blob Storage to hold the JSON documents -> Azure Data Factory which picks up the Blob Storage documents and maps them to on-prem SQL Server stored proc inserts/updates.

The easiest and simplest way is to do it is to simply delete the lease documents and make sure that the StartFromBeginning setting is set to true. Once restarted the change feed service will recreate the leases (if the appropriate setting is configured to true) and reprocess all the documents.
The other way to do so is to update every single lease document and reset the Continuation token "checkpoint" to null, however I don't recommend this method since you might accidentally miss a lease which can lead to issues.

Related

Reading the Realtime database directly VS using Cloud Functions

I have been reading this article about reading the realtime database directly vs calling cloud functions that returns database data.
If I am returning a fairly large chunk of data e.g. a json object that holds data representing 50 user comments from a cloud function does this count
As Outbound Data (Egress) data? If so does this cost $0.12 per gb per month?
The comments are stored like so with an incremental key.
comments: [0 -> {text: “Adsadsads”},
1 -> {text: “dadsacxdg”},
etc.]
Furthermore, I have read you can call goOffline() and goOnline() using the client sdks to stop concurrent connections. Are there any costs associated with closing and
Opening database connections or is it just the speed aspect of opening a connection every time you read?
Would it be more cost effective to call a cloud function that returns the set of 50 comments or allow the devices to read the comments directly from the database
But open/close before/after each read to the database, using orderByKey(), once(), startAt() and limitToFirst()?
e.g something like this
ref(‘comments’).once().orderByKey().startAt(0).limitToFirst(50).
Thanks
If your Cloud Function reads data from Realtime Database and returns (part of) that data to the caller, you pay for the data that is read from the database (at $1/GB) and then also for the data that your Cloud Function returns to the user (at $0.12/GB).
Opening a connection to the database means data is sent from the database to the client, and you are charged for this data (typically a few KB).
Which one is more cost effective is something you can calculate once you have all parameters. I'd recommend against premature cost optimization though: Firebase has a pretty generous free tier on its Realtime Database, so I'd start reading directly from the database and seeing how much traffic that generates. Also: if you are explicitly managing the connection state, and seem uninterested in the realtime nature of Firebase, there might be better/cheaper alternatives than Firebase to fill your needs.

Can we insert a data into Firebase realtime Database?

One child node of my Firebase realtime database has become huge (aroung 20 GB) and I need to purge this and insert the the extracted data of last month from the backup into the Firebase realtime database using Python Admin SDK.
In the documentation, I see the following options:
set - Write or replace data to a defined path, like messages/users/
update - Update some of the keys for a defined path without replacing all of the data
push - Add to a list of data in the database. Every time you push a new node onto a list, your database generates a unique key, like messages/users//
transaction - Use transactions when working with complex data that could be corrupted by concurrent updates
However, I want to add/insert the data from the firebase backup. I have to insert because the app is used in production and I cannot afford the overwrite of data.
Is there any method available to insert/add the data and not overwrite the data?
Any help/support is greatly appreciated.
There is no way to do this in Firebase Realtime Database without reading the current value of each location.
The only operation that allows you to update data based on its existing value is a transaction. A Firebase transaction gives you the (likely) current value at a location, and you then return what the new value should become.
But if the data you're restoring is (largely) the same as the data you have in the database, you might be able to use an update() call with sufficiently deep paths.

cosmosdb - archive data older than n years into cold storage

I researched several places and could not find any direction on what options are there to archive old data from cosmosdb into a cold storage. I see for DynamoDb in AWS it is mentioned that you can move dynamodb data into S3. But not sure what options are for cosmosdb. I understand there is time to live option where the data will be deleted after certain date but I am interested in archiving versus deleting. Any direction would be greatly appreciated. Thanks
I don't think there is a single-click built-in feature in CosmosDB to achieve that.
Still, as you mentioned appreciating any directions, then I suggest you consider DocumentDB Data Migration Tool.
Notes about Data Migration Tool:
you can specify a query to extract only the cold-data (for example, by creation date stored within documents).
supports exporting export to various targets (JSON file, blob
storage, DB, another cosmosDB collection, etc..),
compacts the data in the process - can merge documents into single array document and zip it.
Once you have the configuration set up you can script this
to be triggered automatically using your favorite scheduling tool.
you can easily reverse the source and target to restore the cold data to active store (or to dev, test, backup, etc).
To remove exported data you could use the mentioned TTL feature, but that could cause data loss should your export step fail. I would suggest writing and executing a Stored Procedure to query and delete all exported documents with single call. That SP would not execute automatically but could be included in the automation script and executed only if data was exported successfully first.
See: Azure Cosmos DB server-side programming: Stored procedures, database triggers, and UDFs.
UPDATE:
These days CosmosDB has added Change feed. this really simplifies writing a carbon copy somewhere else.

best practice for bulk update in document DB

we have a scenario where we need to populate the collection every one hour with the latest data whenever we receive the data file in blob from external sources and at the same time , we do not want to impact the live users while updating the collection.
So, we have done below
Created 2 databases and collection 1 in both databases
Created a another collection in different database( configuration database ) with property as Active and Passive and this will have the Database1 and Database2 as values for the above properties
Now , our web job will run every time it sees the file in blob and check this configuration database and identify which one is active or passive and process the xml file and update the collection in passive database as that is not used by the live feed and once it is done , will update the active database to current and passive to live
now , our service will always check which one is active and passive and fetch the data accordingly and show to user
As we have to delete the data and insert the newly data in web job , wanted to know is this is best design we have come up with ? Does deleting and inserting the data will cost ? Is there better way to do bulks delete and insert as we are doing sequentially now
wanted to know is this is best design we have come up with ?
As David Makogon said, as for your solution, you need to manage and pay for multiple databases. If possible, you could create new documents in same collection and control which document is active in your program logic.
Does deleting and inserting the data will cost ?
the operation/request will consume the request units, which will be charged. To know Request Units and DocumentDB Pricing details, please refer to:
What is a Request Unit
DocumentDB pricing details
Is there better way to do bulks delete and insert as we are doing sequentially now
Stored Procedure that provides a way to group operations like inserts and submit them in bulk. You could create the stored procedures and then execute the stored procedure in your Webjobs function.

AWS Auto Scaling Launch Configuration Encrypted EBS Cloud Formation Example

I am creating cloud formation script, which will have ELB. In Auto Scaling launch configuration, I want to add encrypted EBS volume. Couldn't find an encrypted property withing blockdevicemapping. I need to encrypt volume. How can I attach an encrypted EBS volume to an EC2 instance through auto scaling launch configuration?
There is no such property for some strange reason when using launch configurations, however it is there when using blockdevicemappings with simple EC2 instances. See
launchconfig-blockdev vs ec2-blockdev
So you'll either have to use simple instances instead of autoscaling groups, or you can try this workaround:
SnapshotIds are accepted for launchconf blockdev too, and as stated here "Snapshots that are taken from encrypted volumes are automatically encrypted. Volumes that are created from encrypted snapshots are also automatically encrypted."
Create a snapshot from an encrypted empty EBS volume and use it in the CloudFormation template. If your template should work in multiple regions then of course you'll have to create the snapshot in every region and use a Mapping in the template.
As Marton says, there is no such property (unfortunately it often takes a while for CloudFormation to catch up with the main APIs).
Normally each encrypted volume you create will have a different key. However, when using the workaround mentioned (of using an encrypted snapshot) the resulting encrypted volumes will inherit the encryption key from the snapshot and all be the same.
From a cryptography point of view this is a bad idea as you potentially have multiple, different volumes and snapshots with the same key. If an attacker has access to all of these then he can potentially use differences to infer information about the key more easily.
An alternative is to write a script that creates and attaches a new encrypted volume at the boot time of a instance. This is fairly easy to do. You'll need to give the instance permissions to create and attach volumes and either have installed the AWS CLI tool or a library for your preferred scripting language. One you have that you can, from the instance that is booting, create a volume and attach it.
You can find a starting point for such a script here: https://github.com/guardian/machine-images/blob/master/packer/resources/features/ebs/add-encrypted.sh
There is an AutoScaling EBS Block Device type which provides the "Encrypted" option:
http://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-as-launchconfig-blockdev-template.html
Hope this helps!
AWS recently announced Default Encryption for New EBS Volumes. You can enable this per region via
EC2 Console > Settings > Always encrypt new EBS volumes
https://aws.amazon.com/blogs/aws/new-opt-in-to-default-encryption-for-new-ebs-volumes/

Resources