Multiple KCL application with same application name reading from one Kinesis Stream - amazon-dynamodb

I'm confused on how KCL works. First of all these are my understanding now.
1 KCL application uses one application name, creates one dynamodb table.
1 KCL application has one worker with x number of record-processor working parallel on x number of shards in a stream.
The dynamodb table keeps track of owner, checkpoints and etc of each shards.
If i create multiple, let's say 3, KCL application with different application name, then they are basically different application reading from the same stream, isolate from each other by having separate dynamodb tables. All 3 of them will read all x number of shards in the stream and keep track of the checkpoints separately.
Based on a few docs that i read, for example: https://docs.aws.amazon.com/streams/latest/dev/kinesis-record-processor-scaling.html
I would assume if i create another KCL application with the same application name, there would be 2 KCL application working on the same stream, with shards being load balanced to 2 workers in the 2 apps.
So, technically i can create 8 KCL app(let says there are 8 shards in the stream) in 8 ec2 instances, and each of them will process exactly one shard without clash, since each of them checkpoint in its own row in the dynamodb table.
I thought that is the case, but this post suggest otherwise: Multiple different consumers of same Kinesis stream
Else how can i achieve this
All workers associated with this application name are assumed to be working together on the same stream. These workers may be distributed on multiple instances. If you run an additional instance of the same application code, but with a different application name, the KCL treats the second instance as an entirely separate application that is also operating on the same stream.
as mentioned here https://docs.aws.amazon.com/streams/latest/dev/kinesis-record-processor-implementation-app-java.html#kinesis-record-processor-initialization-java
Reference:
https://www.amazonaws.cn/en/kinesis/data-streams/faqs/#recordprocessor
https://docs.aws.amazon.com/streams/latest/dev/kinesis-record-processor-scaling.html
https://docs.aws.amazon.com/streams/latest/dev/kinesis-record-processor-implementation-app-java.html#kinesis-record-processor-initialization-java

KCL library needs ConfigsBuilder where you pass streamName, applicationName, kinesisAsyncClient etc. Here, if you specify an application name associated with stream name, then
DynamoDB table with the application name and uses the table to
maintain state information
So if you have multiple streams, then you create multiple software.amazon.kinesis.common.ConfigsBuilder with individual streamNames and its associated applicationNames.
Pass individual configBuilder properties to software.amazon.kinesis.coordinator.Scheduler
This way you will have a dynamodb for every single streams. And your multi instance app can consume each stream event only once.

Related

Reading the Realtime database directly VS using Cloud Functions

I have been reading this article about reading the realtime database directly vs calling cloud functions that returns database data.
If I am returning a fairly large chunk of data e.g. a json object that holds data representing 50 user comments from a cloud function does this count
As Outbound Data (Egress) data? If so does this cost $0.12 per gb per month?
The comments are stored like so with an incremental key.
comments: [0 -> {text: “Adsadsads”},
1 -> {text: “dadsacxdg”},
etc.]
Furthermore, I have read you can call goOffline() and goOnline() using the client sdks to stop concurrent connections. Are there any costs associated with closing and
Opening database connections or is it just the speed aspect of opening a connection every time you read?
Would it be more cost effective to call a cloud function that returns the set of 50 comments or allow the devices to read the comments directly from the database
But open/close before/after each read to the database, using orderByKey(), once(), startAt() and limitToFirst()?
e.g something like this
ref(‘comments’).once().orderByKey().startAt(0).limitToFirst(50).
Thanks
If your Cloud Function reads data from Realtime Database and returns (part of) that data to the caller, you pay for the data that is read from the database (at $1/GB) and then also for the data that your Cloud Function returns to the user (at $0.12/GB).
Opening a connection to the database means data is sent from the database to the client, and you are charged for this data (typically a few KB).
Which one is more cost effective is something you can calculate once you have all parameters. I'd recommend against premature cost optimization though: Firebase has a pretty generous free tier on its Realtime Database, so I'd start reading directly from the database and seeing how much traffic that generates. Also: if you are explicitly managing the connection state, and seem uninterested in the realtime nature of Firebase, there might be better/cheaper alternatives than Firebase to fill your needs.

Scan entire dynamo db and update records based on a condition

We have a business requirement to deprecate certain field values("**State**"). So we need to scan the entire db and find these deprecated field values and take the last record of that partition key(as there can be multiple records for the same partition key, sort key is LastUpdatedTimeepoch), then update the record. Right now the table contains around 600k records. What's the best way to do this without bringing down the db service in production?
I see this thread could help me
https://stackoverflow.com/questions/36780856/complete-scan-of-dynamodb-with-boto3
But my main concern is -
This is a one time activity. As this will take time, we cannot run this in AWS lambda since it will exceed 15 minutes. So where can I keep the code running for this?
Create EC2 instance and assign role to access dynamo db and run function in EC2 instance.

Re-run all changes in Lease Collection

I created several new pipelines in Azure Data Factory to process CosmosDB Change Feed (which go into Blob storage for ADF processing to on-prem SQL Server), and I'd like to "resnap" the data from the leases collection to force a full re-sync. Is there a way to do this?
For clarity, my set-up is:
Change Feed ->
Azure Function to process the changes -> Blob Storage to hold the JSON documents -> Azure Data Factory which picks up the Blob Storage documents and maps them to on-prem SQL Server stored proc inserts/updates.
The easiest and simplest way is to do it is to simply delete the lease documents and make sure that the StartFromBeginning setting is set to true. Once restarted the change feed service will recreate the leases (if the appropriate setting is configured to true) and reprocess all the documents.
The other way to do so is to update every single lease document and reset the Continuation token "checkpoint" to null, however I don't recommend this method since you might accidentally miss a lease which can lead to issues.

best practice for bulk update in document DB

we have a scenario where we need to populate the collection every one hour with the latest data whenever we receive the data file in blob from external sources and at the same time , we do not want to impact the live users while updating the collection.
So, we have done below
Created 2 databases and collection 1 in both databases
Created a another collection in different database( configuration database ) with property as Active and Passive and this will have the Database1 and Database2 as values for the above properties
Now , our web job will run every time it sees the file in blob and check this configuration database and identify which one is active or passive and process the xml file and update the collection in passive database as that is not used by the live feed and once it is done , will update the active database to current and passive to live
now , our service will always check which one is active and passive and fetch the data accordingly and show to user
As we have to delete the data and insert the newly data in web job , wanted to know is this is best design we have come up with ? Does deleting and inserting the data will cost ? Is there better way to do bulks delete and insert as we are doing sequentially now
wanted to know is this is best design we have come up with ?
As David Makogon said, as for your solution, you need to manage and pay for multiple databases. If possible, you could create new documents in same collection and control which document is active in your program logic.
Does deleting and inserting the data will cost ?
the operation/request will consume the request units, which will be charged. To know Request Units and DocumentDB Pricing details, please refer to:
What is a Request Unit
DocumentDB pricing details
Is there better way to do bulks delete and insert as we are doing sequentially now
Stored Procedure that provides a way to group operations like inserts and submit them in bulk. You could create the stored procedures and then execute the stored procedure in your Webjobs function.

DynamoDB limitations when deploying MoonMail

I'm trying to deploy MoonMail on AWS. However, I receive this exception from CloudFormation:
Subscriber limit exceeded: Only 10 tables can be created, updated, or deleted simultaneously
Is there another way to deploy without opening support case and asking them to remove my limit?
This is an AWS limit for APIs: (link)
API-Specific Limits
CreateTable/UpdateTable/DeleteTable
In general, you can have up to 10
CreateTable, UpdateTable, and DeleteTable requests running
simultaneously (in any combination). In other words, the total number
of tables in the CREATING, UPDATING or DELETING state cannot exceed
10.
The only exception is when you are creating a table with one or more
secondary indexes. You can have up to 5 such requests running at a
time; however, if the table or index specifications are complex,
DynamoDB might temporarily reduce the number of concurrent requests
below 5.
You could try to open a support request to AWS to raise this limit for your account, but I don't feel this is necessary. It seems that you could create the DynamoDB tables a priori, using the AWS CLI or AWS SDK, and use MoonMail with read-only access to those tables. Using the SDK (example), you could create those tables sequentially, without reaching this simultaneously creation limit.
Another option, is to edit the s-resources-cf.json file to include only 10 tables and deploy. After that, add the missing tables and deploy again.
Whatever solution you apply, consider creating an issue ticket in MoonMail's repo, because as it stands now, it does not work in a first try (there are 12 tables in the resources file).

Resources