Is DynamoDB streams the right option for this use case? - amazon-dynamodb

I have a DynamoDB table that contains key value pairs that will be read by a number of applications. On startup each application will read the entire table and cache it in-memory.
The problem I'm trying to solve is that of getting the applications to update their cache if one or more items in the DynamoDB table have been modified.
DynamoDB streams initially seemed to be the right approach to solving the problem. I have implemented the consumer using Kinesis Client Library (KCL) as recommended by AWS. While implementing it, however, I have encountered some problems that make me believe that I'm on the wrong track. Specifically:
When I create a new consumer using KCL, it creates a new DynamoDB table to do the housekeeping of leases and checkpoints, such that when the application is restarted, KCL knows which records have been consumed and which have not. This is not what I need for this problem. Any stream records that are created while the application is offline is irrelevant, since the entire table is read upon application startup.
Several instances of the same application are running at the same time. Each of them needs to be notified of table updates. To implement that in KCL I need to assign a unique application name to each of them. Otherwise they will share the lease table and only one of the applications will get notified. One table for each application instance doesn't seem right. Also I would then need something to remove unused tables.
I also implemented it using the low level API instead. That works fine when there's a single shard. My implementation doesn't handle re-sharding like KCL, however, so it's too fragile. It seems wrong to have to implement handling of re-sharding for the simple problem I'm trying to solve.
I'm beginning to consider other solutions like:
Implementing a lambda function that gets triggered on updates to the table. The function sends a notification to an SNS topic. Consumers create SQS subscriptions on the topic and gets notified via that. This solution has too many moving parts for my liking.
Make the applications periodically re-read the entire table and determine themselves if changes have been made. This solution feels a bit primitive, but seems to be the simplest.
All solutions that I have considered so far have quite significant drawbacks. What am I missing?

It depends on how your KCL is pushing to the dependent apps but
I believe the SQS path is the correct choice.
You can add a presumably infinite number of consumers without being throttled.
When you do add another dependent app, it won't require changing your KCL to push to it, the new app will simply watch the SQS queue.
You gain the ability to monitor the queue when issues happen.
More moving parts to setup, but once you have the Streams -> SNS -> SQS pipe in place, it's basically bulletproof.
Just my 2¢.

Nowadays an AWS AppSync GraphQL API with subscriptions may be the simplest approach to power this type of application, with the least number of moving parts.
Whenever one of your applications starts up, it connects to your AppSync GraphQL API using the Amplify framework or AppSync SDK and subscribes to the updates its interested in. Then whenever an application updates information in the table via your GraphQL API, all your other applications will be notified of the change, along with the relevant changed data.
AppSync integrates well with DynamoDB out of the box, allowing you to generate DynamoDB tables with appropriate indexes alongside your GraphQL or generate GraphQL from your existing DynamoDB tables if you so choose. Amplify can even help you automatically generate an AppSync GraphQL API at a higher level with associated DynamoDB tables, indexes, entity relationships, and more like elasticsearch search capabilities by using their GraphQL transformers.

Related

AWS Amplify creating multiple DynamoDB tables with duplicate information?

I'm building an offline-first mobile application using AWS Amplify, using the local DataStore and cloud sync. So far, I'm following the documentation without any variation (I think.)
As of now, I only have one model, lets call it at Thing. I noticed that after running amplify push, my environment contains not one, but two DynamoDB tables:
Thing-<app-id>-<env>
AmplifyDataStore-<app-id>-<env>
Whenever I save a Thing entity, it appears to be persisted redundantly in both tables. This effectively doubles my DynamoDB storage costs.
Is there a sound technical reason for this, or any way to avoid it? Or am I just making a mistake somewhere that is causing it to persist twice?
Assuming you have k models, then the Amplify DataStore will provision k + 1 tables. The extra table you're noticing is called the "delta sync table." It used to store incremental changes that have occurred since the last time the client synchronized fully with AppSync. The Delta Sync table carries a short TTL on the records, and they will get dropped if not utilized within that window of time.
To learn more about Delta Sync and DataStore generally, I recommend Ed Lima's AWS AppSync offline reference architecture – powered by the Amplify DataStore. See particularly the section labeled "The Delta Sync table."
Source: I'm an engineer on this product team.

Monitoring table changes in DynamoDB

I am using AWS DynamoDB in order to store information.
I have two machines running separate codes, that accessing the information in the database.
One of the machines is writing into the database and the second one is reading.
Since the second one does not know whether or not the information in the database has been changed I need to somehow monitor my database for changes.
I know that there is something called dynamo streams that can provide you with the information regarding changes made in your database and I already have that code implemented.
The question is as follows: if I am monitoring the database constantly, I need to query this stream all the time, let's say once every minute.
What is the difference between doing that and actually querying the database every minute?
Is it much more efficient?
Is it less costly (resources, moneywise)?
Is there any other, more efficient way of monitoring changes in the database in a specific table from the code?
any help would be appreciated, thank you.
Most people I have seen do something like this do it with DynamoDB Streams + Lambda for best results. Definitely check out the DynamoDB docs and the Lambda docs on this topic.
There's also an example in the docs of monitoring DynamoDB where changes fire off a message to an SNS topic.
DynamoDB Streams is more efficient and near real time. Think of using Lambda in this way like you would a trigger in a relational database. Why do the extra effort, when the patterns are this very well defined and people use them all the time?

Can I add listeners to Multiple Databases in Firebase?

Regarding the recent announcement of the multi-database support within a Firebase project, can we add listeners to multiple databases? Or should we connect to maximum one database at a time?
For example, let's say that I have created two databases, DB-1 and DB-2. I want to add a listener for changes in node-A in DB-1 and another listener in node-B in DB-2. Is this possible? I've read the documentation but it's a bit contradicting:
Each app instance only connects to one database at any given moment.
...
If each client needs to connect to multiple databases during a session, you can reduce the number of simultaneous connections to each database instance by connecting to each database instance for only as long as is necessary.
You can certainly connect to multiple databases at the same time, according to the documentation. There may be cases when you want to reduce the active number of active connections your app is making, especially if you have a lot of shards, each with a lot of activity, so the advice stands for those cases, if this applies to you.

Synchronize Postgres Server Database to Sqllite Client database

I am trying to create an app that receives an Sqlite database from a server for offline use but cloud synchronization. The server has a postgres database with information from many clients.
1) Is it better to delete the sql database and create a new one from a query, or try to synchronize and update the existing separate sqlite files (or another better solution). The refreshes will be a few times a day per client.
2) if it is the latter, could you give me any leads to resources on how I could do this?
I am pretty new to database applications so please excuse my ignorance and let me know if there is any way I could clarify.
There is no one size fits all approach here. You need to carefully consider exactly what needs to be done, what you are replicating, how much data is involved, and what your write models are, all before you build a solution. Along the way you have to decide how to handle write conflicts and more.
In general the one thing I would say is that such synchronization works best with append-only write models (i.e. inserts, no deletes, no updates), and one way to do it is to log changes that need to be made and replicate those changes.
However, master-master replication is difficult on the best of days and with the best of tools available. Jumping between databases with very different capabilities will introduce a number of additional problems. You are in for a big job.
Here's an open source product that claims to solve this for many database types including Postgres. I have no affiliation or commercial interest in this company.
https://github.com/sqlite-sync/SQLite-sync.com
http://sqlite-sync.com/
If you're able and willing to step outside relational databases to use an object store you might want to have a look at CouchDb and perhaps PouchDb that use a MVCC based replication protocol designed to support multi-master replication including conflict resolution. Under the covers, PouchDb uses adaptors for Sqlite, IndexDb, Local storage or a remote CouchBb instance to persist client side data. It auto selects the best client side storage option for the given desktop or mobile browser. The Sqlite engine can be either WebSQL or a Cordova Sqlite plugin.
http://couchdb.apache.org/
https://pouchdb.com/

Firebase and indexing/search

I am considering using Firebase for an application that should people to use full-text search over a collection of a few thousand objects. I like the idea of delivering a client-only application (not having to worry about hosting the data), but I am not sure how to handle search. The data will be static, so the indexing itself is not a big deal.
I assume I will need some additional service that runs queries and returns Firebase object handles. I can spin up such a service at some fixed location, but then I have to worry about its availability ad scalability. Although I don't expect too much traffic for this app, it can peak at a couple of thousand concurrent users.
Architectural thoughts?
Long-term, Firebase may have more advanced querying, so hopefully it'll support this sort of thing directly without you having to do anything special. Until then, you have a few options:
Write server code to handle the searching. The easiest way would be to run some server code responsible for the indexing/searching, as you mentioned. Firebase has a Node.JS client, so that would be an easy way to interface the service into Firebase. All of the data transfer could still happen through Firebase, but you would write a Node.JS service that watches for client "search requests" at some designated location in Firebase and then "responds" by writing the result set back into Firebase, for the client to consume.
Store the index in Firebase with clients automatically updating it. If you want to get really clever, you could try implementing a server-less scheme where clients automatically index their data as they write it... So the index for the full-text search would be stored in Firebase, and when a client writes a new item to the collection, it would be responsible for also updating the index appropriately. And to do a search, the client would directly consume the index to build the result set. This actually makes a lot of sense for simple cases where you want to index one field of a complex object stored in Firebase, but for full-text-search, this would probably be pretty gnarly. :-)
Store the index in Firebase with server code updating it. You could try a hybrid approach where the index is stored in Firebase and is used directly by clients to do searches, but rather than have clients update the index, you'd have server code that updates the index whenever new items are added to the collection. This way, clients could still search for data when your server is down. They just might get stale results until your server catches up on the indexing.
Until Firebase has more advanced querying, #1 is probably your best bet if you're willing to run a little server code. :-)
Google's current method to do full text search seems to be syncing with either Algolia or BigQuery with Cloud Functions for Firebase.
Here's Firebase's Algolia Full-text search integration example, and their BigQuery integration example that could be extended to support full search.

Resources