Monitoring table changes in DynamoDB - amazon-dynamodb

I am using AWS DynamoDB in order to store information.
I have two machines running separate codes, that accessing the information in the database.
One of the machines is writing into the database and the second one is reading.
Since the second one does not know whether or not the information in the database has been changed I need to somehow monitor my database for changes.
I know that there is something called dynamo streams that can provide you with the information regarding changes made in your database and I already have that code implemented.
The question is as follows: if I am monitoring the database constantly, I need to query this stream all the time, let's say once every minute.
What is the difference between doing that and actually querying the database every minute?
Is it much more efficient?
Is it less costly (resources, moneywise)?
Is there any other, more efficient way of monitoring changes in the database in a specific table from the code?
any help would be appreciated, thank you.

Most people I have seen do something like this do it with DynamoDB Streams + Lambda for best results. Definitely check out the DynamoDB docs and the Lambda docs on this topic.
There's also an example in the docs of monitoring DynamoDB where changes fire off a message to an SNS topic.
DynamoDB Streams is more efficient and near real time. Think of using Lambda in this way like you would a trigger in a relational database. Why do the extra effort, when the patterns are this very well defined and people use them all the time?

Related

BigQuery vs Cloud SQL autoscaling?

I declare that I am a beginner in using Google Cloud Platform.
I am developing a web application in react using firebase, so all data is saved on firestore.
Now I need to have a relational database, and I am very confused as to which is the best between Cloud SQL and BigQuery.
My idea was to have one part of the data on Cloud SQL and the other part on Firestore.
When an event happens, the data from Cloud SQL and firestore are merged and uploaded to BigQuery for analysis.
Example:
On Firestore I have a product that has an array field where IDs are
stored. These IDs are related to the Database saved on Cloud SQL. When
an order is placed it is added to a collection on Firestore and
appended to the database on BigQuery.
My problem is that from what I have read there is no possibility of autoscaling on Cloud SQL, while on BigQuery it does.
So my question is can you autoscale on CloudSQL?
If it can't be done, is it correct to use BigQuery exclusively?
Is there another solution on GCP that allows you to have a relational database but with autoscaling?
Edit 1
This is the very simplified model of a part of the database on CloudSQL / BigQuery
I'll use a 2/3 inner join query to get all the values I need.
I don't know how to make it non-relational and therefore be able to use firestore without having a large duplication of data, I am open to any kind of advice
Not sure that I understood correctly, but I reckon you would like to get some data (from one data source), combine/process that data with the data from a Firestore collection, and load/stream the result into BigQuery. All of that - is operationally in run time. The question is about the choice of that data source - either a Cloud SQL or a BigQuery.
Am I right that from you point of view the main Cloud SQL drawback - is a lack of scalability (autoscale). And you would like to consider a BigQuery instead of the Cloud SQL due to the 'autoscale'?
It is not clear what is the rate of the request/queries you expect, and where the data is located (any requirements on a global access), so it may be difficult to discuss the situation. Anyway...
Thinking about BigQuery, in my opinion, - this is a great "database" (the best from my point of view), but mainly for analytical purposes... Each query has some 'initial' latency (the query job won't be executed faster than some threshold), which cannot be significantly minimised, and there is no binary indexes in BigQuery tables. It means that your query will take a few seconds (let's assume 3 or more) every time you run it (unless the result is taken from the cache). If the number of requests is significant - it may become expensive (in BigQuery) and expensive in the component, which is used to process that task (i.e. Cloud Function triggered by some event) - as the later has to wait (and do nothting) during the query time.
In addition, BigQuery is very good in loading or steeaming data into it, but not very good in regular data updates inside it - there are plenty of limitations. Thus, depending on your context, it may be not very good idea to maintain operational data in BigQuery.
If I rule out the BigQuery -
Can we sacrifice 'autoscalability' for the Cloud SQL?
Can we use a Firestore collection instead of the Cloud SQL (and sacrifice the 'relational' property?
Can we use Cloud SQl and handle the the amount of data in tables which are used for querying, so there is no delays?
Not sure if I managed to help, but at least I provided some thoughts about the problem.
'Now I need to have a relational database, and I am very confused as to which is the best between Cloud SQL and BigQuery.'
Please be aware that BigQuery cannot be used to substitute a relational Database, and it is oriented on running analytical queries, not for simple CRUD operations and queries (Like in Cloud SQL). That doesn’t mean BigQuery can’t handle normalized data and joins. It absolutely can. It just performs better on denormalized stuff because BigQuery is essentially an OLAP engine. So, denormalize whenever possible (please read here).
You can use read replications to scale Cloud SQL. Read Replica instances allow data from the master instance to be replicated to one or more slaves. This setup can provide increased read throughput. Please see this.

Firebase Document Write Limit

Hey so with my current feed database design, I am using Redis for the cache for super-fast reads, which are routed through my Google Cloud Functions. The Redis database handles all post data and timeline updates, which is great and all, but I forgot one of the most considerable caveats to this. Firebase Firestore only permits one document write per second, meaning that if I have a document that stores the post data (post_id, user_id, content, like_count), the like_count would be impossible to track with the possibility for many likes per second. Does anyone have any solutions to this?
You can shard your counter among multiple documents and query them in aggregate as needed.
You can also try Cloud Tasks queue to smooth out the write frequency. It will add considerable complexity to the system, but is really the only genericized way in GCP to manage the rate of some work. This might not work out the way you need, however.
If you use Cloud Tasks, your task will need to be configured with a rate limit, and it will have to deliver the document data to write to yet another function or other HTTP endpoint that will perform the write.

Can we use Firebase with Amazon DynamoDB?

I was wondering if it was possible and also, if someone can give his testimony of using both. Actually, for me, the real time thing is not that important, I care more about the NoSQL Database. I really care about pricing and I can see that Firebase prices are okay for almost everything (authentication is even free), but database is very expensive in my opinion (5$/giga stored and 1$/Giga downloaded ...). That's why I want to use DynamoDB for the database because it's way cheaper.
What do you think?
Finally, Firebase just released a new NoSQL service that is somehow similar to DynamoDB. No need to connect to DynamoDB anymore!

Is DynamoDB streams the right option for this use case?

I have a DynamoDB table that contains key value pairs that will be read by a number of applications. On startup each application will read the entire table and cache it in-memory.
The problem I'm trying to solve is that of getting the applications to update their cache if one or more items in the DynamoDB table have been modified.
DynamoDB streams initially seemed to be the right approach to solving the problem. I have implemented the consumer using Kinesis Client Library (KCL) as recommended by AWS. While implementing it, however, I have encountered some problems that make me believe that I'm on the wrong track. Specifically:
When I create a new consumer using KCL, it creates a new DynamoDB table to do the housekeeping of leases and checkpoints, such that when the application is restarted, KCL knows which records have been consumed and which have not. This is not what I need for this problem. Any stream records that are created while the application is offline is irrelevant, since the entire table is read upon application startup.
Several instances of the same application are running at the same time. Each of them needs to be notified of table updates. To implement that in KCL I need to assign a unique application name to each of them. Otherwise they will share the lease table and only one of the applications will get notified. One table for each application instance doesn't seem right. Also I would then need something to remove unused tables.
I also implemented it using the low level API instead. That works fine when there's a single shard. My implementation doesn't handle re-sharding like KCL, however, so it's too fragile. It seems wrong to have to implement handling of re-sharding for the simple problem I'm trying to solve.
I'm beginning to consider other solutions like:
Implementing a lambda function that gets triggered on updates to the table. The function sends a notification to an SNS topic. Consumers create SQS subscriptions on the topic and gets notified via that. This solution has too many moving parts for my liking.
Make the applications periodically re-read the entire table and determine themselves if changes have been made. This solution feels a bit primitive, but seems to be the simplest.
All solutions that I have considered so far have quite significant drawbacks. What am I missing?
It depends on how your KCL is pushing to the dependent apps but
I believe the SQS path is the correct choice.
You can add a presumably infinite number of consumers without being throttled.
When you do add another dependent app, it won't require changing your KCL to push to it, the new app will simply watch the SQS queue.
You gain the ability to monitor the queue when issues happen.
More moving parts to setup, but once you have the Streams -> SNS -> SQS pipe in place, it's basically bulletproof.
Just my 2¢.
Nowadays an AWS AppSync GraphQL API with subscriptions may be the simplest approach to power this type of application, with the least number of moving parts.
Whenever one of your applications starts up, it connects to your AppSync GraphQL API using the Amplify framework or AppSync SDK and subscribes to the updates its interested in. Then whenever an application updates information in the table via your GraphQL API, all your other applications will be notified of the change, along with the relevant changed data.
AppSync integrates well with DynamoDB out of the box, allowing you to generate DynamoDB tables with appropriate indexes alongside your GraphQL or generate GraphQL from your existing DynamoDB tables if you so choose. Amplify can even help you automatically generate an AppSync GraphQL API at a higher level with associated DynamoDB tables, indexes, entity relationships, and more like elasticsearch search capabilities by using their GraphQL transformers.

Synchronize Postgres Server Database to Sqllite Client database

I am trying to create an app that receives an Sqlite database from a server for offline use but cloud synchronization. The server has a postgres database with information from many clients.
1) Is it better to delete the sql database and create a new one from a query, or try to synchronize and update the existing separate sqlite files (or another better solution). The refreshes will be a few times a day per client.
2) if it is the latter, could you give me any leads to resources on how I could do this?
I am pretty new to database applications so please excuse my ignorance and let me know if there is any way I could clarify.
There is no one size fits all approach here. You need to carefully consider exactly what needs to be done, what you are replicating, how much data is involved, and what your write models are, all before you build a solution. Along the way you have to decide how to handle write conflicts and more.
In general the one thing I would say is that such synchronization works best with append-only write models (i.e. inserts, no deletes, no updates), and one way to do it is to log changes that need to be made and replicate those changes.
However, master-master replication is difficult on the best of days and with the best of tools available. Jumping between databases with very different capabilities will introduce a number of additional problems. You are in for a big job.
Here's an open source product that claims to solve this for many database types including Postgres. I have no affiliation or commercial interest in this company.
https://github.com/sqlite-sync/SQLite-sync.com
http://sqlite-sync.com/
If you're able and willing to step outside relational databases to use an object store you might want to have a look at CouchDb and perhaps PouchDb that use a MVCC based replication protocol designed to support multi-master replication including conflict resolution. Under the covers, PouchDb uses adaptors for Sqlite, IndexDb, Local storage or a remote CouchBb instance to persist client side data. It auto selects the best client side storage option for the given desktop or mobile browser. The Sqlite engine can be either WebSQL or a Cordova Sqlite plugin.
http://couchdb.apache.org/
https://pouchdb.com/

Resources