What is the best way to integrate DynamoDB stream with CloudSearch? [duplicate] - amazon-dynamodb

I'm using Dynamo DB pretty heavily for a service I'm building. A new client request has come in that requires cloud search. I see that a cloud search domain can be created from a dynamo table via the AWS console.
My question is this:
Is there a way to automatically offload data from a dynamo table into a cloud search domain via the API or otherwise at a specified
time interval?
I'd prefer this to manually offloading dynamo documents to cloudsearch. All help greatly appreciated!

Here are two ideas.
The official AWS way of searching DynamoDB data with CloudSearch
This approach is described pretty thoroughly in the "Synchronizing a Search Domain with a DynamoDB Table" section of http://docs.aws.amazon.com/cloudsearch/latest/developerguide/searching-dynamodb-data.html.
The downside is that it sounds like a huge pain: you have to either re-create new search domains or maintain an update table in order to sync, and you'd need a cron job or something to execute the script.
The AWS Lambdas way
Use the newish Lambdas event processing service. It is pretty simple to set up an event stream based on Dynamo (see http://docs.aws.amazon.com/lambda/latest/dg/wt-ddb.html).
Your Lambda would then submit a search document to CloudSearch based on the Dynamo event. For an example of submitting a document from a Lambda, see https://gist.github.com/fzakaria/4f93a8dbf483695fb7d5
This approach is a lot nicer in my opinion as it would continuously update your search index without any involvement from you.

I'm not so clear on how Lambda would always keep the data in sync with the data in dynamoDB. Consider the following flow:
Application updates a DynamoDB table's Record A (say to A1)
Very closely after that Application updates same table's same record A (to A2)
Trigger for 1 causes Lambda of 1 to start execute
Trigger for 2 causes Lambda of 2 to start execute
Step 4 completes first, so CloudSearch sees A2
Now Step 3 completes, so CloudSearch sees A1
Lambda triggers are not guaranteed to start ONLY after previous invocation is complete (Correct if wrong, and provide me link)
As we can see, the thing goes out of sync.
The closest I can think which will work is to use AWS Kinesis Streams, but those too with a single Shard (1MB ps limit ingestion). If that restriction works, then your consumer application can be written such that the record is first processed sequentially, i.e., only after previous record is put into CS, then the next record should be put.

Related

Verify dynamodb is healthy

I would like to verify in my service /health check that I have a connection with my dynamodb.
I am searching for something like select 1 in MySQL (its only ping the db and return 1) but for dynamodb.
I saw this post but searching for a nonexisting item is an expensive action.
Any ideas on how to only ping my db?
I believe the select 1 equivalent in DDB is Scan with a Limit of 1 item. You can read more here.
Dynamodb is a managed service from AWS. It is highly available anyways. Instead of using query for verifying health of dynamodb, why not setup cloudwatch metrics on your table and check for recent alarm in cloud watch concerning dynamodb. This will also prevent you from spending your read units.
The question is perhaps too broad to answer as stated. There are many ways you could set this up, depending on your concerns and constraints.
My recommendation would be to not over-think, or over-do it in terms of verifying connectivity from your service host to DynamoDB: for example just performing a periodic GetItem should be sufficient to establish basic network connectivity..
Instead of going about the problem from this angle, perhaps you might want to consider a different approach:
a) setup canary tests that exercise all your service features periodically -- these should be "fail-fast" light tests that run constantly and in the event of consistent failure you can take action
b) setup error metrics from your service and monitor on those metrics: for example, CloudWatch allows you to take action on metrics -- you will likely get more milage out of this approach than narrowly focusing on a single failure mode (ie. DynamoDB, which, as other have stated, is a Managed service with very good availability SLA)

Possible to create pipeline that writes an SQL database to MongoDB daily?

TL:DR I'd like to combine the power of BigQuery with my MERN-stack application. Is it better to (a) use nodejs-biquery to write a Node/Express API directly with BigQuery, or (b) create a daily job that writes my (entire) BigQuery DB over to MongoDB, and then use mongoose to write a Node/Express API with MongoDB?
I need to determine the best approach for combining a data ETL workflow that creates a BigQuery database, with a react/node web application. The data ETL uses Airflow to create a workflow that (a) backs up daily data into GCS, (b) writes that data to BigQuery database, and (c) runs a bunch of SQL to create additional tables in BigQuery. It seems to me that my only two options are to:
Do a daily write/convert/transfer/migrate (whatever the correct verb is) from BigQuery database to MongoDB. I already have a node/express API written using mongoose, connected to a MongoDB cluster, and this approach would allow me to keep that API.
Use the nodejs-biquery library to create a node API that is directly connected to BigQuery. My app would change from MERN stack (BQ)ERN stack. I would have to re-write the node/express API to work with BigQuery, but I would no longer need the MongoDB (nor have to transfer data daily from BigQuery to Mongo). However, BigQuery can be a very slow database if I am looking for a single entry, a since its not meant to be used as Mongo or a SQL Database (it has no index, one row retrieve query run slow as full table scan). Most of my APIs calls are for very little data from the database.
I am not sure which approach is best. I don't know if having 2 databases for 1 web application is a bad practice. I don't know if it's possible to do (1) with the daily transfers from one db to the other, and I don't know how slow BigQuery will be if I use it directly with my API. I think if it is easy to add (1) to my data engineering workflow, that this is preferred, but again, I am not sure.
I am going with (1). It shouldn't be too much work to write a python script that queries tables from BigQuery, transforms, and writes collections to Mongo. There are some things to handle (incremental changes, etc.), however this is much easier to handle than writing a whole new node/bigquery API.
FWIW in a past life, I worked on a web ecommerce site that had 4 different DB back ends. ( Mongo, MySql, Redis, ElasticSearch) so more than 1 is not an issue at all, but you need to consider one as the DB of record, IE if anything does not match between them, one is the sourch of truth, the other is suspect. For my example, Redis and ElasticSearch were nearly ephemeral - Blow them away and they get recreated from the unerlying mysql and mongo sources. Now mySql and Mongo at the same time was a bit odd and that we were dong a slow roll migration. This means various record types were being transitioned from MySql over to mongo. This process looked a bit like:
- ORM layer writes to both mysql and mongo, reads still come from MySql.
- data is regularly compared.
- a few months elapse with no irregularities and writes to MySql are turned off and reads are moved to Mongo.
The end goal was no more MySql, everything was Mongo. I ran down that tangent because it seems like you could do similar - write to both DB's in whatever DB abstraction layer you used ( ORM, DAO, other things I don't keep up to date with etc.) and eventually move the reads as appropriate to wherever they need to go. If you need large batches for writes, you could buffer at that abstraction layer until a threshold of your choosing was reached before sending it.
With all that said, depending on your data complexity, a nightly ETL job would be completely doable as well, but you do run into the extra complexity of managing and monitoring that additional process. Another potential downside is the data is always stale by a day.

What are the ways to access the flink state from outside the flink cluster?

I am new to Apache flink and building a simple application where I am reading the events from a kinesis stream, say something like
TestEvent{
String id,
DateTime created_at,
Long amount
}
performing aggregation (sum) on field amount on above stream keyed by id. The transformation is equivalent to SQL select sum(amount) from testevents group by id where testevents are all the events received till now.
The aggregated result is stored in a flink state and I want the result to be exposed via an API. Is there any way to do so?
PS: Can we store the flink state in dynamoDB and create an API there? or any other way to persist and expose the state to outside world?
I'd recommend to ignore state for now and rather look at sinks as the primary way for a stream application to output results.
If you are already using Kinesis for input, you could also use Kinesis to output the results from Flink. You can then use the Kinesis adapter for DynamoDB that is provided by AWS as further described on a related stackoverflow post.
Coming back to your original question: you can query Flinks state and ship a REST API together with your stream application, but that's a whole lot of work that is not needed to achieve your goal. You could also access checkpointed/savepointed state through the state API, but again that's quite a bit of manual work that can be saved by going the usual route outlined above.
This is Flink's documentation, which provides some use cases queryable_state
You can also use the API to read it offlineState Processor API

How to implement outbox pattern in Cosmos DB

I'm looking to implement support for the outbox pattern in Cosmos DB.
However, Cosmos DB doesn't seem to support transactions across collections.
Then how do I do it?
I've been considering a few approaches to implement this:
Use Service bus transactions
Within a Service bus transaction scope, send the message (not committed just yet), do the Cosmos DB update and, if it works, then we commit the service bus transaction to have the message made available to subscribers.
Use triggers to insert rows in the outbox collection
As inserts/updates happen, we use Cosmos DB triggers to insert the respective messages into the outbox table and from then on, it's business as usual.
Use triggers to execute azure functions
Create Azure functions as Cosmos DB triggers. I almost like this but it would be so much better to get a message straight to service bus.
Use a data pump
Add two fields UpdateTimestamp and OutboxMessageTimestamp. When a recorded is updated so does the UpdateTimestamp.
Some process looks for records in which these two don't match and for each of those creates a notification message and relays it to the respective queues or topics.
Of course, then it updates the second timestamp so they match.
Other ideas on how to do this?
in general, you store things in your cosmos db collection. then you have change feed sending these changes to some observer (lets say azure function). then your azure function can do whatever: put it in queue for other consumers, save into another collection projected differently, etc... within your azure function you should implement your dead letter queue for failures that are not related to function runtime (for example, writing to another collection failed due to id conflict)
[UPDATE]
Let me add a bit more as a response to your comment.
From my experience, doing things atomically in distributed systems boils down to:
Always do things in same order
Make second step itempotent (ensuring you can repeat it any number of times getting same result)
Once first step succeeded - repeat second step until successful
So, in case you want to send email upon something saved into cosmos db, you could:
Save record in cosmos db
Have azure function listen to change feed
Once you receive inserted document > send email (more robust solution would actually put it in queue from which some dedicated consumer sends emails)
Alternative would be to have initial command (to save record) put in queue and then have 2 consumers (one for saving and one for sending emails) but then you have a problem of ordering (if thats important for you).

Firebase Database Migration

Coming from a SQL background, I'm wondering how does one go about doing database migration in firebase?
Assume I have the following data in firebase {dateFrom: 2015-11-11, timeFrom: 09:00} .... and now the front-end client will store and expects data in the form {dateTimeFrom: 2015-011-11T09:00:00-07:00}. How do I update firebase such that all dateFrom: xxxx and timeFrom: yyyy are removed and replaced with dateTimeFrom: xxxxyyyy? Thanks.
You have to create your own script that reads, transform and write it back. You may eider read one node at the time or read the whole DB if it is not big. You may decide to leave the logic to your client when it access to it (if it ever does)
I think you are looking for this: https://github.com/kevlened/fireway
I think is a bad idea to pollute a project with conditionals to update data on the fly.
It is a shame firestore doesn't implement a process for this as it is very common and required to keep the app and db in sync.
FWIW, since I'm using Swift and there isn't a solution like Fireway (that I know of), I've submitted a feature request to the Firebase team that they've accepted as a potential feature.
You can also submit a DB migration feature request to increase the likelihood that they create the feature.

Resources