Streams with initial state - amazon-dynamodb

I would like to expose something like a subscription or a "sticky query": the goal is to query DynamoDB and return the results via the WebSockets API in API Gateway. Well, whenever DynamoDB changes in a way the query would be affected (I guess I could use Streams for that) I would like to notify the client(s). How can I make sure the client gets the initial list and all updates? I would like to make sure the client doesn't miss any updates right after the subscription is created and before the initial list of results is returned to it...

To inform your clients about changes in your DynamoDB, DynamoDB Streams could be used. However, the information is only available for 24 hours.
Even if you write your updates to a Kinesis Stream, your information will be available for a maximum of 7 days (according to the FAQ)
I suggest splitting your use-case in two:
Create a service where you return the initial state of your "sticky query"
Create a stream which will notify your clients about updates to the "sticky query"

Related

what is the best practice for handling asynchronous api call that take time

So suppose I have an API to create a cloud instance asynchronously. So after I made an API call it will just return the success response, but the cloud instance will not been initialized yet. It will take 1-2 minutes to create cloud instance and after that it will save the cloud instance information (ex. ip, hostname, os) to db which mean I have to wait 1-2 minutes so I can fetch the data again to show cloud information. At first I try making a loading component, but the problem is that I don't know when the cloud instance is initialized (each instance has different time duration for creating). I'm considering using websocket or using cron or should I redesign my API? Has anyone design asynchronous system before how do you handle such a case.
If the API that you call gives you no information on when it's done with its asynchronous processing, it seems to me that you'll have to check at intervals until you find that the resource is ready; i.e. to poll it.
This seems to me to roughly fit the description and intent of the Polling Consumer pattern. In general, for asynchronous systems design, I can't recommend Enterprise Integration Patterns enough.
As other noted you can either have a notification channel using WebSockets or poll the backend. Personally I'd probably go with the latter for this case and would actually create several APIs, one for initiating the work and get back a URL with "job id" in it where the status of the job can be polled.
RESTfully that would look something like POST /instances to initiate a job GET /instances see all the instances that are running/created/stopped and GET /instances/<id> to see the status of a current instance (initiating , failed , running or whatever)
WebSockets would work, but might be an overkill for this use case. I would probably display a status of 'creating' or something similar after receiving the success response from the API call, and then start polling the API to see if the creation process has finished.

What are the ways to access the flink state from outside the flink cluster?

I am new to Apache flink and building a simple application where I am reading the events from a kinesis stream, say something like
TestEvent{
String id,
DateTime created_at,
Long amount
}
performing aggregation (sum) on field amount on above stream keyed by id. The transformation is equivalent to SQL select sum(amount) from testevents group by id where testevents are all the events received till now.
The aggregated result is stored in a flink state and I want the result to be exposed via an API. Is there any way to do so?
PS: Can we store the flink state in dynamoDB and create an API there? or any other way to persist and expose the state to outside world?
I'd recommend to ignore state for now and rather look at sinks as the primary way for a stream application to output results.
If you are already using Kinesis for input, you could also use Kinesis to output the results from Flink. You can then use the Kinesis adapter for DynamoDB that is provided by AWS as further described on a related stackoverflow post.
Coming back to your original question: you can query Flinks state and ship a REST API together with your stream application, but that's a whole lot of work that is not needed to achieve your goal. You could also access checkpointed/savepointed state through the state API, but again that's quite a bit of manual work that can be saved by going the usual route outlined above.
This is Flink's documentation, which provides some use cases queryable_state
You can also use the API to read it offlineState Processor API

How to implement outbox pattern in Cosmos DB

I'm looking to implement support for the outbox pattern in Cosmos DB.
However, Cosmos DB doesn't seem to support transactions across collections.
Then how do I do it?
I've been considering a few approaches to implement this:
Use Service bus transactions
Within a Service bus transaction scope, send the message (not committed just yet), do the Cosmos DB update and, if it works, then we commit the service bus transaction to have the message made available to subscribers.
Use triggers to insert rows in the outbox collection
As inserts/updates happen, we use Cosmos DB triggers to insert the respective messages into the outbox table and from then on, it's business as usual.
Use triggers to execute azure functions
Create Azure functions as Cosmos DB triggers. I almost like this but it would be so much better to get a message straight to service bus.
Use a data pump
Add two fields UpdateTimestamp and OutboxMessageTimestamp. When a recorded is updated so does the UpdateTimestamp.
Some process looks for records in which these two don't match and for each of those creates a notification message and relays it to the respective queues or topics.
Of course, then it updates the second timestamp so they match.
Other ideas on how to do this?
in general, you store things in your cosmos db collection. then you have change feed sending these changes to some observer (lets say azure function). then your azure function can do whatever: put it in queue for other consumers, save into another collection projected differently, etc... within your azure function you should implement your dead letter queue for failures that are not related to function runtime (for example, writing to another collection failed due to id conflict)
[UPDATE]
Let me add a bit more as a response to your comment.
From my experience, doing things atomically in distributed systems boils down to:
Always do things in same order
Make second step itempotent (ensuring you can repeat it any number of times getting same result)
Once first step succeeded - repeat second step until successful
So, in case you want to send email upon something saved into cosmos db, you could:
Save record in cosmos db
Have azure function listen to change feed
Once you receive inserted document > send email (more robust solution would actually put it in queue from which some dedicated consumer sends emails)
Alternative would be to have initial command (to save record) put in queue and then have 2 consumers (one for saving and one for sending emails) but then you have a problem of ordering (if thats important for you).

Is it possible to do multiple integration requests in one API gateway request?

The challange
I would like to build a simple and fast (< 50ms) API with API gateway that works as an intelligent cache. The request to the API should batch fetch some items from DynamoDB based on their keys. However, if one or more items are not found in dynamoDB I would like to somehow trigger a lambda function, who knows how to populate those missing items for later requests) in dynamoDB.
My idea
The way I though of doing this is by creating any missing items in DynamoDB only with their key and then use DynamoDB steams to invoke a lambda function who reads the records and fill's their blank attributes with values from another external API.
Problem
My problem is however, that as far as I can figure out I can ONLY do ONE request to dynamoDB (either a BatchGetItem or a conditional PutItem) per request to API gateway. Is there some way of achieving what I want.
I would really like to use API gateway as this is needed to scale quite aggressively and I would rather not handle the scaling of servers and applications.
Alternatives
Also an alternative would be to have the client code make the decision of what items were missing from the response and then send a second request to "queue" those for population, however I would really like that logic to NOT go on the client side and and also to avoid two network roundtrips.
I also looked into if dynamoDB would be able to do this "find or create"-behaviour for me - but as it seems, no luck: What event can be triggered to fire a lambda function in DynamoDB?
If your API Gateway endpoint calls a Lambda function (instead of doing a DynamoDB proxy which wouldn't be a clean API anyway) then you can run whatever code you want. You can make as many calls to DynamoDB and other services as you want whenever an API Gateway call triggers the Lambda function. The only restriction is the amount of time your Lambda function can run.

Message guarantees on rapidly updated entities in Firebase

I'd like to understand how Firebase and listening Clients behave in the situation where a large number of updates are made to an entity in a short amount of time, and a client is listening to 'value' changes on that entity.
Say I have an entity in firebase with some simple data.
{
"entity": 1
}
And the value of that "entity" was updated very rapidly. Something like the below code that writes 1000 integers.
//pseudo-code for making 1000 writes as quickly as possible
for(var i = 0; i < 1000; i++) {
ref.child('entity').set(i)
}
Ignoring transient issues, would a listening client using the 'on' API in a browser receive ALL 1000 notifications containing 0-999, or does Firebase have throttles in place?
First off, it's important to note that the Firebase realtime database is a state synchronization service, and is not a pub/sub service.
If you have a location that is updating rapidly, the service guarantees that eventually the state will be consistent across all clients, but not that all intermittent states will be surfaced. At most one event will fire for every update, but the server is free to 'squash' successive updates to the same location into one.
On the client making the updates, I think the current behavior is that every change propagates a local event, but I could be wrong and this is a notable exception.
In order to achieve guaranteed delivery of every intermediate state, it's possible to push (childByAutoId in Objective-C) onto a list of events at a database location instead of simply updating the value directly. Check out for the Firebase REST API docs on saving lists of data

Resources