Apache Flink: How to store intermedia data in streaming application - bigdata

I am implementing the MisraGries algorithm with Flink's DataStream API. It keeps k counters to record the data summary by increment or decrement.
What is the best approach to store such counters when using DataStream API to implement the algorithm? Now I just declared a HashMap variable in the operator. Is this the right approach or do I need to use some other features like state?

You should store the counters in Flink's managed state, i.e., either keyed state or operator state and enable checkpointing. Otherwise, the information will be lost in case of a failure.
If state is correctly used and checkpointing is enabled, Flink periodically checkpoints the state of an application. In case of a failure, the job is restarted and its state is reset to the latest checkpoint.

Related

Cosmos DB - thread safe pattern to allocate an 'available' document to reach request

For example if I was building an airline booking system and all of my seats were individual documents in a cosmos container with PartitionKey of the FlightNumber_DepartureDateTime e.g. UAT123_20220605T1100Z and id of SeatNumber eg. 12A.
A request comes in to allocate a single seat (any seat without a preference).
I want to be able to query the cosmos container for seats where allocated: false and allocate the first one to the request by setting allocated: true allocatedTo:ticketReference. But I need to do this in a thread safe way so that no two requests get the same seat.
Does Cosmos DB (SQL API) have a standard pattern to solve this problem?
The solution I thought of was to query a document and then update it by checking its Etag and if another thread got in first then the update would fail. If it fails then query another document and keep trying until I can successfully update it to claim the seat for this thread.
Is there a better way?
You could achieve this by using transactions. Cosmos DB allows you to write stored procedures that are executed in an atomic transaction, basically serializing concurrent seat reservation operations for you within a logical partition.
Quote from "Benefits of using server-side programming" in the link above:
Atomic transactions: Azure Cosmos DB database operations that are
performed within a single stored procedure or a trigger are atomic.
This atomic functionality lets an application combine related
operations into a single batch, so that either all of the operations
succeed or none of them succeed.
Bear in mind though that transactions come with a cost. They limit scalability of those operations. However in your scenario when you partition data per flight and given that those operations are very fast, this might be the preferable and most reliable option.
I have done something similar with Service Bus queues, essentially allowing you to queue bookings to be saved, therefore you can do the availability logic before you save the booking guaranteeing no overbookings.
https://learn.microsoft.com/en-us/azure/service-bus-messaging/service-bus-queues-topics-subscriptions

Store a JSON in the PostgreSQL database of a Corda node

I have a scenario where I need to store a large JSON String in the PostgreSQL database of a Corda node.
Does a Corda node support this type of scenario?
Technically I think you can store a large string as part of a state in corda. However, it would perhaps not be a great idea to do so. The larger the string would be the bigger the transaction size would be and thus greater network latency. Also if the string becomes part of the state, it would get copied across its state evolution and hence any node not having previous transactions of the state would need to download the complete backchain which means all the previous consumed states which adds up the network latency.
It would rather be better to add a reference to the state like a hash of the json file.

Firebase Persistent database on first installation

My current application developed in Unity uses Firebase Realtime Database with database persistence enabled. This works great for offline use of the application (such as in areas of no signal).
However, if a new user runs the application for the first time without an internet connection - the application freezes. I am guessing, it's because it needs to pull down the database for the first time in order for persistence to work.
I am aware of threads such as: 'Detect if Firebase connection is lost/regained' that talk about handling database disconnection from Firebase.
However, is there anyway I can check to see if it is the users first time using the application (eg via presence of the persistent database?). I can then inform them they must go online for 'first time setup'?
In addition to #frank-van-puffelen's answer, I do not believe that Firebase RTDB should itself cause your game to lock up until a network connection occurs. If your game is playable on first launch without a network connect (ie: your logic itself doesn't require some initial state from the network), you may want to check the following issues:
Make sure you can handle null. If your game logic is in a Coroutine, Unity may decide to silently stop it rather than fully failing out.
If you're interacting with the database via Transactions, generally assume that it will run twice (once against your local cache then again when the cache is synced with the server if the value is different). This means that the first time you perform a change via a transaction, you'll likely have a null previous state.
If you can, prefer to listen to ValueChanged over GetValueAsync. You'll always get this callback on your main Unity thread, you'll always get the callback once on registration with the data in your local cache, and the data will be periodically updated as the server updates. Further, if you see #frank-van-puffelen answer elsewhere, if you're using GetValueAsync you may not get the data you expect (including a null if the user is offline). If your game is frozen because it's waiting on a ContinueWithOnMainThread (always prefer this to ContinueWith in Unity unless you have a reason not to) or an await statement, this could ValueChanged may work around this as well (I don't think this should be the case).
Double check your object lifetimes. There are a ton of reasons that an application may freeze, but when dealing with asynchronous logic definitely make sure you're aware of the differences between Unity's GameObject lifecycle and C#'s typical object lifecycle (see this post and my own on interacting with asynchronous logic with Unity and Firebase). If an objects OnDestroy is invoked before await, ContinueWith[OnMainThread], or ValueChanged is invoked, you're in danger of running into null references in your own code. This can happen if a scene changes, the frame after Destroy is called, or immediately following a DestroyImmediate.
Finally, many Firebase functions have an Async and synchronous variant (ex: CheckDependencies and CheckDependenciesAsync). I don't think there are any to call out for Realtime Database proper, but if you use the non async variant of a function (or if you spinlock on the task completing, including forgetting to yield in a coroutine), the game will definitely freeze for a bit. Remember that any cloud product is i/o bound by nature, and will typically run slower than your game's update loop (although Firebase does its best to be as fast as possible).
I hope this helps!
--Patrick
There is nothing in the Firebase Database API to detect whether its offline cache was populated.
But you can detect when you make a connection to the database, for example by listening to the .info/connected node. And then when that first is set to true, you can set a local flag in the local storage, for example in PlayerPrefs.
With this code in place, you can then detect if the flag is set in the PlayerPrefs, and if not, show a message to the user that they need to have a network connection for you to download the initial data.

What are the ways to access the flink state from outside the flink cluster?

I am new to Apache flink and building a simple application where I am reading the events from a kinesis stream, say something like
TestEvent{
String id,
DateTime created_at,
Long amount
}
performing aggregation (sum) on field amount on above stream keyed by id. The transformation is equivalent to SQL select sum(amount) from testevents group by id where testevents are all the events received till now.
The aggregated result is stored in a flink state and I want the result to be exposed via an API. Is there any way to do so?
PS: Can we store the flink state in dynamoDB and create an API there? or any other way to persist and expose the state to outside world?
I'd recommend to ignore state for now and rather look at sinks as the primary way for a stream application to output results.
If you are already using Kinesis for input, you could also use Kinesis to output the results from Flink. You can then use the Kinesis adapter for DynamoDB that is provided by AWS as further described on a related stackoverflow post.
Coming back to your original question: you can query Flinks state and ship a REST API together with your stream application, but that's a whole lot of work that is not needed to achieve your goal. You could also access checkpointed/savepointed state through the state API, but again that's quite a bit of manual work that can be saved by going the usual route outlined above.
This is Flink's documentation, which provides some use cases queryable_state
You can also use the API to read it offlineState Processor API

How do I use Correlation in WF4 StateFlow (Platform Update 1)

I have a WF Service (CustomerProvisioningService) that receives a Request message and immediately runs up a StateFlow (CustomerProvisioningStateFlow) and which is marked as CanCreateInstance.
The first State in the flow has a Sequential flow as its Entry activity which is a long running work flow with its own Send and ReceiveReply pattern to call out and receive extra information(ProvisionCustomerActivityFlow). This workflow CanCreateInstance too.
I presume for the sequential flow I need to manage correlation based on Content (CustomerId) and in this way I can identify the persisted workflow in the underlying AppFabric sql persistence.
Subsequently I have other operations in the StateFlow which are represented by WCF Service calls similar to :
SuspendCustomer(string customerId)
I am assuming that I need to pick up the correct StateFlow instance by correlating with the CustomerId in the StateFlow but I can find no way to apply Correlation in the StateFlow, neither by adding CorrelationInitializers nor referencing a local CorrelationHandle variable.
Now I am questioning if I need correlation on the StateFlow and if so how do I do it? Or am I misunderstanding something here?
Many thanks
Brian
You need to setup request message correlation to route messages to the same workflow instance. I have an example on my blog here about how to set this up. The example uses a Sequence but the process is the same with a state machine.

Resources