Google PubSub queue for distributed state management - airflow

I have n sources that a job depends on.
Each source has a separate topic in Google PubSub; when a source is updated it sends a message in the corresponding topic subscription. When all sources are updated (i.e. when there is at least one new message in each subscription) the job can start.
The job is scheduled with airflow. The DAG starts with a series of parallel tasks one for each subscription that check if a new message has been published, but without aknowledging it. The next task waits for all the previous ones and uses XCOM to see if all contains a message. In that case it proceeds with the job (it first aknowledge the messages), otherwise it stops.
In this way I acknowledge the messages only when they are all available, using PubSub as a coordinator. The messages frequency is once or twice a day at most.
Basically I'm using PubSub as way to keep "state". Suppose I have different jobs that depend on the same source. I can create a subscription for the same topic on each job and it all works fine.
Is there a better way/tool/framework to do this?

According with the volume of message that you have, and from my previous implementations, I can recommend you to persist states in Firestore: serverless, affordable, fast...
When a message is published, trigger a function that persist state in Firestore
Then, trigger the number of processes that you want, query Firestone to check if all the states are OK, and continue or stop.
It's my pattern for synchronization. Not that the best!
Anyway, if you create a subscription per process, it also works. The message are duplicated in each subscription and thus you can process them independently.

Related

Cloud Tasks - waiting for a result

My application needs front-end searching. It searches an external API, for which I'm limited to a few calls per second.
So, I wanted to keep ALL queries, related to this external API, on the same Cloud Task queue, so I could guarantee the amount of calls per second.
That means the user would have to wait for second or two, most likely, when searching.
However, using Google's const { CloudTasksClient } = require('#google-cloud/tasks') library, I can create a task but when I go to check it's status using .getTask() it says:
The task no longer exists, though a task with this name existed recently.
Is there any way to poll a task until it's complete and retrieve response data? Or any other recommended methods for this? Thanks in advance.
No. GCP Cloud Tasks provides no way to gather information on the body of requests that successfully completed.
(Which is a shame, because it seems quite natural. I just wrote an email to my own GCP account rep asking about this possible feature. If I get an update I'll put it here.)

Firebase functions invoke in next 20s

So, I'm making multiplayer mobile game using Xamarin and Firebase. In game there are many moment when I'm letting players decide what to do and send their decision to the server (by putting decision enum in player-specific Firebase database node). Decision is time limited (short time, no longer than 20s).
I set listener to that specific node in my Firebase functions to check if all player decided or player decision comes after time deadline, but I need to deal with case when: some players send their decision in time - sersnmart were will not execute next action, and that one player just will not send his decision (eave game or something) - server won't be poke again to check deadline and invoke functions.
That why I'm looking for something else, I found method for schedule functions using crontab, but the minimal time interval there seems to be minutes, which is way more to long for me.
Second idea includes wait that specific time interval in previous Firebase thread, but it seems too bad way to deal with this.
Which way is best for dynamic invoking short-interval scheduled Firebase functions?
The best way to schedule Cloud Functions to run at a specific time it through the Google Cloud Tasks schedules. See Doug's blog post for a full description of this: How to schedule a Cloud Function to run in the future with Cloud Tasks (to build a Firestore document TTL)
That said, I regularly use setTimeOut in my own Cloud Functions too when I need to delay an operation for a short period of time. Just keep in mind that you pay for the seconds that the function is sleeping, so cost-wise you'll want to trade that time off against what another invocation would cost.
So for now I decided to use setTimeout, free firebase plan seems to limit only functions invoke number, not working plan so this shouldn't be a problem. Depsite this, I'm still waiting for advice from you

Event-sourcing: when (and not) should I use Message Queue?

I am building a project from scratch using event-sourcing with Java and Cassandra.
My apps we be based on microservices and in some use cases information will be processed asynchronously. I was wondering what part a Message Queue (such as Rabbit, Active MQ Artemis, Kafka, etc) would play to improve the technology stack in this environment and if I understand the scenarios if I won't use it.
I would start with separating messaging infrastructure like RabbitMQ from event streaming/storing/processing like Kafka. These are two different things made for two (or more) different purposes.
Concerning the event sourcing, you have to have a place where you must store events. This storage must be append-only and support fast reads of unstructured data based on an identity. One example of such persistence is the EventStore.
Event sourcing goes together with CQRS, which means you have to project your changes (event) to another store, which you can query. This is done by projecting events to that store, this is where events get processed to change the domain object state. It is important to understand that using message infrastructure for projections is generally a bad idea. This is due to the nature of messaging and two-phase commit issue.
If you look at how events get persisted, you can see that they get saved to the store as one transaction. If you then need to publish events, this will be another transaction. Since you are dealing with two different pieces of infrastructure, things can get broken.
The messaging issue as such is that messages are usually guaranteed to be delivered "at least once" and the order of messages is usually not guaranteed. Also, when your message consumer fails and NACKs the message, it will be redelivered but usually a bit later, again breaking the sequence.
The ordering and duplication concerns, whoever, do not apply to event streaming servers like Kafka. Also, the EventStore will guarantee once only event delivery in order if you use catch-up subscription.
In my experience, messages are used to send commands and to implement event-driven architecture to connect independent services in a reactive way. Event stores, at the other hand, are used to persist events and only events that get there are then projected to the query store and also get published to the message bus.
Make sure you are clear on the distinction between send(command) and publish(event). Udi Dahan touches on that topic in his essay on busses and brokers.
In most cases where you are event sourcing, you do not want to be reconstructing state from published events. If you need state, then query the technical authority/book of record for the history, and reconstruct the state from the history.
On the other hand, event driven activity off of a message queue should be fine. When a single event (plus the subscriber's state) has everything you need, then running off of the bus is fine.
In some cases, you might do both. For example, if you were updating cached views, you'd subscribe to various BobChanged events to know when your cached data was stale; to rebuild a stale view, you would reload a representation of the history and transform it into an updated view.
In the world of event-sourcing applications, message queues usually allow you to implement publish-subscribe pattern style of communication between producers and consumers. Also, they usually help you with delivery guarantees: which messages were delivered to which subscribers and which ones were not.
But they don't store all messages indefinitely. You need to have an event store to do any kind of event sourcing.
The question is not 'to queue or not to queue', but it is more like:
can this thing store huge volume of events indefinitely?
does it have publish-subscribe capabilities?
does it provide at-least-once delivery guarantees?
So, you should use something like Kafka or EventStore to have all that out-of-the-box. Alternatively, you can combine event store with message queue manually, but this is going to be more involved.

Does Meteor Merge Box reuse documents when a subscription changes it's arguments

If a subscription is rerun with the "same arguments" in a flush cycle it reuses the observer or the server and the data in minimongo:
If the subscription is run with the same arguments then the “new” subscription discovers the old “marked for destruction” subscription that’s sitting around, with the same data already ready, and simply reuses that. - Meteor Guide
Additionally, if two subscriptions both request the same document Merge Box will ensure the data is not sent multiple times across DDP.
Furthermore, if a subscription is marked for destruction and rerun with different arguments the observer cannot be reused, however my question is: if there are shared documents being published by the old and new subscription, in the same flush cycle, will the overlapping documents need be intelligently recycled on the client or will they need be sent over the wire a second time.
[Assume there are no other subscriptions that share this data.]
I believe the data will be reused I need to double check though.

Using SQS or DynamoDB to control order status

I am building a system that processes orders. Each order will follow a workflow. So this order can be, e.g., booked,accepted,payment approved,cancelled and so on.
Every time a status of a order changes I will post this change to SNS. To know if a status order has changed I will need to make a request to a external API, and compare to the last known status.
The question is: What is the best place to store the last known order status?
1. A SQS queue. So every time I read a message from queue, check status using the external API, delete the message and insert another one with the new status.
2. Use a database (like Dynamo DB) to control the order status.
You should not use the word "store" to describe something happening with stateful facts and a queue. Stateful, factual information should be stored -- persisted -- to a database.
The queue messages should be treated as "hints" on what work needs to be done -- a request to consider the reasonableness of a proposed action, and if reasonable, perform the action.
What I mean by this, is that when a queue consumer sees a message to create an order, it should check the database and create the order if not already present. Update an order? Check the database to see whether the order is in a correct status for the update to occur. (Canceling an order that has already shipped would be an example of a mismatched state).
Queues, by design, can't be as precise and atomic in their operation as a database should. The Two Generals Problem is one of several scenarios that becomes an issue in dealing with queues (and indeed with designing a queue system) -- messages can be lost or delivered more than once.
What happens in a "queue is authoritative" scenario when a message is delivered (received from the queue) more than once? What happens if a message is lost? There's nothing wrong with using a queue, but I respectfully suggest that in this scenario the queue should not be treated as authoritative.
I will go with the database option instead of SQS:
1) option SQS:
You will have one application which will change the status
Add the status value into SQS
Now another application will check your messages and send notification, delete the message
2) Option DynamoDB:
Insert you updated status in DynamoDB
Configure a Lambda function on update of that field
Lambda function will send notifcation
The database option looks clear additionally, you don't have to worry about maintaining any queue plus you can read one message from the queue at a time unless you implement parallel reader to read from the queue. In a database, you can update multiple rows and it will trigger the lambda and you don't have to worry about it.
Hope that helps

Resources