Apache Flink: Execute buckets created by keyby plus global window operator one by one key based with async I/O - asynchronous

I have a usecase to implement where kafka source is reading stream of events (unbounded), followed by keyby(tenantId + specialId) + global window followed by asyncIO and kafka sink.
Now in asyncIO have couple of business functions to execute based on "specialId". Now the execution should be done serially so that one batch from one type of key gets executed and only on its returned response next batch for the same key should get executed.
But with asyncIO execution is performed parallel & any of the subtask picks the batch & start the execution.
How can I make the execution per key done serially using flink provided operators or logic?

Related

Correct Approach For Airflow DAG Project

I am trying to see if Airflow is the right tool for some functionality I need for my project. We are trying to use it as a scheduler for running a sequence of jobs
that start at a particular time (or possibly on demand).
The first "task" is to query the database for the list of job id's to sequence through.
For each job in the sequence send a REST request to start the job
Wait until job completes or fails (via REST call or DB query)
Go to next job in sequence.
I am looking for recommendations on how to break down the functionality discussed above into an airflow DAG. So far my approach would :
create a Hook for the database and another for the REST server.
create a custom operator that handles the start and monitoring of the "job" (steps 2 and 3)
use a sensor to poll handle waiting for job to complete
Thanks

How to launch a Dataflow job with Apache Airflow and not block other tasks?

Problem
Airflow tasks of the type DataflowTemplateOperator take a long time to complete. This means other tasks can be blocked by it (correct?).
When we run more of these tasks, that means we would need a bigger Cloud Composer cluster (in our case) to execute tasks that are essentially blocking while they shouldn't be (they should be async operations).
Options
Option 1: just launch the job and airflow job is successful
Option 2: write a wrapper as explained here and use a reschedule mode as explained here
Option 1 does not seem feasible as the DataflowTemplateOperator only has an option to specify the wait time between completion checks called poll_sleep (source).
For the DataflowCreateJavaJobOperator there is an option check_if_running to wait for completion of a previous job with the same name (see this code)
It seems that after launching a job, the wait_for_finish is executed (see this line), which boils down to an "incomplete" job (see this line).
For Option 2, I need Option 1.
Questions
Am I correct to assume that Dataflow tasks will block others in Cloud Composer/Airflow?
Is there a way to schedule a job without a "wait to finish" using the built-in operators? (I might have overlooked something)
Is there an easy way to write this myself? I'm thinking of just executing a bash launch script, followed by a task that looks if the job finished correctly, but in a reschedule mode.
Is there another way to avoid blocking other tasks while running dataflow jobs? Basically this is an async operation and should not take resources.
Answers
Am I correct to assume that Dataflow tasks will block others in Cloud Composer/Airflow?
A: Partly yes. Airflow has parallelism option in the configuration which define the number of tasks that should execute at a time across the system. Having a task block this slot might slow down the execution in the system but this issue is bound to happen as you increase the number of tasks and DAGs. You can increase this in the configuration depending on your needs
Is there a way to schedule a job without a "wait to finish" using the built-in operators? (I might have overlooked something)
A: Yes. You can use PythonOperator and in the python_callable you can use the dataflow hook to launch the job in async mode (launch and don't wait).
Is there an easy way to write this myself? I'm thinking of just executing a bash launch script, followed by a task that looks if the job finished correctly, but in a reschedule mode.
A: When you say reschedule, I'm assuming that you are going to retry the task that looks for job that checks if the job finished correctly. If I'm right, you can set the task on retry mode and the delay at which you want the retry to happen.
Is there another way to avoid blocking other tasks while running dataflow jobs? Basically this is an async operation and should not take resources.
A: I think I answered this in the second question.

Airflow Dagrun for each datum instead of scheduled

The current problem that I am facing is that I have documents in a MongoDB collection which each need to be processed and updated by tasks which need to run in an acyclic dependency graph. If a task upstream fails to process a document, then none of the dependent tasks may process that document, as that document has not been updated with the prerequisite information.
If I were to use Airflow, this leaves me with two solutions:
Trigger a DAG for each document, and pass in the document ID with --conf. The problem with this is that this is not the intended way for Airflow to be used; I would never be running a scheduled process, and based on how documents appear in the collection, I would be making 1440 Dagruns per day.
Run a DAG every period for processing all documents created in the collection for that period. This follows how Airflow is expected to work, but the problem is that if a task fails to process a single document, none of the dependent tasks may process any of the other documents. Also, if a document takes longer than other documents do to be processed by a task, those other documents are waiting on that single document to continue down the DAG.
Is there a better method than Airflow? Or is there a better way to handle this in Airflow than the two methods I currently see?
From the knowledge I gained in my attempt to answer this question, I've come to the conclusion that Airflow is just not the tool for the job.
Airflow is designed for scheduled, idempotent DAGs. A DagRun must also have a unique execution_date; this means running the same DAG at the exact same start time (in the case that we receive two documents at the same time is quite literally impossible. Of course, we can schedule the next DagRun immediately in succession, but this limitation should demonstrate that any attempt to use Airflow in this fashion will always be, to an extent, a hack.
The most viable solution I've found is to instead use Prefect, which was developed with the intention of overcoming some of the limitations of Airflow:
"Prefect assumes that flows can be run at any time, for any reason."
Prefect's equivalent of a DAG is a Flow; one key advantage of a flow that we may take advantage of is the ease of parametriziation. Then, with some threads, we're able to have a Flow run for each element in a stream. Here is an example streaming ETL pipeline:
import time
from prefect import task, Flow, Parameter
from threading import Thread
​
​
def stream():
for x in range(10):
yield x
time.sleep(1)
​
​
#task
def extract(x):
# If 'x' referenced a document, in this step we could load that document
return x
​
​
#task
def transform(x):
return x * 2
​
​
#task
def load(y):
print("Received y: {}".format(y))
​
​
with Flow("ETL") as flow:
x_param = Parameter('x')
e = extract(x_param)
t = transform(e)
l = load(t)
​
for x in stream():
thread = Thread(target=flow.run, kwargs={"x": x})
thread.start()
You could change trigger_rule from "all_success" to "all_done"
https://github.com/apache/airflow/blob/62b21d747582d9d2b7cdcc34a326a8a060e2a8dd/airflow/example_dags/example_latest_only_with_trigger.py#L40
And also could create a branch that processes failed documents with trigger_rule set to "one_failed" to move processes those failed documents somehow differently (e.g. move to a "failed" folder and send a notification)
I would be making 1440 Dagruns per day.
With a good Airflow architecture, this is quite possible.
Choking points might be
executor - use Celery Executor instead of Local Executor for example
backend database - monitor and tune as necessary (indexes, proper storage etc)
webserver - well, for thousands of dagruns, tasks etc.. perhaps only use webeserver for dev/qa environments, and not for production where you have higher rate of task/dagruns submissions. You could use cli etc instead.
Another approach is scaling out by running multiple Airflow instances - partition documents let's say to ten buckets, and assign each partition's documents to just one Airflow instance.
I'd process the heavier tasks in parallel and feed successful operations downstream. As far as I know, you can't feed successes asynchronously to downstream tasks, so you would still need to wait for every thread to finish until moving downstream but, this would still be well more acceptable than spawning 1 dag for each record, something in these lines:
Task 1: read mongo filtering by some timestamp (remember idempotence) and feed tasks (i.e. via xcom);
Task 2: do stuff in paralell via PythonOperator, or even better via K8sPod, i.e:
def thread_fun(ret):
while not job_queue.empty():
job = job_queue.get()
try:
ret.append(stuff_done(job))
except:
pass
job_queue.task_done()
return ret
# Create workers and queue
threads = []
ret = [] # a mutable object
job_queue = Queue(maxsize=0)
for thr_nr in appropriate_thread_nr:
worker = threading.Thread(
target=thread_fun,
args=([ret])
)
worker.setDaemon(True)
threads.append(worker)
# Populate queue with jobs
for row in xcom_pull(task_ids=upstream_task):
job_queue.put(row)
# Start threads
for thr in threads:
thr.start()
# Wait to finish their jobs
for thr in threads:
thr.join()
xcom_push(ret)
Task 3: Do more stuff coming from previous task, and so on
We have built a system that queries MongoDB for a list, and generates a python file per item containing one DAG (note: having each dag have its own python file helps Airflow scheduler efficiency, with it's current design) - the generator DAG runs hourly, right before the scheduled hourly run of all the generated DAGs.

Mule - Batch Job sync or async

I have two batch jobs in differents flows. The first, do an Upsert in Salesforce and when it finish, it call to the second flow that has another batch job.
This image represents the flows:
But when I see the log on the console, sometimes the log of the second batch is mixed with the log of the first.
I get the feeling that the batch processes are asynchronous and the second batch is called even though the first batch is being processed.
Am I wrong? Should I pay attention to the order of the logs?
If I wanted it to be totally synchronous, what would be the best way?
Mule Batch is asynchronous, it is like fire and forget. If you want to call the second batch after first batch is completed, then invoke the second batch at 'On Complete' phase of first batch as shown in below picture.
If you want to do some function before invoking the second batch, then you need to use request-reply scope to make batch component synchronous.
Yes the batch job is asynchronous. As soon as the batch execute is triggered the flow will move on to the next event processor.
If batch job 2 needs to run after batch job 1 only, then you can use the on-complete phase of the first batch job to trigger some event to indicate the first has finished so that can be used to trigger the second batch job.
Alternatively If the batch jobs are related that closely you might be able to combine them into one using multiple batch steps

Running tasks asynchronously that will never run simultaneously

I was wondering if there's a way running tasks asynchronously that will run in the background (Using Celery for example) which will never run simultaneously?
Which means, each task can run by itself simultaneously with it self but not with another tasks that interfere with the actions of the first task.
For example,
Task A: Reads from a file (can run simultaneously with it self (with other tasks that reads from files)
Task B: Writes to a file (Should not run simultaneously with the read tasks (With task A))
Essentially, what I need is a way for tasks A and B to find out if the other task is running and if it is, then delay itself and wait until it's done (probably with blocking the task queue)
Does defining a queue for the tasks solves the problem? or is it just a queue for the execution of tasks (So it will execute the 2nd task in the queue without waiting for the result of the first one)?
Is using a lock my only solution here?
If the lock solution is the only one, what's the correct way of implementing this?
I have found this:
Ensuring a task is only executed one at a time
But it uses django's cache as a lock and I'm not running my programs in a django environment so it doesn't work for me.

Resources