Delay between tasks in Airflow or any other option? - airflow

We are using airflow 2.00. I am trying to implement a DAG that does two things:
Trigger Reports via API
Download reports from source to destination.
There needs to at least 2-3 hours gap between tasks 1 and 2. From my research I two options
Two DAGs for two tasks. Schedule the 2nd DAG two hour apart from 1st DAG
Delay between two tasks as mentioned here
Is there a preference between the two options. Is there a 3rd option with Airflow 2.0? Please advise.

The other option would be to have a sensor waiting for the report to be present. You can utilise reschedule mode of sensors to free up workers slots.
generate_report = GenerateOperator(...)
wait_for_report = WaitForReportSensor(mode='reschedule', poke_interval=5 * 60, ...)
donwload_report = DonwloadReportOperator(...)
generate_report >> wait_for_report >> donwload_report

A third option would be to use a sensor between two tasks that waits for a report to become ready. An off-the-shelf one if there is one for your source, or a custom one that subclasses the base sensor.
The first two options are different implementations of a fixed waiting time. Two problems with it: 1. What if the report is still not ready after the predefined time? 2. Unnecessary waiting if the report is ready earlier.

Related

Proper way to let airflow sensor continuous triggering?

Is it possible to let airflow sensor continuous triggering? By continuous triggering what I mean is that for example the sensor will listen to a Kafka topic, and will trigger different DAGs depend on the received message, this will keep running possible forever.
The sensor doesn't trigger the dag run, it's a part of the run, but it can block it by staying in running state (or up for rescheduling) waiting certain condition, then all the downstream tasks will stay waiting (None state).
To achieve what you want to do, you can create a sub class from TriggerDagRunOperator to read the kafka topic then trigger runs in other dags based on your needs.
Or you can create a stream application outside Airflow, and use the Airflow API to trigger the runs.
If you're using Deferable operators, you should be able to re-defer them after re-entering , untested pseudo code would be something like :
def execute(self,context) -> Any:
self.defer(AwaitMessageTrigger(), method="execute_complete")
def execute_complete(self,context,event):
TriggerDagRunOperator(conf=event)
self.defer(AwaitMessageTrigger(), method="execute_complete")
This pattern has been implemented as part of the provider package here.
https://github.com/astronomer/airflow-provider-kafka/blob/main/airflow_provider_kafka/operators/event_triggers_function.py
Example here
https://github.com/astronomer/airflow-provider-kafka/blob/main/example_dags/listener_dag_function.py

Airflow: Concurrency Depth first, rather than breadth first?

In airflow, the default configuration seems to be to queue up tasks, in parallel, across days--from one day to the next.
However, if I spin this process up across, say, two years, then the airflow dag will churn through preliminary processes first, across all days, rather than taking, say, 4 days forward from start to finish concurrently.
How do I toggle airflow to execute tasks according to a depth first paradigm rather than a breadth first paradigm?
I have come across a similar situation. I used the following trick to achieve that depth-first behaviour.
Assign all tasks of your DAG to a single pool (with limited number of slots like, say, 20-30)
Set weight_rule=upstream to all the above tasks
Explaination
The UPSTREAM weight_rule reverses prioritization of tasks based on their position across breadth of workflow, resulting in all downstream tasks to have a higher priority than upstream tasks.
This would ensure that whatever branches are launched will go onto completion before next branch is picked, thereby achieving that depth-first behaviour
Try to toggle with the parallelism and max_active_runs parameters in your airflow.cfg and the concurrency parameter at your DAGs.

How to configure Apache Airflow with Celery to run concurrent tasks?

I am interested in this use case for my proof of concept, where i read from a file containing a huge list of ids and i want to process this ids as such func(id) concurrently.
Is it possible to configure airflow with CeleryExecutors to achieve this?
I saw this link :-
Running more than 32 concurrent tasks in Apache Airflow
But what if the number of ids are unknown and could be anywhere from 10,000 or even 100,000 and i want to process them around 500-1000 at a time?
Airflow can execute tasks in parallel, and it can use Celery to achieve this. Everything else is up to you to implement however you see fit, there are no specifics related to Airflow/Celery regarding your intended use.
In the end, if all you care about is paralleling your work and don't care much about other Airflow features, you could be better off using Celery alone.
There are many different ways to go about this, but here is some food for though to get you started:
Airflow tasks should be as "dumb" as possible, i.e. take an input, process it and store the output. Don't put your file-splitting logic here. You can have a dedicated DAG for that if needed. For example, you can have a DAG which reads the input file and chunks it up via some logic, then store it somewhere for tasks to pick up (convenient file structure, message queue, db, etc.)
Decide on a place for your input data such that tasks can easily pick up a limited amount of input. For example, if you're using a file structure, where one chunk to be processed is a single file, a task can get read a single file and remove it. Repeat until no chunks/files are left. Same goes for any other way, e.g. if using a message queue you can consume the chunks. Make sure you have that original DAG ready to split up the input file into chunks again if needed. You are free to make this as simple or as complex as you want.
Watch out for idempotency, e.g. make sure your process can be repeated without side-effects. If you lose data in some step, you can just restart everything without issues.

How to run Airflow DAG for specific number of times?

How to run airflow dag for specified number of times?
I tried using TriggerDagRunOperator, This operators works for me.
In callable function we can check states and decide to continue or not.
However the current count and states needs to be maintained.
Using above approach I am able to repeat DAG 'run'.
Need expert opinion, Is there is any other profound way to run Airflow DAG for X number of times?
Thanks.
I'm afraid that Airflow is ENTIRELY about time based scheduling.
You can set a schedule to None and then use the API to trigger runs, but you'd be doing that externally, and thus maintaining the counts and states that determine when and why to trigger externally.
When you say that your DAG may have 5 tasks which you want to run 10 times and a run takes 2 hours and you cannot schedule it based on time, this is confusing. We have no idea what the significance of 2 hours is to you, or why it must be 10 runs, nor why you cannot schedule it to run those 5 tasks once a day. With a simple daily schedule it would run once a day at approximately the same time, and it won't matter that it takes a little longer than 2 hours on any given day. Right?
You could set the start_date to 11 days ago (a fixed date though, don't set it dynamically), and the end_date to today (also fixed) and then add a daily schedule_interval and a max_active_runs of 1 and you'll get exactly 10 runs and it'll run them back to back without overlapping while changing the execution_date accordingly, then stop. Or you could just use airflow backfill with a None scheduled DAG and a range of execution datetimes.
Do you mean that you want it to run every 2 hours continuously, but sometimes it will be running longer and you don't want it to overlap runs? Well, you definitely can schedule it to run every 2 hours (0 0/2 * * *) and set the max_active_runs to 1, so that if the prior run hasn't finished the next run will wait then kick off when the prior one has completed. See the last bullet in https://airflow.apache.org/faq.html#why-isn-t-my-task-getting-scheduled.
If you want your DAG to run exactly every 2 hours on the dot [give or take some scheduler lag, yes that's a thing] and to leave the prior run going, that's mostly the default behavior, but you could add depends_on_past to some of the important tasks that themselves shouldn't be run concurrently (like creating, inserting to, or dropping a temp table), or use a pool with a single slot.
There isn't any feature to kill the prior run if your next schedule is ready to start. It might be possible to skip the current run if the prior one hasn't completed yet, but I forget how that's done exactly.
That's basically most of your options there. Also you could create manual dag_runs for an unscheduled DAG; creating 10 at a time when you feel like (using the UI or CLI instead of the API, but the API might be easier).
Do any of these suggestions address your concerns? Because it's not clear why you want a fixed number of runs, how frequently, or with what schedule and conditions, it's difficult to provide specific recommendations.
This functionality isn't natively supported by Airflow
But by exploiting the meta-db, we can cook-up this functionality ourselves
we can write a custom-operator / python operator
before running the actual computation, check if 'n' runs for the task (TaskInstance table) already exist in meta-db. (Refer to task_command.py for help)
and if they do, just skip the task (raise AirflowSkipException, reference)
This excellent article can be used for inspiration: Use apache airflow to run task exactly once
Note
The downside of this approach is that it assumes historical runs of task (TaskInstances) would forever be preserved (and correctly)
in practise though, I've often found task_instances to be missing (we have catchup set to False)
furthermore, on large Airflow deployments, one might need to setup routinal cleanup of meta-db, which would make this approach impossible

openCL : No. of iterations in profiling API

Trying to use clGetEventProfilingInfo for timing my kernels.
Is there any facility to give no. of iterations before which the values of start time and end time are reported?
If the kernel is run only once then , of ourse it has lots of over heads associated with it. So to get the best timing we should run the kernel several times and take the average time.
Do we have such a parameter in profiling using API? (We do have such parameters when we use third party software Tools for profiling)
The clGetEventProfilingInfo function will return profiling information for a single event, which corresponds to a single enqueued command. There is no built-in mechanism to automatically report information across a number of calls; you'll have to code that yourself.
It's pretty straightforward to do - just query the start and end times for each event you care about and add them up. If you are only running a single kernel (in a loop), then you could just use a wall-clock timer (with clFinish before you start and stop timing), or take the difference between the time the first event started and the last event finished.

Resources