Running DAG in Loop - airflow

I want to run Airflow DAG in a continuous loop. Below is the dependency of my DAG:
create_dummy_start >> task1 >> task2 >> task3 >> create_dummy_end >> task_email_notify
The requirement is as soon as the flow reaches the create_dummy_end, the flow should re-iterate back to first task i.e. create_dummy_start.
I have tried re-triggering the DAG using below code:
`create_dummy_end = TriggerDagRunOperator(
task_id='End_Task',
trigger_dag_id=dag.dag_id,
dag=dag
)`
This will re-trigger the DAG but previous instance of DAG also keeps running, and hence it starts multiple instances parallelly which does not suffice the requirement.
I am new to Airflow, any inputs would be helpful.

By definition DAG is "Acyclic" (Directed Acyclic Graph) - there are no cycles.
Airflow - in general - works on "schedule" rather than "continuously" and while you can try to (as you did) trigger a new dag manually, this will always be "another dag run". There is no way to get Airflow in a continuous loop like that within a single DAG run.
You can use other tools for such purpose (which is much closer to streaming rather than Airflow's Batch processing). For example you can use Apache Beam for that - it seems to better fit your needs.

Related

Is it a good practice to use airflow metadatabase to control pipelines?

Recently I'm developing an airflow pipeline that will be running for multi tenants. This DAG will be triggered via API, and separated by batches, which is controlled by a metadabase in SQL following some business rules.
Each batch has a batch_id in order to controll the batches, and it is passed to conf DAG via API. The batch id has the timestamp of creation combined with tenant and filetype. Example: tenant1_20221120123323 ... tenant2_20221120123323. These batches can contain two filetypes ( for example purpouses) and for each filetype a DAG is triggered (DAG1 for filetype 1 and DAG2 for filetype 2) and then from the file perspective, it is combined with the filetype in some stages tenant1_20221120123323_filetype1, tenant1_20221120123323_filetype2 ...
For illustrate this, imagine that the first dag the following pipeline process_data_on_spark >> check_new_files_on_statingstorage >> [filetype2_exists, write_new_data_to_warehouse] filetype2_exists >> read_data_from_filetype2 >> merge_filetype2_filetype2 >> write_new_data_to_warehouse . Where the filetype2_exists is a BranchPythonOperator, that verify if DAG_2 was triggered, and if it was, it will merge the resulted data form DAG2 with the DAG1 before execute write_new_data_to_warehouse.
Based on this DAG model, there will be one DAG run for each tenant. So, the DAG can have multiple DAG runs running in parallel if we trigger more than one DAG run (one per tenant). Here is my first question:
Is a good practice work with multiple DAG runs in the same DAG instead of working with Dynamic DAGs ? In this case, I would end withprocess_data_on_spark _tenant1,
process_data_on_spark _tenant2, ...process_data_on_spark _tenantN. It worth mention that the number of tenants can reach hundreads.
Now, considering that the filetype2 can or not be present in the batch, and, considering that I would use the model mentioned above (on single DAG with multiples DAG run runnning in parallel - one for each tenant). The only idead that I have for check if DAG2 was triggered for the current batch (ie., filetype2 was present in the batch) was modify the DAG_run_id to include the batch_id, combined with the filetype:
The default dag_run_id: manual__2022-11-19T00:00:00+00:00
The new dag_run_id: manual__tenant1_20221120123323_filetype2__2022-11-19T00:00:00+00:00
And from then, I would be able to query the airflow metadatabse and check if there was an dag_run_id that contains the current batch_id and the filetype2 running, and, with a sensor, wait for the dag_status be success. Then, I could run the read_data_from_filetype2 task. Otherwise, if there is no dag_run_id with batch_id and filetype2 registed in airflow metadatabase, I can follow the write_new_data_to_warehouse directly.
Here's the other question:
Is a good practice to modify dag_run_id and use it combined with airflow metadatabase to control pipelines?
Considering this scenario, It would be better to create dynamic DAGs, even if there would be result in hundeads DAGs or working with dag_run_id and airflow_metadabase and keep parallel DAG runs in one single DAG?
Or, there would be a better approach for this problem?
Thank You.

A DAG is preventing other smaller DAGs tasks to start

I have a big DAG with around 400 tasks that starts at 8:00 and runs for about 2.5 hours.
There are some smaller DAGs that need to start at 9:00, they are scheduled but are not able to start until the first DAG finishes.
I reduced concurrency=6. The DAG is running only 6 parallel tasks, however this is not solving the issue that the other tasks in other DAGs don't start.
There is no other global configuration to limit the number of running tasks, other smaller dags usually run in parallel.
What can be the issue here?
Ariflow version: 2.1 with Local Executor with Postgres backend running on a 20core server.
Tasks of active DAGs not starting
I don't think it's related to concurrency. This could be related to Airflow using the mini-scheduler.
When a task is finished Task supervisor process perform a "mini scheduler" attempting to schedule more tasks of the same DAG. This means that the DAG will be finished quicker as the downstream tasks are set to Scheduled mode directly however one of it's side effect that it can cause starvation for other DAGs in some circumstances. A case like you present where you have one very big DAG that takes very long time to complete and starts before smaller DAGs may be the exact case where stravation can happen.
Try to set schedule_after_task_execution = False in airflow.cfg and it should solve your issue.
Why don't you use the option to invoke the task after the previous one is finished?
In the first DAG, insert the call to the next one as follows:
trigger_new_dag = TriggerDagRunOperator(
task_id=[task name],
trigger_dag_id=[trigered dag],
dag=dag
)
This operator will start a new DAG after the previous one is executed.
Documentation: https://airflow.apache.org/docs/apache-airflow/stable/_api/airflow/operators/trigger_dagrun/index.html

Is there a way to to setup airflow dags such that dag-a won't run if dag-b is still running and vice versa?

I have multiple dags that run on different cadence: some weekly, some daily etc. I want it to setup such that while dag-a is running, dag-b should wait until it is completed. Also, if dag-b is running dag-a should wait until dag-b completes, etc. Is there a way to do this in airflow out of the box?
What you are looking for is probably the ExternalTaskSensor
Airflow's Cross-DAG Dependencies description is also pretty useful.
If you are using this, there is also the Airflow DAG dependencies plugin, which can be pretty useful for visualizating those dependencies.
You could use the sensor operator to sense the dag runs or a task in a dag run. External task sensor is the best bet. Be careful how you set the timedelta passed. In general, the idea is to specify the when should the sensor be able to find the dag run.
Eg:
If the main dag is scheduled at 4 UTC, and a task sensor is a task in the dag like below
ExternalTaskSensor(
dag=dag,
task_id='dag_sensor_{}'.format(key),
external_dag_id=key,
timedelta=timedelta(days=1),
external_task_id=None,
mode='reschedule',
check_existence=True
)
Then the other dag that should get sensed must be triggering a run at 4.00UTC. That one day difference is set to offset the difference of execution date and current date

Airflow DAG Multiple Runs

I have a DAG that I want to run multiple times after each successful completion. For an example I want to run it 10 times and stop. Is there a way to accomplish this? I tried looking into scheduling with CRON but it doesn't seem clean nor triggering the DAG via UI multiple times doesn't work (runs in parallel).
I found a solution to my use case. It incorporated using depends_on_past=True (mentioned by #Hitesh Gupta) and setting your airflow.cfg file below:
# The maximum number of active DAG runs per DAG
max_active_runs_per_dag = 1
This allowed us to only have one active DAG run at a time and also to not continue the next DAG run if there were failure in the previous run. This is for Airflow version 1.10.1 that I tested on.
You can, in addition to supplying a start_date, provide your DAG an end_date
Quoting the docstring
:param start_date: The timestamp from which the scheduler will attempt
to backfill
:type start_date: datetime.datetime
:param end_date: A date beyond which your DAG won't run, leave to None for open ended scheduling
:type end_date: datetime.datetime
While unrelated, also have a look at following scheduler settings in airflow.cfg as mentioned in this article
run_duration
num_runs
UPDATE-1
In his article Use apache airflow to run task exactly once, #Andreas P has described a clever technique, which I believe can be adapted to your use-case. While even that won't be a very-tidy solution, it would at-least allow you to specify beforehand the number of runs (integer) for DAG instead of end_date.
Alternatively (assuming you implement the above approach) rather than rather than baking this skipping-dag-after max-runs functionality within each DAG, you can create a separate orchestrator DAG that disables a given DAG after its max runs have passed.
You have to set property depends_on_past. This is set under DAG's default arguments section and it refers to previous instance dag instance. This is fix your problem.

Schedule a continually airflow DAG run

Is there a way to run airflow DAG in a loop?
When trying to create a cycle (connecting the last component to the upstream of the last one) I got "Cycle detected in DAG. Faulty task: ..."
Generally, I have a short flow of 3 BashOperator components which i want to run continually (without any input-output pass from the last component to the first).
Thanks!
You should be able to use the TriggerDagRunOperator to rerun the DAG after the last task is finished. Just put it after the last operator and make it trigger the same DAG.

Resources