I want to structure my DAGs into reusable pieces but I have a hard time understanding how the start date and schedule part work for subdags, and I can't find the information anywhere in the doc either.
Let's say I have a parent DAG that started today and a subdag that starts in a week, will the subdag be run when the parent DAG run or will it wait to run the subdag until the start date condition is met? Similar question goes for schedule, if I have a schedule that a subdag should run only on Monday's but the parent DAG runs everyday, will the subdag be triggered everyday or only on Mondays?
From the documentation:
SubDAGs must have a schedule and be enabled. If the SubDAG’s schedule is set to None or #once, the SubDAG will succeed without having done anything.
What I understood from this is that SubDagOperator is implemented as a BackfillJob and hence we need to provide schedule_interval.
Related
Recently, we have been getting some errors on airflow where certain dags will not run any tasks but are being marked as complete.
We had the start_date using days_ago from airflow.
from airflow.utils.dates import days_ago
From: https://forum.astronomer.io/t/dag-run-marked-as-success-but-no-tasks-even-started/1423
If you see dag runs that are marked as success but don’t have any task runs, this means the dag runs’ execution_date was earlier than the dag’s start_date.
This is most commonly seen when the start_date is set to some dynamic value e.g. airflow.utils.dates.days_ago(0). This creates the opportunity for the execution date of a delayed dag execution to be before what the dag now thinks is it’s start_date. This can even happen in a cyclic pattern, where a few dagruns will work, and then at the beginning of every day a dagrun will experience this problem.
This simplest way to avoid this problem is the never use dynamic start_date. It is always better to specify a static start_date. If you are concerned about accidentally triggering multiple runs of the same dag, just set catchup=False.
There is an open ticket in Airflow project with this issue: https://github.com/apache/airflow/issues/17977
I have a big DAG with around 400 tasks that starts at 8:00 and runs for about 2.5 hours.
There are some smaller DAGs that need to start at 9:00, they are scheduled but are not able to start until the first DAG finishes.
I reduced concurrency=6. The DAG is running only 6 parallel tasks, however this is not solving the issue that the other tasks in other DAGs don't start.
There is no other global configuration to limit the number of running tasks, other smaller dags usually run in parallel.
What can be the issue here?
Ariflow version: 2.1 with Local Executor with Postgres backend running on a 20core server.
Tasks of active DAGs not starting
I don't think it's related to concurrency. This could be related to Airflow using the mini-scheduler.
When a task is finished Task supervisor process perform a "mini scheduler" attempting to schedule more tasks of the same DAG. This means that the DAG will be finished quicker as the downstream tasks are set to Scheduled mode directly however one of it's side effect that it can cause starvation for other DAGs in some circumstances. A case like you present where you have one very big DAG that takes very long time to complete and starts before smaller DAGs may be the exact case where stravation can happen.
Try to set schedule_after_task_execution = False in airflow.cfg and it should solve your issue.
Why don't you use the option to invoke the task after the previous one is finished?
In the first DAG, insert the call to the next one as follows:
trigger_new_dag = TriggerDagRunOperator(
task_id=[task name],
trigger_dag_id=[trigered dag],
dag=dag
)
This operator will start a new DAG after the previous one is executed.
Documentation: https://airflow.apache.org/docs/apache-airflow/stable/_api/airflow/operators/trigger_dagrun/index.html
I have a DAG that has run tasks for over a decade of execution dates. Now I needed to add another year to the beginning. I googled a little bit and the recommendation was to do this under a new dag_id. Because the old DAG has run already for that named execution date range, I want to mark those in the new DAG as a success. How can I archive this in a convenient way?
Thanks in advance. Have a nice start to this week.
Airflow's backfill feature is designed to do exactly what you're trying to do.
That said, not everyone likes using the feature. For example, if your dag is ordinarily an hourly job, backfilling several years of data in hourly batches might be really inefficient.
So for various reasons, creating a temporary "backfill" dag is not a bad way to go.
And to be clear, with "backfill" dag I refer to a dag you are using for purpose of backfill, while not using airflow backfill feature.
For your "backfill" dag, use the DAG parameters start_date and end_date to control the range of execution_date a dag will create dag runs for.
Then after your "backfill" dag is done with all its runs, you can delete it. Airflow won't know the old task instances are now backfilled, but you may not care about that. If you do, you can update the dag_id manually in the metastore database. And otherwise, your "old" dag has correct metadata for more recent periods.
We are currently evaluating airflow for a project. I was wondering if there is a way to stop/start individual dagruns while running a DAG multiple times in parallel. Pause/unpause on dag_id seems to pause/unpause all the dagruns under a dag. Instead we want to pause individual dagruns (or tasks within them). Let me know if this is achievable in airflow.
If its not possible, here are other alternatives I am thinking of, let me know your opinion on these
Change task state. – Change all tasks under a dagrun to Mark Failed or Success. That way that particular dagrun is stopped on its tracks without affecting other dagruns.
Airflow sensor to pull this information from s3 or http or sql or somewhere to pause current dagrun. And have a task to check on s3 everytime if this dagrun needs to be stopped (not other dagruns).
subdags. - Can we pause/unpause subdags. That way for each parallel user's request we want to do we issue a subdag and we can pause userAs subdag without impacting other user’s subdags.
There's nothing "baked" into Airflow to support this but you could (ab)use the state of the DagRun by changing it to "failed" to pause and then back to "running" to resume; you won't be able to blanket unpause but for testing it should be workable.
I have multiple dags that run on different cadence: some weekly, some daily etc. I want it to setup such that while dag-a is running, dag-b should wait until it is completed. Also, if dag-b is running dag-a should wait until dag-b completes, etc. Is there a way to do this in airflow out of the box?
What you are looking for is probably the ExternalTaskSensor
Airflow's Cross-DAG Dependencies description is also pretty useful.
If you are using this, there is also the Airflow DAG dependencies plugin, which can be pretty useful for visualizating those dependencies.
You could use the sensor operator to sense the dag runs or a task in a dag run. External task sensor is the best bet. Be careful how you set the timedelta passed. In general, the idea is to specify the when should the sensor be able to find the dag run.
Eg:
If the main dag is scheduled at 4 UTC, and a task sensor is a task in the dag like below
ExternalTaskSensor(
dag=dag,
task_id='dag_sensor_{}'.format(key),
external_dag_id=key,
timedelta=timedelta(days=1),
external_task_id=None,
mode='reschedule',
check_existence=True
)
Then the other dag that should get sensed must be triggering a run at 4.00UTC. That one day difference is set to offset the difference of execution date and current date