We have a process that runs everyday and kicks of several DAGs and subtags. something like:
(1) Master controller --> (11) DAGs --> (115) Child DAGs --> (115*4) Tasks
if something failed on any particular day, we want to do retry the next day. Similarly, we want to retry all failed dags over the last 10 days (to successfully complete them automatically).
Is there a way to automate this retry process?
(until now) Airflow doesn't natively support rerunning failed DAGs (Failed tasks within a DAG can of course be retried)
The premise could've been that
tasks are retried anyways;
so if even then the DAG fails, then the workflow might require human-intervention
But as always, you can build it (a custom-operator or simply PythonOperator)
Determine failed DagRuns in your specified time-period (last 10 days or whatever)
by either using DagRun SQLAlchemy model (you can check views.py for reference)
or by directly querying the dag_run table in Airflow's backend meta-db
Trigger those failed DAGs using TriggerDagRunOperator
And then create and schedule this retry-orchestrator DAG (that runs daily / whatever frequency you need) to re-trigger failed DAGs of past 10-days
Related
We have an ETL DAG which is executed daily. DAG and tasks have the following parameters:
catchup=False
max_active_runs=1
depends_on_past=True
When we add a new task, due to depends_on_past property, no new DAG runs get scheduled, as all previous states for new task are missing.
We would like to avoid having to run manual backfill or manually marking previous runs from UI as it can be easily forgotten, and we also have some dynamic DAGs where tasks get added automatically and halt future DAG executions.
Is there a way to automatically set past executions for new tasks as skipped by default, or some other approach that will allow future DAG runs to execute without human intervention?
We also considered creating a maintenance DAG that would insert missing task executions with skipped state, but would rather not go this route.
Are we missing something as the flow looks like a common thing to do?
Defined in Airflow documentation on BaseOperator:
depends_on_past (bool) – when set to true, task instances will run
sequentially and only if the previous instance has succeeded or has
been skipped. The task instance for the start_date is allowed to run.
As long as there exists a previous instance of the task, if that previous instance is not in the success state, the current instance of the task cannot run.
When adding a task to a DAG with existing dagrun, Airflow will create the missing task instances in the None state for all dagruns. Unfortunately, it is not possible to set the default state of task instances.
I do not believe there is a way to allow future task instances of a DAG with existing dagruns to run without human intervention. Personally, for depends_on_past enabled tasks, I will mark the previous task instance as success either through the CLI or the Airflow UI.
Looks like there is an Github Issue describing exactly what you are experiencing! Feel free to bump this PR or take a stab at it if you would like.
A hacky solution is to set depends_on_past to False as max_active_runs=1 will implicitly guarantee the same behavior. As of the current Airflow version, the scheduler orders both dag runs and task instances by execution date before running them (checked 1.10.x but also 2.0)
Another difference is that next execution will be scheduled even if previous fails. We solved this by retrying unlimited times (setting a ver large number), and alert if retry number is larger than some value.
We are currently evaluating airflow for a project. I was wondering if there is a way to stop/start individual dagruns while running a DAG multiple times in parallel. Pause/unpause on dag_id seems to pause/unpause all the dagruns under a dag. Instead we want to pause individual dagruns (or tasks within them). Let me know if this is achievable in airflow.
If its not possible, here are other alternatives I am thinking of, let me know your opinion on these
Change task state. – Change all tasks under a dagrun to Mark Failed or Success. That way that particular dagrun is stopped on its tracks without affecting other dagruns.
Airflow sensor to pull this information from s3 or http or sql or somewhere to pause current dagrun. And have a task to check on s3 everytime if this dagrun needs to be stopped (not other dagruns).
subdags. - Can we pause/unpause subdags. That way for each parallel user's request we want to do we issue a subdag and we can pause userAs subdag without impacting other user’s subdags.
There's nothing "baked" into Airflow to support this but you could (ab)use the state of the DagRun by changing it to "failed" to pause and then back to "running" to resume; you won't be able to blanket unpause but for testing it should be workable.
I have a DAG that runs on a schedule, and the tasks inside it find a file in a path, process it, and move that file to an archive folder.
Instead of waiting for the schedule, I manually triggered the DAG. The manually triggered DAG executed it's first task and "found a new file to process", but before starting the second task to load that file, the DAG schedule, automatically picked up and started to process the same file.
When the scheduled dag started, it paused executing the manually triggered DAG.
After the scheduled DAG finished, it went back to running the tasks from the manually triggered DAG which caused a failed state since the DAG moves files to an archive from a source directory, and the manually scheduled DAG started processing a file that "it believed was there" due to the success and information from the first task.
So:
DAG manually triggered
DAG manually Triggered Task 1 Executed
DAG scheduled invoked
DAG scheduled task 1 executed
DAG scheduled task 2 executed
DAG scheduled task 3 executed
DAG scheduled completed as Success
DAG manually Triggered Task 2 Failed (because of scheduled task 2 moving file detected in task 1)
DAG manually triggered skips other tasks due to failed task 2.
DAG manually triggered complete as Failed
So, my question is:
How do I configure Airflow such that invocations of the same DAG are executed FIFO, regardless if the DAG was invoked by a schedule, manual, or trigger?
Regarding the question:
Why do Apache Airflow scheduled DAGs prioritize over manually
triggered DAGs?
I can not say something. But probably is a good idea take a look on airflow code.
Regarding the task you want to accomplish:
working with files in mounted folder (or similar), I can suggest you that the first thing to do is copy them in another folder lets say "process folder" in order to be sure that these files are untouchable by "others".
In my project we work with group of videos and we generate trigger related folder using the {{ ts_nodash }} default variable.
I have an Airflow DAG that runs once daily at a specific time. The DAG runs a bunch of SQL scripts to create and load tables in a database, and the very last task updates permissions so that users can access the tables. Currently the permissions task requires that all previous SQL tasks have completed, so this means that none of the tables' permissions are updated if any of the table tasks fail.
To fix this I'd like to create another permissions task (i.e., a backup task) that runs at a preset time regardless of the status of any of the previous tasks (doesn't hurt to update permissions multiple times). If I don't specify a time different from the DAG's time, then because the new task has no dependencies, the task will try updating permissions before any of the tables have been updated. Is there a setting for me to pass a cron string to a specific task? Or is there an option to pass a timedelta on top of the task's DAG time? I need to run the task some amount of time after the DAG time.
If your permissions task can run no matter what the result of the upstream tasks, I think the best option is simply to change the trigger_rule of your permissions task to all_done (default is all_success).
If you need to do some specific stuffs when there is a failure, you could consider creating a secondary DAG which first step is a sensor that waits for the main DAG to complete with State.FAILED, then run your permissions task.
Have a look at ExternalTaskSensor when you want to establish a dependency between DAGs.
I haven't checked but you might also need to use soft_fail on the sensor to prevent the secondary DAG to show up as failed when the main DAG completes successfully.
We have a long dag (~60 tasks), and quite frequently we see a dagrun for this dag in a state of failed. When looking at the tasks in the DAG they are all in a state of either success or null (i.e. not even queued yet). It appears that the dag has got into a state of failed prematurely.
Under what circumstances can this happen, and what should people do to protect against it?
If it's helpful for context we're running Airflow using the Celery executor and currently running on version 1.9.0. If we set the state of the dag in question back to running then all the tasks (and the dag as a whole) complete successfully.
The only way that a DAG can fail without a task failing is through something not connected to any of the tasks. Besides manual intervention (check that nobody on the team is manually failing the dags!) the only thing that fails DAGs outside of considering task states is the timeout checker.
This runs inside the scheduler, while considering whether it needs to schedule a new dag_run. If it finds another active run, which has been running longer than the dagrun_timeout argument of the DAG, then it will get killed. As far as I can see this isn't logged anywhere, so the best way to diagnose this is to look at the time that the DAG started and the time that the last task finished to see if it's roughly the length of dagrun_timeout.
You can see the code in action here: https://github.com/apache/incubator-airflow/blob/e9f3fdc52cb53f3ac3e9721e5128d17d1c5c418c/airflow/jobs.py#L800