I have a DAG that runs on a schedule, and the tasks inside it find a file in a path, process it, and move that file to an archive folder.
Instead of waiting for the schedule, I manually triggered the DAG. The manually triggered DAG executed it's first task and "found a new file to process", but before starting the second task to load that file, the DAG schedule, automatically picked up and started to process the same file.
When the scheduled dag started, it paused executing the manually triggered DAG.
After the scheduled DAG finished, it went back to running the tasks from the manually triggered DAG which caused a failed state since the DAG moves files to an archive from a source directory, and the manually scheduled DAG started processing a file that "it believed was there" due to the success and information from the first task.
So:
DAG manually triggered
DAG manually Triggered Task 1 Executed
DAG scheduled invoked
DAG scheduled task 1 executed
DAG scheduled task 2 executed
DAG scheduled task 3 executed
DAG scheduled completed as Success
DAG manually Triggered Task 2 Failed (because of scheduled task 2 moving file detected in task 1)
DAG manually triggered skips other tasks due to failed task 2.
DAG manually triggered complete as Failed
So, my question is:
How do I configure Airflow such that invocations of the same DAG are executed FIFO, regardless if the DAG was invoked by a schedule, manual, or trigger?
Regarding the question:
Why do Apache Airflow scheduled DAGs prioritize over manually
triggered DAGs?
I can not say something. But probably is a good idea take a look on airflow code.
Regarding the task you want to accomplish:
working with files in mounted folder (or similar), I can suggest you that the first thing to do is copy them in another folder lets say "process folder" in order to be sure that these files are untouchable by "others".
In my project we work with group of videos and we generate trigger related folder using the {{ ts_nodash }} default variable.
Related
I have two DAGs in my airflow scheduler, which were working in the past. After needing to rebuild the docker containers running airflow, they are now stuck in queued. DAGs in my case are triggered via the REST API, so no actual scheduling is involved.
Since there are quite a few similar posts, I ran through the checklist of this answer from a similar question:
Do you have the airflow scheduler running?
Yes!
Do you have the airflow webserver running?
Yes!
Have you checked that all DAGs you want to run are set to On in the web ui?
Yes, both DAGS are shown in the WebUI and no errors are displayed.
Do all the DAGs you want to run have a start date which is in the past?
Yes, the constructor of both DAGs looks as follows:
dag = DAG(
dag_id='image_object_detection_dag',
default_args=args,
schedule_interval=None,
start_date=days_ago(2),
tags=['helloworld'],
)
Do all the DAGs you want to run have a proper schedule which is shown in the web ui?
No, I trigger my DAGs manually via the REST API.
If nothing else works, you can use the web ui to click on the dag, then on Graph View. Now select the first task and click on Task Instance. In the paragraph Task Instance Details you will see why a DAG is waiting or not running.
Here is the output of what this paragraph is showing me:
What is the best way to find the reason, why the tasks won't exit the queued state and run?
EDIT:
Out of curiousity I tried to trigger the DAG from within the WebUI and now both Runs executed (the one triggered from the WebUI failed, but that was expected, since there was no config set)
I have the directory for my dag in the airflow/dags directory, and when calling airflow dags list while logged into the webserver, the dag's ID shows up in the list. However, calling airflow dags list while logged into the scheduler returns the following error:
Killed
command terminated with exit code 137
The dag also does not show up in list on the webserver UI. When manually entering the dag_id in the url, it shows up with every task in the right place, but triggering a manual run via the Trigger DAG button results in a pop_up stating Cannot find dag <dag_id>. Has anyone run into this issue before? Is this a memory problem?
My DAG code is written in python, and the resulting DAG object has a large number of tasks (>80).
Running on airflow 1.10.15 with a kubernetes executor
What would happen if DAG changes while DAG is running (particularly if this is dynamic DAG)
What would happen if code of custom made operator is changed during some DAG run
So after some investigation I came to conclusion that DAG changes will be visible in current DAG run and that operators (and all other plugins) are not reloaded by default.
DAG changes during DAG run
If you remove task while DAG is running scheduler notice that, set status for that task to "Removed" while logging current line:
State of this instance has been externally set to removed. Taking the poison pill.
If you add new task it will also be visible and executed. It will be executed even if it is upstream task of task which is already finished
If task is changed, changes will be incorporated into current DAG run only if task is not already started execution or finished
All plugins (included operators) are not refreshed automatically by default. You can restart Airflow or you can set reload_on_plugin_change property in [webserver] section of airflow.cfg file to True
We have a process that runs everyday and kicks of several DAGs and subtags. something like:
(1) Master controller --> (11) DAGs --> (115) Child DAGs --> (115*4) Tasks
if something failed on any particular day, we want to do retry the next day. Similarly, we want to retry all failed dags over the last 10 days (to successfully complete them automatically).
Is there a way to automate this retry process?
(until now) Airflow doesn't natively support rerunning failed DAGs (Failed tasks within a DAG can of course be retried)
The premise could've been that
tasks are retried anyways;
so if even then the DAG fails, then the workflow might require human-intervention
But as always, you can build it (a custom-operator or simply PythonOperator)
Determine failed DagRuns in your specified time-period (last 10 days or whatever)
by either using DagRun SQLAlchemy model (you can check views.py for reference)
or by directly querying the dag_run table in Airflow's backend meta-db
Trigger those failed DAGs using TriggerDagRunOperator
And then create and schedule this retry-orchestrator DAG (that runs daily / whatever frequency you need) to re-trigger failed DAGs of past 10-days
I have an Airflow DAG that runs once daily at a specific time. The DAG runs a bunch of SQL scripts to create and load tables in a database, and the very last task updates permissions so that users can access the tables. Currently the permissions task requires that all previous SQL tasks have completed, so this means that none of the tables' permissions are updated if any of the table tasks fail.
To fix this I'd like to create another permissions task (i.e., a backup task) that runs at a preset time regardless of the status of any of the previous tasks (doesn't hurt to update permissions multiple times). If I don't specify a time different from the DAG's time, then because the new task has no dependencies, the task will try updating permissions before any of the tables have been updated. Is there a setting for me to pass a cron string to a specific task? Or is there an option to pass a timedelta on top of the task's DAG time? I need to run the task some amount of time after the DAG time.
If your permissions task can run no matter what the result of the upstream tasks, I think the best option is simply to change the trigger_rule of your permissions task to all_done (default is all_success).
If you need to do some specific stuffs when there is a failure, you could consider creating a secondary DAG which first step is a sensor that waits for the main DAG to complete with State.FAILED, then run your permissions task.
Have a look at ExternalTaskSensor when you want to establish a dependency between DAGs.
I haven't checked but you might also need to use soft_fail on the sensor to prevent the secondary DAG to show up as failed when the main DAG completes successfully.