Airflow DAG Run triggered, but never executed? - airflow

I've found myself in a situation where I manually trigger a DAG Run (via airflow trigger_dag datablocks_dag) run, and the Dag Run shows up in the interface, but it then stays "Running" forever without actually doing anything.
When I inspect this DAG Run in the UI, I see the following:
I've got start_date set to datetime(2016, 1, 1), and schedule_interval set to #once. My understanding from reading the docs is that since start_date < now, the DAG will be triggered. The #once makes sure it only happens a single time.
My log file says:
[2017-07-11 21:32:05,359] {jobs.py:343} DagFileProcessor0 INFO - Started process (PID=21217) to work on /home/alex/Desktop/datablocks/tests/.airflow/dags/datablocks_dag.py
[2017-07-11 21:32:05,359] {jobs.py:534} DagFileProcessor0 ERROR - Cannot use more than 1 thread when using sqlite. Setting max_threads to 1
[2017-07-11 21:32:05,365] {jobs.py:1525} DagFileProcessor0 INFO - Processing file /home/alex/Desktop/datablocks/tests/.airflow/dags/datablocks_dag.py for tasks to queue
[2017-07-11 21:32:05,365] {models.py:176} DagFileProcessor0 INFO - Filling up the DagBag from /home/alex/Desktop/datablocks/tests/.airflow/dags/datablocks_dag.py
[2017-07-11 21:32:05,703] {models.py:2048} DagFileProcessor0 WARNING - schedule_interval is used for <Task(BashOperator): foo>, though it has been deprecated as a task parameter, you need to specify it as a DAG parameter instead
[2017-07-11 21:32:05,703] {models.py:2048} DagFileProcessor0 WARNING - schedule_interval is used for <Task(BashOperator): foo2>, though it has been deprecated as a task parameter, you need to specify it as a DAG parameter instead
[2017-07-11 21:32:05,704] {jobs.py:1539} DagFileProcessor0 INFO - DAG(s) dict_keys(['example_branch_dop_operator_v3', 'latest_only', 'tutorial', 'example_http_operator', 'example_python_operator', 'example_bash_operator', 'example_branch_operator', 'example_trigger_target_dag', 'example_short_circuit_operator', 'example_passing_params_via_test_command', 'test_utils', 'example_subdag_operator', 'example_subdag_operator.section-1', 'example_subdag_operator.section-2', 'example_skip_dag', 'example_xcom', 'example_trigger_controller_dag', 'latest_only_with_trigger', 'datablocks_dag']) retrieved from /home/alex/Desktop/datablocks/tests/.airflow/dags/datablocks_dag.py
[2017-07-11 21:32:07,083] {models.py:3529} DagFileProcessor0 INFO - Creating ORM DAG for datablocks_dag
[2017-07-11 21:32:07,234] {models.py:331} DagFileProcessor0 INFO - Finding 'running' jobs without a recent heartbeat
[2017-07-11 21:32:07,234] {models.py:337} DagFileProcessor0 INFO - Failing jobs without heartbeat after 2017-07-11 21:27:07.234388
[2017-07-11 21:32:07,240] {jobs.py:351} DagFileProcessor0 INFO - Processing /home/alex/Desktop/datablocks/tests/.airflow/dags/datablocks_dag.py took 1.881 seconds
What could be causing the issue?
Am I misunderstanding how start_date operates?
Or are the worrisome-seeming schedule_interval WARNING lines in the log file possibly the source of the problem?

The problem is that the dag is paused.
In the screenshot you have provided, in the top left corner, flip this to On and that should do it.
This is a common "gotcha" when starting out with airflow.

The accepted answer is correct. This issue can be handled through UI.
One other way of handling this is by using configuration.
By defualt all dags are paused on creation.
You can check the default configuration in airflow.cfg
# Are DAGs paused by default at creation
dags_are_paused_at_creation = True
Turning the flag on will start your dag after the next heartbeat.
Relevant git issue

I had the same issue, but it had to with depends_on_past or wait_for_downstream

Related

Airflow (MWAA) tasks receiving SIGTERM but task is externally set to success

We face a lot of our Airflow (MWAA) tasks receiving SIGTERM:
[2022-10-06 06:23:48,347] {{logging_mixin.py:104}} INFO - [2022-10-06 06:23:48,347] {{local_task_job.py:188}} WARNING - State of this instance has been externally set to success. Terminating instance.
[2022-10-06 06:23:48,348] {{process_utils.py:100}} INFO - Sending Signals.SIGTERM to GPID 2740
[2022-10-06 06:23:55,113] {{taskinstance.py:1265}} ERROR - Received SIGTERM. Terminating subprocesses.
[2022-10-06 06:23:55,164] {{process_utils.py:66}} INFO - Process psutil.Process(pid=2740, status='terminated', exitcode=1, started='06:23:42') (2740) terminated with exit code 1
It happens to a few of our tasks and it would not have been a big deal if the tasks were not set as a SUCCESS:
State of this instance has been externally set to success. Terminating instance
We understood that this can happen because of a lack of memory within the worker. We tried to increase the number of workers without any success. What would be our solutions to avoid having set tasks externally killed?
When tasks are getting killed, they are marked as failed. Here it seems to be the other way around. The task seem to get marked by something/someone as a success, after which the job is stopped/killed.
I am not aware of how Mwaa is deployed, but I would have a look at the action logging to see what/who is marking these tasks as success.

Airflow task succeed but returns sigterm

I have a task in Airflow 2.1.2 which is finishing with success status, but after that log shows a sigterm:
[2021-12-07 06:11:45,031] {python.py:151} INFO - Done. Returned value was: None
[2021-12-07 06:11:45,224] {taskinstance.py:1204} INFO - Marking task as SUCCESS. dag_id=DAG_ID, task_id=TASK_ID, execution_date=20211207T050000, start_date=20211207T061119, end_date=20211207T061145
[2021-12-07 06:11:45,308] {local_task_job.py:197} WARNING - State of this instance has been externally set to success. Terminating instance.
[2021-12-07 06:11:45,309] {taskinstance.py:1265} INFO - 0 downstream tasks scheduled from follow-on schedule check
[2021-12-07 06:11:45,310] {process_utils.py:100} INFO - Sending Signals.SIGTERM to GPID 6666
[2021-12-07 06:11:45,310] {taskinstance.py:1284} ERROR - Received SIGTERM. Terminating subprocesses.
[2021-12-07 06:11:45,362] {process_utils.py:66} INFO - Process psutil.Process(pid=6666, status='terminated', exitcode=1, started='06:11:19') (6666) terminated with exit code 1
As you can see the first row returns Done, and the previous rows of this log showed that all script worked fine and data has been inserted in the Datawarehouse.
In the line number 8 it shows SIGTERM due some external trigger mark it as success but I am sure that nobody used the API, or CLI to mark it as success neither the UI.
Any idea how to avoid it and why could this be happening?
I don't know if maybe increasing the AIRFLOW_CORE_KILLED_TASK_CLEANUP_TIME could fix it, but I would like to understand it.

Airflow scheduler not pick up tasks, task waiting forever

I met an issue that my task in a tag never got pick up by workers for some reason.
When I look at the task details:
All dependencies are met but the task instance is not running. In most
cases this just means that the task will probably be scheduled soon
unless:
- The scheduler is down or under heavy load
If this task instance does not start soon please contact your Airflow
administrator for assistance.
I checked the scheduler, no errors in the log, also restarted it a few times.
I also checked the airflow websever log, only notice this:
22/11/2018 12:10:39[2018-11-22 01:10:39,747] {{cli.py:644}} DEBUG - [5
/ 5] killing 1 workers 22/11/2018 12:10:39[2018-11-22 01:10:39 +0000]
[43] [INFO] Handling signal: ttou 22/11/2018 12:10:39[2018-11-22
01:10:39 +0000] [348] [INFO] Worker exiting (pid: 348)
Not sure what happens, it worked fine before.
Airflow version 1.9.0, never change the version, only playing around some of the config: min_file_process_interval and dag_dir_list_interval (but I put it back to default when encounter this issue)
I do notice that this happens when I am playing around with some of the airflow config and rebuild our docker airflow image, then I revert it back to the original version, which used to work. Then the problem solved.
I also notice one error occurred (but not always captured) in my celery workers when I use the newly built image:
Unrecoverable error: AttributeError("'float' object has no attribute 'items'",)
So find that it is related to the latest redis release (Celery will use redis), you can find more details.

airflow - all tasks being queued and not moved to execution

airflow 1.8.1
Scheduler, worker and webserver are running in separate dockers on AWS.
The system was operational, and now for some reason all tasks are staying in queued state...
No errors in scheduler logs.
In worker I see this error (not sure if its related since scheduler should move tasks from queued state):
[2018-01-23 20:46:00,428] {base_task_runner.py:95} INFO - Subtask: [2018-01-23 20:46:00,428] {models.py:1122} INFO - Dependencies not met for , dependency 'Task Instance State' FAILED: Task is in the 'success' state which is not a valid state for execution. The task must be cleared in order to be run.
I tried reboots, airflow clear and then resetdb commands but it did not help.
Any idea what else can be done to fix that problem?
Thanks

Example DAG gets stuck in "running" state indefinitely

In my first foray into airflow, I am trying to run one of the example DAGS that comes with the installation. This is v.1.8.0. Here are my steps:
$ airflow trigger_dag example_bash_operator
[2017-04-19 15:32:38,391] {__init__.py:57} INFO - Using executor SequentialExecutor
[2017-04-19 15:32:38,676] {models.py:167} INFO - Filling up the DagBag from /Users/gbenison/software/kludge/airflow/dags
[2017-04-19 15:32:38,947] {cli.py:185} INFO - Created <DagRun example_bash_operator # 2017-04-19 15:32:38: manual__2017-04-19T15:32:38, externally triggered: True>
$ airflow dag_state example_bash_operator '2017-04-19 15:32:38'
[2017-04-19 15:33:12,918] {__init__.py:57} INFO - Using executor SequentialExecutor
[2017-04-19 15:33:13,229] {models.py:167} INFO - Filling up the DagBag from /Users/gbenison/software/kludge/airflow/dags
running
The dag state remains "running" for a long time (at least 20 minutes by now), although from a quick inspection of this task it should take a matter of seconds. How can I troubleshoot this? How can I see which step it is stuck on?
To run any DAGs, you need to make sure two processes are running:
airflow webserver
airflow scheduler
If you only have airflow webserver running, the UI will show DAGs as running, but if you click on the DAG, none of it's tasks are actually running or scheduled, but rather in a Null state.
What this means is that they are waiting to be picked up by airflow scheduler. If airflow scheduler is not running, you'll be stuck in this state forever, as the tasks are never picked up for execution.
Additionally, make sure that the toggle button in the DAGs view is switched to 'ON' for the particular DAG. Otherwise it will not get picked up by the scheduler if you trigger it manually.
I too recently started using Airflow and my dags kept endlessly running. Your dag may be set on 'pause' without you realizing it, and thus the scheduler will not schedule new task instances and when you trigger the dag it just looks like it is endlessly running.
There are a few solutions:
1) In the Airflow UI toggle the button left of the dag from 'Off' to 'On'. Off means that the dag is paused, so On will allow the scheduler to pick it up and complete the dag. (this fixed my initial issue)
2) In your airflow.cfg file dags_are_paused_at_creation = True, is the default. So all new dags you create are paused from the start. Change this to False, and future dags you create will be good to go right away (i had to reboot webserver and scheduler for changes to the airflow.cfg to be recognized)
3) use the command line $ airflow unpause [dag_id]
documentation: https://airflow.apache.org/cli.html#unpause
The below worked for me.
Make sure AIRFLOW_HOME is set
in AIRFLOW_HOME have folders dags, plugins. The folders to have permissions r,w,x to airflow user.
Make sure u have atleast one dag in the dags/ folder.
pip install celery[redis]==4.1.1
I have checked the above soln on airflow 1.9.0 Airflow version
I tried the same trick with airflow 1.10 version and it worked.

Resources