Airlfow Execution Timeout not working well - airflow

I've set 'execution_timeout': timedelta(seconds=300) parameter on many tasks. When the execution timeout is set on task downloading data from Google Analytics it works properly - after ~300 seconds is the task set to failed. The task downloads some data from API (python), then it does some transformations (python) and loads data into PostgreSQL.
Then I've a task which executes only one PostgreSQL function - execution sometimes takes more than 300 seconds but I get this (task is marked as finished successfully).
*** Reading local file: /home/airflow/airflow/logs/bulk_replication_p2p_realtime/t1/2020-07-20T00:05:00+00:00/1.log
[2020-07-20 05:05:35,040] {__init__.py:1139} INFO - Dependencies all met for <TaskInstance: bulk_replication_p2p_realtime.t1 2020-07-20T00:05:00+00:00 [queued]>
[2020-07-20 05:05:35,051] {__init__.py:1139} INFO - Dependencies all met for <TaskInstance: bulk_replication_p2p_realtime.t1 2020-07-20T00:05:00+00:00 [queued]>
[2020-07-20 05:05:35,051] {__init__.py:1353} INFO -
--------------------------------------------------------------------------------
[2020-07-20 05:05:35,051] {__init__.py:1354} INFO - Starting attempt 1 of 1
[2020-07-20 05:05:35,051] {__init__.py:1355} INFO -
--------------------------------------------------------------------------------
[2020-07-20 05:05:35,098] {__init__.py:1374} INFO - Executing <Task(PostgresOperator): t1> on 2020-07-20T00:05:00+00:00
[2020-07-20 05:05:35,099] {base_task_runner.py:119} INFO - Running: ['airflow', 'run', 'bulk_replication_p2p_realtime', 't1', '2020-07-20T00:05:00+00:00', '--job_id', '958216', '--raw', '-sd', 'DAGS_FOLDER/bulk_replication_p2p_realtime.py', '--cfg_path', '/tmp/tmph11tn6fe']
[2020-07-20 05:05:37,348] {base_task_runner.py:101} INFO - Job 958216: Subtask t1 [2020-07-20 05:05:37,347] {settings.py:182} INFO - settings.configure_orm(): Using pool settings. pool_size=10, pool_recycle=1800, pid=26244
[2020-07-20 05:05:39,503] {base_task_runner.py:101} INFO - Job 958216: Subtask t1 [2020-07-20 05:05:39,501] {__init__.py:51} INFO - Using executor LocalExecutor
[2020-07-20 05:05:39,857] {base_task_runner.py:101} INFO - Job 958216: Subtask t1 [2020-07-20 05:05:39,856] {__init__.py:305} INFO - Filling up the DagBag from /home/airflow/airflow/dags/bulk_replication_p2p_realtime.py
[2020-07-20 05:05:39,894] {base_task_runner.py:101} INFO - Job 958216: Subtask t1 [2020-07-20 05:05:39,894] {cli.py:517} INFO - Running <TaskInstance: bulk_replication_p2p_realtime.t1 2020-07-20T00:05:00+00:00 [running]> on host dwh2-airflow-dev
[2020-07-20 05:05:39,938] {postgres_operator.py:62} INFO - Executing: CALL dw_system.bulk_replicate(p_graph_name=>'replication_p2p_realtime',p_group_size=>4 , p_group=>1, p_dag_id=>'bulk_replication_p2p_realtime', p_task_id=>'t1')
[2020-07-20 05:05:39,960] {logging_mixin.py:95} INFO - [2020-07-20 05:05:39,953] {base_hook.py:83} INFO - Using connection to: id: postgres_warehouse. Host: XXX Port: 5432, Schema: XXXX Login: XXX Password: XXXXXXXX, extra: {}
[2020-07-20 05:05:39,973] {logging_mixin.py:95} INFO - [2020-07-20 05:05:39,972] {dbapi_hook.py:171} INFO - CALL dw_system.bulk_replicate(p_graph_name=>'replication_p2p_realtime',p_group_size=>4 , p_group=>1, p_dag_id=>'bulk_replication_p2p_realtime', p_task_id=>'t1')
[2020-07-20 05:23:21,450] {logging_mixin.py:95} INFO - [2020-07-20 05:23:21,449] {timeout.py:42} ERROR - Process timed out, PID: 26244
[2020-07-20 05:23:36,453] {logging_mixin.py:95} INFO - [2020-07-20 05:23:36,452] {jobs.py:2562} INFO - Task exited with return code 0
Does anyone know how to enforce execution timeout out for such long running functions? It seems that the execution timeout is evaluated once the PG function finish.

Airflow uses the signal module from the standard library to affect a timeout. In Airflow it's used to hook into these system signals and request that the calling process be notified in N seconds and, should the process still be inside the context (see the __enter__ and __exit__ methods on the class) it will raise an AirflowTaskTimeout exception.
Unfortunately for this situation, there are certain classes of system operations that cannot be interrupted. This is actually called out in the signal documentation:
A long-running calculation implemented purely in C (such as regular expression matching on a large body of text) may run uninterrupted for an arbitrary amount of time, regardless of any signals received. The Python signal handlers will be called when the calculation finishes.
To which we say "But I'm not doing a long-running calculation in C!" -- yeah for Airflow this is almost always due to uninterruptable I/O operations.
The highlighted sentence above (emphasis mine) nicely explains why the handler is still triggered even after the task is allowed to (frustratingly!) finish, well beyond your requested timeout.

Related

Airflow 2: GoogleSheetsToGCSOperator gives Negsignal.SIGKILL

We're running airflow in google composer, and we're running into difficulties with the the GoogleSheetsToGCSOperator. We're using composer 2, and therefore I understand that we have to make sure to use a connection with the correct scopes. So that's fine, I've set up a connection with those scopes, and we now no longer get permission errors. However, the dag still doesn't work, it now fails in a couple of different ways.
Most of the time, any dag that tries to upload a google sheet to GCS fails with error Negsignal.SIGKILL. For example:
--------------------------------------------------------------------------------
[2022-10-03, 15:50:55 UTC] {taskinstance.py:1251} INFO - Starting attempt 1 of 1
[2022-10-03, 15:50:55 UTC] {taskinstance.py:1252} INFO -
--------------------------------------------------------------------------------
[2022-10-03, 15:50:55 UTC] {taskinstance.py:1271} INFO - Executing <Task(GoogleSheetsToGCSOperator): upload_sheet_to_gcs_airflow_permission_test_sheet> on 2022-10-03 15:50:38.412899+00:00
[2022-10-03, 15:50:55 UTC] {standard_task_runner.py:52} INFO - Started process 529848 to run task
[2022-10-03, 15:50:55 UTC] {standard_task_runner.py:79} INFO - Running: ['airflow', 'tasks', 'run', 'test_brunel_core_2', 'upload_sheet_to_gcs_airflow_permission_test_sheet', 'manual__2022-10-03T15:50:38.412899+00:00', '--job-id', '7342', '--raw', '--subdir', 'DAGS_FOLDER/DAGs/z_airflow_testing_dags/test_brunel_2_functions.py', '--cfg-path', '/tmp/tmpyuhkixqc', '--error-file', '/tmp/tmp7p2delaz']
[2022-10-03, 15:50:55 UTC] {standard_task_runner.py:80} INFO - Job 7342: Subtask upload_sheet_to_gcs_airflow_permission_test_sheet
/opt/python3.8/lib/python3.8/site-packages/airflow/utils/log/file_task_handler.py:110: ResourceWarning: unclosed file <_io.TextIOWrapper name='/home/airflow/gcs/logs/test_brunel_core_2/upload_sheet_to_gcs_airflow_permission_test_sheet/2022-10-03T15:50:38.412899+00:00/1.log' mode='a' encoding='utf-8'>
self.handler = NonCachingFileHandler(local_loc, encoding='utf-8')
[2022-10-03, 15:50:56 UTC] {task_command.py:298} INFO - Running <TaskInstance: test_brunel_core_2.upload_sheet_to_gcs_airflow_permission_test_sheet manual__2022-10-03T15:50:38.412899+00:00 [running]> on host airflow-worker-j28mn
[2022-10-03, 15:50:56 UTC] {taskinstance.py:1448} INFO - Exporting the following env vars:
AIRFLOW_CTX_DAG_OWNER=process_dev_joe_m
AIRFLOW_CTX_DAG_ID=test_brunel_core_2
AIRFLOW_CTX_TASK_ID=upload_sheet_to_gcs_airflow_permission_test_sheet
AIRFLOW_CTX_EXECUTION_DATE=2022-10-03T15:50:38.412899+00:00
AIRFLOW_CTX_DAG_RUN_ID=manual__2022-10-03T15:50:38.412899+00:00
[2022-10-03, 15:51:02 UTC] {local_task_job.py:154} INFO - Task exited with return code Negsignal.SIGKILL
[2022-10-03, 15:51:02 UTC] {taskinstance.py:1279} INFO - Marking task as FAILED. dag_id=test_brunel_core_2, task_id=upload_sheet_to_gcs_airflow_permission_test_sheet, execution_date=20221003T155038, start_date=20221003T155055, end_date=20221003T155102
The rest of the time, some random task in the dag fails (not neccesarily the step with the GoogleSheetsToGCSOperator). Sometimes it a step fails with absolutely no log being generated at all, or sometimes log is generated but it contains no errors. Instead, the only clue is a warning:
/opt/python3.8/lib/python3.8/site-packages/airflow/utils/log/file_task_handler.py:110: ResourceWarning: unclosed file <_io.TextIOWrapper name='/home/airflow/gcs/logs/test_flakiness/create_table_JM_test_table.create/2022-10-04T09:11:58.425115+00:00/1.log' mode='a' encoding='utf-8'>
self.handler = NonCachingFileHandler(local_loc, encoding='utf-8')
The weird thing about that warning is that it's warning about the log file itself. As in, that message is written into log file gs://europe-west1-process-dev-ai-fd1dc540-bucket/logs/test_flakiness/create_table_JM_test_table.create/2022-10-04T09:11:58.425115+00:00/1.log. So of course the file is open, you're writing to it, so why are you warning about it being open?
Some other facts that may or may not be relevant:
composer-2.0.25 airflow-2.2.5
When monitoring the environment, all
resources (cpu, memory, etc) seem to be fine, nothing is hitting its
limits.
Our environment is configured to use between 1 and 4 workers.
Only ever one worker is used, so I don't think it can be a problem
with multiple workers all trying to write to the same file at once.
This is all happening in our test environment. The same dag will work absolutely fine in our prod environment. Our
prod environment is running composer-1.19.3-airflow-2.2.5, and
therefore is set up differently when it comes to things like Google
drive authentication scopes. So that's already 2 potential reasons
why things are different in the prod environment.

Airflow task randomly exited with return code 1 [Local Executor / PythonOperator]

To give some context, I am using Airflow 2.3.0 on Kubernetes with the Local Executor (which may sound weird, but it works for us for now) with one pod for the webserver and two for the scheduler.
I have a DAG consisting of a single task (PythonOperator) that makes many API calls (200K) using requests.
Every 15 calls, the data is loaded in a DataFrame and stored on AWS S3 (using boto3) to reduce the RAM usage.
The problem is that I can't get to the end of this task because it goes into error randomly (after 1, 10 or 120 minutes).
I have made more than 50 tries, no success and the only logs on the task are:
[2022-09-01, 14:45:44 UTC] {taskinstance.py:1159} INFO - Dependencies all met for <TaskInstance: INGESTION-DAILY-dag.extract_task scheduled__2022-08-30T00:00:00+00:00 [queued]>
[2022-09-01, 14:45:44 UTC] {taskinstance.py:1159} INFO - Dependencies all met for <TaskInstance: INGESTION-DAILY-dag.extract_task scheduled__2022-08-30T00:00:00+00:00 [queued]>
[2022-09-01, 14:45:44 UTC] {taskinstance.py:1356} INFO -
--------------------------------------------------------------------------------
[2022-09-01, 14:45:44 UTC] {taskinstance.py:1357} INFO - Starting attempt 23 of 24
[2022-09-01, 14:45:44 UTC] {taskinstance.py:1358} INFO -
--------------------------------------------------------------------------------
[2022-09-01, 14:45:44 UTC] {taskinstance.py:1377} INFO - Executing <Task(_PythonDecoratedOperator): extract_task> on 2022-08-30 00:00:00+00:00
[2022-09-01, 14:45:44 UTC] {standard_task_runner.py:52} INFO - Started process 942 to run task
[2022-09-01, 14:45:44 UTC] {standard_task_runner.py:79} INFO - Running: ['airflow', 'tasks', 'run', 'INGESTION-DAILY-dag', 'extract_task', 'scheduled__2022-08-30T00:00:00+00:00', '--job-id', '4390', '--raw', '--subdir', 'DAGS_FOLDER/dags/ingestion/daily_dag/dag.py', '--cfg-path', '/tmp/tmpwxasaq93', '--error-file', '/tmp/tmpl7t_gd8e']
[2022-09-01, 14:45:44 UTC] {standard_task_runner.py:80} INFO - Job 4390: Subtask extract_task
[2022-09-01, 14:45:45 UTC] {task_command.py:369} INFO - Running <TaskInstance: INGESTION-DAILY-dag.extract_task scheduled__2022-08-30T00:00:00+00:00 [running]> on host 10.XX.XXX.XXX
[2022-09-01, 14:48:17 UTC] {local_task_job.py:156} INFO - Task exited with return code 1
[2022-09-01, 14:48:17 UTC] {taskinstance.py:1395} INFO - Marking task as UP_FOR_RETRY. dag_id=INGESTION-DAILY-dag, task_id=extract_task, execution_date=20220830T000000, start_date=20220901T144544, end_date=20220901T144817
[2022-09-01, 14:48:17 UTC] {local_task_job.py:273} INFO - 0 downstream tasks scheduled from follow-on schedule check
But when I go to the pod logs, I get the following message:
[2022-09-01 14:06:31,624] {local_executor.py:128} ERROR - Failed to execute task an integer is required (got type ChunkedEncodingError).
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/executors/local_executor.py", line 124, in _execute_work_in_fork
args.func(args)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/cli/cli_parser.py", line 51, in command
return func(*args, **kwargs)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/utils/cli.py", line 99, in wrapper
return f(*args, **kwargs)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/cli/commands/task_command.py", line 377, in task_run
_run_task_by_selected_method(args, dag, ti)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/cli/commands/task_command.py", line 183, in _run_task_by_selected_method
_run_task_by_local_task_job(args, ti)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/cli/commands/task_command.py", line 241, in _run_task_by_local_task_job
run_job.run()
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/jobs/base_job.py", line 244, in run
self._execute()
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/jobs/local_task_job.py", line 105, in _execute
self.task_runner.start()
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/task/task_runner/standard_task_runner.py", line 41, in start
self.process = self._start_by_fork()
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/task/task_runner/standard_task_runner.py", line 125, in _start_by_fork
os._exit(return_code)
TypeError: an integer is required (got type ChunkedEncodingError)
What I find strange is that I never had this error on other DAGs (where tasks are smaller and faster). I checked, during an attempt, CPU and RAM usages are stable and low.
I have the same error locally, I also tried to upgrade to 2.3.4 but nothing works.
Do you have any idea how to fix this?
Thanks a lot!
Nicolas
As #EDG956 said, this is not an error from Airflow but from the code.
I solved it using a context manager (which was not enough) and recreating a session:
s = requests.Session()
while True:
try:
with s.get(base_url) as r:
response = r
except requests.exceptions.ChunkedEncodingError:
s.close()
s.requests.Session()
response = s.get(base_url)

Why are my Airflow tasks being "externally set to failed"?

I'm using Airflow 2.0.0, and my tasks are sporadically being killed "externally" after running for a few seconds or minutes. The tasks usually run successfully (both for manual task initiated via airflow tasks test ... and for scheduled DAG runs), so I believe this is not related to my DAG code.
When tasks fail, this seems to be the key error from the task logs:
{local_task_job.py:170} WARNING - State of this instance has been externally set to failed. Terminating instance.
[2020-12-20 11:26:11,448] {taskinstance.py:826} INFO - Dependencies all met for <TaskInstance: daily_backups.run_backupper 2020-12-19T02:00:00+00:00 [queued]>
[2020-12-20 11:26:11,473] {taskinstance.py:826} INFO - Dependencies all met for <TaskInstance: daily_backups.run_backupper 2020-12-19T02:00:00+00:00 [queued]>
[2020-12-20 11:26:11,473] {taskinstance.py:1017} INFO -
--------------------------------------------------------------------------------
[2020-12-20 11:26:11,473] {taskinstance.py:1018} INFO - Starting attempt 3 of 3
[2020-12-20 11:26:11,473] {taskinstance.py:1019} INFO -
--------------------------------------------------------------------------------
[2020-12-20 11:26:11,506] {taskinstance.py:1038} INFO - Executing <Task(PythonOperator): run_backupper> on 2020-12-19T02:00:00+00:00
[2020-12-20 11:26:11,509] {standard_task_runner.py:51} INFO - Started process 12059 to run task
[2020-12-20 11:26:11,515] {standard_task_runner.py:75} INFO - Running: ['airflow', 'tasks', 'run', 'daily_backups', 'run_backupper', '2020-12-19T02:00:00+00:00', '--job-id', '22', '--pool', 'default_pool', '--raw', '--subdir', 'DAGS_FOLDER/backupper/daily_backups.py', '--cfg-path', '/tmp/tmpnfmqtorg']
[2020-12-20 11:26:11,517] {standard_task_runner.py:76} INFO - Job 22: Subtask run_backupper
[2020-12-20 11:26:11,609] {logging_mixin.py:103} INFO - Running <TaskInstance: daily_backups.run_backupper 2020-12-19T02:00:00+00:00 [running]> on host localhost
[2020-12-20 11:26:11,742] {taskinstance.py:1232} INFO - Exporting the following env vars:
AIRFLOW_CTX_DAG_OWNER=<user>
AIRFLOW_CTX_DAG_ID=daily_backups
AIRFLOW_CTX_TASK_ID=run_backupper
AIRFLOW_CTX_EXECUTION_DATE=2020-12-19T02:00:00+00:00
AIRFLOW_CTX_DAG_RUN_ID=scheduled__2020-12-19T02:00:00+00:00
...
... my job's logs, indicating that the job is running healthily ...
...
[2020-12-20 11:26:16,587] {local_task_job.py:170} WARNING - State of this instance has been externally set to failed. Terminating instance.
[2020-12-20 11:26:16,593] {process_utils.py:95} INFO - Sending Signals.SIGTERM to GPID 12059
[2020-12-20 11:27:16,609] {process_utils.py:108} WARNING - process psutil.Process(pid=12059, name='airflow task runner: daily_backups run_backupper 2020-12-19T02:00:00+00:00 22', status='sleeping', started='11:26:11') did not respond to SIGTERM. Trying SIGKILL
[2020-12-20 11:27:16,618] {process_utils.py:61} INFO - Process psutil.Process(pid=12059, name='airflow task runner: daily_backups run_backupper 2020-12-19T02:00:00+00:00 22', status='terminated', exitcode=<Negsignal.SIGKILL: -9>, started='11:26:11') (12059) terminated with exit code Negsignal.SIGKILL
[2020-12-20 11:27:16,618] {local_task_job.py:118} INFO - Task exited with return code Negsignal.SIGKILL
The final few lines in the logs are not consistent. Here is a different version, for the same task that failed in an earlier attempt:
... same stuff as before ...
[2020-12-20 02:01:12,689] {local_task_job.py:170} WARNING - State of this instance has been externally set to failed. Terminating instance.
[2020-12-20 02:01:12,695] {process_utils.py:95} INFO - Sending Signals.SIGTERM to GPID 24442
[2020-12-20 02:02:00,462] {taskinstance.py:1214} ERROR - Received SIGTERM. Terminating subprocesses.
[2020-12-20 02:02:00,498] {process_utils.py:61} INFO - Process psutil.Process(pid=24442, status='terminated', exitcode=0, started='02:00:10') (24442) terminated with exit code 0
[2020-12-20 02:02:00,499] {local_task_job.py:118} INFO - Task exited with return code 0
I suspect in this case the script was able to respond to the SIGTERM in time, whereas in the previous case it was blocked on a long-running query and was not able to terminate cleanly.
I believe the problem was that the scheduler health check threshold was set to be smaller than the scheduler heartbeat interval.
In my config I had set scheduler_health_check_threshold to 30 seconds and scheduler_heartbeat_sec to 60 seconds. During the check for orphaned tasks (itself governed by a different parameter, orphaned_tasks_check_interval), the scheduler heartbeat was determined to be older than 30 seconds, which makes sense, because it was only heartbeating every 60 seconds. Thus the scheduler was inferred to be unhealthy and was therefore terminated.
Around the time of the failure, I could see messages like these in /var/log/syslog
Dec 20 11:26:14 localhost bash[11545]: [2020-12-20 11:26:14,368] {scheduler_job.py:1751} INFO - Resetting orphaned tasks for active dag runs
Dec 20 11:26:14 localhost bash[11545]: [2020-12-20 11:26:14,373] {scheduler_job.py:1764} INFO - Marked 1 SchedulerJob instances as failed
Dec 20 11:26:14 localhost bash[11545]: [2020-12-20 11:26:14,381] {scheduler_job.py:1805} INFO - Reset the following 1 orphaned TaskInstances:
Dec 20 11:26:14 localhost bash[11545]: #011<TaskInstance: daily_backups.run_backupper 2020-12-19 02:00:00+00:00 [running]>
Dec 20 11:26:14 localhost bash[11545]: [2020-12-20 11:26:14,571] {scheduler_job.py:938} INFO - 1 tasks up for execution:
Dec 20 11:26:14 localhost bash[11545]: #011<TaskInstance: daily_backups.run_backupper 2020-12-19 02:00:00+00:00 [scheduled]>
Dec 20 11:26:14 localhost bash[11545]: [2020-12-20 11:26:14,574] {scheduler_job.py:972} INFO - Figuring out tasks to run in Pool(name=default_pool) with 128 open slots and 1 task instances ready to be queued
Dec 20 11:26:14 localhost bash[11545]: [2020-12-20 11:26:14,575] {scheduler_job.py:999} INFO - DAG daily_backups has 0/16 running and queued tasks
Dec 20 11:26:14 localhost bash[11545]: [2020-12-20 11:26:14,575] {scheduler_job.py:1060} INFO - Setting the following tasks to queued state:
Dec 20 11:26:14 localhost bash[11545]: #011<TaskInstance: daily_backups.run_backupper 2020-12-19 02:00:00+00:00 [scheduled]>
Dec 20 11:26:14 localhost bash[11545]: [2020-12-20 11:26:14,578] {scheduler_job.py:1102} INFO - Sending TaskInstanceKey(dag_id='daily_backups', task_id='run_backupper', execution_date=datetime.datetime(2020, 12, 19, 2, 0, tzinfo=Timezone('UTC')), try_number=4) to executor with priority 2 and queue default
Dec 20 11:26:14 localhost bash[11545]: [2020-12-20 11:26:14,578] {base_executor.py:79} INFO - Adding to queue: ['airflow', 'tasks', 'run', 'daily_backups', 'run_backupper', '2020-12-19T02:00:00+00:00', '--local', '--pool', 'default_pool', '--subdir', '/storage/airflow/dags/backupper/daily_backups.py']
Dec 20 11:26:14 localhost bash[11545]: [2020-12-20 11:26:14,581] {local_executor.py:81} INFO - QueuedLocalWorker running ['airflow', 'tasks', 'run', 'daily_backups', 'run_backupper', '2020-12-19T02:00:00+00:00', '--local', '--pool', 'default_pool', '--subdir', '/storage/airflow/dags/backupper/daily_backups.py']
Dec 20 11:26:14 localhost bash[11545]: [2020-12-20 11:26:14,707] {dagbag.py:440} INFO - Filling up the DagBag from /storage/airflow/dags/backupper/daily_backups.py
Dec 20 11:26:15 localhost bash[11545]: Running <TaskInstance: daily_backups.run_backupper 2020-12-19T02:00:00+00:00 [queued]> on host localhost
and the timestamps coincide closely with the SIGTERM received by my task. I guess that since the SchedulerJob was marked as failed, then the TaskInstance running my actual task was considered an orphan, and thus marked for termination. At the same time it scheduled a new attempt (try_number=4).
Increasing the scheduler_health_check_threshold to 120 seconds and restarting the scheduler/webserver services appears to have resolved my issue.
I had the same issue.
From logs:
2021-05-07 13:04:19,960 INFO - Resetting orphaned tasks for active dag runs
2021-05-07 13:09:20,060 INFO - Resetting orphaned tasks for active dag runs
2021-05-07 13:14:20,186 INFO - Resetting orphaned tasks for active dag runs
2021-05-07 13:19:20,263 INFO - Resetting orphaned tasks for active dag runs
2021-05-07 13:24:20,399 INFO - Resetting orphaned tasks for active dag runs
2021-05-07 13:29:20,729 INFO - Resetting orphaned tasks for active dag runs
2021-05-07 13:34:20,892 INFO - Resetting orphaned tasks for active dag runs
2021-05-07 13:39:21,070 INFO - Resetting orphaned tasks for active dag runs
2021-05-07 13:44:21,328 INFO - Resetting orphaned tasks for active dag runs
2021-05-07 13:49:21,423 INFO - Resetting orphaned tasks for active dag runs
And my scheduler config was as follows:
# Task instances listen for external kill signal (when you clear tasks
# from the CLI or the UI), this defines the frequency at which they should
# listen (in seconds).
job_heartbeat_sec = 5
I had the same issue in AKS (Azure Kubernetes).
I resolved it with setting AIRFLOW__SCHEDULER__SCHEDULE_AFTER_TASK_EXECUTION to False.
https://github.com/apache/airflow/issues/14672
I had the same problem, running on a MacBook. Seems to be the MacBook going to sleep, solved it by ticking "Prevent your Mac from automatically sleeping when the display is off" in the "Power Adapter" section of preferences -> battery. (When on a charger)

Airflow task exited with return code 1 without any warning/error message

Apache Airflow version: 1.10.10
Kubernetes version (if you are using kubernetes) (use kubectl version): Not using Kubernetes or docker
Environment: CentOS Linux release 7.7.1908 (Core) Linux 3.10.0-1062.el7.x86_64
Python Version: 3.7.6
Executor: LocalExecutor
What happened:
I write a simple dag to clean airflow logs. Everything is OK when I use 'airflow test' command to test it, I also trigger it manually in WebUI which use 'airflow run' command to start my task, it is still OK.
But after I reboot my server and restart my webserver & scheduler service (in daemon mode), every time I trigger the exactly same dag, it still get scheduled like usual, but exit with code 1 immediately after start a new process to run task.
I also use 'airflow test' command again to check if there is something wrong with my code now, but everything seems OK when using 'airflow test', but exit silently when using 'airflow run', it is really weird.
Here's the task log when it's manually triggered in WebUI ( I've changed the log level to DEBUG, but still can't find anything useful), or you can read the attached log file: task error log.txt
Reading local file: /root/airflow/logs/airflow_log_cleanup/log_cleanup_worker_num_1/2020-04-29T13:51:44.071744+00:00/1.log
[2020-04-29 21:51:53,744] {base_task_runner.py:61} DEBUG - Planning to run as the user
[2020-04-29 21:51:53,750] {taskinstance.py:686} DEBUG - dependency 'Previous Dagrun State' PASSED: True, The task did not have depends_on_past set.
[2020-04-29 21:51:53,754] {taskinstance.py:686} DEBUG - dependency 'Not In Retry Period' PASSED: True, The task instance was not marked for retrying.
[2020-04-29 21:51:53,754] {taskinstance.py:686} DEBUG - dependency 'Task Instance State' PASSED: True, Task state queued was valid.
[2020-04-29 21:51:53,754] {taskinstance.py:669} INFO - Dependencies all met for
[2020-04-29 21:51:53,757] {taskinstance.py:686} DEBUG - dependency 'Previous Dagrun State' PASSED: True, The task did not have depends_on_past set.
[2020-04-29 21:51:53,760] {taskinstance.py:686} DEBUG - dependency 'Pool Slots Available' PASSED: True, ('There are enough open slots in %s to execute the task', 'default_pool')
[2020-04-29 21:51:53,766] {taskinstance.py:686} DEBUG - dependency 'Not In Retry Period' PASSED: True, The task instance was not marked for retrying.
[2020-04-29 21:51:53,768] {taskinstance.py:686} DEBUG - dependency 'Task Concurrency' PASSED: True, Task concurrency is not set.
[2020-04-29 21:51:53,768] {taskinstance.py:669} INFO - Dependencies all met for
[2020-04-29 21:51:53,768] {taskinstance.py:879} INFO -
[2020-04-29 21:51:53,768] {taskinstance.py:880} INFO - Starting attempt 1 of 2
[2020-04-29 21:51:53,768] {taskinstance.py:881} INFO -
[2020-04-29 21:51:53,779] {taskinstance.py:900} INFO - Executing on 2020-04-29T13:51:44.071744+00:00
[2020-04-29 21:51:53,781] {standard_task_runner.py:53} INFO - Started process 29718 to run task
[2020-04-29 21:51:53,805] {logging_mixin.py:112} INFO - [2020-04-29 21:51:53,805] {cli_action_loggers.py:68} DEBUG - Calling callbacks: []
[2020-04-29 21:51:53,818] {logging_mixin.py:112} INFO - [2020-04-29 21:51:53,817] {cli_action_loggers.py:86} DEBUG - Calling callbacks: []
[2020-04-29 21:51:58,759] {logging_mixin.py:112} INFO - [2020-04-29 21:51:58,759] {base_job.py:200} DEBUG - [heartbeat]
[2020-04-29 21:51:58,759] {logging_mixin.py:112} INFO - [2020-04-29 21:51:58,759] {local_task_job.py:124} DEBUG - Time since last heartbeat(0.01 s) < heartrate(5.0 s), sleeping for 4.98824 s
[2020-04-29 21:52:03,753] {logging_mixin.py:112} INFO - [2020-04-29 21:52:03,753] {local_task_job.py:103} INFO - Task exited with return code 1
How to reproduce it:
I really don't know how to reproduce it. because it happens suddenly, and seems like permanently??
Anything else we need to know:
I try to figure out the difference between 'airflow test' and 'airflow run', it might have something to do with process fork I guess?
What I've tried to solve this problem but all failed:
clear all dag/dag run/task instance info, remove all files under /root/airflow except for the config file, and restart my service
reboot my server again
uninstall airflow and install it again
I finally figure out how to reproduce this bug.
When you config email in airflow.cfg and your dag contains email operator or use smtp serivce, if your smtp password contains character like "^", the first task of your dag will 100% exited with return code 1 without any error information, in my case the first task is merely a python operator.
Although I think it's my bad to mess up smtp service, there should be some reasonable hints, actually it takes me a whole week to debug this, I have to reset everything in my airflow environment and slowly change configuration to see when does this bug happens.
Hope this information is helpful

Airflow execution_timeout settings not respected

In my tasks, I have execution_timeout=timedelta(minutes=1) set in my task and 'dagrun_timeout': timedelta(minutes=2) for my DAG, and this is correctly reflected in the web GUI's Task Instance Details. However, none of my task instances are actually set to failed or retry when breaching the one minute threshold. Rather, they time out at 11 minutes...
[2017-11-02 18:00:05,376] {base_task_runner.py:95} INFO - Subtask: [2017-11-02 18:00:05,370] {base_hook.py:67} INFO - Using connection to: [REDACTED]
[2017-11-02 18:10:06,505] {base_task_runner.py:95} INFO - Subtask: [2017-11-02 18:10:06,504] {timeout.py:37} ERROR - Process timed out
Do I have a problem with my configuration, or is there something buggy happening with how Airflow interprets time out settings?

Resources