Airflow killing tasks right after triggering it - airflow

I'm using Airflow 2.0 with LocalExecutor and backend in MySQL.
My problem is that every time airflow correctly schedule one of my tasks, it gets automatically killed. Here is the log:
[2021-03-07 21:01:05,855] {bash.py:158} INFO - Running command: sudo /d/Airflow/pdi-ce-8.3.0.0-371/data-integration/pan.sh -rep:"penta_repo" -user:dummy -pass:dummy -dir:BDN_DCC/ALIMENTADOR -trans:"ALIMENTADOR_1" -level:Basic
[2021-03-07 21:01:05,942] {bash.py:169} INFO - Output:
[2021-03-07 21:01:10,897] {local_task_job.py:169} WARNING - State of this instance has been externally set to None. Terminating instance.
[2021-03-07 21:01:10,922] {process_utils.py:95} INFO - Sending Signals.SIGTERM to GPID 19956
[2021-03-07 21:01:10,923] {taskinstance.py:1214} ERROR - Received SIGTERM. Terminating subprocesses.
[2021-03-07 21:01:10,924] {bash.py:185} INFO - Sending SIGTERM signal to bash process group
[2021-03-07 21:01:10,965] {taskinstance.py:1396} ERROR - [Errno 1] Operation not permitted
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/airflow/models/taskinstance.py", line 1086, in _run_raw_task
self._prepare_and_execute_task_with_callbacks(context, task)
File "/usr/local/lib/python3.8/dist-packages/airflow/models/taskinstance.py", line 1260, in _prepare_and_execute_task_with_callbacks
result = self._execute_task(context, task_copy)
File "/usr/local/lib/python3.8/dist-packages/airflow/models/taskinstance.py", line 1300, in _execute_task
result = task_copy.execute(context=context)
File "/usr/local/lib/python3.8/dist-packages/airflow/operators/bash.py", line 171, in execute
for raw_line in iter(self.sub_process.stdout.readline, b''):
File "/usr/local/lib/python3.8/dist-packages/airflow/models/taskinstance.py", line 1215, in signal_handler
task_copy.on_kill()
File "/usr/local/lib/python3.8/dist-packages/airflow/operators/bash.py", line 187, in on_kill
os.killpg(os.getpgid(self.sub_process.pid), signal.SIGTERM)
PermissionError: [Errno 1] Operation not permitted
[2021-03-07 21:01:10,968] {taskinstance.py:1433} INFO - Marking task as UP_FOR_RETRY. dag_id=ALIMENTADOR_JOB, task_id=Alimentador, execution_date=20210307T230100, start_date=20210308T000105, end_date=20210308T000110
[2021-03-07 21:01:11,337] {process_utils.py:61} INFO - Process psutil.Process(pid=19956, status='terminated', exitcode=1, started='21:01:05') (19956) terminated with exit code 1
[2021-03-07 21:01:49,679] {process_utils.py:61} INFO - Process psutil.Process(pid=20807, status='terminated', started='21:01:06') (20807) terminated with exit code None
[2021-03-07 21:01:49,680] {process_utils.py:61} INFO - Process psutil.Process(pid=20792, status='terminated', started='21:01:06') (20792) terminated with exit code None
[2021-03-07 21:01:49,681] {process_utils.py:61} INFO - Process psutil.Process(pid=20815, status='terminated', started='21:01:06') (20815) terminated with exit code None
[2021-03-07 21:01:49,682] {process_utils.py:61} INFO - Process psutil.Process(pid=19972, status='terminated', started='21:01:05') (19972) terminated with exit code None
[2021-03-07 21:01:49,682] {process_utils.py:61} INFO - Process psutil.Process(pid=19994, status='terminated', started='21:01:05') (19994) terminated with exit code None
[2021-03-07 21:01:49,683] {process_utils.py:61} INFO - Process psutil.Process(pid=20778, status='terminated', started='21:01:06') (20778) terminated with exit code None
[2021-03-07 21:01:49,684] {process_utils.py:61} INFO - Process psutil.Process(pid=20817, status='terminated', started='21:01:06') (20817) terminated with exit code None
[2021-03-07 21:01:49,684] {process_utils.py:61} INFO - Process psutil.Process(pid=20033, status='terminated', started='21:01:06') (20033) terminated with exit code None
[2021-03-07 21:01:49,685] {process_utils.py:61} INFO - Process psutil.Process(pid=20804, status='terminated', started='21:01:06') (20804) terminated with exit code None
[2021-03-07 21:01:49,685] {local_task_job.py:118} INFO - Task exited with return code 1
I've looked several times at the airflow config file, tried to increase timeout but nothing worked.
Running the bash command directly in command prompt works fine.
Any ideas about what is happening here?
Thanks!

[2021-03-07 21:01:10,897] {local_task_job.py:169} WARNING - State of
this instance has been externally set to None. Terminating instance.
[2021-03-07 21:01:10,922] {process_utils.py:95} INFO - Sending
Signals.SIGTERM to GPID 19956 [2021-03-07 21:01:10,923]
{taskinstance.py:1214} ERROR - Received SIGTERM. Terminating
subprocesses. [2021-03-07 21:01:10,924] {bash.py:185} INFO - Sending
SIGTERM signal to bash process group [2021-03-07 21:01:10,965]
{taskinstance.py:1396} ERROR - [Errno 1] Operation not permitted
Those lines says you have OOM issues, check your CPU and your RAM

Related

Airflow 2 Error sending Celery task: Timeout

I am in the process of migrating our Airflow environment from version 1.10.15 to 2.3.3. I have migrated 1 DAG over to the new environment and intermittently I get an email with this error: Executor reports task instance finished (failed) although the task says its queued. (Info: None) Was the task killed externally?
When looking at the logs, this is what I find in the scheduler logs:
[2022-08-09 07:00:08,621] {dag.py:2968} INFO - Setting next_dagrun for DAGRP-Get_Overrides to 2022-08-09T11:00:00+00:00, run_after=2022-08-09T16:00:00+00:00
[2022-08-09 07:00:08,652] {scheduler_job.py:353} INFO - 1 tasks up for execution:
<TaskInstance: DAGRP-Get_Overrides.Get_override scheduled__2022-08-08T16:00:00+00:00 [scheduled]>
[2022-08-09 07:00:08,652] {scheduler_job.py:418} INFO - DAG DAGRP-Get_Overrides has 0/3 running and queued tasks
[2022-08-09 07:00:08,652] {scheduler_job.py:504} INFO - Setting the following tasks to queued state:
<TaskInstance: DAGRP-Get_Overrides.Get_override scheduled__2022-08-08T16:00:00+00:00 [scheduled]>
[2022-08-09 07:00:08,654] {scheduler_job.py:546} INFO - Sending TaskInstanceKey(dag_id='DAGRP-Get_Overrides', task_id='Get_override', run_id='scheduled__2022-08-08T16:00:00+00:00', try_number=1, map_index=-1) to executor with priority 1 and queue default
[2022-08-09 07:00:08,654] {base_executor.py:91} INFO - Adding to queue: ['airflow', 'tasks', 'run', 'DAGRP-Get_Overrides', 'Get_override', 'scheduled__2022-08-08T16:00:00+00:00', '--local', '--subdir', 'DAGS_FOLDER/da_group/get_override.py']
[2022-08-09 07:00:12,665] {timeout.py:67} ERROR - Process timed out, PID: 1
[2022-08-09 07:00:12,667] {celery_executor.py:283} INFO - [Try 1 of 3] Task Timeout Error for Task: (TaskInstanceKey(dag_id='DAGRP-Get_Overrides', task_id='Get_override', run_id='scheduled__2022-08-08T16:00:00+00:00', try_number=1, map_index=-1)).
[2022-08-09 07:00:16,701] {timeout.py:67} ERROR - Process timed out, PID: 1
[2022-08-09 07:00:16,702] {celery_executor.py:283} INFO - [Try 2 of 3] Task Timeout Error for Task: (TaskInstanceKey(dag_id='DAGRP-Get_Overrides', task_id='Get_override', run_id='scheduled__2022-08-08T16:00:00+00:00', try_number=1, map_index=-1)).
[2022-08-09 07:00:21,704] {timeout.py:67} ERROR - Process timed out, PID: 1
[2022-08-09 07:00:21,705] {celery_executor.py:283} INFO - [Try 3 of 3] Task Timeout Error for Task: (TaskInstanceKey(dag_id='DAGRP-Get_Overrides', task_id='Get_override', run_id='scheduled__2022-08-08T16:00:00+00:00', try_number=1, map_index=-1)).
[2022-08-09 07:00:26,627] {timeout.py:67} ERROR - Process timed out, PID: 1
[2022-08-09 07:00:26,627] {celery_executor.py:294} ERROR - Error sending Celery task: Timeout, PID: 1
Celery Task ID: TaskInstanceKey(dag_id='DAGRP-Get_Overrides', task_id='Get_override', run_id='scheduled__2022-08-08T16:00:00+00:00', try_number=1, map_index=-1)
Traceback (most recent call last):
File "/opt/airflow/lib/python3.8/site-packages/kombu/utils/functional.py", line 30, in __call__
return self.__value__
AttributeError: 'ChannelPromise' object has no attribute '__value__'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/airflow/lib/python3.8/site-packages/airflow/executors/celery_executor.py", line 177, in send_task_to_executor
result = task_to_run.apply_async(args=[command], queue=queue)
File "/opt/airflow/lib/python3.8/site-packages/celery/app/task.py", line 575, in apply_async
return app.send_task(
File "/opt/airflow/lib/python3.8/site-packages/celery/app/base.py", line 788, in send_task
amqp.send_task_message(P, name, message, **options)
File "/opt/airflow/lib/python3.8/site-packages/celery/app/amqp.py", line 510, in send_task_message
ret = producer.publish(
File "/opt/airflow/lib/python3.8/site-packages/kombu/messaging.py", line 177, in publish
return _publish(
File "/opt/airflow/lib/python3.8/site-packages/kombu/connection.py", line 523, in _ensured
return fun(*args, **kwargs)
File "/opt/airflow/lib/python3.8/site-packages/kombu/messaging.py", line 186, in _publish
channel = self.channel
File "/opt/airflow/lib/python3.8/site-packages/kombu/messaging.py", line 209, in _get_channel
channel = self._channel = channel()
File "/opt/airflow/lib/python3.8/site-packages/kombu/utils/functional.py", line 32, in __call__
value = self.__value__ = self.__contract__()
File "/opt/airflow/lib/python3.8/site-packages/kombu/messaging.py", line 225, in <lambda>
channel = ChannelPromise(lambda: connection.default_channel)
File "/opt/airflow/lib/python3.8/site-packages/kombu/connection.py", line 895, in default_channel
self._ensure_connection(**conn_opts)
File "/opt/airflow/lib/python3.8/site-packages/kombu/connection.py", line 433, in _ensure_connection
return retry_over_time(
File "/opt/airflow/lib/python3.8/site-packages/kombu/utils/functional.py", line 312, in retry_over_time
return fun(*args, **kwargs)
File "/opt/airflow/lib/python3.8/site-packages/kombu/connection.py", line 877, in _connection_factory
self._connection = self._establish_connection()
File "/opt/airflow/lib/python3.8/site-packages/kombu/connection.py", line 812, in _establish_connection
conn = self.transport.establish_connection()
File "/opt/airflow/lib/python3.8/site-packages/kombu/transport/pyamqp.py", line 201, in establish_connection
conn.connect()
File "/opt/airflow/lib/python3.8/site-packages/amqp/connection.py", line 323, in connect
self.transport.connect()
File "/opt/airflow/lib/python3.8/site-packages/amqp/transport.py", line 129, in connect
self._connect(self.host, self.port, self.connect_timeout)
File "/opt/airflow/lib/python3.8/site-packages/amqp/transport.py", line 184, in _connect
self.sock.connect(sa)
File "/opt/airflow/lib/python3.8/site-packages/airflow/utils/timeout.py", line 68, in handle_timeout
raise AirflowTaskTimeout(self.error_message)
airflow.exceptions.AirflowTaskTimeout: Timeout, PID: 1
[2022-08-09 07:00:26,627] {scheduler_job.py:599} INFO - Executor reports execution of DAGRP-Get_Overrides.Get_override run_id=scheduled__2022-08-08T16:00:00+00:00 exited with status failed for try_number 1
[2022-08-09 07:00:26,633] {scheduler_job.py:642} INFO - TaskInstance Finished: dag_id=DAGRP-Get_Overrides, task_id=Get_override, run_id=scheduled__2022-08-08T16:00:00+00:00, map_index=-1, run_start_date=None, run_end_date=None, run_duration=None, state=queued, executor_state=failed, try_number=1, max_tries=0, job_id=None, pool=default_pool, queue=default, priority_weight=1, operator=PythonOperator, queued_dttm=2022-08-09 11:00:08.652767+00:00, queued_by_job_id=56, pid=None
[2022-08-09 07:00:26,633] {scheduler_job.py:684} ERROR - Executor reports task instance <TaskInstance: DAGRP-Get_Overrides.Get_override scheduled__2022-08-08T16:00:00+00:00 [queued]> finished (failed) although the task says its queued. (Info: None) Was the task killed externally?
[2022-08-09 07:01:16,687] {processor.py:233} WARNING - Killing DAGFileProcessorProcess (PID=1811)
[2022-08-09 07:04:00,640] {scheduler_job.py:1233} INFO - Resetting orphaned tasks for active dag runs
I am running Airflow on 2 servers with 2 of each service (2 schedulers, 2 workers, 2 webservers). They are running in docker containers. They are configured to use celery executor and I'm using RabbitMQ version 3.10.6 (also 2 instances in docker containers behind a LB). I am using Postgres 13.7 for our database (running one instance in a docker container on the 1st server). Our environment is running on Python 3.8.12.
From my understanding, the timeout is between the scheduler and rabbitmq? From what I can tell we are hitting this timeout: AIRFLOW__CELERY__OPERATION_TIMEOUT (it's currently set to 4).
I would like to track down what is causing the issue before I just increase timeout settings. What can I do to find out what's going on? Anyone else run into this issue? Am I correct in assuming the timeout is between the scheduler and rabbitmq? Is it between the scheduler and database? Why am I seeing this with Airflow 2 when I have the same setup with Airflow 1 and it works with no problems? Any help is greatly appreciated!
Update:
I was able to reproduce the error by shutting down 1 of the rabbitmq nodes. Even though rabbitmq is behind a LB with a health probe, whenever a job was picked up by scheduler 1, it would fail with this error... But if scheduler 2 picked up the job, it would finish successfully. The odd thing is that I shut down rabbitmq 2..
So I think I've been able to solve this issue. Here is what I did:
I added a custom celery_config.py to the scheduler and worker docker containers, adding this environment variable: AIRFLOW__CELERY__CELERY_CONFIG_OPTIONS=celery_config.CELERY_CONFIG. As part of that celery config, I specified both my rabbitmq brokers under broker_url. This is the full config:
from airflow.config_templates.default_celery import DEFAULT_CELERY_CONFIG
import os
RABBITMQ_PW = os.environ["RABBITMQ_PW"]
CLUSTER_NODE = os.environ["RABBITMQ_CLUSTER_NODE"]
LOCAL_NODE = os.environ["RABBITMQ_NODE"]
CELERY_CONFIG = {
**DEFAULT_CELERY_CONFIG,
"worker_send_task_events": True,
"task_send_sent_event": True,
"result_extended": True,
"broker_url": [
f'amqp://rabbitmq:{RABBITMQ_PW}#{LOCAL_NODE}:5672',
f'amqp://rabbitmq:{RABBITMQ_PW}#{CLUSTER_NODE}:5672'
]
}
What happens now in the worker, if it looses connection to the 1st broker, it will attempt to connect to the 2nd broker.
[2022-08-11 12:00:52,876: ERROR/MainProcess] consumer: Cannot connect to amqp://rabbitmq:**#<LOCAL_NODE>:5672//: [Errno 111] Connection refused.
[2022-08-11 12:00:52,875: INFO/MainProcess] Connected to amqp://rabbitmq:**#<CLUSTER_NODE>:5672//
Also an interesting note, I still have the Airflow environment variable AIRFLOW__CELERY__BROKER_URL set to the load balancer URL. That's because Airflow 1 won't allow the worker to start without it, and 2 won't allow you to specify multiple brokers like the celery config does. So when the worker starts, it shows:
- ** ---------- .> transport: amqp://rabbitmq:**#<LOCAL_NODE>:5672//
[2022-08-26 11:37:17,952: INFO/MainProcess] Connected to amqp://rabbitmq:**#<LOCAL_NODE>:5672//
Even though I have the LB configured for the AIRFLOW__CELERY__BROKER_URL

Airflow task succeed but returns sigterm

I have a task in Airflow 2.1.2 which is finishing with success status, but after that log shows a sigterm:
[2021-12-07 06:11:45,031] {python.py:151} INFO - Done. Returned value was: None
[2021-12-07 06:11:45,224] {taskinstance.py:1204} INFO - Marking task as SUCCESS. dag_id=DAG_ID, task_id=TASK_ID, execution_date=20211207T050000, start_date=20211207T061119, end_date=20211207T061145
[2021-12-07 06:11:45,308] {local_task_job.py:197} WARNING - State of this instance has been externally set to success. Terminating instance.
[2021-12-07 06:11:45,309] {taskinstance.py:1265} INFO - 0 downstream tasks scheduled from follow-on schedule check
[2021-12-07 06:11:45,310] {process_utils.py:100} INFO - Sending Signals.SIGTERM to GPID 6666
[2021-12-07 06:11:45,310] {taskinstance.py:1284} ERROR - Received SIGTERM. Terminating subprocesses.
[2021-12-07 06:11:45,362] {process_utils.py:66} INFO - Process psutil.Process(pid=6666, status='terminated', exitcode=1, started='06:11:19') (6666) terminated with exit code 1
As you can see the first row returns Done, and the previous rows of this log showed that all script worked fine and data has been inserted in the Datawarehouse.
In the line number 8 it shows SIGTERM due some external trigger mark it as success but I am sure that nobody used the API, or CLI to mark it as success neither the UI.
Any idea how to avoid it and why could this be happening?
I don't know if maybe increasing the AIRFLOW_CORE_KILLED_TASK_CLEANUP_TIME could fix it, but I would like to understand it.

Great Expectation Validation Failed and Job aborted

I am working on a Data Monitoring task where I am using the Great Expectation framework to monitor the quality of the data. I am using the airflow+big query+great expectation together to achieve this.
I have set the param is_blocking:False for expectation, but the job is aborted with an exception and the downstream tasks could not execute because of this. Is there a way the notifications are sent but the execution will not stop.
Detailed exception as follows:
[2021-11-29 15:19:45,925] {taskinstance.py:1252} INFO - Exporting the following env vars:
AIRFLOW_CTX_DAG_OWNER=data-science
AIRFLOW_CTX_DAG_ID=abcd-data-ds-1
AIRFLOW_CTX_TASK_ID=ge-notify-_data_monitoring-expect_-5ff9677f
AIRFLOW_CTX_EXECUTION_DATE=2021-11-29T11:00:00+00:00
AIRFLOW_CTX_DAG_RUN_ID=scheduled__2021-11-29T11:00:00+00:00
[2021-11-29 15:19:45,926] {great_expectations_notification_operator.py:42} INFO - Retrieving key data-ds-v4__promo_roi_input_features_monitoring_expect_column_values_to_be_between47deadf091f092857156a30495953f3c_20211129T110000
[2021-11-29 15:19:45,986] {alerts.py:109} INFO - Sending slack notification
[2021-11-29 15:19:46,411] {great_expectations_notification_operator.py:73} ERROR - Validation failed in datawarehouse for abcd.xyz.is_outlier
[2021-11-29 15:19:46,430] {taskinstance.py:1463} ERROR - Task failed with exception
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1165, in _run_raw_task
self._prepare_and_execute_task_with_callbacks(context, task)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1283, in _prepare_and_execute_task_with_callbacks
result = self._execute_task(context, task_copy)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1308, in _execute_task
result = task_copy.execute(context=context)
File "/opt/airflow/src/datahub/operators/expectations/great_expectations_notification_operator.py", line 79, in execute
raise AirflowException(message)
airflow.exceptions.AirflowException: Validation failed in datawarehouse for abcd.xyz.is_outlier
[2021-11-29 15:19:46,432] {taskinstance.py:1506} INFO - Marking task as FAILED. dag_id=curated-data-ds-v4, task_id=ge-notify-data_monitoring-expect_-5ff9677f, execution_date=20211129T110000, start_date=20211129T151945, end_date=20211129T151946
[2021-11-29 15:19:46,505] {local_task_job.py:151} INFO - Task exited with return code 1
[2021-11-29 15:19:46,557] {alerts.py:109} INFO - Sending slack notification
[2021-11-29 15:19:47,564] {local_task_job.py:261} INFO - 0 downstream tasks scheduled from follow-on schedule check

Airlfow Execution Timeout not working well

I've set 'execution_timeout': timedelta(seconds=300) parameter on many tasks. When the execution timeout is set on task downloading data from Google Analytics it works properly - after ~300 seconds is the task set to failed. The task downloads some data from API (python), then it does some transformations (python) and loads data into PostgreSQL.
Then I've a task which executes only one PostgreSQL function - execution sometimes takes more than 300 seconds but I get this (task is marked as finished successfully).
*** Reading local file: /home/airflow/airflow/logs/bulk_replication_p2p_realtime/t1/2020-07-20T00:05:00+00:00/1.log
[2020-07-20 05:05:35,040] {__init__.py:1139} INFO - Dependencies all met for <TaskInstance: bulk_replication_p2p_realtime.t1 2020-07-20T00:05:00+00:00 [queued]>
[2020-07-20 05:05:35,051] {__init__.py:1139} INFO - Dependencies all met for <TaskInstance: bulk_replication_p2p_realtime.t1 2020-07-20T00:05:00+00:00 [queued]>
[2020-07-20 05:05:35,051] {__init__.py:1353} INFO -
--------------------------------------------------------------------------------
[2020-07-20 05:05:35,051] {__init__.py:1354} INFO - Starting attempt 1 of 1
[2020-07-20 05:05:35,051] {__init__.py:1355} INFO -
--------------------------------------------------------------------------------
[2020-07-20 05:05:35,098] {__init__.py:1374} INFO - Executing <Task(PostgresOperator): t1> on 2020-07-20T00:05:00+00:00
[2020-07-20 05:05:35,099] {base_task_runner.py:119} INFO - Running: ['airflow', 'run', 'bulk_replication_p2p_realtime', 't1', '2020-07-20T00:05:00+00:00', '--job_id', '958216', '--raw', '-sd', 'DAGS_FOLDER/bulk_replication_p2p_realtime.py', '--cfg_path', '/tmp/tmph11tn6fe']
[2020-07-20 05:05:37,348] {base_task_runner.py:101} INFO - Job 958216: Subtask t1 [2020-07-20 05:05:37,347] {settings.py:182} INFO - settings.configure_orm(): Using pool settings. pool_size=10, pool_recycle=1800, pid=26244
[2020-07-20 05:05:39,503] {base_task_runner.py:101} INFO - Job 958216: Subtask t1 [2020-07-20 05:05:39,501] {__init__.py:51} INFO - Using executor LocalExecutor
[2020-07-20 05:05:39,857] {base_task_runner.py:101} INFO - Job 958216: Subtask t1 [2020-07-20 05:05:39,856] {__init__.py:305} INFO - Filling up the DagBag from /home/airflow/airflow/dags/bulk_replication_p2p_realtime.py
[2020-07-20 05:05:39,894] {base_task_runner.py:101} INFO - Job 958216: Subtask t1 [2020-07-20 05:05:39,894] {cli.py:517} INFO - Running <TaskInstance: bulk_replication_p2p_realtime.t1 2020-07-20T00:05:00+00:00 [running]> on host dwh2-airflow-dev
[2020-07-20 05:05:39,938] {postgres_operator.py:62} INFO - Executing: CALL dw_system.bulk_replicate(p_graph_name=>'replication_p2p_realtime',p_group_size=>4 , p_group=>1, p_dag_id=>'bulk_replication_p2p_realtime', p_task_id=>'t1')
[2020-07-20 05:05:39,960] {logging_mixin.py:95} INFO - [2020-07-20 05:05:39,953] {base_hook.py:83} INFO - Using connection to: id: postgres_warehouse. Host: XXX Port: 5432, Schema: XXXX Login: XXX Password: XXXXXXXX, extra: {}
[2020-07-20 05:05:39,973] {logging_mixin.py:95} INFO - [2020-07-20 05:05:39,972] {dbapi_hook.py:171} INFO - CALL dw_system.bulk_replicate(p_graph_name=>'replication_p2p_realtime',p_group_size=>4 , p_group=>1, p_dag_id=>'bulk_replication_p2p_realtime', p_task_id=>'t1')
[2020-07-20 05:23:21,450] {logging_mixin.py:95} INFO - [2020-07-20 05:23:21,449] {timeout.py:42} ERROR - Process timed out, PID: 26244
[2020-07-20 05:23:36,453] {logging_mixin.py:95} INFO - [2020-07-20 05:23:36,452] {jobs.py:2562} INFO - Task exited with return code 0
Does anyone know how to enforce execution timeout out for such long running functions? It seems that the execution timeout is evaluated once the PG function finish.
Airflow uses the signal module from the standard library to affect a timeout. In Airflow it's used to hook into these system signals and request that the calling process be notified in N seconds and, should the process still be inside the context (see the __enter__ and __exit__ methods on the class) it will raise an AirflowTaskTimeout exception.
Unfortunately for this situation, there are certain classes of system operations that cannot be interrupted. This is actually called out in the signal documentation:
A long-running calculation implemented purely in C (such as regular expression matching on a large body of text) may run uninterrupted for an arbitrary amount of time, regardless of any signals received. The Python signal handlers will be called when the calculation finishes.
To which we say "But I'm not doing a long-running calculation in C!" -- yeah for Airflow this is almost always due to uninterruptable I/O operations.
The highlighted sentence above (emphasis mine) nicely explains why the handler is still triggered even after the task is allowed to (frustratingly!) finish, well beyond your requested timeout.

How to fix the error "AirflowException("Hostname of job runner does not match")"?

I'm running airflow on my computer (Mac AirBook, 1.6 GHz Intel Core i5 and 8 GB 2133 MHz LPDDR3). A DAG with several tasks, failed with below error. Checked several articles online but with little to no help. There is nothing wrong with the task itself(double checked).
Any help is much appreciated.
[2019-08-27 13:01:55,372] {sequential_executor.py:45} INFO - Executing command: ['airflow', 'run', 'Makefile_DAG', 'normalize_companies', '2019-08-27T15:38:20.914820+00:00', '--local', '--pool', 'default_pool', '-sd', '/home/airflow/dags/makefileDAG.py']
[2019-08-27 13:01:56,937] {settings.py:213} INFO - settings.configure_orm(): Using pool settings. pool_size=5, max_overflow=10, pool_recycle=1800, pid=40647
[2019-08-27 13:01:57,285] {__init__.py:51} INFO - Using executor SequentialExecutor
[2019-08-27 13:01:59,423] {dagbag.py:90} INFO - Filling up the DagBag from /home/airflow/dags/makefileDAG.py
[2019-08-27 13:02:01,736] {cli.py:516} INFO - Running <TaskInstance: Makefile_DAG.normalize_companies 2019-08-27T15:38:20.914820+00:00 [queued]> on host ajays-macbook-air.local
Traceback (most recent call last):
File "/anaconda3/envs/airflow/bin/airflow", line 32, in <module>
args.func(args)
File "/anaconda3/envs/airflow/lib/python3.6/site-packages/airflow/utils/cli.py", line 74, in wrapper
return f(*args, **kwargs)
File "/anaconda3/envs/airflow/lib/python3.6/site-packages/airflow/bin/cli.py", line 522, in run
_run(args, dag, ti)
File "/anaconda3/envs/airflow/lib/python3.6/site-packages/airflow/bin/cli.py", line 435, in _run
run_job.run()
File "/anaconda3/envs/airflow/lib/python3.6/site-packages/airflow/jobs/base_job.py", line 213, in run
self._execute()
File "/anaconda3/envs/airflow/lib/python3.6/site-packages/airflow/jobs/local_task_job.py", line 111, in _execute
self.heartbeat()
File "/anaconda3/envs/airflow/lib/python3.6/site-packages/airflow/jobs/base_job.py", line 196, in heartbeat
self.heartbeat_callback(session=session)
File "/anaconda3/envs/airflow/lib/python3.6/site-packages/airflow/utils/db.py", line 70, in wrapper
return func(*args, **kwargs)
File "/anaconda3/envs/airflow/lib/python3.6/site-packages/airflow/jobs/local_task_job.py", line 159, in heartbeat_callback
raise AirflowException("Hostname of job runner does not match")
airflow.exceptions.AirflowException: Hostname of job runner does not match
[2019-08-27 13:05:05,904] {sequential_executor.py:52} ERROR - Failed to execute task Command '['airflow', 'run', 'Makefile_DAG', 'normalize_companies', '2019-08-27T15:38:20.914820+00:00', '--local', '--pool', 'default_pool', '-sd', '/home/airflow/dags/makefileDAG.py']' returned non-zero exit status 1..
[2019-08-27 13:05:05,905] {scheduler_job.py:1256} INFO - Executor reports execution of Makefile_DAG.normalize_companies execution_date=2019-08-27 15:38:20.914820+00:00 exited with status failed for try_number 2
Logs from the task:
[2019-08-27 13:02:13,616] {bash_operator.py:115} INFO - Running command: python /home/Makefile_Redo/normalize_companies.py
[2019-08-27 13:02:13,628] {bash_operator.py:124} INFO - Output:
[2019-08-27 13:05:02,849] {logging_mixin.py:95} INFO - [[34m2019-08-27 13:05:02,848[0m] {[34mlocal_task_job.py:[0m158} [33mWARNING[0m - [33mThe recorded hostname [1majays-macbook-air.local[0m does not match this instance's hostname [1mAJAYs-MacBook-Air.local[0m[0m
[2019-08-27 13:05:02,860] {helpers.py:319} INFO - Sending Signals.SIGTERM to GPID 40649
[2019-08-27 13:05:02,861] {taskinstance.py:897} ERROR - Received SIGTERM. Terminating subprocesses.
[2019-08-27 13:05:02,862] {bash_operator.py:142} INFO - Sending SIGTERM signal to bash process group
[2019-08-27 13:05:03,539] {taskinstance.py:1047} ERROR - Task received SIGTERM signal
Traceback (most recent call last):
File "/anaconda3/envs/airflow/lib/python3.6/site-packages/airflow/models/taskinstance.py", line 922, in _run_raw_task
result = task_copy.execute(context=context)
File "/anaconda3/envs/airflow/lib/python3.6/site-packages/airflow/operators/bash_operator.py", line 126, in execute
for line in iter(sp.stdout.readline, b''):
File "/anaconda3/envs/airflow/lib/python3.6/site-packages/airflow/models/taskinstance.py", line 899, in signal_handler
raise AirflowException("Task received SIGTERM signal")
airflow.exceptions.AirflowException: Task received SIGTERM signal
[2019-08-27 13:05:03,550] {taskinstance.py:1076} INFO - All retries failed; marking task as FAILED
A weird thing I noticed from above log is:
The recorded hostname [1majays-macbook-air.local[0m does not match this instance's hostname [1mAJAYs-MacBook-Air.local[0m[0m
How is this possible and any solution to fix this?
I had the same problem on my Mac. The solution that worked for me was updating airflow.cfg with hostname_callable = socket:gethostname. The original getfqdn returns different hostnames from time to time.

Resources