airflow kubernetes pod operator retry no working

airflow kubernetes pod operator retry no working - airflow

I am running airflow 1.10.12 in CeleryExecutor mode on Kubernetes cluster. This airflow DAG will create a Pod on the same Kubernetes cluster but in a different namespace.
On a good run, pod will run spark job successfully. But if it fails, airflow should retry. By retrying, the failed pod should receive a new label, metadata.labels.already_checked: 'True' on the first retry attempt and on the second retry attempt a new pod should be launched. However, a new label is not being marked as intended.
A snapshot of error message
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/airflow/contrib/operators/kubernetes_pod_operator.py", line 282, in execute
final_state, result = self.handle_pod_overlap(labels, try_numbers_match, launcher, pod_list)
File "/usr/local/lib/python3.6/site-packages/airflow/contrib/operators/kubernetes_pod_operator.py", line 312, in handle_pod_overlap
final_state, result = self.monitor_launched_pod(launcher, pod_list.items[0])
File "/usr/local/lib/python3.6/site-packages/airflow/contrib/operators/kubernetes_pod_operator.py", line 432, in monitor_launched_pod
'Pod returned a failure: {state}'.format(state=final_state)
airflow.exceptions.AirflowException: Pod returned a failure: failed

Related

failure to install airflow helm chart on GKE due to failing migration

I have been trying to set up airflow using the official airflow helm chart from artifacthub on a GKE cluster but coming up with a couple of issues.
First I get these errors from the pods:
Failed to load logs: container "scheduler" in pod "airflow-scheduler-6b6cc9db4-qmvbw" is waiting to start: PodInitializing
Reason: BadRequest (400)
Then from the init containers, I get the following error:
Traceback (most recent call last):
File "/home/airflow/.local/bin/airflow", line 8, in <module>
sys.exit(main())
File "/home/airflow/.local/lib/python3.7/site-packages/airflow/__main__.py", line 39, in main
args.func(args)
File "/home/airflow/.local/lib/python3.7/site-packages/airflow/cli/cli_parser.py", line 52, in command
return func(*args, **kwargs)
File "/home/airflow/.local/lib/python3.7/site-packages/airflow/cli/commands/db_command.py", line 138, in check_migrations
db.check_migrations(timeout=args.migration_wait_timeout)
File "/home/airflow/.local/lib/python3.7/site-packages/airflow/utils/db.py", line 739, in check_migrations
f"There are still unapplied migrations after {timeout} seconds. Migration"
TimeoutError: There are still unapplied migrations after 60 seconds. MigrationHead(s) in DB: set() | Migration Head(s) in Source Code: {'ecb43d2a1842'}
The postgres pods look okay.
The second issue is that I get the error that the pods are not scheduled because no nodes are available, and the cluster doesn't autoscale because even after that, the nodes will not contain the new pods.
The challenge with the above is that I upgraded the nodes to have 32 processors and 64GB memory, but the error persists. So I assume it is something else.

Airflow 2 Error sending Celery task: Timeout

I am in the process of migrating our Airflow environment from version 1.10.15 to 2.3.3. I have migrated 1 DAG over to the new environment and intermittently I get an email with this error: Executor reports task instance finished (failed) although the task says its queued. (Info: None) Was the task killed externally?
When looking at the logs, this is what I find in the scheduler logs:
[2022-08-09 07:00:08,621] {dag.py:2968} INFO - Setting next_dagrun for DAGRP-Get_Overrides to 2022-08-09T11:00:00+00:00, run_after=2022-08-09T16:00:00+00:00
[2022-08-09 07:00:08,652] {scheduler_job.py:353} INFO - 1 tasks up for execution:
<TaskInstance: DAGRP-Get_Overrides.Get_override scheduled__2022-08-08T16:00:00+00:00 [scheduled]>
[2022-08-09 07:00:08,652] {scheduler_job.py:418} INFO - DAG DAGRP-Get_Overrides has 0/3 running and queued tasks
[2022-08-09 07:00:08,652] {scheduler_job.py:504} INFO - Setting the following tasks to queued state:
<TaskInstance: DAGRP-Get_Overrides.Get_override scheduled__2022-08-08T16:00:00+00:00 [scheduled]>
[2022-08-09 07:00:08,654] {scheduler_job.py:546} INFO - Sending TaskInstanceKey(dag_id='DAGRP-Get_Overrides', task_id='Get_override', run_id='scheduled__2022-08-08T16:00:00+00:00', try_number=1, map_index=-1) to executor with priority 1 and queue default
[2022-08-09 07:00:08,654] {base_executor.py:91} INFO - Adding to queue: ['airflow', 'tasks', 'run', 'DAGRP-Get_Overrides', 'Get_override', 'scheduled__2022-08-08T16:00:00+00:00', '--local', '--subdir', 'DAGS_FOLDER/da_group/get_override.py']
[2022-08-09 07:00:12,665] {timeout.py:67} ERROR - Process timed out, PID: 1
[2022-08-09 07:00:12,667] {celery_executor.py:283} INFO - [Try 1 of 3] Task Timeout Error for Task: (TaskInstanceKey(dag_id='DAGRP-Get_Overrides', task_id='Get_override', run_id='scheduled__2022-08-08T16:00:00+00:00', try_number=1, map_index=-1)).
[2022-08-09 07:00:16,701] {timeout.py:67} ERROR - Process timed out, PID: 1
[2022-08-09 07:00:16,702] {celery_executor.py:283} INFO - [Try 2 of 3] Task Timeout Error for Task: (TaskInstanceKey(dag_id='DAGRP-Get_Overrides', task_id='Get_override', run_id='scheduled__2022-08-08T16:00:00+00:00', try_number=1, map_index=-1)).
[2022-08-09 07:00:21,704] {timeout.py:67} ERROR - Process timed out, PID: 1
[2022-08-09 07:00:21,705] {celery_executor.py:283} INFO - [Try 3 of 3] Task Timeout Error for Task: (TaskInstanceKey(dag_id='DAGRP-Get_Overrides', task_id='Get_override', run_id='scheduled__2022-08-08T16:00:00+00:00', try_number=1, map_index=-1)).
[2022-08-09 07:00:26,627] {timeout.py:67} ERROR - Process timed out, PID: 1
[2022-08-09 07:00:26,627] {celery_executor.py:294} ERROR - Error sending Celery task: Timeout, PID: 1
Celery Task ID: TaskInstanceKey(dag_id='DAGRP-Get_Overrides', task_id='Get_override', run_id='scheduled__2022-08-08T16:00:00+00:00', try_number=1, map_index=-1)
Traceback (most recent call last):
File "/opt/airflow/lib/python3.8/site-packages/kombu/utils/functional.py", line 30, in __call__
return self.__value__
AttributeError: 'ChannelPromise' object has no attribute '__value__'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/airflow/lib/python3.8/site-packages/airflow/executors/celery_executor.py", line 177, in send_task_to_executor
result = task_to_run.apply_async(args=[command], queue=queue)
File "/opt/airflow/lib/python3.8/site-packages/celery/app/task.py", line 575, in apply_async
return app.send_task(
File "/opt/airflow/lib/python3.8/site-packages/celery/app/base.py", line 788, in send_task
amqp.send_task_message(P, name, message, **options)
File "/opt/airflow/lib/python3.8/site-packages/celery/app/amqp.py", line 510, in send_task_message
ret = producer.publish(
File "/opt/airflow/lib/python3.8/site-packages/kombu/messaging.py", line 177, in publish
return _publish(
File "/opt/airflow/lib/python3.8/site-packages/kombu/connection.py", line 523, in _ensured
return fun(*args, **kwargs)
File "/opt/airflow/lib/python3.8/site-packages/kombu/messaging.py", line 186, in _publish
channel = self.channel
File "/opt/airflow/lib/python3.8/site-packages/kombu/messaging.py", line 209, in _get_channel
channel = self._channel = channel()
File "/opt/airflow/lib/python3.8/site-packages/kombu/utils/functional.py", line 32, in __call__
value = self.__value__ = self.__contract__()
File "/opt/airflow/lib/python3.8/site-packages/kombu/messaging.py", line 225, in <lambda>
channel = ChannelPromise(lambda: connection.default_channel)
File "/opt/airflow/lib/python3.8/site-packages/kombu/connection.py", line 895, in default_channel
self._ensure_connection(**conn_opts)
File "/opt/airflow/lib/python3.8/site-packages/kombu/connection.py", line 433, in _ensure_connection
return retry_over_time(
File "/opt/airflow/lib/python3.8/site-packages/kombu/utils/functional.py", line 312, in retry_over_time
return fun(*args, **kwargs)
File "/opt/airflow/lib/python3.8/site-packages/kombu/connection.py", line 877, in _connection_factory
self._connection = self._establish_connection()
File "/opt/airflow/lib/python3.8/site-packages/kombu/connection.py", line 812, in _establish_connection
conn = self.transport.establish_connection()
File "/opt/airflow/lib/python3.8/site-packages/kombu/transport/pyamqp.py", line 201, in establish_connection
conn.connect()
File "/opt/airflow/lib/python3.8/site-packages/amqp/connection.py", line 323, in connect
self.transport.connect()
File "/opt/airflow/lib/python3.8/site-packages/amqp/transport.py", line 129, in connect
self._connect(self.host, self.port, self.connect_timeout)
File "/opt/airflow/lib/python3.8/site-packages/amqp/transport.py", line 184, in _connect
self.sock.connect(sa)
File "/opt/airflow/lib/python3.8/site-packages/airflow/utils/timeout.py", line 68, in handle_timeout
raise AirflowTaskTimeout(self.error_message)
airflow.exceptions.AirflowTaskTimeout: Timeout, PID: 1
[2022-08-09 07:00:26,627] {scheduler_job.py:599} INFO - Executor reports execution of DAGRP-Get_Overrides.Get_override run_id=scheduled__2022-08-08T16:00:00+00:00 exited with status failed for try_number 1
[2022-08-09 07:00:26,633] {scheduler_job.py:642} INFO - TaskInstance Finished: dag_id=DAGRP-Get_Overrides, task_id=Get_override, run_id=scheduled__2022-08-08T16:00:00+00:00, map_index=-1, run_start_date=None, run_end_date=None, run_duration=None, state=queued, executor_state=failed, try_number=1, max_tries=0, job_id=None, pool=default_pool, queue=default, priority_weight=1, operator=PythonOperator, queued_dttm=2022-08-09 11:00:08.652767+00:00, queued_by_job_id=56, pid=None
[2022-08-09 07:00:26,633] {scheduler_job.py:684} ERROR - Executor reports task instance <TaskInstance: DAGRP-Get_Overrides.Get_override scheduled__2022-08-08T16:00:00+00:00 [queued]> finished (failed) although the task says its queued. (Info: None) Was the task killed externally?
[2022-08-09 07:01:16,687] {processor.py:233} WARNING - Killing DAGFileProcessorProcess (PID=1811)
[2022-08-09 07:04:00,640] {scheduler_job.py:1233} INFO - Resetting orphaned tasks for active dag runs
I am running Airflow on 2 servers with 2 of each service (2 schedulers, 2 workers, 2 webservers). They are running in docker containers. They are configured to use celery executor and I'm using RabbitMQ version 3.10.6 (also 2 instances in docker containers behind a LB). I am using Postgres 13.7 for our database (running one instance in a docker container on the 1st server). Our environment is running on Python 3.8.12.
From my understanding, the timeout is between the scheduler and rabbitmq? From what I can tell we are hitting this timeout: AIRFLOW__CELERY__OPERATION_TIMEOUT (it's currently set to 4).
I would like to track down what is causing the issue before I just increase timeout settings. What can I do to find out what's going on? Anyone else run into this issue? Am I correct in assuming the timeout is between the scheduler and rabbitmq? Is it between the scheduler and database? Why am I seeing this with Airflow 2 when I have the same setup with Airflow 1 and it works with no problems? Any help is greatly appreciated!
Update:
I was able to reproduce the error by shutting down 1 of the rabbitmq nodes. Even though rabbitmq is behind a LB with a health probe, whenever a job was picked up by scheduler 1, it would fail with this error... But if scheduler 2 picked up the job, it would finish successfully. The odd thing is that I shut down rabbitmq 2..

So I think I've been able to solve this issue. Here is what I did:
I added a custom celery_config.py to the scheduler and worker docker containers, adding this environment variable: AIRFLOW__CELERY__CELERY_CONFIG_OPTIONS=celery_config.CELERY_CONFIG. As part of that celery config, I specified both my rabbitmq brokers under broker_url. This is the full config:
from airflow.config_templates.default_celery import DEFAULT_CELERY_CONFIG
import os
RABBITMQ_PW = os.environ["RABBITMQ_PW"]
CLUSTER_NODE = os.environ["RABBITMQ_CLUSTER_NODE"]
LOCAL_NODE = os.environ["RABBITMQ_NODE"]
CELERY_CONFIG = {
**DEFAULT_CELERY_CONFIG,
"worker_send_task_events": True,
"task_send_sent_event": True,
"result_extended": True,
"broker_url": [
f'amqp://rabbitmq:{RABBITMQ_PW}#{LOCAL_NODE}:5672',
f'amqp://rabbitmq:{RABBITMQ_PW}#{CLUSTER_NODE}:5672'
]
}
What happens now in the worker, if it looses connection to the 1st broker, it will attempt to connect to the 2nd broker.
[2022-08-11 12:00:52,876: ERROR/MainProcess] consumer: Cannot connect to amqp://rabbitmq:**#<LOCAL_NODE>:5672//: [Errno 111] Connection refused.
[2022-08-11 12:00:52,875: INFO/MainProcess] Connected to amqp://rabbitmq:**#<CLUSTER_NODE>:5672//
Also an interesting note, I still have the Airflow environment variable AIRFLOW__CELERY__BROKER_URL set to the load balancer URL. That's because Airflow 1 won't allow the worker to start without it, and 2 won't allow you to specify multiple brokers like the celery config does. So when the worker starts, it shows:
- ** ---------- .> transport: amqp://rabbitmq:**#<LOCAL_NODE>:5672//
[2022-08-26 11:37:17,952: INFO/MainProcess] Connected to amqp://rabbitmq:**#<LOCAL_NODE>:5672//
Even though I have the LB configured for the AIRFLOW__CELERY__BROKER_URL

Airflow Exception - Task received SIGTERM signal

I am running airflow tasks using SSH operator. I am pretty sure that the python program has no error and runs successfully when i run it. But when run from airflow towards the end of program execution I end up with SIGTERM error.
I tried to figure out by looking into various solutions but nothing worked. I tried increasing
killed_task_cleanup_time = 1200 from 60 in airflow.cfg file. Also tried changing hostname_callable to socket:gethostname in airflow.cfg as I received the following warning before this error
Warning: The recorded hostname xxx does not match this instance's hostname
Error:
[2020-10-15 10:45:34,937] {taskinstance.py:954} ERROR - Received SIGTERM. Terminating subprocesses.
[2020-10-15 10:45:34,959] {taskinstance.py:1145} ERROR - SSH operator error: Task received SIGTERM signal
Traceback (most recent call last):
File "/opt/anaconda3/lib/python3.7/site-packages/airflow/contrib/operators/ssh_operator.py", line 137, in execute
readq, _, _ = select([channel], [], [], self.timeout)
File "/opt/anaconda3/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 956, in signal_handler
raise AirflowException("Task received SIGTERM signal")
airflow.exceptions.AirflowException: Task received SIGTERM signal
Any ideas and suggestions are teally helpful. Stuck with this for a day now

This problem is triggered by the fact that the RECORDED hostname XXX maps an IP address that is different from the IP address mapped by instance's hostname, throwing a SIGTERM error. So you need to specify the IP mapping for the recorded Hostname XXX

Possibly this thread might help? https://issues.apache.org/jira/browse/AIRFLOW-966.
Which version of airflow are you using, and did you check your celery broker settings?
The solution seems to be setting visibility timeout higher than the celery default, which is 1 hour, to prevent celery from re-submitting the job. I believe this only affects tasks created via manual run / CLI (not normally scheduled tasks.)

apache airflow 1.10.9 statsd enabled making scheduler crashed

my airflow running in CeleryExecutor mode + progresql 12, all things go well except when turning statsd on:
statsd_on = True
statsd_host = localhost
statsd_port = 8125
statsd_prefix = airflow
The schedulers can render jobs but jobs are not running, the scheduler log having below error:
[SQL: SELECT count(*) AS count_1
FROM task_instance
WHERE task_instance.pool = %(pool_1)s AND task_instance.state IN (%(state_1)s, %(state_2)s)]
[parameters: {'pool_1': 'default_pool', 'state_1': 'running', 'state_2': 'queued'}]
(Background on this error at: http://sqlalche.me/e/4xp6)[0m
[31mTraceback (most recent call last):
File "/usr/local/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 1246, in _execute_context
cursor, statement, parameters, context
File "/usr/local/lib64/python3.6/site-packages/sqlalchemy/engine/default.py", line 588, in do_execute
cursor.execute(statement, parameters)
psycopg2.errors.ProtocolViolation: invalid frontend message type 97
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/airflow/jobs/scheduler_job.py", line 1495, in _validate_and_run_task_instances
self._process_and_execute_tasks(simple_dag_bag)
File "/usr/local/lib64/python3.6/site-packages/sqlalchemy/engine/default.py", line 588, in do_execute
cursor.execute(statement, parameters)
sqlalchemy.exc.DatabaseError: (psycopg2.errors.ProtocolViolation) invalid frontend message type 97
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
If disable statsd, everything resume. Is it a bug for airflow? any advise to resolve it?

I faced the same error, and after a few tests, i can get statsd metrics working. Typically, you will see the error if the following conditions are met.
Statsd enabled set to True
SqlAlchemy connection pool set to True
Scheduler syserr log enabled (by redirect the err log to a file where you can see this error)
In my case, even though the scheduler kept throwing the error logs, statsd metrics were still delivered, and tasks were also scheduled as they should. I dont know how to measure the impact, i also dont want to sacrifice sql_alchemy connection pool, so I leave statsd turned off.
(I guess other people not seeing the error because they are missing the 3rd one above)

AirflowException: Celery command failed - The recorded hostname does not match this instance's hostname

I'm running Airflow on a clustered environment running on two AWS EC2-Instances. One for master and one for the worker. The worker node though periodically throws this error when running "$airflow worker":
[2018-08-09 16:15:43,553] {jobs.py:2574} WARNING - The recorded hostname ip-1.2.3.4 does not match this instance's hostname ip-1.2.3.4.eco.tanonprod.comanyname.io
Traceback (most recent call last):
File "/usr/bin/airflow", line 27, in <module>
args.func(args)
File "/usr/local/lib/python3.6/site-packages/airflow/bin/cli.py", line 387, in run
run_job.run()
File "/usr/local/lib/python3.6/site-packages/airflow/jobs.py", line 198, in run
self._execute()
File "/usr/local/lib/python3.6/site-packages/airflow/jobs.py", line 2527, in _execute
self.heartbeat()
File "/usr/local/lib/python3.6/site-packages/airflow/jobs.py", line 182, in heartbeat
self.heartbeat_callback(session=session)
File "/usr/local/lib/python3.6/site-packages/airflow/utils/db.py", line 50, in wrapper
result = func(*args, **kwargs)
File "/usr/local/lib/python3.6/site-packages/airflow/jobs.py", line 2575, in heartbeat_callback
raise AirflowException("Hostname of job runner does not match")
airflow.exceptions.AirflowException: Hostname of job runner does not match
[2018-08-09 16:15:43,671] {celery_executor.py:54} ERROR - Command 'airflow run arl_source_emr_test_dag runEmrStep2WaiterTask 2018-08-07T00:00:00 --local -sd /var/lib/airflow/dags/arl_source_emr_test_dag.py' returned non-zero exit status 1.
[2018-08-09 16:15:43,681: ERROR/ForkPoolWorker-30] Task airflow.executors.celery_executor.execute_command[875a4da9-582e-4c10-92aa-5407f3b46d5f] raised unexpected: AirflowException('Celery command failed',)
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/airflow/executors/celery_executor.py", line 52, in execute_command
subprocess.check_call(command, shell=True)
File "/usr/lib64/python3.6/subprocess.py", line 291, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'airflow run arl_source_emr_test_dag runEmrStep2WaiterTask 2018-08-07T00:00:00 --local -sd /var/lib/airflow/dags/arl_source_emr_test_dag.py' returned non-zero exit status 1.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.6/dist-packages/celery/app/trace.py", line 382, in trace_task
R = retval = fun(*args, **kwargs)
File "/usr/lib/python3.6/dist-packages/celery/app/trace.py", line 641, in __protected_call__
return self.run(*args, **kwargs)
File "/usr/local/lib/python3.6/site-packages/airflow/executors/celery_executor.py", line 55, in execute_command
raise AirflowException('Celery command failed')
airflow.exceptions.AirflowException: Celery command failed
When this error occurs the task is marked as failed on Airflow and thus fails my DAG when nothing actually went wrong in the task.
I'm using Redis as my queue and postgreSQL as my meta-database. Both are external as AWS services. I'm running all of this on my company environment which is why the full name of the server is ip-1.2.3.4.eco.tanonprod.comanyname.io. It looks like it wants this full name somewhere but I have no idea where I need to fix this value so that it's getting ip-1.2.3.4.eco.tanonprod.comanyname.io instead of just ip-1.2.3.4.
The really weird thing about this issue is that it doesn't always happen. It seems to just randomly happen every once in a while when I run the DAG. It's also occurring on all of my DAGs sporadically so it's not just one DAG. I find it strange though how it's sporadic because that means other task runs are handling the IP address for whatever this is just fine.
Note: I've changed the real IP address to 1.2.3.4 for privacy reasons.
Answer:
https://github.com/apache/incubator-airflow/pull/2484
This is exactly the problem I am having and other Airflow users on AWS EC2-Instances are experiencing it as well.

The hostname is set when the task instance runs, and is set to self.hostname = socket.getfqdn(), where socket is the python package import socket.
The comparison that triggers this error is:
fqdn = socket.getfqdn()
if fqdn != ti.hostname:
logging.warning("The recorded hostname {ti.hostname} "
"does not match this instance's hostname "
"{fqdn}".format(**locals()))
raise AirflowException("Hostname of job runner does not match")
It seems like the hostname on the ec2 instance is changing on you while the worker is running. Perhaps try manually setting the hostname as described here https://forums.aws.amazon.com/thread.jspa?threadID=246906 and see if that sticks.

I had a similar problem on my Mac. It fixed it setting hostname_callable = socket:gethostname in airflow.cfg.

Personally when running on my Mac, I found that I got similar errors to this when the Mac would sleep while I was running a long job. The solution was to go into System Preferences -> Energy Saver and then check "Prevent computer from sleeping automatically when the display is off."

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

airflow kubernetes pod operator retry no working - airflow

Related

failure to install airflow helm chart on GKE due to failing migration

Airflow 2 Error sending Celery task: Timeout

Airflow Exception - Task received SIGTERM signal

apache airflow 1.10.9 statsd enabled making scheduler crashed

AirflowException: Celery command failed - The recorded hostname does not match this instance's hostname

Categories

Resources