Airflow task instance stuck in retry mode

Airflow task instance stuck in retry mode - airflow

We are running airflow version 1.9 in celery executor mode. Our task instances are stuck in retry mode. When the job fails, the task instance retries. After that it tries to run the task and then fall back to new retry time.
First state:
Task is not ready for retry yet but will be retried automatically. Current date is 2018-08-28T03:46:53.101483 and task will be retried at 2018-08-28T03:47:25.463271.
After some time:
All dependencies are met but the task instance is not running. In most cases this just means that the task will probably be scheduled soon unless:
- The scheduler is down or under heavy load
If this task instance does not start soon please contact your Airflow administrator for assistance
After some time: it again went into retry mode.
Task is not ready for retry yet but will be retried automatically. Current date is 2018-08-28T03:51:48.322424 and task will be retried at 2018-08-28T03:52:57.893430.
This is happening for all dags. We created a test dag and tried to get logs for both scheduler and worker logs.
from datetime import *
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.operators.bash_operator import BashOperator
default_args = {
'owner': 'Pramiti',
'depends_on_past': False,
'retries': 3,
'retry_delay': timedelta(minutes=1)
}
dag = DAG('airflow-examples.test_failed_dag_v2', description='Failed DAG',
schedule_interval='*/10 * * * *',
start_date=datetime(2018, 9, 7), default_args=default_args)
b = BashOperator(
task_id="ls_command",
bash_command="mdr",
dag=dag
)
Task Logs
Scheduler Logs
Worker Logs

Related

How to force a Airflow Task to restart at the new scheduling date?

I have this simple Airflow DAG:
from airflow import DAG
from datetime import datetime, timedelta
from airflow.operators.bash import BashOperator
with DAG("Second Dag",
start_date=datetime(2022,1,1),
schedule_interval="0 5 * * *",
catchup=False,
max_active_runs=1
) as dag:
task_a = BashOperator(
task_id="ToRepeat",
bash_command="cd /home/xdf/local/ && (env/bin/python workflow/test1.py)",
retries =1,
)
The task takes a variable amount of time between one run and the other, and I don't have any guarantee that it will be finished within the 5 A.M of the next day.
If the task is still running when a new task is scheduled to start, I need to kill the old one before it starts running.
How can I design Airflow DAG to automatically kill the old task if it's still running when a new task is scheduled to start?
More details:
I am looking for something dynamic. The old DAG should be killed only when the new DAG is starting. If, for any reason, the new DAG does not start for one week, then old DAG should be able to run for an entire week. That's why using a timeout is sub-optimal

You should set dagrun_timeout for your DAG.
dagrun_timeout: specify how long a DagRun should be up before
timing out / failing, so that new DagRuns can be created. The timeout
is only enforced for scheduled DagRuns.
Since your DAG runs daily you can set 24 hours for timeout.
with DAG("Second Dag",
start_date=datetime(2022,1,1),
schedule_interval="0 5 * * *",
catchup=False,
max_active_runs=1
dagrun_timeout=timedelta(hours=24)
) as dag:
If you want to set timeout on a specific task in your DAG you should use execution_timeout on your operator.
execution_timeout: max time allowed for the execution of this task instance, if it goes beyond it will raise and fail
Example:
MyOperator(task_id='task', execution_timeout=timedelta(hours=24))

If you really are looking for a dynamic solution; you can take help of Airflow DAGRun APIs and Xcoms; you can push your current dag run_id to Xcom and for subsequent runs you can pull this Xcom to consume with airflow API to check and kill the dag run with that run_id.
check_previous_dag_run_id >> kill_previous_dag_run >> push_current_run_id >> your_main_task
and your API call task should be something like
...
kill_previous_dag_run = BashOperator(
task_id="kill_previous_dag_run",
bash_command="curl -X 'DELETE' \
'http://<<your_webserver_dns>>/api/v1/dags/<<your_dag_name>>/dagRuns/<<url_encoded_run_id>>' \
-H 'accept: */*' --user <<api_username>>:<<api_user_password>>",
dag=dag
)
...

Apache Airflow does not enforce dagrun_timeout

I am using Apache Airflow version 1.10.3 with the sequential executor, and I would like the DAG to fail after a certain amount of time if it has not finished. I tried setting dagrun_timeout in the example code
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'me',
'depends_on_past': False,
'start_date': datetime(2019, 6, 1),
'retries': 0,
}
dag = DAG('min_timeout', default_args=default_args, schedule_interval=timedelta(minutes=5), dagrun_timeout = timedelta(seconds=30), max_active_runs=1)
t1 = BashOperator(
task_id='fast_task',
bash_command='date',
dag=dag)
t2 = BashOperator(
task_id='slow_task',
bash_command='sleep 45',
dag=dag)
t2.set_upstream(t1)
slow_task alone takes more than the time limit set by dagrun_timeout, so my understanding is that airflow should stop DAG execution. However, this does not happen, and slow_task is allowed to run for its entire duration. After this occurs, the run is marked as failed, but this does not kill the task or DAG as desired. Using execution_timeout for slow_task does cause the task to be killed at the specified time limit, but I would prefer to use an overall time limit for the DAG rather than specifying execution_timeout for each task.
Is there anything else I should try to achieve this behavior, or any mistakes I can fix?

The Airflow scheduler runs a loop at least every SCHEDULER_HEARTBEAT_SEC (the default is 5 seconds).
Bear in mind at least here, because the scheduler performs some actions that may delay the next cycle of its loop.
These actions include:
parsing the dags
filling up the DagBag
checking the DagRun and updating their state
scheduling next DagRun
In your example, the delayed task isn't terminated at the dagrun_timeout because the scheduler performs its next cycle after the task completes.

According to Airflow documentation:
dagrun_timeout (datetime.timedelta) – specify how long a DagRun should be up before timing out / failing, so that new DagRuns can be created. The timeout is only enforced for scheduled DagRuns, and only once the # of active DagRuns == max_active_runs.
So dagrun_timeout wouldn't work for non-scheduled DagRuns (e.g. manually triggered) and if the number of active DagRuns < max_active_runs parameter.

How fix DAG seems to be missing?

I want to run a simple Dag "test_update_bq", but when I go to localhost I see this: DAG "test_update_bq" seems to be missing.
There are no errors when I run "airflow initdb", also when I run test airflow test test_update_bq update_table_sql 2015-06-01, It was successfully done and the table was updated in BQ. Dag:
from airflow import DAG
from airflow.contrib.operators.bigquery_operator import BigQueryOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'Anna',
'depends_on_past': True,
'start_date': datetime(2017, 6, 2),
'email': ['airflow#airflow.com'],
'email_on_failure': True,
'email_on_retry': False,
'retries': 5,
'retry_delay': timedelta(minutes=5),
}
schedule_interval = "00 21 * * *"
# Define DAG: Set ID and assign default args and schedule interval
dag = DAG('test_update_bq', default_args=default_args, schedule_interval=schedule_interval, template_searchpath = ['/home/ubuntu/airflow/dags/sql_bq'])
update_task = BigQueryOperator(
dag = dag,
allow_large_results=True,
task_id = 'update_table_sql',
sql = 'update_bq.sql',
use_legacy_sql = False,
bigquery_conn_id = 'test'
)
update_task
I would be grateful for any help.
/logs/scheduler
[2019-10-10 11:28:53,308] {logging_mixin.py:95} INFO - [2019-10-10 11:28:53,308] {dagbag.py:90} INFO - Filling up the DagBag from /home/ubuntu/airflow/dags/update_bq.py
[2019-10-10 11:28:53,333] {scheduler_job.py:1532} INFO - DAG(s) dict_keys(['test_update_bq']) retrieved from /home/ubuntu/airflow/dags/update_bq.py
[2019-10-10 11:28:53,383] {scheduler_job.py:152} INFO - Processing /home/ubuntu/airflow/dags/update_bq.py took 0.082 seconds
[2019-10-10 11:28:56,315] {logging_mixin.py:95} INFO - [2019-10-10 11:28:56,315] {settings.py:213} INFO - settings.configure_orm(): Using pool settings. pool_size=5, max_overflow=10, pool_recycle=3600, pid=11761
[2019-10-10 11:28:56,318] {scheduler_job.py:146} INFO - Started process (PID=11761) to work on /home/ubuntu/airflow/dags/update_bq.py
[2019-10-10 11:28:56,324] {scheduler_job.py:1520} INFO - Processing file /home/ubuntu/airflow/dags/update_bq.py for tasks to queue
[2019-10-10 11:28:56,325] {logging_mixin.py:95} INFO - [2019-10-10 11:28:56,325] {dagbag.py:90} INFO - Filling up the DagBag from /home/ubuntu/airflow/dags/update_bq.py
[2019-10-10 11:28:56,350] {scheduler_job.py:1532} INFO - DAG(s) dict_keys(['test_update_bq']) retrieved from /home/ubuntu/airflow/dags/update_bq.py
[2019-10-10 11:28:56,399] {scheduler_job.py:152} INFO - Processing /home/ubuntu/airflow/dags/update_bq.py took 0.081 seconds

Restarting the airflow webserver helped.
So I kill gunicorn process on ubuntu and then restart airflow webserver

This error is usually due to an exception happening when Airflow tries to parse a DAG. So the DAG gets registered in metastore(thus visible UI), but it wasn't parsed by Airflow. Can you take a look at Airflow logs, you might see an exception causing this error.

None of the responses helped me solving this issue.
However after spending some time I found out how to see the exact problem.
In my case I ran airflow (v2.4.0) using helm chart (v1.6.0) inside kubernetes. It created multiple docker containers. I got into the running container using ssh and executed two commands using airflow's cli and this helped me a lot to debug and understand the problem
airflow dags report
airflow dags reserialize
In my case the problem was that database schema didn't match the airflow version.

Airflow scheduler not scheduling simple DAG task immediately

I have scheduled a DAG with a simple bash task to run every 5th minute:
# bash_dag.py
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'airflow',
'start_date' : datetime(2019, 5, 30)
}
dag = DAG(
'bash_count',
default_args=default_args,
schedule_interval='*/5 * * * *',
catchup = False
)
t1 = BashOperator(
task_id='print_date',
bash_command='date',
dag=dag
)
Scheduling works fine, DAG is executing every 5th minute threshold. However, I have noticed that there is a significant delay between the 5th minute threshold and task queueing time. For the examples shown in the image, task queueing takes in between 3 to 50 seconds. For example, last DAG execution in the image was supposed to be triggered after 20:05:00 but task instance was queued 28 seconds later (20:05:28).
I'm surprised this is the case, since the DAG being scheduled has a single very simple task. Is this a normal airflow delay? Should I expect further delays when dealing with more complex DAGs?
I'm running a local airflow server with Postgres as db on a 16 GB Mac with OS Mojave. Machine is not resource constrained.

SSHOperator showing "No Status" after manual trigger

I am new to airflow and I have written a simple SSHOperator to learn how it works.
default_args = {
'start_date': datetime(2018,6,20)
}
dag = DAG(dag_id='ssh_test', schedule_interval = '#hourly',default_args=default_args)
sshHook = SSHHook(ssh_conn_id='testing')
t1 = SSHOperator(
task_id='task1',
command='echo Hello World',
ssh_hook=sshHook,
dag=dag)
When I manually trigger it on the UI, the dag shows a status of running but the operator stays white, no status.
I'm wondering why my task isn't queuing. Does anyone have any ideas? My airflow.config is the default if that is useful information.
Even this isn't running
dag=DAG(dag_id='test',start_date = datetime(2018,6,21), schedule_interval='0 0 * * *')
runMe = DummyOperator(task_id = 'testest', dag = dag)

Make sure you've started the Airflow Scheduler in addition to the Airflow Web Server:
airflow scheduler

check if airflow scheduler is running
check if airflow webserver is running
check if all DAGs are set to On in the web UI
check if the DAGs have a start date which is in the past
check if the DAGs have a proper schedule (before the schedule date) which is shown in the web UI
check if the dag has the proper pool and queue.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Airflow task instance stuck in retry mode - airflow

Related

How to force a Airflow Task to restart at the new scheduling date?

Apache Airflow does not enforce dagrun_timeout

How fix DAG seems to be missing?

Airflow scheduler not scheduling simple DAG task immediately

SSHOperator showing "No Status" after manual trigger

Categories

Resources