Tasks retrying more than specified retry in Airflow - airflow

I have recently upgraded my airflow to 1.10.2. Some tasks in the dag is running fine while some tasks are retrying more than the specified number of retries.
One of the task logs shows - Starting attempt 26 of 2. Why is the scheduler scheduling it even after two failure?
Anyone facing the similar issue?
Example Dag -
args = {
'owner': airflow,
'depends_on_past': False,
'start_date': datetime(2019, 03, 10, 0, 0, 0),
'retries':1,
'retry_delay': timedelta(minutes=2),
'email': ['my#myorg.com'],
'email_on_failure': True,
'email_on_retry': True
}
dag = DAG(dag_id='dag1',
default_args=args,
schedule_interval='0 12 * * *',
max_active_runs=1)
data_processor1 = BashOperator(
task_id='data_processor1',
bash_command="sh processor1.sh {{ ds }} ",
dag=dag)
data_processor2 = BashOperator(
task_id='data_processor2',
bash_command="ssh processor2.sh {{ ds }} ",
dag=dag)
data_processor1.set_downstream(data_processor2)

This may be useful,
I tried to generate the same error you are facing in airflow, but I couldn't generate it.
In my Airflow GUI, it shows only single retry then it is marking Task and DAG as failed, which is general airflow behavior, I don't know why and how you're facing this issue.
click here to see image screenshot of my airflow GUI for your DAG
can you please add more details regarding the problem (like logs and all).

Related

Execution of next task in case of failure- airflow

1.4 with composer 2.0.
I have a DAG that runs multiple tasks, the problem I have is that when one task fails, it runs the next one anyway.
According to the airflow documentation this should not be the case, but rather terminate the execution of the DAG after the task fails.
Tasks are dependent, so if one fails, the next will fail.
I want that in case a task fails, the execution of the DAG will be terminated.
default_args = {
'owner': owner,
'start_date': datetime.datetime(2021, 12, 28 ,15 ,0 ,0 ), #2021-08-08 10:00:00 UTC-0
'email': email,
'email_on_failure': True,
'retries': 0, # Retry once before failing the task.
#'on_failure_callback': incident_pg, #ejecuta funcion en caso de que la tarea falle
}
with DAG(dag_id=inst_dag_id,
default_args = default_args,
catchup = True,
max_active_runs = 5,
#schedule_interval = None) as dag: #ejecucion manual
schedule_interval = "0 15 * * *" ) as dag:
Looks like the issue is with task definition. It would be more transparent to have the task code in your question. From what you have in the question, there is no 'trigger_rule' parameter defined and from apache-airflow's definition of BaseOperator, the trigger_rule by default is all_success which means that all the upstream tasks are to be successful before downstream tasks can execute.
Check if you have the trigger rule on the task delete_bq_table as all_done if so, remove that or change it to all_success

SLA is not saving in database. Also not trigger any mail in airflow

I have successfully setup smtp server. also working fine in case of job failed.
But I tried to set SLA miss as per the below link.
https://blog.clairvoyantsoft.com/airflow-service-level-agreement-sla-2f3c91cd84cc
mid = BashOperator(
task_id='mid',
sla=timedelta(seconds=5),
bash_command='sleep 10',
retries=0,
dag=dag,
)
There is no event saving . Also i have checked through as below
Browse->SLA misses
I have tried more. Unable to catch the issue.
the dag is defined as :
args = {
'owner': 'airflow',
'start_date': datetime(2020, 11, 18),
'catchup':False,
'retries': 0,
'provide_context': True,
'email' : "XXXXXXXX#gmail.com",
'start_date': airflow.utils.dates.days_ago(n=0, minute=1),
'priority_weight': 1,
'email_on_failure' : True,
'default_args':{
'on_failure_callback': on_failure_callback,
}
}
d = datetime(2020, 10, 30)
dag = DAG('MyApplication', start_date = d,on_failure_callback=on_failure_callback, schedule_interval = '#daily', default_args = args)
The issue seems to be in the arguments, more specifically 'start_date': airflow.utils.dates.days_ago(n=0, minute=1), this means that start_date gets newly interpreted every time the scheduler parses the DAG file. You should specify a "static" start date like datetime(2020,11,18).
See also Airflow FAQ:
We recommend against using dynamic values as start_date, especially datetime.now() as it can be quite confusing. The task is triggered once the period closes, and in theory an #hourly DAG would never get to an hour after now as now() moves along.
Also specifying default_args inside of args seems weird to me.

Skip run if DAG is already running

I have a DAG that I need to run only one instance at the same time. To solve this I am using max_active_runs=1 which works fine:
dag_args = {
'owner': 'Owner',
'depends_on_past': False,
'start_date': datetime(2018, 01, 1, 12, 00),
'email_on_failure': False
}
sched = timedelta(hours=1)
dag = DAG(job_id, default_args=dag_args, schedule_interval=sched, max_active_runs=1)
The problem is:
When DAG is going to be triggered and there's an instance running, AirFlow waits for this run to finish and then triggers the DAG again.
My question is:
Is there any way to skip this run so DAG will not run after this execution in this case?
Thanks!
This is just from checking the docs, but it looks like you only need to add another parameter:
catchup=False
catchup (bool) – Perform scheduler catchup (or only run latest)?
Defaults to True

Airflow worker stuck : Task is in the 'running' state which is not a valid state for execution. The task must be cleared in order to be run

Airflow tasks run w/o any issues and suddenly half the way it gets stuck and the task instance details say above message.
I cleared my entire database, but still, I am getting the same error.
The fact is I am getting this issue for only some dags. Mostly when the long-running jobs.
I am getting below error
[2019-07-03 12:14:56,337] {{models.py:1353}} INFO - Dependencies not met for <TaskInstance: XXXXXX.index_to_es 2019-07-01T13:30:00+00:00 [running]>, dependency 'Task Instance State' FAILED: Task is in the 'running' state which is not a valid state for execution. The task must be cleared in order to be run.
[2019-07-03 12:14:56,341] {{models.py:1353}} INFO - Dependencies not met for <TaskInstance: XXXXXX.index_to_es 2019-07-01T13:30:00+00:00 [running]>, dependency 'Task Instance Not Already Running' FAILED: Task is already running, it started on 2019-07-03 05:58:51.601552+00:00.
[2019-07-03 12:14:56,342] {{logging_mixin.py:95}} INFO - [2019-07-03 12:14:56,342] {{jobs.py:2514}} INFO - Task is not able to be run
My dag looks like below
default_args = {
'owner': 'datascience',
'depends_on_past': True,
'start_date': datetime(2019, 6, 12),
'email': ['datascience#mycompany.com'],
'email_on_failure': True,
'email_on_retry': True,
'retries': 3,
'retry_delay': timedelta(minutes=5),
# 'queue': 'nill',
# 'pool': 'backfill',
# 'priority_weight': 10,
# 'end_date': datetime(2016, 1, 1),
}
def get_index_date(**kwargs):
tomorrow=kwargs.get('templates_dict').get('tomorrow')
return str(tomorrow).replace('-','.')
"""
Create Dags specify its features
"""
dag = DAG(
DAG_NAME,
schedule_interval="0 9 * * *",
catchup=True,
default_args=default_args,
template_searchpath='/efs/sql')
create_table = BigQueryOperator(
dag=dag,
task_id='create_temp_table_from_query',
sql='daily_demand.sql',
use_legacy_sql=False,
destination_dataset_table=TEMP_TABLE,
bigquery_conn_id=CONNECTION_ID,
create_disposition='CREATE_IF_NEEDED',
write_disposition='WRITE_TRUNCATE'
)
"""Task to zip and export to GCS"""
export_to_storage = BigQueryToCloudStorageOperator(
task_id='export_to_GCS',
source_project_dataset_table=TEMP_TABLE,
destination_cloud_storage_uris=[CLOUD_STORAGE_URI],
export_format='NEWLINE_DELIMITED_JSON',
compression='GZIP',
bigquery_conn_id=CONNECTION_ID,
dag=dag)
"""Task to get the tomorrow execution date formatted for indexing"""
get_index_date = PythonOperator(
task_id='get_index_date',
python_callable=get_index_date,
templates_dict={'tomorrow':"{{ tomorrow_ds }}"},
provide_context=True,
dag=dag
)
"""Task to download zipped files and bulkindex to elasticsearch"""
es_indexing = EsDownloadAndIndexOperator(
task_id="index_to_es",
object=OBJECT,
es_url=ES_URI,
local_path=LOCAL_FILE,
gcs_conn_id=CONNECTION_ID,
bucket=GCS_BUCKET_ID,
es_index_type='demand_shopper',
es_bulk_batch=5000,
es_index_name=INDEX,
es_request_timeout=300,
dag=dag)
"""Define the chronology of tasks in DAG"""
create_table >> export_to_storage >> get_index_date >> es_indexing
Thanks for your help
I figured out the issue, it was the underlying infrastructure problem. I was using AWS EFS and the burst mode was blocking the worker as the throughput was reached. Changed to provisioned mode, workers are no more in a stuck state.
I got the idea from
ecs-airflow-1-10-2-performance-issues-operators-and-tasks-take-10x-longer
I noticed same issue. This message was logged.
Dependencies not met for <TaskInstance:xxxxx]>, dependency 'Task Instance State' FAILED:
Task is in the 'running' state which is not a valid state for execution.
The task must be cleared in order to be run.
Similarly, I ran an hourly DAG to "delete_dags_and_then_refresh" job (instead of file share). Long runnings jobs would fail until I disabled this dags-update job. Problem solved. I will try something different to refresh dags.

Airflow DAG not getting scheduled

I am new to Airflow and created my first DAG. Here is my DAG code. I want the DAG to start now and thereafter run once in a day.
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime.now(),
'email': ['aaaa#gmail.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
dag = DAG(
'alamode', default_args=default_args, schedule_interval=timedelta(1))
create_command = "/home/ubuntu/scripts/makedir.sh "
# t1 is the task which will invoke the directory creation shell script
t1 = BashOperator(
task_id='create_directory',
bash_command=create_command,
dag=dag)
run_spiders = "/home/ubuntu/scripts/crawl_spiders.sh "
# t2 is the task which will invoke the spiders
t2 = BashOperator(
task_id='web_scrawl',
bash_command=run_spiders,
dag=dag)
# To set dependency between tasks. 't1' should run before t2
t2.set_upstream(t1)
The DAG is not getting picked by Airflow. I checked the log and here is what it says.
[2017-09-12 18:08:20,220] {jobs.py:343} DagFileProcessor398 INFO - Started process (PID=7001) to work on /home/ubuntu/airflow/dags/alamode.py
[2017-09-12 18:08:20,223] {jobs.py:1521} DagFileProcessor398 INFO - Processing file /home/ubuntu/airflow/dags/alamode.py for tasks to queue
[2017-09-12 18:08:20,223] {models.py:167} DagFileProcessor398 INFO - Filling up the DagBag from /home/ubuntu/airflow/dags/alamode.py
[2017-09-12 18:08:20,262] {jobs.py:1535} DagFileProcessor398 INFO - DAG(s) ['alamode'] retrieved from /home/ubuntu/airflow/dags/alamode.py
[2017-09-12 18:08:20,291] {jobs.py:1169} DagFileProcessor398 INFO - Processing alamode
/usr/local/lib/python2.7/dist-packages/sqlalchemy/sql/default_comparator.py:161: SAWarning: The IN-predicate on "dag_run.dag_id" was invoked with an empty sequence. This results in a contradiction, which nonetheless can be expensive to evaluate. Consider alternative strategies for improved performance.
'strategies for improved performance.' % expr)
[2017-09-12 18:08:20,317] {models.py:322} DagFileProcessor398 INFO - Finding 'running' jobs without a recent heartbeat
[2017-09-12 18:08:20,318] {models.py:328} DagFileProcessor398 INFO - Failing jobs without heartbeat after 2017-09-12 18:03:20.318105
[2017-09-12 18:08:20,320] {jobs.py:351} DagFileProcessor398 INFO - Processing /home/ubuntu/airflow/dags/alamode.py took 0.100 seconds
What exactly am I doing wrong? I have tried changing the schedule_interval to schedule_interval=timedelta(minutes=1) to see if it starts immediately, but still no use. I can see the tasks under the DAG as expected in Airflow UI but with schedule status as 'no status'. Please help me here.
This issue has been resolved by following the below steps:
1) I used a much older date for start_date and schedule_interval=timedelta(minutes=10). Also, used a real date instead of datetime.now().
2) Added catchup = True in DAG arguments.
3) Setup environment variable as export AIRFLOW_HOME=pwd/airflow_home.
4) Deleted airflow.db
5) Moved the new code to DAGS folder
6) Ran the command 'airflow initdb' to create the DB again.
7) Turned the 'ON' switch of my DAG through UI
8) Ran the command 'airflow scheduler'
Here is the code which works now:
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2017, 9, 12),
'email': ['anjana#gapro.tech'],
'retries': 0,
'retry_delay': timedelta(minutes=15)
}
dag = DAG(
'alamode', catchup=False, default_args=default_args, schedule_interval="#daily")
# t1 is the task which will invoke the directory creation shell script
t1 = BashOperator(
task_id='create_directory',
bash_command='/home/ubuntu/scripts/makedir.sh ',
dag=dag)
# t2 is the task which will invoke the spiders
t2 = BashOperator(
task_id= 'web_crawl',
bash_command='/home/ubuntu/scripts/crawl_spiders.sh ',
dag=dag)
# To set dependency between tasks. 't1' should run before t2
t2.set_upstream(t1)

Resources