Execution of next task in case of failure- airflow - airflow

1.4 with composer 2.0.
I have a DAG that runs multiple tasks, the problem I have is that when one task fails, it runs the next one anyway.
According to the airflow documentation this should not be the case, but rather terminate the execution of the DAG after the task fails.
Tasks are dependent, so if one fails, the next will fail.
I want that in case a task fails, the execution of the DAG will be terminated.
default_args = {
'owner': owner,
'start_date': datetime.datetime(2021, 12, 28 ,15 ,0 ,0 ), #2021-08-08 10:00:00 UTC-0
'email': email,
'email_on_failure': True,
'retries': 0, # Retry once before failing the task.
#'on_failure_callback': incident_pg, #ejecuta funcion en caso de que la tarea falle
}
with DAG(dag_id=inst_dag_id,
default_args = default_args,
catchup = True,
max_active_runs = 5,
#schedule_interval = None) as dag: #ejecucion manual
schedule_interval = "0 15 * * *" ) as dag:

Looks like the issue is with task definition. It would be more transparent to have the task code in your question. From what you have in the question, there is no 'trigger_rule' parameter defined and from apache-airflow's definition of BaseOperator, the trigger_rule by default is all_success which means that all the upstream tasks are to be successful before downstream tasks can execute.
Check if you have the trigger rule on the task delete_bq_table as all_done if so, remove that or change it to all_success

Related

Airflow skips one scheduled run

I have various DAGs scheduled, but especially one DAG at a certain run is not being triggered.
I am aware that Airflow runs a job at the end of the period, but surely I'm missing something.
I have a schedule defined as:
10 2,5,8,11,14,17,20,23 * * *, meaning my job should run everyday at 02.10, 05.10, 08.10, 11.10, 14.10, 17.10, 20.10, 23.10 UTC.
For some reason, 23.10 UTC is always skipped, and I don't understand why.
Airflow runs my 20.10 run, skips 23.10, and then continue with 02.10.
So my question is why this run is always skipped.
My default DAG arguments are as follows:
default_args = {
"owner": "whir",
"depends_on_past": False,
"start_date": days_ago(0, hour=0, minute=0, second=0, microsecond=0),
"email": [""],
"email_on_failure": False,
"email_on_retry": False,
"retries": 4,
"retry_delay": timedelta(minutes=30),
}
with DAG(
'transfer-data',
default_args=default_args,
description="Transfer data",
schedule_interval='10 2,5,8,11,14,17,20,23 * * *',
catchup=True
) as dag:
...
Ok my guess for why something's wrong here is that your start_date parameter should be in the DAG definition, not in default_args. Move it out of your default args and instead add it into you DAG definition like:
with DAG(
'transfer-data',
default_args=default_args,
description="Transfer data",
start_date = (your start date)
schedule_interval='10 2,5,8,11,14,17,20,23 * * *',
catchup=True
) as dag:
Airflow is very particular about DAG definitions as it can sometimes cause unexpected behavior in the metadata database on the backend. start_date is a parameter set at the DAG level - you're stating when the DAG should begin. You're not passing it to each individual tasks, which is what default_args should be for.
It's hard to tell just by looking at what you've given us, but my guess is that the start date gets reset around midnight, and that's why it's somehow working for every run other than the 23:10 one.

Skip run if DAG is already running

I have a DAG that I need to run only one instance at the same time. To solve this I am using max_active_runs=1 which works fine:
dag_args = {
'owner': 'Owner',
'depends_on_past': False,
'start_date': datetime(2018, 01, 1, 12, 00),
'email_on_failure': False
}
sched = timedelta(hours=1)
dag = DAG(job_id, default_args=dag_args, schedule_interval=sched, max_active_runs=1)
The problem is:
When DAG is going to be triggered and there's an instance running, AirFlow waits for this run to finish and then triggers the DAG again.
My question is:
Is there any way to skip this run so DAG will not run after this execution in this case?
Thanks!
This is just from checking the docs, but it looks like you only need to add another parameter:
catchup=False
catchup (bool) – Perform scheduler catchup (or only run latest)?
Defaults to True

Airflow schedule getting skipped if previous task execution takes more time

I have two tasks in my airflow DAG. One triggers an API call ( Http operator ) and another one keeps checking its status using another api ( Http sensor ). This DAG is scheduled to run every hour & 10 minutes. But some times one execution can take long time to finish for example 20 hours. In such cases all the schedules while the previous task is running is not executing.
For example say if I the job at 01:10 takes 10 hours to finish. Schedules 02:10, 03:10, 04:10, ... 11:10 etc which are supposed to run are getting skipped and only the one at 12:10 is executed.
I am using local executor. I am running airflow server & scheduler using below script.
start_server.sh
export AIRFLOW_HOME=./airflow_home;
export AIRFLOW_GPL_UNIDECODE=yes;
export AIRFLOW_CONN_REST_API=http://localhost:5000;
export AIRFLOW_CONN_MANAGEMENT_API=http://localhost:8001;
airflow initdb;
airflow webserver -p 7200;
start_scheduler.sh
export AIRFLOW_HOME=./airflow_home;
# Connection string for connecting to REST interface server
export AIRFLOW_CONN_REST_API=http://localhost:5000;
export AIRFLOW_CONN_MANAGEMENT_API=http://localhost:8001;
#export AIRFLOW__SMTP__SMTP_PASSWORD=**********;
airflow scheduler;
my_dag_file.py
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': airflow.utils.dates.days_ago(2),
'email': admin_email_ids,
'email_on_failure': False,
'email_on_retry': False
}
DAG_ID = 'reconciliation_job_pipeline'
MANAGEMENT_RES_API_CONNECTION_CONFIG = 'management_api'
DA_REST_API_CONNECTION_CONFIG = 'rest_api'
recon_schedule = Variable.get('recon_cron_expression',"10 * * * *")
dag = DAG(DAG_ID, max_active_runs=1, default_args=default_args,
schedule_interval=recon_schedule,
catchup=False)
dag.doc_md = __doc__
spark_job_end_point = conf['sip_da']['spark_job_end_point']
fetch_index_record_count_config_key = conf['reconciliation'][
'fetch_index_record_count']
fetch_index_record_count = SparkJobOperator(
job_id_key='fetch_index_record_count_job',
config_key=fetch_index_record_count_config_key,
exec_id_req=False,
dag=dag,
http_conn_id=DA_REST_API_CONNECTION_CONFIG,
task_id='fetch_index_record_count_job',
data={},
method='POST',
endpoint=spark_job_end_point,
headers={
"Content-Type": "application/json"}
)
job_endpoint = conf['sip_da']['job_resource_endpoint']
fetch_index_record_count_status_job = JobStatusSensor(
job_id_key='fetch_index_record_count_job',
http_conn_id=DA_REST_API_CONNECTION_CONFIG,
task_id='fetch_index_record_count_status_job',
endpoint=job_endpoint,
method='GET',
request_params={'required': 'status'},
headers={"Content-Type": "application/json"},
dag=dag,
poke_interval=15
)
fetch_index_record_count>>fetch_index_record_count_status_job
SparkJobOperator & JobStatusSensor my custom class extending SimpleHttpOperator & HttpSensor.
If I set depends_on_past true will it work as expected?. Another problem I have for this option is some time the status check job will fail. But the next schedule should get trigger. How can I achieve this behavior ?
I think the main discussion point here is what you set is catchup=False, more detail can be found here. So airflow scheduler will skip those task execution and you would see the behavior as you mentioned.
This sounds like you would need to perform catchup if the previous process took longer than expected. You can try to change it catchup=True

Airflow worker stuck : Task is in the 'running' state which is not a valid state for execution. The task must be cleared in order to be run

Airflow tasks run w/o any issues and suddenly half the way it gets stuck and the task instance details say above message.
I cleared my entire database, but still, I am getting the same error.
The fact is I am getting this issue for only some dags. Mostly when the long-running jobs.
I am getting below error
[2019-07-03 12:14:56,337] {{models.py:1353}} INFO - Dependencies not met for <TaskInstance: XXXXXX.index_to_es 2019-07-01T13:30:00+00:00 [running]>, dependency 'Task Instance State' FAILED: Task is in the 'running' state which is not a valid state for execution. The task must be cleared in order to be run.
[2019-07-03 12:14:56,341] {{models.py:1353}} INFO - Dependencies not met for <TaskInstance: XXXXXX.index_to_es 2019-07-01T13:30:00+00:00 [running]>, dependency 'Task Instance Not Already Running' FAILED: Task is already running, it started on 2019-07-03 05:58:51.601552+00:00.
[2019-07-03 12:14:56,342] {{logging_mixin.py:95}} INFO - [2019-07-03 12:14:56,342] {{jobs.py:2514}} INFO - Task is not able to be run
My dag looks like below
default_args = {
'owner': 'datascience',
'depends_on_past': True,
'start_date': datetime(2019, 6, 12),
'email': ['datascience#mycompany.com'],
'email_on_failure': True,
'email_on_retry': True,
'retries': 3,
'retry_delay': timedelta(minutes=5),
# 'queue': 'nill',
# 'pool': 'backfill',
# 'priority_weight': 10,
# 'end_date': datetime(2016, 1, 1),
}
def get_index_date(**kwargs):
tomorrow=kwargs.get('templates_dict').get('tomorrow')
return str(tomorrow).replace('-','.')
"""
Create Dags specify its features
"""
dag = DAG(
DAG_NAME,
schedule_interval="0 9 * * *",
catchup=True,
default_args=default_args,
template_searchpath='/efs/sql')
create_table = BigQueryOperator(
dag=dag,
task_id='create_temp_table_from_query',
sql='daily_demand.sql',
use_legacy_sql=False,
destination_dataset_table=TEMP_TABLE,
bigquery_conn_id=CONNECTION_ID,
create_disposition='CREATE_IF_NEEDED',
write_disposition='WRITE_TRUNCATE'
)
"""Task to zip and export to GCS"""
export_to_storage = BigQueryToCloudStorageOperator(
task_id='export_to_GCS',
source_project_dataset_table=TEMP_TABLE,
destination_cloud_storage_uris=[CLOUD_STORAGE_URI],
export_format='NEWLINE_DELIMITED_JSON',
compression='GZIP',
bigquery_conn_id=CONNECTION_ID,
dag=dag)
"""Task to get the tomorrow execution date formatted for indexing"""
get_index_date = PythonOperator(
task_id='get_index_date',
python_callable=get_index_date,
templates_dict={'tomorrow':"{{ tomorrow_ds }}"},
provide_context=True,
dag=dag
)
"""Task to download zipped files and bulkindex to elasticsearch"""
es_indexing = EsDownloadAndIndexOperator(
task_id="index_to_es",
object=OBJECT,
es_url=ES_URI,
local_path=LOCAL_FILE,
gcs_conn_id=CONNECTION_ID,
bucket=GCS_BUCKET_ID,
es_index_type='demand_shopper',
es_bulk_batch=5000,
es_index_name=INDEX,
es_request_timeout=300,
dag=dag)
"""Define the chronology of tasks in DAG"""
create_table >> export_to_storage >> get_index_date >> es_indexing
Thanks for your help
I figured out the issue, it was the underlying infrastructure problem. I was using AWS EFS and the burst mode was blocking the worker as the throughput was reached. Changed to provisioned mode, workers are no more in a stuck state.
I got the idea from
ecs-airflow-1-10-2-performance-issues-operators-and-tasks-take-10x-longer
I noticed same issue. This message was logged.
Dependencies not met for <TaskInstance:xxxxx]>, dependency 'Task Instance State' FAILED:
Task is in the 'running' state which is not a valid state for execution.
The task must be cleared in order to be run.
Similarly, I ran an hourly DAG to "delete_dags_and_then_refresh" job (instead of file share). Long runnings jobs would fail until I disabled this dags-update job. Problem solved. I will try something different to refresh dags.

Tasks retrying more than specified retry in Airflow

I have recently upgraded my airflow to 1.10.2. Some tasks in the dag is running fine while some tasks are retrying more than the specified number of retries.
One of the task logs shows - Starting attempt 26 of 2. Why is the scheduler scheduling it even after two failure?
Anyone facing the similar issue?
Example Dag -
args = {
'owner': airflow,
'depends_on_past': False,
'start_date': datetime(2019, 03, 10, 0, 0, 0),
'retries':1,
'retry_delay': timedelta(minutes=2),
'email': ['my#myorg.com'],
'email_on_failure': True,
'email_on_retry': True
}
dag = DAG(dag_id='dag1',
default_args=args,
schedule_interval='0 12 * * *',
max_active_runs=1)
data_processor1 = BashOperator(
task_id='data_processor1',
bash_command="sh processor1.sh {{ ds }} ",
dag=dag)
data_processor2 = BashOperator(
task_id='data_processor2',
bash_command="ssh processor2.sh {{ ds }} ",
dag=dag)
data_processor1.set_downstream(data_processor2)
This may be useful,
I tried to generate the same error you are facing in airflow, but I couldn't generate it.
In my Airflow GUI, it shows only single retry then it is marking Task and DAG as failed, which is general airflow behavior, I don't know why and how you're facing this issue.
click here to see image screenshot of my airflow GUI for your DAG
can you please add more details regarding the problem (like logs and all).

Resources