Skip run if DAG is already running - airflow

I have a DAG that I need to run only one instance at the same time. To solve this I am using max_active_runs=1 which works fine:
dag_args = {
'owner': 'Owner',
'depends_on_past': False,
'start_date': datetime(2018, 01, 1, 12, 00),
'email_on_failure': False
}
sched = timedelta(hours=1)
dag = DAG(job_id, default_args=dag_args, schedule_interval=sched, max_active_runs=1)
The problem is:
When DAG is going to be triggered and there's an instance running, AirFlow waits for this run to finish and then triggers the DAG again.
My question is:
Is there any way to skip this run so DAG will not run after this execution in this case?
Thanks!

This is just from checking the docs, but it looks like you only need to add another parameter:
catchup=False
catchup (bool) – Perform scheduler catchup (or only run latest)?
Defaults to True

Related

SLA is not saving in database. Also not trigger any mail in airflow

I have successfully setup smtp server. also working fine in case of job failed.
But I tried to set SLA miss as per the below link.
https://blog.clairvoyantsoft.com/airflow-service-level-agreement-sla-2f3c91cd84cc
mid = BashOperator(
task_id='mid',
sla=timedelta(seconds=5),
bash_command='sleep 10',
retries=0,
dag=dag,
)
There is no event saving . Also i have checked through as below
Browse->SLA misses
I have tried more. Unable to catch the issue.
the dag is defined as :
args = {
'owner': 'airflow',
'start_date': datetime(2020, 11, 18),
'catchup':False,
'retries': 0,
'provide_context': True,
'email' : "XXXXXXXX#gmail.com",
'start_date': airflow.utils.dates.days_ago(n=0, minute=1),
'priority_weight': 1,
'email_on_failure' : True,
'default_args':{
'on_failure_callback': on_failure_callback,
}
}
d = datetime(2020, 10, 30)
dag = DAG('MyApplication', start_date = d,on_failure_callback=on_failure_callback, schedule_interval = '#daily', default_args = args)
The issue seems to be in the arguments, more specifically 'start_date': airflow.utils.dates.days_ago(n=0, minute=1), this means that start_date gets newly interpreted every time the scheduler parses the DAG file. You should specify a "static" start date like datetime(2020,11,18).
See also Airflow FAQ:
We recommend against using dynamic values as start_date, especially datetime.now() as it can be quite confusing. The task is triggered once the period closes, and in theory an #hourly DAG would never get to an hour after now as now() moves along.
Also specifying default_args inside of args seems weird to me.

Airflow ExternalTaskSensor does not trigger task

I'm trying to add a cross dag dependency using ExternalTaskSensor but haven't been able to get it to work. Dag A has schedule_interval=None as it doesn't have a fixed schedule and is triggered externally by a file creation event. Dag B should execute once Dag A has completed. Here is code for dag_a and dag_b.
DAG A
default_args = {
'depends_on_past': False,
'start_date': datetime.today()-timedelta(1),
'email_on_failure': True,
'email_on_retry': False,
'queue': 'default'
}
dag = DAG(
'dag_a', default_args=default_args, schedule_interval=None)
dag_a = AWSBatchOperator(
task_id='dag_a',
job_name='dag_a',
job_definition='dag_a',
job_queue='MyAWSJobQueue',
max_retries=10,
aws_conn_id='aws_conn',
region_name='us-east-1',
dag=dag,
parameters={},
overrides={})
DAG B
default_args = {
'depends_on_past': False,
'start_date': datetime.today()-timedelta(1),
'email_on_failure': True,
'email_on_retry': False,
'queue': 'default'
}
dag = DAG(
'dag_b', default_args=default_args, schedule_interval=None)
dag_b = AWSBatchOperator(
task_id='dag_b',
job_name='dag_b',
job_definition='dag_b',
job_queue='MyAWSJobQueue',
max_retries=10,
aws_conn_id='aws_conn',
region_name='us-east-1',
dag=dag,
parameters={},
overrides={})
wait_for_dag_a = ExternalTaskSensor(
task_id='wait_for_irr',
external_dag_id='dag_a',
external_task_id=None,
execution_delta = timedelta(hours=1),
dag=dag,
timeout = 300)
dag_b.set_upstream(wait_for_dag_a)
I set both dags with schedule_interval=None and same start_date. I even added execution_delta = timedelta(hours=1) for dag_b, but dag_b hasn't triggered so far, though dag_a is complete. Any help is appreciated.
I have tried using TriggerDagRunOperator which works, but is not suitable for my use case since dag_b will eventually be dependent on multiple parent dags.
I've met similar problem before, so there are two things need to check, first I cannot see any time delta between DAG A and DAG B, both use the default arg so you should not give the waiting task a execution_delta, and for the airflow trigger, somehow it cannot detect the DAG finish sign if there are multiple parents DAGs, so I've tried give a value to external_task_id, like 'dag_a-done' instead of the default 'None', and that works. One more thing to mention is the task_id normally should not contain underscore.
The link is the source code of external sensor:
https://airflow.apache.org/docs/stable/_modules/airflow/sensors/external_task_sensor.html
Also an article describes how the ExternalTaskSensors works:
https://medium.com/#fninsiima/sensing-the-completion-of-external-airflow-tasks-827344d03142

Airflow schedule getting skipped if previous task execution takes more time

I have two tasks in my airflow DAG. One triggers an API call ( Http operator ) and another one keeps checking its status using another api ( Http sensor ). This DAG is scheduled to run every hour & 10 minutes. But some times one execution can take long time to finish for example 20 hours. In such cases all the schedules while the previous task is running is not executing.
For example say if I the job at 01:10 takes 10 hours to finish. Schedules 02:10, 03:10, 04:10, ... 11:10 etc which are supposed to run are getting skipped and only the one at 12:10 is executed.
I am using local executor. I am running airflow server & scheduler using below script.
start_server.sh
export AIRFLOW_HOME=./airflow_home;
export AIRFLOW_GPL_UNIDECODE=yes;
export AIRFLOW_CONN_REST_API=http://localhost:5000;
export AIRFLOW_CONN_MANAGEMENT_API=http://localhost:8001;
airflow initdb;
airflow webserver -p 7200;
start_scheduler.sh
export AIRFLOW_HOME=./airflow_home;
# Connection string for connecting to REST interface server
export AIRFLOW_CONN_REST_API=http://localhost:5000;
export AIRFLOW_CONN_MANAGEMENT_API=http://localhost:8001;
#export AIRFLOW__SMTP__SMTP_PASSWORD=**********;
airflow scheduler;
my_dag_file.py
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': airflow.utils.dates.days_ago(2),
'email': admin_email_ids,
'email_on_failure': False,
'email_on_retry': False
}
DAG_ID = 'reconciliation_job_pipeline'
MANAGEMENT_RES_API_CONNECTION_CONFIG = 'management_api'
DA_REST_API_CONNECTION_CONFIG = 'rest_api'
recon_schedule = Variable.get('recon_cron_expression',"10 * * * *")
dag = DAG(DAG_ID, max_active_runs=1, default_args=default_args,
schedule_interval=recon_schedule,
catchup=False)
dag.doc_md = __doc__
spark_job_end_point = conf['sip_da']['spark_job_end_point']
fetch_index_record_count_config_key = conf['reconciliation'][
'fetch_index_record_count']
fetch_index_record_count = SparkJobOperator(
job_id_key='fetch_index_record_count_job',
config_key=fetch_index_record_count_config_key,
exec_id_req=False,
dag=dag,
http_conn_id=DA_REST_API_CONNECTION_CONFIG,
task_id='fetch_index_record_count_job',
data={},
method='POST',
endpoint=spark_job_end_point,
headers={
"Content-Type": "application/json"}
)
job_endpoint = conf['sip_da']['job_resource_endpoint']
fetch_index_record_count_status_job = JobStatusSensor(
job_id_key='fetch_index_record_count_job',
http_conn_id=DA_REST_API_CONNECTION_CONFIG,
task_id='fetch_index_record_count_status_job',
endpoint=job_endpoint,
method='GET',
request_params={'required': 'status'},
headers={"Content-Type": "application/json"},
dag=dag,
poke_interval=15
)
fetch_index_record_count>>fetch_index_record_count_status_job
SparkJobOperator & JobStatusSensor my custom class extending SimpleHttpOperator & HttpSensor.
If I set depends_on_past true will it work as expected?. Another problem I have for this option is some time the status check job will fail. But the next schedule should get trigger. How can I achieve this behavior ?
I think the main discussion point here is what you set is catchup=False, more detail can be found here. So airflow scheduler will skip those task execution and you would see the behavior as you mentioned.
This sounds like you would need to perform catchup if the previous process took longer than expected. You can try to change it catchup=True

Tasks retrying more than specified retry in Airflow

I have recently upgraded my airflow to 1.10.2. Some tasks in the dag is running fine while some tasks are retrying more than the specified number of retries.
One of the task logs shows - Starting attempt 26 of 2. Why is the scheduler scheduling it even after two failure?
Anyone facing the similar issue?
Example Dag -
args = {
'owner': airflow,
'depends_on_past': False,
'start_date': datetime(2019, 03, 10, 0, 0, 0),
'retries':1,
'retry_delay': timedelta(minutes=2),
'email': ['my#myorg.com'],
'email_on_failure': True,
'email_on_retry': True
}
dag = DAG(dag_id='dag1',
default_args=args,
schedule_interval='0 12 * * *',
max_active_runs=1)
data_processor1 = BashOperator(
task_id='data_processor1',
bash_command="sh processor1.sh {{ ds }} ",
dag=dag)
data_processor2 = BashOperator(
task_id='data_processor2',
bash_command="ssh processor2.sh {{ ds }} ",
dag=dag)
data_processor1.set_downstream(data_processor2)
This may be useful,
I tried to generate the same error you are facing in airflow, but I couldn't generate it.
In my Airflow GUI, it shows only single retry then it is marking Task and DAG as failed, which is general airflow behavior, I don't know why and how you're facing this issue.
click here to see image screenshot of my airflow GUI for your DAG
can you please add more details regarding the problem (like logs and all).

How to limit Airflow to run only one instance of a DAG run at a time?

I want the tasks in the DAG to all finish before the 1st task of the next run gets executed.
I have max_active_runs = 1, but this still happens.
default_args = {
'depends_on_past': True,
'wait_for_downstream': True,
'max_active_runs': 1,
'start_date': datetime(2018, 03, 04),
'owner': 't.n',
'email': ['t.n#example.com'],
'email_on_failure': True,
'email_on_retry': False,
'retries': 3,
'retry_delay': timedelta(minutes=4)
}
dag = DAG('example', default_args=default_args, schedule_interval = schedule_interval)
(All of my tasks are dependent on the previous task. Airflow version is 1.8.0)
Thank you
I changed to put max_active_runs as an argument of DAG() instead of in default_arguments, and it worked.
Thanks SimonD for giving me the idea, though not directly pointing to it in your answer.
You've put the 'max_active_runs': 1 into the default_args parameter and not into the correct spot.
max_active_runs is a constructor argument for a DAG and should not be put into the default_args dictionary.
Here is an example DAG that shows where you need to move it to:
dag_args = {
'owner': 'Owner',
# 'max_active_runs': 1, # <--- Here is where you had it.
'depends_on_past': False,
'start_date': datetime(2018, 01, 1, 12, 00),
'email_on_failure': False
}
sched = timedelta(hours=1)
dag = DAG(
job_id,
default_args=dag_args,
schedule_interval=sched,
max_active_runs=1 # <---- Here is where it is supposed to be
)
If the tasks that your dag is running are actually sub-dags then you may need to pass max_active_runs into the subdags too but not 100% sure on this.
You can use xcoms to do it. First take 2 python operators as 'start' and 'end' to the DAG. Set the flow as:
start ---> ALL TASKS ----> end
'end' will always push a variable
last_success = context['execution_date'] to xcom (xcom_push). (Requires provide_context = True in the PythonOperators).
And 'start' will always check xcom (xcom_pull) to see whether there exists a last_success variable with value equal to the previous DagRun's execution_date or to the DAG's start_date (to let the process start).
Followed this answer
Actually you should use DAG_CONCURRENCY=1 as environment var. Worked for me.

Resources