Airflow job is started before scheduled time - airflow

I am confused by the behavior of Airflow scheduling. I have a daily job which should run at 11:59 CET. My understanding is that the job should be run at the end of the data interval which would be Dec 31st 11:59. But instead the job is run on Dec 31st 11:53.
Below is the DAG code. I am using Airflow 2.2.0.
default_args = {
'owner': 'airflow',
'start_date': dt.datetime(2018, 9, 24, 10, 00, 00,
tzinfo=pendulum.timezone('UTC')),
'concurrency': 0,
'retries': 0,
'catchup': False,
}
with DAG('and_another_dag5',
default_args=default_args,
schedule_interval='59 11 * * *',
#catchup=True,
) as dag:
dummy = DummyOperator(task_id='run_ths')

It looks like this behavior was caused by using MS SQL as the backend in Airflow v2.2.0. Switching back to Postgres solves the issue for us.
See also https://giters.com/apache/airflow/issues/19651.

Related

How to configure Apache Airflow start_date and schedule_interval to run daily at 7am?

I'm using Airflow airflow-2.3.3 (through GCP Composer)
I pass this yaml configuration when deploying my DAG:
dag_args:
dag_id: FTP_DAILY
default_args:
owner: 'Dev team'
start_date: "00:00:00"
max_active_runs: 1
retries: 2
schedule_interval: "0 7 * * *"
ftp_conn_id: 'ftp_dev'
I want this DAG to run at 7am UTC every morning, but it's not running. In the UI it says next run: 2022-11-22, 07:00:00 (as of Nov 22nd) and it never runs. How should I configure my start_date and schedule_interval so that the DAG runs at 7am UTC every day, starting from the nearest 7am after the deployment?
You can pass default args directly in the Python DAG code and calculate yesterday's date, example :
from airflow.utils.dates import days_ago
dag_default_args = {
'depends_on_past': False,
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 0,
'retry_delay': timedelta(minutes=5),
'start_date': days_ago(1)
}
Then in the DAG :
with airflow.DAG(
"dag_name",
default_args=dag_default_args,
schedule_interval="0 7 * * *") as dag:
......
In this case the schedule_interval and cron will work correctly, Airflow will based the cron DAG on the start date.
The main concept of airflow is that the execution of a dag starts after the required interval has passed. If you schedule a dag with the above setup airflow will parse
interval_start_date as 2022-11-22 07:00:00
and interval_end_date as 2022-11-23 07:00:00
As you are requesting airflow to fetch data from this interval it will wait for the interval to pass, thus starting execution on 23rd November 7am.
If you want it to trigger immediately after you deploy the dag you need to move the start date back by one day. You might need to set up the catchup flag to true.
with DAG(
dag_id='new_workflow4',
schedule_interval="0 7 * * *",
start_date=pendulum.datetime(2022, 11, 21, hour=0, tz="UTC"),
catchup=True
) as dag:

SLA is not saving in database. Also not trigger any mail in airflow

I have successfully setup smtp server. also working fine in case of job failed.
But I tried to set SLA miss as per the below link.
https://blog.clairvoyantsoft.com/airflow-service-level-agreement-sla-2f3c91cd84cc
mid = BashOperator(
task_id='mid',
sla=timedelta(seconds=5),
bash_command='sleep 10',
retries=0,
dag=dag,
)
There is no event saving . Also i have checked through as below
Browse->SLA misses
I have tried more. Unable to catch the issue.
the dag is defined as :
args = {
'owner': 'airflow',
'start_date': datetime(2020, 11, 18),
'catchup':False,
'retries': 0,
'provide_context': True,
'email' : "XXXXXXXX#gmail.com",
'start_date': airflow.utils.dates.days_ago(n=0, minute=1),
'priority_weight': 1,
'email_on_failure' : True,
'default_args':{
'on_failure_callback': on_failure_callback,
}
}
d = datetime(2020, 10, 30)
dag = DAG('MyApplication', start_date = d,on_failure_callback=on_failure_callback, schedule_interval = '#daily', default_args = args)
The issue seems to be in the arguments, more specifically 'start_date': airflow.utils.dates.days_ago(n=0, minute=1), this means that start_date gets newly interpreted every time the scheduler parses the DAG file. You should specify a "static" start date like datetime(2020,11,18).
See also Airflow FAQ:
We recommend against using dynamic values as start_date, especially datetime.now() as it can be quite confusing. The task is triggered once the period closes, and in theory an #hourly DAG would never get to an hour after now as now() moves along.
Also specifying default_args inside of args seems weird to me.

Skip run if DAG is already running

I have a DAG that I need to run only one instance at the same time. To solve this I am using max_active_runs=1 which works fine:
dag_args = {
'owner': 'Owner',
'depends_on_past': False,
'start_date': datetime(2018, 01, 1, 12, 00),
'email_on_failure': False
}
sched = timedelta(hours=1)
dag = DAG(job_id, default_args=dag_args, schedule_interval=sched, max_active_runs=1)
The problem is:
When DAG is going to be triggered and there's an instance running, AirFlow waits for this run to finish and then triggers the DAG again.
My question is:
Is there any way to skip this run so DAG will not run after this execution in this case?
Thanks!
This is just from checking the docs, but it looks like you only need to add another parameter:
catchup=False
catchup (bool) – Perform scheduler catchup (or only run latest)?
Defaults to True

Tasks retrying more than specified retry in Airflow

I have recently upgraded my airflow to 1.10.2. Some tasks in the dag is running fine while some tasks are retrying more than the specified number of retries.
One of the task logs shows - Starting attempt 26 of 2. Why is the scheduler scheduling it even after two failure?
Anyone facing the similar issue?
Example Dag -
args = {
'owner': airflow,
'depends_on_past': False,
'start_date': datetime(2019, 03, 10, 0, 0, 0),
'retries':1,
'retry_delay': timedelta(minutes=2),
'email': ['my#myorg.com'],
'email_on_failure': True,
'email_on_retry': True
}
dag = DAG(dag_id='dag1',
default_args=args,
schedule_interval='0 12 * * *',
max_active_runs=1)
data_processor1 = BashOperator(
task_id='data_processor1',
bash_command="sh processor1.sh {{ ds }} ",
dag=dag)
data_processor2 = BashOperator(
task_id='data_processor2',
bash_command="ssh processor2.sh {{ ds }} ",
dag=dag)
data_processor1.set_downstream(data_processor2)
This may be useful,
I tried to generate the same error you are facing in airflow, but I couldn't generate it.
In my Airflow GUI, it shows only single retry then it is marking Task and DAG as failed, which is general airflow behavior, I don't know why and how you're facing this issue.
click here to see image screenshot of my airflow GUI for your DAG
can you please add more details regarding the problem (like logs and all).

airflow backfill dag not running until given end date

I have a backfill DAG, which is scheduled to run yearly from 01-01-2012 to 01-01-2018, but this runs only from 01-01-2012 until 01-01-2017. Why is this not running until 01-01-2018 and how to make it run until 2018.
Here is the code that I have used in the DAG:
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2012,1,1),
'end_date': datetime(2018,1,1),
'email': ['sef12#gmail.com'],
'email_on_failure': True,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(seconds=5)
}
dag = DAG(
dag_id='SAMPLE_LOAD',schedule_interval= '#yearly',default_args=default_args,catchup=True,max_active_runs=1, concurrency=1)
This is due to how Airflow handles scheduling. From the docs:
Note that if you run a DAG on a schedule_interval of one day, the run stamped 2016-01-01 will be triggered soon after 2016-01-01T23:59. In other words, the job instance is started once the period it covers has ended.
Let’s Repeat That The scheduler runs your job one schedule_interval AFTER the start date, at the END of the period.
Your run for 2018 will start once 2018 is over, since that's the end of the interval.

Resources