Problem with start date and scheduled date in Apache Airflow - airflow

I am working with Apache Airflow and I have a problem with the scheduled day and the starting day.
I want a DAG to run every day at 8:00 AM UTC. So, I did:
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2020, 12, 7, 10, 0,0),
'email': ['example#emaiil.com'],
'email_on_failure': True,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(hours=5)
}
# Never run
dag = DAG(dag_id='id', default_args=default_args, schedule_interval='0 8 * * *',catchup=True)
The day I upload the DAG was 2020-12-07 and I wanted to run it on 2020-12-08 at 08:00:00.
I set the start_date at 2020-12-07 at 10:00:00 to avoid running it at 2020-12-07 at 08:00:00 and only trigger it the next day, but it didn't work.
Then I modified the starting day:
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2020, 12, 7, 7, 59,0),
'email': ['example#emaiil.com'],
'email_on_failure': True,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(hours=5)
}
# Never run
dag = DAG(dag_id='etl-ca-cpke-spark_dev_databricks', default_args=default_args, schedule_interval='0 8 * * *',catchup=True)
Now the start date is 1 minute before the DAG should run, and indeed, because the catchup is set to True, the DAG has been triggered for 2020-12-07 at 08:00:00, but it has not being triggered for 2020-12-08 at 08:00:00.
Why?

Airflow schedules tasks at the end of the interval (See documentation reference)
Meaning that when you do:
start_date: datetime(2020, 12, 7, 8, 0,0)
schedule_interval: '0 8 * * *'
The first run will kick in at 2020-12-08 at 08:00+- (depends on resources)
This run's execution_date will be: 2020-12-07 08:00
The next run will kick in at 2020-12-09 at 08:00
This run's execution_date of 2020-12-08 08:00.
Since today is 2020-12-08 the next run didn't kick in because it's not the end of the interval yet.

Related

Airflow job not runinng as scheduled

I have a job that i had set to run a 9:00 UTC on Wednesday. It didn't run as planned by the end of the delay interval, which I thought was curious because I believe I have everything defined properly.
default_args = {
'start_date': airflow.utils.dates.days_ago(0),
'retries': 1,
'retry_delay': timedelta(minutes=5)
}
dag = DAG(
'noncomp_trial',
default_args=default_args,
description='test of dag',
schedule_interval='0 9 * * 3',
dagrun_timeout=timedelta(minutes=20))
If anyone has any advice here that would be greatly appreciated!
The Airflow Scheduler runs tasks once the start_date + one schedule_interval value has passed. In your example, the DAG won't run until 9:00AM on Wednesday the following week occurs.
See more information about the relationship between the start_date and schedule_interval here.
You could try setting your start_date to a static date in the past by a week or two to see if that works? And to make sure the Scheduler doesn't try to execute every start_date + schedule_interval occurrence between that new start_date and now, you can set catchup=False on the DAG. For example:
from datetime import datetime
dag = DAG(
'noncomp_trial',
default_args= {
'start_date': datetime(2021, 7, 1),
'retries': 1,
'retry_delay': timedelta(minutes=5)
},
description='test of dag',
schedule_interval='0 9 * * 3',
dagrun_timeout=timedelta(minutes=20),
catchup=False,
)

Apache airflow scheduler not triggering once in a month task as expected

Here is the dag, which I want to execute on fixed date of every month, as of now I kept it on 18th of every month.
But the task is getting triggered daily by the scheduler. catchup_by_default = False is set to false in airflow.cfg file
default_args = {
'owner': 'muddassir',
'depends_on_past': True,
'start_date': datetime(2021, 3, 18),
'retries': 1,
'schedule_interval': '0 0 18 * *',
}
Image 1
Image 2
Image 3
Image 4
you have mentioned schedule_interval in your default_args, which is not the way to schedule the DAG. default_args are actually applied to the tasks, as they are passed to Operators and not the DAG itself.
you can modify your code as follows, just by removing the schedule_interval from default_args and passing it in the DAG instance as follows and you can set catchup flag as False to avoid any backfills :
# assuming this is how you initialize your DAG
dag = DAG('your DAG UI name', default_args=default_args, schedule_interval = '0 0 18 * *', catchup=False)

Airflow Scheduling Monthly Jobs

I want to schedule a monthly job which runs every month on the same day as of today. And i want first run to be today. For example today is 11/2 and time is 10am. How can i schedule a monthly job which runs every month on 2nd at 11 am, 11/2 has to be first run.
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2020, 11, 1,22,00),
'email': "myemail#abc.com",
'email_on_failure': True,
'email_on_success': True,
'retries': 0
}
def print_hello():
today = date.today()
print("Today's date:", today)
return 'Hello world! Monthly Run'
dag = DAG('dummy_monthly', description='Simple tutorial DAG',
schedule_interval='11 00 2 * *',
start_date=datetime(2020, 11,2), catchup=False)
Change the schedule_interval to 00 11 2 * *

Airflow dag not running when scheduled (while others scheduled at same time, do)?

I have 2 dags in airflow, both of which are scheduled to run at 22 UTC time (12PM my time (HST)). I find that only one of these dags is running at this time and am not sure why this would be happening. I can manually start the other dag while the one that works is running, but it just does not start on it's own.
Here is the dag configs for the dag that is running on schedule
default_args = {
'owner': 'me',
'depends_on_past': False,
'start_date': datetime(2019, 10, 13),
'email': [
'me#co.org'
],
'email_on_failure': True,
'retries': 0,
'retry_delay': timedelta(minutes=5),
'max_active_runs': 1,
}
dag = DAG('my_dag_1', default_args=default_args, catchup=False, schedule_interval="0 22 * * *")
Here is the dag configs for the dag that is failing to run
default_args = {
'owner': 'me',
'depends_on_past': False,
'start_date': datetime(2019, 10, 13),
'email': [
'me#co.org',
],
'email_on_failure': True,
'retries': 0,
'retry_delay': timedelta(minutes=5),
}
dag = DAG('my_dag_2', default_args=default_args,
max_active_runs=1,
catchup=False, schedule_interval=f"0 19,22,1 * * *")
# run setup dag and trigger at 9AM, 12PM,and 3PM (need to convert from UTC time (-2 HST))
From the airflow.cfg file, some of the settings that I think are relevant are set as...
# The amount of parallelism as a setting to the executor. This defines
# the max number of task instances that should run simultaneously
# on this airflow installation
#parallelism = 32
parallelism = 8
# The number of task instances allowed to run concurrently by the scheduler
#dag_concurrency = 16
dag_concurrency = 3
# Are DAGs paused by default at creation
dags_are_paused_at_creation = True
# The maximum number of active DAG runs per DAG
#max_active_runs_per_dag = 16
max_active_runs_per_dag = 1
Not sure what could be going on here. Is there some setting that I am mistakenly switching on that stops multiple different dags from running at the same time? Any more debugging info to get to add to this question?

How to prevent catch up of a DAG?

I have the following DAG:
default_args = {
'owner': 'Airflow',
'depends_on_past': False,
'start_date': datetime(2018, 07, 19, 11,0,0),
'email': ['me#me.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 5,
'retry_delay': timedelta(minutes=2),
'catchup' : False,
'depends_on_past' : False,
}
with DAG('some_dag', schedule_interval=timedelta(minutes=30), max_active_runs=1, default_args=default_args) as dag:
This dag runs every 30 minutes. It rewrite data in the table (delete all and write). So if Airflow was down for 2 days there is no point in running all the missing dag runs during that time.
However the above definition does not work. After 2 days that airflow was down it still try to run all the missing tasks.
How can i solve this?
OK. I solved it.
Apparently there is no meaning for 'catchup' : False on the default_args . It does nothing.
I changed the code to
default_args = {
'owner': 'Airflow',
'depends_on_past': False,
'start_date': datetime(2018, 07, 19, 11,0,0),
'email': ['me#me.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 5,
'retry_delay': timedelta(minutes=2),
'depends_on_past' : False,
}
with DAG('some_dag', catchup=False, schedule_interval=timedelta(minutes=30), max_active_runs=1, default_args=default_args) as dag:
now it works.
according to the docs: https://airflow.apache.org/scheduler.html#backfill-and-catchup
Adding dag.catchup = False to the DAG args.
Adding catchup_by_default = False to the airflow.cfg
Depends on your use case, a good practice could be set catchup_by_default = False and then only use dag.catchup = True if a given DAG requires the catchup.

Resources