Airflow: Confusion in execution_date & start_date - airflow

My airflow scheduled runs were running perfectly until it misbehaved on 31-12-2022. I'm unable to figure out what exactly happened but I'm sure it is related to exec_date and start_date. My start_date for the dag is determined by the below code.(I call the below function)
def derive_start_date(timezone='Australia/Melbourne') -> datetime:
def current_year():
return datetime.datetime.now().year
local_tz = pendulum.timezone(timezone)
_dt = datetime.datetime(current_year(), 1, 1, tzinfo=local_tz
return _dt
While the above function should be setting the start date to 01-01-2022 but for some reason in the Airflow UI the dags had the start_date as 31-12-2022 (unable to find why).
The dags run were running fine for months until 31-12-2022. Below are details of a sample run on 30/12/2022.
30/12/22:
Schedule interval: (30 19 * * *)
Execution Date: 2022-12-30 08:30:00+00:00
DAG Start date: 2022-12-31T13:00:00+00:00
Task start date: 2022-12-31, 8:30
31/12/2022:
Schedule interval: (30 19 * * *)
Execution Date: 2022-12-31T13:00:00+00:00
DAG Start date: 2022-12-31T13:00:00+00:00
Task start date: 2023-01-01, 12:13
01/01/2023:
Schedule interval: (30 19 * * *)
Execution Date: 2023-01-01 08:30:00+00:00
DAG Start date: 2022-12-31T13:00:00+00:00
Task start date: 2023-01-02, 8:30
As you can see on 31 December the execution dates have changed automatically and I'm unable figure out what could have happened. Can anyone point out which part of the puzzle am I missing here? They are back to normal from 1st January onwards.

Related

How to refer a task success status of one dag to another dag

I have a scenario like below:
dag A :
task 1,
task 2
Dag B :
Task 3,
Task 4
Now i want to trigger/ run the task 3 (dag B) only after the success of task 1(dag A). Both dags scheduled on the same day but different time.
For example:Dag A runs on 14 July 8 AM,.
Dag B runs on 14 July 2 PM
Is that doable? How?
Please help
Thanks
In DagB you should create a BranchPythonOperator that returns "task3" if the appropriate conditions occurred.
in this code example, I return "task3" only if "DagA" finished state=success on the same day.
def check_success_dag_a(**context):
ti: TaskInstance = context['ti']
dag_run: DagRun = context['dag_run']
date: datetime = ti.execution_date
ts = timezone.make_aware(datetime(date.year, date.month, date.day, 0, 0, 0))
dag_a = dag_run.find(
dag_id='DagA',
state="success",
execution_start_date=ts,
execution_end_date=ti.execution_date)
if dag_a:
return "task3"
check_success = BranchPythonOperator(
task_id="check_success_dag_a",
python_callable=check_success_dag_a,
)
def run(**context):
ti = context['ti']
print(ti.task_id)
task3 = PythonOperator(
task_id="task3",
python_callable=run,
trigger_rule=TriggerRule.ONE_SUCCESS
)
task4 = PythonOperator(
task_id="task4",
python_callable=run,
trigger_rule=TriggerRule.ONE_SUCCESS
)
(check_success >> [task3] >> task4)

DAG run as per timezone

I want to run my dag as per new york time zone. As the data comes as per the new york time zone and dag fails for the initial runs and skips last runs data as well since UTC is 4 hours ahead and the day gets completed for UTC 4 hours earlier than for new york.
I have done
timezone = pendulum.timezone("America/New_York")
default_args = {
'depends_on_past': False,
'start_date': pendulum.datetime(2022, 5, 13, tzinfo=timezone),
...
}
test_dag = DAG(
'test_data_import_tz', default_args=default_args,
schedule_interval="0 * * * *", catchup=False)
DAG CODE works with:
yesterday_report_date = context.get('yesterday_ds_nodash')
report_date = context.get('ds_nodash')
But when the DAG runs as per UTC only and the initial run fails as data is not present.
Dag Run times are:
2022-05-13T00:00:00+00:00
2022-05-13T01:00:00+00:00
and so on
whereas I want this to start from
2022-05-13T04:00:00+00:00
2022-05-13T05:00:00+00:00

How to run airflow dag on working days between 9am to 4pm in every 10 minutes

I have a DAG that needs to be scheduled to run in working days (Mon to Fri) between 9AM to 4PM in every 10 minutes. How do i do this in Airflow.
Set your DAG with cron expression: */10 9-16 * * 1-5 Which means at every 10th minute past every hour from 9 through 16 on every day-of-week from Monday through Friday. See crontab guru.
dag = DAG(
dag_id='my_dag',
schedule_interval='*/10 9-16 * * 1-5',
start_date=datetime(2021, 1, 1),
)

Execute the task on Airflow on 12th day of the month and 2 days before the last day of the month

I need to execute the airflow same task on 12th day of the month and 2 days before the last day of the month.
I was trying with macros and execution_date as well. Not sure how to proceed further. Could you please help on this?
def check_trigger(execution_date, day_offset, **kwargs):
target_date = execution_date - timedelta(days = day_offset)
return target_date
I would approach it like below. And twelfth_or_two_before is a Python function that simply checks the date & returns the task_id of the appropriate downstream task. (That way if the business needs ever change & you need to run the actual tasks on a different/additional day(s), you just modify that function.)
with DAG( ... ) as dag:
right_days = BranchPythonOperator(
task_id="start",
python_callable="twelfth_or_two_before,
)
do_nothing = DummyOperator(task_id="do_nothing")
actual_task = ____Operator( ... ) # This is the Operator that does actual work
start >> [do_nothing, actual_task]

Airflow DAG Scheduling for end of month

I want to run a schedule on Airflow (v1.9.0).
My DAG needs to run at every end of month, but I don't know how to write the settings.
my_dag = DAG(dag_id=DAG_ID,
catchup=False,
default_args=default_args,
schedule_interval='30 0 31 * *',
start_date=datetime(2019, 7, 1))
But this won't work in a month where there's no 31st, right?
How can I write a schedule_interval to run at every end of the month?
You can do this by putting L in the month position of your schedule_interval cron expression.
schedule_interval='59 23 L * *' # 23:59 on the last day of the month

Resources