I am trying to run a pipeline on say 'first Monday of the month'. My schedule interval is '30 17 * * 1#1' and start date is '2022-01-01' but it doesn't run. Any pointers on why it doesn't run on the scheduled time? Thank You
Related
I wanted to have my dags to have first run on 2:00 AM on 25th and then onwards Tuesday to Sat daily run at 2:00 am.
following is how my scheduling look like.
with DAG(
dag_id='in__xxx__agnt_brk_com',
schedule_interval='0 2 * * 2-6',
start_date=datetime(2022, 10, 24),
catchup=False,
) as dag:
And on Airflow UI also it shows that my first run should be on 25th 2:00 AM. But unfortunately, dags didn't execute on time.
What I am missing here ?
Airflow is scheduling at 2am in your local time, that is 6am in UTC.
Take a look at this link on how to specify the timezone in your dag: https://airflow.apache.org/docs/apache-airflow/stable/timezone.html
You should consult the documentation on dag intervals.
Your dag did not run on the 25th:
Your start date is 2022-10-24
This creates a dag interval of 2022-10-24 - 2022-10-25.
You set catchup=False
You created the dag after midnight on the 25th. The dag interval for the 24th has passed and you've denied catchup.
The next dag is scheduled for 2022-10-25
This creates a dag interval of 2022-10-25 - 2022-10-26
Your dag will run at 2am UTC on the 26th.
I have the following dag config:
with DAG(
dag_id='dag_example',
catchup=False,
start_date=datetime.datetime(2022, 5, 26),
schedule_interval='0 6,7,9,11,15,19,23 * * *',
max_active_runs=1,
default_args=default_args
)
I would like to know why my dag that is scheduled to run at 7 AM is running at 9 AM (next scheduled date...). I'm using airflow 2.1.2. When I was using airflow v1 the dag runs correclty.
This is how Airflow works.
DAGs are scheduled at the end of the interval.
So in your case run_id of 2022-05-27 10:00 will start running on 2022-05-27 12:00 because the interval you set is of 2 hours and Airflow schedule at the end of the interval.
Note: This is consistent with batch processing practices.
If you run a daily job then today you are processing yesterday data.
If you run hourly job then at 10:00 you are processing the interval between 09:00 to 10:00, in other words the run_id of 09:00 will actually run at the end of the hourly interval which is 10:00
You can read Problem with start date and scheduled date in Apache Airflow for more information
Should you want reference specific interval from your DAG this is just a question of what macro to use. See Templates reference
We use airflow in a hybrid ETL system. By this I mean that some of our DAGs are not scheduled but externally triggered using the Airflow API.
We are trying to do the following: Have a sensor in a scheduled DAG (DAG1) that senses that a task inside an externally triggered DAG (DAG2) has run.
For example, the DAG1 runs at 11 am, and we want to be sure that DAG2 has run (due to an external trigger) at least once since 00:00. I have tried to set execution_delta = timedelta(hours=11) but the sensor is sensing nothing. I think the problem is that the sensor tries to look for a task that has been scheduled exactly at 00:00. This won't be the case, as DAG2 can be triggered at any time from 00:00 to 11:00.
Is there any solution that can serve the purpose we need? I think we might need to create a custom Sensor, but it feels strange to me that the native Airflow Sensor does not solve this issue.
This is the sensor I'm defining:
from datetime import timedelta
from airflow.sensors import external_task
sensor = external_task.ExternalTaskSensor(
task_id='sensor',
dag=dag,
external_dag_id='DAG2',
external_task_id='sensed_task',
mode='reschedule',
check_existence=True,
execution_delta=timedelta(hours=int(execution_type)),
poke_interval=10 * 60, # Check every 10 minutes
timeout=1 * 60 * 60, # Allow for 1 hour of delay in execution
)
I had the same problem & used the execution_date_fn parameter:
ExternalTaskSensor(
task_id="sensor",
external_dag_id="dag_id",
execution_date_fn=get_most_recent_dag_run,
mode="reschedule",
where the get_most_recent_dag_run function looks like this :
from airflow.models import DagRun
def get_most_recent_dag_run(dt):
dag_runs = DagRun.find(dag_id="dag_id")
dag_runs.sort(key=lambda x: x.execution_date, reverse=True)
if dag_runs:
return dag_runs[0].execution_date
As the ExternalTaskSensor needs to know both the dag_id and the exact last_execution_date for cross-DAGs dependencies.
I have oozie job running on CDH cluster. I have the following coordinator
<coordinator-app name="name" frequency="0 */5 * * *" start="2020-03-05T16:00Z" end="2020-03-07T16:00Z" timezone="America/New_York" xmlns="uri:oozie:coordinator:0.4">
I submitted this job at 15:15 new york time and oozie started the first job right away and it was marked at 15:00 (new york time) and the next one is scheduled for 19:00. I don't understand the time zone for oozie. Why it does not pick up the time zone I have specified ?
You can over-ride the Timezone when submitting the oozie job on the terminal
-timezone EST -config coordinator.properties -run
I have a DAG in airflow and for now it is running each hour (#hourly).
Is it possible to have it running each 5 minutes ?
Yes, here's an example of a DAG that I have running every 5 min:
dag = DAG(dag_id='eth_rates',
default_args=args,
schedule_interval='*/5 * * * *',
dagrun_timeout=timedelta(seconds=5))
schedule_interval accepts a CRON expression: https://en.wikipedia.org/wiki/Cron#CRON_expression
The documentation states:
Each DAG may or may not have a schedule, which informs how DAG Runs
are created. schedule_interval is defined as a DAG arguments, and
receives preferably a cron expression as a str, or a
datetime.timedelta object.
When following the provided link for CRON expressions it appears you can specify it as */5 * * * * to run it every 5 minutes.
I'm not familiar on the matter, but this is what the documentation states.
Airflow 2 (I'm using 2.4.2) supports timedelta for scheduling DAGs on a particular cadence (hourly, every 5 minutes, etc.) so you can put:
schedule_interval = timedelta(minutes=5)