Airflow Scheduler Misunderstanding - airflow

I'm new to Airflow.
My goal is to run a dag, on a daily basis, starting 1 hour from now.
I'm truly misunderstanding the airflow schedule "end-of-interval invoke" rules.
From the docs [(Airflow Docs)][1]
Note that if you run a DAG on a schedule_interval of one day, the run stamped 2016-01-01 will be trigger soon after 2016-01-01T23:59. In other words, the job instance is started once the period it covers has ended.
I set schedule_interval as followed:
schedule_interval="00 15 * * *"
and start_date as followed:
start_date=datetime(year=2019, month=8, day=7)
My assumption was, that if now it's 14:00:00 PM (UTC time) and the date today is 07-08-2019, then my dag will be executed exactly in one hour.
However, my dag is not starting at all.

So there is a whole page talking about airflow job not been scheduled. https://airflow.apache.org/faq.html
The key thing to notice here is:
The Airflow scheduler triggers the task soon after the start_date +
scheduler_interval is passed.
To my understanding, you want to trigger a task start_date=datetime(year=2019, month=8, day=7) at 15:00 UTC daily. schedule_interval="00 15 * * *" means you would run the task every day at 15:00 UTC. According to the docs, The scheduler triggers your task after start_date + scheduler_interval, so airflow won't trigger it until the next day which is August 8th 2019 15:00:00 UTC. Or you can change the day to 6th. It might be easier to understand this from ETL way: you can only process the data for a given period after it has passed. So August 7th 2019 15:00:00 UTC is your start point, you need to wait until August 8th 2019 15:00:00 UTC to run the task within that given period.
Also, note airflow has execution_data and start_date, you can find more here

schedule_interval="00 15 * * *"
start_date=07-08-2019
1st run will be on 08-08-2019 at 3:00
if you created this dag before 3:00 on 7-8-2019

Related

Run DAG at specific time each day

I've read multiple examples about schedule_interval, start_date and the Airflow docs multiple times aswell, and I still can't wrap my head around:
How do I get to execute my DAG at a specific time each day? E.g say it's now 9:30 (AM), I deploy my DAG and I want it to get executed at 10:30
I have tried
with DAG(
"test",
default_args=default_args,
description= "test",
schedule_interval = "0 10 * * *",
start_date = days_ago(0),
tags = ["goodie"]) as dag:
but for some reason that wasnt run today. I have tried different start_dates altso start_date = datetime.datetime(2021,6,23) but it does not get executed.
If I replace days_ago(0) with days_ago(1) it is behind 1 day all the time i.e it does not get run today but did run yesterday
Isn't there an easy way to say "I deploy my DAG now, and I want to get it executed with this cron-syntax" (which I assume is what most people want) instead of calculating an execution time, based on start_date, schedule_interval and figuring out, how to interpret it?
If I replace days_ago(0) with days_ago(1) it is behind 1 day all the time
It's not behind. You are simply confusing Airflow scheduling mechanizem with cron jobs. In cron jobs you just provide a cron expression and it schedule accordingly - This is not how it works in Airflow.
In Airflow the scheduling is calculated by start_date + schedule interval. Airflow execute the job at the END of the interval. This is consistent with how data pipelines usually works. Today you are processing yesterday data so at the end of this day you want to start a process that will go over yesterday records.
As a rule - NEVER use dynamic start date.
Setting:
with DAG(
"test",
default_args=default_args,
description= "test",
schedule_interval = "0 10 * * *",
start_date = datetime(2021,06,23, 10 ,0), # 2021-06-23 10:00
tags = ["goodie"]) as dag:
Means that the first will start on 2021-06-24 10:00 this run execution_date will be 2021-06-23 10:00. The second run will start on 2021-06-25 10:00 this run execution_date will be 2021-06-24 10:00
Since this is a source of confusion to many new users there is an architecture change in progress AIP-39 Richer scheduler_interval which will decople between WHEN to run and WHAT interval to consider with this run. It will be available in Airflow 2.3.0
UPDATE for Airflow>=2.3.0:
AIP-39 Richer scheduler_interval has been completed and released
It added Timetable support so you can Customizing DAG Scheduling with Timetables

Airflow schedule_interval and the active dags run

define the instance for processing the training data
dag = DAG(
dag_id,
start_date = datetime(2019, 11, 14),
description = 'Reading training logs from the corresponding location',
default_args = default_args,
schedule_interval = timedelta(hours=1),
)
I have the code like this. So in my opinion, this dag will execute every one hour.
But in the airflow web, I got many run days in Schedule part. The day is executing all the time.
Especially, in the Tree View part, I could see all the block were filled within one hour!!!
I am confused about the schedule_interval function. Any ideas on how to fix that .
On the FIRST DAG run, it will start on the date you define on start_date. From that point on, the scheduler creates new DagRuns based on your schedule_interval and the corresponding task instances run as your dependencies are met.
You can read more about it here .
I know, it is the problem coming from the non consistent time setting between the really time and start_date. It the start_date is behind the really time, the system will backfill the past time.

How is the execution_date of a DagRun set?

Given a DAG having an start_date, which is run at a specific date, how is the execution_date of the corresponding DAGRun defined?
I have read the documentation but one example is confusing me:
"""
Code that goes along with the Airflow tutorial located at:
https://github.com/airbnb/airflow/blob/master/airflow/example_dags/tutorial.py
"""
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2015, 12, 1),
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
'schedule_interval': '#hourly',
}
dag = DAG('tutorial', catchup=False, default_args=default_args)
Assuming that the DAG is run on 2016-01-02 at 6 AM, the first DAGRun will have an execution_date of 2016-01-01 and, as said in the documentation
the next one will be created just after midnight on the morning of
2016-01-03 with an execution date of 2016-01-02
Here is how I would have set the execution_date:
the DAG having its schedule_interval set to every hour and being run on 2016-01-02 at 6 AM, the execution_date of the first DAGRun would have been set to 2016-01-02 at 7 AM, the second to 2016-01-02 at 8 AM ...ect.
This is just how scheduling works in Airflow. I think it makes sense to do it the way that Airflow does when you think about how normal ETL batch processes run and how you use the execution_date to pick up delta records that have changed.
Lets say that we want to schedule a batch job to run every night to extract new records from some source database. We want all records that were changed from the 1/1/2018 onwards (we want all records changed on the 1st too). To do this you would set the start_date of the DAG to the 1/1/2018, the scheduler will run a bunch of times but when it gets to 2/1/2018 (or very shortly after) it will run our DAG with an execution_date of 1/1/2018.
Now we can send an SQL statement to the source database which uses the execution_date as part of the SQL using JINJA templating. The SQL would look something like:
SELECT row1, row2, row3
FROM table_name
WHERE timestamp_col >= {{ execution_date }} and timestamp_col < {{ next_execution_date }}
I think when you look at it this way it makes more sense although I admit I had trouble trying to understand this at the beginning.
Here is a quote from the documentation https://airflow.apache.org/scheduler.html:
The scheduler runs your job one schedule_interval AFTER the start date, at the END of the period.
Also it's worth noting that the example you're looking at from the documentation is describing the behaviour of the schedule when backfilling is disabled. If backfilling was enabled there would be a DAG run created for every 1 hour interval between 1/12/2015 and the current date if the DAG had never been run before.
We get this question a lot from analysts writing airflow dags.
Each dag run covers a period of time with a start & end.
The start = execution_date
The end = when the dag run is created and executed (next_execution_date)
An example that should help:
Schedule interval: '0 0 * * *' (run daily at 00:00:00 UTC)
Start date: 2019-10-01 00:00:00
10/1 00:00 10/2 00:00
*<------------------>*
< your 1st dag run >
^ execution_date
next_execution_date^
^when this 1st dag run is actually created by the scheduler
As #simond pointed out in a comment, "execution_date" is a poor name for this variable. It is neither a date nor represents when it was executed. Alas we're stuck with what the creators of airflow gave us... I find it helpful to just use next_execution_date if I want the datetime the dag run will execute my code.

How to schedule an Oracle dbms_scheduler Job timezone and DST safely

I am trying to setup a DBMS_SCHEDULER Job to run exactly at 1 AM on 1st of January every year on Oracle 11g. How to setup its attributes to be absolutely sure it wont get executed in wrong hour, because of timezone differences nor Daylight Savings Time.
I have spent plenty of time going through Oracle documentation, but I have still not reached the level of certainity.
Just btw, here are the rules which I found and consider relevant to the subject:
Job attributes
start_date This attribute specifies the first date on which this job is scheduled to start. If start_date and repeat_interval are left null, then the job is scheduled to run as soon as the job is enabled.
For repeating jobs that use a calendaring expression to specify the repeat interval, start_date is used as a reference date. The first time the job will be scheduled to run is the first match of the calendaring expression that is on or after the current date.
The Scheduler cannot guarantee that a job will execute on an exact time because the system may be overloaded and thus resources unavailable.
repeat_interval This attribute specifies how often the job should repeat. You can specify the repeat interval by using calendaring or PL/SQL expressions.
The expression specified is evaluated to determine the next time the job should run. If repeat_interval is not specified, the job will run only once at the specified start date. See "Calendaring Syntax" for further information.
Rules in Calendaring syntax
The calendaring syntax does not allow you to specify a time zone.
Instead the Scheduler retrieves the time zone from the start_date
argument. If jobs must follow daylight savings adjustments you must
make sure that you specify a region name for the time zone of the
start_date. For example specifying the start_date time zone as
'US/Eastern' in New York will make sure that daylight saving
adjustments are automatically applied. If instead the time zone of
the start_date is set to an absolute offset, such as '-5:00',
daylight savings adjustments are not followed and your job execution
will be off by an hour half of the year.
When start_date is NULL, the Scheduler will determine the time zone for the repeat interval as follows:
It will check whether the session time zone is a region name. The session time zone can be set by either:
Issuing an ALTER SESSION statement, for example: SQL> ALTER SESSION
SET time_zone = 'Asia/Shanghai'; Setting the ORA_SDTZ environment
variable.
If the session time zone is an absolute offset instead of a region name, the Scheduler will use the value of the DEFAULT_TIMEZONE Scheduler attribute. For more information, see the SET_SCHEDULER_ATTRIBUTE Procedure.
If the DEFAULT_TIMEZONE attribute is NULL, the Scheduler will use the time zone of systimestamp when the job or window is enabled.
You may use this to make sure you pass a timestamp with time zone and that the start date will have a timezone name (US/Eastern) instead of an offset (ex: +5:00). This way, as the above fragments from the oracle docs mention, the Scheduler will keep track of DST.
-- Create a SCHEDULE
declare
v_start_date timestamp with time zone;
BEGIN
select localtimestamp at time zone 'US/Eastern' into v_start_date from dual; --US/Eastern
DBMS_SCHEDULER.CREATE_SCHEDULE(
schedule_name => 'SAMPLE_SCHEDULE',
start_date => v_start_date,
repeat_interval => 'FREQ=DAILY; BYHOUR=10; BYMINUTE= 15',
comments => 'Runs daily at the specified hour.');
END;
To make sure you have set it properly you can run this:
ALTER SESSION SET nls_timestamp_tz_format = 'MM-DD-YYYY HH24:MI:SS tzr tzd';
Now, create two schedules, one as above and one using sysdate as the start_date parameter and execute the query below.
-- Check the TIMEZONE
select * from USER_SCHEDULER_SCHEDULES;
v1:
27-MAR-14 11.44.24.929282 AM **US/EASTERN**
v2:
27-MAR-14 05.44.54.000000 PM **+05:00**
I am unsure if this answer truly passes the rules of an answer on this site, but after spending a lot of time googling I came up with the following solution:
start_date => CAST(trunc(sysdate, 'YEAR')+2/24 AS TIMESTAMP) at time zone 'Europe/Berlin'
I believe this is closest to safest solution because:
It uses timestamp instead of date - i believe it forces the job to be truly executed on given time in given timezone, while ignoring DMBS_SCHEDULER default_timezone. I found also some suggestions that say that it is also unsafe to use directly timestamp, that only this cast is safe
I selected manually the timezone I need, with the hope, that it would not come to conflict with local settings. Altough it is unclear to me, whether it is now truly unrelated to SESSIONTIMEZONE, or DBTIMEZONE and whether it affects the proper time of run.
I have used a little hack, even though the request is that the job should start after midnight, I have set it to 2AM, with the hope that even in case of bad time zone and bad daylight savings it would get moved max +-2 hours.
I would be happier with the solution, if I would be absolutely clear on when the job actually gets executed with the respect of local time of a server, SESSIONTIMEZONE, DBTIMEZONE, start_date Time Zone and a DBMS_SCHEDULER time zone.
I am also unhappy with the Time Zone specification, since its has 4 abbreviations linked with it - LMT, CET, CEST, CEMT, where CEST seems to me like being completely wrong. My target is to use CET with Daylight savings(winter!=summer).
Actually I have never tried it with .CREATE_SCHEDULE method but .CREATE_JOB has also start_date parameter and what works for me is just plain
start_date => TO_TIMESTAMP_TZ('00:10 Europe/Rome','hh24:mi tzr'
It retains 'Europe/Rome' when I query dba_scheduler_jobs:
SELECT job_name, TO_CHAR(start_date) start_date,
TO_CHAR(next_run_date) next_run_date
FROM dba_scheduler_jobs;
To add a bit more info to this, when you want to check if modification was successful.
Run query: select * from all_scheduler_jobs where owner='schema_name'.
There you can see in the field start_date, which has type timestamp with timezone, that it contains data like: 2017-12-05 01:55:00,000000000 EUROPE/YOUR_CITY
Having time zone info at the end confirms that it is properly saved for the job.
Then, also, next_run_date is aligned with start_date and it should also show time zone details.
SELECT DBMS_SCHEDULER.STIME FROM DUAL;
Reference

SQLite Query Sorting

I'm a bit of a nub when it comes to SQL queries and I couldn't find any help so I figured I'd ask.
I'm currently working on an event tracker/calendar style thing, with two types of events. One is a standard starts at X and ends at Y, while the other is "all day" (ie, starts at 12:01 AM, ends at 11:59 PM). My problem is the database query to sort it properly. What I'm trying to do is get the return such that the all-day events are at the very end of that day's list.
For example, if I have 4 events, one at 1 PM, one at 2 PM, one all day, and one tomorrow at 11 AM, it would look like:
1:00 PM Event
2:00 PM Event
All Day Event
Tomorrow 11:00 AM Event
I've got UNIX timestamps (in seconds for whatever reason) for start and end dates, and my current attempt is
SELECT * FROM table ORDER BY all_day_flag, startTime;
This won't work, because it would always put the all-day events at the end, so any tips on where to refine it would be much appreciated.
You need to extract the date and time separately from your Unix timestamp, and then use the date as the first sort option, followed by all-day flag and then the time.
Try this:
SELECT date(startTime, 'unixepoch') AS startDate,
time(startTime, 'unixepoch') AS startHour,
time(endTime, 'unixepoch') AS endHour,
all_day_flag
FROM table
ORDER BY
startDate, all_day_flag, startHour;

Resources