ariflow dag_run not setting end_date - airflow

All dags in our airflow deployment are not setting the end_date on successful completion.
The start_date is correctly set as the datetime at which the task was started, but the end_date remains empty.
This adhoc query clearly shows the issue:
select * from dag_run where dag_id = 'my_dag' and state = 'success'
order by start_date desc
Am I doing something wrong?
Thanks

It is a bug in Airflow <=1.10. There is already a fix for it in a testing branch.

Related

Get delta of updated records using sqlSensor in AirFlow based on last updated time

I'm trying to use AirFlow SqlSensor to get the updated/inserted new records in a postgresql database. There is a column in this table representing the timestamp of the last update.
I want to use SqlSensor in Airflow to fetch the newly updated / inserted records. However, I am stuck at the value of the timestamp that I should insert in the sql query of the sql sensor.
Here is my code:
with DAG(
dag_id="dag_process_supervisor",
start_date=datetime(2022, 12, 1),
catchup=False,
schedule_interval="#hourly"
) as dag:
wait_for_table_update = SqlSensor(
task_id='forecasting_jobs_sensor',
conn_id='postgres',
sql='''
SELECT *
FROM ForecastingJob
WHERE last_modified_at > ????;
''',
success=_success_criteria,
pass_value=True,
timeout=5 * 60, # The maximum amount of time in seconds that the sensor checks the condition.
poke_interval=60, # this is the time in seconds that the sensor waits before checking the condition again.
mode='reschedule' # if the criteria is not met then the sensor releases its worker slot and reschedule.
)
I am not really sure of the value to replace the ??? with and how is it going to be updated at each poll?
What about look back for the last poke_interval seconds? In your case 60 sec
sql='''SELECT *
FROM ForecastingJob
WHERE last_modified_at > CURRENT_TIMESTAMP() - interval '60 seconds';
'''
add a column(flag) with timestamp which track the last updated time in the destination (it could be datawarehouse or a file that keep record of the updates)
write a task before the "SqlSensor" one and retrieve the last_update_time and push it into the Xcoms
use the Xcoms to pull the last_update_time and replace it with "????"

Run DAG at specific time each day

I've read multiple examples about schedule_interval, start_date and the Airflow docs multiple times aswell, and I still can't wrap my head around:
How do I get to execute my DAG at a specific time each day? E.g say it's now 9:30 (AM), I deploy my DAG and I want it to get executed at 10:30
I have tried
with DAG(
"test",
default_args=default_args,
description= "test",
schedule_interval = "0 10 * * *",
start_date = days_ago(0),
tags = ["goodie"]) as dag:
but for some reason that wasnt run today. I have tried different start_dates altso start_date = datetime.datetime(2021,6,23) but it does not get executed.
If I replace days_ago(0) with days_ago(1) it is behind 1 day all the time i.e it does not get run today but did run yesterday
Isn't there an easy way to say "I deploy my DAG now, and I want to get it executed with this cron-syntax" (which I assume is what most people want) instead of calculating an execution time, based on start_date, schedule_interval and figuring out, how to interpret it?
If I replace days_ago(0) with days_ago(1) it is behind 1 day all the time
It's not behind. You are simply confusing Airflow scheduling mechanizem with cron jobs. In cron jobs you just provide a cron expression and it schedule accordingly - This is not how it works in Airflow.
In Airflow the scheduling is calculated by start_date + schedule interval. Airflow execute the job at the END of the interval. This is consistent with how data pipelines usually works. Today you are processing yesterday data so at the end of this day you want to start a process that will go over yesterday records.
As a rule - NEVER use dynamic start date.
Setting:
with DAG(
"test",
default_args=default_args,
description= "test",
schedule_interval = "0 10 * * *",
start_date = datetime(2021,06,23, 10 ,0), # 2021-06-23 10:00
tags = ["goodie"]) as dag:
Means that the first will start on 2021-06-24 10:00 this run execution_date will be 2021-06-23 10:00. The second run will start on 2021-06-25 10:00 this run execution_date will be 2021-06-24 10:00
Since this is a source of confusion to many new users there is an architecture change in progress AIP-39 Richer scheduler_interval which will decople between WHEN to run and WHAT interval to consider with this run. It will be available in Airflow 2.3.0
UPDATE for Airflow>=2.3.0:
AIP-39 Richer scheduler_interval has been completed and released
It added Timetable support so you can Customizing DAG Scheduling with Timetables

Airflow Task has no start date , end date and duration

I'm a newbie in Airflow.
Does anybody know why the start date, end date of a task could be blank ?
Task Log Screenshot:

Airflow schedule_interval and the active dags run

define the instance for processing the training data
dag = DAG(
dag_id,
start_date = datetime(2019, 11, 14),
description = 'Reading training logs from the corresponding location',
default_args = default_args,
schedule_interval = timedelta(hours=1),
)
I have the code like this. So in my opinion, this dag will execute every one hour.
But in the airflow web, I got many run days in Schedule part. The day is executing all the time.
Especially, in the Tree View part, I could see all the block were filled within one hour!!!
I am confused about the schedule_interval function. Any ideas on how to fix that .
On the FIRST DAG run, it will start on the date you define on start_date. From that point on, the scheduler creates new DagRuns based on your schedule_interval and the corresponding task instances run as your dependencies are met.
You can read more about it here .
I know, it is the problem coming from the non consistent time setting between the really time and start_date. It the start_date is behind the really time, the system will backfill the past time.

How is the execution_date of a DagRun set?

Given a DAG having an start_date, which is run at a specific date, how is the execution_date of the corresponding DAGRun defined?
I have read the documentation but one example is confusing me:
"""
Code that goes along with the Airflow tutorial located at:
https://github.com/airbnb/airflow/blob/master/airflow/example_dags/tutorial.py
"""
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2015, 12, 1),
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
'schedule_interval': '#hourly',
}
dag = DAG('tutorial', catchup=False, default_args=default_args)
Assuming that the DAG is run on 2016-01-02 at 6 AM, the first DAGRun will have an execution_date of 2016-01-01 and, as said in the documentation
the next one will be created just after midnight on the morning of
2016-01-03 with an execution date of 2016-01-02
Here is how I would have set the execution_date:
the DAG having its schedule_interval set to every hour and being run on 2016-01-02 at 6 AM, the execution_date of the first DAGRun would have been set to 2016-01-02 at 7 AM, the second to 2016-01-02 at 8 AM ...ect.
This is just how scheduling works in Airflow. I think it makes sense to do it the way that Airflow does when you think about how normal ETL batch processes run and how you use the execution_date to pick up delta records that have changed.
Lets say that we want to schedule a batch job to run every night to extract new records from some source database. We want all records that were changed from the 1/1/2018 onwards (we want all records changed on the 1st too). To do this you would set the start_date of the DAG to the 1/1/2018, the scheduler will run a bunch of times but when it gets to 2/1/2018 (or very shortly after) it will run our DAG with an execution_date of 1/1/2018.
Now we can send an SQL statement to the source database which uses the execution_date as part of the SQL using JINJA templating. The SQL would look something like:
SELECT row1, row2, row3
FROM table_name
WHERE timestamp_col >= {{ execution_date }} and timestamp_col < {{ next_execution_date }}
I think when you look at it this way it makes more sense although I admit I had trouble trying to understand this at the beginning.
Here is a quote from the documentation https://airflow.apache.org/scheduler.html:
The scheduler runs your job one schedule_interval AFTER the start date, at the END of the period.
Also it's worth noting that the example you're looking at from the documentation is describing the behaviour of the schedule when backfilling is disabled. If backfilling was enabled there would be a DAG run created for every 1 hour interval between 1/12/2015 and the current date if the DAG had never been run before.
We get this question a lot from analysts writing airflow dags.
Each dag run covers a period of time with a start & end.
The start = execution_date
The end = when the dag run is created and executed (next_execution_date)
An example that should help:
Schedule interval: '0 0 * * *' (run daily at 00:00:00 UTC)
Start date: 2019-10-01 00:00:00
10/1 00:00 10/2 00:00
*<------------------>*
< your 1st dag run >
^ execution_date
next_execution_date^
^when this 1st dag run is actually created by the scheduler
As #simond pointed out in a comment, "execution_date" is a poor name for this variable. It is neither a date nor represents when it was executed. Alas we're stuck with what the creators of airflow gave us... I find it helpful to just use next_execution_date if I want the datetime the dag run will execute my code.

Resources