Airflow Task has no start date , end date and duration - airflow

I'm a newbie in Airflow.
Does anybody know why the start date, end date of a task could be blank ?
Task Log Screenshot:

Related

Run DAG at specific time each day

I've read multiple examples about schedule_interval, start_date and the Airflow docs multiple times aswell, and I still can't wrap my head around:
How do I get to execute my DAG at a specific time each day? E.g say it's now 9:30 (AM), I deploy my DAG and I want it to get executed at 10:30
I have tried
with DAG(
"test",
default_args=default_args,
description= "test",
schedule_interval = "0 10 * * *",
start_date = days_ago(0),
tags = ["goodie"]) as dag:
but for some reason that wasnt run today. I have tried different start_dates altso start_date = datetime.datetime(2021,6,23) but it does not get executed.
If I replace days_ago(0) with days_ago(1) it is behind 1 day all the time i.e it does not get run today but did run yesterday
Isn't there an easy way to say "I deploy my DAG now, and I want to get it executed with this cron-syntax" (which I assume is what most people want) instead of calculating an execution time, based on start_date, schedule_interval and figuring out, how to interpret it?
If I replace days_ago(0) with days_ago(1) it is behind 1 day all the time
It's not behind. You are simply confusing Airflow scheduling mechanizem with cron jobs. In cron jobs you just provide a cron expression and it schedule accordingly - This is not how it works in Airflow.
In Airflow the scheduling is calculated by start_date + schedule interval. Airflow execute the job at the END of the interval. This is consistent with how data pipelines usually works. Today you are processing yesterday data so at the end of this day you want to start a process that will go over yesterday records.
As a rule - NEVER use dynamic start date.
Setting:
with DAG(
"test",
default_args=default_args,
description= "test",
schedule_interval = "0 10 * * *",
start_date = datetime(2021,06,23, 10 ,0), # 2021-06-23 10:00
tags = ["goodie"]) as dag:
Means that the first will start on 2021-06-24 10:00 this run execution_date will be 2021-06-23 10:00. The second run will start on 2021-06-25 10:00 this run execution_date will be 2021-06-24 10:00
Since this is a source of confusion to many new users there is an architecture change in progress AIP-39 Richer scheduler_interval which will decople between WHEN to run and WHAT interval to consider with this run. It will be available in Airflow 2.3.0
UPDATE for Airflow>=2.3.0:
AIP-39 Richer scheduler_interval has been completed and released
It added Timetable support so you can Customizing DAG Scheduling with Timetables

Airflow schedule_interval and the active dags run

define the instance for processing the training data
dag = DAG(
dag_id,
start_date = datetime(2019, 11, 14),
description = 'Reading training logs from the corresponding location',
default_args = default_args,
schedule_interval = timedelta(hours=1),
)
I have the code like this. So in my opinion, this dag will execute every one hour.
But in the airflow web, I got many run days in Schedule part. The day is executing all the time.
Especially, in the Tree View part, I could see all the block were filled within one hour!!!
I am confused about the schedule_interval function. Any ideas on how to fix that .
On the FIRST DAG run, it will start on the date you define on start_date. From that point on, the scheduler creates new DagRuns based on your schedule_interval and the corresponding task instances run as your dependencies are met.
You can read more about it here .
I know, it is the problem coming from the non consistent time setting between the really time and start_date. It the start_date is behind the really time, the system will backfill the past time.

How to write/read time stamp from a variable in airflow?

I'm working with: EXEC_DATE = '{{ macros.ds_add(ds, 1) }}' This gives me the execution date but not the hour.
I want to be able to save this value as YYYY-MM-DD HH:MM into variable called process_last_run.
Basically read the variable in the beginning of the run and write to it in the end of the dag. This variable indicates what was the running time of the last dag.
How can I do that?
You can do this with the macro execution_date. However be advised that this is a poorly named concept in airflow. It represents the beginning of a scheduled interval period. It will not change within the same dag-run, even if the task is re-run manually. It's there to support idempotent data updates. Which frankly is the optimal way to approach data pipelines. In your case though you said elsewhere that your data fetching api takes a start date and provides all the data up to current, which isn't conducive to being processed idempotently, though you could throw away data after a cut off specified.
So instead you might just take the date after your processing of data has completed, and store that for later. You can store into an Airflow Variable. You might note though that the time you get out of a date command shown below is going to be later then the last time of the data you might have gotten from within your process_data api call for all data from a start date. So it might be better if your processing step outputs the actual last date and time of the data processed as the last line of stdout (which is captured by BashOperator for xcom).
E.G.
from airflow.models import Variable, DAG
from datetime import datetime
def pyop_fun(**context):
# You could have used execution_date here and in the next operator
# to make the operator rerun safe.
# date_string = context['execution_date'].strftime('%Y-%m-%d %H:%M')
# But elsewhere you said your api is always giving you the up-to-the-minute data.
# So maybe getting the date from the prior task would work better for you.
Variable.set(
'process_last_run',
context['task_instance'].xcom_pull(task_ids='process_data')
with dag as DAG(…):
pyop = PythonOperator(
task_id='set_process_last_run',
callable=pyop_fun,
provide_context=True, …)
shop = BashOperator(
task_id='process_data',
bash_command='''
process_data "{{var.value.process_last_run}}";
date -u +%Y-%m-%d\ %H:%M''',
xcom_push=True, …)
shop >> pyop
# Because the last output line of a BashOperator is pushed into xcom for that
# task id with the default key, it can be pulled by the PythonOperator and
# stored in a variable.
There's an {{ execution_date }} variable in Jinja you can use to get the execution date for the current DAG run.
More info: Airflow - Macros
If you're looking to track something like start time or end time of execution or duration (in seconds) of a particular task instance, that info is stored in the TaskInstance model.
class TaskInstance(Base, LoggingMixin):
...
start_date = Column(UtcDateTime)
end_date = Column(UtcDateTime)
duration = Column(Float)
https://github.com/apache/incubator-airflow/blob/4c30d402c4cd57dc56a5d9dcfe642eadc77ec3ba/airflow/models.py#L877-L879
Also, if you wanted to compute the running time of the entire DAG, you could get that from querying the Airflow metadata database around these fields for a particular DAG run.
If you're doing this in your Python code already, you can access the execution_date field on the task instance itself as well instead of using the template layer.
Variables
You can write to and read from Airflow Variables like so:
Variable.set('my_key', 'some value')
my_val = Variable.get('my_key')
You can also perform CRUD operations on variables with the CLI.
Stats
Another thing you might keep in mind if you find yourself working with stats like task duration a lot is Airflow's StatsD integration which gathers metrics on Airflow itself at execution time. You can have these metrics go into a push-based system like StatsD itself, or into a pull-based system like Prometheus / Grafana by using statsd_exporter.

How to schedule an Oracle dbms_scheduler Job timezone and DST safely

I am trying to setup a DBMS_SCHEDULER Job to run exactly at 1 AM on 1st of January every year on Oracle 11g. How to setup its attributes to be absolutely sure it wont get executed in wrong hour, because of timezone differences nor Daylight Savings Time.
I have spent plenty of time going through Oracle documentation, but I have still not reached the level of certainity.
Just btw, here are the rules which I found and consider relevant to the subject:
Job attributes
start_date This attribute specifies the first date on which this job is scheduled to start. If start_date and repeat_interval are left null, then the job is scheduled to run as soon as the job is enabled.
For repeating jobs that use a calendaring expression to specify the repeat interval, start_date is used as a reference date. The first time the job will be scheduled to run is the first match of the calendaring expression that is on or after the current date.
The Scheduler cannot guarantee that a job will execute on an exact time because the system may be overloaded and thus resources unavailable.
repeat_interval This attribute specifies how often the job should repeat. You can specify the repeat interval by using calendaring or PL/SQL expressions.
The expression specified is evaluated to determine the next time the job should run. If repeat_interval is not specified, the job will run only once at the specified start date. See "Calendaring Syntax" for further information.
Rules in Calendaring syntax
The calendaring syntax does not allow you to specify a time zone.
Instead the Scheduler retrieves the time zone from the start_date
argument. If jobs must follow daylight savings adjustments you must
make sure that you specify a region name for the time zone of the
start_date. For example specifying the start_date time zone as
'US/Eastern' in New York will make sure that daylight saving
adjustments are automatically applied. If instead the time zone of
the start_date is set to an absolute offset, such as '-5:00',
daylight savings adjustments are not followed and your job execution
will be off by an hour half of the year.
When start_date is NULL, the Scheduler will determine the time zone for the repeat interval as follows:
It will check whether the session time zone is a region name. The session time zone can be set by either:
Issuing an ALTER SESSION statement, for example: SQL> ALTER SESSION
SET time_zone = 'Asia/Shanghai'; Setting the ORA_SDTZ environment
variable.
If the session time zone is an absolute offset instead of a region name, the Scheduler will use the value of the DEFAULT_TIMEZONE Scheduler attribute. For more information, see the SET_SCHEDULER_ATTRIBUTE Procedure.
If the DEFAULT_TIMEZONE attribute is NULL, the Scheduler will use the time zone of systimestamp when the job or window is enabled.
You may use this to make sure you pass a timestamp with time zone and that the start date will have a timezone name (US/Eastern) instead of an offset (ex: +5:00). This way, as the above fragments from the oracle docs mention, the Scheduler will keep track of DST.
-- Create a SCHEDULE
declare
v_start_date timestamp with time zone;
BEGIN
select localtimestamp at time zone 'US/Eastern' into v_start_date from dual; --US/Eastern
DBMS_SCHEDULER.CREATE_SCHEDULE(
schedule_name => 'SAMPLE_SCHEDULE',
start_date => v_start_date,
repeat_interval => 'FREQ=DAILY; BYHOUR=10; BYMINUTE= 15',
comments => 'Runs daily at the specified hour.');
END;
To make sure you have set it properly you can run this:
ALTER SESSION SET nls_timestamp_tz_format = 'MM-DD-YYYY HH24:MI:SS tzr tzd';
Now, create two schedules, one as above and one using sysdate as the start_date parameter and execute the query below.
-- Check the TIMEZONE
select * from USER_SCHEDULER_SCHEDULES;
v1:
27-MAR-14 11.44.24.929282 AM **US/EASTERN**
v2:
27-MAR-14 05.44.54.000000 PM **+05:00**
I am unsure if this answer truly passes the rules of an answer on this site, but after spending a lot of time googling I came up with the following solution:
start_date => CAST(trunc(sysdate, 'YEAR')+2/24 AS TIMESTAMP) at time zone 'Europe/Berlin'
I believe this is closest to safest solution because:
It uses timestamp instead of date - i believe it forces the job to be truly executed on given time in given timezone, while ignoring DMBS_SCHEDULER default_timezone. I found also some suggestions that say that it is also unsafe to use directly timestamp, that only this cast is safe
I selected manually the timezone I need, with the hope, that it would not come to conflict with local settings. Altough it is unclear to me, whether it is now truly unrelated to SESSIONTIMEZONE, or DBTIMEZONE and whether it affects the proper time of run.
I have used a little hack, even though the request is that the job should start after midnight, I have set it to 2AM, with the hope that even in case of bad time zone and bad daylight savings it would get moved max +-2 hours.
I would be happier with the solution, if I would be absolutely clear on when the job actually gets executed with the respect of local time of a server, SESSIONTIMEZONE, DBTIMEZONE, start_date Time Zone and a DBMS_SCHEDULER time zone.
I am also unhappy with the Time Zone specification, since its has 4 abbreviations linked with it - LMT, CET, CEST, CEMT, where CEST seems to me like being completely wrong. My target is to use CET with Daylight savings(winter!=summer).
Actually I have never tried it with .CREATE_SCHEDULE method but .CREATE_JOB has also start_date parameter and what works for me is just plain
start_date => TO_TIMESTAMP_TZ('00:10 Europe/Rome','hh24:mi tzr'
It retains 'Europe/Rome' when I query dba_scheduler_jobs:
SELECT job_name, TO_CHAR(start_date) start_date,
TO_CHAR(next_run_date) next_run_date
FROM dba_scheduler_jobs;
To add a bit more info to this, when you want to check if modification was successful.
Run query: select * from all_scheduler_jobs where owner='schema_name'.
There you can see in the field start_date, which has type timestamp with timezone, that it contains data like: 2017-12-05 01:55:00,000000000 EUROPE/YOUR_CITY
Having time zone info at the end confirms that it is properly saved for the job.
Then, also, next_run_date is aligned with start_date and it should also show time zone details.
SELECT DBMS_SCHEDULER.STIME FROM DUAL;
Reference

SQLite Query Sorting

I'm a bit of a nub when it comes to SQL queries and I couldn't find any help so I figured I'd ask.
I'm currently working on an event tracker/calendar style thing, with two types of events. One is a standard starts at X and ends at Y, while the other is "all day" (ie, starts at 12:01 AM, ends at 11:59 PM). My problem is the database query to sort it properly. What I'm trying to do is get the return such that the all-day events are at the very end of that day's list.
For example, if I have 4 events, one at 1 PM, one at 2 PM, one all day, and one tomorrow at 11 AM, it would look like:
1:00 PM Event
2:00 PM Event
All Day Event
Tomorrow 11:00 AM Event
I've got UNIX timestamps (in seconds for whatever reason) for start and end dates, and my current attempt is
SELECT * FROM table ORDER BY all_day_flag, startTime;
This won't work, because it would always put the all-day events at the end, so any tips on where to refine it would be much appreciated.
You need to extract the date and time separately from your Unix timestamp, and then use the date as the first sort option, followed by all-day flag and then the time.
Try this:
SELECT date(startTime, 'unixepoch') AS startDate,
time(startTime, 'unixepoch') AS startHour,
time(endTime, 'unixepoch') AS endHour,
all_day_flag
FROM table
ORDER BY
startDate, all_day_flag, startHour;

Resources