I have set timezone to "Canada/Central" however dag info page shows it as UTC only.
Code
support_email = Variable.get("prod_support_email")
args = {
'owner': 'Gaurang Shah',
'email': [support_email],
'email_on_failure': True,
'email_on_retry': True,
'retries': 0,
'start_date': datetime(2020, 9, 17, tzinfo=local_tz),
}
Info Page
I think the key to answer the question is clarify the question.
how to know the dag is a Time zone aware DAGs.
following the office doc enter link description here, you can set up a timezone awared dags and check it by print dags timezone.
It is programmatic way, you can also check it in dags tree view as follow picutres.
there will be localtimezone, dag timezone, if you dags is right, you will get 3 timezone(utc, local and dag)
if the task info page showed utc means the dags is a utc timezone dage?
My answer is No. No matter how you dags change, the default timezone will be UTC in info page. It is not the dags timezoneinfo.
Related
Do you know the solution how to implement SLA on DAG level? I mean, the SLA within the whole DAG needs to be marked as success. Adding SLA parameter to default_args does not help here, because despite the fact this is a global parameter, it is evaluated on task level.
# Default settings applied to all tasks
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'email_on_failure': True,
'email': 'noreply#astronomer.io',
'email_on_retry': False,
'sla': timedelta(seconds=30)
}
Let's take example where we have DAG consisting of 3 tasks and SLA set as 2 hours. Suppose that each task proceeds 1 hour - the SLA is met despite the fact that it takes 3 hours to proceed whole DAG.
Do you know any solution for this? As far as I know there is no default setting for this.
A simple question for airflow DAG development
args = {
'owner': 'Airflow',
'start_date': dates.days_ago(1),
'email': ['email1#gmail.com', 'email2#gmail.com'],
'email_on_failure': True,
'email_on_success': True,
'schedule_interval': '0 * * * *',
}
The above configuration states that the DAG should run every hour on the top of the hour.
How do I make the job skip one hour if the previous job is still in motion?
Thanks!
As mentioned in comment you can achieve what you want by setting max_active_runs=1. However, this will depend on wider context of expected behaviour.
If you need more complex schedule consider implementing your own Timetable.
When I schedule DAGs to run at a specific time everyday, the DAG execution does not take place at all.
However, when I restart Airflow webserver and scheduler, the DAGs execute once on the scheduled time for that particular day and do not execute from the next day onwards.
I am using Airflow version v1.7.1.3 with python 2.7.6.
Here goes the DAG code:
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
import time
n=time.strftime("%Y,%m,%d")
v=datetime.strptime(n,"%Y,%m,%d")
default_args = {
'owner': 'airflow',
'depends_on_past': True,
'start_date': v,
'email': ['airflow#airflow.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=10),
}
dag = DAG('dag_user_answer_attempts', default_args=default_args, schedule_interval='03 02 * * *')
# t1, t2 and t3 are examples of tasks created by instantiating operators
t1 = BashOperator(
task_id='user_answer_attempts',
bash_command='python /home/ubuntu/bigcrons/appengine-flask-skeleton-master/useranswerattemptsgen.py',
dag=dag)
Am I doing something wrong?
Your issue is the start_date being set for the current time. Airflow runs jobs at the end of an interval, not the beginning. This means that the first run of your job is going to be after the first interval.
Example:
You make a dag and put it live in Airflow at midnight. Today (20XX-01-01 00:00:00) is also the start_date, but it is hard-coded ("start_date":datetime(20XX,1,1)). The schedule interval is daily, like yours (3 2 * * *).
The first time this dag will be queued for execution is 20XX-01-02 02:03:00, because that is when the interval period ends. If you look at your dag being run at that time, it should have a started datetime of roughly one day after the schedule_date.
You can solve this by having your start_date hard-coded to a date or by making sure that the dynamic date is further in the past than the interval between executions (In your case, 2 days would be plenty). Airflow recommends you use static start_dates in case you need to re-run jobs or backfill (or end a dag).
For more information on backfilling (the opposite side of this common stackoverflow question), check the docs or this question:
Airflow not scheduling Correctly Python
Check the following:
start_date is a fix time in the past(don't use datetime.now())
if you don't want to run the historical data, use catchup=false
to set a specific time for DAG to run (e.g hourly, monthly, daily, at a specific time), try using
https://crontab.guru/#40_21_*_*_* to write what you need.
If you think you have 1, 2, 3 steps all correct but the DAG is not running. Or the DAG can run every xx minutes, but failed to trigger even once in a daily interval, try create a new python file, copy your DAG code there, rename it so that the file is unique and then test again. It could be the case that airflow scheduler got confused by the inconsistency between previous DAG Runs' metadata and the current schedule.
Hope this helped!
From the schedule your DAG should run everyday at 02:03 AM. My suspicion is the start_date might be impacting it. Can you hardcode that to something like 'start_date': datetime.datetime(2016, 11, 01) and try.
Great answer apathyman.
It helped me a lot to understand. I was using days_ago(0) and once I changed it to days_ago(1), scheduler started triggering.
This is my code:
default_args = {
'owner': 'airflow',
'depends_on_past': True,
'start_date': datetime(2018,9,9),
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5)
}
dag = DAG('hello', catchup=False, default_args=default_args, schedule_interval=timedelta(minutes=1))
And the task instances list is like this:
You can see that I started at 08:36:24, and I know it will execute the task at 08:35:20 since I set the schedule_interval equals 1 minute. But why it executed the task at 08:34:20?
The rightmost column shows when the corresponding task was actually executed, but it does not tell us when you actually enabled the DAG. I suspect that you enabled the DAG when it was 08:35, the scheduler picked up the DAG and scheduled the first DAG run for 8:34. As the scheduler finished all the setup work and executed the first DAG run, it was already 8:36.
The one minute interval was simply too short (or you was too slow ;)). Try a 10 minute interval and enable the DAG when it is, for example, 8:33 (so not at the scheduling interval boundary, such as 8:30 or 8:40) and you will see that everything works as you expect.
I use Airflow 1.8.0
and I have a DAG like this one :
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'email': ['technical#me.com'],
'start_date': datetime.datetime(2018, 5, 21),
'email_on_retry': False,
'retries': 0
}
dag = DAG('my_dag',
schedule_interval='40 20 * * *',
catchup=True,
default_args=default_args)
Every day the dag is correctly scheduled but with a day late.
Given today's date is
2018-07-02
the web interface show :
instead of 2018-07-01
But if I do a manual trigger
The current date is correctly passed :
Is there a way to force scheduler to run with the current date ?
This is correct and is a part of the design of airflow. If you look here you'll see the explanation:
Note that if you run a DAG on a schedule_interval of one day, the run stamped 2016-01-01 will be trigger soon after 2016-01-01T23:59. In other words, the job instance is started once the period it covers has ended.
Let’s Repeat That The scheduler runs your job one schedule_interval AFTER the start date, at the END of the period.
Your schedule_interval is schedule_interval='20 40 * * *'. Remember that the schedule_interval is in CRON format, or (Minutes Hour Day(month) Month day(week). So your current schedule is actually incorrect as the scheduler cannot run every 40th hour. Do you want to make it so that it runs at the 40th minute every 20 hours? If so, try schedule_interval='40 20 * * *'.
Also, set catchup to catchup=False if you want it to run the most recent day. With both of these fixes, it should work. Refer to this website for more CRON help.