Do you know the solution how to implement SLA on DAG level? I mean, the SLA within the whole DAG needs to be marked as success. Adding SLA parameter to default_args does not help here, because despite the fact this is a global parameter, it is evaluated on task level.
# Default settings applied to all tasks
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'email_on_failure': True,
'email': 'noreply#astronomer.io',
'email_on_retry': False,
'sla': timedelta(seconds=30)
}
Let's take example where we have DAG consisting of 3 tasks and SLA set as 2 hours. Suppose that each task proceeds 1 hour - the SLA is met despite the fact that it takes 3 hours to proceed whole DAG.
Do you know any solution for this? As far as I know there is no default setting for this.
Related
I have the below TaskGroup that creates 2 dynamic tasks from a dictionary one to s3 and then after that runs the snowflake task begins. I want to have the first 2 tasks run, regardless if they fail or not, before the next 2 tasks in this TaskGorup starts:
default_args = {
'owner': 'airflow',
'start_date': START_DATE,
'depends_on_past': False,
'task_concurrency': 1}
with TaskGroup(group_id='ID') as ID:
[GooglexOperator(
dag=dag,
task_id=f'{sheet_name}_s3',
trigger_rule="all_done",
url=True)
>> SnowflakeLoadOperator(
dag=dag,
task_id=f'{sheet_name}_snowflake',
table=CoreTable()
)
) for sheet_name, config in id_sheets.items()]
I tried to add chain(*[GooglexOperator( .. etc)]) but still all the s3 tasks in this TaskGroup ran at the same time.
With the task_concurrency parameter the way that it is, are you getting the behavior you desire? If not, you can also use airflow pools to throttle the number of tasks that run at a time.
Also, Airflow might be reading your tasks weirdly. Try setting your dependencies like [tg1] >> [tg2]
A simple question for airflow DAG development
args = {
'owner': 'Airflow',
'start_date': dates.days_ago(1),
'email': ['email1#gmail.com', 'email2#gmail.com'],
'email_on_failure': True,
'email_on_success': True,
'schedule_interval': '0 * * * *',
}
The above configuration states that the DAG should run every hour on the top of the hour.
How do I make the job skip one hour if the previous job is still in motion?
Thanks!
As mentioned in comment you can achieve what you want by setting max_active_runs=1. However, this will depend on wider context of expected behaviour.
If you need more complex schedule consider implementing your own Timetable.
I have set timezone to "Canada/Central" however dag info page shows it as UTC only.
Code
support_email = Variable.get("prod_support_email")
args = {
'owner': 'Gaurang Shah',
'email': [support_email],
'email_on_failure': True,
'email_on_retry': True,
'retries': 0,
'start_date': datetime(2020, 9, 17, tzinfo=local_tz),
}
Info Page
I think the key to answer the question is clarify the question.
how to know the dag is a Time zone aware DAGs.
following the office doc enter link description here, you can set up a timezone awared dags and check it by print dags timezone.
It is programmatic way, you can also check it in dags tree view as follow picutres.
there will be localtimezone, dag timezone, if you dags is right, you will get 3 timezone(utc, local and dag)
if the task info page showed utc means the dags is a utc timezone dage?
My answer is No. No matter how you dags change, the default timezone will be UTC in info page. It is not the dags timezoneinfo.
I want one dag starts after completion of another dag. one solution is using external sensor function, below you can find my solution. the problem I encounter is that the dependent dag is stuck at poking, I checked this answer and made sure that both of the dags runs on the same schedule, my simplified code is as follows:
any help would be appreciated.
leader dag:
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2015, 6, 1),
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
schedule = '* * * * *'
dag = DAG('leader_dag', default_args=default_args,catchup=False,
schedule_interval=schedule)
t1 = BashOperator(
task_id='print_date',
bash_command='date',
dag=dag)
the dependent dag:
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
from airflow.operators.sensors import ExternalTaskSensor
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2018, 10, 8),
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
schedule='* * * * *'
dag = DAG('dependent_dag', default_args=default_args, catchup=False,
schedule_interval=schedule)
wait_for_task = ExternalTaskSensor(task_id = 'wait_for_task',
external_dag_id = 'leader_dag', external_task_id='t1', dag=dag)
t1 = BashOperator(
task_id='print_date',
bash_command='date',
dag=dag)
t1.set_upstream(wait_for_task)
the log for leader_dag:
the log for dependent dag:
First the task_id in the leader_dag is named print_date but you setup your dependent_dag with a task wait_for_task which is waiting on leader_dag's task named t1. There is no task named t1. What you assigned it to in the py file is not relevant, nor used in the Airflow db and transversely by the sensor. It should be waiting on task name print_date.
Second your logs do not line up in which leader_dag run you show for what the dependent_dag is waiting for.
Finally, I can't recommend you use Airflow to schedule tasks every minute. Certainly not two dependent tasks together.
Consider writing streaming jobs in a different system like Spark, or rolling your own Celery or Dask environment for this.
You could also avoid the the ExternalTaskSensor by adding a TriggerDagRunOperator to the end of your leader_dag to trigger the dependent_dag, and removing the schedule from that by setting the schedule_interval to None.
What I see in your logs is a log for the leader from 2018-10-13T19:08:11. This at best would be the dagrun for execution_date 2018-10-13 19:07:00 because the minute period starting 19:07 ends at 19:08 which is the earliest it can be scheduled. And I see some delay between scheduling and execution of about 11 seconds if this is the case. However there can be multiple minutes of scheduling lag in Airflow.
I also see a log from the dependent_dag which runs from 19:14:04 to 19:14:34 and is looking for the completion of the corresponding 19:13:00 dagrun. There's no indication that your scheduler is lag free enough to have started the 19:13:00 dagrun of leader_dag by 19:14:34. You could have better convinced me if you showed it poking for 5 minutes or so. Of course it's never going to sense leader_dag.t1 because that isn't what you named the tasks shown.
So, Airflow has scheduling delay, If you had a few 1000 dags in the system, it might be higher than 1 minute, such that a with catchup=False you're going to get some runs following each other IE 19:08, 19:09 and some runs that skip a minute (or 6) like 19:10 followed by 19:16 can happen, and since the delay is a bit random on a dag-by-dag basis, you might get unaligned runs with the sensor waiting for ever, EVEN if you have the correct task id to wait for:
wait_for_task = ExternalTaskSensor(
task_id='wait_for_task',
external_dag_id='leader_dag',
- external_task_id='t1',
+ external_task_id='print_date',
dag=dag)
While using ExternalTaskSensor you have to give both DAGs the same starting date. If that does not work for your use case then you need to use execution_delta or execution_date_fn in your ExternalTaskSensor.
Simple solution to this is to do the following:
1- make sure all the variables you set in the ExternalTaskSensor is set right, exactly like the task_id and dag_id of the dag you want your master_dag to have an eye on.
2- make both of the master_dag and slave_dag(the dag you want to wait for it) have the same start_date, otherwise it wont work. if your slave starts at 22PM and master starts at 22:30 so you have 30 minutes diffrence that you should specify with execution delta.
If your error couldn't be solved by following, your problem is either basic or you programmed your dag way-too-wrong..
This is my code:
default_args = {
'owner': 'airflow',
'depends_on_past': True,
'start_date': datetime(2018,9,9),
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5)
}
dag = DAG('hello', catchup=False, default_args=default_args, schedule_interval=timedelta(minutes=1))
And the task instances list is like this:
You can see that I started at 08:36:24, and I know it will execute the task at 08:35:20 since I set the schedule_interval equals 1 minute. But why it executed the task at 08:34:20?
The rightmost column shows when the corresponding task was actually executed, but it does not tell us when you actually enabled the DAG. I suspect that you enabled the DAG when it was 08:35, the scheduler picked up the DAG and scheduled the first DAG run for 8:34. As the scheduler finished all the setup work and executed the first DAG run, it was already 8:36.
The one minute interval was simply too short (or you was too slow ;)). Try a 10 minute interval and enable the DAG when it is, for example, 8:33 (so not at the scheduling interval boundary, such as 8:30 or 8:40) and you will see that everything works as you expect.